Re: RBD fio Performance concerns

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: RBD fio Performance concerns
       [not found] ` <CAOLwVUmQa4C_vs_Mbi3b2LeO=wx8_EMVWX5Pyu0y-JnG8nyz+Q@mail.gmail.com>
@ 2012-11-16 22:59   ` Mark Kampe
  2012-11-19 14:56     ` Sébastien Han
  0 siblings, 1 reply; 51+ messages in thread
From: Mark Kampe @ 2012-11-16 22:59 UTC (permalink / raw)
  To: Sébastien Han; +Cc: ceph-devel

On 11/15/2012 12:23 PM, Sébastien Han wrote:

> First of all, I would like to thank you for this well explained,
> structured and clear answer. I guess I got better IOPS thanks to the 10K disks.

10K RPM would bring your per-drive throughput (for 4K random writes)
up to 142 IOPS and your aggregate cluster throughput up to 1700.
This would predict a corresponding RADOSbench throughput somewhere
above 425 (how much better depending on write aggregation and cylinder 
affinity).  Your RADOSbench 708 now seems even more reasonable.

> To be really honest I wasn't so concerned about the RADOS benchmarks
> but more about the RBD fio benchmarks and the amont of IOPS that comes
> out of it, which I found à bit to low.

Sticking with 4K random writes, it looks to me like you were running
fio with libaio (which means direct, no buffer cache).  Because it
is direct, every I/O operation is really happening and the best
sustained throughput you should expect from this cluster is
the aggregate raw fio 4K write throughput (1700 IOPS) divided
by two copies = 850 random 4K writes per second.  If I read the
output correctly you got 763 or about 90% of back-of-envelope.

BUT, there are some footnotes (there always are with performance)

If you had been doing buffered I/O you would have seen a lot more
(up front) benefit from page caching ... but you wouldn't have been
measuring real (and hence sustainable) I/O throughput ... which is
ultimately limited by the heads on those twelve disk drives, where
all of those writes ultimately wind up.  It is easy to be fast
if you aren't really doing the writes :-)

I would have expected write aggregation and cylinder affinity to
have eliminated some seeks and improved rotational latency resulting
in better than theoretical random write throughput.  Against those
expectations 763/850 IOPS is not so impressive.  But, it looks to
me like you were running fio in a 1G file with 100 parallel requests.
The default RBD stripe width is 4M.  This means that those 100
parallel requests were being spread across 256 (1G/4M) objects.
People in the know tell me that writes to a single object are
serialized, which means that many of those (potentially) parallel
writes were to the same object, and hence serialized.  This would
increase the average request time for the colliding operations,
and reduce the aggregate throughput correspondingly.  Use a
bigger file (or a narrower stripe) and this will get better.

Thus, getting 763 random 4K write IOPs out of those 12 drives
still sounds about right to me.

> On 15 nov. 2012, at 19:43, Mark Kampe <mark.kampe@inktank.com> wrote:
>
>> Dear Sebastien,
>>
>> Ross Turn forwarded me your e-mail.  You sent a great deal
>> of information, but it was not immediately obvious to me
>> what your specific concern was.
>>
>> You have 4 servers, 3 OSDs per, 2 copy, and you measured a
>> radosbench (4K object creation) throughput of 2.9MB/s
>> (or 708 IOPS).  I infer that you were disappointed by
>> this number, but it looks right to me.
>>
>> Assuming typical 7200 RPM drives, I would guess that each
>> of them would deliver a sustained direct 4K random write
>> performance in the general neighborhood of:
>>     4ms seek (short seeks with write-settle-downs)
>>     4ms latency (1/2 rotation)
>>     0ms write (4K/144MB/s ~ 30us)
>>     -----
>>     8ms or about 125 IOPS
>>
>> Your twelve drives should therefore have a sustainable
>> aggregate direct 4K random write throughput of 1500 IOPS.
>>
>> Each 4K object create involves four writes (two copies,
>> each getting one data write and one data update).  Thus
>> I would expect a (crude) 4K create rate of 375 IOPS (1500/4).
>>
>> You are getting almost twice the expected raw IOPS ...
>> and we should expect that a large number of parallel
>> operations would realize some write/seek aggregation
>> benefits ... so these numbers look right to me.
>>
>> Is this the number you were concerned about, or have I
>> misunderstood?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-16 22:59   ` RBD fio Performance concerns Mark Kampe
@ 2012-11-19 14:56     ` Sébastien Han
  2012-11-19 15:28       ` Alexandre DERUMIER
  0 siblings, 1 reply; 51+ messages in thread
From: Sébastien Han @ 2012-11-19 14:56 UTC (permalink / raw)
  To: Mark Kampe; +Cc: ceph-devel

Hello Mark,

First of all, thank you again for another accurate answer :-).

> I would have expected write aggregation and cylinder affinity to
> have eliminated some seeks and improved rotational latency resulting
> in better than theoretical random write throughput.  Against those
> expectations 763/850 IOPS is not so impressive.  But, it looks to
> me like you were running fio in a 1G file with 100 parallel requests.
> The default RBD stripe width is 4M.  This means that those 100
> parallel requests were being spread across 256 (1G/4M) objects.
> People in the know tell me that writes to a single object are
> serialized, which means that many of those (potentially) parallel
> writes were to the same object, and hence serialized.  This would
> increase the average request time for the colliding operations,
> and reduce the aggregate throughput correspondingly.  Use a
> bigger file (or a narrower stripe) and this will get better.


I followed your advice and used a bigger file (10G) and an iodepth of
128 and I've been able to reach ~27k iops for rand reads but I
couldn't reach more than 870 iops in randwrites... It's kind of
expected. But the thing a still don't understand is: why the
sequential read/writes are lower than the randoms onces? Or maybe do I
just need to care about the bandwidth for those values?

Thank you.

Regards.
--
Bien cordialement.
Sébastien HAN.


On Fri, Nov 16, 2012 at 11:59 PM, Mark Kampe <mark.kampe@inktank.com> wrote:
> On 11/15/2012 12:23 PM, Sébastien Han wrote:
>
>> First of all, I would like to thank you for this well explained,
>> structured and clear answer. I guess I got better IOPS thanks to the 10K
>> disks.
>
>
> 10K RPM would bring your per-drive throughput (for 4K random writes)
> up to 142 IOPS and your aggregate cluster throughput up to 1700.
> This would predict a corresponding RADOSbench throughput somewhere
> above 425 (how much better depending on write aggregation and cylinder
> affinity).  Your RADOSbench 708 now seems even more reasonable.
>
>> To be really honest I wasn't so concerned about the RADOS benchmarks
>> but more about the RBD fio benchmarks and the amont of IOPS that comes
>> out of it, which I found à bit to low.
>
>
> Sticking with 4K random writes, it looks to me like you were running
> fio with libaio (which means direct, no buffer cache).  Because it
> is direct, every I/O operation is really happening and the best
> sustained throughput you should expect from this cluster is
> the aggregate raw fio 4K write throughput (1700 IOPS) divided
> by two copies = 850 random 4K writes per second.  If I read the
> output correctly you got 763 or about 90% of back-of-envelope.
>
> BUT, there are some footnotes (there always are with performance)
>
> If you had been doing buffered I/O you would have seen a lot more
> (up front) benefit from page caching ... but you wouldn't have been
> measuring real (and hence sustainable) I/O throughput ... which is
> ultimately limited by the heads on those twelve disk drives, where
> all of those writes ultimately wind up.  It is easy to be fast
> if you aren't really doing the writes :-)
>
> I would have expected write aggregation and cylinder affinity to
> have eliminated some seeks and improved rotational latency resulting
> in better than theoretical random write throughput.  Against those
> expectations 763/850 IOPS is not so impressive.  But, it looks to
> me like you were running fio in a 1G file with 100 parallel requests.
> The default RBD stripe width is 4M.  This means that those 100
> parallel requests were being spread across 256 (1G/4M) objects.
> People in the know tell me that writes to a single object are
> serialized, which means that many of those (potentially) parallel
> writes were to the same object, and hence serialized.  This would
> increase the average request time for the colliding operations,
> and reduce the aggregate throughput correspondingly.  Use a
> bigger file (or a narrower stripe) and this will get better.
>
> Thus, getting 763 random 4K write IOPs out of those 12 drives
> still sounds about right to me.
>
>
>> On 15 nov. 2012, at 19:43, Mark Kampe <mark.kampe@inktank.com> wrote:
>>
>>> Dear Sebastien,
>>>
>>> Ross Turn forwarded me your e-mail.  You sent a great deal
>>> of information, but it was not immediately obvious to me
>>> what your specific concern was.
>>>
>>> You have 4 servers, 3 OSDs per, 2 copy, and you measured a
>>> radosbench (4K object creation) throughput of 2.9MB/s
>>> (or 708 IOPS).  I infer that you were disappointed by
>>> this number, but it looks right to me.
>>>
>>> Assuming typical 7200 RPM drives, I would guess that each
>>> of them would deliver a sustained direct 4K random write
>>> performance in the general neighborhood of:
>>>     4ms seek (short seeks with write-settle-downs)
>>>     4ms latency (1/2 rotation)
>>>     0ms write (4K/144MB/s ~ 30us)
>>>     -----
>>>     8ms or about 125 IOPS
>>>
>>> Your twelve drives should therefore have a sustainable
>>> aggregate direct 4K random write throughput of 1500 IOPS.
>>>
>>> Each 4K object create involves four writes (two copies,
>>> each getting one data write and one data update).  Thus
>>> I would expect a (crude) 4K create rate of 375 IOPS (1500/4).
>>>
>>> You are getting almost twice the expected raw IOPS ...
>>> and we should expect that a large number of parallel
>>> operations would realize some write/seek aggregation
>>> benefits ... so these numbers look right to me.
>>>
>>> Is this the number you were concerned about, or have I
>>> misunderstood?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-19 14:56     ` Sébastien Han
@ 2012-11-19 15:28       ` Alexandre DERUMIER
  2012-11-19 15:42         ` Sébastien Han
  0 siblings, 1 reply; 51+ messages in thread
From: Alexandre DERUMIER @ 2012-11-19 15:28 UTC (permalink / raw)
  To: Sébastien Han; +Cc: ceph-devel, Mark Kampe

>>why the
>>sequential read/writes are lower than the randoms onces? Or maybe do I
>>just need to care about the bandwidth for those values?

If I remember, you use fio with 4MB block size for sequential.
So it's normal that you have less ios, but more bandwith.



----- Mail original ----- 

De: "Sébastien Han" <han.sebastien@gmail.com> 
À: "Mark Kampe" <mark.kampe@inktank.com> 
Cc: "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Lundi 19 Novembre 2012 15:56:35 
Objet: Re: RBD fio Performance concerns 

Hello Mark, 

First of all, thank you again for another accurate answer :-). 

> I would have expected write aggregation and cylinder affinity to 
> have eliminated some seeks and improved rotational latency resulting 
> in better than theoretical random write throughput. Against those 
> expectations 763/850 IOPS is not so impressive. But, it looks to 
> me like you were running fio in a 1G file with 100 parallel requests. 
> The default RBD stripe width is 4M. This means that those 100 
> parallel requests were being spread across 256 (1G/4M) objects. 
> People in the know tell me that writes to a single object are 
> serialized, which means that many of those (potentially) parallel 
> writes were to the same object, and hence serialized. This would 
> increase the average request time for the colliding operations, 
> and reduce the aggregate throughput correspondingly. Use a 
> bigger file (or a narrower stripe) and this will get better. 


I followed your advice and used a bigger file (10G) and an iodepth of 
128 and I've been able to reach ~27k iops for rand reads but I 
couldn't reach more than 870 iops in randwrites... It's kind of 
expected. But the thing a still don't understand is: why the 
sequential read/writes are lower than the randoms onces? Or maybe do I 
just need to care about the bandwidth for those values? 

Thank you. 

Regards. 
-- 
Bien cordialement. 
Sébastien HAN. 


On Fri, Nov 16, 2012 at 11:59 PM, Mark Kampe <mark.kampe@inktank.com> wrote: 
> On 11/15/2012 12:23 PM, Sébastien Han wrote: 
> 
>> First of all, I would like to thank you for this well explained, 
>> structured and clear answer. I guess I got better IOPS thanks to the 10K 
>> disks. 
> 
> 
> 10K RPM would bring your per-drive throughput (for 4K random writes) 
> up to 142 IOPS and your aggregate cluster throughput up to 1700. 
> This would predict a corresponding RADOSbench throughput somewhere 
> above 425 (how much better depending on write aggregation and cylinder 
> affinity). Your RADOSbench 708 now seems even more reasonable. 
> 
>> To be really honest I wasn't so concerned about the RADOS benchmarks 
>> but more about the RBD fio benchmarks and the amont of IOPS that comes 
>> out of it, which I found à bit to low. 
> 
> 
> Sticking with 4K random writes, it looks to me like you were running 
> fio with libaio (which means direct, no buffer cache). Because it 
> is direct, every I/O operation is really happening and the best 
> sustained throughput you should expect from this cluster is 
> the aggregate raw fio 4K write throughput (1700 IOPS) divided 
> by two copies = 850 random 4K writes per second. If I read the 
> output correctly you got 763 or about 90% of back-of-envelope. 
> 
> BUT, there are some footnotes (there always are with performance) 
> 
> If you had been doing buffered I/O you would have seen a lot more 
> (up front) benefit from page caching ... but you wouldn't have been 
> measuring real (and hence sustainable) I/O throughput ... which is 
> ultimately limited by the heads on those twelve disk drives, where 
> all of those writes ultimately wind up. It is easy to be fast 
> if you aren't really doing the writes :-) 
> 
> I would have expected write aggregation and cylinder affinity to 
> have eliminated some seeks and improved rotational latency resulting 
> in better than theoretical random write throughput. Against those 
> expectations 763/850 IOPS is not so impressive. But, it looks to 
> me like you were running fio in a 1G file with 100 parallel requests. 
> The default RBD stripe width is 4M. This means that those 100 
> parallel requests were being spread across 256 (1G/4M) objects. 
> People in the know tell me that writes to a single object are 
> serialized, which means that many of those (potentially) parallel 
> writes were to the same object, and hence serialized. This would 
> increase the average request time for the colliding operations, 
> and reduce the aggregate throughput correspondingly. Use a 
> bigger file (or a narrower stripe) and this will get better. 
> 
> Thus, getting 763 random 4K write IOPs out of those 12 drives 
> still sounds about right to me. 
> 
> 
>> On 15 nov. 2012, at 19:43, Mark Kampe <mark.kampe@inktank.com> wrote: 
>> 
>>> Dear Sebastien, 
>>> 
>>> Ross Turn forwarded me your e-mail. You sent a great deal 
>>> of information, but it was not immediately obvious to me 
>>> what your specific concern was. 
>>> 
>>> You have 4 servers, 3 OSDs per, 2 copy, and you measured a 
>>> radosbench (4K object creation) throughput of 2.9MB/s 
>>> (or 708 IOPS). I infer that you were disappointed by 
>>> this number, but it looks right to me. 
>>> 
>>> Assuming typical 7200 RPM drives, I would guess that each 
>>> of them would deliver a sustained direct 4K random write 
>>> performance in the general neighborhood of: 
>>> 4ms seek (short seeks with write-settle-downs) 
>>> 4ms latency (1/2 rotation) 
>>> 0ms write (4K/144MB/s ~ 30us) 
>>> ----- 
>>> 8ms or about 125 IOPS 
>>> 
>>> Your twelve drives should therefore have a sustainable 
>>> aggregate direct 4K random write throughput of 1500 IOPS. 
>>> 
>>> Each 4K object create involves four writes (two copies, 
>>> each getting one data write and one data update). Thus 
>>> I would expect a (crude) 4K create rate of 375 IOPS (1500/4). 
>>> 
>>> You are getting almost twice the expected raw IOPS ... 
>>> and we should expect that a large number of parallel 
>>> operations would realize some write/seek aggregation 
>>> benefits ... so these numbers look right to me. 
>>> 
>>> Is this the number you were concerned about, or have I 
>>> misunderstood? 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-19 15:28       ` Alexandre DERUMIER
@ 2012-11-19 15:42         ` Sébastien Han
  2012-11-19 16:44           ` Sage Weil
  2012-11-19 16:54           ` Mark Kampe
  0 siblings, 2 replies; 51+ messages in thread
From: Sébastien Han @ 2012-11-19 15:42 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, Mark Kampe

> If I remember, you use fio with 4MB block size for sequential.
> So it's normal that you have less ios, but more bandwith.

That's correct for some of the benchmarks. However even with 4K for
seq, I still get less IOPS. See below my last fio:

# fio rbd-bench.fio
seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
fio 1.59
Starting 4 processes
Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99  iops] [eta 02m:59s]
seq-read: (groupid=0, jobs=1): err= 0: pid=15096
  read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
    slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
    clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
     lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
    bw (KB/s) : min=    0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06
  cpu          : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued r/w/d: total=200473/0/0, short=0/0/0

     lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
rand-read: (groupid=1, jobs=1): err= 0: pid=16846
  read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
    slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
    clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
     lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
    bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, stdev=648.62
  cpu          : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued r/w/d: total=1632349/0/0, short=0/0/0

     lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
seq-write: (groupid=2, jobs=1): err= 0: pid=18653
  write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
    slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
    clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
     lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
    bw (KB/s) : min=    7, max= 2165, per=104.03%, avg=764.65, stdev=353.97
  cpu          : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued r/w/d: total=0/11171/0, short=0/0/0

     lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
     lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
rand-write: (groupid=3, jobs=1): err= 0: pid=20446
  write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
    slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
    clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
     lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
    bw (KB/s) : min=    0, max= 7728, per=31.44%, avg=1078.21, stdev=2000.45
  cpu          : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued r/w/d: total=0/52147/0, short=0/0/0

     lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
     lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%

Run status group 0 (all jobs):
   READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
mint=60053msec, maxt=60053msec

Run status group 1 (all jobs):
   READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
maxb=111425KB/s, mint=60005msec, maxt=60005msec

Run status group 2 (all jobs):
  WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
mint=60725msec, maxt=60725msec

Run status group 3 (all jobs):
  WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
mint=60822msec, maxt=60822msec

Disk stats (read/write):
  rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
in_queue=33434120, util=99.79%

Cheers!
--
Bien cordialement.
Sébastien HAN.


On Mon, Nov 19, 2012 at 4:28 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>why the
>>>sequential read/writes are lower than the randoms onces? Or maybe do I
>>>just need to care about the bandwidth for those values?
>
> If I remember, you use fio with 4MB block size for sequential.
> So it's normal that you have less ios, but more bandwith.
>
>
>
> ----- Mail original -----
>
> De: "Sébastien Han" <han.sebastien@gmail.com>
> À: "Mark Kampe" <mark.kampe@inktank.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Lundi 19 Novembre 2012 15:56:35
> Objet: Re: RBD fio Performance concerns
>
> Hello Mark,
>
> First of all, thank you again for another accurate answer :-).
>
>> I would have expected write aggregation and cylinder affinity to
>> have eliminated some seeks and improved rotational latency resulting
>> in better than theoretical random write throughput. Against those
>> expectations 763/850 IOPS is not so impressive. But, it looks to
>> me like you were running fio in a 1G file with 100 parallel requests.
>> The default RBD stripe width is 4M. This means that those 100
>> parallel requests were being spread across 256 (1G/4M) objects.
>> People in the know tell me that writes to a single object are
>> serialized, which means that many of those (potentially) parallel
>> writes were to the same object, and hence serialized. This would
>> increase the average request time for the colliding operations,
>> and reduce the aggregate throughput correspondingly. Use a
>> bigger file (or a narrower stripe) and this will get better.
>
>
> I followed your advice and used a bigger file (10G) and an iodepth of
> 128 and I've been able to reach ~27k iops for rand reads but I
> couldn't reach more than 870 iops in randwrites... It's kind of
> expected. But the thing a still don't understand is: why the
> sequential read/writes are lower than the randoms onces? Or maybe do I
> just need to care about the bandwidth for those values?
>
> Thank you.
>
> Regards.
> --
> Bien cordialement.
> Sébastien HAN.
>
>
> On Fri, Nov 16, 2012 at 11:59 PM, Mark Kampe <mark.kampe@inktank.com> wrote:
>> On 11/15/2012 12:23 PM, Sébastien Han wrote:
>>
>>> First of all, I would like to thank you for this well explained,
>>> structured and clear answer. I guess I got better IOPS thanks to the 10K
>>> disks.
>>
>>
>> 10K RPM would bring your per-drive throughput (for 4K random writes)
>> up to 142 IOPS and your aggregate cluster throughput up to 1700.
>> This would predict a corresponding RADOSbench throughput somewhere
>> above 425 (how much better depending on write aggregation and cylinder
>> affinity). Your RADOSbench 708 now seems even more reasonable.
>>
>>> To be really honest I wasn't so concerned about the RADOS benchmarks
>>> but more about the RBD fio benchmarks and the amont of IOPS that comes
>>> out of it, which I found à bit to low.
>>
>>
>> Sticking with 4K random writes, it looks to me like you were running
>> fio with libaio (which means direct, no buffer cache). Because it
>> is direct, every I/O operation is really happening and the best
>> sustained throughput you should expect from this cluster is
>> the aggregate raw fio 4K write throughput (1700 IOPS) divided
>> by two copies = 850 random 4K writes per second. If I read the
>> output correctly you got 763 or about 90% of back-of-envelope.
>>
>> BUT, there are some footnotes (there always are with performance)
>>
>> If you had been doing buffered I/O you would have seen a lot more
>> (up front) benefit from page caching ... but you wouldn't have been
>> measuring real (and hence sustainable) I/O throughput ... which is
>> ultimately limited by the heads on those twelve disk drives, where
>> all of those writes ultimately wind up. It is easy to be fast
>> if you aren't really doing the writes :-)
>>
>> I would have expected write aggregation and cylinder affinity to
>> have eliminated some seeks and improved rotational latency resulting
>> in better than theoretical random write throughput. Against those
>> expectations 763/850 IOPS is not so impressive. But, it looks to
>> me like you were running fio in a 1G file with 100 parallel requests.
>> The default RBD stripe width is 4M. This means that those 100
>> parallel requests were being spread across 256 (1G/4M) objects.
>> People in the know tell me that writes to a single object are
>> serialized, which means that many of those (potentially) parallel
>> writes were to the same object, and hence serialized. This would
>> increase the average request time for the colliding operations,
>> and reduce the aggregate throughput correspondingly. Use a
>> bigger file (or a narrower stripe) and this will get better.
>>
>> Thus, getting 763 random 4K write IOPs out of those 12 drives
>> still sounds about right to me.
>>
>>
>>> On 15 nov. 2012, at 19:43, Mark Kampe <mark.kampe@inktank.com> wrote:
>>>
>>>> Dear Sebastien,
>>>>
>>>> Ross Turn forwarded me your e-mail. You sent a great deal
>>>> of information, but it was not immediately obvious to me
>>>> what your specific concern was.
>>>>
>>>> You have 4 servers, 3 OSDs per, 2 copy, and you measured a
>>>> radosbench (4K object creation) throughput of 2.9MB/s
>>>> (or 708 IOPS). I infer that you were disappointed by
>>>> this number, but it looks right to me.
>>>>
>>>> Assuming typical 7200 RPM drives, I would guess that each
>>>> of them would deliver a sustained direct 4K random write
>>>> performance in the general neighborhood of:
>>>> 4ms seek (short seeks with write-settle-downs)
>>>> 4ms latency (1/2 rotation)
>>>> 0ms write (4K/144MB/s ~ 30us)
>>>> -----
>>>> 8ms or about 125 IOPS
>>>>
>>>> Your twelve drives should therefore have a sustainable
>>>> aggregate direct 4K random write throughput of 1500 IOPS.
>>>>
>>>> Each 4K object create involves four writes (two copies,
>>>> each getting one data write and one data update). Thus
>>>> I would expect a (crude) 4K create rate of 375 IOPS (1500/4).
>>>>
>>>> You are getting almost twice the expected raw IOPS ...
>>>> and we should expect that a large number of parallel
>>>> operations would realize some write/seek aggregation
>>>> benefits ... so these numbers look right to me.
>>>>
>>>> Is this the number you were concerned about, or have I
>>>> misunderstood?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-19 15:42         ` Sébastien Han
@ 2012-11-19 16:44           ` Sage Weil
  2012-11-19 16:54           ` Mark Kampe
  1 sibling, 0 replies; 51+ messages in thread
From: Sage Weil @ 2012-11-19 16:44 UTC (permalink / raw)
  To: Sébastien Han; +Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe

On Mon, 19 Nov 2012, S?bastien Han wrote:
> > If I remember, you use fio with 4MB block size for sequential.
> > So it's normal that you have less ios, but more bandwith.
> 
> That's correct for some of the benchmarks. However even with 4K for
> seq, I still get less IOPS. See below my last fio:

Small IOs striped over large objects tends to mean that many IOs may get 
piled up behind a single object at a time.  There is a new striping 
feature in RBD that lets you stripe small blocks over larger objects to 
mitigate this, but it means slower performance the rest of the time, and 
is only really useful for specific workloads (e.g., database journal 
file/device).

sage

> 
> # fio rbd-bench.fio
> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
> fio 1.59
> Starting 4 processes
> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99  iops] [eta 02m:59s]
> seq-read: (groupid=0, jobs=1): err= 0: pid=15096
>   read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
>     slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
>     clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
>      lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
>     bw (KB/s) : min=    0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06
>   cpu          : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued r/w/d: total=200473/0/0, short=0/0/0
> 
>      lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
> rand-read: (groupid=1, jobs=1): err= 0: pid=16846
>   read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
>     slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
>     clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
>      lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
>     bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, stdev=648.62
>   cpu          : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued r/w/d: total=1632349/0/0, short=0/0/0
> 
>      lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
>   write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
>     slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
>     clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
>      lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
>     bw (KB/s) : min=    7, max= 2165, per=104.03%, avg=764.65, stdev=353.97
>   cpu          : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued r/w/d: total=0/11171/0, short=0/0/0
> 
>      lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
>      lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
>   write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
>     slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
>     clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
>      lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
>     bw (KB/s) : min=    0, max= 7728, per=31.44%, avg=1078.21, stdev=2000.45
>   cpu          : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued r/w/d: total=0/52147/0, short=0/0/0
> 
>      lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
>      lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
> 
> Run status group 0 (all jobs):
>    READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
> mint=60053msec, maxt=60053msec
> 
> Run status group 1 (all jobs):
>    READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
> maxb=111425KB/s, mint=60005msec, maxt=60005msec
> 
> Run status group 2 (all jobs):
>   WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
> mint=60725msec, maxt=60725msec
> 
> Run status group 3 (all jobs):
>   WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
> mint=60822msec, maxt=60822msec
> 
> Disk stats (read/write):
>   rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
> in_queue=33434120, util=99.79%
> 
> Cheers!
> --
> Bien cordialement.
> S?bastien HAN.
> 
> 
> On Mon, Nov 19, 2012 at 4:28 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
> >>>why the
> >>>sequential read/writes are lower than the randoms onces? Or maybe do I
> >>>just need to care about the bandwidth for those values?
> >
> > If I remember, you use fio with 4MB block size for sequential.
> > So it's normal that you have less ios, but more bandwith.
> >
> >
> >
> > ----- Mail original -----
> >
> > De: "S?bastien Han" <han.sebastien@gmail.com>
> > ?: "Mark Kampe" <mark.kampe@inktank.com>
> > Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
> > Envoy?: Lundi 19 Novembre 2012 15:56:35
> > Objet: Re: RBD fio Performance concerns
> >
> > Hello Mark,
> >
> > First of all, thank you again for another accurate answer :-).
> >
> >> I would have expected write aggregation and cylinder affinity to
> >> have eliminated some seeks and improved rotational latency resulting
> >> in better than theoretical random write throughput. Against those
> >> expectations 763/850 IOPS is not so impressive. But, it looks to
> >> me like you were running fio in a 1G file with 100 parallel requests.
> >> The default RBD stripe width is 4M. This means that those 100
> >> parallel requests were being spread across 256 (1G/4M) objects.
> >> People in the know tell me that writes to a single object are
> >> serialized, which means that many of those (potentially) parallel
> >> writes were to the same object, and hence serialized. This would
> >> increase the average request time for the colliding operations,
> >> and reduce the aggregate throughput correspondingly. Use a
> >> bigger file (or a narrower stripe) and this will get better.
> >
> >
> > I followed your advice and used a bigger file (10G) and an iodepth of
> > 128 and I've been able to reach ~27k iops for rand reads but I
> > couldn't reach more than 870 iops in randwrites... It's kind of
> > expected. But the thing a still don't understand is: why the
> > sequential read/writes are lower than the randoms onces? Or maybe do I
> > just need to care about the bandwidth for those values?
> >
> > Thank you.
> >
> > Regards.
> > --
> > Bien cordialement.
> > S?bastien HAN.
> >
> >
> > On Fri, Nov 16, 2012 at 11:59 PM, Mark Kampe <mark.kampe@inktank.com> wrote:
> >> On 11/15/2012 12:23 PM, S?bastien Han wrote:
> >>
> >>> First of all, I would like to thank you for this well explained,
> >>> structured and clear answer. I guess I got better IOPS thanks to the 10K
> >>> disks.
> >>
> >>
> >> 10K RPM would bring your per-drive throughput (for 4K random writes)
> >> up to 142 IOPS and your aggregate cluster throughput up to 1700.
> >> This would predict a corresponding RADOSbench throughput somewhere
> >> above 425 (how much better depending on write aggregation and cylinder
> >> affinity). Your RADOSbench 708 now seems even more reasonable.
> >>
> >>> To be really honest I wasn't so concerned about the RADOS benchmarks
> >>> but more about the RBD fio benchmarks and the amont of IOPS that comes
> >>> out of it, which I found ? bit to low.
> >>
> >>
> >> Sticking with 4K random writes, it looks to me like you were running
> >> fio with libaio (which means direct, no buffer cache). Because it
> >> is direct, every I/O operation is really happening and the best
> >> sustained throughput you should expect from this cluster is
> >> the aggregate raw fio 4K write throughput (1700 IOPS) divided
> >> by two copies = 850 random 4K writes per second. If I read the
> >> output correctly you got 763 or about 90% of back-of-envelope.
> >>
> >> BUT, there are some footnotes (there always are with performance)
> >>
> >> If you had been doing buffered I/O you would have seen a lot more
> >> (up front) benefit from page caching ... but you wouldn't have been
> >> measuring real (and hence sustainable) I/O throughput ... which is
> >> ultimately limited by the heads on those twelve disk drives, where
> >> all of those writes ultimately wind up. It is easy to be fast
> >> if you aren't really doing the writes :-)
> >>
> >> I would have expected write aggregation and cylinder affinity to
> >> have eliminated some seeks and improved rotational latency resulting
> >> in better than theoretical random write throughput. Against those
> >> expectations 763/850 IOPS is not so impressive. But, it looks to
> >> me like you were running fio in a 1G file with 100 parallel requests.
> >> The default RBD stripe width is 4M. This means that those 100
> >> parallel requests were being spread across 256 (1G/4M) objects.
> >> People in the know tell me that writes to a single object are
> >> serialized, which means that many of those (potentially) parallel
> >> writes were to the same object, and hence serialized. This would
> >> increase the average request time for the colliding operations,
> >> and reduce the aggregate throughput correspondingly. Use a
> >> bigger file (or a narrower stripe) and this will get better.
> >>
> >> Thus, getting 763 random 4K write IOPs out of those 12 drives
> >> still sounds about right to me.
> >>
> >>
> >>> On 15 nov. 2012, at 19:43, Mark Kampe <mark.kampe@inktank.com> wrote:
> >>>
> >>>> Dear Sebastien,
> >>>>
> >>>> Ross Turn forwarded me your e-mail. You sent a great deal
> >>>> of information, but it was not immediately obvious to me
> >>>> what your specific concern was.
> >>>>
> >>>> You have 4 servers, 3 OSDs per, 2 copy, and you measured a
> >>>> radosbench (4K object creation) throughput of 2.9MB/s
> >>>> (or 708 IOPS). I infer that you were disappointed by
> >>>> this number, but it looks right to me.
> >>>>
> >>>> Assuming typical 7200 RPM drives, I would guess that each
> >>>> of them would deliver a sustained direct 4K random write
> >>>> performance in the general neighborhood of:
> >>>> 4ms seek (short seeks with write-settle-downs)
> >>>> 4ms latency (1/2 rotation)
> >>>> 0ms write (4K/144MB/s ~ 30us)
> >>>> -----
> >>>> 8ms or about 125 IOPS
> >>>>
> >>>> Your twelve drives should therefore have a sustainable
> >>>> aggregate direct 4K random write throughput of 1500 IOPS.
> >>>>
> >>>> Each 4K object create involves four writes (two copies,
> >>>> each getting one data write and one data update). Thus
> >>>> I would expect a (crude) 4K create rate of 375 IOPS (1500/4).
> >>>>
> >>>> You are getting almost twice the expected raw IOPS ...
> >>>> and we should expect that a large number of parallel
> >>>> operations would realize some write/seek aggregation
> >>>> benefits ... so these numbers look right to me.
> >>>>
> >>>> Is this the number you were concerned about, or have I
> >>>> misunderstood?
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-19 15:42         ` Sébastien Han
  2012-11-19 16:44           ` Sage Weil
@ 2012-11-19 16:54           ` Mark Kampe
  2012-11-19 18:03             ` Sébastien Han
  1 sibling, 1 reply; 51+ messages in thread
From: Mark Kampe @ 2012-11-19 16:54 UTC (permalink / raw)
  To: Sébastien Han; +Cc: Alexandre DERUMIER, ceph-devel

Recall:
    1. RBD volumes are striped (4M wide) across RADOS objects
    2. distinct writes to a single RADOS object are serialized

Your sequential 4K writes are direct, depth=256, so there are
(at all times) 256 writes queued to the same object.  All of
your writes are waiting through a very long line, which is adding
horrendous latency.

If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).

We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize).  Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write aggregation.

> That's correct for some of the benchmarks. However even with 4K for
> seq, I still get less IOPS. See below my last fio:
>
> # fio rbd-bench.fio
> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
> fio 1.59
> Starting 4 processes
> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99  iops] [eta 02m:59s]
> seq-read: (groupid=0, jobs=1): err= 0: pid=15096
>    read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
>      slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
>      clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
>       lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
>      bw (KB/s) : min=    0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06
>    cpu          : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>       issued r/w/d: total=200473/0/0, short=0/0/0
>
>       lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
> rand-read: (groupid=1, jobs=1): err= 0: pid=16846
>    read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
>      slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
>      clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
>       lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
>      bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, stdev=648.62
>    cpu          : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>       issued r/w/d: total=1632349/0/0, short=0/0/0
>
>       lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
>    write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
>      slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
>      clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
>       lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
>      bw (KB/s) : min=    7, max= 2165, per=104.03%, avg=764.65, stdev=353.97
>    cpu          : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>       issued r/w/d: total=0/11171/0, short=0/0/0
>
>       lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
>       lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
>    write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
>      slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
>      clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
>       lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
>      bw (KB/s) : min=    0, max= 7728, per=31.44%, avg=1078.21, stdev=2000.45
>    cpu          : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>       issued r/w/d: total=0/52147/0, short=0/0/0
>
>       lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
>       lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
>
> Run status group 0 (all jobs):
>     READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
> mint=60053msec, maxt=60053msec
>
> Run status group 1 (all jobs):
>     READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
> maxb=111425KB/s, mint=60005msec, maxt=60005msec
>
> Run status group 2 (all jobs):
>    WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
> mint=60725msec, maxt=60725msec
>
> Run status group 3 (all jobs):
>    WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
> mint=60822msec, maxt=60822msec
>
> Disk stats (read/write):
>    rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
> in_queue=33434120, util=99.79%

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-19 16:54           ` Mark Kampe
@ 2012-11-19 18:03             ` Sébastien Han
  2012-11-19 19:11               ` Alexandre DERUMIER
       [not found]               ` <50AA763A.1050709@inktank.com>
  0 siblings, 2 replies; 51+ messages in thread
From: Sébastien Han @ 2012-11-19 18:03 UTC (permalink / raw)
  To: Mark Kampe; +Cc: Alexandre DERUMIER, ceph-devel

@Sage, thanks for the info :)
@Mark:

> If you want to do sequential I/O, you should do it buffered
> (so that the writes can be aggregated) or with a 4M block size
> (very efficient and avoiding object serialization).

The original benchmark has been performed with 4M block size. And as
you can see I still get more IOPS with rand than seq... I just tried
with 4M without direct I/O, still the same. I can print fio results if
it's needed.

> We do direct writes for benchmarking, not because it is a reasonable
> way to do I/O, but because it bypasses the buffer cache and enables
> us to directly measure cluster I/O throughput (which is what we are
> trying to optimize).  Applications should usually do buffered I/O,
> to get the (very significant) benefits of caching and write aggregation.

I know why I use direct I/O. It's synthetic benchmarks, it's far away
from a real life scenario and how common applications works. I just
try to see the maximum I/O throughput that I can get from my RBD. All
my applications use buffered I/O.

@Alexandre: is it the same for you? or do you always get more IOPS with seq?

Thanks to all of you..


On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe <mark.kampe@inktank.com> wrote:
> Recall:
>    1. RBD volumes are striped (4M wide) across RADOS objects
>    2. distinct writes to a single RADOS object are serialized
>
> Your sequential 4K writes are direct, depth=256, so there are
> (at all times) 256 writes queued to the same object.  All of
> your writes are waiting through a very long line, which is adding
> horrendous latency.
>
> If you want to do sequential I/O, you should do it buffered
> (so that the writes can be aggregated) or with a 4M block size
> (very efficient and avoiding object serialization).
>
> We do direct writes for benchmarking, not because it is a reasonable
> way to do I/O, but because it bypasses the buffer cache and enables
> us to directly measure cluster I/O throughput (which is what we are
> trying to optimize).  Applications should usually do buffered I/O,
> to get the (very significant) benefits of caching and write aggregation.
>
>
>> That's correct for some of the benchmarks. However even with 4K for
>> seq, I still get less IOPS. See below my last fio:
>>
>> # fio rbd-bench.fio
>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
>> iodepth=256
>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
>> iodepth=256
>> fio 1.59
>> Starting 4 processes
>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99  iops] [eta
>> 02m:59s]
>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096
>>    read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
>>      slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
>>      clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
>>       lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
>>      bw (KB/s) : min=    0, max=14406, per=31.89%, avg=4258.24,
>> stdev=6239.06
>>    cpu          : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>> >=64=100.0%
>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.0%
>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.1%
>>       issued r/w/d: total=200473/0/0, short=0/0/0
>>
>>       lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846
>>    read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
>>      slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
>>      clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
>>       lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
>>      bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48,
>> stdev=648.62
>>    cpu          : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>> >=64=100.0%
>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.0%
>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.1%
>>       issued r/w/d: total=1632349/0/0, short=0/0/0
>>
>>       lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
>>    write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
>>      slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
>>      clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
>>       lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
>>      bw (KB/s) : min=    7, max= 2165, per=104.03%, avg=764.65,
>> stdev=353.97
>>    cpu          : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
>> >=64=99.4%
>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.0%
>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.1%
>>       issued r/w/d: total=0/11171/0, short=0/0/0
>>
>>       lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
>>       lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
>>    write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
>>      slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
>>      clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
>>       lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
>>      bw (KB/s) : min=    0, max= 7728, per=31.44%, avg=1078.21,
>> stdev=2000.45
>>    cpu          : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>> >=64=99.9%
>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.0%
>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.1%
>>       issued r/w/d: total=0/52147/0, short=0/0/0
>>
>>       lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
>>       lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
>>
>> Run status group 0 (all jobs):
>>     READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
>> mint=60053msec, maxt=60053msec
>>
>> Run status group 1 (all jobs):
>>     READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
>> maxb=111425KB/s, mint=60005msec, maxt=60005msec
>>
>> Run status group 2 (all jobs):
>>    WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
>> mint=60725msec, maxt=60725msec
>>
>> Run status group 3 (all jobs):
>>    WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
>> mint=60822msec, maxt=60822msec
>>
>> Disk stats (read/write):
>>    rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
>> in_queue=33434120, util=99.79%

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-19 18:03             ` Sébastien Han
@ 2012-11-19 19:11               ` Alexandre DERUMIER
  2012-11-19 20:57                 ` Sébastien Han
       [not found]               ` <50AA763A.1050709@inktank.com>
  1 sibling, 1 reply; 51+ messages in thread
From: Alexandre DERUMIER @ 2012-11-19 19:11 UTC (permalink / raw)
  To: Sébastien Han; +Cc: ceph-devel, Mark Kampe

>>@Alexandre: is it the same for you? or do you always get more IOPS with seq?

rand read 4K : 6000 iops
seq read 4K : 3500 iops
seq read 4M : 31iops (1gigabit client bandwith limit)

rand write 4k: 6000iops  (tmpfs journal)
seq write 4k: 1600iops
seq write 4M : 31iops (1gigabit client bandwith limit)


I really don't understand why I can't get more rand read iops with 4K block ...

I try with high end cpu for client, it doesn't change nothing.
But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around 15% on cluster during read bench)


----- Mail original ----- 

De: "Sébastien Han" <han.sebastien@gmail.com> 
À: "Mark Kampe" <mark.kampe@inktank.com> 
Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Lundi 19 Novembre 2012 19:03:40 
Objet: Re: RBD fio Performance concerns 

@Sage, thanks for the info :) 
@Mark: 

> If you want to do sequential I/O, you should do it buffered 
> (so that the writes can be aggregated) or with a 4M block size 
> (very efficient and avoiding object serialization). 

The original benchmark has been performed with 4M block size. And as 
you can see I still get more IOPS with rand than seq... I just tried 
with 4M without direct I/O, still the same. I can print fio results if 
it's needed. 

> We do direct writes for benchmarking, not because it is a reasonable 
> way to do I/O, but because it bypasses the buffer cache and enables 
> us to directly measure cluster I/O throughput (which is what we are 
> trying to optimize). Applications should usually do buffered I/O, 
> to get the (very significant) benefits of caching and write aggregation. 

I know why I use direct I/O. It's synthetic benchmarks, it's far away 
from a real life scenario and how common applications works. I just 
try to see the maximum I/O throughput that I can get from my RBD. All 
my applications use buffered I/O. 

@Alexandre: is it the same for you? or do you always get more IOPS with seq? 

Thanks to all of you.. 


On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe <mark.kampe@inktank.com> wrote: 
> Recall: 
> 1. RBD volumes are striped (4M wide) across RADOS objects 
> 2. distinct writes to a single RADOS object are serialized 
> 
> Your sequential 4K writes are direct, depth=256, so there are 
> (at all times) 256 writes queued to the same object. All of 
> your writes are waiting through a very long line, which is adding 
> horrendous latency. 
> 
> If you want to do sequential I/O, you should do it buffered 
> (so that the writes can be aggregated) or with a 4M block size 
> (very efficient and avoiding object serialization). 
> 
> We do direct writes for benchmarking, not because it is a reasonable 
> way to do I/O, but because it bypasses the buffer cache and enables 
> us to directly measure cluster I/O throughput (which is what we are 
> trying to optimize). Applications should usually do buffered I/O, 
> to get the (very significant) benefits of caching and write aggregation. 
> 
> 
>> That's correct for some of the benchmarks. However even with 4K for 
>> seq, I still get less IOPS. See below my last fio: 
>> 
>> # fio rbd-bench.fio 
>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 
>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, 
>> iodepth=256 
>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 
>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, 
>> iodepth=256 
>> fio 1.59 
>> Starting 4 processes 
>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 
>> 02m:59s] 
>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096 
>> read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec 
>> slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 
>> clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63 
>> lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62 
>> bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24, 
>> stdev=6239.06 
>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
>> >=64=100.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0% 
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.1% 
>> issued r/w/d: total=200473/0/0, short=0/0/0 
>> 
>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10% 
>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846 
>> read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec 
>> slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 
>> clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40 
>> lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21 
>> bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, 
>> stdev=648.62 
>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
>> >=64=100.0% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0% 
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.1% 
>> issued r/w/d: total=1632349/0/0, short=0/0/0 
>> 
>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01% 
>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653 
>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec 
>> slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97 
>> clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19 
>> lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17 
>> bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65, 
>> stdev=353.97 
>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, 
>> >=64=99.4% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0% 
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.1% 
>> issued r/w/d: total=0/11171/0, short=0/0/0 
>> 
>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60% 
>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20% 
>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446 
>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec 
>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37 
>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27 
>> lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84 
>> bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21, 
>> stdev=2000.45 
>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19 
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
>> >=64=99.9% 
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.0% 
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>> >=64=0.1% 
>> issued r/w/d: total=0/52147/0, short=0/0/0 
>> 
>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10% 
>> lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33% 
>> 
>> Run status group 0 (all jobs): 
>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s, 
>> mint=60053msec, maxt=60053msec 
>> 
>> Run status group 1 (all jobs): 
>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s, 
>> maxb=111425KB/s, mint=60005msec, maxt=60005msec 
>> 
>> Run status group 2 (all jobs): 
>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s, 
>> mint=60725msec, maxt=60725msec 
>> 
>> Run status group 3 (all jobs): 
>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s, 
>> mint=60822msec, maxt=60822msec 
>> 
>> Disk stats (read/write): 
>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132, 
>> in_queue=33434120, util=99.79% 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-19 19:11               ` Alexandre DERUMIER
@ 2012-11-19 20:57                 ` Sébastien Han
  2012-11-20  7:32                   ` Alexandre DERUMIER
  2012-11-21 15:52                   ` Mark Nelson
  0 siblings, 2 replies; 51+ messages in thread
From: Sébastien Han @ 2012-11-19 20:57 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, Mark Kampe

Which iodepth did you use for those benchs?


> I really don't understand why I can't get more rand read iops with 4K block ...

Me neither, hope to get some clarification from the Inktank guys. It
doesn't make any sense to me...
--
Bien cordialement.
Sébastien HAN.


On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>@Alexandre: is it the same for you? or do you always get more IOPS with seq?
>
> rand read 4K : 6000 iops
> seq read 4K : 3500 iops
> seq read 4M : 31iops (1gigabit client bandwith limit)
>
> rand write 4k: 6000iops  (tmpfs journal)
> seq write 4k: 1600iops
> seq write 4M : 31iops (1gigabit client bandwith limit)
>
>
> I really don't understand why I can't get more rand read iops with 4K block ...
>
> I try with high end cpu for client, it doesn't change nothing.
> But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around 15% on cluster during read bench)
>
>
> ----- Mail original -----
>
> De: "Sébastien Han" <han.sebastien@gmail.com>
> À: "Mark Kampe" <mark.kampe@inktank.com>
> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Lundi 19 Novembre 2012 19:03:40
> Objet: Re: RBD fio Performance concerns
>
> @Sage, thanks for the info :)
> @Mark:
>
>> If you want to do sequential I/O, you should do it buffered
>> (so that the writes can be aggregated) or with a 4M block size
>> (very efficient and avoiding object serialization).
>
> The original benchmark has been performed with 4M block size. And as
> you can see I still get more IOPS with rand than seq... I just tried
> with 4M without direct I/O, still the same. I can print fio results if
> it's needed.
>
>> We do direct writes for benchmarking, not because it is a reasonable
>> way to do I/O, but because it bypasses the buffer cache and enables
>> us to directly measure cluster I/O throughput (which is what we are
>> trying to optimize). Applications should usually do buffered I/O,
>> to get the (very significant) benefits of caching and write aggregation.
>
> I know why I use direct I/O. It's synthetic benchmarks, it's far away
> from a real life scenario and how common applications works. I just
> try to see the maximum I/O throughput that I can get from my RBD. All
> my applications use buffered I/O.
>
> @Alexandre: is it the same for you? or do you always get more IOPS with seq?
>
> Thanks to all of you..
>
>
> On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe <mark.kampe@inktank.com> wrote:
>> Recall:
>> 1. RBD volumes are striped (4M wide) across RADOS objects
>> 2. distinct writes to a single RADOS object are serialized
>>
>> Your sequential 4K writes are direct, depth=256, so there are
>> (at all times) 256 writes queued to the same object. All of
>> your writes are waiting through a very long line, which is adding
>> horrendous latency.
>>
>> If you want to do sequential I/O, you should do it buffered
>> (so that the writes can be aggregated) or with a 4M block size
>> (very efficient and avoiding object serialization).
>>
>> We do direct writes for benchmarking, not because it is a reasonable
>> way to do I/O, but because it bypasses the buffer cache and enables
>> us to directly measure cluster I/O throughput (which is what we are
>> trying to optimize). Applications should usually do buffered I/O,
>> to get the (very significant) benefits of caching and write aggregation.
>>
>>
>>> That's correct for some of the benchmarks. However even with 4K for
>>> seq, I still get less IOPS. See below my last fio:
>>>
>>> # fio rbd-bench.fio
>>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
>>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
>>> iodepth=256
>>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
>>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
>>> iodepth=256
>>> fio 1.59
>>> Starting 4 processes
>>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta
>>> 02m:59s]
>>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096
>>> read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
>>> slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
>>> clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
>>> lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
>>> bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24,
>>> stdev=6239.06
>>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>> >=64=100.0%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> >=64=0.0%
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> >=64=0.1%
>>> issued r/w/d: total=200473/0/0, short=0/0/0
>>>
>>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
>>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846
>>> read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
>>> slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
>>> clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
>>> lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
>>> bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48,
>>> stdev=648.62
>>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>> >=64=100.0%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> >=64=0.0%
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> >=64=0.1%
>>> issued r/w/d: total=1632349/0/0, short=0/0/0
>>>
>>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
>>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
>>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
>>> slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
>>> clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
>>> lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
>>> bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65,
>>> stdev=353.97
>>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
>>> >=64=99.4%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> >=64=0.0%
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> >=64=0.1%
>>> issued r/w/d: total=0/11171/0, short=0/0/0
>>>
>>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
>>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
>>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
>>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
>>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
>>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
>>> lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
>>> bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21,
>>> stdev=2000.45
>>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>> >=64=99.9%
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> >=64=0.0%
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> >=64=0.1%
>>> issued r/w/d: total=0/52147/0, short=0/0/0
>>>
>>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
>>> lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
>>>
>>> Run status group 0 (all jobs):
>>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
>>> mint=60053msec, maxt=60053msec
>>>
>>> Run status group 1 (all jobs):
>>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
>>> maxb=111425KB/s, mint=60005msec, maxt=60005msec
>>>
>>> Run status group 2 (all jobs):
>>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
>>> mint=60725msec, maxt=60725msec
>>>
>>> Run status group 3 (all jobs):
>>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
>>> mint=60822msec, maxt=60822msec
>>>
>>> Disk stats (read/write):
>>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
>>> in_queue=33434120, util=99.79%
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
       [not found]               ` <50AA763A.1050709@inktank.com>
@ 2012-11-19 21:01                 ` Sébastien Han
  0 siblings, 0 replies; 51+ messages in thread
From: Sébastien Han @ 2012-11-19 21:01 UTC (permalink / raw)
  To: Mark Kampe; +Cc: ceph-devel

Hello Mark,

See below my benchmarks results:

-RADOS Bench with 4M block size write:

# rados -p bench bench 300 write -t 32 --no-cleanup
Maintaining 32 concurrent writes of 4194304 bytes for at least 300 seconds.

2012-11-19 21:35:01.722143min lat: 0.255396 max lat: 8.40212 avg lat: 1.14076
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   300      32      8414      8382   111.737       104  0.502774   1.14076
 Total time run:         300.814954
Total writes made:      8414
Write size:             4194304
Bandwidth (MB/sec):     111.883

Stddev Bandwidth:       7.4274
Max bandwidth (MB/sec): 132
Min bandwidth (MB/sec): 56
Average Latency:        1.14352
Stddev Latency:         1.18344
Max latency:            8.40212
Min latency:            0.255396



-RADOS Bench with 4M block size seq:

# rados -p bench bench 300 seq -t 32 --no-cleanup

2012-11-19 21:40:35.128728min lat: 0.224415 max lat: 6.14781 avg lat: 1.1591
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   300      31      8284      8253   110.021       108   1.87698    1.1591
 Total time run:        300.931287
Total reads made:     8285
Read size:            4194304
Bandwidth (MB/sec):    110.125

Average Latency:       1.16177
Max latency:           6.14781
Min latency:           0.224415


-RBD FIO test, as you recommend I used 4M block size for seq tests for
the first test. See below the fio configuration file used:

[global]
ioengine=libaio
iodepth=4
size=1G
runtime=60
filename=/dev/rbd1

[seq-read]
rw=read
bs=4M
stonewall
direct=1

[rand-read]
rw=randread
bs=4K
stonewall
direct=1

[seq-write]
rw=write
bs=4M
stonewall
direct=1

[rand-write]
rw=randwrite
bs=4K
stonewall
direct=1


Results iodepth 4 and 1G file:

# fio rbd-bench.fio
seq-read: (g=0): rw=read, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=4
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4
seq-write: (g=2): rw=write, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=4
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=4
fio 1.59
Starting 4 processes
Jobs: 1 (f=1): [___w] [64.2% done] [0K/2588K /s] [0 /632  iops] [eta 01m:18s]
seq-read: (groupid=0, jobs=1): err= 0: pid=10586
  read : io=1024.0MB, bw=110656KB/s, iops=27 , runt=  9476msec
    slat (usec): min=250 , max=1812 , avg=389.88, stdev=178.26
    clat (msec): min=37 , max=615 , avg=147.42, stdev=102.77
     lat (msec): min=38 , max=615 , avg=147.81, stdev=102.77
    bw (KB/s) : min=84216, max=122390, per=99.60%, avg=110208.50, stdev=9149.98
  cpu          : usr=0.00%, sys=0.97%, ctx=1552, majf=0, minf=4119
  IO depths    : 1=0.4%, 2=0.8%, 4=98.8%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=256/0/0, short=0/0/0

     lat (msec): 50=4.69%, 100=31.64%, 250=50.78%, 500=11.72%, 750=1.17%
rand-read: (groupid=1, jobs=1): err= 0: pid=10868
  read : io=161972KB, bw=2697.1KB/s, iops=674 , runt= 60036msec
    slat (usec): min=12 , max=346 , avg=39.89, stdev=10.04
    clat (usec): min=570 , max=50215 , avg=5885.64, stdev=12119.46
     lat (usec): min=601 , max=50258 , avg=5926.07, stdev=12117.44
    bw (KB/s) : min= 2015, max= 3356, per=100.15%, avg=2701.03, stdev=276.41
  cpu          : usr=0.51%, sys=2.14%, ctx=66054, majf=0, minf=26
  IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=40493/0/0, short=0/0/0
     lat (usec): 750=3.69%, 1000=60.21%
     lat (msec): 2=19.37%, 4=1.49%, 10=1.30%, 20=0.30%, 50=13.64%
     lat (msec): 100=0.01%
seq-write: (groupid=2, jobs=1): err= 0: pid=12619
  write: io=1024.0MB, bw=112412KB/s, iops=27 , runt=  9328msec
    slat (usec): min=510 , max=1683 , avg=820.63, stdev=150.32
    clat (msec): min=47 , max=744 , avg=144.21, stdev=73.99
     lat (msec): min=48 , max=744 , avg=145.03, stdev=74.00
    bw (KB/s) : min=103193, max=124830, per=100.87%, avg=113390.71,
stdev=6178.93
  cpu          : usr=1.46%, sys=0.81%, ctx=267, majf=0, minf=21
  IO depths    : 1=0.4%, 2=0.8%, 4=98.8%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/256/0, short=0/0/0

     lat (msec): 50=0.78%, 100=17.97%, 250=75.39%, 500=5.08%, 750=0.78%
rand-write: (groupid=3, jobs=1): err= 0: pid=12934
  write: io=125352KB, bw=2088.1KB/s, iops=522 , runt= 60007msec
    slat (usec): min=13 , max=388 , avg=50.47, stdev=13.73
    clat (msec): min=1 , max=1271 , avg= 7.60, stdev=22.16
     lat (msec): min=1 , max=1271 , avg= 7.66, stdev=22.16
    bw (KB/s) : min=  155, max= 2944, per=102.13%, avg=2132.45, stdev=563.22
  cpu          : usr=0.45%, sys=1.87%, ctx=51594, majf=0, minf=19
  IO depths    : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/31338/0, short=0/0/0

     lat (msec): 2=5.84%, 4=59.28%, 10=12.47%, 20=15.83%, 50=5.72%
     lat (msec): 100=0.30%, 250=0.44%, 500=0.07%, 750=0.02%, 1000=0.02%
     lat (msec): 2000=0.01%

Run status group 0 (all jobs):
   READ: io=1024.0MB, aggrb=110655KB/s, minb=113311KB/s,
maxb=113311KB/s, mint=9476msec, maxt=9476msec

Run status group 1 (all jobs):
   READ: io=161972KB, aggrb=2697KB/s, minb=2762KB/s, maxb=2762KB/s,
mint=60036msec, maxt=60036msec

Run status group 2 (all jobs):
  WRITE: io=1024.0MB, aggrb=112411KB/s, minb=115109KB/s,
maxb=115109KB/s, mint=9328msec, maxt=9328msec

Run status group 3 (all jobs):
  WRITE: io=125352KB, aggrb=2088KB/s, minb=2139KB/s, maxb=2139KB/s,
mint=60007msec, maxt=60007msec

Disk stats (read/write):
  rbd1: ios=42707/33325, merge=0/0, ticks=439568/438892,
in_queue=878724, util=99.57%

With an iodepth of 64 and 10G file:

seq-read: (g=0): rw=read, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=64
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
seq-write: (g=2): rw=write, bs=4M-4M/4M-4M, ioengine=libaio, iodepth=64
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio 1.59
Starting 4 processes
Jobs: 1 (f=1): [___w] [58.1% done] [0K/0K /s] [0 /0  iops] [eta 02m:57s]
seq-read: (groupid=0, jobs=1): err= 0: pid=25257
  read : io=6564.0MB, bw=110816KB/s, iops=27 , runt= 60655msec
    slat (usec): min=204 , max=287661 , avg=36605.14, stdev=63984.12
    clat (msec): min=573 , max=5910 , avg=2305.13, stdev=715.03
     lat (msec): min=712 , max=5938 , avg=2341.74, stdev=716.44
    bw (KB/s) : min=    0, max=116819, per=61.34%, avg=67975.54, stdev=54174.75
  cpu          : usr=0.00%, sys=1.08%, ctx=10644, majf=0, minf=65559
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.5%, 16=1.0%, 32=2.0%, >=64=96.2%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=1641/0/0, short=0/0/0

     lat (msec): 750=0.30%, 1000=0.37%, 2000=41.13%, >=2000=58.20%
rand-read: (groupid=1, jobs=1): err= 0: pid=27045
  read : io=6170.6MB, bw=105242KB/s, iops=26310 , runt= 60039msec
    slat (usec): min=9 , max=2456 , avg=26.68, stdev= 9.23
    clat (usec): min=501 , max=42630 , avg=2403.11, stdev=1136.11
     lat (usec): min=544 , max=42654 , avg=2430.20, stdev=1135.87
    bw (KB/s) : min=    0, max=107376, per=65.94%, avg=69395.09, stdev=50034.68
  cpu          : usr=9.62%, sys=53.77%, ctx=1804080, majf=0, minf=86
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=1579662/0/0, short=0/0/0
     lat (usec): 750=0.34%, 1000=2.07%
     lat (msec): 2=38.96%, 4=51.17%, 10=7.37%, 20=0.08%, 50=0.01%
seq-write: (groupid=2, jobs=1): err= 0: pid=28845
  write: io=6776.0MB, bw=114538KB/s, iops=27 , runt= 60579msec
    slat (usec): min=419 , max=237721 , avg=35415.33, stdev=60635.70
    clat (msec): min=572 , max=6468 , avg=2229.49, stdev=935.01
     lat (msec): min=695 , max=6469 , avg=2264.91, stdev=931.13
    bw (KB/s) : min=    0, max=136533, per=61.73%, avg=70705.08, stdev=56037.47
  cpu          : usr=1.96%, sys=0.75%, ctx=623, majf=0, minf=21
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.5%, 16=0.9%, 32=1.9%, >=64=96.3%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=0/1694/0, short=0/0/0

     lat (msec): 750=0.30%, 1000=0.47%, 2000=63.64%, >=2000=35.60%
rand-write: (groupid=3, jobs=1): err= 0: pid=30722
  write: io=203724KB, bw=3250.5KB/s, iops=812 , runt= 62675msec
    slat (usec): min=12 , max=589 , avg=50.66, stdev=12.44
    clat (msec): min=1 , max=3603 , avg=78.65, stdev=242.01
     lat (msec): min=1 , max=3603 , avg=78.70, stdev=242.01
    bw (KB/s) : min=    0, max= 7001, per=70.93%, avg=2305.36, stdev=2413.85
  cpu          : usr=0.59%, sys=2.66%, ctx=81900, majf=0, minf=19
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=0/50931/0, short=0/0/0

     lat (msec): 2=9.94%, 4=34.46%, 10=6.95%, 20=11.34%, 50=14.86%
     lat (msec): 100=7.69%, 250=7.06%, 500=3.90%, 750=1.59%, 1000=0.73%
     lat (msec): 2000=1.15%, >=2000=0.33%

Run status group 0 (all jobs):
   READ: io=6564.0MB, aggrb=110815KB/s, minb=113475KB/s,
maxb=113475KB/s, mint=60655msec, maxt=60655msec

Run status group 1 (all jobs):
   READ: io=6170.6MB, aggrb=105242KB/s, minb=107768KB/s,
maxb=107768KB/s, mint=60039msec, maxt=60039msec

Run status group 2 (all jobs):
  WRITE: io=6776.0MB, aggrb=114538KB/s, minb=117287KB/s,
maxb=117287KB/s, mint=60579msec, maxt=60579msec

Run status group 3 (all jobs):
  WRITE: io=203724KB, aggrb=3250KB/s, minb=3328KB/s, maxb=3328KB/s,
mint=62675msec, maxt=62675msec

Disk stats (read/write):
  rbd1: ios=1592951/64482, merge=0/0, ticks=12415028/12528984,
in_queue=24945216, util=99.68%


Thank you in advance.


On Mon, Nov 19, 2012 at 7:11 PM, Mark Kampe <mark.kampe@inktank.com> wrote:
>
>
> On 11/19/2012 10:03 AM, Sébastien Han wrote:
>
>> The original benchmark has been performed with 4M block size. And as
>> you can see I still get more IOPS with rand than seq... I just tried
>> with 4M without direct I/O, still the same. I can print fio results if
>> it's needed.
>
>
> Yes, please send me your 4M random and sequential write results
> both radosbench (or better smalliobench, which is more directly
> comparable) and fio to an RBD.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-19 20:57                 ` Sébastien Han
@ 2012-11-20  7:32                   ` Alexandre DERUMIER
  2012-11-20 10:37                     ` Sébastien Han
  2012-11-21 15:52                   ` Mark Nelson
  1 sibling, 1 reply; 51+ messages in thread
From: Alexandre DERUMIER @ 2012-11-20  7:32 UTC (permalink / raw)
  To: Sébastien Han; +Cc: ceph-devel, Mark Kampe

>>Which iodepth did you use for those benchs? 

iodepth = 100

filesize = 1G, 10G, 30G  , same result

(3 nodes,8 cores 2,5GHZ,32GB ram, with 6 osd each (15k drive) + journal on tmpfs)


Note that I can't get more than 6000 iops on a rbd device, but with more devices it's scale. (each fio is at 6000iops)

(I have same result with rbd module or with kvm guest)



----- Mail original ----- 

De: "Sébastien Han" <han.sebastien@gmail.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com> 
Envoyé: Lundi 19 Novembre 2012 21:57:59 
Objet: Re: RBD fio Performance concerns 

Which iodepth did you use for those benchs? 


> I really don't understand why I can't get more rand read iops with 4K block ... 

Me neither, hope to get some clarification from the Inktank guys. It 
doesn't make any sense to me... 
-- 
Bien cordialement. 
Sébastien HAN. 


On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote: 
>>>@Alexandre: is it the same for you? or do you always get more IOPS with seq? 
> 
> rand read 4K : 6000 iops 
> seq read 4K : 3500 iops 
> seq read 4M : 31iops (1gigabit client bandwith limit) 
> 
> rand write 4k: 6000iops (tmpfs journal) 
> seq write 4k: 1600iops 
> seq write 4M : 31iops (1gigabit client bandwith limit) 
> 
> 
> I really don't understand why I can't get more rand read iops with 4K block ... 
> 
> I try with high end cpu for client, it doesn't change nothing. 
> But test cluster use old 8 cores E5420 @ 2.50GHZ (But cpu is around 15% on cluster during read bench) 
> 
> 
> ----- Mail original ----- 
> 
> De: "Sébastien Han" <han.sebastien@gmail.com> 
> À: "Mark Kampe" <mark.kampe@inktank.com> 
> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Lundi 19 Novembre 2012 19:03:40 
> Objet: Re: RBD fio Performance concerns 
> 
> @Sage, thanks for the info :) 
> @Mark: 
> 
>> If you want to do sequential I/O, you should do it buffered 
>> (so that the writes can be aggregated) or with a 4M block size 
>> (very efficient and avoiding object serialization). 
> 
> The original benchmark has been performed with 4M block size. And as 
> you can see I still get more IOPS with rand than seq... I just tried 
> with 4M without direct I/O, still the same. I can print fio results if 
> it's needed. 
> 
>> We do direct writes for benchmarking, not because it is a reasonable 
>> way to do I/O, but because it bypasses the buffer cache and enables 
>> us to directly measure cluster I/O throughput (which is what we are 
>> trying to optimize). Applications should usually do buffered I/O, 
>> to get the (very significant) benefits of caching and write aggregation. 
> 
> I know why I use direct I/O. It's synthetic benchmarks, it's far away 
> from a real life scenario and how common applications works. I just 
> try to see the maximum I/O throughput that I can get from my RBD. All 
> my applications use buffered I/O. 
> 
> @Alexandre: is it the same for you? or do you always get more IOPS with seq? 
> 
> Thanks to all of you.. 
> 
> 
> On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe <mark.kampe@inktank.com> wrote: 
>> Recall: 
>> 1. RBD volumes are striped (4M wide) across RADOS objects 
>> 2. distinct writes to a single RADOS object are serialized 
>> 
>> Your sequential 4K writes are direct, depth=256, so there are 
>> (at all times) 256 writes queued to the same object. All of 
>> your writes are waiting through a very long line, which is adding 
>> horrendous latency. 
>> 
>> If you want to do sequential I/O, you should do it buffered 
>> (so that the writes can be aggregated) or with a 4M block size 
>> (very efficient and avoiding object serialization). 
>> 
>> We do direct writes for benchmarking, not because it is a reasonable 
>> way to do I/O, but because it bypasses the buffer cache and enables 
>> us to directly measure cluster I/O throughput (which is what we are 
>> trying to optimize). Applications should usually do buffered I/O, 
>> to get the (very significant) benefits of caching and write aggregation. 
>> 
>> 
>>> That's correct for some of the benchmarks. However even with 4K for 
>>> seq, I still get less IOPS. See below my last fio: 
>>> 
>>> # fio rbd-bench.fio 
>>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 
>>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, 
>>> iodepth=256 
>>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 
>>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, 
>>> iodepth=256 
>>> fio 1.59 
>>> Starting 4 processes 
>>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 
>>> 02m:59s] 
>>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096 
>>> read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec 
>>> slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 
>>> clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63 
>>> lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62 
>>> bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24, 
>>> stdev=6239.06 
>>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
>>> >=64=100.0% 
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.0% 
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.1% 
>>> issued r/w/d: total=200473/0/0, short=0/0/0 
>>> 
>>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10% 
>>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846 
>>> read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec 
>>> slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 
>>> clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40 
>>> lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21 
>>> bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, 
>>> stdev=648.62 
>>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278 
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
>>> >=64=100.0% 
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.0% 
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.1% 
>>> issued r/w/d: total=1632349/0/0, short=0/0/0 
>>> 
>>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01% 
>>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653 
>>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec 
>>> slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97 
>>> clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19 
>>> lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17 
>>> bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65, 
>>> stdev=353.97 
>>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21 
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, 
>>> >=64=99.4% 
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.0% 
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.1% 
>>> issued r/w/d: total=0/11171/0, short=0/0/0 
>>> 
>>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60% 
>>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20% 
>>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446 
>>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec 
>>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37 
>>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27 
>>> lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84 
>>> bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21, 
>>> stdev=2000.45 
>>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19 
>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
>>> >=64=99.9% 
>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.0% 
>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.1% 
>>> issued r/w/d: total=0/52147/0, short=0/0/0 
>>> 
>>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10% 
>>> lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33% 
>>> 
>>> Run status group 0 (all jobs): 
>>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s, 
>>> mint=60053msec, maxt=60053msec 
>>> 
>>> Run status group 1 (all jobs): 
>>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s, 
>>> maxb=111425KB/s, mint=60005msec, maxt=60005msec 
>>> 
>>> Run status group 2 (all jobs): 
>>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s, 
>>> mint=60725msec, maxt=60725msec 
>>> 
>>> Run status group 3 (all jobs): 
>>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s, 
>>> mint=60822msec, maxt=60822msec 
>>> 
>>> Disk stats (read/write): 
>>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132, 
>>> in_queue=33434120, util=99.79% 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-20  7:32                   ` Alexandre DERUMIER
@ 2012-11-20 10:37                     ` Sébastien Han
  0 siblings, 0 replies; 51+ messages in thread
From: Sébastien Han @ 2012-11-20 10:37 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel, Mark Kampe

@Alexandre: thanks for publishing your results as well :)

I also tried with different sizes and no difference.


On Tue, Nov 20, 2012 at 8:32 AM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>Which iodepth did you use for those benchs?
>
> iodepth = 100
>
> filesize = 1G, 10G, 30G  , same result
>
> (3 nodes,8 cores 2,5GHZ,32GB ram, with 6 osd each (15k drive) + journal on tmpfs)
>
>
> Note that I can't get more than 6000 iops on a rbd device, but with more devices it's scale. (each fio is at 6000iops)
>
> (I have same result with rbd module or with kvm guest)
>
>
>
> ----- Mail original -----
>
> De: "Sébastien Han" <han.sebastien@gmail.com>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>
> Envoyé: Lundi 19 Novembre 2012 21:57:59
> Objet: Re: RBD fio Performance concerns
>
> Which iodepth did you use for those benchs?
>
>
>> I really don't understand why I can't get more rand read iops with 4K block ...
>
> Me neither, hope to get some clarification from the Inktank guys. It
> doesn't make any sense to me...
> --
> Bien cordialement.
> Sébastien HAN.
>
>
> On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>>@Alexandre: is it the same for you? or do you always get more IOPS with seq?
>>
>> rand read 4K : 6000 iops
>> seq read 4K : 3500 iops
>> seq read 4M : 31iops (1gigabit client bandwith limit)
>>
>> rand write 4k: 6000iops (tmpfs journal)
>> seq write 4k: 1600iops
>> seq write 4M : 31iops (1gigabit client bandwith limit)
>>
>>
>> I really don't understand why I can't get more rand read iops with 4K block ...
>>
>> I try with high end cpu for client, it doesn't change nothing.
>> But test cluster use old 8 cores E5420 @ 2.50GHZ (But cpu is around 15% on cluster during read bench)
>>
>>
>> ----- Mail original -----
>>
>> De: "Sébastien Han" <han.sebastien@gmail.com>
>> À: "Mark Kampe" <mark.kampe@inktank.com>
>> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Lundi 19 Novembre 2012 19:03:40
>> Objet: Re: RBD fio Performance concerns
>>
>> @Sage, thanks for the info :)
>> @Mark:
>>
>>> If you want to do sequential I/O, you should do it buffered
>>> (so that the writes can be aggregated) or with a 4M block size
>>> (very efficient and avoiding object serialization).
>>
>> The original benchmark has been performed with 4M block size. And as
>> you can see I still get more IOPS with rand than seq... I just tried
>> with 4M without direct I/O, still the same. I can print fio results if
>> it's needed.
>>
>>> We do direct writes for benchmarking, not because it is a reasonable
>>> way to do I/O, but because it bypasses the buffer cache and enables
>>> us to directly measure cluster I/O throughput (which is what we are
>>> trying to optimize). Applications should usually do buffered I/O,
>>> to get the (very significant) benefits of caching and write aggregation.
>>
>> I know why I use direct I/O. It's synthetic benchmarks, it's far away
>> from a real life scenario and how common applications works. I just
>> try to see the maximum I/O throughput that I can get from my RBD. All
>> my applications use buffered I/O.
>>
>> @Alexandre: is it the same for you? or do you always get more IOPS with seq?
>>
>> Thanks to all of you..
>>
>>
>> On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe <mark.kampe@inktank.com> wrote:
>>> Recall:
>>> 1. RBD volumes are striped (4M wide) across RADOS objects
>>> 2. distinct writes to a single RADOS object are serialized
>>>
>>> Your sequential 4K writes are direct, depth=256, so there are
>>> (at all times) 256 writes queued to the same object. All of
>>> your writes are waiting through a very long line, which is adding
>>> horrendous latency.
>>>
>>> If you want to do sequential I/O, you should do it buffered
>>> (so that the writes can be aggregated) or with a 4M block size
>>> (very efficient and avoiding object serialization).
>>>
>>> We do direct writes for benchmarking, not because it is a reasonable
>>> way to do I/O, but because it bypasses the buffer cache and enables
>>> us to directly measure cluster I/O throughput (which is what we are
>>> trying to optimize). Applications should usually do buffered I/O,
>>> to get the (very significant) benefits of caching and write aggregation.
>>>
>>>
>>>> That's correct for some of the benchmarks. However even with 4K for
>>>> seq, I still get less IOPS. See below my last fio:
>>>>
>>>> # fio rbd-bench.fio
>>>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
>>>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
>>>> iodepth=256
>>>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
>>>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
>>>> iodepth=256
>>>> fio 1.59
>>>> Starting 4 processes
>>>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta
>>>> 02m:59s]
>>>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096
>>>> read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
>>>> slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
>>>> clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
>>>> lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
>>>> bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24,
>>>> stdev=6239.06
>>>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>> >=64=100.0%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>> >=64=0.0%
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>> >=64=0.1%
>>>> issued r/w/d: total=200473/0/0, short=0/0/0
>>>>
>>>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
>>>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846
>>>> read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
>>>> slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
>>>> clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
>>>> lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
>>>> bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48,
>>>> stdev=648.62
>>>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>> >=64=100.0%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>> >=64=0.0%
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>> >=64=0.1%
>>>> issued r/w/d: total=1632349/0/0, short=0/0/0
>>>>
>>>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
>>>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
>>>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
>>>> slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
>>>> clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
>>>> lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
>>>> bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65,
>>>> stdev=353.97
>>>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
>>>> >=64=99.4%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>> >=64=0.0%
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>> >=64=0.1%
>>>> issued r/w/d: total=0/11171/0, short=0/0/0
>>>>
>>>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
>>>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
>>>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
>>>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
>>>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
>>>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
>>>> lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
>>>> bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21,
>>>> stdev=2000.45
>>>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>> >=64=99.9%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>> >=64=0.0%
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>> >=64=0.1%
>>>> issued r/w/d: total=0/52147/0, short=0/0/0
>>>>
>>>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
>>>> lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
>>>>
>>>> Run status group 0 (all jobs):
>>>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
>>>> mint=60053msec, maxt=60053msec
>>>>
>>>> Run status group 1 (all jobs):
>>>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
>>>> maxb=111425KB/s, mint=60005msec, maxt=60005msec
>>>>
>>>> Run status group 2 (all jobs):
>>>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
>>>> mint=60725msec, maxt=60725msec
>>>>
>>>> Run status group 3 (all jobs):
>>>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
>>>> mint=60822msec, maxt=60822msec
>>>>
>>>> Disk stats (read/write):
>>>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
>>>> in_queue=33434120, util=99.79%
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-19 20:57                 ` Sébastien Han
  2012-11-20  7:32                   ` Alexandre DERUMIER
@ 2012-11-21 15:52                   ` Mark Nelson
  2012-11-21 16:34                     ` Mark Nelson
  1 sibling, 1 reply; 51+ messages in thread
From: Mark Nelson @ 2012-11-21 15:52 UTC (permalink / raw)
  To: Sébastien Han; +Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe

Hi Guys,

I'm late to this thread but thought I'd chime in.  Crazy that you are 
getting higher performance with random reads/writes vs sequential!  It 
would be interesting to see what kind of throughput smalliobench reports 
(should be packaged in bobtail) and also see if this behavior happens 
with cephfs.  It's still too early in the morning for me right now to 
come up with a reasonable explanation for what's going on.  It might be 
worth running blktrace and seekwatcher to see what the io patterns on 
the underlying disk look like in each case.  Maybe something unexpected 
is going on.

Mark

On 11/19/2012 02:57 PM, Sébastien Han wrote:
> Which iodepth did you use for those benchs?
>
>
>> I really don't understand why I can't get more rand read iops with 4K block ...
>
> Me neither, hope to get some clarification from the Inktank guys. It
> doesn't make any sense to me...
> --
> Bien cordialement.
> Sébastien HAN.
>
>
> On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>> @Alexandre: is it the same for you? or do you always get more IOPS with seq?
>>
>> rand read 4K : 6000 iops
>> seq read 4K : 3500 iops
>> seq read 4M : 31iops (1gigabit client bandwith limit)
>>
>> rand write 4k: 6000iops  (tmpfs journal)
>> seq write 4k: 1600iops
>> seq write 4M : 31iops (1gigabit client bandwith limit)
>>
>>
>> I really don't understand why I can't get more rand read iops with 4K block ...
>>
>> I try with high end cpu for client, it doesn't change nothing.
>> But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around 15% on cluster during read bench)
>>
>>
>> ----- Mail original -----
>>
>> De: "Sébastien Han" <han.sebastien@gmail.com>
>> À: "Mark Kampe" <mark.kampe@inktank.com>
>> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Lundi 19 Novembre 2012 19:03:40
>> Objet: Re: RBD fio Performance concerns
>>
>> @Sage, thanks for the info :)
>> @Mark:
>>
>>> If you want to do sequential I/O, you should do it buffered
>>> (so that the writes can be aggregated) or with a 4M block size
>>> (very efficient and avoiding object serialization).
>>
>> The original benchmark has been performed with 4M block size. And as
>> you can see I still get more IOPS with rand than seq... I just tried
>> with 4M without direct I/O, still the same. I can print fio results if
>> it's needed.
>>
>>> We do direct writes for benchmarking, not because it is a reasonable
>>> way to do I/O, but because it bypasses the buffer cache and enables
>>> us to directly measure cluster I/O throughput (which is what we are
>>> trying to optimize). Applications should usually do buffered I/O,
>>> to get the (very significant) benefits of caching and write aggregation.
>>
>> I know why I use direct I/O. It's synthetic benchmarks, it's far away
>> from a real life scenario and how common applications works. I just
>> try to see the maximum I/O throughput that I can get from my RBD. All
>> my applications use buffered I/O.
>>
>> @Alexandre: is it the same for you? or do you always get more IOPS with seq?
>>
>> Thanks to all of you..
>>
>>
>> On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe <mark.kampe@inktank.com> wrote:
>>> Recall:
>>> 1. RBD volumes are striped (4M wide) across RADOS objects
>>> 2. distinct writes to a single RADOS object are serialized
>>>
>>> Your sequential 4K writes are direct, depth=256, so there are
>>> (at all times) 256 writes queued to the same object. All of
>>> your writes are waiting through a very long line, which is adding
>>> horrendous latency.
>>>
>>> If you want to do sequential I/O, you should do it buffered
>>> (so that the writes can be aggregated) or with a 4M block size
>>> (very efficient and avoiding object serialization).
>>>
>>> We do direct writes for benchmarking, not because it is a reasonable
>>> way to do I/O, but because it bypasses the buffer cache and enables
>>> us to directly measure cluster I/O throughput (which is what we are
>>> trying to optimize). Applications should usually do buffered I/O,
>>> to get the (very significant) benefits of caching and write aggregation.
>>>
>>>
>>>> That's correct for some of the benchmarks. However even with 4K for
>>>> seq, I still get less IOPS. See below my last fio:
>>>>
>>>> # fio rbd-bench.fio
>>>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
>>>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
>>>> iodepth=256
>>>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
>>>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
>>>> iodepth=256
>>>> fio 1.59
>>>> Starting 4 processes
>>>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta
>>>> 02m:59s]
>>>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096
>>>> read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
>>>> slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
>>>> clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
>>>> lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
>>>> bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24,
>>>> stdev=6239.06
>>>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>> =64=100.0%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.0%
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.1%
>>>> issued r/w/d: total=200473/0/0, short=0/0/0
>>>>
>>>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
>>>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846
>>>> read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
>>>> slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
>>>> clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
>>>> lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
>>>> bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48,
>>>> stdev=648.62
>>>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>> =64=100.0%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.0%
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.1%
>>>> issued r/w/d: total=1632349/0/0, short=0/0/0
>>>>
>>>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
>>>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
>>>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
>>>> slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
>>>> clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
>>>> lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
>>>> bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65,
>>>> stdev=353.97
>>>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
>>>>> =64=99.4%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.0%
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.1%
>>>> issued r/w/d: total=0/11171/0, short=0/0/0
>>>>
>>>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
>>>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
>>>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
>>>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
>>>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
>>>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
>>>> lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
>>>> bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21,
>>>> stdev=2000.45
>>>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>> =64=99.9%
>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.0%
>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>> =64=0.1%
>>>> issued r/w/d: total=0/52147/0, short=0/0/0
>>>>
>>>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
>>>> lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
>>>>
>>>> Run status group 0 (all jobs):
>>>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
>>>> mint=60053msec, maxt=60053msec
>>>>
>>>> Run status group 1 (all jobs):
>>>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
>>>> maxb=111425KB/s, mint=60005msec, maxt=60005msec
>>>>
>>>> Run status group 2 (all jobs):
>>>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
>>>> mint=60725msec, maxt=60725msec
>>>>
>>>> Run status group 3 (all jobs):
>>>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
>>>> mint=60822msec, maxt=60822msec
>>>>
>>>> Disk stats (read/write):
>>>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
>>>> in_queue=33434120, util=99.79%
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-21 15:52                   ` Mark Nelson
@ 2012-11-21 16:34                     ` Mark Nelson
  2012-11-21 21:47                       ` Sébastien Han
  2012-11-22 10:19                       ` Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 51+ messages in thread
From: Mark Nelson @ 2012-11-21 16:34 UTC (permalink / raw)
  To: Sébastien Han; +Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe

Responding to my own message. :)

Talked to Sage a bit offline about this.  I think there are two opposing 
forces:

On one hand, random IO may be spreading reads/writes out across more 
OSDs than sequential IO that presumably would be hitting a single OSD 
more regularly.

On the other hand, you'd expect that sequential writes would be getting 
coalesced either at the RBD layer or on the OSD, and that the 
drive/controller/filesystem underneath the OSD would be doing some kind 
of readahead or prefetching.

On the third hand, maybe coalescing/prefetching is in fact happening but 
we are IOP limited by some per-osd limitation.

It could be interesting to do the test with a single OSD and see what 
happens.

Mark

On 11/21/2012 09:52 AM, Mark Nelson wrote:
> Hi Guys,
>
> I'm late to this thread but thought I'd chime in.  Crazy that you are
> getting higher performance with random reads/writes vs sequential!  It
> would be interesting to see what kind of throughput smalliobench reports
> (should be packaged in bobtail) and also see if this behavior happens
> with cephfs.  It's still too early in the morning for me right now to
> come up with a reasonable explanation for what's going on.  It might be
> worth running blktrace and seekwatcher to see what the io patterns on
> the underlying disk look like in each case.  Maybe something unexpected
> is going on.
>
> Mark
>
> On 11/19/2012 02:57 PM, Sébastien Han wrote:
>> Which iodepth did you use for those benchs?
>>
>>
>>> I really don't understand why I can't get more rand read iops with 4K
>>> block ...
>>
>> Me neither, hope to get some clarification from the Inktank guys. It
>> doesn't make any sense to me...
>> --
>> Bien cordialement.
>> Sébastien HAN.
>>
>>
>> On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER
>> <aderumier@odiso.com> wrote:
>>>>> @Alexandre: is it the same for you? or do you always get more IOPS
>>>>> with seq?
>>>
>>> rand read 4K : 6000 iops
>>> seq read 4K : 3500 iops
>>> seq read 4M : 31iops (1gigabit client bandwith limit)
>>>
>>> rand write 4k: 6000iops  (tmpfs journal)
>>> seq write 4k: 1600iops
>>> seq write 4M : 31iops (1gigabit client bandwith limit)
>>>
>>>
>>> I really don't understand why I can't get more rand read iops with 4K
>>> block ...
>>>
>>> I try with high end cpu for client, it doesn't change nothing.
>>> But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around
>>> 15% on cluster during read bench)
>>>
>>>
>>> ----- Mail original -----
>>>
>>> De: "Sébastien Han" <han.sebastien@gmail.com>
>>> À: "Mark Kampe" <mark.kampe@inktank.com>
>>> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel"
>>> <ceph-devel@vger.kernel.org>
>>> Envoyé: Lundi 19 Novembre 2012 19:03:40
>>> Objet: Re: RBD fio Performance concerns
>>>
>>> @Sage, thanks for the info :)
>>> @Mark:
>>>
>>>> If you want to do sequential I/O, you should do it buffered
>>>> (so that the writes can be aggregated) or with a 4M block size
>>>> (very efficient and avoiding object serialization).
>>>
>>> The original benchmark has been performed with 4M block size. And as
>>> you can see I still get more IOPS with rand than seq... I just tried
>>> with 4M without direct I/O, still the same. I can print fio results if
>>> it's needed.
>>>
>>>> We do direct writes for benchmarking, not because it is a reasonable
>>>> way to do I/O, but because it bypasses the buffer cache and enables
>>>> us to directly measure cluster I/O throughput (which is what we are
>>>> trying to optimize). Applications should usually do buffered I/O,
>>>> to get the (very significant) benefits of caching and write
>>>> aggregation.
>>>
>>> I know why I use direct I/O. It's synthetic benchmarks, it's far away
>>> from a real life scenario and how common applications works. I just
>>> try to see the maximum I/O throughput that I can get from my RBD. All
>>> my applications use buffered I/O.
>>>
>>> @Alexandre: is it the same for you? or do you always get more IOPS
>>> with seq?
>>>
>>> Thanks to all of you..
>>>
>>>
>>> On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe <mark.kampe@inktank.com>
>>> wrote:
>>>> Recall:
>>>> 1. RBD volumes are striped (4M wide) across RADOS objects
>>>> 2. distinct writes to a single RADOS object are serialized
>>>>
>>>> Your sequential 4K writes are direct, depth=256, so there are
>>>> (at all times) 256 writes queued to the same object. All of
>>>> your writes are waiting through a very long line, which is adding
>>>> horrendous latency.
>>>>
>>>> If you want to do sequential I/O, you should do it buffered
>>>> (so that the writes can be aggregated) or with a 4M block size
>>>> (very efficient and avoiding object serialization).
>>>>
>>>> We do direct writes for benchmarking, not because it is a reasonable
>>>> way to do I/O, but because it bypasses the buffer cache and enables
>>>> us to directly measure cluster I/O throughput (which is what we are
>>>> trying to optimize). Applications should usually do buffered I/O,
>>>> to get the (very significant) benefits of caching and write
>>>> aggregation.
>>>>
>>>>
>>>>> That's correct for some of the benchmarks. However even with 4K for
>>>>> seq, I still get less IOPS. See below my last fio:
>>>>>
>>>>> # fio rbd-bench.fio
>>>>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
>>>>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>> iodepth=256
>>>>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>> iodepth=256
>>>>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>> iodepth=256
>>>>> fio 1.59
>>>>> Starting 4 processes
>>>>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta
>>>>> 02m:59s]
>>>>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096
>>>>> read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
>>>>> slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
>>>>> clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
>>>>> lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
>>>>> bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24,
>>>>> stdev=6239.06
>>>>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>> =64=100.0%
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>> =64=0.0%
>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>> =64=0.1%
>>>>> issued r/w/d: total=200473/0/0, short=0/0/0
>>>>>
>>>>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
>>>>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846
>>>>> read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
>>>>> slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
>>>>> clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
>>>>> lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
>>>>> bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48,
>>>>> stdev=648.62
>>>>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>> =64=100.0%
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>> =64=0.0%
>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>> =64=0.1%
>>>>> issued r/w/d: total=1632349/0/0, short=0/0/0
>>>>>
>>>>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
>>>>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
>>>>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
>>>>> slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
>>>>> clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
>>>>> lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
>>>>> bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65,
>>>>> stdev=353.97
>>>>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
>>>>>> =64=99.4%
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>> =64=0.0%
>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>> =64=0.1%
>>>>> issued r/w/d: total=0/11171/0, short=0/0/0
>>>>>
>>>>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
>>>>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
>>>>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
>>>>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
>>>>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
>>>>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
>>>>> lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
>>>>> bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21,
>>>>> stdev=2000.45
>>>>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>> =64=99.9%
>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>> =64=0.0%
>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>> =64=0.1%
>>>>> issued r/w/d: total=0/52147/0, short=0/0/0
>>>>>
>>>>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
>>>>> lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
>>>>>
>>>>> Run status group 0 (all jobs):
>>>>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
>>>>> mint=60053msec, maxt=60053msec
>>>>>
>>>>> Run status group 1 (all jobs):
>>>>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
>>>>> maxb=111425KB/s, mint=60005msec, maxt=60005msec
>>>>>
>>>>> Run status group 2 (all jobs):
>>>>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
>>>>> mint=60725msec, maxt=60725msec
>>>>>
>>>>> Run status group 3 (all jobs):
>>>>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
>>>>> mint=60822msec, maxt=60822msec
>>>>>
>>>>> Disk stats (read/write):
>>>>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
>>>>> in_queue=33434120, util=99.79%
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-21 16:34                     ` Mark Nelson
@ 2012-11-21 21:47                       ` Sébastien Han
  2012-11-21 22:05                         ` Mark Kampe
                                           ` (2 more replies)
  2012-11-22 10:19                       ` Stefan Priebe - Profihost AG
  1 sibling, 3 replies; 51+ messages in thread
From: Sébastien Han @ 2012-11-21 21:47 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe

Hi Mark,

Well the most concerning thing is that I have 2 Ceph clusters and both
of them show better rand than seq...
I don't have enough background to argue on your assomptions but I
could try to skrink my test platform to a single OSD and how it
performs. We keep in touch on that one.

But it seems that Alexandre and I have the same results (more rand
than seq), he has (at least) one cluster and I have 2. Thus I start to
think that's not an isolated issue.

Is it different for you? Do you usually get more seq IOPS from an RBD
thant rand?


On Wed, Nov 21, 2012 at 5:34 PM, Mark Nelson <mark.nelson@inktank.com> wrote:
> Responding to my own message. :)
>
> Talked to Sage a bit offline about this.  I think there are two opposing
> forces:
>
> On one hand, random IO may be spreading reads/writes out across more OSDs
> than sequential IO that presumably would be hitting a single OSD more
> regularly.
>
> On the other hand, you'd expect that sequential writes would be getting
> coalesced either at the RBD layer or on the OSD, and that the
> drive/controller/filesystem underneath the OSD would be doing some kind of
> readahead or prefetching.
>
> On the third hand, maybe coalescing/prefetching is in fact happening but we
> are IOP limited by some per-osd limitation.
>
> It could be interesting to do the test with a single OSD and see what
> happens.
>
> Mark
>
>
> On 11/21/2012 09:52 AM, Mark Nelson wrote:
>>
>> Hi Guys,
>>
>> I'm late to this thread but thought I'd chime in.  Crazy that you are
>> getting higher performance with random reads/writes vs sequential!  It
>> would be interesting to see what kind of throughput smalliobench reports
>> (should be packaged in bobtail) and also see if this behavior happens
>> with cephfs.  It's still too early in the morning for me right now to
>> come up with a reasonable explanation for what's going on.  It might be
>> worth running blktrace and seekwatcher to see what the io patterns on
>> the underlying disk look like in each case.  Maybe something unexpected
>> is going on.
>>
>> Mark
>>
>> On 11/19/2012 02:57 PM, Sébastien Han wrote:
>>>
>>> Which iodepth did you use for those benchs?
>>>
>>>
>>>> I really don't understand why I can't get more rand read iops with 4K
>>>> block ...
>>>
>>>
>>> Me neither, hope to get some clarification from the Inktank guys. It
>>> doesn't make any sense to me...
>>> --
>>> Bien cordialement.
>>> Sébastien HAN.
>>>
>>>
>>> On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER
>>> <aderumier@odiso.com> wrote:
>>>>>>
>>>>>> @Alexandre: is it the same for you? or do you always get more IOPS
>>>>>> with seq?
>>>>
>>>>
>>>> rand read 4K : 6000 iops
>>>> seq read 4K : 3500 iops
>>>> seq read 4M : 31iops (1gigabit client bandwith limit)
>>>>
>>>> rand write 4k: 6000iops  (tmpfs journal)
>>>> seq write 4k: 1600iops
>>>> seq write 4M : 31iops (1gigabit client bandwith limit)
>>>>
>>>>
>>>> I really don't understand why I can't get more rand read iops with 4K
>>>> block ...
>>>>
>>>> I try with high end cpu for client, it doesn't change nothing.
>>>> But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around
>>>> 15% on cluster during read bench)
>>>>
>>>>
>>>> ----- Mail original -----
>>>>
>>>> De: "Sébastien Han" <han.sebastien@gmail.com>
>>>> À: "Mark Kampe" <mark.kampe@inktank.com>
>>>> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel"
>>>> <ceph-devel@vger.kernel.org>
>>>> Envoyé: Lundi 19 Novembre 2012 19:03:40
>>>> Objet: Re: RBD fio Performance concerns
>>>>
>>>> @Sage, thanks for the info :)
>>>> @Mark:
>>>>
>>>>> If you want to do sequential I/O, you should do it buffered
>>>>> (so that the writes can be aggregated) or with a 4M block size
>>>>> (very efficient and avoiding object serialization).
>>>>
>>>>
>>>> The original benchmark has been performed with 4M block size. And as
>>>> you can see I still get more IOPS with rand than seq... I just tried
>>>> with 4M without direct I/O, still the same. I can print fio results if
>>>> it's needed.
>>>>
>>>>> We do direct writes for benchmarking, not because it is a reasonable
>>>>> way to do I/O, but because it bypasses the buffer cache and enables
>>>>> us to directly measure cluster I/O throughput (which is what we are
>>>>> trying to optimize). Applications should usually do buffered I/O,
>>>>> to get the (very significant) benefits of caching and write
>>>>> aggregation.
>>>>
>>>>
>>>> I know why I use direct I/O. It's synthetic benchmarks, it's far away
>>>> from a real life scenario and how common applications works. I just
>>>> try to see the maximum I/O throughput that I can get from my RBD. All
>>>> my applications use buffered I/O.
>>>>
>>>> @Alexandre: is it the same for you? or do you always get more IOPS
>>>> with seq?
>>>>
>>>> Thanks to all of you..
>>>>
>>>>
>>>> On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe <mark.kampe@inktank.com>
>>>> wrote:
>>>>>
>>>>> Recall:
>>>>> 1. RBD volumes are striped (4M wide) across RADOS objects
>>>>> 2. distinct writes to a single RADOS object are serialized
>>>>>
>>>>> Your sequential 4K writes are direct, depth=256, so there are
>>>>> (at all times) 256 writes queued to the same object. All of
>>>>> your writes are waiting through a very long line, which is adding
>>>>> horrendous latency.
>>>>>
>>>>> If you want to do sequential I/O, you should do it buffered
>>>>> (so that the writes can be aggregated) or with a 4M block size
>>>>> (very efficient and avoiding object serialization).
>>>>>
>>>>> We do direct writes for benchmarking, not because it is a reasonable
>>>>> way to do I/O, but because it bypasses the buffer cache and enables
>>>>> us to directly measure cluster I/O throughput (which is what we are
>>>>> trying to optimize). Applications should usually do buffered I/O,
>>>>> to get the (very significant) benefits of caching and write
>>>>> aggregation.
>>>>>
>>>>>
>>>>>> That's correct for some of the benchmarks. However even with 4K for
>>>>>> seq, I still get less IOPS. See below my last fio:
>>>>>>
>>>>>> # fio rbd-bench.fio
>>>>>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
>>>>>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>>> iodepth=256
>>>>>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>>> iodepth=256
>>>>>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>>> iodepth=256
>>>>>> fio 1.59
>>>>>> Starting 4 processes
>>>>>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta
>>>>>> 02m:59s]
>>>>>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096
>>>>>> read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
>>>>>> slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
>>>>>> clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
>>>>>> lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
>>>>>> bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24,
>>>>>> stdev=6239.06
>>>>>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>>>
>>>>>>> =64=100.0%
>>>>>>
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.0%
>>>>>>
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.1%
>>>>>>
>>>>>> issued r/w/d: total=200473/0/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
>>>>>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846
>>>>>> read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
>>>>>> slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
>>>>>> clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
>>>>>> lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
>>>>>> bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48,
>>>>>> stdev=648.62
>>>>>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>>>
>>>>>>> =64=100.0%
>>>>>>
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.0%
>>>>>>
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.1%
>>>>>>
>>>>>> issued r/w/d: total=1632349/0/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
>>>>>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
>>>>>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
>>>>>> slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
>>>>>> clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
>>>>>> lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
>>>>>> bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65,
>>>>>> stdev=353.97
>>>>>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
>>>>>>>
>>>>>>> =64=99.4%
>>>>>>
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.0%
>>>>>>
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.1%
>>>>>>
>>>>>> issued r/w/d: total=0/11171/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
>>>>>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
>>>>>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
>>>>>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
>>>>>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
>>>>>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
>>>>>> lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
>>>>>> bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21,
>>>>>> stdev=2000.45
>>>>>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>>>
>>>>>>> =64=99.9%
>>>>>>
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.0%
>>>>>>
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.1%
>>>>>>
>>>>>> issued r/w/d: total=0/52147/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
>>>>>> lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
>>>>>>
>>>>>> Run status group 0 (all jobs):
>>>>>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
>>>>>> mint=60053msec, maxt=60053msec
>>>>>>
>>>>>> Run status group 1 (all jobs):
>>>>>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
>>>>>> maxb=111425KB/s, mint=60005msec, maxt=60005msec
>>>>>>
>>>>>> Run status group 2 (all jobs):
>>>>>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
>>>>>> mint=60725msec, maxt=60725msec
>>>>>>
>>>>>> Run status group 3 (all jobs):
>>>>>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
>>>>>> mint=60822msec, maxt=60822msec
>>>>>>
>>>>>> Disk stats (read/write):
>>>>>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
>>>>>> in_queue=33434120, util=99.79%
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-21 21:47                       ` Sébastien Han
@ 2012-11-21 22:05                         ` Mark Kampe
  2012-11-22  5:46                         ` Alexandre DERUMIER
  2012-11-23 13:36                         ` Chen, Xiaoxi
  2 siblings, 0 replies; 51+ messages in thread
From: Mark Kampe @ 2012-11-21 22:05 UTC (permalink / raw)
  To: Sébastien Han; +Cc: ceph-devel

Sequential is faster than random on a disk, but we are not
doing I/O to a disk, but a distributed storage cluster:

   small random operations are striped over multiple objects and
   servers, and so can proceed in parallel and take advantage of
   more nodes and disks.  This parallelism can overcome the added
   latencies of network I/O to yield very good throughput.

   small sequential read and write operations are serialized on
   a single server, NIC, and drive.  This serialization eliminates
   parallelism, and the network and other queuing delays are no
   longer compensated for.

This striping is a good idea for the small random I/O that is
typical of the way Linux systems talk to their disks.  But for
other I/O patterns, it is not optimal.

On 11/21/2012 01:47 PM, Sébastien Han wrote:
> Hi Mark,
>
> Well the most concerning thing is that I have 2 Ceph clusters and both
> of them show better rand than seq...
> I don't have enough background to argue on your assomptions but I
> could try to skrink my test platform to a single OSD and how it
> performs. We keep in touch on that one.
>
> But it seems that Alexandre and I have the same results (more rand
> than seq), he has (at least) one cluster and I have 2. Thus I start to
> think that's not an isolated issue.
>
> Is it different for you? Do you usually get more seq IOPS from an RBD
> thant rand?
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-21 21:47                       ` Sébastien Han
  2012-11-21 22:05                         ` Mark Kampe
@ 2012-11-22  5:46                         ` Alexandre DERUMIER
  2012-11-23 13:36                         ` Chen, Xiaoxi
  2 siblings, 0 replies; 51+ messages in thread
From: Alexandre DERUMIER @ 2012-11-22  5:46 UTC (permalink / raw)
  To: Sébastien Han; +Cc: ceph-devel, Mark Kampe, Mark Nelson

>>but it seems that Alexandre and I have the same results (more rand 
>>than seq), he has (at least) one cluster and I have 2. Thus I start to 
>>think that's not an isolated issue. 

Hi, I have bought new servers with more powerfull cpus to made a new 3 nodes cluster to compare.
I'll redo tests in 1 or 2 week. 
I hope performance will improve.

I'll keep you in touch !

Alexandre


----- Mail original ----- 

De: "Sébastien Han" <han.sebastien@gmail.com> 
À: "Mark Nelson" <mark.nelson@inktank.com> 
Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com> 
Envoyé: Mercredi 21 Novembre 2012 22:47:08 
Objet: Re: RBD fio Performance concerns 

Hi Mark, 

Well the most concerning thing is that I have 2 Ceph clusters and both 
of them show better rand than seq... 
I don't have enough background to argue on your assomptions but I 
could try to skrink my test platform to a single OSD and how it 
performs. We keep in touch on that one. 

But it seems that Alexandre and I have the same results (more rand 
than seq), he has (at least) one cluster and I have 2. Thus I start to 
think that's not an isolated issue. 

Is it different for you? Do you usually get more seq IOPS from an RBD 
thant rand? 


On Wed, Nov 21, 2012 at 5:34 PM, Mark Nelson <mark.nelson@inktank.com> wrote: 
> Responding to my own message. :) 
> 
> Talked to Sage a bit offline about this. I think there are two opposing 
> forces: 
> 
> On one hand, random IO may be spreading reads/writes out across more OSDs 
> than sequential IO that presumably would be hitting a single OSD more 
> regularly. 
> 
> On the other hand, you'd expect that sequential writes would be getting 
> coalesced either at the RBD layer or on the OSD, and that the 
> drive/controller/filesystem underneath the OSD would be doing some kind of 
> readahead or prefetching. 
> 
> On the third hand, maybe coalescing/prefetching is in fact happening but we 
> are IOP limited by some per-osd limitation. 
> 
> It could be interesting to do the test with a single OSD and see what 
> happens. 
> 
> Mark 
> 
> 
> On 11/21/2012 09:52 AM, Mark Nelson wrote: 
>> 
>> Hi Guys, 
>> 
>> I'm late to this thread but thought I'd chime in. Crazy that you are 
>> getting higher performance with random reads/writes vs sequential! It 
>> would be interesting to see what kind of throughput smalliobench reports 
>> (should be packaged in bobtail) and also see if this behavior happens 
>> with cephfs. It's still too early in the morning for me right now to 
>> come up with a reasonable explanation for what's going on. It might be 
>> worth running blktrace and seekwatcher to see what the io patterns on 
>> the underlying disk look like in each case. Maybe something unexpected 
>> is going on. 
>> 
>> Mark 
>> 
>> On 11/19/2012 02:57 PM, Sébastien Han wrote: 
>>> 
>>> Which iodepth did you use for those benchs? 
>>> 
>>> 
>>>> I really don't understand why I can't get more rand read iops with 4K 
>>>> block ... 
>>> 
>>> 
>>> Me neither, hope to get some clarification from the Inktank guys. It 
>>> doesn't make any sense to me... 
>>> -- 
>>> Bien cordialement. 
>>> Sébastien HAN. 
>>> 
>>> 
>>> On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER 
>>> <aderumier@odiso.com> wrote: 
>>>>>> 
>>>>>> @Alexandre: is it the same for you? or do you always get more IOPS 
>>>>>> with seq? 
>>>> 
>>>> 
>>>> rand read 4K : 6000 iops 
>>>> seq read 4K : 3500 iops 
>>>> seq read 4M : 31iops (1gigabit client bandwith limit) 
>>>> 
>>>> rand write 4k: 6000iops (tmpfs journal) 
>>>> seq write 4k: 1600iops 
>>>> seq write 4M : 31iops (1gigabit client bandwith limit) 
>>>> 
>>>> 
>>>> I really don't understand why I can't get more rand read iops with 4K 
>>>> block ... 
>>>> 
>>>> I try with high end cpu for client, it doesn't change nothing. 
>>>> But test cluster use old 8 cores E5420 @ 2.50GHZ (But cpu is around 
>>>> 15% on cluster during read bench) 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> 
>>>> De: "Sébastien Han" <han.sebastien@gmail.com> 
>>>> À: "Mark Kampe" <mark.kampe@inktank.com> 
>>>> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" 
>>>> <ceph-devel@vger.kernel.org> 
>>>> Envoyé: Lundi 19 Novembre 2012 19:03:40 
>>>> Objet: Re: RBD fio Performance concerns 
>>>> 
>>>> @Sage, thanks for the info :) 
>>>> @Mark: 
>>>> 
>>>>> If you want to do sequential I/O, you should do it buffered 
>>>>> (so that the writes can be aggregated) or with a 4M block size 
>>>>> (very efficient and avoiding object serialization). 
>>>> 
>>>> 
>>>> The original benchmark has been performed with 4M block size. And as 
>>>> you can see I still get more IOPS with rand than seq... I just tried 
>>>> with 4M without direct I/O, still the same. I can print fio results if 
>>>> it's needed. 
>>>> 
>>>>> We do direct writes for benchmarking, not because it is a reasonable 
>>>>> way to do I/O, but because it bypasses the buffer cache and enables 
>>>>> us to directly measure cluster I/O throughput (which is what we are 
>>>>> trying to optimize). Applications should usually do buffered I/O, 
>>>>> to get the (very significant) benefits of caching and write 
>>>>> aggregation. 
>>>> 
>>>> 
>>>> I know why I use direct I/O. It's synthetic benchmarks, it's far away 
>>>> from a real life scenario and how common applications works. I just 
>>>> try to see the maximum I/O throughput that I can get from my RBD. All 
>>>> my applications use buffered I/O. 
>>>> 
>>>> @Alexandre: is it the same for you? or do you always get more IOPS 
>>>> with seq? 
>>>> 
>>>> Thanks to all of you.. 
>>>> 
>>>> 
>>>> On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe <mark.kampe@inktank.com> 
>>>> wrote: 
>>>>> 
>>>>> Recall: 
>>>>> 1. RBD volumes are striped (4M wide) across RADOS objects 
>>>>> 2. distinct writes to a single RADOS object are serialized 
>>>>> 
>>>>> Your sequential 4K writes are direct, depth=256, so there are 
>>>>> (at all times) 256 writes queued to the same object. All of 
>>>>> your writes are waiting through a very long line, which is adding 
>>>>> horrendous latency. 
>>>>> 
>>>>> If you want to do sequential I/O, you should do it buffered 
>>>>> (so that the writes can be aggregated) or with a 4M block size 
>>>>> (very efficient and avoiding object serialization). 
>>>>> 
>>>>> We do direct writes for benchmarking, not because it is a reasonable 
>>>>> way to do I/O, but because it bypasses the buffer cache and enables 
>>>>> us to directly measure cluster I/O throughput (which is what we are 
>>>>> trying to optimize). Applications should usually do buffered I/O, 
>>>>> to get the (very significant) benefits of caching and write 
>>>>> aggregation. 
>>>>> 
>>>>> 
>>>>>> That's correct for some of the benchmarks. However even with 4K for 
>>>>>> seq, I still get less IOPS. See below my last fio: 
>>>>>> 
>>>>>> # fio rbd-bench.fio 
>>>>>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 
>>>>>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, 
>>>>>> iodepth=256 
>>>>>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, 
>>>>>> iodepth=256 
>>>>>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, 
>>>>>> iodepth=256 
>>>>>> fio 1.59 
>>>>>> Starting 4 processes 
>>>>>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 
>>>>>> 02m:59s] 
>>>>>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096 
>>>>>> read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec 
>>>>>> slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 
>>>>>> clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63 
>>>>>> lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62 
>>>>>> bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24, 
>>>>>> stdev=6239.06 
>>>>>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
>>>>>>> 
>>>>>>> =64=100.0% 
>>>>>> 
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>>>>>> 
>>>>>>> =64=0.0% 
>>>>>> 
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>>>>>> 
>>>>>>> =64=0.1% 
>>>>>> 
>>>>>> issued r/w/d: total=200473/0/0, short=0/0/0 
>>>>>> 
>>>>>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10% 
>>>>>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846 
>>>>>> read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec 
>>>>>> slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 
>>>>>> clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40 
>>>>>> lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21 
>>>>>> bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, 
>>>>>> stdev=648.62 
>>>>>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278 
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
>>>>>>> 
>>>>>>> =64=100.0% 
>>>>>> 
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>>>>>> 
>>>>>>> =64=0.0% 
>>>>>> 
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>>>>>> 
>>>>>>> =64=0.1% 
>>>>>> 
>>>>>> issued r/w/d: total=1632349/0/0, short=0/0/0 
>>>>>> 
>>>>>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01% 
>>>>>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653 
>>>>>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec 
>>>>>> slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97 
>>>>>> clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19 
>>>>>> lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17 
>>>>>> bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65, 
>>>>>> stdev=353.97 
>>>>>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21 
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, 
>>>>>>> 
>>>>>>> =64=99.4% 
>>>>>> 
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>>>>>> 
>>>>>>> =64=0.0% 
>>>>>> 
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>>>>>> 
>>>>>>> =64=0.1% 
>>>>>> 
>>>>>> issued r/w/d: total=0/11171/0, short=0/0/0 
>>>>>> 
>>>>>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60% 
>>>>>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20% 
>>>>>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446 
>>>>>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec 
>>>>>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37 
>>>>>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27 
>>>>>> lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84 
>>>>>> bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21, 
>>>>>> stdev=2000.45 
>>>>>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19 
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
>>>>>>> 
>>>>>>> =64=99.9% 
>>>>>> 
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>>>>>> 
>>>>>>> =64=0.0% 
>>>>>> 
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>>>>>> 
>>>>>>> =64=0.1% 
>>>>>> 
>>>>>> issued r/w/d: total=0/52147/0, short=0/0/0 
>>>>>> 
>>>>>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10% 
>>>>>> lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33% 
>>>>>> 
>>>>>> Run status group 0 (all jobs): 
>>>>>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s, 
>>>>>> mint=60053msec, maxt=60053msec 
>>>>>> 
>>>>>> Run status group 1 (all jobs): 
>>>>>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s, 
>>>>>> maxb=111425KB/s, mint=60005msec, maxt=60005msec 
>>>>>> 
>>>>>> Run status group 2 (all jobs): 
>>>>>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s, 
>>>>>> mint=60725msec, maxt=60725msec 
>>>>>> 
>>>>>> Run status group 3 (all jobs): 
>>>>>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s, 
>>>>>> mint=60822msec, maxt=60822msec 
>>>>>> 
>>>>>> Disk stats (read/write): 
>>>>>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132, 
>>>>>> in_queue=33434120, util=99.79% 
>>> 
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>>> the body of a message to majordomo@vger.kernel.org 
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>>> 
>> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-21 16:34                     ` Mark Nelson
  2012-11-21 21:47                       ` Sébastien Han
@ 2012-11-22 10:19                       ` Stefan Priebe - Profihost AG
       [not found]                         ` <CAOLwVUmp7wrfead8qX2BZPbyeN_JY_XBN+wkEWmbY6q1-5u0fw@mail.gmail.com>
  1 sibling, 1 reply; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-22 10:19 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Sébastien Han, Alexandre DERUMIER, ceph-devel, Mark Kampe


Same to me:
rand 4k: 23.000 iops
seq 4k: 13.000 iops

Even in writeback mode where normally seq 4k should be merged into 
bigger requests.

Stefan

Am 21.11.2012 17:34, schrieb Mark Nelson:
> Responding to my own message. :)
>
> Talked to Sage a bit offline about this.  I think there are two opposing
> forces:
>
> On one hand, random IO may be spreading reads/writes out across more
> OSDs than sequential IO that presumably would be hitting a single OSD
> more regularly.
>
> On the other hand, you'd expect that sequential writes would be getting
> coalesced either at the RBD layer or on the OSD, and that the
> drive/controller/filesystem underneath the OSD would be doing some kind
> of readahead or prefetching.
>
> On the third hand, maybe coalescing/prefetching is in fact happening but
> we are IOP limited by some per-osd limitation.
>
> It could be interesting to do the test with a single OSD and see what
> happens.
>
> Mark
>
> On 11/21/2012 09:52 AM, Mark Nelson wrote:
>> Hi Guys,
>>
>> I'm late to this thread but thought I'd chime in.  Crazy that you are
>> getting higher performance with random reads/writes vs sequential!  It
>> would be interesting to see what kind of throughput smalliobench reports
>> (should be packaged in bobtail) and also see if this behavior happens
>> with cephfs.  It's still too early in the morning for me right now to
>> come up with a reasonable explanation for what's going on.  It might be
>> worth running blktrace and seekwatcher to see what the io patterns on
>> the underlying disk look like in each case.  Maybe something unexpected
>> is going on.
>>
>> Mark
>>
>> On 11/19/2012 02:57 PM, Sébastien Han wrote:
>>> Which iodepth did you use for those benchs?
>>>
>>>
>>>> I really don't understand why I can't get more rand read iops with 4K
>>>> block ...
>>>
>>> Me neither, hope to get some clarification from the Inktank guys. It
>>> doesn't make any sense to me...
>>> --
>>> Bien cordialement.
>>> Sébastien HAN.
>>>
>>>
>>> On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER
>>> <aderumier@odiso.com> wrote:
>>>>>> @Alexandre: is it the same for you? or do you always get more IOPS
>>>>>> with seq?
>>>>
>>>> rand read 4K : 6000 iops
>>>> seq read 4K : 3500 iops
>>>> seq read 4M : 31iops (1gigabit client bandwith limit)
>>>>
>>>> rand write 4k: 6000iops  (tmpfs journal)
>>>> seq write 4k: 1600iops
>>>> seq write 4M : 31iops (1gigabit client bandwith limit)
>>>>
>>>>
>>>> I really don't understand why I can't get more rand read iops with 4K
>>>> block ...
>>>>
>>>> I try with high end cpu for client, it doesn't change nothing.
>>>> But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around
>>>> 15% on cluster during read bench)
>>>>
>>>>
>>>> ----- Mail original -----
>>>>
>>>> De: "Sébastien Han" <han.sebastien@gmail.com>
>>>> À: "Mark Kampe" <mark.kampe@inktank.com>
>>>> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel"
>>>> <ceph-devel@vger.kernel.org>
>>>> Envoyé: Lundi 19 Novembre 2012 19:03:40
>>>> Objet: Re: RBD fio Performance concerns
>>>>
>>>> @Sage, thanks for the info :)
>>>> @Mark:
>>>>
>>>>> If you want to do sequential I/O, you should do it buffered
>>>>> (so that the writes can be aggregated) or with a 4M block size
>>>>> (very efficient and avoiding object serialization).
>>>>
>>>> The original benchmark has been performed with 4M block size. And as
>>>> you can see I still get more IOPS with rand than seq... I just tried
>>>> with 4M without direct I/O, still the same. I can print fio results if
>>>> it's needed.
>>>>
>>>>> We do direct writes for benchmarking, not because it is a reasonable
>>>>> way to do I/O, but because it bypasses the buffer cache and enables
>>>>> us to directly measure cluster I/O throughput (which is what we are
>>>>> trying to optimize). Applications should usually do buffered I/O,
>>>>> to get the (very significant) benefits of caching and write
>>>>> aggregation.
>>>>
>>>> I know why I use direct I/O. It's synthetic benchmarks, it's far away
>>>> from a real life scenario and how common applications works. I just
>>>> try to see the maximum I/O throughput that I can get from my RBD. All
>>>> my applications use buffered I/O.
>>>>
>>>> @Alexandre: is it the same for you? or do you always get more IOPS
>>>> with seq?
>>>>
>>>> Thanks to all of you..
>>>>
>>>>
>>>> On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe <mark.kampe@inktank.com>
>>>> wrote:
>>>>> Recall:
>>>>> 1. RBD volumes are striped (4M wide) across RADOS objects
>>>>> 2. distinct writes to a single RADOS object are serialized
>>>>>
>>>>> Your sequential 4K writes are direct, depth=256, so there are
>>>>> (at all times) 256 writes queued to the same object. All of
>>>>> your writes are waiting through a very long line, which is adding
>>>>> horrendous latency.
>>>>>
>>>>> If you want to do sequential I/O, you should do it buffered
>>>>> (so that the writes can be aggregated) or with a 4M block size
>>>>> (very efficient and avoiding object serialization).
>>>>>
>>>>> We do direct writes for benchmarking, not because it is a reasonable
>>>>> way to do I/O, but because it bypasses the buffer cache and enables
>>>>> us to directly measure cluster I/O throughput (which is what we are
>>>>> trying to optimize). Applications should usually do buffered I/O,
>>>>> to get the (very significant) benefits of caching and write
>>>>> aggregation.
>>>>>
>>>>>
>>>>>> That's correct for some of the benchmarks. However even with 4K for
>>>>>> seq, I still get less IOPS. See below my last fio:
>>>>>>
>>>>>> # fio rbd-bench.fio
>>>>>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>>> iodepth=256
>>>>>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>>> iodepth=256
>>>>>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>>> iodepth=256
>>>>>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>>> iodepth=256
>>>>>> fio 1.59
>>>>>> Starting 4 processes
>>>>>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta
>>>>>> 02m:59s]
>>>>>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096
>>>>>> read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
>>>>>> slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
>>>>>> clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
>>>>>> lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
>>>>>> bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24,
>>>>>> stdev=6239.06
>>>>>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>>> =64=100.0%
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>> =64=0.0%
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>> =64=0.1%
>>>>>> issued r/w/d: total=200473/0/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
>>>>>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846
>>>>>> read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
>>>>>> slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
>>>>>> clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
>>>>>> lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
>>>>>> bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48,
>>>>>> stdev=648.62
>>>>>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>>> =64=100.0%
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>> =64=0.0%
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>> =64=0.1%
>>>>>> issued r/w/d: total=1632349/0/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
>>>>>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
>>>>>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
>>>>>> slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
>>>>>> clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
>>>>>> lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
>>>>>> bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65,
>>>>>> stdev=353.97
>>>>>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
>>>>>>> =64=99.4%
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>> =64=0.0%
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>> =64=0.1%
>>>>>> issued r/w/d: total=0/11171/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
>>>>>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
>>>>>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
>>>>>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
>>>>>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
>>>>>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
>>>>>> lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
>>>>>> bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21,
>>>>>> stdev=2000.45
>>>>>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>>> =64=99.9%
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>> =64=0.0%
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>> =64=0.1%
>>>>>> issued r/w/d: total=0/52147/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
>>>>>> lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
>>>>>>
>>>>>> Run status group 0 (all jobs):
>>>>>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
>>>>>> mint=60053msec, maxt=60053msec
>>>>>>
>>>>>> Run status group 1 (all jobs):
>>>>>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
>>>>>> maxb=111425KB/s, mint=60005msec, maxt=60005msec
>>>>>>
>>>>>> Run status group 2 (all jobs):
>>>>>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
>>>>>> mint=60725msec, maxt=60725msec
>>>>>>
>>>>>> Run status group 3 (all jobs):
>>>>>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
>>>>>> mint=60822msec, maxt=60822msec
>>>>>>
>>>>>> Disk stats (read/write):
>>>>>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
>>>>>> in_queue=33434120, util=99.79%
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
       [not found]                         ` <CAOLwVUmp7wrfead8qX2BZPbyeN_JY_XBN+wkEWmbY6q1-5u0fw@mail.gmail.com>
@ 2012-11-22 11:48                           ` Stefan Priebe - Profihost AG
  2012-11-22 12:50                             ` Sébastien Han
  2012-11-22 14:34                           ` Mark Nelson
  1 sibling, 1 reply; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-22 11:48 UTC (permalink / raw)
  To: Sébastien Han
  Cc: Mark Nelson, Alexandre DERUMIER, ceph-devel, Mark Kampe

Am 22.11.2012 11:49, schrieb Sébastien Han:
> @Alexandre: cool!
>
> @ Stefan: Full SSD cluster and 10G switches?
Yes

> Couple of weeks ago I saw
> that you use journal aio, did you notice performance improvement with it?
journal is running on tmpfs to me but that changes nothing.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 11:48                           ` Stefan Priebe - Profihost AG
@ 2012-11-22 12:50                             ` Sébastien Han
  2012-11-22 13:14                               ` Stefan Priebe - Profihost AG
  2012-11-23 10:31                               ` Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 51+ messages in thread
From: Sébastien Han @ 2012-11-22 12:50 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Mark Nelson, Alexandre DERUMIER, ceph-devel, Mark Kampe

> journal is running on tmpfs to me but that changes nothing.

I don't think it works then. According to the doc: Enables using
libaio for asynchronous writes to the journal. Requires journal dio
set to true.


On Thu, Nov 22, 2012 at 12:48 PM, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
> Am 22.11.2012 11:49, schrieb Sébastien Han:
>
>> @Alexandre: cool!
>>
>> @ Stefan: Full SSD cluster and 10G switches?
>
> Yes
>
>
>> Couple of weeks ago I saw
>> that you use journal aio, did you notice performance improvement with it?
>
> journal is running on tmpfs to me but that changes nothing.
>
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 12:50                             ` Sébastien Han
@ 2012-11-22 13:14                               ` Stefan Priebe - Profihost AG
       [not found]                                 ` <CAOLwVUkwVSv-Ven2CTjnTN2J573TBTD2SLDY7df0h7ncJZQgpQ@mail.gmail.com>
  2012-11-23 10:31                               ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-22 13:14 UTC (permalink / raw)
  To: Sébastien Han
  Cc: Mark Nelson, Alexandre DERUMIER, ceph-devel, Mark Kampe

Am 22.11.2012 13:50, schrieb Sébastien Han:
>> journal is running on tmpfs to me but that changes nothing.
>
> I don't think it works then. According to the doc: Enables using
> libaio for asynchronous writes to the journal. Requires journal dio
> set to true.

Ah might be but as the SSDs are pretty fast i don't know which device to 
use as journal except tmpfs.

And RAMDISK devices are too expensive.

Greets,
Stefan

> On Thu, Nov 22, 2012 at 12:48 PM, Stefan Priebe - Profihost AG
> <s.priebe@profihost.ag> wrote:
>> Am 22.11.2012 11:49, schrieb Sébastien Han:
>>
>>> @Alexandre: cool!
>>>
>>> @ Stefan: Full SSD cluster and 10G switches?
>>
>> Yes
>>
>>
>>> Couple of weeks ago I saw
>>> that you use journal aio, did you notice performance improvement with it?
>>
>> journal is running on tmpfs to me but that changes nothing.
>>
>> Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
       [not found]                                 ` <CAOLwVUkwVSv-Ven2CTjnTN2J573TBTD2SLDY7df0h7ncJZQgpQ@mail.gmail.com>
@ 2012-11-22 13:29                                   ` Stefan Priebe - Profihost AG
  2012-11-22 14:20                                     ` Alexandre DERUMIER
  0 siblings, 1 reply; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-22 13:29 UTC (permalink / raw)
  To: Sébastien Han
  Cc: Mark Nelson, Alexandre DERUMIER, ceph-devel, Mark Kampe

Am 22.11.2012 14:22, schrieb Sébastien Han:
>     And RAMDISK devices are too expensive.
>
> It would make sense in your infra, but yes they are really expensive.

We need something like tmpfs - running in local memory but support dio.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 13:29                                   ` Stefan Priebe - Profihost AG
@ 2012-11-22 14:20                                     ` Alexandre DERUMIER
  2012-11-22 14:22                                       ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 51+ messages in thread
From: Alexandre DERUMIER @ 2012-11-22 14:20 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Mark Nelson, ceph-devel, Mark Kampe, Sébastien Han

>>We need something like tmpfs - running in local memory but support dio. 

Maybe with ramdisk, /dev/ram0  ? 

we can format it with standard filesystem (ext3,ext4,...) so maybe dio works with it ?



----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Sébastien Han" <han.sebastien@gmail.com> 
Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com> 
Envoyé: Jeudi 22 Novembre 2012 14:29:03 
Objet: Re: RBD fio Performance concerns 

Am 22.11.2012 14:22, schrieb Sébastien Han: 
> And RAMDISK devices are too expensive. 
> 
> It would make sense in your infra, but yes they are really expensive. 

We need something like tmpfs - running in local memory but support dio. 

Stefan 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 14:20                                     ` Alexandre DERUMIER
@ 2012-11-22 14:22                                       ` Stefan Priebe - Profihost AG
  2012-11-22 14:37                                         ` Mark Nelson
  0 siblings, 1 reply; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-22 14:22 UTC (permalink / raw)
  To: Alexandre DERUMIER
  Cc: Mark Nelson, ceph-devel, Mark Kampe, Sébastien Han

Hi,

can someone from inktank comment this? Might be using /dev/ram0 with an 
fs on it be better than tmpfs as we can use dio?

Greets,
Stefan

> ----- Mail original -----
>
> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
> À: "Sébastien Han" <han.sebastien@gmail.com>
> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>
> Envoyé: Jeudi 22 Novembre 2012 14:29:03
> Objet: Re: RBD fio Performance concerns
>
> Am 22.11.2012 14:22, schrieb Sébastien Han:
>> And RAMDISK devices are too expensive.
>>
>> It would make sense in your infra, but yes they are really expensive.
>
> We need something like tmpfs - running in local memory but support dio.
>
> Stefan
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
       [not found]                         ` <CAOLwVUmp7wrfead8qX2BZPbyeN_JY_XBN+wkEWmbY6q1-5u0fw@mail.gmail.com>
  2012-11-22 11:48                           ` Stefan Priebe - Profihost AG
@ 2012-11-22 14:34                           ` Mark Nelson
  1 sibling, 0 replies; 51+ messages in thread
From: Mark Nelson @ 2012-11-22 14:34 UTC (permalink / raw)
  To: Sébastien Han
  Cc: Stefan Priebe - Profihost AG, Alexandre DERUMIER, ceph-devel,
	Mark Kampe

On 11/22/2012 04:49 AM, Sébastien Han wrote:
> @Alexandre: cool!
>
> @ Stefan: Full SSD cluster and 10G switches? Couple of weeks ago I saw
> that you use journal aio, did you notice performance improvement with it?
>
> @Mark Kampe
>
>  > If I read the above correctly, your random operations are 4K and your
>  > sequential operations are 4M.
>
>
> As you recommend. (see below what you previously said):
>
>  > If you want to do sequential I/O, you should do it buffered
>  > (so that the writes can be aggregated) or with a 4M block size
>  > (very efficient and avoiding object serialization).
>
>
>
>  > The block-size difference makes the random and sequential
>  > results incomparable.
>
>
> Ok let's do it again then (short output that fits on a screen), with a
> single OSD and 4K blocks:
>
> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=42
> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=42
> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=42
> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=42
>
> seq-read: (groupid=0, jobs=1): err= 0: pid=7542
>    read : io=912716KB, bw=15210KB/s, iops=3802 , runt= 60009msec
> rand-read: (groupid=1, jobs=1): err= 0: pid=7546
>    read : io=980504KB, bw=16339KB/s, iops=4084 , runt= 60009msec
> seq-write: (groupid=2, jobs=1): err= 0: pid=7547
>    write: io=54216KB, bw=922718 B/s, iops=225 , runt= 60167msec
> rand-write: (groupid=3, jobs=1): err= 0: pid=7557
>    write: io=66116KB, bw=1098.5KB/s, iops=274 , runt= 60192msec
>
> Sequentials and random operations are getting closer to each others, but
> random operations remain higher.
>
>  > the more data you send me, the longer it takes me to find
>  > time to review your results.  If you send me a message that
>  > fits on a single screen, I will try to answer it immediately.
>
>
> I just don't want to miss any information that you may find useful.
>
> @ Mark Nelson
>
> See below the new blkparse trace with 4K block for all operations:
>
> Inline image 1

Thanks for doing this!  Unfortunately it's only showing writes so we 
don't know what read behavior looks like in these graphs.  That might be 
important.  Also, do you know approximately how the tests line up with 
timestamps on the seekwatcher results?  Seekwatcher only seems to be 
going for 228 seconds, but the 4 tests should be lasting 240+ seconds?

If I just break this down into 60s chunks with seq-read, rand-read, 
seq-write, rand-write in that order, my wildly speculative guess is that 
there's a chunk of time missing at the beginning where the read tests 
are happening that extend out to about second ~80-85 in the graph. 
After that it's the write tests from ~85 out to ~205.  After that, I 
guess that there are no more incoming writes, but existing writes are 
being flushed and that with no incoming writes we see a bump in 
performance (possibly due to a reduction in lock contention? What 
version of ceph is this again?)

A couple of thoughts:

- You may want to pause between the tests for a while (or even reformat 
between every test!)

- That spike in performance at the end is interesting.  I'd really like 
to know if that happened during the rand-write test or after the test 
completed (once the data hit the jouranl the writes would have been 
acknowledged letting the test end while the data was still being flushed 
out to disk).

- If my interpretation is right, it looks like the typical seq-write 
throughput is slightly higher according to blktrace but with regular 
dips, while the random write performance is typically lower but with no 
dips (and maybe a big increase at the end).  In both cases we have a 
very high number of seeks! Do you have WB cache on your controller? 
These are 10K RPM drives?

- Read behavior would be really useful!

- You can pretty clearly see the different AGs in XFS doing their thing. 
  I wonder if 1 AG would be better here.  On the other hand, there's a 
pretty long thread that discusses IOPs heavy workload on XFS here:

http://xfs.9218.n7.nabble.com/XFS-Abysmal-write-performance-because-of-excessive-seeking-allocation-groups-to-blame-td15501.html

- It would be very interesting to try this test on EXT4 or BTRFS and see 
if the results are the same.  I forget, did someone already do this?

Mark

> Thanks again everyone, for your help.
>
> Cheers!
>
> On Thu, Nov 22, 2012 at 11:19 AM, Stefan Priebe - Profihost AG
> <s.priebe@profihost.ag <mailto:s.priebe@profihost.ag>> wrote:
>  >
>  >
>  > Same to me:
>  > rand 4k: 23.000 iops
>  > seq 4k: 13.000 iops
>  >
>  > Even in writeback mode where normally seq 4k should be merged into
> bigger requests.
>  >
>  > Stefan
>  >
>  > Am 21.11.2012 17:34, schrieb Mark Nelson:
>  >
>  >> Responding to my own message. :)
>  >>
>  >> Talked to Sage a bit offline about this.  I think there are two opposing
>  >> forces:
>  >>
>  >> On one hand, random IO may be spreading reads/writes out across more
>  >> OSDs than sequential IO that presumably would be hitting a single OSD
>  >> more regularly.
>  >>
>  >> On the other hand, you'd expect that sequential writes would be getting
>  >> coalesced either at the RBD layer or on the OSD, and that the
>  >> drive/controller/filesystem underneath the OSD would be doing some kind
>  >> of readahead or prefetching.
>  >>
>  >> On the third hand, maybe coalescing/prefetching is in fact happening but
>  >> we are IOP limited by some per-osd limitation.
>  >>
>  >> It could be interesting to do the test with a single OSD and see what
>  >> happens.
>  >>
>  >> Mark
>  >>
>  >> On 11/21/2012 09:52 AM, Mark Nelson wrote:
>  >>>
>  >>> Hi Guys,
>  >>>
>  >>> I'm late to this thread but thought I'd chime in.  Crazy that you are
>  >>> getting higher performance with random reads/writes vs sequential!  It
>  >>> would be interesting to see what kind of throughput smalliobench
> reports
>  >>> (should be packaged in bobtail) and also see if this behavior happens
>  >>> with cephfs.  It's still too early in the morning for me right now to
>  >>> come up with a reasonable explanation for what's going on.  It might be
>  >>> worth running blktrace and seekwatcher to see what the io patterns on
>  >>> the underlying disk look like in each case.  Maybe something unexpected
>  >>> is going on.
>  >>>
>  >>> Mark
>  >>>
>  >>> On 11/19/2012 02:57 PM, Sébastien Han wrote:
>  >>>>
>  >>>> Which iodepth did you use for those benchs?
>  >>>>
>  >>>>
>  >>>>> I really don't understand why I can't get more rand read iops with 4K
>  >>>>> block ...
>  >>>>
>  >>>>
>  >>>> Me neither, hope to get some clarification from the Inktank guys. It
>  >>>> doesn't make any sense to me...
>  >>>> --
>  >>>> Bien cordialement.
>  >>>> Sébastien HAN.
>  >>>>
>  >>>>
>  >>>> On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER
>  >>>> <aderumier@odiso.com <mailto:aderumier@odiso.com>> wrote:
>  >>>>>>>
>  >>>>>>> @Alexandre: is it the same for you? or do you always get more IOPS
>  >>>>>>> with seq?
>  >>>>>
>  >>>>>
>  >>>>> rand read 4K : 6000 iops
>  >>>>> seq read 4K : 3500 iops
>  >>>>> seq read 4M : 31iops (1gigabit client bandwith limit)
>  >>>>>
>  >>>>> rand write 4k: 6000iops  (tmpfs journal)
>  >>>>> seq write 4k: 1600iops
>  >>>>> seq write 4M : 31iops (1gigabit client bandwith limit)
>  >>>>>
>  >>>>>
>  >>>>> I really don't understand why I can't get more rand read iops with 4K
>  >>>>> block ...
>  >>>>>
>  >>>>> I try with high end cpu for client, it doesn't change nothing.
>  >>>>> But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around
>  >>>>> 15% on cluster during read bench)
>  >>>>>
>  >>>>>
>  >>>>> ----- Mail original -----
>  >>>>>
>  >>>>> De: "Sébastien Han" <han.sebastien@gmail.com
> <mailto:han.sebastien@gmail.com>>
>  >>>>> À: "Mark Kampe" <mark.kampe@inktank.com
> <mailto:mark.kampe@inktank.com>>
>  >>>>> Cc: "Alexandre DERUMIER" <aderumier@odiso.com
> <mailto:aderumier@odiso.com>>, "ceph-devel"
>  >>>>> <ceph-devel@vger.kernel.org <mailto:ceph-devel@vger.kernel.org>>
>  >>>>> Envoyé: Lundi 19 Novembre 2012 19:03:40
>  >>>>> Objet: Re: RBD fio Performance concerns
>  >>>>>
>  >>>>> @Sage, thanks for the info :)
>  >>>>> @Mark:
>  >>>>>
>  >>>>>> If you want to do sequential I/O, you should do it buffered
>  >>>>>> (so that the writes can be aggregated) or with a 4M block size
>  >>>>>> (very efficient and avoiding object serialization).
>  >>>>>
>  >>>>>
>  >>>>> The original benchmark has been performed with 4M block size. And as
>  >>>>> you can see I still get more IOPS with rand than seq... I just tried
>  >>>>> with 4M without direct I/O, still the same. I can print fio
> results if
>  >>>>> it's needed.
>  >>>>>
>  >>>>>> We do direct writes for benchmarking, not because it is a reasonable
>  >>>>>> way to do I/O, but because it bypasses the buffer cache and enables
>  >>>>>> us to directly measure cluster I/O throughput (which is what we are
>  >>>>>> trying to optimize). Applications should usually do buffered I/O,
>  >>>>>> to get the (very significant) benefits of caching and write
>  >>>>>> aggregation.
>  >>>>>
>  >>>>>
>  >>>>> I know why I use direct I/O. It's synthetic benchmarks, it's far away
>  >>>>> from a real life scenario and how common applications works. I just
>  >>>>> try to see the maximum I/O throughput that I can get from my RBD. All
>  >>>>> my applications use buffered I/O.
>  >>>>>
>  >>>>> @Alexandre: is it the same for you? or do you always get more IOPS
>  >>>>> with seq?
>  >>>>>
>  >>>>> Thanks to all of you..
>  >>>>>
>  >>>>>
>  >>>>> On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe
> <mark.kampe@inktank.com <mailto:mark.kampe@inktank.com>>
>  >>>>> wrote:
>  >>>>>>
>  >>>>>> Recall:
>  >>>>>> 1. RBD volumes are striped (4M wide) across RADOS objects
>  >>>>>> 2. distinct writes to a single RADOS object are serialized
>  >>>>>>
>  >>>>>> Your sequential 4K writes are direct, depth=256, so there are
>  >>>>>> (at all times) 256 writes queued to the same object. All of
>  >>>>>> your writes are waiting through a very long line, which is adding
>  >>>>>> horrendous latency.
>  >>>>>>
>  >>>>>> If you want to do sequential I/O, you should do it buffered
>  >>>>>> (so that the writes can be aggregated) or with a 4M block size
>  >>>>>> (very efficient and avoiding object serialization).
>  >>>>>>
>  >>>>>> We do direct writes for benchmarking, not because it is a reasonable
>  >>>>>> way to do I/O, but because it bypasses the buffer cache and enables
>  >>>>>> us to directly measure cluster I/O throughput (which is what we are
>  >>>>>> trying to optimize). Applications should usually do buffered I/O,
>  >>>>>> to get the (very significant) benefits of caching and write
>  >>>>>> aggregation.
>  >>>>>>
>  >>>>>>
>  >>>>>>> That's correct for some of the benchmarks. However even with 4K for
>  >>>>>>> seq, I still get less IOPS. See below my last fio:
>  >>>>>>>
>  >>>>>>> # fio rbd-bench.fio
>  >>>>>>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio,
>  >>>>>>> iodepth=256
>  >>>>>>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
>  >>>>>>> iodepth=256
>  >>>>>>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio,
>  >>>>>>> iodepth=256
>  >>>>>>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
>  >>>>>>> iodepth=256
>  >>>>>>> fio 1.59
>  >>>>>>> Starting 4 processes
>  >>>>>>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta
>  >>>>>>> 02m:59s]
>  >>>>>>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096
>  >>>>>>> read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
>  >>>>>>> slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
>  >>>>>>> clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
>  >>>>>>> lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
>  >>>>>>> bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24,
>  >>>>>>> stdev=6239.06
>  >>>>>>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
>  >>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>  >>>>>>>>
>  >>>>>>>> =64=100.0%
>  >>>>>>>
>  >>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>  >>>>>>>>
>  >>>>>>>> =64=0.0%
>  >>>>>>>
>  >>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>  >>>>>>>>
>  >>>>>>>> =64=0.1%
>  >>>>>>>
>  >>>>>>> issued r/w/d: total=200473/0/0, short=0/0/0
>  >>>>>>>
>  >>>>>>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
>  >>>>>>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846
>  >>>>>>> read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
>  >>>>>>> slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
>  >>>>>>> clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
>  >>>>>>> lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
>  >>>>>>> bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48,
>  >>>>>>> stdev=648.62
>  >>>>>>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
>  >>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>  >>>>>>>>
>  >>>>>>>> =64=100.0%
>  >>>>>>>
>  >>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>  >>>>>>>>
>  >>>>>>>> =64=0.0%
>  >>>>>>>
>  >>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>  >>>>>>>>
>  >>>>>>>> =64=0.1%
>  >>>>>>>
>  >>>>>>> issued r/w/d: total=1632349/0/0, short=0/0/0
>  >>>>>>>
>  >>>>>>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
>  >>>>>>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
>  >>>>>>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
>  >>>>>>> slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
>  >>>>>>> clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
>  >>>>>>> lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
>  >>>>>>> bw (KB/s) : min= 7, max= 2165, per=104.03%, avg=764.65,
>  >>>>>>> stdev=353.97
>  >>>>>>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
>  >>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
>  >>>>>>>>
>  >>>>>>>> =64=99.4%
>  >>>>>>>
>  >>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>  >>>>>>>>
>  >>>>>>>> =64=0.0%
>  >>>>>>>
>  >>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>  >>>>>>>>
>  >>>>>>>> =64=0.1%
>  >>>>>>>
>  >>>>>>> issued r/w/d: total=0/11171/0, short=0/0/0
>  >>>>>>>
>  >>>>>>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
>  >>>>>>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
>  >>>>>>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
>  >>>>>>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
>  >>>>>>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
>  >>>>>>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
>  >>>>>>> lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
>  >>>>>>> bw (KB/s) : min= 0, max= 7728, per=31.44%, avg=1078.21,
>  >>>>>>> stdev=2000.45
>  >>>>>>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
>  >>>>>>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>  >>>>>>>>
>  >>>>>>>> =64=99.9%
>  >>>>>>>
>  >>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>  >>>>>>>>
>  >>>>>>>> =64=0.0%
>  >>>>>>>
>  >>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>  >>>>>>>>
>  >>>>>>>> =64=0.1%
>  >>>>>>>
>  >>>>>>> issued r/w/d: total=0/52147/0, short=0/0/0
>  >>>>>>>
>  >>>>>>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
>  >>>>>>> lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
>  >>>>>>>
>  >>>>>>> Run status group 0 (all jobs):
>  >>>>>>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
>  >>>>>>> mint=60053msec, maxt=60053msec
>  >>>>>>>
>  >>>>>>> Run status group 1 (all jobs):
>  >>>>>>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
>  >>>>>>> maxb=111425KB/s, mint=60005msec, maxt=60005msec
>  >>>>>>>
>  >>>>>>> Run status group 2 (all jobs):
>  >>>>>>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,
>  >>>>>>> mint=60725msec, maxt=60725msec
>  >>>>>>>
>  >>>>>>> Run status group 3 (all jobs):
>  >>>>>>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,
>  >>>>>>> mint=60822msec, maxt=60822msec
>  >>>>>>>
>  >>>>>>> Disk stats (read/write):
>  >>>>>>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,
>  >>>>>>> in_queue=33434120, util=99.79%
>  >>>>
>  >>>> --
>  >>>> To unsubscribe from this list: send the line "unsubscribe
> ceph-devel" in
>  >>>> the body of a message to majordomo@vger.kernel.org
> <mailto:majordomo@vger.kernel.org>
>  >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>  >>>>
>  >>>
>  >>
>  >> --
>  >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>  >> the body of a message to majordomo@vger.kernel.org
> <mailto:majordomo@vger.kernel.org>
>  >> More majordomo info at http://vger.kernel.org/majordomo-info.html


-- 
Mark Nelson
Performance Engineer
Inktank
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 14:22                                       ` Stefan Priebe - Profihost AG
@ 2012-11-22 14:37                                         ` Mark Nelson
  2012-11-22 14:42                                           ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 51+ messages in thread
From: Mark Nelson @ 2012-11-22 14:37 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe, Sébastien Han

I don't think we recommend tmpfs at all for anything other than playing 
around. :)

On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote:
> Hi,
>
> can someone from inktank comment this? Might be using /dev/ram0 with an
> fs on it be better than tmpfs as we can use dio?
>
> Greets,
> Stefan
>
>> ----- Mail original -----
>>
>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
>> À: "Sébastien Han" <han.sebastien@gmail.com>
>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Alexandre DERUMIER"
>> <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>,
>> "Mark Kampe" <mark.kampe@inktank.com>
>> Envoyé: Jeudi 22 Novembre 2012 14:29:03
>> Objet: Re: RBD fio Performance concerns
>>
>> Am 22.11.2012 14:22, schrieb Sébastien Han:
>>> And RAMDISK devices are too expensive.
>>>
>>> It would make sense in your infra, but yes they are really expensive.
>>
>> We need something like tmpfs - running in local memory but support dio.
>>
>> Stefan
>>


-- 
Mark Nelson
Performance Engineer
Inktank
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 14:37                                         ` Mark Nelson
@ 2012-11-22 14:42                                           ` Stefan Priebe - Profihost AG
  2012-11-22 14:46                                             ` Mark Nelson
  2012-11-22 14:52                                             ` Alexandre DERUMIER
  0 siblings, 2 replies; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-22 14:42 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe, Sébastien Han

Am 22.11.2012 15:37, schrieb Mark Nelson:
> I don't think we recommend tmpfs at all for anything other than playing
> around. :)

I discussed this with somebody frmo inktank. Had to search the 
mailinglist. It might be OK if you're working with enough replicas and UPS.

I see no other option while working with SSDs - the only Option would be 
to be able to deaktivate the journal at all. But ceph does not support this.

Stefan

> On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote:
>> Hi,
>>
>> can someone from inktank comment this? Might be using /dev/ram0 with an
>> fs on it be better than tmpfs as we can use dio?
>>
>> Greets,
>> Stefan
>>
>>> ----- Mail original -----
>>>
>>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
>>> À: "Sébastien Han" <han.sebastien@gmail.com>
>>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Alexandre DERUMIER"
>>> <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>,
>>> "Mark Kampe" <mark.kampe@inktank.com>
>>> Envoyé: Jeudi 22 Novembre 2012 14:29:03
>>> Objet: Re: RBD fio Performance concerns
>>>
>>> Am 22.11.2012 14:22, schrieb Sébastien Han:
>>>> And RAMDISK devices are too expensive.
>>>>
>>>> It would make sense in your infra, but yes they are really expensive.
>>>
>>> We need something like tmpfs - running in local memory but support dio.
>>>
>>> Stefan
>>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 14:42                                           ` Stefan Priebe - Profihost AG
@ 2012-11-22 14:46                                             ` Mark Nelson
  2012-11-22 15:01                                               ` Stefan Priebe - Profihost AG
  2012-11-22 14:52                                             ` Alexandre DERUMIER
  1 sibling, 1 reply; 51+ messages in thread
From: Mark Nelson @ 2012-11-22 14:46 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe, Sébastien Han

I haven't played a whole lot with SSD only OSDs yet (other than noting 
last summer that iop performance wasn't as high as I wanted it).  Is a 
second partition on the SSD for the journal not an option for you?

Mark

On 11/22/2012 08:42 AM, Stefan Priebe - Profihost AG wrote:
> Am 22.11.2012 15:37, schrieb Mark Nelson:
>> I don't think we recommend tmpfs at all for anything other than playing
>> around. :)
>
> I discussed this with somebody frmo inktank. Had to search the
> mailinglist. It might be OK if you're working with enough replicas and UPS.
>
> I see no other option while working with SSDs - the only Option would be
> to be able to deaktivate the journal at all. But ceph does not support
> this.
>
> Stefan
>
>> On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote:
>>> Hi,
>>>
>>> can someone from inktank comment this? Might be using /dev/ram0 with an
>>> fs on it be better than tmpfs as we can use dio?
>>>
>>> Greets,
>>> Stefan
>>>
>>>> ----- Mail original -----
>>>>
>>>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
>>>> À: "Sébastien Han" <han.sebastien@gmail.com>
>>>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Alexandre DERUMIER"
>>>> <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>,
>>>> "Mark Kampe" <mark.kampe@inktank.com>
>>>> Envoyé: Jeudi 22 Novembre 2012 14:29:03
>>>> Objet: Re: RBD fio Performance concerns
>>>>
>>>> Am 22.11.2012 14:22, schrieb Sébastien Han:
>>>>> And RAMDISK devices are too expensive.
>>>>>
>>>>> It would make sense in your infra, but yes they are really expensive.
>>>>
>>>> We need something like tmpfs - running in local memory but support dio.
>>>>
>>>> Stefan
>>>>
>>
>>


-- 
Mark Nelson
Performance Engineer
Inktank
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 14:42                                           ` Stefan Priebe - Profihost AG
  2012-11-22 14:46                                             ` Mark Nelson
@ 2012-11-22 14:52                                             ` Alexandre DERUMIER
  2012-11-22 15:00                                               ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 51+ messages in thread
From: Alexandre DERUMIER @ 2012-11-22 14:52 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: ceph-devel, Mark Kampe, Sébastien Han, Mark Nelson

>>I discussed this with somebody frmo inktank. Had to search the
>>mailinglist. It might be OK if you're working with enough replicas and UPS.
>>
>>I see no other option while working with SSDs - the only Option would be
>>to be able to deaktivate the journal at all. But ceph does not support this.

Do you have a big difference with putting 1 journal by osd on each ssd drive ?


another alternative can be (but indeed costly):

- stec zeus ram ssd drive, around 2000$ for 8G (I have benched it around 100000 iops ;)
- ddrdrive (http://www.ddrdrive.com/) (around 200000iops, don't know the price)
- fusionio card (iodrive2, 360GB, around 3000€ , but they are 160GB model, maybe half the price)
- maybe ocz talos, around 600€ for OCZ Talos 2 R 200 Go (don't have benched them, but spec say around 35000iops random)





----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Mark Nelson" <mark.nelson@inktank.com> 
Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>, "Sébastien Han" <han.sebastien@gmail.com> 
Envoyé: Jeudi 22 Novembre 2012 15:42:14 
Objet: Re: RBD fio Performance concerns 

Am 22.11.2012 15:37, schrieb Mark Nelson: 
> I don't think we recommend tmpfs at all for anything other than playing 
> around. :) 

I discussed this with somebody frmo inktank. Had to search the 
mailinglist. It might be OK if you're working with enough replicas and UPS. 

I see no other option while working with SSDs - the only Option would be 
to be able to deaktivate the journal at all. But ceph does not support this. 

Stefan 

> On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote: 
>> Hi, 
>> 
>> can someone from inktank comment this? Might be using /dev/ram0 with an 
>> fs on it be better than tmpfs as we can use dio? 
>> 
>> Greets, 
>> Stefan 
>> 
>>> ----- Mail original ----- 
>>> 
>>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
>>> À: "Sébastien Han" <han.sebastien@gmail.com> 
>>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Alexandre DERUMIER" 
>>> <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, 
>>> "Mark Kampe" <mark.kampe@inktank.com> 
>>> Envoyé: Jeudi 22 Novembre 2012 14:29:03 
>>> Objet: Re: RBD fio Performance concerns 
>>> 
>>> Am 22.11.2012 14:22, schrieb Sébastien Han: 
>>>> And RAMDISK devices are too expensive. 
>>>> 
>>>> It would make sense in your infra, but yes they are really expensive. 
>>> 
>>> We need something like tmpfs - running in local memory but support dio. 
>>> 
>>> Stefan 
>>> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 14:52                                             ` Alexandre DERUMIER
@ 2012-11-22 15:00                                               ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-22 15:00 UTC (permalink / raw)
  To: Alexandre DERUMIER
  Cc: ceph-devel, Mark Kampe, Sébastien Han, Mark Nelson

Am 22.11.2012 15:52, schrieb Alexandre DERUMIER:
>>> I discussed this with somebody frmo inktank. Had to search the
>>> mailinglist. It might be OK if you're working with enough replicas and UPS.
>>>
>>> I see no other option while working with SSDs - the only Option would be
>>> to be able to deaktivate the journal at all. But ceph does not support this.
>
> Do you have a big difference with putting 1 journal by osd on each ssd drive ?

Not tested.

> another alternative can be (but indeed costly):
>
> - stec zeus ram ssd drive, around 2000$ for 8G (I have benched it around 100000 iops ;)
> - ddrdrive (http://www.ddrdrive.com/) (around 200000iops, don't know the price)
> - fusionio card (iodrive2, 360GB, around 3000€ , but they are 160GB model, maybe half the price)

All too expensive.

> - maybe ocz talos, around 600€ for OCZ Talos 2 R 200 Go (don't have benched them, but spec say around 35000iops random)
Not usable as each OSD can do 35.000 random IOP/s in my case and have 8 
of them in each node...

Stefan

> ----- Mail original -----
>
> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
> À: "Mark Nelson" <mark.nelson@inktank.com>
> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>, "Sébastien Han" <han.sebastien@gmail.com>
> Envoyé: Jeudi 22 Novembre 2012 15:42:14
> Objet: Re: RBD fio Performance concerns
>
> Am 22.11.2012 15:37, schrieb Mark Nelson:
>> I don't think we recommend tmpfs at all for anything other than playing
>> around. :)
>
> I discussed this with somebody frmo inktank. Had to search the
> mailinglist. It might be OK if you're working with enough replicas and UPS.
>
> I see no other option while working with SSDs - the only Option would be
> to be able to deaktivate the journal at all. But ceph does not support this.
>
> Stefan
>
>> On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote:
>>> Hi,
>>>
>>> can someone from inktank comment this? Might be using /dev/ram0 with an
>>> fs on it be better than tmpfs as we can use dio?
>>>
>>> Greets,
>>> Stefan
>>>
>>>> ----- Mail original -----
>>>>
>>>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
>>>> À: "Sébastien Han" <han.sebastien@gmail.com>
>>>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Alexandre DERUMIER"
>>>> <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>,
>>>> "Mark Kampe" <mark.kampe@inktank.com>
>>>> Envoyé: Jeudi 22 Novembre 2012 14:29:03
>>>> Objet: Re: RBD fio Performance concerns
>>>>
>>>> Am 22.11.2012 14:22, schrieb Sébastien Han:
>>>>> And RAMDISK devices are too expensive.
>>>>>
>>>>> It would make sense in your infra, but yes they are really expensive.
>>>>
>>>> We need something like tmpfs - running in local memory but support dio.
>>>>
>>>> Stefan
>>>>
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 14:46                                             ` Mark Nelson
@ 2012-11-22 15:01                                               ` Stefan Priebe - Profihost AG
  2012-11-22 15:26                                                 ` Alexandre DERUMIER
  0 siblings, 1 reply; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-22 15:01 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe, Sébastien Han

Am 22.11.2012 15:46, schrieb Mark Nelson:
> I haven't played a whole lot with SSD only OSDs yet (other than noting
> last summer that iop performance wasn't as high as I wanted it).  Is a
> second partition on the SSD for the journal not an option for you?

Haven't tested that. But does this makes sense? I mean data goes to Disk 
journal - same disk then has to copy the Data from part A to part B.

Why is this an advantage?

Stefan

> Mark
>
> On 11/22/2012 08:42 AM, Stefan Priebe - Profihost AG wrote:
>> Am 22.11.2012 15:37, schrieb Mark Nelson:
>>> I don't think we recommend tmpfs at all for anything other than playing
>>> around. :)
>>
>> I discussed this with somebody frmo inktank. Had to search the
>> mailinglist. It might be OK if you're working with enough replicas and
>> UPS.
>>
>> I see no other option while working with SSDs - the only Option would be
>> to be able to deaktivate the journal at all. But ceph does not support
>> this.
>>
>> Stefan
>>
>>> On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote:
>>>> Hi,
>>>>
>>>> can someone from inktank comment this? Might be using /dev/ram0 with an
>>>> fs on it be better than tmpfs as we can use dio?
>>>>
>>>> Greets,
>>>> Stefan
>>>>
>>>>> ----- Mail original -----
>>>>>
>>>>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
>>>>> À: "Sébastien Han" <han.sebastien@gmail.com>
>>>>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Alexandre DERUMIER"
>>>>> <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>,
>>>>> "Mark Kampe" <mark.kampe@inktank.com>
>>>>> Envoyé: Jeudi 22 Novembre 2012 14:29:03
>>>>> Objet: Re: RBD fio Performance concerns
>>>>>
>>>>> Am 22.11.2012 14:22, schrieb Sébastien Han:
>>>>>> And RAMDISK devices are too expensive.
>>>>>>
>>>>>> It would make sense in your infra, but yes they are really expensive.
>>>>>
>>>>> We need something like tmpfs - running in local memory but support
>>>>> dio.
>>>>>
>>>>> Stefan
>>>>>
>>>
>>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 15:01                                               ` Stefan Priebe - Profihost AG
@ 2012-11-22 15:26                                                 ` Alexandre DERUMIER
  2012-11-22 15:28                                                   ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 51+ messages in thread
From: Alexandre DERUMIER @ 2012-11-22 15:26 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: ceph-devel, Mark Kampe, Sébastien Han, Mark Nelson

>>Haven't tested that. But does this makes sense? I mean data goes to Disk 
>>journal - same disk then has to copy the Data from part A to part B. 
>>
>>Why is this an advantage? 

Well, if you are cpu limited, I don't think you can use all 8*35000iops by node.
So, maybe a benchmark can tell us if the difference is really big.

Using tmpfs and ups can be ok, but if you have a kernel panic or hardware problem, you'll lost your journal. 



----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Mark Nelson" <mark.nelson@inktank.com> 
Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>, "Sébastien Han" <han.sebastien@gmail.com> 
Envoyé: Jeudi 22 Novembre 2012 16:01:56 
Objet: Re: RBD fio Performance concerns 

Am 22.11.2012 15:46, schrieb Mark Nelson: 
> I haven't played a whole lot with SSD only OSDs yet (other than noting 
> last summer that iop performance wasn't as high as I wanted it). Is a 
> second partition on the SSD for the journal not an option for you? 

Haven't tested that. But does this makes sense? I mean data goes to Disk 
journal - same disk then has to copy the Data from part A to part B. 

Why is this an advantage? 

Stefan 

> Mark 
> 
> On 11/22/2012 08:42 AM, Stefan Priebe - Profihost AG wrote: 
>> Am 22.11.2012 15:37, schrieb Mark Nelson: 
>>> I don't think we recommend tmpfs at all for anything other than playing 
>>> around. :) 
>> 
>> I discussed this with somebody frmo inktank. Had to search the 
>> mailinglist. It might be OK if you're working with enough replicas and 
>> UPS. 
>> 
>> I see no other option while working with SSDs - the only Option would be 
>> to be able to deaktivate the journal at all. But ceph does not support 
>> this. 
>> 
>> Stefan 
>> 
>>> On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote: 
>>>> Hi, 
>>>> 
>>>> can someone from inktank comment this? Might be using /dev/ram0 with an 
>>>> fs on it be better than tmpfs as we can use dio? 
>>>> 
>>>> Greets, 
>>>> Stefan 
>>>> 
>>>>> ----- Mail original ----- 
>>>>> 
>>>>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
>>>>> À: "Sébastien Han" <han.sebastien@gmail.com> 
>>>>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Alexandre DERUMIER" 
>>>>> <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, 
>>>>> "Mark Kampe" <mark.kampe@inktank.com> 
>>>>> Envoyé: Jeudi 22 Novembre 2012 14:29:03 
>>>>> Objet: Re: RBD fio Performance concerns 
>>>>> 
>>>>> Am 22.11.2012 14:22, schrieb Sébastien Han: 
>>>>>> And RAMDISK devices are too expensive. 
>>>>>> 
>>>>>> It would make sense in your infra, but yes they are really expensive. 
>>>>> 
>>>>> We need something like tmpfs - running in local memory but support 
>>>>> dio. 
>>>>> 
>>>>> Stefan 
>>>>> 
>>> 
>>> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 15:26                                                 ` Alexandre DERUMIER
@ 2012-11-22 15:28                                                   ` Stefan Priebe - Profihost AG
  2012-11-22 15:35                                                     ` Alexandre DERUMIER
  0 siblings, 1 reply; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-22 15:28 UTC (permalink / raw)
  To: Alexandre DERUMIER
  Cc: ceph-devel, Mark Kampe, Sébastien Han, Mark Nelson

Am 22.11.2012 16:26, schrieb Alexandre DERUMIER:
>>> Haven't tested that. But does this makes sense? I mean data goes to Disk
>>> journal - same disk then has to copy the Data from part A to part B.
>>>
>>> Why is this an advantage?
>
> Well, if you are cpu limited, I don't think you can use all 8*35000iops by node.
> So, maybe a benchmark can tell us if the difference is really big.
>
> Using tmpfs and ups can be ok, but if you have a kernel panic or hardware problem, you'll lost your journal.

But who cares? it's also on the 2nd node. or even on the 3rd if you have 
replicas 3.

Stefan


> ----- Mail original -----
>
> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
> À: "Mark Nelson" <mark.nelson@inktank.com>
> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>, "Sébastien Han" <han.sebastien@gmail.com>
> Envoyé: Jeudi 22 Novembre 2012 16:01:56
> Objet: Re: RBD fio Performance concerns
>
> Am 22.11.2012 15:46, schrieb Mark Nelson:
>> I haven't played a whole lot with SSD only OSDs yet (other than noting
>> last summer that iop performance wasn't as high as I wanted it). Is a
>> second partition on the SSD for the journal not an option for you?
>
> Haven't tested that. But does this makes sense? I mean data goes to Disk
> journal - same disk then has to copy the Data from part A to part B.
>
> Why is this an advantage?
>
> Stefan
>
>> Mark
>>
>> On 11/22/2012 08:42 AM, Stefan Priebe - Profihost AG wrote:
>>> Am 22.11.2012 15:37, schrieb Mark Nelson:
>>>> I don't think we recommend tmpfs at all for anything other than playing
>>>> around. :)
>>>
>>> I discussed this with somebody frmo inktank. Had to search the
>>> mailinglist. It might be OK if you're working with enough replicas and
>>> UPS.
>>>
>>> I see no other option while working with SSDs - the only Option would be
>>> to be able to deaktivate the journal at all. But ceph does not support
>>> this.
>>>
>>> Stefan
>>>
>>>> On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote:
>>>>> Hi,
>>>>>
>>>>> can someone from inktank comment this? Might be using /dev/ram0 with an
>>>>> fs on it be better than tmpfs as we can use dio?
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>>> ----- Mail original -----
>>>>>>
>>>>>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
>>>>>> À: "Sébastien Han" <han.sebastien@gmail.com>
>>>>>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Alexandre DERUMIER"
>>>>>> <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>,
>>>>>> "Mark Kampe" <mark.kampe@inktank.com>
>>>>>> Envoyé: Jeudi 22 Novembre 2012 14:29:03
>>>>>> Objet: Re: RBD fio Performance concerns
>>>>>>
>>>>>> Am 22.11.2012 14:22, schrieb Sébastien Han:
>>>>>>> And RAMDISK devices are too expensive.
>>>>>>>
>>>>>>> It would make sense in your infra, but yes they are really expensive.
>>>>>>
>>>>>> We need something like tmpfs - running in local memory but support
>>>>>> dio.
>>>>>>
>>>>>> Stefan
>>>>>>
>>>>
>>>>
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 15:28                                                   ` Stefan Priebe - Profihost AG
@ 2012-11-22 15:35                                                     ` Alexandre DERUMIER
  2012-11-22 15:49                                                       ` Sébastien Han
  2012-11-22 15:59                                                       ` Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 51+ messages in thread
From: Alexandre DERUMIER @ 2012-11-22 15:35 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: ceph-devel, Mark Kampe, Sébastien Han, Mark Nelson

>>But who cares? it's also on the 2nd node. or even on the 3rd if you have 
>>replicas 3. 
Yes, but rebuilding a dead node use cpu and ios. (but it should be benched too, to see the impact on the production)



----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>, "Sébastien Han" <han.sebastien@gmail.com>, "Mark Nelson" <mark.nelson@inktank.com> 
Envoyé: Jeudi 22 Novembre 2012 16:28:57 
Objet: Re: RBD fio Performance concerns 

Am 22.11.2012 16:26, schrieb Alexandre DERUMIER: 
>>> Haven't tested that. But does this makes sense? I mean data goes to Disk 
>>> journal - same disk then has to copy the Data from part A to part B. 
>>> 
>>> Why is this an advantage? 
> 
> Well, if you are cpu limited, I don't think you can use all 8*35000iops by node. 
> So, maybe a benchmark can tell us if the difference is really big. 
> 
> Using tmpfs and ups can be ok, but if you have a kernel panic or hardware problem, you'll lost your journal. 

But who cares? it's also on the 2nd node. or even on the 3rd if you have 
replicas 3. 

Stefan 


> ----- Mail original ----- 
> 
> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
> À: "Mark Nelson" <mark.nelson@inktank.com> 
> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>, "Sébastien Han" <han.sebastien@gmail.com> 
> Envoyé: Jeudi 22 Novembre 2012 16:01:56 
> Objet: Re: RBD fio Performance concerns 
> 
> Am 22.11.2012 15:46, schrieb Mark Nelson: 
>> I haven't played a whole lot with SSD only OSDs yet (other than noting 
>> last summer that iop performance wasn't as high as I wanted it). Is a 
>> second partition on the SSD for the journal not an option for you? 
> 
> Haven't tested that. But does this makes sense? I mean data goes to Disk 
> journal - same disk then has to copy the Data from part A to part B. 
> 
> Why is this an advantage? 
> 
> Stefan 
> 
>> Mark 
>> 
>> On 11/22/2012 08:42 AM, Stefan Priebe - Profihost AG wrote: 
>>> Am 22.11.2012 15:37, schrieb Mark Nelson: 
>>>> I don't think we recommend tmpfs at all for anything other than playing 
>>>> around. :) 
>>> 
>>> I discussed this with somebody frmo inktank. Had to search the 
>>> mailinglist. It might be OK if you're working with enough replicas and 
>>> UPS. 
>>> 
>>> I see no other option while working with SSDs - the only Option would be 
>>> to be able to deaktivate the journal at all. But ceph does not support 
>>> this. 
>>> 
>>> Stefan 
>>> 
>>>> On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote: 
>>>>> Hi, 
>>>>> 
>>>>> can someone from inktank comment this? Might be using /dev/ram0 with an 
>>>>> fs on it be better than tmpfs as we can use dio? 
>>>>> 
>>>>> Greets, 
>>>>> Stefan 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> 
>>>>>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
>>>>>> À: "Sébastien Han" <han.sebastien@gmail.com> 
>>>>>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Alexandre DERUMIER" 
>>>>>> <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, 
>>>>>> "Mark Kampe" <mark.kampe@inktank.com> 
>>>>>> Envoyé: Jeudi 22 Novembre 2012 14:29:03 
>>>>>> Objet: Re: RBD fio Performance concerns 
>>>>>> 
>>>>>> Am 22.11.2012 14:22, schrieb Sébastien Han: 
>>>>>>> And RAMDISK devices are too expensive. 
>>>>>>> 
>>>>>>> It would make sense in your infra, but yes they are really expensive. 
>>>>>> 
>>>>>> We need something like tmpfs - running in local memory but support 
>>>>>> dio. 
>>>>>> 
>>>>>> Stefan 
>>>>>> 
>>>> 
>>>> 
>> 
>> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 15:35                                                     ` Alexandre DERUMIER
@ 2012-11-22 15:49                                                       ` Sébastien Han
  2012-11-22 15:54                                                         ` Stefan Priebe - Profihost AG
  2012-11-22 15:59                                                       ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 51+ messages in thread
From: Sébastien Han @ 2012-11-22 15:49 UTC (permalink / raw)
  To: Alexandre DERUMIER
  Cc: Stefan Priebe - Profihost AG, ceph-devel, Mark Kampe, Mark Nelson

>>But who cares? it's also on the 2nd node. or even on the 3rd if you have
>>replicas 3.

Yes but you could also suffer a crash while writing the first replica.
If the journal is in tmpfs, there is nothing to replay.



On Thu, Nov 22, 2012 at 4:35 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>
> >>But who cares? it's also on the 2nd node. or even on the 3rd if you have
> >>replicas 3.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 15:49                                                       ` Sébastien Han
@ 2012-11-22 15:54                                                         ` Stefan Priebe - Profihost AG
  2012-11-22 15:55                                                           ` Sébastien Han
  0 siblings, 1 reply; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-22 15:54 UTC (permalink / raw)
  To: Sébastien Han
  Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe, Mark Nelson

I thought the Client would then write to the 2nd is this wrong?

Stefan

Am 22.11.2012 um 16:49 schrieb Sébastien Han <han.sebastien@gmail.com>:

>>> But who cares? it's also on the 2nd node. or even on the 3rd if you have
>>> replicas 3.
> 
> Yes but you could also suffer a crash while writing the first replica.
> If the journal is in tmpfs, there is nothing to replay.
> 
> 
> 
> On Thu, Nov 22, 2012 at 4:35 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>> 
>>>> But who cares? it's also on the 2nd node. or even on the 3rd if you have
>>>> replicas 3.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 15:54                                                         ` Stefan Priebe - Profihost AG
@ 2012-11-22 15:55                                                           ` Sébastien Han
  2012-11-22 15:57                                                             ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 51+ messages in thread
From: Sébastien Han @ 2012-11-22 15:55 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe, Mark Nelson

Hum sorry, you're right. Forget about what I said :)


On Thu, Nov 22, 2012 at 4:54 PM, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
> I thought the Client would then write to the 2nd is this wrong?
>
> Stefan
>
> Am 22.11.2012 um 16:49 schrieb Sébastien Han <han.sebastien@gmail.com>:
>
>>>> But who cares? it's also on the 2nd node. or even on the 3rd if you have
>>>> replicas 3.
>>
>> Yes but you could also suffer a crash while writing the first replica.
>> If the journal is in tmpfs, there is nothing to replay.
>>
>>
>>
>> On Thu, Nov 22, 2012 at 4:35 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>
>>>>> But who cares? it's also on the 2nd node. or even on the 3rd if you have
>>>>> replicas 3.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 15:55                                                           ` Sébastien Han
@ 2012-11-22 15:57                                                             ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-22 15:57 UTC (permalink / raw)
  To: Sébastien Han
  Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe, Mark Nelson

Otherwise you would have the same problem with the disk crashes

Am 22.11.2012 um 16:55 schrieb Sébastien Han <han.sebastien@gmail.com>:

> Hum sorry, you're right. Forget about what I said :)
> 
> 
> On Thu, Nov 22, 2012 at 4:54 PM, Stefan Priebe - Profihost AG
> <s.priebe@profihost.ag> wrote:
>> I thought the Client would then write to the 2nd is this wrong?
>> 
>> Stefan
>> 
>> Am 22.11.2012 um 16:49 schrieb Sébastien Han <han.sebastien@gmail.com>:
>> 
>>>>> But who cares? it's also on the 2nd node. or even on the 3rd if you have
>>>>> replicas 3.
>>> 
>>> Yes but you could also suffer a crash while writing the first replica.
>>> If the journal is in tmpfs, there is nothing to replay.
>>> 
>>> 
>>> 
>>> On Thu, Nov 22, 2012 at 4:35 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>> 
>>>>>> But who cares? it's also on the 2nd node. or even on the 3rd if you have
>>>>>> replicas 3.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 15:35                                                     ` Alexandre DERUMIER
  2012-11-22 15:49                                                       ` Sébastien Han
@ 2012-11-22 15:59                                                       ` Stefan Priebe - Profihost AG
  1 sibling, 0 replies; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-22 15:59 UTC (permalink / raw)
  To: Alexandre DERUMIER
  Cc: ceph-devel, Mark Kampe, Sébastien Han, Mark Nelson

In my test it was just recovering some replicas not the whole osd.

Am 22.11.2012 um 16:35 schrieb Alexandre DERUMIER <aderumier@odiso.com>:

>>> But who cares? it's also on the 2nd node. or even on the 3rd if you have 
>>> replicas 3.
> Yes, but rebuilding a dead node use cpu and ios. (but it should be benched too, to see the impact on the production)
> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
> À: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>, "Sébastien Han" <han.sebastien@gmail.com>, "Mark Nelson" <mark.nelson@inktank.com> 
> Envoyé: Jeudi 22 Novembre 2012 16:28:57 
> Objet: Re: RBD fio Performance concerns 
> 
> Am 22.11.2012 16:26, schrieb Alexandre DERUMIER: 
>>>> Haven't tested that. But does this makes sense? I mean data goes to Disk 
>>>> journal - same disk then has to copy the Data from part A to part B. 
>>>> 
>>>> Why is this an advantage?
>> 
>> Well, if you are cpu limited, I don't think you can use all 8*35000iops by node. 
>> So, maybe a benchmark can tell us if the difference is really big. 
>> 
>> Using tmpfs and ups can be ok, but if you have a kernel panic or hardware problem, you'll lost your journal.
> 
> But who cares? it's also on the 2nd node. or even on the 3rd if you have 
> replicas 3. 
> 
> Stefan 
> 
> 
>> ----- Mail original ----- 
>> 
>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
>> À: "Mark Nelson" <mark.nelson@inktank.com> 
>> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>, "Sébastien Han" <han.sebastien@gmail.com> 
>> Envoyé: Jeudi 22 Novembre 2012 16:01:56 
>> Objet: Re: RBD fio Performance concerns 
>> 
>> Am 22.11.2012 15:46, schrieb Mark Nelson: 
>>> I haven't played a whole lot with SSD only OSDs yet (other than noting 
>>> last summer that iop performance wasn't as high as I wanted it). Is a 
>>> second partition on the SSD for the journal not an option for you?
>> 
>> Haven't tested that. But does this makes sense? I mean data goes to Disk 
>> journal - same disk then has to copy the Data from part A to part B. 
>> 
>> Why is this an advantage? 
>> 
>> Stefan 
>> 
>>> Mark 
>>> 
>>> On 11/22/2012 08:42 AM, Stefan Priebe - Profihost AG wrote: 
>>>> Am 22.11.2012 15:37, schrieb Mark Nelson: 
>>>>> I don't think we recommend tmpfs at all for anything other than playing 
>>>>> around. :)
>>>> 
>>>> I discussed this with somebody frmo inktank. Had to search the 
>>>> mailinglist. It might be OK if you're working with enough replicas and 
>>>> UPS. 
>>>> 
>>>> I see no other option while working with SSDs - the only Option would be 
>>>> to be able to deaktivate the journal at all. But ceph does not support 
>>>> this. 
>>>> 
>>>> Stefan 
>>>> 
>>>>> On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote: 
>>>>>> Hi, 
>>>>>> 
>>>>>> can someone from inktank comment this? Might be using /dev/ram0 with an 
>>>>>> fs on it be better than tmpfs as we can use dio? 
>>>>>> 
>>>>>> Greets, 
>>>>>> Stefan 
>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> 
>>>>>>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
>>>>>>> À: "Sébastien Han" <han.sebastien@gmail.com> 
>>>>>>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Alexandre DERUMIER" 
>>>>>>> <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, 
>>>>>>> "Mark Kampe" <mark.kampe@inktank.com> 
>>>>>>> Envoyé: Jeudi 22 Novembre 2012 14:29:03 
>>>>>>> Objet: Re: RBD fio Performance concerns 
>>>>>>> 
>>>>>>> Am 22.11.2012 14:22, schrieb Sébastien Han: 
>>>>>>>> And RAMDISK devices are too expensive. 
>>>>>>>> 
>>>>>>>> It would make sense in your infra, but yes they are really expensive.
>>>>>>> 
>>>>>>> We need something like tmpfs - running in local memory but support 
>>>>>>> dio. 
>>>>>>> 
>>>>>>> Stefan
>>> 
>>> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-22 12:50                             ` Sébastien Han
  2012-11-22 13:14                               ` Stefan Priebe - Profihost AG
@ 2012-11-23 10:31                               ` Stefan Priebe - Profihost AG
  2012-11-23 10:47                                 ` Alexandre DERUMIER
  1 sibling, 1 reply; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-23 10:31 UTC (permalink / raw)
  To: Sébastien Han
  Cc: Mark Nelson, Alexandre DERUMIER, ceph-devel, Mark Kampe

Hi,

when i switch the journal to the OSD Disk seperate partiton on each disk 
(/dev/sdX1 for journal 1GB and /dev/sdX2 for OSD) i go down from 23.000 
iops to 200 iops random 4k.

Greets,
Stefan
Am 22.11.2012 13:50, schrieb Sébastien Han:
>> journal is running on tmpfs to me but that changes nothing.
>
> I don't think it works then. According to the doc: Enables using
> libaio for asynchronous writes to the journal. Requires journal dio
> set to true.
>
>
> On Thu, Nov 22, 2012 at 12:48 PM, Stefan Priebe - Profihost AG
> <s.priebe@profihost.ag> wrote:
>> Am 22.11.2012 11:49, schrieb Sébastien Han:
>>
>>> @Alexandre: cool!
>>>
>>> @ Stefan: Full SSD cluster and 10G switches?
>>
>> Yes
>>
>>
>>> Couple of weeks ago I saw
>>> that you use journal aio, did you notice performance improvement with it?
>>
>> journal is running on tmpfs to me but that changes nothing.
>>
>> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-23 10:31                               ` Stefan Priebe - Profihost AG
@ 2012-11-23 10:47                                 ` Alexandre DERUMIER
  2012-11-23 10:49                                   ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 51+ messages in thread
From: Alexandre DERUMIER @ 2012-11-23 10:47 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Mark Nelson, ceph-devel, Mark Kampe, Sébastien Han

>>when i switch the journal to the OSD Disk seperate partiton on each disk 
>>(/dev/sdX1 for journal 1GB and /dev/sdX2 for OSD) i go down from 23.000 
>>iops to 200 iops random 4k. 
O_o , that's seem crazy...
Are you sure that your partitions are correctly aligned ? (starting first partition at sector 2048 is best for ssd)


----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Sébastien Han" <han.sebastien@gmail.com> 
Cc: "Mark Nelson" <mark.nelson@inktank.com>, "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com> 
Envoyé: Vendredi 23 Novembre 2012 11:31:15 
Objet: Re: RBD fio Performance concerns 

Hi, 

when i switch the journal to the OSD Disk seperate partiton on each disk 
(/dev/sdX1 for journal 1GB and /dev/sdX2 for OSD) i go down from 23.000 
iops to 200 iops random 4k. 

Greets, 
Stefan 
Am 22.11.2012 13:50, schrieb Sébastien Han: 
>> journal is running on tmpfs to me but that changes nothing. 
> 
> I don't think it works then. According to the doc: Enables using 
> libaio for asynchronous writes to the journal. Requires journal dio 
> set to true. 
> 
> 
> On Thu, Nov 22, 2012 at 12:48 PM, Stefan Priebe - Profihost AG 
> <s.priebe@profihost.ag> wrote: 
>> Am 22.11.2012 11:49, schrieb Sébastien Han: 
>> 
>>> @Alexandre: cool! 
>>> 
>>> @ Stefan: Full SSD cluster and 10G switches? 
>> 
>> Yes 
>> 
>> 
>>> Couple of weeks ago I saw 
>>> that you use journal aio, did you notice performance improvement with it? 
>> 
>> journal is running on tmpfs to me but that changes nothing. 
>> 
>> Stefan 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-23 10:47                                 ` Alexandre DERUMIER
@ 2012-11-23 10:49                                   ` Stefan Priebe - Profihost AG
  2012-11-23 11:03                                     ` Alexandre DERUMIER
  0 siblings, 1 reply; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-23 10:49 UTC (permalink / raw)
  To: Alexandre DERUMIER
  Cc: Mark Nelson, ceph-devel, Mark Kampe, Sébastien Han

Am 23.11.2012 11:47, schrieb Alexandre DERUMIER:
>>> when i switch the journal to the OSD Disk seperate partiton on each disk
>>> (/dev/sdX1 for journal 1GB and /dev/sdX2 for OSD) i go down from 23.000
>>> iops to 200 iops random 4k.
> O_o , that's seem crazy...
> Are you sure that your partitions are correctly aligned ? (starting first partition at sector 2048 is best for ssd)

Model: ATA INTEL SSDSC2CW24 (scsi)
Disk /dev/sdb: 468862128s
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start     End         Size        File system  Name  Flags
  1      2048s     2342911s    2340864s    xfs          pri
  2      2342912s  468860927s  466518016s  xfs          pri

Stefan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-23 10:49                                   ` Stefan Priebe - Profihost AG
@ 2012-11-23 11:03                                     ` Alexandre DERUMIER
  2012-11-23 13:12                                       ` Stefan Priebe - Profihost AG
  2012-11-23 13:18                                       ` Mark Nelson
  0 siblings, 2 replies; 51+ messages in thread
From: Alexandre DERUMIER @ 2012-11-23 11:03 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Mark Nelson, ceph-devel, Mark Kampe, Sébastien Han

so correcly aligned...

Maybe try to use journal directly on the full partition, without xfs ?


----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Mark Nelson" <mark.nelson@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>, "Sébastien Han" <han.sebastien@gmail.com> 
Envoyé: Vendredi 23 Novembre 2012 11:49:10 
Objet: Re: RBD fio Performance concerns 

Am 23.11.2012 11:47, schrieb Alexandre DERUMIER: 
>>> when i switch the journal to the OSD Disk seperate partiton on each disk 
>>> (/dev/sdX1 for journal 1GB and /dev/sdX2 for OSD) i go down from 23.000 
>>> iops to 200 iops random 4k. 
> O_o , that's seem crazy... 
> Are you sure that your partitions are correctly aligned ? (starting first partition at sector 2048 is best for ssd) 

Model: ATA INTEL SSDSC2CW24 (scsi) 
Disk /dev/sdb: 468862128s 
Sector size (logical/physical): 512B/512B 
Partition Table: gpt 

Number Start End Size File system Name Flags 
1 2048s 2342911s 2340864s xfs pri 
2 2342912s 468860927s 466518016s xfs pri 

Stefan 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-23 11:03                                     ` Alexandre DERUMIER
@ 2012-11-23 13:12                                       ` Stefan Priebe - Profihost AG
  2012-11-23 13:18                                       ` Mark Nelson
  1 sibling, 0 replies; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-23 13:12 UTC (permalink / raw)
  To: Alexandre DERUMIER
  Cc: Mark Nelson, ceph-devel, Mark Kampe, Sébastien Han

Am 23.11.2012 12:03, schrieb Alexandre DERUMIER:
> so correcly aligned...
>
> Maybe try to use journal directly on the full partition, without xfs ?

The same - just 200 iops for rand 4k.

Stefan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-23 11:03                                     ` Alexandre DERUMIER
  2012-11-23 13:12                                       ` Stefan Priebe - Profihost AG
@ 2012-11-23 13:18                                       ` Mark Nelson
  2012-11-23 13:24                                         ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 51+ messages in thread
From: Mark Nelson @ 2012-11-23 13:18 UTC (permalink / raw)
  To: Alexandre DERUMIER
  Cc: Stefan Priebe - Profihost AG, ceph-devel, Mark Kampe,
	Sébastien Han

Agreed with Alexandre, try putting the journal on a raw partition. 
That's pretty insane!  What controller are you using again?

Mark

On 11/23/2012 05:03 AM, Alexandre DERUMIER wrote:
> so correcly aligned...
>
> Maybe try to use journal directly on the full partition, without xfs ?
>
>
> ----- Mail original -----
>
> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>, "Sébastien Han" <han.sebastien@gmail.com>
> Envoyé: Vendredi 23 Novembre 2012 11:49:10
> Objet: Re: RBD fio Performance concerns
>
> Am 23.11.2012 11:47, schrieb Alexandre DERUMIER:
>>>> when i switch the journal to the OSD Disk seperate partiton on each disk
>>>> (/dev/sdX1 for journal 1GB and /dev/sdX2 for OSD) i go down from 23.000
>>>> iops to 200 iops random 4k.
>> O_o , that's seem crazy...
>> Are you sure that your partitions are correctly aligned ? (starting first partition at sector 2048 is best for ssd)
>
> Model: ATA INTEL SSDSC2CW24 (scsi)
> Disk /dev/sdb: 468862128s
> Sector size (logical/physical): 512B/512B
> Partition Table: gpt
>
> Number Start End Size File system Name Flags
> 1 2048s 2342911s 2340864s xfs pri
> 2 2342912s 468860927s 466518016s xfs pri
>
> Stefan
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-23 13:18                                       ` Mark Nelson
@ 2012-11-23 13:24                                         ` Stefan Priebe - Profihost AG
  2012-11-23 13:32                                           ` Alexandre DERUMIER
  2012-11-23 13:33                                           ` Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-23 13:24 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe, Sébastien Han

Am 23.11.2012 14:18, schrieb Mark Nelson:
> Agreed with Alexandre, try putting the journal on a raw partition.
> That's pretty insane!  What controller are you using again?

Makes no difference. No Controller. Just using the SATA 3.0 onboard 
Ports for each SSD.

fio directly on SSD gives me 45.000 rand 4k write iops and 270Mb/s seq 
write speed. So that's ok.

Stefan

> Mark
>
> On 11/23/2012 05:03 AM, Alexandre DERUMIER wrote:
>> so correcly aligned...
>>
>> Maybe try to use journal directly on the full partition, without xfs ?
>>
>>
>> ----- Mail original -----
>>
>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "ceph-devel"
>> <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>,
>> "Sébastien Han" <han.sebastien@gmail.com>
>> Envoyé: Vendredi 23 Novembre 2012 11:49:10
>> Objet: Re: RBD fio Performance concerns
>>
>> Am 23.11.2012 11:47, schrieb Alexandre DERUMIER:
>>>>> when i switch the journal to the OSD Disk seperate partiton on each
>>>>> disk
>>>>> (/dev/sdX1 for journal 1GB and /dev/sdX2 for OSD) i go down from
>>>>> 23.000
>>>>> iops to 200 iops random 4k.
>>> O_o , that's seem crazy...
>>> Are you sure that your partitions are correctly aligned ? (starting
>>> first partition at sector 2048 is best for ssd)
>>
>> Model: ATA INTEL SSDSC2CW24 (scsi)
>> Disk /dev/sdb: 468862128s
>> Sector size (logical/physical): 512B/512B
>> Partition Table: gpt
>>
>> Number Start End Size File system Name Flags
>> 1 2048s 2342911s 2340864s xfs pri
>> 2 2342912s 468860927s 466518016s xfs pri
>>
>> Stefan
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-23 13:24                                         ` Stefan Priebe - Profihost AG
@ 2012-11-23 13:32                                           ` Alexandre DERUMIER
  2012-11-23 13:33                                           ` Stefan Priebe - Profihost AG
  1 sibling, 0 replies; 51+ messages in thread
From: Alexandre DERUMIER @ 2012-11-23 13:32 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: ceph-devel, Mark Kampe, Sébastien Han, Mark Nelson

maybe try to play with io scheduler ? maybe noop can help ?
 (I'm out of idea)


----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Mark Nelson" <mark.nelson@inktank.com> 
Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>, "Sébastien Han" <han.sebastien@gmail.com> 
Envoyé: Vendredi 23 Novembre 2012 14:24:26 
Objet: Re: RBD fio Performance concerns 

Am 23.11.2012 14:18, schrieb Mark Nelson: 
> Agreed with Alexandre, try putting the journal on a raw partition. 
> That's pretty insane! What controller are you using again? 

Makes no difference. No Controller. Just using the SATA 3.0 onboard 
Ports for each SSD. 

fio directly on SSD gives me 45.000 rand 4k write iops and 270Mb/s seq 
write speed. So that's ok. 

Stefan 

> Mark 
> 
> On 11/23/2012 05:03 AM, Alexandre DERUMIER wrote: 
>> so correcly aligned... 
>> 
>> Maybe try to use journal directly on the full partition, without xfs ? 
>> 
>> 
>> ----- Mail original ----- 
>> 
>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
>> À: "Alexandre DERUMIER" <aderumier@odiso.com> 
>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "ceph-devel" 
>> <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>, 
>> "Sébastien Han" <han.sebastien@gmail.com> 
>> Envoyé: Vendredi 23 Novembre 2012 11:49:10 
>> Objet: Re: RBD fio Performance concerns 
>> 
>> Am 23.11.2012 11:47, schrieb Alexandre DERUMIER: 
>>>>> when i switch the journal to the OSD Disk seperate partiton on each 
>>>>> disk 
>>>>> (/dev/sdX1 for journal 1GB and /dev/sdX2 for OSD) i go down from 
>>>>> 23.000 
>>>>> iops to 200 iops random 4k. 
>>> O_o , that's seem crazy... 
>>> Are you sure that your partitions are correctly aligned ? (starting 
>>> first partition at sector 2048 is best for ssd) 
>> 
>> Model: ATA INTEL SSDSC2CW24 (scsi) 
>> Disk /dev/sdb: 468862128s 
>> Sector size (logical/physical): 512B/512B 
>> Partition Table: gpt 
>> 
>> Number Start End Size File system Name Flags 
>> 1 2048s 2342911s 2340864s xfs pri 
>> 2 2342912s 468860927s 466518016s xfs pri 
>> 
>> Stefan 
>> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-23 13:24                                         ` Stefan Priebe - Profihost AG
  2012-11-23 13:32                                           ` Alexandre DERUMIER
@ 2012-11-23 13:33                                           ` Stefan Priebe - Profihost AG
  2012-11-23 13:43                                             ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-23 13:33 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe, Sébastien Han

Am 23.11.2012 14:24, schrieb Stefan Priebe - Profihost AG:
> Am 23.11.2012 14:18, schrieb Mark Nelson:
>> Agreed with Alexandre, try putting the journal on a raw partition.
>> That's pretty insane!  What controller are you using again?

uhu crazy now i have the same with tmpfs... no idea what's going on here.

I've updated ceph to latest git next. Maybe i should rollback.

Stefan

>
> Makes no difference. No Controller. Just using the SATA 3.0 onboard
> Ports for each SSD.
>
> fio directly on SSD gives me 45.000 rand 4k write iops and 270Mb/s seq
> write speed. So that's ok.
>
> Stefan
>
>> Mark
>>
>> On 11/23/2012 05:03 AM, Alexandre DERUMIER wrote:
>>> so correcly aligned...
>>>
>>> Maybe try to use journal directly on the full partition, without xfs ?
>>>
>>>
>>> ----- Mail original -----
>>>
>>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
>>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "ceph-devel"
>>> <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>,
>>> "Sébastien Han" <han.sebastien@gmail.com>
>>> Envoyé: Vendredi 23 Novembre 2012 11:49:10
>>> Objet: Re: RBD fio Performance concerns
>>>
>>> Am 23.11.2012 11:47, schrieb Alexandre DERUMIER:
>>>>>> when i switch the journal to the OSD Disk seperate partiton on each
>>>>>> disk
>>>>>> (/dev/sdX1 for journal 1GB and /dev/sdX2 for OSD) i go down from
>>>>>> 23.000
>>>>>> iops to 200 iops random 4k.
>>>> O_o , that's seem crazy...
>>>> Are you sure that your partitions are correctly aligned ? (starting
>>>> first partition at sector 2048 is best for ssd)
>>>
>>> Model: ATA INTEL SSDSC2CW24 (scsi)
>>> Disk /dev/sdb: 468862128s
>>> Sector size (logical/physical): 512B/512B
>>> Partition Table: gpt
>>>
>>> Number Start End Size File system Name Flags
>>> 1 2048s 2342911s 2340864s xfs pri
>>> 2 2342912s 468860927s 466518016s xfs pri
>>>
>>> Stefan
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: RBD fio Performance concerns
  2012-11-21 21:47                       ` Sébastien Han
  2012-11-21 22:05                         ` Mark Kampe
  2012-11-22  5:46                         ` Alexandre DERUMIER
@ 2012-11-23 13:36                         ` Chen, Xiaoxi
  2012-11-24 16:59                           ` Gregory Farnum
  2 siblings, 1 reply; 51+ messages in thread
From: Chen, Xiaoxi @ 2012-11-23 13:36 UTC (permalink / raw)
  To: Sébastien Han, Mark Nelson
  Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe

Hi Han,
      I have a cluster with 8 nodes(each node with 1 SSD as journal and 3 7200 rpm sata disk as data disk), each OSD consist of 1 sata disk together with one 30G partition from the SSD.So in total I have 24 OSDs.
	My test method is start 24VMs and 24 RBD volumes, make the VM and volume 1:1 paired. Then Aiostress is used as test tools.
	In total, I will get ~1000 IOPS for sequential 4K write for each volume and  ~60 IOPS for random 4K write. 
But there still some strange things on my cluster which I cannot explain the reason,if I clean the pagecache on ceph clusters BEFORE the test, performance drops to half. I don’t understand why old pagecache has any connect with write performance
                                                                                   Xiaoxi

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sébastien Han
Sent: 2012年11月22日 5:47
To: Mark Nelson
Cc: Alexandre DERUMIER; ceph-devel; Mark Kampe
Subject: Re: RBD fio Performance concerns

Hi Mark,

Well the most concerning thing is that I have 2 Ceph clusters and both of them show better rand than seq...
I don't have enough background to argue on your assomptions but I could try to skrink my test platform to a single OSD and how it performs. We keep in touch on that one.

But it seems that Alexandre and I have the same results (more rand than seq), he has (at least) one cluster and I have 2. Thus I start to think that's not an isolated issue.

Is it different for you? Do you usually get more seq IOPS from an RBD thant rand?


On Wed, Nov 21, 2012 at 5:34 PM, Mark Nelson <mark.nelson@inktank.com> wrote:
> Responding to my own message. :)
>
> Talked to Sage a bit offline about this.  I think there are two 
> opposing
> forces:
>
> On one hand, random IO may be spreading reads/writes out across more 
> OSDs than sequential IO that presumably would be hitting a single OSD 
> more regularly.
>
> On the other hand, you'd expect that sequential writes would be 
> getting coalesced either at the RBD layer or on the OSD, and that the 
> drive/controller/filesystem underneath the OSD would be doing some 
> kind of readahead or prefetching.
>
> On the third hand, maybe coalescing/prefetching is in fact happening 
> but we are IOP limited by some per-osd limitation.
>
> It could be interesting to do the test with a single OSD and see what 
> happens.
>
> Mark
>
>
> On 11/21/2012 09:52 AM, Mark Nelson wrote:
>>
>> Hi Guys,
>>
>> I'm late to this thread but thought I'd chime in.  Crazy that you are 
>> getting higher performance with random reads/writes vs sequential!  
>> It would be interesting to see what kind of throughput smalliobench 
>> reports (should be packaged in bobtail) and also see if this behavior 
>> happens with cephfs.  It's still too early in the morning for me 
>> right now to come up with a reasonable explanation for what's going 
>> on.  It might be worth running blktrace and seekwatcher to see what 
>> the io patterns on the underlying disk look like in each case.  Maybe 
>> something unexpected is going on.
>>
>> Mark
>>
>> On 11/19/2012 02:57 PM, Sébastien Han wrote:
>>>
>>> Which iodepth did you use for those benchs?
>>>
>>>
>>>> I really don't understand why I can't get more rand read iops with 
>>>> 4K block ...
>>>
>>>
>>> Me neither, hope to get some clarification from the Inktank guys. It 
>>> doesn't make any sense to me...
>>> --
>>> Bien cordialement.
>>> Sébastien HAN.
>>>
>>>
>>> On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER 
>>> <aderumier@odiso.com> wrote:
>>>>>>
>>>>>> @Alexandre: is it the same for you? or do you always get more 
>>>>>> IOPS with seq?
>>>>
>>>>
>>>> rand read 4K : 6000 iops
>>>> seq read 4K : 3500 iops
>>>> seq read 4M : 31iops (1gigabit client bandwith limit)
>>>>
>>>> rand write 4k: 6000iops  (tmpfs journal) seq write 4k: 1600iops seq 
>>>> write 4M : 31iops (1gigabit client bandwith limit)
>>>>
>>>>
>>>> I really don't understand why I can't get more rand read iops with 
>>>> 4K block ...
>>>>
>>>> I try with high end cpu for client, it doesn't change nothing.
>>>> But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is 
>>>> around 15% on cluster during read bench)
>>>>
>>>>
>>>> ----- Mail original -----
>>>>
>>>> De: "Sébastien Han" <han.sebastien@gmail.com>
>>>> À: "Mark Kampe" <mark.kampe@inktank.com>
>>>> Cc: "Alexandre DERUMIER" <aderumier@odiso.com>, "ceph-devel"
>>>> <ceph-devel@vger.kernel.org>
>>>> Envoyé: Lundi 19 Novembre 2012 19:03:40
>>>> Objet: Re: RBD fio Performance concerns
>>>>
>>>> @Sage, thanks for the info :)
>>>> @Mark:
>>>>
>>>>> If you want to do sequential I/O, you should do it buffered (so 
>>>>> that the writes can be aggregated) or with a 4M block size (very 
>>>>> efficient and avoiding object serialization).
>>>>
>>>>
>>>> The original benchmark has been performed with 4M block size. And 
>>>> as you can see I still get more IOPS with rand than seq... I just 
>>>> tried with 4M without direct I/O, still the same. I can print fio 
>>>> results if it's needed.
>>>>
>>>>> We do direct writes for benchmarking, not because it is a 
>>>>> reasonable way to do I/O, but because it bypasses the buffer cache 
>>>>> and enables us to directly measure cluster I/O throughput (which 
>>>>> is what we are trying to optimize). Applications should usually do 
>>>>> buffered I/O, to get the (very significant) benefits of caching 
>>>>> and write aggregation.
>>>>
>>>>
>>>> I know why I use direct I/O. It's synthetic benchmarks, it's far 
>>>> away from a real life scenario and how common applications works. I 
>>>> just try to see the maximum I/O throughput that I can get from my 
>>>> RBD. All my applications use buffered I/O.
>>>>
>>>> @Alexandre: is it the same for you? or do you always get more IOPS 
>>>> with seq?
>>>>
>>>> Thanks to all of you..
>>>>
>>>>
>>>> On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe 
>>>> <mark.kampe@inktank.com>
>>>> wrote:
>>>>>
>>>>> Recall:
>>>>> 1. RBD volumes are striped (4M wide) across RADOS objects 2. 
>>>>> distinct writes to a single RADOS object are serialized
>>>>>
>>>>> Your sequential 4K writes are direct, depth=256, so there are (at 
>>>>> all times) 256 writes queued to the same object. All of your 
>>>>> writes are waiting through a very long line, which is adding 
>>>>> horrendous latency.
>>>>>
>>>>> If you want to do sequential I/O, you should do it buffered (so 
>>>>> that the writes can be aggregated) or with a 4M block size (very 
>>>>> efficient and avoiding object serialization).
>>>>>
>>>>> We do direct writes for benchmarking, not because it is a 
>>>>> reasonable way to do I/O, but because it bypasses the buffer cache 
>>>>> and enables us to directly measure cluster I/O throughput (which 
>>>>> is what we are trying to optimize). Applications should usually do 
>>>>> buffered I/O, to get the (very significant) benefits of caching 
>>>>> and write aggregation.
>>>>>
>>>>>
>>>>>> That's correct for some of the benchmarks. However even with 4K 
>>>>>> for seq, I still get less IOPS. See below my last fio:
>>>>>>
>>>>>> # fio rbd-bench.fio
>>>>>> seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, 
>>>>>> iodepth=256
>>>>>> rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>>> iodepth=256
>>>>>> seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>>> iodepth=256
>>>>>> rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
>>>>>> iodepth=256
>>>>>> fio 1.59
>>>>>> Starting 4 processes
>>>>>> Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 
>>>>>> 02m:59s]
>>>>>> seq-read: (groupid=0, jobs=1): err= 0: pid=15096 read : 
>>>>>> io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec slat 
>>>>>> (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 clat 
>>>>>> (msec): min=18 , max=133 , avg=76.37, stdev=16.63 lat (msec): 
>>>>>> min=18 , max=133 , avg=76.67, stdev=16.62 bw (KB/s) : min= 0, 
>>>>>> max=14406, per=31.89%, avg=4258.24,
>>>>>> stdev=6239.06
>>>>>> cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 IO 
>>>>>> depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>>>
>>>>>>> =64=100.0%
>>>>>>
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.0%
>>>>>>
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.1%
>>>>>>
>>>>>> issued r/w/d: total=200473/0/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
>>>>>> rand-read: (groupid=1, jobs=1): err= 0: pid=16846 read : 
>>>>>> io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec slat 
>>>>>> (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 clat (usec): 
>>>>>> min=4642 , max=55760 , avg=9374.10, stdev=970.40 lat (usec): 
>>>>>> min=4671 , max=55788 , avg=9408.00, stdev=971.21 bw (KB/s) : 
>>>>>> min=105496, max=109136, per=100.00%, avg=108815.48,
>>>>>> stdev=648.62
>>>>>> cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278 IO 
>>>>>> depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>>>
>>>>>>> =64=100.0%
>>>>>>
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.0%
>>>>>>
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.1%
>>>>>>
>>>>>> issued r/w/d: total=1632349/0/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
>>>>>> seq-write: (groupid=2, jobs=1): err= 0: pid=18653
>>>>>> write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec slat 
>>>>>> (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97 clat 
>>>>>> (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19 lat (msec): 
>>>>>> min=25 , max=4868 , avg=1389.62, stdev=470.17 bw (KB/s) : min= 7, 
>>>>>> max= 2165, per=104.03%, avg=764.65,
>>>>>> stdev=353.97
>>>>>> cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21 IO depths : 
>>>>>> 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
>>>>>>>
>>>>>>> =64=99.4%
>>>>>>
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.0%
>>>>>>
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.1%
>>>>>>
>>>>>> issued r/w/d: total=0/11171/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60% 
>>>>>> lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
>>>>>> rand-write: (groupid=3, jobs=1): err= 0: pid=20446
>>>>>> write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec 
>>>>>> slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37 
>>>>>> clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27 lat 
>>>>>> (msec): min=22 , max=5639 , avg=298.52, stdev=430.84 bw (KB/s) : 
>>>>>> min= 0, max= 7728, per=31.44%, avg=1078.21,
>>>>>> stdev=2000.45
>>>>>> cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19 IO depths 
>>>>>> : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>>>>
>>>>>>> =64=99.9%
>>>>>>
>>>>>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.0%
>>>>>>
>>>>>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>>>>
>>>>>>> =64=0.1%
>>>>>>
>>>>>> issued r/w/d: total=0/52147/0, short=0/0/0
>>>>>>
>>>>>> lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 
>>>>>> 750=5.10% lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
>>>>>>
>>>>>> Run status group 0 (all jobs):
>>>>>> READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, 
>>>>>> maxb=13673KB/s, mint=60053msec, maxt=60053msec
>>>>>>
>>>>>> Run status group 1 (all jobs):
>>>>>> READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s, 
>>>>>> maxb=111425KB/s, mint=60005msec, maxt=60005msec
>>>>>>
>>>>>> Run status group 2 (all jobs):
>>>>>> WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s, 
>>>>>> mint=60725msec, maxt=60725msec
>>>>>>
>>>>>> Run status group 3 (all jobs):
>>>>>> WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s, 
>>>>>> mint=60822msec, maxt=60822msec
>>>>>>
>>>>>> Disk stats (read/write):
>>>>>> rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132, 
>>>>>> in_queue=33434120, util=99.79%
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-23 13:33                                           ` Stefan Priebe - Profihost AG
@ 2012-11-23 13:43                                             ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 51+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-23 13:43 UTC (permalink / raw)
  To: Mark Nelson
  Cc: Alexandre DERUMIER, ceph-devel, Mark Kampe, Sébastien Han

*arg* don't do bandwith limit tests to 10Mbit/s in parallel...

Am 23.11.2012 14:33, schrieb Stefan Priebe - Profihost AG:
> Am 23.11.2012 14:24, schrieb Stefan Priebe - Profihost AG:
>> Am 23.11.2012 14:18, schrieb Mark Nelson:
>>> Agreed with Alexandre, try putting the journal on a raw partition.
>>> That's pretty insane!  What controller are you using again?
>
> uhu crazy now i have the same with tmpfs... no idea what's going on here.
>
> I've updated ceph to latest git next. Maybe i should rollback.
>
> Stefan
>
>>
>> Makes no difference. No Controller. Just using the SATA 3.0 onboard
>> Ports for each SSD.
>>
>> fio directly on SSD gives me 45.000 rand 4k write iops and 270Mb/s seq
>> write speed. So that's ok.
>>
>> Stefan
>>
>>> Mark
>>>
>>> On 11/23/2012 05:03 AM, Alexandre DERUMIER wrote:
>>>> so correcly aligned...
>>>>
>>>> Maybe try to use journal directly on the full partition, without xfs ?
>>>>
>>>>
>>>> ----- Mail original -----
>>>>
>>>> De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag>
>>>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>>>> Cc: "Mark Nelson" <mark.nelson@inktank.com>, "ceph-devel"
>>>> <ceph-devel@vger.kernel.org>, "Mark Kampe" <mark.kampe@inktank.com>,
>>>> "Sébastien Han" <han.sebastien@gmail.com>
>>>> Envoyé: Vendredi 23 Novembre 2012 11:49:10
>>>> Objet: Re: RBD fio Performance concerns
>>>>
>>>> Am 23.11.2012 11:47, schrieb Alexandre DERUMIER:
>>>>>>> when i switch the journal to the OSD Disk seperate partiton on each
>>>>>>> disk
>>>>>>> (/dev/sdX1 for journal 1GB and /dev/sdX2 for OSD) i go down from
>>>>>>> 23.000
>>>>>>> iops to 200 iops random 4k.
>>>>> O_o , that's seem crazy...
>>>>> Are you sure that your partitions are correctly aligned ? (starting
>>>>> first partition at sector 2048 is best for ssd)
>>>>
>>>> Model: ATA INTEL SSDSC2CW24 (scsi)
>>>> Disk /dev/sdb: 468862128s
>>>> Sector size (logical/physical): 512B/512B
>>>> Partition Table: gpt
>>>>
>>>> Number Start End Size File system Name Flags
>>>> 1 2048s 2342911s 2340864s xfs pri
>>>> 2 2342912s 468860927s 466518016s xfs pri
>>>>
>>>> Stefan
>>>>
>>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: RBD fio Performance concerns
  2012-11-23 13:36                         ` Chen, Xiaoxi
@ 2012-11-24 16:59                           ` Gregory Farnum
  0 siblings, 0 replies; 51+ messages in thread
From: Gregory Farnum @ 2012-11-24 16:59 UTC (permalink / raw)
  To: Chen, Xiaoxi
  Cc: Sébastien Han, Mark Nelson, Alexandre DERUMIER, ceph-devel,
	Mark Kampe

On Friday, November 23, 2012 at 5:36 AM, Chen, Xiaoxi wrote:
> Hi Han,
> I have a cluster with 8 nodes(each node with 1 SSD as journal and 3 7200 rpm sata disk as data disk), each OSD consist of 1 sata disk together with one 30G partition from the SSD.So in total I have 24 OSDs.
> My test method is start 24VMs and 24 RBD volumes, make the VM and volume 1:1 paired. Then Aiostress is used as test tools.
> In total, I will get ~1000 IOPS for sequential 4K write for each volume and ~60 IOPS for random 4K write.  
> But there still some strange things on my cluster which I cannot explain the reason,if I clean the pagecache on ceph clusters BEFORE the test, performance drops to half. I donâ€™t understand why old pagecache has any connect with write performance
> Xiaoxi

That's because when you dump out the page cache you're clearing out all of the OSD's data directory inodes from cache, so it needs to do a bunch of random IO disk hops to read them in, but normally they'd be in-memory since there aren't that many of them and they're accessed pretty frequently. ;)
-Greg

  
>  
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of SÃ©bastien Han
> Sent: 2012å¹´11æœˆ22æ—¥ 5:47
> To: Mark Nelson
> Cc: Alexandre DERUMIER; ceph-devel; Mark Kampe
> Subject: Re: RBD fio Performance concerns
>  
> Hi Mark,
>  
> Well the most concerning thing is that I have 2 Ceph clusters and both of them show better rand than seq...
> I don't have enough background to argue on your assomptions but I could try to skrink my test platform to a single OSD and how it performs. We keep in touch on that one.
>  
> But it seems that Alexandre and I have the same results (more rand than seq), he has (at least) one cluster and I have 2. Thus I start to think that's not an isolated issue.
>  
> Is it different for you? Do you usually get more seq IOPS from an RBD thant rand?
>  
>  
> On Wed, Nov 21, 2012 at 5:34 PM, Mark Nelson <mark.nelson@inktank.com (mailto:mark.nelson@inktank.com)> wrote:
> > Responding to my own message. :)
> >  
> > Talked to Sage a bit offline about this. I think there are two  
> > opposing
> > forces:
> >  
> > On one hand, random IO may be spreading reads/writes out across more  
> > OSDs than sequential IO that presumably would be hitting a single OSD  
> > more regularly.
> >  
> > On the other hand, you'd expect that sequential writes would be  
> > getting coalesced either at the RBD layer or on the OSD, and that the  
> > drive/controller/filesystem underneath the OSD would be doing some  
> > kind of readahead or prefetching.
> >  
> > On the third hand, maybe coalescing/prefetching is in fact happening  
> > but we are IOP limited by some per-osd limitation.
> >  
> > It could be interesting to do the test with a single OSD and see what  
> > happens.
> >  
> > Mark
> >  
> >  
> > On 11/21/2012 09:52 AM, Mark Nelson wrote:
> > >  
> > > Hi Guys,
> > >  
> > > I'm late to this thread but thought I'd chime in. Crazy that you are  
> > > getting higher performance with random reads/writes vs sequential!  
> > > It would be interesting to see what kind of throughput smalliobench  
> > > reports (should be packaged in bobtail) and also see if this behavior  
> > > happens with cephfs. It's still too early in the morning for me  
> > > right now to come up with a reasonable explanation for what's going  
> > > on. It might be worth running blktrace and seekwatcher to see what  
> > > the io patterns on the underlying disk look like in each case. Maybe  
> > > something unexpected is going on.
> > >  
> > > Mark
> > >  
> > > On 11/19/2012 02:57 PM, SÃ©bastien Han wrote:
> > > >  
> > > > Which iodepth did you use for those benchs?
> > > >  
> > > >  
> > > > > I really don't understand why I can't get more rand read iops with  
> > > > > 4K block ...
> > > >  
> > > >  
> > > >  
> > > >  
> > > > Me neither, hope to get some clarification from the Inktank guys. It  
> > > > doesn't make any sense to me...
> > > > --
> > > > Bien cordialement.
> > > > SÃ©bastien HAN.
> > > >  
> > > >  
> > > > On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER  
> > > > <aderumier@odiso.com (mailto:aderumier@odiso.com)> wrote:
> > > > > > >  
> > > > > > > @Alexandre: is it the same for you? or do you always get more  
> > > > > > > IOPS with seq?
> > > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > > rand read 4K : 6000 iops
> > > > > seq read 4K : 3500 iops
> > > > > seq read 4M : 31iops (1gigabit client bandwith limit)
> > > > >  
> > > > > rand write 4k: 6000iops (tmpfs journal) seq write 4k: 1600iops seq  
> > > > > write 4M : 31iops (1gigabit client bandwith limit)
> > > > >  
> > > > >  
> > > > > I really don't understand why I can't get more rand read iops with  
> > > > > 4K block ...
> > > > >  
> > > > > I try with high end cpu for client, it doesn't change nothing.
> > > > > But test cluster use old 8 cores E5420 @ 2.50GHZ (But cpu is  
> > > > > around 15% on cluster during read bench)
> > > > >  
> > > > >  
> > > > > ----- Mail original -----
> > > > >  
> > > > > De: "SÃ©bastien Han" <han.sebastien@gmail.com (mailto:han.sebastien@gmail.com)>
> > > > > Ã€: "Mark Kampe" <mark.kampe@inktank.com (mailto:mark.kampe@inktank.com)>
> > > > > Cc: "Alexandre DERUMIER" <aderumier@odiso.com (mailto:aderumier@odiso.com)>, "ceph-devel"
> > > > > <ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)>
> > > > > EnvoyÃ©: Lundi 19 Novembre 2012 19:03:40
> > > > > Objet: Re: RBD fio Performance concerns
> > > > >  
> > > > > @Sage, thanks for the info :)
> > > > > @Mark:
> > > > >  
> > > > > > If you want to do sequential I/O, you should do it buffered (so  
> > > > > > that the writes can be aggregated) or with a 4M block size (very  
> > > > > > efficient and avoiding object serialization).
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > > The original benchmark has been performed with 4M block size. And  
> > > > > as you can see I still get more IOPS with rand than seq... I just  
> > > > > tried with 4M without direct I/O, still the same. I can print fio  
> > > > > results if it's needed.
> > > > >  
> > > > > > We do direct writes for benchmarking, not because it is a  
> > > > > > reasonable way to do I/O, but because it bypasses the buffer cache  
> > > > > > and enables us to directly measure cluster I/O throughput (which  
> > > > > > is what we are trying to optimize). Applications should usually do  
> > > > > > buffered I/O, to get the (very significant) benefits of caching  
> > > > > > and write aggregation.
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > > I know why I use direct I/O. It's synthetic benchmarks, it's far  
> > > > > away from a real life scenario and how common applications works. I  
> > > > > just try to see the maximum I/O throughput that I can get from my  
> > > > > RBD. All my applications use buffered I/O.
> > > > >  
> > > > > @Alexandre: is it the same for you? or do you always get more IOPS  
> > > > > with seq?
> > > > >  
> > > > > Thanks to all of you..
> > > > >  
> > > > >  
> > > > > On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe  
> > > > > <mark.kampe@inktank.com (mailto:mark.kampe@inktank.com)>
> > > > > wrote:
> > > > > >  
> > > > > > Recall:
> > > > > > 1. RBD volumes are striped (4M wide) across RADOS objects 2.  
> > > > > > distinct writes to a single RADOS object are serialized
> > > > > >  
> > > > > > Your sequential 4K writes are direct, depth=256, so there are (at  
> > > > > > all times) 256 writes queued to the same object. All of your  
> > > > > > writes are waiting through a very long line, which is adding  
> > > > > > horrendous latency.
> > > > > >  
> > > > > > If you want to do sequential I/O, you should do it buffered (so  
> > > > > > that the writes can be aggregated) or with a 4M block size (very  
> > > > > > efficient and avoiding object serialization).
> > > > > >  
> > > > > > We do direct writes for benchmarking, not because it is a  
> > > > > > reasonable way to do I/O, but because it bypasses the buffer cache  
> > > > > > and enables us to directly measure cluster I/O throughput (which  
> > > > > > is what we are trying to optimize). Applications should usually do  
> > > > > > buffered I/O, to get the (very significant) benefits of caching  
> > > > > > and write aggregation.
> > > > > >  
> > > > > >  
> > > > > > > That's correct for some of the benchmarks. However even with 4K  
> > > > > > > for seq, I still get less IOPS. See below my last fio:
> > > > > > >  
> > > > > > > # fio rbd-bench.fio
> > > > > > > seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio,  
> > > > > > > iodepth=256
> > > > > > > rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
> > > > > > > iodepth=256
> > > > > > > seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio,
> > > > > > > iodepth=256
> > > > > > > rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
> > > > > > > iodepth=256
> > > > > > > fio 1.59
> > > > > > > Starting 4 processes
> > > > > > > Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta  
> > > > > > > 02m:59s]
> > > > > > > seq-read: (groupid=0, jobs=1): err= 0: pid=15096 read :  
> > > > > > > io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec slat  
> > > > > > > (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 clat  
> > > > > > > (msec): min=18 , max=133 , avg=76.37, stdev=16.63 lat (msec):  
> > > > > > > min=18 , max=133 , avg=76.67, stdev=16.62 bw (KB/s) : min= 0,  
> > > > > > > max=14406, per=31.89%, avg=4258.24,
> > > > > > > stdev=6239.06
> > > > > > > cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 IO  
> > > > > > > depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> > > > > > > >  
> > > > > > > > =64=100.0%
> > > > > > >  
> > > > > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.0%
> > > > > > >  
> > > > > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.1%
> > > > > > >  
> > > > > > > issued r/w/d: total=200473/0/0, short=0/0/0
> > > > > > >  
> > > > > > > lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
> > > > > > > rand-read: (groupid=1, jobs=1): err= 0: pid=16846 read :  
> > > > > > > io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec slat  
> > > > > > > (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 clat (usec):  
> > > > > > > min=4642 , max=55760 , avg=9374.10, stdev=970.40 lat (usec):  
> > > > > > > min=4671 , max=55788 , avg=9408.00, stdev=971.21 bw (KB/s) :  
> > > > > > > min=105496, max=109136, per=100.00%, avg=108815.48,
> > > > > > > stdev=648.62
> > > > > > > cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278 IO  
> > > > > > > depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> > > > > > > >  
> > > > > > > > =64=100.0%
> > > > > > >  
> > > > > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.0%
> > > > > > >  
> > > > > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.1%
> > > > > > >  
> > > > > > > issued r/w/d: total=1632349/0/0, short=0/0/0
> > > > > > >  
> > > > > > > lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
> > > > > > > seq-write: (groupid=2, jobs=1): err= 0: pid=18653
> > > > > > > write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec slat  
> > > > > > > (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97 clat  
> > > > > > > (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19 lat (msec):  
> > > > > > > min=25 , max=4868 , avg=1389.62, stdev=470.17 bw (KB/s) : min= 7,  
> > > > > > > max= 2165, per=104.03%, avg=764.65,
> > > > > > > stdev=353.97
> > > > > > > cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21 IO depths :  
> > > > > > > 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%,
> > > > > > > >  
> > > > > > > > =64=99.4%
> > > > > > >  
> > > > > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.0%
> > > > > > >  
> > > > > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.1%
> > > > > > >  
> > > > > > > issued r/w/d: total=0/11171/0, short=0/0/0
> > > > > > >  
> > > > > > > lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%  
> > > > > > > lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
> > > > > > > rand-write: (groupid=3, jobs=1): err= 0: pid=20446
> > > > > > > write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec  
> > > > > > > slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37  
> > > > > > > clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27 lat  
> > > > > > > (msec): min=22 , max=5639 , avg=298.52, stdev=430.84 bw (KB/s) :  
> > > > > > > min= 0, max= 7728, per=31.44%, avg=1078.21,
> > > > > > > stdev=2000.45
> > > > > > > cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19 IO depths  
> > > > > > > : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
> > > > > > > >  
> > > > > > > > =64=99.9%
> > > > > > >  
> > > > > > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.0%
> > > > > > >  
> > > > > > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > > > > > >  
> > > > > > > > =64=0.1%
> > > > > > >  
> > > > > > > issued r/w/d: total=0/52147/0, short=0/0/0
> > > > > > >  
> > > > > > > lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%,  
> > > > > > > 750=5.10% lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%
> > > > > > >  
> > > > > > > Run status group 0 (all jobs):
> > > > > > > READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s,  
> > > > > > > maxb=13673KB/s, mint=60053msec, maxt=60053msec
> > > > > > >  
> > > > > > > Run status group 1 (all jobs):
> > > > > > > READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,  
> > > > > > > maxb=111425KB/s, mint=60005msec, maxt=60005msec
> > > > > > >  
> > > > > > > Run status group 2 (all jobs):
> > > > > > > WRITE: io=44684KB, aggrb=735KB/s, minb=753KB/s, maxb=753KB/s,  
> > > > > > > mint=60725msec, maxt=60725msec
> > > > > > >  
> > > > > > > Run status group 3 (all jobs):
> > > > > > > WRITE: io=208588KB, aggrb=3429KB/s, minb=3511KB/s, maxb=3511KB/s,  
> > > > > > > mint=60822msec, maxt=60822msec
> > > > > > >  
> > > > > > > Disk stats (read/write):
> > > > > > > rbd1: ios=1832984/63270, merge=0/0, ticks=16374236/17012132,  
> > > > > > > in_queue=33434120, util=99.79%
> > > > > >  
> > > > >  
> > > >  
> > > >  
> > > >  
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe  
> > > > ceph-devel" in the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)  
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >  
> >  
>  
>  
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> ¢éì¹»®&Þ~º&¶¬–+-±éÝ¶¥Šw®žË›±Êâmç¦^½ébžØ^n‡r¡ö¦zË?ëh™¨èÚ&¢ø®G«?éh®(éšŽŠÝ¢j"?ú¶m§ÿï?êäz¹Þ–Šàþf£¢·hšˆ§~ˆmš



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2012-11-24 16:59 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <50A537EA.5090409@inktank.com>
     [not found] ` <CAOLwVUmQa4C_vs_Mbi3b2LeO=wx8_EMVWX5Pyu0y-JnG8nyz+Q@mail.gmail.com>
2012-11-16 22:59   ` RBD fio Performance concerns Mark Kampe
2012-11-19 14:56     ` Sébastien Han
2012-11-19 15:28       ` Alexandre DERUMIER
2012-11-19 15:42         ` Sébastien Han
2012-11-19 16:44           ` Sage Weil
2012-11-19 16:54           ` Mark Kampe
2012-11-19 18:03             ` Sébastien Han
2012-11-19 19:11               ` Alexandre DERUMIER
2012-11-19 20:57                 ` Sébastien Han
2012-11-20  7:32                   ` Alexandre DERUMIER
2012-11-20 10:37                     ` Sébastien Han
2012-11-21 15:52                   ` Mark Nelson
2012-11-21 16:34                     ` Mark Nelson
2012-11-21 21:47                       ` Sébastien Han
2012-11-21 22:05                         ` Mark Kampe
2012-11-22  5:46                         ` Alexandre DERUMIER
2012-11-23 13:36                         ` Chen, Xiaoxi
2012-11-24 16:59                           ` Gregory Farnum
2012-11-22 10:19                       ` Stefan Priebe - Profihost AG
     [not found]                         ` <CAOLwVUmp7wrfead8qX2BZPbyeN_JY_XBN+wkEWmbY6q1-5u0fw@mail.gmail.com>
2012-11-22 11:48                           ` Stefan Priebe - Profihost AG
2012-11-22 12:50                             ` Sébastien Han
2012-11-22 13:14                               ` Stefan Priebe - Profihost AG
     [not found]                                 ` <CAOLwVUkwVSv-Ven2CTjnTN2J573TBTD2SLDY7df0h7ncJZQgpQ@mail.gmail.com>
2012-11-22 13:29                                   ` Stefan Priebe - Profihost AG
2012-11-22 14:20                                     ` Alexandre DERUMIER
2012-11-22 14:22                                       ` Stefan Priebe - Profihost AG
2012-11-22 14:37                                         ` Mark Nelson
2012-11-22 14:42                                           ` Stefan Priebe - Profihost AG
2012-11-22 14:46                                             ` Mark Nelson
2012-11-22 15:01                                               ` Stefan Priebe - Profihost AG
2012-11-22 15:26                                                 ` Alexandre DERUMIER
2012-11-22 15:28                                                   ` Stefan Priebe - Profihost AG
2012-11-22 15:35                                                     ` Alexandre DERUMIER
2012-11-22 15:49                                                       ` Sébastien Han
2012-11-22 15:54                                                         ` Stefan Priebe - Profihost AG
2012-11-22 15:55                                                           ` Sébastien Han
2012-11-22 15:57                                                             ` Stefan Priebe - Profihost AG
2012-11-22 15:59                                                       ` Stefan Priebe - Profihost AG
2012-11-22 14:52                                             ` Alexandre DERUMIER
2012-11-22 15:00                                               ` Stefan Priebe - Profihost AG
2012-11-23 10:31                               ` Stefan Priebe - Profihost AG
2012-11-23 10:47                                 ` Alexandre DERUMIER
2012-11-23 10:49                                   ` Stefan Priebe - Profihost AG
2012-11-23 11:03                                     ` Alexandre DERUMIER
2012-11-23 13:12                                       ` Stefan Priebe - Profihost AG
2012-11-23 13:18                                       ` Mark Nelson
2012-11-23 13:24                                         ` Stefan Priebe - Profihost AG
2012-11-23 13:32                                           ` Alexandre DERUMIER
2012-11-23 13:33                                           ` Stefan Priebe - Profihost AG
2012-11-23 13:43                                             ` Stefan Priebe - Profihost AG
2012-11-22 14:34                           ` Mark Nelson
     [not found]               ` <50AA763A.1050709@inktank.com>
2012-11-19 21:01                 ` Sébastien Han

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.