Object Write Latency

All of lore.kernel.org
 help / color / mirror / Atom feed

* Object Write Latency
@ 2013-09-20 12:27 Andreas Joachim Peters
  2013-09-20 13:11 ` Mark Nelson
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Andreas Joachim Peters @ 2013-09-20 12:27 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hi, 

we made some benchmarks about object read/write latencies on the CERN ceph installation.

The cluster has 44 nodes and ~1k disks, all on 10GE and the pool configuration has 3 copies. 
Client & Server is 0.67.

The latencies we observe (using tiny objects ... 5 bytes) on the idle pool:

write full object(sync) ~65-80ms
append to object ~60-75ms
set xattr object ~65-80ms
lock object ~65-80ms
stat object ~1ms

We seem to saturate the pools writing ~ 20k objects/s (= internally 60k/s).

Is there an easy explanation for 80 ms (quasi without payload) and a possible tuning to reduce that?
I measured (append few bytes +fsync) on such a disk around 33ms which explains probably part of the latency.

Then I tried with the async API to see if there is a difference in the measurement between wait_for_complete or wait_for_safe ... shouldn't wait_for_complete be much shorter, but I get always comparable results ...

Thanks, Andreas.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-20 12:27 Object Write Latency Andreas Joachim Peters
@ 2013-09-20 13:11 ` Mark Nelson
  2013-09-23  7:34   ` Dan van der Ster
  2013-09-20 14:47 ` Gregory Farnum
  2013-09-20 15:34 ` Sage Weil
  2 siblings, 1 reply; 20+ messages in thread
From: Mark Nelson @ 2013-09-20 13:11 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel@vger.kernel.org

On 09/20/2013 07:27 AM, Andreas Joachim Peters wrote:
> Hi,

Hi Andreas!

>
> we made some benchmarks about object read/write latencies on the CERN ceph installation.
>
> The cluster has 44 nodes and ~1k disks, all on 10GE and the pool configuration has 3 copies.
> Client & Server is 0.67.
>
> The latencies we observe (using tiny objects ... 5 bytes) on the idle pool:
>
> write full object(sync) ~65-80ms
> append to object ~60-75ms
> set xattr object ~65-80ms
> lock object ~65-80ms
> stat object ~1ms
>
> We seem to saturate the pools writing ~ 20k objects/s (= internally 60k/s).

Out of curiosity, how much difference do you see with write latencies if 
you do the same thing to a pool with 1 copy?

>
> Is there an easy explanation for 80 ms (quasi without payload) and a possible tuning to reduce that?
> I measured (append few bytes +fsync) on such a disk around 33ms which explains probably part of the latency.

I've been wanting to really dig into object write latency in RADOS but 
just haven't had the time to devote to it yet.  I've been doing some 
simple rados bench tests to a 8-SSD test node and am topping out at 
about 8-9K write IOPS and 26K read IOPS (no replication) though with 
little tuning.  I suspect there are many areas in the code where we 
could improve things.

>
> Then I tried with the async API to see if there is a difference in the measurement between wait_for_complete or wait_for_safe ... shouldn't wait_for_complete be much shorter, but I get always comparable results ...

Hrm, I'm going to let Sage or someone else comment on this.
>
> Thanks, Andreas.--
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-20 12:27 Object Write Latency Andreas Joachim Peters
  2013-09-20 13:11 ` Mark Nelson
@ 2013-09-20 14:47 ` Gregory Farnum
  2013-09-23  7:37   ` Dan van der Ster
  2013-09-20 15:34 ` Sage Weil
  2 siblings, 1 reply; 20+ messages in thread
From: Gregory Farnum @ 2013-09-20 14:47 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel@vger.kernel.org

On Fri, Sep 20, 2013 at 5:27 AM, Andreas Joachim Peters
<Andreas.Joachim.Peters@cern.ch> wrote:
> Hi,
>
>
> we made some benchmarks about object read/write latencies on the CERN ceph installation.
>
> The cluster has 44 nodes and ~1k disks, all on 10GE and the pool configuration has 3 copies.
> Client & Server is 0.67.
>
> The latencies we observe (using tiny objects ... 5 bytes) on the idle pool:

Does that mean you have non-idle pools in the same cluster? Unless
you've got physical separation, the fact that the pool is idle doesn't
mean much unless the cluster is as well. (Though if you're getting 60
object writes/hard drive/second I think it probably is idle.)

> write full object(sync) ~65-80ms
> append to object ~60-75ms
> set xattr object ~65-80ms
> lock object ~65-80ms
> stat object ~1ms

Anecdotally those write times look a little high to me, but my
expectations are probably set for 2x and I'm not sure how much
difference that makes (I would expect not much, but maybe there's
something happening I haven't considered).


> We seem to saturate the pools writing ~ 20k objects/s (= internally 60k/s).
>
> Is there an easy explanation for 80 ms (quasi without payload) and a possible tuning to reduce that?
> I measured (append few bytes +fsync) on such a disk around 33ms which explains probably part of the latency.
Ah, and that's also higher than I would normally expect for a disk
access, so that's probably why the above numbers seem a little large.
Separately, what's your journal config? Does each spindle have a
partition? This math all works out to about what I'd expect if so.

> Then I tried with the async API to see if there is a difference in the measurement between wait_for_complete or wait_for_safe ... shouldn't wait_for_complete be much shorter, but I get always comparable results ...

You're presumably on xfs? With non-btrfs FSes, the OSDs have to use
write-ahead journaling so they always commit the op to disk before
applying to the local FS.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-20 12:27 Object Write Latency Andreas Joachim Peters
  2013-09-20 13:11 ` Mark Nelson
  2013-09-20 14:47 ` Gregory Farnum
@ 2013-09-20 15:34 ` Sage Weil
  2013-09-23  7:39   ` Dan van der Ster
  2 siblings, 1 reply; 20+ messages in thread
From: Sage Weil @ 2013-09-20 15:34 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel@vger.kernel.org

On Fri, 20 Sep 2013, Andreas Joachim Peters wrote:
> Hi, 
> 
> 
> we made some benchmarks about object read/write latencies on the CERN ceph installation.
> 
> The cluster has 44 nodes and ~1k disks, all on 10GE and the pool configuration has 3 copies. 
> Client & Server is 0.67.
> 
> The latencies we observe (using tiny objects ... 5 bytes) on the idle pool:
> 
> write full object(sync) ~65-80ms
> append to object ~60-75ms
> set xattr object ~65-80ms
> lock object ~65-80ms
> stat object ~1ms

How are the individual OSDs configured?  Are they purely HDD's with a 
journal partition, or is there an SSD journal?  If it's pure HDD, this 
will be a signnificant source of latency.

That said, Dieter just recently pointed out to me that he's observing 
significant time in a request turnaround (single request, 1 request in 
flight) that is not be related to the storage backend.  I've been 
traveling this week and haven't had time to look into it yet.  Tracking 
this down should just be a matter of turning up the logs and looking 
carefully at the timestamps to see where things are being delayed.

> We seem to saturate the pools writing ~ 20k objects/s (= internally 60k/s).
> 
> Is there an easy explanation for 80 ms (quasi without payload) and a possible tuning to reduce that?
> I measured (append few bytes +fsync) on such a disk around 33ms which explains probably part of the latency.
> 
> Then I tried with the async API to see if there is a difference in the 
> measurement between wait_for_complete or wait_for_safe ... shouldn't 
> wait_for_complete be much shorter, but I get always comparable results 
> ...

If you are using XFS or ext4 on teh backend, the OSD is doing write-ahead 
journaling, which means that in reality the commit happens before the op 
is applied to the fs and is readable.  (The 'commit/ondisk' reply implies 
an ack so the OSD has some internal locking to maintain this illusion from 
the client's perspective.)

sage

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-20 13:11 ` Mark Nelson
@ 2013-09-23  7:34   ` Dan van der Ster
  0 siblings, 0 replies; 20+ messages in thread
From: Dan van der Ster @ 2013-09-23  7:34 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Andreas Joachim Peters, ceph-devel@vger.kernel.org

On Fri, Sep 20, 2013 at 3:11 PM, Mark Nelson <mark.nelson@inktank.com> wrote:
> On 09/20/2013 07:27 AM, Andreas Joachim Peters wrote:
>>
>> Hi,
>
>
> Hi Andreas!
>
>
>>
>> we made some benchmarks about object read/write latencies on the CERN ceph
>> installation.
>>
>> The cluster has 44 nodes and ~1k disks, all on 10GE and the pool
>> configuration has 3 copies.
>> Client & Server is 0.67.
>>
>> The latencies we observe (using tiny objects ... 5 bytes) on the idle
>> pool:
>>
>> write full object(sync) ~65-80ms
>> append to object ~60-75ms
>> set xattr object ~65-80ms
>> lock object ~65-80ms
>> stat object ~1ms
>>
>> We seem to saturate the pools writing ~ 20k objects/s (= internally
>> 60k/s).
>
>
> Out of curiosity, how much difference do you see with write latencies if you
> do the same thing to a pool with 1 copy?

# 3 copies:
[root@p05151113777233 ~]# rados bench -p test 10 write -t 1 -b 1
 Maintaining 1 concurrent writes of 1 bytes for up to 10 seconds or 0 objects
...
Average Latency:        0.0655107
Stddev Latency:         0.0156095
Max latency:            0.113482
Min latency:            0.033944


# 1 copy:
[root@p01001532149022 ~]# rados bench -p test 10 write -t 1 -b 5
 Maintaining 1 concurrent writes of 5 bytes for up to 10 seconds or 0 objects
...
Average Latency:        0.0470315
Stddev Latency:         0.0204646
Max latency:            0.097039
Min latency:            0.004141


Cheers, Dan

>
>
>>
>> Is there an easy explanation for 80 ms (quasi without payload) and a
>> possible tuning to reduce that?
>> I measured (append few bytes +fsync) on such a disk around 33ms which
>> explains probably part of the latency.
>
>
> I've been wanting to really dig into object write latency in RADOS but just
> haven't had the time to devote to it yet.  I've been doing some simple rados
> bench tests to a 8-SSD test node and am topping out at about 8-9K write IOPS
> and 26K read IOPS (no replication) though with little tuning.  I suspect
> there are many areas in the code where we could improve things.
>
>
>>
>> Then I tried with the async API to see if there is a difference in the
>> measurement between wait_for_complete or wait_for_safe ... shouldn't
>> wait_for_complete be much shorter, but I get always comparable results ...
>
>
> Hrm, I'm going to let Sage or someone else comment on this.
>
>>
>> Thanks, Andreas.--
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-20 14:47 ` Gregory Farnum
@ 2013-09-23  7:37   ` Dan van der Ster
  2013-09-23 18:18     ` Sage Weil
  0 siblings, 1 reply; 20+ messages in thread
From: Dan van der Ster @ 2013-09-23  7:37 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Andreas Joachim Peters, ceph-devel@vger.kernel.org

On Fri, Sep 20, 2013 at 4:47 PM, Gregory Farnum <greg@inktank.com> wrote:
> On Fri, Sep 20, 2013 at 5:27 AM, Andreas Joachim Peters
> <Andreas.Joachim.Peters@cern.ch> wrote:
>> Hi,
>>
>>
>> we made some benchmarks about object read/write latencies on the CERN ceph installation.
>>
>> The cluster has 44 nodes and ~1k disks, all on 10GE and the pool configuration has 3 copies.
>> Client & Server is 0.67.
>>
>> The latencies we observe (using tiny objects ... 5 bytes) on the idle pool:
>
> Does that mean you have non-idle pools in the same cluster? Unless
> you've got physical separation, the fact that the pool is idle doesn't
> mean much unless the cluster is as well. (Though if you're getting 60
> object writes/hard drive/second I think it probably is idle.)

The cluster is pretty idle -- we'll separate with some new hardware to
exclude other usage from this test.


>
>> write full object(sync) ~65-80ms
>> append to object ~60-75ms
>> set xattr object ~65-80ms
>> lock object ~65-80ms
>> stat object ~1ms
>
> Anecdotally those write times look a little high to me, but my
> expectations are probably set for 2x and I'm not sure how much
> difference that makes (I would expect not much, but maybe there's
> something happening I haven't considered).
>
>
>> We seem to saturate the pools writing ~ 20k objects/s (= internally 60k/s).
>>
>> Is there an easy explanation for 80 ms (quasi without payload) and a possible tuning to reduce that?
>> I measured (append few bytes +fsync) on such a disk around 33ms which explains probably part of the latency.
> Ah, and that's also higher than I would normally expect for a disk
> access, so that's probably why the above numbers seem a little large.
> Separately, what's your journal config? Does each spindle have a
> partition? This math all works out to about what I'd expect if so.


The journals are on the same spinning disk & partition as the OSD
data. Each spinning disk is 1 OSD.

Cheers, Dan


>
>> Then I tried with the async API to see if there is a difference in the measurement between wait_for_complete or wait_for_safe ... shouldn't wait_for_complete be much shorter, but I get always comparable results ...
>
> You're presumably on xfs? With non-btrfs FSes, the OSDs have to use
> write-ahead journaling so they always commit the op to disk before
> applying to the local FS.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-20 15:34 ` Sage Weil
@ 2013-09-23  7:39   ` Dan van der Ster
  2013-09-23 15:38     ` Andreas Joachim Peters
  2013-09-23 15:40     ` Sage Weil
  0 siblings, 2 replies; 20+ messages in thread
From: Dan van der Ster @ 2013-09-23  7:39 UTC (permalink / raw)
  To: Sage Weil; +Cc: Andreas Joachim Peters, ceph-devel@vger.kernel.org

On Fri, Sep 20, 2013 at 5:34 PM, Sage Weil <sage@inktank.com> wrote:
> On Fri, 20 Sep 2013, Andreas Joachim Peters wrote:
>> Hi,
>>
>>
>> we made some benchmarks about object read/write latencies on the CERN ceph installation.
>>
>> The cluster has 44 nodes and ~1k disks, all on 10GE and the pool configuration has 3 copies.
>> Client & Server is 0.67.
>>
>> The latencies we observe (using tiny objects ... 5 bytes) on the idle pool:
>>
>> write full object(sync) ~65-80ms
>> append to object ~60-75ms
>> set xattr object ~65-80ms
>> lock object ~65-80ms
>> stat object ~1ms
>
> How are the individual OSDs configured?  Are they purely HDD's with a
> journal partition, or is there an SSD journal?  If it's pure HDD, this
> will be a signnificant source of latency.

No SSD journal, same HDD for journal and partition. It still seems
large to have >45ms for 1 copy though, no?


>
> That said, Dieter just recently pointed out to me that he's observing
> significant time in a request turnaround (single request, 1 request in
> flight) that is not be related to the storage backend.  I've been
> traveling this week and haven't had time to look into it yet.  Tracking
> this down should just be a matter of turning up the logs and looking
> carefully at the timestamps to see where things are being delayed.

Which logs.. debug_ms or debug_osd or both or something else?


>
>> We seem to saturate the pools writing ~ 20k objects/s (= internally 60k/s).
>>
>> Is there an easy explanation for 80 ms (quasi without payload) and a possible tuning to reduce that?
>> I measured (append few bytes +fsync) on such a disk around 33ms which explains probably part of the latency.
>>
>> Then I tried with the async API to see if there is a difference in the
>> measurement between wait_for_complete or wait_for_safe ... shouldn't
>> wait_for_complete be much shorter, but I get always comparable results
>> ...
>
> If you are using XFS or ext4 on teh backend, the OSD is doing write-ahead
> journaling, which means that in reality the commit happens before the op
> is applied to the fs and is readable.  (The 'commit/ondisk' reply implies
> an ack so the OSD has some internal locking to maintain this illusion from
> the client's perspective.)

It's XFS.

Cheers, Dan


>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Object Write Latency
  2013-09-23  7:39   ` Dan van der Ster
@ 2013-09-23 15:38     ` Andreas Joachim Peters
  2013-09-23 16:03       ` Mark Nelson
  2013-09-23 16:19       ` Sage Weil
  2013-09-23 15:40     ` Sage Weil
  1 sibling, 2 replies; 20+ messages in thread
From: Andreas Joachim Peters @ 2013-09-23 15:38 UTC (permalink / raw)
  To: Dan van der Ster, Sage Weil; +Cc: ceph-devel@vger.kernel.org

We deployed 3 OSDs with an EXT4 using RapidDisk in-memory.

The FS does 140k/s append+sync and the latency is now:

~1 ms for few byte objects with single replica
~2 ms for few byte objects three replica  (instead of 65-80ms)

This gives probably the base-line of the best you can do with the current implementation.

==> the 80ms are probably just a 'feature' of the hardware (JBOD disks/controller) and we might try to find some tuning parameters to improve the latency slightly.

Could you just explain how the async api functions (is_complete, is_safe) map to the three states 

1) object is transferred from client to all OSDs and is present in memory there
2) object is written to the OSD journal
3) object is committed from OSD journal to the OSD filesystem

Is it correct that the object is visible by clients only when 3) has happened? 

Thanks for your help, 
Andreas.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-23  7:39   ` Dan van der Ster
  2013-09-23 15:38     ` Andreas Joachim Peters
@ 2013-09-23 15:40     ` Sage Weil
  1 sibling, 0 replies; 20+ messages in thread
From: Sage Weil @ 2013-09-23 15:40 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Andreas Joachim Peters, ceph-devel@vger.kernel.org

On Mon, 23 Sep 2013, Dan van der Ster wrote:
> On Fri, Sep 20, 2013 at 5:34 PM, Sage Weil <sage@inktank.com> wrote:
> > On Fri, 20 Sep 2013, Andreas Joachim Peters wrote:
> >> Hi,
> >>
> >>
> >> we made some benchmarks about object read/write latencies on the CERN ceph installation.
> >>
> >> The cluster has 44 nodes and ~1k disks, all on 10GE and the pool configuration has 3 copies.
> >> Client & Server is 0.67.
> >>
> >> The latencies we observe (using tiny objects ... 5 bytes) on the idle pool:
> >>
> >> write full object(sync) ~65-80ms
> >> append to object ~60-75ms
> >> set xattr object ~65-80ms
> >> lock object ~65-80ms
> >> stat object ~1ms
> >
> > How are the individual OSDs configured?  Are they purely HDD's with a
> > journal partition, or is there an SSD journal?  If it's pure HDD, this
> > will be a signnificant source of latency.
> 
> No SSD journal, same HDD for journal and partition. It still seems
> large to have >45ms for 1 copy though, no?

Yeah. 
 
> >
> > That said, Dieter just recently pointed out to me that he's observing
> > significant time in a request turnaround (single request, 1 request in
> > flight) that is not be related to the storage backend.  I've been
> > traveling this week and haven't had time to look into it yet.  Tracking
> > this down should just be a matter of turning up the logs and looking
> > carefully at the timestamps to see where things are being delayed.
> 
> Which logs.. debug_ms or debug_osd or both or something else?

Let's do

 debug ms = 1
 debug osd = 20
 debug filestore = 20
 debug journal = 20

to get the full picture on a single write.

sage

> 
> 
> >
> >> We seem to saturate the pools writing ~ 20k objects/s (= internally 60k/s).
> >>
> >> Is there an easy explanation for 80 ms (quasi without payload) and a possible tuning to reduce that?
> >> I measured (append few bytes +fsync) on such a disk around 33ms which explains probably part of the latency.
> >>
> >> Then I tried with the async API to see if there is a difference in the
> >> measurement between wait_for_complete or wait_for_safe ... shouldn't
> >> wait_for_complete be much shorter, but I get always comparable results
> >> ...
> >
> > If you are using XFS or ext4 on teh backend, the OSD is doing write-ahead
> > journaling, which means that in reality the commit happens before the op
> > is applied to the fs and is readable.  (The 'commit/ondisk' reply implies
> > an ack so the OSD has some internal locking to maintain this illusion from
> > the client's perspective.)
> 
> It's XFS.
> 
> Cheers, Dan
> 
> 
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-23 15:38     ` Andreas Joachim Peters
@ 2013-09-23 16:03       ` Mark Nelson
  2013-09-23 21:40         ` Andreas Joachim Peters
  2013-09-23 16:19       ` Sage Weil
  1 sibling, 1 reply; 20+ messages in thread
From: Mark Nelson @ 2013-09-23 16:03 UTC (permalink / raw)
  To: Andreas Joachim Peters
  Cc: Dan van der Ster, Sage Weil, ceph-devel@vger.kernel.org

On 09/23/2013 10:38 AM, Andreas Joachim Peters wrote:
> We deployed 3 OSDs with an EXT4 using RapidDisk in-memory.
>
> The FS does 140k/s append+sync and the latency is now:
>
> ~1 ms for few byte objects with single replica
> ~2 ms for few byte objects three replica  (instead of 65-80ms)

Interesting!  If you look at the slowest operations in the ceph admin 
socket now with dump_historic_ops, where are those operations spending 
their time?

>
> This gives probably the base-line of the best you can do with the current implementation.
>
> ==> the 80ms are probably just a 'feature' of the hardware (JBOD disks/controller) and we might try to find some tuning parameters to improve the latency slightly.

Hardware definitely plays a huge part in terms of Ceph performance.  You 
can run Ceph on just about anything, but it's surprising how different 
two roughly similar systems can perform.

>
> Could you just explain how the async api functions (is_complete, is_safe) map to the three states
>
> 1) object is transferred from client to all OSDs and is present in memory there
> 2) object is written to the OSD journal
> 3) object is committed from OSD journal to the OSD filesystem
>
> Is it correct that the object is visible by clients only when 3) has happened?

Yes, afaik.

>
> Thanks for your help,
> Andreas.
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Object Write Latency
  2013-09-23 15:38     ` Andreas Joachim Peters
  2013-09-23 16:03       ` Mark Nelson
@ 2013-09-23 16:19       ` Sage Weil
  1 sibling, 0 replies; 20+ messages in thread
From: Sage Weil @ 2013-09-23 16:19 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: Dan van der Ster, ceph-devel@vger.kernel.org

On Mon, 23 Sep 2013, Andreas Joachim Peters wrote:
> We deployed 3 OSDs with an EXT4 using RapidDisk in-memory.
> 
> The FS does 140k/s append+sync and the latency is now:
> 
> ~1 ms for few byte objects with single replica
> ~2 ms for few byte objects three replica  (instead of 65-80ms)
> 
> This gives probably the base-line of the best you can do with the 
> current implementation.
> 
> ==> the 80ms are probably just a 'feature' of the hardware (JBOD 
> disks/controller) and we might try to find some tuning parameters to 
> improve the latency slightly.
> 
> Could you just explain how the async api functions (is_complete, 
> is_safe) map to the three states
> 
> 1) object is transferred from client to all OSDs and is present in memory there

Nothing happens yet..

> 2) object is written to the OSD journal

Client gets a COMMIT, which implies ACK

> 3) object is committed from OSD journal to the OSD filesystem

OSD now allows subsequent reads, or read/modify/write operations.

> Is it correct that the object is visible by clients only when 3) has 
> happened?

Yeah.

The ACK (operation is serialized and visible) vs COMMIT (operation is now 
durable) was conceived under the assumption that the serialized+visible 
step would be cheaper than making it durable.  This is the case for btrfs.  
Because of this, the COMMIT message implies ACK, so the client will see 
either ACK + COMMIT or COMMIT, but never COMMIT + ACK.

For ext4 and xfs, we need to do write-ahead journaling just for 
consistency, so the commit happens first.

Hope that helps!

I still think you should look at the logs for the JBOD hardware to see 
where the time is spent; it sounds like there is room for improvement.

sage

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-23  7:37   ` Dan van der Ster
@ 2013-09-23 18:18     ` Sage Weil
  2013-09-24  9:13       ` Dan van der Ster
  0 siblings, 1 reply; 20+ messages in thread
From: Sage Weil @ 2013-09-23 18:18 UTC (permalink / raw)
  To: Dan van der Ster
  Cc: Gregory Farnum, Andreas Joachim Peters,
	ceph-devel@vger.kernel.org

On Mon, 23 Sep 2013, Dan van der Ster wrote:
> The journals are on the same spinning disk & partition as the OSD
> data. Each spinning disk is 1 OSD.

Quick clarification: do you mean that the journal is a file on the same fs 
as osd_data?  If so, moving it to a separate partition on the same disk 
usually helps quite a bit.  The overhead of a write+fsync inside the fs 
tends to be much higher than an direct-io write to another partition. You 
might try measuring that directly and comparing it to the 33ms 
append+fsync that you previously saw.

sage

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Object Write Latency
  2013-09-23 16:03       ` Mark Nelson
@ 2013-09-23 21:40         ` Andreas Joachim Peters
  2013-09-24  8:55           ` Dan van der Ster
  0 siblings, 1 reply; 20+ messages in thread
From: Andreas Joachim Peters @ 2013-09-23 21:40 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Dan van der Ster, Sage Weil, ceph-devel@vger.kernel.org

Here is the output for one write_full on the in-memory OSD (the others look more or less the same e.g. most time is waiting for op_applied/op_commit):

{ "description": "osd_op(client.1764910.0:1 eos-root--0 [writefull 0~5] 15.fef59585 e3818)",
          "rmw_flags": 4,
          "received_at": "2013-09-23 23:33:16.536547",
          "age": "3.698191",
          "duration": "0.002225",
          "flag_point": "commit sent; apply or cleanup",
          "client_info": { "client": "client.1764910",
              "tid": 1},
          "events": [
                { "time": "2013-09-23 23:33:16.536706",
                  "event": "waiting_for_osdmap"},
                { "time": "2013-09-23 23:33:16.536807",
                  "event": "reached_pg"},
                { "time": "2013-09-23 23:33:16.536915",
                  "event": "started"},
                { "time": "2013-09-23 23:33:16.536936",
                  "event": "started"},
                { "time": "2013-09-23 23:33:16.537029",
                  "event": "waiting for subops from [1056,1057]"},
                { "time": "2013-09-23 23:33:16.537110",
                  "event": "commit_queued_for_journal_write"},
                { "time": "2013-09-23 23:33:16.537158",
                  "event": "write_thread_in_journal_buffer"},
                { "time": "2013-09-23 23:33:16.537242",
                  "event": "journaled_completion_queued"},
                { "time": "2013-09-23 23:33:16.537269",
                  "event": "op_commit"},
                { "time": "2013-09-23 23:33:16.538547",
                  "event": "sub_op_commit_rec"},
                { "time": "2013-09-23 23:33:16.538573",
                  "event": "op_applied"},
                { "time": "2013-09-23 23:33:16.538715",
                  "event": "sub_op_commit_rec"},
                { "time": "2013-09-23 23:33:16.538754",
                  "event": "commit_sent"},
                { "time": "2013-09-23 23:33:16.538772",
                  "event": "done"}]},

We should probably look at the same output for the JBOD configuration ...

Cheers Andreas.

________________________________________
From: Mark Nelson [mark.nelson@inktank.com]
Sent: 23 September 2013 18:03
To: Andreas Joachim Peters
Cc: Dan van der Ster; Sage Weil; ceph-devel@vger.kernel.org
Subject: Re: Object Write Latency

On 09/23/2013 10:38 AM, Andreas Joachim Peters wrote:
> We deployed 3 OSDs with an EXT4 using RapidDisk in-memory.
>
> The FS does 140k/s append+sync and the latency is now:
>
> ~1 ms for few byte objects with single replica
> ~2 ms for few byte objects three replica  (instead of 65-80ms)

Interesting!  If you look at the slowest operations in the ceph admin
socket now with dump_historic_ops, where are those operations spending
their time?

>
> This gives probably the base-line of the best you can do with the current implementation.
>
> ==> the 80ms are probably just a 'feature' of the hardware (JBOD disks/controller) and we might try to find some tuning parameters to improve the latency slightly.

Hardware definitely plays a huge part in terms of Ceph performance.  You
can run Ceph on just about anything, but it's surprising how different
two roughly similar systems can perform.

>
> Could you just explain how the async api functions (is_complete, is_safe) map to the three states
>
> 1) object is transferred from client to all OSDs and is present in memory there
> 2) object is written to the OSD journal
> 3) object is committed from OSD journal to the OSD filesystem
>
> Is it correct that the object is visible by clients only when 3) has happened?

Yes, afaik.

>
> Thanks for your help,
> Andreas.
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-23 21:40         ` Andreas Joachim Peters
@ 2013-09-24  8:55           ` Dan van der Ster
  2013-09-24 15:20             ` Sage Weil
  0 siblings, 1 reply; 20+ messages in thread
From: Dan van der Ster @ 2013-09-24  8:55 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: Mark Nelson, Sage Weil, ceph-devel@vger.kernel.org

Hi,
Yes, the journal is a file on same XFS partition as the data.

I have some results from a rados bench to a single-copy pool on the JBOD disks.

[root@p05151113777233 ~]# rados bench -p test 30 write -t 1 -b 5
...
 Total time run:         30.027296
Total writes made:      612
Write size:             5
Bandwidth (MB/sec):     0.000

Stddev Bandwidth:       1.92864e-05
Max bandwidth (MB/sec): 0.000119209
Min bandwidth (MB/sec): 0
Average Latency:        0.0490627
Stddev Latency:         0.0197453
Max latency:            0.136101
Min latency:            0.003991

The osd log and dump of historic ops are here:
 http://box.cern.ch/owncloud/public.php?service=files&t=a53c229bab2179c9e1a03899a859129f

The logs are quite verbose for me, but the journal do_write latencies
certainly stand out:

2013-09-24 10:28:52.853011 7f03b78dd700 20 journal do_write latency 0.044686
2013-09-24 10:28:56.610670 7f03b78dd700 20 journal do_write latency 0.025085
2013-09-24 10:29:09.273017 7f03b78dd700 20 journal do_write latency 0.031082
2013-09-24 10:29:22.703697 7f03b78dd700 20 journal do_write latency 0.045090

If you confirm that that really is the journal write+sync latency (and
not including the final commit to disk in that time), then I'll find
some new boxes where I can try out the direct IO to a partitioned
journal.

Cheers, Dan


On Mon, Sep 23, 2013 at 11:40 PM, Andreas Joachim Peters
<Andreas.Joachim.Peters@cern.ch> wrote:
> Here is the output for one write_full on the in-memory OSD (the others look more or less the same e.g. most time is waiting for op_applied/op_commit):
>
> { "description": "osd_op(client.1764910.0:1 eos-root--0 [writefull 0~5] 15.fef59585 e3818)",
>           "rmw_flags": 4,
>           "received_at": "2013-09-23 23:33:16.536547",
>           "age": "3.698191",
>           "duration": "0.002225",
>           "flag_point": "commit sent; apply or cleanup",
>           "client_info": { "client": "client.1764910",
>               "tid": 1},
>           "events": [
>                 { "time": "2013-09-23 23:33:16.536706",
>                   "event": "waiting_for_osdmap"},
>                 { "time": "2013-09-23 23:33:16.536807",
>                   "event": "reached_pg"},
>                 { "time": "2013-09-23 23:33:16.536915",
>                   "event": "started"},
>                 { "time": "2013-09-23 23:33:16.536936",
>                   "event": "started"},
>                 { "time": "2013-09-23 23:33:16.537029",
>                   "event": "waiting for subops from [1056,1057]"},
>                 { "time": "2013-09-23 23:33:16.537110",
>                   "event": "commit_queued_for_journal_write"},
>                 { "time": "2013-09-23 23:33:16.537158",
>                   "event": "write_thread_in_journal_buffer"},
>                 { "time": "2013-09-23 23:33:16.537242",
>                   "event": "journaled_completion_queued"},
>                 { "time": "2013-09-23 23:33:16.537269",
>                   "event": "op_commit"},
>                 { "time": "2013-09-23 23:33:16.538547",
>                   "event": "sub_op_commit_rec"},
>                 { "time": "2013-09-23 23:33:16.538573",
>                   "event": "op_applied"},
>                 { "time": "2013-09-23 23:33:16.538715",
>                   "event": "sub_op_commit_rec"},
>                 { "time": "2013-09-23 23:33:16.538754",
>                   "event": "commit_sent"},
>                 { "time": "2013-09-23 23:33:16.538772",
>                   "event": "done"}]},
>
> We should probably look at the same output for the JBOD configuration ...
>
> Cheers Andreas.
>
> ________________________________________
> From: Mark Nelson [mark.nelson@inktank.com]
> Sent: 23 September 2013 18:03
> To: Andreas Joachim Peters
> Cc: Dan van der Ster; Sage Weil; ceph-devel@vger.kernel.org
> Subject: Re: Object Write Latency
>
> On 09/23/2013 10:38 AM, Andreas Joachim Peters wrote:
>> We deployed 3 OSDs with an EXT4 using RapidDisk in-memory.
>>
>> The FS does 140k/s append+sync and the latency is now:
>>
>> ~1 ms for few byte objects with single replica
>> ~2 ms for few byte objects three replica  (instead of 65-80ms)
>
> Interesting!  If you look at the slowest operations in the ceph admin
> socket now with dump_historic_ops, where are those operations spending
> their time?
>
>>
>> This gives probably the base-line of the best you can do with the current implementation.
>>
>> ==> the 80ms are probably just a 'feature' of the hardware (JBOD disks/controller) and we might try to find some tuning parameters to improve the latency slightly.
>
> Hardware definitely plays a huge part in terms of Ceph performance.  You
> can run Ceph on just about anything, but it's surprising how different
> two roughly similar systems can perform.
>
>>
>> Could you just explain how the async api functions (is_complete, is_safe) map to the three states
>>
>> 1) object is transferred from client to all OSDs and is present in memory there
>> 2) object is written to the OSD journal
>> 3) object is committed from OSD journal to the OSD filesystem
>>
>> Is it correct that the object is visible by clients only when 3) has happened?
>
> Yes, afaik.
>
>>
>> Thanks for your help,
>> Andreas.
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-23 18:18     ` Sage Weil
@ 2013-09-24  9:13       ` Dan van der Ster
  2013-09-24 10:04         ` Dan van der Ster
  0 siblings, 1 reply; 20+ messages in thread
From: Dan van der Ster @ 2013-09-24  9:13 UTC (permalink / raw)
  To: Sage Weil
  Cc: Gregory Farnum, Andreas Joachim Peters,
	ceph-devel@vger.kernel.org

On Mon, Sep 23, 2013 at 8:18 PM, Sage Weil <sage@inktank.com> wrote:
> You
> might try measuring that directly and comparing it to the 33ms
> append+fsync that you previously saw.

dd with fsync is quite slow...

[root@p05151113777233 fio]# time dd if=/dev/zero
of=/var/lib/ceph/osd/osd.1045/testtest bs=5 count=1
1+0 records in
1+0 records out
5 bytes (5 B) copied, 8.884e-05 s, 56.3 kB/s

real 0m0.001s
user 0m0.000s
sys 0m0.001s


[root@p05151113777233 fio]# time dd if=/dev/zero
of=/var/lib/ceph/osd/osd.1045/testtest bs=5 count=1 conv=fsync
1+0 records in
1+0 records out
5 bytes (5 B) copied, 0.0435103 s, 0.1 kB/s

real 0m0.045s
user 0m0.001s
sys 0m0.001s

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-24  9:13       ` Dan van der Ster
@ 2013-09-24 10:04         ` Dan van der Ster
       [not found]           ` <etPan.52417ad3.3804823e.f19a@leseb-mba.local>
  0 siblings, 1 reply; 20+ messages in thread
From: Dan van der Ster @ 2013-09-24 10:04 UTC (permalink / raw)
  To: Sage Weil
  Cc: Gregory Farnum, Andreas Joachim Peters,
	ceph-devel@vger.kernel.org

Some more tests...

~journal in XFS:
[root@p05151113777233 fio]# time dd if=/dev/zero
of=/var/lib/ceph/osd/osd.1045/testtest bs=512 count=1 conv=fsync
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.0423365 s, 12.1 kB/s

real 0m0.044s
user 0m0.000s
sys 0m0.001s


~directio via the FS
[root@p05151113777233 fio]# time dd if=/dev/zero
of=/var/lib/ceph/osd/osd.1045/testtest bs=512 count=1
oflag=direct,dsync
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.0426422 s, 12.0 kB/s

real 0m0.044s
user 0m0.000s
sys 0m0.001s


~directio to the bdev seems to be faster (a few attempts below):

[root@p05151113760120 ~]# time dd if=/dev/zero of=/dev/sdo1 bs=512
count=1 seek=$(($RANDOM*1000)) oflag=direct,dsync
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.0113362 s, 45.2 kB/s

real 0m0.013s
user 0m0.001s
sys 0m0.001s
[root@p05151113760120 ~]# time dd if=/dev/zero of=/dev/sdo1 bs=512
count=1 seek=$(($RANDOM*1000)) oflag=direct,dsync
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.0129222 s, 39.6 kB/s

real 0m0.014s
user 0m0.001s
sys 0m0.000s
[root@p05151113760120 ~]# time dd if=/dev/zero of=/dev/sdo1 bs=512
count=1 seek=$(($RANDOM*1000)) oflag=direct,dsync
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00484823 s, 106 kB/s

real 0m0.006s
user 0m0.001s
sys 0m0.001s
[root@p05151113760120 ~]#
[root@p05151113760120 ~]# time dd if=/dev/zero of=/dev/sdo1 bs=512
count=1 seek=$(($RANDOM*1000)) oflag=direct,dsync
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00441663 s, 116 kB/s

real 0m0.006s
user 0m0.001s
sys 0m0.001s


So it looks like I really should try those O_DIRECT journals on a
separate partition.

Cheers, Dan

On Tue, Sep 24, 2013 at 11:13 AM, Dan van der Ster <dan@vanderster.com> wrote:
> On Mon, Sep 23, 2013 at 8:18 PM, Sage Weil <sage@inktank.com> wrote:
>> You
>> might try measuring that directly and comparing it to the 33ms
>> append+fsync that you previously saw.
>
> dd with fsync is quite slow...
>
> [root@p05151113777233 fio]# time dd if=/dev/zero
> of=/var/lib/ceph/osd/osd.1045/testtest bs=5 count=1
> 1+0 records in
> 1+0 records out
> 5 bytes (5 B) copied, 8.884e-05 s, 56.3 kB/s
>
> real 0m0.001s
> user 0m0.000s
> sys 0m0.001s
>
>
> [root@p05151113777233 fio]# time dd if=/dev/zero
> of=/var/lib/ceph/osd/osd.1045/testtest bs=5 count=1 conv=fsync
> 1+0 records in
> 1+0 records out
> 5 bytes (5 B) copied, 0.0435103 s, 0.1 kB/s
>
> real 0m0.045s
> user 0m0.001s
> sys 0m0.001s

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
       [not found]           ` <etPan.52417ad3.3804823e.f19a@leseb-mba.local>
@ 2013-09-24 12:30             ` Sébastien Han
  2013-09-24 12:33               ` Dan van der Ster
  0 siblings, 1 reply; 20+ messages in thread
From: Sébastien Han @ 2013-09-24 12:30 UTC (permalink / raw)
  To: Sage Weil, Dan van der Ster
  Cc: Andreas Joachim Peters, Gregory Farnum,
	ceph-devel@vger.kernel.org

Ideally a partition using the first sectors of the disk.
I usually do a tiny partition at the beginning of the device and leave the rest for ods_data.

––––
Sébastien Han
Cloud Engineer

"Always give 100%. Unless you're giving blood.”

Phone: +33 (0)1 49 70 99 72
Mail: sebastien.han@enovance.com
Address : 10, rue de la Victoire - 75009 Paris
Web : www.enovance.com - Twitter : @enovance

On September 24, 2013 at 1:43:15 PM, Sébastien Han (sebastien.han@enovance.com) wrote:

Ideally a partition using the first sectors of the disk.
I usually do a tiny partition at the beginning of the device and leave the rest for ods_data.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-24 12:30             ` Sébastien Han
@ 2013-09-24 12:33               ` Dan van der Ster
  2013-09-24 12:39                 ` Sébastien Han
  0 siblings, 1 reply; 20+ messages in thread
From: Dan van der Ster @ 2013-09-24 12:33 UTC (permalink / raw)
  To: Sébastien Han
  Cc: Sage Weil, Andreas Joachim Peters, Gregory Farnum,
	ceph-devel@vger.kernel.org

Thanks Sebastien, we will try that. BTW, as you know we're using the
puppet modules of enovance, which set up a file-based journal. Did you
guys already puppetize a blockdev journal? I'll be doing that now if
not...
Cheers, Dan
CERN IT

On Tue, Sep 24, 2013 at 2:30 PM, Sébastien Han
<sebastien.han@enovance.com> wrote:
> Ideally a partition using the first sectors of the disk.
> I usually do a tiny partition at the beginning of the device and leave the rest for ods_data.
>
> ––––
> Sébastien Han
> Cloud Engineer
>
> "Always give 100%. Unless you're giving blood.”
>
> Phone: +33 (0)1 49 70 99 72
> Mail: sebastien.han@enovance.com
> Address : 10, rue de la Victoire - 75009 Paris
> Web : www.enovance.com - Twitter : @enovance
>
> On September 24, 2013 at 1:43:15 PM, Sébastien Han (sebastien.han@enovance.com) wrote:
>
> Ideally a partition using the first sectors of the disk.
> I usually do a tiny partition at the beginning of the device and leave the rest for ods_data.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-24 12:33               ` Dan van der Ster
@ 2013-09-24 12:39                 ` Sébastien Han
  0 siblings, 0 replies; 20+ messages in thread
From: Sébastien Han @ 2013-09-24 12:39 UTC (permalink / raw)
  To: Dan van der Ster
  Cc: Andreas Joachim Peters, Gregory Farnum,
	ceph-devel@vger.kernel.org, Sage Weil

Hi Dan,

Yes I noticed :), no we haven’t done anything on that side yet.
But I’ll be happy to see this happening, that way better than a file of the fs.

Would be nice if you could push something then :)

Cheers.
––––
Sébastien Han
Cloud Engineer

"Always give 100%. Unless you're giving blood.”

Phone: +33 (0)1 49 70 99 72
Mail: sebastien.han@enovance.com
Address : 10, rue de la Victoire - 75009 Paris
Web : www.enovance.com - Twitter : @enovance

On September 24, 2013 at 2:33:57 PM, Dan van der Ster (dan@vanderster.com) wrote:

Thanks Sebastien, we will try that. BTW, as you know we're using the  
puppet modules of enovance, which set up a file-based journal. Did you  
guys already puppetize a blockdev journal? I'll be doing that now if  
not...  
Cheers, Dan  
CERN IT  

On Tue, Sep 24, 2013 at 2:30 PM, Sébastien Han  
<sebastien.han@enovance.com> wrote:  
> Ideally a partition using the first sectors of the disk.  
> I usually do a tiny partition at the beginning of the device and leave the rest for ods_data.  
>  
> ––––  
> Sébastien Han  
> Cloud Engineer  
>  
> "Always give 100%. Unless you're giving blood.”  
>  
> Phone: +33 (0)1 49 70 99 72  
> Mail: sebastien.han@enovance.com  
> Address : 10, rue de la Victoire - 75009 Paris  
> Web : www.enovance.com - Twitter : @enovance  
>  
> On September 24, 2013 at 1:43:15 PM, Sébastien Han (sebastien.han@enovance.com) wrote:  
>  
> Ideally a partition using the first sectors of the disk.  
> I usually do a tiny partition at the beginning of the device and leave the rest for ods_data.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Object Write Latency
  2013-09-24  8:55           ` Dan van der Ster
@ 2013-09-24 15:20             ` Sage Weil
  0 siblings, 0 replies; 20+ messages in thread
From: Sage Weil @ 2013-09-24 15:20 UTC (permalink / raw)
  To: Dan van der Ster
  Cc: Andreas Joachim Peters, Mark Nelson, ceph-devel@vger.kernel.org

On Tue, 24 Sep 2013, Dan van der Ster wrote:
> Hi,
> Yes, the journal is a file on same XFS partition as the data.
> 
> I have some results from a rados bench to a single-copy pool on the JBOD disks.
> 
> [root@p05151113777233 ~]# rados bench -p test 30 write -t 1 -b 5
> ...
>  Total time run:         30.027296
> Total writes made:      612
> Write size:             5
> Bandwidth (MB/sec):     0.000
> 
> Stddev Bandwidth:       1.92864e-05
> Max bandwidth (MB/sec): 0.000119209
> Min bandwidth (MB/sec): 0
> Average Latency:        0.0490627
> Stddev Latency:         0.0197453
> Max latency:            0.136101
> Min latency:            0.003991
> 
> The osd log and dump of historic ops are here:
>  http://box.cern.ch/owncloud/public.php?service=files&t=a53c229bab2179c9e1a03899a859129f
> 
> The logs are quite verbose for me, but the journal do_write latencies
> certainly stand out:
> 
> 2013-09-24 10:28:52.853011 7f03b78dd700 20 journal do_write latency 0.044686
> 2013-09-24 10:28:56.610670 7f03b78dd700 20 journal do_write latency 0.025085
> 2013-09-24 10:29:09.273017 7f03b78dd700 20 journal do_write latency 0.031082
> 2013-09-24 10:29:22.703697 7f03b78dd700 20 journal do_write latency 0.045090
> 
> If you confirm that that really is the journal write+sync latency (and
> not including the final commit to disk in that time), then I'll find
> some new boxes where I can try out the direct IO to a partitioned
> journal.

Yes.. this is write(2) and then fdatasync(2) to the journal file only.  
With a partition (well, raw block device) you will get direct-io and aio 
(several in-flight requests at once), so these latencies should go down 
quite a bit.  On a standalone disk and lots of writes you should be able 
to see the rotational timing (either sub-ms latencies or ~6-9ms or 
whatever).  For a partition, you're competing with the fs so it will be 
noisier than that.

sage

> 
> Cheers, Dan
> 
> 
> On Mon, Sep 23, 2013 at 11:40 PM, Andreas Joachim Peters
> <Andreas.Joachim.Peters@cern.ch> wrote:
> > Here is the output for one write_full on the in-memory OSD (the others look more or less the same e.g. most time is waiting for op_applied/op_commit):
> >
> > { "description": "osd_op(client.1764910.0:1 eos-root--0 [writefull 0~5] 15.fef59585 e3818)",
> >           "rmw_flags": 4,
> >           "received_at": "2013-09-23 23:33:16.536547",
> >           "age": "3.698191",
> >           "duration": "0.002225",
> >           "flag_point": "commit sent; apply or cleanup",
> >           "client_info": { "client": "client.1764910",
> >               "tid": 1},
> >           "events": [
> >                 { "time": "2013-09-23 23:33:16.536706",
> >                   "event": "waiting_for_osdmap"},
> >                 { "time": "2013-09-23 23:33:16.536807",
> >                   "event": "reached_pg"},
> >                 { "time": "2013-09-23 23:33:16.536915",
> >                   "event": "started"},
> >                 { "time": "2013-09-23 23:33:16.536936",
> >                   "event": "started"},
> >                 { "time": "2013-09-23 23:33:16.537029",
> >                   "event": "waiting for subops from [1056,1057]"},
> >                 { "time": "2013-09-23 23:33:16.537110",
> >                   "event": "commit_queued_for_journal_write"},
> >                 { "time": "2013-09-23 23:33:16.537158",
> >                   "event": "write_thread_in_journal_buffer"},
> >                 { "time": "2013-09-23 23:33:16.537242",
> >                   "event": "journaled_completion_queued"},
> >                 { "time": "2013-09-23 23:33:16.537269",
> >                   "event": "op_commit"},
> >                 { "time": "2013-09-23 23:33:16.538547",
> >                   "event": "sub_op_commit_rec"},
> >                 { "time": "2013-09-23 23:33:16.538573",
> >                   "event": "op_applied"},
> >                 { "time": "2013-09-23 23:33:16.538715",
> >                   "event": "sub_op_commit_rec"},
> >                 { "time": "2013-09-23 23:33:16.538754",
> >                   "event": "commit_sent"},
> >                 { "time": "2013-09-23 23:33:16.538772",
> >                   "event": "done"}]},
> >
> > We should probably look at the same output for the JBOD configuration ...
> >
> > Cheers Andreas.
> >
> > ________________________________________
> > From: Mark Nelson [mark.nelson@inktank.com]
> > Sent: 23 September 2013 18:03
> > To: Andreas Joachim Peters
> > Cc: Dan van der Ster; Sage Weil; ceph-devel@vger.kernel.org
> > Subject: Re: Object Write Latency
> >
> > On 09/23/2013 10:38 AM, Andreas Joachim Peters wrote:
> >> We deployed 3 OSDs with an EXT4 using RapidDisk in-memory.
> >>
> >> The FS does 140k/s append+sync and the latency is now:
> >>
> >> ~1 ms for few byte objects with single replica
> >> ~2 ms for few byte objects three replica  (instead of 65-80ms)
> >
> > Interesting!  If you look at the slowest operations in the ceph admin
> > socket now with dump_historic_ops, where are those operations spending
> > their time?
> >
> >>
> >> This gives probably the base-line of the best you can do with the current implementation.
> >>
> >> ==> the 80ms are probably just a 'feature' of the hardware (JBOD disks/controller) and we might try to find some tuning parameters to improve the latency slightly.
> >
> > Hardware definitely plays a huge part in terms of Ceph performance.  You
> > can run Ceph on just about anything, but it's surprising how different
> > two roughly similar systems can perform.
> >
> >>
> >> Could you just explain how the async api functions (is_complete, is_safe) map to the three states
> >>
> >> 1) object is transferred from client to all OSDs and is present in memory there
> >> 2) object is written to the OSD journal
> >> 3) object is committed from OSD journal to the OSD filesystem
> >>
> >> Is it correct that the object is visible by clients only when 3) has happened?
> >
> > Yes, afaik.
> >
> >>
> >> Thanks for your help,
> >> Andreas.
> >>
> >>
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-09-24 15:20 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-20 12:27 Object Write Latency Andreas Joachim Peters
2013-09-20 13:11 ` Mark Nelson
2013-09-23  7:34   ` Dan van der Ster
2013-09-20 14:47 ` Gregory Farnum
2013-09-23  7:37   ` Dan van der Ster
2013-09-23 18:18     ` Sage Weil
2013-09-24  9:13       ` Dan van der Ster
2013-09-24 10:04         ` Dan van der Ster
     [not found]           ` <etPan.52417ad3.3804823e.f19a@leseb-mba.local>
2013-09-24 12:30             ` Sébastien Han
2013-09-24 12:33               ` Dan van der Ster
2013-09-24 12:39                 ` Sébastien Han
2013-09-20 15:34 ` Sage Weil
2013-09-23  7:39   ` Dan van der Ster
2013-09-23 15:38     ` Andreas Joachim Peters
2013-09-23 16:03       ` Mark Nelson
2013-09-23 21:40         ` Andreas Joachim Peters
2013-09-24  8:55           ` Dan van der Ster
2013-09-24 15:20             ` Sage Weil
2013-09-23 16:19       ` Sage Weil
2013-09-23 15:40     ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.