All of lore.kernel.org
 help / color / mirror / Atom feed
* Cache tier READ_FORWARD transition
@ 2014-07-07 16:29 Luis Pabon
  2014-07-07 19:29 ` Sage Weil
  0 siblings, 1 reply; 11+ messages in thread
From: Luis Pabon @ 2014-07-07 16:29 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hi all,
     I am working on OSDMonitor.cc:5325 and wanted to confirm the 
following read_forward cache tier transition:

     readforward -> forward || writeback || (any && num_objects_dirty == 0)
     forward -> writeback || readforward || (any && num_objects_dirty == 0)
     writeback -> readforward || forward

Is this the correct cache tier state transition?

- Luis

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cache tier READ_FORWARD transition
  2014-07-07 16:29 Cache tier READ_FORWARD transition Luis Pabon
@ 2014-07-07 19:29 ` Sage Weil
  2014-07-07 19:38   ` Mark Nelson
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Sage Weil @ 2014-07-07 19:29 UTC (permalink / raw)
  To: Luis Pabon; +Cc: ceph-devel@vger.kernel.org

On Mon, 7 Jul 2014, Luis Pabon wrote:
> Hi all,
>     I am working on OSDMonitor.cc:5325 and wanted to confirm the following
> read_forward cache tier transition:
> 
>     readforward -> forward || writeback || (any && num_objects_dirty == 0)
>     forward -> writeback || readforward || (any && num_objects_dirty == 0)
>     writeback -> readforward || forward
> 
> Is this the correct cache tier state transition?

That looks right to me.

By the way, I had a thought after we spoke that we probably want something 
that is somewhere inbetween the current writeback behavior (promote on 
first read) and the read_forward behavior (never promote on read).  I 
suspect a good all-around policy is something like promote on second read?  
This should probably be rolled into the writeback mode as a tunable...

sage



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cache tier READ_FORWARD transition
  2014-07-07 19:29 ` Sage Weil
@ 2014-07-07 19:38   ` Mark Nelson
  2014-07-07 19:43     ` Sage Weil
  2014-07-07 19:45     ` Sage Weil
  2014-07-07 21:03   ` Luis Pabón
  2014-07-07 21:31   ` Luis Pabón
  2 siblings, 2 replies; 11+ messages in thread
From: Mark Nelson @ 2014-07-07 19:38 UTC (permalink / raw)
  To: Sage Weil, Luis Pabon; +Cc: ceph-devel@vger.kernel.org

On 07/07/2014 02:29 PM, Sage Weil wrote:
> On Mon, 7 Jul 2014, Luis Pabon wrote:
>> Hi all,
>>      I am working on OSDMonitor.cc:5325 and wanted to confirm the following
>> read_forward cache tier transition:
>>
>>      readforward -> forward || writeback || (any && num_objects_dirty == 0)
>>      forward -> writeback || readforward || (any && num_objects_dirty == 0)
>>      writeback -> readforward || forward
>>
>> Is this the correct cache tier state transition?
>
> That looks right to me.
>
> By the way, I had a thought after we spoke that we probably want something
> that is somewhere inbetween the current writeback behavior (promote on
> first read) and the read_forward behavior (never promote on read).  I
> suspect a good all-around policy is something like promote on second read?
> This should probably be rolled into the writeback mode as a tunable...

That would be a good start I think.  What about some kind of scheme that 
also favours promoting small objects over larger ones?  It could be as 
simple as increasing the number of reads necessary to do a promotion 
based on the object size.

ie something like:

<= 64k object = 1 read
<= 512k object = 2 read
else 3 read

That would make the behaviour for default RBD object sizes always 3 
read, but could keep big objects out of the cache tier for RGW.

Mark

>
> sage
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cache tier READ_FORWARD transition
  2014-07-07 19:38   ` Mark Nelson
@ 2014-07-07 19:43     ` Sage Weil
  2014-07-07 21:02       ` Mark Nelson
  2014-07-07 19:45     ` Sage Weil
  1 sibling, 1 reply; 11+ messages in thread
From: Sage Weil @ 2014-07-07 19:43 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Luis Pabon, ceph-devel@vger.kernel.org

On Mon, 7 Jul 2014, Mark Nelson wrote:
> On 07/07/2014 02:29 PM, Sage Weil wrote:
> > On Mon, 7 Jul 2014, Luis Pabon wrote:
> > > Hi all,
> > >      I am working on OSDMonitor.cc:5325 and wanted to confirm the
> > > following
> > > read_forward cache tier transition:
> > > 
> > >      readforward -> forward || writeback || (any && num_objects_dirty ==
> > > 0)
> > >      forward -> writeback || readforward || (any && num_objects_dirty ==
> > > 0)
> > >      writeback -> readforward || forward
> > > 
> > > Is this the correct cache tier state transition?
> > 
> > That looks right to me.
> > 
> > By the way, I had a thought after we spoke that we probably want something
> > that is somewhere inbetween the current writeback behavior (promote on
> > first read) and the read_forward behavior (never promote on read).  I
> > suspect a good all-around policy is something like promote on second read?
> > This should probably be rolled into the writeback mode as a tunable...
> 
> That would be a good start I think.  What about some kind of scheme that also
> favours promoting small objects over larger ones?  It could be as simple as
> increasing the number of reads necessary to do a promotion based on the object
> size.
> 
> ie something like:
> 
> <= 64k object = 1 read
> <= 512k object = 2 read
> else 3 read
> 
> That would make the behaviour for default RBD object sizes always 3 read, but
> could keep big objects out of the cache tier for RGW.

We don't have enough information to do that right now, since on a miss we 
redirect the client instead of proxying them and never learn what the 
actual object size is.

If/after we start doing proxying for the reads, then lots of other stuff 
becomes possible... but I think we'll need to be careful about choosing 
where to add complexity.

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cache tier READ_FORWARD transition
  2014-07-07 19:38   ` Mark Nelson
  2014-07-07 19:43     ` Sage Weil
@ 2014-07-07 19:45     ` Sage Weil
  1 sibling, 0 replies; 11+ messages in thread
From: Sage Weil @ 2014-07-07 19:45 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Luis Pabon, ceph-devel@vger.kernel.org

On Mon, 7 Jul 2014, Mark Nelson wrote:
> On 07/07/2014 02:29 PM, Sage Weil wrote:
> > On Mon, 7 Jul 2014, Luis Pabon wrote:
> > > Hi all,
> > >      I am working on OSDMonitor.cc:5325 and wanted to confirm the
> > > following
> > > read_forward cache tier transition:
> > > 
> > >      readforward -> forward || writeback || (any && num_objects_dirty ==
> > > 0)
> > >      forward -> writeback || readforward || (any && num_objects_dirty ==
> > > 0)
> > >      writeback -> readforward || forward
> > > 
> > > Is this the correct cache tier state transition?
> > 
> > That looks right to me.
> > 
> > By the way, I had a thought after we spoke that we probably want something
> > that is somewhere inbetween the current writeback behavior (promote on
> > first read) and the read_forward behavior (never promote on read).  I
> > suspect a good all-around policy is something like promote on second read?
> > This should probably be rolled into the writeback mode as a tunable...
> 
> That would be a good start I think.  What about some kind of scheme that also
> favours promoting small objects over larger ones?  It could be as simple as
> increasing the number of reads necessary to do a promotion based on the object
> size.
> 
> ie something like:
> 
> <= 64k object = 1 read
> <= 512k object = 2 read
> else 3 read
> 
> That would make the behaviour for default RBD object sizes always 3 read, but
> could keep big objects out of the cache tier for RGW.

Hmm FWIW we in the RBD vs RGW case those are different pools so we can set 
different policies.  I think small vs big object distinction might make 
sense in other contexts, though!

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cache tier READ_FORWARD transition
  2014-07-07 19:43     ` Sage Weil
@ 2014-07-07 21:02       ` Mark Nelson
  0 siblings, 0 replies; 11+ messages in thread
From: Mark Nelson @ 2014-07-07 21:02 UTC (permalink / raw)
  To: Sage Weil; +Cc: Luis Pabon, ceph-devel@vger.kernel.org

On 07/07/2014 02:43 PM, Sage Weil wrote:
> On Mon, 7 Jul 2014, Mark Nelson wrote:
>> On 07/07/2014 02:29 PM, Sage Weil wrote:
>>> On Mon, 7 Jul 2014, Luis Pabon wrote:
>>>> Hi all,
>>>>       I am working on OSDMonitor.cc:5325 and wanted to confirm the
>>>> following
>>>> read_forward cache tier transition:
>>>>
>>>>       readforward -> forward || writeback || (any && num_objects_dirty ==
>>>> 0)
>>>>       forward -> writeback || readforward || (any && num_objects_dirty ==
>>>> 0)
>>>>       writeback -> readforward || forward
>>>>
>>>> Is this the correct cache tier state transition?
>>>
>>> That looks right to me.
>>>
>>> By the way, I had a thought after we spoke that we probably want something
>>> that is somewhere inbetween the current writeback behavior (promote on
>>> first read) and the read_forward behavior (never promote on read).  I
>>> suspect a good all-around policy is something like promote on second read?
>>> This should probably be rolled into the writeback mode as a tunable...
>>
>> That would be a good start I think.  What about some kind of scheme that also
>> favours promoting small objects over larger ones?  It could be as simple as
>> increasing the number of reads necessary to do a promotion based on the object
>> size.
>>
>> ie something like:
>>
>> <= 64k object = 1 read
>> <= 512k object = 2 read
>> else 3 read
>>
>> That would make the behaviour for default RBD object sizes always 3 read, but
>> could keep big objects out of the cache tier for RGW.
>
> We don't have enough information to do that right now, since on a miss we
> redirect the client instead of proxying them and never learn what the
> actual object size is.
>
> If/after we start doing proxying for the reads, then lots of other stuff
> becomes possible... but I think we'll need to be careful about choosing
> where to add complexity.

Ok, that makes sense.  Ignoring RGW for the moment, on the RBD side can 
we infer about the object sizes based on the image order?  Can we 
provide a hint in some way?  I guess my assumptions specifically for RBD 
are:

1) For large reads from any object:

very low promotion priority since spinning disks can do this fast. Can 
get just from the read len?

2) For small reads from (presumed) large objects

sequential IO: Probably not at all (especially if we have big enough 
read ahead on base pool OSD fs)?  Can we  save/check previous read 
pos(s) of the same object in addition to a previous attempt?  Too complex?

random IO: Maybe even 3rd read attempt?  The worst reads will come out 
of buffer cache anyway.  Given how expensive promotion is for large 
objects, it seems to me we need to promote very slowly and infrequently.

3) reads from (presumed) small objects.

Do the promotion right away since the promotion is small and the SSDs 
can do small writes faster than the spinning disks can do small reads?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cache tier READ_FORWARD transition
  2014-07-07 19:29 ` Sage Weil
  2014-07-07 19:38   ` Mark Nelson
@ 2014-07-07 21:03   ` Luis Pabón
  2014-07-07 21:31   ` Luis Pabón
  2 siblings, 0 replies; 11+ messages in thread
From: Luis Pabón @ 2014-07-07 21:03 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

I think so, but I am not sure what kind of workload would benefit from 
that tune.  Do you have any in mind?  Is the reason for this tuneable 
value to have a more storage efficient caching tier?

- Luis

On 07/07/2014 03:29 PM, Sage Weil wrote:
> On Mon, 7 Jul 2014, Luis Pabon wrote:
>> Hi all,
>>      I am working on OSDMonitor.cc:5325 and wanted to confirm the following
>> read_forward cache tier transition:
>>
>>      readforward -> forward || writeback || (any && num_objects_dirty == 0)
>>      forward -> writeback || readforward || (any && num_objects_dirty == 0)
>>      writeback -> readforward || forward
>>
>> Is this the correct cache tier state transition?
> That looks right to me.
>
> By the way, I had a thought after we spoke that we probably want something
> that is somewhere inbetween the current writeback behavior (promote on
> first read) and the read_forward behavior (never promote on read).  I
> suspect a good all-around policy is something like promote on second read?
> This should probably be rolled into the writeback mode as a tunable...
>
> sage
>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cache tier READ_FORWARD transition
  2014-07-07 19:29 ` Sage Weil
  2014-07-07 19:38   ` Mark Nelson
  2014-07-07 21:03   ` Luis Pabón
@ 2014-07-07 21:31   ` Luis Pabón
  2014-07-08 16:01     ` Sage Weil
  2 siblings, 1 reply; 11+ messages in thread
From: Luis Pabón @ 2014-07-07 21:31 UTC (permalink / raw)
  To: Sage Weil, Mark Nelson; +Cc: ceph-devel@vger.kernel.org

What about the following usecase (please forgive some of my ceph 
architecture ignorance):

If it was possible to setup OSD caching tier at the host (if the host 
had a dedicated SSD for accelerating I/O), then caching pools could be 
created to cache VM rbds, since they are inherently exclusive to a 
single host.  Using a write through (or a readonly, depending on the 
workload) policy would have a major increase in VM IOPs.   Using 
writethrough or readonly policy would also ensure any writes are first 
written to the back end storage tier.  Enabling hosts to service most of 
their VM I/O reads would also increases the overall IOPs of the back end 
storage tier.

Does this make sense?

- Luis

On 07/07/2014 03:29 PM, Sage Weil wrote:
> On Mon, 7 Jul 2014, Luis Pabon wrote:
>> Hi all,
>>      I am working on OSDMonitor.cc:5325 and wanted to confirm the following
>> read_forward cache tier transition:
>>
>>      readforward -> forward || writeback || (any && num_objects_dirty == 0)
>>      forward -> writeback || readforward || (any && num_objects_dirty == 0)
>>      writeback -> readforward || forward
>>
>> Is this the correct cache tier state transition?
> That looks right to me.
>
> By the way, I had a thought after we spoke that we probably want something
> that is somewhere inbetween the current writeback behavior (promote on
> first read) and the read_forward behavior (never promote on read).  I
> suspect a good all-around policy is something like promote on second read?
> This should probably be rolled into the writeback mode as a tunable...
>
> sage
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cache tier READ_FORWARD transition
  2014-07-07 21:31   ` Luis Pabón
@ 2014-07-08 16:01     ` Sage Weil
  2014-07-09 17:46       ` Luis Pabon
  2014-07-10  4:34       ` Alexandre DERUMIER
  0 siblings, 2 replies; 11+ messages in thread
From: Sage Weil @ 2014-07-08 16:01 UTC (permalink / raw)
  To: Luis Pabón; +Cc: Mark Nelson, ceph-devel@vger.kernel.org

On Mon, 7 Jul 2014, Luis Pab?n wrote:
> What about the following usecase (please forgive some of my ceph architecture
> ignorance):
> 
> If it was possible to setup OSD caching tier at the host (if the host had a
> dedicated SSD for accelerating I/O), then caching pools could be created to
> cache VM rbds, since they are inherently exclusive to a single host.  Using a
> write through (or a readonly, depending on the workload) policy would have a
> major increase in VM IOPs.   Using writethrough or readonly policy would also
> ensure any writes are first written to the back end storage tier.  Enabling
> hosts to service most of their VM I/O reads would also increases the overall
> IOPs of the back end storage tier.

This could be accomplished by doing a rados pool per client host.  The 
rados caching only works in as a writeback cache, though, not 
write-through, so you really need to replicate it for it to be usable in 
practice.  So although it's possible, this isn't a particularly attractive 
approach.

What you're describing is really a client-side write-through cache, either 
for librbd or librados.  We've discussed this in the past (mostly in the 
context of a shared host-wide read-only data, not as write-through), but 
in both cases the caching would plug into the client libraries.  There are 
some CDS notes from emperor:

	http://wiki.ceph.com/Planning/Sideboard/rbd%3A_shared_read_cache
	http://pad.ceph.com/p/rbd-shared-read-cache
	http://www.youtube.com/watch?v=SVgBdUv_Lv4&t=70m11s

Note that you can also accomplish this with the kernel rbd driver by 
layering dm-cache or bcache or something similar on top and running it in 
write-through mode.  Most clients are (KVM+)librbd, though, so eventually 
a userspace implementation for librbd (or maybe librados) makes sense.

sage


> Does this make sense?
> 
> - Luis
> 
> On 07/07/2014 03:29 PM, Sage Weil wrote:
> > On Mon, 7 Jul 2014, Luis Pabon wrote:
> > > Hi all,
> > >      I am working on OSDMonitor.cc:5325 and wanted to confirm the
> > > following
> > > read_forward cache tier transition:
> > > 
> > >      readforward -> forward || writeback || (any && num_objects_dirty ==
> > > 0)
> > >      forward -> writeback || readforward || (any && num_objects_dirty ==
> > > 0)
> > >      writeback -> readforward || forward
> > > 
> > > Is this the correct cache tier state transition?
> > That looks right to me.
> > 
> > By the way, I had a thought after we spoke that we probably want something
> > that is somewhere inbetween the current writeback behavior (promote on
> > first read) and the read_forward behavior (never promote on read).  I
> > suspect a good all-around policy is something like promote on second read?
> > This should probably be rolled into the writeback mode as a tunable...
> > 
> > sage
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cache tier READ_FORWARD transition
  2014-07-08 16:01     ` Sage Weil
@ 2014-07-09 17:46       ` Luis Pabon
  2014-07-10  4:34       ` Alexandre DERUMIER
  1 sibling, 0 replies; 11+ messages in thread
From: Luis Pabon @ 2014-07-09 17:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, ceph-devel@vger.kernel.org

This is great information.

Thank Sage.

- Luis


On 07/08/2014 12:01 PM, Sage Weil wrote:
> On Mon, 7 Jul 2014, Luis Pab?n wrote:
>> What about the following usecase (please forgive some of my ceph architecture
>> ignorance):
>>
>> If it was possible to setup OSD caching tier at the host (if the host had a
>> dedicated SSD for accelerating I/O), then caching pools could be created to
>> cache VM rbds, since they are inherently exclusive to a single host.  Using a
>> write through (or a readonly, depending on the workload) policy would have a
>> major increase in VM IOPs.   Using writethrough or readonly policy would also
>> ensure any writes are first written to the back end storage tier.  Enabling
>> hosts to service most of their VM I/O reads would also increases the overall
>> IOPs of the back end storage tier.
> This could be accomplished by doing a rados pool per client host.  The
> rados caching only works in as a writeback cache, though, not
> write-through, so you really need to replicate it for it to be usable in
> practice.  So although it's possible, this isn't a particularly attractive
> approach.
>
> What you're describing is really a client-side write-through cache, either
> for librbd or librados.  We've discussed this in the past (mostly in the
> context of a shared host-wide read-only data, not as write-through), but
> in both cases the caching would plug into the client libraries.  There are
> some CDS notes from emperor:
>
> 	http://wiki.ceph.com/Planning/Sideboard/rbd%3A_shared_read_cache
> 	http://pad.ceph.com/p/rbd-shared-read-cache
> 	http://www.youtube.com/watch?v=SVgBdUv_Lv4&t=70m11s
>
> Note that you can also accomplish this with the kernel rbd driver by
> layering dm-cache or bcache or something similar on top and running it in
> write-through mode.  Most clients are (KVM+)librbd, though, so eventually
> a userspace implementation for librbd (or maybe librados) makes sense.
>
> sage
>
>
>> Does this make sense?
>>
>> - Luis
>>
>> On 07/07/2014 03:29 PM, Sage Weil wrote:
>>> On Mon, 7 Jul 2014, Luis Pabon wrote:
>>>> Hi all,
>>>>       I am working on OSDMonitor.cc:5325 and wanted to confirm the
>>>> following
>>>> read_forward cache tier transition:
>>>>
>>>>       readforward -> forward || writeback || (any && num_objects_dirty ==
>>>> 0)
>>>>       forward -> writeback || readforward || (any && num_objects_dirty ==
>>>> 0)
>>>>       writeback -> readforward || forward
>>>>
>>>> Is this the correct cache tier state transition?
>>> That looks right to me.
>>>
>>> By the way, I had a thought after we spoke that we probably want something
>>> that is somewhere inbetween the current writeback behavior (promote on
>>> first read) and the read_forward behavior (never promote on read).  I
>>> suspect a good all-around policy is something like promote on second read?
>>> This should probably be rolled into the writeback mode as a tunable...
>>>
>>> sage
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Cache tier READ_FORWARD transition
  2014-07-08 16:01     ` Sage Weil
  2014-07-09 17:46       ` Luis Pabon
@ 2014-07-10  4:34       ` Alexandre DERUMIER
  1 sibling, 0 replies; 11+ messages in thread
From: Alexandre DERUMIER @ 2014-07-10  4:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, ceph-devel, Luis Pabón

>>Note that you can also accomplish this with the kernel rbd driver by 
>>layering dm-cache or bcache or something similar on top and running it in 
>>write-through mode.  Most clients are (KVM+)librbd, though, so eventually 
>>a userspace implementation for librbd (or maybe librados) makes sense.

I vote for this,
it would be wonderful to have a client cache at librbd level !


----- Mail original ----- 

De: "Sage Weil" <sweil@redhat.com> 
À: "Luis Pabón" <lpabon@redhat.com> 
Cc: "Mark Nelson" <mnelson@redhat.com>, ceph-devel@vger.kernel.org 
Envoyé: Mardi 8 Juillet 2014 18:01:46 
Objet: Re: Cache tier READ_FORWARD transition 

On Mon, 7 Jul 2014, Luis Pab?n wrote: 
> What about the following usecase (please forgive some of my ceph architecture 
> ignorance): 
> 
> If it was possible to setup OSD caching tier at the host (if the host had a 
> dedicated SSD for accelerating I/O), then caching pools could be created to 
> cache VM rbds, since they are inherently exclusive to a single host. Using a 
> write through (or a readonly, depending on the workload) policy would have a 
> major increase in VM IOPs. Using writethrough or readonly policy would also 
> ensure any writes are first written to the back end storage tier. Enabling 
> hosts to service most of their VM I/O reads would also increases the overall 
> IOPs of the back end storage tier. 

This could be accomplished by doing a rados pool per client host. The 
rados caching only works in as a writeback cache, though, not 
write-through, so you really need to replicate it for it to be usable in 
practice. So although it's possible, this isn't a particularly attractive 
approach. 

What you're describing is really a client-side write-through cache, either 
for librbd or librados. We've discussed this in the past (mostly in the 
context of a shared host-wide read-only data, not as write-through), but 
in both cases the caching would plug into the client libraries. There are 
some CDS notes from emperor: 

http://wiki.ceph.com/Planning/Sideboard/rbd%3A_shared_read_cache 
http://pad.ceph.com/p/rbd-shared-read-cache 
http://www.youtube.com/watch?v=SVgBdUv_Lv4&t=70m11s 

Note that you can also accomplish this with the kernel rbd driver by 
layering dm-cache or bcache or something similar on top and running it in 
write-through mode. Most clients are (KVM+)librbd, though, so eventually 
a userspace implementation for librbd (or maybe librados) makes sense. 

sage 


> Does this make sense? 
> 
> - Luis 
> 
> On 07/07/2014 03:29 PM, Sage Weil wrote: 
> > On Mon, 7 Jul 2014, Luis Pabon wrote: 
> > > Hi all, 
> > > I am working on OSDMonitor.cc:5325 and wanted to confirm the 
> > > following 
> > > read_forward cache tier transition: 
> > > 
> > > readforward -> forward || writeback || (any && num_objects_dirty == 
> > > 0) 
> > > forward -> writeback || readforward || (any && num_objects_dirty == 
> > > 0) 
> > > writeback -> readforward || forward 
> > > 
> > > Is this the correct cache tier state transition? 
> > That looks right to me. 
> > 
> > By the way, I had a thought after we spoke that we probably want something 
> > that is somewhere inbetween the current writeback behavior (promote on 
> > first read) and the read_forward behavior (never promote on read). I 
> > suspect a good all-around policy is something like promote on second read? 
> > This should probably be rolled into the writeback mode as a tunable... 
> > 
> > sage 
> > 
> > 
> > -- 
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> > the body of a message to majordomo@vger.kernel.org 
> > More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-07-10  4:34 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-07 16:29 Cache tier READ_FORWARD transition Luis Pabon
2014-07-07 19:29 ` Sage Weil
2014-07-07 19:38   ` Mark Nelson
2014-07-07 19:43     ` Sage Weil
2014-07-07 21:02       ` Mark Nelson
2014-07-07 19:45     ` Sage Weil
2014-07-07 21:03   ` Luis Pabón
2014-07-07 21:31   ` Luis Pabón
2014-07-08 16:01     ` Sage Weil
2014-07-09 17:46       ` Luis Pabon
2014-07-10  4:34       ` Alexandre DERUMIER

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.