Re: Cache tier READ_FORWARD transition

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mark Nelson <mark.nelson@inktank.com>
To: Sage Weil <sweil@redhat.com>
Cc: Luis Pabon <lpabon@redhat.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: Cache tier READ_FORWARD transition
Date: Mon, 07 Jul 2014 16:02:26 -0500	[thread overview]
Message-ID: <53BB0AE2.6000903@inktank.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1407071242050.24973@cobra.newdream.net>

On 07/07/2014 02:43 PM, Sage Weil wrote:
> On Mon, 7 Jul 2014, Mark Nelson wrote:
>> On 07/07/2014 02:29 PM, Sage Weil wrote:
>>> On Mon, 7 Jul 2014, Luis Pabon wrote:
>>>> Hi all,
>>>>       I am working on OSDMonitor.cc:5325 and wanted to confirm the
>>>> following
>>>> read_forward cache tier transition:
>>>>
>>>>       readforward -> forward || writeback || (any && num_objects_dirty ==
>>>> 0)
>>>>       forward -> writeback || readforward || (any && num_objects_dirty ==
>>>> 0)
>>>>       writeback -> readforward || forward
>>>>
>>>> Is this the correct cache tier state transition?
>>>
>>> That looks right to me.
>>>
>>> By the way, I had a thought after we spoke that we probably want something
>>> that is somewhere inbetween the current writeback behavior (promote on
>>> first read) and the read_forward behavior (never promote on read).  I
>>> suspect a good all-around policy is something like promote on second read?
>>> This should probably be rolled into the writeback mode as a tunable...
>>
>> That would be a good start I think.  What about some kind of scheme that also
>> favours promoting small objects over larger ones?  It could be as simple as
>> increasing the number of reads necessary to do a promotion based on the object
>> size.
>>
>> ie something like:
>>
>> <= 64k object = 1 read
>> <= 512k object = 2 read
>> else 3 read
>>
>> That would make the behaviour for default RBD object sizes always 3 read, but
>> could keep big objects out of the cache tier for RGW.
>
> We don't have enough information to do that right now, since on a miss we
> redirect the client instead of proxying them and never learn what the
> actual object size is.
>
> If/after we start doing proxying for the reads, then lots of other stuff
> becomes possible... but I think we'll need to be careful about choosing
> where to add complexity.

Ok, that makes sense.  Ignoring RGW for the moment, on the RBD side can 
we infer about the object sizes based on the image order?  Can we 
provide a hint in some way?  I guess my assumptions specifically for RBD 
are:

1) For large reads from any object:

very low promotion priority since spinning disks can do this fast. Can 
get just from the read len?

2) For small reads from (presumed) large objects

sequential IO: Probably not at all (especially if we have big enough 
read ahead on base pool OSD fs)?  Can we  save/check previous read 
pos(s) of the same object in addition to a previous attempt?  Too complex?

random IO: Maybe even 3rd read attempt?  The worst reads will come out 
of buffer cache anyway.  Given how expensive promotion is for large 
objects, it seems to me we need to promote very slowly and infrequently.

3) reads from (presumed) small objects.

Do the promotion right away since the promotion is small and the SSDs 
can do small writes faster than the spinning disks can do small reads?

next prev parent reply	other threads:[~2014-07-07 21:02 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-07 16:29 Cache tier READ_FORWARD transition Luis Pabon
2014-07-07 19:29 ` Sage Weil
2014-07-07 19:38   ` Mark Nelson
2014-07-07 19:43     ` Sage Weil
2014-07-07 21:02       ` Mark Nelson [this message]
2014-07-07 19:45     ` Sage Weil
2014-07-07 21:03   ` Luis Pabón
2014-07-07 21:31   ` Luis Pabón
2014-07-08 16:01     ` Sage Weil
2014-07-09 17:46       ` Luis Pabon
2014-07-10  4:34       ` Alexandre DERUMIER

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53BB0AE2.6000903@inktank.com \
    --to=mark.nelson@inktank.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=lpabon@redhat.com \
    --cc=sweil@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.