From: "Matt W. Benjamin" <matt@cohortfs.com>
To: Zhiqiang Wang <zhiqiang.wang@intel.com>
Cc: sjust@redhat.com, ceph-devel@vger.kernel.org,
Sage Weil <sweil@redhat.com>
Subject: Re: The design of the eviction improvement
Date: Wed, 22 Jul 2015 00:06:46 -0400 (EDT) [thread overview]
Message-ID: <1876849988.146.1437538006685.JavaMail.root@thunderbeast.private.linuxbox.com> (raw)
In-Reply-To: <06E7D85B3BA36C4DB207FEDE871C5348A001D9@SHSMSX101.ccr.corp.intel.com>
Hi,
----- "Zhiqiang Wang" <zhiqiang.wang@intel.com> wrote:
> Hi Matt,
>
> > -----Original Message-----
> > From: Matt W. Benjamin [mailto:matt@cohortfs.com]
> > Sent: Tuesday, July 21, 2015 10:16 PM
> > To: Wang, Zhiqiang
> > Cc: sjust@redhat.com; ceph-devel@vger.kernel.org; Sage Weil
> > Subject: Re: The design of the eviction improvement
> >
> > Hi,
> >
> > Couple of points.
> >
> > 1) a successor to 2Q is MQ (Li et al). We have an intrusive MQ LRU
> > implementation with 2 levels currently, plus a pinned queue, that
> addresses
> > stuff like partitioning ("sharding"), scan resistance, and
> coordination w/lookup
> > tables. We might extend/re-use it.
>
> The MQ algorithm is more complex, and seems like it has more overhead
> than 2Q. The approximate 2Q algorithm here combines some benefits of
> the clock algorithm, and works well on the linux kernel. MQ could be
> another choice. There are some other candidates like LIRS, ARC, etc.,
> which have been deployed in some practical systems.
MQ has been deployed in practical systems, and is more general.
>
> >
> > 2) I'm a bit confused by active/inactive vocabulary, dimensioning of
> cache
> > segments (are you proposing to/do we now always cache whole
> objects?), and
> > cost of "looking for" dirty objects; I suspect that it makes sense
> to amortize
> > the cost of locating segments eligible to be flushed, rather than
> minimize
> > bookkeeping.
>
> Though currently the caching decisions are made in the unit of object
> as Greg pointed out in another mail, I think we still have something
> to improve here. I'll come back to this some time later.
>
> >
> > Matt
> >
> > ----- "Zhiqiang Wang" <zhiqiang.wang@intel.com> wrote:
> >
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@vger.kernel.org
> > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage
> Weil
> > > > Sent: Tuesday, July 21, 2015 6:38 AM
> > > > To: Wang, Zhiqiang
> > > > Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> > > > Subject: Re: The design of the eviction improvement
> > > >
> > > > On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
> > > > > Hi all,
> > > > >
> > > > > This is a follow-up of one of the CDS session at
> > > >
> > >
> http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ti
> > > eri
> > > > ng_eviction. We discussed the drawbacks of the current eviction
> > > algorithm and
> > > > several ways to improve it. Seems like the LRU variants is the
> right
> > > way to go. I
> > > > come up with some design points after the CDS, and want to
> discuss
> > > it with you.
> > > > It is an approximate 2Q algorithm, combining some benefits of
> the
> > > clock
> > > > algorithm, similar to what the linux kernel does for the page
> > > cache.
> > > >
> > > > Unfortunately I missed this last CDS so I'm behind on the
> > > discussion. I have a
> > > > few questions though...
> > > >
> > > > > # Design points:
> > > > >
> > > > > ## LRU lists
> > > > > - Maintain LRU lists at the PG level.
> > > > > The SharedLRU and SimpleLRU implementation in the current
> code
> > > have a
> > > > > max_size, which limits the max number of elements in the
> list.
> > > This
> > > > > mostly looks like a MRU, though its name implies they are
> LRUs.
> > > Since
> > > > > the object size may vary in a PG, it's not possible to
> caculate
> > > the
> > > > > total number of objects which the cache tier can hold ahead
> of
> > > time.
> > > > > We need a new LRU implementation with no limit on the size.
> > > >
> > > > This last sentence seems to me to be the crux of it. Assuming
> we
> > > have an
> > > > OSD based by flash storing O(n) objects, we need a way to
> maintain
> > > an LRU of
> > > > O(n) objects in memory. The current hitset-based approach was
> taken
> > > based
> > > > on the assumption that this wasn't feasible--or at least we
> didn't
> > > know how to
> > > > implmement such a thing. If it is, or we simply want to
> stipulate
> > > that cache
> > > > tier OSDs get gobs of RAM to make it possible, then lots of
> better
> > > options
> > > > become possible...
> > > >
> > > > Let's say you have a 1TB SSD, with an average object size of 1MB
> --
> > > that's
> > > > 1 million objects. At maybe ~100bytes per object of RAM for an
> LRU
> > > entry
> > > > that's 100MB... so not so unreasonable, perhaps!
> > >
> > > I was having the same question before proposing this. I did the
> > > similar calculation and thought it would be ok to use this many
> memory
> > > :-)
> > >
> > > >
> > > > > - Two lists for each PG: active and inactive Objects are
> first
> > > put
> > > > > into the inactive list when they are accessed, and moved
> between
> > > these two
> > > > lists based on some criteria.
> > > > > Object flag: active, referenced, unevictable, dirty.
> > > > > - When an object is accessed:
> > > > > 1) If it's not in both of the lists, it's put on the top of
> the
> > > > > inactive list
> > > > > 2) If it's in the inactive list, and the referenced flag is
> not
> > > set, the referenced
> > > > flag is set, and it's moved to the top of the inactive list.
> > > > > 3) If it's in the inactive list, and the referenced flag is
> set,
> > > the referenced flag
> > > > is cleared, and it's removed from the inactive list, and put on
> top
> > > of the active
> > > > list.
> > > > > 4) If it's in the active list, and the referenced flag is not
> set,
> > > the referenced
> > > > flag is set, and it's moved to the top of the active list.
> > > > > 5) If it's in the active list, and the referenced flag is
> set,
> > > it's moved to the top
> > > > of the active list.
> > > > > - When selecting objects to evict:
> > > > > 1) Objects at the bottom of the inactive list are selected to
> > > evict. They are
> > > > removed from the inactive list.
> > > > > 2) If the number of the objects in the inactive list becomes
> low,
> > > some of the
> > > > objects at the bottom of the active list are moved to the
> inactive
> > > list. For those
> > > > objects which have the referenced flag set, they are given one
> more
> > > chance in
> > > > the active list. They are moved to the top of the active list
> with
> > > the referenced
> > > > flag cleared. For those objects which don't have the referenced
> flag
> > > set, they
> > > > are moved to the inactive list, with the referenced flag set.
> So
> > > that they can be
> > > > quickly promoted to the active list when necessary.
> > > > >
> > > > > ## Combine flush with eviction
> > > > > - When evicting an object, if it's dirty, it's flushed first.
> > > After flushing, it's
> > > > evicted. If not dirty, it's evicted directly.
> > > > > - This means that we won't have separate activities and won't
> set
> > > different
> > > > ratios for flush and evict. Is there a need to do so?
> > > > > - Number of objects to evict at a time. 'evict_effort' acts as
> the
> > > priority, which
> > > > is used to calculate the number of objects to evict.
> > > >
> > > > As someone else mentioned in a follow-up, the reason we let the
> > > dirty level be
> > > > set lower than the full level is that it provides headroom so
> that
> > > objects can be
> > > > quickly evicted (delete, no flush) to make room for new writes
> or
> > > new
> > > > promotions.
> > > >
> > > > That said, we probably can/should streamline the flush so that
> an
> > > evict can
> > > > immediately follow without waiting for the agent to come around
> > > again.
> > > > (I don't think we do that now?)
> > >
> > > I was afraid of having to maintain different lists for
> flush/evict. So
> > > I proposed to combine flush with evict. But seems like we don't
> need
> > > to. When reaching the dirty ratio and selecting objects to flush,
> we
> > > look for dirty objects from the inactive list. The objects are
> marked
> > > clean and kept in the same position in the LRU list after
> flushing.
> > > When evicting objects, clean objects in the inactive list are
> > > selected. Dirty objects are ignored. This will keep the
> flush/evict
> > > the same as current implementation. Make sense?
> > >
> > > >
> > > > sage
> > > >
> > > >
> > > > > ## LRU lists Snapshotting
> > > > > - The two lists are snapshotted persisted periodically.
> > > > > - Only one copy needs to be saved. The old copy is removed
> when
> > > persisting
> > > > the lists. The saved lists are used to restore the LRU lists
> when
> > > OSD reboots.
> > > > >
> > > > > Any comments/feedbacks are welcomed.
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe
> > > ceph-devel"
> > > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > > info at http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe
> > > ceph-devel" in the body
> > > > of a message to majordomo@vger.kernel.org More majordomo info
> at
> > > > http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> ceph-devel"
> > > in
> > > the body of a message to majordomo@vger.kernel.org More majordomo
> info
> > > at http://vger.kernel.org/majordomo-info.html
> >
> > --
> > Matt Benjamin
> > CohortFS, LLC.
> > 315 West Huron Street, Suite 140A
> > Ann Arbor, Michigan 48103
> >
> > http://cohortfs.com
> >
> > tel. 734-761-4689
> > fax. 734-769-8938
> > cel. 734-216-5309
Matt
--
Matt Benjamin
CohortFS, LLC.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103
http://cohortfs.com
tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309
next prev parent reply other threads:[~2015-07-22 4:06 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1539008200.49.1437488109510.JavaMail.root@thunderbeast.private.linuxbox.com>
2015-07-21 14:15 ` The design of the eviction improvement Matt W. Benjamin
2015-07-21 14:25 ` Gregory Farnum
2015-07-22 3:57 ` Wang, Zhiqiang
2015-07-22 4:06 ` Matt W. Benjamin [this message]
[not found] <639332441.73.1437495369873.JavaMail.root@thunderbeast.private.linuxbox.com>
2015-07-21 16:24 ` Matt W. Benjamin
2015-07-20 8:47 Wang, Zhiqiang
2015-07-20 20:53 ` Allen Samuels
2015-07-20 22:38 ` Sage Weil
2015-07-21 4:08 ` Wang, Zhiqiang
2015-07-21 13:28 ` Sage Weil
2015-07-22 3:34 ` Wang, Zhiqiang
2015-07-22 12:56 ` Sage Weil
2015-07-22 18:40 ` Allen Samuels
2015-07-22 18:45 ` Matt W. Benjamin
2015-07-22 18:51 ` Sage Weil
2015-07-22 19:14 ` Allen Samuels
2015-07-22 21:53 ` Sage Weil
2015-07-22 22:45 ` Allen Samuels
2015-07-23 3:46 ` Wang, Zhiqiang
2015-07-23 5:52 ` Wang, Zhiqiang
[not found] ` <08fcbe8a.9Ro.9Gf.1U.hCy4iF@mailjet.com>
2015-07-21 3:55 ` Wang, Zhiqiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1876849988.146.1437538006685.JavaMail.root@thunderbeast.private.linuxbox.com \
--to=matt@cohortfs.com \
--cc=ceph-devel@vger.kernel.org \
--cc=sjust@redhat.com \
--cc=sweil@redhat.com \
--cc=zhiqiang.wang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.