All of lore.kernel.org
 help / color / mirror / Atom feed
* The design of the eviction improvement
@ 2015-07-20  8:47 Wang, Zhiqiang
  2015-07-20 20:53 ` Allen Samuels
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Wang, Zhiqiang @ 2015-07-20  8:47 UTC (permalink / raw)
  To: Sage Weil, sjust@redhat.com, ceph-devel@vger.kernel.org

Hi all,

This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tiering_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache.

# Design points:

## LRU lists
- Maintain LRU lists at the PG level.
The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size.
- Two lists for each PG: active and inactive
Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria.
Object flag: active, referenced, unevictable, dirty.
- When an object is accessed:
1) If it's not in both of the lists, it's put on the top of the inactive list
2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list.
3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list.
4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list.
5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list.
- When selecting objects to evict:
1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list.
2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag set, they are given one more chance in the active list. They are moved to the top of the active list with the referenced flag cleared. For those objects which don't have the referenced flag set, they are moved to the inactive list, with the referenced flag set. So that they can be quickly promoted to the active list when necessary.

## Combine flush with eviction
- When evicting an object, if it's dirty, it's flushed first. After flushing, it's evicted. If not dirty, it's evicted directly.
- This means that we won't have separate activities and won't set different ratios for flush and evict. Is there a need to do so?
- Number of objects to evict at a time. 'evict_effort' acts as the priority, which is used to calculate the number of objects to evict.

## LRU lists Snapshotting
- The two lists are snapshotted persisted periodically.
- Only one copy needs to be saved. The old copy is removed when persisting the lists. The saved lists are used to restore the LRU lists when OSD reboots.

Any comments/feedbacks are welcomed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
  2015-07-20  8:47 Wang, Zhiqiang
@ 2015-07-20 20:53 ` Allen Samuels
  2015-07-20 22:38 ` Sage Weil
       [not found] ` <08fcbe8a.9Ro.9Gf.1U.hCy4iF@mailjet.com>
  2 siblings, 0 replies; 21+ messages in thread
From: Allen Samuels @ 2015-07-20 20:53 UTC (permalink / raw)
  To: Wang, Zhiqiang, Sage Weil, sjust@redhat.com,
	ceph-devel@vger.kernel.org

This seems much better than the current mechanism. Do you have an estimate of the memory consumption of the two lists? (In terms of bytes/object?)


Allen Samuels
Software Architect, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Wang, Zhiqiang
Sent: Monday, July 20, 2015 1:47 AM
To: Sage Weil; sjust@redhat.com; ceph-devel@vger.kernel.org
Subject: The design of the eviction improvement

Hi all,

This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tiering_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache.

# Design points:

## LRU lists
- Maintain LRU lists at the PG level.
The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size.
- Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria.
Object flag: active, referenced, unevictable, dirty.
- When an object is accessed:
1) If it's not in both of the lists, it's put on the top of the inactive list
2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list.
3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list.
4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list.
5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list.
- When selecting objects to evict:
1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list.
2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag set, they are given one more chance in the active list. They are moved to the top of the active list with the referenced flag cleared. For those objects which don't have the referenced flag set, they are moved to the inactive list, with the referenced flag set. So that they can be quickly promoted to the active list when necessary.

## Combine flush with eviction
- When evicting an object, if it's dirty, it's flushed first. After flushing, it's evicted. If not dirty, it's evicted directly.
- This means that we won't have separate activities and won't set different ratios for flush and evict. Is there a need to do so?
- Number of objects to evict at a time. 'evict_effort' acts as the priority, which is used to calculate the number of objects to evict.

## LRU lists Snapshotting
- The two lists are snapshotted persisted periodically.
- Only one copy needs to be saved. The old copy is removed when persisting the lists. The saved lists are used to restore the LRU lists when OSD reboots.

Any comments/feedbacks are welcomed.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: The design of the eviction improvement
  2015-07-20  8:47 Wang, Zhiqiang
  2015-07-20 20:53 ` Allen Samuels
@ 2015-07-20 22:38 ` Sage Weil
  2015-07-21  4:08   ` Wang, Zhiqiang
       [not found] ` <08fcbe8a.9Ro.9Gf.1U.hCy4iF@mailjet.com>
  2 siblings, 1 reply; 21+ messages in thread
From: Sage Weil @ 2015-07-20 22:38 UTC (permalink / raw)
  To: Wang, Zhiqiang; +Cc: sjust@redhat.com, ceph-devel@vger.kernel.org

On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
> Hi all,
> 
> This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tiering_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache.

Unfortunately I missed this last CDS so I'm behind on the discussion.  I 
have a few questions though...
 
> # Design points:
> 
> ## LRU lists
> - Maintain LRU lists at the PG level.
> The SharedLRU and SimpleLRU implementation in the current code have a 
> max_size, which limits the max number of elements in the list. This 
> mostly looks like a MRU, though its name implies they are LRUs. Since 
> the object size may vary in a PG, it's not possible to caculate the 
> total number of objects which the cache tier can hold ahead of time. We 
> need a new LRU implementation with no limit on the size.

This last sentence seems to me to be the crux of it.  Assuming we have an 
OSD based by flash storing O(n) objects, we need a way to maintain an LRU 
of O(n) objects in memory.  The current hitset-based approach was taken 
based on the assumption that this wasn't feasible--or at least we 
didn't know how to implmement such a thing.  If it is, or we simply want 
to stipulate that cache tier OSDs get gobs of RAM to make it possible, 
then lots of better options become possible...

Let's say you have a 1TB SSD, with an average object size of 1MB -- that's 
1 million objects.  At maybe ~100bytes per object of RAM for an LRU entry 
that's 100MB... so not so unreasonable, perhaps!

> - Two lists for each PG: active and inactive
> Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria.
> Object flag: active, referenced, unevictable, dirty.
> - When an object is accessed:
> 1) If it's not in both of the lists, it's put on the top of the inactive list
> 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list.
> 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list.
> 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list.
> 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list.
> - When selecting objects to evict:
> 1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list.
> 2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag set, they are given one more chance in the active list. They are moved to the top of the active list with the referenced flag cleared. For those objects which don't have the referenced flag set, they are moved to the inactive list, with the referenced flag set. So that they can be quickly promoted to the active list when necessary.
> 
> ## Combine flush with eviction
> - When evicting an object, if it's dirty, it's flushed first. After flushing, it's evicted. If not dirty, it's evicted directly.
> - This means that we won't have separate activities and won't set different ratios for flush and evict. Is there a need to do so?
> - Number of objects to evict at a time. 'evict_effort' acts as the priority, which is used to calculate the number of objects to evict.

As someone else mentioned in a follow-up, the reason we let the dirty 
level be set lower than the full level is that it provides headroom so 
that objects can be quickly evicted (delete, no flush) to make room for 
new writes or new promotions.

That said, we probably can/should streamline the flush so that an evict 
can immediately follow without waiting for the agent to come around again.  
(I don't think we do that now?)

sage

 
> ## LRU lists Snapshotting
> - The two lists are snapshotted persisted periodically.
> - Only one copy needs to be saved. The old copy is removed when persisting the lists. The saved lists are used to restore the LRU lists when OSD reboots.
> 
> Any comments/feedbacks are welcomed.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
       [not found] ` <08fcbe8a.9Ro.9Gf.1U.hCy4iF@mailjet.com>
@ 2015-07-21  3:55   ` Wang, Zhiqiang
  0 siblings, 0 replies; 21+ messages in thread
From: Wang, Zhiqiang @ 2015-07-21  3:55 UTC (permalink / raw)
  To: Nick Fisk
  Cc: 'Sage Weil', sjust@redhat.com, ceph-devel@vger.kernel.org

Hi Nick,

> -----Original Message-----
> From: Nick Fisk [mailto:nick@fisk.me.uk]
> Sent: Monday, July 20, 2015 5:28 PM
> To: Wang, Zhiqiang; 'Sage Weil'; sjust@redhat.com;
> ceph-devel@vger.kernel.org
> Subject: RE: The design of the eviction improvement
> 
> Hi,
> 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Wang, Zhiqiang
> > Sent: 20 July 2015 09:47
> > To: Sage Weil <sweil@redhat.com>; sjust@redhat.com; ceph-
> > devel@vger.kernel.org
> > Subject: The design of the eviction improvement
> >
> > Hi all,
> >
> > This is a follow-up of one of the CDS session at
> > http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_
> > tiering_eviction. We discussed the drawbacks of the current eviction
> > algorithm and several ways to improve it. Seems like the LRU variants
> > is
> the
> > right way to go. I come up with some design points after the CDS, and
> > want to discuss it with you. It is an approximate 2Q algorithm,
> > combining some benefits of the clock algorithm, similar to what the
> > linux kernel does for
> the
> > page cache.
> >
> > # Design points:
> >
> > ## LRU lists
> > - Maintain LRU lists at the PG level.
> > The SharedLRU and SimpleLRU implementation in the current code have a
> > max_size, which limits the max number of elements in the list. This
> > mostly looks like a MRU, though its name implies they are LRUs. Since
> > the object
> size
> > may vary in a PG, it's not possible to caculate the total number of
> objects
> > which the cache tier can hold ahead of time. We need a new LRU
> > implementation with no limit on the size.
> > - Two lists for each PG: active and inactive Objects are first put
> > into
> the
> > inactive list when they are accessed, and moved between these two
> > lists based on some criteria.
> > Object flag: active, referenced, unevictable, dirty.
> > - When an object is accessed:
> > 1) If it's not in both of the lists, it's put on the top of the
> > inactive
> list
> > 2) If it's in the inactive list, and the referenced flag is not set,
> > the
> referenced
> > flag is set, and it's moved to the top of the inactive list.
> > 3) If it's in the inactive list, and the referenced flag is set, the
> referenced flag
> > is cleared, and it's removed from the inactive list, and put on top of
> > the
> active
> > list.
> > 4) If it's in the active list, and the referenced flag is not set, the
> referenced
> > flag is set, and it's moved to the top of the active list.
> > 5) If it's in the active list, and the referenced flag is set, it's
> > moved
> to the top
> > of the active list.
> > - When selecting objects to evict:
> > 1) Objects at the bottom of the inactive list are selected to evict.
> > They
> are
> > removed from the inactive list.
> > 2) If the number of the objects in the inactive list becomes low, some
> > of
> the
> > objects at the bottom of the active list are moved to the inactive list.
> For
> > those objects which have the referenced flag set, they are given one
> > more chance in the active list. They are moved to the top of the
> > active list
> with the
> > referenced flag cleared. For those objects which don't have the
> > referenced flag set, they are moved to the inactive list, with the
> > referenced flag
> set. So
> > that they can be quickly promoted to the active list when necessary.
> >
> 
> I really like this idea but just out of interest, there must be a point where the
> overhead of managing much larger lists of very cold objects starts to impact on
> the gains of having exactly the right objects in each tier. If 90% of your hot IO is
> in 10% of the total data, how much extra benefit would you get by tracking all
> objects vs just tracking the top 10,20,30%...etc and evicting randomly after
> that?  If these objects are being accessed infrequently, the impact of
> re-promoting is probably minimal and if the promotion code can get to a point
> where it is being a bit more intelligent about what objects are promoted then
> this is probably even more so?

The idea is that the lists only hold the objects in the cache tier. For those objects which are cold enough, it's evicted from the cache tier and removed from the lists. Also, the lists are maintained at the PG level. I guess the lists won't be too extremely large? In your example of the 90%/10% data access, it may be right that randomly evicting the 90% cold data is good enough. But we need a way to know what the 10% of the hot data are. Also, we can't assume the 90%/10% pattern for every workload.

> 
> > ## Combine flush with eviction
> > - When evicting an object, if it's dirty, it's flushed first. After
> flushing, it's
> > evicted. If not dirty, it's evicted directly.
> > - This means that we won't have separate activities and won't set
> different
> > ratios for flush and evict. Is there a need to do so?
> > - Number of objects to evict at a time. 'evict_effort' acts as the
> priority, which
> > is used to calculate the number of objects to evict.
> >
> 
> I think you want to flush at a lower limit because flushes can be more expensive
> that evictions. Also you want more headroom for flushes so that you don't run
> the risk of filling up the tier with dirty objects and then running out of space.
> Have you also seen the recent two speed flushing commit which provides high
> and low speeds for tier flushing?

Yes, what you said are right. My previous thought is that we may need to maintain different LRU lists for flush and evict if we separate them. But seems not. Please check my reply to Sage in another mail.

> 
> On a related note, I wonder if it would make sense to make sure a read
> promotion can't trigger a dirty object flush. You probably don't want to start
> doing flushing if you get a read storm, something like large sequential read on a
> file. I think on most other implementations of caching I can think of, the write
> cache is normally ring fenced in this sense.

Currently the flush is triggered when the dirty ratio reaches the threshold when handling op request. It's not triggered by promotion. For sequential read, maybe a better way is to proxy them. Even if we decides to do a read promotion, this doesn't increase the dirty ratio.

> 
> > ## LRU lists Snapshotting
> > - The two lists are snapshotted persisted periodically.
> > - Only one copy needs to be saved. The old copy is removed when
> > persisting the lists. The saved lists are used to restore the LRU
> > lists when OSD
> reboots.
> >
> > Any comments/feedbacks are welcomed.
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in
> the
> > body of a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> 
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
  2015-07-20 22:38 ` Sage Weil
@ 2015-07-21  4:08   ` Wang, Zhiqiang
  2015-07-21 13:28     ` Sage Weil
  0 siblings, 1 reply; 21+ messages in thread
From: Wang, Zhiqiang @ 2015-07-21  4:08 UTC (permalink / raw)
  To: Sage Weil; +Cc: sjust@redhat.com, ceph-devel@vger.kernel.org



> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Tuesday, July 21, 2015 6:38 AM
> To: Wang, Zhiqiang
> Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> Subject: Re: The design of the eviction improvement
> 
> On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
> > Hi all,
> >
> > This is a follow-up of one of the CDS session at
> http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
> ng_eviction. We discussed the drawbacks of the current eviction algorithm and
> several ways to improve it. Seems like the LRU variants is the right way to go. I
> come up with some design points after the CDS, and want to discuss it with you.
> It is an approximate 2Q algorithm, combining some benefits of the clock
> algorithm, similar to what the linux kernel does for the page cache.
> 
> Unfortunately I missed this last CDS so I'm behind on the discussion.  I have a
> few questions though...
> 
> > # Design points:
> >
> > ## LRU lists
> > - Maintain LRU lists at the PG level.
> > The SharedLRU and SimpleLRU implementation in the current code have a
> > max_size, which limits the max number of elements in the list. This
> > mostly looks like a MRU, though its name implies they are LRUs. Since
> > the object size may vary in a PG, it's not possible to caculate the
> > total number of objects which the cache tier can hold ahead of time.
> > We need a new LRU implementation with no limit on the size.
> 
> This last sentence seems to me to be the crux of it.  Assuming we have an
> OSD based by flash storing O(n) objects, we need a way to maintain an LRU of
> O(n) objects in memory.  The current hitset-based approach was taken based
> on the assumption that this wasn't feasible--or at least we didn't know how to
> implmement such a thing.  If it is, or we simply want to stipulate that cache
> tier OSDs get gobs of RAM to make it possible, then lots of better options
> become possible...
> 
> Let's say you have a 1TB SSD, with an average object size of 1MB -- that's
> 1 million objects.  At maybe ~100bytes per object of RAM for an LRU entry
> that's 100MB... so not so unreasonable, perhaps!

I was having the same question before proposing this. I did the similar calculation and thought it would be ok to use this many memory :-)

> 
> > - Two lists for each PG: active and inactive Objects are first put
> > into the inactive list when they are accessed, and moved between these two
> lists based on some criteria.
> > Object flag: active, referenced, unevictable, dirty.
> > - When an object is accessed:
> > 1) If it's not in both of the lists, it's put on the top of the
> > inactive list
> > 2) If it's in the inactive list, and the referenced flag is not set, the referenced
> flag is set, and it's moved to the top of the inactive list.
> > 3) If it's in the inactive list, and the referenced flag is set, the referenced flag
> is cleared, and it's removed from the inactive list, and put on top of the active
> list.
> > 4) If it's in the active list, and the referenced flag is not set, the referenced
> flag is set, and it's moved to the top of the active list.
> > 5) If it's in the active list, and the referenced flag is set, it's moved to the top
> of the active list.
> > - When selecting objects to evict:
> > 1) Objects at the bottom of the inactive list are selected to evict. They are
> removed from the inactive list.
> > 2) If the number of the objects in the inactive list becomes low, some of the
> objects at the bottom of the active list are moved to the inactive list. For those
> objects which have the referenced flag set, they are given one more chance in
> the active list. They are moved to the top of the active list with the referenced
> flag cleared. For those objects which don't have the referenced flag set, they
> are moved to the inactive list, with the referenced flag set. So that they can be
> quickly promoted to the active list when necessary.
> >
> > ## Combine flush with eviction
> > - When evicting an object, if it's dirty, it's flushed first. After flushing, it's
> evicted. If not dirty, it's evicted directly.
> > - This means that we won't have separate activities and won't set different
> ratios for flush and evict. Is there a need to do so?
> > - Number of objects to evict at a time. 'evict_effort' acts as the priority, which
> is used to calculate the number of objects to evict.
> 
> As someone else mentioned in a follow-up, the reason we let the dirty level be
> set lower than the full level is that it provides headroom so that objects can be
> quickly evicted (delete, no flush) to make room for new writes or new
> promotions.
> 
> That said, we probably can/should streamline the flush so that an evict can
> immediately follow without waiting for the agent to come around again.
> (I don't think we do that now?)

I was afraid of having to maintain different lists for flush/evict. So I proposed to combine flush with evict. But seems like we don't need to. When reaching the dirty ratio and selecting objects to flush, we look for dirty objects from the inactive list. The objects are marked clean and kept in the same position in the LRU list after flushing. When evicting objects, clean objects in the inactive list are selected. Dirty objects are ignored. This will keep the flush/evict the same as current implementation. Make sense?

> 
> sage
> 
> 
> > ## LRU lists Snapshotting
> > - The two lists are snapshotted persisted periodically.
> > - Only one copy needs to be saved. The old copy is removed when persisting
> the lists. The saved lists are used to restore the LRU lists when OSD reboots.
> >
> > Any comments/feedbacks are welcomed.
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
  2015-07-21  4:08   ` Wang, Zhiqiang
@ 2015-07-21 13:28     ` Sage Weil
  2015-07-22  3:34       ` Wang, Zhiqiang
  0 siblings, 1 reply; 21+ messages in thread
From: Sage Weil @ 2015-07-21 13:28 UTC (permalink / raw)
  To: Wang, Zhiqiang; +Cc: sjust@redhat.com, ceph-devel@vger.kernel.org

On Tue, 21 Jul 2015, Wang, Zhiqiang wrote:
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Tuesday, July 21, 2015 6:38 AM
> > To: Wang, Zhiqiang
> > Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> > Subject: Re: The design of the eviction improvement
> > 
> > On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
> > > Hi all,
> > >
> > > This is a follow-up of one of the CDS session at
> > http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
> > ng_eviction. We discussed the drawbacks of the current eviction algorithm and
> > several ways to improve it. Seems like the LRU variants is the right way to go. I
> > come up with some design points after the CDS, and want to discuss it with you.
> > It is an approximate 2Q algorithm, combining some benefits of the clock
> > algorithm, similar to what the linux kernel does for the page cache.
> > 
> > Unfortunately I missed this last CDS so I'm behind on the discussion.  I have a
> > few questions though...
> > 
> > > # Design points:
> > >
> > > ## LRU lists
> > > - Maintain LRU lists at the PG level.
> > > The SharedLRU and SimpleLRU implementation in the current code have a
> > > max_size, which limits the max number of elements in the list. This
> > > mostly looks like a MRU, though its name implies they are LRUs. Since
> > > the object size may vary in a PG, it's not possible to caculate the
> > > total number of objects which the cache tier can hold ahead of time.
> > > We need a new LRU implementation with no limit on the size.
> > 
> > This last sentence seems to me to be the crux of it.  Assuming we have an
> > OSD based by flash storing O(n) objects, we need a way to maintain an LRU of
> > O(n) objects in memory.  The current hitset-based approach was taken based
> > on the assumption that this wasn't feasible--or at least we didn't know how to
> > implmement such a thing.  If it is, or we simply want to stipulate that cache
> > tier OSDs get gobs of RAM to make it possible, then lots of better options
> > become possible...
> > 
> > Let's say you have a 1TB SSD, with an average object size of 1MB -- that's
> > 1 million objects.  At maybe ~100bytes per object of RAM for an LRU entry
> > that's 100MB... so not so unreasonable, perhaps!
> 
> I was having the same question before proposing this. I did the similar 
> calculation and thought it would be ok to use this many memory :-)

The part that worries me now is the speed with which we can load and 
manage such a list.  Assuming it is several hundred MB, it'll take a while 
to load that into memory and set up all the pointers (assuming 
a conventional linked list structure).  Maybe tens of seconds...

I wonder if instead we should construct some sort of flat model where we 
load slabs of contiguous memory, 10's of MB each, and have the 
next/previous pointers be a (slab,position) pair.  That way we can load it 
into memory in big chunks, quickly, and be able to operate on it (adjust 
links) immediately.

Another thought: currently we use the hobject_t hash only instead of the 
full object name.  We could continue to do the same, or we could do a hash 
pair (hobject_t hash + a different hash of the rest of the object) to keep 
the representation compact.  With a model lke the above, that could get 
the object representation down to 2 u32's.  A link could be a slab + 
position (2 more u32's), and if we have prev + next that'd be just 6x4=24 
bytes per object.

With fixed-sized slots on the slabs, the slab allocator could be very 
simple.. maybe just a bitmap, a free counter, and any other trivial 
optimizations to make finding a slab's next free a slot nice and quick.

> > > - Two lists for each PG: active and inactive Objects are first put
> > > into the inactive list when they are accessed, and moved between these two
> > lists based on some criteria.
> > > Object flag: active, referenced, unevictable, dirty.
> > > - When an object is accessed:
> > > 1) If it's not in both of the lists, it's put on the top of the
> > > inactive list
> > > 2) If it's in the inactive list, and the referenced flag is not set, the referenced
> > flag is set, and it's moved to the top of the inactive list.
> > > 3) If it's in the inactive list, and the referenced flag is set, the referenced flag
> > is cleared, and it's removed from the inactive list, and put on top of the active
> > list.
> > > 4) If it's in the active list, and the referenced flag is not set, the referenced
> > flag is set, and it's moved to the top of the active list.
> > > 5) If it's in the active list, and the referenced flag is set, it's moved to the top
> > of the active list.
> > > - When selecting objects to evict:
> > > 1) Objects at the bottom of the inactive list are selected to evict. They are
> > removed from the inactive list.
> > > 2) If the number of the objects in the inactive list becomes low, some of the
> > objects at the bottom of the active list are moved to the inactive list. For those
> > objects which have the referenced flag set, they are given one more chance in
> > the active list. They are moved to the top of the active list with the referenced
> > flag cleared. For those objects which don't have the referenced flag set, they
> > are moved to the inactive list, with the referenced flag set. So that they can be
> > quickly promoted to the active list when necessary.
> > >
> > > ## Combine flush with eviction
> > > - When evicting an object, if it's dirty, it's flushed first. After flushing, it's
> > evicted. If not dirty, it's evicted directly.
> > > - This means that we won't have separate activities and won't set different
> > ratios for flush and evict. Is there a need to do so?
> > > - Number of objects to evict at a time. 'evict_effort' acts as the priority, which
> > is used to calculate the number of objects to evict.
> > 
> > As someone else mentioned in a follow-up, the reason we let the dirty level be
> > set lower than the full level is that it provides headroom so that objects can be
> > quickly evicted (delete, no flush) to make room for new writes or new
> > promotions.
> > 
> > That said, we probably can/should streamline the flush so that an evict can
> > immediately follow without waiting for the agent to come around again.
> > (I don't think we do that now?)
> 
> I was afraid of having to maintain different lists for flush/evict. So I 
> proposed to combine flush with evict. But seems like we don't need to. 
> When reaching the dirty ratio and selecting objects to flush, we look 
> for dirty objects from the inactive list. The objects are marked clean 
> and kept in the same position in the LRU list after flushing. When 
> evicting objects, clean objects in the inactive list are selected. Dirty 
> objects are ignored. This will keep the flush/evict the same as current 
> implementation. Make sense?

Yep!

sage

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: The design of the eviction improvement
       [not found] <1539008200.49.1437488109510.JavaMail.root@thunderbeast.private.linuxbox.com>
@ 2015-07-21 14:15 ` Matt W. Benjamin
  2015-07-21 14:25   ` Gregory Farnum
  2015-07-22  3:57   ` Wang, Zhiqiang
  0 siblings, 2 replies; 21+ messages in thread
From: Matt W. Benjamin @ 2015-07-21 14:15 UTC (permalink / raw)
  To: Zhiqiang Wang; +Cc: sjust, ceph-devel, Sage Weil

Hi,

Couple of points.

1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU implementation
with 2 levels currently, plus a pinned queue, that addresses stuff like partitioning ("sharding"), scan resistance, and coordination w/lookup tables.  We might extend/re-use it.

2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache
segments (are you proposing to/do we now always cache whole objects?), and cost
of "looking for" dirty objects;  I suspect that it makes sense to amortize the
cost of locating segments eligible to be flushed, rather than minimize
bookkeeping.

Matt

----- "Zhiqiang Wang" <zhiqiang.wang@intel.com> wrote:

> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Tuesday, July 21, 2015 6:38 AM
> > To: Wang, Zhiqiang
> > Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> > Subject: Re: The design of the eviction improvement
> > 
> > On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
> > > Hi all,
> > >
> > > This is a follow-up of one of the CDS session at
> >
> http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
> > ng_eviction. We discussed the drawbacks of the current eviction
> algorithm and
> > several ways to improve it. Seems like the LRU variants is the right
> way to go. I
> > come up with some design points after the CDS, and want to discuss
> it with you.
> > It is an approximate 2Q algorithm, combining some benefits of the
> clock
> > algorithm, similar to what the linux kernel does for the page
> cache.
> > 
> > Unfortunately I missed this last CDS so I'm behind on the
> discussion.  I have a
> > few questions though...
> > 
> > > # Design points:
> > >
> > > ## LRU lists
> > > - Maintain LRU lists at the PG level.
> > > The SharedLRU and SimpleLRU implementation in the current code
> have a
> > > max_size, which limits the max number of elements in the list.
> This
> > > mostly looks like a MRU, though its name implies they are LRUs.
> Since
> > > the object size may vary in a PG, it's not possible to caculate
> the
> > > total number of objects which the cache tier can hold ahead of
> time.
> > > We need a new LRU implementation with no limit on the size.
> > 
> > This last sentence seems to me to be the crux of it.  Assuming we
> have an
> > OSD based by flash storing O(n) objects, we need a way to maintain
> an LRU of
> > O(n) objects in memory.  The current hitset-based approach was taken
> based
> > on the assumption that this wasn't feasible--or at least we didn't
> know how to
> > implmement such a thing.  If it is, or we simply want to stipulate
> that cache
> > tier OSDs get gobs of RAM to make it possible, then lots of better
> options
> > become possible...
> > 
> > Let's say you have a 1TB SSD, with an average object size of 1MB --
> that's
> > 1 million objects.  At maybe ~100bytes per object of RAM for an LRU
> entry
> > that's 100MB... so not so unreasonable, perhaps!
> 
> I was having the same question before proposing this. I did the
> similar calculation and thought it would be ok to use this many memory
> :-)
> 
> > 
> > > - Two lists for each PG: active and inactive Objects are first
> put
> > > into the inactive list when they are accessed, and moved between
> these two
> > lists based on some criteria.
> > > Object flag: active, referenced, unevictable, dirty.
> > > - When an object is accessed:
> > > 1) If it's not in both of the lists, it's put on the top of the
> > > inactive list
> > > 2) If it's in the inactive list, and the referenced flag is not
> set, the referenced
> > flag is set, and it's moved to the top of the inactive list.
> > > 3) If it's in the inactive list, and the referenced flag is set,
> the referenced flag
> > is cleared, and it's removed from the inactive list, and put on top
> of the active
> > list.
> > > 4) If it's in the active list, and the referenced flag is not set,
> the referenced
> > flag is set, and it's moved to the top of the active list.
> > > 5) If it's in the active list, and the referenced flag is set,
> it's moved to the top
> > of the active list.
> > > - When selecting objects to evict:
> > > 1) Objects at the bottom of the inactive list are selected to
> evict. They are
> > removed from the inactive list.
> > > 2) If the number of the objects in the inactive list becomes low,
> some of the
> > objects at the bottom of the active list are moved to the inactive
> list. For those
> > objects which have the referenced flag set, they are given one more
> chance in
> > the active list. They are moved to the top of the active list with
> the referenced
> > flag cleared. For those objects which don't have the referenced flag
> set, they
> > are moved to the inactive list, with the referenced flag set. So
> that they can be
> > quickly promoted to the active list when necessary.
> > >
> > > ## Combine flush with eviction
> > > - When evicting an object, if it's dirty, it's flushed first.
> After flushing, it's
> > evicted. If not dirty, it's evicted directly.
> > > - This means that we won't have separate activities and won't set
> different
> > ratios for flush and evict. Is there a need to do so?
> > > - Number of objects to evict at a time. 'evict_effort' acts as the
> priority, which
> > is used to calculate the number of objects to evict.
> > 
> > As someone else mentioned in a follow-up, the reason we let the
> dirty level be
> > set lower than the full level is that it provides headroom so that
> objects can be
> > quickly evicted (delete, no flush) to make room for new writes or
> new
> > promotions.
> > 
> > That said, we probably can/should streamline the flush so that an
> evict can
> > immediately follow without waiting for the agent to come around
> again.
> > (I don't think we do that now?)
> 
> I was afraid of having to maintain different lists for flush/evict. So
> I proposed to combine flush with evict. But seems like we don't need
> to. When reaching the dirty ratio and selecting objects to flush, we
> look for dirty objects from the inactive list. The objects are marked
> clean and kept in the same position in the LRU list after flushing.
> When evicting objects, clean objects in the inactive list are
> selected. Dirty objects are ignored. This will keep the flush/evict
> the same as current implementation. Make sense?
> 
> > 
> > sage
> > 
> > 
> > > ## LRU lists Snapshotting
> > > - The two lists are snapshotted persisted periodically.
> > > - Only one copy needs to be saved. The old copy is removed when
> persisting
> > the lists. The saved lists are used to restore the LRU lists when
> OSD reboots.
> > >
> > > Any comments/feedbacks are welcomed.
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> ceph-devel" in the body
> > of a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matt Benjamin
CohortFS, LLC.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://cohortfs.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: The design of the eviction improvement
  2015-07-21 14:15 ` Matt W. Benjamin
@ 2015-07-21 14:25   ` Gregory Farnum
  2015-07-22  3:57   ` Wang, Zhiqiang
  1 sibling, 0 replies; 21+ messages in thread
From: Gregory Farnum @ 2015-07-21 14:25 UTC (permalink / raw)
  To: Matt W. Benjamin
  Cc: Zhiqiang Wang, sjust@redhat.com, ceph-devel@vger.kernel.org,
	Sage Weil

On Tue, Jul 21, 2015 at 3:15 PM, Matt W. Benjamin <matt@cohortfs.com> wrote:
> Hi,
>
> Couple of points.
>
> 1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU implementation
> with 2 levels currently, plus a pinned queue, that addresses stuff like partitioning ("sharding"), scan resistance, and coordination w/lookup tables.  We might extend/re-use it.
>
> 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache
> segments (are you proposing to/do we now always cache whole objects?), and cost
> of "looking for" dirty objects;  I suspect that it makes sense to amortize the
> cost of locating segments eligible to be flushed, rather than minimize
> bookkeeping.

We make caching decisions in terms of whole objects right now, yeah.
There's really nothing in the system that's capable of doing segments
within an object, and it's not just about tracking a little more
metadata about dirty objects — the way we handle snapshots, etc would
have to be reworked if we were allowing partial-object caching. Plus
keep in mind the IO cost of the bookkeeping — it needs to be either
consistently persisted to disk or reconstructable from whatever
happens to be in the object. That can get expensive really fast.
-Greg

>
> Matt
>
> ----- "Zhiqiang Wang" <zhiqiang.wang@intel.com> wrote:
>
>> > -----Original Message-----
>> > From: ceph-devel-owner@vger.kernel.org
>> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
>> > Sent: Tuesday, July 21, 2015 6:38 AM
>> > To: Wang, Zhiqiang
>> > Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
>> > Subject: Re: The design of the eviction improvement
>> >
>> > On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
>> > > Hi all,
>> > >
>> > > This is a follow-up of one of the CDS session at
>> >
>> http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
>> > ng_eviction. We discussed the drawbacks of the current eviction
>> algorithm and
>> > several ways to improve it. Seems like the LRU variants is the right
>> way to go. I
>> > come up with some design points after the CDS, and want to discuss
>> it with you.
>> > It is an approximate 2Q algorithm, combining some benefits of the
>> clock
>> > algorithm, similar to what the linux kernel does for the page
>> cache.
>> >
>> > Unfortunately I missed this last CDS so I'm behind on the
>> discussion.  I have a
>> > few questions though...
>> >
>> > > # Design points:
>> > >
>> > > ## LRU lists
>> > > - Maintain LRU lists at the PG level.
>> > > The SharedLRU and SimpleLRU implementation in the current code
>> have a
>> > > max_size, which limits the max number of elements in the list.
>> This
>> > > mostly looks like a MRU, though its name implies they are LRUs.
>> Since
>> > > the object size may vary in a PG, it's not possible to caculate
>> the
>> > > total number of objects which the cache tier can hold ahead of
>> time.
>> > > We need a new LRU implementation with no limit on the size.
>> >
>> > This last sentence seems to me to be the crux of it.  Assuming we
>> have an
>> > OSD based by flash storing O(n) objects, we need a way to maintain
>> an LRU of
>> > O(n) objects in memory.  The current hitset-based approach was taken
>> based
>> > on the assumption that this wasn't feasible--or at least we didn't
>> know how to
>> > implmement such a thing.  If it is, or we simply want to stipulate
>> that cache
>> > tier OSDs get gobs of RAM to make it possible, then lots of better
>> options
>> > become possible...
>> >
>> > Let's say you have a 1TB SSD, with an average object size of 1MB --
>> that's
>> > 1 million objects.  At maybe ~100bytes per object of RAM for an LRU
>> entry
>> > that's 100MB... so not so unreasonable, perhaps!
>>
>> I was having the same question before proposing this. I did the
>> similar calculation and thought it would be ok to use this many memory
>> :-)
>>
>> >
>> > > - Two lists for each PG: active and inactive Objects are first
>> put
>> > > into the inactive list when they are accessed, and moved between
>> these two
>> > lists based on some criteria.
>> > > Object flag: active, referenced, unevictable, dirty.
>> > > - When an object is accessed:
>> > > 1) If it's not in both of the lists, it's put on the top of the
>> > > inactive list
>> > > 2) If it's in the inactive list, and the referenced flag is not
>> set, the referenced
>> > flag is set, and it's moved to the top of the inactive list.
>> > > 3) If it's in the inactive list, and the referenced flag is set,
>> the referenced flag
>> > is cleared, and it's removed from the inactive list, and put on top
>> of the active
>> > list.
>> > > 4) If it's in the active list, and the referenced flag is not set,
>> the referenced
>> > flag is set, and it's moved to the top of the active list.
>> > > 5) If it's in the active list, and the referenced flag is set,
>> it's moved to the top
>> > of the active list.
>> > > - When selecting objects to evict:
>> > > 1) Objects at the bottom of the inactive list are selected to
>> evict. They are
>> > removed from the inactive list.
>> > > 2) If the number of the objects in the inactive list becomes low,
>> some of the
>> > objects at the bottom of the active list are moved to the inactive
>> list. For those
>> > objects which have the referenced flag set, they are given one more
>> chance in
>> > the active list. They are moved to the top of the active list with
>> the referenced
>> > flag cleared. For those objects which don't have the referenced flag
>> set, they
>> > are moved to the inactive list, with the referenced flag set. So
>> that they can be
>> > quickly promoted to the active list when necessary.
>> > >
>> > > ## Combine flush with eviction
>> > > - When evicting an object, if it's dirty, it's flushed first.
>> After flushing, it's
>> > evicted. If not dirty, it's evicted directly.
>> > > - This means that we won't have separate activities and won't set
>> different
>> > ratios for flush and evict. Is there a need to do so?
>> > > - Number of objects to evict at a time. 'evict_effort' acts as the
>> priority, which
>> > is used to calculate the number of objects to evict.
>> >
>> > As someone else mentioned in a follow-up, the reason we let the
>> dirty level be
>> > set lower than the full level is that it provides headroom so that
>> objects can be
>> > quickly evicted (delete, no flush) to make room for new writes or
>> new
>> > promotions.
>> >
>> > That said, we probably can/should streamline the flush so that an
>> evict can
>> > immediately follow without waiting for the agent to come around
>> again.
>> > (I don't think we do that now?)
>>
>> I was afraid of having to maintain different lists for flush/evict. So
>> I proposed to combine flush with evict. But seems like we don't need
>> to. When reaching the dirty ratio and selecting objects to flush, we
>> look for dirty objects from the inactive list. The objects are marked
>> clean and kept in the same position in the LRU list after flushing.
>> When evicting objects, clean objects in the inactive list are
>> selected. Dirty objects are ignored. This will keep the flush/evict
>> the same as current implementation. Make sense?
>>
>> >
>> > sage
>> >
>> >
>> > > ## LRU lists Snapshotting
>> > > - The two lists are snapshotted persisted periodically.
>> > > - Only one copy needs to be saved. The old copy is removed when
>> persisting
>> > the lists. The saved lists are used to restore the LRU lists when
>> OSD reboots.
>> > >
>> > > Any comments/feedbacks are welcomed.
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe
>> ceph-devel"
>> > > in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>> > > info at  http://vger.kernel.org/majordomo-info.html
>> > >
>> > >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe
>> ceph-devel" in the body
>> > of a message to majordomo@vger.kernel.org More majordomo info at
>> > http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Matt Benjamin
> CohortFS, LLC.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://cohortfs.com
>
> tel.  734-761-4689
> fax.  734-769-8938
> cel.  734-216-5309
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: The design of the eviction improvement
       [not found] <639332441.73.1437495369873.JavaMail.root@thunderbeast.private.linuxbox.com>
@ 2015-07-21 16:24 ` Matt W. Benjamin
  0 siblings, 0 replies; 21+ messages in thread
From: Matt W. Benjamin @ 2015-07-21 16:24 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Zhiqiang Wang, sjust, ceph-devel, Sage Weil

Thanks for the explanations, Greg.

----- "Gregory Farnum" <greg@gregs42.com> wrote:

> On Tue, Jul 21, 2015 at 3:15 PM, Matt W. Benjamin <matt@cohortfs.com>
> wrote:
> > Hi,
> >
> > Couple of points.
> >
> > 1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU
> implementation
> > with 2 levels currently, plus a pinned queue, that addresses stuff
> like partitioning ("sharding"), scan resistance, and coordination
> w/lookup tables.  We might extend/re-use it.
> >
> > 2) I'm a bit confused by active/inactive vocabulary, dimensioning of
> cache
> > segments (are you proposing to/do we now always cache whole
> objects?), and cost
> > of "looking for" dirty objects;  I suspect that it makes sense to
> amortize the
> > cost of locating segments eligible to be flushed, rather than
> minimize
> > bookkeeping.
> 
> We make caching decisions in terms of whole objects right now, yeah.
> There's really nothing in the system that's capable of doing segments
> within an object, and it's not just about tracking a little more
> metadata about dirty objects — the way we handle snapshots, etc would
> have to be reworked if we were allowing partial-object caching. Plus
> keep in mind the IO cost of the bookkeeping — it needs to be either
> consistently persisted to disk or reconstructable from whatever
> happens to be in the object. That can get expensive really fast.
> -Greg

For current semantics/structure of PGs + specific tier held fixed, makes
sense.  For our object addressing currently, we have a greater requirement
for partial object caching.  (Partly, we did this to achieve periodicity
w/sequential I/O.)  I think broadly, there are large performance
tradeoffs here.  In AFS and DCE, there is full consistency in materialized
caches.  Also, caches are dimensioned by chunks.  If the cache is materialized
in memory, the semantics aren't those of "disk."  Basically, consistency
guarantees are policy.  Different snapshot mechanisms, or omtting them, e.g.,
should logically enable relaxed consistency, modulo policy.

Matt

> 
> >
> > Matt
> >
> > ----- "Zhiqiang Wang" <zhiqiang.wang@intel.com> wrote:
> >
> >> > -----Original Message-----
> >> > From: ceph-devel-owner@vger.kernel.org
> >> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> >> > Sent: Tuesday, July 21, 2015 6:38 AM
> >> > To: Wang, Zhiqiang
> >> > Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> >> > Subject: Re: The design of the eviction improvement
> >> >
> >> > On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
> >> > > Hi all,
> >> > >
> >> > > This is a follow-up of one of the CDS session at
> >> >
> >>
> http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri
> >> > ng_eviction. We discussed the drawbacks of the current eviction
> >> algorithm and
> >> > several ways to improve it. Seems like the LRU variants is the
> right
> >> way to go. I
> >> > come up with some design points after the CDS, and want to
> discuss
> >> it with you.
> >> > It is an approximate 2Q algorithm, combining some benefits of
> the
> >> clock
> >> > algorithm, similar to what the linux kernel does for the page
> >> cache.
> >> >
> >> > Unfortunately I missed this last CDS so I'm behind on the
> >> discussion.  I have a
> >> > few questions though...
> >> >
> >> > > # Design points:
> >> > >
> >> > > ## LRU lists
> >> > > - Maintain LRU lists at the PG level.
> >> > > The SharedLRU and SimpleLRU implementation in the current code
> >> have a
> >> > > max_size, which limits the max number of elements in the list.
> >> This
> >> > > mostly looks like a MRU, though its name implies they are
> LRUs.
> >> Since
> >> > > the object size may vary in a PG, it's not possible to
> caculate
> >> the
> >> > > total number of objects which the cache tier can hold ahead of
> >> time.
> >> > > We need a new LRU implementation with no limit on the size.
> >> >
> >> > This last sentence seems to me to be the crux of it.  Assuming
> we
> >> have an
> >> > OSD based by flash storing O(n) objects, we need a way to
> maintain
> >> an LRU of
> >> > O(n) objects in memory.  The current hitset-based approach was
> taken
> >> based
> >> > on the assumption that this wasn't feasible--or at least we
> didn't
> >> know how to
> >> > implmement such a thing.  If it is, or we simply want to
> stipulate
> >> that cache
> >> > tier OSDs get gobs of RAM to make it possible, then lots of
> better
> >> options
> >> > become possible...
> >> >
> >> > Let's say you have a 1TB SSD, with an average object size of 1MB
> --
> >> that's
> >> > 1 million objects.  At maybe ~100bytes per object of RAM for an
> LRU
> >> entry
> >> > that's 100MB... so not so unreasonable, perhaps!
> >>
> >> I was having the same question before proposing this. I did the
> >> similar calculation and thought it would be ok to use this many
> memory
> >> :-)
> >>
> >> >
> >> > > - Two lists for each PG: active and inactive Objects are first
> >> put
> >> > > into the inactive list when they are accessed, and moved
> between
> >> these two
> >> > lists based on some criteria.
> >> > > Object flag: active, referenced, unevictable, dirty.
> >> > > - When an object is accessed:
> >> > > 1) If it's not in both of the lists, it's put on the top of
> the
> >> > > inactive list
> >> > > 2) If it's in the inactive list, and the referenced flag is
> not
> >> set, the referenced
> >> > flag is set, and it's moved to the top of the inactive list.
> >> > > 3) If it's in the inactive list, and the referenced flag is
> set,
> >> the referenced flag
> >> > is cleared, and it's removed from the inactive list, and put on
> top
> >> of the active
> >> > list.
> >> > > 4) If it's in the active list, and the referenced flag is not
> set,
> >> the referenced
> >> > flag is set, and it's moved to the top of the active list.
> >> > > 5) If it's in the active list, and the referenced flag is set,
> >> it's moved to the top
> >> > of the active list.
> >> > > - When selecting objects to evict:
> >> > > 1) Objects at the bottom of the inactive list are selected to
> >> evict. They are
> >> > removed from the inactive list.
> >> > > 2) If the number of the objects in the inactive list becomes
> low,
> >> some of the
> >> > objects at the bottom of the active list are moved to the
> inactive
> >> list. For those
> >> > objects which have the referenced flag set, they are given one
> more
> >> chance in
> >> > the active list. They are moved to the top of the active list
> with
> >> the referenced
> >> > flag cleared. For those objects which don't have the referenced
> flag
> >> set, they
> >> > are moved to the inactive list, with the referenced flag set. So
> >> that they can be
> >> > quickly promoted to the active list when necessary.
> >> > >
> >> > > ## Combine flush with eviction
> >> > > - When evicting an object, if it's dirty, it's flushed first.
> >> After flushing, it's
> >> > evicted. If not dirty, it's evicted directly.
> >> > > - This means that we won't have separate activities and won't
> set
> >> different
> >> > ratios for flush and evict. Is there a need to do so?
> >> > > - Number of objects to evict at a time. 'evict_effort' acts as
> the
> >> priority, which
> >> > is used to calculate the number of objects to evict.
> >> >
> >> > As someone else mentioned in a follow-up, the reason we let the
> >> dirty level be
> >> > set lower than the full level is that it provides headroom so
> that
> >> objects can be
> >> > quickly evicted (delete, no flush) to make room for new writes
> or
> >> new
> >> > promotions.
> >> >
> >> > That said, we probably can/should streamline the flush so that
> an
> >> evict can
> >> > immediately follow without waiting for the agent to come around
> >> again.
> >> > (I don't think we do that now?)
> >>
> >> I was afraid of having to maintain different lists for flush/evict.
> So
> >> I proposed to combine flush with evict. But seems like we don't
> need
> >> to. When reaching the dirty ratio and selecting objects to flush,
> we
> >> look for dirty objects from the inactive list. The objects are
> marked
> >> clean and kept in the same position in the LRU list after
> flushing.
> >> When evicting objects, clean objects in the inactive list are
> >> selected. Dirty objects are ignored. This will keep the
> flush/evict
> >> the same as current implementation. Make sense?
> >>
> >> >
> >> > sage
> >> >
> >> >
> >> > > ## LRU lists Snapshotting
> >> > > - The two lists are snapshotted persisted periodically.
> >> > > - Only one copy needs to be saved. The old copy is removed
> when
> >> persisting
> >> > the lists. The saved lists are used to restore the LRU lists
> when
> >> OSD reboots.
> >> > >
> >> > > Any comments/feedbacks are welcomed.
> >> > > --
> >> > > To unsubscribe from this list: send the line "unsubscribe
> >> ceph-devel"
> >> > > in the body of a message to majordomo@vger.kernel.org More
> >> majordomo
> >> > > info at  http://vger.kernel.org/majordomo-info.html
> >> > >
> >> > >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe
> >> ceph-devel" in the body
> >> > of a message to majordomo@vger.kernel.org More majordomo info at
> >> > http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> ceph-devel"
> >> in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> > Matt Benjamin
> > CohortFS, LLC.
> > 315 West Huron Street, Suite 140A
> > Ann Arbor, Michigan 48103
> >
> > http://cohortfs.com
> >
> > tel.  734-761-4689
> > fax.  734-769-8938
> > cel.  734-216-5309
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matt Benjamin
CohortFS, LLC.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://cohortfs.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
  2015-07-21 13:28     ` Sage Weil
@ 2015-07-22  3:34       ` Wang, Zhiqiang
  2015-07-22 12:56         ` Sage Weil
  0 siblings, 1 reply; 21+ messages in thread
From: Wang, Zhiqiang @ 2015-07-22  3:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: sjust@redhat.com, ceph-devel@vger.kernel.org


> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Tuesday, July 21, 2015 9:29 PM
> To: Wang, Zhiqiang
> Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> Subject: RE: The design of the eviction improvement
> 
> On Tue, 21 Jul 2015, Wang, Zhiqiang wrote:
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org
> > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Tuesday, July 21, 2015 6:38 AM
> > > To: Wang, Zhiqiang
> > > Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> > > Subject: Re: The design of the eviction improvement
> > >
> > > On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
> > > > Hi all,
> > > >
> > > > This is a follow-up of one of the CDS session at
> > > http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_
> > > tieri ng_eviction. We discussed the drawbacks of the current
> > > eviction algorithm and several ways to improve it. Seems like the
> > > LRU variants is the right way to go. I come up with some design
> > > points after the CDS, and want to discuss it with you.
> > > It is an approximate 2Q algorithm, combining some benefits of the
> > > clock algorithm, similar to what the linux kernel does for the page cache.
> > >
> > > Unfortunately I missed this last CDS so I'm behind on the
> > > discussion.  I have a few questions though...
> > >
> > > > # Design points:
> > > >
> > > > ## LRU lists
> > > > - Maintain LRU lists at the PG level.
> > > > The SharedLRU and SimpleLRU implementation in the current code
> > > > have a max_size, which limits the max number of elements in the
> > > > list. This mostly looks like a MRU, though its name implies they
> > > > are LRUs. Since the object size may vary in a PG, it's not
> > > > possible to caculate the total number of objects which the cache tier can
> hold ahead of time.
> > > > We need a new LRU implementation with no limit on the size.
> > >
> > > This last sentence seems to me to be the crux of it.  Assuming we
> > > have an OSD based by flash storing O(n) objects, we need a way to
> > > maintain an LRU of
> > > O(n) objects in memory.  The current hitset-based approach was taken
> > > based on the assumption that this wasn't feasible--or at least we
> > > didn't know how to implmement such a thing.  If it is, or we simply
> > > want to stipulate that cache tier OSDs get gobs of RAM to make it
> > > possible, then lots of better options become possible...
> > >
> > > Let's say you have a 1TB SSD, with an average object size of 1MB --
> > > that's
> > > 1 million objects.  At maybe ~100bytes per object of RAM for an LRU
> > > entry that's 100MB... so not so unreasonable, perhaps!
> >
> > I was having the same question before proposing this. I did the
> > similar calculation and thought it would be ok to use this many memory
> > :-)
> 
> The part that worries me now is the speed with which we can load and manage
> such a list.  Assuming it is several hundred MB, it'll take a while to load that
> into memory and set up all the pointers (assuming a conventional linked list
> structure).  Maybe tens of seconds...

I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them?

> 
> I wonder if instead we should construct some sort of flat model where we load
> slabs of contiguous memory, 10's of MB each, and have the next/previous
> pointers be a (slab,position) pair.  That way we can load it into memory in big
> chunks, quickly, and be able to operate on it (adjust
> links) immediately.
> 
> Another thought: currently we use the hobject_t hash only instead of the full
> object name.  We could continue to do the same, or we could do a hash pair
> (hobject_t hash + a different hash of the rest of the object) to keep the
> representation compact.  With a model lke the above, that could get the
> object representation down to 2 u32's.  A link could be a slab + position (2
> more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object.

Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object.
1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair?
2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this?

> 
> With fixed-sized slots on the slabs, the slab allocator could be very simple..
> maybe just a bitmap, a free counter, and any other trivial optimizations to
> make finding a slab's next free a slot nice and quick.
> 
> > > > - Two lists for each PG: active and inactive Objects are first put
> > > > into the inactive list when they are accessed, and moved between
> > > > these two
> > > lists based on some criteria.
> > > > Object flag: active, referenced, unevictable, dirty.
> > > > - When an object is accessed:
> > > > 1) If it's not in both of the lists, it's put on the top of the
> > > > inactive list
> > > > 2) If it's in the inactive list, and the referenced flag is not
> > > > set, the referenced
> > > flag is set, and it's moved to the top of the inactive list.
> > > > 3) If it's in the inactive list, and the referenced flag is set,
> > > > the referenced flag
> > > is cleared, and it's removed from the inactive list, and put on top
> > > of the active list.
> > > > 4) If it's in the active list, and the referenced flag is not set,
> > > > the referenced
> > > flag is set, and it's moved to the top of the active list.
> > > > 5) If it's in the active list, and the referenced flag is set,
> > > > it's moved to the top
> > > of the active list.
> > > > - When selecting objects to evict:
> > > > 1) Objects at the bottom of the inactive list are selected to
> > > > evict. They are
> > > removed from the inactive list.
> > > > 2) If the number of the objects in the inactive list becomes low,
> > > > some of the
> > > objects at the bottom of the active list are moved to the inactive
> > > list. For those objects which have the referenced flag set, they are
> > > given one more chance in the active list. They are moved to the top
> > > of the active list with the referenced flag cleared. For those
> > > objects which don't have the referenced flag set, they are moved to
> > > the inactive list, with the referenced flag set. So that they can be quickly
> promoted to the active list when necessary.
> > > >
> > > > ## Combine flush with eviction
> > > > - When evicting an object, if it's dirty, it's flushed first.
> > > > After flushing, it's
> > > evicted. If not dirty, it's evicted directly.
> > > > - This means that we won't have separate activities and won't set
> > > > different
> > > ratios for flush and evict. Is there a need to do so?
> > > > - Number of objects to evict at a time. 'evict_effort' acts as the
> > > > priority, which
> > > is used to calculate the number of objects to evict.
> > >
> > > As someone else mentioned in a follow-up, the reason we let the
> > > dirty level be set lower than the full level is that it provides
> > > headroom so that objects can be quickly evicted (delete, no flush)
> > > to make room for new writes or new promotions.
> > >
> > > That said, we probably can/should streamline the flush so that an
> > > evict can immediately follow without waiting for the agent to come around
> again.
> > > (I don't think we do that now?)
> >
> > I was afraid of having to maintain different lists for flush/evict. So
> > I proposed to combine flush with evict. But seems like we don't need to.
> > When reaching the dirty ratio and selecting objects to flush, we look
> > for dirty objects from the inactive list. The objects are marked clean
> > and kept in the same position in the LRU list after flushing. When
> > evicting objects, clean objects in the inactive list are selected.
> > Dirty objects are ignored. This will keep the flush/evict the same as
> > current implementation. Make sense?
> 
> Yep!
> 
> sage

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
  2015-07-21 14:15 ` Matt W. Benjamin
  2015-07-21 14:25   ` Gregory Farnum
@ 2015-07-22  3:57   ` Wang, Zhiqiang
  2015-07-22  4:06     ` Matt W. Benjamin
  1 sibling, 1 reply; 21+ messages in thread
From: Wang, Zhiqiang @ 2015-07-22  3:57 UTC (permalink / raw)
  To: Matt W. Benjamin; +Cc: sjust@redhat.com, ceph-devel@vger.kernel.org, Sage Weil

Hi Matt,

> -----Original Message-----
> From: Matt W. Benjamin [mailto:matt@cohortfs.com]
> Sent: Tuesday, July 21, 2015 10:16 PM
> To: Wang, Zhiqiang
> Cc: sjust@redhat.com; ceph-devel@vger.kernel.org; Sage Weil
> Subject: Re: The design of the eviction improvement
> 
> Hi,
> 
> Couple of points.
> 
> 1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU
> implementation with 2 levels currently, plus a pinned queue, that addresses
> stuff like partitioning ("sharding"), scan resistance, and coordination w/lookup
> tables.  We might extend/re-use it.

The MQ algorithm is more complex, and seems like it has more overhead than 2Q. The approximate 2Q algorithm here combines some benefits of the clock algorithm, and works well on the linux kernel. MQ could be another choice. There are some other candidates like LIRS, ARC, etc., which have been deployed in some practical systems.

> 
> 2) I'm a bit confused by active/inactive vocabulary, dimensioning of cache
> segments (are you proposing to/do we now always cache whole objects?), and
> cost of "looking for" dirty objects;  I suspect that it makes sense to amortize
> the cost of locating segments eligible to be flushed, rather than minimize
> bookkeeping.

Though currently the caching decisions are made in the unit of object as Greg pointed out in another mail, I think we still have something to improve here. I'll come back to this some time later.

> 
> Matt
> 
> ----- "Zhiqiang Wang" <zhiqiang.wang@intel.com> wrote:
> 
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org
> > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Tuesday, July 21, 2015 6:38 AM
> > > To: Wang, Zhiqiang
> > > Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> > > Subject: Re: The design of the eviction improvement
> > >
> > > On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
> > > > Hi all,
> > > >
> > > > This is a follow-up of one of the CDS session at
> > >
> > http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ti
> > eri
> > > ng_eviction. We discussed the drawbacks of the current eviction
> > algorithm and
> > > several ways to improve it. Seems like the LRU variants is the right
> > way to go. I
> > > come up with some design points after the CDS, and want to discuss
> > it with you.
> > > It is an approximate 2Q algorithm, combining some benefits of the
> > clock
> > > algorithm, similar to what the linux kernel does for the page
> > cache.
> > >
> > > Unfortunately I missed this last CDS so I'm behind on the
> > discussion.  I have a
> > > few questions though...
> > >
> > > > # Design points:
> > > >
> > > > ## LRU lists
> > > > - Maintain LRU lists at the PG level.
> > > > The SharedLRU and SimpleLRU implementation in the current code
> > have a
> > > > max_size, which limits the max number of elements in the list.
> > This
> > > > mostly looks like a MRU, though its name implies they are LRUs.
> > Since
> > > > the object size may vary in a PG, it's not possible to caculate
> > the
> > > > total number of objects which the cache tier can hold ahead of
> > time.
> > > > We need a new LRU implementation with no limit on the size.
> > >
> > > This last sentence seems to me to be the crux of it.  Assuming we
> > have an
> > > OSD based by flash storing O(n) objects, we need a way to maintain
> > an LRU of
> > > O(n) objects in memory.  The current hitset-based approach was taken
> > based
> > > on the assumption that this wasn't feasible--or at least we didn't
> > know how to
> > > implmement such a thing.  If it is, or we simply want to stipulate
> > that cache
> > > tier OSDs get gobs of RAM to make it possible, then lots of better
> > options
> > > become possible...
> > >
> > > Let's say you have a 1TB SSD, with an average object size of 1MB --
> > that's
> > > 1 million objects.  At maybe ~100bytes per object of RAM for an LRU
> > entry
> > > that's 100MB... so not so unreasonable, perhaps!
> >
> > I was having the same question before proposing this. I did the
> > similar calculation and thought it would be ok to use this many memory
> > :-)
> >
> > >
> > > > - Two lists for each PG: active and inactive Objects are first
> > put
> > > > into the inactive list when they are accessed, and moved between
> > these two
> > > lists based on some criteria.
> > > > Object flag: active, referenced, unevictable, dirty.
> > > > - When an object is accessed:
> > > > 1) If it's not in both of the lists, it's put on the top of the
> > > > inactive list
> > > > 2) If it's in the inactive list, and the referenced flag is not
> > set, the referenced
> > > flag is set, and it's moved to the top of the inactive list.
> > > > 3) If it's in the inactive list, and the referenced flag is set,
> > the referenced flag
> > > is cleared, and it's removed from the inactive list, and put on top
> > of the active
> > > list.
> > > > 4) If it's in the active list, and the referenced flag is not set,
> > the referenced
> > > flag is set, and it's moved to the top of the active list.
> > > > 5) If it's in the active list, and the referenced flag is set,
> > it's moved to the top
> > > of the active list.
> > > > - When selecting objects to evict:
> > > > 1) Objects at the bottom of the inactive list are selected to
> > evict. They are
> > > removed from the inactive list.
> > > > 2) If the number of the objects in the inactive list becomes low,
> > some of the
> > > objects at the bottom of the active list are moved to the inactive
> > list. For those
> > > objects which have the referenced flag set, they are given one more
> > chance in
> > > the active list. They are moved to the top of the active list with
> > the referenced
> > > flag cleared. For those objects which don't have the referenced flag
> > set, they
> > > are moved to the inactive list, with the referenced flag set. So
> > that they can be
> > > quickly promoted to the active list when necessary.
> > > >
> > > > ## Combine flush with eviction
> > > > - When evicting an object, if it's dirty, it's flushed first.
> > After flushing, it's
> > > evicted. If not dirty, it's evicted directly.
> > > > - This means that we won't have separate activities and won't set
> > different
> > > ratios for flush and evict. Is there a need to do so?
> > > > - Number of objects to evict at a time. 'evict_effort' acts as the
> > priority, which
> > > is used to calculate the number of objects to evict.
> > >
> > > As someone else mentioned in a follow-up, the reason we let the
> > dirty level be
> > > set lower than the full level is that it provides headroom so that
> > objects can be
> > > quickly evicted (delete, no flush) to make room for new writes or
> > new
> > > promotions.
> > >
> > > That said, we probably can/should streamline the flush so that an
> > evict can
> > > immediately follow without waiting for the agent to come around
> > again.
> > > (I don't think we do that now?)
> >
> > I was afraid of having to maintain different lists for flush/evict. So
> > I proposed to combine flush with evict. But seems like we don't need
> > to. When reaching the dirty ratio and selecting objects to flush, we
> > look for dirty objects from the inactive list. The objects are marked
> > clean and kept in the same position in the LRU list after flushing.
> > When evicting objects, clean objects in the inactive list are
> > selected. Dirty objects are ignored. This will keep the flush/evict
> > the same as current implementation. Make sense?
> >
> > >
> > > sage
> > >
> > >
> > > > ## LRU lists Snapshotting
> > > > - The two lists are snapshotted persisted periodically.
> > > > - Only one copy needs to be saved. The old copy is removed when
> > persisting
> > > the lists. The saved lists are used to restore the LRU lists when
> > OSD reboots.
> > > >
> > > > Any comments/feedbacks are welcomed.
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe
> > ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > ceph-devel" in the body
> > > of a message to majordomo@vger.kernel.org More majordomo info at
> > > http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in
> > the body of a message to majordomo@vger.kernel.org More majordomo info
> > at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Matt Benjamin
> CohortFS, LLC.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
> 
> http://cohortfs.com
> 
> tel.  734-761-4689
> fax.  734-769-8938
> cel.  734-216-5309

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: The design of the eviction improvement
  2015-07-22  3:57   ` Wang, Zhiqiang
@ 2015-07-22  4:06     ` Matt W. Benjamin
  0 siblings, 0 replies; 21+ messages in thread
From: Matt W. Benjamin @ 2015-07-22  4:06 UTC (permalink / raw)
  To: Zhiqiang Wang; +Cc: sjust, ceph-devel, Sage Weil

Hi,

----- "Zhiqiang Wang" <zhiqiang.wang@intel.com> wrote:

> Hi Matt,
> 
> > -----Original Message-----
> > From: Matt W. Benjamin [mailto:matt@cohortfs.com]
> > Sent: Tuesday, July 21, 2015 10:16 PM
> > To: Wang, Zhiqiang
> > Cc: sjust@redhat.com; ceph-devel@vger.kernel.org; Sage Weil
> > Subject: Re: The design of the eviction improvement
> > 
> > Hi,
> > 
> > Couple of points.
> > 
> > 1) a successor to 2Q is MQ (Li et al).  We have an intrusive MQ LRU
> > implementation with 2 levels currently, plus a pinned queue, that
> addresses
> > stuff like partitioning ("sharding"), scan resistance, and
> coordination w/lookup
> > tables.  We might extend/re-use it.
> 
> The MQ algorithm is more complex, and seems like it has more overhead
> than 2Q. The approximate 2Q algorithm here combines some benefits of
> the clock algorithm, and works well on the linux kernel. MQ could be
> another choice. There are some other candidates like LIRS, ARC, etc.,
> which have been deployed in some practical systems.

MQ has been deployed in practical systems, and is more general.

> 
> > 
> > 2) I'm a bit confused by active/inactive vocabulary, dimensioning of
> cache
> > segments (are you proposing to/do we now always cache whole
> objects?), and
> > cost of "looking for" dirty objects;  I suspect that it makes sense
> to amortize
> > the cost of locating segments eligible to be flushed, rather than
> minimize
> > bookkeeping.
> 
> Though currently the caching decisions are made in the unit of object
> as Greg pointed out in another mail, I think we still have something
> to improve here. I'll come back to this some time later.
> 
> > 
> > Matt
> > 
> > ----- "Zhiqiang Wang" <zhiqiang.wang@intel.com> wrote:
> > 
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@vger.kernel.org
> > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage
> Weil
> > > > Sent: Tuesday, July 21, 2015 6:38 AM
> > > > To: Wang, Zhiqiang
> > > > Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> > > > Subject: Re: The design of the eviction improvement
> > > >
> > > > On Mon, 20 Jul 2015, Wang, Zhiqiang wrote:
> > > > > Hi all,
> > > > >
> > > > > This is a follow-up of one of the CDS session at
> > > >
> > >
> http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ti
> > > eri
> > > > ng_eviction. We discussed the drawbacks of the current eviction
> > > algorithm and
> > > > several ways to improve it. Seems like the LRU variants is the
> right
> > > way to go. I
> > > > come up with some design points after the CDS, and want to
> discuss
> > > it with you.
> > > > It is an approximate 2Q algorithm, combining some benefits of
> the
> > > clock
> > > > algorithm, similar to what the linux kernel does for the page
> > > cache.
> > > >
> > > > Unfortunately I missed this last CDS so I'm behind on the
> > > discussion.  I have a
> > > > few questions though...
> > > >
> > > > > # Design points:
> > > > >
> > > > > ## LRU lists
> > > > > - Maintain LRU lists at the PG level.
> > > > > The SharedLRU and SimpleLRU implementation in the current
> code
> > > have a
> > > > > max_size, which limits the max number of elements in the
> list.
> > > This
> > > > > mostly looks like a MRU, though its name implies they are
> LRUs.
> > > Since
> > > > > the object size may vary in a PG, it's not possible to
> caculate
> > > the
> > > > > total number of objects which the cache tier can hold ahead
> of
> > > time.
> > > > > We need a new LRU implementation with no limit on the size.
> > > >
> > > > This last sentence seems to me to be the crux of it.  Assuming
> we
> > > have an
> > > > OSD based by flash storing O(n) objects, we need a way to
> maintain
> > > an LRU of
> > > > O(n) objects in memory.  The current hitset-based approach was
> taken
> > > based
> > > > on the assumption that this wasn't feasible--or at least we
> didn't
> > > know how to
> > > > implmement such a thing.  If it is, or we simply want to
> stipulate
> > > that cache
> > > > tier OSDs get gobs of RAM to make it possible, then lots of
> better
> > > options
> > > > become possible...
> > > >
> > > > Let's say you have a 1TB SSD, with an average object size of 1MB
> --
> > > that's
> > > > 1 million objects.  At maybe ~100bytes per object of RAM for an
> LRU
> > > entry
> > > > that's 100MB... so not so unreasonable, perhaps!
> > >
> > > I was having the same question before proposing this. I did the
> > > similar calculation and thought it would be ok to use this many
> memory
> > > :-)
> > >
> > > >
> > > > > - Two lists for each PG: active and inactive Objects are
> first
> > > put
> > > > > into the inactive list when they are accessed, and moved
> between
> > > these two
> > > > lists based on some criteria.
> > > > > Object flag: active, referenced, unevictable, dirty.
> > > > > - When an object is accessed:
> > > > > 1) If it's not in both of the lists, it's put on the top of
> the
> > > > > inactive list
> > > > > 2) If it's in the inactive list, and the referenced flag is
> not
> > > set, the referenced
> > > > flag is set, and it's moved to the top of the inactive list.
> > > > > 3) If it's in the inactive list, and the referenced flag is
> set,
> > > the referenced flag
> > > > is cleared, and it's removed from the inactive list, and put on
> top
> > > of the active
> > > > list.
> > > > > 4) If it's in the active list, and the referenced flag is not
> set,
> > > the referenced
> > > > flag is set, and it's moved to the top of the active list.
> > > > > 5) If it's in the active list, and the referenced flag is
> set,
> > > it's moved to the top
> > > > of the active list.
> > > > > - When selecting objects to evict:
> > > > > 1) Objects at the bottom of the inactive list are selected to
> > > evict. They are
> > > > removed from the inactive list.
> > > > > 2) If the number of the objects in the inactive list becomes
> low,
> > > some of the
> > > > objects at the bottom of the active list are moved to the
> inactive
> > > list. For those
> > > > objects which have the referenced flag set, they are given one
> more
> > > chance in
> > > > the active list. They are moved to the top of the active list
> with
> > > the referenced
> > > > flag cleared. For those objects which don't have the referenced
> flag
> > > set, they
> > > > are moved to the inactive list, with the referenced flag set.
> So
> > > that they can be
> > > > quickly promoted to the active list when necessary.
> > > > >
> > > > > ## Combine flush with eviction
> > > > > - When evicting an object, if it's dirty, it's flushed first.
> > > After flushing, it's
> > > > evicted. If not dirty, it's evicted directly.
> > > > > - This means that we won't have separate activities and won't
> set
> > > different
> > > > ratios for flush and evict. Is there a need to do so?
> > > > > - Number of objects to evict at a time. 'evict_effort' acts as
> the
> > > priority, which
> > > > is used to calculate the number of objects to evict.
> > > >
> > > > As someone else mentioned in a follow-up, the reason we let the
> > > dirty level be
> > > > set lower than the full level is that it provides headroom so
> that
> > > objects can be
> > > > quickly evicted (delete, no flush) to make room for new writes
> or
> > > new
> > > > promotions.
> > > >
> > > > That said, we probably can/should streamline the flush so that
> an
> > > evict can
> > > > immediately follow without waiting for the agent to come around
> > > again.
> > > > (I don't think we do that now?)
> > >
> > > I was afraid of having to maintain different lists for
> flush/evict. So
> > > I proposed to combine flush with evict. But seems like we don't
> need
> > > to. When reaching the dirty ratio and selecting objects to flush,
> we
> > > look for dirty objects from the inactive list. The objects are
> marked
> > > clean and kept in the same position in the LRU list after
> flushing.
> > > When evicting objects, clean objects in the inactive list are
> > > selected. Dirty objects are ignored. This will keep the
> flush/evict
> > > the same as current implementation. Make sense?
> > >
> > > >
> > > > sage
> > > >
> > > >
> > > > > ## LRU lists Snapshotting
> > > > > - The two lists are snapshotted persisted periodically.
> > > > > - Only one copy needs to be saved. The old copy is removed
> when
> > > persisting
> > > > the lists. The saved lists are used to restore the LRU lists
> when
> > > OSD reboots.
> > > > >
> > > > > Any comments/feedbacks are welcomed.
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe
> > > ceph-devel"
> > > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe
> > > ceph-devel" in the body
> > > > of a message to majordomo@vger.kernel.org More majordomo info
> at
> > > > http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> ceph-devel"
> > > in
> > > the body of a message to majordomo@vger.kernel.org More majordomo
> info
> > > at  http://vger.kernel.org/majordomo-info.html
> > 
> > --
> > Matt Benjamin
> > CohortFS, LLC.
> > 315 West Huron Street, Suite 140A
> > Ann Arbor, Michigan 48103
> > 
> > http://cohortfs.com
> > 
> > tel.  734-761-4689
> > fax.  734-769-8938
> > cel.  734-216-5309

Matt

-- 
Matt Benjamin
CohortFS, LLC.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://cohortfs.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
  2015-07-22  3:34       ` Wang, Zhiqiang
@ 2015-07-22 12:56         ` Sage Weil
  2015-07-22 18:40           ` Allen Samuels
  0 siblings, 1 reply; 21+ messages in thread
From: Sage Weil @ 2015-07-22 12:56 UTC (permalink / raw)
  To: Wang, Zhiqiang; +Cc: sjust@redhat.com, ceph-devel@vger.kernel.org

On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
> > The part that worries me now is the speed with which we can load and 
> > manage such a list.  Assuming it is several hundred MB, it'll take a 
> > while to load that into memory and set up all the pointers (assuming a 
> > conventional linked list structure).  Maybe tens of seconds...
> 
> I'm thinking of maintaining the lists at the PG level. That's to say, we 
> have an active/inactive list for every PG. We can load the lists in 
> parallel during rebooting. Also, the ~100 MB lists are split among 
> different OSD nodes. Perhaps it does not need such long time to load 
> them?
> 
> > 
> > I wonder if instead we should construct some sort of flat model where 
> > we load slabs of contiguous memory, 10's of MB each, and have the 
> > next/previous pointers be a (slab,position) pair.  That way we can 
> > load it into memory in big chunks, quickly, and be able to operate on 
> > it (adjust links) immediately.
> > 
> > Another thought: currently we use the hobject_t hash only instead of 
> > the full object name.  We could continue to do the same, or we could 
> > do a hash pair (hobject_t hash + a different hash of the rest of the 
> > object) to keep the representation compact.  With a model lke the 
> > above, that could get the object representation down to 2 u32's.  A 
> > link could be a slab + position (2 more u32's), and if we have prev + 
> > next that'd be just 6x4=24 bytes per object.
> 
> Looks like for an object, the head and the snapshot version have the 
> same hobject hash. Thus we have to use the hash pair instead of just the 
> hobject hash. But I still have two questions if we use the hash pair to 
> represent an object.
>
> 1) Does the hash pair uniquely identify an object? That's to say, is it 
> possible for two objects to have the same hash pair?

With two hashes collisions would be rare but could happen

> 2) We need a way to get the full object name from the hash pair, so that 
> we know what objects to evict. But seems like we don't have a good way 
> to do this?

Ah, yeah--I'm a little stuck in the current hitset view of things.  I 
think we can either embed the full ghobject_t (which means we lose the 
fixed-size property, and the per-object overhead goes way up.. probably 
from ~24 bytes to more like 80 or 100).  Or, we can enumerate objects 
starting at the (hobject_t) hash position to find the object.  That's 
somewhat inefficient for FileStore (it'll list a directory of a hundred or 
so objects, probably, and iterate over them to find the right one), but 
for NewStore it will be quite fast (NewStore has all objects sorted into 
keys in rocksdb, so we just start listing at the right offset).  Usually 
we'll get the object right off, unless there are hobject_t hash collisions 
(already reasonably rare since it's a 2^32 space for the pool).

Given that, I would lean toward the 2-hash fixed-sized records (of these 2 
options)...

sage


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
  2015-07-22 12:56         ` Sage Weil
@ 2015-07-22 18:40           ` Allen Samuels
  2015-07-22 18:45             ` Matt W. Benjamin
                               ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Allen Samuels @ 2015-07-22 18:40 UTC (permalink / raw)
  To: Sage Weil, Wang, Zhiqiang; +Cc: sjust@redhat.com, ceph-devel@vger.kernel.org

I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems.

If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals).

Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs.

What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations.

I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates.

Allen Samuels
Software Architect, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, July 22, 2015 5:57 AM
To: Wang, Zhiqiang
Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
Subject: RE: The design of the eviction improvement

On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
> > The part that worries me now is the speed with which we can load and
> > manage such a list.  Assuming it is several hundred MB, it'll take a
> > while to load that into memory and set up all the pointers (assuming
> > a conventional linked list structure).  Maybe tens of seconds...
>
> I'm thinking of maintaining the lists at the PG level. That's to say,
> we have an active/inactive list for every PG. We can load the lists in
> parallel during rebooting. Also, the ~100 MB lists are split among
> different OSD nodes. Perhaps it does not need such long time to load
> them?
>
> >
> > I wonder if instead we should construct some sort of flat model
> > where we load slabs of contiguous memory, 10's of MB each, and have
> > the next/previous pointers be a (slab,position) pair.  That way we
> > can load it into memory in big chunks, quickly, and be able to
> > operate on it (adjust links) immediately.
> >
> > Another thought: currently we use the hobject_t hash only instead of
> > the full object name.  We could continue to do the same, or we could
> > do a hash pair (hobject_t hash + a different hash of the rest of the
> > object) to keep the representation compact.  With a model lke the
> > above, that could get the object representation down to 2 u32's.  A
> > link could be a slab + position (2 more u32's), and if we have prev
> > + next that'd be just 6x4=24 bytes per object.
>
> Looks like for an object, the head and the snapshot version have the
> same hobject hash. Thus we have to use the hash pair instead of just
> the hobject hash. But I still have two questions if we use the hash
> pair to represent an object.
>
> 1) Does the hash pair uniquely identify an object? That's to say, is
> it possible for two objects to have the same hash pair?

With two hashes collisions would be rare but could happen

> 2) We need a way to get the full object name from the hash pair, so
> that we know what objects to evict. But seems like we don't have a
> good way to do this?

Ah, yeah--I'm a little stuck in the current hitset view of things.  I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100).  Or, we can enumerate objects starting at the (hobject_t) hash position to find the object.  That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite fast (NewStore has all objects sorted into keys in rocksdb, so we just start listing at the right offset).  Usually we'll get the object right off, unless there are hobject_t hash collisions (already reasonably rare since it's a 2^32 space for the pool).

Given that, I would lean toward the 2-hash fixed-sized records (of these 2 options)...

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: The design of the eviction improvement
  2015-07-22 18:40           ` Allen Samuels
@ 2015-07-22 18:45             ` Matt W. Benjamin
  2015-07-22 18:51             ` Sage Weil
  2015-07-23  5:52             ` Wang, Zhiqiang
  2 siblings, 0 replies; 21+ messages in thread
From: Matt W. Benjamin @ 2015-07-22 18:45 UTC (permalink / raw)
  To: Allen Samuels; +Cc: sjust, ceph-devel, Sage Weil, Zhiqiang Wang

Hi,

----- "Allen Samuels" <Allen.Samuels@sandisk.com> wrote:

> I'm very concerned about designing around the assumption that objects
> are ~1MB in size. That's probably a good assumption for block and HDFS
> dominated systems, but likely a very poor assumption about many object
> and file dominated systems.

++

> 
> If I understand the proposals that have been discussed, each of them
> assumes in in-memory data structure with an entry per object (the
> exact size of the entry varies with the different proposals).
> 
> Under that assumption, I have another concern which is the lack of
> graceful degradation as the object counts grow and the in-memory data
> structures get larger. Everything seems fine until just a few objects
> get added then the system starts to page and performance drops
> dramatically (likely) to the point where Linux will start killing
> OSDs.

I'm not clear why that needs to be the case (but don't think it matters just now whether I do,
I was just letting folks know that we have MQ implementation(s)), but what you're describing seems consistent the model Sage and Greg, at least, are describing.

Matt

> 
> What's really needed is some kind of way to extend the lists into
> storage in way that's doesn't cause a zillion I/O operations.
> 
> I have some vague idea that some data structure like the LSM mechanism
> ought to be able to accomplish what we want. Some amount of the data
> structure (the most likely to be used) is held in DRAM [and backed to
> storage for restart] and the least likely to be used is flushed to
> storage with some mechanism that allows batched updates.
> 
> Allen Samuels
> Software Architect, Systems and Software Solutions
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, July 22, 2015 5:57 AM
> To: Wang, Zhiqiang
> Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> Subject: RE: The design of the eviction improvement
> 
> On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
> > > The part that worries me now is the speed with which we can load
> and
> > > manage such a list.  Assuming it is several hundred MB, it'll take
> a
> > > while to load that into memory and set up all the pointers
> (assuming
> > > a conventional linked list structure).  Maybe tens of seconds...
> >
> > I'm thinking of maintaining the lists at the PG level. That's to
> say,
> > we have an active/inactive list for every PG. We can load the lists
> in
> > parallel during rebooting. Also, the ~100 MB lists are split among
> > different OSD nodes. Perhaps it does not need such long time to
> load
> > them?
> >
> > >
> > > I wonder if instead we should construct some sort of flat model
> > > where we load slabs of contiguous memory, 10's of MB each, and
> have
> > > the next/previous pointers be a (slab,position) pair.  That way
> we
> > > can load it into memory in big chunks, quickly, and be able to
> > > operate on it (adjust links) immediately.
> > >
> > > Another thought: currently we use the hobject_t hash only instead
> of
> > > the full object name.  We could continue to do the same, or we
> could
> > > do a hash pair (hobject_t hash + a different hash of the rest of
> the
> > > object) to keep the representation compact.  With a model lke the
> > > above, that could get the object representation down to 2 u32's. 
> A
> > > link could be a slab + position (2 more u32's), and if we have
> prev
> > > + next that'd be just 6x4=24 bytes per object.
> >
> > Looks like for an object, the head and the snapshot version have
> the
> > same hobject hash. Thus we have to use the hash pair instead of
> just
> > the hobject hash. But I still have two questions if we use the hash
> > pair to represent an object.
> >
> > 1) Does the hash pair uniquely identify an object? That's to say,
> is
> > it possible for two objects to have the same hash pair?
> 
> With two hashes collisions would be rare but could happen
> 
> > 2) We need a way to get the full object name from the hash pair, so
> > that we know what objects to evict. But seems like we don't have a
> > good way to do this?
> 
> Ah, yeah--I'm a little stuck in the current hitset view of things.  I
> think we can either embed the full ghobject_t (which means we lose the
> fixed-size property, and the per-object overhead goes way up..
> probably from ~24 bytes to more like 80 or 100).  Or, we can enumerate
> objects starting at the (hobject_t) hash position to find the object. 
> That's somewhat inefficient for FileStore (it'll list a directory of a
> hundred or so objects, probably, and iterate over them to find the
> right one), but for NewStore it will be quite fast (NewStore has all
> objects sorted into keys in rocksdb, so we just start listing at the
> right offset).  Usually we'll get the object right off, unless there
> are hobject_t hash collisions (already reasonably rare since it's a
> 2^32 space for the pool).
> 
> Given that, I would lean toward the 2-hash fixed-sized records (of
> these 2 options)...
> 
> sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient,
> you are hereby notified that you have received this message in error
> and that any review, dissemination, distribution, or copying of this
> message is strictly prohibited. If you have received this
> communication in error, please notify the sender by telephone or
> e-mail (as shown above) immediately and destroy any and all copies of
> this message in your possession (whether hard copies or electronically
> stored copies).
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matt Benjamin
CohortFS, LLC.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://cohortfs.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
  2015-07-22 18:40           ` Allen Samuels
  2015-07-22 18:45             ` Matt W. Benjamin
@ 2015-07-22 18:51             ` Sage Weil
  2015-07-22 19:14               ` Allen Samuels
  2015-07-23  3:46               ` Wang, Zhiqiang
  2015-07-23  5:52             ` Wang, Zhiqiang
  2 siblings, 2 replies; 21+ messages in thread
From: Sage Weil @ 2015-07-22 18:51 UTC (permalink / raw)
  To: Allen Samuels
  Cc: Wang, Zhiqiang, sjust@redhat.com, ceph-devel@vger.kernel.org

On Wed, 22 Jul 2015, Allen Samuels wrote:
> I'm very concerned about designing around the assumption that objects 
> are ~1MB in size. That's probably a good assumption for block and HDFS 
> dominated systems, but likely a very poor assumption about many object 
> and file dominated systems.
> 
> If I understand the proposals that have been discussed, each of them 
> assumes in in-memory data structure with an entry per object (the exact 
> size of the entry varies with the different proposals).
> 
> Under that assumption, I have another concern which is the lack of 
> graceful degradation as the object counts grow and the in-memory data 
> structures get larger. Everything seems fine until just a few objects 
> get added then the system starts to page and performance drops 
> dramatically (likely) to the point where Linux will start killing OSDs.
> 
> What's really needed is some kind of way to extend the lists into 
> storage in way that's doesn't cause a zillion I/O operations.
> 
> I have some vague idea that some data structure like the LSM mechanism 
> ought to be able to accomplish what we want. Some amount of the data 
> structure (the most likely to be used) is held in DRAM [and backed to 
> storage for restart] and the least likely to be used is flushed to 
> storage with some mechanism that allows batched updates.

How about this:

The basic mapping we want is object -> atime.

We keep a simple LRU of the top N objects in memory with the object->atime 
values.  When an object is accessed, it is moved or added to the top 
of the list.

Periodically, or when the LRU size reaches N * (1.x), we flush:

 - write the top N items to a compact object that can be quickly loaded
 - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb 
in a simple object -> atime fashion

When the agent runs, we just walk across that key range of the db the same 
way we currently enumerate objects.  For each record we use either the 
stored atime or the value in the in-memory LRU (it'll need to be 
dual-indexed by both a list and a hash map), whichever is newer.  We can 
use the same histogram estimation approach we do now to determine if the 
object in question is below the flush/evict threshold.

The LSM does the work of sorting/compacting the atime info, while we avoid 
touching it at all for the hottest objects to keep the amount of work it 
has to do in check.

sage

> 
> Allen Samuels
> Software Architect, Systems and Software Solutions
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, July 22, 2015 5:57 AM
> To: Wang, Zhiqiang
> Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> Subject: RE: The design of the eviction improvement
> 
> On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
> > > The part that worries me now is the speed with which we can load and
> > > manage such a list.  Assuming it is several hundred MB, it'll take a
> > > while to load that into memory and set up all the pointers (assuming
> > > a conventional linked list structure).  Maybe tens of seconds...
> >
> > I'm thinking of maintaining the lists at the PG level. That's to say,
> > we have an active/inactive list for every PG. We can load the lists in
> > parallel during rebooting. Also, the ~100 MB lists are split among
> > different OSD nodes. Perhaps it does not need such long time to load
> > them?
> >
> > >
> > > I wonder if instead we should construct some sort of flat model
> > > where we load slabs of contiguous memory, 10's of MB each, and have
> > > the next/previous pointers be a (slab,position) pair.  That way we
> > > can load it into memory in big chunks, quickly, and be able to
> > > operate on it (adjust links) immediately.
> > >
> > > Another thought: currently we use the hobject_t hash only instead of
> > > the full object name.  We could continue to do the same, or we could
> > > do a hash pair (hobject_t hash + a different hash of the rest of the
> > > object) to keep the representation compact.  With a model lke the
> > > above, that could get the object representation down to 2 u32's.  A
> > > link could be a slab + position (2 more u32's), and if we have prev
> > > + next that'd be just 6x4=24 bytes per object.
> >
> > Looks like for an object, the head and the snapshot version have the
> > same hobject hash. Thus we have to use the hash pair instead of just
> > the hobject hash. But I still have two questions if we use the hash
> > pair to represent an object.
> >
> > 1) Does the hash pair uniquely identify an object? That's to say, is
> > it possible for two objects to have the same hash pair?
> 
> With two hashes collisions would be rare but could happen
> 
> > 2) We need a way to get the full object name from the hash pair, so
> > that we know what objects to evict. But seems like we don't have a
> > good way to do this?
> 
> Ah, yeah--I'm a little stuck in the current hitset view of things.  I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100).  Or, we can enumerate objects starting at the (hobject_t) hash position to find the object.  That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite fast (NewStore has all objects sorted into keys in rocksdb, so we just start listing at the right offset).  Usually we'll get the object right off, unless there are hobject_t hash collisions (already reasonably rare since it's a 2^32 space for the pool).
> 
> Given that, I would lean toward the 2-hash fixed-sized records (of these 2 options)...
> 
> sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
  2015-07-22 18:51             ` Sage Weil
@ 2015-07-22 19:14               ` Allen Samuels
  2015-07-22 21:53                 ` Sage Weil
  2015-07-23  3:46               ` Wang, Zhiqiang
  1 sibling, 1 reply; 21+ messages in thread
From: Allen Samuels @ 2015-07-22 19:14 UTC (permalink / raw)
  To: Sage Weil; +Cc: Wang, Zhiqiang, sjust@redhat.com, ceph-devel@vger.kernel.org

Don't we need to double-index the data structure?

We need it indexed by atime for the purposes of eviction, but we need it indexed by object name for the purposes of updating the list upon a usage.




Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, July 22, 2015 11:51 AM
To: Allen Samuels
Cc: Wang, Zhiqiang; sjust@redhat.com; ceph-devel@vger.kernel.org
Subject: RE: The design of the eviction improvement

On Wed, 22 Jul 2015, Allen Samuels wrote:
> I'm very concerned about designing around the assumption that objects 
> are ~1MB in size. That's probably a good assumption for block and HDFS 
> dominated systems, but likely a very poor assumption about many object 
> and file dominated systems.
> 
> If I understand the proposals that have been discussed, each of them 
> assumes in in-memory data structure with an entry per object (the 
> exact size of the entry varies with the different proposals).
> 
> Under that assumption, I have another concern which is the lack of 
> graceful degradation as the object counts grow and the in-memory data 
> structures get larger. Everything seems fine until just a few objects 
> get added then the system starts to page and performance drops 
> dramatically (likely) to the point where Linux will start killing OSDs.
> 
> What's really needed is some kind of way to extend the lists into 
> storage in way that's doesn't cause a zillion I/O operations.
> 
> I have some vague idea that some data structure like the LSM mechanism 
> ought to be able to accomplish what we want. Some amount of the data 
> structure (the most likely to be used) is held in DRAM [and backed to 
> storage for restart] and the least likely to be used is flushed to 
> storage with some mechanism that allows batched updates.

How about this:

The basic mapping we want is object -> atime.

We keep a simple LRU of the top N objects in memory with the object->atime values.  When an object is accessed, it is moved or added to the top of the list.

Periodically, or when the LRU size reaches N * (1.x), we flush:

 - write the top N items to a compact object that can be quickly loaded
 - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object -> atime fashion

When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects.  For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer.  We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold.

The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check.

sage

> 
> Allen Samuels
> Software Architect, Systems and Software Solutions
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, July 22, 2015 5:57 AM
> To: Wang, Zhiqiang
> Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> Subject: RE: The design of the eviction improvement
> 
> On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
> > > The part that worries me now is the speed with which we can load 
> > > and manage such a list.  Assuming it is several hundred MB, it'll 
> > > take a while to load that into memory and set up all the pointers 
> > > (assuming a conventional linked list structure).  Maybe tens of seconds...
> >
> > I'm thinking of maintaining the lists at the PG level. That's to 
> > say, we have an active/inactive list for every PG. We can load the 
> > lists in parallel during rebooting. Also, the ~100 MB lists are 
> > split among different OSD nodes. Perhaps it does not need such long 
> > time to load them?
> >
> > >
> > > I wonder if instead we should construct some sort of flat model 
> > > where we load slabs of contiguous memory, 10's of MB each, and 
> > > have the next/previous pointers be a (slab,position) pair.  That 
> > > way we can load it into memory in big chunks, quickly, and be able 
> > > to operate on it (adjust links) immediately.
> > >
> > > Another thought: currently we use the hobject_t hash only instead 
> > > of the full object name.  We could continue to do the same, or we 
> > > could do a hash pair (hobject_t hash + a different hash of the 
> > > rest of the
> > > object) to keep the representation compact.  With a model lke the 
> > > above, that could get the object representation down to 2 u32's.  
> > > A link could be a slab + position (2 more u32's), and if we have 
> > > prev
> > > + next that'd be just 6x4=24 bytes per object.
> >
> > Looks like for an object, the head and the snapshot version have the 
> > same hobject hash. Thus we have to use the hash pair instead of just 
> > the hobject hash. But I still have two questions if we use the hash 
> > pair to represent an object.
> >
> > 1) Does the hash pair uniquely identify an object? That's to say, is 
> > it possible for two objects to have the same hash pair?
> 
> With two hashes collisions would be rare but could happen
> 
> > 2) We need a way to get the full object name from the hash pair, so 
> > that we know what objects to evict. But seems like we don't have a 
> > good way to do this?
> 
> Ah, yeah--I'm a little stuck in the current hitset view of things.  I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100).  Or, we can enumerate objects starting at the (hobject_t) hash position to find the object.  That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite fast (NewStore has all objects sorted into keys in rocksdb, so we just start listing at the right offset).  Usually we'll get the object right off, unless there are hobject_t hash collisions (already reasonably rare since it's a 2^32 space for the pool).
> 
> Given that, I would lean toward the 2-hash fixed-sized records (of these 2 options)...
> 
> sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
  2015-07-22 19:14               ` Allen Samuels
@ 2015-07-22 21:53                 ` Sage Weil
  2015-07-22 22:45                   ` Allen Samuels
  0 siblings, 1 reply; 21+ messages in thread
From: Sage Weil @ 2015-07-22 21:53 UTC (permalink / raw)
  To: Allen Samuels
  Cc: Wang, Zhiqiang, sjust@redhat.com, ceph-devel@vger.kernel.org

On Wed, 22 Jul 2015, Allen Samuels wrote:
> Don't we need to double-index the data structure?
> 
> We need it indexed by atime for the purposes of eviction, but we need it 
> indexed by object name for the purposes of updating the list upon a 
> usage.

If you use the same approach the agent uses now (iterate over items, 
evict/trim anything in bottom end of observed age distribution) you can 
get away without the double-index.  Iterating over the LSM should be quite 
cheap.  I'd be more worried about the cost of the insertions.

I'm also not sure the simplistic approach below can be generalized to 
something like 2Q (and certainly not something like MQ).  Maybe...

On the other hand, I'm not sure it is the end of the world if at the end 
of the day the memory requirements for a cache-tier OSD are higher and 
inversely proportional to the object size.  We can make the OSD 
flush/evict more aggressively if the memory utilization (due to a high 
object count) gets out of hand as a safety mechanism.  Paying a few extra 
$$ for RAM isn't the end of the world I'm guessing when the performance 
payoff is significant...

sage


 > 
> 
> 
> 
> Allen Samuels
> Software Architect, Systems and Software Solutions 
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, July 22, 2015 11:51 AM
> To: Allen Samuels
> Cc: Wang, Zhiqiang; sjust@redhat.com; ceph-devel@vger.kernel.org
> Subject: RE: The design of the eviction improvement
> 
> On Wed, 22 Jul 2015, Allen Samuels wrote:
> > I'm very concerned about designing around the assumption that objects 
> > are ~1MB in size. That's probably a good assumption for block and HDFS 
> > dominated systems, but likely a very poor assumption about many object 
> > and file dominated systems.
> > 
> > If I understand the proposals that have been discussed, each of them 
> > assumes in in-memory data structure with an entry per object (the 
> > exact size of the entry varies with the different proposals).
> > 
> > Under that assumption, I have another concern which is the lack of 
> > graceful degradation as the object counts grow and the in-memory data 
> > structures get larger. Everything seems fine until just a few objects 
> > get added then the system starts to page and performance drops 
> > dramatically (likely) to the point where Linux will start killing OSDs.
> > 
> > What's really needed is some kind of way to extend the lists into 
> > storage in way that's doesn't cause a zillion I/O operations.
> > 
> > I have some vague idea that some data structure like the LSM mechanism 
> > ought to be able to accomplish what we want. Some amount of the data 
> > structure (the most likely to be used) is held in DRAM [and backed to 
> > storage for restart] and the least likely to be used is flushed to 
> > storage with some mechanism that allows batched updates.
> 
> How about this:
> 
> The basic mapping we want is object -> atime.
> 
> We keep a simple LRU of the top N objects in memory with the object->atime values.  When an object is accessed, it is moved or added to the top of the list.
> 
> Periodically, or when the LRU size reaches N * (1.x), we flush:
> 
>  - write the top N items to a compact object that can be quickly loaded
>  - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object -> atime fashion
> 
> When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects.  For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer.  We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold.
> 
> The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check.
> 
> sage
> 
> > 
> > Allen Samuels
> > Software Architect, Systems and Software Solutions
> > 
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org 
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Wednesday, July 22, 2015 5:57 AM
> > To: Wang, Zhiqiang
> > Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> > Subject: RE: The design of the eviction improvement
> > 
> > On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
> > > > The part that worries me now is the speed with which we can load 
> > > > and manage such a list.  Assuming it is several hundred MB, it'll 
> > > > take a while to load that into memory and set up all the pointers 
> > > > (assuming a conventional linked list structure).  Maybe tens of seconds...
> > >
> > > I'm thinking of maintaining the lists at the PG level. That's to 
> > > say, we have an active/inactive list for every PG. We can load the 
> > > lists in parallel during rebooting. Also, the ~100 MB lists are 
> > > split among different OSD nodes. Perhaps it does not need such long 
> > > time to load them?
> > >
> > > >
> > > > I wonder if instead we should construct some sort of flat model 
> > > > where we load slabs of contiguous memory, 10's of MB each, and 
> > > > have the next/previous pointers be a (slab,position) pair.  That 
> > > > way we can load it into memory in big chunks, quickly, and be able 
> > > > to operate on it (adjust links) immediately.
> > > >
> > > > Another thought: currently we use the hobject_t hash only instead 
> > > > of the full object name.  We could continue to do the same, or we 
> > > > could do a hash pair (hobject_t hash + a different hash of the 
> > > > rest of the
> > > > object) to keep the representation compact.  With a model lke the 
> > > > above, that could get the object representation down to 2 u32's.  
> > > > A link could be a slab + position (2 more u32's), and if we have 
> > > > prev
> > > > + next that'd be just 6x4=24 bytes per object.
> > >
> > > Looks like for an object, the head and the snapshot version have the 
> > > same hobject hash. Thus we have to use the hash pair instead of just 
> > > the hobject hash. But I still have two questions if we use the hash 
> > > pair to represent an object.
> > >
> > > 1) Does the hash pair uniquely identify an object? That's to say, is 
> > > it possible for two objects to have the same hash pair?
> > 
> > With two hashes collisions would be rare but could happen
> > 
> > > 2) We need a way to get the full object name from the hash pair, so 
> > > that we know what objects to evict. But seems like we don't have a 
> > > good way to do this?
> > 
> > Ah, yeah--I'm a little stuck in the current hitset view of things.  I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100).  Or, we can enumerate objects starting at the (hobject_t) hash position to find the object.  That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite fast (NewStore has all objects sorted into keys in rocksdb, so we just start listing at the right offset).  Usually we'll get the object right off, unless there are hobject_t hash collisions (already reasonably rare since it's a 2^32 space for the pool).
> > 
> > Given that, I would lean toward the 2-hash fixed-sized records (of these 2 options)...
> > 
> > sage
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > ________________________________
> > 
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
  2015-07-22 21:53                 ` Sage Weil
@ 2015-07-22 22:45                   ` Allen Samuels
  0 siblings, 0 replies; 21+ messages in thread
From: Allen Samuels @ 2015-07-22 22:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: Wang, Zhiqiang, sjust@redhat.com, ceph-devel@vger.kernel.org

Yes the cost of the insertions with the current scheme is probably prohibitive. Wouldn't it approach the same amount of time as just having atime turned on in the file system? 

My concern about the memory is mostly that we ensure whatever algorithm is selected degrades gracefully when you get high counts of small objects. I agree that paying $ for RAM that translates into actual performance isn't really a problem. It really boils down to your workload and access pattern.


Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Wednesday, July 22, 2015 2:53 PM
To: Allen Samuels
Cc: Wang, Zhiqiang; sjust@redhat.com; ceph-devel@vger.kernel.org
Subject: RE: The design of the eviction improvement

On Wed, 22 Jul 2015, Allen Samuels wrote:
> Don't we need to double-index the data structure?
> 
> We need it indexed by atime for the purposes of eviction, but we need 
> it indexed by object name for the purposes of updating the list upon a 
> usage.

If you use the same approach the agent uses now (iterate over items, evict/trim anything in bottom end of observed age distribution) you can get away without the double-index.  Iterating over the LSM should be quite cheap.  I'd be more worried about the cost of the insertions.

I'm also not sure the simplistic approach below can be generalized to something like 2Q (and certainly not something like MQ).  Maybe...

On the other hand, I'm not sure it is the end of the world if at the end of the day the memory requirements for a cache-tier OSD are higher and inversely proportional to the object size.  We can make the OSD flush/evict more aggressively if the memory utilization (due to a high object count) gets out of hand as a safety mechanism.  Paying a few extra $$ for RAM isn't the end of the world I'm guessing when the performance payoff is significant...

sage


 > 
> 
> 
> 
> Allen Samuels
> Software Architect, Systems and Software Solutions
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> 
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, July 22, 2015 11:51 AM
> To: Allen Samuels
> Cc: Wang, Zhiqiang; sjust@redhat.com; ceph-devel@vger.kernel.org
> Subject: RE: The design of the eviction improvement
> 
> On Wed, 22 Jul 2015, Allen Samuels wrote:
> > I'm very concerned about designing around the assumption that 
> > objects are ~1MB in size. That's probably a good assumption for 
> > block and HDFS dominated systems, but likely a very poor assumption 
> > about many object and file dominated systems.
> > 
> > If I understand the proposals that have been discussed, each of them 
> > assumes in in-memory data structure with an entry per object (the 
> > exact size of the entry varies with the different proposals).
> > 
> > Under that assumption, I have another concern which is the lack of 
> > graceful degradation as the object counts grow and the in-memory 
> > data structures get larger. Everything seems fine until just a few 
> > objects get added then the system starts to page and performance 
> > drops dramatically (likely) to the point where Linux will start killing OSDs.
> > 
> > What's really needed is some kind of way to extend the lists into 
> > storage in way that's doesn't cause a zillion I/O operations.
> > 
> > I have some vague idea that some data structure like the LSM 
> > mechanism ought to be able to accomplish what we want. Some amount 
> > of the data structure (the most likely to be used) is held in DRAM 
> > [and backed to storage for restart] and the least likely to be used 
> > is flushed to storage with some mechanism that allows batched updates.
> 
> How about this:
> 
> The basic mapping we want is object -> atime.
> 
> We keep a simple LRU of the top N objects in memory with the object->atime values.  When an object is accessed, it is moved or added to the top of the list.
> 
> Periodically, or when the LRU size reaches N * (1.x), we flush:
> 
>  - write the top N items to a compact object that can be quickly 
> loaded
>  - write our records for the oldest items (N .. N*1.x) to 
> leveldb/rocksdb in a simple object -> atime fashion
> 
> When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects.  For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer.  We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold.
> 
> The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check.
> 
> sage
> 
> > 
> > Allen Samuels
> > Software Architect, Systems and Software Solutions
> > 
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org 
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Wednesday, July 22, 2015 5:57 AM
> > To: Wang, Zhiqiang
> > Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> > Subject: RE: The design of the eviction improvement
> > 
> > On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
> > > > The part that worries me now is the speed with which we can load 
> > > > and manage such a list.  Assuming it is several hundred MB, 
> > > > it'll take a while to load that into memory and set up all the 
> > > > pointers (assuming a conventional linked list structure).  Maybe tens of seconds...
> > >
> > > I'm thinking of maintaining the lists at the PG level. That's to 
> > > say, we have an active/inactive list for every PG. We can load the 
> > > lists in parallel during rebooting. Also, the ~100 MB lists are 
> > > split among different OSD nodes. Perhaps it does not need such 
> > > long time to load them?
> > >
> > > >
> > > > I wonder if instead we should construct some sort of flat model 
> > > > where we load slabs of contiguous memory, 10's of MB each, and 
> > > > have the next/previous pointers be a (slab,position) pair.  That 
> > > > way we can load it into memory in big chunks, quickly, and be 
> > > > able to operate on it (adjust links) immediately.
> > > >
> > > > Another thought: currently we use the hobject_t hash only 
> > > > instead of the full object name.  We could continue to do the 
> > > > same, or we could do a hash pair (hobject_t hash + a different 
> > > > hash of the rest of the
> > > > object) to keep the representation compact.  With a model lke 
> > > > the above, that could get the object representation down to 2 u32's.
> > > > A link could be a slab + position (2 more u32's), and if we have 
> > > > prev
> > > > + next that'd be just 6x4=24 bytes per object.
> > >
> > > Looks like for an object, the head and the snapshot version have 
> > > the same hobject hash. Thus we have to use the hash pair instead 
> > > of just the hobject hash. But I still have two questions if we use 
> > > the hash pair to represent an object.
> > >
> > > 1) Does the hash pair uniquely identify an object? That's to say, 
> > > is it possible for two objects to have the same hash pair?
> > 
> > With two hashes collisions would be rare but could happen
> > 
> > > 2) We need a way to get the full object name from the hash pair, 
> > > so that we know what objects to evict. But seems like we don't 
> > > have a good way to do this?
> > 
> > Ah, yeah--I'm a little stuck in the current hitset view of things.  I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100).  Or, we can enumerate objects starting at the (hobject_t) hash position to find the object.  That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite fast (NewStore has all objects sorted into keys in rocksdb, so we just start listing at the right offset).  Usually we'll get the object right off, unless there are hobject_t hash collisions (already reasonably rare since it's a 2^32 space for the pool).
> > 
> > Given that, I would lean toward the 2-hash fixed-sized records (of these 2 options)...
> > 
> > sage
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > ________________________________
> > 
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
  2015-07-22 18:51             ` Sage Weil
  2015-07-22 19:14               ` Allen Samuels
@ 2015-07-23  3:46               ` Wang, Zhiqiang
  1 sibling, 0 replies; 21+ messages in thread
From: Wang, Zhiqiang @ 2015-07-23  3:46 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: sjust@redhat.com, ceph-devel@vger.kernel.org

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Thursday, July 23, 2015 2:51 AM
> To: Allen Samuels
> Cc: Wang, Zhiqiang; sjust@redhat.com; ceph-devel@vger.kernel.org
> Subject: RE: The design of the eviction improvement
> 
> On Wed, 22 Jul 2015, Allen Samuels wrote:
> > I'm very concerned about designing around the assumption that objects
> > are ~1MB in size. That's probably a good assumption for block and HDFS
> > dominated systems, but likely a very poor assumption about many object
> > and file dominated systems.
> >
> > If I understand the proposals that have been discussed, each of them
> > assumes in in-memory data structure with an entry per object (the
> > exact size of the entry varies with the different proposals).
> >
> > Under that assumption, I have another concern which is the lack of
> > graceful degradation as the object counts grow and the in-memory data
> > structures get larger. Everything seems fine until just a few objects
> > get added then the system starts to page and performance drops
> > dramatically (likely) to the point where Linux will start killing OSDs.
> >
> > What's really needed is some kind of way to extend the lists into
> > storage in way that's doesn't cause a zillion I/O operations.
> >
> > I have some vague idea that some data structure like the LSM mechanism
> > ought to be able to accomplish what we want. Some amount of the data
> > structure (the most likely to be used) is held in DRAM [and backed to
> > storage for restart] and the least likely to be used is flushed to
> > storage with some mechanism that allows batched updates.
> 
> How about this:
> 
> The basic mapping we want is object -> atime.
> 
> We keep a simple LRU of the top N objects in memory with the object->atime
> values.  When an object is accessed, it is moved or added to the top of the list.
> 
> Periodically, or when the LRU size reaches N * (1.x), we flush:
> 
>  - write the top N items to a compact object that can be quickly loaded
>  - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a
> simple object -> atime fashion
> 
> When the agent runs, we just walk across that key range of the db the same
> way we currently enumerate objects.  For each record we use either the
> stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by
> both a list and a hash map), whichever is newer.  We can use the same
> histogram estimation approach we do now to determine if the object in
> question is below the flush/evict threshold.

This looks similar to what we do now, except it keeps a LRU of the object->atime mapping in RAM, instead of using a hitset. The object age calculated using the atime would be more accurate than the current hitset and mtime approach.

One comment is that I think if we can find a record of an object in the in-memory LRU list, we don't need to query the DB since the atime in the LRU list is for sure newer than that in the db (if it has).

My concern on this approach is whether the evict decision made by the histogram estimation approach is good enough. It only measures 'recency'. And it made the decision based on some threshold, not starting from the oldest. In contrast, most of the practical algorithms made the decision based on both 'recency' and 'frequency' (accessed once recently vs. accessed twice or more recently).

If we believe the histogram estimation approach is good enough, I think we can easily integrate the idea above with 2Q.
1) The in-memory LRU lists are the same as 2Q. i.e., there are active/inactive lists, and the movements between the two lists are the same as what I stated in the original mail. But we have a limit of the size of the lists. Only the top N hottest objects are kept in the lists.
2) When the size of the lists exceed N*(1.x), evict the oldest items (N .. N*1.x) to db store.
3) N could be dynamically adjusted based on the average size of objects in the PG.
4) Evict decision are made by the histogram estimation approach. Plus I think we want to evict those objects which are not in the in-memory lists first.

> 
> The LSM does the work of sorting/compacting the atime info, while we avoid
> touching it at all for the hottest objects to keep the amount of work it has to do
> in check.
> 
> sage
> 
> >
> > Allen Samuels
> > Software Architect, Systems and Software Solutions
> >
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Wednesday, July 22, 2015 5:57 AM
> > To: Wang, Zhiqiang
> > Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> > Subject: RE: The design of the eviction improvement
> >
> > On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
> > > > The part that worries me now is the speed with which we can load
> > > > and manage such a list.  Assuming it is several hundred MB, it'll
> > > > take a while to load that into memory and set up all the pointers
> > > > (assuming a conventional linked list structure).  Maybe tens of seconds...
> > >
> > > I'm thinking of maintaining the lists at the PG level. That's to
> > > say, we have an active/inactive list for every PG. We can load the
> > > lists in parallel during rebooting. Also, the ~100 MB lists are
> > > split among different OSD nodes. Perhaps it does not need such long
> > > time to load them?
> > >
> > > >
> > > > I wonder if instead we should construct some sort of flat model
> > > > where we load slabs of contiguous memory, 10's of MB each, and
> > > > have the next/previous pointers be a (slab,position) pair.  That
> > > > way we can load it into memory in big chunks, quickly, and be able
> > > > to operate on it (adjust links) immediately.
> > > >
> > > > Another thought: currently we use the hobject_t hash only instead
> > > > of the full object name.  We could continue to do the same, or we
> > > > could do a hash pair (hobject_t hash + a different hash of the
> > > > rest of the
> > > > object) to keep the representation compact.  With a model lke the
> > > > above, that could get the object representation down to 2 u32's.
> > > > A link could be a slab + position (2 more u32's), and if we have
> > > > prev
> > > > + next that'd be just 6x4=24 bytes per object.
> > >
> > > Looks like for an object, the head and the snapshot version have the
> > > same hobject hash. Thus we have to use the hash pair instead of just
> > > the hobject hash. But I still have two questions if we use the hash
> > > pair to represent an object.
> > >
> > > 1) Does the hash pair uniquely identify an object? That's to say, is
> > > it possible for two objects to have the same hash pair?
> >
> > With two hashes collisions would be rare but could happen
> >
> > > 2) We need a way to get the full object name from the hash pair, so
> > > that we know what objects to evict. But seems like we don't have a
> > > good way to do this?
> >
> > Ah, yeah--I'm a little stuck in the current hitset view of things.  I think we
> can either embed the full ghobject_t (which means we lose the fixed-size
> property, and the per-object overhead goes way up.. probably from ~24 bytes
> to more like 80 or 100).  Or, we can enumerate objects starting at the
> (hobject_t) hash position to find the object.  That's somewhat inefficient for
> FileStore (it'll list a directory of a hundred or so objects, probably, and iterate
> over them to find the right one), but for NewStore it will be quite fast
> (NewStore has all objects sorted into keys in rocksdb, so we just start listing at
> the right offset).  Usually we'll get the object right off, unless there are
> hobject_t hash collisions (already reasonably rare since it's a 2^32 space for the
> pool).
> >
> > Given that, I would lean toward the 2-hash fixed-sized records (of these 2
> options)...
> >
> > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> > ________________________________
> >
> > PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: The design of the eviction improvement
  2015-07-22 18:40           ` Allen Samuels
  2015-07-22 18:45             ` Matt W. Benjamin
  2015-07-22 18:51             ` Sage Weil
@ 2015-07-23  5:52             ` Wang, Zhiqiang
  2 siblings, 0 replies; 21+ messages in thread
From: Wang, Zhiqiang @ 2015-07-23  5:52 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: sjust@redhat.com, ceph-devel@vger.kernel.org

Hi Allen,

> -----Original Message-----
> From: Allen Samuels [mailto:Allen.Samuels@sandisk.com]
> Sent: Thursday, July 23, 2015 2:41 AM
> To: Sage Weil; Wang, Zhiqiang
> Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> Subject: RE: The design of the eviction improvement
> 
> I'm very concerned about designing around the assumption that objects are
> ~1MB in size. That's probably a good assumption for block and HDFS dominated
> systems, but likely a very poor assumption about many object and file
> dominated systems.

This is true. If we have lots of small objects/files, the memory used for LRU lists could be extremely large.

> 
> If I understand the proposals that have been discussed, each of them assumes
> in in-memory data structure with an entry per object (the exact size of the
> entry varies with the different proposals).
> 
> Under that assumption, I have another concern which is the lack of graceful
> degradation as the object counts grow and the in-memory data structures get
> larger. Everything seems fine until just a few objects get added then the system
> starts to page and performance drops dramatically (likely) to the point where
> Linux will start killing OSDs.
> 
> What's really needed is some kind of way to extend the lists into storage in way
> that's doesn't cause a zillion I/O operations.
> 
> I have some vague idea that some data structure like the LSM mechanism
> ought to be able to accomplish what we want. Some amount of the data
> structure (the most likely to be used) is held in DRAM [and backed to storage
> for restart] and the least likely to be used is flushed to storage with some
> mechanism that allows batched updates.

The LSM mechanism could solve the memory consumption problem. But I guess the process to choose which objects to evict is complex and inefficient. Also, after evicting some objects, we need to update the on-disk file to remove the entries of these objects. This is inefficient, too.

> 
> Allen Samuels
> Software Architect, Systems and Software Solutions
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, July 22, 2015 5:57 AM
> To: Wang, Zhiqiang
> Cc: sjust@redhat.com; ceph-devel@vger.kernel.org
> Subject: RE: The design of the eviction improvement
> 
> On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
> > > The part that worries me now is the speed with which we can load and
> > > manage such a list.  Assuming it is several hundred MB, it'll take a
> > > while to load that into memory and set up all the pointers (assuming
> > > a conventional linked list structure).  Maybe tens of seconds...
> >
> > I'm thinking of maintaining the lists at the PG level. That's to say,
> > we have an active/inactive list for every PG. We can load the lists in
> > parallel during rebooting. Also, the ~100 MB lists are split among
> > different OSD nodes. Perhaps it does not need such long time to load
> > them?
> >
> > >
> > > I wonder if instead we should construct some sort of flat model
> > > where we load slabs of contiguous memory, 10's of MB each, and have
> > > the next/previous pointers be a (slab,position) pair.  That way we
> > > can load it into memory in big chunks, quickly, and be able to
> > > operate on it (adjust links) immediately.
> > >
> > > Another thought: currently we use the hobject_t hash only instead of
> > > the full object name.  We could continue to do the same, or we could
> > > do a hash pair (hobject_t hash + a different hash of the rest of the
> > > object) to keep the representation compact.  With a model lke the
> > > above, that could get the object representation down to 2 u32's.  A
> > > link could be a slab + position (2 more u32's), and if we have prev
> > > + next that'd be just 6x4=24 bytes per object.
> >
> > Looks like for an object, the head and the snapshot version have the
> > same hobject hash. Thus we have to use the hash pair instead of just
> > the hobject hash. But I still have two questions if we use the hash
> > pair to represent an object.
> >
> > 1) Does the hash pair uniquely identify an object? That's to say, is
> > it possible for two objects to have the same hash pair?
> 
> With two hashes collisions would be rare but could happen
> 
> > 2) We need a way to get the full object name from the hash pair, so
> > that we know what objects to evict. But seems like we don't have a
> > good way to do this?
> 
> Ah, yeah--I'm a little stuck in the current hitset view of things.  I think we can
> either embed the full ghobject_t (which means we lose the fixed-size property,
> and the per-object overhead goes way up.. probably from ~24 bytes to more
> like 80 or 100).  Or, we can enumerate objects starting at the (hobject_t) hash
> position to find the object.  That's somewhat inefficient for FileStore (it'll list a
> directory of a hundred or so objects, probably, and iterate over them to find the
> right one), but for NewStore it will be quite fast (NewStore has all objects
> sorted into keys in rocksdb, so we just start listing at the right offset).  Usually
> we'll get the object right off, unless there are hobject_t hash collisions (already
> reasonably rare since it's a 2^32 space for the pool).
> 
> Given that, I would lean toward the 2-hash fixed-sized records (of these 2
> options)...
> 
> sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-07-23  5:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <639332441.73.1437495369873.JavaMail.root@thunderbeast.private.linuxbox.com>
2015-07-21 16:24 ` The design of the eviction improvement Matt W. Benjamin
     [not found] <1539008200.49.1437488109510.JavaMail.root@thunderbeast.private.linuxbox.com>
2015-07-21 14:15 ` Matt W. Benjamin
2015-07-21 14:25   ` Gregory Farnum
2015-07-22  3:57   ` Wang, Zhiqiang
2015-07-22  4:06     ` Matt W. Benjamin
2015-07-20  8:47 Wang, Zhiqiang
2015-07-20 20:53 ` Allen Samuels
2015-07-20 22:38 ` Sage Weil
2015-07-21  4:08   ` Wang, Zhiqiang
2015-07-21 13:28     ` Sage Weil
2015-07-22  3:34       ` Wang, Zhiqiang
2015-07-22 12:56         ` Sage Weil
2015-07-22 18:40           ` Allen Samuels
2015-07-22 18:45             ` Matt W. Benjamin
2015-07-22 18:51             ` Sage Weil
2015-07-22 19:14               ` Allen Samuels
2015-07-22 21:53                 ` Sage Weil
2015-07-22 22:45                   ` Allen Samuels
2015-07-23  3:46               ` Wang, Zhiqiang
2015-07-23  5:52             ` Wang, Zhiqiang
     [not found] ` <08fcbe8a.9Ro.9Gf.1U.hCy4iF@mailjet.com>
2015-07-21  3:55   ` Wang, Zhiqiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.