zsmalloc concerns

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* zsmalloc concerns
@ 2012-06-05  3:25 Dan Magenheimer
  2012-06-05  6:34 ` Minchan Kim
  2012-06-06  0:28 ` Nitin Gupta
  0 siblings, 2 replies; 7+ messages in thread
From: Dan Magenheimer @ 2012-06-05  3:25 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Seth Jennings, linux-mm, Nitin Gupta, Konrad Wilk

Hi Minchan (and all) --

I promised you that after the window closed, I would
write up my concerns about zsmalloc. My preference would
be to use zsmalloc, but there are definitely tradeoffs
and my objective is to make zcache and RAMster ready
for enterprise customers so I would use a different
or captive allocator if these zsmalloc issues can't
be overcome.

Thanks,
Dan

===

Zsmalloc is designed to maximize density of items that vary in
size between 0<size<PAGE_SIZE, but especially when the mean
item size significantly exceeds PAGE_SIZE/2.  It is primarily
useful when there are a large quantity of such items to be
stored with little or no space wasted; if the quantity
is small and/or some wasted space is acceptable, existing
kernel allocators (e.g. slab) may be sufficient.  In the
case of zcache (and zram and ramster), where a large fraction
of RAM is used to store zpages (lzo1x-compressed pages),
zsmalloc seems to be a good match.  It is unclear whether
zsmalloc will ever have another user -- unless that user is
also storing large quantities of compressed pages.

Zcache is currently one primary user of zsmalloc, however
zcache only uses zsmalloc for anonymous/swap ("frontswap")
pages, not for file ("cleancache") pages.  For file pages,
zcache uses the captive "zbud" allocator; this is because
zcache requires a shrinker for cleancache pages, by which
entire pageframes can be easily reclaimed.  Zsmalloc doesn't
currently have shrinker capability and, because its
storage patterns in and across physical pageframes are
quite complex (to maximize density), an intelligent reclaim
implementation may be difficult to design race-free.  And
implementing reclaim opaquely (i.e. while maintaining a clean
layering) may be impossible.

A good analogy might be linked-lists.  Zsmalloc is like
a singly-linked list (space-efficient but not as flexible)
and zbud is like a doubly-linked list (not as space-efficient
but more flexible).  One has to choose the best data
structure according to the functionality required.

Some believe that the next step in zcache evolution will
require shrinking of both frontswap and cleancache pages.
Andrea has also stated that he thinks frontswap shrinking
will be a must for any future KVM-tmem implementation.
But preliminary investigations indicate that pageframe reclaim
of frontswap pages may be even more difficult with zsmalloc.
Until this issue is resolved (either by an adequately working
implementation of reclaim with zsmalloc or via demonstration
that zcache reclaim is unnecessary), the future use of zsmalloc
by zcache is cloudy.

I'm currently rewriting zbud as a foundation to investigate
some reclaim policy ideas that I think will be useful both for
KVM and for making zcache "enterprise ready."  When that is
done, we will see if zsmalloc can achieve the same flexibility.

A few related comments about these allocators and their users:

Zsmalloc relies on some clever underlying virtual-to-physical
mapping manipulations to ensure that its users can store and
retrieve items.  These manipulations are necessary on HIGHMEM
processors, but the cost is unclear on non-HIGHMEM processors.
(Manipulating TLB entries is not inexpensive.)  For zcache, the
overhead may be irrelevant as long as it is a small fraction
of the cost of compression/decompression, but it is worth
measuring (worst case) to verify.

Zbud can implement efficient reclaim because no more than two
items ever reside in the same pageframe and items never
cross a pageframe boundary.  While zbud storage is certainly
less dense than zsmalloc, the density is probably sufficient
if the size of items is bell-curve distributed with a mean
size of PAGE_SIZE/2 (or slightly less).  This is true for
many workloads, but datasets where the vast majority of items
exceed PAGE_SIZE/2 render zbud useless.  Note, however, that
zcache (due to its foundation on transcendent memory) currently
implements an admission policy that rejects pages when extreme
datasets are encountered.  In other words, zbud would handle
these workloads simply by rejecting the pages, resulting
in performance no worse (approximately) than if zcache were
not present.

RAMster maintains data structures to both point to zpages
that are local and remote.  Remote pages are identified
by a handle-like bit sequence while local pages are identified
by a true pointer.  (Note that ramster currently will not
run on a HIGHMEM machine.)  RAMster currently differentiates
between the two via a hack: examining the LSB.  If the
LSB is set, it is a handle referring to a remote page.
This works with xvmalloc and zbud but not with zsmalloc's
opaque handle.  A simple solution would require zsmalloc
to reserve the LSB of the opaque handle as must-be-zero.

Zram is actually a good match for current zsmalloc because
its storage grows to a pre-set RAM maximum size and cannot
shrink again.  Reclaim is not possible without a massive
redesign (and that redesign is essentially zcache).  But as
a result of its grow-but-never-shrink design, zram may have
some significant performance implications on most workloads
and system configurations.  It remains to be seen if its
niche usage will warrant promotion from the staging tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsmalloc concerns
  2012-06-05  3:25 zsmalloc concerns Dan Magenheimer
@ 2012-06-05  6:34 ` Minchan Kim
  2012-06-06 17:34   ` Dan Magenheimer
  2012-06-06  0:28 ` Nitin Gupta
  1 sibling, 1 reply; 7+ messages in thread
From: Minchan Kim @ 2012-06-05  6:34 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Seth Jennings, linux-mm, Nitin Gupta, Konrad Wilk

Hi Dan,

On 06/05/2012 12:25 PM, Dan Magenheimer wrote:

> Hi Minchan (and all) --
> 
> I promised you that after the window closed, I would
> write up my concerns about zsmalloc. My preference would
> be to use zsmalloc, but there are definitely tradeoffs
> and my objective is to make zcache and RAMster ready
> for enterprise customers so I would use a different
> or captive allocator if these zsmalloc issues can't
> be overcome.
> 
> Thanks,
> Dan
> 
> ===
> 
> Zsmalloc is designed to maximize density of items that vary in
> size between 0<size<PAGE_SIZE, but especially when the mean
> item size significantly exceeds PAGE_SIZE/2.  It is primarily
> useful when there are a large quantity of such items to be
> stored with little or no space wasted; if the quantity
> is small and/or some wasted space is acceptable, existing
> kernel allocators (e.g. slab) may be sufficient.  In the
> case of zcache (and zram and ramster), where a large fraction
> of RAM is used to store zpages (lzo1x-compressed pages),
> zsmalloc seems to be a good match.  It is unclear whether
> zsmalloc will ever have another user -- unless that user is
> also storing large quantities of compressed pages.
> 
> Zcache is currently one primary user of zsmalloc, however
> zcache only uses zsmalloc for anonymous/swap ("frontswap")
> pages, not for file ("cleancache") pages.  For file pages,
> zcache uses the captive "zbud" allocator; this is because
> zcache requires a shrinker for cleancache pages, by which
> entire pageframes can be easily reclaimed.  Zsmalloc doesn't
> currently have shrinker capability and, because its
> storage patterns in and across physical pageframes are
> quite complex (to maximize density), an intelligent reclaim
> implementation may be difficult to design race-free.  And
> implementing reclaim opaquely (i.e. while maintaining a clean
> layering) may be impossible.

> 

> A good analogy might be linked-lists.  Zsmalloc is like
> a singly-linked list (space-efficient but not as flexible)
> and zbud is like a doubly-linked list (not as space-efficient
> but more flexible).  One has to choose the best data
> structure according to the functionality required.
> 
> Some believe that the next step in zcache evolution will
> require shrinking of both frontswap and cleancache pages.
> Andrea has also stated that he thinks frontswap shrinking
> will be a must for any future KVM-tmem implementation.
> But preliminary investigations indicate that pageframe reclaim
> of frontswap pages may be even more difficult with zsmalloc.
> Until this issue is resolved (either by an adequately working
> implementation of reclaim with zsmalloc or via demonstration
> that zcache reclaim is unnecessary), the future use of zsmalloc
> by zcache is cloudy.


What's the requirement for shrinking zsmalloc?
For example, 

int shrink_zsmalloc_memory(int nr_pages)
{
	zsmalloc_evict_pages(nr_pages);
}

Could you tell us your detailed requirement?
Let's see it's possible or not at current zsmalloc.

> 
> I'm currently rewriting zbud as a foundation to investigate
> some reclaim policy ideas that I think will be useful both for
> KVM and for making zcache "enterprise ready."  When that is
> done, we will see if zsmalloc can achieve the same flexibility.
> 
> A few related comments about these allocators and their users:
> 
> Zsmalloc relies on some clever underlying virtual-to-physical
> mapping manipulations to ensure that its users can store and
> retrieve items.  These manipulations are necessary on HIGHMEM


HIGHMEM processors?
I think we need it if the system doesn't support HIGHMEM.
Maybe I am missing your point.

> processors, but the cost is unclear on non-HIGHMEM processors.
> (Manipulating TLB entries is not inexpensive.)  For zcache, the
> overhead may be irrelevant as long as it is a small fraction
> of the cost of compression/decompression, but it is worth
> measuring (worst case) to verify.
> 
> Zbud can implement efficient reclaim because no more than two
> items ever reside in the same pageframe and items never
> cross a pageframe boundary.  While zbud storage is certainly
> less dense than zsmalloc, the density is probably sufficient
> if the size of items is bell-curve distributed with a mean
> size of PAGE_SIZE/2 (or slightly less).  This is true for
> many workloads, but datasets where the vast majority of items
> exceed PAGE_SIZE/2 render zbud useless.  Note, however, that
> zcache (due to its foundation on transcendent memory) currently
> implements an admission policy that rejects pages when extreme
> datasets are encountered.  In other words, zbud would handle
> these workloads simply by rejecting the pages, resulting
> in performance no worse (approximately) than if zcache were
> not present.
> 
> RAMster maintains data structures to both point to zpages
> that are local and remote.  Remote pages are identified
> by a handle-like bit sequence while local pages are identified
> by a true pointer.  (Note that ramster currently will not
> run on a HIGHMEM machine.)  RAMster currently differentiates
> between the two via a hack: examining the LSB.  If the
> LSB is set, it is a handle referring to a remote page.
> This works with xvmalloc and zbud but not with zsmalloc's
> opaque handle.  A simple solution would require zsmalloc
> to reserve the LSB of the opaque handle as must-be-zero.


As you know, it's not difficult but break opaque handle's concept.
I want to avoid that and let you put some identifier into somewhere in zcache.

> 
> Zram is actually a good match for current zsmalloc because
> its storage grows to a pre-set RAM maximum size and cannot
> shrink again.  Reclaim is not possible without a massive
> redesign (and that redesign is essentially zcache).  But as
> a result of its grow-but-never-shrink design, zram may have
> some significant performance implications on most workloads
> and system configurations.  It remains to be seen if its
> niche usage will warrant promotion from the staging tree.


At least, many embedded device have used zram since compcache was introduced.
But not sure, zcache can replace it.
If zcache can replace it, you will be right.

Comparing zcache and zram implementation, it's one of my TODO list.
So I am happy to see them.
But I can't do it shorty due to other urgent works.

In summary, I WANT TO KNOW your detailed requirement for shrinking zsmalloc.

> 

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a>
> 



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: zsmalloc concerns
  2012-06-05  6:34 ` Minchan Kim
@ 2012-06-06 17:34   ` Dan Magenheimer
  2012-06-07  8:06     ` Minchan Kim
  0 siblings, 1 reply; 7+ messages in thread
From: Dan Magenheimer @ 2012-06-06 17:34 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Seth Jennings, linux-mm, Nitin Gupta, Konrad Wilk

> From: Minchan Kim [mailto:minchan@kernel.org]

Hi Minchan --

Reordering the reply a bit...

> > On 06/05/2012 12:25 PM, Dan Magenheimer wrote:
> > Zsmalloc relies on some clever underlying virtual-to-physical
> > mapping manipulations to ensure that its users can store and
> > retrieve items.  These manipulations are necessary on HIGHMEM
> 
> HIGHMEM processors?
> I think we need it if the system doesn't support HIGHMEM.
> Maybe I am missing your point.

I didn't say it very clearly.  What I meant is that, on
processors that require HIGHMEM, it is always necessary
to do a kmap/kunmap around accessing the contents of a
pageframe referred to by a struct page.  On machines
with no HIGHMEM, the kernel is completely mapped so
kmap/kunmap to kernel space are very simple and fast.

However, whenever a compressed item crosses a page
boundary in zsmalloc, zsmalloc creates a special "pair"
mapping of the two pages, and kmap/kunmaps the pair for
every access.  This is why special TLB tricks must
be used by zsmalloc.  I think this can be expensive
so I consider this a disadvantage of zsmalloc, even
though it is very clever and very useful for storing
a large number of items with size larger than PAGE_SIZE/2.

> What's the requirement for shrinking zsmalloc?
> For example,
> 
> int shrink_zsmalloc_memory(int nr_pages)
> {
> 	zsmalloc_evict_pages(nr_pages);
> }
> 
> Could you tell us your detailed requirement?
> Let's see it's possible or not at current zsmalloc.

The objective of the shrinker is to reclaim full
pageframes.  Due to the way zsmalloc works, when
it stores N items in M pages, worst case it
may take N-M zsmalloc "item evictions" before even
a single pageframe is reclaimed.

Next, remember that there may be several "pointers"
(stored as zsmalloc object handles) referencing that page
and there may also be a pointer to an item which
overlaps from an adjacent page.
In zcache, the pointers are stored in the tmem metadata.
This metadata must be purged from tmem before the
pageframe can be reclaimed.  And this must be done
carefully, maybe atomically, because there are various
locks that must be held and released in the correct
order to avoid races and deadlock.  (Holding one
big lock disallowing tmem from operating during reclaim
is an ugly alternative.)

Next, ideally you'd like to be able to reclaim pageframes
in roughly LRU order.  What does LRU mean when many
items stored in the pageframe (and possibly adjacent
pageframes) are added/deleted completely independently?

Last, when that metadata is purged from tmem, for ephemeral
pages the actual stored data can be discarded.  BUT when
the pages are persistent, the data cannot be discarded.
I have preliminary code that decompresses and pushes this
data back into the swapcache.  This too must be atomic.

> > RAMster maintains data structures to both point to zpages
> > that are local and remote.  Remote pages are identified
> > by a handle-like bit sequence while local pages are identified
> > by a true pointer.  (Note that ramster currently will not
> > run on a HIGHMEM machine.)  RAMster currently differentiates
> > between the two via a hack: examining the LSB.  If the
> > LSB is set, it is a handle referring to a remote page.
> > This works with xvmalloc and zbud but not with zsmalloc's
> > opaque handle.  A simple solution would require zsmalloc
> > to reserve the LSB of the opaque handle as must-be-zero.
> 
> As you know, it's not difficult but break opaque handle's concept.
> I want to avoid that and let you put some identifier into somewhere in zcache.

That would be OK with me if it can be done without a large
increase in memory use.  We have so far avoided adding
additional data to each tmem "pampd".  Adding another
unsigned long worth of data is possible but would require
some bug internal API changes.

There are many data structures in the kernel that take
advantage of unused low bits in a pointer, like what
ramster is doing.

And the opaqueness of the handle could still be preserved
if there are one or more reserved bits and one adds functions
to zsmalloc_set_reserved_bits(&handle) and
zsmalloc_read_reserved_bits(handle).

But this is a nit until we are sure that zsmalloc will meet
the reclaim requirements.

> At least, many embedded device have used zram since compcache was introduced.
> But not sure, zcache can replace it.
> If zcache can replace it, you will be right.
> 
> Comparing zcache and zram implementation, it's one of my TODO list.
> So I am happy to see them.
> But I can't do it shorty due to other urgent works.

Zcache has differences, the largest being that zcache currently
works only when the system has a configured swap block device.
Current zcache has issues too, but (as Andrea has observed)
they can be reduced by allowing zcache to be backed, when
necessary, by the swapdisk when memory pressure is high.

> In summary, I WANT TO KNOW your detailed requirement for shrinking zsmalloc.

My core requirement is that an implementation exists that can
handle pageframe reclaim efficiently and race-free.  AND for
persistent pages, ensure it is possible to return the data
to the swapcache when the containing pageframe is reclaimed.

I am not saying that zsmalloc *cannot* meet this requirement.
I just think it is already very difficult with a simple
non-opaque allocator such as zbud.  That's why I am trying
to get it all working with zbud first.

Hope that helps!
Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsmalloc concerns
  2012-06-06 17:34   ` Dan Magenheimer
@ 2012-06-07  8:06     ` Minchan Kim
  2012-06-07 15:40       ` Dan Magenheimer
  0 siblings, 1 reply; 7+ messages in thread
From: Minchan Kim @ 2012-06-07  8:06 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Seth Jennings, linux-mm, Nitin Gupta, Konrad Wilk

On 06/07/2012 02:34 AM, Dan Magenheimer wrote:

>> From: Minchan Kim [mailto:minchan@kernel.org]
> 
> Hi Minchan --
> 
> Reordering the reply a bit...
> 
>>> On 06/05/2012 12:25 PM, Dan Magenheimer wrote:
>>> Zsmalloc relies on some clever underlying virtual-to-physical
>>> mapping manipulations to ensure that its users can store and
>>> retrieve items.  These manipulations are necessary on HIGHMEM
>>
>> HIGHMEM processors?
>> I think we need it if the system doesn't support HIGHMEM.
>> Maybe I am missing your point.
> 
> I didn't say it very clearly.  What I meant is that, on
> processors that require HIGHMEM, it is always necessary
> to do a kmap/kunmap around accessing the contents of a
> pageframe referred to by a struct page.  On machines
> with no HIGHMEM, the kernel is completely mapped so
> kmap/kunmap to kernel space are very simple and fast.
> 
> However, whenever a compressed item crosses a page
> boundary in zsmalloc, zsmalloc creates a special "pair"
> mapping of the two pages, and kmap/kunmaps the pair for
> every access.  This is why special TLB tricks must
> be used by zsmalloc.  I think this can be expensive
> so I consider this a disadvantage of zsmalloc, even
> though it is very clever and very useful for storing
> a large number of items with size larger than PAGE_SIZE/2.


Fair.

> 
>> What's the requirement for shrinking zsmalloc?
>> For example,
>>
>> int shrink_zsmalloc_memory(int nr_pages)
>> {
>> 	zsmalloc_evict_pages(nr_pages);
>> }
>>
>> Could you tell us your detailed requirement?
>> Let's see it's possible or not at current zsmalloc.
> 
> The objective of the shrinker is to reclaim full
> pageframes.  Due to the way zsmalloc works, when
> it stores N items in M pages, worst case it
> may take N-M zsmalloc "item evictions" before even
> a single pageframe is reclaimed.


Right.

> 
> Next, remember that there may be several "pointers"
> (stored as zsmalloc object handles) referencing that page
> and there may also be a pointer to an item which
> overlaps from an adjacent page.
> In zcache, the pointers are stored in the tmem metadata.
> This metadata must be purged from tmem before the
> pageframe can be reclaimed.  And this must be done
> carefully, maybe atomically, because there are various
> locks that must be held and released in the correct
> order to avoid races and deadlock.  (Holding one
> big lock disallowing tmem from operating during reclaim
> is an ugly alternative.)
> 
> Next, ideally you'd like to be able to reclaim pageframes
> in roughly LRU order.  What does LRU mean when many
> items stored in the pageframe (and possibly adjacent
> pageframes) are added/deleted completely independently?

> 

> Last, when that metadata is purged from tmem, for ephemeral
> pages the actual stored data can be discarded.  BUT when
> the pages are persistent, the data cannot be discarded.
> I have preliminary code that decompresses and pushes this
> data back into the swapcache.  This too must be atomic.


I agree zsmalloc isn't good for you.
Then, you can use your allocator "zbud". What's the problem?
Do you want to replace zsmalloc with zbud in zram, too?

> 
>>> RAMster maintains data structures to both point to zpages
>>> that are local and remote.  Remote pages are identified
>>> by a handle-like bit sequence while local pages are identified
>>> by a true pointer.  (Note that ramster currently will not
>>> run on a HIGHMEM machine.)  RAMster currently differentiates
>>> between the two via a hack: examining the LSB.  If the
>>> LSB is set, it is a handle referring to a remote page.
>>> This works with xvmalloc and zbud but not with zsmalloc's
>>> opaque handle.  A simple solution would require zsmalloc
>>> to reserve the LSB of the opaque handle as must-be-zero.
>>
>> As you know, it's not difficult but break opaque handle's concept.
>> I want to avoid that and let you put some identifier into somewhere in zcache.
> 
> That would be OK with me if it can be done without a large
> increase in memory use.  We have so far avoided adding
> additional data to each tmem "pampd".  Adding another
> unsigned long worth of data is possible but would require
> some bug internal API changes.
> 
> There are many data structures in the kernel that take
> advantage of unused low bits in a pointer, like what
> ramster is doing.


But this case is different. It's a generic library and even it's a HANDLE.
I don't want to add such special feature to generic library's handle.

> 
> And the opaqueness of the handle could still be preserved
> if there are one or more reserved bits and one adds functions
> to zsmalloc_set_reserved_bits(&handle) and
> zsmalloc_read_reserved_bits(handle).

> 

> But this is a nit until we are sure that zsmalloc will meet
> the reclaim requirements.

> 

>> At least, many embedded device have used zram since compcache was introduced.
>> But not sure, zcache can replace it.
>> If zcache can replace it, you will be right.
>>
>> Comparing zcache and zram implementation, it's one of my TODO list.
>> So I am happy to see them.
>> But I can't do it shorty due to other urgent works.
> 
> Zcache has differences, the largest being that zcache currently
> works only when the system has a configured swap block device.
> Current zcache has issues too, but (as Andrea has observed)
> they can be reduced by allowing zcache to be backed, when
> necessary, by the swapdisk when memory pressure is high.
> 
>> In summary, I WANT TO KNOW your detailed requirement for shrinking zsmalloc.
> 
> My core requirement is that an implementation exists that can
> handle pageframe reclaim efficiently and race-free.  AND for
> persistent pages, ensure it is possible to return the data
> to the swapcache when the containing pageframe is reclaimed.
> 
> I am not saying that zsmalloc *cannot* meet this requirement.
> I just think it is already very difficult with a simple
> non-opaque allocator such as zbud.  That's why I am trying
> to get it all working with zbud first.


Agreed. Go ahead with zbud.
Again, I can't understand your concern. :)
Sorry if I miss your point.

> 
> Hope that helps!
> Dan
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a>
> 



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: zsmalloc concerns
  2012-06-07  8:06     ` Minchan Kim
@ 2012-06-07 15:40       ` Dan Magenheimer
  2012-06-07 23:49         ` Minchan Kim
  0 siblings, 1 reply; 7+ messages in thread
From: Dan Magenheimer @ 2012-06-07 15:40 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Seth Jennings, linux-mm, Nitin Gupta, Konrad Wilk

> From: Minchan Kim [mailto:minchan@kernel.org]
> Subject: Re: zsmalloc concerns
> 
> On 06/07/2012 02:34 AM, Dan Magenheimer wrote:
> 
> >> From: Minchan Kim [mailto:minchan@kernel.org]
> >
> >
> > However, whenever a compressed item crosses a page
> > boundary in zsmalloc, zsmalloc creates a special "pair"
> > mapping of the two pages, and kmap/kunmaps the pair for
> > every access.  This is why special TLB tricks must
> > be used by zsmalloc.  I think this can be expensive
> > so I consider this a disadvantage of zsmalloc, even
> > though it is very clever and very useful for storing
> > a large number of items with size larger than PAGE_SIZE/2.
> 
> Fair.

By breaking down the opaqueness somewhat, I think
it is not hard to eliminate this requirement.  The
caller needs to be aware that an item may cross
a page boundary and zsmalloc could provide
hooks such as "map/unmap_first/second_page".

(In fact, that gives me some ideas on how to improve
zbud to handle cross-page items.)

> >> Could you tell us your detailed requirement?
> >> Let's see it's possible or not at current zsmalloc.
> >
> > The objective of the shrinker is to reclaim full
> > pageframes.  Due to the way zsmalloc works, when
> > it stores N items in M pages, worst case it
> > may take N-M zsmalloc "item evictions" before even
> > a single pageframe is reclaimed.
> 
> Right.
> 
> > Last, when that metadata is purged from tmem, for ephemeral
> > pages the actual stored data can be discarded.  BUT when
> > the pages are persistent, the data cannot be discarded.
> > I have preliminary code that decompresses and pushes this
> > data back into the swapcache.  This too must be atomic.
> 
> I agree zsmalloc isn't good for you.
> Then, you can use your allocator "zbud". What's the problem?
> Do you want to replace zsmalloc with zbud in zram, too?

No, see below.

> >>> RAMster maintains data structures to both point to zpages
> >>> that are local and remote.  Remote pages are identified
> >>> by a handle-like bit sequence while local pages are identified
> >>> by a true pointer.  (Note that ramster currently will not
> >>> run on a HIGHMEM machine.)  RAMster currently differentiates
> >>> between the two via a hack: examining the LSB.  If the
> >>> LSB is set, it is a handle referring to a remote page.
> >>> This works with xvmalloc and zbud but not with zsmalloc's
> >>> opaque handle.  A simple solution would require zsmalloc
> >>> to reserve the LSB of the opaque handle as must-be-zero.
> >>
> >> As you know, it's not difficult but break opaque handle's concept.
> >> I want to avoid that and let you put some identifier into somewhere in zcache.
> >
> > That would be OK with me if it can be done without a large
> > increase in memory use.  We have so far avoided adding
> > additional data to each tmem "pampd".  Adding another
> > unsigned long worth of data is possible but would require
> > some bug internal API changes.
> >
> > There are many data structures in the kernel that take
> > advantage of unused low bits in a pointer, like what
> > ramster is doing.
> 
> But this case is different. It's a generic library and even it's a HANDLE.
> I don't want to add such special feature to generic library's handle.

Zsmalloc is not a generic library yet.  It is currently used
in zram and for half of zcache.  I think Seth and Nitin had
planned for it to be used for all of zcache.  I was describing
the issues I see with using it for all of zcache and even
for continuing to use it with half of zcache.

> >> In summary, I WANT TO KNOW your detailed requirement for shrinking zsmalloc.
> >
> > My core requirement is that an implementation exists that can
> > handle pageframe reclaim efficiently and race-free.  AND for
> > persistent pages, ensure it is possible to return the data
> > to the swapcache when the containing pageframe is reclaimed.
> >
> > I am not saying that zsmalloc *cannot* meet this requirement.
> > I just think it is already very difficult with a simple
> > non-opaque allocator such as zbud.  That's why I am trying
> > to get it all working with zbud first.
> 
> Agreed. Go ahead with zbud.
> Again, I can't understand your concern. :)
> Sorry if I miss your point.

You asked for my requirements for shrinking with zsmalloc.

I hoped that Nitin and Seth (or you) could resolve the issues
so that zsmalloc could be used for zcache.  But making it
more opaque seems to be going in the wrong direction to me.
I think it is also the wrong direction for zram (see
comment above about the TLB issues) *especially* if
zcache never uses zsmalloc:  Why have a generic allocator
that is opaque if it only has one user?

But if you are the person driving to promote zram and zsmalloc
out of staging, that is your choice.

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsmalloc concerns
  2012-06-07 15:40       ` Dan Magenheimer
@ 2012-06-07 23:49         ` Minchan Kim
  0 siblings, 0 replies; 7+ messages in thread
From: Minchan Kim @ 2012-06-07 23:49 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Seth Jennings, linux-mm, Nitin Gupta, Konrad Wilk

On 06/08/2012 12:40 AM, Dan Magenheimer wrote:

>> From: Minchan Kim [mailto:minchan@kernel.org]
>> Subject: Re: zsmalloc concerns
>>
>> On 06/07/2012 02:34 AM, Dan Magenheimer wrote:
>>
>>>> From: Minchan Kim [mailto:minchan@kernel.org]
>>>
>>>
>>> However, whenever a compressed item crosses a page
>>> boundary in zsmalloc, zsmalloc creates a special "pair"
>>> mapping of the two pages, and kmap/kunmaps the pair for
>>> every access.  This is why special TLB tricks must
>>> be used by zsmalloc.  I think this can be expensive
>>> so I consider this a disadvantage of zsmalloc, even
>>> though it is very clever and very useful for storing
>>> a large number of items with size larger than PAGE_SIZE/2.
>>
>> Fair.
> 
> By breaking down the opaqueness somewhat, I think
> it is not hard to eliminate this requirement.  The
> caller needs to be aware that an item may cross
> a page boundary and zsmalloc could provide
> hooks such as "map/unmap_first/second_page".
> 
> (In fact, that gives me some ideas on how to improve
> zbud to handle cross-page items.)
> 
>>>> Could you tell us your detailed requirement?
>>>> Let's see it's possible or not at current zsmalloc.
>>>
>>> The objective of the shrinker is to reclaim full
>>> pageframes.  Due to the way zsmalloc works, when
>>> it stores N items in M pages, worst case it
>>> may take N-M zsmalloc "item evictions" before even
>>> a single pageframe is reclaimed.
>>
>> Right.
>>
>>> Last, when that metadata is purged from tmem, for ephemeral
>>> pages the actual stored data can be discarded.  BUT when
>>> the pages are persistent, the data cannot be discarded.
>>> I have preliminary code that decompresses and pushes this
>>> data back into the swapcache.  This too must be atomic.
>>
>> I agree zsmalloc isn't good for you.
>> Then, you can use your allocator "zbud". What's the problem?
>> Do you want to replace zsmalloc with zbud in zram, too?
> 
> No, see below.
> 
>>>>> RAMster maintains data structures to both point to zpages
>>>>> that are local and remote.  Remote pages are identified
>>>>> by a handle-like bit sequence while local pages are identified
>>>>> by a true pointer.  (Note that ramster currently will not
>>>>> run on a HIGHMEM machine.)  RAMster currently differentiates
>>>>> between the two via a hack: examining the LSB.  If the
>>>>> LSB is set, it is a handle referring to a remote page.
>>>>> This works with xvmalloc and zbud but not with zsmalloc's
>>>>> opaque handle.  A simple solution would require zsmalloc
>>>>> to reserve the LSB of the opaque handle as must-be-zero.
>>>>
>>>> As you know, it's not difficult but break opaque handle's concept.
>>>> I want to avoid that and let you put some identifier into somewhere in zcache.
>>>
>>> That would be OK with me if it can be done without a large
>>> increase in memory use.  We have so far avoided adding
>>> additional data to each tmem "pampd".  Adding another
>>> unsigned long worth of data is possible but would require
>>> some bug internal API changes.
>>>
>>> There are many data structures in the kernel that take
>>> advantage of unused low bits in a pointer, like what
>>> ramster is doing.
>>
>> But this case is different. It's a generic library and even it's a HANDLE.
>> I don't want to add such special feature to generic library's handle.
> 
> Zsmalloc is not a generic library yet.  It is currently used
> in zram and for half of zcache.  I think Seth and Nitin had

> planned for it to be used for all of zcache.  I was describing

> the issues I see with using it for all of zcache and even
> for continuing to use it with half of zcache.
> 
>>>> In summary, I WANT TO KNOW your detailed requirement for shrinking zsmalloc.
>>>
>>> My core requirement is that an implementation exists that can
>>> handle pageframe reclaim efficiently and race-free.  AND for
>>> persistent pages, ensure it is possible to return the data
>>> to the swapcache when the containing pageframe is reclaimed.
>>>
>>> I am not saying that zsmalloc *cannot* meet this requirement.
>>> I just think it is already very difficult with a simple
>>> non-opaque allocator such as zbud.  That's why I am trying
>>> to get it all working with zbud first.
>>
>> Agreed. Go ahead with zbud.
>> Again, I can't understand your concern. :)
>> Sorry if I miss your point.
> 
> You asked for my requirements for shrinking with zsmalloc.
> 
> I hoped that Nitin and Seth (or you) could resolve the issues
> so that zsmalloc could be used for zcache.  But making it
> more opaque seems to be going in the wrong direction to me.
> I think it is also the wrong direction for zram (see
> comment above about the TLB issues) *especially* if
> zcache never uses zsmalloc:  Why have a generic allocator
> that is opaque if it only has one user?


It's a just staging now and not a long time.
Who can make sure it doesn't have any users any more in future?
If we decide making it specific to zcache and create ugly interface and
coupling, Potential user for zsmalloc might invent the wheel.

> 
> But if you are the person driving to promote zram and zsmalloc
> out of staging, that is your choice.


I see your concern exactly now.
I'm not strong against breaking opaque if Nitin drive that.
But I hope we can do it without breaking generic allocator's concept.
If we need some more functionality, it would be better to be done in caller's layer.
Otherwise, we can provide new mode of zsmalloc like "Don't store this object crossing page boundary"
and warn "it could lose the space efficiency" to user.

Anyhow, I understand your requirement and will try to understand zcache's requirement as I review the code.
I believe we can solve the issue.

Thanks for the input, Dan!

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsmalloc concerns
  2012-06-05  3:25 zsmalloc concerns Dan Magenheimer
  2012-06-05  6:34 ` Minchan Kim
@ 2012-06-06  0:28 ` Nitin Gupta
  1 sibling, 0 replies; 7+ messages in thread
From: Nitin Gupta @ 2012-06-06  0:28 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Minchan Kim, Seth Jennings, linux-mm, Konrad Wilk

On 06/04/2012 08:25 PM, Dan Magenheimer wrote:

> Hi Minchan (and all) --
> 
> I promised you that after the window closed, I would
> write up my concerns about zsmalloc. My preference would
> be to use zsmalloc, but there are definitely tradeoffs
> and my objective is to make zcache and RAMster ready
> for enterprise customers so I would use a different
> or captive allocator if these zsmalloc issues can't
> be overcome.
> 
> Thanks,
> Dan
> 
> ===
> 
> Zsmalloc is designed to maximize density of items that vary in
> size between 0<size<PAGE_SIZE, but especially when the mean
> item size significantly exceeds PAGE_SIZE/2.  It is primarily
> useful when there are a large quantity of such items to be
> stored with little or no space wasted; if the quantity
> is small and/or some wasted space is acceptable, existing
> kernel allocators (e.g. slab) may be sufficient.  In the
> case of zcache (and zram and ramster), where a large fraction
> of RAM is used to store zpages (lzo1x-compressed pages),
> zsmalloc seems to be a good match.  It is unclear whether
> zsmalloc will ever have another user -- unless that user is
> also storing large quantities of compressed pages.
> 


True. zsmalloc use case is very specific: efficiently storing object of
size up to PAGE_SIZE. I never expect it to find any more users.

> Zcache is currently one primary user of zsmalloc, however
> zcache only uses zsmalloc for anonymous/swap ("frontswap")
> pages, not for file ("cleancache") pages.  For file pages,
> zcache uses the captive "zbud" allocator; this is because
> zcache requires a shrinker for cleancache pages, by which
> entire pageframes can be easily reclaimed.  Zsmalloc doesn't
> currently have shrinker capability and, because its
> storage patterns in and across physical pageframes are
> quite complex (to maximize density), an intelligent reclaim
> implementation may be difficult to design race-free.  And
> implementing reclaim opaquely (i.e. while maintaining a clean
> layering) may be impossible.
> 


I'm now trying to start working of compaction but yes it seems to be
really complicated.

> A good analogy might be linked-lists.  Zsmalloc is like
> a singly-linked list (space-efficient but not as flexible)
> and zbud is like a doubly-linked list (not as space-efficient
> but more flexible).  One has to choose the best data
> structure according to the functionality required.
> 
> Some believe that the next step in zcache evolution will
> require shrinking of both frontswap and cleancache pages.
> Andrea has also stated that he thinks frontswap shrinking
> will be a must for any future KVM-tmem implementation.
> But preliminary investigations indicate that pageframe reclaim
> of frontswap pages may be even more difficult with zsmalloc.
> Until this issue is resolved (either by an adequately working
> implementation of reclaim with zsmalloc or via demonstration
> that zcache reclaim is unnecessary), the future use of zsmalloc
> by zcache is cloudy.
> 
> I'm currently rewriting zbud as a foundation to investigate
> some reclaim policy ideas that I think will be useful both for
> KVM and for making zcache "enterprise ready."  When that is
> done, we will see if zsmalloc can achieve the same flexibility.
> 
> A few related comments about these allocators and their users:
> 
> Zsmalloc relies on some clever underlying virtual-to-physical
> mapping manipulations to ensure that its users can store and
> retrieve items.  These manipulations are necessary on HIGHMEM
> processors, but the cost is unclear on non-HIGHMEM processors.
> (Manipulating TLB entries is not inexpensive.)  For zcache, the
> overhead may be irrelevant as long as it is a small fraction
> of the cost of compression/decompression, but it is worth
> measuring (worst case) to verify.
> 


All those virtual-to-physical mapping business needs to be done even if
we ignore HIGHMEM and consider pure 64-bit systems where entire memory
is direct mapped. All these compression schemes come into picture under
low memory conditions when the chances of allocating higher order pages
is close to nil. So, to be able to take physically discontiguous pages
and treat them as a single higher order page, we need some
mapping/unmapping tricks which zsmalloc does.

> Zbud can implement efficient reclaim because no more than two
> items ever reside in the same pageframe and items never
> cross a pageframe boundary.  While zbud storage is certainly
> less dense than zsmalloc, the density is probably sufficient
> if the size of items is bell-curve distributed with a mean
> size of PAGE_SIZE/2 (or slightly less).  This is true for
> many workloads, but datasets where the vast majority of items
> exceed PAGE_SIZE/2 render zbud useless.  Note, however, that
> zcache (due to its foundation on transcendent memory) currently
> implements an admission policy that rejects pages when extreme
> datasets are encountered.  In other words, zbud would handle
> these workloads simply by rejecting the pages, resulting
> in performance no worse (approximately) than if zcache were
> not present.


We really need to have memory dump of various VM images running
different workloads to determine if compressed size distribution indeed
centres around PAGE_SIZE/2. Some of this data was collected some time back:

http://code.google.com/p/compcache/wiki/CompressedLengthDistribution
(histograms would have been more useful)

at least this sample data does not clearly suggest that this assumptions
regarding the size distribution usually holds.

> 
> RAMster maintains data structures to both point to zpages
> that are local and remote.  Remote pages are identified
> by a handle-like bit sequence while local pages are identified
> by a true pointer.  (Note that ramster currently will not
> run on a HIGHMEM machine.)  RAMster currently differentiates
> between the two via a hack: examining the LSB.  If the
> LSB is set, it is a handle referring to a remote page.
> This works with xvmalloc and zbud but not with zsmalloc's
> opaque handle.  A simple solution would require zsmalloc
> to reserve the LSB of the opaque handle as must-be-zero.
> 


I think it should be possible to spare LSB in zsmalloc handle.

> Zram is actually a good match for current zsmalloc because
> its storage grows to a pre-set RAM maximum size and cannot
> shrink again.  Reclaim is not possible without a massive
> redesign (and that redesign is essentially zcache).  But as
> a result of its grow-but-never-shrink design, zram may have
> some significant performance implications on most workloads
> and system configurations.  It remains to be seen if its
> niche usage will warrant promotion from the staging tree.


zram can shrink back:
 - When used as swap device, it receives a "swap notify" callback
whenever a swap slot (page) is freed. See: swap_entry_free() -->
disk->fops->swap_slot_notify_free()
 - When used as a generic disk, say, hosting ext4 filesystem, it can
receive "discard" callbacks from filesystem which have discard support
(a mount option in case of ext4).


Overall, I do agree with your concern that it seems difficult to
implement runtime compaction for zsmalloc and I also think that it might
be worth investing more in simpler zbud till we have it working.

Thanks,
Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-06-07 23:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-05  3:25 zsmalloc concerns Dan Magenheimer
2012-06-05  6:34 ` Minchan Kim
2012-06-06 17:34   ` Dan Magenheimer
2012-06-07  8:06     ` Minchan Kim
2012-06-07 15:40       ` Dan Magenheimer
2012-06-07 23:49         ` Minchan Kim
2012-06-06  0:28 ` Nitin Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).