frontswap/zcache: xvmalloc discussion

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* frontswap/zcache: xvmalloc discussion
@ 2011-06-22 19:15 Seth Jennings
  2011-06-22 19:23 ` [PATCH] Add zv_pool_pages_count to zcache sysfs Seth Jennings
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Seth Jennings @ 2011-06-22 19:15 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Magenheimer, Nitin Gupta, Robert Jennings, Brian King,
	Greg Kroah-Hartman

Dan, Nitin,

I have been experimenting with the frontswap v4 patches and the latest 
zcache in the mainline drivers/staging.  There is a particular issue I'm 
seeing when using pages of different compressibilities.

When the pages compress to less than PAGE_SIZE/2, I get good compression 
and little external fragmentation in the xvmalloc pool.  However, when 
the pages have a compressed size greater than PAGE_SIZE/2, it is a very 
different story.  Basically, because xvmalloc allocations can't span 
multiple pool pages, grow_pool() is called on each allocation, reducing 
the effective compression (total_pages_in_frontswap / 
total_pages_in_xvmalloc_pool) to 0 and drastically increasing external 
fragmentation to up to 50%.

The likelihood that the size of a compressed page is greater than 
PAGE_SIZE/2 is high, considering that lzo1x-1 sacrifices compressibility 
for speed.  In my experiments, pages of English text only compressed to 
75% of their original size with 1zo1x-1.

In order to calculate the effective compression of frontswap, you need 
the number of pages stored by frontswap, provided by frontswap's 
curr_pages sysfs attribute, and the number of pages in the xvmalloc 
pool.  There isn't a sysfs attribute for this, so I made a patch that 
creates a new zv_pool_pages_count attribute for zcache that provides 
this value (patch is in a follow-up message).  I have also included my 
simple test program at the end of this email.  It just allocates and 
stores random pages of from a text file (in my case, a text file of Moby 
Dick).

The real problem here is compressing pages of size x and storing them in 
a pool that has "chunks", if you will, also of size x, where allocations 
can't span multiple chunks.  Ideally, I'd like to address this issue by 
expanding the size of the xvmalloc pool chunks from one page to four 
pages (I can explain why four is a good number, just didn't want to make 
this note too long).

After a little playing around, I've found this isn't entirely trivial to 
do because of the memory mapping implications; more specifically the use 
of kmap/kunamp in the xvmalloc and zcache layers.  I've looked into 
using vmap to map multiple pages into a linear address space, but it 
seems like there is a lot of memory overhead in doing that.

Do you have any feedback on this issue or suggestion solution?

-- 
Seth Jennings
Linux on Power Virtualization
IBM Linux Technology Center

=================
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

#define SIZE_OF_PAGE_IN_BYTES 4096
#define NUM_OF_PAGES_PER_MB (1024*1024/SIZE_OF_PAGE_IN_BYTES)

void usage();

int
main(int argc, char * argv[])
{
	int mbs, numpages, i;
	char *mypage;
	int rc, len, pos;
	FILE *textfile;

	if(argc < 3)
	{
		printf("usage: %s numMBs file\n",argv[0]);
		return 1;
	}

	mbs = atoi(argv[1]);
	numpages = NUM_OF_PAGES_PER_MB * mbs;

	printf("Allocating %d MB (%d pages) of data...\n",mbs,numpages);

	textfile = fopen(argv[2],"r");

	if(textfile == NULL)
	{
		perror("failed to open text file");
		exit(1);
	}

	/* get file length */
	fseek(textfile,0,SEEK_END);
	len = ftell(textfile);

	for (i=0; i < numpages; i++) {
		if(!(i%100))
		{
			if(i)
				printf("\033[F\033[J");
			printf("%d numpages allocated\n", i);
		}

		mypage = malloc(SIZE_OF_PAGE_IN_BYTES);
		if(!mypage)
		{
			perror("malloc()");
			return 1;
		}

		/* start at (pusedo) random location in file */
		pos = rand() % (len - SIZE_OF_PAGE_IN_BYTES);
		fseek(textfile,pos,SEEK_SET);
		rc = fread(mypage,SIZE_OF_PAGE_IN_BYTES,1,textfile);
		if(!rc)
		{
			perror("read()");
			return 1;
		}
	}

	printf("complete\n\npress any key to end program");
	getchar();

	return 0;
}
=================

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] Add zv_pool_pages_count to zcache sysfs
  2011-06-22 19:15 frontswap/zcache: xvmalloc discussion Seth Jennings
@ 2011-06-22 19:23 ` Seth Jennings
  2011-06-23 15:38   ` Dave Hansen
  2011-06-23 16:38 ` frontswap/zcache: xvmalloc discussion Dan Magenheimer
  2011-06-24  6:11 ` Nitin Gupta
  2 siblings, 1 reply; 12+ messages in thread
From: Seth Jennings @ 2011-06-22 19:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Magenheimer, Nitin Gupta, Robert Jennings, Brian King,
	Greg Kroah-Hartman

This attribute returns the number of pages currently
in use by the xvmalloc pool used to store persistent
pages compressed by zcache. This attribute, in combination
with the curr_pages attribute of frontswap, can be used
to calculate the effective compression of frontswap
(i.e. zv_pool_pages_count/curr_pages)

Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
---
  drivers/staging/zcache/zcache.c |   20 ++++++++++++++++++++
  1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/drivers/staging/zcache/zcache.c 
b/drivers/staging/zcache/zcache.c
index 77ac2d4..9821d88 100644
--- a/drivers/staging/zcache/zcache.c
+++ b/drivers/staging/zcache/zcache.c
@@ -684,6 +684,23 @@ static struct {
  	struct xv_pool *xvpool;
  } zcache_client;

+#ifdef CONFIG_SYSFS
+static int zv_show_pool_pages_count(char *buf)
+{
+	char *p = buf;
+	unsigned long numpages;
+
+	if (zcache_client.xvpool == NULL)
+		p += sprintf(p, "%d\n", 0);
+	else {
+		numpages = xv_get_total_size_bytes(zcache_client.xvpool);
+		p += sprintf(p, "%lu\n", numpages >> PAGE_SHIFT);
+	}
+
+	return p - buf;
+}
+#endif
+
  /*
   * Tmem operations assume the poolid implies the invoking client.
   * Zcache only has one client (the kernel itself), so translate
@@ -1130,6 +1147,8 @@ ZCACHE_SYSFS_RO_CUSTOM(zbud_unbuddied_list_counts,
  			zbud_show_unbuddied_list_counts);
  ZCACHE_SYSFS_RO_CUSTOM(zbud_cumul_chunk_counts,
  			zbud_show_cumul_chunk_counts);
+ZCACHE_SYSFS_RO_CUSTOM(zv_pool_pages_count,
+			zv_show_pool_pages_count);

  static struct attribute *zcache_attrs[] = {
  	&zcache_curr_obj_count_attr.attr,
@@ -1160,6 +1179,7 @@ static struct attribute *zcache_attrs[] = {
  	&zcache_aborted_shrink_attr.attr,
  	&zcache_zbud_unbuddied_list_counts_attr.attr,
  	&zcache_zbud_cumul_chunk_counts_attr.attr,
+	&zcache_zv_pool_pages_count_attr.attr,
  	NULL,
  };

-- 
1.7.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] Add zv_pool_pages_count to zcache sysfs
  2011-06-22 19:23 ` [PATCH] Add zv_pool_pages_count to zcache sysfs Seth Jennings
@ 2011-06-23 15:38   ` Dave Hansen
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Hansen @ 2011-06-23 15:38 UTC (permalink / raw)
  To: Seth Jennings
  Cc: linux-mm, Dan Magenheimer, Nitin Gupta, Robert Jennings,
	Brian King, Greg Kroah-Hartman

On Wed, 2011-06-22 at 14:23 -0500, Seth Jennings wrote:
> +#ifdef CONFIG_SYSFS

There are a couple of #ifdef CONFIG_SYSFS blocks in zcache.c already.
Could this go inside one of those instead of being off by itself?

> +static int zv_show_pool_pages_count(char *buf)
> +{
> +	char *p = buf;
> +	unsigned long numpages;
> +
> +	if (zcache_client.xvpool == NULL)
> +		p += sprintf(p, "%d\n", 0);
> +	else {

^^ That's probably a good spot to include brackets.  They don't take up
any more lines, and it keeps folks from introducing bugs doing things
like:

	if (zcache_client.xvpool == NULL)
		p += sprintf(p, "%d\n", 0);
		bar();
	else {

> +		numpages = xv_get_total_size_bytes(zcache_client.xvpool);
> +		p += sprintf(p, "%lu\n", numpages >> PAGE_SHIFT);
> +	}

In this case 'numpages' doesn't actually store a number of pages; it
stores a number of bytes.  I'd probably rename it.

Also 'numpages' is an 'unsigned long' while xv_get_total_size_bytes()
returns a u64.  'unsigned long' is only 32-bits on 32-bit architectures,
so it's possible that large buffers sizes could overflow.  The easiest
way to fix this is probably to just make 'numpages' a u64.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: frontswap/zcache: xvmalloc discussion
  2011-06-22 19:15 frontswap/zcache: xvmalloc discussion Seth Jennings
  2011-06-22 19:23 ` [PATCH] Add zv_pool_pages_count to zcache sysfs Seth Jennings
@ 2011-06-23 16:38 ` Dan Magenheimer
  2011-06-23 21:59   ` Seth Jennings
  2011-06-24  6:11 ` Nitin Gupta
  2 siblings, 1 reply; 12+ messages in thread
From: Dan Magenheimer @ 2011-06-23 16:38 UTC (permalink / raw)
  To: Seth Jennings, linux-mm
  Cc: Nitin Gupta, Robert Jennings, Brian King, Greg Kroah-Hartman

> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> Cc: Dan Magenheimer; Nitin Gupta; Robert Jennings; Brian King; Greg Kroah-Hartman
> Subject: frontswap/zcache: xvmalloc discussion
> 
> Dan, Nitin,

Hi Seth --

Thanks for your interest in frontswap and zcache!

> I have been experimenting with the frontswap v4 patches and the latest
> zcache in the mainline drivers/staging.  There is a particular issue I'm
> seeing when using pages of different compressibilities.
> 
> When the pages compress to less than PAGE_SIZE/2, I get good compression
> and little external fragmentation in the xvmalloc pool.  However, when
> the pages have a compressed size greater than PAGE_SIZE/2, it is a very
> different story.  Basically, because xvmalloc allocations can't span
> multiple pool pages, grow_pool() is called on each allocation, reducing
> the effective compression (total_pages_in_frontswap /
> total_pages_in_xvmalloc_pool) to 0 and drastically increasing external
> fragmentation to up to 50%.
> 
> The likelihood that the size of a compressed page is greater than
> PAGE_SIZE/2 is high, considering that lzo1x-1 sacrifices compressibility
> for speed.  In my experiments, pages of English text only compressed to
> 75% of their original size with 1zo1x-1.

Wow, I'm surprised to hear that.  I suppose it is very workload
dependent, but I agree that consistently poor compression can create
issues for frontswap.

> In order to calculate the effective compression of frontswap, you need
> the number of pages stored by frontswap, provided by frontswap's
> curr_pages sysfs attribute, and the number of pages in the xvmalloc
> pool.  There isn't a sysfs attribute for this, so I made a patch that
> creates a new zv_pool_pages_count attribute for zcache that provides
> this value (patch is in a follow-up message).  I have also included my
> simple test program at the end of this email.  It just allocates and
> stores random pages of from a text file (in my case, a text file of Moby
> Dick).
> 
> The real problem here is compressing pages of size x and storing them in
> a pool that has "chunks", if you will, also of size x, where allocations
> can't span multiple chunks.  Ideally, I'd like to address this issue by
> expanding the size of the xvmalloc pool chunks from one page to four
> pages (I can explain why four is a good number, just didn't want to make
> this note too long).

Nitin is the expert on compression and xvmalloc... I mostly built on top
of his earlier work... so I will wait for him to comment on compression
and xvmalloc issues.

BUT... I'd be concerned with increasing the pool chunk, at least without
a fallback.  When memory is constrained, finding chunks in the kernel
of even two consecutive pages might be a challenge, let alone four.
Since frontswap only is invoked if swapping is occurring, memory
is definitely already constrained.

If it is possible to modify xvmalloc (or possibly the pool creation
calls from zcache) to juggle multiple pools, one with chunkorder==2,
one with chunkorder==1, and one with chunkorder=0, with a fallback
sequence if a higher chunkorder is not available, might that be
helpful?  Still I worry that the same problems might occur because
the higher chunkorders might never be available after some time
passes.

> After a little playing around, I've found this isn't entirely trivial to
> do because of the memory mapping implications; more specifically the use
> of kmap/kunamp in the xvmalloc and zcache layers.  I've looked into
> using vmap to map multiple pages into a linear address space, but it
> seems like there is a lot of memory overhead in doing that.
> 
> Do you have any feedback on this issue or suggestion solution?

One neat feature of frontswap (and the underlying Transcendent
Memory definition) is that ANY PUT may be rejected**.  So zcache
could keep track of the distribution of "zsize" and if the number
of pages with zsize>PAGE_SIZE/2 greatly exceeds the number of pages
with "complementary zsize", the frontswap code in zcache can reject
the larger pages until balance/sanity is restored.

Might that help?  If so, maybe your new sysfs value could be
replaced with the ratio (zv_pool_pages_count/frontswap_curr_pages)
and this could be _writeable_ to allow the above policy target to
be modified at runtime.   Even better, the fraction could be
represented by number-of-bytes ("target_zsize"), which could default
to something like (3*PAGE_SIZE)/4... if the ratio above
exceeds target_zsize and the zsize of the page-being-put exceeds
target_zsize, then the put is rejected.

Thanks,
Dan

** The "put" shouldn't actually be rejected outright... it should
be converted to a "flush" so that, if a previous put was
performed for the matching handle, the space can be reclaimed.
(Let me know if you need more explanation of this.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: frontswap/zcache: xvmalloc discussion
  2011-06-23 16:38 ` frontswap/zcache: xvmalloc discussion Dan Magenheimer
@ 2011-06-23 21:59   ` Seth Jennings
  2011-06-24 22:40     ` Dan Magenheimer
  0 siblings, 1 reply; 12+ messages in thread
From: Seth Jennings @ 2011-06-23 21:59 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-mm, Nitin Gupta, Robert Jennings, Brian King,
	Greg Kroah-Hartman

On 06/23/2011 11:38 AM, Dan Magenheimer wrote:
>> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
>> Cc: Dan Magenheimer; Nitin Gupta; Robert Jennings; Brian King; Greg Kroah-Hartman
>> Subject: frontswap/zcache: xvmalloc discussion
>>
>> Dan, Nitin,
> 
> Hi Seth --
> 
> Thanks for your interest in frontswap and zcache!

Thanks for your quick response!

> 
>> I have been experimenting with the frontswap v4 patches and the latest
>> zcache in the mainline drivers/staging.  There is a particular issue I'm
>> seeing when using pages of different compressibilities.
>>
>> When the pages compress to less than PAGE_SIZE/2, I get good compression
>> and little external fragmentation in the xvmalloc pool.  However, when
>> the pages have a compressed size greater than PAGE_SIZE/2, it is a very
>> different story.  Basically, because xvmalloc allocations can't span
>> multiple pool pages, grow_pool() is called on each allocation, reducing
>> the effective compression (total_pages_in_frontswap /
>> total_pages_in_xvmalloc_pool) to 0 and drastically increasing external
>> fragmentation to up to 50%.
>>
>> The likelihood that the size of a compressed page is greater than
>> PAGE_SIZE/2 is high, considering that lzo1x-1 sacrifices compressibility
>> for speed.  In my experiments, pages of English text only compressed to
>> 75% of their original size with 1zo1x-1.
> 
> Wow, I'm surprised to hear that.  I suppose it is very workload
> dependent, but I agree that consistently poor compression can create
> issues for frontswap.
>
 
Yes, I was surprised as well with how little it compressed.  I guess I'm 
used to gzip level compression, which was around 50% on the same data set.

>> In order to calculate the effective compression of frontswap, you need
>> the number of pages stored by frontswap, provided by frontswap's
>> curr_pages sysfs attribute, and the number of pages in the xvmalloc
>> pool.  There isn't a sysfs attribute for this, so I made a patch that
>> creates a new zv_pool_pages_count attribute for zcache that provides
>> this value (patch is in a follow-up message).  I have also included my
>> simple test program at the end of this email.  It just allocates and
>> stores random pages of from a text file (in my case, a text file of Moby
>> Dick).
>>
>> The real problem here is compressing pages of size x and storing them in
>> a pool that has "chunks", if you will, also of size x, where allocations
>> can't span multiple chunks.  Ideally, I'd like to address this issue by
>> expanding the size of the xvmalloc pool chunks from one page to four
>> pages (I can explain why four is a good number, just didn't want to make
>> this note too long).
> 
> Nitin is the expert on compression and xvmalloc... I mostly built on top
> of his earlier work... so I will wait for him to comment on compression
> and xvmalloc issues.
>

Yes, I do need Nitin to weigh in on this since any changes to the xvmalloc
code would impact zcache and zram.
 
> BUT... I'd be concerned with increasing the pool chunk, at least without
> a fallback.  When memory is constrained, finding chunks in the kernel
> of even two consecutive pages might be a challenge, let alone four.
> Since frontswap only is invoked if swapping is occurring, memory
> is definitely already constrained.
> 
> If it is possible to modify xvmalloc (or possibly the pool creation
> calls from zcache) to juggle multiple pools, one with chunkorder==2,
> one with chunkorder==1, and one with chunkorder=0, with a fallback
> sequence if a higher chunkorder is not available, might that be
> helpful?  Still I worry that the same problems might occur because
> the higher chunkorders might never be available after some time
> passes.
>

To avoid the problem with getting one large set (up to 4 pages) of 
contiguous space, I'm looking into using vm_map_ram() to map
chunks that are multiple noncontiguous pages into a single contiguous 
address space.  I don't know what the overhead is yet.

I do like the idea of having a few pools with different chunk sizes.
 
>> After a little playing around, I've found this isn't entirely trivial to
>> do because of the memory mapping implications; more specifically the use
>> of kmap/kunamp in the xvmalloc and zcache layers.  I've looked into
>> using vmap to map multiple pages into a linear address space, but it
>> seems like there is a lot of memory overhead in doing that.
>>
>> Do you have any feedback on this issue or suggestion solution?
> 
> One neat feature of frontswap (and the underlying Transcendent
> Memory definition) is that ANY PUT may be rejected**.  So zcache
> could keep track of the distribution of "zsize" and if the number
> of pages with zsize>PAGE_SIZE/2 greatly exceeds the number of pages
> with "complementary zsize", the frontswap code in zcache can reject
> the larger pages until balance/sanity is restored.
> 
> Might that help?  

We could do that, but I imagine that would let a lot of pages through 
on most workloads.  Ideally, I'd like to find a solution that would
capture and (efficiently) store pages that compressed to up to 80% of 
their original size.

> If so, maybe your new sysfs value could be
> replaced with the ratio (zv_pool_pages_count/frontswap_curr_pages)
> and this could be _writeable_ to allow the above policy target to
> be modified at runtime.   Even better, the fraction could be
> represented by number-of-bytes ("target_zsize"), which could default
> to something like (3*PAGE_SIZE)/4... if the ratio above
> exceeds target_zsize and the zsize of the page-being-put exceeds
> target_zsize, then the put is rejected.
> 
> Thanks,
> Dan
> 
> ** The "put" shouldn't actually be rejected outright... it should
> be converted to a "flush" so that, if a previous put was
> performed for the matching handle, the space can be reclaimed.
> (Let me know if you need more explanation of this.)

Thanks again for your reply, Dan.  I'll explore this more next week.

--
Seth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: frontswap/zcache: xvmalloc discussion
  2011-06-22 19:15 frontswap/zcache: xvmalloc discussion Seth Jennings
  2011-06-22 19:23 ` [PATCH] Add zv_pool_pages_count to zcache sysfs Seth Jennings
  2011-06-23 16:38 ` frontswap/zcache: xvmalloc discussion Dan Magenheimer
@ 2011-06-24  6:11 ` Nitin Gupta
  2011-06-24 15:52   ` Dave Hansen
  2011-08-05 16:22   ` Seth Jennings
  2 siblings, 2 replies; 12+ messages in thread
From: Nitin Gupta @ 2011-06-24  6:11 UTC (permalink / raw)
  To: Seth Jennings
  Cc: linux-mm, Dan Magenheimer, Robert Jennings, Brian King,
	Greg Kroah-Hartman

Hi Seth,

On 06/22/2011 12:15 PM, Seth Jennings wrote:

>
> The real problem here is compressing pages of size x and storing them in
> a pool that has "chunks", if you will, also of size x, where allocations
> can't span multiple chunks. Ideally, I'd like to address this issue by
> expanding the size of the xvmalloc pool chunks from one page to four
> pages (I can explain why four is a good number, just didn't want to make
> this note too long).
>
> After a little playing around, I've found this isn't entirely trivial to
> do because of the memory mapping implications; more specifically the use
> of kmap/kunamp in the xvmalloc and zcache layers. I've looked into using
> vmap to map multiple pages into a linear address space, but it seems
> like there is a lot of memory overhead in doing that.
>
> Do you have any feedback on this issue or suggestion solution?
>

xvmalloc fragmentation issue has been reported by several zram users and 
quite some time back I started working on a new allocator (xcfmalloc) 
which potentially solves many of these issues. However, all of the 
details are currently on paper and I'm sure actual implementation will 
bring a lot of surprises.

Currently, xvmalloc wastes memory due to:
  - No compaction support: Each page can store chunks of any size which 
makes compaction really hard to implement.
  - Use of 0-order pages only: This was enforced to avoid memory 
allocation failures. As Dan pointed out, any higher order allocation is 
almost guaranteed to fail under memory pressure.

To solve these issues, xcfmalloc:
  - Supports compaction: Its size class based (like SLAB) which, among 
other things, simplifies compaction.
  - Supports higher order pages using little trickery:

For 64-bit systems, we can simply use vmalloc(16k or 64k) pages and 
never bother unmapping them. This is expensive (how much?) in terms of 
both CPU and memory but easy to implement.

But on 32-bit (almost all "embedded" devices), this ofcourse cannot be 
done. For this case, the plan is to create a "vpage" abstraction which 
can be treated as usual higher-order page.

vpage abstraction:
  - Allocate 0-order pages and maintain them in an array
  - Allow a chunk to cross at most one 4K (or whatever is the native 
PAGE_SIZE) page boundary. This limits maximum allocation size to 4K but 
simplifies mapping logic.
  - A vpage is assigned a specific size class just like usual SLAB. This 
will simplify compaction.
  - xcfmalloc() will return a object handle instead of a direct pointer.
  - Provide xcfmalloc_{map,unmap}() which will handle the case where a 
chunk spans two pages. It will map the pages using kmap_atomic() and 
thus user will be expected to unmap them soon.
  - Allow vpage to be "partially freed" i.e. empty 4K pages can be freed 
individually if completely empty.

Much of this vpage functionality seems to be already present in mainline 
as "flexible arrays"[1]

For scalability, we can simply go for per-cpu lists and use Hoard[2] 
like design to bound fragmentation associated with such per-cpu slabs.

Unfortunately, I'm currently too loaded to work on this, atleast for 
next 2 months (internship) but would be glad to contribute if someone is 
willing to work on this.

[1] http://lxr.linux.no/linux+v2.6.39/Documentation/flexible-arrays.txt
[2] Hoard allocator: 
http://www.cs.umass.edu/~emery/pubs/berger-asplos2000.pdf

Thanks,
Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: frontswap/zcache: xvmalloc discussion
  2011-06-24  6:11 ` Nitin Gupta
@ 2011-06-24 15:52   ` Dave Hansen
  2011-06-25  2:42     ` Nitin Gupta
  2011-08-05 16:22   ` Seth Jennings
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2011-06-24 15:52 UTC (permalink / raw)
  To: Nitin Gupta
  Cc: Seth Jennings, linux-mm, Dan Magenheimer, Robert Jennings,
	Brian King, Greg Kroah-Hartman

On Thu, 2011-06-23 at 23:11 -0700, Nitin Gupta wrote:
> Much of this vpage functionality seems to be already present in mainline 
> as "flexible arrays"[1] 

That's a good observation.  I don't know who wrote that junk, but I bet
they never thought of using it for this purpose. :)

FWIW, for flex_arrays, the biggest limitation is that the objects
currently can not cross page boundaries.  The current API also doesn't
have any concept of a release function.  We'd need those to do the
unmapping after a get().  It certainly wouldn't be impossible to fix,
but it would probably make it quite a bit more complicated.

The other limitation is that each array can only hold a small number of
megabytes worth of data in each array.  We only have a single-level
table lookup, and that first-level table is limited to PAGE_SIZE (minus
a wee bit of metadata).

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: frontswap/zcache: xvmalloc discussion
  2011-06-23 21:59   ` Seth Jennings
@ 2011-06-24 22:40     ` Dan Magenheimer
  2011-06-30  2:31       ` Dan Magenheimer
  0 siblings, 1 reply; 12+ messages in thread
From: Dan Magenheimer @ 2011-06-24 22:40 UTC (permalink / raw)
  To: Seth Jennings
  Cc: linux-mm, Nitin Gupta, Robert Jennings, Brian King,
	Greg Kroah-Hartman

> > One neat feature of frontswap (and the underlying Transcendent
> > Memory definition) is that ANY PUT may be rejected**.  So zcache
> > could keep track of the distribution of "zsize" and if the number
> > of pages with zsize>PAGE_SIZE/2 greatly exceeds the number of pages
> > with "complementary zsize", the frontswap code in zcache can reject
> > the larger pages until balance/sanity is restored.
> >
> > Might that help?
> 
> We could do that, but I imagine that would let a lot of pages through
> on most workloads.  Ideally, I'd like to find a solution that would
> capture and (efficiently) store pages that compressed to up to 80% of
> their original size.

After thinking about this a bit, I have to disagree.  For workloads
where the vast majority of pages have zsize>PAGE_SIZE/2, this would
let a lot of pages through.  So if you are correct that LZO
is poor at compression and a large majority of pages are in
this category, some page-crossing scheme is necessary.  However,
that isn't what I've seen... the zsize of many swap pages is
quite small.

So before commencing on a major compression rewrite, it might
be a good idea to measure distribution of zsize for swap pages
on a large variety of workloads.  This could probably be done
by adding a code snippet in the swap path of a normal (non-zcache)
kernel.  And if the distribution is bad, replacing LZO with a
higher-compression-but-slower algorithm might be the best answer,
since zcache is replacing VERY slow swap-device reads/writes with
reasonably fast compression/decompression.  I certainly think
that an algorithm approaching an average 50% compression ratio
should be the goal.

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: frontswap/zcache: xvmalloc discussion
  2011-06-24 15:52   ` Dave Hansen
@ 2011-06-25  2:42     ` Nitin Gupta
  0 siblings, 0 replies; 12+ messages in thread
From: Nitin Gupta @ 2011-06-25  2:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Seth Jennings, linux-mm, Dan Magenheimer, Robert Jennings,
	Brian King, Greg Kroah-Hartman

On 06/24/2011 08:52 AM, Dave Hansen wrote:
> On Thu, 2011-06-23 at 23:11 -0700, Nitin Gupta wrote:
>> Much of this vpage functionality seems to be already present in mainline
>> as "flexible arrays"[1]
>
> That's a good observation.  I don't know who wrote that junk, but I bet
> they never thought of using it for this purpose. :)
>
> FWIW, for flex_arrays, the biggest limitation is that the objects
> currently can not cross page boundaries.  The current API also doesn't
> have any concept of a release function.  We'd need those to do the
> unmapping after a get().  It certainly wouldn't be impossible to fix,
> but it would probably make it quite a bit more complicated.
>
> The other limitation is that each array can only hold a small number of
> megabytes worth of data in each array.  We only have a single-level
> table lookup, and that first-level table is limited to PAGE_SIZE (minus
> a wee bit of metadata).
>

These limitations really makes them unsuitable for use in the new 
allocator and I guess "fixing" them is also not a good idea -- if its 
fundamentally designed to work on small number of objects, it should 
probably be left as-is.  Not sure who really uses flex arrays?

So, if and when vpage stuff is introduced, existence of flex_arrays 
should not become a barrier.

Thanks,
Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: frontswap/zcache: xvmalloc discussion
  2011-06-24 22:40     ` Dan Magenheimer
@ 2011-06-30  2:31       ` Dan Magenheimer
  2011-06-30 16:09         ` Dan Magenheimer
  0 siblings, 1 reply; 12+ messages in thread
From: Dan Magenheimer @ 2011-06-30  2:31 UTC (permalink / raw)
  To: Seth Jennings
  Cc: linux-mm, Nitin Gupta, Robert Jennings, Brian King,
	Greg Kroah-Hartman, Dave Hansen

> > > One neat feature of frontswap (and the underlying Transcendent
> > > Memory definition) is that ANY PUT may be rejected**.  So zcache
> > > could keep track of the distribution of "zsize" and if the number
> > > of pages with zsize>PAGE_SIZE/2 greatly exceeds the number of pages
> > > with "complementary zsize", the frontswap code in zcache can reject
> > > the larger pages until balance/sanity is restored.
> > >
> > > Might that help?
> >
> > We could do that, but I imagine that would let a lot of pages through
> > on most workloads.  Ideally, I'd like to find a solution that would
> > capture and (efficiently) store pages that compressed to up to 80% of
> > their original size.
> 
> After thinking about this a bit, I have to disagree.  For workloads
> where the vast majority of pages have zsize>PAGE_SIZE/2, this would
> let a lot of pages through.  So if you are correct that LZO
> is poor at compression and a large majority of pages are in
> this category, some page-crossing scheme is necessary.  However,
> that isn't what I've seen... the zsize of many swap pages is
> quite small.
> 
> So before commencing on a major compression rewrite, it might
> be a good idea to measure distribution of zsize for swap pages
> on a large variety of workloads.  This could probably be done
> by adding a code snippet in the swap path of a normal (non-zcache)
> kernel.  And if the distribution is bad, replacing LZO with a
> higher-compression-but-slower algorithm might be the best answer,
> since zcache is replacing VERY slow swap-device reads/writes with
> reasonably fast compression/decompression.  I certainly think
> that an algorithm approaching an average 50% compression ratio
> should be the goal.

FWIW, I've measured the distribution of zsize (pages compressed
with frontswap) on my favorite workload (kernel "make -j2" on
mem=512M to force lots of swapping) and the mean is small, close
to 1K (PAGE_SIZE/4).  I've added some sysfs shows for both
the current and cumulative distribution (0-63 bytes, 64-127
bytes, ..., 4032-4095 bytes) for the next update.

I tried your program on the text of Moby Dick and the mean
was still under 1500 bytes ((3*PAGE_SIZE)/8) with a good
broad distribution for zsize.  I tried your program also on
gzip'ed Moby Dick and zcache correctly rejects most of the
pages as uncompressible and does fine on other swapped pages.

So I can't reproduce what you are seeing.  Somehow you
must create and swap a set of pages with a zsize distribution
almost entirely between PAGE_SIZE/2 and (PAGE_SIZE*7)/8.
How did you do that?

FYI, I also added a sysfs settable for zv_max_page_size...
if zsize exceeds it, the page is rejected.  It defaults to
(PAGE_SIZE*7)/8, which was the non-settable hardwired
value before.

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: frontswap/zcache: xvmalloc discussion
  2011-06-30  2:31       ` Dan Magenheimer
@ 2011-06-30 16:09         ` Dan Magenheimer
  0 siblings, 0 replies; 12+ messages in thread
From: Dan Magenheimer @ 2011-06-30 16:09 UTC (permalink / raw)
  To: Seth Jennings
  Cc: linux-mm, Nitin Gupta, Robert Jennings, Brian King,
	Greg Kroah-Hartman, Dave Hansen

> FWIW, I've measured the distribution of zsize (pages compressed
> with frontswap) on my favorite workload (kernel "make -j2" on
> mem=512M to force lots of swapping) and the mean is small, close
> to 1K (PAGE_SIZE/4).  I've added some sysfs shows for both
> the current and cumulative distribution (0-63 bytes, 64-127
> bytes, ..., 4032-4095 bytes) for the next update.
> 
> I tried your program on the text of Moby Dick and the mean
> was still under 1500 bytes ((3*PAGE_SIZE)/8) with a good
> broad distribution for zsize.

Oops, on retry this morning, I am now clearly seeing the poor
compression.  Not sure what is different from last night,
but I suspect I was "watch"ing the new sysfs output during
massive swapping during the run of the test program and
it wasn't updated until program completion (at which point
I was no longer perusing the sysfs output).

Sorry for the noise.  However, now that I have a test case
I am implementing another sysfs tunable to reject poorly-
compressible pages that would drive the mean zsize
above the tunable.  Zcache will reject these pages until
the mean falls below the threshold.  (Setting it to
PAGE_SIZE will continue current behavior, but I've set
the default to (5*PAGE_SIZE)/8 for now.)

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: frontswap/zcache: xvmalloc discussion
  2011-06-24  6:11 ` Nitin Gupta
  2011-06-24 15:52   ` Dave Hansen
@ 2011-08-05 16:22   ` Seth Jennings
  1 sibling, 0 replies; 12+ messages in thread
From: Seth Jennings @ 2011-08-05 16:22 UTC (permalink / raw)
  To: Nitin Gupta
  Cc: linux-mm, Dan Magenheimer, Robert Jennings, Brian King,
	Greg Kroah-Hartman

Nitin,

I have been working on this xcfmalloc allocator using most of the design 
points you described (cross-page allocations, compaction).  I should 
have a patch soon.

On 06/24/2011 01:11 AM, Nitin Gupta wrote:
> Hi Seth,
> 
> On 06/22/2011 12:15 PM, Seth Jennings wrote:
> 
>>
>> The real problem here is compressing pages of size x and storing them in
>> a pool that has "chunks", if you will, also of size x, where allocations
>> can't span multiple chunks. Ideally, I'd like to address this issue by
>> expanding the size of the xvmalloc pool chunks from one page to four
>> pages (I can explain why four is a good number, just didn't want to make
>> this note too long).
>>
>> After a little playing around, I've found this isn't entirely trivial to
>> do because of the memory mapping implications; more specifically the use
>> of kmap/kunamp in the xvmalloc and zcache layers. I've looked into using
>> vmap to map multiple pages into a linear address space, but it seems
>> like there is a lot of memory overhead in doing that.
>>
>> Do you have any feedback on this issue or suggestion solution?
>>
> 
> xvmalloc fragmentation issue has been reported by several zram users and 
> quite some time back I started working on a new allocator (xcfmalloc) 
> which potentially solves many of these issues. However, all of the 
> details are currently on paper and I'm sure actual implementation will 
> bring a lot of surprises.
> 
> Currently, xvmalloc wastes memory due to:
>   - No compaction support: Each page can store chunks of any size which 
> makes compaction really hard to implement.
>   - Use of 0-order pages only: This was enforced to avoid memory 
> allocation failures. As Dan pointed out, any higher order allocation is 
> almost guaranteed to fail under memory pressure.
> 
> To solve these issues, xcfmalloc:
>   - Supports compaction: Its size class based (like SLAB) which, among 
> other things, simplifies compaction.
>   - Supports higher order pages using little trickery:
> 
> For 64-bit systems, we can simply use vmalloc(16k or 64k) pages and 
> never bother unmapping them. This is expensive (how much?) in terms of 
> both CPU and memory but easy to implement.
> 
> But on 32-bit (almost all "embedded" devices), this ofcourse cannot be 
> done. For this case, the plan is to create a "vpage" abstraction which 
> can be treated as usual higher-order page.
> 
> vpage abstraction:
>   - Allocate 0-order pages and maintain them in an array
>   - Allow a chunk to cross at most one 4K (or whatever is the native 
> PAGE_SIZE) page boundary. This limits maximum allocation size to 4K but 
> simplifies mapping logic.
>   - A vpage is assigned a specific size class just like usual SLAB. This 
> will simplify compaction.
>   - xcfmalloc() will return a object handle instead of a direct pointer.
>   - Provide xcfmalloc_{map,unmap}() which will handle the case where a 
> chunk spans two pages. It will map the pages using kmap_atomic() and 
> thus user will be expected to unmap them soon.
>   - Allow vpage to be "partially freed" i.e. empty 4K pages can be freed 
> individually if completely empty.
> 
> Much of this vpage functionality seems to be already present in mainline 
> as "flexible arrays"[1]
> 
> For scalability, we can simply go for per-cpu lists and use Hoard[2] 
> like design to bound fragmentation associated with such per-cpu slabs.
> 
> Unfortunately, I'm currently too loaded to work on this, atleast for 
> next 2 months (internship) but would be glad to contribute if someone is 
> willing to work on this.
> 
> [1] http://lxr.linux.no/linux+v2.6.39/Documentation/flexible-arrays.txt
> [2] Hoard allocator: 
> http://www.cs.umass.edu/~emery/pubs/berger-asplos2000.pdf
> 
> Thanks,
> Nitin
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-08-05 16:22 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-22 19:15 frontswap/zcache: xvmalloc discussion Seth Jennings
2011-06-22 19:23 ` [PATCH] Add zv_pool_pages_count to zcache sysfs Seth Jennings
2011-06-23 15:38   ` Dave Hansen
2011-06-23 16:38 ` frontswap/zcache: xvmalloc discussion Dan Magenheimer
2011-06-23 21:59   ` Seth Jennings
2011-06-24 22:40     ` Dan Magenheimer
2011-06-30  2:31       ` Dan Magenheimer
2011-06-30 16:09         ` Dan Magenheimer
2011-06-24  6:11 ` Nitin Gupta
2011-06-24 15:52   ` Dave Hansen
2011-06-25  2:42     ` Nitin Gupta
2011-08-05 16:22   ` Seth Jennings

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).