linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3]: compressed in-memory swapping
@ 2009-03-17 11:34 Nitin Gupta
  2009-03-17 11:36 ` [PATCH 1/3]: compressed RAM block device Nitin Gupta
                   ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Nitin Gupta @ 2009-03-17 11:34 UTC (permalink / raw)
  To: linux-kernel

Hi,

Project home: http://code.google.com/p/compcache/

It allows creating a RAM based block device which acts as swap disk.
Pages swapped to this device are compressed and stored in memory itself.
This is a big win over swapping to slow hard-disk which are typically used
as swap disk. For flash, these suffer from wear-leveling issues when used
as swap disk - so again its helpful. For swapless systems, it allows more
apps to run.

Its now part of Ubuntu, ALT Linux and some other distros.

Some use cases:
  - Embedded Devices: Memory is scarce and adding more memory increases
    device cost. Also, flash storage suffers from wear-levelling issues.
  - Virtualization: For host/hypervisor, VM/guest memory is all anonymous
    memory. So, compcache can compress any part of guest. So we can host
    more VMs on same amount of memory.
  - Thinclients.

Testing/Performance:
  - Testing for use on Thinclients (contrib: Nai Xia):
     Details: http://code.google.com/p/compcache/wiki/LTSPPerf
     Summary: http://code.google.com/p/compcache/wiki/LTSPPerfSummary

     Summary: after the compcache was loaded:
         - The time of paging down one pdf page was reduced to 1/4~1/100
         - The time of switching from one firefox tab to another was reduced to 1/6
         - The capacity of kpdf was be increased from 2 pdf files to 11 pdf files.
         - The capacity of firefox was increased from 6 web pages to 15 web pages.

  - Testing on x86 and x64 VMs (256/512MB RAM, 1 VCPU, 512MB Swap):
    Started Fedora 10 and some "generic desktop" apps (KDE4, Firefox,
    filemanager, editors) - without compcache, system swapped heavily
    and very unresponsive. With compcache, very noticeable improvement
    in responsiveness.

  - Some reported running on PS3 systems though no performance stats
    available here.

Limitations:
  - Being a "swap disk" it can compress only anonymous memory. But as mentioned
    above, at hypervisor/host level, all guest memory is anonymous so any part
    of guest memory can be compressed.
  - Memory allocator used is not scalable at all (work in progress).
  - No run-time defragmentation, swapping of allocated memory - however,
    incompressible pages are forwarded to physical swap and can set limit
    on amount of memory used by compcache.

I do not claim performance gain for any arbitrary workload but as testing
shows, it certainly helps in some cases :)

Also note that this is a low risk change - it does not modify any part
of kernel. Its just an additional module. In case you find it hurting
performance, just unload this module and maybe use normal disk swap.

Any reviews/comments/suggestions are welcome.

Thanks,
Nitin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/3]: compressed RAM block device
  2009-03-17 11:34 [PATCH 0/3]: compressed in-memory swapping Nitin Gupta
@ 2009-03-17 11:36 ` Nitin Gupta
  2009-03-18 12:25   ` Nick Piggin
  2009-03-20 18:56   ` Pavel Machek
  2009-03-17 11:37 ` [PATCH 2/3]: xvmalloc memory allocator Nitin Gupta
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 34+ messages in thread
From: Nitin Gupta @ 2009-03-17 11:36 UTC (permalink / raw)
  To: linux-kernel

  drivers/block/Kconfig     |   22 +
  drivers/block/Makefile    |    1 +
  drivers/block/compcache.c |  995 +++++++++++++++++++++++++++++++++++++++++++++
  drivers/block/compcache.h |  160 ++++++++
  4 files changed, 1178 insertions(+), 0 deletions(-)

Creates RAM based block device (ramzswap0) which can be used as swap device.
Pages swapped to this are compressed and stored in memory itself.

The module is called compcache.ko. It depends on:
  - xvmalloc.ko: memory allocator
  - lzo_compress.ko
  - lzo_decompress.ko

See Documentation/blockdev/compcache.txt for usage details.

Project home: http://code.google.com/p/compcache/

Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 0344a8a..cccf76c 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -348,6 +348,28 @@ config BLK_DEV_RAM_SIZE
  	  The default value is 4096 kilobytes. Only change this if you know
  	  what you are doing.

+config BLK_DEV_COMPCACHE
+	tristate "Compressed RAM swap device"
+	depends on XVMALLOC
+	depends on LZO_COMPRESS
+	depends on LZO_DECOMPRESS
+	help
+	  Saying Y here will allow you to use in-memory compressed swapping.
+	  It creates a pseudo block device (named ramzswap) which acts as
+	  swap device. Pages swapped to this device are compressed and stored
+	  in memory itself.
+
+	  Project home: http://code.google.com/p/compcache/
+	  For details, read <file:Documentation/blockdev/compcache.txt>
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called compcache.
+
+config BLK_DEV_COMPCACHE_STATS
+	bool "Collect statistics"
+	depends on BLK_DEV_COMPCACHE
+	default y
+
  config BLK_DEV_XIP
  	bool "Support XIP filesystems on RAM block device"
  	depends on BLK_DEV_RAM
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 204332b..6b2ee9e 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -12,6 +12,7 @@ obj-$(CONFIG_PS3_DISK)		+= ps3disk.o
  obj-$(CONFIG_ATARI_FLOPPY)	+= ataflop.o
  obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
  obj-$(CONFIG_BLK_DEV_RAM)	+= brd.o
+obj-$(CONFIG_BLK_DEV_COMPCACHE)	+= compcache.o
  obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
  obj-$(CONFIG_BLK_DEV_XD)	+= xd.o
  obj-$(CONFIG_BLK_CPQ_DA)	+= cpqarray.o
diff --git a/drivers/block/compcache.c b/drivers/block/compcache.c
new file mode 100644
index 0000000..42bbb52
--- /dev/null
+++ b/drivers/block/compcache.c
@@ -0,0 +1,995 @@
+/*
+ * Compressed RAM based swap device
+ *
+ * Copyright (C) 2008, 2009  Nitin Gupta
+ *
+ * This RAM based block device acts as swap disk.
+ * Pages swapped to this device are compressed and
+ * stored in memory.
+ *
+ * Released under the terms of the GNU General Public
+ * License (version 2). See linux/COPYING for more information.
+ *
+ * Project home: http://code.google.com/p/compcache
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/bitops.h>
+#include <linux/blkdev.h>
+#include <linux/buffer_head.h>
+#include <linux/device.h>
+#include <linux/genhd.h>
+#include <linux/highmem.h>
+#include <linux/lzo.h>
+#include <linux/mutex.h>
+#include <linux/proc_fs.h>
+#include <linux/string.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/vmalloc.h>
+#include <linux/xvmalloc.h>
+
+#include "compcache.h"
+
+/* Globals */
+static struct compcache compcache;
+static struct compcache_stats stats;
+
+/* Module params (documentation at end) */
+static unsigned long disksize_kb;
+static unsigned long memlimit_kb;
+static char *backing_dev;
+
+/*
+ * Pages that compress to larger than this size are
+ * forwarded to backing swap, if present or stored
+ * uncompressed in memory otherwise.
+ */
+static unsigned int MAX_CPAGE_SIZE;
+
+static int __init compcache_init(void);
+static struct block_device_operations compcache_devops = {
+	.owner = THIS_MODULE,
+};
+
+static void set_page_zero(u32 index)
+{
+	compcache.table[index].flags |= (1 << CC_zero);
+}
+
+static void set_page_uncompressed(u32 index)
+{
+	compcache.table[index].flags |= (1 << CC_uncompressed);
+}
+
+static void clear_page_zero(u32 index)
+{
+	compcache.table[index].flags &= ~(1 << CC_zero);
+}
+
+static void clear_page_uncompressed(u32 index)
+{
+	compcache.table[index].flags &= ~(1 << CC_uncompressed);
+}
+
+static int is_page_zero(u32 index)
+{
+	return compcache.table[index].flags & (1 << CC_zero);
+}
+
+static int is_page_uncompressed(u32 index)
+{
+	return compcache.table[index].flags & (1 << CC_uncompressed);
+}
+
+static int page_zero_filled(void *ptr)
+{
+	u32 pos;
+	u64 *page;
+
+	page = (u64 *)ptr;
+
+	for (pos = 0; pos != PAGE_SIZE / sizeof(*page); pos++) {
+		if (page[pos])
+			return 0;
+	}
+
+	return 1;
+}
+
+/*
+ * Given <pagenum, offset> pair, provide a dereferencable pointer.
+ */
+static void *get_ptr_atomic(u32 pagenum, u16 offset, enum km_type type)
+{
+	unsigned char *page;
+
+	page = kmap_atomic(pfn_to_page(pagenum), type);
+	return page + offset;
+}
+
+static void put_ptr_atomic(void *ptr, enum km_type type)
+{
+	kunmap_atomic(ptr, type);
+}
+
+#if defined(STATS)
+static struct proc_dir_entry *proc;
+
+static int proc_compcache_read(char *page, char **start, off_t off,
+				int count, int *eof, void *data)
+{
+	int len;
+	size_t succ_writes, mem_used;
+	unsigned int good_compress_perc = 0, no_compress_perc = 0;
+
+	mem_used = xvGetTotalSizeBytes(compcache.mem_pool)
+			+ (stats.pages_expand << PAGE_SHIFT);
+
+	if (off > 0) {
+		*eof = 1;
+		return 0;
+	}
+
+#define K(x)	((x) >> 10)
+	/* Basic stats */
+	len = sprintf(page,
+		"DiskSize:	%8zu kB\n",
+		(size_t)(K(compcache.disksize)));
+
+	if (compcache.backing_dev) {
+		/* This must always be less than ComprDataSize */
+		len += sprintf(page + len,
+			"MemLimit:	%8zu kB\n",
+			K(compcache.memlimit));
+	}
+
+	succ_writes = stats.num_writes - stats.failed_writes;
+	if (succ_writes) {
+		good_compress_perc = stats.good_compress * 100
+					/ stats.pages_stored;
+		no_compress_perc = stats.pages_expand * 100
+					/ stats.pages_stored;
+	}
+
+	/* Extended stats */
+	len += sprintf(page + len,
+		"NumReads:	%8llu\n"
+		"NumWrites:	%8llu\n"
+		"FailedReads:	%8llu\n"
+		"FailedWrites:	%8llu\n"
+		"InvalidIO:	%8llu\n"
+		"PagesDiscard:	%8llu\n"
+		"ZeroPages:	%8u\n"
+		"GoodCompress:	%8u %%\n"
+		"NoCompress:	%8u %%\n"
+		"PagesStored:	%8u\n"
+		"PagesUsed:	%8zu\n"
+		"OrigDataSize:	%8zu kB\n"
+		"ComprDataSize:	%8zu kB\n"
+		"MemUsedTotal:	%8zu kB\n",
+		stats.num_reads,
+		stats.num_writes,
+		stats.failed_reads,
+		stats.failed_writes,
+		stats.invalid_io,
+		stats.pages_discard,
+		stats.pages_zero,
+		good_compress_perc,
+		no_compress_perc,
+		stats.pages_stored,
+		mem_used >> PAGE_SHIFT,
+		(size_t)(K(stats.pages_stored << PAGE_SHIFT)),
+		(size_t)(K(stats.compr_size)),
+		(size_t)(K(mem_used)));
+
+	if (compcache.backing_dev) {
+		/* This must always be less than ComprDataSize */
+		len += sprintf(page + len,
+			"BDevNumReads:	%8llu\n"
+			"BDevNumWrites:	%8llu\n",
+			stats.bdev_num_reads,
+			stats.bdev_num_writes);
+	}
+
+
+	return len;
+}
+#endif	/* STATS */
+
+/*
+ * Check if value of backing_dev module param is sane.
+ * Claim this device and set compcache size equal to
+ * size of this block device.
+ */
+static int setup_backing_device(void)
+{
+	int error = 0;
+	struct inode *inode;
+	struct file *swap_file;
+	struct address_space *mapping;
+	struct block_device *bdev = NULL;
+
+	if (backing_dev == NULL) {
+		pr_debug(C "backing_dev param not given\n");
+		goto out;
+	}
+
+	pr_info(C "Using backing swap device: %s\n", backing_dev);
+
+	swap_file = filp_open(backing_dev, O_RDWR | O_LARGEFILE, 0);
+	if (IS_ERR(swap_file)) {
+		pr_err(C "Error opening backing device: %s\n", backing_dev);
+		error = -EINVAL;
+		goto out;
+	}
+
+	mapping = swap_file->f_mapping;
+	inode = mapping->host;
+
+	if (S_ISBLK(inode->i_mode)) {
+		bdev = I_BDEV(inode);
+		error = bd_claim(bdev, compcache_init);
+		if (error < 0) {
+			bdev = NULL;
+			goto bad_param;
+		}
+		compcache.old_block_size = block_size(bdev);
+		error = set_blocksize(bdev, PAGE_SIZE);
+		if (error < 0)
+			goto bad_param;
+	} else {
+		/* TODO: support for regular file as backing swap */
+		pr_info(C "%s is not a block device.\n", backing_dev);
+		error = -EINVAL;
+		goto out;
+	}
+
+	compcache.swap_file = swap_file;
+	compcache.backing_dev = bdev;
+	compcache.disksize = i_size_read(inode);
+	BUG_ON(!compcache.disksize);
+
+	return 0;
+
+bad_param:
+	if (bdev) {
+		set_blocksize(bdev, compcache.old_block_size);
+		bd_release(bdev);
+	}
+	filp_close(swap_file, NULL);
+
+out:
+	compcache.backing_dev = NULL;
+	return error;
+}
+
+/*
+ * Check if request is within bounds and page aligned.
+ */
+static inline int valid_swap_request(struct bio *bio)
+{
+	if (unlikely(
+		(bio->bi_sector >= (compcache.disksize >> SECTOR_SHIFT)) ||
+		(bio->bi_sector & (SECTORS_PER_PAGE - 1)) ||
+		(bio->bi_vcnt != 1) ||
+		(bio->bi_size != PAGE_SIZE) ||
+		(bio->bi_io_vec[0].bv_offset != 0))) {
+
+		return 0;
+	}
+
+	/* swap request is valid*/
+	return 1;
+}
+
+static void compcache_free_page(size_t index)
+{
+
+	u32 clen;
+	void *obj;
+
+
+	if (unlikely(is_page_uncompressed(index))) {
+		clen = PAGE_SIZE;
+		__free_page(pfn_to_page(
+			compcache.table[index].pageNum));
+		clear_page_uncompressed(index);
+		stat_dec(stats.pages_expand);
+	} else {
+		obj = get_ptr_atomic(compcache.table[index].pageNum,
+			compcache.table[index].offset, KM_USER0);
+		clen = xvGetObjectSize(obj) - sizeof(struct zobj_header);
+		put_ptr_atomic(obj, KM_USER0);
+		xvFree(compcache.mem_pool,
+			compcache.table[index].pageNum,
+			compcache.table[index].offset);
+		stat_dec_if_less(stats.good_compress, clen, PAGE_SIZE/2 + 1);
+	}
+
+	stats.compr_size -= clen;
+	stat_dec(stats.pages_stored);
+
+	compcache.table[index].pageNum = 0;
+	compcache.table[index].offset = 0;
+}
+
+static int compcache_prepare_discard(struct request_queue *q,
+					struct request *req)
+{
+	return 0;
+}
+
+/*
+ * Called by main I/O handler function. This helper
+ * function handles 'discard' I/O requests which means
+ * that  some swap pages are no longer required, so
+ * swap device can take needed action -- we free memory
+ * allocated for these pages.
+ */
+static void compcache_discard(struct bio *bio)
+{
+	size_t index, start_page, num_pages;
+
+	start_page = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT;
+	num_pages = bio->bi_size >> (SECTOR_SHIFT + SECTORS_PER_PAGE_SHIFT);
+
+	for (index = start_page; index < start_page + num_pages;
+							index++) {
+		if (compcache.table[index].pageNum) {
+			compcache_free_page(index);
+			stat_inc(stats.pages_discard);
+		}
+	}
+	set_bit(BIO_UPTODATE, &bio->bi_flags);
+	bio_endio(bio, 0);
+	return;
+}
+
+/*
+ * Handler function for all compcache I/O requests.
+ */
+static int compcache_make_request(struct request_queue *queue, struct bio *bio)
+{
+	int ret, fwd_write_request = 0;
+	u32 offset;
+	size_t clen, index;
+	struct zobj_header *zheader;
+	struct page *page, *page_store;
+	unsigned char *user_mem, *cmem, *src;
+
+	if (bio_discard(bio)) {
+		compcache_discard(bio);
+		return 0;
+	}
+
+	if (!valid_swap_request(bio)) {
+		stat_inc(stats.invalid_io);
+		goto out;
+	}
+
+	page = bio->bi_io_vec[0].bv_page;
+	index = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT;
+
+	switch (bio_data_dir(bio)) {
+	case READ:
+		stat_inc(stats.num_reads);
+
+		if (is_page_zero(index)) {
+			user_mem = get_ptr_atomic(page_to_pfn(page), 0,
+						KM_USER0);
+			memset(user_mem, 0, PAGE_SIZE);
+			put_ptr_atomic(user_mem, KM_USER0);
+			set_bit(BIO_UPTODATE, &bio->bi_flags);
+			bio_endio(bio, 0);
+			return 0;
+		}
+
+		/*
+		 * Requested page is not present in compressed area.
+		 * Its either in backing swap device (if present) or
+		 * this is an attempt to read before any previous write
+		 * to this location - this happens due to readahead when
+		 * swap device is read from user-space (e.g. during swapon)
+		 */
+		if (!compcache.table[index].pageNum) {
+			/*
+			 * Always forward such requests to backing swap
+			 * device (if present)
+			 */
+			if (compcache.backing_dev) {
+				stat_dec(stats.num_reads);
+				stat_inc(stats.bdev_num_reads);
+				bio->bi_bdev = compcache.backing_dev;
+				return 1;
+			}
+			/*
+			 * Its unlikely event in case backing dev is
+			 * not present
+			 */
+			pr_debug(C "Read before write on swap device: "
+				"sector=%lu, size=%u, offset=%u\n",
+				(ulong)(bio->bi_sector),
+				bio->bi_size,
+				bio->bi_io_vec[0].bv_offset);
+			user_mem = kmap(page);
+			memset(user_mem, 0, PAGE_SIZE);
+			kunmap(page);
+			set_bit(BIO_UPTODATE, &bio->bi_flags);
+			bio_endio(bio, 0);
+			return 0;
+		}
+
+		user_mem = get_ptr_atomic(page_to_pfn(page), 0, KM_USER0);
+
+		clen = PAGE_SIZE;
+		cmem = get_ptr_atomic(compcache.table[index].pageNum,
+				compcache.table[index].offset, KM_USER1);
+
+		/* Page is stored uncompressed since its incompressible */
+		if (unlikely(is_page_uncompressed(index))) {
+			memcpy(user_mem, cmem, PAGE_SIZE);
+			put_ptr_atomic(user_mem, KM_USER0);
+			put_ptr_atomic(cmem, KM_USER1);
+			set_bit(BIO_UPTODATE, &bio->bi_flags);
+			bio_endio(bio, 0);
+			return 0;
+		}
+
+		ret = lzo1x_decompress_safe(
+			cmem + sizeof(*zheader),
+			xvGetObjectSize(cmem) - sizeof(*zheader),
+			user_mem, &clen);
+
+		put_ptr_atomic(user_mem, KM_USER0);
+		put_ptr_atomic(cmem, KM_USER1);
+
+		/* should NEVER happen */
+		if (unlikely(ret != LZO_E_OK)) {
+			pr_err(C "Decompression failed! "
+				"err=%d, page=%zu\n",
+				ret, index);
+			stat_inc(stats.failed_reads);
+			goto out;
+		}
+
+		set_bit(BIO_UPTODATE, &bio->bi_flags);
+		bio_endio(bio, 0);
+		return 0;
+
+	case WRITE:
+		src = compcache.compress_buffer;
+		stat_inc(stats.num_writes);
+
+		/*
+		 * System swaps to same sector again when the stored page
+		 * is no longer referenced by any process. So, its now safe
+		 * to free the memory that was allocated for this page.
+		 */
+		if (compcache.table[index].pageNum)
+			compcache_free_page(index);
+
+		/*
+		 * No memory ia allocated for zero filled pages.
+		 * Simply clear zero page flag.
+		 */
+		if (is_page_zero(index)) {
+			stat_dec(stats.pages_zero);
+			clear_page_zero(index);
+		}
+
+		mutex_lock(&compcache.lock);
+
+		user_mem = get_ptr_atomic(page_to_pfn(page), 0, KM_USER0);
+		if (page_zero_filled(user_mem)) {
+			put_ptr_atomic(user_mem, KM_USER0);
+			mutex_unlock(&compcache.lock);
+			stat_inc(stats.pages_zero);
+			set_page_zero(index);
+			set_bit(BIO_UPTODATE, &bio->bi_flags);
+			bio_endio(bio, 0);
+			return 0;
+		}
+
+		if (compcache.backing_dev &&
+			(stats.compr_size > compcache.memlimit - PAGE_SIZE)) {
+			put_ptr_atomic(user_mem, KM_USER0);
+			mutex_unlock(&compcache.lock);
+			fwd_write_request = 1;
+			goto out;
+		}
+
+		ret = lzo1x_1_compress(user_mem, PAGE_SIZE,
+			src, &clen, compcache.compress_workmem);
+
+		put_ptr_atomic(user_mem, KM_USER0);
+
+		if (unlikely(ret != LZO_E_OK)) {
+			mutex_unlock(&compcache.lock);
+			pr_err(C "Compression failed! err=%d\n", ret);
+			stat_inc(stats.failed_writes);
+			goto out;
+		}
+
+		/* Page is incompressible - store it as is */
+		if (unlikely(clen > MAX_CPAGE_SIZE)) {
+			if (compcache.backing_dev) {
+				mutex_unlock(&compcache.lock);
+				fwd_write_request = 1;
+				goto out;
+			}
+			clen = PAGE_SIZE;
+			page_store = alloc_page(GFP_NOIO | __GFP_HIGHMEM);
+			if (unlikely(!page_store)) {
+				mutex_unlock(&compcache.lock);
+				stat_inc(stats.failed_writes);
+				goto out;
+			}
+			stat_inc(stats.pages_expand);
+			compcache.table[index].pageNum =
+						page_to_pfn(page_store);
+			set_page_uncompressed(index);
+			src = get_ptr_atomic(page_to_pfn(page), 0, KM_USER0);
+			offset = 0;
+		} else {
+			if (xvMalloc(compcache.mem_pool,
+				clen + sizeof(*zheader),
+				&compcache.table[index].pageNum,
+				&offset)) {
+				mutex_unlock(&compcache.lock);
+				pr_debug(C "Error allocating memory for "
+					"compressed page: %zu, size=%zu \n",
+					index, clen);
+				stat_inc(stats.failed_writes);
+				goto out;
+			}
+		}
+
+		compcache.table[index].offset = offset;
+
+		cmem = get_ptr_atomic(compcache.table[index].pageNum,
+				compcache.table[index].offset, KM_USER1);
+
+		if (!is_page_uncompressed(index)) {
+			zheader = (struct zobj_header *)cmem;
+			zheader->table_idx = index;
+			cmem += sizeof(*zheader);
+		}
+
+		memcpy(cmem, src, clen);
+
+		put_ptr_atomic(cmem, KM_USER1);
+		if (unlikely(is_page_uncompressed(index)))
+			put_ptr_atomic(src, KM_USER0);
+
+		/* Update stats */
+		stats.compr_size += clen;
+		stat_inc(stats.pages_stored);
+		stat_inc_if_less(stats.pages_expand, PAGE_SIZE - 1, clen);
+		stat_inc_if_less(stats.good_compress, clen, PAGE_SIZE/2 + 1);
+
+		mutex_unlock(&compcache.lock);
+
+		set_bit(BIO_UPTODATE, &bio->bi_flags);
+		bio_endio(bio, 0);
+		return 0;
+	}
+
+out:
+	if (fwd_write_request) {
+		stat_inc(stats.bdev_num_writes);
+		bio->bi_bdev = compcache.backing_dev;
+		return 1;
+	}
+
+	bio_io_error(bio);
+	return 0;
+}
+
+/*
+ * Swap header (1st page of swap device) contains information
+ * to indentify it as a swap partition. Prepare such a header
+ * for compcache device (ramzswap) so that swapon can identify
+ * it as swap partition. In case backing swap device is provided,
+ * copy its swap header.
+ */
+static int setup_swap_header(union swap_header *s)
+{
+	int ret = 0;
+	struct page *page;
+	struct address_space *mapping;
+	union swap_header *backing_dev_header;
+
+	/*
+	 * There is no backing swap device. Create a swap header
+	 * that is acceptable by swapon.
+	 */
+	if (compcache.backing_dev == NULL) {
+		s->info.version = 1;
+		s->info.last_page = compcache.disksize >> PAGE_SHIFT;
+		s->info.nr_badpages = 0;
+		memcpy(s->magic.magic, "SWAPSPACE2", 10);
+		return 0;
+	}
+
+	/*
+	 * We have a backing swap device. Copy its swap header
+	 * to compcache swap header. If this header contains
+	 * invalid information (backing device not a swap
+	 * partition, etc.), swapon will fail for compcache
+	 * which is correct behavior - we don't want to
+	 * swap over filesystem partition!
+	 */
+
+	/*
+	 * Read the backing swap header.
+	 * (code from sys_swapon)
+	 */
+
+	mapping = compcache.swap_file->f_mapping;
+	if (!mapping->a_ops->readpage) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	page = read_mapping_page(mapping, 0, compcache.swap_file);
+	if (IS_ERR(page)) {
+		ret = PTR_ERR(page);
+		goto out;
+	}
+
+	backing_dev_header = kmap(page);
+	*s = *backing_dev_header;
+	kunmap(page);
+
+out:
+	return ret;
+}
+
+static void compcache_set_disksize(size_t totalram_bytes)
+{
+	compcache.disksize = disksize_kb << 10;
+
+	if (!disksize_kb) {
+		pr_info(C
+		"disk size not provided. You can use disksize_kb module "
+		"param to specify size.\nUsing default: (%u%% of RAM).\n",
+		DEFAULT_DISKSIZE_PERC_RAM
+		);
+		compcache.disksize = DEFAULT_DISKSIZE_PERC_RAM *
+					(totalram_bytes / 100);
+	}
+
+	if (disksize_kb > 2 * (totalram_bytes >> 10)) {
+		pr_info(C
+		"There is little point creating a compcache of greater than "
+		"twice the size of memory since we expect a 2:1 compression "
+		"ratio. Note that compcache uses about 0.1%% of the size of "
+		"the swap device when not in use so a huge compcache is "
+		"wasteful.\n"
+		"\tMemory Size: %zu kB\n"
+		"\tSize you selected: %lu kB\n"
+		"Continuing anyway ...\n",
+		totalram_bytes >> 10, disksize_kb
+		);
+	}
+
+	compcache.disksize &= PAGE_MASK;
+
+	pr_info(C "disk size set to %zu kB\n", compcache.disksize >> 10);
+}
+
+/*
+ * memlimit cannot be greater than backing disk size.
+ */
+static void compcache_set_memlimit(size_t totalram_bytes)
+{
+	int memlimit_valid = 1;
+	compcache.memlimit = memlimit_kb << 10;
+
+	if (!compcache.memlimit) {
+		pr_info(C "memory limit not set. You can use "
+			"memlimit_kb module param to specify limit.");
+		memlimit_valid = 0;
+	}
+
+	if (compcache.memlimit > compcache.disksize) {
+		pr_info(C "memory limit cannot be greater than "
+			"disksize: limit=%zu, disksize=%zu",
+			compcache.memlimit,
+			compcache.disksize);
+		memlimit_valid = 0;
+	}
+
+	if (!memlimit_valid) {
+		size_t mempart, disksize;
+		pr_info(C "\nUsing default: MIN[(%u%% of RAM), "
+			"(backing disk size)].\n",
+			DEFAULT_MEMLIMIT_PERC_RAM);
+		mempart = DEFAULT_MEMLIMIT_PERC_RAM * (totalram_bytes / 100);
+		disksize = compcache.disksize;
+		compcache.memlimit = mempart > disksize ? disksize : mempart;
+	}
+
+	if (compcache.memlimit > totalram_bytes / 2) {
+		pr_info(C
+		"Its not advisable setting limit more than half of "
+		"size of memory since we expect a 2:1 compression ratio. "
+		"Limit represents amount of *compressed* data we can keep "
+		"in memory!\n"
+		"\tMemory Size: %zu kB\n"
+		"\tLimit you selected: %lu kB\n"
+		"Continuing anyway ...\n",
+		totalram_bytes >> 10, memlimit_kb
+		);
+	}
+
+	compcache.memlimit &= PAGE_MASK;
+	BUG_ON(!compcache.memlimit);
+
+	pr_info(C "memory limit set to %zu kB\n", compcache.memlimit >> 10);
+
+}
+
+static int __init compcache_init(void)
+{
+	int ret;
+	size_t num_pages, totalram_bytes;
+	struct sysinfo i;
+	struct page *page;
+	void *swap_header;
+
+	mutex_init(&compcache.lock);
+
+	ret = setup_backing_device();
+	if (ret)
+		goto fail;
+
+	si_meminfo(&i);
+	/* Here is a trivia: guess unit used for i.totalram !! */
+	totalram_bytes = i.totalram << PAGE_SHIFT;
+
+	if (compcache.backing_dev)
+		compcache_set_memlimit(totalram_bytes);
+	else
+		compcache_set_disksize(totalram_bytes);
+
+	compcache.compress_workmem = kmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL);
+	if (compcache.compress_workmem == NULL) {
+		pr_err(C "Error allocating compressor working memory\n");
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	compcache.compress_buffer = kmalloc(2 * PAGE_SIZE, GFP_KERNEL);
+	if (compcache.compress_buffer == NULL) {
+		pr_err(C "Error allocating compressor buffer space\n");
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	num_pages = compcache.disksize >> PAGE_SHIFT;
+	compcache.table = vmalloc(num_pages * sizeof(*compcache.table));
+	if (compcache.table == NULL) {
+		pr_err(C "Error allocating compcache address table\n");
+		ret = -ENOMEM;
+		goto fail;
+	}
+	memset(compcache.table, 0, num_pages * sizeof(*compcache.table));
+
+	page = alloc_page(__GFP_ZERO);
+	if (page == NULL) {
+		pr_err(C "Error allocating swap header page\n");
+		ret = -ENOMEM;
+		goto fail;
+	}
+	compcache.table[0].pageNum = page_to_pfn(page);
+	set_page_uncompressed(0);
+
+	swap_header = kmap(page);
+	ret = setup_swap_header((union swap_header *)(swap_header));
+	kunmap(page);
+	if (ret) {
+		pr_err(C "Error setting swap header\n");
+		goto fail;
+	}
+
+	compcache.disk = alloc_disk(1);
+	if (compcache.disk == NULL) {
+		pr_err(C "Error allocating disk structure\n");
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	compcache.disk->first_minor = 0;
+	compcache.disk->fops = &compcache_devops;
+	/*
+	 * It is named like this to prevent distro installers
+	 * from offering compcache as installation target. They
+	 * seem to ignore all devices beginning with 'ram'
+	 */
+	strcpy(compcache.disk->disk_name, "ramzswap0");
+
+	compcache.disk->major = register_blkdev(0, compcache.disk->disk_name);
+	if (compcache.disk->major < 0) {
+		pr_err(C "Cannot register block device\n");
+		ret = -EFAULT;
+		goto fail;
+	}
+
+	compcache.disk->queue = blk_alloc_queue(GFP_KERNEL);
+	if (compcache.disk->queue == NULL) {
+		pr_err(C "Cannot register disk queue\n");
+		ret = -EFAULT;
+		goto fail;
+	}
+
+	set_capacity(compcache.disk, compcache.disksize >> SECTOR_SHIFT);
+	blk_queue_make_request(compcache.disk->queue, compcache_make_request);
+
+	/*
+	 * Assuming backing device is "rotational" type.
+	 * TODO: check if its actually "non-rotational" (SSD).
+	 *
+	 * We have ident mapping of sectors for compcache and
+	 * and the backing swap device. So, this queue flag
+	 * should be according to backing dev.
+	 */
+	if (!compcache.backing_dev) {
+		queue_flag_set_unlocked(QUEUE_FLAG_NONROT,
+					compcache.disk->queue);
+	}
+	blk_queue_set_discard(compcache.disk->queue,
+				compcache_prepare_discard);
+	blk_queue_hardsect_size(compcache.disk->queue, PAGE_SIZE);
+	add_disk(compcache.disk);
+
+	compcache.mem_pool = xvCreateMemPool();
+	if (compcache.mem_pool == INVALID_POOL_ID) {
+		pr_err(C "Error creating memory pool\n");
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+#if defined(STATS)
+	proc = create_proc_entry("compcache", S_IRUGO, NULL);
+	if (proc)
+		proc->read_proc = &proc_compcache_read;
+	else {
+		ret = -ENOMEM;
+		pr_warning(C "Error creating proc entry\n");
+		goto fail;
+	}
+#endif
+
+	/*
+	 * Pages that compress to size greater than this are forwarded
+	 * to physical swap disk (if backing dev is provided)
+	 */
+	if (compcache.backing_dev)
+		MAX_CPAGE_SIZE = MAX_CPAGE_SIZE_BDEV;
+	else
+		MAX_CPAGE_SIZE = MAX_CPAGE_SIZE_NOBDEV;
+
+	pr_debug(C "Max compressed page size: %u bytes\n", MAX_CPAGE_SIZE);
+
+	pr_debug(C "Initialization done!\n");
+	return 0;
+
+fail:
+	if (compcache.disk != NULL) {
+		if (compcache.disk->major > 0)
+			unregister_blkdev(compcache.disk->major,
+					compcache.disk->disk_name);
+		del_gendisk(compcache.disk);
+	}
+
+	if (compcache.table && compcache.table[0].pageNum)
+		__free_page(pfn_to_page(compcache.table[0].pageNum));
+	kfree(compcache.compress_workmem);
+	kfree(compcache.compress_buffer);
+	vfree(compcache.table);
+	xvDestroyMemPool(compcache.mem_pool);
+#if defined(STATS)
+	if (proc)
+		remove_proc_entry("compcache", proc->parent);
+#endif
+	pr_err(C "Initialization failed: err=%d\n", ret);
+	return ret;
+}
+
+static void __exit compcache_exit(void)
+{
+	size_t index, num_pages;
+	num_pages = compcache.disksize >> PAGE_SHIFT;
+
+	unregister_blkdev(compcache.disk->major, compcache.disk->disk_name);
+	del_gendisk(compcache.disk);
+
+	/* Close backing swap device (if present) */
+	if (compcache.backing_dev) {
+		set_blocksize(compcache.backing_dev, compcache.old_block_size);
+		bd_release(compcache.backing_dev);
+		filp_close(compcache.swap_file, NULL);
+	}
+
+	__free_page(pfn_to_page(compcache.table[0].pageNum));
+	kfree(compcache.compress_workmem);
+	kfree(compcache.compress_buffer);
+
+	/* Free all pages that are still in compcache */
+	for (index = 1; index < num_pages; index++) {
+		if (!compcache.table[index].pageNum)
+			continue;
+
+		if (unlikely(is_page_uncompressed(index))) {
+			__free_page(pfn_to_page(
+				compcache.table[index].pageNum));
+		} else {
+			xvFree(compcache.mem_pool,
+				compcache.table[index].pageNum,
+				compcache.table[index].offset);
+		}
+	}
+
+	vfree(compcache.table);
+	xvDestroyMemPool(compcache.mem_pool);
+
+#if defined(STATS)
+	remove_proc_entry("compcache", proc->parent);
+#endif
+	pr_debug(C "cleanup done!\n");
+}
+
+/*
+ * This param is applicable only when there is no backing swap device.
+ * We ignore this param in case backing dev is provided since then its
+ * always equal to size of the backing swap device.
+ *
+ * This size refers to amount of (uncompressed) data it can hold.
+ * For e.g. disksize_kb=1024 means it can hold 1024kb worth of
+ * uncompressed data even if this data compresses to just, say, 100kb.
+ *
+ * Default value is used if this param is missing or 0 (if its applicable).
+ * Default: [DEFAULT_DISKSIZE_PERC_RAM]% of RAM
+ */
+module_param(disksize_kb, ulong, 0);
+MODULE_PARM_DESC(disksize_kb, "compcache device size (kB)");
+
+/*
+ * This param is applicable only when backing swap device is provided.
+ * This refers to limit on amount of (compressed) data it can hold in
+ * memory. Note that total amount of memory used (MemUsedTotal) can
+ * exceed this memlimit since that includes memory wastage due to
+ * fragmentation and metadata overhead.
+ *
+ * Any additional data beyond this limit is forwarded to backing
+ * swap device. TODO: allow changing memlimit at runtime.
+ *
+ * Default value is used if this param is missing or 0 (if its applicable).
+ * Default: MIN([DEFAULT_MEMLIMIT_PERC_RAM]% of RAM, Backing Device Size)
+ */
+module_param(memlimit_kb, ulong, 0);
+MODULE_PARM_DESC(memlimit_kb, "compcache memory limit (kB)");
+
+/*
+ * This is block device to be used as backing store for compcache.
+ * When pages more than memlimit_kb as swapped to compcache, we store
+ * any additional pages in this device. We may also move some pages
+ * from compcache to this device in case system is really low on
+ * memory (TODO).
+ *
+ * This device is not directly visible to kernel as a swap device
+ * (/proc/swaps will only show /dev/ramzswap0 and not this device).
+ * Managing this backing device is the job of compcache module.
+ */
+module_param(backing_dev, charp, 0);
+MODULE_PARM_DESC(backing_dev, "Backing swap partition");
+
+module_init(compcache_init);
+module_exit(compcache_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Nitin Gupta <ngupta@vflare.org>");
+MODULE_DESCRIPTION("Compressed RAM Based Swap Device");
diff --git a/drivers/block/compcache.h b/drivers/block/compcache.h
new file mode 100644
index 0000000..23a58e7
--- /dev/null
+++ b/drivers/block/compcache.h
@@ -0,0 +1,160 @@
+/*
+ * Compressed RAM based swap device
+ *
+ * Copyright (C) 2008, 2009  Nitin Gupta
+ *
+ * This RAM based block device acts as swap disk.
+ * Pages swapped to this device are compressed and
+ * stored in memory.
+ *
+ * Released under the terms of the GNU General Public
+ * License (version 2). See linux/COPYING for more information.
+ *
+ * Project home: http://code.google.com/p/compcache
+ */
+
+#ifndef _COMPCACHE_H_
+#define _COMPCACHE_H_
+
+#include <linux/xvmalloc.h>
+
+/*
+ * Stored at beginning of each compressed object.
+ *
+ * It stores back-reference to table entry which points
+ * to this object. This will required when we implement
+ * memory defragmentation or migrating compressed pages
+ * to swap disk.
+ */
+struct zobj_header {
+	u32 table_idx;
+};
+
+/*-- Configurable parameters */
+
+/* Default compcache disk size: 25% of total RAM */
+#define DEFAULT_DISKSIZE_PERC_RAM	25
+#define DEFAULT_MEMLIMIT_PERC_RAM	15
+
+/*
+ * Max compressed page size when backing device is provided.
+ * Pages that compress to size greater than this are sent to
+ * physical swap disk.
+ */
+#define MAX_CPAGE_SIZE_BDEV	(PAGE_SIZE / 2)
+
+/*
+ * Max compressed page size when there is no backing dev.
+ * Pages that compress to size greater than this are stored
+ * uncompressed in memory.
+ */
+#define MAX_CPAGE_SIZE_NOBDEV	(PAGE_SIZE / 4 * 3)
+
+/*
+ * NOTE: MAX_CPAGE_SIZE_{BDEV,NOBDEV} sizes must be
+ * less than or equal to:
+ *   XV_MAX_ALLOC_SIZE - sizeof(struct zobj_header)
+ * since otherwise xvMalloc would always return failure.
+ */
+
+/*-- End of configurable params */
+
+#define SECTOR_SHIFT		9
+#define SECTOR_SIZE		(1 << SECTOR_SHIFT)
+#define SECTORS_PER_PAGE_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
+#define SECTORS_PER_PAGE	(1 << SECTORS_PER_PAGE_SHIFT)
+
+/* Message prefix */
+#define C "compcache: "
+
+/* Debugging and Stats */
+#define NOP	do { } while (0)
+
+#if defined(CONFIG_BLK_DEV_COMPCACHE_STATS)
+#define STATS
+#endif
+
+#if defined(STATS)
+#define stat_inc(stat)			((stat)++)
+#define stat_dec(stat)			((stat)--)
+#define stat_inc_if_less(stat, val1, val2) \
+				((stat) += ((val1) < (val2) ? 1 : 0))
+#define stat_dec_if_less(stat, val1, val2) \
+				((stat) -= ((val1) < (val2) ? 1 : 0))
+#else	/* STATS */
+#define stat_inc(x)			NOP
+#define stat_dec(x)			NOP
+#define stat_inc_if_less(x, v1, v2)	NOP
+#define stat_dec_if_less(x, v1, v2)	NOP
+#endif	/* STATS */
+
+/* Flags for compcache pages (table[page_no].flags) */
+enum cc_pageflags {
+	/* Page is stored uncompressed */
+	CC_uncompressed,
+
+	/* Page consists entirely of zeros */
+	CC_zero,
+
+	__NR_CC_PAGEFLAGS,
+};
+
+/*-- Data structures */
+
+/* Indexed by page no. */
+struct table {
+	u32 pageNum;
+	u16 offset;
+	u8 count;	/* object ref count (not yet used) */
+	u8 flags;
+};
+
+struct compcache {
+	struct Pool *mem_pool;
+	void *compress_workmem;
+	void *compress_buffer;
+	struct table *table;
+	struct mutex lock;
+	struct gendisk *disk;
+	/*
+	 * This is limit on compressed data size (stats.compr_size)
+	 * Its applicable only when backing swap device is present.
+	 */
+	size_t memlimit;	/* bytes */
+	/*
+	 * This is limit on amount of *uncompressed* worth of data
+	 * we can hold. When backing swap device is provided, it is
+	 * set equal to device size.
+	 */
+	size_t disksize;	/* bytes */
+
+
+	/* backing swap device info */
+	struct block_device *backing_dev;
+	struct file *swap_file;
+	int old_block_size;
+};
+
+struct compcache_stats {
+	/* basic stats */
+	size_t compr_size;	/* compressed size of pages stored -
+				 * needed to enforce memlimit */
+	/* more stats */
+#if defined(STATS)
+	u64 num_reads;		/* failed + successful */
+	u64 num_writes;		/* --do-- */
+	u64 failed_reads;	/* can happen when memory is too low */
+	u64 failed_writes;	/* should NEVER! happen */
+	u64 invalid_io;		/* non-swap I/O requests */
+	u64 pages_discard;	/* no. of pages freed by discard callback */
+	u32 pages_zero;		/* no. of zero filled pages */
+	u32 pages_stored;	/* no. of pages currently stored */
+	u32 good_compress;	/* no. of pages with compression ratio<=50% */
+	u32 pages_expand;	/* no. of incompressible pages */
+	u64 bdev_num_reads;	/* no. of reads on backing dev */
+	u64 bdev_num_writes;	/* no. of writes on backing dev */
+#endif
+};
+/*-- */
+
+#endif


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/3]: xvmalloc memory allocator
  2009-03-17 11:34 [PATCH 0/3]: compressed in-memory swapping Nitin Gupta
  2009-03-17 11:36 ` [PATCH 1/3]: compressed RAM block device Nitin Gupta
@ 2009-03-17 11:37 ` Nitin Gupta
  2009-03-17 16:41   ` Christoph Lameter
  2009-03-17 11:38 ` [PATCH 3/3]: documentation Nitin Gupta
  2009-03-17 23:34 ` [PATCH 0/3]: compressed in-memory swapping Russ Dill
  3 siblings, 1 reply; 34+ messages in thread
From: Nitin Gupta @ 2009-03-17 11:37 UTC (permalink / raw)
  To: linux-kernel

  include/linux/xvmalloc.h |   30 +++
  init/Kconfig             |   14 +
  mm/Makefile              |    1 +
  mm/xvmalloc.c            |  605 ++++++++++++++++++++++++++++++++++++++++++++++
  mm/xvmalloc_int.h        |   91 +++++++
  5 files changed, 741 insertions(+), 0 deletions(-)

xvMalloc is a memory allocator designed specifically for compcache project.

* Features:
  - Low metadata overhead (just 4 bytes per object)
  - O(1) Alloc/Free - except when we have to call system page allocator to
	get additional memory.
  - Very low fragmentation: In all tests, xvMalloc memory usage is within 12%
	of "Ideal".

One of the main highlights is that it maps pages only when required.
So, it does not hog vmalloc area which is very small on 32-bit systems.

Slub allocator could not be used due to fragmentation issues:
http://code.google.com/p/compcache/wiki/AllocatorsComparison
Data here shows kmalloc using ~43% more memory than TLSF and xvMalloc
is showed ~2% more space efficiency than TLSF (due to smaller metadata).

* Implementation:
It uses two-level bitmap search to find free list containing block of
correct size. This idea is taken from TLSF (Two-Level Segregate Fit)
allocator and is well explained in its paper (see [Links] below).
Highlights:
  - Pool based allocator: each pool can grow/shrink.
  - Immediate coalescing of free blocks.
  - Maps/Unmaps memory pages only when required.

* Limitations:
  - Poor scalability: No per-cpu data structures (work in progress).

[Links]
1. Details and Performance data:
http://code.google.com/p/compcache/wiki/xvMalloc
http://code.google.com/p/compcache/wiki/xvMallocPerformance

2. TLSF memory allocator:
home: http://rtportal.upv.es/rtmalloc/
paper: http://rtportal.upv.es/rtmalloc/files/MRBC_2008.pdf

Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---

diff --git a/include/linux/xvmalloc.h b/include/linux/xvmalloc.h
new file mode 100755
index 0000000..92366bd
--- /dev/null
+++ b/include/linux/xvmalloc.h
@@ -0,0 +1,30 @@
+/*
+ * xvmalloc.h
+ *
+ * Copyright (C) 2008, 2009  Nitin Gupta
+ *
+ * This code is released using a dual license strategy: GPL/LGPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of the GNU General Public License Version 2.0
+ * Released under the terms of the GNU Lesser General Public License Version 2.1
+ */
+
+#ifndef _XVMALLOC_H_
+#define _XVMALLOC_H_
+
+#define INVALID_POOL_ID	NULL
+#define INVALID_PGNUM	((u32)(-1))
+
+struct Pool;
+
+struct Pool *xvCreateMemPool(void);
+void xvDestroyMemPool(struct Pool *pool);
+
+int xvMalloc(struct Pool *pool, u32 size, u32 *pageNum, u32 *offset);
+void xvFree(struct Pool *pool, u32 pageNum, u32 offset);
+
+u32 xvGetObjectSize(void *obj);
+u64 xvGetTotalSizeBytes(struct Pool *pool);
+
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 6a5c5fe..5e722cc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -930,6 +930,20 @@ config SLOB

  endchoice

+config XVMALLOC
+	tristate "xvMalloc memory allocator"
+	help
+	   This is a simple, low fragmentation, O(1) allocator.
+	   Details: http://code.google.com/p/compcache/wiki/xvMalloc
+
+config XVMALLOC_STATS
+	depends on XVMALLOC
+	bool "Collect statistics"
+	default y
+	help
+	  Collect basic allocator statistics with minimal overhead.
+	  In unsure, say Y.
+
  config PROFILING
  	bool "Profiling support (EXPERIMENTAL)"
  	help
diff --git a/mm/Makefile b/mm/Makefile
index 72255be..b6b705f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_SLOB) += slob.o
  obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
  obj-$(CONFIG_SLAB) += slab.o
  obj-$(CONFIG_SLUB) += slub.o
+obj-$(CONFIG_XVMALLOC) += xvmalloc.o
  obj-$(CONFIG_FAILSLAB) += failslab.o
  obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
  obj-$(CONFIG_FS_XIP) += filemap_xip.o
diff --git a/mm/xvmalloc.c b/mm/xvmalloc.c
new file mode 100755
index 0000000..3fcc012
--- /dev/null
+++ b/mm/xvmalloc.c
@@ -0,0 +1,605 @@
+/*
+ * xvmalloc.c
+ *
+ * Copyright (C) 2008, 2009  Nitin Gupta
+ *
+ * This code is released using a dual license strategy: GPL/LGPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of the GNU General Public License Version 2.0
+ * Released under the terms of the GNU Lesser General Public License Version 2.1
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/highmem.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/xvmalloc.h>
+
+#include "xvmalloc_int.h"
+
+#ifdef CONFIG_XVMALLOC_STATS
+static void statInc(u64 *value)
+{
+	*value = *value + 1;
+}
+
+static void statDec(u64 *value)
+{
+	*value = *value - 1;
+}
+#else
+#define statInc(x) do { } while (0)
+#define statDec(x) do { } while (0)
+#endif
+
+static void SetBit(u32 *value, u32 bitIdx)
+{
+	*value |= (u32)(1 << bitIdx);
+}
+
+static void ClearBit(u32 *value, u32 bitIdx)
+{
+	*value &= (u32)(~(1 << bitIdx));
+}
+
+static u32 IsBlockFree(struct BlockHeader *block)
+{
+	return block->prev & BLOCK_FREE;
+}
+
+static u32 IsPrevBlockFree(struct BlockHeader *block)
+{
+	return block->prev & PREV_FREE;
+}
+
+static void SetBlockFree(struct BlockHeader *block)
+{
+	block->prev |= BLOCK_FREE;
+}
+
+static void SetBlockPrevFree(struct BlockHeader *block)
+{
+	block->prev |= PREV_FREE;
+}
+
+static void SetBlockUsed(struct BlockHeader *block)
+{
+	block->prev &= ~BLOCK_FREE;
+}
+
+static void SetBlockPrevUsed(struct BlockHeader *block)
+{
+	block->prev &= ~PREV_FREE;
+}
+
+static u16 GetBlockPrev(struct BlockHeader *block)
+{
+	return block->prev & PREV_MASK;
+}
+
+static void SetBlockPrev(struct BlockHeader *block, u16 newOffset)
+{
+	block->prev = newOffset | (block->prev & FLAGS_MASK);
+}
+
+/*
+ * Get index of free list containing blocks of maximum size
+ * which is less than or equal to given size.
+ */
+static u32 GetIndexForInsert(u32 size)
+{
+	size = size > XV_MAX_ALLOC_SIZE ? XV_MAX_ALLOC_SIZE : size;
+	size &= ~FL_DELTA_MASK;
+	return (size - XV_MIN_ALLOC_SIZE) >> FL_DELTA_SHIFT;
+}
+
+/*
+ * Get index of free list having blocks of size greater than
+ * or equal to requested size.
+ */
+static u32 GetIndex(u32 size)
+{
+	size = (size + FL_DELTA_MASK) & ~FL_DELTA_MASK;
+	return (size - XV_MIN_ALLOC_SIZE) >> FL_DELTA_SHIFT;
+}
+
+/*
+ * Given <pageNum, offset> pair, provide a derefrencable pointer.
+ * This is called from xvMalloc/xvFree path, so it needs to be fast.
+ */
+static void *GetPtrAtomic(u32 pageNum, u16 offset, enum km_type type)
+{
+	unsigned char *pagePtr;
+
+	pagePtr = kmap_atomic(pfn_to_page(pageNum), type);
+	return pagePtr + offset;
+}
+
+static void PutPtrAtomic(void *ptr, enum km_type type)
+{
+	kunmap_atomic(ptr, type);
+}
+
+/*
+ * Allocate a memory page. Called when a pool needs to grow.
+ */
+static u32 AllocPage(void)
+{
+	struct page *page;
+
+	page = alloc_page(GFP_NOIO | __GFP_HIGHMEM);
+
+	if (unlikely(!page))
+		return INVALID_PGNUM;
+
+	return page_to_pfn(page);
+}
+
+/*
+ * Called when all objects in a page are freed.
+ */
+static void FreePage(u32 pageNum)
+{
+	__free_page(pfn_to_page(pageNum));
+}
+
+/**
+ * FindBlock - find block of at least given size
+ * @pool: memory pool to search from
+ * @size: size of block required
+ * @pageNum: page no. containing required block
+ * @offset: offset within the page where block is located.
+ *
+ * Searches two level bitmap to locate block of at least
+ * the given size. If such a block is found, it provides
+ * <pageNum, offset> to identify this block and returns index
+ * in FreeList where we found this block.
+ * Otherwise, returns 0 and <pageNum, offset> params are not touched.
+ */
+static u32 FindBlock(struct Pool *pool, u32 size, u32 *pageNum, u32 *offset)
+{
+	u32 flBitmap, slBitmap;
+	u32 flIndex, slIndex, slBitStart;
+
+	/* There are no free blocks in this pool */
+	if (!pool->flBitmap)
+		return 0;
+
+	if (unlikely(size < XV_MIN_ALLOC_SIZE))
+		size = XV_MIN_ALLOC_SIZE;
+
+	/* Get FreeList index correspoding to this size */
+	slIndex = GetIndex(size);
+	slBitmap = pool->slBitmap[slIndex >> BITMAP_SHIFT];
+	slBitStart = slIndex & BITMAP_MASK;
+
+	/*
+	 * If FreeList is not empty at this index, we found the
+	 * block - head of this list. This is approximate best-fit match.
+	 */
+	if (slBitmap & (1 << slBitStart)) {
+		*pageNum = pool->FreeList[slIndex].pageNum;
+		*offset = pool->FreeList[slIndex].offset;
+		return slIndex;
+	}
+
+	/*
+	 * No best-fit found. Search a bit further in bitmap for a free block.
+	 * Second level bitmap consists of series of 32-bit chunks. Search
+	 * further in the chunk where we expected a best-fit, starting from
+	 * index location found above.
+	 */
+	slBitStart++;
+	slBitmap >>= slBitStart;
+
+	/* Skip this search if we were already at end of this bitmap chunk */
+	if ((slBitStart != BITMAP_BITS) && slBitmap) {
+		slIndex += ffs(slBitmap);
+		*pageNum = pool->FreeList[slIndex].pageNum;
+		*offset = pool->FreeList[slIndex].offset;
+		return slIndex;
+	}
+
+	/* Now do a full two-level bitmap search to find next nearest fit */
+	flIndex = slIndex >> BITMAP_SHIFT;
+
+	flBitmap = (pool->flBitmap) >> (flIndex + 1);
+	if (!flBitmap)
+		return 0;
+
+	flIndex += ffs(flBitmap);
+	slBitmap = pool->slBitmap[flIndex];
+	slIndex = (flIndex << BITMAP_SHIFT) + ffs(slBitmap) - 1;
+	*pageNum = pool->FreeList[slIndex].pageNum;
+	*offset = pool->FreeList[slIndex].offset;
+
+	return slIndex;
+}
+
+/*
+ * Insert block at <pageNum, offset> in freelist of given pool.
+ * FreeList used depends on block size.
+ */
+static void InsertBlock(struct Pool *pool, u32 pageNum, u32 offset,
+			struct BlockHeader *block)
+{
+	u32 flIndex, slIndex;
+	struct BlockHeader *nextBlock;
+
+	slIndex = GetIndexForInsert(block->size);
+	flIndex = slIndex >> BITMAP_SHIFT;
+
+	block->link.prevPageNum = INVALID_PGNUM;
+	block->link.prevOffset = 0;
+	block->link.nextPageNum = pool->FreeList[slIndex].pageNum;
+	block->link.nextOffset = pool->FreeList[slIndex].offset;
+	pool->FreeList[slIndex].pageNum = pageNum;
+	pool->FreeList[slIndex].offset = offset;
+
+	if (block->link.nextPageNum != INVALID_PGNUM) {
+		nextBlock = (struct BlockHeader *)GetPtrAtomic(
+				block->link.nextPageNum,
+				block->link.nextOffset, KM_USER1
+			);
+		nextBlock->link.prevPageNum = pageNum;
+		nextBlock->link.prevOffset = offset;
+		PutPtrAtomic(nextBlock, KM_USER1);
+	}
+
+	SetBit(&pool->slBitmap[flIndex], slIndex & BITMAP_MASK);
+	SetBit(&pool->flBitmap, flIndex);
+}
+
+/*
+ * Remove block from head of freelist. Index 'slIndex' identifies the FreeList.
+ */
+static void RemoveBlockHead(struct Pool *pool, struct BlockHeader *block,
+			    u32 slIndex)
+{
+	struct BlockHeader *tmpBlock;
+	u32 flIndex = slIndex >> BITMAP_SHIFT;
+
+	pool->FreeList[slIndex].pageNum = block->link.nextPageNum;
+	pool->FreeList[slIndex].offset = block->link.nextOffset;
+	block->link.prevPageNum = INVALID_PGNUM;
+	block->link.prevOffset = 0;
+
+	if (pool->FreeList[slIndex].pageNum == INVALID_PGNUM) {
+		ClearBit(&pool->slBitmap[flIndex], slIndex & BITMAP_MASK);
+		if (!pool->slBitmap[flIndex])
+			ClearBit(&pool->flBitmap, flIndex);
+	} else {
+		/*
+		 * DEBUG ONLY: We need not reinitialize freelist head previous
+		 * pointer to INVALID_PGNUM - we never depend on its value.
+		 * But just for sanity, lets keep it.
+		 */
+		tmpBlock = (struct BlockHeader *)GetPtrAtomic(
+				pool->FreeList[slIndex].pageNum,
+				pool->FreeList[slIndex].offset, KM_USER1
+			);
+		tmpBlock->link.prevPageNum = INVALID_PGNUM;
+		tmpBlock->link.prevOffset = 0;
+		PutPtrAtomic(tmpBlock, KM_USER1);
+	}
+}
+
+/*
+ * Remove block from freelist. Index 'slIndex' identifies the FreeList.
+ */
+static void RemoveBlock(struct Pool *pool, u32 pageNum, u32 offset,
+			struct BlockHeader *block, u32 slIndex)
+{
+	u32 flIndex;
+	struct BlockHeader *tmpBlock;
+
+	if (pool->FreeList[slIndex].pageNum == pageNum
+	   && pool->FreeList[slIndex].offset == offset) {
+		RemoveBlockHead(pool, block, slIndex);
+		return;
+	}
+
+	flIndex = slIndex >> BITMAP_SHIFT;
+
+	if (block->link.prevPageNum != INVALID_PGNUM) {
+		tmpBlock = (struct BlockHeader *)GetPtrAtomic(
+				block->link.prevPageNum,
+				block->link.prevOffset, KM_USER1
+			);
+		tmpBlock->link.nextPageNum = block->link.nextPageNum;
+		tmpBlock->link.nextOffset = block->link.nextOffset;
+		PutPtrAtomic(tmpBlock, KM_USER1);
+	}
+
+	if (block->link.nextPageNum != INVALID_PGNUM) {
+		tmpBlock = (struct BlockHeader *)GetPtrAtomic(
+				block->link.nextPageNum,
+				block->link.nextOffset, KM_USER1
+			);
+		tmpBlock->link.prevPageNum = block->link.prevPageNum;
+		tmpBlock->link.prevOffset = block->link.prevOffset;
+		PutPtrAtomic(tmpBlock, KM_USER1);
+	}
+
+	return;
+}
+
+/*
+ * Allocate a page and add it FreeList of given pool.
+ */
+static int GrowPool(struct Pool *pool)
+{
+	u32 pageNum;
+	struct BlockHeader *block;
+
+	pageNum = AllocPage();
+	if (unlikely(pageNum == INVALID_PGNUM))
+		return -ENOMEM;
+
+	statInc(&pool->totalPages);
+
+	spin_lock(&pool->lock);
+	block = GetPtrAtomic(pageNum, 0, KM_USER0);
+
+	block->size = PAGE_SIZE - XV_ALIGN;
+	SetBlockFree(block);
+	SetBlockPrevUsed(block);
+	SetBlockPrev(block, 0);
+
+	InsertBlock(pool, pageNum, 0, block);
+
+	PutPtrAtomic(block, KM_USER0);
+	spin_unlock(&pool->lock);
+
+	return 0;
+}
+
+/*
+ * Create a memory pool. Allocates FreeList, bitmaps and other
+ * per-pool metadata.
+ */
+struct Pool *xvCreateMemPool(void)
+{
+	int i;
+	void *ovhdMem;
+	struct Pool *pool;
+	u32 poolOvhd;
+
+	poolOvhd = ROUNDUP(sizeof(*pool), PAGE_SIZE);
+	ovhdMem = kmalloc(poolOvhd, GFP_KERNEL);
+	if (!ovhdMem)
+		return INVALID_POOL_ID;
+
+	pool = ovhdMem;
+	memset(pool, 0, poolOvhd);
+
+	for (i = 0; i < NUM_FREE_LISTS; i++)
+		pool->FreeList[i].pageNum = INVALID_PGNUM;
+
+	spin_lock_init(&pool->lock);
+
+	return pool;
+}
+EXPORT_SYMBOL_GPL(xvCreateMemPool);
+
+void xvDestroyMemPool(struct Pool *pool)
+{
+	kfree(pool);
+}
+EXPORT_SYMBOL_GPL(xvDestroyMemPool);
+
+/**
+ * xvMalloc - Allocate block of given size from pool.
+ * @pool: pool to allocate from
+ * @size: size of block to allocate
+ * @pageNum: page no. that holds the object
+ * @offset: location of object within pageNum
+ *
+ * On success, <pageNum, offset> identifies block allocated
+ * and 0 is returned. On failure, <pageNum, offset> is not touched
+ * and -ENOMEM is returned.
+ *
+ * Allocation requests with size > XV_MAX_ALLOC_SIZE will fail.
+ */
+int xvMalloc(struct Pool *pool, u32 size, u32 *pageNum, u32 *offset)
+{
+	int error;
+	u32 index, tmpSize, origSize, tmpOffset;
+	struct BlockHeader *block, *tmpBlock = NULL;
+
+	*pageNum = INVALID_PGNUM;
+	*offset = 0;
+	origSize = size;
+
+	if (unlikely(!size || size > XV_MAX_ALLOC_SIZE))
+		return -ENOMEM;
+
+	if (unlikely(size < XV_MIN_ALLOC_SIZE))
+		size = XV_MIN_ALLOC_SIZE;
+	else
+		size = ROUNDUP_ALIGN(size);
+
+	spin_lock(&pool->lock);
+
+	index = FindBlock(pool, size, pageNum, offset);
+
+	if (*pageNum == INVALID_PGNUM) {
+		spin_unlock(&pool->lock);
+		error = GrowPool(pool);
+		if (unlikely(error))
+			return -ENOMEM;
+
+		spin_lock(&pool->lock);
+		index = FindBlock(pool, size, pageNum, offset);
+	}
+
+	if (*pageNum == INVALID_PGNUM) {
+		spin_unlock(&pool->lock);
+		return -ENOMEM;
+	}
+
+	block = (struct BlockHeader *)GetPtrAtomic(
+				*pageNum, *offset, KM_USER0);
+
+	RemoveBlockHead(pool, block, index);
+
+	/* Split the block if required */
+	tmpOffset = *offset + size + XV_ALIGN;
+	tmpSize = block->size - size;
+	tmpBlock = (struct BlockHeader *)((char *)block + size + XV_ALIGN);
+	if (tmpSize) {
+		tmpBlock->size = tmpSize - XV_ALIGN;
+		SetBlockFree(tmpBlock);
+		SetBlockPrevUsed(tmpBlock);
+
+		SetBlockPrev(tmpBlock, *offset);
+		if (tmpBlock->size >= XV_MIN_ALLOC_SIZE)
+			InsertBlock(pool, *pageNum, tmpOffset, tmpBlock);
+
+		if (tmpOffset + XV_ALIGN + tmpBlock->size < PAGE_SIZE) {
+			tmpBlock = (struct BlockHeader *)((char *)tmpBlock
+						+ tmpBlock->size + XV_ALIGN);
+
+			SetBlockPrev(tmpBlock, tmpOffset);
+		}
+	} else {
+		/* This block is exact fit */
+		if (tmpOffset < PAGE_SIZE)
+			SetBlockPrevUsed(tmpBlock);
+	}
+
+	block->size = origSize;
+	SetBlockUsed(block);
+
+	PutPtrAtomic(block, KM_USER0);
+	spin_unlock(&pool->lock);
+
+	*offset += XV_ALIGN;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xvMalloc);
+
+/*
+ * Free block identified with <pageNum, offset>
+ */
+void xvFree(struct Pool *pool, u32 pageNum, u32 offset)
+{
+	void *page;
+	struct BlockHeader *block, *tmpBlock;
+
+	offset -= XV_ALIGN;
+
+	spin_lock(&pool->lock);
+
+	page = GetPtrAtomic(pageNum, 0, KM_USER0);
+	block = (struct BlockHeader *)((char *)page + offset);
+
+	if (unlikely(block->size < XV_MIN_ALLOC_SIZE))
+		block->size = XV_MIN_ALLOC_SIZE;
+	else
+		block->size = ROUNDUP_ALIGN(block->size);
+
+	tmpBlock = (struct BlockHeader *)((char *)block +
+					block->size + XV_ALIGN);
+	if (offset + block->size + XV_ALIGN == PAGE_SIZE)
+		tmpBlock = NULL;
+
+	/* Merge next block if its free */
+	if (tmpBlock && IsBlockFree(tmpBlock)) {
+		/*
+		 * Blocks smaller than XV_MIN_ALLOC_SIZE
+		 * are not inserted in any free list.
+		 */
+		if (tmpBlock->size >= XV_MIN_ALLOC_SIZE) {
+			RemoveBlock(pool, pageNum,
+				    offset + block->size + XV_ALIGN, tmpBlock,
+				    GetIndexForInsert(tmpBlock->size));
+		}
+		block->size += tmpBlock->size + XV_ALIGN;
+	}
+
+	/* Merge previous block if its free */
+	if (IsPrevBlockFree(block)) {
+		tmpBlock = (struct BlockHeader *)((char *)(page) +
+						GetBlockPrev(block));
+		offset = offset - tmpBlock->size - XV_ALIGN;
+
+		if (tmpBlock->size >= XV_MIN_ALLOC_SIZE)
+			RemoveBlock(pool, pageNum, offset, tmpBlock,
+				    GetIndexForInsert(tmpBlock->size));
+
+		tmpBlock->size += block->size + XV_ALIGN;
+		block = tmpBlock;
+	}
+
+	/* No used objects in this page. Free it. */
+	if (block->size == PAGE_SIZE - XV_ALIGN) {
+		PutPtrAtomic(page, KM_USER0);
+		spin_unlock(&pool->lock);
+
+		FreePage(pageNum);
+		statDec(&pool->totalPages);
+		return;
+	}
+
+	SetBlockFree(block);
+	InsertBlock(pool, pageNum, offset, block);
+
+	if (offset + block->size < PAGE_SIZE - XV_ALIGN) {
+		tmpBlock = (struct BlockHeader *)((char *)(block) +
+						block->size + XV_ALIGN);
+		SetBlockPrevFree(tmpBlock);
+		SetBlockPrev(tmpBlock, offset);
+	}
+
+	PutPtrAtomic(page, KM_USER0);
+	spin_unlock(&pool->lock);
+
+	return;
+}
+EXPORT_SYMBOL_GPL(xvFree);
+
+u32 xvGetObjectSize(void *obj)
+{
+	struct BlockHeader *blk;
+
+	blk = (struct BlockHeader *)((char *)(obj) - XV_ALIGN);
+	return blk->size;
+}
+EXPORT_SYMBOL_GPL(xvGetObjectSize);
+
+/*
+ * Returns total memory used by allocator (userdata + metadata)
+ */
+u64 xvGetTotalSizeBytes(struct Pool *pool)
+{
+#ifdef CONFIG_XVMALLOC_STATS
+	return pool->totalPages << PAGE_SHIFT;
+#else
+	return 0;
+#endif
+}
+EXPORT_SYMBOL_GPL(xvGetTotalSizeBytes);
+
+static int __init xvMallocInit(void)
+{
+	return 0;
+}
+
+static void __exit xvMallocExit(void)
+{
+	return;
+}
+
+module_init(xvMallocInit);
+module_exit(xvMallocExit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Nitin Gupta <ngupta@vflare.org>");
+MODULE_DESCRIPTION("xvMalloc Memory Allocator");
diff --git a/mm/xvmalloc_int.h b/mm/xvmalloc_int.h
new file mode 100755
index 0000000..5c648ee
--- /dev/null
+++ b/mm/xvmalloc_int.h
@@ -0,0 +1,91 @@
+/*
+ * xvmalloc_int.c
+ *
+ * Copyright (C) 2008, 2009  Nitin Gupta
+ *
+ * This code is released using a dual license strategy: GPL/LGPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of the GNU General Public License Version 2.0
+ * Released under the terms of the GNU Lesser General Public License Version 2.1
+ */
+
+#ifndef _XVMALLOC_INT_H_
+#define _XVMALLOC_INT_H_
+
+#include <linux/types.h>
+
+#define ROUNDUP(x, y)	(((x) + (y) - 1) / (y) * (y))
+/* Each individual bitmap is 32-bit */
+#define BITMAP_BITS	32
+#define BITMAP_SHIFT	5
+#define BITMAP_MASK	(BITMAP_BITS - 1)
+
+/* User configurable params */
+
+/* This must be greater than sizeof(LinkFree) */
+#define XV_MIN_ALLOC_SIZE       32
+#define XV_MAX_ALLOC_SIZE       (4096 - XV_MIN_ALLOC_SIZE)
+
+/* Must be power of two */
+#define XV_ALIGN_SHIFT	2
+#define XV_ALIGN	(1 << XV_ALIGN_SHIFT)
+#define XV_ALIGN_MASK	(XV_ALIGN - 1)
+
+/* Free lists are separated by FL_DELTA bytes */
+#define FL_DELTA_SHIFT	3
+#define FL_DELTA	(1 << FL_DELTA_SHIFT)
+#define FL_DELTA_MASK	(FL_DELTA - 1)
+#define NUM_FREE_LISTS	((XV_MAX_ALLOC_SIZE - XV_MIN_ALLOC_SIZE) \
+				/ FL_DELTA + 1)
+
+#define MAX_FLI		(ROUNDUP(NUM_FREE_LISTS, 32) / 32)
+
+/* End of user params */
+
+#define ROUNDUP_ALIGN(x)	(((x) + XV_ALIGN_MASK) & ~XV_ALIGN_MASK)
+
+#define BLOCK_FREE	0x1
+#define PREV_FREE	0x2
+#define FLAGS_MASK	XV_ALIGN_MASK
+#define PREV_MASK	(~FLAGS_MASK)
+
+struct FreeListEntry {
+	u32 pageNum;
+	u16 offset;
+	u16 pad;
+};
+
+struct LinkFree {
+	u32 prevPageNum;
+	u32 nextPageNum;
+	u16 prevOffset;
+	u16 nextOffset;
+};
+
+struct BlockHeader {
+	union {
+		/* This common header must be ALIGN bytes */
+		u8 common[XV_ALIGN];
+		struct {
+			u16 size;
+			u16 prev;
+		};
+	};
+	struct LinkFree link;
+};
+
+struct Pool {
+	u32 flBitmap;
+	u32 slBitmap[MAX_FLI];
+	spinlock_t lock;
+
+	struct FreeListEntry FreeList[NUM_FREE_LISTS];
+
+	/* stats */
+#ifdef CONFIG_XVMALLOC_STATS
+	u64 totalPages;
+#endif
+};
+
+#endif


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/3]: documentation
  2009-03-17 11:34 [PATCH 0/3]: compressed in-memory swapping Nitin Gupta
  2009-03-17 11:36 ` [PATCH 1/3]: compressed RAM block device Nitin Gupta
  2009-03-17 11:37 ` [PATCH 2/3]: xvmalloc memory allocator Nitin Gupta
@ 2009-03-17 11:38 ` Nitin Gupta
  2009-03-17 23:34 ` [PATCH 0/3]: compressed in-memory swapping Russ Dill
  3 siblings, 0 replies; 34+ messages in thread
From: Nitin Gupta @ 2009-03-17 11:38 UTC (permalink / raw)
  To: linux-kernel

  Documentation/blockdev/00-INDEX      |    2 +
  Documentation/blockdev/compcache.txt |   67 ++++++++++++++++++++++++++++++++++
  Documentation/kernel-parameters.txt  |    9 +++++
  3 files changed, 78 insertions(+), 0 deletions(-)

Documentation on how to use compcache module.

Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---

diff --git a/Documentation/blockdev/00-INDEX b/Documentation/blockdev/00-INDEX
index 86f054c..d53dec0 100644
--- a/Documentation/blockdev/00-INDEX
+++ b/Documentation/blockdev/00-INDEX
@@ -4,6 +4,8 @@ README.DAC960
  	- info on Mylex DAC960/DAC1100 PCI RAID Controller Driver for Linux.
  cciss.txt
  	- info, major/minor #'s for Compaq's SMART Array Controllers.
+compcache.txt
+	- short guide on how to setup compressed RAM swap device.
  cpqarray.txt
  	- info on using Compaq's SMART2 Intelligent Disk Array Controllers.
  floppy.txt
diff --git a/Documentation/blockdev/compcache.txt b/Documentation/blockdev/compcache.txt
new file mode 100644
index 0000000..46c33a5
--- /dev/null
+++ b/Documentation/blockdev/compcache.txt
@@ -0,0 +1,67 @@
+compcache: Compressed RAM swap device
+-------------------------------------
+
+Project home: http://code.google.com/p/compcache/
+
+This module creates RAM based block device (named ramzswap0) which acts
+as swap disk. Pages swapped to this disk are compressed and stored in
+memory itself.
+
+It uses these components:
+ - xvMalloc: memory allocator (xvmalloc.ko)
+ - LZO1X: de/compressor: (lzo_compress.ko, lzo_decompress.ko)
+
+Usage:
+ - modprobe compcache [memlimit_kb=<val>|disksize_kb=<val>] [backing_dev=<dev>]
+
+   memlimit_kb: This param is applicable only when backing_dev is given.
+	It is limit on amount compressed data stored in memory. Any
+	additional data is forwarded to backing_dev. It cannot be greater
+	than backing device size. If missing or 0, default value is used:
+	15% of RAM or backing device size, whichever is smaller.
+
+   disksize_kb: This param is applicable only when backing_dev is not given.
+	It is limit on amount of *uncompressed* worth of data stored in
+	memory. For e.g. disksize_kb=1024 means it can hold 1024kb worth of
+	uncompressed data even if this data compresses to just, say, 100kb.
+	If missing or 0, default value is used: 25% of RAM.
+
+   backing_dev: This is block device to be used as backing store for compcache.
+	It must be a valid swap partition. We move data to this device when we
+	encounter incompressible page or memlimit is reached. TODO: we may also
+	move some pages from compcache to this device in case system is really
+	low on memory.
+	This device is not directly visible to kernel as a swap device
+	(/proc/swaps will only show /dev/ramzswap0 and not this device).
+	Managing this backing device is the job of compcache module.
+
+Examples:
+	1) modprobe compcache memlimit_kb=10240 backing_dev=/dev/sda2
+	sets compcache limit as 10MB and /dev/sda2 as backing swap device.
+	NOTE: here /dev/sda2 is a valid swap partition.
+
+	2) modprobe compcache backing_dev=/dev/sda2
+	same as (1) but memlimit is set to default: 15% of RAM or size of
+	backing swap device, whichever is smaller.
+
+	3) modprobe compcache disksize_kb=10240
+	sets compcache disk size as 10MB.
+
+	4) modprobe compcache.ko
+	same as (3) but compcache disk size will be set to default:
+	25% of RAM size.
+
+	Once module is loaded, activate this swap with highest priority:
+	swapon /dev/ramzswap0 -p 100
+	(-p param set swap priority)
+
+Notes:
+ - compcache stats are exported via /proc/compcache
+ - If you give non-swap partition as backing_dev, nothing bad will happen -
+   swapon will simply fail to recognize /dev/ramzswap0 as swap partition.
+   So, in this case, unload the module and reload with correct backing_dev.
+
+Please report any problems to:
+
+Nitin Gupta
+ngupta@vflare.org
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 54f21a5..a23f0ae 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -414,6 +414,15 @@ and is between 256 and 4096 characters. It is defined in the file
  			possible to determine what the correct size should be.
  			This option provides an override for these situations.

+	compcache.memlimit_kb=
+			See Documentation/blockdev/compcache.txt.
+
+	compcache.disksize_kb=
+			See Documentation/blockdev/compcache.txt.
+
+	compcache.backing_dev=
+			See Documentation/blockdev/compcache.txt.
+
  	security=	[SECURITY] Choose a security module to enable at boot.
  			If this boot parameter is not specified, only the first
  			security module asking for security registration will be


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3]: xvmalloc memory allocator
  2009-03-17 11:37 ` [PATCH 2/3]: xvmalloc memory allocator Nitin Gupta
@ 2009-03-17 16:41   ` Christoph Lameter
  2009-03-17 17:55     ` Nitin Gupta
       [not found]     ` <d760cf2d0903171028o600dc94cn7a5238520d104455@mail.gmail.com>
  0 siblings, 2 replies; 34+ messages in thread
From: Christoph Lameter @ 2009-03-17 16:41 UTC (permalink / raw)
  To: Nitin Gupta; +Cc: linux-kernel

On Tue, 17 Mar 2009, Nitin Gupta wrote:

> Slub allocator could not be used due to fragmentation issues:
> http://code.google.com/p/compcache/wiki/AllocatorsComparison
> Data here shows kmalloc using ~43% more memory than TLSF and xvMalloc
> is showed ~2% more space efficiency than TLSF (due to smaller metadata).

Well this looks like the use of 2^n sizes of the kmalloc arrayse cause a
lot of wastage here. Creating slab caches that fit your needs may help?
Have you had a look at the SLOB approach?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3]: xvmalloc memory allocator
  2009-03-17 16:41   ` Christoph Lameter
@ 2009-03-17 17:55     ` Nitin Gupta
       [not found]     ` <d760cf2d0903171028o600dc94cn7a5238520d104455@mail.gmail.com>
  1 sibling, 0 replies; 34+ messages in thread
From: Nitin Gupta @ 2009-03-17 17:55 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

(sending again since it was blocked by lkml last time. Sorry Christoph for duplicate email).

Christoph Lameter wrote:
> On Tue, 17 Mar 2009, Nitin Gupta wrote:
> 
>> Slub allocator could not be used due to fragmentation issues:
>> http://code.google.com/p/compcache/wiki/AllocatorsComparison
>> Data here shows kmalloc using ~43% more memory than TLSF and xvMalloc
>> is showed ~2% more space efficiency than TLSF (due to smaller metadata).
> 
> Well this looks like the use of 2^n sizes of the kmalloc arrayse cause a
> lot of wastage here. Creating slab caches that fit your needs may help?

Creating slabs for sizes in range, say, [32, 3/4*PAGE_SIZE] separated by 64bytes
will require 48 slabs! Then for slab of each size class will have wastage due to
unused slab objects in each class.
Larger difference in slab sizes (and thus small no. of them), will surely cause too much
wastage due to internal fragmentation.

Another (more important) point to consider is that, use of slabs will eat-up vmalloc
area to keep slab memory backed by VA space. On 32-bit systems, vmalloc area
is small and limits amount of memory that can be allocated for compressed pages.
With xvmalloc we map/unmap pages on demand thus removing dependence on
vmalloc VA area.

 > Have you had a look at the SLOB approach?
 >

Nope. I will see how this may help.

Thanks,
Nitin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3]: xvmalloc memory allocator
       [not found]     ` <d760cf2d0903171028o600dc94cn7a5238520d104455@mail.gmail.com>
@ 2009-03-17 17:58       ` Christoph Lameter
  2009-03-17 18:34         ` Pekka Enberg
  2009-03-18 15:17         ` Nitin Gupta
  0 siblings, 2 replies; 34+ messages in thread
From: Christoph Lameter @ 2009-03-17 17:58 UTC (permalink / raw)
  To: Nitin Gupta; +Cc: linux-kernel

On Tue, 17 Mar 2009, Nitin Gupta wrote:

> Creating slabs for sizes in range, say, [32, 3/4*PAGE_SIZE] separated by
> 64bytes
> will require 48 slabs! Then for slab of each size class will have wastage
> due to
> unused slab objects in each class.
> Larger difference in slab sizes (and thus small no. of them), will surely
> cause too much
> wastage due to internal fragmentation.

The slabs that match existing other slabs of similar sizes will be aliased
and not created. Create the 48 slabs and you likely will only use 10 real
additional ones. The rest will just be pointing to existing ones.

> Another (more important) point to consider is that, use of slabs will
> eat-up vmalloc area to keep slab memory backed by VA space. On 32-bit
> systems, vmalloc area is small and limits amount of memory that can be
> allocated for compressed pages. With xvmalloc we map/unmap pages on
> demand thus removing dependence on vmalloc VA area.

Slab memory is not backed by vmalloc space.

> > Have you had a look at the SLOB approach?
> Nope. I will see how this may help.

Slob is another attempt to reduce wastage due to the rounding up of
object sizes to 2^N in SLAB/SLUB.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3]: xvmalloc memory allocator
  2009-03-17 17:58       ` Christoph Lameter
@ 2009-03-17 18:34         ` Pekka Enberg
  2009-03-18 16:07           ` Nitin Gupta
  2009-03-18 15:17         ` Nitin Gupta
  1 sibling, 1 reply; 34+ messages in thread
From: Pekka Enberg @ 2009-03-17 18:34 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nitin Gupta, linux-kernel

On Tue, 17 Mar 2009, Nitin Gupta wrote:
>> Creating slabs for sizes in range, say, [32, 3/4*PAGE_SIZE] separated by
>> 64bytes
>> will require 48 slabs! Then for slab of each size class will have wastage
>> due to
>> unused slab objects in each class.
>> Larger difference in slab sizes (and thus small no. of them), will surely
>> cause too much
>> wastage due to internal fragmentation.

On Tue, Mar 17, 2009 at 10:58 AM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> The slabs that match existing other slabs of similar sizes will be aliased
> and not created. Create the 48 slabs and you likely will only use 10 real
> additional ones. The rest will just be pointing to existing ones.

Yup. One thing I don't quite understand is why you need all the 48
caches in the first place. Allocation sizes tend to be clustered and I
would have imagined you'd see that when compressing page sized chunks
as well. Using kmemtrace to analyze the exact reason for the bad
fragmentation would probably be helpful.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3]: compressed in-memory swapping
  2009-03-17 11:34 [PATCH 0/3]: compressed in-memory swapping Nitin Gupta
                   ` (2 preceding siblings ...)
  2009-03-17 11:38 ` [PATCH 3/3]: documentation Nitin Gupta
@ 2009-03-17 23:34 ` Russ Dill
  3 siblings, 0 replies; 34+ messages in thread
From: Russ Dill @ 2009-03-17 23:34 UTC (permalink / raw)
  To: linux-kernel

Nitin Gupta <ngupta <at> vflare.org> writes:

> 
> Hi,
> 
> Project home: http://code.google.com/p/compcache/
> 
> It allows creating a RAM based block device which acts as swap disk.
> Pages swapped to this device are compressed and stored in memory itself.
> This is a big win over swapping to slow hard-disk which are typically used
> as swap disk. For flash, these suffer from wear-leveling issues when used
> as swap disk - so again its helpful. For swapless systems, it allows more
> apps to run.
> 

I would recommend making xvmalloc the 1/3 instead of 2/3 so that the commit is
bisectable.

I really like this project and I'm especially impressed with how the SSD ATAPI
discard command fell in nicely with the discard requirement for compcache.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3]: compressed RAM block device
  2009-03-17 11:36 ` [PATCH 1/3]: compressed RAM block device Nitin Gupta
@ 2009-03-18 12:25   ` Nick Piggin
  2009-03-18 12:38     ` Nitin Gupta
  2009-03-20 18:56   ` Pavel Machek
  1 sibling, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2009-03-18 12:25 UTC (permalink / raw)
  To: ngupta; +Cc: linux-kernel

On Tuesday 17 March 2009 22:36:46 Nitin Gupta wrote:
>   drivers/block/Kconfig     |   22 +
>   drivers/block/Makefile    |    1 +
>   drivers/block/compcache.c |  995
> +++++++++++++++++++++++++++++++++++++++++++++ drivers/block/compcache.h | 
> 160 ++++++++
>   4 files changed, 1178 insertions(+), 0 deletions(-)
>
> Creates RAM based block device (ramzswap0) which can be used as swap
> device. Pages swapped to this are compressed and stored in memory itself.
>
> The module is called compcache.ko. It depends on:
>   - xvmalloc.ko: memory allocator
>   - lzo_compress.ko
>   - lzo_decompress.ko
>
> See Documentation/blockdev/compcache.txt for usage details.
>
> Project home: http://code.google.com/p/compcache/

I wonder how hard it would be to make the compression code
use an arbitrary file or device for the storage backend rather
than make a new block device? Then you could make a new ram
block device that can swap its pages out (or even extend brd.c
with that functionality, or use loop on tmpfs etc).




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3]: compressed RAM block device
  2009-03-18 12:25   ` Nick Piggin
@ 2009-03-18 12:38     ` Nitin Gupta
  2009-03-18 12:49       ` Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Nitin Gupta @ 2009-03-18 12:38 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel

Nick Piggin wrote:
> On Tuesday 17 March 2009 22:36:46 Nitin Gupta wrote:
>>   drivers/block/Kconfig     |   22 +
>>   drivers/block/Makefile    |    1 +
>>   drivers/block/compcache.c |  995
>> +++++++++++++++++++++++++++++++++++++++++++++ drivers/block/compcache.h | 
>> 160 ++++++++
>>   4 files changed, 1178 insertions(+), 0 deletions(-)
>>
>> Creates RAM based block device (ramzswap0) which can be used as swap
>> device. Pages swapped to this are compressed and stored in memory itself.
>>
>> The module is called compcache.ko. It depends on:
>>   - xvmalloc.ko: memory allocator
>>   - lzo_compress.ko
>>   - lzo_decompress.ko
>>
>> See Documentation/blockdev/compcache.txt for usage details.
>>
>> Project home: http://code.google.com/p/compcache/
> 

> I wonder how hard it would be to make the compression code
> use an arbitrary file or device for the storage backend rather
> than make a new block device? Then you could make a new ram
> block device that can swap its pages out (or even extend brd.c
> with that functionality, or use loop on tmpfs etc).
> 

Its already done as mentioned in Documentation/blockdev/compcache.txt
(patch 3/3) but I should have added it here also: compcache accepts
"backing_dev" parameter for backing swap partition - this allows us to
forward any R/W request to backing swap device. Currently only poorly
compressible pages (compress length > PAGE_SIZE/2) are forwarded to
backing_dev.

Thanks,
Nitin





^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3]: compressed RAM block device
  2009-03-18 12:38     ` Nitin Gupta
@ 2009-03-18 12:49       ` Nick Piggin
  0 siblings, 0 replies; 34+ messages in thread
From: Nick Piggin @ 2009-03-18 12:49 UTC (permalink / raw)
  To: ngupta; +Cc: linux-kernel

On Wednesday 18 March 2009 23:38:30 Nitin Gupta wrote:
> Nick Piggin wrote:
> > On Tuesday 17 March 2009 22:36:46 Nitin Gupta wrote:
> >>   drivers/block/Kconfig     |   22 +
> >>   drivers/block/Makefile    |    1 +
> >>   drivers/block/compcache.c |  995
> >> +++++++++++++++++++++++++++++++++++++++++++++ drivers/block/compcache.h
> >> | 160 ++++++++
> >>   4 files changed, 1178 insertions(+), 0 deletions(-)
> >>
> >> Creates RAM based block device (ramzswap0) which can be used as swap
> >> device. Pages swapped to this are compressed and stored in memory
> >> itself.
> >>
> >> The module is called compcache.ko. It depends on:
> >>   - xvmalloc.ko: memory allocator
> >>   - lzo_compress.ko
> >>   - lzo_decompress.ko
> >>
> >> See Documentation/blockdev/compcache.txt for usage details.
> >>
> >> Project home: http://code.google.com/p/compcache/
> >
> > I wonder how hard it would be to make the compression code
> > use an arbitrary file or device for the storage backend rather
> > than make a new block device? Then you could make a new ram
> > block device that can swap its pages out (or even extend brd.c
> > with that functionality, or use loop on tmpfs etc).
>
> Its already done as mentioned in Documentation/blockdev/compcache.txt
> (patch 3/3) but I should have added it here also: compcache accepts
> "backing_dev" parameter for backing swap partition - this allows us to

I meant more like compressing the data then passing it down to
another device. I guess you still need to make a block device to
operate on the requests, however, so it would get more complicated
because you'd need to pass your own down.... yeah it would probably
end up being just as complex and less efficient. Ignore me then :)


> forward any R/W request to backing swap device. Currently only poorly
> compressible pages (compress length > PAGE_SIZE/2) are forwarded to
> backing_dev.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3]: xvmalloc memory allocator
  2009-03-17 17:58       ` Christoph Lameter
  2009-03-17 18:34         ` Pekka Enberg
@ 2009-03-18 15:17         ` Nitin Gupta
  2009-03-18 16:58           ` Christoph Lameter
  1 sibling, 1 reply; 34+ messages in thread
From: Nitin Gupta @ 2009-03-18 15:17 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

Christoph Lameter wrote:
> On Tue, 17 Mar 2009, Nitin Gupta wrote:
> 
>> Creating slabs for sizes in range, say, [32, 3/4*PAGE_SIZE] separated by
>> 64bytes
>> will require 48 slabs! Then for slab of each size class will have wastage
>> due to
>> unused slab objects in each class.
>> Larger difference in slab sizes (and thus small no. of them), will surely
>> cause too much
>> wastage due to internal fragmentation.
> 
> The slabs that match existing other slabs of similar sizes will be aliased
> and not created. Create the 48 slabs and you likely will only use 10 real
> additional ones. The rest will just be pointing to existing ones.
> 
>> Another (more important) point to consider is that, use of slabs will
>> eat-up vmalloc area to keep slab memory backed by VA space. On 32-bit
>> systems, vmalloc area is small and limits amount of memory that can be
>> allocated for compressed pages. With xvmalloc we map/unmap pages on
>> demand thus removing dependence on vmalloc VA area.
> 
> Slab memory is not backed by vmalloc space.
> 

Oh, it uses "low memory". Still not good for compcache :)

>>> Have you had a look at the SLOB approach?
>> Nope. I will see how this may help.
> 
> Slob is another attempt to reduce wastage due to the rounding up of
> object sizes to 2^N in SLAB/SLUB.
> 
> 

I had detailed look at SLOB allocator and found it unacceptable to be
used for compcache.

To begin with, SLOB maintains just 3 freelists:
  - for size < 256     - free_slob_small
  - [256, 1024)        - free_slob_medium
  - [1024, PAGE_SIZE)  - free_slob_large

and allocates from one of these lists depending on size requested. No need to
create 50+ caches, we only get to use these 3 lists.

Why SLOB is bad:

1) O(n) allocation:
To find block of given size, it _linearaly_ scans corresponding free list to
find a page with _total_ free space >= requested size. This free space might not
be contiguous. So it runs through free blocks within each such candidate
page until it finally finds some page with free contiguous area >= requested
size.

2) When growing SLOB cache, page is added to one of 3 freelists (depending on
what size we are currently allocating). After this, this page can never move to
any other list - even if its free space drops down to fall in next range below
or vice versa. This has two problems:
   - Potentially large wastage due to "page internal fragmentation": e.g.:
alloc(3096) is satisfied from a page in 'large free list'. Now it has
1000b free (assuming 4k page) which will now never be used.
   - It can trigger unnecessary cache grows: e.g.: even though we have such
unfilled pages in 'large' list, allocation in 'small' range can still cause
cache grow if 'small free list' is empty.

3) It only allocates from "low memory". This is clearly not acceptable for
compcache.

In contrast xvmalloc is O(1): do a simple search in two-level bitmap to find
freelist containing block of required size. Obviously not O(1) in case it has
to go to system page allocator to grow pool.

Also, xvmalloc doesn't dedicate a page to any single size class - so it doesn't
suffer from above problems. Note that this might not be good in general - say,
in cases where majority of alloc requests are for some select sizes only. But
for compcache, this is not true. Also, in our case, there is no correlation
between object sizes and object lifetime - so no benefit keeping similar sized
objects together. Considering these, there's no point dedicating pages to size
classes. Instead, better go for freely mixing these objects to get maximum
packing.

...and xvmalloc is not restricted to "low memory".

Thanks,
Nitin



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3]: xvmalloc memory allocator
  2009-03-17 18:34         ` Pekka Enberg
@ 2009-03-18 16:07           ` Nitin Gupta
  0 siblings, 0 replies; 34+ messages in thread
From: Nitin Gupta @ 2009-03-18 16:07 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, linux-kernel

Pekka Enberg wrote:
> On Tue, 17 Mar 2009, Nitin Gupta wrote:
>>> Creating slabs for sizes in range, say, [32, 3/4*PAGE_SIZE] separated by
>>> 64bytes
>>> will require 48 slabs! Then for slab of each size class will have wastage
>>> due to
>>> unused slab objects in each class.
>>> Larger difference in slab sizes (and thus small no. of them), will surely
>>> cause too much
>>> wastage due to internal fragmentation.
> 
> On Tue, Mar 17, 2009 at 10:58 AM, Christoph Lameter
> <cl@linux-foundation.org> wrote:
>> The slabs that match existing other slabs of similar sizes will be aliased
>> and not created. Create the 48 slabs and you likely will only use 10 real
>> additional ones. The rest will just be pointing to existing ones.
> 
> Yup. One thing I don't quite understand is why you need all the 48
> caches in the first place. Allocation sizes tend to be clustered and I
> would have imagined you'd see that when compressing page sized chunks
> as well.

Compressed page lengths sometimes do tend to cluster within somewhat
small range. However, the range of where majority of pages will lie depends
highly of workload - sometimes range is not clear and sometime there is no
preferred range at all. Please refer this data:
http://code.google.com/p/compcache/wiki/CompressedLengthDistribution

It shows compressed page size distribution for various workloads.

> Using kmemtrace to analyze the exact reason for the bad
> fragmentation would probably be helpful.
> 

That was purely internal fragmentation.
Wastage per obj = ksize(obj) - actual_size.

Code used for testing:
http://code.google.com/p/compcache/source/browse/trunk/sub-projects/testing/kmalloc_test/kmalloc_test.c
This is a "SwapReplay client". Please see:
http://code.google.com/p/compcache/wiki/SwapReplayDesign

Thanks,
Nitin





^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3]: xvmalloc memory allocator
  2009-03-18 15:17         ` Nitin Gupta
@ 2009-03-18 16:58           ` Christoph Lameter
  2009-03-18 17:29             ` Nitin Gupta
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Lameter @ 2009-03-18 16:58 UTC (permalink / raw)
  To: Nitin Gupta; +Cc: linux-kernel

On Wed, 18 Mar 2009, Nitin Gupta wrote:

> > Slab memory is not backed by vmalloc space.
> Oh, it uses "low memory". Still not good for compcache :)

Only if you use 32 bit.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3]: xvmalloc memory allocator
  2009-03-18 16:58           ` Christoph Lameter
@ 2009-03-18 17:29             ` Nitin Gupta
  2009-03-18 19:21               ` Pekka Enberg
  0 siblings, 1 reply; 34+ messages in thread
From: Nitin Gupta @ 2009-03-18 17:29 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

On Wed, Mar 18, 2009 at 10:28 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> On Wed, 18 Mar 2009, Nitin Gupta wrote:
>
>> > Slab memory is not backed by vmalloc space.
>> Oh, it uses "low memory". Still not good for compcache :)
>
> Only if you use 32 bit.
>

Yes, this point is for 32-bit only - many users were unhappy with
compcache size limit due to small vmalloc area on x86-32.

Nitin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3]: xvmalloc memory allocator
  2009-03-18 17:29             ` Nitin Gupta
@ 2009-03-18 19:21               ` Pekka Enberg
  2009-03-18 19:36                 ` Nitin Gupta
  2009-03-19  2:30                 ` Nitin Gupta
  0 siblings, 2 replies; 34+ messages in thread
From: Pekka Enberg @ 2009-03-18 19:21 UTC (permalink / raw)
  To: Nitin Gupta; +Cc: Christoph Lameter, linux-kernel

On Wed, 18 Mar 2009, Nitin Gupta wrote:
>>> > Slab memory is not backed by vmalloc space.
>>> Oh, it uses "low memory". Still not good for compcache :)

On Wed, Mar 18, 2009 at 10:28 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
>> Only if you use 32 bit.

On Wed, Mar 18, 2009 at 10:29 AM, Nitin Gupta <ngupta@vflare.org> wrote:
> Yes, this point is for 32-bit only - many users were unhappy with
> compcache size limit due to small vmalloc area on x86-32.

Yeah, I can see the point in having a custom allocator for this. But
quite frankly the xvmalloc code is just too ugly to live with. You
might want to make it look like kernel code as per CodingStyle and
change the name to something less generic.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3]: xvmalloc memory allocator
  2009-03-18 19:21               ` Pekka Enberg
@ 2009-03-18 19:36                 ` Nitin Gupta
  2009-03-19  2:30                 ` Nitin Gupta
  1 sibling, 0 replies; 34+ messages in thread
From: Nitin Gupta @ 2009-03-18 19:36 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, linux-kernel

Pekka Enberg wrote:
> Yeah, I can see the point in having a custom allocator for this. But
> quite frankly the xvmalloc code is just too ugly to live with. You
> might want to make it look like kernel code as per CodingStyle and
> change the name to something less generic.
> 

In beginning I also thought CamelCase coding would be a problem.
But quite frankly, this is best I could do to beautify xvmalloc code.

I will change it to kernel_style. Maybe that helps.

Thanks,
Nitin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3]: xvmalloc memory allocator
  2009-03-18 19:21               ` Pekka Enberg
  2009-03-18 19:36                 ` Nitin Gupta
@ 2009-03-19  2:30                 ` Nitin Gupta
  2009-03-19  6:08                   ` Pekka Enberg
  1 sibling, 1 reply; 34+ messages in thread
From: Nitin Gupta @ 2009-03-19  2:30 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, linux-kernel

On Thu, Mar 19, 2009 at 12:51 AM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> On Wed, 18 Mar 2009, Nitin Gupta wrote:
>>>> > Slab memory is not backed by vmalloc space.
>>>> Oh, it uses "low memory". Still not good for compcache :)
>
> On Wed, Mar 18, 2009 at 10:28 PM, Christoph Lameter
> <cl@linux-foundation.org> wrote:
>>> Only if you use 32 bit.
>
> On Wed, Mar 18, 2009 at 10:29 AM, Nitin Gupta <ngupta@vflare.org> wrote:
>> Yes, this point is for 32-bit only - many users were unhappy with
>> compcache size limit due to small vmalloc area on x86-32.
>
> Yeah, I can see the point in having a custom allocator for this. But
> quite frankly the xvmalloc code is just too ugly to live with. You
> might want to make it look like kernel code as per CodingStyle and
> change the name to something less generic.
>

I can see quite some places for cleanups (like macros for
reaching next free block etc). Do you think CamelCase will be acceptable
for now? and also keeping xvmalloc name? Other than these, I will
check for potential cleanups and send updated diffs.

Thanks,
Nitin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3]: xvmalloc memory allocator
  2009-03-19  2:30                 ` Nitin Gupta
@ 2009-03-19  6:08                   ` Pekka Enberg
  0 siblings, 0 replies; 34+ messages in thread
From: Pekka Enberg @ 2009-03-19  6:08 UTC (permalink / raw)
  To: Nitin Gupta; +Cc: Christoph Lameter, linux-kernel

On 3/18/09, Nitin Gupta <ngupta@vflare.org> wrote:
> I can see quite some places for cleanups (like macros for
> reaching next free block etc). Do you think CamelCase will be acceptable
> for now? and also keeping xvmalloc name? Other than these, I will
> check for potential cleanups and send updated diffs.

No, I do not think it is acceptable for core kernel code.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/3] xvmalloc memory allocator
  2009-03-20 14:07 [PATCH 0/3] compressed in-memory swapping take2 Nitin Gupta
@ 2009-03-20 14:12 ` Nitin Gupta
  2009-03-20 14:57   ` Christoph Lameter
  0 siblings, 1 reply; 34+ messages in thread
From: Nitin Gupta @ 2009-03-20 14:12 UTC (permalink / raw)
  To: ngupta; +Cc: Pekka Enberg, Christoph Lameter, linux-kernel

  init/Kconfig      |    6 +
  mm/Makefile       |    1 +
  mm/xvmalloc.c     |  572 +++++++++++++++++++++++++++++++++++++++++++++++++++++
  mm/xvmalloc_int.h |   95 +++++++++
  4 files changed, 674 insertions(+), 0 deletions(-)

xvmalloc is a memory allocator designed specifically for compcache project.

* Features:
  - Low metadata overhead (just 4 bytes per object)
  - O(1) Alloc/Free - except when we have to call system page allocator to
     get additional memory.
  - Very low fragmentation: In all tests, xvMalloc memory usage is within 12%
     of "Ideal".

One of the main highlights is that it maps pages only when required.
So, it does not hog vmalloc area which is very small on 32-bit systems.

Slub allocator could not be used due to fragmentation issues:
http://code.google.com/p/compcache/wiki/AllocatorsComparison
Data here shows kmalloc using ~43% more memory than TLSF and xvMalloc
is showed ~2% more space efficiency than TLSF (due to smaller metadata).

* Implementation:
It uses two-level bitmap search to find free list containing block of
correct size. This idea is taken from TLSF (Two-Level Segregate Fit)
allocator and is well explained in its paper (see [Links] below).
Highlights:
  - Pool based allocator: each pool can grow/shrink.
  - Immediate coalescing of free blocks.
  - Maps/Unmaps memory pages only when required.

* Limitations:
  - Poor scalability: No per-cpu data structures (work in progress).

[Links]
1. Details and Performance data:
http://code.google.com/p/compcache/wiki/xvMalloc
http://code.google.com/p/compcache/wiki/xvMallocPerformance

2. TLSF memory allocator:
home: http://rtportal.upv.es/rtmalloc/
paper: http://rtportal.upv.es/rtmalloc/files/MRBC_2008.pdf

Signed-off-by: Nitin Gupta <ngupta@vflare.org>
---

diff --git a/init/Kconfig b/init/Kconfig
index 6a5c5fe..fa41598 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -930,6 +930,12 @@ config SLOB

  endchoice

+config XVMALLOC
+	tristate "xvMalloc memory allocator"
+	help
+	   This is a simple, low fragmentation, O(1) allocator.
+	   Details: http://code.google.com/p/compcache/wiki/xvMalloc
+
  config PROFILING
  	bool "Profiling support (EXPERIMENTAL)"
  	help
diff --git a/mm/Makefile b/mm/Makefile
index 72255be..b6b705f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_SLOB) += slob.o
  obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
  obj-$(CONFIG_SLAB) += slab.o
  obj-$(CONFIG_SLUB) += slub.o
+obj-$(CONFIG_XVMALLOC) += xvmalloc.o
  obj-$(CONFIG_FAILSLAB) += failslab.o
  obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
  obj-$(CONFIG_FS_XIP) += filemap_xip.o
diff --git a/mm/xvmalloc.c b/mm/xvmalloc.c
new file mode 100755
index 0000000..e272487
--- /dev/null
+++ b/mm/xvmalloc.c
@@ -0,0 +1,572 @@
+/*
+ * xvmalloc.c
+ *
+ * Copyright (C) 2008, 2009  Nitin Gupta
+ *
+ * This code is released using a dual license strategy: GPL/LGPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of the GNU General Public License Version 2.0
+ * Released under the terms of the GNU Lesser General Public License Version 2.1
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/highmem.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/xvmalloc.h>
+
+#include "xvmalloc_int.h"
+
+static void stat_inc(u64 *value)
+{
+	*value = *value + 1;
+}
+
+static void stat_dec(u64 *value)
+{
+	*value = *value - 1;
+}
+
+static void bitmap_set(u32 *map, u32 idx)
+{
+	*map |= (u32)(1 << idx);
+}
+
+static void bitmap_clear(u32 *map, u32 idx)
+{
+	*map &= (u32)(~(1 << idx));
+}
+
+static u32 test_flag(struct block_header *block, enum blockflags flag)
+{
+	return block->prev & (1 << flag);
+}
+
+static void set_flag(struct block_header *block, enum blockflags flag)
+{
+	block->prev |= (1 << flag);
+}
+
+static void clear_flag(struct block_header *block, enum blockflags flag)
+{
+	block->prev &= ~(1 << flag);
+}
+
+static u32 get_blockprev(struct block_header *block)
+{
+	return block->prev & PREV_MASK;
+}
+
+static void set_blockprev(struct block_header *block, u16 new_offset)
+{
+	block->prev = new_offset | (block->prev & FLAGS_MASK);
+}
+
+static struct block_header *BLOCK_NEXT(struct block_header *block)
+{
+	return (struct block_header *)((char *)block + block->size + XV_ALIGN);
+}
+
+/*
+ * Get index of free list containing blocks of maximum size
+ * which is less than or equal to given size.
+ */
+static u32 get_index_for_insert(u32 size)
+{
+	size = size > XV_MAX_ALLOC_SIZE ? XV_MAX_ALLOC_SIZE : size;
+	size &= ~FL_DELTA_MASK;
+	return (size - XV_MIN_ALLOC_SIZE) >> FL_DELTA_SHIFT;
+}
+
+/*
+ * Get index of free list having blocks of size greater than
+ * or equal to requested size.
+ */
+static u32 get_index(u32 size)
+{
+	size = (size + FL_DELTA_MASK) & ~FL_DELTA_MASK;
+	return (size - XV_MIN_ALLOC_SIZE) >> FL_DELTA_SHIFT;
+}
+
+/*
+ * Given <pagenum, offset> pair, provide a derefrencable pointer.
+ * This is called from xv_malloc/xv_free path, so it needs to be fast.
+ */
+static void *get_ptr_atomic(u32 pagenum, u16 offset, enum km_type type)
+{
+	unsigned char *base;
+
+	base = kmap_atomic(pfn_to_page(pagenum), type);
+	return base + offset;
+}
+
+static void put_ptr_atomic(void *ptr, enum km_type type)
+{
+	kunmap_atomic(ptr, type);
+}
+
+/*
+ * Allocate a memory page. Called when a pool needs to grow.
+ */
+static u32 xv_alloc_page(void)
+{
+	struct page *page;
+
+	page = alloc_page(GFP_NOIO | __GFP_HIGHMEM);
+
+	if (unlikely(!page))
+		return INVALID_PGNUM;
+
+	return page_to_pfn(page);
+}
+
+/*
+ * Called when all objects in a page are freed.
+ */
+static void xv_free_page(u32 pagenum)
+{
+	__free_page(pfn_to_page(pagenum));
+}
+
+/**
+ * find_block - find block of at least given size
+ * @pool: memory pool to search from
+ * @size: size of block required
+ * @pagenum: page no. containing required block
+ * @offset: offset within the page where block is located.
+ *
+ * Searches two level bitmap to locate block of at least
+ * the given size. If such a block is found, it provides
+ * <pagenum, offset> to identify this block and returns index
+ * in freelist where we found this block.
+ * Otherwise, returns 0 and <pagenum, offset> params are not touched.
+ */
+static u32 find_block(struct xv_pool *pool, u32 size,
+			u32 *pagenum, u32 *offset)
+{
+	u32 flbitmap, slbitmap;
+	u32 flindex, slindex, slbitstart;
+
+	/* There are no free blocks in this pool */
+	if (!pool->flbitmap)
+		return 0;
+
+	if (unlikely(size < XV_MIN_ALLOC_SIZE))
+		size = XV_MIN_ALLOC_SIZE;
+
+	/* Get freelist index correspoding to this size */
+	slindex = get_index(size);
+	slbitmap = pool->slbitmap[slindex >> BITMAP_SHIFT];
+	slbitstart = slindex & BITMAP_MASK;
+
+	/*
+	 * If freelist is not empty at this index, we found the
+	 * block - head of this list. This is approximate best-fit match.
+	 */
+	if (slbitmap & (1 << slbitstart)) {
+		*pagenum = pool->freelist[slindex].pagenum;
+		*offset = pool->freelist[slindex].offset;
+		return slindex;
+	}
+
+	/*
+	 * No best-fit found. Search a bit further in bitmap for a free block.
+	 * Second level bitmap consists of series of 32-bit chunks. Search
+	 * further in the chunk where we expected a best-fit, starting from
+	 * index location found above.
+	 */
+	slbitstart++;
+	slbitmap >>= slbitstart;
+
+	/* Skip this search if we were already at end of this bitmap chunk */
+	if ((slbitstart != BITMAP_BITS) && slbitmap) {
+		slindex += ffs(slbitmap);
+		*pagenum = pool->freelist[slindex].pagenum;
+		*offset = pool->freelist[slindex].offset;
+		return slindex;
+	}
+
+	/* Now do a full two-level bitmap search to find next nearest fit */
+	flindex = slindex >> BITMAP_SHIFT;
+
+	flbitmap = (pool->flbitmap) >> (flindex + 1);
+	if (!flbitmap)
+		return 0;
+
+	flindex += ffs(flbitmap);
+	slbitmap = pool->slbitmap[flindex];
+	slindex = (flindex << BITMAP_SHIFT) + ffs(slbitmap) - 1;
+	*pagenum = pool->freelist[slindex].pagenum;
+	*offset = pool->freelist[slindex].offset;
+
+	return slindex;
+}
+
+/*
+ * Insert block at <pagenum, offset> in freelist of given pool.
+ * freelist used depends on block size.
+ */
+static void insert_block(struct xv_pool *pool, u32 pagenum, u32 offset,
+			struct block_header *block)
+{
+	u32 flindex, slindex;
+	struct block_header *nextblock;
+
+	slindex = get_index_for_insert(block->size);
+	flindex = slindex >> BITMAP_SHIFT;
+
+	block->link.prev_pagenum = INVALID_PGNUM;
+	block->link.prev_offset = 0;
+	block->link.next_pagenum = pool->freelist[slindex].pagenum;
+	block->link.next_offset = pool->freelist[slindex].offset;
+	pool->freelist[slindex].pagenum = pagenum;
+	pool->freelist[slindex].offset = offset;
+
+	if (block->link.next_pagenum != INVALID_PGNUM) {
+		nextblock = get_ptr_atomic(block->link.next_pagenum,
+					block->link.next_offset, KM_USER1);
+		nextblock->link.prev_pagenum = pagenum;
+		nextblock->link.prev_offset = offset;
+		put_ptr_atomic(nextblock, KM_USER1);
+	}
+
+	bitmap_set(&pool->slbitmap[flindex], slindex & BITMAP_MASK);
+	bitmap_set(&pool->flbitmap, flindex);
+}
+
+/*
+ * Remove block from head of freelist. Index 'slindex' identifies the freelist.
+ */
+static void remove_block_head(struct xv_pool *pool,
+			struct block_header *block, u32 slindex)
+{
+	struct block_header *tmpblock;
+	u32 flindex = slindex >> BITMAP_SHIFT;
+
+	pool->freelist[slindex].pagenum = block->link.next_pagenum;
+	pool->freelist[slindex].offset = block->link.next_offset;
+	block->link.prev_pagenum = INVALID_PGNUM;
+	block->link.prev_offset = 0;
+
+	if (pool->freelist[slindex].pagenum == INVALID_PGNUM) {
+		bitmap_clear(&pool->slbitmap[flindex], slindex & BITMAP_MASK);
+		if (!pool->slbitmap[flindex])
+			bitmap_clear(&pool->flbitmap, flindex);
+	} else {
+		/*
+		 * DEBUG ONLY: We need not reinitialize freelist head previous
+		 * pointer to INVALID_PGNUM - we never depend on its value.
+		 * But just for sanity, lets keep it.
+		 */
+		tmpblock = get_ptr_atomic(pool->freelist[slindex].pagenum,
+				pool->freelist[slindex].offset, KM_USER1);
+		tmpblock->link.prev_pagenum = INVALID_PGNUM;
+		tmpblock->link.prev_offset = 0;
+		put_ptr_atomic(tmpblock, KM_USER1);
+	}
+}
+
+/*
+ * Remove block from freelist. Index 'slindex' identifies the freelist.
+ */
+static void remove_block(struct xv_pool *pool, u32 pagenum, u32 offset,
+			struct block_header *block, u32 slindex)
+{
+	u32 flindex;
+	struct block_header *tmpblock;
+
+	if (pool->freelist[slindex].pagenum == pagenum
+	   && pool->freelist[slindex].offset == offset) {
+		remove_block_head(pool, block, slindex);
+		return;
+	}
+
+	flindex = slindex >> BITMAP_SHIFT;
+
+	if (block->link.prev_pagenum != INVALID_PGNUM) {
+		tmpblock = get_ptr_atomic(block->link.prev_pagenum,
+				block->link.prev_offset, KM_USER1);
+		tmpblock->link.next_pagenum = block->link.next_pagenum;
+		tmpblock->link.next_offset = block->link.next_offset;
+		put_ptr_atomic(tmpblock, KM_USER1);
+	}
+
+	if (block->link.next_pagenum != INVALID_PGNUM) {
+		tmpblock = get_ptr_atomic(block->link.next_pagenum,
+				block->link.next_offset, KM_USER1);
+		tmpblock->link.prev_pagenum = block->link.prev_pagenum;
+		tmpblock->link.prev_offset = block->link.prev_offset;
+		put_ptr_atomic(tmpblock, KM_USER1);
+	}
+
+	return;
+}
+
+/*
+ * Allocate a page and add it freelist of given pool.
+ */
+static int grow_pool(struct xv_pool *pool)
+{
+	u32 pagenum;
+	struct block_header *block;
+
+	pagenum = xv_alloc_page();
+	if (unlikely(pagenum == INVALID_PGNUM))
+		return -ENOMEM;
+
+	stat_inc(&pool->total_pages);
+
+	spin_lock(&pool->lock);
+	block = get_ptr_atomic(pagenum, 0, KM_USER0);
+
+	block->size = PAGE_SIZE - XV_ALIGN;
+	set_flag(block, BLOCK_FREE);
+	clear_flag(block, PREV_FREE);
+	set_blockprev(block, 0);
+
+	insert_block(pool, pagenum, 0, block);
+
+	put_ptr_atomic(block, KM_USER0);
+	spin_unlock(&pool->lock);
+
+	return 0;
+}
+
+/*
+ * Create a memory pool. Allocates freelist, bitmaps and other
+ * per-pool metadata.
+ */
+struct xv_pool *xv_create_pool(void)
+{
+	int i;
+	u32 ovhd_size;
+	struct xv_pool *pool;
+
+	ovhd_size = ROUNDUP(sizeof(*pool), PAGE_SIZE);
+	pool = kmalloc(ovhd_size, GFP_KERNEL);
+	if (!pool)
+		return NULL;
+
+	memset(pool, 0, ovhd_size);
+
+	for (i = 0; i < NUM_FREE_LISTS; i++)
+		pool->freelist[i].pagenum = INVALID_PGNUM;
+
+	spin_lock_init(&pool->lock);
+
+	return pool;
+}
+EXPORT_SYMBOL_GPL(xv_create_pool);
+
+void xv_destroy_pool(struct xv_pool *pool)
+{
+	kfree(pool);
+}
+EXPORT_SYMBOL_GPL(xv_destroy_pool);
+
+/**
+ * xvMalloc - Allocate block of given size from pool.
+ * @pool: pool to allocate from
+ * @size: size of block to allocate
+ * @pagenum: page no. that holds the object
+ * @offset: location of object within pagenum
+ *
+ * On success, <pagenum, offset> identifies block allocated
+ * and 0 is returned. On failure, <pagenum, offset> is not touched
+ * and -ENOMEM is returned.
+ *
+ * Allocation requests with size > XV_MAX_ALLOC_SIZE will fail.
+ */
+int xv_malloc(struct xv_pool *pool, u32 size, u32 *pagenum, u32 *offset)
+{
+	int error;
+	u32 index, tmpsize, origsize, tmpoffset;
+	struct block_header *block, *tmpblock = NULL;
+
+	*pagenum = INVALID_PGNUM;
+	*offset = 0;
+	origsize = size;
+
+	if (unlikely(!size || size > XV_MAX_ALLOC_SIZE))
+		return -ENOMEM;
+
+	if (unlikely(size < XV_MIN_ALLOC_SIZE))
+		size = XV_MIN_ALLOC_SIZE;
+	else
+		size = ROUNDUP_ALIGN(size);
+
+	spin_lock(&pool->lock);
+
+	index = find_block(pool, size, pagenum, offset);
+
+	if (*pagenum == INVALID_PGNUM) {
+		spin_unlock(&pool->lock);
+		error = grow_pool(pool);
+		if (unlikely(error))
+			return -ENOMEM;
+
+		spin_lock(&pool->lock);
+		index = find_block(pool, size, pagenum, offset);
+	}
+
+	if (*pagenum == INVALID_PGNUM) {
+		spin_unlock(&pool->lock);
+		return -ENOMEM;
+	}
+
+	block = get_ptr_atomic(*pagenum, *offset, KM_USER0);
+
+	remove_block_head(pool, block, index);
+
+	/* Split the block if required */
+	tmpoffset = *offset + size + XV_ALIGN;
+	tmpsize = block->size - size;
+	tmpblock = (struct block_header *)((char *)block + size + XV_ALIGN);
+	if (tmpsize) {
+		tmpblock->size = tmpsize - XV_ALIGN;
+		set_flag(tmpblock, BLOCK_FREE);
+		clear_flag(tmpblock, PREV_FREE);
+
+		set_blockprev(tmpblock, *offset);
+		if (tmpblock->size >= XV_MIN_ALLOC_SIZE)
+			insert_block(pool, *pagenum, tmpoffset, tmpblock);
+
+		if (tmpoffset + XV_ALIGN + tmpblock->size < PAGE_SIZE) {
+			tmpblock = BLOCK_NEXT(tmpblock);
+			set_blockprev(tmpblock, tmpoffset);
+		}
+	} else {
+		/* This block is exact fit */
+		if (tmpoffset < PAGE_SIZE)
+			clear_flag(tmpblock, PREV_FREE);
+	}
+
+	block->size = origsize;
+	clear_flag(block, BLOCK_FREE);
+
+	put_ptr_atomic(block, KM_USER0);
+	spin_unlock(&pool->lock);
+
+	*offset += XV_ALIGN;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xv_malloc);
+
+/*
+ * Free block identified with <pagenum, offset>
+ */
+void xv_free(struct xv_pool *pool, u32 pagenum, u32 offset)
+{
+	void *page;
+	struct block_header *block, *tmpblock;
+
+	offset -= XV_ALIGN;
+
+	spin_lock(&pool->lock);
+
+	page = get_ptr_atomic(pagenum, 0, KM_USER0);
+	block = (struct block_header *)((char *)page + offset);
+
+	if (unlikely(block->size < XV_MIN_ALLOC_SIZE))
+		block->size = XV_MIN_ALLOC_SIZE;
+	else
+		block->size = ROUNDUP_ALIGN(block->size);
+
+	tmpblock = BLOCK_NEXT(block);
+	if (offset + block->size + XV_ALIGN == PAGE_SIZE)
+		tmpblock = NULL;
+
+	/* Merge next block if its free */
+	if (tmpblock && test_flag(tmpblock, BLOCK_FREE)) {
+		/*
+		 * Blocks smaller than XV_MIN_ALLOC_SIZE
+		 * are not inserted in any free list.
+		 */
+		if (tmpblock->size >= XV_MIN_ALLOC_SIZE) {
+			remove_block(pool, pagenum,
+				    offset + block->size + XV_ALIGN, tmpblock,
+				    get_index_for_insert(tmpblock->size));
+		}
+		block->size += tmpblock->size + XV_ALIGN;
+	}
+
+	/* Merge previous block if its free */
+	if (test_flag(block, PREV_FREE)) {
+		tmpblock = (struct block_header *)((char *)(page) +
+						get_blockprev(block));
+		offset = offset - tmpblock->size - XV_ALIGN;
+
+		if (tmpblock->size >= XV_MIN_ALLOC_SIZE)
+			remove_block(pool, pagenum, offset, tmpblock,
+				    get_index_for_insert(tmpblock->size));
+
+		tmpblock->size += block->size + XV_ALIGN;
+		block = tmpblock;
+	}
+
+	/* No used objects in this page. Free it. */
+	if (block->size == PAGE_SIZE - XV_ALIGN) {
+		put_ptr_atomic(page, KM_USER0);
+		spin_unlock(&pool->lock);
+
+		xv_free_page(pagenum);
+		stat_dec(&pool->total_pages);
+		return;
+	}
+
+	set_flag(block, BLOCK_FREE);
+	insert_block(pool, pagenum, offset, block);
+
+	if (offset + block->size < PAGE_SIZE - XV_ALIGN) {
+		tmpblock = BLOCK_NEXT(block);
+		set_flag(tmpblock, PREV_FREE);
+		set_blockprev(tmpblock, offset);
+	}
+
+	put_ptr_atomic(page, KM_USER0);
+	spin_unlock(&pool->lock);
+
+	return;
+}
+EXPORT_SYMBOL_GPL(xv_free);
+
+u32 xv_get_object_size(void *obj)
+{
+	struct block_header *blk;
+
+	blk = (struct block_header *)((char *)(obj) - XV_ALIGN);
+	return blk->size;
+}
+EXPORT_SYMBOL_GPL(xv_get_object_size);
+
+/*
+ * Returns total memory used by allocator (userdata + metadata)
+ */
+u64 xv_get_total_size_bytes(struct xv_pool *pool)
+{
+	return pool->total_pages << PAGE_SHIFT;
+}
+EXPORT_SYMBOL_GPL(xv_get_total_size_bytes);
+
+static int __init xv_malloc_init(void)
+{
+	return 0;
+}
+
+static void __exit xv_malloc_exit(void)
+{
+	return;
+}
+
+module_init(xv_malloc_init);
+module_exit(xv_malloc_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Nitin Gupta <ngupta@vflare.org>");
+MODULE_DESCRIPTION("xvmalloc memory allocator");
diff --git a/mm/xvmalloc_int.h b/mm/xvmalloc_int.h
new file mode 100755
index 0000000..b7ece98
--- /dev/null
+++ b/mm/xvmalloc_int.h
@@ -0,0 +1,95 @@
+/*
+ * xvmalloc_int.c
+ *
+ * Copyright (C) 2008, 2009  Nitin Gupta
+ *
+ * This code is released using a dual license strategy: GPL/LGPL
+ * You can choose the licence that better fits your requirements.
+ *
+ * Released under the terms of the GNU General Public License Version 2.0
+ * Released under the terms of the GNU Lesser General Public License Version 2.1
+ */
+
+#ifndef _XVMALLOC_INT_H_
+#define _XVMALLOC_INT_H_
+
+#include <linux/types.h>
+
+#define INVALID_PGNUM	((u32)(-1))
+
+#define ROUNDUP(x, y)	(((x) + (y) - 1) / (y) * (y))
+/* Each individual bitmap is 32-bit */
+#define BITMAP_BITS	32
+#define BITMAP_SHIFT	5
+#define BITMAP_MASK	(BITMAP_BITS - 1)
+
+/* User configurable params */
+
+/* This must be greater than sizeof(LinkFree) */
+#define XV_MIN_ALLOC_SIZE       32
+#define XV_MAX_ALLOC_SIZE       (PAGE_SIZE - XV_ALIGN)
+
+/* Must be power of two */
+#define XV_ALIGN_SHIFT	2
+#define XV_ALIGN	(1 << XV_ALIGN_SHIFT)
+#define XV_ALIGN_MASK	(XV_ALIGN - 1)
+
+/* Free lists are separated by FL_DELTA bytes */
+#define FL_DELTA_SHIFT	3
+#define FL_DELTA	(1 << FL_DELTA_SHIFT)
+#define FL_DELTA_MASK	(FL_DELTA - 1)
+#define NUM_FREE_LISTS	((XV_MAX_ALLOC_SIZE - XV_MIN_ALLOC_SIZE) \
+				/ FL_DELTA + 1)
+
+#define MAX_FLI		(ROUNDUP(NUM_FREE_LISTS, 32) / 32)
+
+/* End of user params */
+
+#define ROUNDUP_ALIGN(x)	(((x) + XV_ALIGN_MASK) & ~XV_ALIGN_MASK)
+
+enum blockflags {
+	BLOCK_FREE,
+	PREV_FREE,
+	__NR_BLOCKFLAGS,
+};
+
+#define FLAGS_MASK	XV_ALIGN_MASK
+#define PREV_MASK	(~FLAGS_MASK)
+
+struct freelist_entry {
+	u32 pagenum;
+	u16 offset;
+	u16 pad;
+};
+
+struct link_free {
+	u32 prev_pagenum;
+	u32 next_pagenum;
+	u16 prev_offset;
+	u16 next_offset;
+};
+
+struct block_header {
+	union {
+		/* This common header must be ALIGN bytes */
+		u8 common[XV_ALIGN];
+		struct {
+			u16 size;
+			u16 prev;
+		};
+	};
+	struct link_free link;
+};
+
+struct xv_pool {
+	u32 flbitmap;
+	u32 slbitmap[MAX_FLI];
+	spinlock_t lock;
+
+	struct freelist_entry freelist[NUM_FREE_LISTS];
+
+	/* stats */
+	u64 total_pages;
+};
+
+#endif


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] xvmalloc memory allocator
  2009-03-20 14:12 ` [PATCH 2/3] xvmalloc memory allocator Nitin Gupta
@ 2009-03-20 14:57   ` Christoph Lameter
  2009-03-20 16:24     ` Nitin Gupta
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Lameter @ 2009-03-20 14:57 UTC (permalink / raw)
  To: Nitin Gupta; +Cc: Pekka Enberg, linux-kernel

On Fri, 20 Mar 2009, Nitin Gupta wrote:

> xvmalloc is a memory allocator designed specifically for compcache project.

Its an allocator that is highmem capable? Looks like an entirely new
animal to me.

> * Features:
>  - Low metadata overhead (just 4 bytes per object)

SLUB has 0 byte overhead. SLOB has 2 bytes.

>  - O(1) Alloc/Free - except when we have to call system page allocator to
>     get additional memory.

SLOB is not O(1) okay but the others are.

>  - Very low fragmentation: In all tests, xvMalloc memory usage is within 12%
>     of "Ideal".

Maybe try a fair test instead of relying on kmalloc rounding up to
the next power of 2 size?

> One of the main highlights is that it maps pages only when required.
> So, it does not hog vmalloc area which is very small on 32-bit systems.

Got some difficulty understanding what is going on here. So this allocator
is highmem capable? Doesnt that mean that you must make function calls to
ensure that an object is mapped before accessing it.

> +#include "xvmalloc_int.h"
> +
> +static void stat_inc(u64 *value)
> +{
> +	*value = *value + 1;
> +}

(*value) += 1?

atomic_inc?

local_inc?

> +static void bitmap_set(u32 *map, u32 idx)
> +{
> +	*map |= (u32)(1 << idx);
> +}

We have bitops for that purpose. Please use those.

> +/*
> + * Get index of free list having blocks of size greater than
> + * or equal to requested size.
> + */
> +static u32 get_index(u32 size)
> +{
> +	size = (size + FL_DELTA_MASK) & ~FL_DELTA_MASK;

See the ALIGN macro.

> +/*
> + * Allocate a memory page. Called when a pool needs to grow.
> + */
> +static u32 xv_alloc_page(void)
> +{
> +	struct page *page;
> +
> +	page = alloc_page(GFP_NOIO | __GFP_HIGHMEM);

Yes a highmem based allocator!!!!

> +
> +	if (unlikely(!page))
> +		return INVALID_PGNUM;

Return NULL?

> +#define INVALID_PGNUM	((u32)(-1))

NULL

> +#define ROUNDUP(x, y)	(((x) + (y) - 1) / (y) * (y))

There is a global macro available for that purpose

> +/* Each individual bitmap is 32-bit */
> +#define BITMAP_BITS	32

Use kernel constants please BITS_PER_LONG here.

> +#define ROUNDUP_ALIGN(x)	(((x) + XV_ALIGN_MASK) & ~XV_ALIGN_MASK)

== ALIGN?

Well I think this allocator is pretty useful for systems that depend to a
large degree on highmem. This is basically x86 32 bit configuration swith
more than 1G memmory.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] xvmalloc memory allocator
  2009-03-20 14:57   ` Christoph Lameter
@ 2009-03-20 16:24     ` Nitin Gupta
  2009-03-20 17:40       ` Christoph Lameter
  0 siblings, 1 reply; 34+ messages in thread
From: Nitin Gupta @ 2009-03-20 16:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-kernel

Christoph Lameter wrote:
> On Fri, 20 Mar 2009, Nitin Gupta wrote:
> 
>> xvmalloc is a memory allocator designed specifically for compcache project.
> 
> Its an allocator that is highmem capable? Looks like an entirely new
> animal to me.
> 
>> * Features:
>>  - Low metadata overhead (just 4 bytes per object)
> 
> SLUB has 0 byte overhead. SLOB has 2 bytes.

SLUB: 0 metadata but lot of wastage due to rounding-off.
SLOB: 2 bytes header but I think we should really return object aligned to
at least 4 bytes on most archs (including x86 and x64). Apart from smaller
header, it has too many other problems as I detailed in previous mail.

> 
>>  - O(1) Alloc/Free - except when we have to call system page allocator to
>>     get additional memory.
> 
> SLOB is not O(1) okay but the others are.

Only SLOB is currently there for good packing, so it is the only one worth
comparing against. We loose this mem-compression game if we waste too much :)

> 
>>  - Very low fragmentation: In all tests, xvMalloc memory usage is within 12%
>>     of "Ideal".
> 
> Maybe try a fair test instead of relying on kmalloc rounding up to
> the next power of 2 size?
> 

Okay, for testing, I will make some wrappers around SLOB that directly use
slob_alloc() to avoid any of this rounding-off. I hope to show some data on
this soon. But considering other SLOB issues, this should not, hopefully,
be a blocker for compcache.


>> One of the main highlights is that it maps pages only when required.
>> So, it does not hog vmalloc area which is very small on 32-bit systems.
> 
> Got some difficulty understanding what is going on here. So this allocator
> is highmem capable? Doesnt that mean that you must make function calls to
> ensure that an object is mapped before accessing it.
> 

Yes. xvmalloc caller gets <pagenum, offset> pair. Caller has to
separately map it to get dereferenceable pointer.

>> +#include "xvmalloc_int.h"
>> +
>> +static void stat_inc(u64 *value)
>> +{
>> +	*value = *value + 1;
>> +}
> 
> (*value) += 1?
> 
Looks better.

> atomic_inc?
> 
There is really no need to make these stat variables atomic.

> local_inc?
> 

This one looks useful when we work on making compcache code scalable.

>> +static void bitmap_set(u32 *map, u32 idx)
>> +{
>> +	*map |= (u32)(1 << idx);
>> +}
> 
> We have bitops for that purpose. Please use those.
> 

Ok.

>> +/*
>> + * Get index of free list having blocks of size greater than
>> + * or equal to requested size.
>> + */
>> +static u32 get_index(u32 size)
>> +{
>> +	size = (size + FL_DELTA_MASK) & ~FL_DELTA_MASK;
> 
> See the ALIGN macro.
> 

ALIGN is doing same thing - will use it instead.

>> +/*
>> + * Allocate a memory page. Called when a pool needs to grow.
>> + */
>> +static u32 xv_alloc_page(void)
>> +{
>> +	struct page *page;
>> +
>> +	page = alloc_page(GFP_NOIO | __GFP_HIGHMEM);
> 
> Yes a highmem based allocator!!!!
> 
>> +
>> +	if (unlikely(!page))
>> +		return INVALID_PGNUM;
> 
> Return NULL?
> 
>> +#define INVALID_PGNUM	((u32)(-1))
> 
> NULL
> 

okay, INVALID_PGNUM -> NULL

>> +#define ROUNDUP(x, y)	(((x) + (y) - 1) / (y) * (y))
> 
> There is a global macro available for that purpose
> 
Ok, will use that.

>> +/* Each individual bitmap is 32-bit */
>> +#define BITMAP_BITS	32
> 
> Use kernel constants please BITS_PER_LONG here.
> 
Ok.

>> +#define ROUNDUP_ALIGN(x)	(((x) + XV_ALIGN_MASK) & ~XV_ALIGN_MASK)
> 
> == ALIGN?

Yup.

> 
> Well I think this allocator is pretty useful for systems that depend to a
> large degree on highmem. This is basically x86 32 bit configuration swith
> more than 1G memmory.
> 

Not just highmem.
1) Its O(1) and as a side effect causes less cache pollution than SLOB which
does this absolutely crazy freelist scanning.
2) Better packing (will post comparison data soon).

I think, with a bit playing around with interfaces, it can be turned into
general purpose allocator (this will most probably lack highmem support).


Thanks for your feedback.
Nitin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] xvmalloc memory allocator
  2009-03-20 16:24     ` Nitin Gupta
@ 2009-03-20 17:40       ` Christoph Lameter
  2009-03-20 19:01         ` Pekka Enberg
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Lameter @ 2009-03-20 17:40 UTC (permalink / raw)
  To: Nitin Gupta; +Cc: Pekka Enberg, linux-kernel

On Fri, 20 Mar 2009, Nitin Gupta wrote:

> > Maybe try a fair test instead of relying on kmalloc rounding up to
> > the next power of 2 size?
> >
>
> Okay, for testing, I will make some wrappers around SLOB that directly use
> slob_alloc() to avoid any of this rounding-off. I hope to show some data on
> this soon. But considering other SLOB issues, this should not, hopefully,
> be a blocker for compcache.

SLOB is not rounding off. SLAB/SLUB are. You need to create custom
kmem_caches for this.

> I think, with a bit playing around with interfaces, it can be turned into
> general purpose allocator (this will most probably lack highmem support).

Then it would need to implement the SLAB api (see include/linux/slab.h).
Thus we are getting slab allocator #5.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3]: compressed RAM block device
  2009-03-17 11:36 ` [PATCH 1/3]: compressed RAM block device Nitin Gupta
  2009-03-18 12:25   ` Nick Piggin
@ 2009-03-20 18:56   ` Pavel Machek
  2009-03-20 19:53     ` Nitin Gupta
  1 sibling, 1 reply; 34+ messages in thread
From: Pavel Machek @ 2009-03-20 18:56 UTC (permalink / raw)
  To: Nitin Gupta; +Cc: linux-kernel

On Tue 2009-03-17 17:06:46, Nitin Gupta wrote:
>  drivers/block/Kconfig     |   22 +
>  drivers/block/Makefile    |    1 +
>  drivers/block/compcache.c |  995 +++++++++++++++++++++++++++++++++++++++++++++
>  drivers/block/compcache.h |  160 ++++++++
>  4 files changed, 1178 insertions(+), 0 deletions(-)
>
> Creates RAM based block device (ramzswap0) which can be used as swap device.
> Pages swapped to this are compressed and stored in memory itself.
>
> The module is called compcache.ko. It depends on:
>  - xvmalloc.ko: memory allocator
>  - lzo_compress.ko
>  - lzo_decompress.ko
>
> See Documentation/blockdev/compcache.txt for usage details.
>
> Project home: http://code.google.com/p/compcache/


Compcache is really bad name for this. zramdisk? gzrd?

Is the block device useful for general filesystem storage?

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] xvmalloc memory allocator
  2009-03-20 17:40       ` Christoph Lameter
@ 2009-03-20 19:01         ` Pekka Enberg
  2009-03-20 19:43           ` Nitin Gupta
  0 siblings, 1 reply; 34+ messages in thread
From: Pekka Enberg @ 2009-03-20 19:01 UTC (permalink / raw)
  To: cl, nitingupta910; +Cc: Pekka Enberg, linux-kernel@vger.kernel.org


On 3/20/2009, "Christoph Lameter" <cl@linux-foundation.org> wrote:
> > I think, with a bit playing around with interfaces, it can be turned into
> > general purpose allocator (this will most probably lack highmem support).
> 
> Then it would need to implement the SLAB api (see include/linux/slab.h).
> Thus we are getting slab allocator #5.

I do not see the point in that. As I suggested earlier, you should
probably just move this into drivers/block/ and make it a private
compcache allocator.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] xvmalloc memory allocator
  2009-03-20 19:01         ` Pekka Enberg
@ 2009-03-20 19:43           ` Nitin Gupta
  2009-03-21 10:21             ` Andrew Morton
  0 siblings, 1 reply; 34+ messages in thread
From: Nitin Gupta @ 2009-03-20 19:43 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: cl, linux-kernel@vger.kernel.org

Pekka Enberg wrote:
> On 3/20/2009, "Christoph Lameter" <cl@linux-foundation.org> wrote:
>>> I think, with a bit playing around with interfaces, it can be turned into
>>> general purpose allocator (this will most probably lack highmem support).
>> Then it would need to implement the SLAB api (see include/linux/slab.h).
>> Thus we are getting slab allocator #5.
> 
> I do not see the point in that. As I suggested earlier, you should
> probably just move this into drivers/block/ and make it a private
> compcache allocator.
> 

Your wish. But, really, we should not dismiss an O(1) allocator with great
space-efficiency so easily. I think it will be great at least for embedded
devices (its counterpart, SLOB is simply funny).

Just to add to this, Xen recently included a variant of TLSF allocator
which was used in earlier versions of compcache. TLSF is allocator on
which xvmalloc is based.
http://xenbits.xensource.com/xen-unstable.hg?file/0477f9061c8a/xen/common/xmalloc_tlsf.c
(Though I do not know which parts of xen depend on this allocator).

and xvmalloc vs tlsf arguments are here:
http://code.google.com/p/compcache/wiki/xvMalloc

tlsf:
http://rtportal.upv.es/rtmalloc/

Anyways, I will move it to drivers/block.

Thanks,
Nitin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3]: compressed RAM block device
  2009-03-20 18:56   ` Pavel Machek
@ 2009-03-20 19:53     ` Nitin Gupta
  0 siblings, 0 replies; 34+ messages in thread
From: Nitin Gupta @ 2009-03-20 19:53 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel

Pavel Machek wrote:
> On Tue 2009-03-17 17:06:46, Nitin Gupta wrote:
>>  drivers/block/Kconfig     |   22 +
>>  drivers/block/Makefile    |    1 +
>>  drivers/block/compcache.c |  995 +++++++++++++++++++++++++++++++++++++++++++++
>>  drivers/block/compcache.h |  160 ++++++++
>>  4 files changed, 1178 insertions(+), 0 deletions(-)
>>
>> Creates RAM based block device (ramzswap0) which can be used as swap device.
>> Pages swapped to this are compressed and stored in memory itself.
>>
>> The module is called compcache.ko. It depends on:
>>  - xvmalloc.ko: memory allocator
>>  - lzo_compress.ko
>>  - lzo_decompress.ko
>>
>> See Documentation/blockdev/compcache.txt for usage details.
>>
>> Project home: http://code.google.com/p/compcache/
> 
> 
> Compcache is really bad name for this. zramdisk? gzrd?
> 
It was named compcache according to its original goal:
compressed caching for anonymous _and_ filesystem caches. This ram block
device is for handling anonymous memory only. Anyway, actual block device
is called ramzswap which you might like more :)

> Is the block device useful for general filesystem storage?
> 
No. It can only handle page aligned I/O. But it shouldn't be too hard
to make it generic compressed ram disk with physical backing device support.
But unfortunately I do not have bandwidth to do this myself.

Thanks,
Nitin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] xvmalloc memory allocator
  2009-03-20 19:43           ` Nitin Gupta
@ 2009-03-21 10:21             ` Andrew Morton
  2009-03-21 12:12               ` Nitin Gupta
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2009-03-21 10:21 UTC (permalink / raw)
  To: ngupta; +Cc: Pekka Enberg, cl, linux-kernel@vger.kernel.org

On Sat, 21 Mar 2009 01:13:42 +0530 Nitin Gupta <ngupta@vflare.org> wrote:

> Pekka Enberg wrote:
> > On 3/20/2009, "Christoph Lameter" <cl@linux-foundation.org> wrote:
> >>> I think, with a bit playing around with interfaces, it can be turned into
> >>> general purpose allocator (this will most probably lack highmem support).
> >> Then it would need to implement the SLAB api (see include/linux/slab.h).
> >> Thus we are getting slab allocator #5.
> > 
> > I do not see the point in that. As I suggested earlier, you should
> > probably just move this into drivers/block/ and make it a private
> > compcache allocator.
> > 
> 
> Your wish. But, really, we should not dismiss an O(1) allocator with great
> space-efficiency so easily. I think it will be great at least for embedded
> devices (its counterpart, SLOB is simply funny).
> 
> Just to add to this, Xen recently included a variant of TLSF allocator
> which was used in earlier versions of compcache. TLSF is allocator on
> which xvmalloc is based.
> http://xenbits.xensource.com/xen-unstable.hg?file/0477f9061c8a/xen/common/xmalloc_tlsf.c
> (Though I do not know which parts of xen depend on this allocator).
> 
> and xvmalloc vs tlsf arguments are here:
> http://code.google.com/p/compcache/wiki/xvMalloc
> 
> tlsf:
> http://rtportal.upv.es/rtmalloc/

Well, xvmalloc may or may not be a good thing and we can discuss that
separately.

But what is regrettable is that xvmalloc appears to be tied to
compressed-swap in some manner.  Is it not possible to split these two
initiatives apart so that neither is dependent upon the other?  Or is
compressed-swap hopelessly crippled without xvmalloc?

(compcache is a terrible name, btw - it isn't a "compressed cache" at all!)

> Anyways, I will move it to drivers/block.

This sounds like it might be a backward step.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] xvmalloc memory allocator
  2009-03-21 10:21             ` Andrew Morton
@ 2009-03-21 12:12               ` Nitin Gupta
  2009-03-21 12:24                 ` Andrew Morton
  0 siblings, 1 reply; 34+ messages in thread
From: Nitin Gupta @ 2009-03-21 12:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Pekka Enberg, cl, linux-kernel@vger.kernel.org

Andrew Morton wrote:
> On Sat, 21 Mar 2009 01:13:42 +0530 Nitin Gupta <ngupta@vflare.org> wrote:

> But what is regrettable is that xvmalloc appears to be tied to
> compressed-swap in some manner.  Is it not possible to split these two
> initiatives apart so that neither is dependent upon the other?  Or is
> compressed-swap hopelessly crippled without xvmalloc?

xvmalloc itself is completely independent of compressed-swap. Infact, its
loaded as separate kernel module (xvmalloc.ko)
However, this compression project is almost useless without this specialized
allocator.

> 
> (compcache is a terrible name, btw - it isn't a "compressed cache" at all!)
> 

I have now heard this many times and my conscious is beginning to hurt now :)
I will change it to match name of its block device: ramzswap   sounds better?

>> Anyways, I will move it to drivers/block.
> 
> This sounds like it might be a backward step.


I'm bit confused here. Last thing I want to do is block mainline merge
because of such issues. Its real pain to maintain these things separately.

Thanks,
Nitin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] xvmalloc memory allocator
  2009-03-21 12:12               ` Nitin Gupta
@ 2009-03-21 12:24                 ` Andrew Morton
  2009-03-21 13:14                   ` Nitin Gupta
  2009-03-21 16:21                   ` Pekka Enberg
  0 siblings, 2 replies; 34+ messages in thread
From: Andrew Morton @ 2009-03-21 12:24 UTC (permalink / raw)
  To: Nitin Gupta; +Cc: Pekka Enberg, cl, linux-kernel@vger.kernel.org

On Sat, 21 Mar 2009 17:42:52 +0530 Nitin Gupta <nitingupta910@gmail.com> wrote:

> Andrew Morton wrote:
> > On Sat, 21 Mar 2009 01:13:42 +0530 Nitin Gupta <ngupta@vflare.org> wrote:
> 
> > But what is regrettable is that xvmalloc appears to be tied to
> > compressed-swap in some manner.  Is it not possible to split these two
> > initiatives apart so that neither is dependent upon the other?  Or is
> > compressed-swap hopelessly crippled without xvmalloc?
> 
> xvmalloc itself is completely independent of compressed-swap. Infact, its
> loaded as separate kernel module (xvmalloc.ko)

That sounds good.

> However, this compression project is almost useless without this specialized
> allocator.

Why?  Important information!!

See, being told all this helps us understand why xvmalloc exists.  Plus
once we have a good description of _why_ xvmalloc is needed, perhaps we can
come up with alternatives which are more palatable than merging a whole new
allocator.  Such as enhancing an existing one.

> > 
> > (compcache is a terrible name, btw - it isn't a "compressed cache" at all!)
> > 
> 
> I have now heard this many times and my conscious is beginning to hurt now :)
> I will change it to match name of its block device: ramzswap   sounds better?

Is there anything swap-specific about it?  It's a block device, yes?  I
should be able to run mkfs.ext2 on it and mount the thing?

> >> Anyways, I will move it to drivers/block.
> > 
> > This sounds like it might be a backward step.
> 
> 
> I'm bit confused here. Last thing I want to do is block mainline merge
> because of such issues. Its real pain to maintain these things separately.

This is why I tell myself to never use the word "it" in an email message.

I assumed that you were referring to moving xvmalloc() down into
drivers/block.  That would be bad, because then xvmalloc() will _never_ be
usable by anything other than ramzblock <new name!>?



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] xvmalloc memory allocator
  2009-03-21 12:24                 ` Andrew Morton
@ 2009-03-21 13:14                   ` Nitin Gupta
  2009-03-21 16:21                   ` Pekka Enberg
  1 sibling, 0 replies; 34+ messages in thread
From: Nitin Gupta @ 2009-03-21 13:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Pekka Enberg, cl, linux-kernel@vger.kernel.org

Andrew Morton wrote:
> On Sat, 21 Mar 2009 17:42:52 +0530 Nitin Gupta <nitingupta910@gmail.com> wrote:
> 
>> Andrew Morton wrote:
>>> On Sat, 21 Mar 2009 01:13:42 +0530 Nitin Gupta <ngupta@vflare.org> wrote:
>>> But what is regrettable is that xvmalloc appears to be tied to
>>> compressed-swap in some manner.  Is it not possible to split these two
>>> initiatives apart so that neither is dependent upon the other?  Or is
>>> compressed-swap hopelessly crippled without xvmalloc?
>> xvmalloc itself is completely independent of compressed-swap. Infact, its
>> loaded as separate kernel module (xvmalloc.ko)
> 
> That sounds good.
> 
>> However, this compression project is almost useless without this specialized
>> allocator.
> 
> Why?  Important information!!
> 
> See, being told all this helps us understand why xvmalloc exists.  Plus
> once we have a good description of _why_ xvmalloc is needed, perhaps we can
> come up with alternatives which are more palatable than merging a whole new
> allocator.  Such as enhancing an existing one.
> 

xvmalloc is needed by compressed swap since:
  - Its O(1)
  - It is very memory efficient
  - It can use "high memory" for allocation

* space efficiency:

  - comparison with SLUB:
http://code.google.com/p/compcache/wiki/AllocatorsComparison
shows that tlsf (allocator on which xvmalloc is based) uses ~40% less memory
than kmalloc() backed by SLUB. Christoph suggested creating multiple slabs of
different sizes for this test -- which will be a more fair comparison as kmalloc
just uses some predefined slabs. I hope to present this data soon.
Also, SLUB is limited to using "low memory" - this is blocker issue for compress
swap project (on 32-bit system with >1G RAM). xvmalloc can use high memory.

  - comparison with SLOB:
In some previous mail in this thread, I explained all the issues that exist
with SLOB that make it unacceptable for use in this project.


>>> (compcache is a terrible name, btw - it isn't a "compressed cache" at all!)
>>>
>> I have now heard this many times and my conscious is beginning to hurt now :)
>> I will change it to match name of its block device: ramzswap   sounds better?
> 
> Is there anything swap-specific about it?  It's a block device, yes?  I
> should be able to run mkfs.ext2 on it and mount the thing?
> 
No. It can handle page-aligned I/O only. Maybe its not too difficult to extend
it to handle arbitrary I/O. But as a swap device, handling just page-aligned
I/O is good enough.

>>>> Anyways, I will move it to drivers/block.
>>> This sounds like it might be a backward step.
>>
>> I'm bit confused here. Last thing I want to do is block mainline merge
>> because of such issues. Its real pain to maintain these things separately.
> 
> This is why I tell myself to never use the word "it" in an email message.
> 
> I assumed that you were referring to moving xvmalloc() down into
> drivers/block.  That would be bad, because then xvmalloc() will _never_ be
> usable by anything other than ramzblock <new name!>?
> 

I was also referring to moving xvmalloc to drivers/block. I meant that for
now maybe move it to drivers/block it that can help speed up the merge.
Maybe later if someone else find it useful too then we can work to move
it back to the real place: mm/ :)

Thanks,
Nitin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] xvmalloc memory allocator
  2009-03-21 12:24                 ` Andrew Morton
  2009-03-21 13:14                   ` Nitin Gupta
@ 2009-03-21 16:21                   ` Pekka Enberg
  2009-03-21 17:36                     ` Nitin Gupta
  1 sibling, 1 reply; 34+ messages in thread
From: Pekka Enberg @ 2009-03-21 16:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nitin Gupta, cl, linux-kernel@vger.kernel.org

On Sat, Mar 21, 2009 at 2:24 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> I assumed that you were referring to moving xvmalloc() down into
> drivers/block.  That would be bad, because then xvmalloc() will _never_ be
> usable by anything other than ramzblock <new name!>?

Who is going to use it? The only reason compcache needs something
special is because it wants to take advantage of GFP_HIGHMEM pages.
Are there other subsystems that need this capability as well?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] xvmalloc memory allocator
  2009-03-21 16:21                   ` Pekka Enberg
@ 2009-03-21 17:36                     ` Nitin Gupta
  0 siblings, 0 replies; 34+ messages in thread
From: Nitin Gupta @ 2009-03-21 17:36 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Andrew Morton, cl, linux-kernel@vger.kernel.org

Pekka Enberg wrote:
> On Sat, Mar 21, 2009 at 2:24 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>> I assumed that you were referring to moving xvmalloc() down into
>> drivers/block.  That would be bad, because then xvmalloc() will _never_ be
>> usable by anything other than ramzblock <new name!>?
> 
> Who is going to use it? The only reason compcache needs something
> special is because it wants to take advantage of GFP_HIGHMEM pages.
> Are there other subsystems that need this capability as well?
> 

As I mentioned earlier, highmem is not the only advantage. Don't forget
O(1) alloc/free and low fragmentation. Sometime in next week, I will post
additional numbers comparing SLUB and xvmalloc.

One point I noted in SLUB is that, it needs to allocate higher order pages
to minimize space wastage at end of every page. For in-memory swap compression,
we simply cannot allocate higher order pages since its going to used under
memory crunch (its a swap device!) and we cannot hope to find lot of higher
order pages under such conditions. If we enforce it to use 0-order pages
then we cannot allocate > 2048b since all such allocations will end-up
using entire page!
Also, if we decide to use SLUB for objects of size < 2048b only then how will
we store bigger objects provided we can only use 0-order pages?
(we need storage for range, say, [32, 3/4*PAGE_SIZE]).

Nitin

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2009-03-21 17:37 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-17 11:34 [PATCH 0/3]: compressed in-memory swapping Nitin Gupta
2009-03-17 11:36 ` [PATCH 1/3]: compressed RAM block device Nitin Gupta
2009-03-18 12:25   ` Nick Piggin
2009-03-18 12:38     ` Nitin Gupta
2009-03-18 12:49       ` Nick Piggin
2009-03-20 18:56   ` Pavel Machek
2009-03-20 19:53     ` Nitin Gupta
2009-03-17 11:37 ` [PATCH 2/3]: xvmalloc memory allocator Nitin Gupta
2009-03-17 16:41   ` Christoph Lameter
2009-03-17 17:55     ` Nitin Gupta
     [not found]     ` <d760cf2d0903171028o600dc94cn7a5238520d104455@mail.gmail.com>
2009-03-17 17:58       ` Christoph Lameter
2009-03-17 18:34         ` Pekka Enberg
2009-03-18 16:07           ` Nitin Gupta
2009-03-18 15:17         ` Nitin Gupta
2009-03-18 16:58           ` Christoph Lameter
2009-03-18 17:29             ` Nitin Gupta
2009-03-18 19:21               ` Pekka Enberg
2009-03-18 19:36                 ` Nitin Gupta
2009-03-19  2:30                 ` Nitin Gupta
2009-03-19  6:08                   ` Pekka Enberg
2009-03-17 11:38 ` [PATCH 3/3]: documentation Nitin Gupta
2009-03-17 23:34 ` [PATCH 0/3]: compressed in-memory swapping Russ Dill
  -- strict thread matches above, loose matches on Subject: below --
2009-03-20 14:07 [PATCH 0/3] compressed in-memory swapping take2 Nitin Gupta
2009-03-20 14:12 ` [PATCH 2/3] xvmalloc memory allocator Nitin Gupta
2009-03-20 14:57   ` Christoph Lameter
2009-03-20 16:24     ` Nitin Gupta
2009-03-20 17:40       ` Christoph Lameter
2009-03-20 19:01         ` Pekka Enberg
2009-03-20 19:43           ` Nitin Gupta
2009-03-21 10:21             ` Andrew Morton
2009-03-21 12:12               ` Nitin Gupta
2009-03-21 12:24                 ` Andrew Morton
2009-03-21 13:14                   ` Nitin Gupta
2009-03-21 16:21                   ` Pekka Enberg
2009-03-21 17:36                     ` Nitin Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).