* [PATCH v2 0/3] staging: zcache: xcfmalloc support
@ 2011-09-07 14:09 Seth Jennings
2011-09-07 14:09 ` [PATCH v2 1/3] staging: zcache: xcfmalloc memory allocator for zcache Seth Jennings
` (4 more replies)
0 siblings, 5 replies; 28+ messages in thread
From: Seth Jennings @ 2011-09-07 14:09 UTC (permalink / raw)
To: gregkh
Cc: dan.magenheimer, ngupta, cascardo, devel, linux-kernel, rdunlap,
linux-mm, rcj, dave, brking, Seth Jennings
Changelog:
v2: fix bug in find_remove_block()
fix whitespace warning at EOF
This patchset introduces a new memory allocator for persistent
pages for zcache. The current allocator is xvmalloc. xvmalloc
has two notable limitations:
* High (up to 50%) external fragmentation on allocation sets > PAGE_SIZE/2
* No compaction support which reduces page reclaimation
xcfmalloc seeks to fix these issues by using scatter-gather model that
allows for cross-page allocations and relocatable data blocks.
In tests, with pages that only compress to 75% of their original
size, xvmalloc had an effective compression (pages stored / pages used by the
compressed memory pool) of ~95% (~20% lost to fragmentation). Almost nothing
was gained by the compression in this case. xcfmalloc had an effective
compression of ~77% (about ~2% lost to fragmentation and metadata overhead).
xcfmalloc uses the same locking scheme as xvmalloc; a single pool-level
spinlock. This can lead to some contention. However, in my tests on a 4
way SMP system, the contention was minimal (200 contentions out of 600k
aquisitions). The locking scheme may be able to be improved in the future.
In tests, xcfmalloc and xvmalloc had identical throughputs.
While the xcfmalloc design lends itself to compaction, this is not yet
implemented. Support will be added in a follow-on patch.
Based on 3.1-rc4.
=== xvmalloc vs xcfmalloc ===
Here are some comparison metrics vs xvmalloc. The tests were done on
a 32-bit system in a a cgroup with a memory.limit_in_bytes of 256MB.
I ran a program that allocates 512MB, one 4k page a time. The pages
can be filled with zeros or text depending on the command arguments.
The text is english text that has an average compression ratio, using
lzo1x, of 75% with little deviation. zv_max_mean_zsize and zv_max_zsize are
both set to 3584 (7 * PAGE_SIZE / 8).
xvmalloc
curr_pages zv_page_count effective compression
zero filled 65859 1269 1.93%
text (75%) 65925 65892 99.95%
xcfmalloc (descriptors are 24 bytes, 170 per 4k page)
curr_pages zv_page_count zv_desc_count effective compression
zero filled 65845 2068 65858 3.72% (+1.79)
text (75%) 65965 50848 114980 78.11% (-21.84)
This shows that xvmalloc is 1.79 points better on zero filled pages.
This is because xcfmalloc has higher internal fragmentation because
the block sizes aren't as granular as xvmalloc. This contributes
to 1.21 points of the delta. xcfmalloc also has block descriptors,
which contributes to the remaining 0.58 points.
It also shows that xcfmalloc is 21.84 points better on text filled
pages. This is because of xcfmalloc allocations can span different
pages which greatly reduces external fragmentation compared to xvmalloc.
I did some quick tests with "time" using the same program and the
timings are very close (3 run average, little deviation):
xvmalloc:
zero filled 0m0.852s
text (75%) 0m14.415s
xcfmalloc:
zero filled 0m0.870s
text (75%) 0m15.089s
I suspect that the small decrease in throughput is due to the
extra memcpy in xcfmalloc. However, these timing, more than
anything, demonstrate that the throughput is GREATLY effected
by the compressibility of the data.
In all cases, all swapped pages where captured by frontswap with
no put failures.
Seth Jennings (3):
staging: zcache: xcfmalloc memory allocator for zcache
staging: zcache: replace xvmalloc with xcfmalloc
staging: zcache: add zv_page_count and zv_desc_count
drivers/staging/zcache/Makefile | 2 +-
drivers/staging/zcache/xcfmalloc.c | 652 ++++++++++++++++++++++++++++++++++
drivers/staging/zcache/xcfmalloc.h | 28 ++
drivers/staging/zcache/zcache-main.c | 154 ++++++---
4 files changed, 789 insertions(+), 47 deletions(-)
create mode 100644 drivers/staging/zcache/xcfmalloc.c
create mode 100644 drivers/staging/zcache/xcfmalloc.h
--
1.7.4.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v2 1/3] staging: zcache: xcfmalloc memory allocator for zcache
2011-09-07 14:09 [PATCH v2 0/3] staging: zcache: xcfmalloc support Seth Jennings
@ 2011-09-07 14:09 ` Seth Jennings
2011-09-07 14:09 ` [PATCH v2 2/3] staging: zcache: replace xvmalloc with xcfmalloc Seth Jennings
` (3 subsequent siblings)
4 siblings, 0 replies; 28+ messages in thread
From: Seth Jennings @ 2011-09-07 14:09 UTC (permalink / raw)
To: gregkh
Cc: dan.magenheimer, ngupta, cascardo, devel, linux-kernel, rdunlap,
linux-mm, rcj, dave, brking, Seth Jennings
xcfmalloc is a memory allocator based on a scatter-gather model that
allows for cross-page storage and relocatable data blocks. This is
achieved through a simple, though non-standard, memory allocation API.
xcfmalloc is a replacement for the xvmalloc allocator used by zcache
for persistent compressed page storage.
xcfmalloc uses a scatter-gather block model in order to reduce external
fragmentation (at the expense of an extra memcpy). Low fragmentation
is especially important for a zcache allocator since memory will already
be under pressure.
It also uses relocatable data blocks in order to allow compaction to take
place. Compacting seeks to move blocks to other, less sparsely used,
pages such that the maximum number of pages can be returned to the kernel
buddy allocator. Compaction is not yet implemented in this patch.
Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
---
drivers/staging/zcache/Makefile | 2 +-
drivers/staging/zcache/xcfmalloc.c | 652 ++++++++++++++++++++++++++++++++++++
drivers/staging/zcache/xcfmalloc.h | 28 ++
3 files changed, 681 insertions(+), 1 deletions(-)
create mode 100644 drivers/staging/zcache/xcfmalloc.c
create mode 100644 drivers/staging/zcache/xcfmalloc.h
diff --git a/drivers/staging/zcache/Makefile b/drivers/staging/zcache/Makefile
index 60daa27..c7aa772 100644
--- a/drivers/staging/zcache/Makefile
+++ b/drivers/staging/zcache/Makefile
@@ -1,3 +1,3 @@
-zcache-y := zcache-main.o tmem.o
+zcache-y := zcache-main.o tmem.o xcfmalloc.o
obj-$(CONFIG_ZCACHE) += zcache.o
diff --git a/drivers/staging/zcache/xcfmalloc.c b/drivers/staging/zcache/xcfmalloc.c
new file mode 100644
index 0000000..60aceab
--- /dev/null
+++ b/drivers/staging/zcache/xcfmalloc.c
@@ -0,0 +1,652 @@
+/* xcfmalloc.c */
+
+/*
+ * xcfmalloc is a memory allocator based on a scatter-gather model that
+ * allows for cross-page storage and relocatable data blocks. This is
+ * achieved through a simple, though non-standard, memory allocation API.
+ *
+ * xcfmalloc is a replacement for the xvmalloc allocator used by zcache
+ * for persistent compressed page storage.
+ *
+ * xcfmalloc uses a scatter-gather block model in order to reduce external
+ * fragmentation (at the expense of an extra memcpy). Low fragmentation
+ * is especially important for a zcache allocator since memory will already
+ * be under pressure.
+ *
+ * It also uses relocatable data blocks in order to allow compaction to take
+ * place. Compacting seeks to move blocks to other, less sparsely used,
+ * pages such that the maximum number of pages can be returned to the kernel
+ * buddy allocator. NOTE: Compaction is not yet implemented.
+*/
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/highmem.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+
+#include "xcfmalloc.h"
+
+/* turn debugging on */
+#define XCFM_DEBUG
+
+/* tunables */
+#define XCF_BLOCK_SHIFT 6
+#define XCF_MAX_RESERVE_PAGES 32
+#define XCF_MAX_BLOCKS_PER_ALLOC 2
+
+#define XCF_BLOCK_ALIGN (1 << XCF_BLOCK_SHIFT)
+#define XCF_NUM_FREELISTS (PAGE_SIZE >> XCF_BLOCK_SHIFT)
+#define XCF_RESERVED_INDEX (XCF_NUM_FREELISTS - 1)
+
+static inline int xcf_align_to_block(int size)
+{
+ return ALIGN(size, XCF_BLOCK_ALIGN);
+}
+
+static inline int xcf_size_to_flindex(int size)
+{
+
+ return (xcf_align_to_block(size) >> XCF_BLOCK_SHIFT) - 1;
+}
+
+/* pool structure */
+struct xcf_pool {
+ struct xcf_blkdesc *freelists[XCF_NUM_FREELISTS];
+ spinlock_t lock;
+ gfp_t flags; /* flags used for page allocations */
+ int reserve_count; /* number of reserved (completely unused) pages */
+ int page_count; /* number of pages used and reserved */
+ int desc_count; /* number of block descriptors */
+};
+
+/* block descriptor structure and helper functions */
+struct xcf_blkdesc {
+ struct page *page;
+ struct xcf_blkdesc *next;
+ struct xcf_blkdesc *prev;
+ int offxcf_set_flags;
+#define XCF_FLAG_MASK (XCF_BLOCK_ALIGN - 1)
+#define XCF_OFFSET_MASK (~XCF_FLAG_MASK)
+#define XCF_BLOCK_FREE (1<<0)
+#define XCF_BLOCK_USED (1<<1)
+ /* size means different things depending on whether the block is
+ * in use or not. If the block is free, the size is the aligned size
+ * of the block and the block can contain an allocation of up to
+ * (size - sizeof(struct xcf_blkhdr)). If the block is in use,
+ * size is the actual number of bytes used by the allocation plus
+ * sizeof(struct xcf_blkhdr) */
+ int size;
+};
+
+static inline void xcf_set_offset(struct xcf_blkdesc *desc, int offset)
+{
+ BUG_ON(offset != (offset & XCF_OFFSET_MASK));
+ desc->offxcf_set_flags = offset |
+ (desc->offxcf_set_flags & XCF_FLAG_MASK);
+}
+
+static inline int xcf_get_offset(struct xcf_blkdesc *desc)
+{
+ return desc->offxcf_set_flags & XCF_OFFSET_MASK;
+}
+
+static inline void xcf_set_flags(struct xcf_blkdesc *desc, int flags)
+{
+ BUG_ON(flags != (flags & XCF_FLAG_MASK));
+ desc->offxcf_set_flags = flags |
+ (desc->offxcf_set_flags & XCF_OFFSET_MASK);
+}
+
+static inline bool xcf_is_free(struct xcf_blkdesc *desc)
+{
+ return desc->offxcf_set_flags & XCF_BLOCK_FREE;
+}
+
+static inline bool xcf_is_used(struct xcf_blkdesc *desc)
+{
+ return desc->offxcf_set_flags & XCF_BLOCK_USED;
+}
+
+/* block header structure and helper functions */
+#ifdef XCFM_DEBUG
+#define DECL_SENTINEL int sentinel
+#define SENTINEL 0xdeadbeef
+#define SET_SENTINEL(_obj) ((_obj)->sentinel = SENTINEL)
+#define ASSERT_SENTINEL(_obj) BUG_ON((_obj)->sentinel != SENTINEL)
+#define CLEAR_SENTINEL(_obj) ((_obj)->sentinel = 0)
+#else
+#define DECL_SENTINEL
+#define SET_SENTINEL(_obj) do { } while (0)
+#define ASSERT_SENTINEL(_obj) do { } while (0)
+#define CLEAR_SENTINEL(_obj) do { } while (0)
+#endif
+
+struct xcf_blkhdr {
+ struct xcf_blkdesc *desc;
+ int prevoffset;
+ DECL_SENTINEL;
+};
+
+#define MAX_ALLOC_SIZE (PAGE_SIZE - sizeof(struct xcf_blkhdr))
+
+static inline void *xcf_page_start(struct xcf_blkhdr *block)
+{
+ return (void *)((unsigned long)block & PAGE_MASK);
+}
+
+static inline struct xcf_blkhdr *xcf_block_offset(struct xcf_blkhdr *block,
+ int offset)
+{
+ return (struct xcf_blkhdr *)((char *)block + offset);
+}
+
+static inline bool xcf_same_page(struct xcf_blkhdr *block1,
+ struct xcf_blkhdr *block2)
+{
+ return (xcf_page_start(block1) == xcf_page_start(block2));
+}
+
+static inline void *__xcf_map_block(struct xcf_blkdesc *desc,
+ enum km_type type)
+{
+ return kmap_atomic(desc->page, type) + xcf_get_offset(desc);
+}
+
+static inline void *xcf_map_block(struct xcf_blkdesc *desc, enum km_type type)
+{
+ struct xcf_blkhdr *block;
+ block = __xcf_map_block(desc, type);
+ ASSERT_SENTINEL(block);
+ return block;
+}
+
+static inline void xcf_unmap_block(struct xcf_blkhdr *block, enum km_type type)
+{
+ kunmap_atomic(xcf_page_start(block), type);
+}
+
+/* descriptor memory cache */
+static struct kmem_cache *xcf_desc_cache;
+
+/* descriptor management */
+
+struct xcf_blkdesc *xcf_alloc_desc(struct xcf_pool *pool, gfp_t flags)
+{
+ struct xcf_blkdesc *desc;
+ desc = kmem_cache_zalloc(xcf_desc_cache, flags);
+ if (desc)
+ pool->desc_count++;
+ return desc;
+}
+
+void xcf_free_desc(struct xcf_pool *pool, struct xcf_blkdesc *desc)
+{
+ kmem_cache_free(xcf_desc_cache, desc);
+ pool->desc_count--;
+}
+
+/* block management */
+
+static void xcf_remove_block(struct xcf_pool *pool, struct xcf_blkdesc *desc)
+{
+ u32 flindex;
+
+ BUG_ON(!xcf_is_free(desc));
+ flindex = xcf_size_to_flindex(desc->size);
+ if (flindex == XCF_RESERVED_INDEX)
+ pool->reserve_count--;
+ if (pool->freelists[flindex] == desc)
+ pool->freelists[flindex] = desc->next;
+ if (desc->prev)
+ desc->prev->next = desc->next;
+ if (desc->next)
+ desc->next->prev = desc->prev;
+ desc->next = NULL;
+ desc->prev = NULL;
+}
+
+static void xcf_insert_block(struct xcf_pool *pool, struct xcf_blkdesc *desc)
+{
+ u32 flindex;
+
+ flindex = xcf_size_to_flindex(desc->size);
+ if (flindex == XCF_RESERVED_INDEX)
+ pool->reserve_count++;
+ /* the block size needs to be realigned since it was previously
+ * set to the actual allocation size */
+ desc->size = xcf_align_to_block(desc->size);
+ desc->next = pool->freelists[flindex];
+ desc->prev = NULL;
+ xcf_set_flags(desc, XCF_BLOCK_FREE);
+ pool->freelists[flindex] = desc;
+ if (desc->next)
+ desc->next->prev = desc;
+}
+
+/*
+ * The xcf_find_remove_block function contains the block selection logic for
+ * this allocator.
+ *
+ * This selection pattern has the advantage of utilizing blocks in partially
+ * used pages before considering a completely unused page. This results
+ * in high utilization of the pool pages and the expense of higher allocation
+ * fragmentation (i.e. more blocks per allocation). This is acceptable
+ * though since, at this point, memory is the resource under pressure.
+*/
+static struct xcf_blkdesc *xcf_find_remove_block(struct xcf_pool *pool,
+ int size, int blocknum)
+{
+ int flindex, i;
+ struct xcf_blkdesc *desc = NULL;
+
+ flindex = xcf_size_to_flindex(size + sizeof(struct xcf_blkhdr));
+
+ /* look for best fit */
+ if (pool->freelists[flindex])
+ goto remove;
+
+ /* if this is the last block allowed in the allocation, we shouldn't
+ * consider smaller blocks. it's all or nothing now */
+ if (blocknum != XCF_MAX_BLOCKS_PER_ALLOC) {
+ /* look for largest smaller block */
+ for (i = flindex; i > 0; i--) {
+ if (pool->freelists[i]) {
+ flindex = i;
+ goto remove;
+ }
+ }
+ }
+
+ /* look for smallest larger block */
+ for (i = flindex + 1; i < XCF_NUM_FREELISTS; i++) {
+ if (pool->freelists[i]) {
+ flindex = i;
+ goto remove;
+ }
+ }
+
+ /* if we get here, there are no blocks that satisfy the request */
+ return NULL;
+
+remove:
+ desc = pool->freelists[flindex];
+ xcf_remove_block(pool, desc);
+ return desc;
+}
+
+struct xcf_blkdesc *xcf_merge_free_block(struct xcf_pool *pool,
+ struct xcf_blkdesc *desc)
+{
+ struct xcf_blkdesc *next, *prev;
+ struct xcf_blkhdr *block, *nextblock, *prevblock;
+
+ block = xcf_map_block(desc, KM_USER0);
+
+ prevblock = xcf_block_offset(xcf_page_start(block), block->prevoffset);
+ if (xcf_get_offset(desc) == 0) {
+ prev = NULL;
+ } else {
+ ASSERT_SENTINEL(prevblock);
+ prev = prevblock->desc;
+ }
+
+ nextblock = xcf_block_offset(block, desc->size);
+ if (!xcf_same_page(block, nextblock)) {
+ next = NULL;
+ } else {
+ ASSERT_SENTINEL(nextblock);
+ next = nextblock->desc;
+ }
+
+ /* merge adjacent free blocks in page */
+ if (prev && xcf_is_free(prev)) {
+ /* merge with previous block */
+ xcf_remove_block(pool, prev);
+ prev->size += desc->size;
+ xcf_free_desc(pool, desc);
+ CLEAR_SENTINEL(block);
+ desc = prev;
+ block = prevblock;
+ }
+
+ if (next && xcf_is_free(next)) {
+ /* merge with next block */
+ xcf_remove_block(pool, next);
+ desc->size += next->size;
+ CLEAR_SENTINEL(nextblock);
+ nextblock = xcf_block_offset(nextblock, next->size);
+ xcf_free_desc(pool, next);
+ if (!xcf_same_page(nextblock, block)) {
+ next = NULL;
+ } else {
+ ASSERT_SENTINEL(nextblock);
+ next = nextblock->desc;
+ }
+ }
+
+ if (next)
+ nextblock->prevoffset = xcf_get_offset(desc);
+
+ xcf_unmap_block(block, KM_USER0);
+
+ return desc;
+}
+
+/* pool management */
+
+/* try to get pool->reserve_count up to XCF_MAX_RESERVE_PAGES */
+static void xcf_increase_pool(struct xcf_pool *pool, gfp_t flags)
+{
+ struct xcf_blkdesc *desc;
+ int deficit_pages;
+ struct xcf_blkhdr *block;
+ struct page *page;
+
+ /* if we are at or above our desired number of
+ * reserved pages, nothing to do */
+ if (pool->reserve_count >= XCF_MAX_RESERVE_PAGES)
+ return;
+
+ /* alloc some pages (if we can) */
+ deficit_pages = XCF_MAX_RESERVE_PAGES - pool->reserve_count;
+ while (deficit_pages--) {
+ desc = xcf_alloc_desc(pool, GFP_NOWAIT);
+ if (!desc)
+ break;
+ /* we use our own scatter-gather mechanism that maps high
+ * memory pages under the covers, so we add HIGHMEM
+ * even if the caller did not explicitly request it */
+ page = alloc_page(flags | __GFP_HIGHMEM);
+ if (!page) {
+ xcf_free_desc(pool, desc);
+ break;
+ }
+ pool->page_count++;
+ desc->page = page;
+ xcf_set_offset(desc, 0);
+ desc->size = PAGE_SIZE;
+ xcf_insert_block(pool, desc);
+ block = __xcf_map_block(desc, KM_USER0);
+ block->desc = desc;
+ block->prevoffset = 0;
+ SET_SENTINEL(block);
+ xcf_unmap_block(block, KM_USER0);
+ }
+}
+
+/* tries to get pool->reserve_count down to XCF_MAX_RESERVE_PAGES */
+static void xcf_decrease_pool(struct xcf_pool *pool)
+{
+ struct xcf_blkdesc *desc;
+ int surplus_pages;
+
+ /* if we are at are below our desired number of
+ * reserved pages, nothing to do */
+ if (pool->reserve_count <= XCF_MAX_RESERVE_PAGES)
+ return;
+
+ /* free some pages */
+ surplus_pages = pool->reserve_count - XCF_MAX_RESERVE_PAGES;
+ while (surplus_pages--) {
+ desc = pool->freelists[XCF_RESERVED_INDEX];
+ BUG_ON(!desc);
+ xcf_remove_block(pool, desc);
+ __free_page(desc->page);
+ pool->page_count--;
+ xcf_free_desc(pool, desc);
+ }
+}
+
+/* public functions */
+
+/* return handle to new descect with specified size */
+void *xcf_malloc(struct xcf_pool *pool, int size, gfp_t flags)
+{
+ struct xcf_blkdesc *first, *prev, *next, *desc, *splitdesc;
+ int i, sizeleft, alignedleft;
+ struct xcf_blkhdr *block, *nextblock, *splitblock;
+
+ if (size > MAX_ALLOC_SIZE)
+ return NULL;
+
+ spin_lock(&pool->lock);
+
+ /* check and possibly increase (alloc) pool's page reserves */
+ xcf_increase_pool(pool, flags);
+
+ sizeleft = size;
+ first = NULL;
+ prev = NULL;
+
+ /* find block(s) to contain the allocation */
+ for (i = 1; i <= XCF_MAX_BLOCKS_PER_ALLOC; i++) {
+ desc = xcf_find_remove_block(pool, sizeleft, i);
+ if (!desc)
+ goto return_blocks;
+ if (!first)
+ first = desc;
+ desc->prev = prev;
+ if (prev)
+ prev->next = desc;
+ xcf_set_flags(desc, XCF_BLOCK_USED);
+ if (desc->size >= sizeleft + sizeof(struct xcf_blkhdr))
+ break;
+ sizeleft -= desc->size - sizeof(struct xcf_blkhdr);
+ prev = desc;
+ }
+
+ /* split is only possible on the last block */
+ alignedleft = xcf_align_to_block(sizeleft + sizeof(struct xcf_blkhdr));
+ if (desc->size > alignedleft) {
+ splitdesc = xcf_alloc_desc(pool, GFP_NOWAIT);
+ if (!splitdesc)
+ goto return_blocks;
+ splitdesc->size = desc->size - alignedleft;
+ splitdesc->page = desc->page;
+ xcf_set_offset(splitdesc, xcf_get_offset(desc) + alignedleft);
+
+ block = xcf_map_block(desc, KM_USER0);
+ ASSERT_SENTINEL(block);
+ nextblock = xcf_block_offset(block, desc->size);
+ if (xcf_same_page(nextblock, block))
+ nextblock->prevoffset = xcf_get_offset(splitdesc);
+ splitblock = xcf_block_offset(block, alignedleft);
+ splitblock->desc = splitdesc;
+ splitblock->prevoffset = xcf_get_offset(desc);
+ SET_SENTINEL(splitblock);
+ xcf_unmap_block(block, KM_USER0);
+
+ xcf_insert_block(pool, splitdesc);
+ }
+
+ desc->size = sizeleft + sizeof(struct xcf_blkhdr);
+
+ /* ensure the changes to desc are written before a potential
+ read in xcf_get_alloc_size() since it does not aquire a
+ lock */
+ wmb();
+
+ spin_unlock(&pool->lock);
+ return first;
+
+return_blocks:
+ desc = first;
+ while (desc) {
+ next = desc->next;
+ xcf_insert_block(pool, desc);
+ desc = next;
+ }
+ spin_unlock(&pool->lock);
+ return NULL;
+}
+
+enum xcf_op {
+ XCFMOP_READ,
+ XCFMOP_WRITE
+};
+
+static int access_allocation(void *handle, char *buf, enum xcf_op op)
+{
+ struct xcf_blkdesc *desc;
+ int count, offset;
+ char *block;
+
+ desc = handle;
+ count = XCF_MAX_BLOCKS_PER_ALLOC;
+ offset = 0;
+ while (desc && count--) {
+ BUG_ON(!xcf_is_used(desc));
+ BUG_ON(offset > MAX_ALLOC_SIZE);
+
+ block = xcf_map_block(desc, KM_USER0);
+ if (op == XCFMOP_WRITE)
+ memcpy(block + sizeof(struct xcf_blkhdr),
+ buf + offset,
+ desc->size - sizeof(struct xcf_blkhdr));
+ else /* XCFMOP_READ */
+ memcpy(buf + offset,
+ block + sizeof(struct xcf_blkhdr),
+ desc->size - sizeof(struct xcf_blkhdr));
+ xcf_unmap_block((struct xcf_blkhdr *)block, KM_USER0);
+
+ offset += desc->size - sizeof(struct xcf_blkhdr);
+ desc = desc->next;
+ }
+
+ BUG_ON(desc);
+
+ return 0;
+}
+
+/* write data from buf into allocation */
+int xcf_write(struct xcf_handle *handle, void *buf)
+{
+ return access_allocation(handle, buf, XCFMOP_WRITE);
+}
+
+/* read data from allocation into buf */
+int xcf_read(struct xcf_handle *handle, void *buf)
+{
+ return access_allocation(handle, buf, XCFMOP_READ);
+}
+
+/* free an allocation */
+void xcf_free(struct xcf_pool *pool, struct xcf_handle *handle)
+{
+ struct xcf_blkdesc *desc, *nextdesc;
+ int count;
+
+ spin_lock(&pool->lock);
+
+ desc = (struct xcf_blkdesc *)handle;
+ count = XCF_MAX_BLOCKS_PER_ALLOC;
+ while (desc && count--) {
+ nextdesc = desc->next;
+ BUG_ON(xcf_is_free(desc));
+ xcf_set_flags(desc, XCF_BLOCK_FREE);
+ desc->size = xcf_align_to_block(desc->size);
+ /* xcf_merge_free_block may merge with a previous block which
+ * will cause desc to change */
+ desc = xcf_merge_free_block(pool, desc);
+ xcf_insert_block(pool, desc);
+ desc = nextdesc;
+ }
+
+ BUG_ON(desc);
+
+ /* check and possibly decrease (free) pool's page reserves */
+ xcf_decrease_pool(pool);
+
+ spin_unlock(&pool->lock);
+}
+
+/* create an xcfmalloc memory pool */
+struct xcf_pool *xcf_create_pool(gfp_t flags)
+{
+ struct xcf_pool *pool = NULL;
+
+ if (!xcf_desc_cache)
+ xcf_desc_cache = kmem_cache_create("xcf_desc_cache",
+ sizeof(struct xcf_blkdesc), 0, 0, NULL);
+
+ if (!xcf_desc_cache)
+ goto out;
+
+ pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+ if (!pool)
+ goto out;
+
+ spin_lock_init(&pool->lock);
+
+ xcf_increase_pool(pool, flags);
+
+out:
+ return pool;
+}
+
+/* destroy an xcfmalloc memory pool */
+void xcf_destroy_pool(struct xcf_pool *pool)
+{
+ struct xcf_blkdesc *desc, *nextdesc;
+
+ /* free reserved pages */
+ desc = pool->freelists[XCF_RESERVED_INDEX];
+ while (desc) {
+ nextdesc = desc->next;
+ __free_page(desc->page);
+ xcf_free_desc(pool, desc);
+ desc = nextdesc;
+ }
+
+ kfree(pool);
+}
+
+/* get size of allocation associated with handle */
+int xcf_get_alloc_size(struct xcf_handle *handle)
+{
+ struct xcf_blkdesc *desc;
+ int count, size = 0;
+
+ if (!handle)
+ goto out;
+
+ /* ensure the changes to desc by xcf_malloc() are written before
+ we access here since we don't acquire a lock */
+ rmb();
+
+ desc = (struct xcf_blkdesc *)handle;
+ count = XCF_MAX_BLOCKS_PER_ALLOC;
+ while (desc && count--) {
+ BUG_ON(!xcf_is_used(desc));
+ size += desc->size - sizeof(struct xcf_blkhdr);
+ desc = desc->next;
+ }
+
+out:
+ return size;
+}
+
+/* get total number of pages in use by pool not including the reserved pages */
+u64 xcf_get_total_size_bytes(struct xcf_pool *pool)
+{
+ u64 ret;
+ spin_lock(&pool->lock);
+ ret = pool->page_count - pool->reserve_count;
+ spin_unlock(&pool->lock);
+ return ret << PAGE_SHIFT;
+}
+
+/* get number of descriptors in use by pool not including the descriptors for
+ * reserved pages */
+int xcf_get_desc_count(struct xcf_pool *pool)
+{
+ int ret;
+ spin_lock(&pool->lock);
+ ret = pool->desc_count - pool->reserve_count;
+ spin_unlock(&pool->lock);
+ return ret;
+}
diff --git a/drivers/staging/zcache/xcfmalloc.h b/drivers/staging/zcache/xcfmalloc.h
new file mode 100644
index 0000000..1a8091a
--- /dev/null
+++ b/drivers/staging/zcache/xcfmalloc.h
@@ -0,0 +1,28 @@
+/* xcfmalloc.h */
+
+#ifndef _XCFMALLOC_H_
+#define _XCFMALLOC_H_
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+struct xcf_pool;
+struct xcf_handle {
+ void *desc;
+};
+
+void *xcf_malloc(struct xcf_pool *pool, int size, gfp_t flags);
+void xcf_free(struct xcf_pool *pool, struct xcf_handle *handle);
+
+int xcf_write(struct xcf_handle *handle, void *buf);
+int xcf_read(struct xcf_handle *handle, void *buf);
+
+struct xcf_pool *xcf_create_pool(gfp_t flags);
+void xcf_destroy_pool(struct xcf_pool *pool);
+
+int xcf_get_alloc_size(struct xcf_handle *handle);
+
+u64 xcf_get_total_size_bytes(struct xcf_pool *pool);
+int xcf_get_desc_count(struct xcf_pool *pool);
+
+#endif /* _XCFMALLOC_H_ */
--
1.7.4.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v2 2/3] staging: zcache: replace xvmalloc with xcfmalloc
2011-09-07 14:09 [PATCH v2 0/3] staging: zcache: xcfmalloc support Seth Jennings
2011-09-07 14:09 ` [PATCH v2 1/3] staging: zcache: xcfmalloc memory allocator for zcache Seth Jennings
@ 2011-09-07 14:09 ` Seth Jennings
2011-09-07 14:09 ` [PATCH v2 3/3] staging: zcache: add zv_page_count and zv_desc_count Seth Jennings
` (2 subsequent siblings)
4 siblings, 0 replies; 28+ messages in thread
From: Seth Jennings @ 2011-09-07 14:09 UTC (permalink / raw)
To: gregkh
Cc: dan.magenheimer, ngupta, cascardo, devel, linux-kernel, rdunlap,
linux-mm, rcj, dave, brking, Seth Jennings
This patch replaces xvmalloc with xcfmalloc as the persistent page
allocator for zcache.
Because the API is not the same between xvmalloc and xcfmalloc, the
changes are not a simple find/replace on the function names.
Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
---
drivers/staging/zcache/zcache-main.c | 130 ++++++++++++++++++++++------------
1 files changed, 84 insertions(+), 46 deletions(-)
diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c
index a3f5162..b07377b 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -8,7 +8,7 @@
* and, thus indirectly, for cleancache and frontswap. Zcache includes two
* page-accessible memory [1] interfaces, both utilizing lzo1x compression:
* 1) "compression buddies" ("zbud") is used for ephemeral pages
- * 2) xvmalloc is used for persistent pages.
+ * 2) xcfmalloc is used for persistent pages.
* Xvmalloc (based on the TLSF allocator) has very low fragmentation
* so maximizes space efficiency, while zbud allows pairs (and potentially,
* in the future, more than a pair of) compressed pages to be closely linked
@@ -31,7 +31,7 @@
#include <linux/math64.h>
#include "tmem.h"
-#include "../zram/xvmalloc.h" /* if built in drivers/staging */
+#include "xcfmalloc.h"
#if (!defined(CONFIG_CLEANCACHE) && !defined(CONFIG_FRONTSWAP))
#error "zcache is useless without CONFIG_CLEANCACHE or CONFIG_FRONTSWAP"
@@ -60,7 +60,7 @@ MODULE_LICENSE("GPL");
struct zcache_client {
struct tmem_pool *tmem_pools[MAX_POOLS_PER_CLIENT];
- struct xv_pool *xvpool;
+ struct xcf_pool *xcfmpool;
bool allocated;
atomic_t refcount;
};
@@ -623,9 +623,8 @@ static int zbud_show_cumul_chunk_counts(char *buf)
#endif
/**********
- * This "zv" PAM implementation combines the TLSF-based xvMalloc
- * with lzo1x compression to maximize the amount of data that can
- * be packed into a physical page.
+ * This "zv" PAM implementation combines xcfmalloc with lzo1x compression
+ * to maximize the amount of data that can be packed into a physical page.
*
* Zv represents a PAM page with the index and object (plus a "size" value
* necessary for decompression) immediately preceding the compressed data.
@@ -658,71 +657,97 @@ static unsigned int zv_max_mean_zsize = (PAGE_SIZE / 8) * 5;
static unsigned long zv_curr_dist_counts[NCHUNKS];
static unsigned long zv_cumul_dist_counts[NCHUNKS];
-static struct zv_hdr *zv_create(struct xv_pool *xvpool, uint32_t pool_id,
+static DEFINE_PER_CPU(unsigned char *, zv_cbuf); /* zv create buffer */
+static DEFINE_PER_CPU(unsigned char *, zv_dbuf); /* zv decompress buffer */
+
+static int zv_cpu_notifier(struct notifier_block *nb,
+ unsigned long action, void *pcpu)
+{
+ int cpu = (long)pcpu;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ per_cpu(zv_cbuf, cpu) = (void *)__get_free_page(
+ GFP_KERNEL | __GFP_REPEAT);
+ per_cpu(zv_dbuf, cpu) = (void *)__get_free_page(
+ GFP_KERNEL | __GFP_REPEAT);
+ break;
+ case CPU_DEAD:
+ case CPU_UP_CANCELED:
+ free_page((unsigned long)per_cpu(zv_cbuf, cpu));
+ per_cpu(zv_cbuf, cpu) = NULL;
+ free_page((unsigned long)per_cpu(zv_dbuf, cpu));
+ per_cpu(zv_dbuf, cpu) = NULL;
+ break;
+ default:
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
+
+static void **zv_create(struct xcf_pool *xcfmpool, uint32_t pool_id,
struct tmem_oid *oid, uint32_t index,
void *cdata, unsigned clen)
{
- struct page *page;
- struct zv_hdr *zv = NULL;
- uint32_t offset;
- int alloc_size = clen + sizeof(struct zv_hdr);
- int chunks = (alloc_size + (CHUNK_SIZE - 1)) >> CHUNK_SHIFT;
- int ret;
+ struct zv_hdr *zv;
+ u32 size = clen + sizeof(struct zv_hdr);
+ int chunks = (size + (CHUNK_SIZE - 1)) >> CHUNK_SHIFT;
+ void *handle;
BUG_ON(!irqs_disabled());
BUG_ON(chunks >= NCHUNKS);
- ret = xv_malloc(xvpool, alloc_size,
- &page, &offset, ZCACHE_GFP_MASK);
- if (unlikely(ret))
- goto out;
+
+ handle = xcf_malloc(xcfmpool, size, ZCACHE_GFP_MASK);
+ if (!handle)
+ return NULL;
+
zv_curr_dist_counts[chunks]++;
zv_cumul_dist_counts[chunks]++;
- zv = kmap_atomic(page, KM_USER0) + offset;
+
+ zv = (struct zv_hdr *)((char *)cdata - sizeof(*zv));
zv->index = index;
zv->oid = *oid;
zv->pool_id = pool_id;
SET_SENTINEL(zv, ZVH);
- memcpy((char *)zv + sizeof(struct zv_hdr), cdata, clen);
- kunmap_atomic(zv, KM_USER0);
-out:
- return zv;
+ xcf_write(handle, zv);
+
+ return handle;
}
-static void zv_free(struct xv_pool *xvpool, struct zv_hdr *zv)
+static void zv_free(struct xcf_pool *xcfmpool, void *handle)
{
unsigned long flags;
- struct page *page;
- uint32_t offset;
- uint16_t size = xv_get_object_size(zv);
+ u32 size = xcf_get_alloc_size(handle);
int chunks = (size + (CHUNK_SIZE - 1)) >> CHUNK_SHIFT;
- ASSERT_SENTINEL(zv, ZVH);
BUG_ON(chunks >= NCHUNKS);
zv_curr_dist_counts[chunks]--;
- size -= sizeof(*zv);
- BUG_ON(size == 0);
- INVERT_SENTINEL(zv, ZVH);
- page = virt_to_page(zv);
- offset = (unsigned long)zv & ~PAGE_MASK;
+
local_irq_save(flags);
- xv_free(xvpool, page, offset);
+ xcf_free(xcfmpool, handle);
local_irq_restore(flags);
}
-static void zv_decompress(struct page *page, struct zv_hdr *zv)
+static void zv_decompress(struct page *page, void *handle)
{
size_t clen = PAGE_SIZE;
char *to_va;
unsigned size;
int ret;
+ struct zv_hdr *zv;
- ASSERT_SENTINEL(zv, ZVH);
- size = xv_get_object_size(zv) - sizeof(*zv);
+ size = xcf_get_alloc_size(handle) - sizeof(*zv);
BUG_ON(size == 0);
+ zv = (struct zv_hdr *)(get_cpu_var(zv_dbuf));
+ xcf_read(handle, zv);
+ ASSERT_SENTINEL(zv, ZVH);
to_va = kmap_atomic(page, KM_USER0);
ret = lzo1x_decompress_safe((char *)zv + sizeof(*zv),
size, to_va, &clen);
kunmap_atomic(to_va, KM_USER0);
+ put_cpu_var(zv_dbuf);
BUG_ON(ret != LZO_E_OK);
BUG_ON(clen != PAGE_SIZE);
}
@@ -949,8 +974,9 @@ int zcache_new_client(uint16_t cli_id)
goto out;
cli->allocated = 1;
#ifdef CONFIG_FRONTSWAP
- cli->xvpool = xv_create_pool();
- if (cli->xvpool == NULL)
+ cli->xcfmpool =
+ xcf_create_pool(ZCACHE_GFP_MASK);
+ if (cli->xcfmpool == NULL)
goto out;
#endif
ret = 0;
@@ -1154,7 +1180,7 @@ static void *zcache_pampd_create(char *data, size_t size, bool raw, int eph,
struct tmem_pool *pool, struct tmem_oid *oid,
uint32_t index)
{
- void *pampd = NULL, *cdata;
+ void *pampd = NULL, *cdata = NULL;
size_t clen;
int ret;
unsigned long count;
@@ -1186,26 +1212,30 @@ static void *zcache_pampd_create(char *data, size_t size, bool raw, int eph,
if (curr_pers_pampd_count >
(zv_page_count_policy_percent * totalram_pages) / 100)
goto out;
+ cdata = get_cpu_var(zv_cbuf) + sizeof(struct zv_hdr);
ret = zcache_compress(page, &cdata, &clen);
if (ret == 0)
goto out;
/* reject if compression is too poor */
if (clen > zv_max_zsize) {
zcache_compress_poor++;
+ put_cpu_var(zv_cbuf);
goto out;
}
/* reject if mean compression is too poor */
if ((clen > zv_max_mean_zsize) && (curr_pers_pampd_count > 0)) {
- total_zsize = xv_get_total_size_bytes(cli->xvpool);
+ total_zsize = xcf_get_total_size_bytes(cli->xcfmpool);
zv_mean_zsize = div_u64(total_zsize,
curr_pers_pampd_count);
if (zv_mean_zsize > zv_max_mean_zsize) {
zcache_mean_compress_poor++;
+ put_cpu_var(zv_cbuf);
goto out;
}
}
- pampd = (void *)zv_create(cli->xvpool, pool->pool_id,
+ pampd = (void *)zv_create(cli->xcfmpool, pool->pool_id,
oid, index, cdata, clen);
+ put_cpu_var(zv_cbuf);
if (pampd == NULL)
goto out;
count = atomic_inc_return(&zcache_curr_pers_pampd_count);
@@ -1262,7 +1292,7 @@ static void zcache_pampd_free(void *pampd, struct tmem_pool *pool,
atomic_dec(&zcache_curr_eph_pampd_count);
BUG_ON(atomic_read(&zcache_curr_eph_pampd_count) < 0);
} else {
- zv_free(cli->xvpool, (struct zv_hdr *)pampd);
+ zv_free(cli->xcfmpool, pampd);
atomic_dec(&zcache_curr_pers_pampd_count);
BUG_ON(atomic_read(&zcache_curr_pers_pampd_count) < 0);
}
@@ -1309,11 +1339,16 @@ static DEFINE_PER_CPU(unsigned char *, zcache_dstmem);
static int zcache_compress(struct page *from, void **out_va, size_t *out_len)
{
int ret = 0;
- unsigned char *dmem = __get_cpu_var(zcache_dstmem);
+ unsigned char *dmem;
unsigned char *wmem = __get_cpu_var(zcache_workmem);
char *from_va;
BUG_ON(!irqs_disabled());
+ if (out_va && *out_va)
+ dmem = *out_va;
+ else
+ dmem = __get_cpu_var(zcache_dstmem);
+
if (unlikely(dmem == NULL || wmem == NULL))
goto out; /* no buffer, so can't compress */
from_va = kmap_atomic(from, KM_USER0);
@@ -1331,7 +1366,7 @@ out:
static int zcache_cpu_notifier(struct notifier_block *nb,
unsigned long action, void *pcpu)
{
- int cpu = (long)pcpu;
+ int ret, cpu = (long)pcpu;
struct zcache_preload *kp;
switch (action) {
@@ -1363,7 +1398,10 @@ static int zcache_cpu_notifier(struct notifier_block *nb,
default:
break;
}
- return NOTIFY_OK;
+
+ ret = zv_cpu_notifier(nb, action, pcpu);
+
+ return ret;
}
static struct notifier_block zcache_cpu_notifier_block = {
@@ -1991,7 +2029,7 @@ static int __init zcache_init(void)
old_ops = zcache_frontswap_register_ops();
pr_info("zcache: frontswap enabled using kernel "
- "transcendent memory and xvmalloc\n");
+ "transcendent memory and xcfmalloc\n");
if (old_ops.init != NULL)
pr_warning("ktmem: frontswap_ops overridden");
}
--
1.7.4.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v2 3/3] staging: zcache: add zv_page_count and zv_desc_count
2011-09-07 14:09 [PATCH v2 0/3] staging: zcache: xcfmalloc support Seth Jennings
2011-09-07 14:09 ` [PATCH v2 1/3] staging: zcache: xcfmalloc memory allocator for zcache Seth Jennings
2011-09-07 14:09 ` [PATCH v2 2/3] staging: zcache: replace xvmalloc with xcfmalloc Seth Jennings
@ 2011-09-07 14:09 ` Seth Jennings
2011-09-09 20:34 ` [PATCH v2 0/3] staging: zcache: xcfmalloc support Greg KH
2011-09-29 17:47 ` Seth Jennings
4 siblings, 0 replies; 28+ messages in thread
From: Seth Jennings @ 2011-09-07 14:09 UTC (permalink / raw)
To: gregkh
Cc: dan.magenheimer, ngupta, cascardo, devel, linux-kernel, rdunlap,
linux-mm, rcj, dave, brking, Seth Jennings
This patch adds the zv_page_count and zv_desc_count attributes
to the zcache sysfs. They are read-only attributes and return
the number of pages and the number of block descriptors in use
by the pool respectively.
These statistics can be used to calculate effective compression
and block descriptor overhead for the xcfmalloc allocator.
Using the frontswap curr_pages attribute, effective compression
is: zv_page_count / curr_pages
Using /proc/slabinfo to get the objsize for a xcf_desc_cache
object, descriptor overhead is: zv_desc_count * objsize
Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
---
drivers/staging/zcache/zcache-main.c | 24 ++++++++++++++++++++++++
1 files changed, 24 insertions(+), 0 deletions(-)
diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c
index b07377b..6adbbbe 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -789,6 +789,24 @@ static int zv_cumul_dist_counts_show(char *buf)
return p - buf;
}
+static int zv_page_count_show(char *buf)
+{
+ char *p = buf;
+ unsigned long count;
+ count = xcf_get_total_size_bytes(zcache_host.xcfmpool) >> PAGE_SHIFT;
+ p += sprintf(p, "%lu\n", count);
+ return p - buf;
+}
+
+static int zv_desc_count_show(char *buf)
+{
+ char *p = buf;
+ unsigned long count;
+ count = xcf_get_desc_count(zcache_host.xcfmpool);
+ p += sprintf(p, "%lu\n", count);
+ return p - buf;
+}
+
/*
* setting zv_max_zsize via sysfs causes all persistent (e.g. swap)
* pages that don't compress to less than this value (including metadata
@@ -1477,6 +1495,10 @@ ZCACHE_SYSFS_RO_CUSTOM(zv_curr_dist_counts,
zv_curr_dist_counts_show);
ZCACHE_SYSFS_RO_CUSTOM(zv_cumul_dist_counts,
zv_cumul_dist_counts_show);
+ZCACHE_SYSFS_RO_CUSTOM(zv_page_count,
+ zv_page_count_show);
+ZCACHE_SYSFS_RO_CUSTOM(zv_desc_count,
+ zv_desc_count_show);
static struct attribute *zcache_attrs[] = {
&zcache_curr_obj_count_attr.attr,
@@ -1513,6 +1535,8 @@ static struct attribute *zcache_attrs[] = {
&zcache_zv_max_zsize_attr.attr,
&zcache_zv_max_mean_zsize_attr.attr,
&zcache_zv_page_count_policy_percent_attr.attr,
+ &zcache_zv_page_count_attr.attr,
+ &zcache_zv_desc_count_attr.attr,
NULL,
};
--
1.7.4.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-07 14:09 [PATCH v2 0/3] staging: zcache: xcfmalloc support Seth Jennings
` (2 preceding siblings ...)
2011-09-07 14:09 ` [PATCH v2 3/3] staging: zcache: add zv_page_count and zv_desc_count Seth Jennings
@ 2011-09-09 20:34 ` Greg KH
2011-09-10 2:41 ` Nitin Gupta
2011-09-29 17:47 ` Seth Jennings
4 siblings, 1 reply; 28+ messages in thread
From: Greg KH @ 2011-09-09 20:34 UTC (permalink / raw)
To: Seth Jennings
Cc: gregkh, devel, dan.magenheimer, cascardo, linux-kernel, dave,
linux-mm, brking, rcj, ngupta
On Wed, Sep 07, 2011 at 09:09:04AM -0500, Seth Jennings wrote:
> Changelog:
> v2: fix bug in find_remove_block()
> fix whitespace warning at EOF
>
> This patchset introduces a new memory allocator for persistent
> pages for zcache. The current allocator is xvmalloc. xvmalloc
> has two notable limitations:
> * High (up to 50%) external fragmentation on allocation sets > PAGE_SIZE/2
> * No compaction support which reduces page reclaimation
I need some acks from other zcache developers before I can accept this.
{hint...}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-09 20:34 ` [PATCH v2 0/3] staging: zcache: xcfmalloc support Greg KH
@ 2011-09-10 2:41 ` Nitin Gupta
2011-09-12 14:35 ` Seth Jennings
0 siblings, 1 reply; 28+ messages in thread
From: Nitin Gupta @ 2011-09-10 2:41 UTC (permalink / raw)
To: Greg KH
Cc: Seth Jennings, gregkh, devel, dan.magenheimer, cascardo,
linux-kernel, dave, linux-mm, brking, rcj
On 09/09/2011 04:34 PM, Greg KH wrote:
> On Wed, Sep 07, 2011 at 09:09:04AM -0500, Seth Jennings wrote:
>> Changelog:
>> v2: fix bug in find_remove_block()
>> fix whitespace warning at EOF
>>
>> This patchset introduces a new memory allocator for persistent
>> pages for zcache. The current allocator is xvmalloc. xvmalloc
>> has two notable limitations:
>> * High (up to 50%) external fragmentation on allocation sets > PAGE_SIZE/2
>> * No compaction support which reduces page reclaimation
>
> I need some acks from other zcache developers before I can accept this.
>
First, thanks for this new allocator; xvmalloc badly needed a replacement :)
I went through xcfmalloc in detail and would be posting detailed
comments tomorrow. In general, it seems to be quite similar to the
"chunk based" allocator used in initial implementation of "compcache" --
please see section 2.3.1 in this paper:
http://www.linuxsymposium.org/archives/OLS/Reprints-2007/briglia-Reprint.pdf
I'm really looking forward to a slab based allocator as I mentioned in
the initial mail:
http://permalink.gmane.org/gmane.linux.kernel.mm/65467
With the current design xcfmalloc suffers from issues similar to the
allocator described in the paper:
- High metadata overhead
- Difficult implementation of compaction
- Need for extra memcpy()s etc.
With slab based approach, we can almost eliminate any metadata overhead,
remove any free chunk merging logic, simplify compaction and so on.
Thanks,
Nitin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-10 2:41 ` Nitin Gupta
@ 2011-09-12 14:35 ` Seth Jennings
2011-09-13 1:55 ` Nitin Gupta
0 siblings, 1 reply; 28+ messages in thread
From: Seth Jennings @ 2011-09-12 14:35 UTC (permalink / raw)
To: Nitin Gupta
Cc: Greg KH, gregkh, devel, dan.magenheimer, cascardo, linux-kernel,
dave, linux-mm, brking, rcj
On 09/09/2011 09:41 PM, Nitin Gupta wrote:
> On 09/09/2011 04:34 PM, Greg KH wrote:
>
>> On Wed, Sep 07, 2011 at 09:09:04AM -0500, Seth Jennings wrote:
>>> Changelog:
>>> v2: fix bug in find_remove_block()
>>> fix whitespace warning at EOF
>>>
>>> This patchset introduces a new memory allocator for persistent
>>> pages for zcache. The current allocator is xvmalloc. xvmalloc
>>> has two notable limitations:
>>> * High (up to 50%) external fragmentation on allocation sets > PAGE_SIZE/2
>>> * No compaction support which reduces page reclaimation
>>
>> I need some acks from other zcache developers before I can accept this.
>>
>
> First, thanks for this new allocator; xvmalloc badly needed a replacement :)
>
Hey Nitin, I hope your internship went well :) It's good to hear from you.
> I went through xcfmalloc in detail and would be posting detailed
> comments tomorrow. In general, it seems to be quite similar to the
> "chunk based" allocator used in initial implementation of "compcache" --
> please see section 2.3.1 in this paper:
> http://www.linuxsymposium.org/archives/OLS/Reprints-2007/briglia-Reprint.pdf
>
Ah, indeed they look similar. I didn't know that this approach
had already been done before in the history of this project.
> I'm really looking forward to a slab based allocator as I mentioned in
> the initial mail:
> http://permalink.gmane.org/gmane.linux.kernel.mm/65467
>
> With the current design xcfmalloc suffers from issues similar to the
> allocator described in the paper:
> - High metadata overhead
> - Difficult implementation of compaction
> - Need for extra memcpy()s etc.
>
> With slab based approach, we can almost eliminate any metadata overhead,
> remove any free chunk merging logic, simplify compaction and so on.
>
Just to align my understanding with yours, when I hear slab-based,
I'm thinking each page in the compressed memory pool will contain
1 or more blocks that are all the same size. Is this what you mean?
If so, I'm not sure how changing to a slab-based system would eliminate
metadata overhead or do away with memcpy()s.
The memcpy()s are a side effect of having an allocation spread over
blocks in different pages. I'm not seeing a way around this.
It also follows that the blocks that make up an allocation must be in
a list of some kind, leading to some amount of metadata overhead.
If you want to do compaction, it follows that you can't give the user
a direct pointer to the data, since the location of that data may change.
In this case, an indirection layer is required (i.e. xcf_blkdesc and
xcf_read()/xcf_write()).
The only part of the metadata that could be done away with in a slab-
based approach, as far as I can see, is the prevoffset field in xcf_blkhdr,
since the size of the previous block in the page (or the previous object
in the slab) can be inferred from the size of the current block/object.
I do agree that we don't have to worry about free block merging in a
slab-based system.
I didn't implement compaction so a slab-based system could very well
make it easier. I guess it depends on how one ends up doing it.
Anyway, I look forward to your detailed comments :)
--
Seth
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-12 14:35 ` Seth Jennings
@ 2011-09-13 1:55 ` Nitin Gupta
2011-09-13 15:58 ` Seth Jennings
0 siblings, 1 reply; 28+ messages in thread
From: Nitin Gupta @ 2011-09-13 1:55 UTC (permalink / raw)
To: Seth Jennings
Cc: Greg KH, gregkh, devel, dan.magenheimer, cascardo, linux-kernel,
dave, linux-mm, brking, rcj
Hi Seth,
I revised some of the original plans for xcfmalloc and below are some
details. I had few nits regarding the current implementation but I'm
avoiding them here since we may have to change the design itself
significantly.
On 09/12/2011 10:35 AM, Seth Jennings wrote:
> On 09/09/2011 09:41 PM, Nitin Gupta wrote:
>> On 09/09/2011 04:34 PM, Greg KH wrote:
>>
>>> On Wed, Sep 07, 2011 at 09:09:04AM -0500, Seth Jennings wrote:
>>>> Changelog:
>>>> v2: fix bug in find_remove_block()
>>>> fix whitespace warning at EOF
>>>>
>>>> This patchset introduces a new memory allocator for persistent
>>>> pages for zcache. The current allocator is xvmalloc. xvmalloc
>>>> has two notable limitations:
>>>> * High (up to 50%) external fragmentation on allocation sets > PAGE_SIZE/2
>>>> * No compaction support which reduces page reclaimation
>>>
>>> I need some acks from other zcache developers before I can accept this.
>>>
>>
>> First, thanks for this new allocator; xvmalloc badly needed a replacement :)
>>
>
> Hey Nitin, I hope your internship went well :) It's good to hear from you.
>
Yes, it went well and now I can spend more time on this project :)
>> I went through xcfmalloc in detail and would be posting detailed
>> comments tomorrow. In general, it seems to be quite similar to the
>> "chunk based" allocator used in initial implementation of "compcache" --
>> please see section 2.3.1 in this paper:
>> http://www.linuxsymposium.org/archives/OLS/Reprints-2007/briglia-Reprint.pdf
>>
>
> Ah, indeed they look similar. I didn't know that this approach
> had already been done before in the history of this project.
>
>> I'm really looking forward to a slab based allocator as I mentioned in
>> the initial mail:
>> http://permalink.gmane.org/gmane.linux.kernel.mm/65467
>>
>> With the current design xcfmalloc suffers from issues similar to the
>> allocator described in the paper:
>> - High metadata overhead
>> - Difficult implementation of compaction
>> - Need for extra memcpy()s etc.
>>
>> With slab based approach, we can almost eliminate any metadata overhead,
>> remove any free chunk merging logic, simplify compaction and so on.
>>
>
> Just to align my understanding with yours, when I hear slab-based,
> I'm thinking each page in the compressed memory pool will contain
> 1 or more blocks that are all the same size. Is this what you mean?
>
Yes, exactly. The memory pool will consist of "superblocks" (typically
16K or 64K). Each of these superblocks will contain objects of only one
particular size (which is its size class). This is the general
structure of all slab allocators. In particular, I'm planning to use
many of the ideas discussed in this paper:
http://www.cs.umass.edu/~emery/hoard/asplos2000.pdf
One major point to consider would be that these superblocks cannot be
physically contiguous in our case, so we will have to do some map/unmap
trickery. The basic idea is to link together individual pages
(typically 4k) using underlying struct_page->lru to form superblocks and
map/unmap objects on demand.
> If so, I'm not sure how changing to a slab-based system would eliminate
> metadata overhead or do away with memcpy()s.
>
With slab based approach, the allocator itself need not store any
metadata with allocated objects. However, considering zcache and zram
use-cases, the caller will still have to request additional space for
per-object header: actual object size and back-reference (which
inode/page-idx this object belongs to) needed for compaction.
For free-list management, the underlying struct page and the free object
space itself can be used. Some field in the struct page can point to the
first free object in a page and free slab objects themselves will
contain links to next/previous free objects in the page.
> The memcpy()s are a side effect of having an allocation spread over
> blocks in different pages. I'm not seeing a way around this.
>
For slab objects than span 2 pages, we can use vm_map_ram() to
temporarily map pages involved and read/write to objects directly. For
objects lying entirely within a page, we can use much faster
kmap_atomic() for access.
> It also follows that the blocks that make up an allocation must be in
> a list of some kind, leading to some amount of metadata overhead.
>
Used objects need not be placed in any list. For free objects we can use
underlying struct page and free object space itself to manage free list,
as described above.
> If you want to do compaction, it follows that you can't give the user
> a direct pointer to the data, since the location of that data may change.
> In this case, an indirection layer is required (i.e. xcf_blkdesc and
> xcf_read()/xcf_write()).
>
Yes, we can't give a direct pointer anyways since pages used by the
allocator are not permanently mapped (to save precious VA spave on
32-bit). Still, we can save on much of metadata overhead and extra
memcpy() as described above.
> The only part of the metadata that could be done away with in a slab-
> based approach, as far as I can see, is the prevoffset field in xcf_blkhdr,
> since the size of the previous block in the page (or the previous object
> in the slab) can be inferred from the size of the current block/object.
>
> I do agree that we don't have to worry about free block merging in a
> slab-based system.
>
> I didn't implement compaction so a slab-based system could very well
> make it easier. I guess it depends on how one ends up doing it.
>
I expect compaction to be much easier with slab like design since
finding target for relocating objects is so simple. You don't have to
deal with all those little unusable holes created throughout the heap
when mix-all-object-sizes-together approach is used.
For compaction, I plan to have a scheme where the user (zcache/zram)
would "cooperate" with the process. Here is a rough outline for
relocating an object:
- For any size class, when emptiness threshold (=total space allocated
/ actual used) exceeds some threshold, the allocator will copy objects
from least used slab superblocks to the most used ones.
- It will then issue a callback, registered by the user during pool
creation, informing about the new object location,
- Due to various cases, this callback may return failure in which case
the relocation is considered failed and we will disk newly created copy.
- If callback succeeds, it means that the user successfully updated its
object reference and we can now mark the original copy as free.
We can also batch this process (copy all objects in victim superblock at
once to other superblocks and issue a single callback) for better
efficiency.
Please let me know if you have any comments. I plan to start with its
implementation sometime this week.
Thanks,
Nitin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-13 1:55 ` Nitin Gupta
@ 2011-09-13 15:58 ` Seth Jennings
2011-09-13 21:18 ` Nitin Gupta
0 siblings, 1 reply; 28+ messages in thread
From: Seth Jennings @ 2011-09-13 15:58 UTC (permalink / raw)
To: Nitin Gupta
Cc: Greg KH, gregkh, devel, dan.magenheimer, cascardo, linux-kernel,
dave, linux-mm, brking, rcj
On 09/12/2011 08:55 PM, Nitin Gupta wrote:
> Hi Seth,
>
> I revised some of the original plans for xcfmalloc and below are some
> details. I had few nits regarding the current implementation but I'm
> avoiding them here since we may have to change the design itself
> significantly.
>
> On 09/12/2011 10:35 AM, Seth Jennings wrote:
>
>> On 09/09/2011 09:41 PM, Nitin Gupta wrote:
>>> On 09/09/2011 04:34 PM, Greg KH wrote:
>>>
>>>> On Wed, Sep 07, 2011 at 09:09:04AM -0500, Seth Jennings wrote:
>>>>> Changelog:
>>>>> v2: fix bug in find_remove_block()
>>>>> fix whitespace warning at EOF
>>>>>
>>>>> This patchset introduces a new memory allocator for persistent
>>>>> pages for zcache. The current allocator is xvmalloc. xvmalloc
>>>>> has two notable limitations:
>>>>> * High (up to 50%) external fragmentation on allocation sets > PAGE_SIZE/2
>>>>> * No compaction support which reduces page reclaimation
>>>>
>>>> I need some acks from other zcache developers before I can accept this.
>>>>
>>>
>>> First, thanks for this new allocator; xvmalloc badly needed a replacement :)
>>>
>>
>> Hey Nitin, I hope your internship went well :) It's good to hear from you.
>>
>
>
> Yes, it went well and now I can spend more time on this project :)
>
>
>>> I went through xcfmalloc in detail and would be posting detailed
>>> comments tomorrow. In general, it seems to be quite similar to the
>>> "chunk based" allocator used in initial implementation of "compcache" --
>>> please see section 2.3.1 in this paper:
>>> http://www.linuxsymposium.org/archives/OLS/Reprints-2007/briglia-Reprint.pdf
>>>
>>
>> Ah, indeed they look similar. I didn't know that this approach
>> had already been done before in the history of this project.
>>
>>> I'm really looking forward to a slab based allocator as I mentioned in
>>> the initial mail:
>>> http://permalink.gmane.org/gmane.linux.kernel.mm/65467
>>>
>>> With the current design xcfmalloc suffers from issues similar to the
>>> allocator described in the paper:
>>> - High metadata overhead
>>> - Difficult implementation of compaction
>>> - Need for extra memcpy()s etc.
>>>
>>> With slab based approach, we can almost eliminate any metadata overhead,
>>> remove any free chunk merging logic, simplify compaction and so on.
>>>
>>
>> Just to align my understanding with yours, when I hear slab-based,
>> I'm thinking each page in the compressed memory pool will contain
>> 1 or more blocks that are all the same size. Is this what you mean?
>>
>
>
> Yes, exactly. The memory pool will consist of "superblocks" (typically
> 16K or 64K).
A fixed superblock size would be an issue. If you have a fixed superblock
size of 64k on a system that has 64k hardware pages, then we'll find
ourselves in the same fragmentation mess as xvmalloc. It needs to be
relative to the hardware page size; say 4*PAGE_SIZE or 8*PAGE_SIZE.
> Each of these superblocks will contain objects of only one
> particular size (which is its size class). This is the general
> structure of all slab allocators. In particular, I'm planning to use
> many of the ideas discussed in this paper:
> http://www.cs.umass.edu/~emery/hoard/asplos2000.pdf
>
> One major point to consider would be that these superblocks cannot be
> physically contiguous in our case, so we will have to do some map/unmap
> trickery.
I assume you mean vm_map_ram() by "trickery".
In my own experiments, I can tell you that vm_map_ram() is VERY expensive
wrt kmap_atomic(). I made a version of xvmalloc (affectionately called
"chunky xvmalloc") where I modified grow_pool() to allocate 4 0-order pages,
if it could, and group them into a "chunk" (what you are calling a
"superblock"). The first page in the chunk had a chunk header that contained
the array of page pointers that vm_map_ram() expects. A block was
identified by a page pointer to the first page in the chunk, and an offset
within the mapped area of the chunk.
It... was... slow...
Additionally, vm_map_ram() failed a lot because it actually allocates
memory of its own for the vmap_area. In my tests, many pages slipped
through zcache because of this failure.
Even worse, it uses GFP_KERNEL when alloc'ing the vmap_area which
isn't safe under spinlock.
It seems that the use of vm_map_ram() is a linchpin in this design.
> The basic idea is to link together individual pages
> (typically 4k) using underlying struct_page->lru to form superblocks and
> map/unmap objects on demand.
>
>> If so, I'm not sure how changing to a slab-based system would eliminate
>> metadata overhead or do away with memcpy()s.
>>
>
>
> With slab based approach, the allocator itself need not store any
> metadata with allocated objects. However, considering zcache and zram
> use-cases, the caller will still have to request additional space for
> per-object header: actual object size and back-reference (which
> inode/page-idx this object belongs to) needed for compaction.
>
So the metadata still exists. It's just zcache's metadata vs the
allocator's metadata.
> For free-list management, the underlying struct page and the free object
> space itself can be used. Some field in the struct page can point to the
> first free object in a page and free slab objects themselves will
> contain links to next/previous free objects in the page.
>
>> The memcpy()s are a side effect of having an allocation spread over
>> blocks in different pages. I'm not seeing a way around this.
>>
>
>
> For slab objects than span 2 pages, we can use vm_map_ram() to
> temporarily map pages involved and read/write to objects directly. For
> objects lying entirely within a page, we can use much faster
> kmap_atomic() for access.
>
>
See previous comment about vm_map_ram(). There is also an issue
here with a slab object being unreadable do to vm_map_ram() failing.
>> It also follows that the blocks that make up an allocation must be in
>> a list of some kind, leading to some amount of metadata overhead.
>>
>
>
> Used objects need not be placed in any list. For free objects we can use
> underlying struct page and free object space itself to manage free list,
> as described above.
>
I assume this means that allocations consist of only one slab object as
opposed to multiple objects in slabs with different class sizes.
In other words, if you have a 3k allocation and your slab
class sizes are 1k, 2k, and 4k, the allocation would have to
be in a single 4k object (with 25% fragmentation), not spanning
a 1k and 2k object, correct?
This is something that I'm inferring from your description, but is key
to the design. Please confirm that I'm understanding correctly.
>> If you want to do compaction, it follows that you can't give the user
>> a direct pointer to the data, since the location of that data may change.
>> In this case, an indirection layer is required (i.e. xcf_blkdesc and
>> xcf_read()/xcf_write()).
>>
>
>
> Yes, we can't give a direct pointer anyways since pages used by the
> allocator are not permanently mapped (to save precious VA spave on
> 32-bit). Still, we can save on much of metadata overhead and extra
> memcpy() as described above.
>
>
This isn't clear to me. What would be returned by the malloc() call
in this design? In other words, how does the caller access the
allocation? You just said we can't give a direct access to the data
via a page pointer and an offset. What else is there to return other
than a pointer to metadata?
>> The only part of the metadata that could be done away with in a slab-
>> based approach, as far as I can see, is the prevoffset field in xcf_blkhdr,
>> since the size of the previous block in the page (or the previous object
>> in the slab) can be inferred from the size of the current block/object.
>>
>> I do agree that we don't have to worry about free block merging in a
>> slab-based system.
>>
>> I didn't implement compaction so a slab-based system could very well
>> make it easier. I guess it depends on how one ends up doing it.
>>
>
>
> I expect compaction to be much easier with slab like design since
> finding target for relocating objects is so simple. You don't have to
> deal with all those little unusable holes created throughout the heap
> when mix-all-object-sizes-together approach is used.
>
>
> For compaction, I plan to have a scheme where the user (zcache/zram)
> would "cooperate" with the process. Here is a rough outline for
> relocating an object:
> - For any size class, when emptiness threshold (=total space allocated
> / actual used) exceeds some threshold, the allocator will copy objects
> from least used slab superblocks to the most used ones.
> - It will then issue a callback, registered by the user during pool
> creation, informing about the new object location,
> - Due to various cases, this callback may return failure in which case
> the relocation is considered failed and we will disk newly created copy.
> - If callback succeeds, it means that the user successfully updated its
> object reference and we can now mark the original copy as free.
>
>
> We can also batch this process (copy all objects in victim superblock at
> once to other superblocks and issue a single callback) for better
> efficiency.
>
>
> Please let me know if you have any comments. I plan to start with its
> implementation sometime this week.
>
> Thanks,
> Nitin
--
Seth
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-13 15:58 ` Seth Jennings
@ 2011-09-13 21:18 ` Nitin Gupta
2011-09-15 16:31 ` Seth Jennings
0 siblings, 1 reply; 28+ messages in thread
From: Nitin Gupta @ 2011-09-13 21:18 UTC (permalink / raw)
To: Seth Jennings
Cc: Greg KH, gregkh, devel, dan.magenheimer, cascardo, linux-kernel,
dave, linux-mm, brking, rcj
On 09/13/2011 11:58 AM, Seth Jennings wrote:
> On 09/12/2011 08:55 PM, Nitin Gupta wrote:
>>>>
>>>> With slab based approach, we can almost eliminate any metadata overhead,
>>>> remove any free chunk merging logic, simplify compaction and so on.
>>>>
>>>
>>> Just to align my understanding with yours, when I hear slab-based,
>>> I'm thinking each page in the compressed memory pool will contain
>>> 1 or more blocks that are all the same size. Is this what you mean?
>>>
>>
>>
>> Yes, exactly. The memory pool will consist of "superblocks" (typically
>> 16K or 64K).
>
> A fixed superblock size would be an issue. If you have a fixed superblock
> size of 64k on a system that has 64k hardware pages, then we'll find
> ourselves in the same fragmentation mess as xvmalloc. It needs to be
> relative to the hardware page size; say 4*PAGE_SIZE or 8*PAGE_SIZE.
>
Yes, I meant some multiple of actual PAGE_SIZE.
>> Each of these superblocks will contain objects of only one
>> particular size (which is its size class). This is the general
>> structure of all slab allocators. In particular, I'm planning to use
>> many of the ideas discussed in this paper:
>> http://www.cs.umass.edu/~emery/hoard/asplos2000.pdf
>>
>> One major point to consider would be that these superblocks cannot be
>> physically contiguous in our case, so we will have to do some map/unmap
>> trickery.
>
> I assume you mean vm_map_ram() by "trickery".
>
> In my own experiments, I can tell you that vm_map_ram() is VERY expensive
> wrt kmap_atomic(). I made a version of xvmalloc (affectionately called
> "chunky xvmalloc") where I modified grow_pool() to allocate 4 0-order pages,
> if it could, and group them into a "chunk" (what you are calling a
> "superblock"). The first page in the chunk had a chunk header that contained
> the array of page pointers that vm_map_ram() expects. A block was
> identified by a page pointer to the first page in the chunk, and an offset
> within the mapped area of the chunk.
>
> It... was... slow...
>
> Additionally, vm_map_ram() failed a lot because it actually allocates
> memory of its own for the vmap_area. In my tests, many pages slipped
> through zcache because of this failure.
>
> Even worse, it uses GFP_KERNEL when alloc'ing the vmap_area which
> isn't safe under spinlock.
>
> It seems that the use of vm_map_ram() is a linchpin in this design.
>
Real bad. So, we simply cannot use vm_map_ram(). Maybe we can come-up with
something that can map just two pages contiguously and provide kmap_atomic()
like characteristics?
>> The basic idea is to link together individual pages
>> (typically 4k) using underlying struct_page->lru to form superblocks and
>> map/unmap objects on demand.
>>
>>> If so, I'm not sure how changing to a slab-based system would eliminate
>>> metadata overhead or do away with memcpy()s.
>>>
>>
>>
>> With slab based approach, the allocator itself need not store any
>> metadata with allocated objects. However, considering zcache and zram
>> use-cases, the caller will still have to request additional space for
>> per-object header: actual object size and back-reference (which
>> inode/page-idx this object belongs to) needed for compaction.
>>
>
> So the metadata still exists. It's just zcache's metadata vs the
> allocator's metadata.
>
Yes, some little metadata still exists but we save storing previous, next
pointers in used objects.
>> For free-list management, the underlying struct page and the free object
>> space itself can be used. Some field in the struct page can point to the
>> first free object in a page and free slab objects themselves will
>> contain links to next/previous free objects in the page.
>>
>>> The memcpy()s are a side effect of having an allocation spread over
>>> blocks in different pages. I'm not seeing a way around this.
>>>
>>
>>
>> For slab objects than span 2 pages, we can use vm_map_ram() to
>> temporarily map pages involved and read/write to objects directly. For
>> objects lying entirely within a page, we can use much faster
>> kmap_atomic() for access.
>>
>>
>
> See previous comment about vm_map_ram(). There is also an issue
> here with a slab object being unreadable do to vm_map_ram() failing.
>
I will try coming-up with some variant of kmap_atomic() which allows mapping
just two pages contiguously which is sufficient for our purpose.
>>> It also follows that the blocks that make up an allocation must be in
>>> a list of some kind, leading to some amount of metadata overhead.
>>>
>>
>>
>> Used objects need not be placed in any list. For free objects we can use
>> underlying struct page and free object space itself to manage free list,
>> as described above.
>>
>
> I assume this means that allocations consist of only one slab object as
> opposed to multiple objects in slabs with different class sizes.
>
> In other words, if you have a 3k allocation and your slab
> class sizes are 1k, 2k, and 4k, the allocation would have to
> be in a single 4k object (with 25% fragmentation), not spanning
> a 1k and 2k object, correct?
>
> This is something that I'm inferring from your description, but is key
> to the design. Please confirm that I'm understanding correctly.
>
To rephrase the slab design: the memory pool is grown in units of superblocks
which is a (virtually) contiguous region of say, 4*PAGE_SIZE or 8*PAGE_SIZE.
Each of these superblocks is assigned to exactly one size class which is the
size of objects this superblock will store. For example, a superblock of size
64K assigned to size-class 512B can store 64K/512B = 128 objects of size 512B;
it cannot store objects of any other size.
Considering your example, if we have size classes of say 1K, 2K and 4K and
superblock size is say 16K and we get allocation request for 3K, we will use
superblocks assigned to 4K size class. If we selected, say, an empty superblock,
the object will take 4K (1K wasted) leaving 16K-4K = 12K of free space in that
superblock.
To reduce such internal fragmentation, we can have closely spaced size-classes
(say, separated by 32 bytes) but there is a tradeoff here: if there are too many
size classes, we can waste lot of memory due to partially filled superblocks
in each of size classes and if they are too widely spaced, we can lot of
internal fragmentation.
>>> If you want to do compaction, it follows that you can't give the user
>>> a direct pointer to the data, since the location of that data may change.
>>> In this case, an indirection layer is required (i.e. xcf_blkdesc and
>>> xcf_read()/xcf_write()).
>>>
>>
>>
>> Yes, we can't give a direct pointer anyways since pages used by the
>> allocator are not permanently mapped (to save precious VA spave on
>> 32-bit). Still, we can save on much of metadata overhead and extra
>> memcpy() as described above.
>>
>>
>
> This isn't clear to me. What would be returned by the malloc() call
> in this design? In other words, how does the caller access the
> allocation? You just said we can't give a direct access to the data
> via a page pointer and an offset. What else is there to return other
> than a pointer to metadata?
>
I meant that, since we cannot have allocator pages always mapped, we cannot
return a direct pointer (VA) to the allocated object. Instead, we can combine
page frame number (PFN) and offset within the page as a single 64-bit unsigned
number and return this as "object handle".
Thanks,
Nitin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-13 21:18 ` Nitin Gupta
@ 2011-09-15 16:31 ` Seth Jennings
2011-09-15 17:29 ` Dan Magenheimer
2011-09-16 17:46 ` Nitin Gupta
0 siblings, 2 replies; 28+ messages in thread
From: Seth Jennings @ 2011-09-15 16:31 UTC (permalink / raw)
To: Nitin Gupta
Cc: Greg KH, gregkh, devel, dan.magenheimer, cascardo, linux-kernel,
dave, linux-mm, brking, rcj
Hey Nitin,
So this is how I see things...
Right now xvmalloc is broken for zcache's application because
of its huge fragmentation for half the valid allocation sizes
(> PAGE_SIZE/2).
My xcfmalloc patches are _a_ solution that is ready now. Sure,
it doesn't so compaction yet, and it has some metadata overhead.
So it's not "ideal" (if there is such I thing). But it does fix
the brokenness of xvmalloc for zcache's application.
So I see two ways going forward:
1) We review and integrate xcfmalloc now. Then, when you are
done with your allocator, we can run them side by side and see
which is better by numbers. If yours is better, you'll get no
argument from me and we can replace xcfmalloc with yours.
2) We can agree on a date (sooner rather than later) by which your
allocator will be completed. At that time we can compare them and
integrate the best one by the numbers.
Which would you like to do?
--
Seth
On 09/13/2011 04:18 PM, Nitin Gupta wrote:
> On 09/13/2011 11:58 AM, Seth Jennings wrote:
>> On 09/12/2011 08:55 PM, Nitin Gupta wrote:
>
>>>>>
>>>>> With slab based approach, we can almost eliminate any metadata overhead,
>>>>> remove any free chunk merging logic, simplify compaction and so on.
>>>>>
>>>>
>>>> Just to align my understanding with yours, when I hear slab-based,
>>>> I'm thinking each page in the compressed memory pool will contain
>>>> 1 or more blocks that are all the same size. Is this what you mean?
>>>>
>>>
>>>
>>> Yes, exactly. The memory pool will consist of "superblocks" (typically
>>> 16K or 64K).
>>
>> A fixed superblock size would be an issue. If you have a fixed superblock
>> size of 64k on a system that has 64k hardware pages, then we'll find
>> ourselves in the same fragmentation mess as xvmalloc. It needs to be
>> relative to the hardware page size; say 4*PAGE_SIZE or 8*PAGE_SIZE.
>>
>
> Yes, I meant some multiple of actual PAGE_SIZE.
>
>>> Each of these superblocks will contain objects of only one
>>> particular size (which is its size class). This is the general
>>> structure of all slab allocators. In particular, I'm planning to use
>>> many of the ideas discussed in this paper:
>>> http://www.cs.umass.edu/~emery/hoard/asplos2000.pdf
>>>
>>> One major point to consider would be that these superblocks cannot be
>>> physically contiguous in our case, so we will have to do some map/unmap
>>> trickery.
>>
>> I assume you mean vm_map_ram() by "trickery".
>>
>> In my own experiments, I can tell you that vm_map_ram() is VERY expensive
>> wrt kmap_atomic(). I made a version of xvmalloc (affectionately called
>> "chunky xvmalloc") where I modified grow_pool() to allocate 4 0-order pages,
>> if it could, and group them into a "chunk" (what you are calling a
>> "superblock"). The first page in the chunk had a chunk header that contained
>> the array of page pointers that vm_map_ram() expects. A block was
>> identified by a page pointer to the first page in the chunk, and an offset
>> within the mapped area of the chunk.
>>
>> It... was... slow...
>>
>> Additionally, vm_map_ram() failed a lot because it actually allocates
>> memory of its own for the vmap_area. In my tests, many pages slipped
>> through zcache because of this failure.
>>
>> Even worse, it uses GFP_KERNEL when alloc'ing the vmap_area which
>> isn't safe under spinlock.
>>
>> It seems that the use of vm_map_ram() is a linchpin in this design.
>>
>
> Real bad. So, we simply cannot use vm_map_ram(). Maybe we can come-up with
> something that can map just two pages contiguously and provide kmap_atomic()
> like characteristics?
>
>>> The basic idea is to link together individual pages
>>> (typically 4k) using underlying struct_page->lru to form superblocks and
>>> map/unmap objects on demand.
>>>
>>>> If so, I'm not sure how changing to a slab-based system would eliminate
>>>> metadata overhead or do away with memcpy()s.
>>>>
>>>
>>>
>>> With slab based approach, the allocator itself need not store any
>>> metadata with allocated objects. However, considering zcache and zram
>>> use-cases, the caller will still have to request additional space for
>>> per-object header: actual object size and back-reference (which
>>> inode/page-idx this object belongs to) needed for compaction.
>>>
>>
>> So the metadata still exists. It's just zcache's metadata vs the
>> allocator's metadata.
>>
>
> Yes, some little metadata still exists but we save storing previous, next
> pointers in used objects.
>
>>> For free-list management, the underlying struct page and the free object
>>> space itself can be used. Some field in the struct page can point to the
>>> first free object in a page and free slab objects themselves will
>>> contain links to next/previous free objects in the page.
>>>
>>>> The memcpy()s are a side effect of having an allocation spread over
>>>> blocks in different pages. I'm not seeing a way around this.
>>>>
>>>
>>>
>>> For slab objects than span 2 pages, we can use vm_map_ram() to
>>> temporarily map pages involved and read/write to objects directly. For
>>> objects lying entirely within a page, we can use much faster
>>> kmap_atomic() for access.
>>>
>>>
>>
>> See previous comment about vm_map_ram(). There is also an issue
>> here with a slab object being unreadable do to vm_map_ram() failing.
>>
>
> I will try coming-up with some variant of kmap_atomic() which allows mapping
> just two pages contiguously which is sufficient for our purpose.
>
>>>> It also follows that the blocks that make up an allocation must be in
>>>> a list of some kind, leading to some amount of metadata overhead.
>>>>
>>>
>>>
>>> Used objects need not be placed in any list. For free objects we can use
>>> underlying struct page and free object space itself to manage free list,
>>> as described above.
>>>
>>
>> I assume this means that allocations consist of only one slab object as
>> opposed to multiple objects in slabs with different class sizes.
>>
>> In other words, if you have a 3k allocation and your slab
>> class sizes are 1k, 2k, and 4k, the allocation would have to
>> be in a single 4k object (with 25% fragmentation), not spanning
>> a 1k and 2k object, correct?
>>
>> This is something that I'm inferring from your description, but is key
>> to the design. Please confirm that I'm understanding correctly.
>>
>
> To rephrase the slab design: the memory pool is grown in units of superblocks
> which is a (virtually) contiguous region of say, 4*PAGE_SIZE or 8*PAGE_SIZE.
> Each of these superblocks is assigned to exactly one size class which is the
> size of objects this superblock will store. For example, a superblock of size
> 64K assigned to size-class 512B can store 64K/512B = 128 objects of size 512B;
> it cannot store objects of any other size.
>
> Considering your example, if we have size classes of say 1K, 2K and 4K and
> superblock size is say 16K and we get allocation request for 3K, we will use
> superblocks assigned to 4K size class. If we selected, say, an empty superblock,
> the object will take 4K (1K wasted) leaving 16K-4K = 12K of free space in that
> superblock.
>
> To reduce such internal fragmentation, we can have closely spaced size-classes
> (say, separated by 32 bytes) but there is a tradeoff here: if there are too many
> size classes, we can waste lot of memory due to partially filled superblocks
> in each of size classes and if they are too widely spaced, we can lot of
> internal fragmentation.
>
>>>> If you want to do compaction, it follows that you can't give the user
>>>> a direct pointer to the data, since the location of that data may change.
>>>> In this case, an indirection layer is required (i.e. xcf_blkdesc and
>>>> xcf_read()/xcf_write()).
>>>>
>>>
>>>
>>> Yes, we can't give a direct pointer anyways since pages used by the
>>> allocator are not permanently mapped (to save precious VA spave on
>>> 32-bit). Still, we can save on much of metadata overhead and extra
>>> memcpy() as described above.
>>>
>>>
>>
>> This isn't clear to me. What would be returned by the malloc() call
>> in this design? In other words, how does the caller access the
>> allocation? You just said we can't give a direct access to the data
>> via a page pointer and an offset. What else is there to return other
>> than a pointer to metadata?
>>
>
> I meant that, since we cannot have allocator pages always mapped, we cannot
> return a direct pointer (VA) to the allocated object. Instead, we can combine
> page frame number (PFN) and offset within the page as a single 64-bit unsigned
> number and return this as "object handle".
>
> Thanks,
> Nitin
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-15 16:31 ` Seth Jennings
@ 2011-09-15 17:29 ` Dan Magenheimer
2011-09-15 19:24 ` Seth Jennings
2011-09-16 17:46 ` Nitin Gupta
1 sibling, 1 reply; 28+ messages in thread
From: Dan Magenheimer @ 2011-09-15 17:29 UTC (permalink / raw)
To: Seth Jennings, Nitin Gupta
Cc: Greg KH, gregkh, devel, cascardo, linux-kernel, dave, linux-mm,
brking, rcj
> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> Subject: Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
>
> Hey Nitin,
>
> So this is how I see things...
>
> Right now xvmalloc is broken for zcache's application because
> of its huge fragmentation for half the valid allocation sizes
> (> PAGE_SIZE/2).
Um, I have to disagree here. It is broken for zcache for
SOME set of workloads/data, where the AVERAGE compression
is poor (> PAGE_SIZE/2).
> My xcfmalloc patches are _a_ solution that is ready now. Sure,
> it doesn't so compaction yet, and it has some metadata overhead.
> So it's not "ideal" (if there is such I thing). But it does fix
> the brokenness of xvmalloc for zcache's application.
But at what cost? As Dave Hansen pointed out, we still do
not have a comprehensive worst-case performance analysis for
xcfmalloc. Without that (and without an analysis over a very
large set of workloads), it is difficult to characterize
one as "better" than the other.
> So I see two ways going forward:
>
> 1) We review and integrate xcfmalloc now. Then, when you are
> done with your allocator, we can run them side by side and see
> which is better by numbers. If yours is better, you'll get no
> argument from me and we can replace xcfmalloc with yours.
>
> 2) We can agree on a date (sooner rather than later) by which your
> allocator will be completed. At that time we can compare them and
> integrate the best one by the numbers.
>
> Which would you like to do?
Seth, I am still not clear why it is not possible to support
either allocation algorithm, selectable at runtime. Or even
dynamically... use xvmalloc to store well-compressible pages
and xcfmalloc for poorly-compressible pages. I understand
it might require some additional coding, perhaps even an
ugly hack or two, but it seems possible.
Dan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-15 17:29 ` Dan Magenheimer
@ 2011-09-15 19:24 ` Seth Jennings
2011-09-15 20:07 ` Dan Magenheimer
` (2 more replies)
0 siblings, 3 replies; 28+ messages in thread
From: Seth Jennings @ 2011-09-15 19:24 UTC (permalink / raw)
To: Dan Magenheimer
Cc: Nitin Gupta, Greg KH, gregkh, devel, cascardo, linux-kernel, dave,
linux-mm, brking, rcj
On 09/15/2011 12:29 PM, Dan Magenheimer wrote:
>> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
>> Subject: Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
>>
>> Hey Nitin,
>>
>> So this is how I see things...
>>
>> Right now xvmalloc is broken for zcache's application because
>> of its huge fragmentation for half the valid allocation sizes
>> (> PAGE_SIZE/2).
>
> Um, I have to disagree here. It is broken for zcache for
> SOME set of workloads/data, where the AVERAGE compression
> is poor (> PAGE_SIZE/2).
>
True.
But are we not in agreement that xvmalloc needs to be replaced
with an allocator that doesn't have this issue? I thought we all
agreed on that...
>> My xcfmalloc patches are _a_ solution that is ready now. Sure,
>> it doesn't so compaction yet, and it has some metadata overhead.
>> So it's not "ideal" (if there is such I thing). But it does fix
>> the brokenness of xvmalloc for zcache's application.
>
> But at what cost? As Dave Hansen pointed out, we still do
> not have a comprehensive worst-case performance analysis for
> xcfmalloc. Without that (and without an analysis over a very
> large set of workloads), it is difficult to characterize
> one as "better" than the other.
>
I'm not sure what you mean by "comprehensive worst-case performance
analysis". If you're talking about theoretical worst-case runtimes
(i.e. O(whatever)) then apparently we are going to have to
talk to an authority on algorithm analysis because we can't agree
how to determine that. However, it isn't difficult to look at the
code and (within your own understanding) see what it is.
I'd be interested so see what Nitin thinks is the worst-case runtime
bound.
How would you suggest that I measure xcfmalloc performance on a "very
large set of workloads". I guess another form of that question is: How
did xvmalloc do this?
>> So I see two ways going forward:
>>
>> 1) We review and integrate xcfmalloc now. Then, when you are
>> done with your allocator, we can run them side by side and see
>> which is better by numbers. If yours is better, you'll get no
>> argument from me and we can replace xcfmalloc with yours.
>>
>> 2) We can agree on a date (sooner rather than later) by which your
>> allocator will be completed. At that time we can compare them and
>> integrate the best one by the numbers.
>>
>> Which would you like to do?
>
> Seth, I am still not clear why it is not possible to support
> either allocation algorithm, selectable at runtime. Or even
> dynamically... use xvmalloc to store well-compressible pages
> and xcfmalloc for poorly-compressible pages. I understand
> it might require some additional coding, perhaps even an
> ugly hack or two, but it seems possible.
But why do an ugly hack if we can just use a single allocator
that has the best overall performance for the allocation range
the zcache requires. Why make it more complicated that it
needs to be?
>
> Dan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-15 19:24 ` Seth Jennings
@ 2011-09-15 20:07 ` Dan Magenheimer
2011-10-03 15:59 ` Dave Hansen
2011-09-15 22:17 ` Dave Hansen
2011-09-16 17:52 ` Nitin Gupta
2 siblings, 1 reply; 28+ messages in thread
From: Dan Magenheimer @ 2011-09-15 20:07 UTC (permalink / raw)
To: Seth Jennings
Cc: Nitin Gupta, Greg KH, gregkh, devel, cascardo, linux-kernel, dave,
linux-mm, brking, rcj
> On 09/15/2011 12:29 PM, Dan Magenheimer wrote:
> >> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> >> Subject: Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
> >>
> >> Right now xvmalloc is broken for zcache's application because
> >> of its huge fragmentation for half the valid allocation sizes
> >> (> PAGE_SIZE/2).
> >
> > Um, I have to disagree here. It is broken for zcache for
> > SOME set of workloads/data, where the AVERAGE compression
> > is poor (> PAGE_SIZE/2).
>
> True.
>
> But are we not in agreement that xvmalloc needs to be replaced
> with an allocator that doesn't have this issue? I thought we all
> agreed on that...
First, let me make it clear that I very much do appreciate
your innovation and effort here. I'm not trying to block
your work from getting upstream or create hoops for you to
jump through. Heaven knows, I can personally attest to
how frustrating that can be!
I am in agreement that xvmalloc has a significant problem with
some workloads and that it would be good to fix that. What
I'm not clear on is if we are replacing an algorithm with
Problem X with another algorithm that has Problem Y... or
at least, if we are, that we agree that Problem Y is not
worse across a broad set of real world workloads than Problem X.
> >> My xcfmalloc patches are _a_ solution that is ready now. Sure,
> >> it doesn't so compaction yet, and it has some metadata overhead.
> >> So it's not "ideal" (if there is such I thing). But it does fix
> >> the brokenness of xvmalloc for zcache's application.
> >
> > But at what cost? As Dave Hansen pointed out, we still do
> > not have a comprehensive worst-case performance analysis for
> > xcfmalloc. Without that (and without an analysis over a very
> > large set of workloads), it is difficult to characterize
> > one as "better" than the other.
>
> I'm not sure what you mean by "comprehensive worst-case performance
> analysis". If you're talking about theoretical worst-case runtimes
> (i.e. O(whatever)) then apparently we are going to have to
> talk to an authority on algorithm analysis because we can't agree
> how to determine that. However, it isn't difficult to look at the
> code and (within your own understanding) see what it is.
>
> I'd be interested so see what Nitin thinks is the worst-case runtime
> bound.
>
> How would you suggest that I measure xcfmalloc performance on a "very
> large set of workloads". I guess another form of that question is: How
> did xvmalloc do this?
I'm far from an expert in the allocation algorithms you and
Nitin are discussing, so let me use an analogy: ordered link
lists. If you insert, a sequence of N numbers from largest to
smallest and then search/retrieve them in order from smallest
to largest, the data structure appears very very fast. If you
insert them in the opposite order and then search/retrieve
them in the opposite order, the data structure appears
very very slow.
For your algorithm, are there sequences of allocations/deallocations
which will perform very poorly? If so, how poorly? If
"perform very poorly" for allocation/deallocation is
a fraction of the time to compress/decompress, I don't
care, let's switch to xcfmalloc. However, if one could
manufacture a sequence of allocations/searches where the
overhead is much larger than the compress/decompress
time (and especially if grows worse as N grows), that's
an issue we need to understand better.
I think Dave Hansen was saying the same thing in an earlier thread:
> From: Dave Hansen [mailto:dave@linux.vnet.ibm.com]
> It took the largest (most valuable) block, and split a 500 block when it
> didn't have to. The reason it doesn't do this is that it doesn't
> _search_. It just indexes and guesses. That's *fast*, but it errs on
> the side of speed rather than being optimal. That's OK, we do it all
> the time, but it *is* a compromise. We should at least be thinking of
> the cases when this doesn't perform well.
In other words, what happens if on some workload, strictly
by chance, xcfmalloc always guesses wrong? Will search time
grow linearly, or exponentially? (This is especially an
issue if interrupts are disabled during the search, which
they currently are, correct?)
> >> So I see two ways going forward:
> >>
> >> 1) We review and integrate xcfmalloc now. Then, when you are
> >> done with your allocator, we can run them side by side and see
> >> which is better by numbers. If yours is better, you'll get no
> >> argument from me and we can replace xcfmalloc with yours.
> >>
> >> 2) We can agree on a date (sooner rather than later) by which your
> >> allocator will be completed. At that time we can compare them and
> >> integrate the best one by the numbers.
> >>
> >> Which would you like to do?
> >
> > Seth, I am still not clear why it is not possible to support
> > either allocation algorithm, selectable at runtime. Or even
> > dynamically... use xvmalloc to store well-compressible pages
> > and xcfmalloc for poorly-compressible pages. I understand
> > it might require some additional coding, perhaps even an
> > ugly hack or two, but it seems possible.
>
> But why do an ugly hack if we can just use a single allocator
> that has the best overall performance for the allocation range
> the zcache requires. Why make it more complicated that it
> needs to be?
I agree, if we are certain that your statement of "best overall
performance" is true.
If you and Nitin can agree that xcfmalloc is better than xvmalloc,
even if future-slab-based-allocator is predicted to be better
than xcfmalloc, I am OK with (1) above. I just want to feel
confident we aren't exchanging problem X for problem Y (in
which case some runtime or dynamic selection hack might be better).
With all that said, I guess my bottom line is: If Nitin provides
an Acked-by on your patchset, I will too.
Thanks again for your work on this!
Dan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-15 19:24 ` Seth Jennings
2011-09-15 20:07 ` Dan Magenheimer
@ 2011-09-15 22:17 ` Dave Hansen
2011-09-15 22:27 ` Dan Magenheimer
2011-09-16 17:36 ` Nitin Gupta
2011-09-16 17:52 ` Nitin Gupta
2 siblings, 2 replies; 28+ messages in thread
From: Dave Hansen @ 2011-09-15 22:17 UTC (permalink / raw)
To: Seth Jennings
Cc: Dan Magenheimer, Nitin Gupta, Greg KH, gregkh, devel, cascardo,
linux-kernel, linux-mm, brking, rcj
On Thu, 2011-09-15 at 14:24 -0500, Seth Jennings wrote:
> How would you suggest that I measure xcfmalloc performance on a "very
> large set of workloads". I guess another form of that question is: How
> did xvmalloc do this?
Well, it didn't have a competitor, so this probably wasn't done. :)
I'd like to see a microbenchmarky sort of thing. Do a million (or 100
million, whatever) allocations, and time it for both allocators doing
the same thing. You just need to do the *same* allocations for both.
It'd be interesting to see the shape of a graph if you did:
for (i = 0; i < BIG_NUMBER; i++)
for (j = MIN_ALLOC; j < MAX_ALLOC; j += BLOCK_SIZE)
alloc(j);
free();
... basically for both allocators. Let's see how the graphs look. You
could do it a lot of different ways: alloc all, then free all, or alloc
one free one, etc... Maybe it will surprise us. Maybe the page
allocator overhead will dominate _everything_, and we won't even see the
x*malloc() functions show up.
The other thing that's important is to think of cases like I described
that would cause either allocator to do extra splits/joins or be slow in
other ways. I expect xcfmalloc() to be slowest when it is allocating
and has to break down a reserve page. Let's say it does a bunch of ~3kb
allocations and has no pages on the freelists, it will:
1. scan each of the 64 freelists heads (512 bytes of cache)
2. split a 4k page
3. reinsert the 1k remainder
Next time, it will:
1. scan, and find the 1k bit
2. continue scanning, eventually touching each freelist...
3. split a 4k page
4. reinsert the 2k remainder
It'll end up doing a scan/split/reinsert in 3/4 of the cases, I think.
The case of the freelists being quite empty will also be quite common
during times the pool is expanding. I think xvmalloc() will have some
of the same problems, but let's see if it does in practice.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-15 22:17 ` Dave Hansen
@ 2011-09-15 22:27 ` Dan Magenheimer
2011-09-16 17:36 ` Nitin Gupta
1 sibling, 0 replies; 28+ messages in thread
From: Dan Magenheimer @ 2011-09-15 22:27 UTC (permalink / raw)
To: Dave Hansen, Seth Jennings
Cc: Nitin Gupta, Greg KH, gregkh, devel, cascardo, linux-kernel,
linux-mm, brking, rcj
> From: Dave Hansen [mailto:dave@linux.vnet.ibm.com]
> Subject: Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
>
> On Thu, 2011-09-15 at 14:24 -0500, Seth Jennings wrote:
> > How would you suggest that I measure xcfmalloc performance on a "very
> > large set of workloads". I guess another form of that question is: How
> > did xvmalloc do this?
>
> Well, it didn't have a competitor, so this probably wasn't done. :)
>
> I'd like to see a microbenchmarky sort of thing. Do a million (or 100
> million, whatever) allocations, and time it for both allocators doing
> the same thing. You just need to do the *same* allocations for both.
One suggestion: We already know xvmalloc sucks IF the workload has
poor compression for most pages. We are looking to understand if xcfmalloc
is [very**N] bad when xvmalloc is good. So please measure BIG-NUMBER
allocations where compression is known to be OK on average (which is,
I think, a large fraction of workloads), rather than workloads where
xvmalloc already sucks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-15 22:17 ` Dave Hansen
2011-09-15 22:27 ` Dan Magenheimer
@ 2011-09-16 17:36 ` Nitin Gupta
1 sibling, 0 replies; 28+ messages in thread
From: Nitin Gupta @ 2011-09-16 17:36 UTC (permalink / raw)
To: Dave Hansen
Cc: Seth Jennings, Dan Magenheimer, Greg KH, gregkh, devel, cascardo,
linux-kernel, linux-mm, brking, rcj
On 09/15/2011 06:17 PM, Dave Hansen wrote:
> On Thu, 2011-09-15 at 14:24 -0500, Seth Jennings wrote:
>> How would you suggest that I measure xcfmalloc performance on a "very
>> large set of workloads". I guess another form of that question is: How
>> did xvmalloc do this?
>
> Well, it didn't have a competitor, so this probably wasn't done. :)
>
A lot of testing was done for xvmalloc (and its predecessor, tlsf)
before it was integrated into zram:
http://code.google.com/p/compcache/wiki/AllocatorsComparison
http://code.google.com/p/compcache/wiki/xvMalloc
http://code.google.com/p/compcache/wiki/xvMallocPerformance
I think we can use the same set of testing tools. See:
http://code.google.com/p/compcache/source/browse/#hg%2Fsub-projects%2Ftesting
These tools do issue mix of alloc and frees each with some probability
which can be adjusted in code.
There is also a tool called "swap replay" which collects swap-out traces
and simulates the same behavior in userspace, allowing allocator testing
with "real world" traces. See:
http://code.google.com/p/compcache/wiki/SwapReplay
> I'd like to see a microbenchmarky sort of thing. Do a million (or 100
> million, whatever) allocations, and time it for both allocators doing
> the same thing. You just need to do the *same* allocations for both.
>
> It'd be interesting to see the shape of a graph if you did:
>
> for (i = 0; i < BIG_NUMBER; i++)
> for (j = MIN_ALLOC; j < MAX_ALLOC; j += BLOCK_SIZE)
> alloc(j);
> free();
>
> ... basically for both allocators. Let's see how the graphs look. You
> could do it a lot of different ways: alloc all, then free all, or alloc
> one free one, etc... Maybe it will surprise us. Maybe the page
> allocator overhead will dominate _everything_, and we won't even see the
> x*malloc() functions show up.
>
> The other thing that's important is to think of cases like I described
> that would cause either allocator to do extra splits/joins or be slow in
> other ways. I expect xcfmalloc() to be slowest when it is allocating
> and has to break down a reserve page. Let's say it does a bunch of ~3kb
> allocations and has no pages on the freelists, it will:
>
> 1. scan each of the 64 freelists heads (512 bytes of cache)
> 2. split a 4k page
> 3. reinsert the 1k remainder
>
> Next time, it will:
>
> 1. scan, and find the 1k bit
> 2. continue scanning, eventually touching each freelist...
> 3. split a 4k page
> 4. reinsert the 2k remainder
>
> It'll end up doing a scan/split/reinsert in 3/4 of the cases, I think.
> The case of the freelists being quite empty will also be quite common
> during times the pool is expanding. I think xvmalloc() will have some
> of the same problems, but let's see if it does in practice.
>
Thanks,
Nitin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-15 16:31 ` Seth Jennings
2011-09-15 17:29 ` Dan Magenheimer
@ 2011-09-16 17:46 ` Nitin Gupta
2011-09-16 18:33 ` Seth Jennings
2011-11-01 17:30 ` Dave Hansen
1 sibling, 2 replies; 28+ messages in thread
From: Nitin Gupta @ 2011-09-16 17:46 UTC (permalink / raw)
To: Seth Jennings
Cc: Greg KH, gregkh, devel, dan.magenheimer, cascardo, linux-kernel,
dave, linux-mm, brking, rcj
Hi Seth,
On 09/15/2011 12:31 PM, Seth Jennings wrote:
>
> So this is how I see things...
>
> Right now xvmalloc is broken for zcache's application because
> of its huge fragmentation for half the valid allocation sizes
> (> PAGE_SIZE/2).
>
> My xcfmalloc patches are _a_ solution that is ready now. Sure,
> it doesn't so compaction yet, and it has some metadata overhead.
> So it's not "ideal" (if there is such I thing). But it does fix
> the brokenness of xvmalloc for zcache's application.
>
> So I see two ways going forward:
>
> 1) We review and integrate xcfmalloc now. Then, when you are
> done with your allocator, we can run them side by side and see
> which is better by numbers. If yours is better, you'll get no
> argument from me and we can replace xcfmalloc with yours.
>
> 2) We can agree on a date (sooner rather than later) by which your
> allocator will be completed. At that time we can compare them and
> integrate the best one by the numbers.
>
I think replacing allocator every few weeks isn't a good idea. So, I
guess better would be to let me work for about 2 weeks and try the slab
based approach. If nothing works out in this time, then maybe xcfmalloc
can be integrated after further testing.
Thanks,
Nitin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-15 19:24 ` Seth Jennings
2011-09-15 20:07 ` Dan Magenheimer
2011-09-15 22:17 ` Dave Hansen
@ 2011-09-16 17:52 ` Nitin Gupta
2 siblings, 0 replies; 28+ messages in thread
From: Nitin Gupta @ 2011-09-16 17:52 UTC (permalink / raw)
To: Seth Jennings
Cc: Dan Magenheimer, Greg KH, gregkh, devel, cascardo, linux-kernel,
dave, linux-mm, brking, rcj
On 09/15/2011 03:24 PM, Seth Jennings wrote:
> On 09/15/2011 12:29 PM, Dan Magenheimer wrote:
>>> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
>>> Subject: Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
>>>
>>
>> Seth, I am still not clear why it is not possible to support
>> either allocation algorithm, selectable at runtime. Or even
>> dynamically... use xvmalloc to store well-compressible pages
>> and xcfmalloc for poorly-compressible pages. I understand
>> it might require some additional coding, perhaps even an
>> ugly hack or two, but it seems possible.
>
> But why do an ugly hack if we can just use a single allocator
> that has the best overall performance for the allocation range
> the zcache requires. Why make it more complicated that it
> needs to be?
>
>>
I agree with Seth here: a mix of different allocators for the (small)
range of sizes which zcache requires, looks like a bad idea to me.
Maintaining two allocators is a pain and this will also complicate
future plans like compaction etc.
Thanks,
Nitin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-16 17:46 ` Nitin Gupta
@ 2011-09-16 18:33 ` Seth Jennings
2011-11-01 17:30 ` Dave Hansen
1 sibling, 0 replies; 28+ messages in thread
From: Seth Jennings @ 2011-09-16 18:33 UTC (permalink / raw)
To: Nitin Gupta
Cc: Greg KH, gregkh, devel, dan.magenheimer, cascardo, linux-kernel,
dave, linux-mm, brking, rcj
On 09/16/2011 12:46 PM, Nitin Gupta wrote:
> I think replacing allocator every few weeks isn't a good idea. So, I
> guess better would be to let me work for about 2 weeks and try the slab
> based approach. If nothing works out in this time, then maybe xcfmalloc
> can be integrated after further testing.
Sounds good to me.
>
> Thanks,
> Nitin
>
Thanks
--
Seth
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-07 14:09 [PATCH v2 0/3] staging: zcache: xcfmalloc support Seth Jennings
` (3 preceding siblings ...)
2011-09-09 20:34 ` [PATCH v2 0/3] staging: zcache: xcfmalloc support Greg KH
@ 2011-09-29 17:47 ` Seth Jennings
4 siblings, 0 replies; 28+ messages in thread
From: Seth Jennings @ 2011-09-29 17:47 UTC (permalink / raw)
To: Seth Jennings
Cc: gregkh, dan.magenheimer, ngupta, cascardo, devel, linux-kernel,
rdunlap, linux-mm, rcj, dave, brking
On 09/07/2011 09:09 AM, Seth Jennings wrote:
>
> I did some quick tests with "time" using the same program and the
> timings are very close (3 run average, little deviation):
>
> xvmalloc:
> zero filled 0m0.852s
> text (75%) 0m14.415s
>
> xcfmalloc:
> zero filled 0m0.870s
> text (75%) 0m15.089s
>
> I suspect that the small decrease in throughput is due to the
> extra memcpy in xcfmalloc. However, these timing, more than
> anything, demonstrate that the throughput is GREATLY effected
> by the compressibility of the data.
This is not correct. I found out today that the reason text
compressed so much more slowly is because my test program
was inefficiently filling text filled pages.
With my corrected test program:
xvmalloc:
zero filled 0m0.751s
text (75%) 0m2.273s
It is still slower on less compressible data but not to the
degree previously stated.
I don't have the xcfmalloc numbers yet, but I expect they are
almost the same.
--
Seth
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-15 20:07 ` Dan Magenheimer
@ 2011-10-03 15:59 ` Dave Hansen
2011-10-03 17:54 ` Nitin Gupta
0 siblings, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2011-10-03 15:59 UTC (permalink / raw)
To: Dan Magenheimer
Cc: Seth Jennings, Nitin Gupta, Greg KH, gregkh, devel, cascardo,
linux-kernel, linux-mm, brking, rcj
Hi Dan/Nitin,
I've been reading through Seth's patches a bit and looking over the
locking in general. I'm wondering why preempt_disable() is used so
heavily. Preempt seems to be disabled for virtually all of zcache's
operations. It seems a bit unorthodox, and I guess I'm anticipating the
future screams of the low-latency folks. :)
I think long-term it will hurt zcache's ability to move in to other
code. Right now, it's pretty limited to being used in conjunction with
memory reclaim called from kswapd. Seems like something we ought to add
to the TODO list before it escapes from staging/.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-10-03 15:59 ` Dave Hansen
@ 2011-10-03 17:54 ` Nitin Gupta
2011-10-03 18:22 ` Dave Hansen
0 siblings, 1 reply; 28+ messages in thread
From: Nitin Gupta @ 2011-10-03 17:54 UTC (permalink / raw)
To: Dave Hansen
Cc: Dan Magenheimer, Seth Jennings, Greg KH, gregkh, devel, cascardo,
linux-kernel, linux-mm, brking, rcj
Hi Dave,
On 10/03/2011 11:59 AM, Dave Hansen wrote:
>
> I've been reading through Seth's patches a bit and looking over the
> locking in general. I'm wondering why preempt_disable() is used so
> heavily. Preempt seems to be disabled for virtually all of zcache's
> operations. It seems a bit unorthodox, and I guess I'm anticipating the
> future screams of the low-latency folks. :)
>
> I think long-term it will hurt zcache's ability to move in to other
> code. Right now, it's pretty limited to being used in conjunction with
> memory reclaim called from kswapd. Seems like something we ought to add
> to the TODO list before it escapes from staging/.
>
I think disabling preemption on the local CPU is the cheapest we can get
to protect PCPU buffers. We may experiment with, say, multiple buffers
per CPU, so we end up disabling preemption only in highly improbable
case of getting preempted just too many times exactly within critical
section. But before we do all that, we really need to come up with cases
where zcache induced latency is/can be a problem.
Thanks,
Nitin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-10-03 17:54 ` Nitin Gupta
@ 2011-10-03 18:22 ` Dave Hansen
2011-10-05 1:03 ` Dan Magenheimer
0 siblings, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2011-10-03 18:22 UTC (permalink / raw)
To: Nitin Gupta
Cc: Dan Magenheimer, Seth Jennings, Greg KH, gregkh, devel, cascardo,
linux-kernel, linux-mm, brking, rcj
On Mon, 2011-10-03 at 13:54 -0400, Nitin Gupta wrote:
> I think disabling preemption on the local CPU is the cheapest we can get
> to protect PCPU buffers. We may experiment with, say, multiple buffers
> per CPU, so we end up disabling preemption only in highly improbable
> case of getting preempted just too many times exactly within critical
> section.
I guess the problem is two-fold: preempt_disable() and
local_irq_save().
> static int zcache_put_page(int cli_id, int pool_id, struct tmem_oid *oidp,
> uint32_t index, struct page *page)
> {
> struct tmem_pool *pool;
> int ret = -1;
>
> BUG_ON(!irqs_disabled());
That tells me "zcache" doesn't work with interrupts on. It seems like
awfully high-level code to have interrupts disabled. The core page
allocator has some irq-disabling spinlock calls, but that's only really
because it has to be able to service page allocations from interrupts.
What's the high-level reason for zcache?
I'll save the discussion about preempt for when Seth posts his patch.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-10-03 18:22 ` Dave Hansen
@ 2011-10-05 1:03 ` Dan Magenheimer
0 siblings, 0 replies; 28+ messages in thread
From: Dan Magenheimer @ 2011-10-05 1:03 UTC (permalink / raw)
To: Dave Hansen, Nitin Gupta
Cc: Seth Jennings, Greg KH, gregkh, devel, cascardo, linux-kernel,
linux-mm, brking, rcj
> From: Dave Hansen [mailto:dave@linux.vnet.ibm.com]
> Sent: Monday, October 03, 2011 12:23 PM
> To: Nitin Gupta
> Cc: Dan Magenheimer; Seth Jennings; Greg KH; gregkh@suse.de; devel@driverdev.osuosl.org;
> cascardo@holoscopio.com; linux-kernel@vger.kernel.org; linux-mm@kvack.org; brking@linux.vnet.ibm.com;
> rcj@linux.vnet.ibm.com
> Subject: Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
>
> On Mon, 2011-10-03 at 13:54 -0400, Nitin Gupta wrote:
> > I think disabling preemption on the local CPU is the cheapest we can get
> > to protect PCPU buffers. We may experiment with, say, multiple buffers
> > per CPU, so we end up disabling preemption only in highly improbable
> > case of getting preempted just too many times exactly within critical
> > section.
>
> I guess the problem is two-fold: preempt_disable() and
> local_irq_save().
>
> > static int zcache_put_page(int cli_id, int pool_id, struct tmem_oid *oidp,
> > uint32_t index, struct page *page)
> > {
> > struct tmem_pool *pool;
> > int ret = -1;
> >
> > BUG_ON(!irqs_disabled());
>
> That tells me "zcache" doesn't work with interrupts on. It seems like
> awfully high-level code to have interrupts disabled. The core page
> allocator has some irq-disabling spinlock calls, but that's only really
> because it has to be able to service page allocations from interrupts.
> What's the high-level reason for zcache?
>
> I'll save the discussion about preempt for when Seth posts his patch.
I completely agree that the irq/softirq/preempt states should be
re-examined and, where possible, improved before zcache moves
out of staging.
Actually, I think cleancache_put is called from a point in the kernel
where irqs are disabled. I believe it is unsafe to call a routine
sometimes with irqs disabled and sometimes with irqs enabled?
I think some points of call to cleancache_flush may also have
irqs disabled.
IIRC, much of the zcache code has preemption disabled because
it is unsafe for a page fault to occur when zcache is running,
since the page fault may cause a (recursive) call into zcache
and possibly recursively take a lock.
Anyway, some of the atomicity constraints in the code are
definitely required, but there are very likely some constraints
that are overzealous and can be removed. For now, I'd rather
have the longer interrupt latency with code that works than
have developers experimenting with zcache and see lockups. :-}
Dan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-09-16 17:46 ` Nitin Gupta
2011-09-16 18:33 ` Seth Jennings
@ 2011-11-01 17:30 ` Dave Hansen
2011-11-01 18:35 ` Dan Magenheimer
1 sibling, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2011-11-01 17:30 UTC (permalink / raw)
To: Nitin Gupta
Cc: Seth Jennings, Greg KH, gregkh, devel, dan.magenheimer, cascardo,
linux-kernel, linux-mm, brking, rcj
On Fri, 2011-09-16 at 13:46 -0400, Nitin Gupta wrote:
> I think replacing allocator every few weeks isn't a good idea. So, I
> guess better would be to let me work for about 2 weeks and try the slab
> based approach. If nothing works out in this time, then maybe xcfmalloc
> can be integrated after further testing.
Hi Nitin,
It's been about six weeks. :)
Can we talk about putting xcfmalloc() in staging now?
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-11-01 17:30 ` Dave Hansen
@ 2011-11-01 18:35 ` Dan Magenheimer
2011-11-02 2:42 ` Nitin Gupta
0 siblings, 1 reply; 28+ messages in thread
From: Dan Magenheimer @ 2011-11-01 18:35 UTC (permalink / raw)
To: Dave Hansen, Nitin Gupta
Cc: Seth Jennings, Greg KH, gregkh, devel, cascardo, linux-kernel,
linux-mm, brking, rcj
> From: Dave Hansen [mailto:dave@linux.vnet.ibm.com]
> Sent: Tuesday, November 01, 2011 11:30 AM
> To: Nitin Gupta
> Cc: Seth Jennings; Greg KH; gregkh@suse.de; devel@driverdev.osuosl.org; Dan Magenheimer;
> cascardo@holoscopio.com; linux-kernel@vger.kernel.org; linux-mm@kvack.org; brking@linux.vnet.ibm.com;
> rcj@linux.vnet.ibm.com
> Subject: Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
>
> On Fri, 2011-09-16 at 13:46 -0400, Nitin Gupta wrote:
> > I think replacing allocator every few weeks isn't a good idea. So, I
> > guess better would be to let me work for about 2 weeks and try the slab
> > based approach. If nothing works out in this time, then maybe xcfmalloc
> > can be integrated after further testing.
>
> Hi Nitin,
>
> It's been about six weeks. :)
>
> Can we talk about putting xcfmalloc() in staging now?
FWIW, given that I am quoting "code rules!" to the gods of Linux
on another lkml thread, I can hardly disagree here.
If Nitin continues to develop his allocator and it proves
better than xcfmalloc (and especially if it can replace
zbud as well), we can consider replacing xcfmalloc later.
Until zcache is promoted from staging, I think we have
that flexibility.
(Shameless advertisement though: The xcfmalloc allocator
only applies to pages passed via frontswap, and on
that other lkml thread lurk many people intent on shooting
frontswap down. So, frankly, I'd prefer time to be spent
on benchmarking zcache rather than on arguing about
allocators which, as things currently feel to me on that
other lkml thread, is not unlike rearranging deck chairs
on the Titanic. Half-:-).
Dan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
2011-11-01 18:35 ` Dan Magenheimer
@ 2011-11-02 2:42 ` Nitin Gupta
0 siblings, 0 replies; 28+ messages in thread
From: Nitin Gupta @ 2011-11-02 2:42 UTC (permalink / raw)
To: Dan Magenheimer
Cc: Dave Hansen, Seth Jennings, Greg KH, gregkh, devel, cascardo,
linux-kernel, linux-mm, brking, rcj
On 11/01/2011 02:35 PM, Dan Magenheimer wrote:
>> From: Dave Hansen [mailto:dave@linux.vnet.ibm.com]
>> Sent: Tuesday, November 01, 2011 11:30 AM
>> To: Nitin Gupta
>> Cc: Seth Jennings; Greg KH; gregkh@suse.de; devel@driverdev.osuosl.org; Dan Magenheimer;
>> cascardo@holoscopio.com; linux-kernel@vger.kernel.org; linux-mm@kvack.org; brking@linux.vnet.ibm.com;
>> rcj@linux.vnet.ibm.com
>> Subject: Re: [PATCH v2 0/3] staging: zcache: xcfmalloc support
>>
>> On Fri, 2011-09-16 at 13:46 -0400, Nitin Gupta wrote:
>>> I think replacing allocator every few weeks isn't a good idea. So, I
>>> guess better would be to let me work for about 2 weeks and try the slab
>>> based approach. If nothing works out in this time, then maybe xcfmalloc
>>> can be integrated after further testing.
>>
>> Hi Nitin,
>>
>> It's been about six weeks. :)
>>
>> Can we talk about putting xcfmalloc() in staging now?
>
> FWIW, given that I am quoting "code rules!" to the gods of Linux
> on another lkml thread, I can hardly disagree here.
>
I agree with you Dan. It took me really long to bring the new allocator
into some shape and still I'm not very confident that it's ready to be
integrated with zcache.
> If Nitin continues to develop his allocator and it proves
> better than xcfmalloc (and especially if it can replace
> zbud as well), we can consider replacing xcfmalloc later.
> Until zcache is promoted from staging, I think we have
> that flexibility.
>
Agreed. Though I still consider slab based design much better, having
already tried xcfmalloc like design much earlier in the project history,
I would still favor xcfmalloc integration since xvmalloc weakness with
>PAGE_SIZE/2 objects is probably too much to bear.
> (Shameless advertisement though: The xcfmalloc allocator
> only applies to pages passed via frontswap, and on
> that other lkml thread lurk many people intent on shooting
> frontswap down. So, frankly, I'd prefer time to be spent
> on benchmarking zcache rather than on arguing about
> allocators which, as things currently feel to me on that
> other lkml thread, is not unlike rearranging deck chairs
> on the Titanic. Half-:-).
>
>
Thanks,
Nitin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2011-11-02 2:42 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-07 14:09 [PATCH v2 0/3] staging: zcache: xcfmalloc support Seth Jennings
2011-09-07 14:09 ` [PATCH v2 1/3] staging: zcache: xcfmalloc memory allocator for zcache Seth Jennings
2011-09-07 14:09 ` [PATCH v2 2/3] staging: zcache: replace xvmalloc with xcfmalloc Seth Jennings
2011-09-07 14:09 ` [PATCH v2 3/3] staging: zcache: add zv_page_count and zv_desc_count Seth Jennings
2011-09-09 20:34 ` [PATCH v2 0/3] staging: zcache: xcfmalloc support Greg KH
2011-09-10 2:41 ` Nitin Gupta
2011-09-12 14:35 ` Seth Jennings
2011-09-13 1:55 ` Nitin Gupta
2011-09-13 15:58 ` Seth Jennings
2011-09-13 21:18 ` Nitin Gupta
2011-09-15 16:31 ` Seth Jennings
2011-09-15 17:29 ` Dan Magenheimer
2011-09-15 19:24 ` Seth Jennings
2011-09-15 20:07 ` Dan Magenheimer
2011-10-03 15:59 ` Dave Hansen
2011-10-03 17:54 ` Nitin Gupta
2011-10-03 18:22 ` Dave Hansen
2011-10-05 1:03 ` Dan Magenheimer
2011-09-15 22:17 ` Dave Hansen
2011-09-15 22:27 ` Dan Magenheimer
2011-09-16 17:36 ` Nitin Gupta
2011-09-16 17:52 ` Nitin Gupta
2011-09-16 17:46 ` Nitin Gupta
2011-09-16 18:33 ` Seth Jennings
2011-11-01 17:30 ` Dave Hansen
2011-11-01 18:35 ` Dan Magenheimer
2011-11-02 2:42 ` Nitin Gupta
2011-09-29 17:47 ` Seth Jennings
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).