From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx121.postini.com [74.125.245.121]) by kanga.kvack.org (Postfix) with SMTP id 0307C6B0032 for ; Sun, 18 Aug 2013 04:41:15 -0400 (EDT) Received: by mail-pa0-f46.google.com with SMTP id fa1so3521081pad.19 for ; Sun, 18 Aug 2013 01:41:15 -0700 (PDT) From: Bob Liu Subject: [PATCH 0/4] mm: merge zram into zswap Date: Sun, 18 Aug 2013 16:40:45 +0800 Message-Id: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu Both zswap and zram are used to compress anon pages in memory so as to reduce swap io operation. The main different is that zswap uses zbud as its allocator while zram uses zsmalloc. The other different is zram will create a block device, the user need to mkswp and swapon it. Minchan has areadly try to promote zram/zsmalloc into drivers/block/, but it may cause increase maintenance headaches. Since the purpose of zswap and zram are the same, this patch series try to merge them together as Mel suggested. Dropped zram from staging and extended zswap with the same feature as zram. zswap todo: Improve the writeback of zswap pool pages! Bob Liu (4): drivers: staging: drop zram and zsmalloc mm: promote zsmalloc to mm/ mm: zswap: add supporting for zsmalloc mm: zswap: create a pseudo device /dev/zram0 drivers/staging/Kconfig | 4 - drivers/staging/Makefile | 2 - drivers/staging/zram/Kconfig | 25 - drivers/staging/zram/Makefile | 3 - drivers/staging/zram/zram.txt | 77 --- drivers/staging/zram/zram_drv.c | 925 -------------------------- drivers/staging/zram/zram_drv.h | 115 ---- drivers/staging/zsmalloc/Kconfig | 10 - drivers/staging/zsmalloc/Makefile | 3 - drivers/staging/zsmalloc/zsmalloc-main.c | 1063 ----------------------------- drivers/staging/zsmalloc/zsmalloc.h | 43 -- include/linux/zsmalloc.h | 44 ++ mm/Kconfig | 51 +- mm/Makefile | 1 + mm/zsmalloc.c | 1068 ++++++++++++++++++++++++++++++ mm/zswap.c | 269 +++++++- 16 files changed, 1418 insertions(+), 2285 deletions(-) delete mode 100644 drivers/staging/zram/Kconfig delete mode 100644 drivers/staging/zram/Makefile delete mode 100644 drivers/staging/zram/zram.txt delete mode 100644 drivers/staging/zram/zram_drv.c delete mode 100644 drivers/staging/zram/zram_drv.h delete mode 100644 drivers/staging/zsmalloc/Kconfig delete mode 100644 drivers/staging/zsmalloc/Makefile delete mode 100644 drivers/staging/zsmalloc/zsmalloc-main.c delete mode 100644 drivers/staging/zsmalloc/zsmalloc.h create mode 100644 include/linux/zsmalloc.h create mode 100644 mm/zsmalloc.c -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx166.postini.com [74.125.245.166]) by kanga.kvack.org (Postfix) with SMTP id ADE846B0033 for ; Sun, 18 Aug 2013 04:41:28 -0400 (EDT) Received: by mail-pd0-f175.google.com with SMTP id q10so3816207pdj.34 for ; Sun, 18 Aug 2013 01:41:27 -0700 (PDT) From: Bob Liu Subject: [PATCH 1/4] drivers: staging: drop zram and zsmalloc Date: Sun, 18 Aug 2013 16:40:46 +0800 Message-Id: <1376815249-6611-2-git-send-email-bob.liu@oracle.com> In-Reply-To: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu Zswap will be used to replace zram. Signed-off-by: Bob Liu --- drivers/staging/Kconfig | 4 - drivers/staging/Makefile | 2 - drivers/staging/zram/Kconfig | 25 - drivers/staging/zram/Makefile | 3 - drivers/staging/zram/zram.txt | 77 --- drivers/staging/zram/zram_drv.c | 925 -------------------------- drivers/staging/zram/zram_drv.h | 115 ---- drivers/staging/zsmalloc/Kconfig | 10 - drivers/staging/zsmalloc/Makefile | 3 - drivers/staging/zsmalloc/zsmalloc-main.c | 1063 ------------------------------ drivers/staging/zsmalloc/zsmalloc.h | 43 -- 11 files changed, 2270 deletions(-) delete mode 100644 drivers/staging/zram/Kconfig delete mode 100644 drivers/staging/zram/Makefile delete mode 100644 drivers/staging/zram/zram.txt delete mode 100644 drivers/staging/zram/zram_drv.c delete mode 100644 drivers/staging/zram/zram_drv.h delete mode 100644 drivers/staging/zsmalloc/Kconfig delete mode 100644 drivers/staging/zsmalloc/Makefile delete mode 100644 drivers/staging/zsmalloc/zsmalloc-main.c delete mode 100644 drivers/staging/zsmalloc/zsmalloc.h diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig index 57d8b34..d5355f4 100644 --- a/drivers/staging/Kconfig +++ b/drivers/staging/Kconfig @@ -74,10 +74,6 @@ source "drivers/staging/sep/Kconfig" source "drivers/staging/iio/Kconfig" -source "drivers/staging/zsmalloc/Kconfig" - -source "drivers/staging/zram/Kconfig" - source "drivers/staging/wlags49_h2/Kconfig" source "drivers/staging/wlags49_h25/Kconfig" diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile index 429321f..17a828f 100644 --- a/drivers/staging/Makefile +++ b/drivers/staging/Makefile @@ -31,8 +31,6 @@ obj-$(CONFIG_VT6656) += vt6656/ obj-$(CONFIG_VME_BUS) += vme/ obj-$(CONFIG_DX_SEP) += sep/ obj-$(CONFIG_IIO) += iio/ -obj-$(CONFIG_ZRAM) += zram/ -obj-$(CONFIG_ZSMALLOC) += zsmalloc/ obj-$(CONFIG_WLAGS49_H2) += wlags49_h2/ obj-$(CONFIG_WLAGS49_H25) += wlags49_h25/ obj-$(CONFIG_FB_SM7XX) += sm7xxfb/ diff --git a/drivers/staging/zram/Kconfig b/drivers/staging/zram/Kconfig deleted file mode 100644 index 983314c..0000000 --- a/drivers/staging/zram/Kconfig +++ /dev/null @@ -1,25 +0,0 @@ -config ZRAM - tristate "Compressed RAM block device support" - depends on BLOCK && SYSFS && ZSMALLOC - select LZO_COMPRESS - select LZO_DECOMPRESS - default n - help - Creates virtual block devices called /dev/zramX (X = 0, 1, ...). - Pages written to these disks are compressed and stored in memory - itself. These disks allow very fast I/O and compression provides - good amounts of memory savings. - - It has several use cases, for example: /tmp storage, use as swap - disks and maybe many more. - - See zram.txt for more information. - Project home: - -config ZRAM_DEBUG - bool "Compressed RAM block device debug support" - depends on ZRAM - default n - help - This option adds additional debugging code to the compressed - RAM block device driver. diff --git a/drivers/staging/zram/Makefile b/drivers/staging/zram/Makefile deleted file mode 100644 index cb0f9ce..0000000 --- a/drivers/staging/zram/Makefile +++ /dev/null @@ -1,3 +0,0 @@ -zram-y := zram_drv.o - -obj-$(CONFIG_ZRAM) += zram.o diff --git a/drivers/staging/zram/zram.txt b/drivers/staging/zram/zram.txt deleted file mode 100644 index 765d790..0000000 --- a/drivers/staging/zram/zram.txt +++ /dev/null @@ -1,77 +0,0 @@ -zram: Compressed RAM based block devices ----------------------------------------- - -Project home: http://compcache.googlecode.com/ - -* Introduction - -The zram module creates RAM based block devices named /dev/zram -( = 0, 1, ...). Pages written to these disks are compressed and stored -in memory itself. These disks allow very fast I/O and compression provides -good amounts of memory savings. Some of the usecases include /tmp storage, -use as swap disks, various caches under /var and maybe many more :) - -Statistics for individual zram devices are exported through sysfs nodes at -/sys/block/zram/ - -* Usage - -Following shows a typical sequence of steps for using zram. - -1) Load Module: - modprobe zram num_devices=4 - This creates 4 devices: /dev/zram{0,1,2,3} - (num_devices parameter is optional. Default: 1) - -2) Set Disksize - Set disk size by writing the value to sysfs node 'disksize'. - The value can be either in bytes or you can use mem suffixes. - Examples: - # Initialize /dev/zram0 with 50MB disksize - echo $((50*1024*1024)) > /sys/block/zram0/disksize - - # Using mem suffixes - echo 256K > /sys/block/zram0/disksize - echo 512M > /sys/block/zram0/disksize - echo 1G > /sys/block/zram0/disksize - -3) Activate: - mkswap /dev/zram0 - swapon /dev/zram0 - - mkfs.ext4 /dev/zram1 - mount /dev/zram1 /tmp - -4) Stats: - Per-device statistics are exported as various nodes under - /sys/block/zram/ - disksize - num_reads - num_writes - invalid_io - notify_free - discard - zero_pages - orig_data_size - compr_data_size - mem_used_total - -5) Deactivate: - swapoff /dev/zram0 - umount /dev/zram1 - -6) Reset: - Write any positive value to 'reset' sysfs node - echo 1 > /sys/block/zram0/reset - echo 1 > /sys/block/zram1/reset - - This frees all the memory allocated for the given device and - resets the disksize to zero. You must set the disksize again - before reusing the device. - -Please report any problems at: - - Mailing list: linux-mm-cc at laptop dot org - - Issue tracker: http://code.google.com/p/compcache/issues/list - -Nitin Gupta -ngupta@vflare.org diff --git a/drivers/staging/zram/zram_drv.c b/drivers/staging/zram/zram_drv.c deleted file mode 100644 index e77fb6e..0000000 --- a/drivers/staging/zram/zram_drv.c +++ /dev/null @@ -1,925 +0,0 @@ -/* - * Compressed RAM block device - * - * Copyright (C) 2008, 2009, 2010 Nitin Gupta - * - * This code is released using a dual license strategy: BSD/GPL - * You can choose the licence that better fits your requirements. - * - * Released under the terms of 3-clause BSD License - * Released under the terms of GNU General Public License Version 2.0 - * - * Project home: http://compcache.googlecode.com - */ - -#define KMSG_COMPONENT "zram" -#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt - -#ifdef CONFIG_ZRAM_DEBUG -#define DEBUG -#endif - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "zram_drv.h" - -/* Globals */ -static int zram_major; -static struct zram *zram_devices; - -/* Module params (documentation at end) */ -static unsigned int num_devices = 1; - -static inline struct zram *dev_to_zram(struct device *dev) -{ - return (struct zram *)dev_to_disk(dev)->private_data; -} - -static ssize_t disksize_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", zram->disksize); -} - -static ssize_t initstate_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%u\n", zram->init_done); -} - -static ssize_t num_reads_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", - (u64)atomic64_read(&zram->stats.num_reads)); -} - -static ssize_t num_writes_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", - (u64)atomic64_read(&zram->stats.num_writes)); -} - -static ssize_t invalid_io_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", - (u64)atomic64_read(&zram->stats.invalid_io)); -} - -static ssize_t notify_free_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", - (u64)atomic64_read(&zram->stats.notify_free)); -} - -static ssize_t zero_pages_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%u\n", zram->stats.pages_zero); -} - -static ssize_t orig_data_size_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", - (u64)(zram->stats.pages_stored) << PAGE_SHIFT); -} - -static ssize_t compr_data_size_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", - (u64)atomic64_read(&zram->stats.compr_size)); -} - -static ssize_t mem_used_total_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - u64 val = 0; - struct zram *zram = dev_to_zram(dev); - struct zram_meta *meta = zram->meta; - - down_read(&zram->init_lock); - if (zram->init_done) - val = zs_get_total_size_bytes(meta->mem_pool); - up_read(&zram->init_lock); - - return sprintf(buf, "%llu\n", val); -} - -static int zram_test_flag(struct zram_meta *meta, u32 index, - enum zram_pageflags flag) -{ - return meta->table[index].flags & BIT(flag); -} - -static void zram_set_flag(struct zram_meta *meta, u32 index, - enum zram_pageflags flag) -{ - meta->table[index].flags |= BIT(flag); -} - -static void zram_clear_flag(struct zram_meta *meta, u32 index, - enum zram_pageflags flag) -{ - meta->table[index].flags &= ~BIT(flag); -} - -static inline int is_partial_io(struct bio_vec *bvec) -{ - return bvec->bv_len != PAGE_SIZE; -} - -/* - * Check if request is within bounds and aligned on zram logical blocks. - */ -static inline int valid_io_request(struct zram *zram, struct bio *bio) -{ - u64 start, end, bound; - - /* unaligned request */ - if (unlikely(bio->bi_sector & (ZRAM_SECTOR_PER_LOGICAL_BLOCK - 1))) - return 0; - if (unlikely(bio->bi_size & (ZRAM_LOGICAL_BLOCK_SIZE - 1))) - return 0; - - start = bio->bi_sector; - end = start + (bio->bi_size >> SECTOR_SHIFT); - bound = zram->disksize >> SECTOR_SHIFT; - /* out of range range */ - if (unlikely(start >= bound || end > bound || start > end)) - return 0; - - /* I/O request is valid */ - return 1; -} - -static void zram_meta_free(struct zram_meta *meta) -{ - zs_destroy_pool(meta->mem_pool); - kfree(meta->compress_workmem); - free_pages((unsigned long)meta->compress_buffer, 1); - vfree(meta->table); - kfree(meta); -} - -static struct zram_meta *zram_meta_alloc(u64 disksize) -{ - size_t num_pages; - struct zram_meta *meta = kmalloc(sizeof(*meta), GFP_KERNEL); - if (!meta) - goto out; - - meta->compress_workmem = kzalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL); - if (!meta->compress_workmem) - goto free_meta; - - meta->compress_buffer = - (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 1); - if (!meta->compress_buffer) { - pr_err("Error allocating compressor buffer space\n"); - goto free_workmem; - } - - num_pages = disksize >> PAGE_SHIFT; - meta->table = vzalloc(num_pages * sizeof(*meta->table)); - if (!meta->table) { - pr_err("Error allocating zram address table\n"); - goto free_buffer; - } - - meta->mem_pool = zs_create_pool(GFP_NOIO | __GFP_HIGHMEM); - if (!meta->mem_pool) { - pr_err("Error creating memory pool\n"); - goto free_table; - } - - return meta; - -free_table: - vfree(meta->table); -free_buffer: - free_pages((unsigned long)meta->compress_buffer, 1); -free_workmem: - kfree(meta->compress_workmem); -free_meta: - kfree(meta); - meta = NULL; -out: - return meta; -} - -static void update_position(u32 *index, int *offset, struct bio_vec *bvec) -{ - if (*offset + bvec->bv_len >= PAGE_SIZE) - (*index)++; - *offset = (*offset + bvec->bv_len) % PAGE_SIZE; -} - -static int page_zero_filled(void *ptr) -{ - unsigned int pos; - unsigned long *page; - - page = (unsigned long *)ptr; - - for (pos = 0; pos != PAGE_SIZE / sizeof(*page); pos++) { - if (page[pos]) - return 0; - } - - return 1; -} - -static void handle_zero_page(struct bio_vec *bvec) -{ - struct page *page = bvec->bv_page; - void *user_mem; - - user_mem = kmap_atomic(page); - if (is_partial_io(bvec)) - memset(user_mem + bvec->bv_offset, 0, bvec->bv_len); - else - clear_page(user_mem); - kunmap_atomic(user_mem); - - flush_dcache_page(page); -} - -static void zram_free_page(struct zram *zram, size_t index) -{ - struct zram_meta *meta = zram->meta; - unsigned long handle = meta->table[index].handle; - u16 size = meta->table[index].size; - - if (unlikely(!handle)) { - /* - * No memory is allocated for zero filled pages. - * Simply clear zero page flag. - */ - if (zram_test_flag(meta, index, ZRAM_ZERO)) { - zram_clear_flag(meta, index, ZRAM_ZERO); - zram->stats.pages_zero--; - } - return; - } - - if (unlikely(size > max_zpage_size)) - zram->stats.bad_compress--; - - zs_free(meta->mem_pool, handle); - - if (size <= PAGE_SIZE / 2) - zram->stats.good_compress--; - - atomic64_sub(meta->table[index].size, &zram->stats.compr_size); - zram->stats.pages_stored--; - - meta->table[index].handle = 0; - meta->table[index].size = 0; -} - -static int zram_decompress_page(struct zram *zram, char *mem, u32 index) -{ - int ret = LZO_E_OK; - size_t clen = PAGE_SIZE; - unsigned char *cmem; - struct zram_meta *meta = zram->meta; - unsigned long handle = meta->table[index].handle; - - if (!handle || zram_test_flag(meta, index, ZRAM_ZERO)) { - clear_page(mem); - return 0; - } - - cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_RO); - if (meta->table[index].size == PAGE_SIZE) - copy_page(mem, cmem); - else - ret = lzo1x_decompress_safe(cmem, meta->table[index].size, - mem, &clen); - zs_unmap_object(meta->mem_pool, handle); - - /* Should NEVER happen. Return bio error if it does. */ - if (unlikely(ret != LZO_E_OK)) { - pr_err("Decompression failed! err=%d, page=%u\n", ret, index); - atomic64_inc(&zram->stats.failed_reads); - return ret; - } - - return 0; -} - -static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec, - u32 index, int offset, struct bio *bio) -{ - int ret; - struct page *page; - unsigned char *user_mem, *uncmem = NULL; - struct zram_meta *meta = zram->meta; - page = bvec->bv_page; - - if (unlikely(!meta->table[index].handle) || - zram_test_flag(meta, index, ZRAM_ZERO)) { - handle_zero_page(bvec); - return 0; - } - - if (is_partial_io(bvec)) - /* Use a temporary buffer to decompress the page */ - uncmem = kmalloc(PAGE_SIZE, GFP_NOIO); - - user_mem = kmap_atomic(page); - if (!is_partial_io(bvec)) - uncmem = user_mem; - - if (!uncmem) { - pr_info("Unable to allocate temp memory\n"); - ret = -ENOMEM; - goto out_cleanup; - } - - ret = zram_decompress_page(zram, uncmem, index); - /* Should NEVER happen. Return bio error if it does. */ - if (unlikely(ret != LZO_E_OK)) - goto out_cleanup; - - if (is_partial_io(bvec)) - memcpy(user_mem + bvec->bv_offset, uncmem + offset, - bvec->bv_len); - - flush_dcache_page(page); - ret = 0; -out_cleanup: - kunmap_atomic(user_mem); - if (is_partial_io(bvec)) - kfree(uncmem); - return ret; -} - -static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, - int offset) -{ - int ret = 0; - size_t clen; - unsigned long handle; - struct page *page; - unsigned char *user_mem, *cmem, *src, *uncmem = NULL; - struct zram_meta *meta = zram->meta; - - page = bvec->bv_page; - src = meta->compress_buffer; - - if (is_partial_io(bvec)) { - /* - * This is a partial IO. We need to read the full page - * before to write the changes. - */ - uncmem = kmalloc(PAGE_SIZE, GFP_NOIO); - if (!uncmem) { - ret = -ENOMEM; - goto out; - } - ret = zram_decompress_page(zram, uncmem, index); - if (ret) - goto out; - } - - /* - * System overwrites unused sectors. Free memory associated - * with this sector now. - */ - if (meta->table[index].handle || - zram_test_flag(meta, index, ZRAM_ZERO)) - zram_free_page(zram, index); - - user_mem = kmap_atomic(page); - - if (is_partial_io(bvec)) { - memcpy(uncmem + offset, user_mem + bvec->bv_offset, - bvec->bv_len); - kunmap_atomic(user_mem); - user_mem = NULL; - } else { - uncmem = user_mem; - } - - if (page_zero_filled(uncmem)) { - kunmap_atomic(user_mem); - zram->stats.pages_zero++; - zram_set_flag(meta, index, ZRAM_ZERO); - ret = 0; - goto out; - } - - ret = lzo1x_1_compress(uncmem, PAGE_SIZE, src, &clen, - meta->compress_workmem); - - if (!is_partial_io(bvec)) { - kunmap_atomic(user_mem); - user_mem = NULL; - uncmem = NULL; - } - - if (unlikely(ret != LZO_E_OK)) { - pr_err("Compression failed! err=%d\n", ret); - goto out; - } - - if (unlikely(clen > max_zpage_size)) { - zram->stats.bad_compress++; - clen = PAGE_SIZE; - src = NULL; - if (is_partial_io(bvec)) - src = uncmem; - } - - handle = zs_malloc(meta->mem_pool, clen); - if (!handle) { - pr_info("Error allocating memory for compressed page: %u, size=%zu\n", - index, clen); - ret = -ENOMEM; - goto out; - } - cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO); - - if ((clen == PAGE_SIZE) && !is_partial_io(bvec)) { - src = kmap_atomic(page); - copy_page(cmem, src); - kunmap_atomic(src); - } else { - memcpy(cmem, src, clen); - } - - zs_unmap_object(meta->mem_pool, handle); - - meta->table[index].handle = handle; - meta->table[index].size = clen; - - /* Update stats */ - atomic64_add(clen, &zram->stats.compr_size); - zram->stats.pages_stored++; - if (clen <= PAGE_SIZE / 2) - zram->stats.good_compress++; - -out: - if (is_partial_io(bvec)) - kfree(uncmem); - - if (ret) - atomic64_inc(&zram->stats.failed_writes); - return ret; -} - -static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index, - int offset, struct bio *bio, int rw) -{ - int ret; - - if (rw == READ) { - down_read(&zram->lock); - ret = zram_bvec_read(zram, bvec, index, offset, bio); - up_read(&zram->lock); - } else { - down_write(&zram->lock); - ret = zram_bvec_write(zram, bvec, index, offset); - up_write(&zram->lock); - } - - return ret; -} - -static void zram_reset_device(struct zram *zram) -{ - size_t index; - struct zram_meta *meta; - - down_write(&zram->init_lock); - if (!zram->init_done) { - up_write(&zram->init_lock); - return; - } - - meta = zram->meta; - zram->init_done = 0; - - /* Free all pages that are still in this zram device */ - for (index = 0; index < zram->disksize >> PAGE_SHIFT; index++) { - unsigned long handle = meta->table[index].handle; - if (!handle) - continue; - - zs_free(meta->mem_pool, handle); - } - - zram_meta_free(zram->meta); - zram->meta = NULL; - /* Reset stats */ - memset(&zram->stats, 0, sizeof(zram->stats)); - - zram->disksize = 0; - set_capacity(zram->disk, 0); - up_write(&zram->init_lock); -} - -static void zram_init_device(struct zram *zram, struct zram_meta *meta) -{ - if (zram->disksize > 2 * (totalram_pages << PAGE_SHIFT)) { - pr_info( - "There is little point creating a zram of greater than " - "twice the size of memory since we expect a 2:1 compression " - "ratio. Note that zram uses about 0.1%% of the size of " - "the disk when not in use so a huge zram is " - "wasteful.\n" - "\tMemory Size: %lu kB\n" - "\tSize you selected: %llu kB\n" - "Continuing anyway ...\n", - (totalram_pages << PAGE_SHIFT) >> 10, zram->disksize >> 10 - ); - } - - /* zram devices sort of resembles non-rotational disks */ - queue_flag_set_unlocked(QUEUE_FLAG_NONROT, zram->disk->queue); - - zram->meta = meta; - zram->init_done = 1; - - pr_debug("Initialization done!\n"); -} - -static ssize_t disksize_store(struct device *dev, - struct device_attribute *attr, const char *buf, size_t len) -{ - u64 disksize; - struct zram_meta *meta; - struct zram *zram = dev_to_zram(dev); - - disksize = memparse(buf, NULL); - if (!disksize) - return -EINVAL; - - disksize = PAGE_ALIGN(disksize); - meta = zram_meta_alloc(disksize); - down_write(&zram->init_lock); - if (zram->init_done) { - up_write(&zram->init_lock); - zram_meta_free(meta); - pr_info("Cannot change disksize for initialized device\n"); - return -EBUSY; - } - - zram->disksize = disksize; - set_capacity(zram->disk, zram->disksize >> SECTOR_SHIFT); - zram_init_device(zram, meta); - up_write(&zram->init_lock); - - return len; -} - -static ssize_t reset_store(struct device *dev, - struct device_attribute *attr, const char *buf, size_t len) -{ - int ret; - unsigned short do_reset; - struct zram *zram; - struct block_device *bdev; - - zram = dev_to_zram(dev); - bdev = bdget_disk(zram->disk, 0); - - /* Do not reset an active device! */ - if (bdev->bd_holders) - return -EBUSY; - - ret = kstrtou16(buf, 10, &do_reset); - if (ret) - return ret; - - if (!do_reset) - return -EINVAL; - - /* Make sure all pending I/O is finished */ - if (bdev) - fsync_bdev(bdev); - - zram_reset_device(zram); - return len; -} - -static void __zram_make_request(struct zram *zram, struct bio *bio, int rw) -{ - int i, offset; - u32 index; - struct bio_vec *bvec; - - switch (rw) { - case READ: - atomic64_inc(&zram->stats.num_reads); - break; - case WRITE: - atomic64_inc(&zram->stats.num_writes); - break; - } - - index = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT; - offset = (bio->bi_sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT; - - bio_for_each_segment(bvec, bio, i) { - int max_transfer_size = PAGE_SIZE - offset; - - if (bvec->bv_len > max_transfer_size) { - /* - * zram_bvec_rw() can only make operation on a single - * zram page. Split the bio vector. - */ - struct bio_vec bv; - - bv.bv_page = bvec->bv_page; - bv.bv_len = max_transfer_size; - bv.bv_offset = bvec->bv_offset; - - if (zram_bvec_rw(zram, &bv, index, offset, bio, rw) < 0) - goto out; - - bv.bv_len = bvec->bv_len - max_transfer_size; - bv.bv_offset += max_transfer_size; - if (zram_bvec_rw(zram, &bv, index+1, 0, bio, rw) < 0) - goto out; - } else - if (zram_bvec_rw(zram, bvec, index, offset, bio, rw) - < 0) - goto out; - - update_position(&index, &offset, bvec); - } - - set_bit(BIO_UPTODATE, &bio->bi_flags); - bio_endio(bio, 0); - return; - -out: - bio_io_error(bio); -} - -/* - * Handler function for all zram I/O requests. - */ -static void zram_make_request(struct request_queue *queue, struct bio *bio) -{ - struct zram *zram = queue->queuedata; - - down_read(&zram->init_lock); - if (unlikely(!zram->init_done)) - goto error; - - if (!valid_io_request(zram, bio)) { - atomic64_inc(&zram->stats.invalid_io); - goto error; - } - - __zram_make_request(zram, bio, bio_data_dir(bio)); - up_read(&zram->init_lock); - - return; - -error: - up_read(&zram->init_lock); - bio_io_error(bio); -} - -static void zram_slot_free_notify(struct block_device *bdev, - unsigned long index) -{ - struct zram *zram; - - zram = bdev->bd_disk->private_data; - down_write(&zram->lock); - zram_free_page(zram, index); - up_write(&zram->lock); - atomic64_inc(&zram->stats.notify_free); -} - -static const struct block_device_operations zram_devops = { - .swap_slot_free_notify = zram_slot_free_notify, - .owner = THIS_MODULE -}; - -static DEVICE_ATTR(disksize, S_IRUGO | S_IWUSR, - disksize_show, disksize_store); -static DEVICE_ATTR(initstate, S_IRUGO, initstate_show, NULL); -static DEVICE_ATTR(reset, S_IWUSR, NULL, reset_store); -static DEVICE_ATTR(num_reads, S_IRUGO, num_reads_show, NULL); -static DEVICE_ATTR(num_writes, S_IRUGO, num_writes_show, NULL); -static DEVICE_ATTR(invalid_io, S_IRUGO, invalid_io_show, NULL); -static DEVICE_ATTR(notify_free, S_IRUGO, notify_free_show, NULL); -static DEVICE_ATTR(zero_pages, S_IRUGO, zero_pages_show, NULL); -static DEVICE_ATTR(orig_data_size, S_IRUGO, orig_data_size_show, NULL); -static DEVICE_ATTR(compr_data_size, S_IRUGO, compr_data_size_show, NULL); -static DEVICE_ATTR(mem_used_total, S_IRUGO, mem_used_total_show, NULL); - -static struct attribute *zram_disk_attrs[] = { - &dev_attr_disksize.attr, - &dev_attr_initstate.attr, - &dev_attr_reset.attr, - &dev_attr_num_reads.attr, - &dev_attr_num_writes.attr, - &dev_attr_invalid_io.attr, - &dev_attr_notify_free.attr, - &dev_attr_zero_pages.attr, - &dev_attr_orig_data_size.attr, - &dev_attr_compr_data_size.attr, - &dev_attr_mem_used_total.attr, - NULL, -}; - -static struct attribute_group zram_disk_attr_group = { - .attrs = zram_disk_attrs, -}; - -static int create_device(struct zram *zram, int device_id) -{ - int ret = -ENOMEM; - - init_rwsem(&zram->lock); - init_rwsem(&zram->init_lock); - - zram->queue = blk_alloc_queue(GFP_KERNEL); - if (!zram->queue) { - pr_err("Error allocating disk queue for device %d\n", - device_id); - goto out; - } - - blk_queue_make_request(zram->queue, zram_make_request); - zram->queue->queuedata = zram; - - /* gendisk structure */ - zram->disk = alloc_disk(1); - if (!zram->disk) { - pr_warn("Error allocating disk structure for device %d\n", - device_id); - goto out_free_queue; - } - - zram->disk->major = zram_major; - zram->disk->first_minor = device_id; - zram->disk->fops = &zram_devops; - zram->disk->queue = zram->queue; - zram->disk->private_data = zram; - snprintf(zram->disk->disk_name, 16, "zram%d", device_id); - - /* Actual capacity set using syfs (/sys/block/zram/disksize */ - set_capacity(zram->disk, 0); - - /* - * To ensure that we always get PAGE_SIZE aligned - * and n*PAGE_SIZED sized I/O requests. - */ - blk_queue_physical_block_size(zram->disk->queue, PAGE_SIZE); - blk_queue_logical_block_size(zram->disk->queue, - ZRAM_LOGICAL_BLOCK_SIZE); - blk_queue_io_min(zram->disk->queue, PAGE_SIZE); - blk_queue_io_opt(zram->disk->queue, PAGE_SIZE); - - add_disk(zram->disk); - - ret = sysfs_create_group(&disk_to_dev(zram->disk)->kobj, - &zram_disk_attr_group); - if (ret < 0) { - pr_warn("Error creating sysfs group"); - goto out_free_disk; - } - - zram->init_done = 0; - return 0; - -out_free_disk: - del_gendisk(zram->disk); - put_disk(zram->disk); -out_free_queue: - blk_cleanup_queue(zram->queue); -out: - return ret; -} - -static void destroy_device(struct zram *zram) -{ - sysfs_remove_group(&disk_to_dev(zram->disk)->kobj, - &zram_disk_attr_group); - - if (zram->disk) { - del_gendisk(zram->disk); - put_disk(zram->disk); - } - - if (zram->queue) - blk_cleanup_queue(zram->queue); -} - -static int __init zram_init(void) -{ - int ret, dev_id; - - if (num_devices > max_num_devices) { - pr_warn("Invalid value for num_devices: %u\n", - num_devices); - ret = -EINVAL; - goto out; - } - - zram_major = register_blkdev(0, "zram"); - if (zram_major <= 0) { - pr_warn("Unable to get major number\n"); - ret = -EBUSY; - goto out; - } - - /* Allocate the device array and initialize each one */ - zram_devices = kzalloc(num_devices * sizeof(struct zram), GFP_KERNEL); - if (!zram_devices) { - ret = -ENOMEM; - goto unregister; - } - - for (dev_id = 0; dev_id < num_devices; dev_id++) { - ret = create_device(&zram_devices[dev_id], dev_id); - if (ret) - goto free_devices; - } - - pr_info("Created %u device(s) ...\n", num_devices); - - return 0; - -free_devices: - while (dev_id) - destroy_device(&zram_devices[--dev_id]); - kfree(zram_devices); -unregister: - unregister_blkdev(zram_major, "zram"); -out: - return ret; -} - -static void __exit zram_exit(void) -{ - int i; - struct zram *zram; - - for (i = 0; i < num_devices; i++) { - zram = &zram_devices[i]; - - get_disk(zram->disk); - destroy_device(zram); - zram_reset_device(zram); - put_disk(zram->disk); - } - - unregister_blkdev(zram_major, "zram"); - - kfree(zram_devices); - pr_debug("Cleanup done!\n"); -} - -module_init(zram_init); -module_exit(zram_exit); - -module_param(num_devices, uint, 0); -MODULE_PARM_DESC(num_devices, "Number of zram devices"); - -MODULE_LICENSE("Dual BSD/GPL"); -MODULE_AUTHOR("Nitin Gupta "); -MODULE_DESCRIPTION("Compressed RAM Block Device"); diff --git a/drivers/staging/zram/zram_drv.h b/drivers/staging/zram/zram_drv.h deleted file mode 100644 index 9e57bfb..0000000 --- a/drivers/staging/zram/zram_drv.h +++ /dev/null @@ -1,115 +0,0 @@ -/* - * Compressed RAM block device - * - * Copyright (C) 2008, 2009, 2010 Nitin Gupta - * - * This code is released using a dual license strategy: BSD/GPL - * You can choose the licence that better fits your requirements. - * - * Released under the terms of 3-clause BSD License - * Released under the terms of GNU General Public License Version 2.0 - * - * Project home: http://compcache.googlecode.com - */ - -#ifndef _ZRAM_DRV_H_ -#define _ZRAM_DRV_H_ - -#include -#include - -#include "../zsmalloc/zsmalloc.h" - -/* - * Some arbitrary value. This is just to catch - * invalid value for num_devices module parameter. - */ -static const unsigned max_num_devices = 32; - -/*-- Configurable parameters */ - -/* - * Pages that compress to size greater than this are stored - * uncompressed in memory. - */ -static const size_t max_zpage_size = PAGE_SIZE / 4 * 3; - -/* - * NOTE: max_zpage_size must be less than or equal to: - * ZS_MAX_ALLOC_SIZE. Otherwise, zs_malloc() would - * always return failure. - */ - -/*-- End of configurable params */ - -#define SECTOR_SHIFT 9 -#define SECTOR_SIZE (1 << SECTOR_SHIFT) -#define SECTORS_PER_PAGE_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) -#define SECTORS_PER_PAGE (1 << SECTORS_PER_PAGE_SHIFT) -#define ZRAM_LOGICAL_BLOCK_SHIFT 12 -#define ZRAM_LOGICAL_BLOCK_SIZE (1 << ZRAM_LOGICAL_BLOCK_SHIFT) -#define ZRAM_SECTOR_PER_LOGICAL_BLOCK \ - (1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT)) - -/* Flags for zram pages (table[page_no].flags) */ -enum zram_pageflags { - /* Page consists entirely of zeros */ - ZRAM_ZERO, - - __NR_ZRAM_PAGEFLAGS, -}; - -/*-- Data structures */ - -/* Allocated for each disk page */ -struct table { - unsigned long handle; - u16 size; /* object size (excluding header) */ - u8 count; /* object ref count (not yet used) */ - u8 flags; -} __aligned(4); - -/* - * All 64bit fields should only be manipulated by 64bit atomic accessors. - * All modifications to 32bit counter should be protected by zram->lock. - */ -struct zram_stats { - atomic64_t compr_size; /* compressed size of pages stored */ - atomic64_t num_reads; /* failed + successful */ - atomic64_t num_writes; /* --do-- */ - atomic64_t failed_reads; /* should NEVER! happen */ - atomic64_t failed_writes; /* can happen when memory is too low */ - atomic64_t invalid_io; /* non-page-aligned I/O requests */ - atomic64_t notify_free; /* no. of swap slot free notifications */ - u32 pages_zero; /* no. of zero filled pages */ - u32 pages_stored; /* no. of pages currently stored */ - u32 good_compress; /* % of pages with compression ratio<=50% */ - u32 bad_compress; /* % of pages with compression ratio>=75% */ -}; - -struct zram_meta { - void *compress_workmem; - void *compress_buffer; - struct table *table; - struct zs_pool *mem_pool; -}; - -struct zram { - struct zram_meta *meta; - struct rw_semaphore lock; /* protect compression buffers, table, - * 32bit stat counters against concurrent - * notifications, reads and writes */ - struct request_queue *queue; - struct gendisk *disk; - int init_done; - /* Prevent concurrent execution of device init, reset and R/W request */ - struct rw_semaphore init_lock; - /* - * This is the limit on amount of *uncompressed* worth of data - * we can store in a disk. - */ - u64 disksize; /* bytes */ - - struct zram_stats stats; -}; -#endif diff --git a/drivers/staging/zsmalloc/Kconfig b/drivers/staging/zsmalloc/Kconfig deleted file mode 100644 index 7fab032..0000000 --- a/drivers/staging/zsmalloc/Kconfig +++ /dev/null @@ -1,10 +0,0 @@ -config ZSMALLOC - bool "Memory allocator for compressed pages" - default n - help - zsmalloc is a slab-based memory allocator designed to store - compressed RAM pages. zsmalloc uses virtual memory mapping - in order to reduce fragmentation. However, this results in a - non-standard allocator interface where a handle, not a pointer, is - returned by an alloc(). This handle must be mapped in order to - access the allocated space. diff --git a/drivers/staging/zsmalloc/Makefile b/drivers/staging/zsmalloc/Makefile deleted file mode 100644 index b134848..0000000 --- a/drivers/staging/zsmalloc/Makefile +++ /dev/null @@ -1,3 +0,0 @@ -zsmalloc-y := zsmalloc-main.o - -obj-$(CONFIG_ZSMALLOC) += zsmalloc.o diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c b/drivers/staging/zsmalloc/zsmalloc-main.c deleted file mode 100644 index 4bb275b..0000000 --- a/drivers/staging/zsmalloc/zsmalloc-main.c +++ /dev/null @@ -1,1063 +0,0 @@ -/* - * zsmalloc memory allocator - * - * Copyright (C) 2011 Nitin Gupta - * - * This code is released using a dual license strategy: BSD/GPL - * You can choose the license that better fits your requirements. - * - * Released under the terms of 3-clause BSD License - * Released under the terms of GNU General Public License Version 2.0 - */ - - -/* - * This allocator is designed for use with zcache and zram. Thus, the - * allocator is supposed to work well under low memory conditions. In - * particular, it never attempts higher order page allocation which is - * very likely to fail under memory pressure. On the other hand, if we - * just use single (0-order) pages, it would suffer from very high - * fragmentation -- any object of size PAGE_SIZE/2 or larger would occupy - * an entire page. This was one of the major issues with its predecessor - * (xvmalloc). - * - * To overcome these issues, zsmalloc allocates a bunch of 0-order pages - * and links them together using various 'struct page' fields. These linked - * pages act as a single higher-order page i.e. an object can span 0-order - * page boundaries. The code refers to these linked pages as a single entity - * called zspage. - * - * Following is how we use various fields and flags of underlying - * struct page(s) to form a zspage. - * - * Usage of struct page fields: - * page->first_page: points to the first component (0-order) page - * page->index (union with page->freelist): offset of the first object - * starting in this page. For the first page, this is - * always 0, so we use this field (aka freelist) to point - * to the first free object in zspage. - * page->lru: links together all component pages (except the first page) - * of a zspage - * - * For _first_ page only: - * - * page->private (union with page->first_page): refers to the - * component page after the first page - * page->freelist: points to the first free object in zspage. - * Free objects are linked together using in-place - * metadata. - * page->objects: maximum number of objects we can store in this - * zspage (class->zspage_order * PAGE_SIZE / class->size) - * page->lru: links together first pages of various zspages. - * Basically forming list of zspages in a fullness group. - * page->mapping: class index and fullness group of the zspage - * - * Usage of struct page flags: - * PG_private: identifies the first component page - * PG_private2: identifies the last component page - * - */ - -#ifdef CONFIG_ZSMALLOC_DEBUG -#define DEBUG -#endif - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "zsmalloc.h" - -/* - * This must be power of 2 and greater than of equal to sizeof(link_free). - * These two conditions ensure that any 'struct link_free' itself doesn't - * span more than 1 page which avoids complex case of mapping 2 pages simply - * to restore link_free pointer values. - */ -#define ZS_ALIGN 8 - -/* - * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single) - * pages. ZS_MAX_ZSPAGE_ORDER defines upper limit on N. - */ -#define ZS_MAX_ZSPAGE_ORDER 2 -#define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER) - -/* - * Object location (, ) is encoded as - * as single (void *) handle value. - * - * Note that object index is relative to system - * page it is stored in, so for each sub-page belonging - * to a zspage, obj_idx starts with 0. - * - * This is made more complicated by various memory models and PAE. - */ - -#ifndef MAX_PHYSMEM_BITS -#ifdef CONFIG_HIGHMEM64G -#define MAX_PHYSMEM_BITS 36 -#else /* !CONFIG_HIGHMEM64G */ -/* - * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just - * be PAGE_SHIFT - */ -#define MAX_PHYSMEM_BITS BITS_PER_LONG -#endif -#endif -#define _PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT) -#define OBJ_INDEX_BITS (BITS_PER_LONG - _PFN_BITS) -#define OBJ_INDEX_MASK ((_AC(1, UL) << OBJ_INDEX_BITS) - 1) - -#define MAX(a, b) ((a) >= (b) ? (a) : (b)) -/* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */ -#define ZS_MIN_ALLOC_SIZE \ - MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS)) -#define ZS_MAX_ALLOC_SIZE PAGE_SIZE - -/* - * On systems with 4K page size, this gives 254 size classes! There is a - * trader-off here: - * - Large number of size classes is potentially wasteful as free page are - * spread across these classes - * - Small number of size classes causes large internal fragmentation - * - Probably its better to use specific size classes (empirically - * determined). NOTE: all those class sizes must be set as multiple of - * ZS_ALIGN to make sure link_free itself never has to span 2 pages. - * - * ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN - * (reason above) - */ -#define ZS_SIZE_CLASS_DELTA (PAGE_SIZE >> 8) -#define ZS_SIZE_CLASSES ((ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / \ - ZS_SIZE_CLASS_DELTA + 1) - -/* - * We do not maintain any list for completely empty or full pages - */ -enum fullness_group { - ZS_ALMOST_FULL, - ZS_ALMOST_EMPTY, - _ZS_NR_FULLNESS_GROUPS, - - ZS_EMPTY, - ZS_FULL -}; - -/* - * We assign a page to ZS_ALMOST_EMPTY fullness group when: - * n <= N / f, where - * n = number of allocated objects - * N = total number of objects zspage can store - * f = 1/fullness_threshold_frac - * - * Similarly, we assign zspage to: - * ZS_ALMOST_FULL when n > N / f - * ZS_EMPTY when n == 0 - * ZS_FULL when n == N - * - * (see: fix_fullness_group()) - */ -static const int fullness_threshold_frac = 4; - -struct size_class { - /* - * Size of objects stored in this class. Must be multiple - * of ZS_ALIGN. - */ - int size; - unsigned int index; - - /* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */ - int pages_per_zspage; - - spinlock_t lock; - - /* stats */ - u64 pages_allocated; - - struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS]; -}; - -/* - * Placed within free objects to form a singly linked list. - * For every zspage, first_page->freelist gives head of this list. - * - * This must be power of 2 and less than or equal to ZS_ALIGN - */ -struct link_free { - /* Handle of next free chunk (encodes ) */ - void *next; -}; - -struct zs_pool { - struct size_class size_class[ZS_SIZE_CLASSES]; - - gfp_t flags; /* allocation flags used when growing pool */ -}; - -/* - * A zspage's class index and fullness group - * are encoded in its (first)page->mapping - */ -#define CLASS_IDX_BITS 28 -#define FULLNESS_BITS 4 -#define CLASS_IDX_MASK ((1 << CLASS_IDX_BITS) - 1) -#define FULLNESS_MASK ((1 << FULLNESS_BITS) - 1) - -/* - * By default, zsmalloc uses a copy-based object mapping method to access - * allocations that span two pages. However, if a particular architecture - * performs VM mapping faster than copying, then it should be added here - * so that USE_PGTABLE_MAPPING is defined. This causes zsmalloc to use - * page table mapping rather than copying for object mapping. - */ -#if defined(CONFIG_ARM) && !defined(MODULE) -#define USE_PGTABLE_MAPPING -#endif - -struct mapping_area { -#ifdef USE_PGTABLE_MAPPING - struct vm_struct *vm; /* vm area for mapping object that span pages */ -#else - char *vm_buf; /* copy buffer for objects that span pages */ -#endif - char *vm_addr; /* address of kmap_atomic()'ed pages */ - enum zs_mapmode vm_mm; /* mapping mode */ -}; - - -/* per-cpu VM mapping areas for zspage accesses that cross page boundaries */ -static DEFINE_PER_CPU(struct mapping_area, zs_map_area); - -static int is_first_page(struct page *page) -{ - return PagePrivate(page); -} - -static int is_last_page(struct page *page) -{ - return PagePrivate2(page); -} - -static void get_zspage_mapping(struct page *page, unsigned int *class_idx, - enum fullness_group *fullness) -{ - unsigned long m; - BUG_ON(!is_first_page(page)); - - m = (unsigned long)page->mapping; - *fullness = m & FULLNESS_MASK; - *class_idx = (m >> FULLNESS_BITS) & CLASS_IDX_MASK; -} - -static void set_zspage_mapping(struct page *page, unsigned int class_idx, - enum fullness_group fullness) -{ - unsigned long m; - BUG_ON(!is_first_page(page)); - - m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) | - (fullness & FULLNESS_MASK); - page->mapping = (struct address_space *)m; -} - -static int get_size_class_index(int size) -{ - int idx = 0; - - if (likely(size > ZS_MIN_ALLOC_SIZE)) - idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE, - ZS_SIZE_CLASS_DELTA); - - return idx; -} - -static enum fullness_group get_fullness_group(struct page *page) -{ - int inuse, max_objects; - enum fullness_group fg; - BUG_ON(!is_first_page(page)); - - inuse = page->inuse; - max_objects = page->objects; - - if (inuse == 0) - fg = ZS_EMPTY; - else if (inuse == max_objects) - fg = ZS_FULL; - else if (inuse <= max_objects / fullness_threshold_frac) - fg = ZS_ALMOST_EMPTY; - else - fg = ZS_ALMOST_FULL; - - return fg; -} - -static void insert_zspage(struct page *page, struct size_class *class, - enum fullness_group fullness) -{ - struct page **head; - - BUG_ON(!is_first_page(page)); - - if (fullness >= _ZS_NR_FULLNESS_GROUPS) - return; - - head = &class->fullness_list[fullness]; - if (*head) - list_add_tail(&page->lru, &(*head)->lru); - - *head = page; -} - -static void remove_zspage(struct page *page, struct size_class *class, - enum fullness_group fullness) -{ - struct page **head; - - BUG_ON(!is_first_page(page)); - - if (fullness >= _ZS_NR_FULLNESS_GROUPS) - return; - - head = &class->fullness_list[fullness]; - BUG_ON(!*head); - if (list_empty(&(*head)->lru)) - *head = NULL; - else if (*head == page) - *head = (struct page *)list_entry((*head)->lru.next, - struct page, lru); - - list_del_init(&page->lru); -} - -static enum fullness_group fix_fullness_group(struct zs_pool *pool, - struct page *page) -{ - int class_idx; - struct size_class *class; - enum fullness_group currfg, newfg; - - BUG_ON(!is_first_page(page)); - - get_zspage_mapping(page, &class_idx, &currfg); - newfg = get_fullness_group(page); - if (newfg == currfg) - goto out; - - class = &pool->size_class[class_idx]; - remove_zspage(page, class, currfg); - insert_zspage(page, class, newfg); - set_zspage_mapping(page, class_idx, newfg); - -out: - return newfg; -} - -/* - * We have to decide on how many pages to link together - * to form a zspage for each size class. This is important - * to reduce wastage due to unusable space left at end of - * each zspage which is given as: - * wastage = Zp - Zp % size_class - * where Zp = zspage size = k * PAGE_SIZE where k = 1, 2, ... - * - * For example, for size class of 3/8 * PAGE_SIZE, we should - * link together 3 PAGE_SIZE sized pages to form a zspage - * since then we can perfectly fit in 8 such objects. - */ -static int get_pages_per_zspage(int class_size) -{ - int i, max_usedpc = 0; - /* zspage order which gives maximum used size per KB */ - int max_usedpc_order = 1; - - for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) { - int zspage_size; - int waste, usedpc; - - zspage_size = i * PAGE_SIZE; - waste = zspage_size % class_size; - usedpc = (zspage_size - waste) * 100 / zspage_size; - - if (usedpc > max_usedpc) { - max_usedpc = usedpc; - max_usedpc_order = i; - } - } - - return max_usedpc_order; -} - -/* - * A single 'zspage' is composed of many system pages which are - * linked together using fields in struct page. This function finds - * the first/head page, given any component page of a zspage. - */ -static struct page *get_first_page(struct page *page) -{ - if (is_first_page(page)) - return page; - else - return page->first_page; -} - -static struct page *get_next_page(struct page *page) -{ - struct page *next; - - if (is_last_page(page)) - next = NULL; - else if (is_first_page(page)) - next = (struct page *)page->private; - else - next = list_entry(page->lru.next, struct page, lru); - - return next; -} - -/* Encode as a single handle value */ -static void *obj_location_to_handle(struct page *page, unsigned long obj_idx) -{ - unsigned long handle; - - if (!page) { - BUG_ON(obj_idx); - return NULL; - } - - handle = page_to_pfn(page) << OBJ_INDEX_BITS; - handle |= (obj_idx & OBJ_INDEX_MASK); - - return (void *)handle; -} - -/* Decode pair from the given object handle */ -static void obj_handle_to_location(unsigned long handle, struct page **page, - unsigned long *obj_idx) -{ - *page = pfn_to_page(handle >> OBJ_INDEX_BITS); - *obj_idx = handle & OBJ_INDEX_MASK; -} - -static unsigned long obj_idx_to_offset(struct page *page, - unsigned long obj_idx, int class_size) -{ - unsigned long off = 0; - - if (!is_first_page(page)) - off = page->index; - - return off + obj_idx * class_size; -} - -static void reset_page(struct page *page) -{ - clear_bit(PG_private, &page->flags); - clear_bit(PG_private_2, &page->flags); - set_page_private(page, 0); - page->mapping = NULL; - page->freelist = NULL; - page_mapcount_reset(page); -} - -static void free_zspage(struct page *first_page) -{ - struct page *nextp, *tmp, *head_extra; - - BUG_ON(!is_first_page(first_page)); - BUG_ON(first_page->inuse); - - head_extra = (struct page *)page_private(first_page); - - reset_page(first_page); - __free_page(first_page); - - /* zspage with only 1 system page */ - if (!head_extra) - return; - - list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) { - list_del(&nextp->lru); - reset_page(nextp); - __free_page(nextp); - } - reset_page(head_extra); - __free_page(head_extra); -} - -/* Initialize a newly allocated zspage */ -static void init_zspage(struct page *first_page, struct size_class *class) -{ - unsigned long off = 0; - struct page *page = first_page; - - BUG_ON(!is_first_page(first_page)); - while (page) { - struct page *next_page; - struct link_free *link; - unsigned int i, objs_on_page; - - /* - * page->index stores offset of first object starting - * in the page. For the first page, this is always 0, - * so we use first_page->index (aka ->freelist) to store - * head of corresponding zspage's freelist. - */ - if (page != first_page) - page->index = off; - - link = (struct link_free *)kmap_atomic(page) + - off / sizeof(*link); - objs_on_page = (PAGE_SIZE - off) / class->size; - - for (i = 1; i <= objs_on_page; i++) { - off += class->size; - if (off < PAGE_SIZE) { - link->next = obj_location_to_handle(page, i); - link += class->size / sizeof(*link); - } - } - - /* - * We now come to the last (full or partial) object on this - * page, which must point to the first object on the next - * page (if present) - */ - next_page = get_next_page(page); - link->next = obj_location_to_handle(next_page, 0); - kunmap_atomic(link); - page = next_page; - off = (off + class->size) % PAGE_SIZE; - } -} - -/* - * Allocate a zspage for the given size class - */ -static struct page *alloc_zspage(struct size_class *class, gfp_t flags) -{ - int i, error; - struct page *first_page = NULL, *uninitialized_var(prev_page); - - /* - * Allocate individual pages and link them together as: - * 1. first page->private = first sub-page - * 2. all sub-pages are linked together using page->lru - * 3. each sub-page is linked to the first page using page->first_page - * - * For each size class, First/Head pages are linked together using - * page->lru. Also, we set PG_private to identify the first page - * (i.e. no other sub-page has this flag set) and PG_private_2 to - * identify the last page. - */ - error = -ENOMEM; - for (i = 0; i < class->pages_per_zspage; i++) { - struct page *page; - - page = alloc_page(flags); - if (!page) - goto cleanup; - - INIT_LIST_HEAD(&page->lru); - if (i == 0) { /* first page */ - SetPagePrivate(page); - set_page_private(page, 0); - first_page = page; - first_page->inuse = 0; - } - if (i == 1) - first_page->private = (unsigned long)page; - if (i >= 1) - page->first_page = first_page; - if (i >= 2) - list_add(&page->lru, &prev_page->lru); - if (i == class->pages_per_zspage - 1) /* last page */ - SetPagePrivate2(page); - prev_page = page; - } - - init_zspage(first_page, class); - - first_page->freelist = obj_location_to_handle(first_page, 0); - /* Maximum number of objects we can store in this zspage */ - first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size; - - error = 0; /* Success */ - -cleanup: - if (unlikely(error) && first_page) { - free_zspage(first_page); - first_page = NULL; - } - - return first_page; -} - -static struct page *find_get_zspage(struct size_class *class) -{ - int i; - struct page *page; - - for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) { - page = class->fullness_list[i]; - if (page) - break; - } - - return page; -} - -#ifdef USE_PGTABLE_MAPPING -static inline int __zs_cpu_up(struct mapping_area *area) -{ - /* - * Make sure we don't leak memory if a cpu UP notification - * and zs_init() race and both call zs_cpu_up() on the same cpu - */ - if (area->vm) - return 0; - area->vm = alloc_vm_area(PAGE_SIZE * 2, NULL); - if (!area->vm) - return -ENOMEM; - return 0; -} - -static inline void __zs_cpu_down(struct mapping_area *area) -{ - if (area->vm) - free_vm_area(area->vm); - area->vm = NULL; -} - -static inline void *__zs_map_object(struct mapping_area *area, - struct page *pages[2], int off, int size) -{ - BUG_ON(map_vm_area(area->vm, PAGE_KERNEL, &pages)); - area->vm_addr = area->vm->addr; - return area->vm_addr + off; -} - -static inline void __zs_unmap_object(struct mapping_area *area, - struct page *pages[2], int off, int size) -{ - unsigned long addr = (unsigned long)area->vm_addr; - - unmap_kernel_range(addr, PAGE_SIZE * 2); -} - -#else /* USE_PGTABLE_MAPPING */ - -static inline int __zs_cpu_up(struct mapping_area *area) -{ - /* - * Make sure we don't leak memory if a cpu UP notification - * and zs_init() race and both call zs_cpu_up() on the same cpu - */ - if (area->vm_buf) - return 0; - area->vm_buf = (char *)__get_free_page(GFP_KERNEL); - if (!area->vm_buf) - return -ENOMEM; - return 0; -} - -static inline void __zs_cpu_down(struct mapping_area *area) -{ - if (area->vm_buf) - free_page((unsigned long)area->vm_buf); - area->vm_buf = NULL; -} - -static void *__zs_map_object(struct mapping_area *area, - struct page *pages[2], int off, int size) -{ - int sizes[2]; - void *addr; - char *buf = area->vm_buf; - - /* disable page faults to match kmap_atomic() return conditions */ - pagefault_disable(); - - /* no read fastpath */ - if (area->vm_mm == ZS_MM_WO) - goto out; - - sizes[0] = PAGE_SIZE - off; - sizes[1] = size - sizes[0]; - - /* copy object to per-cpu buffer */ - addr = kmap_atomic(pages[0]); - memcpy(buf, addr + off, sizes[0]); - kunmap_atomic(addr); - addr = kmap_atomic(pages[1]); - memcpy(buf + sizes[0], addr, sizes[1]); - kunmap_atomic(addr); -out: - return area->vm_buf; -} - -static void __zs_unmap_object(struct mapping_area *area, - struct page *pages[2], int off, int size) -{ - int sizes[2]; - void *addr; - char *buf = area->vm_buf; - - /* no write fastpath */ - if (area->vm_mm == ZS_MM_RO) - goto out; - - sizes[0] = PAGE_SIZE - off; - sizes[1] = size - sizes[0]; - - /* copy per-cpu buffer to object */ - addr = kmap_atomic(pages[0]); - memcpy(addr + off, buf, sizes[0]); - kunmap_atomic(addr); - addr = kmap_atomic(pages[1]); - memcpy(addr, buf + sizes[0], sizes[1]); - kunmap_atomic(addr); - -out: - /* enable page faults to match kunmap_atomic() return conditions */ - pagefault_enable(); -} - -#endif /* USE_PGTABLE_MAPPING */ - -static int zs_cpu_notifier(struct notifier_block *nb, unsigned long action, - void *pcpu) -{ - int ret, cpu = (long)pcpu; - struct mapping_area *area; - - switch (action) { - case CPU_UP_PREPARE: - area = &per_cpu(zs_map_area, cpu); - ret = __zs_cpu_up(area); - if (ret) - return notifier_from_errno(ret); - break; - case CPU_DEAD: - case CPU_UP_CANCELED: - area = &per_cpu(zs_map_area, cpu); - __zs_cpu_down(area); - break; - } - - return NOTIFY_OK; -} - -static struct notifier_block zs_cpu_nb = { - .notifier_call = zs_cpu_notifier -}; - -static void zs_exit(void) -{ - int cpu; - - for_each_online_cpu(cpu) - zs_cpu_notifier(NULL, CPU_DEAD, (void *)(long)cpu); - unregister_cpu_notifier(&zs_cpu_nb); -} - -static int zs_init(void) -{ - int cpu, ret; - - register_cpu_notifier(&zs_cpu_nb); - for_each_online_cpu(cpu) { - ret = zs_cpu_notifier(NULL, CPU_UP_PREPARE, (void *)(long)cpu); - if (notifier_to_errno(ret)) - goto fail; - } - return 0; -fail: - zs_exit(); - return notifier_to_errno(ret); -} - -/** - * zs_create_pool - Creates an allocation pool to work from. - * @flags: allocation flags used to allocate pool metadata - * - * This function must be called before anything when using - * the zsmalloc allocator. - * - * On success, a pointer to the newly created pool is returned, - * otherwise NULL. - */ -struct zs_pool *zs_create_pool(gfp_t flags) -{ - int i, ovhd_size; - struct zs_pool *pool; - - ovhd_size = roundup(sizeof(*pool), PAGE_SIZE); - pool = kzalloc(ovhd_size, GFP_KERNEL); - if (!pool) - return NULL; - - for (i = 0; i < ZS_SIZE_CLASSES; i++) { - int size; - struct size_class *class; - - size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA; - if (size > ZS_MAX_ALLOC_SIZE) - size = ZS_MAX_ALLOC_SIZE; - - class = &pool->size_class[i]; - class->size = size; - class->index = i; - spin_lock_init(&class->lock); - class->pages_per_zspage = get_pages_per_zspage(size); - - } - - pool->flags = flags; - - return pool; -} -EXPORT_SYMBOL_GPL(zs_create_pool); - -void zs_destroy_pool(struct zs_pool *pool) -{ - int i; - - for (i = 0; i < ZS_SIZE_CLASSES; i++) { - int fg; - struct size_class *class = &pool->size_class[i]; - - for (fg = 0; fg < _ZS_NR_FULLNESS_GROUPS; fg++) { - if (class->fullness_list[fg]) { - pr_info("Freeing non-empty class with size %db, fullness group %d\n", - class->size, fg); - } - } - } - kfree(pool); -} -EXPORT_SYMBOL_GPL(zs_destroy_pool); - -/** - * zs_malloc - Allocate block of given size from pool. - * @pool: pool to allocate from - * @size: size of block to allocate - * - * On success, handle to the allocated object is returned, - * otherwise 0. - * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail. - */ -unsigned long zs_malloc(struct zs_pool *pool, size_t size) -{ - unsigned long obj; - struct link_free *link; - int class_idx; - struct size_class *class; - - struct page *first_page, *m_page; - unsigned long m_objidx, m_offset; - - if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE)) - return 0; - - class_idx = get_size_class_index(size); - class = &pool->size_class[class_idx]; - BUG_ON(class_idx != class->index); - - spin_lock(&class->lock); - first_page = find_get_zspage(class); - - if (!first_page) { - spin_unlock(&class->lock); - first_page = alloc_zspage(class, pool->flags); - if (unlikely(!first_page)) - return 0; - - set_zspage_mapping(first_page, class->index, ZS_EMPTY); - spin_lock(&class->lock); - class->pages_allocated += class->pages_per_zspage; - } - - obj = (unsigned long)first_page->freelist; - obj_handle_to_location(obj, &m_page, &m_objidx); - m_offset = obj_idx_to_offset(m_page, m_objidx, class->size); - - link = (struct link_free *)kmap_atomic(m_page) + - m_offset / sizeof(*link); - first_page->freelist = link->next; - memset(link, POISON_INUSE, sizeof(*link)); - kunmap_atomic(link); - - first_page->inuse++; - /* Now move the zspage to another fullness group, if required */ - fix_fullness_group(pool, first_page); - spin_unlock(&class->lock); - - return obj; -} -EXPORT_SYMBOL_GPL(zs_malloc); - -void zs_free(struct zs_pool *pool, unsigned long obj) -{ - struct link_free *link; - struct page *first_page, *f_page; - unsigned long f_objidx, f_offset; - - int class_idx; - struct size_class *class; - enum fullness_group fullness; - - if (unlikely(!obj)) - return; - - obj_handle_to_location(obj, &f_page, &f_objidx); - first_page = get_first_page(f_page); - - get_zspage_mapping(first_page, &class_idx, &fullness); - class = &pool->size_class[class_idx]; - f_offset = obj_idx_to_offset(f_page, f_objidx, class->size); - - spin_lock(&class->lock); - - /* Insert this object in containing zspage's freelist */ - link = (struct link_free *)((unsigned char *)kmap_atomic(f_page) - + f_offset); - link->next = first_page->freelist; - kunmap_atomic(link); - first_page->freelist = (void *)obj; - - first_page->inuse--; - fullness = fix_fullness_group(pool, first_page); - - if (fullness == ZS_EMPTY) - class->pages_allocated -= class->pages_per_zspage; - - spin_unlock(&class->lock); - - if (fullness == ZS_EMPTY) - free_zspage(first_page); -} -EXPORT_SYMBOL_GPL(zs_free); - -/** - * zs_map_object - get address of allocated object from handle. - * @pool: pool from which the object was allocated - * @handle: handle returned from zs_malloc - * - * Before using an object allocated from zs_malloc, it must be mapped using - * this function. When done with the object, it must be unmapped using - * zs_unmap_object. - * - * Only one object can be mapped per cpu at a time. There is no protection - * against nested mappings. - * - * This function returns with preemption and page faults disabled. - */ -void *zs_map_object(struct zs_pool *pool, unsigned long handle, - enum zs_mapmode mm) -{ - struct page *page; - unsigned long obj_idx, off; - - unsigned int class_idx; - enum fullness_group fg; - struct size_class *class; - struct mapping_area *area; - struct page *pages[2]; - - BUG_ON(!handle); - - /* - * Because we use per-cpu mapping areas shared among the - * pools/users, we can't allow mapping in interrupt context - * because it can corrupt another users mappings. - */ - BUG_ON(in_interrupt()); - - obj_handle_to_location(handle, &page, &obj_idx); - get_zspage_mapping(get_first_page(page), &class_idx, &fg); - class = &pool->size_class[class_idx]; - off = obj_idx_to_offset(page, obj_idx, class->size); - - area = &get_cpu_var(zs_map_area); - area->vm_mm = mm; - if (off + class->size <= PAGE_SIZE) { - /* this object is contained entirely within a page */ - area->vm_addr = kmap_atomic(page); - return area->vm_addr + off; - } - - /* this object spans two pages */ - pages[0] = page; - pages[1] = get_next_page(page); - BUG_ON(!pages[1]); - - return __zs_map_object(area, pages, off, class->size); -} -EXPORT_SYMBOL_GPL(zs_map_object); - -void zs_unmap_object(struct zs_pool *pool, unsigned long handle) -{ - struct page *page; - unsigned long obj_idx, off; - - unsigned int class_idx; - enum fullness_group fg; - struct size_class *class; - struct mapping_area *area; - - BUG_ON(!handle); - - obj_handle_to_location(handle, &page, &obj_idx); - get_zspage_mapping(get_first_page(page), &class_idx, &fg); - class = &pool->size_class[class_idx]; - off = obj_idx_to_offset(page, obj_idx, class->size); - - area = &__get_cpu_var(zs_map_area); - if (off + class->size <= PAGE_SIZE) - kunmap_atomic(area->vm_addr); - else { - struct page *pages[2]; - - pages[0] = page; - pages[1] = get_next_page(page); - BUG_ON(!pages[1]); - - __zs_unmap_object(area, pages, off, class->size); - } - put_cpu_var(zs_map_area); -} -EXPORT_SYMBOL_GPL(zs_unmap_object); - -u64 zs_get_total_size_bytes(struct zs_pool *pool) -{ - int i; - u64 npages = 0; - - for (i = 0; i < ZS_SIZE_CLASSES; i++) - npages += pool->size_class[i].pages_allocated; - - return npages << PAGE_SHIFT; -} -EXPORT_SYMBOL_GPL(zs_get_total_size_bytes); - -module_init(zs_init); -module_exit(zs_exit); - -MODULE_LICENSE("Dual BSD/GPL"); -MODULE_AUTHOR("Nitin Gupta "); diff --git a/drivers/staging/zsmalloc/zsmalloc.h b/drivers/staging/zsmalloc/zsmalloc.h deleted file mode 100644 index fbe6bec..0000000 --- a/drivers/staging/zsmalloc/zsmalloc.h +++ /dev/null @@ -1,43 +0,0 @@ -/* - * zsmalloc memory allocator - * - * Copyright (C) 2011 Nitin Gupta - * - * This code is released using a dual license strategy: BSD/GPL - * You can choose the license that better fits your requirements. - * - * Released under the terms of 3-clause BSD License - * Released under the terms of GNU General Public License Version 2.0 - */ - -#ifndef _ZS_MALLOC_H_ -#define _ZS_MALLOC_H_ - -#include - -/* - * zsmalloc mapping modes - * - * NOTE: These only make a difference when a mapped object spans pages - */ -enum zs_mapmode { - ZS_MM_RW, /* normal read-write mapping */ - ZS_MM_RO, /* read-only (no copy-out at unmap time) */ - ZS_MM_WO /* write-only (no copy-in at map time) */ -}; - -struct zs_pool; - -struct zs_pool *zs_create_pool(gfp_t flags); -void zs_destroy_pool(struct zs_pool *pool); - -unsigned long zs_malloc(struct zs_pool *pool, size_t size); -void zs_free(struct zs_pool *pool, unsigned long obj); - -void *zs_map_object(struct zs_pool *pool, unsigned long handle, - enum zs_mapmode mm); -void zs_unmap_object(struct zs_pool *pool, unsigned long handle); - -u64 zs_get_total_size_bytes(struct zs_pool *pool); - -#endif -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx105.postini.com [74.125.245.105]) by kanga.kvack.org (Postfix) with SMTP id 80E376B0034 for ; Sun, 18 Aug 2013 04:41:47 -0400 (EDT) Received: by mail-pb0-f43.google.com with SMTP id md4so3699905pbc.16 for ; Sun, 18 Aug 2013 01:41:46 -0700 (PDT) From: Bob Liu Subject: [PATCH 2/4] mm: promote zsmalloc to mm/ Date: Sun, 18 Aug 2013 16:40:47 +0800 Message-Id: <1376815249-6611-3-git-send-email-bob.liu@oracle.com> In-Reply-To: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu zsmalloc is a new slab-based memory allocator for storing compressed pages. It is designed for low fragmentation and high allocation success rate on large object, but <= PAGE_SIZE allocations. zsmalloc differs from the kernel slab allocator in two primary ways to achieve these design goals. zsmalloc never requires high order page allocations to back slabs, or "size classes" in zsmalloc terms. Instead it allows multiple single-order pages to be stitched together into a "zspage" which backs the slab. This allows for higher allocation success rate under memory pressure. Also, zsmalloc allows objects to span page boundaries within the zspage. This allows for lower fragmentation than could be had with the kernel slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE. With the kernel slab allocator, if a page compresses to 60% of it original size, the memory savings gained through compression is lost in fragmentation because another object of the same size can't be stored in the leftover space. This ability to span pages results in zsmalloc allocations not being directly addressable by the user. The user is given an non-dereferencable handle in response to an allocation request. That handle must be mapped, using zs_map_object(), which returns a pointer to the mapped region that can be used. The mapping is necessary since the object data may reside in two different noncontigious pages. Signed-off-by: Bob Liu --- include/linux/zsmalloc.h | 43 ++ mm/Kconfig | 35 +- mm/Makefile | 1 + mm/zsmalloc.c | 1063 ++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 1131 insertions(+), 11 deletions(-) create mode 100644 include/linux/zsmalloc.h create mode 100644 mm/zsmalloc.c diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h new file mode 100644 index 0000000..fbe6bec --- /dev/null +++ b/include/linux/zsmalloc.h @@ -0,0 +1,43 @@ +/* + * zsmalloc memory allocator + * + * Copyright (C) 2011 Nitin Gupta + * + * This code is released using a dual license strategy: BSD/GPL + * You can choose the license that better fits your requirements. + * + * Released under the terms of 3-clause BSD License + * Released under the terms of GNU General Public License Version 2.0 + */ + +#ifndef _ZS_MALLOC_H_ +#define _ZS_MALLOC_H_ + +#include + +/* + * zsmalloc mapping modes + * + * NOTE: These only make a difference when a mapped object spans pages + */ +enum zs_mapmode { + ZS_MM_RW, /* normal read-write mapping */ + ZS_MM_RO, /* read-only (no copy-out at unmap time) */ + ZS_MM_WO /* write-only (no copy-in at map time) */ +}; + +struct zs_pool; + +struct zs_pool *zs_create_pool(gfp_t flags); +void zs_destroy_pool(struct zs_pool *pool); + +unsigned long zs_malloc(struct zs_pool *pool, size_t size); +void zs_free(struct zs_pool *pool, unsigned long obj); + +void *zs_map_object(struct zs_pool *pool, unsigned long handle, + enum zs_mapmode mm); +void zs_unmap_object(struct zs_pool *pool, unsigned long handle); + +u64 zs_get_total_size_bytes(struct zs_pool *pool); + +#endif diff --git a/mm/Kconfig b/mm/Kconfig index 8028dcc..48d1786 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -478,21 +478,10 @@ config FRONTSWAP If unsure, say Y to enable frontswap. -config ZBUD - tristate - default n - help - A special purpose allocator for storing compressed pages. - It is designed to store up to two compressed pages per physical - page. While this design limits storage density, it has simple and - deterministic reclaim properties that make it preferable to a higher - density approach when reclaim will be used. - config ZSWAP bool "Compressed cache for swap pages (EXPERIMENTAL)" depends on FRONTSWAP && CRYPTO=y select CRYPTO_LZO - select ZBUD default n help A lightweight compressed cache for swap pages. It takes @@ -508,6 +497,30 @@ config ZSWAP they have not be fully explored on the large set of potential configurations and workloads that exist. +choice + prompt "Select memory allocator for compressed pages" + depends on ZSWAP + default ZBUD + + config ZBUD + bool "zbud" + help + A special purpose allocator for storing compressed pages. + It is designed to store up to two compressed pages per physical + page. While this design limits storage density, it has simple and + deterministic reclaim properties that make it preferable to a higher + density approach when reclaim will be used. + + config ZSMALLOC + bool "zsmalloc" + help + zsmalloc is a slab-based memory allocator designed to store + compressed RAM pages. zsmalloc uses virtual memory mapping + in order to reduce fragmentation and has high compression density. + However, this results in a unpredictable performance characteristics + when reclaiming a single page. +endchoice + config MEM_SOFT_DIRTY bool "Track memory changes" depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY diff --git a/mm/Makefile b/mm/Makefile index f008033..7d11958 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -60,3 +60,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o obj-$(CONFIG_CLEANCACHE) += cleancache.o obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o obj-$(CONFIG_ZBUD) += zbud.o +obj-$(CONFIG_ZSMALLOC) += zsmalloc.o diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c new file mode 100644 index 0000000..4bb275b --- /dev/null +++ b/mm/zsmalloc.c @@ -0,0 +1,1063 @@ +/* + * zsmalloc memory allocator + * + * Copyright (C) 2011 Nitin Gupta + * + * This code is released using a dual license strategy: BSD/GPL + * You can choose the license that better fits your requirements. + * + * Released under the terms of 3-clause BSD License + * Released under the terms of GNU General Public License Version 2.0 + */ + + +/* + * This allocator is designed for use with zcache and zram. Thus, the + * allocator is supposed to work well under low memory conditions. In + * particular, it never attempts higher order page allocation which is + * very likely to fail under memory pressure. On the other hand, if we + * just use single (0-order) pages, it would suffer from very high + * fragmentation -- any object of size PAGE_SIZE/2 or larger would occupy + * an entire page. This was one of the major issues with its predecessor + * (xvmalloc). + * + * To overcome these issues, zsmalloc allocates a bunch of 0-order pages + * and links them together using various 'struct page' fields. These linked + * pages act as a single higher-order page i.e. an object can span 0-order + * page boundaries. The code refers to these linked pages as a single entity + * called zspage. + * + * Following is how we use various fields and flags of underlying + * struct page(s) to form a zspage. + * + * Usage of struct page fields: + * page->first_page: points to the first component (0-order) page + * page->index (union with page->freelist): offset of the first object + * starting in this page. For the first page, this is + * always 0, so we use this field (aka freelist) to point + * to the first free object in zspage. + * page->lru: links together all component pages (except the first page) + * of a zspage + * + * For _first_ page only: + * + * page->private (union with page->first_page): refers to the + * component page after the first page + * page->freelist: points to the first free object in zspage. + * Free objects are linked together using in-place + * metadata. + * page->objects: maximum number of objects we can store in this + * zspage (class->zspage_order * PAGE_SIZE / class->size) + * page->lru: links together first pages of various zspages. + * Basically forming list of zspages in a fullness group. + * page->mapping: class index and fullness group of the zspage + * + * Usage of struct page flags: + * PG_private: identifies the first component page + * PG_private2: identifies the last component page + * + */ + +#ifdef CONFIG_ZSMALLOC_DEBUG +#define DEBUG +#endif + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "zsmalloc.h" + +/* + * This must be power of 2 and greater than of equal to sizeof(link_free). + * These two conditions ensure that any 'struct link_free' itself doesn't + * span more than 1 page which avoids complex case of mapping 2 pages simply + * to restore link_free pointer values. + */ +#define ZS_ALIGN 8 + +/* + * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single) + * pages. ZS_MAX_ZSPAGE_ORDER defines upper limit on N. + */ +#define ZS_MAX_ZSPAGE_ORDER 2 +#define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER) + +/* + * Object location (, ) is encoded as + * as single (void *) handle value. + * + * Note that object index is relative to system + * page it is stored in, so for each sub-page belonging + * to a zspage, obj_idx starts with 0. + * + * This is made more complicated by various memory models and PAE. + */ + +#ifndef MAX_PHYSMEM_BITS +#ifdef CONFIG_HIGHMEM64G +#define MAX_PHYSMEM_BITS 36 +#else /* !CONFIG_HIGHMEM64G */ +/* + * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just + * be PAGE_SHIFT + */ +#define MAX_PHYSMEM_BITS BITS_PER_LONG +#endif +#endif +#define _PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT) +#define OBJ_INDEX_BITS (BITS_PER_LONG - _PFN_BITS) +#define OBJ_INDEX_MASK ((_AC(1, UL) << OBJ_INDEX_BITS) - 1) + +#define MAX(a, b) ((a) >= (b) ? (a) : (b)) +/* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */ +#define ZS_MIN_ALLOC_SIZE \ + MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS)) +#define ZS_MAX_ALLOC_SIZE PAGE_SIZE + +/* + * On systems with 4K page size, this gives 254 size classes! There is a + * trader-off here: + * - Large number of size classes is potentially wasteful as free page are + * spread across these classes + * - Small number of size classes causes large internal fragmentation + * - Probably its better to use specific size classes (empirically + * determined). NOTE: all those class sizes must be set as multiple of + * ZS_ALIGN to make sure link_free itself never has to span 2 pages. + * + * ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN + * (reason above) + */ +#define ZS_SIZE_CLASS_DELTA (PAGE_SIZE >> 8) +#define ZS_SIZE_CLASSES ((ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / \ + ZS_SIZE_CLASS_DELTA + 1) + +/* + * We do not maintain any list for completely empty or full pages + */ +enum fullness_group { + ZS_ALMOST_FULL, + ZS_ALMOST_EMPTY, + _ZS_NR_FULLNESS_GROUPS, + + ZS_EMPTY, + ZS_FULL +}; + +/* + * We assign a page to ZS_ALMOST_EMPTY fullness group when: + * n <= N / f, where + * n = number of allocated objects + * N = total number of objects zspage can store + * f = 1/fullness_threshold_frac + * + * Similarly, we assign zspage to: + * ZS_ALMOST_FULL when n > N / f + * ZS_EMPTY when n == 0 + * ZS_FULL when n == N + * + * (see: fix_fullness_group()) + */ +static const int fullness_threshold_frac = 4; + +struct size_class { + /* + * Size of objects stored in this class. Must be multiple + * of ZS_ALIGN. + */ + int size; + unsigned int index; + + /* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */ + int pages_per_zspage; + + spinlock_t lock; + + /* stats */ + u64 pages_allocated; + + struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS]; +}; + +/* + * Placed within free objects to form a singly linked list. + * For every zspage, first_page->freelist gives head of this list. + * + * This must be power of 2 and less than or equal to ZS_ALIGN + */ +struct link_free { + /* Handle of next free chunk (encodes ) */ + void *next; +}; + +struct zs_pool { + struct size_class size_class[ZS_SIZE_CLASSES]; + + gfp_t flags; /* allocation flags used when growing pool */ +}; + +/* + * A zspage's class index and fullness group + * are encoded in its (first)page->mapping + */ +#define CLASS_IDX_BITS 28 +#define FULLNESS_BITS 4 +#define CLASS_IDX_MASK ((1 << CLASS_IDX_BITS) - 1) +#define FULLNESS_MASK ((1 << FULLNESS_BITS) - 1) + +/* + * By default, zsmalloc uses a copy-based object mapping method to access + * allocations that span two pages. However, if a particular architecture + * performs VM mapping faster than copying, then it should be added here + * so that USE_PGTABLE_MAPPING is defined. This causes zsmalloc to use + * page table mapping rather than copying for object mapping. + */ +#if defined(CONFIG_ARM) && !defined(MODULE) +#define USE_PGTABLE_MAPPING +#endif + +struct mapping_area { +#ifdef USE_PGTABLE_MAPPING + struct vm_struct *vm; /* vm area for mapping object that span pages */ +#else + char *vm_buf; /* copy buffer for objects that span pages */ +#endif + char *vm_addr; /* address of kmap_atomic()'ed pages */ + enum zs_mapmode vm_mm; /* mapping mode */ +}; + + +/* per-cpu VM mapping areas for zspage accesses that cross page boundaries */ +static DEFINE_PER_CPU(struct mapping_area, zs_map_area); + +static int is_first_page(struct page *page) +{ + return PagePrivate(page); +} + +static int is_last_page(struct page *page) +{ + return PagePrivate2(page); +} + +static void get_zspage_mapping(struct page *page, unsigned int *class_idx, + enum fullness_group *fullness) +{ + unsigned long m; + BUG_ON(!is_first_page(page)); + + m = (unsigned long)page->mapping; + *fullness = m & FULLNESS_MASK; + *class_idx = (m >> FULLNESS_BITS) & CLASS_IDX_MASK; +} + +static void set_zspage_mapping(struct page *page, unsigned int class_idx, + enum fullness_group fullness) +{ + unsigned long m; + BUG_ON(!is_first_page(page)); + + m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) | + (fullness & FULLNESS_MASK); + page->mapping = (struct address_space *)m; +} + +static int get_size_class_index(int size) +{ + int idx = 0; + + if (likely(size > ZS_MIN_ALLOC_SIZE)) + idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE, + ZS_SIZE_CLASS_DELTA); + + return idx; +} + +static enum fullness_group get_fullness_group(struct page *page) +{ + int inuse, max_objects; + enum fullness_group fg; + BUG_ON(!is_first_page(page)); + + inuse = page->inuse; + max_objects = page->objects; + + if (inuse == 0) + fg = ZS_EMPTY; + else if (inuse == max_objects) + fg = ZS_FULL; + else if (inuse <= max_objects / fullness_threshold_frac) + fg = ZS_ALMOST_EMPTY; + else + fg = ZS_ALMOST_FULL; + + return fg; +} + +static void insert_zspage(struct page *page, struct size_class *class, + enum fullness_group fullness) +{ + struct page **head; + + BUG_ON(!is_first_page(page)); + + if (fullness >= _ZS_NR_FULLNESS_GROUPS) + return; + + head = &class->fullness_list[fullness]; + if (*head) + list_add_tail(&page->lru, &(*head)->lru); + + *head = page; +} + +static void remove_zspage(struct page *page, struct size_class *class, + enum fullness_group fullness) +{ + struct page **head; + + BUG_ON(!is_first_page(page)); + + if (fullness >= _ZS_NR_FULLNESS_GROUPS) + return; + + head = &class->fullness_list[fullness]; + BUG_ON(!*head); + if (list_empty(&(*head)->lru)) + *head = NULL; + else if (*head == page) + *head = (struct page *)list_entry((*head)->lru.next, + struct page, lru); + + list_del_init(&page->lru); +} + +static enum fullness_group fix_fullness_group(struct zs_pool *pool, + struct page *page) +{ + int class_idx; + struct size_class *class; + enum fullness_group currfg, newfg; + + BUG_ON(!is_first_page(page)); + + get_zspage_mapping(page, &class_idx, &currfg); + newfg = get_fullness_group(page); + if (newfg == currfg) + goto out; + + class = &pool->size_class[class_idx]; + remove_zspage(page, class, currfg); + insert_zspage(page, class, newfg); + set_zspage_mapping(page, class_idx, newfg); + +out: + return newfg; +} + +/* + * We have to decide on how many pages to link together + * to form a zspage for each size class. This is important + * to reduce wastage due to unusable space left at end of + * each zspage which is given as: + * wastage = Zp - Zp % size_class + * where Zp = zspage size = k * PAGE_SIZE where k = 1, 2, ... + * + * For example, for size class of 3/8 * PAGE_SIZE, we should + * link together 3 PAGE_SIZE sized pages to form a zspage + * since then we can perfectly fit in 8 such objects. + */ +static int get_pages_per_zspage(int class_size) +{ + int i, max_usedpc = 0; + /* zspage order which gives maximum used size per KB */ + int max_usedpc_order = 1; + + for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) { + int zspage_size; + int waste, usedpc; + + zspage_size = i * PAGE_SIZE; + waste = zspage_size % class_size; + usedpc = (zspage_size - waste) * 100 / zspage_size; + + if (usedpc > max_usedpc) { + max_usedpc = usedpc; + max_usedpc_order = i; + } + } + + return max_usedpc_order; +} + +/* + * A single 'zspage' is composed of many system pages which are + * linked together using fields in struct page. This function finds + * the first/head page, given any component page of a zspage. + */ +static struct page *get_first_page(struct page *page) +{ + if (is_first_page(page)) + return page; + else + return page->first_page; +} + +static struct page *get_next_page(struct page *page) +{ + struct page *next; + + if (is_last_page(page)) + next = NULL; + else if (is_first_page(page)) + next = (struct page *)page->private; + else + next = list_entry(page->lru.next, struct page, lru); + + return next; +} + +/* Encode as a single handle value */ +static void *obj_location_to_handle(struct page *page, unsigned long obj_idx) +{ + unsigned long handle; + + if (!page) { + BUG_ON(obj_idx); + return NULL; + } + + handle = page_to_pfn(page) << OBJ_INDEX_BITS; + handle |= (obj_idx & OBJ_INDEX_MASK); + + return (void *)handle; +} + +/* Decode pair from the given object handle */ +static void obj_handle_to_location(unsigned long handle, struct page **page, + unsigned long *obj_idx) +{ + *page = pfn_to_page(handle >> OBJ_INDEX_BITS); + *obj_idx = handle & OBJ_INDEX_MASK; +} + +static unsigned long obj_idx_to_offset(struct page *page, + unsigned long obj_idx, int class_size) +{ + unsigned long off = 0; + + if (!is_first_page(page)) + off = page->index; + + return off + obj_idx * class_size; +} + +static void reset_page(struct page *page) +{ + clear_bit(PG_private, &page->flags); + clear_bit(PG_private_2, &page->flags); + set_page_private(page, 0); + page->mapping = NULL; + page->freelist = NULL; + page_mapcount_reset(page); +} + +static void free_zspage(struct page *first_page) +{ + struct page *nextp, *tmp, *head_extra; + + BUG_ON(!is_first_page(first_page)); + BUG_ON(first_page->inuse); + + head_extra = (struct page *)page_private(first_page); + + reset_page(first_page); + __free_page(first_page); + + /* zspage with only 1 system page */ + if (!head_extra) + return; + + list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) { + list_del(&nextp->lru); + reset_page(nextp); + __free_page(nextp); + } + reset_page(head_extra); + __free_page(head_extra); +} + +/* Initialize a newly allocated zspage */ +static void init_zspage(struct page *first_page, struct size_class *class) +{ + unsigned long off = 0; + struct page *page = first_page; + + BUG_ON(!is_first_page(first_page)); + while (page) { + struct page *next_page; + struct link_free *link; + unsigned int i, objs_on_page; + + /* + * page->index stores offset of first object starting + * in the page. For the first page, this is always 0, + * so we use first_page->index (aka ->freelist) to store + * head of corresponding zspage's freelist. + */ + if (page != first_page) + page->index = off; + + link = (struct link_free *)kmap_atomic(page) + + off / sizeof(*link); + objs_on_page = (PAGE_SIZE - off) / class->size; + + for (i = 1; i <= objs_on_page; i++) { + off += class->size; + if (off < PAGE_SIZE) { + link->next = obj_location_to_handle(page, i); + link += class->size / sizeof(*link); + } + } + + /* + * We now come to the last (full or partial) object on this + * page, which must point to the first object on the next + * page (if present) + */ + next_page = get_next_page(page); + link->next = obj_location_to_handle(next_page, 0); + kunmap_atomic(link); + page = next_page; + off = (off + class->size) % PAGE_SIZE; + } +} + +/* + * Allocate a zspage for the given size class + */ +static struct page *alloc_zspage(struct size_class *class, gfp_t flags) +{ + int i, error; + struct page *first_page = NULL, *uninitialized_var(prev_page); + + /* + * Allocate individual pages and link them together as: + * 1. first page->private = first sub-page + * 2. all sub-pages are linked together using page->lru + * 3. each sub-page is linked to the first page using page->first_page + * + * For each size class, First/Head pages are linked together using + * page->lru. Also, we set PG_private to identify the first page + * (i.e. no other sub-page has this flag set) and PG_private_2 to + * identify the last page. + */ + error = -ENOMEM; + for (i = 0; i < class->pages_per_zspage; i++) { + struct page *page; + + page = alloc_page(flags); + if (!page) + goto cleanup; + + INIT_LIST_HEAD(&page->lru); + if (i == 0) { /* first page */ + SetPagePrivate(page); + set_page_private(page, 0); + first_page = page; + first_page->inuse = 0; + } + if (i == 1) + first_page->private = (unsigned long)page; + if (i >= 1) + page->first_page = first_page; + if (i >= 2) + list_add(&page->lru, &prev_page->lru); + if (i == class->pages_per_zspage - 1) /* last page */ + SetPagePrivate2(page); + prev_page = page; + } + + init_zspage(first_page, class); + + first_page->freelist = obj_location_to_handle(first_page, 0); + /* Maximum number of objects we can store in this zspage */ + first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size; + + error = 0; /* Success */ + +cleanup: + if (unlikely(error) && first_page) { + free_zspage(first_page); + first_page = NULL; + } + + return first_page; +} + +static struct page *find_get_zspage(struct size_class *class) +{ + int i; + struct page *page; + + for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) { + page = class->fullness_list[i]; + if (page) + break; + } + + return page; +} + +#ifdef USE_PGTABLE_MAPPING +static inline int __zs_cpu_up(struct mapping_area *area) +{ + /* + * Make sure we don't leak memory if a cpu UP notification + * and zs_init() race and both call zs_cpu_up() on the same cpu + */ + if (area->vm) + return 0; + area->vm = alloc_vm_area(PAGE_SIZE * 2, NULL); + if (!area->vm) + return -ENOMEM; + return 0; +} + +static inline void __zs_cpu_down(struct mapping_area *area) +{ + if (area->vm) + free_vm_area(area->vm); + area->vm = NULL; +} + +static inline void *__zs_map_object(struct mapping_area *area, + struct page *pages[2], int off, int size) +{ + BUG_ON(map_vm_area(area->vm, PAGE_KERNEL, &pages)); + area->vm_addr = area->vm->addr; + return area->vm_addr + off; +} + +static inline void __zs_unmap_object(struct mapping_area *area, + struct page *pages[2], int off, int size) +{ + unsigned long addr = (unsigned long)area->vm_addr; + + unmap_kernel_range(addr, PAGE_SIZE * 2); +} + +#else /* USE_PGTABLE_MAPPING */ + +static inline int __zs_cpu_up(struct mapping_area *area) +{ + /* + * Make sure we don't leak memory if a cpu UP notification + * and zs_init() race and both call zs_cpu_up() on the same cpu + */ + if (area->vm_buf) + return 0; + area->vm_buf = (char *)__get_free_page(GFP_KERNEL); + if (!area->vm_buf) + return -ENOMEM; + return 0; +} + +static inline void __zs_cpu_down(struct mapping_area *area) +{ + if (area->vm_buf) + free_page((unsigned long)area->vm_buf); + area->vm_buf = NULL; +} + +static void *__zs_map_object(struct mapping_area *area, + struct page *pages[2], int off, int size) +{ + int sizes[2]; + void *addr; + char *buf = area->vm_buf; + + /* disable page faults to match kmap_atomic() return conditions */ + pagefault_disable(); + + /* no read fastpath */ + if (area->vm_mm == ZS_MM_WO) + goto out; + + sizes[0] = PAGE_SIZE - off; + sizes[1] = size - sizes[0]; + + /* copy object to per-cpu buffer */ + addr = kmap_atomic(pages[0]); + memcpy(buf, addr + off, sizes[0]); + kunmap_atomic(addr); + addr = kmap_atomic(pages[1]); + memcpy(buf + sizes[0], addr, sizes[1]); + kunmap_atomic(addr); +out: + return area->vm_buf; +} + +static void __zs_unmap_object(struct mapping_area *area, + struct page *pages[2], int off, int size) +{ + int sizes[2]; + void *addr; + char *buf = area->vm_buf; + + /* no write fastpath */ + if (area->vm_mm == ZS_MM_RO) + goto out; + + sizes[0] = PAGE_SIZE - off; + sizes[1] = size - sizes[0]; + + /* copy per-cpu buffer to object */ + addr = kmap_atomic(pages[0]); + memcpy(addr + off, buf, sizes[0]); + kunmap_atomic(addr); + addr = kmap_atomic(pages[1]); + memcpy(addr, buf + sizes[0], sizes[1]); + kunmap_atomic(addr); + +out: + /* enable page faults to match kunmap_atomic() return conditions */ + pagefault_enable(); +} + +#endif /* USE_PGTABLE_MAPPING */ + +static int zs_cpu_notifier(struct notifier_block *nb, unsigned long action, + void *pcpu) +{ + int ret, cpu = (long)pcpu; + struct mapping_area *area; + + switch (action) { + case CPU_UP_PREPARE: + area = &per_cpu(zs_map_area, cpu); + ret = __zs_cpu_up(area); + if (ret) + return notifier_from_errno(ret); + break; + case CPU_DEAD: + case CPU_UP_CANCELED: + area = &per_cpu(zs_map_area, cpu); + __zs_cpu_down(area); + break; + } + + return NOTIFY_OK; +} + +static struct notifier_block zs_cpu_nb = { + .notifier_call = zs_cpu_notifier +}; + +static void zs_exit(void) +{ + int cpu; + + for_each_online_cpu(cpu) + zs_cpu_notifier(NULL, CPU_DEAD, (void *)(long)cpu); + unregister_cpu_notifier(&zs_cpu_nb); +} + +static int zs_init(void) +{ + int cpu, ret; + + register_cpu_notifier(&zs_cpu_nb); + for_each_online_cpu(cpu) { + ret = zs_cpu_notifier(NULL, CPU_UP_PREPARE, (void *)(long)cpu); + if (notifier_to_errno(ret)) + goto fail; + } + return 0; +fail: + zs_exit(); + return notifier_to_errno(ret); +} + +/** + * zs_create_pool - Creates an allocation pool to work from. + * @flags: allocation flags used to allocate pool metadata + * + * This function must be called before anything when using + * the zsmalloc allocator. + * + * On success, a pointer to the newly created pool is returned, + * otherwise NULL. + */ +struct zs_pool *zs_create_pool(gfp_t flags) +{ + int i, ovhd_size; + struct zs_pool *pool; + + ovhd_size = roundup(sizeof(*pool), PAGE_SIZE); + pool = kzalloc(ovhd_size, GFP_KERNEL); + if (!pool) + return NULL; + + for (i = 0; i < ZS_SIZE_CLASSES; i++) { + int size; + struct size_class *class; + + size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA; + if (size > ZS_MAX_ALLOC_SIZE) + size = ZS_MAX_ALLOC_SIZE; + + class = &pool->size_class[i]; + class->size = size; + class->index = i; + spin_lock_init(&class->lock); + class->pages_per_zspage = get_pages_per_zspage(size); + + } + + pool->flags = flags; + + return pool; +} +EXPORT_SYMBOL_GPL(zs_create_pool); + +void zs_destroy_pool(struct zs_pool *pool) +{ + int i; + + for (i = 0; i < ZS_SIZE_CLASSES; i++) { + int fg; + struct size_class *class = &pool->size_class[i]; + + for (fg = 0; fg < _ZS_NR_FULLNESS_GROUPS; fg++) { + if (class->fullness_list[fg]) { + pr_info("Freeing non-empty class with size %db, fullness group %d\n", + class->size, fg); + } + } + } + kfree(pool); +} +EXPORT_SYMBOL_GPL(zs_destroy_pool); + +/** + * zs_malloc - Allocate block of given size from pool. + * @pool: pool to allocate from + * @size: size of block to allocate + * + * On success, handle to the allocated object is returned, + * otherwise 0. + * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail. + */ +unsigned long zs_malloc(struct zs_pool *pool, size_t size) +{ + unsigned long obj; + struct link_free *link; + int class_idx; + struct size_class *class; + + struct page *first_page, *m_page; + unsigned long m_objidx, m_offset; + + if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE)) + return 0; + + class_idx = get_size_class_index(size); + class = &pool->size_class[class_idx]; + BUG_ON(class_idx != class->index); + + spin_lock(&class->lock); + first_page = find_get_zspage(class); + + if (!first_page) { + spin_unlock(&class->lock); + first_page = alloc_zspage(class, pool->flags); + if (unlikely(!first_page)) + return 0; + + set_zspage_mapping(first_page, class->index, ZS_EMPTY); + spin_lock(&class->lock); + class->pages_allocated += class->pages_per_zspage; + } + + obj = (unsigned long)first_page->freelist; + obj_handle_to_location(obj, &m_page, &m_objidx); + m_offset = obj_idx_to_offset(m_page, m_objidx, class->size); + + link = (struct link_free *)kmap_atomic(m_page) + + m_offset / sizeof(*link); + first_page->freelist = link->next; + memset(link, POISON_INUSE, sizeof(*link)); + kunmap_atomic(link); + + first_page->inuse++; + /* Now move the zspage to another fullness group, if required */ + fix_fullness_group(pool, first_page); + spin_unlock(&class->lock); + + return obj; +} +EXPORT_SYMBOL_GPL(zs_malloc); + +void zs_free(struct zs_pool *pool, unsigned long obj) +{ + struct link_free *link; + struct page *first_page, *f_page; + unsigned long f_objidx, f_offset; + + int class_idx; + struct size_class *class; + enum fullness_group fullness; + + if (unlikely(!obj)) + return; + + obj_handle_to_location(obj, &f_page, &f_objidx); + first_page = get_first_page(f_page); + + get_zspage_mapping(first_page, &class_idx, &fullness); + class = &pool->size_class[class_idx]; + f_offset = obj_idx_to_offset(f_page, f_objidx, class->size); + + spin_lock(&class->lock); + + /* Insert this object in containing zspage's freelist */ + link = (struct link_free *)((unsigned char *)kmap_atomic(f_page) + + f_offset); + link->next = first_page->freelist; + kunmap_atomic(link); + first_page->freelist = (void *)obj; + + first_page->inuse--; + fullness = fix_fullness_group(pool, first_page); + + if (fullness == ZS_EMPTY) + class->pages_allocated -= class->pages_per_zspage; + + spin_unlock(&class->lock); + + if (fullness == ZS_EMPTY) + free_zspage(first_page); +} +EXPORT_SYMBOL_GPL(zs_free); + +/** + * zs_map_object - get address of allocated object from handle. + * @pool: pool from which the object was allocated + * @handle: handle returned from zs_malloc + * + * Before using an object allocated from zs_malloc, it must be mapped using + * this function. When done with the object, it must be unmapped using + * zs_unmap_object. + * + * Only one object can be mapped per cpu at a time. There is no protection + * against nested mappings. + * + * This function returns with preemption and page faults disabled. + */ +void *zs_map_object(struct zs_pool *pool, unsigned long handle, + enum zs_mapmode mm) +{ + struct page *page; + unsigned long obj_idx, off; + + unsigned int class_idx; + enum fullness_group fg; + struct size_class *class; + struct mapping_area *area; + struct page *pages[2]; + + BUG_ON(!handle); + + /* + * Because we use per-cpu mapping areas shared among the + * pools/users, we can't allow mapping in interrupt context + * because it can corrupt another users mappings. + */ + BUG_ON(in_interrupt()); + + obj_handle_to_location(handle, &page, &obj_idx); + get_zspage_mapping(get_first_page(page), &class_idx, &fg); + class = &pool->size_class[class_idx]; + off = obj_idx_to_offset(page, obj_idx, class->size); + + area = &get_cpu_var(zs_map_area); + area->vm_mm = mm; + if (off + class->size <= PAGE_SIZE) { + /* this object is contained entirely within a page */ + area->vm_addr = kmap_atomic(page); + return area->vm_addr + off; + } + + /* this object spans two pages */ + pages[0] = page; + pages[1] = get_next_page(page); + BUG_ON(!pages[1]); + + return __zs_map_object(area, pages, off, class->size); +} +EXPORT_SYMBOL_GPL(zs_map_object); + +void zs_unmap_object(struct zs_pool *pool, unsigned long handle) +{ + struct page *page; + unsigned long obj_idx, off; + + unsigned int class_idx; + enum fullness_group fg; + struct size_class *class; + struct mapping_area *area; + + BUG_ON(!handle); + + obj_handle_to_location(handle, &page, &obj_idx); + get_zspage_mapping(get_first_page(page), &class_idx, &fg); + class = &pool->size_class[class_idx]; + off = obj_idx_to_offset(page, obj_idx, class->size); + + area = &__get_cpu_var(zs_map_area); + if (off + class->size <= PAGE_SIZE) + kunmap_atomic(area->vm_addr); + else { + struct page *pages[2]; + + pages[0] = page; + pages[1] = get_next_page(page); + BUG_ON(!pages[1]); + + __zs_unmap_object(area, pages, off, class->size); + } + put_cpu_var(zs_map_area); +} +EXPORT_SYMBOL_GPL(zs_unmap_object); + +u64 zs_get_total_size_bytes(struct zs_pool *pool) +{ + int i; + u64 npages = 0; + + for (i = 0; i < ZS_SIZE_CLASSES; i++) + npages += pool->size_class[i].pages_allocated; + + return npages << PAGE_SHIFT; +} +EXPORT_SYMBOL_GPL(zs_get_total_size_bytes); + +module_init(zs_init); +module_exit(zs_exit); + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Nitin Gupta "); -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx103.postini.com [74.125.245.103]) by kanga.kvack.org (Postfix) with SMTP id 7DAF56B0036 for ; Sun, 18 Aug 2013 04:42:01 -0400 (EDT) Received: by mail-pd0-f171.google.com with SMTP id g10so3802669pdj.2 for ; Sun, 18 Aug 2013 01:42:00 -0700 (PDT) From: Bob Liu Subject: [PATCH 3/4] mm: zswap: add supporting for zsmalloc Date: Sun, 18 Aug 2013 16:40:48 +0800 Message-Id: <1376815249-6611-4-git-send-email-bob.liu@oracle.com> In-Reply-To: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu Make zswap can use zsmalloc as its allocater. But note that zsmalloc don't reclaim any zswap pool pages mandatory, if zswap pool gets full, frontswap_store will be refused unless frontswap_get happened and freed some space. The reason of don't implement reclaiming zsmalloc pages from zswap pool is there is no requiremnet currently. If we want to do mandatory reclaim, we have to write those pages to real backend swap devices. But most of current users of zsmalloc are from embeded world, there is even no real backend swap device. This action is also the same as privous zram! For several area, zsmalloc has unpredictable performance characteristics when reclaiming a single page, then CONFIG_ZBUD are suggested. Signed-off-by: Bob Liu --- include/linux/zsmalloc.h | 1 + mm/Kconfig | 4 +++ mm/zsmalloc.c | 9 ++++-- mm/zswap.c | 73 +++++++++++++++++++++++++++++++++++++++++++--- 4 files changed, 81 insertions(+), 6 deletions(-) diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h index fbe6bec..72fc126 100644 --- a/include/linux/zsmalloc.h +++ b/include/linux/zsmalloc.h @@ -39,5 +39,6 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, void zs_unmap_object(struct zs_pool *pool, unsigned long handle); u64 zs_get_total_size_bytes(struct zs_pool *pool); +u64 zs_get_pool_size(struct zs_pool *pool); #endif diff --git a/mm/Kconfig b/mm/Kconfig index 48d1786..d80a575 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -519,6 +519,10 @@ choice in order to reduce fragmentation and has high compression density. However, this results in a unpredictable performance characteristics when reclaiming a single page. + + Note: By using zsmalloc, no supporting for mandatory reclaiming from + compressed memory pool. If the pool gets full, frontswap_store will + be refused unless frontswap_get happened and freed some space. endchoice config MEM_SOFT_DIRTY diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 4bb275b..9df8d25 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -78,8 +78,7 @@ #include #include #include - -#include "zsmalloc.h" +#include /* * This must be power of 2 and greater than of equal to sizeof(link_free). @@ -1056,6 +1055,12 @@ u64 zs_get_total_size_bytes(struct zs_pool *pool) } EXPORT_SYMBOL_GPL(zs_get_total_size_bytes); +u64 zs_get_pool_size(struct zs_pool *pool) +{ + return zs_get_total_size_bytes(pool) >> PAGE_SHIFT; +} +EXPORT_SYMBOL_GPL(zs_get_pool_size); + module_init(zs_init); module_exit(zs_exit); diff --git a/mm/zswap.c b/mm/zswap.c index deda2b6..8e8dc99 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -34,8 +34,11 @@ #include #include #include +#ifdef CONFIG_ZBUD #include - +#else +#include +#endif #include #include #include @@ -189,7 +192,11 @@ struct zswap_header { struct zswap_tree { struct rb_root rbroot; spinlock_t lock; +#ifdef CONFIG_ZBUD struct zbud_pool *pool; +#else + struct zs_pool *pool; +#endif }; static struct zswap_tree *zswap_trees[MAX_SWAPFILES]; @@ -374,12 +381,21 @@ static bool zswap_is_full(void) */ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry) { +#ifdef CONFIG_ZBUD zbud_free(tree->pool, entry->handle); +#else + zs_free(tree->pool, entry->handle); +#endif zswap_entry_cache_free(entry); atomic_dec(&zswap_stored_pages); +#ifdef CONFIG_ZBUD zswap_pool_pages = zbud_get_pool_size(tree->pool); +#else + zswap_pool_pages = zs_get_pool_size(tree->pool); +#endif } +#ifdef CONFIG_ZBUD /********************************* * writeback code **********************************/ @@ -595,6 +611,7 @@ fail: spin_unlock(&tree->lock); return ret; } +#endif /********************************* * frontswap hooks @@ -620,11 +637,22 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, /* reclaim space if needed */ if (zswap_is_full()) { zswap_pool_limit_hit++; +#ifdef CONFIG_ZBUD if (zbud_reclaim_page(tree->pool, 8)) { zswap_reject_reclaim_fail++; ret = -ENOMEM; goto reject; } +#else + /* + * zsmalloc has unpredictable performance + * characteristics when reclaiming, so don't support + * mandatory reclaiming from zsmalloc + */ + zswap_reject_reclaim_fail++; + ret = -ENOMEM; + goto reject; +#endif } /* allocate entry */ @@ -647,8 +675,9 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, /* store */ len = dlen + sizeof(struct zswap_header); +#ifdef CONFIG_ZBUD ret = zbud_alloc(tree->pool, len, __GFP_NORETRY | __GFP_NOWARN, - &handle); + &handle); if (ret == -ENOSPC) { zswap_reject_compress_poor++; goto freepage; @@ -658,10 +687,23 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, goto freepage; } zhdr = zbud_map(tree->pool, handle); +#else + handle = zs_malloc(tree->pool, len); + if (!handle) { + ret = -ENOMEM; + zswap_reject_alloc_fail++; + goto freepage; + } + zhdr = zs_map_object(tree->pool, handle, ZS_MM_WO); +#endif zhdr->swpentry = swp_entry(type, offset); buf = (u8 *)(zhdr + 1); memcpy(buf, dst, dlen); +#ifdef CONFIG_ZBUD zbud_unmap(tree->pool, handle); +#else + zs_unmap_object(tree->pool, handle); +#endif put_cpu_var(zswap_dstmem); /* populate entry */ @@ -687,8 +729,11 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, /* update stats */ atomic_inc(&zswap_stored_pages); +#ifdef CONFIG_ZBUD zswap_pool_pages = zbud_get_pool_size(tree->pool); - +#else + zswap_pool_pages = zs_get_pool_size(tree->pool); +#endif return 0; freepage: @@ -724,13 +769,22 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, /* decompress */ dlen = PAGE_SIZE; +#ifdef CONFIG_ZBUD src = (u8 *)zbud_map(tree->pool, entry->handle) + - sizeof(struct zswap_header); + sizeof(struct zswap_header); +#else + src = zs_map_object(tree->pool, entry->handle, ZS_MM_RO); + src += sizeof(struct zswap_header); +#endif dst = kmap_atomic(page); ret = zswap_comp_op(ZSWAP_COMPOP_DECOMPRESS, src, entry->length, dst, &dlen); kunmap_atomic(dst); +#ifdef CONFIG_ZBUD zbud_unmap(tree->pool, entry->handle); +#else + zs_unmap_object(tree->pool, entry->handle); +#endif BUG_ON(ret); spin_lock(&tree->lock); @@ -810,7 +864,11 @@ static void zswap_frontswap_invalidate_area(unsigned type) while ((node = rb_first(&tree->rbroot))) { entry = rb_entry(node, struct zswap_entry, rbnode); rb_erase(&entry->rbnode, &tree->rbroot); +#ifdef CONFIG_ZBUD zbud_free(tree->pool, entry->handle); +#else + zs_free(tree->pool, entry->handle); +#endif zswap_entry_cache_free(entry); atomic_dec(&zswap_stored_pages); } @@ -818,9 +876,11 @@ static void zswap_frontswap_invalidate_area(unsigned type) spin_unlock(&tree->lock); } +#ifdef CONFIG_ZBUD static struct zbud_ops zswap_zbud_ops = { .evict = zswap_writeback_entry }; +#endif static void zswap_frontswap_init(unsigned type) { @@ -829,7 +889,12 @@ static void zswap_frontswap_init(unsigned type) tree = kzalloc(sizeof(struct zswap_tree), GFP_KERNEL); if (!tree) goto err; + +#ifdef CONFIG_ZBUD tree->pool = zbud_create_pool(GFP_KERNEL, &zswap_zbud_ops); +#else + tree->pool = zs_create_pool(GFP_NOWAIT); +#endif if (!tree->pool) goto freetree; tree->rbroot = RB_ROOT; -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx160.postini.com [74.125.245.160]) by kanga.kvack.org (Postfix) with SMTP id 6C98D6B0037 for ; Sun, 18 Aug 2013 04:42:17 -0400 (EDT) Received: by mail-pa0-f45.google.com with SMTP id bg4so3513959pad.4 for ; Sun, 18 Aug 2013 01:42:16 -0700 (PDT) From: Bob Liu Subject: [PATCH 4/4] mm: zswap: create a pseudo device /dev/zram0 Date: Sun, 18 Aug 2013 16:40:49 +0800 Message-Id: <1376815249-6611-5-git-send-email-bob.liu@oracle.com> In-Reply-To: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu This is used to replace previous zram. zram users can enable this feature, then a pseudo device will be created automaticlly after kernel boot. Just using "mkswp /dev/zram0; swapon /dev/zram0" to use it as a swap disk. The size of this pseudeo is controlled by zswap boot parameter zswap.max_pool_percent. disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE. Signed-off-by: Bob Liu --- mm/Kconfig | 12 ++++ mm/zswap.c | 196 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 208 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index d80a575..3778026 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -525,6 +525,18 @@ choice be refused unless frontswap_get happened and freed some space. endchoice +config ZSWAP_PSEUDO_BLKDEV + bool "Emulate a pseudo blk-dev based on zswap(previous zram)" + depends on ZSWAP && ZSMALLOC + default n + + help + Enable this option will emulate a pseudo block swapdev /dev/zram0 + with size zswap.max_pool_percent of total ram size. All writes to this + block device will be compressed and cached by zswap as a result no + real IO disk operations will happen. + This feature can be used to replace drivers/staging/zram. + config MEM_SOFT_DIRTY bool "Track memory changes" depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY diff --git a/mm/zswap.c b/mm/zswap.c index 8e8dc99..ae73c9d 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -38,6 +38,11 @@ #include #else #include +#ifdef CONFIG_ZSWAP_PSEUDO_BLKDEV +#include +#include +#include +#endif #endif #include #include @@ -968,6 +973,189 @@ static int __init zswap_debugfs_init(void) static void __exit zswap_debugfs_exit(void) { } #endif +#ifdef CONFIG_ZSWAP_PSEUDO_BLKDEV +#define SECTOR_SHIFT 9 +#define SECTOR_SIZE (1 << SECTOR_SHIFT) +#define SECTORS_PER_PAGE_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) +#define SECTORS_PER_PAGE (1 << SECTORS_PER_PAGE_SHIFT) + +struct zram { + struct rw_semaphore lock; /* protect concurent reads and writes */ + struct request_queue *queue; + struct gendisk *disk; + + /* + * This is the disk size for userland. The size is controlled by + * boot parameter zswap.max_pool_percent. + * disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE + */ + u64 disksize; /* bytes */ + + /* + * This page is used to store real data for /dev/zram. + * Meanful operation to /dev/zramx is only mkswp and swapon/swapoff. + * So use one page to store the real data(written by mkswp). + */ + struct page *metapage; +}; + +/* + * Only create /dev/zram0, can be extened in future if there is real uercases + * need multiple zram devices. + */ +static struct zram zram_device; +static const struct block_device_operations zram_devops = { + .owner = THIS_MODULE +}; + +static void update_position(u32 *index, int *offset, struct bio_vec *bvec) +{ + if (*offset + bvec->bv_len >= PAGE_SIZE) + (*index)++; + *offset = (*offset + bvec->bv_len) % PAGE_SIZE; +} + +static void zram_make_request(struct request_queue *queue, struct bio *bio) +{ + u32 index; + struct bio_vec *bvec; + unsigned char *src, *dst; + int offset, i, rw = bio_data_dir(bio); + struct zram *zram = queue->queuedata; + + index = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT; + offset = (bio->bi_sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT; + + bio_for_each_segment(bvec, bio, i) { + /* + * The only operation to pseudo /dev/zramx is mkswp and + * swapon/swapoff, so we only need one extra page to store the + * real meta data! + */ + BUG_ON(bvec->bv_len != PAGE_SIZE); + BUG_ON(offset); + + if (!index) { + if (rw == READ) { + down_read(&zram->lock); + dst = kmap_atomic(bvec->bv_page); + src = kmap_atomic(zram->metapage); + memcpy(dst, src, bvec->bv_len); + kunmap_atomic(dst); + kunmap_atomic(src); + flush_dcache_page(bvec->bv_page); + up_read(&zram->lock); + } else { + down_write(&zram->lock); + src = kmap_atomic(bvec->bv_page); + dst = kmap_atomic(zram->metapage); + memcpy(dst, src, bvec->bv_len); + kunmap_atomic(dst); + kunmap_atomic(src); + up_write(&zram->lock); + } + } + update_position(&index, &offset, bvec); + } + set_bit(BIO_UPTODATE, &bio->bi_flags); + bio_endio(bio, 0); + return; +} + +static int create_zram_device(struct zram *zram, int major, int device_id) +{ + int ret = -ENOMEM; + u64 disksize; + + zram->queue = blk_alloc_queue(GFP_KERNEL); + if (!zram->queue) { + pr_err("Error allocating disk queue for device%d\n", device_id); + goto out; + } + + blk_queue_make_request(zram->queue, zram_make_request); + zram->queue->queuedata = zram; + + /* gendisk structure */ + zram->disk = alloc_disk(1); + if (!zram->disk) { + pr_warn("Error allocating disk structure for device %d\n", + device_id); + goto out_free_queue; + } + + zram->disk->major = major; + zram->disk->first_minor = device_id; + zram->disk->fops = &zram_devops; + zram->disk->queue = zram->queue; + snprintf(zram->disk->disk_name, 16, "zram%d", device_id); + + /* + * To ensure that we always get PAGE_SIZE aligned + * and n*PAGE_SIZED sized I/O requests. + */ + blk_queue_physical_block_size(zram->disk->queue, PAGE_SIZE); + blk_queue_logical_block_size(zram->disk->queue, 1<<12); + blk_queue_io_min(zram->disk->queue, PAGE_SIZE); + blk_queue_io_opt(zram->disk->queue, PAGE_SIZE); + + add_disk(zram->disk); + + /* Init blk-dev */ + disksize = totalram_pages * zswap_max_pool_percent / 100; + disksize *= PAGE_SIZE; + disksize = PAGE_ALIGN(disksize); + zram->disksize = disksize; + set_capacity(zram->disk, zram->disksize >> SECTOR_SHIFT); + + /* zram devices sort of resembles non-rotational disks */ + queue_flag_set_unlocked(QUEUE_FLAG_NONROT, zram->disk->queue); + + zram->metapage = alloc_page(GFP_KERNEL); + if (!zram->metapage) + goto out_free_disk; + + pr_debug("Initialization done!\n"); + return 0; + +out_free_disk: + pr_debug("Init zram meta pages fail!\n"); + del_gendisk(zram->disk); + put_disk(zram->disk); +out_free_queue: + blk_cleanup_queue(zram->queue); +out: + return ret; +} + +static int zswap_blkdev_init(void) +{ + int major, ret = 0; + + major = register_blkdev(0, "zram"); + if (major <= 0) { + pr_warn("Unable to get major number\n"); + ret = -EBUSY; + goto out; + } + + ret = create_zram_device(&zram_device, major, 0); + if (ret) { + unregister_blkdev(major, "zram"); + goto out; + } + + pr_info("Created zram device(%d, %d).\n", major, 0); +out: + return ret; +} +#else +static int zswap_blkdev_init(void) +{ + return 0; +} +#endif + /********************************* * module init and exit **********************************/ @@ -989,9 +1177,17 @@ static int __init init_zswap(void) pr_err("per-cpu initialization failed\n"); goto pcpufail; } + + if (IS_ENABLED(CONFIG_ZSWAP_PSEUDO_BLKDEV)) + if (zswap_blkdev_init()) { + pr_err("emulate blk device failed\n"); + goto pcpufail; + } + frontswap_register_ops(&zswap_frontswap_ops); if (zswap_debugfs_init()) pr_warn("debugfs initialization failed\n"); + return 0; pcpufail: zswap_comp_exit(); -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx158.postini.com [74.125.245.158]) by kanga.kvack.org (Postfix) with SMTP id 1F0726B0032 for ; Mon, 19 Aug 2013 00:10:22 -0400 (EDT) Date: Mon, 19 Aug 2013 13:10:44 +0900 From: Minchan Kim Subject: Re: [PATCH 0/4] mm: merge zram into zswap Message-ID: <20130819041044.GB26832@bbox> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Bob Liu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, eternaleye@gmail.com, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu On Sun, Aug 18, 2013 at 04:40:45PM +0800, Bob Liu wrote: > Both zswap and zram are used to compress anon pages in memory so as to reduce > swap io operation. The main different is that zswap uses zbud as its allocator > while zram uses zsmalloc. The other different is zram will create a block > device, the user need to mkswp and swapon it. > > Minchan has areadly try to promote zram/zsmalloc into drivers/block/, but it may > cause increase maintenance headaches. Since the purpose of zswap and zram are > the same, this patch series try to merge them together as Mel suggested. > Dropped zram from staging and extended zswap with the same feature as zram. > > zswap todo: > Improve the writeback of zswap pool pages! > > Bob Liu (4): > drivers: staging: drop zram and zsmalloc Bob, I feel you're very rude and I'm really upset. You're just dropping the subsystem you didn't do anything without any consensus from who are contriubting lots of patches to make it works well for a long time. I understand you want to merge zram/zswap to remove the concern Mel suggested but so your intention might help the community. But the approach was totally wrong. You just said a few days ago in my thread and I was holiday so I didn't have a time to reply all of the mail sent to me. Should I break my holiday for just replying to you? Are you okay that someone else removes or moves your efforts without any consensus with you while you're spending good time with family? Please be careful. Bob. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx136.postini.com [74.125.245.136]) by kanga.kvack.org (Postfix) with SMTP id B37D06B0032 for ; Mon, 19 Aug 2013 00:33:04 -0400 (EDT) Message-ID: <52119FC7.5070406@oracle.com> Date: Mon, 19 Aug 2013 12:32:07 +0800 From: Bob Liu MIME-Version: 1.0 Subject: Re: [PATCH 0/4] mm: merge zram into zswap References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> <20130819041044.GB26832@bbox> In-Reply-To: <20130819041044.GB26832@bbox> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Bob Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org, eternaleye@gmail.com, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org Hi Minchan, On 08/19/2013 12:10 PM, Minchan Kim wrote: > On Sun, Aug 18, 2013 at 04:40:45PM +0800, Bob Liu wrote: >> Both zswap and zram are used to compress anon pages in memory so as to reduce >> swap io operation. The main different is that zswap uses zbud as its allocator >> while zram uses zsmalloc. The other different is zram will create a block >> device, the user need to mkswp and swapon it. >> >> Minchan has areadly try to promote zram/zsmalloc into drivers/block/, but it may >> cause increase maintenance headaches. Since the purpose of zswap and zram are >> the same, this patch series try to merge them together as Mel suggested. >> Dropped zram from staging and extended zswap with the same feature as zram. >> >> zswap todo: >> Improve the writeback of zswap pool pages! >> >> Bob Liu (4): >> drivers: staging: drop zram and zsmalloc > > Bob, I feel you're very rude and I'm really upset. > > You're just dropping the subsystem you didn't do anything without any consensus > from who are contriubting lots of patches to make it works well for a long time. I apologize for that, at least I should add [RFC] in the patch title! -- Regards, -Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx135.postini.com [74.125.245.135]) by kanga.kvack.org (Postfix) with SMTP id 7A6706B0032 for ; Mon, 19 Aug 2013 13:00:00 -0400 (EDT) Received: from /spool/local by e8.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 19 Aug 2013 17:59:59 +0100 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by d01dlp02.pok.ibm.com (Postfix) with ESMTP id 264FB6E8048 for ; Mon, 19 Aug 2013 12:59:50 -0400 (EDT) Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r7JGxtLc208638 for ; Mon, 19 Aug 2013 12:59:55 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r7JGxrIa022804 for ; Mon, 19 Aug 2013 12:59:55 -0400 Date: Mon, 19 Aug 2013 11:59:48 -0500 From: Seth Jennings Subject: Re: [PATCH 3/4] mm: zswap: add supporting for zsmalloc Message-ID: <20130819165948.GA5703@variantweb.net> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> <1376815249-6611-4-git-send-email-bob.liu@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1376815249-6611-4-git-send-email-bob.liu@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Bob Liu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu On Sun, Aug 18, 2013 at 04:40:48PM +0800, Bob Liu wrote: > Make zswap can use zsmalloc as its allocater. > But note that zsmalloc don't reclaim any zswap pool pages mandatory, if zswap > pool gets full, frontswap_store will be refused unless frontswap_get happened > and freed some space. > > The reason of don't implement reclaiming zsmalloc pages from zswap pool is there > is no requiremnet currently. > If we want to do mandatory reclaim, we have to write those pages to real backend > swap devices. But most of current users of zsmalloc are from embeded world, > there is even no real backend swap device. > This action is also the same as privous zram! > > For several area, zsmalloc has unpredictable performance characteristics when > reclaiming a single page, then CONFIG_ZBUD are suggested. Looking at this patch on its own, it does show how simple it could be for zswap to support zsmalloc. So thanks! However, I don't like all the ifdefs scattered everywhere. I'd like to have a ops structure (e.g. struct zswap_alloc_ops) instead and just switch ops based on the CONFIG flag. Or better yet, have it boot-time selectable instead of build-time. Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx140.postini.com [74.125.245.140]) by kanga.kvack.org (Postfix) with SMTP id 217A16B0032 for ; Mon, 19 Aug 2013 13:46:45 -0400 (EDT) Received: from /spool/local by e9.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 19 Aug 2013 13:46:44 -0400 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by d01dlp01.pok.ibm.com (Postfix) with ESMTP id 2097638C8059 for ; Mon, 19 Aug 2013 13:46:39 -0400 (EDT) Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r7JHkekM203456 for ; Mon, 19 Aug 2013 13:46:40 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r7JHkcXZ005431 for ; Mon, 19 Aug 2013 14:46:40 -0300 Date: Mon, 19 Aug 2013 12:46:34 -0500 From: Seth Jennings Subject: Re: [PATCH 4/4] mm: zswap: create a pseudo device /dev/zram0 Message-ID: <20130819174634.GB5703@variantweb.net> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> <1376815249-6611-5-git-send-email-bob.liu@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1376815249-6611-5-git-send-email-bob.liu@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Bob Liu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu On Sun, Aug 18, 2013 at 04:40:49PM +0800, Bob Liu wrote: > This is used to replace previous zram. > zram users can enable this feature, then a pseudo device will be created > automaticlly after kernel boot. > Just using "mkswp /dev/zram0; swapon /dev/zram0" to use it as a swap disk. > > The size of this pseudeo is controlled by zswap boot parameter > zswap.max_pool_percent. > disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE. This /dev/zram0 will behave nothing like the block device that zram creates. It only allows reads/writes to the first PAGE_SIZE area of the device, for mkswap to work, and then doesn't do anything for all other accesses. I guess if you disabled zswap writeback, then... it would somewhat be the same thing. We do need to disable zswap writeback in this case so that zswap does decompressed a ton of pages into the swapcache for writebacks that will just fail. Since zsmalloc does not yet support the reclaim functionality, zswap writeback is implicitly disabled. But this is really weird conceptually since zswap is a caching layer that uses frontswap. If a frontswap store fails, it will try to send the page to the zram0 device which will fail the write. Then the page will be... put back on the active or inactive list? Also, using the max_pool_percent in calculating the psuedo-device size isn't right. Right now, the code makes the device the max size of the _compressed_ pool, but the underlying swap device size is in _uncompressed_ pages. So you'll never be able to fill zswap sizing the device like this, unless every page is highly incompressible to the point that each compressed page effectively uses a memory pool page, in which case, the user shouldn't be using memory compression. This also means that this hasn't been tested in the zswap pool-is-full case since there is no way, in this code, to hit that case. In the zbud case the expected compression is 2:1 so you could just multiply the compressed pool size by 2 and get a good psuedo-device size. With zsmalloc the expected compression is harder to determine since it can achieve very high effective compression ratios on highly compressible pages. Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx171.postini.com [74.125.245.171]) by kanga.kvack.org (Postfix) with SMTP id A9DDD6B0032 for ; Mon, 19 Aug 2013 21:12:35 -0400 (EDT) Message-ID: <5212C24F.9050702@oracle.com> Date: Tue, 20 Aug 2013 09:11:43 +0800 From: Bob Liu MIME-Version: 1.0 Subject: Re: [PATCH 3/4] mm: zswap: add supporting for zsmalloc References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> <1376815249-6611-4-git-send-email-bob.liu@oracle.com> <20130819165948.GA5703@variantweb.net> In-Reply-To: <20130819165948.GA5703@variantweb.net> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Seth Jennings Cc: Bob Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org On 08/20/2013 12:59 AM, Seth Jennings wrote: > On Sun, Aug 18, 2013 at 04:40:48PM +0800, Bob Liu wrote: >> Make zswap can use zsmalloc as its allocater. >> But note that zsmalloc don't reclaim any zswap pool pages mandatory, if zswap >> pool gets full, frontswap_store will be refused unless frontswap_get happened >> and freed some space. >> >> The reason of don't implement reclaiming zsmalloc pages from zswap pool is there >> is no requiremnet currently. >> If we want to do mandatory reclaim, we have to write those pages to real backend >> swap devices. But most of current users of zsmalloc are from embeded world, >> there is even no real backend swap device. >> This action is also the same as privous zram! >> >> For several area, zsmalloc has unpredictable performance characteristics when >> reclaiming a single page, then CONFIG_ZBUD are suggested. > > Looking at this patch on its own, it does show how simple it could be > for zswap to support zsmalloc. So thanks! > > However, I don't like all the ifdefs scattered everywhere. I'd like to > have a ops structure (e.g. struct zswap_alloc_ops) instead and just > switch ops based on the CONFIG flag. Or better yet, have it boot-time > selectable instead of build-time. > I don't like the ifdefs neither. But I didn't find a better way to replace them since the data structures and API of zbud and zsmalloc are different. I can take a try using zswap_alloc_ops. -- Regards, -Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx147.postini.com [74.125.245.147]) by kanga.kvack.org (Postfix) with SMTP id 2B4A46B0032 for ; Mon, 19 Aug 2013 22:04:12 -0400 (EDT) Message-ID: <5212CE61.2090600@oracle.com> Date: Tue, 20 Aug 2013 10:03:13 +0800 From: Bob Liu MIME-Version: 1.0 Subject: Re: [PATCH 4/4] mm: zswap: create a pseudo device /dev/zram0 References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> <1376815249-6611-5-git-send-email-bob.liu@oracle.com> <20130819174634.GB5703@variantweb.net> In-Reply-To: <20130819174634.GB5703@variantweb.net> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Seth Jennings Cc: Bob Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org On 08/20/2013 01:46 AM, Seth Jennings wrote: > On Sun, Aug 18, 2013 at 04:40:49PM +0800, Bob Liu wrote: >> This is used to replace previous zram. >> zram users can enable this feature, then a pseudo device will be created >> automaticlly after kernel boot. >> Just using "mkswp /dev/zram0; swapon /dev/zram0" to use it as a swap disk. >> >> The size of this pseudeo is controlled by zswap boot parameter >> zswap.max_pool_percent. >> disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE. > > This /dev/zram0 will behave nothing like the block device that zram > creates. It only allows reads/writes to the first PAGE_SIZE area of the > device, for mkswap to work, and then doesn't do anything for all other > accesses. Yes, all the other data should be stored in zswap pool and don't need to go through block layer. > > I guess if you disabled zswap writeback, then... it would somewhat be > the same thing. We do need to disable zswap writeback in this case so > that zswap does decompressed a ton of pages into the swapcache for > writebacks that will just fail. Since zsmalloc does not yet support the > reclaim functionality, zswap writeback is implicitly disabled. > Yes, ZSWAP_PSEUDO_BLKDEV depends on zsmalloc and if using zsmalloc as the allocator then the writeback is disabled(not implemented and no requirement). > But this is really weird conceptually since zswap is a caching layer > that uses frontswap. If a frontswap store fails, it will try to send > the page to the zram0 device which will fail the write. Then the page That's a problem. We should disable sending the page to zram0 if frontswap store fails. Return fail just like the swap device is full. > will be... put back on the active or inactive list? > > Also, using the max_pool_percent in calculating the psuedo-device size > isn't right. Right now, the code makes the device the max size of the > _compressed_ pool, but the underlying swap device size is in > _uncompressed_ pages. So you'll never be able to fill zswap sizing the > device like this, unless every page is highly incompressible to the > point that each compressed page effectively uses a memory pool page, in > which case, the user shouldn't be using memory compression. > > This also means that this hasn't been tested in the zswap pool-is-full > case since there is no way, in this code, to hit that case. Yes, but in my understanding there is no need to trigger this path. It's the same with zram. Eg. create /dev/zram0 with disksize(eg. 100M), then mm-core will store ~100M uncompressed pages to /dev/zram0 at most. But the real memory spent for storing those pages are depended on the compression ratio. It's rare that zram will need 100M real memory. > > In the zbud case the expected compression is 2:1 so you could just > multiply the compressed pool size by 2 and get a good psuedo-device > size. With zsmalloc the expected compression is harder to determine > since it can achieve very high effective compression ratios on highly > compressible pages. > Some users can know the compression ratio of their workloads even using zsmalloc. -- Regards, -Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754710Ab3HRIlR (ORCPT ); Sun, 18 Aug 2013 04:41:17 -0400 Received: from mail-pa0-f41.google.com ([209.85.220.41]:62127 "EHLO mail-pa0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752209Ab3HRIlP (ORCPT ); Sun, 18 Aug 2013 04:41:15 -0400 From: Bob Liu To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu Subject: [PATCH 0/4] mm: merge zram into zswap Date: Sun, 18 Aug 2013 16:40:45 +0800 Message-Id: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> X-Mailer: git-send-email 1.7.10.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Both zswap and zram are used to compress anon pages in memory so as to reduce swap io operation. The main different is that zswap uses zbud as its allocator while zram uses zsmalloc. The other different is zram will create a block device, the user need to mkswp and swapon it. Minchan has areadly try to promote zram/zsmalloc into drivers/block/, but it may cause increase maintenance headaches. Since the purpose of zswap and zram are the same, this patch series try to merge them together as Mel suggested. Dropped zram from staging and extended zswap with the same feature as zram. zswap todo: Improve the writeback of zswap pool pages! Bob Liu (4): drivers: staging: drop zram and zsmalloc mm: promote zsmalloc to mm/ mm: zswap: add supporting for zsmalloc mm: zswap: create a pseudo device /dev/zram0 drivers/staging/Kconfig | 4 - drivers/staging/Makefile | 2 - drivers/staging/zram/Kconfig | 25 - drivers/staging/zram/Makefile | 3 - drivers/staging/zram/zram.txt | 77 --- drivers/staging/zram/zram_drv.c | 925 -------------------------- drivers/staging/zram/zram_drv.h | 115 ---- drivers/staging/zsmalloc/Kconfig | 10 - drivers/staging/zsmalloc/Makefile | 3 - drivers/staging/zsmalloc/zsmalloc-main.c | 1063 ----------------------------- drivers/staging/zsmalloc/zsmalloc.h | 43 -- include/linux/zsmalloc.h | 44 ++ mm/Kconfig | 51 +- mm/Makefile | 1 + mm/zsmalloc.c | 1068 ++++++++++++++++++++++++++++++ mm/zswap.c | 269 +++++++- 16 files changed, 1418 insertions(+), 2285 deletions(-) delete mode 100644 drivers/staging/zram/Kconfig delete mode 100644 drivers/staging/zram/Makefile delete mode 100644 drivers/staging/zram/zram.txt delete mode 100644 drivers/staging/zram/zram_drv.c delete mode 100644 drivers/staging/zram/zram_drv.h delete mode 100644 drivers/staging/zsmalloc/Kconfig delete mode 100644 drivers/staging/zsmalloc/Makefile delete mode 100644 drivers/staging/zsmalloc/zsmalloc-main.c delete mode 100644 drivers/staging/zsmalloc/zsmalloc.h create mode 100644 include/linux/zsmalloc.h create mode 100644 mm/zsmalloc.c -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755450Ab3HRIle (ORCPT ); Sun, 18 Aug 2013 04:41:34 -0400 Received: from mail-pb0-f44.google.com ([209.85.160.44]:34996 "EHLO mail-pb0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754294Ab3HRIl2 (ORCPT ); Sun, 18 Aug 2013 04:41:28 -0400 From: Bob Liu To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu Subject: [PATCH 1/4] drivers: staging: drop zram and zsmalloc Date: Sun, 18 Aug 2013 16:40:46 +0800 Message-Id: <1376815249-6611-2-git-send-email-bob.liu@oracle.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Zswap will be used to replace zram. Signed-off-by: Bob Liu --- drivers/staging/Kconfig | 4 - drivers/staging/Makefile | 2 - drivers/staging/zram/Kconfig | 25 - drivers/staging/zram/Makefile | 3 - drivers/staging/zram/zram.txt | 77 --- drivers/staging/zram/zram_drv.c | 925 -------------------------- drivers/staging/zram/zram_drv.h | 115 ---- drivers/staging/zsmalloc/Kconfig | 10 - drivers/staging/zsmalloc/Makefile | 3 - drivers/staging/zsmalloc/zsmalloc-main.c | 1063 ------------------------------ drivers/staging/zsmalloc/zsmalloc.h | 43 -- 11 files changed, 2270 deletions(-) delete mode 100644 drivers/staging/zram/Kconfig delete mode 100644 drivers/staging/zram/Makefile delete mode 100644 drivers/staging/zram/zram.txt delete mode 100644 drivers/staging/zram/zram_drv.c delete mode 100644 drivers/staging/zram/zram_drv.h delete mode 100644 drivers/staging/zsmalloc/Kconfig delete mode 100644 drivers/staging/zsmalloc/Makefile delete mode 100644 drivers/staging/zsmalloc/zsmalloc-main.c delete mode 100644 drivers/staging/zsmalloc/zsmalloc.h diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig index 57d8b34..d5355f4 100644 --- a/drivers/staging/Kconfig +++ b/drivers/staging/Kconfig @@ -74,10 +74,6 @@ source "drivers/staging/sep/Kconfig" source "drivers/staging/iio/Kconfig" -source "drivers/staging/zsmalloc/Kconfig" - -source "drivers/staging/zram/Kconfig" - source "drivers/staging/wlags49_h2/Kconfig" source "drivers/staging/wlags49_h25/Kconfig" diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile index 429321f..17a828f 100644 --- a/drivers/staging/Makefile +++ b/drivers/staging/Makefile @@ -31,8 +31,6 @@ obj-$(CONFIG_VT6656) += vt6656/ obj-$(CONFIG_VME_BUS) += vme/ obj-$(CONFIG_DX_SEP) += sep/ obj-$(CONFIG_IIO) += iio/ -obj-$(CONFIG_ZRAM) += zram/ -obj-$(CONFIG_ZSMALLOC) += zsmalloc/ obj-$(CONFIG_WLAGS49_H2) += wlags49_h2/ obj-$(CONFIG_WLAGS49_H25) += wlags49_h25/ obj-$(CONFIG_FB_SM7XX) += sm7xxfb/ diff --git a/drivers/staging/zram/Kconfig b/drivers/staging/zram/Kconfig deleted file mode 100644 index 983314c..0000000 --- a/drivers/staging/zram/Kconfig +++ /dev/null @@ -1,25 +0,0 @@ -config ZRAM - tristate "Compressed RAM block device support" - depends on BLOCK && SYSFS && ZSMALLOC - select LZO_COMPRESS - select LZO_DECOMPRESS - default n - help - Creates virtual block devices called /dev/zramX (X = 0, 1, ...). - Pages written to these disks are compressed and stored in memory - itself. These disks allow very fast I/O and compression provides - good amounts of memory savings. - - It has several use cases, for example: /tmp storage, use as swap - disks and maybe many more. - - See zram.txt for more information. - Project home: - -config ZRAM_DEBUG - bool "Compressed RAM block device debug support" - depends on ZRAM - default n - help - This option adds additional debugging code to the compressed - RAM block device driver. diff --git a/drivers/staging/zram/Makefile b/drivers/staging/zram/Makefile deleted file mode 100644 index cb0f9ce..0000000 --- a/drivers/staging/zram/Makefile +++ /dev/null @@ -1,3 +0,0 @@ -zram-y := zram_drv.o - -obj-$(CONFIG_ZRAM) += zram.o diff --git a/drivers/staging/zram/zram.txt b/drivers/staging/zram/zram.txt deleted file mode 100644 index 765d790..0000000 --- a/drivers/staging/zram/zram.txt +++ /dev/null @@ -1,77 +0,0 @@ -zram: Compressed RAM based block devices ----------------------------------------- - -Project home: http://compcache.googlecode.com/ - -* Introduction - -The zram module creates RAM based block devices named /dev/zram -( = 0, 1, ...). Pages written to these disks are compressed and stored -in memory itself. These disks allow very fast I/O and compression provides -good amounts of memory savings. Some of the usecases include /tmp storage, -use as swap disks, various caches under /var and maybe many more :) - -Statistics for individual zram devices are exported through sysfs nodes at -/sys/block/zram/ - -* Usage - -Following shows a typical sequence of steps for using zram. - -1) Load Module: - modprobe zram num_devices=4 - This creates 4 devices: /dev/zram{0,1,2,3} - (num_devices parameter is optional. Default: 1) - -2) Set Disksize - Set disk size by writing the value to sysfs node 'disksize'. - The value can be either in bytes or you can use mem suffixes. - Examples: - # Initialize /dev/zram0 with 50MB disksize - echo $((50*1024*1024)) > /sys/block/zram0/disksize - - # Using mem suffixes - echo 256K > /sys/block/zram0/disksize - echo 512M > /sys/block/zram0/disksize - echo 1G > /sys/block/zram0/disksize - -3) Activate: - mkswap /dev/zram0 - swapon /dev/zram0 - - mkfs.ext4 /dev/zram1 - mount /dev/zram1 /tmp - -4) Stats: - Per-device statistics are exported as various nodes under - /sys/block/zram/ - disksize - num_reads - num_writes - invalid_io - notify_free - discard - zero_pages - orig_data_size - compr_data_size - mem_used_total - -5) Deactivate: - swapoff /dev/zram0 - umount /dev/zram1 - -6) Reset: - Write any positive value to 'reset' sysfs node - echo 1 > /sys/block/zram0/reset - echo 1 > /sys/block/zram1/reset - - This frees all the memory allocated for the given device and - resets the disksize to zero. You must set the disksize again - before reusing the device. - -Please report any problems at: - - Mailing list: linux-mm-cc at laptop dot org - - Issue tracker: http://code.google.com/p/compcache/issues/list - -Nitin Gupta -ngupta@vflare.org diff --git a/drivers/staging/zram/zram_drv.c b/drivers/staging/zram/zram_drv.c deleted file mode 100644 index e77fb6e..0000000 --- a/drivers/staging/zram/zram_drv.c +++ /dev/null @@ -1,925 +0,0 @@ -/* - * Compressed RAM block device - * - * Copyright (C) 2008, 2009, 2010 Nitin Gupta - * - * This code is released using a dual license strategy: BSD/GPL - * You can choose the licence that better fits your requirements. - * - * Released under the terms of 3-clause BSD License - * Released under the terms of GNU General Public License Version 2.0 - * - * Project home: http://compcache.googlecode.com - */ - -#define KMSG_COMPONENT "zram" -#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt - -#ifdef CONFIG_ZRAM_DEBUG -#define DEBUG -#endif - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "zram_drv.h" - -/* Globals */ -static int zram_major; -static struct zram *zram_devices; - -/* Module params (documentation at end) */ -static unsigned int num_devices = 1; - -static inline struct zram *dev_to_zram(struct device *dev) -{ - return (struct zram *)dev_to_disk(dev)->private_data; -} - -static ssize_t disksize_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", zram->disksize); -} - -static ssize_t initstate_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%u\n", zram->init_done); -} - -static ssize_t num_reads_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", - (u64)atomic64_read(&zram->stats.num_reads)); -} - -static ssize_t num_writes_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", - (u64)atomic64_read(&zram->stats.num_writes)); -} - -static ssize_t invalid_io_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", - (u64)atomic64_read(&zram->stats.invalid_io)); -} - -static ssize_t notify_free_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", - (u64)atomic64_read(&zram->stats.notify_free)); -} - -static ssize_t zero_pages_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%u\n", zram->stats.pages_zero); -} - -static ssize_t orig_data_size_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", - (u64)(zram->stats.pages_stored) << PAGE_SHIFT); -} - -static ssize_t compr_data_size_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - struct zram *zram = dev_to_zram(dev); - - return sprintf(buf, "%llu\n", - (u64)atomic64_read(&zram->stats.compr_size)); -} - -static ssize_t mem_used_total_show(struct device *dev, - struct device_attribute *attr, char *buf) -{ - u64 val = 0; - struct zram *zram = dev_to_zram(dev); - struct zram_meta *meta = zram->meta; - - down_read(&zram->init_lock); - if (zram->init_done) - val = zs_get_total_size_bytes(meta->mem_pool); - up_read(&zram->init_lock); - - return sprintf(buf, "%llu\n", val); -} - -static int zram_test_flag(struct zram_meta *meta, u32 index, - enum zram_pageflags flag) -{ - return meta->table[index].flags & BIT(flag); -} - -static void zram_set_flag(struct zram_meta *meta, u32 index, - enum zram_pageflags flag) -{ - meta->table[index].flags |= BIT(flag); -} - -static void zram_clear_flag(struct zram_meta *meta, u32 index, - enum zram_pageflags flag) -{ - meta->table[index].flags &= ~BIT(flag); -} - -static inline int is_partial_io(struct bio_vec *bvec) -{ - return bvec->bv_len != PAGE_SIZE; -} - -/* - * Check if request is within bounds and aligned on zram logical blocks. - */ -static inline int valid_io_request(struct zram *zram, struct bio *bio) -{ - u64 start, end, bound; - - /* unaligned request */ - if (unlikely(bio->bi_sector & (ZRAM_SECTOR_PER_LOGICAL_BLOCK - 1))) - return 0; - if (unlikely(bio->bi_size & (ZRAM_LOGICAL_BLOCK_SIZE - 1))) - return 0; - - start = bio->bi_sector; - end = start + (bio->bi_size >> SECTOR_SHIFT); - bound = zram->disksize >> SECTOR_SHIFT; - /* out of range range */ - if (unlikely(start >= bound || end > bound || start > end)) - return 0; - - /* I/O request is valid */ - return 1; -} - -static void zram_meta_free(struct zram_meta *meta) -{ - zs_destroy_pool(meta->mem_pool); - kfree(meta->compress_workmem); - free_pages((unsigned long)meta->compress_buffer, 1); - vfree(meta->table); - kfree(meta); -} - -static struct zram_meta *zram_meta_alloc(u64 disksize) -{ - size_t num_pages; - struct zram_meta *meta = kmalloc(sizeof(*meta), GFP_KERNEL); - if (!meta) - goto out; - - meta->compress_workmem = kzalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL); - if (!meta->compress_workmem) - goto free_meta; - - meta->compress_buffer = - (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 1); - if (!meta->compress_buffer) { - pr_err("Error allocating compressor buffer space\n"); - goto free_workmem; - } - - num_pages = disksize >> PAGE_SHIFT; - meta->table = vzalloc(num_pages * sizeof(*meta->table)); - if (!meta->table) { - pr_err("Error allocating zram address table\n"); - goto free_buffer; - } - - meta->mem_pool = zs_create_pool(GFP_NOIO | __GFP_HIGHMEM); - if (!meta->mem_pool) { - pr_err("Error creating memory pool\n"); - goto free_table; - } - - return meta; - -free_table: - vfree(meta->table); -free_buffer: - free_pages((unsigned long)meta->compress_buffer, 1); -free_workmem: - kfree(meta->compress_workmem); -free_meta: - kfree(meta); - meta = NULL; -out: - return meta; -} - -static void update_position(u32 *index, int *offset, struct bio_vec *bvec) -{ - if (*offset + bvec->bv_len >= PAGE_SIZE) - (*index)++; - *offset = (*offset + bvec->bv_len) % PAGE_SIZE; -} - -static int page_zero_filled(void *ptr) -{ - unsigned int pos; - unsigned long *page; - - page = (unsigned long *)ptr; - - for (pos = 0; pos != PAGE_SIZE / sizeof(*page); pos++) { - if (page[pos]) - return 0; - } - - return 1; -} - -static void handle_zero_page(struct bio_vec *bvec) -{ - struct page *page = bvec->bv_page; - void *user_mem; - - user_mem = kmap_atomic(page); - if (is_partial_io(bvec)) - memset(user_mem + bvec->bv_offset, 0, bvec->bv_len); - else - clear_page(user_mem); - kunmap_atomic(user_mem); - - flush_dcache_page(page); -} - -static void zram_free_page(struct zram *zram, size_t index) -{ - struct zram_meta *meta = zram->meta; - unsigned long handle = meta->table[index].handle; - u16 size = meta->table[index].size; - - if (unlikely(!handle)) { - /* - * No memory is allocated for zero filled pages. - * Simply clear zero page flag. - */ - if (zram_test_flag(meta, index, ZRAM_ZERO)) { - zram_clear_flag(meta, index, ZRAM_ZERO); - zram->stats.pages_zero--; - } - return; - } - - if (unlikely(size > max_zpage_size)) - zram->stats.bad_compress--; - - zs_free(meta->mem_pool, handle); - - if (size <= PAGE_SIZE / 2) - zram->stats.good_compress--; - - atomic64_sub(meta->table[index].size, &zram->stats.compr_size); - zram->stats.pages_stored--; - - meta->table[index].handle = 0; - meta->table[index].size = 0; -} - -static int zram_decompress_page(struct zram *zram, char *mem, u32 index) -{ - int ret = LZO_E_OK; - size_t clen = PAGE_SIZE; - unsigned char *cmem; - struct zram_meta *meta = zram->meta; - unsigned long handle = meta->table[index].handle; - - if (!handle || zram_test_flag(meta, index, ZRAM_ZERO)) { - clear_page(mem); - return 0; - } - - cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_RO); - if (meta->table[index].size == PAGE_SIZE) - copy_page(mem, cmem); - else - ret = lzo1x_decompress_safe(cmem, meta->table[index].size, - mem, &clen); - zs_unmap_object(meta->mem_pool, handle); - - /* Should NEVER happen. Return bio error if it does. */ - if (unlikely(ret != LZO_E_OK)) { - pr_err("Decompression failed! err=%d, page=%u\n", ret, index); - atomic64_inc(&zram->stats.failed_reads); - return ret; - } - - return 0; -} - -static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec, - u32 index, int offset, struct bio *bio) -{ - int ret; - struct page *page; - unsigned char *user_mem, *uncmem = NULL; - struct zram_meta *meta = zram->meta; - page = bvec->bv_page; - - if (unlikely(!meta->table[index].handle) || - zram_test_flag(meta, index, ZRAM_ZERO)) { - handle_zero_page(bvec); - return 0; - } - - if (is_partial_io(bvec)) - /* Use a temporary buffer to decompress the page */ - uncmem = kmalloc(PAGE_SIZE, GFP_NOIO); - - user_mem = kmap_atomic(page); - if (!is_partial_io(bvec)) - uncmem = user_mem; - - if (!uncmem) { - pr_info("Unable to allocate temp memory\n"); - ret = -ENOMEM; - goto out_cleanup; - } - - ret = zram_decompress_page(zram, uncmem, index); - /* Should NEVER happen. Return bio error if it does. */ - if (unlikely(ret != LZO_E_OK)) - goto out_cleanup; - - if (is_partial_io(bvec)) - memcpy(user_mem + bvec->bv_offset, uncmem + offset, - bvec->bv_len); - - flush_dcache_page(page); - ret = 0; -out_cleanup: - kunmap_atomic(user_mem); - if (is_partial_io(bvec)) - kfree(uncmem); - return ret; -} - -static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, - int offset) -{ - int ret = 0; - size_t clen; - unsigned long handle; - struct page *page; - unsigned char *user_mem, *cmem, *src, *uncmem = NULL; - struct zram_meta *meta = zram->meta; - - page = bvec->bv_page; - src = meta->compress_buffer; - - if (is_partial_io(bvec)) { - /* - * This is a partial IO. We need to read the full page - * before to write the changes. - */ - uncmem = kmalloc(PAGE_SIZE, GFP_NOIO); - if (!uncmem) { - ret = -ENOMEM; - goto out; - } - ret = zram_decompress_page(zram, uncmem, index); - if (ret) - goto out; - } - - /* - * System overwrites unused sectors. Free memory associated - * with this sector now. - */ - if (meta->table[index].handle || - zram_test_flag(meta, index, ZRAM_ZERO)) - zram_free_page(zram, index); - - user_mem = kmap_atomic(page); - - if (is_partial_io(bvec)) { - memcpy(uncmem + offset, user_mem + bvec->bv_offset, - bvec->bv_len); - kunmap_atomic(user_mem); - user_mem = NULL; - } else { - uncmem = user_mem; - } - - if (page_zero_filled(uncmem)) { - kunmap_atomic(user_mem); - zram->stats.pages_zero++; - zram_set_flag(meta, index, ZRAM_ZERO); - ret = 0; - goto out; - } - - ret = lzo1x_1_compress(uncmem, PAGE_SIZE, src, &clen, - meta->compress_workmem); - - if (!is_partial_io(bvec)) { - kunmap_atomic(user_mem); - user_mem = NULL; - uncmem = NULL; - } - - if (unlikely(ret != LZO_E_OK)) { - pr_err("Compression failed! err=%d\n", ret); - goto out; - } - - if (unlikely(clen > max_zpage_size)) { - zram->stats.bad_compress++; - clen = PAGE_SIZE; - src = NULL; - if (is_partial_io(bvec)) - src = uncmem; - } - - handle = zs_malloc(meta->mem_pool, clen); - if (!handle) { - pr_info("Error allocating memory for compressed page: %u, size=%zu\n", - index, clen); - ret = -ENOMEM; - goto out; - } - cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO); - - if ((clen == PAGE_SIZE) && !is_partial_io(bvec)) { - src = kmap_atomic(page); - copy_page(cmem, src); - kunmap_atomic(src); - } else { - memcpy(cmem, src, clen); - } - - zs_unmap_object(meta->mem_pool, handle); - - meta->table[index].handle = handle; - meta->table[index].size = clen; - - /* Update stats */ - atomic64_add(clen, &zram->stats.compr_size); - zram->stats.pages_stored++; - if (clen <= PAGE_SIZE / 2) - zram->stats.good_compress++; - -out: - if (is_partial_io(bvec)) - kfree(uncmem); - - if (ret) - atomic64_inc(&zram->stats.failed_writes); - return ret; -} - -static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index, - int offset, struct bio *bio, int rw) -{ - int ret; - - if (rw == READ) { - down_read(&zram->lock); - ret = zram_bvec_read(zram, bvec, index, offset, bio); - up_read(&zram->lock); - } else { - down_write(&zram->lock); - ret = zram_bvec_write(zram, bvec, index, offset); - up_write(&zram->lock); - } - - return ret; -} - -static void zram_reset_device(struct zram *zram) -{ - size_t index; - struct zram_meta *meta; - - down_write(&zram->init_lock); - if (!zram->init_done) { - up_write(&zram->init_lock); - return; - } - - meta = zram->meta; - zram->init_done = 0; - - /* Free all pages that are still in this zram device */ - for (index = 0; index < zram->disksize >> PAGE_SHIFT; index++) { - unsigned long handle = meta->table[index].handle; - if (!handle) - continue; - - zs_free(meta->mem_pool, handle); - } - - zram_meta_free(zram->meta); - zram->meta = NULL; - /* Reset stats */ - memset(&zram->stats, 0, sizeof(zram->stats)); - - zram->disksize = 0; - set_capacity(zram->disk, 0); - up_write(&zram->init_lock); -} - -static void zram_init_device(struct zram *zram, struct zram_meta *meta) -{ - if (zram->disksize > 2 * (totalram_pages << PAGE_SHIFT)) { - pr_info( - "There is little point creating a zram of greater than " - "twice the size of memory since we expect a 2:1 compression " - "ratio. Note that zram uses about 0.1%% of the size of " - "the disk when not in use so a huge zram is " - "wasteful.\n" - "\tMemory Size: %lu kB\n" - "\tSize you selected: %llu kB\n" - "Continuing anyway ...\n", - (totalram_pages << PAGE_SHIFT) >> 10, zram->disksize >> 10 - ); - } - - /* zram devices sort of resembles non-rotational disks */ - queue_flag_set_unlocked(QUEUE_FLAG_NONROT, zram->disk->queue); - - zram->meta = meta; - zram->init_done = 1; - - pr_debug("Initialization done!\n"); -} - -static ssize_t disksize_store(struct device *dev, - struct device_attribute *attr, const char *buf, size_t len) -{ - u64 disksize; - struct zram_meta *meta; - struct zram *zram = dev_to_zram(dev); - - disksize = memparse(buf, NULL); - if (!disksize) - return -EINVAL; - - disksize = PAGE_ALIGN(disksize); - meta = zram_meta_alloc(disksize); - down_write(&zram->init_lock); - if (zram->init_done) { - up_write(&zram->init_lock); - zram_meta_free(meta); - pr_info("Cannot change disksize for initialized device\n"); - return -EBUSY; - } - - zram->disksize = disksize; - set_capacity(zram->disk, zram->disksize >> SECTOR_SHIFT); - zram_init_device(zram, meta); - up_write(&zram->init_lock); - - return len; -} - -static ssize_t reset_store(struct device *dev, - struct device_attribute *attr, const char *buf, size_t len) -{ - int ret; - unsigned short do_reset; - struct zram *zram; - struct block_device *bdev; - - zram = dev_to_zram(dev); - bdev = bdget_disk(zram->disk, 0); - - /* Do not reset an active device! */ - if (bdev->bd_holders) - return -EBUSY; - - ret = kstrtou16(buf, 10, &do_reset); - if (ret) - return ret; - - if (!do_reset) - return -EINVAL; - - /* Make sure all pending I/O is finished */ - if (bdev) - fsync_bdev(bdev); - - zram_reset_device(zram); - return len; -} - -static void __zram_make_request(struct zram *zram, struct bio *bio, int rw) -{ - int i, offset; - u32 index; - struct bio_vec *bvec; - - switch (rw) { - case READ: - atomic64_inc(&zram->stats.num_reads); - break; - case WRITE: - atomic64_inc(&zram->stats.num_writes); - break; - } - - index = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT; - offset = (bio->bi_sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT; - - bio_for_each_segment(bvec, bio, i) { - int max_transfer_size = PAGE_SIZE - offset; - - if (bvec->bv_len > max_transfer_size) { - /* - * zram_bvec_rw() can only make operation on a single - * zram page. Split the bio vector. - */ - struct bio_vec bv; - - bv.bv_page = bvec->bv_page; - bv.bv_len = max_transfer_size; - bv.bv_offset = bvec->bv_offset; - - if (zram_bvec_rw(zram, &bv, index, offset, bio, rw) < 0) - goto out; - - bv.bv_len = bvec->bv_len - max_transfer_size; - bv.bv_offset += max_transfer_size; - if (zram_bvec_rw(zram, &bv, index+1, 0, bio, rw) < 0) - goto out; - } else - if (zram_bvec_rw(zram, bvec, index, offset, bio, rw) - < 0) - goto out; - - update_position(&index, &offset, bvec); - } - - set_bit(BIO_UPTODATE, &bio->bi_flags); - bio_endio(bio, 0); - return; - -out: - bio_io_error(bio); -} - -/* - * Handler function for all zram I/O requests. - */ -static void zram_make_request(struct request_queue *queue, struct bio *bio) -{ - struct zram *zram = queue->queuedata; - - down_read(&zram->init_lock); - if (unlikely(!zram->init_done)) - goto error; - - if (!valid_io_request(zram, bio)) { - atomic64_inc(&zram->stats.invalid_io); - goto error; - } - - __zram_make_request(zram, bio, bio_data_dir(bio)); - up_read(&zram->init_lock); - - return; - -error: - up_read(&zram->init_lock); - bio_io_error(bio); -} - -static void zram_slot_free_notify(struct block_device *bdev, - unsigned long index) -{ - struct zram *zram; - - zram = bdev->bd_disk->private_data; - down_write(&zram->lock); - zram_free_page(zram, index); - up_write(&zram->lock); - atomic64_inc(&zram->stats.notify_free); -} - -static const struct block_device_operations zram_devops = { - .swap_slot_free_notify = zram_slot_free_notify, - .owner = THIS_MODULE -}; - -static DEVICE_ATTR(disksize, S_IRUGO | S_IWUSR, - disksize_show, disksize_store); -static DEVICE_ATTR(initstate, S_IRUGO, initstate_show, NULL); -static DEVICE_ATTR(reset, S_IWUSR, NULL, reset_store); -static DEVICE_ATTR(num_reads, S_IRUGO, num_reads_show, NULL); -static DEVICE_ATTR(num_writes, S_IRUGO, num_writes_show, NULL); -static DEVICE_ATTR(invalid_io, S_IRUGO, invalid_io_show, NULL); -static DEVICE_ATTR(notify_free, S_IRUGO, notify_free_show, NULL); -static DEVICE_ATTR(zero_pages, S_IRUGO, zero_pages_show, NULL); -static DEVICE_ATTR(orig_data_size, S_IRUGO, orig_data_size_show, NULL); -static DEVICE_ATTR(compr_data_size, S_IRUGO, compr_data_size_show, NULL); -static DEVICE_ATTR(mem_used_total, S_IRUGO, mem_used_total_show, NULL); - -static struct attribute *zram_disk_attrs[] = { - &dev_attr_disksize.attr, - &dev_attr_initstate.attr, - &dev_attr_reset.attr, - &dev_attr_num_reads.attr, - &dev_attr_num_writes.attr, - &dev_attr_invalid_io.attr, - &dev_attr_notify_free.attr, - &dev_attr_zero_pages.attr, - &dev_attr_orig_data_size.attr, - &dev_attr_compr_data_size.attr, - &dev_attr_mem_used_total.attr, - NULL, -}; - -static struct attribute_group zram_disk_attr_group = { - .attrs = zram_disk_attrs, -}; - -static int create_device(struct zram *zram, int device_id) -{ - int ret = -ENOMEM; - - init_rwsem(&zram->lock); - init_rwsem(&zram->init_lock); - - zram->queue = blk_alloc_queue(GFP_KERNEL); - if (!zram->queue) { - pr_err("Error allocating disk queue for device %d\n", - device_id); - goto out; - } - - blk_queue_make_request(zram->queue, zram_make_request); - zram->queue->queuedata = zram; - - /* gendisk structure */ - zram->disk = alloc_disk(1); - if (!zram->disk) { - pr_warn("Error allocating disk structure for device %d\n", - device_id); - goto out_free_queue; - } - - zram->disk->major = zram_major; - zram->disk->first_minor = device_id; - zram->disk->fops = &zram_devops; - zram->disk->queue = zram->queue; - zram->disk->private_data = zram; - snprintf(zram->disk->disk_name, 16, "zram%d", device_id); - - /* Actual capacity set using syfs (/sys/block/zram/disksize */ - set_capacity(zram->disk, 0); - - /* - * To ensure that we always get PAGE_SIZE aligned - * and n*PAGE_SIZED sized I/O requests. - */ - blk_queue_physical_block_size(zram->disk->queue, PAGE_SIZE); - blk_queue_logical_block_size(zram->disk->queue, - ZRAM_LOGICAL_BLOCK_SIZE); - blk_queue_io_min(zram->disk->queue, PAGE_SIZE); - blk_queue_io_opt(zram->disk->queue, PAGE_SIZE); - - add_disk(zram->disk); - - ret = sysfs_create_group(&disk_to_dev(zram->disk)->kobj, - &zram_disk_attr_group); - if (ret < 0) { - pr_warn("Error creating sysfs group"); - goto out_free_disk; - } - - zram->init_done = 0; - return 0; - -out_free_disk: - del_gendisk(zram->disk); - put_disk(zram->disk); -out_free_queue: - blk_cleanup_queue(zram->queue); -out: - return ret; -} - -static void destroy_device(struct zram *zram) -{ - sysfs_remove_group(&disk_to_dev(zram->disk)->kobj, - &zram_disk_attr_group); - - if (zram->disk) { - del_gendisk(zram->disk); - put_disk(zram->disk); - } - - if (zram->queue) - blk_cleanup_queue(zram->queue); -} - -static int __init zram_init(void) -{ - int ret, dev_id; - - if (num_devices > max_num_devices) { - pr_warn("Invalid value for num_devices: %u\n", - num_devices); - ret = -EINVAL; - goto out; - } - - zram_major = register_blkdev(0, "zram"); - if (zram_major <= 0) { - pr_warn("Unable to get major number\n"); - ret = -EBUSY; - goto out; - } - - /* Allocate the device array and initialize each one */ - zram_devices = kzalloc(num_devices * sizeof(struct zram), GFP_KERNEL); - if (!zram_devices) { - ret = -ENOMEM; - goto unregister; - } - - for (dev_id = 0; dev_id < num_devices; dev_id++) { - ret = create_device(&zram_devices[dev_id], dev_id); - if (ret) - goto free_devices; - } - - pr_info("Created %u device(s) ...\n", num_devices); - - return 0; - -free_devices: - while (dev_id) - destroy_device(&zram_devices[--dev_id]); - kfree(zram_devices); -unregister: - unregister_blkdev(zram_major, "zram"); -out: - return ret; -} - -static void __exit zram_exit(void) -{ - int i; - struct zram *zram; - - for (i = 0; i < num_devices; i++) { - zram = &zram_devices[i]; - - get_disk(zram->disk); - destroy_device(zram); - zram_reset_device(zram); - put_disk(zram->disk); - } - - unregister_blkdev(zram_major, "zram"); - - kfree(zram_devices); - pr_debug("Cleanup done!\n"); -} - -module_init(zram_init); -module_exit(zram_exit); - -module_param(num_devices, uint, 0); -MODULE_PARM_DESC(num_devices, "Number of zram devices"); - -MODULE_LICENSE("Dual BSD/GPL"); -MODULE_AUTHOR("Nitin Gupta "); -MODULE_DESCRIPTION("Compressed RAM Block Device"); diff --git a/drivers/staging/zram/zram_drv.h b/drivers/staging/zram/zram_drv.h deleted file mode 100644 index 9e57bfb..0000000 --- a/drivers/staging/zram/zram_drv.h +++ /dev/null @@ -1,115 +0,0 @@ -/* - * Compressed RAM block device - * - * Copyright (C) 2008, 2009, 2010 Nitin Gupta - * - * This code is released using a dual license strategy: BSD/GPL - * You can choose the licence that better fits your requirements. - * - * Released under the terms of 3-clause BSD License - * Released under the terms of GNU General Public License Version 2.0 - * - * Project home: http://compcache.googlecode.com - */ - -#ifndef _ZRAM_DRV_H_ -#define _ZRAM_DRV_H_ - -#include -#include - -#include "../zsmalloc/zsmalloc.h" - -/* - * Some arbitrary value. This is just to catch - * invalid value for num_devices module parameter. - */ -static const unsigned max_num_devices = 32; - -/*-- Configurable parameters */ - -/* - * Pages that compress to size greater than this are stored - * uncompressed in memory. - */ -static const size_t max_zpage_size = PAGE_SIZE / 4 * 3; - -/* - * NOTE: max_zpage_size must be less than or equal to: - * ZS_MAX_ALLOC_SIZE. Otherwise, zs_malloc() would - * always return failure. - */ - -/*-- End of configurable params */ - -#define SECTOR_SHIFT 9 -#define SECTOR_SIZE (1 << SECTOR_SHIFT) -#define SECTORS_PER_PAGE_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) -#define SECTORS_PER_PAGE (1 << SECTORS_PER_PAGE_SHIFT) -#define ZRAM_LOGICAL_BLOCK_SHIFT 12 -#define ZRAM_LOGICAL_BLOCK_SIZE (1 << ZRAM_LOGICAL_BLOCK_SHIFT) -#define ZRAM_SECTOR_PER_LOGICAL_BLOCK \ - (1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT)) - -/* Flags for zram pages (table[page_no].flags) */ -enum zram_pageflags { - /* Page consists entirely of zeros */ - ZRAM_ZERO, - - __NR_ZRAM_PAGEFLAGS, -}; - -/*-- Data structures */ - -/* Allocated for each disk page */ -struct table { - unsigned long handle; - u16 size; /* object size (excluding header) */ - u8 count; /* object ref count (not yet used) */ - u8 flags; -} __aligned(4); - -/* - * All 64bit fields should only be manipulated by 64bit atomic accessors. - * All modifications to 32bit counter should be protected by zram->lock. - */ -struct zram_stats { - atomic64_t compr_size; /* compressed size of pages stored */ - atomic64_t num_reads; /* failed + successful */ - atomic64_t num_writes; /* --do-- */ - atomic64_t failed_reads; /* should NEVER! happen */ - atomic64_t failed_writes; /* can happen when memory is too low */ - atomic64_t invalid_io; /* non-page-aligned I/O requests */ - atomic64_t notify_free; /* no. of swap slot free notifications */ - u32 pages_zero; /* no. of zero filled pages */ - u32 pages_stored; /* no. of pages currently stored */ - u32 good_compress; /* % of pages with compression ratio<=50% */ - u32 bad_compress; /* % of pages with compression ratio>=75% */ -}; - -struct zram_meta { - void *compress_workmem; - void *compress_buffer; - struct table *table; - struct zs_pool *mem_pool; -}; - -struct zram { - struct zram_meta *meta; - struct rw_semaphore lock; /* protect compression buffers, table, - * 32bit stat counters against concurrent - * notifications, reads and writes */ - struct request_queue *queue; - struct gendisk *disk; - int init_done; - /* Prevent concurrent execution of device init, reset and R/W request */ - struct rw_semaphore init_lock; - /* - * This is the limit on amount of *uncompressed* worth of data - * we can store in a disk. - */ - u64 disksize; /* bytes */ - - struct zram_stats stats; -}; -#endif diff --git a/drivers/staging/zsmalloc/Kconfig b/drivers/staging/zsmalloc/Kconfig deleted file mode 100644 index 7fab032..0000000 --- a/drivers/staging/zsmalloc/Kconfig +++ /dev/null @@ -1,10 +0,0 @@ -config ZSMALLOC - bool "Memory allocator for compressed pages" - default n - help - zsmalloc is a slab-based memory allocator designed to store - compressed RAM pages. zsmalloc uses virtual memory mapping - in order to reduce fragmentation. However, this results in a - non-standard allocator interface where a handle, not a pointer, is - returned by an alloc(). This handle must be mapped in order to - access the allocated space. diff --git a/drivers/staging/zsmalloc/Makefile b/drivers/staging/zsmalloc/Makefile deleted file mode 100644 index b134848..0000000 --- a/drivers/staging/zsmalloc/Makefile +++ /dev/null @@ -1,3 +0,0 @@ -zsmalloc-y := zsmalloc-main.o - -obj-$(CONFIG_ZSMALLOC) += zsmalloc.o diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c b/drivers/staging/zsmalloc/zsmalloc-main.c deleted file mode 100644 index 4bb275b..0000000 --- a/drivers/staging/zsmalloc/zsmalloc-main.c +++ /dev/null @@ -1,1063 +0,0 @@ -/* - * zsmalloc memory allocator - * - * Copyright (C) 2011 Nitin Gupta - * - * This code is released using a dual license strategy: BSD/GPL - * You can choose the license that better fits your requirements. - * - * Released under the terms of 3-clause BSD License - * Released under the terms of GNU General Public License Version 2.0 - */ - - -/* - * This allocator is designed for use with zcache and zram. Thus, the - * allocator is supposed to work well under low memory conditions. In - * particular, it never attempts higher order page allocation which is - * very likely to fail under memory pressure. On the other hand, if we - * just use single (0-order) pages, it would suffer from very high - * fragmentation -- any object of size PAGE_SIZE/2 or larger would occupy - * an entire page. This was one of the major issues with its predecessor - * (xvmalloc). - * - * To overcome these issues, zsmalloc allocates a bunch of 0-order pages - * and links them together using various 'struct page' fields. These linked - * pages act as a single higher-order page i.e. an object can span 0-order - * page boundaries. The code refers to these linked pages as a single entity - * called zspage. - * - * Following is how we use various fields and flags of underlying - * struct page(s) to form a zspage. - * - * Usage of struct page fields: - * page->first_page: points to the first component (0-order) page - * page->index (union with page->freelist): offset of the first object - * starting in this page. For the first page, this is - * always 0, so we use this field (aka freelist) to point - * to the first free object in zspage. - * page->lru: links together all component pages (except the first page) - * of a zspage - * - * For _first_ page only: - * - * page->private (union with page->first_page): refers to the - * component page after the first page - * page->freelist: points to the first free object in zspage. - * Free objects are linked together using in-place - * metadata. - * page->objects: maximum number of objects we can store in this - * zspage (class->zspage_order * PAGE_SIZE / class->size) - * page->lru: links together first pages of various zspages. - * Basically forming list of zspages in a fullness group. - * page->mapping: class index and fullness group of the zspage - * - * Usage of struct page flags: - * PG_private: identifies the first component page - * PG_private2: identifies the last component page - * - */ - -#ifdef CONFIG_ZSMALLOC_DEBUG -#define DEBUG -#endif - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "zsmalloc.h" - -/* - * This must be power of 2 and greater than of equal to sizeof(link_free). - * These two conditions ensure that any 'struct link_free' itself doesn't - * span more than 1 page which avoids complex case of mapping 2 pages simply - * to restore link_free pointer values. - */ -#define ZS_ALIGN 8 - -/* - * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single) - * pages. ZS_MAX_ZSPAGE_ORDER defines upper limit on N. - */ -#define ZS_MAX_ZSPAGE_ORDER 2 -#define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER) - -/* - * Object location (, ) is encoded as - * as single (void *) handle value. - * - * Note that object index is relative to system - * page it is stored in, so for each sub-page belonging - * to a zspage, obj_idx starts with 0. - * - * This is made more complicated by various memory models and PAE. - */ - -#ifndef MAX_PHYSMEM_BITS -#ifdef CONFIG_HIGHMEM64G -#define MAX_PHYSMEM_BITS 36 -#else /* !CONFIG_HIGHMEM64G */ -/* - * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just - * be PAGE_SHIFT - */ -#define MAX_PHYSMEM_BITS BITS_PER_LONG -#endif -#endif -#define _PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT) -#define OBJ_INDEX_BITS (BITS_PER_LONG - _PFN_BITS) -#define OBJ_INDEX_MASK ((_AC(1, UL) << OBJ_INDEX_BITS) - 1) - -#define MAX(a, b) ((a) >= (b) ? (a) : (b)) -/* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */ -#define ZS_MIN_ALLOC_SIZE \ - MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS)) -#define ZS_MAX_ALLOC_SIZE PAGE_SIZE - -/* - * On systems with 4K page size, this gives 254 size classes! There is a - * trader-off here: - * - Large number of size classes is potentially wasteful as free page are - * spread across these classes - * - Small number of size classes causes large internal fragmentation - * - Probably its better to use specific size classes (empirically - * determined). NOTE: all those class sizes must be set as multiple of - * ZS_ALIGN to make sure link_free itself never has to span 2 pages. - * - * ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN - * (reason above) - */ -#define ZS_SIZE_CLASS_DELTA (PAGE_SIZE >> 8) -#define ZS_SIZE_CLASSES ((ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / \ - ZS_SIZE_CLASS_DELTA + 1) - -/* - * We do not maintain any list for completely empty or full pages - */ -enum fullness_group { - ZS_ALMOST_FULL, - ZS_ALMOST_EMPTY, - _ZS_NR_FULLNESS_GROUPS, - - ZS_EMPTY, - ZS_FULL -}; - -/* - * We assign a page to ZS_ALMOST_EMPTY fullness group when: - * n <= N / f, where - * n = number of allocated objects - * N = total number of objects zspage can store - * f = 1/fullness_threshold_frac - * - * Similarly, we assign zspage to: - * ZS_ALMOST_FULL when n > N / f - * ZS_EMPTY when n == 0 - * ZS_FULL when n == N - * - * (see: fix_fullness_group()) - */ -static const int fullness_threshold_frac = 4; - -struct size_class { - /* - * Size of objects stored in this class. Must be multiple - * of ZS_ALIGN. - */ - int size; - unsigned int index; - - /* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */ - int pages_per_zspage; - - spinlock_t lock; - - /* stats */ - u64 pages_allocated; - - struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS]; -}; - -/* - * Placed within free objects to form a singly linked list. - * For every zspage, first_page->freelist gives head of this list. - * - * This must be power of 2 and less than or equal to ZS_ALIGN - */ -struct link_free { - /* Handle of next free chunk (encodes ) */ - void *next; -}; - -struct zs_pool { - struct size_class size_class[ZS_SIZE_CLASSES]; - - gfp_t flags; /* allocation flags used when growing pool */ -}; - -/* - * A zspage's class index and fullness group - * are encoded in its (first)page->mapping - */ -#define CLASS_IDX_BITS 28 -#define FULLNESS_BITS 4 -#define CLASS_IDX_MASK ((1 << CLASS_IDX_BITS) - 1) -#define FULLNESS_MASK ((1 << FULLNESS_BITS) - 1) - -/* - * By default, zsmalloc uses a copy-based object mapping method to access - * allocations that span two pages. However, if a particular architecture - * performs VM mapping faster than copying, then it should be added here - * so that USE_PGTABLE_MAPPING is defined. This causes zsmalloc to use - * page table mapping rather than copying for object mapping. - */ -#if defined(CONFIG_ARM) && !defined(MODULE) -#define USE_PGTABLE_MAPPING -#endif - -struct mapping_area { -#ifdef USE_PGTABLE_MAPPING - struct vm_struct *vm; /* vm area for mapping object that span pages */ -#else - char *vm_buf; /* copy buffer for objects that span pages */ -#endif - char *vm_addr; /* address of kmap_atomic()'ed pages */ - enum zs_mapmode vm_mm; /* mapping mode */ -}; - - -/* per-cpu VM mapping areas for zspage accesses that cross page boundaries */ -static DEFINE_PER_CPU(struct mapping_area, zs_map_area); - -static int is_first_page(struct page *page) -{ - return PagePrivate(page); -} - -static int is_last_page(struct page *page) -{ - return PagePrivate2(page); -} - -static void get_zspage_mapping(struct page *page, unsigned int *class_idx, - enum fullness_group *fullness) -{ - unsigned long m; - BUG_ON(!is_first_page(page)); - - m = (unsigned long)page->mapping; - *fullness = m & FULLNESS_MASK; - *class_idx = (m >> FULLNESS_BITS) & CLASS_IDX_MASK; -} - -static void set_zspage_mapping(struct page *page, unsigned int class_idx, - enum fullness_group fullness) -{ - unsigned long m; - BUG_ON(!is_first_page(page)); - - m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) | - (fullness & FULLNESS_MASK); - page->mapping = (struct address_space *)m; -} - -static int get_size_class_index(int size) -{ - int idx = 0; - - if (likely(size > ZS_MIN_ALLOC_SIZE)) - idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE, - ZS_SIZE_CLASS_DELTA); - - return idx; -} - -static enum fullness_group get_fullness_group(struct page *page) -{ - int inuse, max_objects; - enum fullness_group fg; - BUG_ON(!is_first_page(page)); - - inuse = page->inuse; - max_objects = page->objects; - - if (inuse == 0) - fg = ZS_EMPTY; - else if (inuse == max_objects) - fg = ZS_FULL; - else if (inuse <= max_objects / fullness_threshold_frac) - fg = ZS_ALMOST_EMPTY; - else - fg = ZS_ALMOST_FULL; - - return fg; -} - -static void insert_zspage(struct page *page, struct size_class *class, - enum fullness_group fullness) -{ - struct page **head; - - BUG_ON(!is_first_page(page)); - - if (fullness >= _ZS_NR_FULLNESS_GROUPS) - return; - - head = &class->fullness_list[fullness]; - if (*head) - list_add_tail(&page->lru, &(*head)->lru); - - *head = page; -} - -static void remove_zspage(struct page *page, struct size_class *class, - enum fullness_group fullness) -{ - struct page **head; - - BUG_ON(!is_first_page(page)); - - if (fullness >= _ZS_NR_FULLNESS_GROUPS) - return; - - head = &class->fullness_list[fullness]; - BUG_ON(!*head); - if (list_empty(&(*head)->lru)) - *head = NULL; - else if (*head == page) - *head = (struct page *)list_entry((*head)->lru.next, - struct page, lru); - - list_del_init(&page->lru); -} - -static enum fullness_group fix_fullness_group(struct zs_pool *pool, - struct page *page) -{ - int class_idx; - struct size_class *class; - enum fullness_group currfg, newfg; - - BUG_ON(!is_first_page(page)); - - get_zspage_mapping(page, &class_idx, &currfg); - newfg = get_fullness_group(page); - if (newfg == currfg) - goto out; - - class = &pool->size_class[class_idx]; - remove_zspage(page, class, currfg); - insert_zspage(page, class, newfg); - set_zspage_mapping(page, class_idx, newfg); - -out: - return newfg; -} - -/* - * We have to decide on how many pages to link together - * to form a zspage for each size class. This is important - * to reduce wastage due to unusable space left at end of - * each zspage which is given as: - * wastage = Zp - Zp % size_class - * where Zp = zspage size = k * PAGE_SIZE where k = 1, 2, ... - * - * For example, for size class of 3/8 * PAGE_SIZE, we should - * link together 3 PAGE_SIZE sized pages to form a zspage - * since then we can perfectly fit in 8 such objects. - */ -static int get_pages_per_zspage(int class_size) -{ - int i, max_usedpc = 0; - /* zspage order which gives maximum used size per KB */ - int max_usedpc_order = 1; - - for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) { - int zspage_size; - int waste, usedpc; - - zspage_size = i * PAGE_SIZE; - waste = zspage_size % class_size; - usedpc = (zspage_size - waste) * 100 / zspage_size; - - if (usedpc > max_usedpc) { - max_usedpc = usedpc; - max_usedpc_order = i; - } - } - - return max_usedpc_order; -} - -/* - * A single 'zspage' is composed of many system pages which are - * linked together using fields in struct page. This function finds - * the first/head page, given any component page of a zspage. - */ -static struct page *get_first_page(struct page *page) -{ - if (is_first_page(page)) - return page; - else - return page->first_page; -} - -static struct page *get_next_page(struct page *page) -{ - struct page *next; - - if (is_last_page(page)) - next = NULL; - else if (is_first_page(page)) - next = (struct page *)page->private; - else - next = list_entry(page->lru.next, struct page, lru); - - return next; -} - -/* Encode as a single handle value */ -static void *obj_location_to_handle(struct page *page, unsigned long obj_idx) -{ - unsigned long handle; - - if (!page) { - BUG_ON(obj_idx); - return NULL; - } - - handle = page_to_pfn(page) << OBJ_INDEX_BITS; - handle |= (obj_idx & OBJ_INDEX_MASK); - - return (void *)handle; -} - -/* Decode pair from the given object handle */ -static void obj_handle_to_location(unsigned long handle, struct page **page, - unsigned long *obj_idx) -{ - *page = pfn_to_page(handle >> OBJ_INDEX_BITS); - *obj_idx = handle & OBJ_INDEX_MASK; -} - -static unsigned long obj_idx_to_offset(struct page *page, - unsigned long obj_idx, int class_size) -{ - unsigned long off = 0; - - if (!is_first_page(page)) - off = page->index; - - return off + obj_idx * class_size; -} - -static void reset_page(struct page *page) -{ - clear_bit(PG_private, &page->flags); - clear_bit(PG_private_2, &page->flags); - set_page_private(page, 0); - page->mapping = NULL; - page->freelist = NULL; - page_mapcount_reset(page); -} - -static void free_zspage(struct page *first_page) -{ - struct page *nextp, *tmp, *head_extra; - - BUG_ON(!is_first_page(first_page)); - BUG_ON(first_page->inuse); - - head_extra = (struct page *)page_private(first_page); - - reset_page(first_page); - __free_page(first_page); - - /* zspage with only 1 system page */ - if (!head_extra) - return; - - list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) { - list_del(&nextp->lru); - reset_page(nextp); - __free_page(nextp); - } - reset_page(head_extra); - __free_page(head_extra); -} - -/* Initialize a newly allocated zspage */ -static void init_zspage(struct page *first_page, struct size_class *class) -{ - unsigned long off = 0; - struct page *page = first_page; - - BUG_ON(!is_first_page(first_page)); - while (page) { - struct page *next_page; - struct link_free *link; - unsigned int i, objs_on_page; - - /* - * page->index stores offset of first object starting - * in the page. For the first page, this is always 0, - * so we use first_page->index (aka ->freelist) to store - * head of corresponding zspage's freelist. - */ - if (page != first_page) - page->index = off; - - link = (struct link_free *)kmap_atomic(page) + - off / sizeof(*link); - objs_on_page = (PAGE_SIZE - off) / class->size; - - for (i = 1; i <= objs_on_page; i++) { - off += class->size; - if (off < PAGE_SIZE) { - link->next = obj_location_to_handle(page, i); - link += class->size / sizeof(*link); - } - } - - /* - * We now come to the last (full or partial) object on this - * page, which must point to the first object on the next - * page (if present) - */ - next_page = get_next_page(page); - link->next = obj_location_to_handle(next_page, 0); - kunmap_atomic(link); - page = next_page; - off = (off + class->size) % PAGE_SIZE; - } -} - -/* - * Allocate a zspage for the given size class - */ -static struct page *alloc_zspage(struct size_class *class, gfp_t flags) -{ - int i, error; - struct page *first_page = NULL, *uninitialized_var(prev_page); - - /* - * Allocate individual pages and link them together as: - * 1. first page->private = first sub-page - * 2. all sub-pages are linked together using page->lru - * 3. each sub-page is linked to the first page using page->first_page - * - * For each size class, First/Head pages are linked together using - * page->lru. Also, we set PG_private to identify the first page - * (i.e. no other sub-page has this flag set) and PG_private_2 to - * identify the last page. - */ - error = -ENOMEM; - for (i = 0; i < class->pages_per_zspage; i++) { - struct page *page; - - page = alloc_page(flags); - if (!page) - goto cleanup; - - INIT_LIST_HEAD(&page->lru); - if (i == 0) { /* first page */ - SetPagePrivate(page); - set_page_private(page, 0); - first_page = page; - first_page->inuse = 0; - } - if (i == 1) - first_page->private = (unsigned long)page; - if (i >= 1) - page->first_page = first_page; - if (i >= 2) - list_add(&page->lru, &prev_page->lru); - if (i == class->pages_per_zspage - 1) /* last page */ - SetPagePrivate2(page); - prev_page = page; - } - - init_zspage(first_page, class); - - first_page->freelist = obj_location_to_handle(first_page, 0); - /* Maximum number of objects we can store in this zspage */ - first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size; - - error = 0; /* Success */ - -cleanup: - if (unlikely(error) && first_page) { - free_zspage(first_page); - first_page = NULL; - } - - return first_page; -} - -static struct page *find_get_zspage(struct size_class *class) -{ - int i; - struct page *page; - - for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) { - page = class->fullness_list[i]; - if (page) - break; - } - - return page; -} - -#ifdef USE_PGTABLE_MAPPING -static inline int __zs_cpu_up(struct mapping_area *area) -{ - /* - * Make sure we don't leak memory if a cpu UP notification - * and zs_init() race and both call zs_cpu_up() on the same cpu - */ - if (area->vm) - return 0; - area->vm = alloc_vm_area(PAGE_SIZE * 2, NULL); - if (!area->vm) - return -ENOMEM; - return 0; -} - -static inline void __zs_cpu_down(struct mapping_area *area) -{ - if (area->vm) - free_vm_area(area->vm); - area->vm = NULL; -} - -static inline void *__zs_map_object(struct mapping_area *area, - struct page *pages[2], int off, int size) -{ - BUG_ON(map_vm_area(area->vm, PAGE_KERNEL, &pages)); - area->vm_addr = area->vm->addr; - return area->vm_addr + off; -} - -static inline void __zs_unmap_object(struct mapping_area *area, - struct page *pages[2], int off, int size) -{ - unsigned long addr = (unsigned long)area->vm_addr; - - unmap_kernel_range(addr, PAGE_SIZE * 2); -} - -#else /* USE_PGTABLE_MAPPING */ - -static inline int __zs_cpu_up(struct mapping_area *area) -{ - /* - * Make sure we don't leak memory if a cpu UP notification - * and zs_init() race and both call zs_cpu_up() on the same cpu - */ - if (area->vm_buf) - return 0; - area->vm_buf = (char *)__get_free_page(GFP_KERNEL); - if (!area->vm_buf) - return -ENOMEM; - return 0; -} - -static inline void __zs_cpu_down(struct mapping_area *area) -{ - if (area->vm_buf) - free_page((unsigned long)area->vm_buf); - area->vm_buf = NULL; -} - -static void *__zs_map_object(struct mapping_area *area, - struct page *pages[2], int off, int size) -{ - int sizes[2]; - void *addr; - char *buf = area->vm_buf; - - /* disable page faults to match kmap_atomic() return conditions */ - pagefault_disable(); - - /* no read fastpath */ - if (area->vm_mm == ZS_MM_WO) - goto out; - - sizes[0] = PAGE_SIZE - off; - sizes[1] = size - sizes[0]; - - /* copy object to per-cpu buffer */ - addr = kmap_atomic(pages[0]); - memcpy(buf, addr + off, sizes[0]); - kunmap_atomic(addr); - addr = kmap_atomic(pages[1]); - memcpy(buf + sizes[0], addr, sizes[1]); - kunmap_atomic(addr); -out: - return area->vm_buf; -} - -static void __zs_unmap_object(struct mapping_area *area, - struct page *pages[2], int off, int size) -{ - int sizes[2]; - void *addr; - char *buf = area->vm_buf; - - /* no write fastpath */ - if (area->vm_mm == ZS_MM_RO) - goto out; - - sizes[0] = PAGE_SIZE - off; - sizes[1] = size - sizes[0]; - - /* copy per-cpu buffer to object */ - addr = kmap_atomic(pages[0]); - memcpy(addr + off, buf, sizes[0]); - kunmap_atomic(addr); - addr = kmap_atomic(pages[1]); - memcpy(addr, buf + sizes[0], sizes[1]); - kunmap_atomic(addr); - -out: - /* enable page faults to match kunmap_atomic() return conditions */ - pagefault_enable(); -} - -#endif /* USE_PGTABLE_MAPPING */ - -static int zs_cpu_notifier(struct notifier_block *nb, unsigned long action, - void *pcpu) -{ - int ret, cpu = (long)pcpu; - struct mapping_area *area; - - switch (action) { - case CPU_UP_PREPARE: - area = &per_cpu(zs_map_area, cpu); - ret = __zs_cpu_up(area); - if (ret) - return notifier_from_errno(ret); - break; - case CPU_DEAD: - case CPU_UP_CANCELED: - area = &per_cpu(zs_map_area, cpu); - __zs_cpu_down(area); - break; - } - - return NOTIFY_OK; -} - -static struct notifier_block zs_cpu_nb = { - .notifier_call = zs_cpu_notifier -}; - -static void zs_exit(void) -{ - int cpu; - - for_each_online_cpu(cpu) - zs_cpu_notifier(NULL, CPU_DEAD, (void *)(long)cpu); - unregister_cpu_notifier(&zs_cpu_nb); -} - -static int zs_init(void) -{ - int cpu, ret; - - register_cpu_notifier(&zs_cpu_nb); - for_each_online_cpu(cpu) { - ret = zs_cpu_notifier(NULL, CPU_UP_PREPARE, (void *)(long)cpu); - if (notifier_to_errno(ret)) - goto fail; - } - return 0; -fail: - zs_exit(); - return notifier_to_errno(ret); -} - -/** - * zs_create_pool - Creates an allocation pool to work from. - * @flags: allocation flags used to allocate pool metadata - * - * This function must be called before anything when using - * the zsmalloc allocator. - * - * On success, a pointer to the newly created pool is returned, - * otherwise NULL. - */ -struct zs_pool *zs_create_pool(gfp_t flags) -{ - int i, ovhd_size; - struct zs_pool *pool; - - ovhd_size = roundup(sizeof(*pool), PAGE_SIZE); - pool = kzalloc(ovhd_size, GFP_KERNEL); - if (!pool) - return NULL; - - for (i = 0; i < ZS_SIZE_CLASSES; i++) { - int size; - struct size_class *class; - - size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA; - if (size > ZS_MAX_ALLOC_SIZE) - size = ZS_MAX_ALLOC_SIZE; - - class = &pool->size_class[i]; - class->size = size; - class->index = i; - spin_lock_init(&class->lock); - class->pages_per_zspage = get_pages_per_zspage(size); - - } - - pool->flags = flags; - - return pool; -} -EXPORT_SYMBOL_GPL(zs_create_pool); - -void zs_destroy_pool(struct zs_pool *pool) -{ - int i; - - for (i = 0; i < ZS_SIZE_CLASSES; i++) { - int fg; - struct size_class *class = &pool->size_class[i]; - - for (fg = 0; fg < _ZS_NR_FULLNESS_GROUPS; fg++) { - if (class->fullness_list[fg]) { - pr_info("Freeing non-empty class with size %db, fullness group %d\n", - class->size, fg); - } - } - } - kfree(pool); -} -EXPORT_SYMBOL_GPL(zs_destroy_pool); - -/** - * zs_malloc - Allocate block of given size from pool. - * @pool: pool to allocate from - * @size: size of block to allocate - * - * On success, handle to the allocated object is returned, - * otherwise 0. - * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail. - */ -unsigned long zs_malloc(struct zs_pool *pool, size_t size) -{ - unsigned long obj; - struct link_free *link; - int class_idx; - struct size_class *class; - - struct page *first_page, *m_page; - unsigned long m_objidx, m_offset; - - if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE)) - return 0; - - class_idx = get_size_class_index(size); - class = &pool->size_class[class_idx]; - BUG_ON(class_idx != class->index); - - spin_lock(&class->lock); - first_page = find_get_zspage(class); - - if (!first_page) { - spin_unlock(&class->lock); - first_page = alloc_zspage(class, pool->flags); - if (unlikely(!first_page)) - return 0; - - set_zspage_mapping(first_page, class->index, ZS_EMPTY); - spin_lock(&class->lock); - class->pages_allocated += class->pages_per_zspage; - } - - obj = (unsigned long)first_page->freelist; - obj_handle_to_location(obj, &m_page, &m_objidx); - m_offset = obj_idx_to_offset(m_page, m_objidx, class->size); - - link = (struct link_free *)kmap_atomic(m_page) + - m_offset / sizeof(*link); - first_page->freelist = link->next; - memset(link, POISON_INUSE, sizeof(*link)); - kunmap_atomic(link); - - first_page->inuse++; - /* Now move the zspage to another fullness group, if required */ - fix_fullness_group(pool, first_page); - spin_unlock(&class->lock); - - return obj; -} -EXPORT_SYMBOL_GPL(zs_malloc); - -void zs_free(struct zs_pool *pool, unsigned long obj) -{ - struct link_free *link; - struct page *first_page, *f_page; - unsigned long f_objidx, f_offset; - - int class_idx; - struct size_class *class; - enum fullness_group fullness; - - if (unlikely(!obj)) - return; - - obj_handle_to_location(obj, &f_page, &f_objidx); - first_page = get_first_page(f_page); - - get_zspage_mapping(first_page, &class_idx, &fullness); - class = &pool->size_class[class_idx]; - f_offset = obj_idx_to_offset(f_page, f_objidx, class->size); - - spin_lock(&class->lock); - - /* Insert this object in containing zspage's freelist */ - link = (struct link_free *)((unsigned char *)kmap_atomic(f_page) - + f_offset); - link->next = first_page->freelist; - kunmap_atomic(link); - first_page->freelist = (void *)obj; - - first_page->inuse--; - fullness = fix_fullness_group(pool, first_page); - - if (fullness == ZS_EMPTY) - class->pages_allocated -= class->pages_per_zspage; - - spin_unlock(&class->lock); - - if (fullness == ZS_EMPTY) - free_zspage(first_page); -} -EXPORT_SYMBOL_GPL(zs_free); - -/** - * zs_map_object - get address of allocated object from handle. - * @pool: pool from which the object was allocated - * @handle: handle returned from zs_malloc - * - * Before using an object allocated from zs_malloc, it must be mapped using - * this function. When done with the object, it must be unmapped using - * zs_unmap_object. - * - * Only one object can be mapped per cpu at a time. There is no protection - * against nested mappings. - * - * This function returns with preemption and page faults disabled. - */ -void *zs_map_object(struct zs_pool *pool, unsigned long handle, - enum zs_mapmode mm) -{ - struct page *page; - unsigned long obj_idx, off; - - unsigned int class_idx; - enum fullness_group fg; - struct size_class *class; - struct mapping_area *area; - struct page *pages[2]; - - BUG_ON(!handle); - - /* - * Because we use per-cpu mapping areas shared among the - * pools/users, we can't allow mapping in interrupt context - * because it can corrupt another users mappings. - */ - BUG_ON(in_interrupt()); - - obj_handle_to_location(handle, &page, &obj_idx); - get_zspage_mapping(get_first_page(page), &class_idx, &fg); - class = &pool->size_class[class_idx]; - off = obj_idx_to_offset(page, obj_idx, class->size); - - area = &get_cpu_var(zs_map_area); - area->vm_mm = mm; - if (off + class->size <= PAGE_SIZE) { - /* this object is contained entirely within a page */ - area->vm_addr = kmap_atomic(page); - return area->vm_addr + off; - } - - /* this object spans two pages */ - pages[0] = page; - pages[1] = get_next_page(page); - BUG_ON(!pages[1]); - - return __zs_map_object(area, pages, off, class->size); -} -EXPORT_SYMBOL_GPL(zs_map_object); - -void zs_unmap_object(struct zs_pool *pool, unsigned long handle) -{ - struct page *page; - unsigned long obj_idx, off; - - unsigned int class_idx; - enum fullness_group fg; - struct size_class *class; - struct mapping_area *area; - - BUG_ON(!handle); - - obj_handle_to_location(handle, &page, &obj_idx); - get_zspage_mapping(get_first_page(page), &class_idx, &fg); - class = &pool->size_class[class_idx]; - off = obj_idx_to_offset(page, obj_idx, class->size); - - area = &__get_cpu_var(zs_map_area); - if (off + class->size <= PAGE_SIZE) - kunmap_atomic(area->vm_addr); - else { - struct page *pages[2]; - - pages[0] = page; - pages[1] = get_next_page(page); - BUG_ON(!pages[1]); - - __zs_unmap_object(area, pages, off, class->size); - } - put_cpu_var(zs_map_area); -} -EXPORT_SYMBOL_GPL(zs_unmap_object); - -u64 zs_get_total_size_bytes(struct zs_pool *pool) -{ - int i; - u64 npages = 0; - - for (i = 0; i < ZS_SIZE_CLASSES; i++) - npages += pool->size_class[i].pages_allocated; - - return npages << PAGE_SHIFT; -} -EXPORT_SYMBOL_GPL(zs_get_total_size_bytes); - -module_init(zs_init); -module_exit(zs_exit); - -MODULE_LICENSE("Dual BSD/GPL"); -MODULE_AUTHOR("Nitin Gupta "); diff --git a/drivers/staging/zsmalloc/zsmalloc.h b/drivers/staging/zsmalloc/zsmalloc.h deleted file mode 100644 index fbe6bec..0000000 --- a/drivers/staging/zsmalloc/zsmalloc.h +++ /dev/null @@ -1,43 +0,0 @@ -/* - * zsmalloc memory allocator - * - * Copyright (C) 2011 Nitin Gupta - * - * This code is released using a dual license strategy: BSD/GPL - * You can choose the license that better fits your requirements. - * - * Released under the terms of 3-clause BSD License - * Released under the terms of GNU General Public License Version 2.0 - */ - -#ifndef _ZS_MALLOC_H_ -#define _ZS_MALLOC_H_ - -#include - -/* - * zsmalloc mapping modes - * - * NOTE: These only make a difference when a mapped object spans pages - */ -enum zs_mapmode { - ZS_MM_RW, /* normal read-write mapping */ - ZS_MM_RO, /* read-only (no copy-out at unmap time) */ - ZS_MM_WO /* write-only (no copy-in at map time) */ -}; - -struct zs_pool; - -struct zs_pool *zs_create_pool(gfp_t flags); -void zs_destroy_pool(struct zs_pool *pool); - -unsigned long zs_malloc(struct zs_pool *pool, size_t size); -void zs_free(struct zs_pool *pool, unsigned long obj); - -void *zs_map_object(struct zs_pool *pool, unsigned long handle, - enum zs_mapmode mm); -void zs_unmap_object(struct zs_pool *pool, unsigned long handle); - -u64 zs_get_total_size_bytes(struct zs_pool *pool); - -#endif -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755471Ab3HRIlt (ORCPT ); Sun, 18 Aug 2013 04:41:49 -0400 Received: from mail-pb0-f42.google.com ([209.85.160.42]:37995 "EHLO mail-pb0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754294Ab3HRIlr (ORCPT ); Sun, 18 Aug 2013 04:41:47 -0400 From: Bob Liu To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu Subject: [PATCH 2/4] mm: promote zsmalloc to mm/ Date: Sun, 18 Aug 2013 16:40:47 +0800 Message-Id: <1376815249-6611-3-git-send-email-bob.liu@oracle.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org zsmalloc is a new slab-based memory allocator for storing compressed pages. It is designed for low fragmentation and high allocation success rate on large object, but <= PAGE_SIZE allocations. zsmalloc differs from the kernel slab allocator in two primary ways to achieve these design goals. zsmalloc never requires high order page allocations to back slabs, or "size classes" in zsmalloc terms. Instead it allows multiple single-order pages to be stitched together into a "zspage" which backs the slab. This allows for higher allocation success rate under memory pressure. Also, zsmalloc allows objects to span page boundaries within the zspage. This allows for lower fragmentation than could be had with the kernel slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE. With the kernel slab allocator, if a page compresses to 60% of it original size, the memory savings gained through compression is lost in fragmentation because another object of the same size can't be stored in the leftover space. This ability to span pages results in zsmalloc allocations not being directly addressable by the user. The user is given an non-dereferencable handle in response to an allocation request. That handle must be mapped, using zs_map_object(), which returns a pointer to the mapped region that can be used. The mapping is necessary since the object data may reside in two different noncontigious pages. Signed-off-by: Bob Liu --- include/linux/zsmalloc.h | 43 ++ mm/Kconfig | 35 +- mm/Makefile | 1 + mm/zsmalloc.c | 1063 ++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 1131 insertions(+), 11 deletions(-) create mode 100644 include/linux/zsmalloc.h create mode 100644 mm/zsmalloc.c diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h new file mode 100644 index 0000000..fbe6bec --- /dev/null +++ b/include/linux/zsmalloc.h @@ -0,0 +1,43 @@ +/* + * zsmalloc memory allocator + * + * Copyright (C) 2011 Nitin Gupta + * + * This code is released using a dual license strategy: BSD/GPL + * You can choose the license that better fits your requirements. + * + * Released under the terms of 3-clause BSD License + * Released under the terms of GNU General Public License Version 2.0 + */ + +#ifndef _ZS_MALLOC_H_ +#define _ZS_MALLOC_H_ + +#include + +/* + * zsmalloc mapping modes + * + * NOTE: These only make a difference when a mapped object spans pages + */ +enum zs_mapmode { + ZS_MM_RW, /* normal read-write mapping */ + ZS_MM_RO, /* read-only (no copy-out at unmap time) */ + ZS_MM_WO /* write-only (no copy-in at map time) */ +}; + +struct zs_pool; + +struct zs_pool *zs_create_pool(gfp_t flags); +void zs_destroy_pool(struct zs_pool *pool); + +unsigned long zs_malloc(struct zs_pool *pool, size_t size); +void zs_free(struct zs_pool *pool, unsigned long obj); + +void *zs_map_object(struct zs_pool *pool, unsigned long handle, + enum zs_mapmode mm); +void zs_unmap_object(struct zs_pool *pool, unsigned long handle); + +u64 zs_get_total_size_bytes(struct zs_pool *pool); + +#endif diff --git a/mm/Kconfig b/mm/Kconfig index 8028dcc..48d1786 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -478,21 +478,10 @@ config FRONTSWAP If unsure, say Y to enable frontswap. -config ZBUD - tristate - default n - help - A special purpose allocator for storing compressed pages. - It is designed to store up to two compressed pages per physical - page. While this design limits storage density, it has simple and - deterministic reclaim properties that make it preferable to a higher - density approach when reclaim will be used. - config ZSWAP bool "Compressed cache for swap pages (EXPERIMENTAL)" depends on FRONTSWAP && CRYPTO=y select CRYPTO_LZO - select ZBUD default n help A lightweight compressed cache for swap pages. It takes @@ -508,6 +497,30 @@ config ZSWAP they have not be fully explored on the large set of potential configurations and workloads that exist. +choice + prompt "Select memory allocator for compressed pages" + depends on ZSWAP + default ZBUD + + config ZBUD + bool "zbud" + help + A special purpose allocator for storing compressed pages. + It is designed to store up to two compressed pages per physical + page. While this design limits storage density, it has simple and + deterministic reclaim properties that make it preferable to a higher + density approach when reclaim will be used. + + config ZSMALLOC + bool "zsmalloc" + help + zsmalloc is a slab-based memory allocator designed to store + compressed RAM pages. zsmalloc uses virtual memory mapping + in order to reduce fragmentation and has high compression density. + However, this results in a unpredictable performance characteristics + when reclaiming a single page. +endchoice + config MEM_SOFT_DIRTY bool "Track memory changes" depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY diff --git a/mm/Makefile b/mm/Makefile index f008033..7d11958 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -60,3 +60,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o obj-$(CONFIG_CLEANCACHE) += cleancache.o obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o obj-$(CONFIG_ZBUD) += zbud.o +obj-$(CONFIG_ZSMALLOC) += zsmalloc.o diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c new file mode 100644 index 0000000..4bb275b --- /dev/null +++ b/mm/zsmalloc.c @@ -0,0 +1,1063 @@ +/* + * zsmalloc memory allocator + * + * Copyright (C) 2011 Nitin Gupta + * + * This code is released using a dual license strategy: BSD/GPL + * You can choose the license that better fits your requirements. + * + * Released under the terms of 3-clause BSD License + * Released under the terms of GNU General Public License Version 2.0 + */ + + +/* + * This allocator is designed for use with zcache and zram. Thus, the + * allocator is supposed to work well under low memory conditions. In + * particular, it never attempts higher order page allocation which is + * very likely to fail under memory pressure. On the other hand, if we + * just use single (0-order) pages, it would suffer from very high + * fragmentation -- any object of size PAGE_SIZE/2 or larger would occupy + * an entire page. This was one of the major issues with its predecessor + * (xvmalloc). + * + * To overcome these issues, zsmalloc allocates a bunch of 0-order pages + * and links them together using various 'struct page' fields. These linked + * pages act as a single higher-order page i.e. an object can span 0-order + * page boundaries. The code refers to these linked pages as a single entity + * called zspage. + * + * Following is how we use various fields and flags of underlying + * struct page(s) to form a zspage. + * + * Usage of struct page fields: + * page->first_page: points to the first component (0-order) page + * page->index (union with page->freelist): offset of the first object + * starting in this page. For the first page, this is + * always 0, so we use this field (aka freelist) to point + * to the first free object in zspage. + * page->lru: links together all component pages (except the first page) + * of a zspage + * + * For _first_ page only: + * + * page->private (union with page->first_page): refers to the + * component page after the first page + * page->freelist: points to the first free object in zspage. + * Free objects are linked together using in-place + * metadata. + * page->objects: maximum number of objects we can store in this + * zspage (class->zspage_order * PAGE_SIZE / class->size) + * page->lru: links together first pages of various zspages. + * Basically forming list of zspages in a fullness group. + * page->mapping: class index and fullness group of the zspage + * + * Usage of struct page flags: + * PG_private: identifies the first component page + * PG_private2: identifies the last component page + * + */ + +#ifdef CONFIG_ZSMALLOC_DEBUG +#define DEBUG +#endif + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "zsmalloc.h" + +/* + * This must be power of 2 and greater than of equal to sizeof(link_free). + * These two conditions ensure that any 'struct link_free' itself doesn't + * span more than 1 page which avoids complex case of mapping 2 pages simply + * to restore link_free pointer values. + */ +#define ZS_ALIGN 8 + +/* + * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single) + * pages. ZS_MAX_ZSPAGE_ORDER defines upper limit on N. + */ +#define ZS_MAX_ZSPAGE_ORDER 2 +#define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER) + +/* + * Object location (, ) is encoded as + * as single (void *) handle value. + * + * Note that object index is relative to system + * page it is stored in, so for each sub-page belonging + * to a zspage, obj_idx starts with 0. + * + * This is made more complicated by various memory models and PAE. + */ + +#ifndef MAX_PHYSMEM_BITS +#ifdef CONFIG_HIGHMEM64G +#define MAX_PHYSMEM_BITS 36 +#else /* !CONFIG_HIGHMEM64G */ +/* + * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just + * be PAGE_SHIFT + */ +#define MAX_PHYSMEM_BITS BITS_PER_LONG +#endif +#endif +#define _PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT) +#define OBJ_INDEX_BITS (BITS_PER_LONG - _PFN_BITS) +#define OBJ_INDEX_MASK ((_AC(1, UL) << OBJ_INDEX_BITS) - 1) + +#define MAX(a, b) ((a) >= (b) ? (a) : (b)) +/* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */ +#define ZS_MIN_ALLOC_SIZE \ + MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS)) +#define ZS_MAX_ALLOC_SIZE PAGE_SIZE + +/* + * On systems with 4K page size, this gives 254 size classes! There is a + * trader-off here: + * - Large number of size classes is potentially wasteful as free page are + * spread across these classes + * - Small number of size classes causes large internal fragmentation + * - Probably its better to use specific size classes (empirically + * determined). NOTE: all those class sizes must be set as multiple of + * ZS_ALIGN to make sure link_free itself never has to span 2 pages. + * + * ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN + * (reason above) + */ +#define ZS_SIZE_CLASS_DELTA (PAGE_SIZE >> 8) +#define ZS_SIZE_CLASSES ((ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / \ + ZS_SIZE_CLASS_DELTA + 1) + +/* + * We do not maintain any list for completely empty or full pages + */ +enum fullness_group { + ZS_ALMOST_FULL, + ZS_ALMOST_EMPTY, + _ZS_NR_FULLNESS_GROUPS, + + ZS_EMPTY, + ZS_FULL +}; + +/* + * We assign a page to ZS_ALMOST_EMPTY fullness group when: + * n <= N / f, where + * n = number of allocated objects + * N = total number of objects zspage can store + * f = 1/fullness_threshold_frac + * + * Similarly, we assign zspage to: + * ZS_ALMOST_FULL when n > N / f + * ZS_EMPTY when n == 0 + * ZS_FULL when n == N + * + * (see: fix_fullness_group()) + */ +static const int fullness_threshold_frac = 4; + +struct size_class { + /* + * Size of objects stored in this class. Must be multiple + * of ZS_ALIGN. + */ + int size; + unsigned int index; + + /* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */ + int pages_per_zspage; + + spinlock_t lock; + + /* stats */ + u64 pages_allocated; + + struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS]; +}; + +/* + * Placed within free objects to form a singly linked list. + * For every zspage, first_page->freelist gives head of this list. + * + * This must be power of 2 and less than or equal to ZS_ALIGN + */ +struct link_free { + /* Handle of next free chunk (encodes ) */ + void *next; +}; + +struct zs_pool { + struct size_class size_class[ZS_SIZE_CLASSES]; + + gfp_t flags; /* allocation flags used when growing pool */ +}; + +/* + * A zspage's class index and fullness group + * are encoded in its (first)page->mapping + */ +#define CLASS_IDX_BITS 28 +#define FULLNESS_BITS 4 +#define CLASS_IDX_MASK ((1 << CLASS_IDX_BITS) - 1) +#define FULLNESS_MASK ((1 << FULLNESS_BITS) - 1) + +/* + * By default, zsmalloc uses a copy-based object mapping method to access + * allocations that span two pages. However, if a particular architecture + * performs VM mapping faster than copying, then it should be added here + * so that USE_PGTABLE_MAPPING is defined. This causes zsmalloc to use + * page table mapping rather than copying for object mapping. + */ +#if defined(CONFIG_ARM) && !defined(MODULE) +#define USE_PGTABLE_MAPPING +#endif + +struct mapping_area { +#ifdef USE_PGTABLE_MAPPING + struct vm_struct *vm; /* vm area for mapping object that span pages */ +#else + char *vm_buf; /* copy buffer for objects that span pages */ +#endif + char *vm_addr; /* address of kmap_atomic()'ed pages */ + enum zs_mapmode vm_mm; /* mapping mode */ +}; + + +/* per-cpu VM mapping areas for zspage accesses that cross page boundaries */ +static DEFINE_PER_CPU(struct mapping_area, zs_map_area); + +static int is_first_page(struct page *page) +{ + return PagePrivate(page); +} + +static int is_last_page(struct page *page) +{ + return PagePrivate2(page); +} + +static void get_zspage_mapping(struct page *page, unsigned int *class_idx, + enum fullness_group *fullness) +{ + unsigned long m; + BUG_ON(!is_first_page(page)); + + m = (unsigned long)page->mapping; + *fullness = m & FULLNESS_MASK; + *class_idx = (m >> FULLNESS_BITS) & CLASS_IDX_MASK; +} + +static void set_zspage_mapping(struct page *page, unsigned int class_idx, + enum fullness_group fullness) +{ + unsigned long m; + BUG_ON(!is_first_page(page)); + + m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) | + (fullness & FULLNESS_MASK); + page->mapping = (struct address_space *)m; +} + +static int get_size_class_index(int size) +{ + int idx = 0; + + if (likely(size > ZS_MIN_ALLOC_SIZE)) + idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE, + ZS_SIZE_CLASS_DELTA); + + return idx; +} + +static enum fullness_group get_fullness_group(struct page *page) +{ + int inuse, max_objects; + enum fullness_group fg; + BUG_ON(!is_first_page(page)); + + inuse = page->inuse; + max_objects = page->objects; + + if (inuse == 0) + fg = ZS_EMPTY; + else if (inuse == max_objects) + fg = ZS_FULL; + else if (inuse <= max_objects / fullness_threshold_frac) + fg = ZS_ALMOST_EMPTY; + else + fg = ZS_ALMOST_FULL; + + return fg; +} + +static void insert_zspage(struct page *page, struct size_class *class, + enum fullness_group fullness) +{ + struct page **head; + + BUG_ON(!is_first_page(page)); + + if (fullness >= _ZS_NR_FULLNESS_GROUPS) + return; + + head = &class->fullness_list[fullness]; + if (*head) + list_add_tail(&page->lru, &(*head)->lru); + + *head = page; +} + +static void remove_zspage(struct page *page, struct size_class *class, + enum fullness_group fullness) +{ + struct page **head; + + BUG_ON(!is_first_page(page)); + + if (fullness >= _ZS_NR_FULLNESS_GROUPS) + return; + + head = &class->fullness_list[fullness]; + BUG_ON(!*head); + if (list_empty(&(*head)->lru)) + *head = NULL; + else if (*head == page) + *head = (struct page *)list_entry((*head)->lru.next, + struct page, lru); + + list_del_init(&page->lru); +} + +static enum fullness_group fix_fullness_group(struct zs_pool *pool, + struct page *page) +{ + int class_idx; + struct size_class *class; + enum fullness_group currfg, newfg; + + BUG_ON(!is_first_page(page)); + + get_zspage_mapping(page, &class_idx, &currfg); + newfg = get_fullness_group(page); + if (newfg == currfg) + goto out; + + class = &pool->size_class[class_idx]; + remove_zspage(page, class, currfg); + insert_zspage(page, class, newfg); + set_zspage_mapping(page, class_idx, newfg); + +out: + return newfg; +} + +/* + * We have to decide on how many pages to link together + * to form a zspage for each size class. This is important + * to reduce wastage due to unusable space left at end of + * each zspage which is given as: + * wastage = Zp - Zp % size_class + * where Zp = zspage size = k * PAGE_SIZE where k = 1, 2, ... + * + * For example, for size class of 3/8 * PAGE_SIZE, we should + * link together 3 PAGE_SIZE sized pages to form a zspage + * since then we can perfectly fit in 8 such objects. + */ +static int get_pages_per_zspage(int class_size) +{ + int i, max_usedpc = 0; + /* zspage order which gives maximum used size per KB */ + int max_usedpc_order = 1; + + for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) { + int zspage_size; + int waste, usedpc; + + zspage_size = i * PAGE_SIZE; + waste = zspage_size % class_size; + usedpc = (zspage_size - waste) * 100 / zspage_size; + + if (usedpc > max_usedpc) { + max_usedpc = usedpc; + max_usedpc_order = i; + } + } + + return max_usedpc_order; +} + +/* + * A single 'zspage' is composed of many system pages which are + * linked together using fields in struct page. This function finds + * the first/head page, given any component page of a zspage. + */ +static struct page *get_first_page(struct page *page) +{ + if (is_first_page(page)) + return page; + else + return page->first_page; +} + +static struct page *get_next_page(struct page *page) +{ + struct page *next; + + if (is_last_page(page)) + next = NULL; + else if (is_first_page(page)) + next = (struct page *)page->private; + else + next = list_entry(page->lru.next, struct page, lru); + + return next; +} + +/* Encode as a single handle value */ +static void *obj_location_to_handle(struct page *page, unsigned long obj_idx) +{ + unsigned long handle; + + if (!page) { + BUG_ON(obj_idx); + return NULL; + } + + handle = page_to_pfn(page) << OBJ_INDEX_BITS; + handle |= (obj_idx & OBJ_INDEX_MASK); + + return (void *)handle; +} + +/* Decode pair from the given object handle */ +static void obj_handle_to_location(unsigned long handle, struct page **page, + unsigned long *obj_idx) +{ + *page = pfn_to_page(handle >> OBJ_INDEX_BITS); + *obj_idx = handle & OBJ_INDEX_MASK; +} + +static unsigned long obj_idx_to_offset(struct page *page, + unsigned long obj_idx, int class_size) +{ + unsigned long off = 0; + + if (!is_first_page(page)) + off = page->index; + + return off + obj_idx * class_size; +} + +static void reset_page(struct page *page) +{ + clear_bit(PG_private, &page->flags); + clear_bit(PG_private_2, &page->flags); + set_page_private(page, 0); + page->mapping = NULL; + page->freelist = NULL; + page_mapcount_reset(page); +} + +static void free_zspage(struct page *first_page) +{ + struct page *nextp, *tmp, *head_extra; + + BUG_ON(!is_first_page(first_page)); + BUG_ON(first_page->inuse); + + head_extra = (struct page *)page_private(first_page); + + reset_page(first_page); + __free_page(first_page); + + /* zspage with only 1 system page */ + if (!head_extra) + return; + + list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) { + list_del(&nextp->lru); + reset_page(nextp); + __free_page(nextp); + } + reset_page(head_extra); + __free_page(head_extra); +} + +/* Initialize a newly allocated zspage */ +static void init_zspage(struct page *first_page, struct size_class *class) +{ + unsigned long off = 0; + struct page *page = first_page; + + BUG_ON(!is_first_page(first_page)); + while (page) { + struct page *next_page; + struct link_free *link; + unsigned int i, objs_on_page; + + /* + * page->index stores offset of first object starting + * in the page. For the first page, this is always 0, + * so we use first_page->index (aka ->freelist) to store + * head of corresponding zspage's freelist. + */ + if (page != first_page) + page->index = off; + + link = (struct link_free *)kmap_atomic(page) + + off / sizeof(*link); + objs_on_page = (PAGE_SIZE - off) / class->size; + + for (i = 1; i <= objs_on_page; i++) { + off += class->size; + if (off < PAGE_SIZE) { + link->next = obj_location_to_handle(page, i); + link += class->size / sizeof(*link); + } + } + + /* + * We now come to the last (full or partial) object on this + * page, which must point to the first object on the next + * page (if present) + */ + next_page = get_next_page(page); + link->next = obj_location_to_handle(next_page, 0); + kunmap_atomic(link); + page = next_page; + off = (off + class->size) % PAGE_SIZE; + } +} + +/* + * Allocate a zspage for the given size class + */ +static struct page *alloc_zspage(struct size_class *class, gfp_t flags) +{ + int i, error; + struct page *first_page = NULL, *uninitialized_var(prev_page); + + /* + * Allocate individual pages and link them together as: + * 1. first page->private = first sub-page + * 2. all sub-pages are linked together using page->lru + * 3. each sub-page is linked to the first page using page->first_page + * + * For each size class, First/Head pages are linked together using + * page->lru. Also, we set PG_private to identify the first page + * (i.e. no other sub-page has this flag set) and PG_private_2 to + * identify the last page. + */ + error = -ENOMEM; + for (i = 0; i < class->pages_per_zspage; i++) { + struct page *page; + + page = alloc_page(flags); + if (!page) + goto cleanup; + + INIT_LIST_HEAD(&page->lru); + if (i == 0) { /* first page */ + SetPagePrivate(page); + set_page_private(page, 0); + first_page = page; + first_page->inuse = 0; + } + if (i == 1) + first_page->private = (unsigned long)page; + if (i >= 1) + page->first_page = first_page; + if (i >= 2) + list_add(&page->lru, &prev_page->lru); + if (i == class->pages_per_zspage - 1) /* last page */ + SetPagePrivate2(page); + prev_page = page; + } + + init_zspage(first_page, class); + + first_page->freelist = obj_location_to_handle(first_page, 0); + /* Maximum number of objects we can store in this zspage */ + first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size; + + error = 0; /* Success */ + +cleanup: + if (unlikely(error) && first_page) { + free_zspage(first_page); + first_page = NULL; + } + + return first_page; +} + +static struct page *find_get_zspage(struct size_class *class) +{ + int i; + struct page *page; + + for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) { + page = class->fullness_list[i]; + if (page) + break; + } + + return page; +} + +#ifdef USE_PGTABLE_MAPPING +static inline int __zs_cpu_up(struct mapping_area *area) +{ + /* + * Make sure we don't leak memory if a cpu UP notification + * and zs_init() race and both call zs_cpu_up() on the same cpu + */ + if (area->vm) + return 0; + area->vm = alloc_vm_area(PAGE_SIZE * 2, NULL); + if (!area->vm) + return -ENOMEM; + return 0; +} + +static inline void __zs_cpu_down(struct mapping_area *area) +{ + if (area->vm) + free_vm_area(area->vm); + area->vm = NULL; +} + +static inline void *__zs_map_object(struct mapping_area *area, + struct page *pages[2], int off, int size) +{ + BUG_ON(map_vm_area(area->vm, PAGE_KERNEL, &pages)); + area->vm_addr = area->vm->addr; + return area->vm_addr + off; +} + +static inline void __zs_unmap_object(struct mapping_area *area, + struct page *pages[2], int off, int size) +{ + unsigned long addr = (unsigned long)area->vm_addr; + + unmap_kernel_range(addr, PAGE_SIZE * 2); +} + +#else /* USE_PGTABLE_MAPPING */ + +static inline int __zs_cpu_up(struct mapping_area *area) +{ + /* + * Make sure we don't leak memory if a cpu UP notification + * and zs_init() race and both call zs_cpu_up() on the same cpu + */ + if (area->vm_buf) + return 0; + area->vm_buf = (char *)__get_free_page(GFP_KERNEL); + if (!area->vm_buf) + return -ENOMEM; + return 0; +} + +static inline void __zs_cpu_down(struct mapping_area *area) +{ + if (area->vm_buf) + free_page((unsigned long)area->vm_buf); + area->vm_buf = NULL; +} + +static void *__zs_map_object(struct mapping_area *area, + struct page *pages[2], int off, int size) +{ + int sizes[2]; + void *addr; + char *buf = area->vm_buf; + + /* disable page faults to match kmap_atomic() return conditions */ + pagefault_disable(); + + /* no read fastpath */ + if (area->vm_mm == ZS_MM_WO) + goto out; + + sizes[0] = PAGE_SIZE - off; + sizes[1] = size - sizes[0]; + + /* copy object to per-cpu buffer */ + addr = kmap_atomic(pages[0]); + memcpy(buf, addr + off, sizes[0]); + kunmap_atomic(addr); + addr = kmap_atomic(pages[1]); + memcpy(buf + sizes[0], addr, sizes[1]); + kunmap_atomic(addr); +out: + return area->vm_buf; +} + +static void __zs_unmap_object(struct mapping_area *area, + struct page *pages[2], int off, int size) +{ + int sizes[2]; + void *addr; + char *buf = area->vm_buf; + + /* no write fastpath */ + if (area->vm_mm == ZS_MM_RO) + goto out; + + sizes[0] = PAGE_SIZE - off; + sizes[1] = size - sizes[0]; + + /* copy per-cpu buffer to object */ + addr = kmap_atomic(pages[0]); + memcpy(addr + off, buf, sizes[0]); + kunmap_atomic(addr); + addr = kmap_atomic(pages[1]); + memcpy(addr, buf + sizes[0], sizes[1]); + kunmap_atomic(addr); + +out: + /* enable page faults to match kunmap_atomic() return conditions */ + pagefault_enable(); +} + +#endif /* USE_PGTABLE_MAPPING */ + +static int zs_cpu_notifier(struct notifier_block *nb, unsigned long action, + void *pcpu) +{ + int ret, cpu = (long)pcpu; + struct mapping_area *area; + + switch (action) { + case CPU_UP_PREPARE: + area = &per_cpu(zs_map_area, cpu); + ret = __zs_cpu_up(area); + if (ret) + return notifier_from_errno(ret); + break; + case CPU_DEAD: + case CPU_UP_CANCELED: + area = &per_cpu(zs_map_area, cpu); + __zs_cpu_down(area); + break; + } + + return NOTIFY_OK; +} + +static struct notifier_block zs_cpu_nb = { + .notifier_call = zs_cpu_notifier +}; + +static void zs_exit(void) +{ + int cpu; + + for_each_online_cpu(cpu) + zs_cpu_notifier(NULL, CPU_DEAD, (void *)(long)cpu); + unregister_cpu_notifier(&zs_cpu_nb); +} + +static int zs_init(void) +{ + int cpu, ret; + + register_cpu_notifier(&zs_cpu_nb); + for_each_online_cpu(cpu) { + ret = zs_cpu_notifier(NULL, CPU_UP_PREPARE, (void *)(long)cpu); + if (notifier_to_errno(ret)) + goto fail; + } + return 0; +fail: + zs_exit(); + return notifier_to_errno(ret); +} + +/** + * zs_create_pool - Creates an allocation pool to work from. + * @flags: allocation flags used to allocate pool metadata + * + * This function must be called before anything when using + * the zsmalloc allocator. + * + * On success, a pointer to the newly created pool is returned, + * otherwise NULL. + */ +struct zs_pool *zs_create_pool(gfp_t flags) +{ + int i, ovhd_size; + struct zs_pool *pool; + + ovhd_size = roundup(sizeof(*pool), PAGE_SIZE); + pool = kzalloc(ovhd_size, GFP_KERNEL); + if (!pool) + return NULL; + + for (i = 0; i < ZS_SIZE_CLASSES; i++) { + int size; + struct size_class *class; + + size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA; + if (size > ZS_MAX_ALLOC_SIZE) + size = ZS_MAX_ALLOC_SIZE; + + class = &pool->size_class[i]; + class->size = size; + class->index = i; + spin_lock_init(&class->lock); + class->pages_per_zspage = get_pages_per_zspage(size); + + } + + pool->flags = flags; + + return pool; +} +EXPORT_SYMBOL_GPL(zs_create_pool); + +void zs_destroy_pool(struct zs_pool *pool) +{ + int i; + + for (i = 0; i < ZS_SIZE_CLASSES; i++) { + int fg; + struct size_class *class = &pool->size_class[i]; + + for (fg = 0; fg < _ZS_NR_FULLNESS_GROUPS; fg++) { + if (class->fullness_list[fg]) { + pr_info("Freeing non-empty class with size %db, fullness group %d\n", + class->size, fg); + } + } + } + kfree(pool); +} +EXPORT_SYMBOL_GPL(zs_destroy_pool); + +/** + * zs_malloc - Allocate block of given size from pool. + * @pool: pool to allocate from + * @size: size of block to allocate + * + * On success, handle to the allocated object is returned, + * otherwise 0. + * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail. + */ +unsigned long zs_malloc(struct zs_pool *pool, size_t size) +{ + unsigned long obj; + struct link_free *link; + int class_idx; + struct size_class *class; + + struct page *first_page, *m_page; + unsigned long m_objidx, m_offset; + + if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE)) + return 0; + + class_idx = get_size_class_index(size); + class = &pool->size_class[class_idx]; + BUG_ON(class_idx != class->index); + + spin_lock(&class->lock); + first_page = find_get_zspage(class); + + if (!first_page) { + spin_unlock(&class->lock); + first_page = alloc_zspage(class, pool->flags); + if (unlikely(!first_page)) + return 0; + + set_zspage_mapping(first_page, class->index, ZS_EMPTY); + spin_lock(&class->lock); + class->pages_allocated += class->pages_per_zspage; + } + + obj = (unsigned long)first_page->freelist; + obj_handle_to_location(obj, &m_page, &m_objidx); + m_offset = obj_idx_to_offset(m_page, m_objidx, class->size); + + link = (struct link_free *)kmap_atomic(m_page) + + m_offset / sizeof(*link); + first_page->freelist = link->next; + memset(link, POISON_INUSE, sizeof(*link)); + kunmap_atomic(link); + + first_page->inuse++; + /* Now move the zspage to another fullness group, if required */ + fix_fullness_group(pool, first_page); + spin_unlock(&class->lock); + + return obj; +} +EXPORT_SYMBOL_GPL(zs_malloc); + +void zs_free(struct zs_pool *pool, unsigned long obj) +{ + struct link_free *link; + struct page *first_page, *f_page; + unsigned long f_objidx, f_offset; + + int class_idx; + struct size_class *class; + enum fullness_group fullness; + + if (unlikely(!obj)) + return; + + obj_handle_to_location(obj, &f_page, &f_objidx); + first_page = get_first_page(f_page); + + get_zspage_mapping(first_page, &class_idx, &fullness); + class = &pool->size_class[class_idx]; + f_offset = obj_idx_to_offset(f_page, f_objidx, class->size); + + spin_lock(&class->lock); + + /* Insert this object in containing zspage's freelist */ + link = (struct link_free *)((unsigned char *)kmap_atomic(f_page) + + f_offset); + link->next = first_page->freelist; + kunmap_atomic(link); + first_page->freelist = (void *)obj; + + first_page->inuse--; + fullness = fix_fullness_group(pool, first_page); + + if (fullness == ZS_EMPTY) + class->pages_allocated -= class->pages_per_zspage; + + spin_unlock(&class->lock); + + if (fullness == ZS_EMPTY) + free_zspage(first_page); +} +EXPORT_SYMBOL_GPL(zs_free); + +/** + * zs_map_object - get address of allocated object from handle. + * @pool: pool from which the object was allocated + * @handle: handle returned from zs_malloc + * + * Before using an object allocated from zs_malloc, it must be mapped using + * this function. When done with the object, it must be unmapped using + * zs_unmap_object. + * + * Only one object can be mapped per cpu at a time. There is no protection + * against nested mappings. + * + * This function returns with preemption and page faults disabled. + */ +void *zs_map_object(struct zs_pool *pool, unsigned long handle, + enum zs_mapmode mm) +{ + struct page *page; + unsigned long obj_idx, off; + + unsigned int class_idx; + enum fullness_group fg; + struct size_class *class; + struct mapping_area *area; + struct page *pages[2]; + + BUG_ON(!handle); + + /* + * Because we use per-cpu mapping areas shared among the + * pools/users, we can't allow mapping in interrupt context + * because it can corrupt another users mappings. + */ + BUG_ON(in_interrupt()); + + obj_handle_to_location(handle, &page, &obj_idx); + get_zspage_mapping(get_first_page(page), &class_idx, &fg); + class = &pool->size_class[class_idx]; + off = obj_idx_to_offset(page, obj_idx, class->size); + + area = &get_cpu_var(zs_map_area); + area->vm_mm = mm; + if (off + class->size <= PAGE_SIZE) { + /* this object is contained entirely within a page */ + area->vm_addr = kmap_atomic(page); + return area->vm_addr + off; + } + + /* this object spans two pages */ + pages[0] = page; + pages[1] = get_next_page(page); + BUG_ON(!pages[1]); + + return __zs_map_object(area, pages, off, class->size); +} +EXPORT_SYMBOL_GPL(zs_map_object); + +void zs_unmap_object(struct zs_pool *pool, unsigned long handle) +{ + struct page *page; + unsigned long obj_idx, off; + + unsigned int class_idx; + enum fullness_group fg; + struct size_class *class; + struct mapping_area *area; + + BUG_ON(!handle); + + obj_handle_to_location(handle, &page, &obj_idx); + get_zspage_mapping(get_first_page(page), &class_idx, &fg); + class = &pool->size_class[class_idx]; + off = obj_idx_to_offset(page, obj_idx, class->size); + + area = &__get_cpu_var(zs_map_area); + if (off + class->size <= PAGE_SIZE) + kunmap_atomic(area->vm_addr); + else { + struct page *pages[2]; + + pages[0] = page; + pages[1] = get_next_page(page); + BUG_ON(!pages[1]); + + __zs_unmap_object(area, pages, off, class->size); + } + put_cpu_var(zs_map_area); +} +EXPORT_SYMBOL_GPL(zs_unmap_object); + +u64 zs_get_total_size_bytes(struct zs_pool *pool) +{ + int i; + u64 npages = 0; + + for (i = 0; i < ZS_SIZE_CLASSES; i++) + npages += pool->size_class[i].pages_allocated; + + return npages << PAGE_SHIFT; +} +EXPORT_SYMBOL_GPL(zs_get_total_size_bytes); + +module_init(zs_init); +module_exit(zs_exit); + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_AUTHOR("Nitin Gupta "); -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755500Ab3HRImD (ORCPT ); Sun, 18 Aug 2013 04:42:03 -0400 Received: from mail-pb0-f48.google.com ([209.85.160.48]:62879 "EHLO mail-pb0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754294Ab3HRImB (ORCPT ); Sun, 18 Aug 2013 04:42:01 -0400 From: Bob Liu To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu Subject: [PATCH 3/4] mm: zswap: add supporting for zsmalloc Date: Sun, 18 Aug 2013 16:40:48 +0800 Message-Id: <1376815249-6611-4-git-send-email-bob.liu@oracle.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Make zswap can use zsmalloc as its allocater. But note that zsmalloc don't reclaim any zswap pool pages mandatory, if zswap pool gets full, frontswap_store will be refused unless frontswap_get happened and freed some space. The reason of don't implement reclaiming zsmalloc pages from zswap pool is there is no requiremnet currently. If we want to do mandatory reclaim, we have to write those pages to real backend swap devices. But most of current users of zsmalloc are from embeded world, there is even no real backend swap device. This action is also the same as privous zram! For several area, zsmalloc has unpredictable performance characteristics when reclaiming a single page, then CONFIG_ZBUD are suggested. Signed-off-by: Bob Liu --- include/linux/zsmalloc.h | 1 + mm/Kconfig | 4 +++ mm/zsmalloc.c | 9 ++++-- mm/zswap.c | 73 +++++++++++++++++++++++++++++++++++++++++++--- 4 files changed, 81 insertions(+), 6 deletions(-) diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h index fbe6bec..72fc126 100644 --- a/include/linux/zsmalloc.h +++ b/include/linux/zsmalloc.h @@ -39,5 +39,6 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle, void zs_unmap_object(struct zs_pool *pool, unsigned long handle); u64 zs_get_total_size_bytes(struct zs_pool *pool); +u64 zs_get_pool_size(struct zs_pool *pool); #endif diff --git a/mm/Kconfig b/mm/Kconfig index 48d1786..d80a575 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -519,6 +519,10 @@ choice in order to reduce fragmentation and has high compression density. However, this results in a unpredictable performance characteristics when reclaiming a single page. + + Note: By using zsmalloc, no supporting for mandatory reclaiming from + compressed memory pool. If the pool gets full, frontswap_store will + be refused unless frontswap_get happened and freed some space. endchoice config MEM_SOFT_DIRTY diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 4bb275b..9df8d25 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -78,8 +78,7 @@ #include #include #include - -#include "zsmalloc.h" +#include /* * This must be power of 2 and greater than of equal to sizeof(link_free). @@ -1056,6 +1055,12 @@ u64 zs_get_total_size_bytes(struct zs_pool *pool) } EXPORT_SYMBOL_GPL(zs_get_total_size_bytes); +u64 zs_get_pool_size(struct zs_pool *pool) +{ + return zs_get_total_size_bytes(pool) >> PAGE_SHIFT; +} +EXPORT_SYMBOL_GPL(zs_get_pool_size); + module_init(zs_init); module_exit(zs_exit); diff --git a/mm/zswap.c b/mm/zswap.c index deda2b6..8e8dc99 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -34,8 +34,11 @@ #include #include #include +#ifdef CONFIG_ZBUD #include - +#else +#include +#endif #include #include #include @@ -189,7 +192,11 @@ struct zswap_header { struct zswap_tree { struct rb_root rbroot; spinlock_t lock; +#ifdef CONFIG_ZBUD struct zbud_pool *pool; +#else + struct zs_pool *pool; +#endif }; static struct zswap_tree *zswap_trees[MAX_SWAPFILES]; @@ -374,12 +381,21 @@ static bool zswap_is_full(void) */ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry) { +#ifdef CONFIG_ZBUD zbud_free(tree->pool, entry->handle); +#else + zs_free(tree->pool, entry->handle); +#endif zswap_entry_cache_free(entry); atomic_dec(&zswap_stored_pages); +#ifdef CONFIG_ZBUD zswap_pool_pages = zbud_get_pool_size(tree->pool); +#else + zswap_pool_pages = zs_get_pool_size(tree->pool); +#endif } +#ifdef CONFIG_ZBUD /********************************* * writeback code **********************************/ @@ -595,6 +611,7 @@ fail: spin_unlock(&tree->lock); return ret; } +#endif /********************************* * frontswap hooks @@ -620,11 +637,22 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, /* reclaim space if needed */ if (zswap_is_full()) { zswap_pool_limit_hit++; +#ifdef CONFIG_ZBUD if (zbud_reclaim_page(tree->pool, 8)) { zswap_reject_reclaim_fail++; ret = -ENOMEM; goto reject; } +#else + /* + * zsmalloc has unpredictable performance + * characteristics when reclaiming, so don't support + * mandatory reclaiming from zsmalloc + */ + zswap_reject_reclaim_fail++; + ret = -ENOMEM; + goto reject; +#endif } /* allocate entry */ @@ -647,8 +675,9 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, /* store */ len = dlen + sizeof(struct zswap_header); +#ifdef CONFIG_ZBUD ret = zbud_alloc(tree->pool, len, __GFP_NORETRY | __GFP_NOWARN, - &handle); + &handle); if (ret == -ENOSPC) { zswap_reject_compress_poor++; goto freepage; @@ -658,10 +687,23 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, goto freepage; } zhdr = zbud_map(tree->pool, handle); +#else + handle = zs_malloc(tree->pool, len); + if (!handle) { + ret = -ENOMEM; + zswap_reject_alloc_fail++; + goto freepage; + } + zhdr = zs_map_object(tree->pool, handle, ZS_MM_WO); +#endif zhdr->swpentry = swp_entry(type, offset); buf = (u8 *)(zhdr + 1); memcpy(buf, dst, dlen); +#ifdef CONFIG_ZBUD zbud_unmap(tree->pool, handle); +#else + zs_unmap_object(tree->pool, handle); +#endif put_cpu_var(zswap_dstmem); /* populate entry */ @@ -687,8 +729,11 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, /* update stats */ atomic_inc(&zswap_stored_pages); +#ifdef CONFIG_ZBUD zswap_pool_pages = zbud_get_pool_size(tree->pool); - +#else + zswap_pool_pages = zs_get_pool_size(tree->pool); +#endif return 0; freepage: @@ -724,13 +769,22 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset, /* decompress */ dlen = PAGE_SIZE; +#ifdef CONFIG_ZBUD src = (u8 *)zbud_map(tree->pool, entry->handle) + - sizeof(struct zswap_header); + sizeof(struct zswap_header); +#else + src = zs_map_object(tree->pool, entry->handle, ZS_MM_RO); + src += sizeof(struct zswap_header); +#endif dst = kmap_atomic(page); ret = zswap_comp_op(ZSWAP_COMPOP_DECOMPRESS, src, entry->length, dst, &dlen); kunmap_atomic(dst); +#ifdef CONFIG_ZBUD zbud_unmap(tree->pool, entry->handle); +#else + zs_unmap_object(tree->pool, entry->handle); +#endif BUG_ON(ret); spin_lock(&tree->lock); @@ -810,7 +864,11 @@ static void zswap_frontswap_invalidate_area(unsigned type) while ((node = rb_first(&tree->rbroot))) { entry = rb_entry(node, struct zswap_entry, rbnode); rb_erase(&entry->rbnode, &tree->rbroot); +#ifdef CONFIG_ZBUD zbud_free(tree->pool, entry->handle); +#else + zs_free(tree->pool, entry->handle); +#endif zswap_entry_cache_free(entry); atomic_dec(&zswap_stored_pages); } @@ -818,9 +876,11 @@ static void zswap_frontswap_invalidate_area(unsigned type) spin_unlock(&tree->lock); } +#ifdef CONFIG_ZBUD static struct zbud_ops zswap_zbud_ops = { .evict = zswap_writeback_entry }; +#endif static void zswap_frontswap_init(unsigned type) { @@ -829,7 +889,12 @@ static void zswap_frontswap_init(unsigned type) tree = kzalloc(sizeof(struct zswap_tree), GFP_KERNEL); if (!tree) goto err; + +#ifdef CONFIG_ZBUD tree->pool = zbud_create_pool(GFP_KERNEL, &zswap_zbud_ops); +#else + tree->pool = zs_create_pool(GFP_NOWAIT); +#endif if (!tree->pool) goto freetree; tree->rbroot = RB_ROOT; -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755527Ab3HRImT (ORCPT ); Sun, 18 Aug 2013 04:42:19 -0400 Received: from mail-pa0-f51.google.com ([209.85.220.51]:51666 "EHLO mail-pa0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753240Ab3HRImR (ORCPT ); Sun, 18 Aug 2013 04:42:17 -0400 From: Bob Liu To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu Subject: [PATCH 4/4] mm: zswap: create a pseudo device /dev/zram0 Date: Sun, 18 Aug 2013 16:40:49 +0800 Message-Id: <1376815249-6611-5-git-send-email-bob.liu@oracle.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is used to replace previous zram. zram users can enable this feature, then a pseudo device will be created automaticlly after kernel boot. Just using "mkswp /dev/zram0; swapon /dev/zram0" to use it as a swap disk. The size of this pseudeo is controlled by zswap boot parameter zswap.max_pool_percent. disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE. Signed-off-by: Bob Liu --- mm/Kconfig | 12 ++++ mm/zswap.c | 196 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 208 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index d80a575..3778026 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -525,6 +525,18 @@ choice be refused unless frontswap_get happened and freed some space. endchoice +config ZSWAP_PSEUDO_BLKDEV + bool "Emulate a pseudo blk-dev based on zswap(previous zram)" + depends on ZSWAP && ZSMALLOC + default n + + help + Enable this option will emulate a pseudo block swapdev /dev/zram0 + with size zswap.max_pool_percent of total ram size. All writes to this + block device will be compressed and cached by zswap as a result no + real IO disk operations will happen. + This feature can be used to replace drivers/staging/zram. + config MEM_SOFT_DIRTY bool "Track memory changes" depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY diff --git a/mm/zswap.c b/mm/zswap.c index 8e8dc99..ae73c9d 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -38,6 +38,11 @@ #include #else #include +#ifdef CONFIG_ZSWAP_PSEUDO_BLKDEV +#include +#include +#include +#endif #endif #include #include @@ -968,6 +973,189 @@ static int __init zswap_debugfs_init(void) static void __exit zswap_debugfs_exit(void) { } #endif +#ifdef CONFIG_ZSWAP_PSEUDO_BLKDEV +#define SECTOR_SHIFT 9 +#define SECTOR_SIZE (1 << SECTOR_SHIFT) +#define SECTORS_PER_PAGE_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) +#define SECTORS_PER_PAGE (1 << SECTORS_PER_PAGE_SHIFT) + +struct zram { + struct rw_semaphore lock; /* protect concurent reads and writes */ + struct request_queue *queue; + struct gendisk *disk; + + /* + * This is the disk size for userland. The size is controlled by + * boot parameter zswap.max_pool_percent. + * disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE + */ + u64 disksize; /* bytes */ + + /* + * This page is used to store real data for /dev/zram. + * Meanful operation to /dev/zramx is only mkswp and swapon/swapoff. + * So use one page to store the real data(written by mkswp). + */ + struct page *metapage; +}; + +/* + * Only create /dev/zram0, can be extened in future if there is real uercases + * need multiple zram devices. + */ +static struct zram zram_device; +static const struct block_device_operations zram_devops = { + .owner = THIS_MODULE +}; + +static void update_position(u32 *index, int *offset, struct bio_vec *bvec) +{ + if (*offset + bvec->bv_len >= PAGE_SIZE) + (*index)++; + *offset = (*offset + bvec->bv_len) % PAGE_SIZE; +} + +static void zram_make_request(struct request_queue *queue, struct bio *bio) +{ + u32 index; + struct bio_vec *bvec; + unsigned char *src, *dst; + int offset, i, rw = bio_data_dir(bio); + struct zram *zram = queue->queuedata; + + index = bio->bi_sector >> SECTORS_PER_PAGE_SHIFT; + offset = (bio->bi_sector & (SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT; + + bio_for_each_segment(bvec, bio, i) { + /* + * The only operation to pseudo /dev/zramx is mkswp and + * swapon/swapoff, so we only need one extra page to store the + * real meta data! + */ + BUG_ON(bvec->bv_len != PAGE_SIZE); + BUG_ON(offset); + + if (!index) { + if (rw == READ) { + down_read(&zram->lock); + dst = kmap_atomic(bvec->bv_page); + src = kmap_atomic(zram->metapage); + memcpy(dst, src, bvec->bv_len); + kunmap_atomic(dst); + kunmap_atomic(src); + flush_dcache_page(bvec->bv_page); + up_read(&zram->lock); + } else { + down_write(&zram->lock); + src = kmap_atomic(bvec->bv_page); + dst = kmap_atomic(zram->metapage); + memcpy(dst, src, bvec->bv_len); + kunmap_atomic(dst); + kunmap_atomic(src); + up_write(&zram->lock); + } + } + update_position(&index, &offset, bvec); + } + set_bit(BIO_UPTODATE, &bio->bi_flags); + bio_endio(bio, 0); + return; +} + +static int create_zram_device(struct zram *zram, int major, int device_id) +{ + int ret = -ENOMEM; + u64 disksize; + + zram->queue = blk_alloc_queue(GFP_KERNEL); + if (!zram->queue) { + pr_err("Error allocating disk queue for device%d\n", device_id); + goto out; + } + + blk_queue_make_request(zram->queue, zram_make_request); + zram->queue->queuedata = zram; + + /* gendisk structure */ + zram->disk = alloc_disk(1); + if (!zram->disk) { + pr_warn("Error allocating disk structure for device %d\n", + device_id); + goto out_free_queue; + } + + zram->disk->major = major; + zram->disk->first_minor = device_id; + zram->disk->fops = &zram_devops; + zram->disk->queue = zram->queue; + snprintf(zram->disk->disk_name, 16, "zram%d", device_id); + + /* + * To ensure that we always get PAGE_SIZE aligned + * and n*PAGE_SIZED sized I/O requests. + */ + blk_queue_physical_block_size(zram->disk->queue, PAGE_SIZE); + blk_queue_logical_block_size(zram->disk->queue, 1<<12); + blk_queue_io_min(zram->disk->queue, PAGE_SIZE); + blk_queue_io_opt(zram->disk->queue, PAGE_SIZE); + + add_disk(zram->disk); + + /* Init blk-dev */ + disksize = totalram_pages * zswap_max_pool_percent / 100; + disksize *= PAGE_SIZE; + disksize = PAGE_ALIGN(disksize); + zram->disksize = disksize; + set_capacity(zram->disk, zram->disksize >> SECTOR_SHIFT); + + /* zram devices sort of resembles non-rotational disks */ + queue_flag_set_unlocked(QUEUE_FLAG_NONROT, zram->disk->queue); + + zram->metapage = alloc_page(GFP_KERNEL); + if (!zram->metapage) + goto out_free_disk; + + pr_debug("Initialization done!\n"); + return 0; + +out_free_disk: + pr_debug("Init zram meta pages fail!\n"); + del_gendisk(zram->disk); + put_disk(zram->disk); +out_free_queue: + blk_cleanup_queue(zram->queue); +out: + return ret; +} + +static int zswap_blkdev_init(void) +{ + int major, ret = 0; + + major = register_blkdev(0, "zram"); + if (major <= 0) { + pr_warn("Unable to get major number\n"); + ret = -EBUSY; + goto out; + } + + ret = create_zram_device(&zram_device, major, 0); + if (ret) { + unregister_blkdev(major, "zram"); + goto out; + } + + pr_info("Created zram device(%d, %d).\n", major, 0); +out: + return ret; +} +#else +static int zswap_blkdev_init(void) +{ + return 0; +} +#endif + /********************************* * module init and exit **********************************/ @@ -989,9 +1177,17 @@ static int __init init_zswap(void) pr_err("per-cpu initialization failed\n"); goto pcpufail; } + + if (IS_ENABLED(CONFIG_ZSWAP_PSEUDO_BLKDEV)) + if (zswap_blkdev_init()) { + pr_err("emulate blk device failed\n"); + goto pcpufail; + } + frontswap_register_ops(&zswap_frontswap_ops); if (zswap_debugfs_init()) pr_warn("debugfs initialization failed\n"); + return 0; pcpufail: zswap_comp_exit(); -- 1.7.10.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750814Ab3HSEKX (ORCPT ); Mon, 19 Aug 2013 00:10:23 -0400 Received: from LGEMRELSE7Q.lge.com ([156.147.1.151]:61998 "EHLO LGEMRELSE7Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750730Ab3HSEKW (ORCPT ); Mon, 19 Aug 2013 00:10:22 -0400 X-AuditID: 9c930197-b7b44ae00000347f-fe-52119aaccbca Date: Mon, 19 Aug 2013 13:10:44 +0900 From: Minchan Kim To: Bob Liu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, eternaleye@gmail.com, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu Subject: Re: [PATCH 0/4] mm: merge zram into zswap Message-ID: <20130819041044.GB26832@bbox> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Aug 18, 2013 at 04:40:45PM +0800, Bob Liu wrote: > Both zswap and zram are used to compress anon pages in memory so as to reduce > swap io operation. The main different is that zswap uses zbud as its allocator > while zram uses zsmalloc. The other different is zram will create a block > device, the user need to mkswp and swapon it. > > Minchan has areadly try to promote zram/zsmalloc into drivers/block/, but it may > cause increase maintenance headaches. Since the purpose of zswap and zram are > the same, this patch series try to merge them together as Mel suggested. > Dropped zram from staging and extended zswap with the same feature as zram. > > zswap todo: > Improve the writeback of zswap pool pages! > > Bob Liu (4): > drivers: staging: drop zram and zsmalloc Bob, I feel you're very rude and I'm really upset. You're just dropping the subsystem you didn't do anything without any consensus from who are contriubting lots of patches to make it works well for a long time. I understand you want to merge zram/zswap to remove the concern Mel suggested but so your intention might help the community. But the approach was totally wrong. You just said a few days ago in my thread and I was holiday so I didn't have a time to reply all of the mail sent to me. Should I break my holiday for just replying to you? Are you okay that someone else removes or moves your efforts without any consensus with you while you're spending good time with family? Please be careful. Bob. -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750803Ab3HSEeV (ORCPT ); Mon, 19 Aug 2013 00:34:21 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:16871 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750765Ab3HSEeT (ORCPT ); Mon, 19 Aug 2013 00:34:19 -0400 Message-ID: <52119FC7.5070406@oracle.com> Date: Mon, 19 Aug 2013 12:32:07 +0800 From: Bob Liu User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 MIME-Version: 1.0 To: Minchan Kim CC: Bob Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org, eternaleye@gmail.com, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, sjenning@linux.vnet.ibm.com, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org Subject: Re: [PATCH 0/4] mm: merge zram into zswap References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> <20130819041044.GB26832@bbox> In-Reply-To: <20130819041044.GB26832@bbox> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Source-IP: acsinet22.oracle.com [141.146.126.238] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Minchan, On 08/19/2013 12:10 PM, Minchan Kim wrote: > On Sun, Aug 18, 2013 at 04:40:45PM +0800, Bob Liu wrote: >> Both zswap and zram are used to compress anon pages in memory so as to reduce >> swap io operation. The main different is that zswap uses zbud as its allocator >> while zram uses zsmalloc. The other different is zram will create a block >> device, the user need to mkswp and swapon it. >> >> Minchan has areadly try to promote zram/zsmalloc into drivers/block/, but it may >> cause increase maintenance headaches. Since the purpose of zswap and zram are >> the same, this patch series try to merge them together as Mel suggested. >> Dropped zram from staging and extended zswap with the same feature as zram. >> >> zswap todo: >> Improve the writeback of zswap pool pages! >> >> Bob Liu (4): >> drivers: staging: drop zram and zsmalloc > > Bob, I feel you're very rude and I'm really upset. > > You're just dropping the subsystem you didn't do anything without any consensus > from who are contriubting lots of patches to make it works well for a long time. I apologize for that, at least I should add [RFC] in the patch title! -- Regards, -Bob From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751146Ab3HSRAJ (ORCPT ); Mon, 19 Aug 2013 13:00:09 -0400 Received: from e7.ny.us.ibm.com ([32.97.182.137]:54899 "EHLO e7.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751122Ab3HSRAC (ORCPT ); Mon, 19 Aug 2013 13:00:02 -0400 Date: Mon, 19 Aug 2013 11:59:48 -0500 From: Seth Jennings To: Bob Liu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu Subject: Re: [PATCH 3/4] mm: zswap: add supporting for zsmalloc Message-ID: <20130819165948.GA5703@variantweb.net> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> <1376815249-6611-4-git-send-email-bob.liu@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1376815249-6611-4-git-send-email-bob.liu@oracle.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13081916-5806-0000-0000-000022794B05 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Aug 18, 2013 at 04:40:48PM +0800, Bob Liu wrote: > Make zswap can use zsmalloc as its allocater. > But note that zsmalloc don't reclaim any zswap pool pages mandatory, if zswap > pool gets full, frontswap_store will be refused unless frontswap_get happened > and freed some space. > > The reason of don't implement reclaiming zsmalloc pages from zswap pool is there > is no requiremnet currently. > If we want to do mandatory reclaim, we have to write those pages to real backend > swap devices. But most of current users of zsmalloc are from embeded world, > there is even no real backend swap device. > This action is also the same as privous zram! > > For several area, zsmalloc has unpredictable performance characteristics when > reclaiming a single page, then CONFIG_ZBUD are suggested. Looking at this patch on its own, it does show how simple it could be for zswap to support zsmalloc. So thanks! However, I don't like all the ifdefs scattered everywhere. I'd like to have a ops structure (e.g. struct zswap_alloc_ops) instead and just switch ops based on the CONFIG flag. Or better yet, have it boot-time selectable instead of build-time. Seth From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751194Ab3HSRrR (ORCPT ); Mon, 19 Aug 2013 13:47:17 -0400 Received: from e7.ny.us.ibm.com ([32.97.182.137]:36072 "EHLO e7.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750906Ab3HSRrQ (ORCPT ); Mon, 19 Aug 2013 13:47:16 -0400 Date: Mon, 19 Aug 2013 12:46:34 -0500 From: Seth Jennings To: Bob Liu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org, Bob Liu Subject: Re: [PATCH 4/4] mm: zswap: create a pseudo device /dev/zram0 Message-ID: <20130819174634.GB5703@variantweb.net> References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> <1376815249-6611-5-git-send-email-bob.liu@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1376815249-6611-5-git-send-email-bob.liu@oracle.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13081917-5806-0000-0000-000022797274 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Aug 18, 2013 at 04:40:49PM +0800, Bob Liu wrote: > This is used to replace previous zram. > zram users can enable this feature, then a pseudo device will be created > automaticlly after kernel boot. > Just using "mkswp /dev/zram0; swapon /dev/zram0" to use it as a swap disk. > > The size of this pseudeo is controlled by zswap boot parameter > zswap.max_pool_percent. > disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE. This /dev/zram0 will behave nothing like the block device that zram creates. It only allows reads/writes to the first PAGE_SIZE area of the device, for mkswap to work, and then doesn't do anything for all other accesses. I guess if you disabled zswap writeback, then... it would somewhat be the same thing. We do need to disable zswap writeback in this case so that zswap does decompressed a ton of pages into the swapcache for writebacks that will just fail. Since zsmalloc does not yet support the reclaim functionality, zswap writeback is implicitly disabled. But this is really weird conceptually since zswap is a caching layer that uses frontswap. If a frontswap store fails, it will try to send the page to the zram0 device which will fail the write. Then the page will be... put back on the active or inactive list? Also, using the max_pool_percent in calculating the psuedo-device size isn't right. Right now, the code makes the device the max size of the _compressed_ pool, but the underlying swap device size is in _uncompressed_ pages. So you'll never be able to fill zswap sizing the device like this, unless every page is highly incompressible to the point that each compressed page effectively uses a memory pool page, in which case, the user shouldn't be using memory compression. This also means that this hasn't been tested in the zswap pool-is-full case since there is no way, in this code, to hit that case. In the zbud case the expected compression is 2:1 so you could just multiply the compressed pool size by 2 and get a good psuedo-device size. With zsmalloc the expected compression is harder to determine since it can achieve very high effective compression ratios on highly compressible pages. Seth From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751542Ab3HTBNs (ORCPT ); Mon, 19 Aug 2013 21:13:48 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:48278 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751297Ab3HTBNq (ORCPT ); Mon, 19 Aug 2013 21:13:46 -0400 Message-ID: <5212C24F.9050702@oracle.com> Date: Tue, 20 Aug 2013 09:11:43 +0800 From: Bob Liu User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 MIME-Version: 1.0 To: Seth Jennings CC: Bob Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org Subject: Re: [PATCH 3/4] mm: zswap: add supporting for zsmalloc References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> <1376815249-6611-4-git-send-email-bob.liu@oracle.com> <20130819165948.GA5703@variantweb.net> In-Reply-To: <20130819165948.GA5703@variantweb.net> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Source-IP: ucsinet22.oracle.com [156.151.31.94] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/20/2013 12:59 AM, Seth Jennings wrote: > On Sun, Aug 18, 2013 at 04:40:48PM +0800, Bob Liu wrote: >> Make zswap can use zsmalloc as its allocater. >> But note that zsmalloc don't reclaim any zswap pool pages mandatory, if zswap >> pool gets full, frontswap_store will be refused unless frontswap_get happened >> and freed some space. >> >> The reason of don't implement reclaiming zsmalloc pages from zswap pool is there >> is no requiremnet currently. >> If we want to do mandatory reclaim, we have to write those pages to real backend >> swap devices. But most of current users of zsmalloc are from embeded world, >> there is even no real backend swap device. >> This action is also the same as privous zram! >> >> For several area, zsmalloc has unpredictable performance characteristics when >> reclaiming a single page, then CONFIG_ZBUD are suggested. > > Looking at this patch on its own, it does show how simple it could be > for zswap to support zsmalloc. So thanks! > > However, I don't like all the ifdefs scattered everywhere. I'd like to > have a ops structure (e.g. struct zswap_alloc_ops) instead and just > switch ops based on the CONFIG flag. Or better yet, have it boot-time > selectable instead of build-time. > I don't like the ifdefs neither. But I didn't find a better way to replace them since the data structures and API of zbud and zsmalloc are different. I can take a try using zswap_alloc_ops. -- Regards, -Bob From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751723Ab3HTCE0 (ORCPT ); Mon, 19 Aug 2013 22:04:26 -0400 Received: from aserp1040.oracle.com ([141.146.126.69]:41213 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751254Ab3HTCEY (ORCPT ); Mon, 19 Aug 2013 22:04:24 -0400 Message-ID: <5212CE61.2090600@oracle.com> Date: Tue, 20 Aug 2013 10:03:13 +0800 From: Bob Liu User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 MIME-Version: 1.0 To: Seth Jennings CC: Bob Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org, eternaleye@gmail.com, minchan@kernel.org, mgorman@suse.de, gregkh@linuxfoundation.org, akpm@linux-foundation.org, axboe@kernel.dk, ngupta@vflare.org, semenzato@google.com, penberg@iki.fi, sonnyrao@google.com, smbarber@google.com, konrad.wilk@oracle.com, riel@redhat.com, kmpark@infradead.org Subject: Re: [PATCH 4/4] mm: zswap: create a pseudo device /dev/zram0 References: <1376815249-6611-1-git-send-email-bob.liu@oracle.com> <1376815249-6611-5-git-send-email-bob.liu@oracle.com> <20130819174634.GB5703@variantweb.net> In-Reply-To: <20130819174634.GB5703@variantweb.net> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Source-IP: ucsinet21.oracle.com [156.151.31.93] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/20/2013 01:46 AM, Seth Jennings wrote: > On Sun, Aug 18, 2013 at 04:40:49PM +0800, Bob Liu wrote: >> This is used to replace previous zram. >> zram users can enable this feature, then a pseudo device will be created >> automaticlly after kernel boot. >> Just using "mkswp /dev/zram0; swapon /dev/zram0" to use it as a swap disk. >> >> The size of this pseudeo is controlled by zswap boot parameter >> zswap.max_pool_percent. >> disksize = (totalram_pages * zswap.max_pool_percent/100)*PAGE_SIZE. > > This /dev/zram0 will behave nothing like the block device that zram > creates. It only allows reads/writes to the first PAGE_SIZE area of the > device, for mkswap to work, and then doesn't do anything for all other > accesses. Yes, all the other data should be stored in zswap pool and don't need to go through block layer. > > I guess if you disabled zswap writeback, then... it would somewhat be > the same thing. We do need to disable zswap writeback in this case so > that zswap does decompressed a ton of pages into the swapcache for > writebacks that will just fail. Since zsmalloc does not yet support the > reclaim functionality, zswap writeback is implicitly disabled. > Yes, ZSWAP_PSEUDO_BLKDEV depends on zsmalloc and if using zsmalloc as the allocator then the writeback is disabled(not implemented and no requirement). > But this is really weird conceptually since zswap is a caching layer > that uses frontswap. If a frontswap store fails, it will try to send > the page to the zram0 device which will fail the write. Then the page That's a problem. We should disable sending the page to zram0 if frontswap store fails. Return fail just like the swap device is full. > will be... put back on the active or inactive list? > > Also, using the max_pool_percent in calculating the psuedo-device size > isn't right. Right now, the code makes the device the max size of the > _compressed_ pool, but the underlying swap device size is in > _uncompressed_ pages. So you'll never be able to fill zswap sizing the > device like this, unless every page is highly incompressible to the > point that each compressed page effectively uses a memory pool page, in > which case, the user shouldn't be using memory compression. > > This also means that this hasn't been tested in the zswap pool-is-full > case since there is no way, in this code, to hit that case. Yes, but in my understanding there is no need to trigger this path. It's the same with zram. Eg. create /dev/zram0 with disksize(eg. 100M), then mm-core will store ~100M uncompressed pages to /dev/zram0 at most. But the real memory spent for storing those pages are depended on the compression ratio. It's rare that zram will need 100M real memory. > > In the zbud case the expected compression is 2:1 so you could just > multiply the compressed pool size by 2 and get a good psuedo-device > size. With zsmalloc the expected compression is harder to determine > since it can achieve very high effective compression ratios on highly > compressible pages. > Some users can know the compression ratio of their workloads even using zsmalloc. -- Regards, -Bob