* [PATCH] Patch to integrate RapidDisk and RapidCache RAM Drive / Caching modules into the kernel
@ 2015-09-27 17:17 Petros Koutoupis
2015-09-28 6:49 ` Christoph Hellwig
0 siblings, 1 reply; 8+ messages in thread
From: Petros Koutoupis @ 2015-09-27 17:17 UTC (permalink / raw)
To: linux-kernel; +Cc: devel@rapiddisk.org
Attached is a patch for two modules: RapidDisk & RapidCache. RapidDisk is a
Linux RAM drive module which allows the user to dynamically create, remove,
and resize RAM-based block devices. RapidDisk is designed to work with both
volatile and non-volatile memory. In the case of volatile memory, memory is
allocated only when needed. The RapidCache module in turn utilizes a RapidDisk
volume as a FIFO Write-Through caching node to a slower block device.
Signed-off-by: Petros Koutoupis <petros@petroskoutoupis.com>
---
Documentation/rapiddisk/rxdsk.txt | 67 ++++
MAINTAINERS | 8 +
drivers/block/Kconfig | 2 +
drivers/block/Makefile | 1 +
drivers/block/rapiddisk/Kconfig | 10 +
drivers/block/rapiddisk/Makefile | 2 +
drivers/block/rapiddisk/rxcache.c | 1045 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/block/rapiddisk/rxdsk.c | 777 ++++++++++++++++++++++++++++++++++++++++++++++
8 files changed, 1912 insertions(+)
diff -uNpr linux-next.orig/Documentation/rapiddisk/rxdsk.txt linux-next/Documentation/rapiddisk/rxdsk.txt
--- linux-next.orig/Documentation/rapiddisk/rxdsk.txt 1969-12-31 18:00:00.000000000 -0600
+++ linux-next/Documentation/rapiddisk/rxdsk.txt 2015-09-20 15:31:01.412603554 -0500
@@ -0,0 +1,67 @@
+RapidDisk (rxdsk) RAM disk and RapidCache (rxcache) Linux modules
+
+== Description ==
+
+RapidDisk or rxdsk was designed to be used in high performing environments and has been designed with simplicity in mind. Utilizing a user land binary, the system administrator is capable of dynamically adding new RAM based block devices of varying sizes, removing existing ones to even listing all existing RAM block devices. The rxdsk module has been designed to allocate from the system's memory pages and is capable of addressing memory to support Gigabytes if not a Terabyte of available Random Access Memory.
+
+RapidCache or rxcache is designed to leverage the high speed performing technologies of the RapidDisk RAM disk and utilizing the Device Mapper framework, map an rxdsk volume to act as a block device's Write/Read-through cache. This can significantly boost the performance of a local or remote disk device.
+
+
+== Module Parameters ==
+
+RapidDisk
+---------
+max_rxcnt: Total RAM Disk devices available for use. (Default = 128 = MAX) (int)
+max_sectors: Maximum sectors (in KB) for the request queue. (Default = 127) (int)
+nr_requests: Number of requests at a given time for the request queue. (Default = 128) (int)
+
+RapidCache
+---------
+None.
+
+
+== Usage ==
+
+RapidDisk
+---------
+It is advised to utilize the userland utility, rxadm, but this is essentially what is written to the /proc/rxctl proc file to manage rxdsk volumes:
+
+Attach a new rxdsk volume by typing (size in sectors):
+ # echo "rxdsk attach 0 8192" > /proc/rxctl
+
+Attach a non-volatile rxdsk volume by typing (starting / ending addresses in decimal format):
+ # echo "rxdsk attach-nv 0 1234 56789" > /proc/rxctl
+
+Detach an existing rxdsk volume by typing:
+ # echo "rxdsk detach 0" > /proc/rxctl
+
+Resize an existing rxdsk volume by typing (size in sectors):
+ # echo "rxdsk resize 0 65536" > /proc/rxctl
+
+
+Note - the integer after the "rxdsk <command>" is the RapidDisk volume number. "0" would signify "rxd0."
+
+RapidCache
+---------
+It is advised to utilize the userland utility, rxadm, but this is essentially what is sent to the dmsetup command:
+
+Map an existing rxdsk volume to a block device by typing:
+ # echo 0 4194303 rxcache /dev/sdb /dev/rxd0 196608|dmsetup create rxc0
+
+Parameter 1: Start block of source volume (in sectors).
+Parameter 2: End block of source volume (in sectors).
+Parameter 4: Source volume.
+Parameter 5: Cache volume.
+Parameter 6: Cache size (in sectors).
+
+
+Unmap an rxdsk volume:
+
+ # dmsetup remove rxc0
+
+
+== Design ==
+
+The memory management of the rxdsk module was inspired by the brd Linux RAM disk module.
+
+The functionality of the rxcache module was inspired by Flashcache-wt and dm-cache.
diff -uNpr linux-next.orig/drivers/block/Kconfig linux-next/drivers/block/Kconfig
--- linux-next.orig/drivers/block/Kconfig 2015-09-20 14:50:16.936673072 -0500
+++ linux-next/drivers/block/Kconfig 2015-09-20 15:31:01.720603545 -0500
@@ -558,4 +558,6 @@ config BLK_DEV_RSXX
To compile this driver as a module, choose M here: the
module will be called rsxx.
+source "drivers/block/rapiddisk/Kconfig"
+
endif # BLK_DEV
diff -uNpr linux-next.orig/drivers/block/Makefile linux-next/drivers/block/Makefile
--- linux-next.orig/drivers/block/Makefile 2015-09-20 14:50:16.928673073 -0500
+++ linux-next/drivers/block/Makefile 2015-09-20 15:31:01.720603545 -0500
@@ -43,6 +43,7 @@ obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX) +
obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk.o
obj-$(CONFIG_ZRAM) += zram/
+obj-$(CONFIG_RXDSK) += rapiddisk/
nvme-y := nvme-core.o nvme-scsi.o
skd-y := skd_main.o
diff -uNpr linux-next.orig/drivers/block/rapiddisk/Kconfig linux-next/drivers/block/rapiddisk/Kconfig
--- linux-next.orig/drivers/block/rapiddisk/Kconfig 1969-12-31 18:00:00.000000000 -0600
+++ linux-next/drivers/block/rapiddisk/Kconfig 2015-09-20 15:31:01.720603545 -0500
@@ -0,0 +1,10 @@
+config RXDSK
+ tristate "RapidDisk Enhanced Linux RAM disk and caching solution"
+ depends on BLOCK && BLK_DEV_DM
+ default n
+ help
+ Creates virtual block devices called /dev/rxd(X = 0, 1, ...).
+ Supports both volatile and non-volatile memory. And can map
+ rxdsk volumes as caching nodes to slower drives.
+
+ Project home: http://www.rapiddisk.org/
diff -uNpr linux-next.orig/drivers/block/rapiddisk/Makefile linux-next/drivers/block/rapiddisk/Makefile
--- linux-next.orig/drivers/block/rapiddisk/Makefile 1969-12-31 18:00:00.000000000 -0600
+++ linux-next/drivers/block/rapiddisk/Makefile 2015-09-20 15:31:01.720603545 -0500
@@ -0,0 +1,2 @@
+obj-$(CONFIG_RXDSK) += rxdsk.o
+obj-$(CONFIG_RXDSK) += rxcache.o
diff -uNpr linux-next.orig/drivers/block/rapiddisk/rxcache.c linux-next/drivers/block/rapiddisk/rxcache.c
--- linux-next.orig/drivers/block/rapiddisk/rxcache.c 1969-12-31 18:00:00.000000000 -0600
+++ linux-next/drivers/block/rapiddisk/rxcache.c 2015-09-20 15:31:01.720603545 -0500
@@ -0,0 +1,1045 @@
+/*******************************************************************************
+ ** Copyright (c) 2011-2015 Petros Koutoupis
+ **
+ ** description: Device mapper target for block-level disk write-through and
+ ** read-ahead caching. This module is inspired by Facebook's
+ ** Flashcache-wt.
+ **
+ ** This file is licensed under GPLv2.
+ **
+ ******************************************************************************/
+
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/vmalloc.h>
+#include <linux/slab.h>
+#include <linux/hash.h>
+#include <linux/spinlock.h>
+#include <linux/workqueue.h>
+#include <linux/pagemap.h>
+#include <linux/random.h>
+#include <linux/version.h>
+#include <linux/seq_file.h>
+#include <linux/hardirq.h>
+#include <asm/kmap_types.h>
+#include <linux/dm-io.h>
+#include <linux/device-mapper.h>
+#include <linux/bio.h>
+
+#define ASSERT(x) do { \
+ if (unlikely(!(x))) { \
+ dump_stack(); \
+ panic("ASSERT: assertion (%s) failed at %s (%d)\n", \
+ #x, __FILE__, __LINE__); \
+ } \
+} while (0)
+
+#define VERSION_STR "3.4"
+#define DM_MSG_PREFIX "rxc"
+
+#define READCACHE 1
+#define WRITECACHE 2
+#define READSOURCE 3
+#define WRITESOURCE 4
+#define READCACHE_DONE 5
+
+#define GENERIC_ERROR -1
+#define BYTES_PER_BLOCK 512
+/* Default cache parameters */
+#define DEFAULT_CACHE_ASSOC 512
+#define CACHE_BLOCK_SIZE (PAGE_SIZE / BYTES_PER_BLOCK)
+#define CONSECUTIVE_BLOCKS 512
+
+/* States of a cache block */
+#define INVALID 0
+#define VALID 1
+#define INPROG 2 /* IO (cache fill) is in progress */
+#define CACHEREADINPROG 3
+#define INPROG_INVALID 4 /* Write invalidated during a refill */
+
+#define DEV_PATHLEN 128
+
+#ifndef DM_MAPIO_SUBMITTED
+#define DM_MAPIO_SUBMITTED 0
+#endif
+
+#define WT_MIN_JOBS 1024
+#define bio_barrier(bio) ((bio)->bi_rw & REQ_FLUSH)
+
+/* Cache context */
+struct cache_context {
+ struct dm_target *tgt;
+ struct dm_dev *disk_dev; /* Source device */
+ struct dm_dev *cache_dev; /* Cache device */
+
+ spinlock_t cache_spin_lock; /* For Cache updates / reads */
+ struct cache_block *cache;
+ u8 *cache_state;
+ u32 *set_lru_next;
+
+ struct dm_io_client *io_client;
+ sector_t size;
+ unsigned int assoc;
+ unsigned int block_size;
+ unsigned int block_shift;
+ unsigned int block_mask;
+ unsigned int consecutive_shift;
+
+ wait_queue_head_t destroyq; /* Wait queue for I/O completion */
+ atomic_t nr_jobs; /* Number of I/O jobs */
+
+ /* Stats */
+ unsigned long reads;
+ unsigned long writes;
+ unsigned long cache_hits;
+ unsigned long replace;
+ unsigned long wr_invalidates;
+ unsigned long rd_invalidates;
+ unsigned long cached_blocks;
+ unsigned long cache_wr_replace;
+ unsigned long uncached_reads;
+ unsigned long uncached_writes;
+ unsigned long cache_reads, cache_writes;
+ unsigned long disk_reads, disk_writes;
+
+ char cache_devname[DEV_PATHLEN];
+ char disk_devname[DEV_PATHLEN];
+};
+
+/* Cache block metadata structure */
+struct cache_block {
+ sector_t dbn; /* Sector number of the cached block */
+};
+
+/* Structure for a kcached job */
+struct kcached_job {
+ struct list_head list;
+ struct cache_context *dmc;
+ struct bio *bio; /* Original bio */
+ struct dm_io_region disk;
+ struct dm_io_region cache;
+ int index;
+ int rw;
+ int error;
+};
+
+static struct workqueue_struct *kcached_wq;
+static struct work_struct kcached_work;
+static struct kmem_cache *job_cache;
+static mempool_t *job_pool;
+static DEFINE_SPINLOCK(job_lock);
+static LIST_HEAD(complete_jobs);
+static LIST_HEAD(io_jobs);
+static void cache_read_miss(struct cache_context *, struct bio *, int);
+static void cache_write(struct cache_context *, struct bio *);
+static int cache_invalidate_blocks(struct cache_context *, struct bio *);
+static void rxc_uncached_io_callback(unsigned long, void *context);
+static void rxc_start_uncached_io(struct cache_context *, struct bio *);
+
+int dm_io_async_bvec(unsigned int num_regions, struct dm_io_region *where,
+ int rw, struct bio *bio, io_notify_fn fn, void *context)
+{
+ struct kcached_job *job = (struct kcached_job *)context;
+ struct cache_context *dmc = job->dmc;
+ struct dm_io_request iorq;
+
+ iorq.bi_rw = rw;
+ iorq.mem.type = DM_IO_BIO;
+ iorq.mem.ptr.bio = bio;
+ iorq.notify.fn = fn;
+ iorq.notify.context = context;
+ iorq.client = dmc->io_client;
+
+ return dm_io(&iorq, num_regions, where, NULL);
+}
+
+static int jobs_init(void)
+{
+ job_cache = kmem_cache_create("kcached-jobs-wt",
+ sizeof(struct kcached_job),
+ __alignof__(struct kcached_job),
+ 0, NULL);
+ if (!job_cache)
+ return -ENOMEM;
+
+ job_pool = mempool_create(WT_MIN_JOBS, mempool_alloc_slab,
+ mempool_free_slab, job_cache);
+ if (!job_pool) {
+ kmem_cache_destroy(job_cache);
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+static void jobs_exit(void)
+{
+ WARN_ON_ONCE(!list_empty(&complete_jobs));
+ WARN_ON_ONCE(!list_empty(&io_jobs));
+
+ mempool_destroy(job_pool);
+ kmem_cache_destroy(job_cache);
+ job_pool = NULL;
+ job_cache = NULL;
+}
+
+static inline struct kcached_job *pop(struct list_head *jobs)
+{
+ struct kcached_job *job = NULL;
+ unsigned long flags;
+
+ spin_lock_irqsave(&job_lock, flags);
+ if (!list_empty(jobs)) {
+ job = list_entry(jobs->next, struct kcached_job, list);
+ list_del(&job->list);
+ }
+ spin_unlock_irqrestore(&job_lock, flags);
+ return job;
+}
+
+static inline void push(struct list_head *jobs, struct kcached_job *job)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&job_lock, flags);
+ list_add_tail(&job->list, jobs);
+ spin_unlock_irqrestore(&job_lock, flags);
+}
+
+void rxc_io_callback(unsigned long error, void *context)
+{
+ struct kcached_job *job = (struct kcached_job *)context;
+ struct cache_context *dmc = job->dmc;
+ struct bio *bio;
+ int invalid = 0;
+
+ ASSERT(job);
+ bio = job->bio;
+ ASSERT(bio);
+ if (error)
+ DMERR("%s: io error %ld", __func__, error);
+ if (job->rw == READSOURCE || job->rw == WRITESOURCE) {
+ spin_lock_bh(&dmc->cache_spin_lock);
+ if (dmc->cache_state[job->index] != INPROG) {
+ ASSERT(dmc->cache_state[job->index] == INPROG_INVALID);
+ invalid++;
+ }
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ if (error || invalid) {
+ if (invalid)
+ DMERR("%s: cache fill invalidation, sector %lu, size %u",
+ __func__,
+ (unsigned long)bio->bi_iter.bi_sector,
+ bio->bi_iter.bi_size);
+ bio->bi_error = error;
+ bio_io_error(bio);
+ spin_lock_bh(&dmc->cache_spin_lock);
+ dmc->cache_state[job->index] = INVALID;
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ goto out;
+ } else {
+ job->rw = WRITECACHE;
+ push(&io_jobs, job);
+ queue_work(kcached_wq, &kcached_work);
+ return;
+ }
+ } else if (job->rw == READCACHE) {
+ spin_lock_bh(&dmc->cache_spin_lock);
+ ASSERT(dmc->cache_state[job->index] == INPROG_INVALID ||
+ dmc->cache_state[job->index] == CACHEREADINPROG);
+ if (dmc->cache_state[job->index] == INPROG_INVALID)
+ invalid++;
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ if (!invalid && !error) {
+ bio_endio(bio);
+ spin_lock_bh(&dmc->cache_spin_lock);
+ dmc->cache_state[job->index] = VALID;
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ goto out;
+ }
+ /* error || invalid || bounce back to source device */
+ job->rw = READCACHE_DONE;
+ push(&complete_jobs, job);
+ queue_work(kcached_wq, &kcached_work);
+ return;
+ } else {
+ ASSERT(job->rw == WRITECACHE);
+ bio_endio(bio);
+ spin_lock_bh(&dmc->cache_spin_lock);
+ ASSERT((dmc->cache_state[job->index] == INPROG) ||
+ (dmc->cache_state[job->index] == INPROG_INVALID));
+ if (error || dmc->cache_state[job->index] == INPROG_INVALID) {
+ dmc->cache_state[job->index] = INVALID;
+ } else {
+ dmc->cache_state[job->index] = VALID;
+ dmc->cached_blocks++;
+ }
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ }
+out:
+ mempool_free(job, job_pool);
+ if (atomic_dec_and_test(&dmc->nr_jobs))
+ wake_up(&dmc->destroyq);
+}
+EXPORT_SYMBOL(rxc_io_callback);
+
+static int do_io(struct kcached_job *job)
+{
+ int r = 0;
+ struct cache_context *dmc = job->dmc;
+ struct bio *bio = job->bio;
+
+ ASSERT(job->rw == WRITECACHE);
+ dmc->cache_writes++;
+ r = dm_io_async_bvec(1, &job->cache, WRITE, bio, rxc_io_callback, job);
+ ASSERT(r == 0); /* dm_io_async_bvec() must always return 0 */
+ return r;
+}
+
+int rxc_do_complete(struct kcached_job *job)
+{
+ struct bio *bio = job->bio;
+ struct cache_context *dmc = job->dmc;
+
+ ASSERT(job->rw == READCACHE_DONE);
+ /* error || block invalidated while reading from cache */
+ spin_lock_bh(&dmc->cache_spin_lock);
+ dmc->cache_state[job->index] = INVALID;
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ mempool_free(job, job_pool);
+ if (atomic_dec_and_test(&dmc->nr_jobs))
+ wake_up(&dmc->destroyq);
+ /* Kick this IO back to the source bdev */
+ rxc_start_uncached_io(dmc, bio);
+ return 0;
+}
+EXPORT_SYMBOL(rxc_do_complete);
+
+static void process_jobs(struct list_head *jobs,
+ int (*fn)(struct kcached_job *))
+{
+ struct kcached_job *job;
+
+ while ((job = pop(jobs)))
+ (void)fn(job);
+}
+
+static void do_work(struct work_struct *work)
+{
+ process_jobs(&complete_jobs, rxc_do_complete);
+ process_jobs(&io_jobs, do_io);
+}
+
+static int kcached_init(struct cache_context *dmc)
+{
+ init_waitqueue_head(&dmc->destroyq);
+ atomic_set(&dmc->nr_jobs, 0);
+ return 0;
+}
+
+void kcached_client_destroy(struct cache_context *dmc)
+{
+ wait_event(dmc->destroyq, !atomic_read(&dmc->nr_jobs));
+}
+
+static unsigned long hash_block(struct cache_context *dmc, sector_t dbn)
+{
+ unsigned long set_number, value;
+
+ value = (dbn >> (dmc->block_shift + dmc->consecutive_shift));
+ set_number = do_div(value, (dmc->size >> dmc->consecutive_shift));
+ return set_number;
+}
+
+static int find_valid_dbn(struct cache_context *dmc, sector_t dbn,
+ int start_index, int *index)
+{
+ int i;
+ int end_index = start_index + dmc->assoc;
+
+ for (i = start_index ; i < end_index ; i++) {
+ if (dbn == dmc->cache[i].dbn &&
+ (dmc->cache_state[i] == VALID ||
+ dmc->cache_state[i] == CACHEREADINPROG ||
+ dmc->cache_state[i] == INPROG)) {
+ *index = i;
+ return dmc->cache_state[i];
+ }
+ }
+ return GENERIC_ERROR;
+}
+
+static void find_invalid_dbn(struct cache_context *dmc,
+ int start_index, int *index)
+{
+ int i;
+ int end_index = start_index + dmc->assoc;
+
+ /* Find INVALID slot that we can reuse */
+ for (i = start_index ; i < end_index ; i++) {
+ if (dmc->cache_state[i] == INVALID) {
+ *index = i;
+ return;
+ }
+ }
+}
+
+static void find_reclaim_dbn(struct cache_context *dmc,
+ int start_index, int *index)
+{
+ int i;
+ int end_index = start_index + dmc->assoc;
+ int set = start_index / dmc->assoc;
+ int slots_searched = 0;
+
+ /* Find the "oldest" VALID slot to recycle. For each set, we keep
+ * track of the next "lru" slot to pick off. Each time we pick off
+ * a VALID entry to recycle we advance this pointer. So we sweep
+ * through the set looking for next blocks to recycle. This
+ * approximates to FIFO (modulo for blocks written through).
+ */
+ i = dmc->set_lru_next[set];
+ while (slots_searched < dmc->assoc) {
+ ASSERT(i >= start_index);
+ ASSERT(i < end_index);
+ if (dmc->cache_state[i] == VALID) {
+ *index = i;
+ break;
+ }
+ slots_searched++;
+ i++;
+ if (i == end_index)
+ i = start_index;
+ }
+ i++;
+ if (i == end_index)
+ i = start_index;
+ dmc->set_lru_next[set] = i;
+}
+
+/* dbn is the starting sector, io_size is the number of sectors. */
+static int cache_lookup(struct cache_context *dmc, struct bio *bio, int *index)
+{
+ sector_t dbn = bio->bi_iter.bi_sector;
+ unsigned long set_number = hash_block(dmc, dbn);
+ int invalid = -1, oldest_clean = -1;
+ int start_index;
+ int ret;
+
+ start_index = dmc->assoc * set_number;
+ ret = find_valid_dbn(dmc, dbn, start_index, index);
+ if (ret == VALID || ret == INPROG || ret == CACHEREADINPROG) {
+ /* We found the exact range of blocks we are looking for */
+ return ret;
+ }
+ ASSERT(ret == -1);
+ find_invalid_dbn(dmc, start_index, &invalid);
+ if (invalid == -1) {
+ /* Search for oldest valid entry */
+ find_reclaim_dbn(dmc, start_index, &oldest_clean);
+ }
+ /* Cache miss : We can't choose an entry marked INPROG,
+ * but choose the oldest INVALID or the oldest VALID entry.
+ */
+ *index = start_index + dmc->assoc;
+ if (invalid != -1)
+ *index = invalid;
+ else if (oldest_clean != -1)
+ *index = oldest_clean;
+ if (*index < (start_index + dmc->assoc))
+ return INVALID;
+ else
+ return GENERIC_ERROR;
+}
+
+static struct kcached_job *new_kcached_job(struct cache_context *dmc,
+ struct bio *bio, int index)
+{
+ struct kcached_job *job;
+
+ job = mempool_alloc(job_pool, GFP_NOIO);
+ if (!job)
+ return NULL;
+ job->disk.bdev = dmc->disk_dev->bdev;
+ job->disk.sector = bio->bi_iter.bi_sector;
+ if (index != -1)
+ job->disk.count = dmc->block_size;
+ else
+ job->disk.count = to_sector(bio->bi_iter.bi_size);
+ job->cache.bdev = dmc->cache_dev->bdev;
+ if (index != -1) {
+ job->cache.sector = index << dmc->block_shift;
+ job->cache.count = dmc->block_size;
+ }
+ job->dmc = dmc;
+ job->bio = bio;
+ job->index = index;
+ job->error = 0;
+ return job;
+}
+
+static void cache_read_miss(struct cache_context *dmc,
+ struct bio *bio, int index)
+{
+ struct kcached_job *job;
+
+ job = new_kcached_job(dmc, bio, index);
+ if (unlikely(!job)) {
+ DMERR("%s: Cannot allocate job\n", __func__);
+ spin_lock_bh(&dmc->cache_spin_lock);
+ dmc->cache_state[index] = INVALID;
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ bio->bi_error = -EIO;
+ bio_io_error(bio);
+ } else {
+ job->rw = READSOURCE;
+ atomic_inc(&dmc->nr_jobs);
+ dmc->disk_reads++;
+ dm_io_async_bvec(1, &job->disk, READ,
+ bio, rxc_io_callback, job);
+ }
+}
+
+static void cache_read(struct cache_context *dmc, struct bio *bio)
+{
+ int index;
+ int res;
+
+ spin_lock_bh(&dmc->cache_spin_lock);
+ res = cache_lookup(dmc, bio, &index);
+ if ((res == VALID) &&
+ (dmc->cache[index].dbn == bio->bi_iter.bi_sector)) {
+ struct kcached_job *job;
+
+ dmc->cache_state[index] = CACHEREADINPROG;
+ dmc->cache_hits++;
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ job = new_kcached_job(dmc, bio, index);
+ if (unlikely(!job)) {
+ DMERR("cache_read(_hit): Cannot allocate job\n");
+ spin_lock_bh(&dmc->cache_spin_lock);
+ dmc->cache_state[index] = VALID;
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ bio->bi_error = -EIO;
+ bio_io_error(bio);
+ } else {
+ job->rw = READCACHE;
+ atomic_inc(&dmc->nr_jobs);
+ dmc->cache_reads++;
+ dm_io_async_bvec(1, &job->cache, READ, bio,
+ rxc_io_callback, job);
+ }
+ return;
+ }
+ if (cache_invalidate_blocks(dmc, bio) > 0) {
+ /* A non zero return indicates an inprog invalidation */
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ rxc_start_uncached_io(dmc, bio);
+ return;
+ }
+ if (res == -1 || res >= INPROG) {
+ /* We either didn't find a cache slot in the set we were
+ * looking at or the block we are trying to read is being
+ * refilled into cache.
+ */
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ rxc_start_uncached_io(dmc, bio);
+ return;
+ }
+ /* (res == INVALID) Cache Miss And we found cache blocks to replace
+ * Claim the cache blocks before giving up the spinlock
+ */
+ if (dmc->cache_state[index] == VALID) {
+ dmc->cached_blocks--;
+ dmc->replace++;
+ }
+ dmc->cache_state[index] = INPROG;
+ dmc->cache[index].dbn = bio->bi_iter.bi_sector;
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ cache_read_miss(dmc, bio, index);
+}
+
+static int cache_invalidate_block_set(struct cache_context *dmc, int set,
+ sector_t io_start, sector_t io_end,
+ int rw, int *inprog_inval)
+{
+ int start_index, end_index, i;
+ int invalidations = 0;
+
+ start_index = dmc->assoc * set;
+ end_index = start_index + dmc->assoc;
+ for (i = start_index ; i < end_index ; i++) {
+ sector_t start_dbn = dmc->cache[i].dbn;
+ sector_t end_dbn = start_dbn + dmc->block_size;
+
+ if (dmc->cache_state[i] == INVALID ||
+ dmc->cache_state[i] == INPROG_INVALID)
+ continue;
+ if ((io_start >= start_dbn && io_start < end_dbn) ||
+ (io_end >= start_dbn && io_end < end_dbn)) {
+ if (rw == WRITE)
+ dmc->wr_invalidates++;
+ else
+ dmc->rd_invalidates++;
+ invalidations++;
+ if (dmc->cache_state[i] == VALID) {
+ dmc->cached_blocks--;
+ dmc->cache_state[i] = INVALID;
+ } else if (dmc->cache_state[i] >= INPROG) {
+ (*inprog_inval)++;
+ dmc->cache_state[i] = INPROG_INVALID;
+ DMERR("%s: sector %lu, size %lu, rw %d",
+ __func__, (unsigned long)io_start,
+ (unsigned long)io_end - (unsigned long)io_start, rw);
+ }
+ }
+ }
+ return invalidations;
+}
+
+static int cache_invalidate_blocks(struct cache_context *dmc, struct bio *bio)
+{
+ sector_t io_start = bio->bi_iter.bi_sector;
+ sector_t io_end = bio->bi_iter.bi_sector + (to_sector(bio->bi_iter.bi_size) - 1);
+ int start_set, end_set;
+ int inprog_inval_start = 0, inprog_inval_end = 0;
+
+ start_set = hash_block(dmc, io_start);
+ end_set = hash_block(dmc, io_end);
+ cache_invalidate_block_set(dmc, start_set, io_start, io_end,
+ bio_data_dir(bio), &inprog_inval_start);
+ if (start_set != end_set)
+ cache_invalidate_block_set(dmc, end_set, io_start,
+ io_end, bio_data_dir(bio),
+ &inprog_inval_end);
+ return (inprog_inval_start + inprog_inval_end);
+}
+
+static void cache_write(struct cache_context *dmc, struct bio *bio)
+{
+ int index;
+ int res;
+ struct kcached_job *job;
+
+ spin_lock_bh(&dmc->cache_spin_lock);
+ if (cache_invalidate_blocks(dmc, bio) > 0) {
+ /* A non zero return indicates an inprog invalidation */
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ rxc_start_uncached_io(dmc, bio);
+ return;
+ }
+ res = cache_lookup(dmc, bio, &index);
+ ASSERT(res == -1 || res == INVALID);
+ if (res == -1) {
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ rxc_start_uncached_io(dmc, bio);
+ return;
+ }
+ if (dmc->cache_state[index] == VALID) {
+ dmc->cached_blocks--;
+ dmc->cache_wr_replace++;
+ }
+ dmc->cache_state[index] = INPROG;
+ dmc->cache[index].dbn = bio->bi_iter.bi_sector;
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ job = new_kcached_job(dmc, bio, index);
+ if (unlikely(!job)) {
+ DMERR("%s: Cannot allocate job\n", __func__);
+ spin_lock_bh(&dmc->cache_spin_lock);
+ dmc->cache_state[index] = INVALID;
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ bio->bi_error = -EIO;
+ bio_io_error(bio);
+ return;
+ }
+ job->rw = WRITESOURCE;
+ atomic_inc(&job->dmc->nr_jobs);
+ dmc->disk_writes++;
+ dm_io_async_bvec(1, &job->disk, WRITE, bio, rxc_io_callback, job);
+}
+
+int rxc_map(struct dm_target *ti, struct bio *bio)
+{
+ struct cache_context *dmc = (struct cache_context *)ti->private;
+
+ if (bio_barrier(bio))
+ return -EOPNOTSUPP;
+
+ ASSERT(to_sector(bio->bi_iter.bi_size) <= dmc->block_size);
+ if (bio_data_dir(bio) == READ)
+ dmc->reads++;
+ else
+ dmc->writes++;
+
+ if (to_sector(bio->bi_iter.bi_size) != dmc->block_size) {
+ spin_lock_bh(&dmc->cache_spin_lock);
+ (void)cache_invalidate_blocks(dmc, bio);
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ rxc_start_uncached_io(dmc, bio);
+ } else {
+ if (bio_data_dir(bio) == READ)
+ cache_read(dmc, bio);
+ else
+ cache_write(dmc, bio);
+ }
+ return DM_MAPIO_SUBMITTED;
+}
+EXPORT_SYMBOL(rxc_map);
+
+static void rxc_uncached_io_callback(unsigned long error, void *context)
+{
+ struct kcached_job *job = (struct kcached_job *)context;
+ struct cache_context *dmc = job->dmc;
+
+ spin_lock_bh(&dmc->cache_spin_lock);
+ if (bio_data_dir(job->bio) == READ)
+ dmc->uncached_reads++;
+ else
+ dmc->uncached_writes++;
+ (void)cache_invalidate_blocks(dmc, job->bio);
+ spin_unlock_bh(&dmc->cache_spin_lock);
+ if (error) {
+ job->bio->bi_error = error;
+ bio_io_error(job->bio);
+ } else {
+ bio_endio(job->bio);
+ }
+ mempool_free(job, job_pool);
+ if (atomic_dec_and_test(&dmc->nr_jobs))
+ wake_up(&dmc->destroyq);
+}
+
+static void rxc_start_uncached_io(struct cache_context *dmc, struct bio *bio)
+{
+ int is_write = (bio_data_dir(bio) == WRITE);
+ struct kcached_job *job;
+
+ job = new_kcached_job(dmc, bio, -1);
+ if (unlikely(!job)) {
+ bio->bi_error = -EIO;
+ bio_io_error(bio);
+ return;
+ }
+ atomic_inc(&dmc->nr_jobs);
+ if (bio_data_dir(job->bio) == READ)
+ dmc->disk_reads++;
+ else
+ dmc->disk_writes++;
+ dm_io_async_bvec(1, &job->disk, ((is_write) ? WRITE : READ),
+ bio, rxc_uncached_io_callback, job);
+}
+
+static inline int rxc_get_dev(struct dm_target *ti, char *pth,
+ struct dm_dev **dmd, char *dmc_dname,
+ sector_t tilen)
+{
+ int rc;
+
+ rc = dm_get_device(ti, pth, dm_table_get_mode(ti->table), dmd);
+ if (!rc)
+ strncpy(dmc_dname, pth, DEV_PATHLEN);
+ return rc;
+}
+
+/* Construct a cache mapping.
+ * arg[0]: path to source device
+ * arg[1]: path to cache device
+ * arg[2]: cache size (in blocks)
+ * arg[3]: cache associativity
+ */
+static int cache_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+ struct cache_context *dmc;
+ unsigned int consecutive_blocks;
+ sector_t i, order, tmpsize;
+ sector_t data_size, dev_size;
+ int r = -EINVAL;
+
+ if (argc < 2) {
+ ti->error = "rxc: Need at least 2 arguments";
+ goto construct_fail;
+ }
+
+ dmc = kzalloc(sizeof(*dmc), GFP_KERNEL);
+ if (!dmc) {
+ ti->error = "rxc: Failed to allocate cache context";
+ r = -ENOMEM;
+ goto construct_fail;
+ }
+
+ dmc->tgt = ti;
+
+ if (rxc_get_dev(ti, argv[0], &dmc->disk_dev,
+ dmc->disk_devname, ti->len)) {
+ ti->error = "rxc: Disk device lookup failed";
+ goto construct_fail1;
+ }
+ if (strncmp(argv[1], "/dev/rxd", 8) != 0) {
+ pr_err("%s: %s is not a valid cache device for rxcache.",
+ DM_MSG_PREFIX, argv[1]);
+ ti->error = "rxc: Invalid cache device. Not an rxdsk volume.";
+ goto construct_fail2;
+ }
+ if (rxc_get_dev(ti, argv[1], &dmc->cache_dev, dmc->cache_devname, 0)) {
+ ti->error = "rxc: Cache device lookup failed";
+ goto construct_fail2;
+ }
+
+ dmc->io_client = dm_io_client_create();
+ if (IS_ERR(dmc->io_client)) {
+ r = PTR_ERR(dmc->io_client);
+ ti->error = "Failed to create io client\n";
+ goto construct_fail3;
+ }
+
+ r = kcached_init(dmc);
+ if (r) {
+ ti->error = "Failed to initialize kcached";
+ goto construct_fail4;
+ }
+ dmc->block_size = CACHE_BLOCK_SIZE;
+ dmc->block_shift = ffs(dmc->block_size) - 1;
+ dmc->block_mask = dmc->block_size - 1;
+
+ if (argc >= 3) {
+ if (kstrtoul(argv[2], 10, (unsigned long *)&dmc->size)) {
+ ti->error = "rxc: Invalid cache size";
+ r = -EINVAL;
+ goto construct_fail5;
+ }
+ } else {
+ dmc->size = to_sector(dmc->cache_dev->bdev->bd_inode->i_size);
+ }
+
+ if (argc >= 4) {
+ if (kstrtoint(argv[3], 10, &dmc->assoc)) {
+ ti->error = "rxc: Invalid cache associativity";
+ r = -EINVAL;
+ goto construct_fail5;
+ }
+ if (!dmc->assoc || (dmc->assoc & (dmc->assoc - 1)) ||
+ dmc->size < dmc->assoc) {
+ ti->error = "rxc: Invalid cache associativity";
+ r = -EINVAL;
+ goto construct_fail5;
+ }
+ } else {
+ dmc->assoc = DEFAULT_CACHE_ASSOC;
+ }
+
+ /* Convert size (in sectors) to blocks. Then round size
+ * (in blocks now) down to a multiple of associativity
+ */
+ do_div(dmc->size, dmc->block_size);
+ tmpsize = dmc->size;
+ do_div(tmpsize, dmc->assoc);
+ dmc->size = tmpsize * dmc->assoc;
+
+ dev_size = to_sector(dmc->cache_dev->bdev->bd_inode->i_size);
+ data_size = dmc->size * dmc->block_size;
+ if (data_size > dev_size) {
+ DMERR("Requested cache size exceeds the cache device's capacity (%lu>%lu)",
+ (unsigned long)data_size, (unsigned long)dev_size);
+ ti->error = "rxc: Invalid cache size";
+ r = -EINVAL;
+ goto construct_fail5;
+ }
+
+ consecutive_blocks = dmc->assoc;
+ dmc->consecutive_shift = ffs(consecutive_blocks) - 1;
+
+ order = dmc->size * sizeof(struct cache_block);
+ DMINFO("Allocate %luKB (%luB per) mem for %lu-entry cache"
+ "(capacity:%luMB, associativity:%u, block size:%u sectors(%uKB))",
+ (unsigned long)order >> 10,
+ (unsigned long)sizeof(struct cache_block),
+ (unsigned long)dmc->size,
+ (unsigned long)data_size >> (20 - SECTOR_SHIFT),
+ dmc->assoc, dmc->block_size,
+ dmc->block_size >> (10 - SECTOR_SHIFT));
+
+ dmc->cache = vmalloc(order);
+ if (!dmc->cache)
+ goto construct_fail6;
+ dmc->cache_state = vmalloc(dmc->size);
+ if (!dmc->cache_state)
+ goto construct_fail7;
+
+ order = (dmc->size >> dmc->consecutive_shift) * sizeof(u32);
+ dmc->set_lru_next = vmalloc(order);
+ if (!dmc->set_lru_next)
+ goto construct_fail8;
+
+ for (i = 0; i < dmc->size ; i++) {
+ dmc->cache[i].dbn = 0;
+ dmc->cache_state[i] = INVALID;
+ }
+
+ /* Initialize the point where LRU sweeps begin for each set */
+ for (i = 0 ; i < (dmc->size >> dmc->consecutive_shift) ; i++)
+ dmc->set_lru_next[i] = i * dmc->assoc;
+
+ spin_lock_init(&dmc->cache_spin_lock);
+
+ dmc->reads = 0;
+ dmc->writes = 0;
+ dmc->cache_hits = 0;
+ dmc->replace = 0;
+ dmc->wr_invalidates = 0;
+ dmc->rd_invalidates = 0;
+ dmc->cached_blocks = 0;
+ dmc->cache_wr_replace = 0;
+
+ r = dm_set_target_max_io_len(ti, dmc->block_size);
+ if (r)
+ goto construct_fail8;
+ ti->private = dmc;
+
+ return 0;
+
+construct_fail8:
+ vfree(dmc->cache_state);
+construct_fail7:
+ vfree(dmc->cache);
+construct_fail6:
+ r = -ENOMEM;
+ ti->error = "Unable to allocate memory";
+construct_fail5:
+ kcached_client_destroy(dmc);
+construct_fail4:
+ dm_io_client_destroy(dmc->io_client);
+construct_fail3:
+ dm_put_device(ti, dmc->cache_dev);
+construct_fail2:
+ dm_put_device(ti, dmc->disk_dev);
+construct_fail1:
+ kfree(dmc);
+construct_fail:
+ return r;
+}
+
+static void cache_dtr(struct dm_target *ti)
+{
+ struct cache_context *dmc = (struct cache_context *)ti->private;
+
+ kcached_client_destroy(dmc);
+
+ if (dmc->reads + dmc->writes > 0) {
+ DMINFO("stats:\n\treads(%lu), writes(%lu)\n",
+ dmc->reads, dmc->writes);
+ DMINFO("\tcache hits(%lu),replacement(%lu), write replacement(%lu)\n"
+ "\tread invalidates(%lu), write invalidates(%lu)\n",
+ dmc->cache_hits, dmc->replace, dmc->cache_wr_replace,
+ dmc->rd_invalidates, dmc->wr_invalidates);
+ DMINFO("conf:\n\tcapacity(%luM), associativity(%u), block size(%uK)\n"
+ "\ttotal blocks(%lu), cached blocks(%lu)\n",
+ (unsigned long)dmc->size * dmc->block_size >> 11,
+ dmc->assoc, dmc->block_size >> (10 - SECTOR_SHIFT),
+ (unsigned long)dmc->size, dmc->cached_blocks);
+ }
+
+ dm_io_client_destroy(dmc->io_client);
+ vfree(dmc->cache);
+ vfree(dmc->cache_state);
+ vfree(dmc->set_lru_next);
+
+ dm_put_device(ti, dmc->disk_dev);
+ dm_put_device(ti, dmc->cache_dev);
+ kfree(dmc);
+}
+
+static void rxc_status_info(struct cache_context *dmc, status_type_t type,
+ char *result, unsigned int maxlen)
+{
+ int sz = 0;
+
+ DMEMIT("stats:\n\treads(%lu), writes(%lu)\n", dmc->reads, dmc->writes);
+ DMEMIT("\tcache hits(%lu) replacement(%lu), write replacement(%lu)\n"
+ "\tread invalidates(%lu), write invalidates(%lu)\n"
+ "\tuncached reads(%lu), uncached writes(%lu)\n"
+ "\tdisk reads(%lu), disk writes(%lu)\n"
+ "\tcache reads(%lu), cache writes(%lu)\n",
+ dmc->cache_hits, dmc->replace, dmc->cache_wr_replace,
+ dmc->rd_invalidates, dmc->wr_invalidates,
+ dmc->uncached_reads, dmc->uncached_writes,
+ dmc->disk_reads, dmc->disk_writes,
+ dmc->cache_reads, dmc->cache_writes);
+}
+
+static void rxc_status_table(struct cache_context *dmc, status_type_t type,
+ char *result, unsigned int maxlen)
+{
+ int sz = 0;
+
+ DMEMIT("conf:\n\trxd dev (%s), disk dev (%s) mode (%s)\n"
+ "\tcapacity(%luM), associativity(%u), block size(%uK)\n"
+ "\ttotal blocks(%lu), cached blocks(%lu)\n",
+ dmc->cache_devname, dmc->disk_devname, "WRITETHROUGH",
+ (unsigned long)dmc->size * dmc->block_size >> 11, dmc->assoc,
+ dmc->block_size >> (10 - SECTOR_SHIFT),
+ (unsigned long)dmc->size, dmc->cached_blocks);
+}
+
+static void cache_status(struct dm_target *ti, status_type_t type,
+ unsigned status_flags, char *result,
+ unsigned int maxlen)
+{
+ struct cache_context *dmc = (struct cache_context *)ti->private;
+
+ switch (type) {
+ case STATUSTYPE_INFO:
+ rxc_status_info(dmc, type, result, maxlen);
+ break;
+ case STATUSTYPE_TABLE:
+ rxc_status_table(dmc, type, result, maxlen);
+ break;
+ }
+}
+
+static struct target_type cache_target = {
+ .name = "rxcache",
+ .version = {3, 4, 0},
+ .module = THIS_MODULE,
+ .ctr = cache_ctr,
+ .dtr = cache_dtr,
+ .map = rxc_map,
+ .status = cache_status,
+};
+
+int __init rxc_init(void)
+{
+ int ret;
+
+ ret = jobs_init();
+ if (ret)
+ return ret;
+
+ kcached_wq = create_singlethread_workqueue("kcached");
+ if (!kcached_wq)
+ return -ENOMEM;
+
+ INIT_WORK(&kcached_work, do_work);
+
+ ret = dm_register_target(&cache_target);
+ if (ret < 0)
+ return ret;
+ return 0;
+}
+
+void rxc_exit(void)
+{
+ dm_unregister_target(&cache_target);
+ jobs_exit();
+ destroy_workqueue(kcached_wq);
+}
+
+module_init(rxc_init);
+module_exit(rxc_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Petros Koutoupis <petros@petroskoutoupis.com>");
+MODULE_DESCRIPTION("RapidCache (rxc) DM target is a write-through caching target with RapidDisk volumes.");
+MODULE_VERSION(VERSION_STR);
diff -uNpr linux-next.orig/drivers/block/rapiddisk/rxdsk.c linux-next/drivers/block/rapiddisk/rxdsk.c
--- linux-next.orig/drivers/block/rapiddisk/rxdsk.c 1969-12-31 18:00:00.000000000 -0600
+++ linux-next/drivers/block/rapiddisk/rxdsk.c 2015-09-20 15:31:01.720603545 -0500
@@ -0,0 +1,777 @@
+/*******************************************************************************
+ ** Copyright (c) 2011-2015 Petros Koutoupis
+ **
+ ** description: RapidDisk is an enhanced Linux RAM disk module to dynamically
+ ** create, remove, and resize RAM drives. RapidDisk supports both volatile
+ ** and non-volatile memory.
+ **
+ ** This file is licensed under GPLv2.
+ **
+ ******************************************************************************/
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/version.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/fs.h>
+#include <linux/hdreg.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/errno.h>
+#include <linux/radix-tree.h>
+#include <linux/io.h>
+
+#define VERSION_STR "3.4"
+#define RDPREFIX "rxd"
+#define BYTES_PER_SECTOR 512
+#define MAX_RDSKS 128
+#define DEFAULT_MAX_SECTS 127
+#define DEFAULT_REQUESTS 128
+#define GENERIC_ERROR -1
+
+#define FREE_BATCH 16
+#define SECTOR_SHIFT 9
+#define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT)
+#define PAGE_SECTORS BIT(PAGE_SECTORS_SHIFT)
+
+#define NON_VOLATILE_MEMORY 0
+#define VOLATILE_MEMORY 1
+
+/* ioctls */
+#define INVALID_CDQUERY_IOCTL 0x5331
+#define RXD_GET_STATS 0x0529
+
+static DEFINE_MUTEX(sysfs_mutex);
+static DEFINE_MUTEX(rxioctl_mutex);
+
+struct rdsk_device {
+ int num;
+ bool volatile_memory;
+ struct request_queue *rdsk_queue;
+ struct gendisk *rdsk_disk;
+ struct list_head rdsk_list;
+ unsigned long long max_blk_alloc; /* rdsk: highest sector write */
+ unsigned long long start_addr; /* rdsk-nv: start address */
+ unsigned long long end_addr; /* rdsk-nv: end address */
+ unsigned long long size;
+ unsigned long error_cnt;
+ spinlock_t rdsk_lock; /* spinlock page insertions */
+ struct radix_tree_root rdsk_pages;
+};
+
+static int rdsk_ma_no, rxcnt; /* no. of attached devices */
+static int max_rxcnt = MAX_RDSKS;
+static int max_sectors = DEFAULT_MAX_SECTS, nr_requests = DEFAULT_REQUESTS;
+static LIST_HEAD(rdsk_devices);
+static struct kobject *rdsk_kobj;
+
+module_param(max_rxcnt, int, S_IRUGO);
+MODULE_PARM_DESC(max_rxcnt, " Maximum number of RAM Disks. (Default = 128)");
+module_param(max_sectors, int, S_IRUGO);
+MODULE_PARM_DESC(max_sectors, " Maximum sectors (in KB) for the request queue. (Default = 127)");
+module_param(nr_requests, int, S_IRUGO);
+MODULE_PARM_DESC(nr_requests, " Number of requests at a given time for the request queue. (Default = 128)");
+
+static int rdsk_do_bvec(struct rdsk_device *, struct page *,
+ unsigned int, unsigned int, int, sector_t);
+static int rdsk_ioctl(struct block_device *, fmode_t,
+ unsigned int, unsigned long);
+static void rdsk_make_request(struct request_queue *, struct bio *);
+static int attach_device(int, int);
+static int attach_nv_device(int, unsigned long long, unsigned long long);
+static int detach_device(int);
+static int resize_device(int, int);
+static ssize_t mgmt_show(struct kobject *, struct kobj_attribute *, char *);
+static ssize_t mgmt_store(struct kobject *, struct kobj_attribute *,
+ const char *, size_t);
+
+static ssize_t mgmt_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *buf)
+{
+ int len;
+ struct rdsk_device *rdsk;
+
+ len = sprintf(buf, "RapidDisk (rxdsk) %s\n\nMaximum Number of Attachable Devices: %d\nNumber of Attached Devices: %d\nMax Sectors (KB): %d\nNumber of Requests: %d\n\n",
+ VERSION_STR, max_rxcnt, rxcnt, max_sectors, nr_requests);
+ list_for_each_entry(rdsk, &rdsk_devices, rdsk_list) {
+ if (rdsk->volatile_memory == VOLATILE_MEMORY) {
+ len += sprintf(buf + len, "rxd%d\tSize: %llu MBs\tErrors: %lu\n",
+ rdsk->num, (rdsk->size / 1024 / 1024),
+ rdsk->error_cnt);
+ } else {
+ len += sprintf(buf + len, "rxd%d\tSize: %llu MBs\tErrors: %lu\tStart Address: %llu\tEnd Address: %llu\n",
+ rdsk->num, (rdsk->size / 1024 / 1024),
+ rdsk->error_cnt, rdsk->start_addr,
+ rdsk->end_addr);
+ }
+ }
+ return len;
+}
+
+static ssize_t mgmt_store(struct kobject *kobj, struct kobj_attribute *attr,
+ const char *buffer, size_t count)
+{
+ int num, size, err = (int)count;
+ char *ptr, *buf;
+ unsigned long long start_addr, end_addr;
+
+ if (!buffer || count > PAGE_SIZE)
+ return -EINVAL;
+
+ mutex_lock(&sysfs_mutex);
+ buf = (char *)__get_free_page(GFP_KERNEL);
+ if (!buf) {
+ err = -ENOMEM;
+ goto write_sysfs_error;
+ }
+ strcpy(buf, buffer);
+
+ if (!strncmp("rxdsk attach ", buffer, 13)) {
+ ptr = buf + 13;
+ num = simple_strtoul(ptr, &ptr, 0);
+ size = simple_strtoul(ptr + 1, &ptr, 0);
+
+ if (attach_device(num, size) != 0) {
+ pr_err("%s: Unable to attach rxd%d\n", RDPREFIX, num);
+ err = -EINVAL;
+ }
+ } else if (!strncmp("rxdsk attach-nv ", buffer, 16)) {
+ /* "rdsk attach-nv num start end" */
+ ptr = buf + 16;
+ num = simple_strtoul(ptr, &ptr, 0);
+ start_addr = simple_strtoull(ptr + 1, &ptr, 0);
+ end_addr = simple_strtoull(ptr + 1, &ptr, 0);
+
+ if (attach_nv_device(num, start_addr, end_addr) != 0) {
+ pr_err("%s: Unable to attach rxd%d\n", RDPREFIX, num);
+ err = -EINVAL;
+ }
+ } else if (!strncmp("rxdsk detach ", buffer, 13)) {
+ ptr = buf + 13;
+ num = simple_strtoul(ptr, &ptr, 0);
+ if (detach_device(num) != 0) {
+ pr_err("%s: Unable to detach rxd%d\n", RDPREFIX, num);
+ err = -EINVAL;
+ }
+ } else if (!strncmp("rxdsk resize ", buffer, 13)) {
+ ptr = buf + 13;
+ num = simple_strtoul(ptr, &ptr, 0);
+ size = simple_strtoul(ptr + 1, &ptr, 0);
+
+ if (resize_device(num, size) != 0) {
+ pr_err("%s: Unable to resize rxd%d\n", RDPREFIX, num);
+ err = -EINVAL;
+ }
+ } else {
+ pr_err("%s: Unsupported command: %s\n", RDPREFIX, buffer);
+ err = -EINVAL;
+ }
+
+ free_page((unsigned long)buf);
+write_sysfs_error:
+ mutex_unlock(&sysfs_mutex);
+ return err;
+}
+
+static struct kobj_attribute mgmt_attribute =
+ __ATTR(mgmt, 0664, mgmt_show, mgmt_store);
+
+static struct attribute *attrs[] = {
+ &mgmt_attribute.attr,
+ NULL,
+};
+
+static struct attribute_group attr_group = {
+ .attrs = attrs,
+};
+
+static struct page *rdsk_lookup_page(struct rdsk_device *rdsk, sector_t sector)
+{
+ pgoff_t idx;
+ struct page *page;
+
+ rcu_read_lock();
+ idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */
+ page = radix_tree_lookup(&rdsk->rdsk_pages, idx);
+ rcu_read_unlock();
+
+ WARN_ON_ONCE(page && page->index != idx);
+
+ return page;
+}
+
+static struct page *rdsk_insert_page(struct rdsk_device *rdsk, sector_t sector)
+{
+ pgoff_t idx;
+ struct page *page;
+ gfp_t gfp_flags;
+
+ page = rdsk_lookup_page(rdsk, sector);
+ if (page)
+ return page;
+
+ /*
+ * Must use NOIO because we don't want to recurse back into the
+ * block or filesystem layers from page reclaim.
+ */
+ gfp_flags = GFP_NOIO | __GFP_ZERO;
+#ifndef CONFIG_BLK_DEV_XIP
+ gfp_flags |= __GFP_HIGHMEM;
+#endif
+ page = alloc_page(gfp_flags);
+ if (!page)
+ return NULL;
+
+ if (radix_tree_preload(GFP_NOIO)) {
+ __free_page(page);
+ return NULL;
+ }
+
+ spin_lock(&rdsk->rdsk_lock);
+ idx = sector >> PAGE_SECTORS_SHIFT;
+ if (radix_tree_insert(&rdsk->rdsk_pages, idx, page)) {
+ __free_page(page);
+ page = radix_tree_lookup(&rdsk->rdsk_pages, idx);
+ WARN_ON_ONCE(!page);
+ WARN_ON_ONCE(page->index != idx);
+ } else {
+ page->index = idx;
+ }
+ spin_unlock(&rdsk->rdsk_lock);
+
+ radix_tree_preload_end();
+
+ return page;
+}
+
+static void rdsk_zero_page(struct rdsk_device *rdsk, sector_t sector)
+{
+ struct page *page;
+
+ page = rdsk_lookup_page(rdsk, sector);
+ if (page)
+ clear_highpage(page);
+}
+
+static void rdsk_free_pages(struct rdsk_device *rdsk)
+{
+ unsigned long pos = 0;
+ struct page *pages[FREE_BATCH];
+ int nr_pages;
+
+ do {
+ int i;
+
+ nr_pages = radix_tree_gang_lookup(&rdsk->rdsk_pages,
+ (void **)pages, pos,
+ FREE_BATCH);
+
+ for (i = 0; i < nr_pages; i++) {
+ void *ret;
+
+ WARN_ON_ONCE(pages[i]->index < pos);
+ pos = pages[i]->index;
+ ret = radix_tree_delete(&rdsk->rdsk_pages, pos);
+ WARN_ON_ONCE(!ret || ret != pages[i]);
+ __free_page(pages[i]);
+ }
+ pos++;
+ } while (nr_pages == FREE_BATCH);
+}
+
+static int copy_to_rdsk_setup(struct rdsk_device *rdsk,
+ sector_t sector, size_t n)
+{
+ unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
+ size_t copy;
+
+ copy = min_t(size_t, n, PAGE_SIZE - offset);
+ if (!rdsk_insert_page(rdsk, sector))
+ return -ENOMEM;
+ if (copy < n) {
+ sector += copy >> SECTOR_SHIFT;
+ if (!rdsk_insert_page(rdsk, sector))
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+static void discard_from_rdsk(struct rdsk_device *rdsk,
+ sector_t sector, size_t n)
+{
+ while (n >= PAGE_SIZE) {
+ rdsk_zero_page(rdsk, sector);
+ sector += PAGE_SIZE >> SECTOR_SHIFT;
+ n -= PAGE_SIZE;
+ }
+}
+
+static void copy_to_rdsk(struct rdsk_device *rdsk, const void *src,
+ sector_t sector, size_t n)
+{
+ struct page *page;
+ void *dst;
+ unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
+ size_t copy;
+
+ copy = min_t(size_t, n, PAGE_SIZE - offset);
+ page = rdsk_lookup_page(rdsk, sector);
+ WARN_ON_ONCE(!page);
+
+ dst = kmap_atomic(page);
+ memcpy(dst + offset, src, copy);
+ kunmap_atomic(dst);
+
+ if (copy < n) {
+ src += copy;
+ sector += copy >> SECTOR_SHIFT;
+ copy = n - copy;
+ page = rdsk_lookup_page(rdsk, sector);
+ WARN_ON_ONCE(!page);
+ dst = kmap_atomic(page);
+ memcpy(dst, src, copy);
+ kunmap_atomic(dst);
+ }
+
+ if ((sector + (n / BYTES_PER_SECTOR)) > rdsk->max_blk_alloc)
+ rdsk->max_blk_alloc = (sector + (n / BYTES_PER_SECTOR));
+}
+
+static void copy_from_rdsk(void *dst, struct rdsk_device *rdsk,
+ sector_t sector, size_t n)
+{
+ struct page *page;
+ void *src;
+ unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
+ size_t copy;
+
+ copy = min_t(size_t, n, PAGE_SIZE - offset);
+ page = rdsk_lookup_page(rdsk, sector);
+
+ if (page) {
+ src = kmap_atomic(page);
+ memcpy(dst, src + offset, copy);
+ kunmap_atomic(src);
+ } else {
+ memset(dst, 0, copy);
+ }
+
+ if (copy < n) {
+ dst += copy;
+ sector += copy >> SECTOR_SHIFT;
+ copy = n - copy;
+ page = rdsk_lookup_page(rdsk, sector);
+ if (page) {
+ src = kmap_atomic(page);
+ memcpy(dst, src, copy);
+ kunmap_atomic(src);
+ } else {
+ memset(dst, 0, copy);
+ }
+ }
+}
+
+static int rdsk_do_bvec(struct rdsk_device *rdsk, struct page *page,
+ unsigned int len, unsigned int off, int rw,
+ sector_t sector){
+ void *mem;
+ int err = 0;
+ void __iomem *vaddr = NULL;
+ resource_size_t phys_addr = (rdsk->start_addr +
+ (sector * BYTES_PER_SECTOR));
+
+ if (rdsk->volatile_memory == VOLATILE_MEMORY) {
+ if (rw != READ) {
+ err = copy_to_rdsk_setup(rdsk, sector, len);
+ if (err)
+ goto out;
+ }
+ } else {
+ if (((sector * BYTES_PER_SECTOR) + len) > rdsk->size) {
+ pr_err("%s: Beyond rxd%d boundary (offset: %llu len: %u).\n",
+ RDPREFIX, rdsk->num,
+ (unsigned long long)phys_addr, len);
+ return -EFAULT;
+ }
+
+ vaddr = ioremap_nocache(phys_addr, len);
+ if (!vaddr) {
+ pr_err("%s: Unable to map memory at address %llu of size %lu\n",
+ RDPREFIX, (unsigned long long)phys_addr,
+ (unsigned long)len);
+ return -EFAULT;
+ }
+ }
+ mem = kmap_atomic(page);
+ if (rw == READ) {
+ if (rdsk->volatile_memory == VOLATILE_MEMORY) {
+ copy_from_rdsk(mem + off, rdsk, sector, len);
+ flush_dcache_page(page);
+ } else {
+ memcpy(mem, vaddr, len);
+ }
+ } else {
+ if (rdsk->volatile_memory == VOLATILE_MEMORY) {
+ flush_dcache_page(page);
+ copy_to_rdsk(rdsk, mem + off, sector, len);
+ } else {
+ memcpy(vaddr, mem, len);
+ }
+ }
+ kunmap_atomic(mem);
+ if (rdsk->volatile_memory == NON_VOLATILE_MEMORY)
+ iounmap(vaddr);
+out:
+ return err;
+}
+
+static void
+rdsk_make_request(struct request_queue *q, struct bio *bio)
+{
+ struct block_device *bdev = bio->bi_bdev;
+ struct rdsk_device *rdsk = bdev->bd_disk->private_data;
+ int rw;
+ sector_t sector;
+ struct bio_vec bvec;
+ struct bvec_iter iter;
+ int err = -EIO;
+
+ sector = bio->bi_iter.bi_sector;
+ if ((sector + bio_sectors(bio)) > get_capacity(bdev->bd_disk))
+ goto io_error;
+
+ err = 0;
+ if (unlikely(bio->bi_rw & REQ_DISCARD)) {
+ discard_from_rdsk(rdsk, sector, bio->bi_iter.bi_size);
+ goto out;
+ }
+ rw = bio_rw(bio);
+ if (rw == READA)
+ rw = READ;
+
+ bio_for_each_segment(bvec, bio, iter) {
+ unsigned int len = bvec.bv_len;
+
+ err = rdsk_do_bvec(rdsk, bvec.bv_page, len,
+ bvec.bv_offset, rw, sector);
+ if (err) {
+ rdsk->error_cnt++;
+ goto io_error;
+ }
+ sector += len >> SECTOR_SHIFT;
+ }
+
+out:
+ bio_endio(bio);
+ return;
+io_error:
+ bio->bi_error = err;
+ bio_io_error(bio);
+}
+
+static int rdsk_ioctl(struct block_device *bdev, fmode_t mode,
+ unsigned int cmd, unsigned long arg)
+{
+ loff_t size;
+ int error = 0;
+ struct rdsk_device *rdsk = bdev->bd_disk->private_data;
+
+ switch (cmd) {
+ case BLKGETSIZE:
+ size = bdev->bd_inode->i_size;
+ if ((size >> 9) > ~0UL)
+ return -EFBIG;
+ return copy_to_user((void __user *)arg, &size,
+ sizeof(size)) ? -EFAULT : 0;
+ case BLKGETSIZE64:
+ return copy_to_user((void __user *)arg,
+ &bdev->bd_inode->i_size,
+ sizeof(bdev->bd_inode->i_size)) ? -EFAULT : 0;
+ case BLKFLSBUF:
+ if (rdsk->volatile_memory == NON_VOLATILE_MEMORY)
+ return 0;
+ /* We are killing the RAM disk data. */
+ mutex_lock(&rxioctl_mutex);
+ mutex_lock(&bdev->bd_mutex);
+ error = -EBUSY;
+ if (bdev->bd_openers <= 1) {
+ kill_bdev(bdev);
+ rdsk_free_pages(rdsk);
+ error = 0;
+ }
+ mutex_unlock(&bdev->bd_mutex);
+ mutex_unlock(&rxioctl_mutex);
+ return error;
+ case INVALID_CDQUERY_IOCTL:
+ return -EINVAL;
+ case RXD_GET_STATS:
+ return copy_to_user((void __user *)arg,
+ &rdsk->max_blk_alloc,
+ sizeof(rdsk->max_blk_alloc)) ? -EFAULT : 0;
+ case BLKPBSZGET:
+ case BLKBSZGET:
+ case BLKSSZGET:
+ size = BYTES_PER_SECTOR;
+ return copy_to_user((void __user *)arg, &size,
+ sizeof(size)) ? -EFAULT : 0;
+ }
+
+ pr_warn("%s: 0x%x invalid ioctl.\n", RDPREFIX, cmd);
+ return -ENOTTY; /* unknown command */
+}
+
+static const struct block_device_operations rdsk_fops = {
+ .owner = THIS_MODULE,
+ .ioctl = rdsk_ioctl,
+};
+
+static int attach_device(int num, int size)
+{
+ struct rdsk_device *rdsk, *rxtmp;
+ struct gendisk *disk;
+
+ if (rxcnt > max_rxcnt) {
+ pr_warn("%s: Reached maximum number of attached disks.\n",
+ RDPREFIX);
+ goto out;
+ }
+
+ list_for_each_entry(rxtmp, &rdsk_devices, rdsk_list) {
+ if (rxtmp->num == num) {
+ pr_warn("%s: rdsk device %d already exists.\n",
+ RDPREFIX, num);
+ goto out;
+ }
+ }
+
+ rdsk = kzalloc(sizeof(*rdsk), GFP_KERNEL);
+ if (!rdsk)
+ goto out;
+ rdsk->num = num;
+ rdsk->error_cnt = 0;
+ rdsk->volatile_memory = VOLATILE_MEMORY;
+ rdsk->max_blk_alloc = 0;
+ rdsk->end_addr = rdsk->size = (size * BYTES_PER_SECTOR);
+ rdsk->start_addr = 0;
+ spin_lock_init(&rdsk->rdsk_lock);
+ INIT_RADIX_TREE(&rdsk->rdsk_pages, GFP_ATOMIC);
+
+ rdsk->rdsk_queue = blk_alloc_queue(GFP_KERNEL);
+ if (!rdsk->rdsk_queue)
+ goto out_free_dev;
+ blk_queue_make_request(rdsk->rdsk_queue, rdsk_make_request);
+ blk_queue_logical_block_size(rdsk->rdsk_queue, BYTES_PER_SECTOR);
+ blk_queue_flush(rdsk->rdsk_queue, REQ_FLUSH);
+
+ rdsk->rdsk_queue->limits.max_sectors = (max_sectors * 2);
+ rdsk->rdsk_queue->nr_requests = nr_requests;
+ rdsk->rdsk_queue->limits.discard_granularity = PAGE_SIZE;
+ rdsk->rdsk_queue->limits.discard_zeroes_data = 1;
+ rdsk->rdsk_queue->limits.max_discard_sectors = UINT_MAX;
+ queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rdsk->rdsk_queue);
+
+ disk = rdsk->rdsk_disk = alloc_disk(1);
+ if (!disk)
+ goto out_free_queue;
+ disk->major = rdsk_ma_no;
+ disk->first_minor = num;
+ disk->fops = &rdsk_fops;
+ disk->private_data = rdsk;
+ disk->queue = rdsk->rdsk_queue;
+ disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
+ sprintf(disk->disk_name, "%s%d", RDPREFIX, num);
+ set_capacity(disk, size);
+
+ add_disk(rdsk->rdsk_disk);
+ list_add_tail(&rdsk->rdsk_list, &rdsk_devices);
+ rxcnt++;
+ pr_info("%s: Attached rxd%d of %llu bytes in size.\n", RDPREFIX, num,
+ (unsigned long long)(size * BYTES_PER_SECTOR));
+ return 0;
+
+out_free_queue:
+ blk_cleanup_queue(rdsk->rdsk_queue);
+out_free_dev:
+ kfree(rdsk);
+out:
+ return GENERIC_ERROR;
+}
+
+static int attach_nv_device(int num, unsigned long long start_addr,
+ unsigned long long end_addr)
+{
+ struct rdsk_device *rdsk, *rxtmp;
+ struct gendisk *disk;
+ unsigned long size = ((end_addr - start_addr) / BYTES_PER_SECTOR);
+
+ if (rxcnt > max_rxcnt) {
+ pr_warn("%s: Reached maximum number of attached disks.\n",
+ RDPREFIX);
+ goto out_nv;
+ }
+
+ list_for_each_entry(rxtmp, &rdsk_devices, rdsk_list) {
+ if (rxtmp->num == num) {
+ pr_warn("%s: rdsk device %d already exists.\n",
+ RDPREFIX, num);
+ goto out_nv;
+ }
+ }
+ if (!request_mem_region(start_addr,
+ (end_addr - start_addr), "RapidDisk-NV")) {
+ pr_err("%s: Failed to request memory region (address: %llu size: %llu).\n",
+ RDPREFIX, start_addr, (end_addr - start_addr));
+ goto out_nv;
+ }
+
+ rdsk = kzalloc(sizeof(*rdsk), GFP_KERNEL);
+ if (!rdsk)
+ goto out_nv;
+ rdsk->num = num;
+ rdsk->error_cnt = 0;
+ rdsk->start_addr = start_addr;
+ rdsk->end_addr = end_addr;
+ rdsk->size = end_addr - start_addr;
+ rdsk->volatile_memory = NON_VOLATILE_MEMORY;
+ rdsk->max_blk_alloc = (rdsk->size / BYTES_PER_SECTOR);
+
+ spin_lock_init(&rdsk->rdsk_lock);
+
+ rdsk->rdsk_queue = blk_alloc_queue(GFP_KERNEL);
+ if (!rdsk->rdsk_queue)
+ goto out_free_dev_nv;
+ blk_queue_make_request(rdsk->rdsk_queue, rdsk_make_request);
+ blk_queue_logical_block_size(rdsk->rdsk_queue, BYTES_PER_SECTOR);
+ blk_queue_flush(rdsk->rdsk_queue, REQ_FLUSH);
+
+ rdsk->rdsk_queue->limits.max_sectors = (max_sectors * 2);
+ rdsk->rdsk_queue->nr_requests = nr_requests;
+ rdsk->rdsk_queue->limits.discard_granularity = PAGE_SIZE;
+ rdsk->rdsk_queue->limits.discard_zeroes_data = 1;
+ rdsk->rdsk_queue->limits.max_discard_sectors = UINT_MAX;
+ queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, rdsk->rdsk_queue);
+
+ disk = rdsk->rdsk_disk = alloc_disk(1);
+ if (!disk)
+ goto out_free_queue_nv;
+ disk->major = rdsk_ma_no;
+ disk->first_minor = num;
+ disk->fops = &rdsk_fops;
+ disk->private_data = rdsk;
+ disk->queue = rdsk->rdsk_queue;
+ disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
+ sprintf(disk->disk_name, "%s%d", RDPREFIX, num);
+ set_capacity(disk, size);
+
+ add_disk(rdsk->rdsk_disk);
+ list_add_tail(&rdsk->rdsk_list, &rdsk_devices);
+ rxcnt++;
+ pr_info("%s: Attached rxd%d of %llu bytes in size.\n", RDPREFIX, num,
+ (unsigned long long)(size * BYTES_PER_SECTOR));
+
+ return 0;
+
+out_free_queue_nv:
+ blk_cleanup_queue(rdsk->rdsk_queue);
+
+out_free_dev_nv:
+ release_mem_region(rdsk->start_addr, rdsk->size);
+ kfree(rdsk);
+out_nv:
+ return GENERIC_ERROR;
+}
+
+static int detach_device(int num)
+{
+ struct rdsk_device *rdsk;
+
+ list_for_each_entry(rdsk, &rdsk_devices, rdsk_list)
+ if (rdsk->num == num)
+ break;
+
+ list_del(&rdsk->rdsk_list);
+ del_gendisk(rdsk->rdsk_disk);
+ put_disk(rdsk->rdsk_disk);
+ blk_cleanup_queue(rdsk->rdsk_queue);
+ if (rdsk->volatile_memory == VOLATILE_MEMORY)
+ rdsk_free_pages(rdsk);
+ else
+ release_mem_region(rdsk->start_addr, rdsk->size);
+ kfree(rdsk);
+ rxcnt--;
+ pr_info("%s: Detached rxd%d.\n", RDPREFIX, num);
+
+ return 0;
+}
+
+static int resize_device(int num, int size)
+{
+ struct rdsk_device *rdsk;
+
+ list_for_each_entry(rdsk, &rdsk_devices, rdsk_list)
+ if (rdsk->num == num)
+ break;
+
+ if (rdsk->volatile_memory == NON_VOLATILE_MEMORY) {
+ pr_warn("%s: Resizing unsupported for non-volatile mappings.\n",
+ RDPREFIX);
+ return GENERIC_ERROR;
+ }
+ if (size <= get_capacity(rdsk->rdsk_disk)) {
+ pr_warn("%s: Please specify a larger size for resizing.\n",
+ RDPREFIX);
+ return GENERIC_ERROR;
+ }
+ set_capacity(rdsk->rdsk_disk, size);
+ rdsk->size = (size * BYTES_PER_SECTOR);
+ pr_info("%s: Resized rxd%d of %lu bytes in size.\n", RDPREFIX, num,
+ (unsigned long)(size * BYTES_PER_SECTOR));
+ return 0;
+}
+
+static int __init init_rxd(void)
+{
+ int retval;
+
+ rxcnt = 0;
+ rdsk_ma_no = register_blkdev(rdsk_ma_no, RDPREFIX);
+ if (rdsk_ma_no < 0) {
+ pr_err("%s: Failed registering rdsk, returned %d\n",
+ RDPREFIX, rdsk_ma_no);
+ return rdsk_ma_no;
+ }
+
+ rdsk_kobj = kobject_create_and_add("rapiddisk", kernel_kobj);
+ if (!rdsk_kobj)
+ goto init_failure;
+ retval = sysfs_create_group(rdsk_kobj, &attr_group);
+ if (retval) {
+ kobject_put(rdsk_kobj);
+ goto init_failure;
+ }
+ return 0;
+
+init_failure:
+ unregister_blkdev(rdsk_ma_no, RDPREFIX);
+ return -ENOMEM;
+}
+
+static void __exit exit_rxd(void)
+{
+ struct rdsk_device *rdsk, *next;
+
+ kobject_put(rdsk_kobj);
+ list_for_each_entry_safe(rdsk, next, &rdsk_devices, rdsk_list)
+ detach_device(rdsk->num);
+ unregister_blkdev(rdsk_ma_no, RDPREFIX);
+}
+
+module_init(init_rxd);
+module_exit(exit_rxd);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Petros Koutoupis <petros@petroskoutoupis.com>");
+MODULE_DESCRIPTION("RapidDisk (rxdsk) is an enhanced RAM disk block device driver.");
+MODULE_VERSION(VERSION_STR);
diff -uNpr linux-next.orig/MAINTAINERS linux-next/MAINTAINERS
--- linux-next.orig/MAINTAINERS 2015-09-20 14:50:13.628673167 -0500
+++ linux-next/MAINTAINERS 2015-09-20 15:31:01.400603554 -0500
@@ -8642,6 +8642,14 @@ M: "Theodore Ts'o" <tytso@mit.edu>
S: Maintained
F: drivers/char/random.c
+RAPIDDISK & RAPIDCACHE DRIVERS
+M: Petros Koutoupis <petros@petroskoutoupis.com>
+L: devel@rapiddisk.org
+W: http://www.rapiddisk.org
+S: Maintained
+F: drivers/block/rapiddisk/
+F: Documentation/rapiddisk/
+
RAPIDIO SUBSYSTEM
M: Matt Porter <mporter@kernel.crashing.org>
M: Alexandre Bounine <alexandre.bounine@idt.com>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Patch to integrate RapidDisk and RapidCache RAM Drive / Caching modules into the kernel
2015-09-27 17:17 [PATCH] Patch to integrate RapidDisk and RapidCache RAM Drive / Caching modules into the kernel Petros Koutoupis
@ 2015-09-28 6:49 ` Christoph Hellwig
2015-09-28 14:50 ` Petros Koutoupis
[not found] ` <CALMxJTyS5ARHw5NWhiPkJOh_0ys2x7cGVNdn60O6ecaUTFkq_Q@mail.gmail.com>
0 siblings, 2 replies; 8+ messages in thread
From: Christoph Hellwig @ 2015-09-28 6:49 UTC (permalink / raw)
To: Petros Koutoupis; +Cc: linux-kernel, devel@rapiddisk.org
On Sun, Sep 27, 2015 at 12:17:24PM -0500, Petros Koutoupis wrote:
> Attached is a patch for two modules: RapidDisk & RapidCache. RapidDisk is a
> Linux RAM drive module which allows the user to dynamically create, remove,
> and resize RAM-based block devices. RapidDisk is designed to work with both
> volatile and non-volatile memory. In the case of volatile memory, memory is
> allocated only when needed. The RapidCache module in turn utilizes a RapidDisk
> volume as a FIFO Write-Through caching node to a slower block device.
Hi Petros,
this is three things at the same time! We already have a ramdisk
driver, a pmem drive, bcache and dm-cache, so for each of them please
explain why we'd want to duplicate them instead of adding whatever
features you need to them. First step is to identify those features.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Patch to integrate RapidDisk and RapidCache RAM Drive / Caching modules into the kernel
2015-09-28 6:49 ` Christoph Hellwig
@ 2015-09-28 14:50 ` Petros Koutoupis
[not found] ` <CALMxJTyS5ARHw5NWhiPkJOh_0ys2x7cGVNdn60O6ecaUTFkq_Q@mail.gmail.com>
1 sibling, 0 replies; 8+ messages in thread
From: Petros Koutoupis @ 2015-09-28 14:50 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-kernel, devel@rapiddisk.org
Christoph (and all),
I hope this message finds you well. The unfortunate truth is, had I the
courage and confidence to submit this years ago, we wouldn't be having
this conversation. Anyway, to address your questions:
1. Unlike the already mainline ramdisk driver, RapidDisk is designed to
be managed dynamically. That is, instead of configuring a fixed number
of volumes and volume sizes as compile/boot time variables, RapidDisk
will allow you to add, remove, and resize your RAM drive(s) at runtime.
Besides, the built in module is designed to work with smaller sizes in
mind while RapidDisk focuses on larger sizes that can reach to the
multiple Gigabytes or even Terabytes. Much like the built in module, it
will allocate pages as they are needed which allows for over
provisioning (not that it is advised) of volume sizes.
2. The majority of RapidDisk code focuses on the use of Volatile memory.
The support for Non-Volatile memory is a bit newer and there may be some
overlap here with the recently integrated pmem code. The only advantage
to having this code within RapidDisk is to provide the user with the
ability to manage both technologies simultaneously, through a single
interface.
3. The RapidCache component is designed around the Non-Volatile
functionality of RapidDisk (hence the block-level Write-Through
caching). It is also coded and optimized around the RapidDisk
sizes/variables, out-of-box. It is worth noting that I am in the process
of expanding this module to add deduplication support. This will
leverage RapidDisk's ability to allocate pages only when needed and
reduce the cache's memory footprint; making more out of less.
Thoughts, suggestions, and concerns are always welcome.
On 9/28/15 1:49 AM, Christoph Hellwig wrote:
> On Sun, Sep 27, 2015 at 12:17:24PM -0500, Petros Koutoupis wrote:
>> Attached is a patch for two modules: RapidDisk & RapidCache. RapidDisk is a
>> Linux RAM drive module which allows the user to dynamically create, remove,
>> and resize RAM-based block devices. RapidDisk is designed to work with both
>> volatile and non-volatile memory. In the case of volatile memory, memory is
>> allocated only when needed. The RapidCache module in turn utilizes a RapidDisk
>> volume as a FIFO Write-Through caching node to a slower block device.
> Hi Petros,
>
> this is three things at the same time! We already have a ramdisk
> driver, a pmem drive, bcache and dm-cache, so for each of them please
> explain why we'd want to duplicate them instead of adding whatever
> features you need to them. First step is to identify those features.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Patch to integrate RapidDisk and RapidCache RAM Drive / Caching modules into the kernel
[not found] ` <CALMxJTyS5ARHw5NWhiPkJOh_0ys2x7cGVNdn60O6ecaUTFkq_Q@mail.gmail.com>
@ 2015-09-28 16:29 ` Christoph Hellwig
2015-09-28 16:45 ` Petros Koutoupis
0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2015-09-28 16:29 UTC (permalink / raw)
To: Petros Koutoupis; +Cc: linux-kernel, devel@rapiddisk.org
Hi Petros,
On Mon, Sep 28, 2015 at 09:12:13AM -0500, Petros Koutoupis wrote:
> 1. Unlike the already mainline ramdisk driver, RapidDisk is designed to be
> managed dynamically. That is, instead of configuring a fixed number of
> volumes and volume sizes as compile/boot time variables, RapidDisk will
> allow you to add, remove, and resize your RAM drive(s) at runtime. Besides,
> the built in module is designed to work with smaller sizes in mind while
> RapidDisk focuses on larger sizes that can reach to the multiple Gigabytes
> or even Terabytes. Much like the built in module, it will allocate pages as
> they are needed which allows for over provisioning (not that it is advised)
> of volume sizes.
The ramdisk driver allows to selects sizes and count at module load
load. I agree that having runtime control would be even better, but
that's best done by adding a runtime interface to the existing driver
instead of duplicating it.
> 2. The majority of RapidDisk code focuses on the use of Volatile memory.
> The support for Non-Volatile memory is a bit newer and there may be some
> overlap here with the recently integrated pmem code. The only advantage to
> having this code within RapidDisk is to provide the user with the ability
> to manage both technologies simultaneously, through a single interface.
Which really doesn't sound like a good enough reason to duplicate it.
> 3. The RapidCache component is designed around the Non-Volatile
> functionality of RapidDisk (hence the block-level Write-Through caching).
> It is also coded and optimized around the RapidDisk sizes/variables,
> out-of-box. It is worth noting that I am in the process of expanding this
> module to add deduplication support. This will leverage RapidDisk's ability
> to allocate pages only when needed and reduce the cache's memory footprint;
> making more out of less.
Still needs some code comparism to our existing two caching solutions.
I'd love to see you go ahead with the dynamic ramdisk configuration as
this is clearly a very useful feature. A caching solution that is
optimized for non-volatile memory does sound useful, but we'll still
need a patch better explaining how it actually is as useful as it might
sound.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Patch to integrate RapidDisk and RapidCache RAM Drive / Caching modules into the kernel
2015-09-28 16:29 ` Christoph Hellwig
@ 2015-09-28 16:45 ` Petros Koutoupis
2015-09-29 14:32 ` Austin S Hemmelgarn
0 siblings, 1 reply; 8+ messages in thread
From: Petros Koutoupis @ 2015-09-28 16:45 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-kernel, devel@rapiddisk.org
Christoph,
See my replies below....
On 9/28/15 11:29 AM, Christoph Hellwig wrote:
> Hi Petros,
>
> On Mon, Sep 28, 2015 at 09:12:13AM -0500, Petros Koutoupis wrote:
>> 1. Unlike the already mainline ramdisk driver, RapidDisk is designed to be
>> managed dynamically. That is, instead of configuring a fixed number of
>> volumes and volume sizes as compile/boot time variables, RapidDisk will
>> allow you to add, remove, and resize your RAM drive(s) at runtime. Besides,
>> the built in module is designed to work with smaller sizes in mind while
>> RapidDisk focuses on larger sizes that can reach to the multiple Gigabytes
>> or even Terabytes. Much like the built in module, it will allocate pages as
>> they are needed which allows for over provisioning (not that it is advised)
>> of volume sizes.
> The ramdisk driver allows to selects sizes and count at module load
> load. I agree that having runtime control would be even better, but
> that's best done by adding a runtime interface to the existing driver
> instead of duplicating it.
I understand the concern and I will definitely scope out this approach,
although at the moment, I am not sure how both approaches will play nice
together. As mentioned above, the current implementation requires the
predefined number of ram drives with the specified size to be configured
at boot time (or compiled into the kernel). The only wiggle room I see
for runtime control is resizing individual volumes.
>> 2. The majority of RapidDisk code focuses on the use of Volatile memory.
>> The support for Non-Volatile memory is a bit newer and there may be some
>> overlap here with the recently integrated pmem code. The only advantage to
>> having this code within RapidDisk is to provide the user with the ability
>> to manage both technologies simultaneously, through a single interface.
> Which really doesn't sound like a good enough reason to duplicate it.
I do not disagree with your comment here. This component does not have
to be patched into the mainline.
>> 3. The RapidCache component is designed around the Non-Volatile
>> functionality of RapidDisk (hence the block-level Write-Through caching).
>> It is also coded and optimized around the RapidDisk sizes/variables,
>> out-of-box. It is worth noting that I am in the process of expanding this
>> module to add deduplication support. This will leverage RapidDisk's ability
>> to allocate pages only when needed and reduce the cache's memory footprint;
>> making more out of less.
> Still needs some code comparism to our existing two caching solutions.
>
> I'd love to see you go ahead with the dynamic ramdisk configuration as
> this is clearly a very useful feature. A caching solution that is
> optimized for non-volatile memory does sound useful, but we'll still
> need a patch better explaining how it actually is as useful as it might
> sound.
CORRECTION: I meant to say Volatile and NOT Non-Volatile. RapidCache is
designed around Volatile memory. I guess I was a little to excited in my
response and I do apologize for that. I will provide a code comparison
in my next e-mail, after I go through the existing RAM drive code.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Patch to integrate RapidDisk and RapidCache RAM Drive / Caching modules into the kernel
2015-09-28 16:45 ` Petros Koutoupis
@ 2015-09-29 14:32 ` Austin S Hemmelgarn
2015-09-30 14:29 ` Petros Koutoupis
0 siblings, 1 reply; 8+ messages in thread
From: Austin S Hemmelgarn @ 2015-09-29 14:32 UTC (permalink / raw)
To: Petros Koutoupis, Christoph Hellwig; +Cc: linux-kernel, devel@rapiddisk.org
[-- Attachment #1: Type: text/plain, Size: 4509 bytes --]
On 2015-09-28 12:45, Petros Koutoupis wrote:
> Christoph,
>
> See my replies below....
>
> On 9/28/15 11:29 AM, Christoph Hellwig wrote:
>> Hi Petros,
>>
>> On Mon, Sep 28, 2015 at 09:12:13AM -0500, Petros Koutoupis wrote:
>>> 1. Unlike the already mainline ramdisk driver, RapidDisk is designed
>>> to be
>>> managed dynamically. That is, instead of configuring a fixed number of
>>> volumes and volume sizes as compile/boot time variables, RapidDisk will
>>> allow you to add, remove, and resize your RAM drive(s) at runtime.
>>> Besides,
>>> the built in module is designed to work with smaller sizes in mind while
>>> RapidDisk focuses on larger sizes that can reach to the multiple
>>> Gigabytes
>>> or even Terabytes. Much like the built in module, it will allocate
>>> pages as
>>> they are needed which allows for over provisioning (not that it is
>>> advised)
>>> of volume sizes.
>> The ramdisk driver allows to selects sizes and count at module load
>> load. I agree that having runtime control would be even better, but
>> that's best done by adding a runtime interface to the existing driver
>> instead of duplicating it.
> I understand the concern and I will definitely scope out this approach,
> although at the moment, I am not sure how both approaches will play nice
> together. As mentioned above, the current implementation requires the
> predefined number of ram drives with the specified size to be configured
> at boot time (or compiled into the kernel). The only wiggle room I see
> for runtime control is resizing individual volumes.
Just because there is not code currently to do dynamic
allocation/freeing of ramdisks in the current driver doesn't mean that
it isn't possible, it just means that nobody has written code to do it
yet. This functionality would be extremely useful (I often use ramdisks
on a VM host as a small amount of very fast swap space for the virtual
machines). On top of that, the deduplication would be a wonderful
feature, although it may already be indirectly implemented through KSM
(that is, when KSM is on and configured to scan everything, I'm not sure
if it scans memory used by the ramdisks or not).
>>> 2. The majority of RapidDisk code focuses on the use of Volatile memory.
>>> The support for Non-Volatile memory is a bit newer and there may be some
>>> overlap here with the recently integrated pmem code. The only
>>> advantage to
>>> having this code within RapidDisk is to provide the user with the
>>> ability
>>> to manage both technologies simultaneously, through a single interface.
>> Which really doesn't sound like a good enough reason to duplicate it.
> I do not disagree with your comment here. This component does not have
> to be patched into the mainline.
>
>>> 3. The RapidCache component is designed around the Non-Volatile
>>> functionality of RapidDisk (hence the block-level Write-Through
>>> caching).
>>> It is also coded and optimized around the RapidDisk sizes/variables,
>>> out-of-box. It is worth noting that I am in the process of expanding
>>> this
>>> module to add deduplication support. This will leverage RapidDisk's
>>> ability
>>> to allocate pages only when needed and reduce the cache's memory
>>> footprint;
>>> making more out of less.
>> Still needs some code comparism to our existing two caching solutions.
>>
>> I'd love to see you go ahead with the dynamic ramdisk configuration as
>> this is clearly a very useful feature. A caching solution that is
>> optimized for non-volatile memory does sound useful, but we'll still
>> need a patch better explaining how it actually is as useful as it might
>> sound.
> CORRECTION: I meant to say Volatile and NOT Non-Volatile. RapidCache is
> designed around Volatile memory. I guess I was a little to excited in my
> response and I do apologize for that. I will provide a code comparison
> in my next e-mail, after I go through the existing RAM drive code.
To a certain extent, I see that as potentially less useful than
optimized for non-volatile memory. While the current incarnation of the
pagecache in Linux could stand to have some serious performance
improvements (just think how fast things would be if we used ARC instead
of plain LRU), it does still do it's job well for most workloads
(although being able to tell the kernel to reserve some portion of
memory _just_ for the pagecache would be an interesting and probably
very useful feature).
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Patch to integrate RapidDisk and RapidCache RAM Drive / Caching modules into the kernel
2015-09-29 14:32 ` Austin S Hemmelgarn
@ 2015-09-30 14:29 ` Petros Koutoupis
2015-09-30 15:17 ` Austin S Hemmelgarn
0 siblings, 1 reply; 8+ messages in thread
From: Petros Koutoupis @ 2015-09-30 14:29 UTC (permalink / raw)
To: Austin S Hemmelgarn, Christoph Hellwig; +Cc: linux-kernel, devel@rapiddisk.org
Christoph and Austin,
You both have provided me with some valuable feedback. I will do what I
can to clean this patch up and in turn apply the same dynamic
functionality to the already in-kernel module. Also please see my
replies below.
On 9/29/15 9:32 AM, Austin S Hemmelgarn wrote:
> On 2015-09-28 12:45, Petros Koutoupis wrote:
>> Christoph,
>>
>> See my replies below....
>>
>> On 9/28/15 11:29 AM, Christoph Hellwig wrote:
>>> Hi Petros,
>>>
>>> On Mon, Sep 28, 2015 at 09:12:13AM -0500, Petros Koutoupis wrote:
>>>> 1. Unlike the already mainline ramdisk driver, RapidDisk is designed
>>>> to be
>>>> managed dynamically. That is, instead of configuring a fixed number of
>>>> volumes and volume sizes as compile/boot time variables, RapidDisk
>>>> will
>>>> allow you to add, remove, and resize your RAM drive(s) at runtime.
>>>> Besides,
>>>> the built in module is designed to work with smaller sizes in mind
>>>> while
>>>> RapidDisk focuses on larger sizes that can reach to the multiple
>>>> Gigabytes
>>>> or even Terabytes. Much like the built in module, it will allocate
>>>> pages as
>>>> they are needed which allows for over provisioning (not that it is
>>>> advised)
>>>> of volume sizes.
>>> The ramdisk driver allows to selects sizes and count at module load
>>> load. I agree that having runtime control would be even better, but
>>> that's best done by adding a runtime interface to the existing driver
>>> instead of duplicating it.
>> I understand the concern and I will definitely scope out this approach,
>> although at the moment, I am not sure how both approaches will play nice
>> together. As mentioned above, the current implementation requires the
>> predefined number of ram drives with the specified size to be configured
>> at boot time (or compiled into the kernel). The only wiggle room I see
>> for runtime control is resizing individual volumes.
> Just because there is not code currently to do dynamic
> allocation/freeing of ramdisks in the current driver doesn't mean that
> it isn't possible, it just means that nobody has written code to do it
> yet. This functionality would be extremely useful (I often use
> ramdisks on a VM host as a small amount of very fast swap space for
> the virtual machines). On top of that, the deduplication would be a
> wonderful feature, although it may already be indirectly implemented
> through KSM (that is, when KSM is on and configured to scan
> everything, I'm not sure if it scans memory used by the ramdisks or not).
>
To my understanding KSM is only applied to KVM deployments. One way I
have seen my caching module work is users/vendors have a block device,
map it to a RapidDisk RAM drive as a RAM based Write-Through caching
node and in turn export it via a traditional SAN. The idea behind adding
deduplication to this module is to minimize the RAM drive footprint when
used as a block level cache.
>>>> 2. The majority of RapidDisk code focuses on the use of Volatile
>>>> memory.
>>>> The support for Non-Volatile memory is a bit newer and there may be
>>>> some
>>>> overlap here with the recently integrated pmem code. The only
>>>> advantage to
>>>> having this code within RapidDisk is to provide the user with the
>>>> ability
>>>> to manage both technologies simultaneously, through a single
>>>> interface.
>>> Which really doesn't sound like a good enough reason to duplicate it.
>> I do not disagree with your comment here. This component does not have
>> to be patched into the mainline.
>>
>>>> 3. The RapidCache component is designed around the Non-Volatile
>>>> functionality of RapidDisk (hence the block-level Write-Through
>>>> caching).
>>>> It is also coded and optimized around the RapidDisk sizes/variables,
>>>> out-of-box. It is worth noting that I am in the process of expanding
>>>> this
>>>> module to add deduplication support. This will leverage RapidDisk's
>>>> ability
>>>> to allocate pages only when needed and reduce the cache's memory
>>>> footprint;
>>>> making more out of less.
>>> Still needs some code comparism to our existing two caching solutions.
>>>
>>> I'd love to see you go ahead with the dynamic ramdisk configuration as
>>> this is clearly a very useful feature. A caching solution that is
>>> optimized for non-volatile memory does sound useful, but we'll still
>>> need a patch better explaining how it actually is as useful as it might
>>> sound.
>> CORRECTION: I meant to say Volatile and NOT Non-Volatile. RapidCache is
>> designed around Volatile memory. I guess I was a little to excited in my
>> response and I do apologize for that. I will provide a code comparison
>> in my next e-mail, after I go through the existing RAM drive code.
> To a certain extent, I see that as potentially less useful than
> optimized for non-volatile memory. While the current incarnation of
> the pagecache in Linux could stand to have some serious performance
> improvements (just think how fast things would be if we used ARC
> instead of plain LRU), it does still do it's job well for most
> workloads (although being able to tell the kernel to reserve some
> portion of memory _just_ for the pagecache would be an interesting and
> probably very useful feature).
>
My only concern with an ARC is CPU utilization. A lot more is required
to manage two lists.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] Patch to integrate RapidDisk and RapidCache RAM Drive / Caching modules into the kernel
2015-09-30 14:29 ` Petros Koutoupis
@ 2015-09-30 15:17 ` Austin S Hemmelgarn
0 siblings, 0 replies; 8+ messages in thread
From: Austin S Hemmelgarn @ 2015-09-30 15:17 UTC (permalink / raw)
To: Petros Koutoupis, Christoph Hellwig; +Cc: linux-kernel, devel@rapiddisk.org
[-- Attachment #1: Type: text/plain, Size: 7296 bytes --]
On 2015-09-30 10:29, Petros Koutoupis wrote:
> Christoph and Austin,
>
> You both have provided me with some valuable feedback. I will do what I
> can to clean this patch up and in turn apply the same dynamic
> functionality to the already in-kernel module. Also please see my
> replies below.
>
> On 9/29/15 9:32 AM, Austin S Hemmelgarn wrote:
>> On 2015-09-28 12:45, Petros Koutoupis wrote:
>>> Christoph,
>>>
>>> See my replies below....
>>>
>>> On 9/28/15 11:29 AM, Christoph Hellwig wrote:
>>>> Hi Petros,
>>>>
>>>> On Mon, Sep 28, 2015 at 09:12:13AM -0500, Petros Koutoupis wrote:
>>>>> 1. Unlike the already mainline ramdisk driver, RapidDisk is designed
>>>>> to be
>>>>> managed dynamically. That is, instead of configuring a fixed number of
>>>>> volumes and volume sizes as compile/boot time variables, RapidDisk
>>>>> will
>>>>> allow you to add, remove, and resize your RAM drive(s) at runtime.
>>>>> Besides,
>>>>> the built in module is designed to work with smaller sizes in mind
>>>>> while
>>>>> RapidDisk focuses on larger sizes that can reach to the multiple
>>>>> Gigabytes
>>>>> or even Terabytes. Much like the built in module, it will allocate
>>>>> pages as
>>>>> they are needed which allows for over provisioning (not that it is
>>>>> advised)
>>>>> of volume sizes.
>>>> The ramdisk driver allows to selects sizes and count at module load
>>>> load. I agree that having runtime control would be even better, but
>>>> that's best done by adding a runtime interface to the existing driver
>>>> instead of duplicating it.
>>> I understand the concern and I will definitely scope out this approach,
>>> although at the moment, I am not sure how both approaches will play nice
>>> together. As mentioned above, the current implementation requires the
>>> predefined number of ram drives with the specified size to be configured
>>> at boot time (or compiled into the kernel). The only wiggle room I see
>>> for runtime control is resizing individual volumes.
>> Just because there is not code currently to do dynamic
>> allocation/freeing of ramdisks in the current driver doesn't mean that
>> it isn't possible, it just means that nobody has written code to do it
>> yet. This functionality would be extremely useful (I often use
>> ramdisks on a VM host as a small amount of very fast swap space for
>> the virtual machines). On top of that, the deduplication would be a
>> wonderful feature, although it may already be indirectly implemented
>> through KSM (that is, when KSM is on and configured to scan
>> everything, I'm not sure if it scans memory used by the ramdisks or not).
>>
> To my understanding KSM is only applied to KVM deployments. One way I
> have seen my caching module work is users/vendors have a block device,
> map it to a RapidDisk RAM drive as a RAM based Write-Through caching
> node and in turn export it via a traditional SAN. The idea behind adding
> deduplication to this module is to minimize the RAM drive footprint when
> used as a block level cache.
KSM is usually used in KVM or other userspace VM deployments, but that
is by no means the only use-case. I actually use it regularly on most
of my systems, and it does help in some cases (for example, I run a lot
of distributed computing apps, often using multiple instances of the
same app, and those don't always share memory to the degree they should,
KSM helps with this).
The write-through caching may be worth looking into, although I think
(not certain about this) that you can force the page cache to do
write-through caching only, except that can only be done globally.
It would probably be better to improve upon the existing pagecache
implementation anyway, ideally, I would love to see:
1. The ability to tell the page cache to claim some minimum amount of
memory that only it can use.
2. The ability to easily tune cache parameters on a per-device (or even
better, per-filesystem) basis.
3. Conversion to a framework that would allow for easy development and
testing of different caching algorithms (although this is probably never
going to happen).
>>>>> 2. The majority of RapidDisk code focuses on the use of Volatile
>>>>> memory.
>>>>> The support for Non-Volatile memory is a bit newer and there may be
>>>>> some
>>>>> overlap here with the recently integrated pmem code. The only
>>>>> advantage to
>>>>> having this code within RapidDisk is to provide the user with the
>>>>> ability
>>>>> to manage both technologies simultaneously, through a single
>>>>> interface.
>>>> Which really doesn't sound like a good enough reason to duplicate it.
>>> I do not disagree with your comment here. This component does not have
>>> to be patched into the mainline.
>>>
>>>>> 3. The RapidCache component is designed around the Non-Volatile
>>>>> functionality of RapidDisk (hence the block-level Write-Through
>>>>> caching).
>>>>> It is also coded and optimized around the RapidDisk sizes/variables,
>>>>> out-of-box. It is worth noting that I am in the process of expanding
>>>>> this
>>>>> module to add deduplication support. This will leverage RapidDisk's
>>>>> ability
>>>>> to allocate pages only when needed and reduce the cache's memory
>>>>> footprint;
>>>>> making more out of less.
>>>> Still needs some code comparism to our existing two caching solutions.
>>>>
>>>> I'd love to see you go ahead with the dynamic ramdisk configuration as
>>>> this is clearly a very useful feature. A caching solution that is
>>>> optimized for non-volatile memory does sound useful, but we'll still
>>>> need a patch better explaining how it actually is as useful as it might
>>>> sound.
>>> CORRECTION: I meant to say Volatile and NOT Non-Volatile. RapidCache is
>>> designed around Volatile memory. I guess I was a little to excited in my
>>> response and I do apologize for that. I will provide a code comparison
>>> in my next e-mail, after I go through the existing RAM drive code.
>> To a certain extent, I see that as potentially less useful than
>> optimized for non-volatile memory. While the current incarnation of
>> the pagecache in Linux could stand to have some serious performance
>> improvements (just think how fast things would be if we used ARC
>> instead of plain LRU), it does still do it's job well for most
>> workloads (although being able to tell the kernel to reserve some
>> portion of memory _just_ for the pagecache would be an interesting and
>> probably very useful feature).
>>
> My only concern with an ARC is CPU utilization. A lot more is required
> to manage two lists.
Actually, most of the CPU time spent in an ARC cache is in the
auto-tuning (the 'adaptive' bit), I've done testing just in userspace
and SLRU (ARC without the adaptive sizing of the lists) uses only a
little more CPU time than traditional LRU, somewhat less than ARC, and
does a much better job of handling COW based workloads. COW is a tough
workload for LRU caching (which is why ZFS uses ARC and not traditional
LRU), as a read-modify-write cycle ends up with the read data not being
needed ever again, which in turn means that MRU caching can be better in
may cases for heavy read-write COW workloads.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-09-30 15:17 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-27 17:17 [PATCH] Patch to integrate RapidDisk and RapidCache RAM Drive / Caching modules into the kernel Petros Koutoupis
2015-09-28 6:49 ` Christoph Hellwig
2015-09-28 14:50 ` Petros Koutoupis
[not found] ` <CALMxJTyS5ARHw5NWhiPkJOh_0ys2x7cGVNdn60O6ecaUTFkq_Q@mail.gmail.com>
2015-09-28 16:29 ` Christoph Hellwig
2015-09-28 16:45 ` Petros Koutoupis
2015-09-29 14:32 ` Austin S Hemmelgarn
2015-09-30 14:29 ` Petros Koutoupis
2015-09-30 15:17 ` Austin S Hemmelgarn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox