Linux RAID subsystem development
 help / color / mirror / Atom feed
* Re: [PATCH] mdadm: add man page for --add-journal
From: Song Liu @ 2016-08-15 17:16 UTC (permalink / raw)
  To: Jes Sorensen, Adam Goryachev
  Cc: linux-raid@vger.kernel.org, yizhan@redhat.com, Shaohua Li
In-Reply-To: <wrfj60r25c15.fsf@redhat.com>

Thanks Adam and Jes. 

These looks good to me. 

PS: we will make “add-journal” more flexible, and revise the man page accordingly. 

Song

>> On 8/15/16, 7:42 AM, "Jes Sorensen" <Jes.Sorensen@redhat.com> wrote:

    Adam Goryachev <mailinglists@websitemanagers.com.au> writes:
    > On 13/08/2016 00:58, Jes Sorensen wrote:
    >> Song Liu <songliubraving@fb.com> writes:
    >>> Add the following to man page:
    >>>
    >>> --add-journal
    >>>        Recreate journal for RAID-4/5/6 array that losts journal
    >>>        devices. In current implementation, this command cannot
    >>>        add journal to an array that had failed journal.  To
    >>>        avoid  interrupting  on-going  write  opertions,
    >>>        --add-journal only works for array in Read-Only state.
    >>>
    >>> Reported-by: Yi Zhang <yizhan@redhat.com>
    >>> Signed-off-by: Song Liu <songliubraving@fb.com>
    >>> Signed-off-by: Shaohua Li <shli@fb.com>
    >>> ---
    >>>   mdadm.8.in | 8 ++++++++
    >>>   1 file changed, 8 insertions(+)
    >> Applied, with a few minor mods.
    >>
    >> I changed it to say this, I hope you are fine with that:
    >>
    >> "Recreate journal for RAID-4/5/6 array that lost a journal device. In the
    >> current implementation, this command cannot add a journal to an array
    >> that had a failed journal. To avoid interrupting on-going write
    >> opertions, "
    > I think this might be more correct:
    >
    > "Recreate journal for RAID-4/5/6 array that lost a journal device. In the
    > current implementation, this command cannot add a journal to an array
    > that *has* a failed journal. To avoid interrupting on-going write
    > *operations*, "
    >
    >
    > Note the two words modified have **
    > has mean currently, if it had (past) a failed journal, but that has
    > already been fixed, then it currently has a working journal, and so I
    > assume this patch is not relevant. It's only related to if the array
    > is currently missing a journal...
    > The second operations is just a typo...
    >
    > Hope you don't mind my jumping in here, I can't help much with code,
    > but hopefully contribution is still helpful.
    
    If Song is happy with this and you send me a patch, I'll be happy to
    apply it.
    
    Cheers,
    Jes
    


^ permalink raw reply

* [RFC PATCH 00/16] dm-inplace-compression block device
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, agk, snitzer, corbet, Ram Pai

This patch series provides a generic device-mapper inplace compression device.
Originally written by Shaohua Li.
https://www.redhat.com/archives/dm-devel/2013-December/msg00143.html

I have optimized the code and used it as a compressed swap device supporting
exterme levels of swap traffic using a NVMe device as a backend.

Comments from Alasdair have been incorporated.
https://www.redhat.com/archives/dm-devel/2013-December/msg00144.html

Testing:

This patch is tested thoroughly as a swap device on Power machine only.  More
testing is needed before it can be used as a generic compression device.

Your comments to improve the code is very much appreciated.

Ram Pai (15):
  DM: Ability to choose the compressor.
  DM: Error if enough space is not available.
  DM: Ensure that the read request is within the device range.
  DM: allocation/free helper routines.
  DM: separate out compression and decompression routines.
  DM: Optimize memory allocated to hold compressed buffer.
  DM: Tag a magicmarker at the end of each compressed segment.
  DM: Delay allocation of decompression buffer during read.
  DM: Try to use the bio buffer for decompression instead of allocating
    one.
  DM: Try to avoid temporary buffer allocation to hold compressed data.
  DM: release unneeded buffer as soon as possible.
  DM: macros to set and get the state of the request.
  DM: Wasted bio copy.
  DM: Add sysfs parameters to track total memory saved and allocated.
  DM: add documentation for dm-inplace-compress.

Shaohua Li (1):
  DM: dm-inplace-compress: a compressed DM target for SSD

 .../device-mapper/dm-inplace-compress.text         |  138 ++
 drivers/md/Kconfig                                 |    6 +
 drivers/md/Makefile                                |    1 +
 drivers/md/dm-inplace-compress.c                   | 1792 ++++++++++++++++++++
 drivers/md/dm-inplace-compress.h                   |  162 ++
 5 files changed, 2099 insertions(+)
 create mode 100644 Documentation/device-mapper/dm-inplace-compress.text
 create mode 100644 drivers/md/dm-inplace-compress.c
 create mode 100644 drivers/md/dm-inplace-compress.h

-- 
1.8.3.1


^ permalink raw reply

* [RFC PATCH 01/16] DM: dm-inplace-compress: an inplace compressed DM target
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc
  Cc: snitzer, corbet, Shaohua Li, Ram Pai, shli, agk
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

From: Shaohua Li <shli@kernel.org>

This is a simple DM target supporting inplace compression. Its best suited for
SSD. The underlying disk must support 512B sector size, the target only
supports 4k sector size.

Disk layout:
|super|...meta...|..data...|

Store unit is 4k (a block). Super is 1 block, which stores meta and data
size and compression algorithm. Meta is a bitmap. For each data block,
there are 5 bits meta.

Data:

Data of a block is compressed. Compressed data is round up to 512B, which
is the payload. In disk, payload is stored at the beginning of logical
sector of the block. Let's look at an example. Say we store data to block
A, which is in sector B(A*8), its orginal size is 4k, compressed size is
1500. Compressed data (CD) will use 3 sectors (512B). The 3 sectors are the
payload. Payload will be stored at sector B.

---------------------------------------------------
... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
---------------------------------------------------
    ^B    ^B+1  ^B+2                  ^B+7 ^B+8

For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta
bits to present payload size. The compressed size (1500) isn't stored in
meta directly. Instead, we store it at the last 32bits of payload. In this
example, we store it at the end of sector B+2. If compressed size +
sizeof(32bits) crosses a sector, payload size will increase one sector. If
payload uses 8 sectors, we store uncompressed data directly.

If IO size is bigger than one block, we can store the data as an extent.
Data of the whole extent will compressed and stored in the similar way like
above.  The first block of the extent is the head, all others are the tail.
If extent is 1 block, the block is head. We have 1 bit of meta to present
if a block is head or tail. If 4 meta bits of head block can't store extent
payload size, we will borrow tail block meta bits to store payload size.
Max allowd extent size is 128k, so we don't compress/decompress too big
size data.

Meta:
Modifying data will modify meta too. Meta will be written(flush) to disk
depending on meta write policy. We support writeback and writethrough mode.
In writeback mode, meta will be written to disk in an interval or a FLUSH
request.  In writethrough mode, data and meta data will be written to disk
together.

Advantages:

1. Simple. Since we store compressed data in-place, we don't need complicated
disk data management.
2. Efficient. For each 4k, we only need 5 bits meta. 1T data will use less than
200M meta, so we can load all meta into memory. And actual compression size is
in payload. So if IO doesn't need RMW and we use writeback meta flush, we don't
need extra IO for meta.

Disadvantages:

1. hole. Since we store compressed data in-place, there are a lot of holes
(in above example, B+3 - B+7) Hole can impact IO, because we can't do IO
merge.

2. 1:1 size. Compression doesn't change disk size. If disk is 1T, we can
only store 1T data even we do compression.

But this is for SSD only. Generally SSD firmware has a FTL layer to map
disk sectors to flash nand. High end SSD firmware has filesystem-like FTL.

1. hole. Disk has a lot of holes, but SSD FTL can still store data continuous
in nand. Even if we can't do IO merge in OS layer, SSD firmware can do it.

2. 1:1 size. On one side, we write compressed data to SSD, which means less
data is written to SSD. This will be very helpful to improve SSD garbage
collection, and so write speed and life cycle. So even this is a problem, the
target is still helpful. On the other side, advanced SSD FTL can easily do thin
provision. For example, if nand is 1T and we let SSD report it as 2T, and use
the SSD as compressed target. In such SSD, we don't have the 1:1 size issue.

So even if SSD FTL cannot map non-continuous disk sectors to continuous nand,
the compression target can still function well.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Ram Pai <ram.n.pai@gmail.com>
---
 drivers/md/Kconfig               |    6 +
 drivers/md/Makefile              |    1 +
 drivers/md/dm-inplace-compress.c | 1487 ++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-inplace-compress.h |  140 ++++
 4 files changed, 1634 insertions(+), 0 deletions(-)
 create mode 100644 drivers/md/dm-inplace-compress.c
 create mode 100644 drivers/md/dm-inplace-compress.h

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 02a5345..cdb1984 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -343,6 +343,12 @@ config DM_MIRROR
          Allow volume managers to mirror logical volumes, also
          needed for live data migration tools such as 'pvmove'.
 
+config DM_INPLACE_COMPRESS
+       tristate "Inplace Compression target"
+       depends on BLK_DEV_DM
+       ---help---
+         Allow volume managers to compress data for SSD.
+
 config DM_LOG_USERSPACE
 	tristate "Mirror userspace logging"
 	depends on DM_MIRROR && NET
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 52ba8dd..966eb2c 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -58,6 +58,7 @@ obj-$(CONFIG_DM_CACHE_SMQ)	+= dm-cache-smq.o
 obj-$(CONFIG_DM_CACHE_CLEANER)	+= dm-cache-cleaner.o
 obj-$(CONFIG_DM_ERA)		+= dm-era.o
 obj-$(CONFIG_DM_LOG_WRITES)	+= dm-log-writes.o
+obj-$(CONFIG_DM_INPLACE_COMPRESS)	+= dm-inplace-compress.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
new file mode 100644
index 0000000..c3c3750
--- /dev/null
+++ b/drivers/md/dm-inplace-compress.c
@@ -0,0 +1,1487 @@
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/device-mapper.h>
+#include <linux/dm-io.h>
+#include <linux/crypto.h>
+#include <linux/lzo.h>
+#include <linux/kthread.h>
+#include <linux/page-flags.h>
+#include <linux/completion.h>
+#include <linux/vmalloc.h>
+#include "dm-inplace-compress.h"
+
+#define DM_MSG_PREFIX "dm-inplace-compress"
+
+static struct dm_icomp_compressor_data compressors[] = {
+	[DMCP_COMP_ALG_LZO] = {
+		.name = "lzo",
+		.comp_len = lzo_comp_len,
+	},
+};
+static int default_compressor;
+
+static struct kmem_cache *dm_icomp_req_cachep;
+static struct kmem_cache *dm_icomp_io_range_cachep;
+static struct kmem_cache *dm_icomp_meta_io_cachep;
+
+static struct dm_icomp_io_worker dm_icomp_io_workers[NR_CPUS];
+static struct workqueue_struct *dm_icomp_wq;
+
+static u8 dm_icomp_get_meta(struct dm_icomp_info *info, u64 block_index)
+{
+	u64 first_bit = block_index * DMCP_META_BITS;
+	int bits, offset;
+	u8 data, ret = 0;
+
+	offset = first_bit & 7;
+	bits = min_t(u8, DMCP_META_BITS, 8 - offset);
+
+	data = info->meta_bitmap[first_bit >> 3];
+	ret = (data >> offset) & ((1 << bits) - 1);
+
+	if (bits < DMCP_META_BITS) {
+		data = info->meta_bitmap[(first_bit >> 3) + 1];
+		bits = DMCP_META_BITS - bits;
+		ret |= (data & ((1 << bits) - 1)) << (DMCP_META_BITS - bits);
+	}
+	return ret;
+}
+
+static void dm_icomp_set_meta(struct dm_icomp_info *info, u64 block_index,
+		u8 meta, bool dirty_meta)
+{
+	u64 first_bit = block_index * DMCP_META_BITS;
+	int bits, offset;
+	u8 data;
+	struct page *page;
+
+	offset = first_bit & 7;
+	bits = min_t(u8, DMCP_META_BITS, 8 - offset);
+
+	data = info->meta_bitmap[first_bit >> 3];
+	data &= ~(((1 << bits) - 1) << offset);
+	data |= (meta & ((1 << bits) - 1)) << offset;
+	info->meta_bitmap[first_bit >> 3] = data;
+
+	if (info->write_mode == DMCP_WRITE_BACK) {
+		page = vmalloc_to_page(&info->meta_bitmap[first_bit >> 3]);
+		if (dirty_meta)
+			SetPageDirty(page);
+		else
+			ClearPageDirty(page);
+	}
+
+	if (bits < DMCP_META_BITS) {
+		meta >>= bits;
+		data = info->meta_bitmap[(first_bit >> 3) + 1];
+		bits = DMCP_META_BITS - bits;
+		data = (data >> bits) << bits;
+		data |= meta & ((1 << bits) - 1);
+		info->meta_bitmap[(first_bit >> 3) + 1] = data;
+
+		if (info->write_mode == DMCP_WRITE_BACK) {
+			page = vmalloc_to_page(&info->meta_bitmap[
+						(first_bit >> 3) + 1]);
+			if (dirty_meta)
+				SetPageDirty(page);
+			else
+				ClearPageDirty(page);
+		}
+	}
+}
+
+static void dm_icomp_set_extent(struct dm_icomp_req *req, u64 block,
+	u16 logical_blocks, sector_t data_sectors)
+{
+	int i;
+	u8 data;
+
+	for (i = 0; i < logical_blocks; i++) {
+		data = min_t(sector_t, data_sectors, 8);
+		data_sectors -= data;
+		if (i != 0)
+			data |= DMCP_TAIL_MASK;
+		/* For FUA, we write out meta data directly */
+		dm_icomp_set_meta(req->info, block + i, data,
+					!(req->bio->bi_rw & REQ_FUA));
+	}
+}
+
+static void dm_icomp_get_extent(struct dm_icomp_info *info, u64 block_index,
+	u64 *first_block_index, u16 *logical_sectors, u16 *data_sectors)
+{
+	u8 data;
+
+	data = dm_icomp_get_meta(info, block_index);
+	while (data & DMCP_TAIL_MASK) {
+		block_index--;
+		data = dm_icomp_get_meta(info, block_index);
+	}
+	*first_block_index = block_index;
+	*logical_sectors = DMCP_BLOCK_SIZE >> 9;
+	*data_sectors = data & DMCP_LENGTH_MASK;
+	block_index++;
+	while (block_index < info->data_blocks) {
+		data = dm_icomp_get_meta(info, block_index);
+		if (!(data & DMCP_TAIL_MASK))
+			break;
+		*logical_sectors += DMCP_BLOCK_SIZE >> 9;
+		*data_sectors += data & DMCP_LENGTH_MASK;
+		block_index++;
+	}
+}
+
+static int dm_icomp_access_super(struct dm_icomp_info *info, void *addr, int rw)
+{
+	struct dm_io_region region;
+	struct dm_io_request req;
+	unsigned long io_error = 0;
+	int ret;
+
+	region.bdev = info->dev->bdev;
+	region.sector = 0;
+	region.count = DMCP_BLOCK_SIZE >> 9;
+
+	req.bi_rw = rw;
+	req.mem.type = DM_IO_KMEM;
+	req.mem.offset = 0;
+	req.mem.ptr.addr = addr;
+	req.notify.fn = NULL;
+	req.client = info->io_client;
+
+	ret = dm_io(&req, 1, &region, &io_error);
+	if (ret || io_error)
+		return -EIO;
+	return 0;
+}
+
+static void dm_icomp_meta_io_done(unsigned long error, void *context)
+{
+	struct dm_icomp_meta_io *meta_io = context;
+
+	meta_io->fn(meta_io->data, error);
+	kmem_cache_free(dm_icomp_meta_io_cachep, meta_io);
+}
+
+static int dm_icomp_write_meta(struct dm_icomp_info *info, u64 start_page,
+	u64 end_page, void *data,
+	void (*fn)(void *data, unsigned long error), int rw)
+{
+	struct dm_icomp_meta_io *meta_io;
+
+	WARN_ON(end_page > info->meta_bitmap_pages);
+
+	meta_io = kmem_cache_alloc(dm_icomp_meta_io_cachep, GFP_NOIO);
+	if (!meta_io) {
+		fn(data, -ENOMEM);
+		return -ENOMEM;
+	}
+	meta_io->data = data;
+	meta_io->fn = fn;
+
+	meta_io->io_region.bdev = info->dev->bdev;
+	meta_io->io_region.sector = DMCP_META_START_SECTOR +
+					(start_page << (PAGE_SHIFT - 9));
+	meta_io->io_region.count = (end_page - start_page) << (PAGE_SHIFT - 9);
+
+	atomic64_add(meta_io->io_region.count << 9, &info->meta_write_size);
+
+	meta_io->io_req.bi_rw = rw;
+	meta_io->io_req.mem.type = DM_IO_VMA;
+	meta_io->io_req.mem.offset = 0;
+	meta_io->io_req.mem.ptr.addr = info->meta_bitmap +
+						(start_page << PAGE_SHIFT);
+	meta_io->io_req.notify.fn = dm_icomp_meta_io_done;
+	meta_io->io_req.notify.context = meta_io;
+	meta_io->io_req.client = info->io_client;
+
+	dm_io(&meta_io->io_req, 1, &meta_io->io_region, NULL);
+	return 0;
+}
+
+struct writeback_flush_data {
+	struct completion complete;
+	atomic_t cnt;
+};
+
+static void writeback_flush_io_done(void *data, unsigned long error)
+{
+	struct writeback_flush_data *wb = data;
+
+	if (atomic_dec_return(&wb->cnt))
+		return;
+	complete(&wb->complete);
+}
+
+static void dm_icomp_flush_dirty_meta(struct dm_icomp_info *info,
+			struct writeback_flush_data *data)
+{
+	struct page *page;
+	u64 start = 0, index;
+	u32 pending = 0, cnt = 0;
+	bool dirty;
+	struct blk_plug plug;
+
+	blk_start_plug(&plug);
+	for (index = 0; index < info->meta_bitmap_pages; index++, cnt++) {
+		if (cnt == 256) {
+			cnt = 0;
+			cond_resched();
+		}
+
+		page = vmalloc_to_page(info->meta_bitmap +
+					(index << PAGE_SHIFT));
+		dirty = TestClearPageDirty(page);
+
+		if (pending == 0 && dirty) {
+			start = index;
+			pending++;
+			continue;
+		} else if (pending == 0)
+			continue;
+		else if (pending > 0 && dirty) {
+			pending++;
+			continue;
+		}
+
+		/* pending > 0 && !dirty */
+		atomic_inc(&data->cnt);
+		dm_icomp_write_meta(info, start, start + pending, data,
+			writeback_flush_io_done, WRITE);
+		pending = 0;
+	}
+
+	if (pending > 0) {
+		atomic_inc(&data->cnt);
+		dm_icomp_write_meta(info, start, start + pending, data,
+			writeback_flush_io_done, WRITE);
+	}
+	blkdev_issue_flush(info->dev->bdev, GFP_NOIO, NULL);
+	blk_finish_plug(&plug);
+}
+
+static int dm_icomp_meta_writeback_thread(void *data)
+{
+	struct dm_icomp_info *info = data;
+	struct writeback_flush_data wb;
+
+	atomic_set(&wb.cnt, 1);
+	init_completion(&wb.complete);
+
+	while (!kthread_should_stop()) {
+		schedule_timeout_interruptible(
+			msecs_to_jiffies(info->writeback_delay * 1000));
+		dm_icomp_flush_dirty_meta(info, &wb);
+	}
+
+	dm_icomp_flush_dirty_meta(info, &wb);
+
+	writeback_flush_io_done(&wb, 0);
+	wait_for_completion(&wb.complete);
+	return 0;
+}
+
+static int dm_icomp_init_meta(struct dm_icomp_info *info, bool new)
+{
+	struct dm_io_region region;
+	struct dm_io_request req;
+	unsigned long io_error = 0;
+	struct blk_plug plug;
+	int ret;
+	ssize_t len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
+
+	len *= sizeof(unsigned long);
+
+	region.bdev = info->dev->bdev;
+	region.sector = DMCP_META_START_SECTOR;
+	region.count = (len + 511) >> 9;
+
+	req.mem.type = DM_IO_VMA;
+	req.mem.offset = 0;
+	req.mem.ptr.addr = info->meta_bitmap;
+	req.notify.fn = NULL;
+	req.client = info->io_client;
+
+	blk_start_plug(&plug);
+	if (new) {
+		memset(info->meta_bitmap, 0, len);
+		req.bi_rw = WRITE_FLUSH;
+		ret = dm_io(&req, 1, &region, &io_error);
+	} else {
+		req.bi_rw = READ;
+		ret = dm_io(&req, 1, &region, &io_error);
+	}
+	blk_finish_plug(&plug);
+
+	if (ret || io_error) {
+		info->ti->error = "Access metadata error";
+		return -EIO;
+	}
+
+	if (info->write_mode == DMCP_WRITE_BACK) {
+		info->writeback_tsk = kthread_run(
+			dm_icomp_meta_writeback_thread,
+			info, "dm_icomp_writeback");
+		if (!info->writeback_tsk) {
+			info->ti->error = "Create writeback thread error";
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int dm_icomp_alloc_compressor(struct dm_icomp_info *info)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		info->tfm[i] = crypto_alloc_comp(
+			compressors[info->comp_alg].name, 0, 0);
+		if (IS_ERR(info->tfm[i])) {
+			info->tfm[i] = NULL;
+			goto err;
+		}
+	}
+	return 0;
+err:
+	for_each_possible_cpu(i) {
+		if (info->tfm[i]) {
+			crypto_free_comp(info->tfm[i]);
+			info->tfm[i] = NULL;
+		}
+	}
+	return -ENOMEM;
+}
+
+static void dm_icomp_free_compressor(struct dm_icomp_info *info)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		if (info->tfm[i]) {
+			crypto_free_comp(info->tfm[i]);
+			info->tfm[i] = NULL;
+		}
+	}
+}
+
+static int dm_icomp_read_or_create_super(struct dm_icomp_info *info)
+{
+	void *addr;
+	struct dm_icomp_super_block *super;
+	u64 total_blocks;
+	u64 data_blocks, meta_blocks;
+	u32 rem, cnt;
+	bool new_super = false;
+	int ret;
+	ssize_t len;
+
+	total_blocks = i_size_read(info->dev->bdev->bd_inode) >>
+					DMCP_BLOCK_SHIFT;
+	data_blocks = total_blocks - 1;
+	rem = do_div(data_blocks, DMCP_BLOCK_SIZE * 8 + DMCP_META_BITS);
+	meta_blocks = data_blocks * DMCP_META_BITS;
+	data_blocks *= DMCP_BLOCK_SIZE * 8;
+
+	cnt = rem;
+	rem /= (DMCP_BLOCK_SIZE * 8 / DMCP_META_BITS + 1);
+	data_blocks += rem * (DMCP_BLOCK_SIZE * 8 / DMCP_META_BITS);
+	meta_blocks += rem;
+
+	cnt %= (DMCP_BLOCK_SIZE * 8 / DMCP_META_BITS + 1);
+	meta_blocks += 1;
+	data_blocks += cnt - 1;
+
+	info->data_blocks = data_blocks;
+	info->data_start = (1 + meta_blocks) << DMCP_BLOCK_SECTOR_SHIFT;
+
+	addr = kzalloc(DMCP_BLOCK_SIZE, GFP_KERNEL);
+	if (!addr) {
+		info->ti->error = "Cannot allocate super";
+		return -ENOMEM;
+	}
+
+	super = addr;
+	ret = dm_icomp_access_super(info, addr, READ);
+	if (ret)
+		goto out;
+
+	if (le64_to_cpu(super->magic) == DMCP_SUPER_MAGIC) {
+		if (le64_to_cpu(super->meta_blocks) != meta_blocks ||
+		    le64_to_cpu(super->data_blocks) != data_blocks) {
+			info->ti->error = "Super is invalid";
+			ret = -EINVAL;
+			goto out;
+		}
+		if (!crypto_has_comp(compressors[super->comp_alg].name, 0, 0)) {
+			info->ti->error =
+					"Compressor algorithm doesn't support";
+			ret = -EINVAL;
+			goto out;
+		}
+	} else {
+		super->magic = cpu_to_le64(DMCP_SUPER_MAGIC);
+		super->meta_blocks = cpu_to_le64(meta_blocks);
+		super->data_blocks = cpu_to_le64(data_blocks);
+		super->comp_alg = default_compressor;
+		ret = dm_icomp_access_super(info, addr, WRITE_FUA);
+		if (ret) {
+			info->ti->error = "Access super fails";
+			goto out;
+		}
+		new_super = true;
+	}
+
+	info->comp_alg = super->comp_alg;
+	if (dm_icomp_alloc_compressor(info)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	info->meta_bitmap_bits = data_blocks * DMCP_META_BITS;
+	len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
+	len *= sizeof(unsigned long);
+	info->meta_bitmap_pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	info->meta_bitmap = vmalloc(info->meta_bitmap_pages * PAGE_SIZE);
+	if (!info->meta_bitmap) {
+		ret = -ENOMEM;
+		goto bitmap_err;
+	}
+
+	ret = dm_icomp_init_meta(info, new_super);
+	if (ret)
+		goto meta_err;
+
+	return 0;
+meta_err:
+	vfree(info->meta_bitmap);
+bitmap_err:
+	dm_icomp_free_compressor(info);
+out:
+	kfree(addr);
+	return ret;
+}
+
+/*
+ * <dev> <writethough>/<writeback> <meta_commit_delay>
+ */
+static int dm_icomp_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct dm_icomp_info *info;
+	char write_mode[15];
+	int ret, i;
+
+	if (argc < 2) {
+		ti->error = "Invalid argument count";
+		return -EINVAL;
+	}
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		ti->error = "dm-inplace-compress: Cannot allocate context";
+		return -ENOMEM;
+	}
+	info->ti = ti;
+
+	if (sscanf(argv[1], "%s", write_mode) != 1) {
+		ti->error = "Invalid argument";
+		ret = -EINVAL;
+		goto err_para;
+	}
+
+	if (strcmp(write_mode, "writeback") == 0) {
+		if (argc != 3) {
+			ti->error = "Invalid argument";
+			ret = -EINVAL;
+			goto err_para;
+		}
+		info->write_mode = DMCP_WRITE_BACK;
+		if (sscanf(argv[2], "%u", &info->writeback_delay) != 1) {
+			ti->error = "Invalid argument";
+			ret = -EINVAL;
+			goto err_para;
+		}
+	} else if (strcmp(write_mode, "writethrough") == 0) {
+		info->write_mode = DMCP_WRITE_THROUGH;
+	} else {
+		ti->error = "Invalid argument";
+		ret = -EINVAL;
+		goto err_para;
+	}
+
+	if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
+							&info->dev)) {
+		ti->error = "Can't get device";
+		ret = -EINVAL;
+		goto err_para;
+	}
+
+	info->io_client = dm_io_client_create();
+	if (!info->io_client) {
+		ti->error = "Can't create io client";
+		ret = -EINVAL;
+		goto err_ioclient;
+	}
+
+	if (bdev_logical_block_size(info->dev->bdev) != 512) {
+		ti->error = "Can't logical block size too big";
+		ret = -EINVAL;
+		goto err_blocksize;
+	}
+
+	ret = dm_icomp_read_or_create_super(info);
+	if (ret)
+		goto err_blocksize;
+
+	for (i = 0; i < BITMAP_HASH_LEN; i++) {
+		info->bitmap_locks[i].io_running = 0;
+		spin_lock_init(&info->bitmap_locks[i].wait_lock);
+		INIT_LIST_HEAD(&info->bitmap_locks[i].wait_list);
+	}
+
+	atomic64_set(&info->compressed_write_size, 0);
+	atomic64_set(&info->uncompressed_write_size, 0);
+	atomic64_set(&info->meta_write_size, 0);
+	ti->num_flush_bios = 1;
+	/* ti->num_discard_bios = 1; */
+	ti->private = info;
+	return 0;
+err_blocksize:
+	dm_io_client_destroy(info->io_client);
+err_ioclient:
+	dm_put_device(ti, info->dev);
+err_para:
+	kfree(info);
+	return ret;
+}
+
+static void dm_icomp_dtr(struct dm_target *ti)
+{
+	struct dm_icomp_info *info = ti->private;
+
+	if (info->write_mode == DMCP_WRITE_BACK)
+		kthread_stop(info->writeback_tsk);
+	dm_icomp_free_compressor(info);
+	vfree(info->meta_bitmap);
+	dm_io_client_destroy(info->io_client);
+	dm_put_device(ti, info->dev);
+	kfree(info);
+}
+
+static u64 dm_icomp_sector_to_block(sector_t sect)
+{
+	return sect >> DMCP_BLOCK_SECTOR_SHIFT;
+}
+
+static struct dm_icomp_hash_lock *dm_icomp_block_hash_lock(
+		struct dm_icomp_info *info, u64 block_index)
+{
+	return &info->bitmap_locks[(block_index >> BITMAP_HASH_SHIFT) &
+			BITMAP_HASH_MASK];
+}
+
+static struct dm_icomp_hash_lock *dm_icomp_trylock_block(
+		struct dm_icomp_info *info,
+		struct dm_icomp_req *req, u64 block_index)
+{
+	struct dm_icomp_hash_lock *hash_lock;
+
+	hash_lock = dm_icomp_block_hash_lock(req->info, block_index);
+
+	spin_lock_irq(&hash_lock->wait_lock);
+	if (!hash_lock->io_running) {
+		hash_lock->io_running = 1;
+		spin_unlock_irq(&hash_lock->wait_lock);
+		return hash_lock;
+	}
+	list_add_tail(&req->sibling, &hash_lock->wait_list);
+	spin_unlock_irq(&hash_lock->wait_lock);
+	return NULL;
+}
+
+static void dm_icomp_queue_req_list(struct dm_icomp_info *info,
+	 struct list_head *list);
+
+static void dm_icomp_unlock_block(struct dm_icomp_info *info,
+	struct dm_icomp_req *req, struct dm_icomp_hash_lock *hash_lock)
+{
+	LIST_HEAD(pending_list);
+	unsigned long flags;
+
+	spin_lock_irqsave(&hash_lock->wait_lock, flags);
+	/* wakeup all pending reqs to avoid live lock */
+	list_splice_init(&hash_lock->wait_list, &pending_list);
+	hash_lock->io_running = 0;
+	spin_unlock_irqrestore(&hash_lock->wait_lock, flags);
+
+	dm_icomp_queue_req_list(info, &pending_list);
+}
+
+static int dm_icomp_lock_req_range(struct dm_icomp_req *req)
+{
+	u64 block_index, first_block_index;
+	u64 first_lock_block, second_lock_block;
+	u16 logical_sectors, data_sectors;
+
+	block_index = dm_icomp_sector_to_block(req->bio->bi_iter.bi_sector);
+	req->locks[0] = dm_icomp_trylock_block(req->info, req, block_index);
+	if (!req->locks[0])
+		return 0;
+	dm_icomp_get_extent(req->info, block_index, &first_block_index,
+				&logical_sectors, &data_sectors);
+	if (dm_icomp_block_hash_lock(req->info, first_block_index) !=
+						req->locks[0]) {
+		dm_icomp_unlock_block(req->info, req, req->locks[0]);
+
+		first_lock_block = first_block_index;
+		second_lock_block = block_index;
+		goto two_locks;
+	}
+
+	block_index = dm_icomp_sector_to_block(bio_end_sector(req->bio) - 1);
+	dm_icomp_get_extent(req->info, block_index, &first_block_index,
+				&logical_sectors, &data_sectors);
+	first_block_index += dm_icomp_sector_to_block(logical_sectors) - 1;
+	if (dm_icomp_block_hash_lock(req->info, first_block_index) !=
+						req->locks[0]) {
+		second_lock_block = first_block_index;
+		goto second_lock;
+	}
+	req->locked_locks = 1;
+	return 1;
+
+two_locks:
+	req->locks[0] = dm_icomp_trylock_block(req->info, req,
+		first_lock_block);
+	if (!req->locks[0])
+		return 0;
+second_lock:
+	req->locks[1] = dm_icomp_trylock_block(req->info, req,
+				second_lock_block);
+	if (!req->locks[1]) {
+		dm_icomp_unlock_block(req->info, req, req->locks[0]);
+		return 0;
+	}
+	/* Don't need check if meta is changed */
+	req->locked_locks = 2;
+	return 1;
+}
+
+static void dm_icomp_unlock_req_range(struct dm_icomp_req *req)
+{
+	int i;
+
+	for (i = req->locked_locks - 1; i >= 0; i--)
+		dm_icomp_unlock_block(req->info, req, req->locks[i]);
+}
+
+static void dm_icomp_queue_req(struct dm_icomp_info *info,
+		struct dm_icomp_req *req)
+{
+	unsigned long flags;
+	struct dm_icomp_io_worker *worker = &dm_icomp_io_workers[req->cpu];
+
+	spin_lock_irqsave(&worker->lock, flags);
+	list_add_tail(&req->sibling, &worker->pending);
+	spin_unlock_irqrestore(&worker->lock, flags);
+
+	queue_work_on(req->cpu, dm_icomp_wq, &worker->work);
+}
+
+static void dm_icomp_queue_req_list(struct dm_icomp_info *info,
+		struct list_head *list)
+{
+	struct dm_icomp_req *req;
+
+	while (!list_empty(list)) {
+		req = list_first_entry(list, struct dm_icomp_req, sibling);
+		list_del_init(&req->sibling);
+		dm_icomp_queue_req(info, req);
+	}
+}
+
+static void dm_icomp_get_req(struct dm_icomp_req *req)
+{
+	atomic_inc(&req->io_pending);
+}
+
+static void dm_icomp_free_io_range(struct dm_icomp_io_range *io)
+{
+	kfree(io->decomp_data);
+	kfree(io->comp_data);
+	kmem_cache_free(dm_icomp_io_range_cachep, io);
+}
+
+static void dm_icomp_put_req(struct dm_icomp_req *req)
+{
+	struct dm_icomp_io_range *io;
+
+	if (atomic_dec_return(&req->io_pending))
+		return;
+
+	if (req->stage == STAGE_INIT) /* waiting for locking */
+		return;
+
+	if (req->stage == STAGE_READ_DECOMP ||
+	    req->stage == STAGE_WRITE_COMP ||
+	    req->result)
+		req->stage = STAGE_DONE;
+
+	if (req->stage != STAGE_DONE) {
+		dm_icomp_queue_req(req->info, req);
+		return;
+	}
+
+	while (!list_empty(&req->all_io)) {
+		io = list_entry(req->all_io.next,
+			struct dm_icomp_io_range, next);
+		list_del(&io->next);
+		dm_icomp_free_io_range(io);
+	}
+
+	dm_icomp_unlock_req_range(req);
+
+	req->bio->bi_error = req->result;
+	bio_endio(req->bio);
+	kmem_cache_free(dm_icomp_req_cachep, req);
+}
+
+static void dm_icomp_io_range_done(unsigned long error, void *context)
+{
+	struct dm_icomp_io_range *io = context;
+
+	if (error)
+		io->req->result = error;
+	dm_icomp_put_req(io->req);
+}
+
+static inline int dm_icomp_compressor_len(struct dm_icomp_info *info, int len)
+{
+	if (compressors[info->comp_alg].comp_len)
+		return compressors[info->comp_alg].comp_len(len);
+	return len;
+}
+
+/*
+ * caller should set region.sector, region.count. bi_rw. IO always to/from
+ * comp_data
+ */
+static struct dm_icomp_io_range *dm_icomp_create_io_range(
+	struct dm_icomp_req *req, int comp_len, int decomp_len)
+{
+	struct dm_icomp_io_range *io;
+
+	io = kmem_cache_alloc(dm_icomp_io_range_cachep, GFP_NOIO);
+	if (!io)
+		return NULL;
+
+	io->comp_data = kmalloc(dm_icomp_compressor_len(req->info, comp_len),
+								GFP_NOIO);
+	io->decomp_data = kmalloc(decomp_len, GFP_NOIO);
+	if (!io->decomp_data || !io->comp_data) {
+		kfree(io->decomp_data);
+		kfree(io->comp_data);
+		kmem_cache_free(dm_icomp_io_range_cachep, io);
+		return NULL;
+	}
+
+	io->io_req.notify.fn = dm_icomp_io_range_done;
+	io->io_req.notify.context = io;
+	io->io_req.client = req->info->io_client;
+	io->io_req.mem.type = DM_IO_KMEM;
+	io->io_req.mem.ptr.addr = io->comp_data;
+	io->io_req.mem.offset = 0;
+
+	io->io_region.bdev = req->info->dev->bdev;
+
+	io->decomp_len = decomp_len;
+	io->comp_len = comp_len;
+	io->req = req;
+	return io;
+}
+
+static void dm_icomp_bio_copy(struct bio *bio, off_t bio_off, void *buf,
+		ssize_t len, bool to_buf)
+{
+	struct bio_vec bv;
+	struct bvec_iter iter;
+	off_t buf_off = 0;
+	ssize_t size;
+	void *addr;
+
+	WARN_ON(bio_off + len > (bio_sectors(bio) << 9));
+
+	bio_for_each_segment(bv, bio, iter) {
+		int length = bv.bv_len;
+
+		if (bio_off >= length) {
+			bio_off -= length;
+			continue;
+		}
+		addr = kmap_atomic(bv.bv_page);
+		size = min_t(ssize_t, len, length - bio_off);
+		if (to_buf)
+			memcpy(buf + buf_off, addr + bio_off + bv.bv_offset,
+				size);
+		else
+			memcpy(addr + bio_off + bv.bv_offset, buf + buf_off,
+				size);
+		kunmap_atomic(addr);
+		bio_off = 0;
+		buf_off += size;
+		len -= size;
+	}
+}
+
+/*
+ * return value:
+ * < 0 : error
+ * == 0 : ok
+ * == 1 : ok, but comp/decomp is skipped
+ * Compressed data size is roundup of 512, which makes the payload.
+ * We store the actual compressed len in the last u32 of the payload.
+ * If there is no free space, we add 512 to the payload size.
+ */
+static int dm_icomp_io_range_comp(struct dm_icomp_info *info, void *comp_data,
+	unsigned int *comp_len, void *decomp_data, unsigned int decomp_len,
+	bool do_comp)
+{
+	struct crypto_comp *tfm;
+	u32 *addr;
+	unsigned int actual_comp_len;
+	int ret;
+
+	if (do_comp) {
+		actual_comp_len = *comp_len;
+
+		tfm = info->tfm[get_cpu()];
+		ret = crypto_comp_compress(tfm, decomp_data, decomp_len,
+			comp_data, &actual_comp_len);
+		put_cpu();
+
+		atomic64_add(decomp_len, &info->uncompressed_write_size);
+		if (ret || decomp_len < actual_comp_len + sizeof(u32) + 512) {
+			*comp_len = decomp_len;
+			atomic64_add(*comp_len, &info->compressed_write_size);
+			return 1;
+		}
+
+		*comp_len = round_up(actual_comp_len, 512);
+		if (*comp_len - actual_comp_len < sizeof(u32))
+			*comp_len += 512;
+		atomic64_add(*comp_len, &info->compressed_write_size);
+		addr = comp_data + *comp_len;
+		addr--;
+		*addr = cpu_to_le32(actual_comp_len);
+	} else {
+		if (*comp_len == decomp_len)
+			return 1;
+		addr = comp_data + *comp_len;
+		addr--;
+		actual_comp_len = le32_to_cpu(*addr);
+
+		tfm = info->tfm[get_cpu()];
+		ret = crypto_comp_decompress(tfm, comp_data, actual_comp_len,
+			decomp_data, &decomp_len);
+		put_cpu();
+		if (ret)
+			return -EINVAL;
+	}
+	return 0;
+}
+
+static void dm_icomp_handle_read_decomp(struct dm_icomp_req *req)
+{
+	struct dm_icomp_io_range *io;
+	off_t bio_off = 0;
+	int ret;
+
+	req->stage = STAGE_READ_DECOMP;
+
+	if (req->result)
+		return;
+
+	list_for_each_entry(io, &req->all_io, next) {
+		ssize_t dst_off = 0, src_off = 0, len;
+
+		io->io_region.sector -= req->info->data_start;
+
+		/* Do decomp here */
+		ret = dm_icomp_io_range_comp(req->info, io->comp_data,
+			&io->comp_len, io->decomp_data, io->decomp_len, false);
+		if (ret < 0) {
+			req->result = -EIO;
+			return;
+		}
+
+		if (io->io_region.sector >= req->bio->bi_iter.bi_sector)
+			dst_off = (io->io_region.sector -
+				 req->bio->bi_iter.bi_sector) << 9;
+		else
+			src_off = (req->bio->bi_iter.bi_sector -
+				 io->io_region.sector) << 9;
+
+		len = min_t(ssize_t, io->decomp_len - src_off,
+			(bio_sectors(req->bio) << 9) - dst_off);
+
+		/* io range in all_io list is ordered for read IO */
+		while (bio_off != dst_off) {
+			ssize_t size = min_t(ssize_t, PAGE_SIZE,
+					dst_off - bio_off);
+			dm_icomp_bio_copy(req->bio, bio_off, empty_zero_page,
+					size, false);
+			bio_off += size;
+		}
+
+		if (ret == 1)
+			dm_icomp_bio_copy(req->bio, dst_off,
+					io->comp_data + src_off, len, false);
+		else
+			dm_icomp_bio_copy(req->bio, dst_off,
+					io->decomp_data + src_off, len, false);
+		bio_off = dst_off + len;
+	}
+
+	while (bio_off != (bio_sectors(req->bio) << 9)) {
+		ssize_t size = min_t(ssize_t, PAGE_SIZE,
+			(bio_sectors(req->bio) << 9) - bio_off);
+		dm_icomp_bio_copy(req->bio, bio_off, empty_zero_page,
+			size, false);
+		bio_off += size;
+	}
+}
+
+static void dm_icomp_read_one_extent(struct dm_icomp_req *req, u64 block,
+	u16 logical_sectors, u16 data_sectors)
+{
+	struct dm_icomp_io_range *io;
+
+	io = dm_icomp_create_io_range(req, data_sectors << 9,
+		logical_sectors << 9);
+	if (!io) {
+		req->result = -EIO;
+		return;
+	}
+
+	dm_icomp_get_req(req);
+	list_add_tail(&io->next, &req->all_io);
+
+	io->io_region.sector = (block << DMCP_BLOCK_SECTOR_SHIFT) +
+				req->info->data_start;
+	io->io_region.count = data_sectors;
+
+	io->io_req.bi_rw = READ;
+	dm_io(&io->io_req, 1, &io->io_region, NULL);
+}
+
+static void dm_icomp_handle_read_read_existing(struct dm_icomp_req *req)
+{
+	u64 block_index, first_block_index;
+	u16 logical_sectors, data_sectors;
+
+	req->stage = STAGE_READ_EXISTING;
+
+	block_index = dm_icomp_sector_to_block(req->bio->bi_iter.bi_sector);
+again:
+	dm_icomp_get_extent(req->info, block_index, &first_block_index,
+		&logical_sectors, &data_sectors);
+	if (data_sectors > 0)
+		dm_icomp_read_one_extent(req, first_block_index,
+			logical_sectors, data_sectors);
+
+	if (req->result)
+		return;
+
+	block_index = first_block_index + (logical_sectors >>
+				DMCP_BLOCK_SECTOR_SHIFT);
+	if ((block_index << DMCP_BLOCK_SECTOR_SHIFT) < bio_end_sector(req->bio))
+		goto again;
+
+	/* A shortcut if all data is in already */
+	if (list_empty(&req->all_io))
+		dm_icomp_handle_read_decomp(req);
+}
+
+static void dm_icomp_handle_read_request(struct dm_icomp_req *req)
+{
+	dm_icomp_get_req(req);
+
+	if (req->stage == STAGE_INIT) {
+		if (!dm_icomp_lock_req_range(req)) {
+			dm_icomp_put_req(req);
+			return;
+		}
+
+		dm_icomp_handle_read_read_existing(req);
+	} else if (req->stage == STAGE_READ_EXISTING)
+		dm_icomp_handle_read_decomp(req);
+
+	dm_icomp_put_req(req);
+}
+
+static void dm_icomp_write_meta_done(void *context, unsigned long error)
+{
+	struct dm_icomp_req *req = context;
+
+	dm_icomp_put_req(req);
+}
+
+static u64 dm_icomp_block_meta_page_index(u64 block, bool end)
+{
+	u64 bits = block * DMCP_META_BITS - !!end;
+	/* (1 << 3) bits per byte */
+	return bits >> (3 + PAGE_SHIFT);
+}
+
+static int dm_icomp_handle_write_modify(struct dm_icomp_io_range *io,
+	u64 *meta_start, u64 *meta_end, bool *handle_bio)
+{
+	struct dm_icomp_req *req = io->req;
+	sector_t start, count;
+	unsigned int comp_len;
+	off_t offset;
+	u64 page_index;
+	int ret;
+
+	io->io_region.sector -= req->info->data_start;
+
+	/* decompress original data */
+	ret = dm_icomp_io_range_comp(req->info, io->comp_data, &io->comp_len,
+			io->decomp_data, io->decomp_len, false);
+	if (ret < 0) {
+		req->result = -EINVAL;
+		return -EIO;
+	}
+
+	start = io->io_region.sector;
+	count = io->decomp_len >> 9;
+	if (start < req->bio->bi_iter.bi_sector && start + count >
+					bio_end_sector(req->bio)) {
+		/* we don't split an extent */
+		if (ret == 1) {
+			memcpy(io->decomp_data, io->comp_data, io->decomp_len);
+			dm_icomp_bio_copy(req->bio, 0,
+			   io->decomp_data +
+			   ((req->bio->bi_iter.bi_sector - start) << 9),
+			   bio_sectors(req->bio) << 9, true);
+		} else {
+			dm_icomp_bio_copy(req->bio, 0,
+			   io->decomp_data +
+			   ((req->bio->bi_iter.bi_sector - start) << 9),
+			   bio_sectors(req->bio) << 9, true);
+
+			kfree(io->comp_data);
+			/* New compressed len might be bigger */
+			io->comp_data = kmalloc(
+				dm_icomp_compressor_len(req->info,
+					io->decomp_len), GFP_NOIO);
+			io->comp_len = io->decomp_len;
+			if (!io->comp_data) {
+				req->result = -ENOMEM;
+				return -EIO;
+			}
+			io->io_req.mem.ptr.addr = io->comp_data;
+		}
+		/* need compress data */
+		ret = 0;
+		offset = 0;
+		*handle_bio = false;
+	} else if (start < req->bio->bi_iter.bi_sector) {
+		count = req->bio->bi_iter.bi_sector - start;
+		offset = 0;
+	} else {
+		offset = bio_end_sector(req->bio) - start;
+		start = bio_end_sector(req->bio);
+		count = count - offset;
+	}
+
+	/* Original data is uncompressed, we don't need writeback */
+	if (ret == 1) {
+		comp_len = count << 9;
+		goto handle_meta;
+	}
+
+	/* assume compress less data uses less space (at least 4k lsess data) */
+	comp_len = io->comp_len;
+	ret = dm_icomp_io_range_comp(req->info, io->comp_data, &comp_len,
+		io->decomp_data + (offset << 9), count << 9, true);
+	if (ret < 0) {
+		req->result = -EIO;
+		return -EIO;
+	}
+
+	dm_icomp_get_req(req);
+	if (ret == 1)
+		io->io_req.mem.ptr.addr = io->decomp_data + (offset << 9);
+	io->io_region.count = comp_len >> 9;
+	io->io_region.sector = start + req->info->data_start;
+
+	io->io_req.bi_rw = req->bio->bi_rw;
+	dm_io(&io->io_req, 1, &io->io_region, NULL);
+handle_meta:
+	dm_icomp_set_extent(req, start >> DMCP_BLOCK_SECTOR_SHIFT,
+		count >> DMCP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
+
+	page_index = dm_icomp_block_meta_page_index(start >>
+					DMCP_BLOCK_SECTOR_SHIFT, false);
+	if (*meta_start > page_index)
+		*meta_start = page_index;
+	page_index = dm_icomp_block_meta_page_index(
+			(start + count) >> DMCP_BLOCK_SECTOR_SHIFT, true);
+	if (*meta_end < page_index)
+		*meta_end = page_index;
+	return 0;
+}
+
+static void dm_icomp_handle_write_comp(struct dm_icomp_req *req)
+{
+	struct dm_icomp_io_range *io;
+	sector_t count;
+	unsigned int comp_len;
+	u64 meta_start = -1L, meta_end = 0, page_index;
+	int ret;
+	bool handle_bio = true;
+
+	req->stage = STAGE_WRITE_COMP;
+
+	if (req->result)
+		return;
+
+	list_for_each_entry(io, &req->all_io, next) {
+		if (dm_icomp_handle_write_modify(io, &meta_start, &meta_end,
+						&handle_bio))
+			return;
+	}
+
+	if (!handle_bio)
+		goto update_meta;
+
+	count = bio_sectors(req->bio);
+	io = dm_icomp_create_io_range(req, count << 9, count << 9);
+	if (!io) {
+		req->result = -EIO;
+		return;
+	}
+	dm_icomp_bio_copy(req->bio, 0, io->decomp_data, count << 9, true);
+
+	/* compress data */
+	comp_len = io->comp_len;
+	ret = dm_icomp_io_range_comp(req->info, io->comp_data, &comp_len,
+		io->decomp_data, count << 9, true);
+	if (ret < 0) {
+		dm_icomp_free_io_range(io);
+		req->result = -EIO;
+		return;
+	}
+
+	dm_icomp_get_req(req);
+	list_add_tail(&io->next, &req->all_io);
+	io->io_region.sector = req->bio->bi_iter.bi_sector +
+			 req->info->data_start;
+
+	if (ret == 1)
+		io->io_req.mem.ptr.addr = io->decomp_data;
+
+	io->io_region.count = comp_len >> 9;
+	io->io_req.bi_rw = req->bio->bi_rw;
+	dm_io(&io->io_req, 1, &io->io_region, NULL);
+	dm_icomp_set_extent(req,
+		req->bio->bi_iter.bi_sector >> DMCP_BLOCK_SECTOR_SHIFT,
+		count >> DMCP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
+
+	page_index = dm_icomp_block_meta_page_index(
+		req->bio->bi_iter.bi_sector >> DMCP_BLOCK_SECTOR_SHIFT, false);
+	if (meta_start > page_index)
+		meta_start = page_index;
+
+	page_index = dm_icomp_block_meta_page_index(
+	   (req->bio->bi_iter.bi_sector + count) >> DMCP_BLOCK_SECTOR_SHIFT,
+	     true);
+
+	if (meta_end < page_index)
+		meta_end = page_index;
+update_meta:
+	if (req->info->write_mode == DMCP_WRITE_THROUGH ||
+						(req->bio->bi_rw & REQ_FUA)) {
+		dm_icomp_get_req(req);
+		dm_icomp_write_meta(req->info, meta_start, meta_end + 1, req,
+			dm_icomp_write_meta_done, req->bio->bi_rw);
+	}
+}
+
+static void dm_icomp_handle_write_read_existing(struct dm_icomp_req *req)
+{
+	u64 block_index, first_block_index;
+	u16 logical_sectors, data_sectors;
+
+	req->stage = STAGE_READ_EXISTING;
+
+	block_index = dm_icomp_sector_to_block(req->bio->bi_iter.bi_sector);
+	dm_icomp_get_extent(req->info, block_index, &first_block_index,
+		&logical_sectors, &data_sectors);
+	if (data_sectors > 0 && (first_block_index < block_index ||
+	    first_block_index + dm_icomp_sector_to_block(logical_sectors) >
+	    dm_icomp_sector_to_block(bio_end_sector(req->bio))))
+		dm_icomp_read_one_extent(req, first_block_index,
+			logical_sectors, data_sectors);
+
+	if (req->result)
+		return;
+
+	if (first_block_index + dm_icomp_sector_to_block(logical_sectors) >=
+	    dm_icomp_sector_to_block(bio_end_sector(req->bio)))
+		goto out;
+
+	block_index = dm_icomp_sector_to_block(bio_end_sector(req->bio)) - 1;
+	dm_icomp_get_extent(req->info, block_index, &first_block_index,
+		&logical_sectors, &data_sectors);
+	if (data_sectors > 0 &&
+	    first_block_index + dm_icomp_sector_to_block(logical_sectors) >
+	    block_index + 1)
+		dm_icomp_read_one_extent(req, first_block_index,
+			logical_sectors, data_sectors);
+
+	if (req->result)
+		return;
+out:
+	if (list_empty(&req->all_io))
+		dm_icomp_handle_write_comp(req);
+}
+
+static void dm_icomp_handle_write_request(struct dm_icomp_req *req)
+{
+	dm_icomp_get_req(req);
+
+	if (req->stage == STAGE_INIT) {
+		if (!dm_icomp_lock_req_range(req)) {
+			dm_icomp_put_req(req);
+			return;
+		}
+
+		dm_icomp_handle_write_read_existing(req);
+	} else if (req->stage == STAGE_READ_EXISTING)
+		dm_icomp_handle_write_comp(req);
+
+	dm_icomp_put_req(req);
+}
+
+/* For writeback mode */
+static void dm_icomp_handle_flush_request(struct dm_icomp_req *req)
+{
+	struct writeback_flush_data wb;
+
+	atomic_set(&wb.cnt, 1);
+	init_completion(&wb.complete);
+
+	dm_icomp_flush_dirty_meta(req->info, &wb);
+
+	writeback_flush_io_done(&wb, 0);
+	wait_for_completion(&wb.complete);
+
+	req->bio->bi_error = 0;
+	bio_endio(req->bio);
+	kmem_cache_free(dm_icomp_req_cachep, req);
+}
+
+static void dm_icomp_handle_request(struct dm_icomp_req *req)
+{
+	if (req->bio->bi_rw & REQ_FLUSH)
+		dm_icomp_handle_flush_request(req);
+	else if (req->bio->bi_rw & REQ_WRITE)
+		dm_icomp_handle_write_request(req);
+	else
+		dm_icomp_handle_read_request(req);
+}
+
+static void dm_icomp_do_request_work(struct work_struct *work)
+{
+	struct dm_icomp_io_worker *worker = container_of(work,
+				struct dm_icomp_io_worker, work);
+	LIST_HEAD(list);
+	struct dm_icomp_req *req;
+	struct blk_plug plug;
+	bool repeat;
+
+	blk_start_plug(&plug);
+again:
+	spin_lock_irq(&worker->lock);
+	list_splice_init(&worker->pending, &list);
+	spin_unlock_irq(&worker->lock);
+
+	repeat = !list_empty(&list);
+	while (!list_empty(&list)) {
+		req = list_first_entry(&list, struct dm_icomp_req, sibling);
+		list_del(&req->sibling);
+
+		dm_icomp_handle_request(req);
+	}
+	if (repeat)
+		goto again;
+	blk_finish_plug(&plug);
+}
+
+static int dm_icomp_map(struct dm_target *ti, struct bio *bio)
+{
+	struct dm_icomp_info *info = ti->private;
+	struct dm_icomp_req *req;
+
+	if ((bio->bi_rw & REQ_FLUSH) &&
+			info->write_mode == DMCP_WRITE_THROUGH) {
+		bio->bi_bdev = info->dev->bdev;
+		return DM_MAPIO_REMAPPED;
+	}
+	req = kmem_cache_alloc(dm_icomp_req_cachep, GFP_NOIO);
+	if (!req)
+		return -EIO;
+
+	req->bio = bio;
+	req->info = info;
+	atomic_set(&req->io_pending, 0);
+	INIT_LIST_HEAD(&req->all_io);
+	req->result = 0;
+	req->stage = STAGE_INIT;
+	req->locked_locks = 0;
+
+	req->cpu = raw_smp_processor_id();
+	dm_icomp_queue_req(info, req);
+
+	return DM_MAPIO_SUBMITTED;
+}
+
+static void dm_icomp_status(struct dm_target *ti, status_type_t type,
+	  unsigned int status_flags, char *result, unsigned int maxlen)
+{
+	struct dm_icomp_info *info = ti->private;
+	unsigned int sz = 0;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		DMEMIT("%lu %lu %lu",
+			atomic64_read(&info->uncompressed_write_size),
+			atomic64_read(&info->compressed_write_size),
+			atomic64_read(&info->meta_write_size));
+		break;
+	case STATUSTYPE_TABLE:
+		if (info->write_mode == DMCP_WRITE_BACK)
+			DMEMIT("%s %s %d", info->dev->name, "writeback",
+				info->writeback_delay);
+		else
+			DMEMIT("%s %s", info->dev->name, "writethrough");
+		break;
+	}
+}
+
+static int dm_icomp_iterate_devices(struct dm_target *ti,
+				  iterate_devices_callout_fn fn, void *data)
+{
+	struct dm_icomp_info *info = ti->private;
+
+	return fn(ti, info->dev, info->data_start,
+		info->data_blocks << DMCP_BLOCK_SECTOR_SHIFT, data);
+}
+
+static void dm_icomp_io_hints(struct dm_target *ti,
+			    struct queue_limits *limits)
+{
+	/* No blk_limits_logical_block_size */
+	limits->logical_block_size = limits->physical_block_size =
+		limits->io_min = DMCP_BLOCK_SIZE;
+}
+
+static struct target_type dm_icomp_target = {
+	.name   = "inplacecompress",
+	.version = {1, 0, 0},
+	.module = THIS_MODULE,
+	.ctr    = dm_icomp_ctr,
+	.dtr    = dm_icomp_dtr,
+	.map    = dm_icomp_map,
+	.status = dm_icomp_status,
+	.iterate_devices = dm_icomp_iterate_devices,
+	.io_hints = dm_icomp_io_hints,
+};
+
+static int __init dm_icomp_init(void)
+{
+	int r;
+
+	for (r = 0; r < ARRAY_SIZE(compressors); r++)
+		if (crypto_has_comp(compressors[r].name, 0, 0))
+			break;
+	if (r >= ARRAY_SIZE(compressors)) {
+		DMWARN("No crypto compressors are supported");
+		return -EINVAL;
+	}
+
+	default_compressor = r;
+
+	r = -ENOMEM;
+	dm_icomp_req_cachep = kmem_cache_create("dm_icomp_requests",
+		sizeof(struct dm_icomp_req), 0, 0, NULL);
+	if (!dm_icomp_req_cachep) {
+		DMWARN("Can't create request cache");
+		goto err;
+	}
+
+	dm_icomp_io_range_cachep = kmem_cache_create("dm_icomp_io_range",
+		sizeof(struct dm_icomp_io_range), 0, 0, NULL);
+	if (!dm_icomp_io_range_cachep) {
+		DMWARN("Can't create io_range cache");
+		goto err;
+	}
+
+	dm_icomp_meta_io_cachep = kmem_cache_create("dm_icomp_meta_io",
+		sizeof(struct dm_icomp_meta_io), 0, 0, NULL);
+	if (!dm_icomp_meta_io_cachep) {
+		DMWARN("Can't create meta_io cache");
+		goto err;
+	}
+
+	dm_icomp_wq = alloc_workqueue("dm_icomp_io",
+		WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
+	if (!dm_icomp_wq) {
+		DMWARN("Can't create io workqueue");
+		goto err;
+	}
+
+	r = dm_register_target(&dm_icomp_target);
+	if (r < 0) {
+		DMWARN("target registration failed");
+		goto err;
+	}
+
+	for_each_possible_cpu(r) {
+		INIT_LIST_HEAD(&dm_icomp_io_workers[r].pending);
+		spin_lock_init(&dm_icomp_io_workers[r].lock);
+		INIT_WORK(&dm_icomp_io_workers[r].work,
+				dm_icomp_do_request_work);
+	}
+	return 0;
+err:
+	kmem_cache_destroy(dm_icomp_req_cachep);
+	kmem_cache_destroy(dm_icomp_io_range_cachep);
+	kmem_cache_destroy(dm_icomp_meta_io_cachep);
+	if (dm_icomp_wq)
+		destroy_workqueue(dm_icomp_wq);
+
+	return r;
+}
+
+static void __exit dm_icomp_exit(void)
+{
+	dm_unregister_target(&dm_icomp_target);
+	kmem_cache_destroy(dm_icomp_req_cachep);
+	kmem_cache_destroy(dm_icomp_io_range_cachep);
+	kmem_cache_destroy(dm_icomp_meta_io_cachep);
+	destroy_workqueue(dm_icomp_wq);
+}
+
+module_init(dm_icomp_init);
+module_exit(dm_icomp_exit);
+
+MODULE_AUTHOR("Shaohua Li <shli@kernel.org>");
+MODULE_DESCRIPTION(DM_NAME " target with data inplace-compression");
+MODULE_LICENSE("GPL");
diff --git a/drivers/md/dm-inplace-compress.h b/drivers/md/dm-inplace-compress.h
new file mode 100644
index 0000000..e07b9b7
--- /dev/null
+++ b/drivers/md/dm-inplace-compress.h
@@ -0,0 +1,140 @@
+#ifndef __DM_INPLACE_COMPRESS_H__
+#define __DM_INPLACE_COMPRESS_H__
+#include <linux/types.h>
+
+#define DMCP_SUPER_MAGIC 0x106526c206506c09
+struct dm_icomp_super_block {
+	__le64 magic;
+	__le64 meta_blocks;
+	__le64 data_blocks;
+	u8 comp_alg;
+} __packed;
+
+#define DMCP_COMP_ALG_LZO 0
+
+#ifdef __KERNEL__
+struct dm_icomp_compressor_data {
+	char *name;
+	int (*comp_len)(int comp_len);
+};
+
+static inline int lzo_comp_len(int comp_len)
+{
+	return lzo1x_worst_compress(comp_len);
+}
+
+/*
+ * Minium logical sector size of this target is 4096 byte, which is a block.
+ * Data of a block is compressed. Compressed data is round up to 512B, which is
+ * the payload. For each block, we have 5 bits meta data. bit 0 - 3 stands
+ * payload length. 0 - 8 sectors. If compressed payload length is 8 sectors, we
+ * just store uncompressed data. Actual compressed data length is stored at the
+ * last 32 bits of payload if data is compressed. In disk, payload is stored at
+ * the beginning of logical sector of the block. If IO size is bigger than one
+ * block, we store the whole data as an extent. Bit 4 stands tail for an
+ * extent. Max allowed extent size is 128k.
+ */
+#define DMCP_BLOCK_SIZE 4096
+#define DMCP_BLOCK_SHIFT 12
+#define DMCP_BLOCK_SECTOR_SHIFT (DMCP_BLOCK_SHIFT - 9)
+
+#define DMCP_MIN_SIZE 4096
+#define DMCP_MAX_SIZE (128 * 1024)
+
+#define DMCP_LENGTH_MASK ((1 << 4) - 1)
+#define DMCP_TAIL_MASK (1 << 4)
+#define DMCP_META_BITS 5
+
+#define DMCP_META_START_SECTOR (DMCP_BLOCK_SIZE >> 9)
+
+enum DMCP_WRITE_MODE {
+	DMCP_WRITE_BACK,
+	DMCP_WRITE_THROUGH,
+};
+
+/* 128*4 = 512k, since max IO size is 128k, an IO crosses at most 2 hash */
+#define BITMAP_HASH_SHIFT 7
+#define BITMAP_HASH_MASK ((1 << 6) - 1)
+#define BITMAP_HASH_LEN 64
+struct dm_icomp_hash_lock {
+	int io_running;
+	spinlock_t wait_lock;
+	struct list_head wait_list;
+};
+
+struct dm_icomp_info {
+	struct dm_target *ti;
+	struct dm_dev *dev;
+
+	int comp_alg;
+	struct crypto_comp *tfm[NR_CPUS];
+
+	sector_t data_start;
+	u64 data_blocks;
+
+	char *meta_bitmap;
+	u64 meta_bitmap_bits;
+	u64 meta_bitmap_pages;
+	struct dm_icomp_hash_lock bitmap_locks[BITMAP_HASH_LEN];
+
+	enum DMCP_WRITE_MODE write_mode;
+	unsigned int writeback_delay; /* second */
+	struct task_struct *writeback_tsk;
+	struct dm_io_client *io_client;
+
+	atomic64_t compressed_write_size;
+	atomic64_t uncompressed_write_size;
+	atomic64_t meta_write_size;
+};
+
+struct dm_icomp_meta_io {
+	struct dm_io_request io_req;
+	struct dm_io_region io_region;
+	void *data;
+	void (*fn)(void *data, unsigned long error);
+};
+
+struct dm_icomp_io_range {
+	struct dm_io_request io_req;
+	struct dm_io_region io_region;
+	void *decomp_data;
+	unsigned int decomp_len;
+	void *comp_data;
+	unsigned int comp_len; /* For write, this is estimated */
+	struct list_head next;
+	struct dm_icomp_req *req;
+};
+
+enum DMCP_REQ_STAGE {
+	STAGE_INIT,
+	STAGE_READ_EXISTING,
+	STAGE_READ_DECOMP,
+	STAGE_WRITE_COMP,
+	STAGE_DONE,
+};
+
+struct dm_icomp_req {
+	struct bio *bio;
+	struct dm_icomp_info *info;
+	struct list_head sibling;
+
+	struct list_head all_io;
+	atomic_t io_pending;
+	enum DMCP_REQ_STAGE stage;
+
+	struct dm_icomp_hash_lock *locks[2];
+	int locked_locks;
+	int result;
+
+	int cpu;
+	struct work_struct work;
+};
+
+struct dm_icomp_io_worker {
+	struct list_head pending;
+	spinlock_t lock;
+	struct work_struct work;
+};
+#endif
+
+#endif
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH 02/16] DM: Ability to choose the compressor.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, agk, snitzer, corbet, Ram Pai
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

Ability to create a block device with a compression algorithm of the users
choice. Currently lzo and nx842 compressors are supported.

If the compressor algorithm is not specified the default setting in
/sys/modules/dm-inplace-compress/parameters/compress is used.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |  129 ++++++++++++++++++++++++++++----------
 drivers/md/dm-inplace-compress.h |    8 ++-
 2 files changed, 103 insertions(+), 34 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index c3c3750..70d6c0e 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -20,8 +20,62 @@ static struct dm_icomp_compressor_data compressors[] = {
 		.name = "lzo",
 		.comp_len = lzo_comp_len,
 	},
+	[DMCP_COMP_ALG_842] = {
+		.name = "842",
+		.comp_len = nx842_comp_len,
+	},
+};
+static int default_compressor = -1;
+
+#define DMCP_ALGO_LENGTH 9
+static char dm_icomp_algorithm[DMCP_ALGO_LENGTH] = "lzo";
+static struct kparam_string dm_icomp_compressor_kparam = {
+	.string =	dm_icomp_algorithm,
+	.maxlen =	sizeof(dm_icomp_algorithm),
 };
-static int default_compressor;
+static int dm_icomp_compressor_param_set(const char *,
+		const struct kernel_param *);
+static struct kernel_param_ops dm_icomp_compressor_param_ops = {
+	.set =	dm_icomp_compressor_param_set,
+	.get =	param_get_string,
+};
+module_param_cb(compress_algorithm, &dm_icomp_compressor_param_ops,
+		&dm_icomp_compressor_kparam, 0644);
+
+static int dm_icomp_get_compressor(const char *s)
+{
+	int r, val_len;
+
+	if (crypto_has_comp(s, 0, 0)) {
+		for (r = 0; r < ARRAY_SIZE(compressors); r++) {
+			val_len = strlen(compressors[r].name);
+			if (strncmp(s, compressors[r].name, val_len) == 0)
+				return r;
+		}
+	}
+	return -1;
+}
+
+static int dm_icomp_compressor_param_set(const char *val,
+		const struct kernel_param *kp)
+{
+	int ret;
+	char str[kp->str->maxlen], *s;
+	int val_len = strlen(val)+1;
+
+	strlcpy(str, val, val_len);
+	s = strim(str);
+	ret = dm_icomp_get_compressor(s);
+	if (ret < 0) {
+		DMWARN("Compressor %s not supported", s);
+		return -1;
+	}
+	DMWARN("compressor  is %s", s);
+	default_compressor = ret;
+	strlcpy(dm_icomp_algorithm, compressors[ret].name,
+		sizeof(dm_icomp_algorithm));
+	return 0;
+}
 
 static struct kmem_cache *dm_icomp_req_cachep;
 static struct kmem_cache *dm_icomp_io_range_cachep;
@@ -417,7 +471,7 @@ static int dm_icomp_read_or_create_super(struct dm_icomp_info *info)
 			ret = -EINVAL;
 			goto out;
 		}
-		if (!crypto_has_comp(compressors[super->comp_alg].name, 0, 0)) {
+		if (!crypto_has_comp(compressors[info->comp_alg].name, 0, 0)) {
 			info->ti->error =
 					"Compressor algorithm doesn't support";
 			ret = -EINVAL;
@@ -436,7 +490,6 @@ static int dm_icomp_read_or_create_super(struct dm_icomp_info *info)
 		new_super = true;
 	}
 
-	info->comp_alg = super->comp_alg;
 	if (dm_icomp_alloc_compressor(info)) {
 		ret = -ENOMEM;
 		goto out;
@@ -467,50 +520,56 @@ out:
 }
 
 /*
- * <dev> <writethough>/<writeback> <meta_commit_delay>
+ * <dev> [ <writethough>/<writeback> <meta_commit_delay> ]
+ *	 [ <compressor> <type> ]
  */
 static int dm_icomp_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 {
 	struct dm_icomp_info *info;
-	char write_mode[15];
+	char mode[15];
+	int par = 0;
 	int ret, i;
 
-	if (argc < 2) {
-		ti->error = "Invalid argument count";
-		return -EINVAL;
-	}
-
 	info = kzalloc(sizeof(*info), GFP_KERNEL);
 	if (!info) {
 		ti->error = "dm-inplace-compress: Cannot allocate context";
 		return -ENOMEM;
 	}
 	info->ti = ti;
-
-	if (sscanf(argv[1], "%s", write_mode) != 1) {
-		ti->error = "Invalid argument";
-		ret = -EINVAL;
-		goto err_para;
-	}
-
-	if (strcmp(write_mode, "writeback") == 0) {
-		if (argc != 3) {
+	info->comp_alg = default_compressor;
+	while (++par < argc) {
+		if (sscanf(argv[par], "%s", mode) != 1) {
 			ti->error = "Invalid argument";
 			ret = -EINVAL;
 			goto err_para;
 		}
-		info->write_mode = DMCP_WRITE_BACK;
-		if (sscanf(argv[2], "%u", &info->writeback_delay) != 1) {
-			ti->error = "Invalid argument";
-			ret = -EINVAL;
-			goto err_para;
+
+		if (strcmp(mode, "writeback") == 0) {
+			info->write_mode = DMCP_WRITE_BACK;
+			if (sscanf(argv[++par], "%u",
+				 &info->writeback_delay) != 1) {
+				ti->error = "Invalid argument";
+				ret = -EINVAL;
+				goto err_para;
+			}
+		} else if (strcmp(mode, "writethrough") == 0) {
+			info->write_mode = DMCP_WRITE_THROUGH;
+		} else if (strcmp(mode, "compressor") == 0) {
+			if (sscanf(argv[++par], "%s", mode) != 1) {
+				ti->error = "Invalid argument";
+				ret = -EINVAL;
+				goto err_para;
+			}
+			ret = dm_icomp_get_compressor(mode);
+			if (ret >= 0) {
+				DMWARN("compressor  is %s", mode);
+				info->comp_alg = ret;
+			} else {
+				ti->error = "Unsupported compressor";
+				ret = -EINVAL;
+				goto err_para;
+			}
 		}
-	} else if (strcmp(write_mode, "writethrough") == 0) {
-		info->write_mode = DMCP_WRITE_THROUGH;
-	} else {
-		ti->error = "Invalid argument";
-		ret = -EINVAL;
-		goto err_para;
 	}
 
 	if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
@@ -1407,16 +1466,20 @@ static struct target_type dm_icomp_target = {
 static int __init dm_icomp_init(void)
 {
 	int r;
+	int arr_size = ARRAY_SIZE(compressors);
 
-	for (r = 0; r < ARRAY_SIZE(compressors); r++)
+	for (r = 0; r < arr_size; r++)
 		if (crypto_has_comp(compressors[r].name, 0, 0))
 			break;
-	if (r >= ARRAY_SIZE(compressors)) {
+	if (r >= arr_size) {
 		DMWARN("No crypto compressors are supported");
 		return -EINVAL;
 	}
-
 	default_compressor = r;
+	strlcpy(dm_icomp_algorithm, compressors[r].name,
+			sizeof(dm_icomp_algorithm));
+	DMWARN(" %s crypto compressor used ",
+			compressors[default_compressor].name);
 
 	r = -ENOMEM;
 	dm_icomp_req_cachep = kmem_cache_create("dm_icomp_requests",
diff --git a/drivers/md/dm-inplace-compress.h b/drivers/md/dm-inplace-compress.h
index e07b9b7..b61ff0d 100644
--- a/drivers/md/dm-inplace-compress.h
+++ b/drivers/md/dm-inplace-compress.h
@@ -10,7 +10,8 @@ struct dm_icomp_super_block {
 	u8 comp_alg;
 } __packed;
 
-#define DMCP_COMP_ALG_LZO 0
+#define DMCP_COMP_ALG_LZO 1
+#define DMCP_COMP_ALG_842 0
 
 #ifdef __KERNEL__
 struct dm_icomp_compressor_data {
@@ -23,6 +24,11 @@ static inline int lzo_comp_len(int comp_len)
 	return lzo1x_worst_compress(comp_len);
 }
 
+static inline int nx842_comp_len(int comp_len)
+{
+	return comp_len;
+}
+
 /*
  * Minium logical sector size of this target is 4096 byte, which is a block.
  * Data of a block is compressed. Compressed data is round up to 512B, which is
-- 
1.7.1


^ permalink raw reply related

* [RFC PATCH 03/16] DM: Error if enough space is not available.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, agk, snitzer, corbet, Ram Pai
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

if enough space is not available to create a block of the specified size error
out.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index 70d6c0e..17221a1 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -453,6 +453,12 @@ static int dm_icomp_read_or_create_super(struct dm_icomp_info *info)
 	info->data_blocks = data_blocks;
 	info->data_start = (1 + meta_blocks) << DMCP_BLOCK_SECTOR_SHIFT;
 
+	if ((data_blocks << DMCP_BLOCK_SECTOR_SHIFT) < info->ti->len) {
+		info->ti->error =
+			"Insufficient sectors to satisfy requested size";
+		return -ENOMEM;
+	}
+
 	addr = kzalloc(DMCP_BLOCK_SIZE, GFP_KERNEL);
 	if (!addr) {
 		info->ti->error = "Cannot allocate super";
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH 04/16] DM: Ensure that the read request is within the device range.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, snitzer, agk, corbet
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

If a read request is not within the device range return error.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index 17221a1..bf18028 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -1025,6 +1025,12 @@ static void dm_icomp_read_one_extent(struct dm_icomp_req *req, u64 block,
 {
 	struct dm_icomp_io_range *io;
 
+	if (block+(data_sectors>>DMCP_BLOCK_SECTOR_SHIFT) >=
+			req->info->data_blocks) {
+		req->result = -EIO;
+		return;
+	}
+
 	io = dm_icomp_create_io_range(req, data_sectors << 9,
 		logical_sectors << 9);
 	if (!io) {
@@ -1063,7 +1069,9 @@ again:
 
 	block_index = first_block_index + (logical_sectors >>
 				DMCP_BLOCK_SECTOR_SHIFT);
-	if ((block_index << DMCP_BLOCK_SECTOR_SHIFT) < bio_end_sector(req->bio))
+	if (((block_index << DMCP_BLOCK_SECTOR_SHIFT) <
+			 bio_end_sector(req->bio)) &&
+			((block_index) < req->info->data_blocks))
 		goto again;
 
 	/* A shortcut if all data is in already */
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH 05/16] DM: allocation/free helper routines.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, snitzer, agk, corbet
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

Helper functions to allocate/reallocate and free memory.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index bf18028..c11567c 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -774,6 +774,23 @@ static void dm_icomp_get_req(struct dm_icomp_req *req)
 	atomic_inc(&req->io_pending);
 }
 
+static void *dm_icomp_kmalloc(size_t size, gfp_t flags)
+{
+	return  kmalloc(size, flags);
+}
+
+static void *dm_icomp_krealloc(void *addr, size_t size,
+		 size_t orig_size, gfp_t flags)
+{
+	return krealloc(addr, size, flags);
+}
+
+static void dm_icomp_kfree(void *addr, unsigned int size)
+{
+	kfree(addr);
+}
+
+
 static void dm_icomp_free_io_range(struct dm_icomp_io_range *io)
 {
 	kfree(io->decomp_data);
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH 06/16] DM: separate out compression and decompression routines.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, agk, snitzer, corbet, Ram Pai
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

Simplify the code by separating out the compression and decompression routines.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |  127 +++++++++++++++++++++-----------------
 1 files changed, 71 insertions(+), 56 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index c11567c..5c39169 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -793,8 +793,8 @@ static void dm_icomp_kfree(void *addr, unsigned int size)
 
 static void dm_icomp_free_io_range(struct dm_icomp_io_range *io)
 {
-	kfree(io->decomp_data);
-	kfree(io->comp_data);
+	dm_icomp_kfree(io->decomp_data, io->decomp_len);
+	dm_icomp_kfree(io->comp_data, io->comp_len);
 	kmem_cache_free(dm_icomp_io_range_cachep, io);
 }
 
@@ -861,12 +861,12 @@ static struct dm_icomp_io_range *dm_icomp_create_io_range(
 	if (!io)
 		return NULL;
 
-	io->comp_data = kmalloc(dm_icomp_compressor_len(req->info, comp_len),
-								GFP_NOIO);
-	io->decomp_data = kmalloc(decomp_len, GFP_NOIO);
+	io->comp_data = dm_icomp_kmalloc(
+		dm_icomp_compressor_len(req->info, comp_len), GFP_NOIO);
+	io->decomp_data = dm_icomp_kmalloc(decomp_len, GFP_NOIO);
 	if (!io->decomp_data || !io->comp_data) {
-		kfree(io->decomp_data);
-		kfree(io->comp_data);
+		dm_icomp_kfree(io->decomp_data, io->decomp_len);
+		dm_icomp_kfree(io->comp_data, io->comp_len);
 		kmem_cache_free(dm_icomp_io_range_cachep, io);
 		return NULL;
 	}
@@ -928,51 +928,66 @@ static void dm_icomp_bio_copy(struct bio *bio, off_t bio_off, void *buf,
  * We store the actual compressed len in the last u32 of the payload.
  * If there is no free space, we add 512 to the payload size.
  */
-static int dm_icomp_io_range_comp(struct dm_icomp_info *info, void *comp_data,
-	unsigned int *comp_len, void *decomp_data, unsigned int decomp_len,
-	bool do_comp)
+static int dm_icomp_io_range_compress(struct dm_icomp_info *info,
+		struct dm_icomp_io_range *io, unsigned int *comp_len,
+		void *decomp_data, unsigned int decomp_len)
 {
-	struct crypto_comp *tfm;
+	unsigned int actual_comp_len = io->comp_len;
 	u32 *addr;
-	unsigned int actual_comp_len;
+	struct crypto_comp *tfm =  info->tfm[get_cpu()];
 	int ret;
 
-	if (do_comp) {
-		actual_comp_len = *comp_len;
-
-		tfm = info->tfm[get_cpu()];
-		ret = crypto_comp_compress(tfm, decomp_data, decomp_len,
-			comp_data, &actual_comp_len);
-		put_cpu();
+	ret = crypto_comp_compress(tfm, decomp_data, decomp_len,
+		io->comp_data, &actual_comp_len);
 
-		atomic64_add(decomp_len, &info->uncompressed_write_size);
-		if (ret || decomp_len < actual_comp_len + sizeof(u32) + 512) {
-			*comp_len = decomp_len;
-			atomic64_add(*comp_len, &info->compressed_write_size);
-			return 1;
-		}
+	put_cpu();
+	if (ret < 0)
+		DMWARN("CO Error %d ", ret);
 
-		*comp_len = round_up(actual_comp_len, 512);
-		if (*comp_len - actual_comp_len < sizeof(u32))
-			*comp_len += 512;
+	atomic64_add(decomp_len, &info->uncompressed_write_size);
+	if (ret || decomp_len < actual_comp_len + sizeof(u32) + 512) {
+		*comp_len = decomp_len;
 		atomic64_add(*comp_len, &info->compressed_write_size);
-		addr = comp_data + *comp_len;
-		addr--;
-		*addr = cpu_to_le32(actual_comp_len);
-	} else {
-		if (*comp_len == decomp_len)
-			return 1;
-		addr = comp_data + *comp_len;
-		addr--;
-		actual_comp_len = le32_to_cpu(*addr);
-
-		tfm = info->tfm[get_cpu()];
-		ret = crypto_comp_decompress(tfm, comp_data, actual_comp_len,
-			decomp_data, &decomp_len);
-		put_cpu();
-		if (ret)
-			return -EINVAL;
+		return 1;
 	}
+
+	*comp_len = round_up(actual_comp_len, 512);
+	if (*comp_len - actual_comp_len < sizeof(u32))
+		*comp_len += 512;
+	atomic64_add(*comp_len, &info->compressed_write_size);
+	addr = io->comp_data + *comp_len;
+	addr--;
+	*addr = cpu_to_le32(actual_comp_len);
+	return 0;
+}
+
+/*
+ * return value:
+ * < 0 : error
+ * == 0 : ok
+ * == 1 : ok, but comp/decomp is skipped
+ */
+static int dm_icomp_io_range_decompress(struct dm_icomp_info *info,
+	void *comp_data, unsigned int comp_len, void *decomp_data,
+	unsigned int decomp_len)
+{
+	struct crypto_comp *tfm;
+	u32 *addr;
+	int ret;
+
+	if (comp_len == decomp_len)
+		return 1;
+
+	addr = comp_data + comp_len;
+	addr--;
+	comp_len = le32_to_cpu(*addr);
+
+	tfm = info->tfm[get_cpu()];
+	ret = crypto_comp_decompress(tfm, comp_data, comp_len,
+		decomp_data, &decomp_len);
+	put_cpu();
+	if (ret)
+		return -EINVAL;
 	return 0;
 }
 
@@ -993,8 +1008,8 @@ static void dm_icomp_handle_read_decomp(struct dm_icomp_req *req)
 		io->io_region.sector -= req->info->data_start;
 
 		/* Do decomp here */
-		ret = dm_icomp_io_range_comp(req->info, io->comp_data,
-			&io->comp_len, io->decomp_data, io->decomp_len, false);
+		ret = dm_icomp_io_range_decompress(req->info, io->comp_data,
+			io->comp_len, io->decomp_data, io->decomp_len);
 		if (ret < 0) {
 			req->result = -EIO;
 			return;
@@ -1140,8 +1155,8 @@ static int dm_icomp_handle_write_modify(struct dm_icomp_io_range *io,
 	io->io_region.sector -= req->info->data_start;
 
 	/* decompress original data */
-	ret = dm_icomp_io_range_comp(req->info, io->comp_data, &io->comp_len,
-			io->decomp_data, io->decomp_len, false);
+	ret = dm_icomp_io_range_decompress(req->info, io->comp_data,
+		io->comp_len, io->decomp_data, io->decomp_len);
 	if (ret < 0) {
 		req->result = -EINVAL;
 		return -EIO;
@@ -1164,11 +1179,11 @@ static int dm_icomp_handle_write_modify(struct dm_icomp_io_range *io,
 			   ((req->bio->bi_iter.bi_sector - start) << 9),
 			   bio_sectors(req->bio) << 9, true);
 
-			kfree(io->comp_data);
+			dm_icomp_kfree(io->comp_data, io->comp_len);
 			/* New compressed len might be bigger */
-			io->comp_data = kmalloc(
-				dm_icomp_compressor_len(req->info,
-					io->decomp_len), GFP_NOIO);
+			io->comp_data = dm_icomp_kmalloc(
+				dm_icomp_compressor_len(
+				req->info, io->decomp_len), GFP_NOIO);
 			io->comp_len = io->decomp_len;
 			if (!io->comp_data) {
 				req->result = -ENOMEM;
@@ -1197,8 +1212,8 @@ static int dm_icomp_handle_write_modify(struct dm_icomp_io_range *io,
 
 	/* assume compress less data uses less space (at least 4k lsess data) */
 	comp_len = io->comp_len;
-	ret = dm_icomp_io_range_comp(req->info, io->comp_data, &comp_len,
-		io->decomp_data + (offset << 9), count << 9, true);
+	ret = dm_icomp_io_range_compress(req->info, io, &comp_len,
+		io->decomp_data + (offset << 9), count << 9);
 	if (ret < 0) {
 		req->result = -EIO;
 		return -EIO;
@@ -1260,8 +1275,8 @@ static void dm_icomp_handle_write_comp(struct dm_icomp_req *req)
 
 	/* compress data */
 	comp_len = io->comp_len;
-	ret = dm_icomp_io_range_comp(req->info, io->comp_data, &comp_len,
-		io->decomp_data, count << 9, true);
+	ret = dm_icomp_io_range_compress(req->info, io, &comp_len,
+			io->decomp_data, count << 9);
 	if (ret < 0) {
 		dm_icomp_free_io_range(io);
 		req->result = -EIO;
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH 07/16] DM: Optimize memory allocated to hold compressed buffer.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, snitzer, agk, corbet
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

On an average the compressed size is less than 50% of the original buffer.  Use
this knowledge to optimize the amount of space allocated to hold the compressed
buffer. If the allocated size is determined to be insufficient than reallocate
the required size.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |   39 ++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-inplace-compress.h |   11 ++++++++++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index 5c39169..fe4a4c1 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -19,10 +19,12 @@ static struct dm_icomp_compressor_data compressors[] = {
 	[DMCP_COMP_ALG_LZO] = {
 		.name = "lzo",
 		.comp_len = lzo_comp_len,
+		.max_comp_len = lzo_max_comp_len,
 	},
 	[DMCP_COMP_ALG_842] = {
 		.name = "842",
 		.comp_len = nx842_comp_len,
+		.max_comp_len = nx842_max_comp_len,
 	},
 };
 static int default_compressor = -1;
@@ -848,6 +850,14 @@ static inline int dm_icomp_compressor_len(struct dm_icomp_info *info, int len)
 	return len;
 }
 
+static inline int dm_icomp_compressor_maxlen(struct dm_icomp_info *info,
+		int len)
+{
+	if (compressors[info->comp_alg].max_comp_len)
+		return compressors[info->comp_alg].max_comp_len(len);
+	return len;
+}
+
 /*
  * caller should set region.sector, region.count. bi_rw. IO always to/from
  * comp_data
@@ -919,6 +929,25 @@ static void dm_icomp_bio_copy(struct bio *bio, off_t bio_off, void *buf,
 	}
 }
 
+static int dm_icomp_mod_to_max_io_range(struct dm_icomp_info *info,
+			 struct dm_icomp_io_range *io)
+{
+	unsigned int maxlen = dm_icomp_compressor_maxlen(info, io->decomp_len);
+
+	if (maxlen <= io->comp_len)
+		return -ENOSPC;
+	io->io_req.mem.ptr.addr = io->comp_data =
+		dm_icomp_krealloc(io->comp_data, maxlen,
+			io->comp_len, GFP_NOIO);
+	if (!io->comp_data) {
+		DMWARN("UNFORTUNE allocation failure ");
+		io->comp_len = 0;
+		return -ENOSPC;
+	}
+	io->comp_len = maxlen;
+	return 0;
+}
+
 /*
  * return value:
  * < 0 : error
@@ -940,7 +969,17 @@ static int dm_icomp_io_range_compress(struct dm_icomp_info *info,
 	ret = crypto_comp_compress(tfm, decomp_data, decomp_len,
 		io->comp_data, &actual_comp_len);
 
+	if (ret || actual_comp_len > io->comp_len) {
+		ret = dm_icomp_mod_to_max_io_range(info, io);
+		if (!ret) {
+			actual_comp_len = io->comp_len;
+			ret = crypto_comp_compress(tfm, decomp_data, decomp_len,
+				io->comp_data, &actual_comp_len);
+		}
+	}
+
 	put_cpu();
+
 	if (ret < 0)
 		DMWARN("CO Error %d ", ret);
 
diff --git a/drivers/md/dm-inplace-compress.h b/drivers/md/dm-inplace-compress.h
index b61ff0d..86c0ce6 100644
--- a/drivers/md/dm-inplace-compress.h
+++ b/drivers/md/dm-inplace-compress.h
@@ -17,15 +17,26 @@ struct dm_icomp_super_block {
 struct dm_icomp_compressor_data {
 	char *name;
 	int (*comp_len)(int comp_len);
+	int (*max_comp_len)(int comp_len);
 };
 
 static inline int lzo_comp_len(int comp_len)
 {
+	return lzo1x_worst_compress(comp_len) >> 1;
+}
+
+static inline int lzo_max_comp_len(int comp_len)
+{
 	return lzo1x_worst_compress(comp_len);
 }
 
 static inline int nx842_comp_len(int comp_len)
 {
+	return (comp_len>>4)*7; /* less than half: 7/16 */
+}
+
+static inline int nx842_max_comp_len(int comp_len)
+{
 	return comp_len;
 }
 
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH 08/16] DM: Tag a magicmarker at the end of each compressed segment.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, snitzer, agk, corbet
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

We store the size of the compressed segment, on the sector boundary. And later
use that location to determine the size of the compressed segment. However if
that location is corrupted for any reason we would'nt know. Hence add a
magicmarker to catch for such corruptions.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |   24 ++++++++++++++++--------
 drivers/md/dm-inplace-compress.h |    1 +
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index fe4a4c1..04decdd 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -984,19 +984,21 @@ static int dm_icomp_io_range_compress(struct dm_icomp_info *info,
 		DMWARN("CO Error %d ", ret);
 
 	atomic64_add(decomp_len, &info->uncompressed_write_size);
-	if (ret || decomp_len < actual_comp_len + sizeof(u32) + 512) {
+	if (ret || decomp_len < actual_comp_len + 2*sizeof(u32) + 512) {
 		*comp_len = decomp_len;
 		atomic64_add(*comp_len, &info->compressed_write_size);
 		return 1;
 	}
 
 	*comp_len = round_up(actual_comp_len, 512);
-	if (*comp_len - actual_comp_len < sizeof(u32))
+	if (*comp_len - actual_comp_len < 2*sizeof(u32))
 		*comp_len += 512;
 	atomic64_add(*comp_len, &info->compressed_write_size);
 	addr = io->comp_data + *comp_len;
 	addr--;
 	*addr = cpu_to_le32(actual_comp_len);
+	addr--;
+	*addr = cpu_to_le32(DMCP_COMPRESS_MAGIC);
 	return 0;
 }
 
@@ -1020,13 +1022,19 @@ static int dm_icomp_io_range_decompress(struct dm_icomp_info *info,
 	addr = comp_data + comp_len;
 	addr--;
 	comp_len = le32_to_cpu(*addr);
+	addr--;
 
-	tfm = info->tfm[get_cpu()];
-	ret = crypto_comp_decompress(tfm, comp_data, comp_len,
-		decomp_data, &decomp_len);
-	put_cpu();
-	if (ret)
-		return -EINVAL;
+	if (comp_len == decomp_len)
+		return 1;
+	if (le32_to_cpu(*addr) == DMCP_COMPRESS_MAGIC) {
+		tfm = info->tfm[get_cpu()];
+		ret = crypto_comp_decompress(tfm, comp_data, comp_len,
+			decomp_data, &decomp_len);
+		put_cpu();
+		if (ret)
+			return -EINVAL;
+	} else
+		memset(decomp_data, 0, decomp_len);
 	return 0;
 }
 
diff --git a/drivers/md/dm-inplace-compress.h b/drivers/md/dm-inplace-compress.h
index 86c0ce6..1ce7a6e 100644
--- a/drivers/md/dm-inplace-compress.h
+++ b/drivers/md/dm-inplace-compress.h
@@ -3,6 +3,7 @@
 #include <linux/types.h>
 
 #define DMCP_SUPER_MAGIC 0x106526c206506c09
+#define DMCP_COMPRESS_MAGIC 0xfaceecaf
 struct dm_icomp_super_block {
 	__le64 magic;
 	__le64 meta_blocks;
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH 09/16] DM: Delay allocation of decompression buffer during read.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, agk, snitzer, corbet, Ram Pai
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

The read path allocates a temporary buffer each to hold compressed data and
decompressed data. The buffer to hold the decompressed data is not needed till
the compressed data is read from the device and is uncompressed.

Hence delay the allocation of decompress buffer till it is really needed.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |   76 +++++++++++++++++++++++++++++++++-----
 drivers/md/dm-inplace-compress.h |    3 +-
 2 files changed, 68 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index 04decdd..c56d9b7 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -863,7 +863,7 @@ static inline int dm_icomp_compressor_maxlen(struct dm_icomp_info *info,
  * comp_data
  */
 static struct dm_icomp_io_range *dm_icomp_create_io_range(
-	struct dm_icomp_req *req, int comp_len, int decomp_len)
+		struct dm_icomp_req *req, int comp_len)
 {
 	struct dm_icomp_io_range *io;
 
@@ -871,12 +871,8 @@ static struct dm_icomp_io_range *dm_icomp_create_io_range(
 	if (!io)
 		return NULL;
 
-	io->comp_data = dm_icomp_kmalloc(
-		dm_icomp_compressor_len(req->info, comp_len), GFP_NOIO);
-	io->decomp_data = dm_icomp_kmalloc(decomp_len, GFP_NOIO);
-	if (!io->decomp_data || !io->comp_data) {
-		dm_icomp_kfree(io->decomp_data, io->decomp_len);
-		dm_icomp_kfree(io->comp_data, io->comp_len);
+	io->comp_data = dm_icomp_kmalloc(comp_len, GFP_NOIO);
+	if (!io->comp_data) {
 		kmem_cache_free(dm_icomp_io_range_cachep, io);
 		return NULL;
 	}
@@ -890,12 +886,42 @@ static struct dm_icomp_io_range *dm_icomp_create_io_range(
 
 	io->io_region.bdev = req->info->dev->bdev;
 
-	io->decomp_len = decomp_len;
 	io->comp_len = comp_len;
 	io->req = req;
+
+	io->decomp_data = NULL;
+	io->decomp_len = 0;
+	io->decomp_req_len = 0;
 	return io;
 }
 
+static struct dm_icomp_io_range *dm_icomp_create_io_read_range(
+		struct dm_icomp_req *req, int comp_len, int decomp_len)
+{
+	struct dm_icomp_io_range *io = dm_icomp_create_io_range(req, comp_len);
+
+	if (io) {
+		/* note down the requested length for decompress buffer.
+		 * but dont allocate it yet.
+		 */
+		io->decomp_req_len = decomp_len;
+	}
+	return io;
+}
+
+static int dm_icomp_update_io_read_range(struct dm_icomp_io_range *io)
+{
+	if (io->decomp_len)
+		return 0;
+
+	io->decomp_data = dm_icomp_kmalloc(io->decomp_req_len, GFP_NOIO);
+	if (!io->decomp_data)
+		return 1;
+	io->decomp_len = io->decomp_req_len;
+
+	return 0;
+}
+
 static void dm_icomp_bio_copy(struct bio *bio, off_t bio_off, void *buf,
 		ssize_t len, bool to_buf)
 {
@@ -948,6 +974,31 @@ static int dm_icomp_mod_to_max_io_range(struct dm_icomp_info *info,
 	return 0;
 }
 
+static struct dm_icomp_io_range *dm_icomp_create_io_write_range(
+		struct dm_icomp_req *req)
+{
+	struct dm_icomp_io_range *io;
+	sector_t size  = bio_sectors(req->bio)<<9;
+	int comp_len = dm_icomp_compressor_len(req->info, size);
+	void *addr;
+
+	addr  = dm_icomp_kmalloc(size, GFP_NOIO);
+	if (!addr)
+		return NULL;
+
+	io = dm_icomp_create_io_range(req, comp_len);
+	if (!io) {
+		dm_icomp_kfree(addr, size);
+		return NULL;
+	}
+
+	io->decomp_data = addr;
+	io->decomp_len = size;
+
+	dm_icomp_bio_copy(req->bio, 0, io->decomp_data, size, true);
+	return io;
+}
+
 /*
  * return value:
  * < 0 : error
@@ -1054,6 +1105,11 @@ static void dm_icomp_handle_read_decomp(struct dm_icomp_req *req)
 
 		io->io_region.sector -= req->info->data_start;
 
+		if (dm_icomp_update_io_read_range(io)) {
+			req->result = -EIO;
+			return;
+		}
+
 		/* Do decomp here */
 		ret = dm_icomp_io_range_decompress(req->info, io->comp_data,
 			io->comp_len, io->decomp_data, io->decomp_len);
@@ -1110,7 +1166,7 @@ static void dm_icomp_read_one_extent(struct dm_icomp_req *req, u64 block,
 		return;
 	}
 
-	io = dm_icomp_create_io_range(req, data_sectors << 9,
+	io = dm_icomp_create_io_read_range(req, data_sectors << 9,
 		logical_sectors << 9);
 	if (!io) {
 		req->result = -EIO;
@@ -1313,7 +1369,7 @@ static void dm_icomp_handle_write_comp(struct dm_icomp_req *req)
 		goto update_meta;
 
 	count = bio_sectors(req->bio);
-	io = dm_icomp_create_io_range(req, count << 9, count << 9);
+	io = dm_icomp_create_io_write_range(req);
 	if (!io) {
 		req->result = -EIO;
 		return;
diff --git a/drivers/md/dm-inplace-compress.h b/drivers/md/dm-inplace-compress.h
index 1ce7a6e..d9cf05a 100644
--- a/drivers/md/dm-inplace-compress.h
+++ b/drivers/md/dm-inplace-compress.h
@@ -116,7 +116,8 @@ struct dm_icomp_io_range {
 	struct dm_io_request io_req;
 	struct dm_io_region io_region;
 	void *decomp_data;
-	unsigned int decomp_len;
+	unsigned int decomp_req_len;/* originally requested length */
+	unsigned int decomp_len; /* actual allocated/mapped length */
 	void *comp_data;
 	unsigned int comp_len; /* For write, this is estimated */
 	struct list_head next;
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH 10/16] DM: Try to use the bio buffer for decompression instead of allocating one.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, snitzer, agk, corbet
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

The read path allocates a temporary buffer to hold decompressed data, which is
than copied into the caller's bio buffer.

Instead of allocating a temporary buffer to hold the decompressed data,
decompress the data in the caller's bio buffer. This can be done only if the
destination in the bio-buffer is contigious within the same bio-segment.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |  143 ++++++++++++++++++++++++++-----------
 drivers/md/dm-inplace-compress.h |    2 +
 2 files changed, 102 insertions(+), 43 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index c56d9b7..f6b95e3 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -792,11 +792,34 @@ static void dm_icomp_kfree(void *addr, unsigned int size)
 	kfree(addr);
 }
 
+static void dm_icomp_release_decomp_buffer(struct dm_icomp_io_range *io)
+{
+	if (!io->decomp_data)
+		return;
 
-static void dm_icomp_free_io_range(struct dm_icomp_io_range *io)
+	if (io->decomp_kmap)
+		kunmap(io->decomp_real_data);
+	else
+		dm_icomp_kfree(io->decomp_real_data, io->decomp_req_len);
+	io->decomp_data = io->decomp_real_data = NULL;
+	io->decomp_len  = 0;
+	io->decomp_kmap = false;
+}
+
+static void dm_icomp_release_comp_buffer(struct dm_icomp_io_range *io)
 {
-	dm_icomp_kfree(io->decomp_data, io->decomp_len);
+	if (!io->comp_data)
+		return;
 	dm_icomp_kfree(io->comp_data, io->comp_len);
+
+	io->comp_data = NULL;
+	io->comp_len = 0;
+}
+
+static void dm_icomp_free_io_range(struct dm_icomp_io_range *io)
+{
+	dm_icomp_release_decomp_buffer(io);
+	dm_icomp_release_comp_buffer(io);
 	kmem_cache_free(dm_icomp_io_range_cachep, io);
 }
 
@@ -890,7 +913,9 @@ static struct dm_icomp_io_range *dm_icomp_create_io_range(
 	io->req = req;
 
 	io->decomp_data = NULL;
+	io->decomp_real_data = NULL;
 	io->decomp_len = 0;
+	io->decomp_kmap = false;
 	io->decomp_req_len = 0;
 	return io;
 }
@@ -909,15 +934,43 @@ static struct dm_icomp_io_range *dm_icomp_create_io_read_range(
 	return io;
 }
 
-static int dm_icomp_update_io_read_range(struct dm_icomp_io_range *io)
+static int dm_icomp_update_io_read_range(struct dm_icomp_io_range *io,
+		struct bio *bio, ssize_t bio_off)
 {
+	struct bvec_iter iter;
+	struct bio_vec bv;
+	bool just_use = false;
+
 	if (io->decomp_len)
 		return 0;
 
+	/* use a bio buffer long enough to hold the uncompressed data */
+	bio_for_each_segment(bv, bio, iter) {
+		int avail_len;
+		int length = bv.bv_len;
+
+		if (!just_use && bio_off >= length) {
+			bio_off -= length;
+			continue;
+		}
+		avail_len = just_use ? length : length-bio_off;
+		if (avail_len >= io->decomp_req_len) {
+			io->decomp_real_data = kmap(bv.bv_page);
+			io->decomp_data = io->decomp_real_data + bio_off;
+			io->decomp_len = io->decomp_req_len = avail_len;
+			io->decomp_kmap = true;
+			return 0;
+		}
+		just_use = true;
+	}
+
+	/* none available. :( Allocate one */
 	io->decomp_data = dm_icomp_kmalloc(io->decomp_req_len, GFP_NOIO);
 	if (!io->decomp_data)
 		return 1;
+	io->decomp_real_data = io->decomp_data;
 	io->decomp_len = io->decomp_req_len;
+	io->decomp_kmap = false;
 
 	return 0;
 }
@@ -978,24 +1031,34 @@ static struct dm_icomp_io_range *dm_icomp_create_io_write_range(
 		struct dm_icomp_req *req)
 {
 	struct dm_icomp_io_range *io;
+	struct bio *bio = req->bio;
 	sector_t size  = bio_sectors(req->bio)<<9;
+	int segments = bio_segments(bio);
 	int comp_len = dm_icomp_compressor_len(req->info, size);
 	void *addr;
 
-	addr  = dm_icomp_kmalloc(size, GFP_NOIO);
+	if (segments == 1) {
+		struct bio_vec bv = bio_iovec(bio);
+
+		addr = kmap(bv.bv_page);
+	} else
+		addr  = dm_icomp_kmalloc(size, GFP_NOIO);
+
 	if (!addr)
 		return NULL;
 
 	io = dm_icomp_create_io_range(req, comp_len);
 	if (!io) {
-		dm_icomp_kfree(addr, size);
+		(segments == 1) ?  kunmap(addr) : dm_icomp_kfree(addr, size);
 		return NULL;
 	}
 
-	io->decomp_data = addr;
+	io->decomp_data = io->decomp_real_data = addr;
 	io->decomp_len = size;
 
-	dm_icomp_bio_copy(req->bio, 0, io->decomp_data, size, true);
+	io->decomp_kmap = (segments == 1);
+	if (!io->decomp_kmap)
+		dm_icomp_bio_copy(req->bio, 0, io->decomp_data, size, true);
 	return io;
 }
 
@@ -1105,7 +1168,15 @@ static void dm_icomp_handle_read_decomp(struct dm_icomp_req *req)
 
 		io->io_region.sector -= req->info->data_start;
 
-		if (dm_icomp_update_io_read_range(io)) {
+		if (io->io_region.sector >=
+				req->bio->bi_iter.bi_sector)
+			dst_off = (io->io_region.sector -
+				req->bio->bi_iter.bi_sector) << 9;
+		else
+			src_off = (req->bio->bi_iter.bi_sector -
+				io->io_region.sector) << 9;
+
+		if (dm_icomp_update_io_read_range(io, req->bio, dst_off)) {
 			req->result = -EIO;
 			return;
 		}
@@ -1114,20 +1185,19 @@ static void dm_icomp_handle_read_decomp(struct dm_icomp_req *req)
 		ret = dm_icomp_io_range_decompress(req->info, io->comp_data,
 			io->comp_len, io->decomp_data, io->decomp_len);
 		if (ret < 0) {
+			dm_icomp_release_decomp_buffer(io);
+			dm_icomp_release_comp_buffer(io);
 			req->result = -EIO;
 			return;
 		}
 
-		if (io->io_region.sector >= req->bio->bi_iter.bi_sector)
-			dst_off = (io->io_region.sector -
-				 req->bio->bi_iter.bi_sector) << 9;
-		else
-			src_off = (req->bio->bi_iter.bi_sector -
-				 io->io_region.sector) << 9;
-
 		len = min_t(ssize_t, io->decomp_len - src_off,
 			(bio_sectors(req->bio) << 9) - dst_off);
 
+		dm_icomp_bio_copy(req->bio, dst_off,
+		  ((ret == 1) ? io->comp_data : io->decomp_data) + src_off,
+		  len, false);
+
 		/* io range in all_io list is ordered for read IO */
 		while (bio_off != dst_off) {
 			ssize_t size = min_t(ssize_t, PAGE_SIZE,
@@ -1137,13 +1207,9 @@ static void dm_icomp_handle_read_decomp(struct dm_icomp_req *req)
 			bio_off += size;
 		}
 
-		if (ret == 1)
-			dm_icomp_bio_copy(req->bio, dst_off,
-					io->comp_data + src_off, len, false);
-		else
-			dm_icomp_bio_copy(req->bio, dst_off,
-					io->decomp_data + src_off, len, false);
 		bio_off = dst_off + len;
+		dm_icomp_release_decomp_buffer(io);
+		dm_icomp_release_comp_buffer(io);
 	}
 
 	while (bio_off != (bio_sectors(req->bio) << 9)) {
@@ -1270,29 +1336,20 @@ static int dm_icomp_handle_write_modify(struct dm_icomp_io_range *io,
 	if (start < req->bio->bi_iter.bi_sector && start + count >
 					bio_end_sector(req->bio)) {
 		/* we don't split an extent */
-		if (ret == 1) {
-			memcpy(io->decomp_data, io->comp_data, io->decomp_len);
-			dm_icomp_bio_copy(req->bio, 0,
-			   io->decomp_data +
-			   ((req->bio->bi_iter.bi_sector - start) << 9),
-			   bio_sectors(req->bio) << 9, true);
-		} else {
-			dm_icomp_bio_copy(req->bio, 0,
-			   io->decomp_data +
-			   ((req->bio->bi_iter.bi_sector - start) << 9),
-			   bio_sectors(req->bio) << 9, true);
-
-			dm_icomp_kfree(io->comp_data, io->comp_len);
-			/* New compressed len might be bigger */
-			io->comp_data = dm_icomp_kmalloc(
-				dm_icomp_compressor_len(
-				req->info, io->decomp_len), GFP_NOIO);
-			io->comp_len = io->decomp_len;
-			if (!io->comp_data) {
-				req->result = -ENOMEM;
-				return -EIO;
+		if (!io->decomp_kmap) {
+			if (ret == 1) {
+				memcpy(io->decomp_data, io->comp_data,
+					io->decomp_len);
+				dm_icomp_bio_copy(req->bio, 0,
+				   io->decomp_data +
+				   ((req->bio->bi_iter.bi_sector - start) << 9),
+				   bio_sectors(req->bio) << 9, true);
+			} else {
+				dm_icomp_bio_copy(req->bio, 0,
+				   io->decomp_data +
+				   ((req->bio->bi_iter.bi_sector - start) << 9),
+				   bio_sectors(req->bio) << 9, true);
 			}
-			io->io_req.mem.ptr.addr = io->comp_data;
 		}
 		/* need compress data */
 		ret = 0;
diff --git a/drivers/md/dm-inplace-compress.h b/drivers/md/dm-inplace-compress.h
index d9cf05a..e144e96 100644
--- a/drivers/md/dm-inplace-compress.h
+++ b/drivers/md/dm-inplace-compress.h
@@ -115,7 +115,9 @@ struct dm_icomp_meta_io {
 struct dm_icomp_io_range {
 	struct dm_io_request io_req;
 	struct dm_io_region io_region;
+	int decomp_kmap;	/* Is the decomp_data kmapped'? */
 	void *decomp_data;
+	void *decomp_real_data; /* holds the actual start of the buffer */
 	unsigned int decomp_req_len;/* originally requested length */
 	unsigned int decomp_len; /* actual allocated/mapped length */
 	void *comp_data;
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH 11/16] DM: Try to avoid temporary buffer allocation to hold compressed data.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, agk, snitzer, corbet, Ram Pai
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

The read path
a) allocates a temporary buffer to hold the compressed data.
b) reads the compressed data into the temporary buffer.
c) decompresses the compressed data into the caller's bio-buffer.

We know that the caller's bio-buffer will be atleast as large as the compressed
data.  So we could save a buffer allocation in step (a) if we use caller's
bio-buffer to read in the compressed data from the disk. NOTE: this is not
possible all the time, but is possible if the destination within the bio-buffer
falls within the same segment.

This saves us from holding the temporary buffer across read boundaries.  Which
is a huge advantage especially if we are operating under acute memory
starvation; mostly when this block device is used as a swap device.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |  185 ++++++++++++++++++-------------------
 drivers/md/dm-inplace-compress.h |    1 +
 2 files changed, 91 insertions(+), 95 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index f6b95e3..bc1cf70 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -810,10 +810,15 @@ static void dm_icomp_release_comp_buffer(struct dm_icomp_io_range *io)
 {
 	if (!io->comp_data)
 		return;
-	dm_icomp_kfree(io->comp_data, io->comp_len);
+
+	if (io->comp_kmap)
+		kunmap(io->comp_data);
+	else
+		dm_icomp_kfree(io->comp_data, io->comp_len);
 
 	io->comp_data = NULL;
 	io->comp_len = 0;
+	io->comp_kmap = false;
 }
 
 static void dm_icomp_free_io_range(struct dm_icomp_io_range *io)
@@ -857,6 +862,39 @@ static void dm_icomp_put_req(struct dm_icomp_req *req)
 	kmem_cache_free(dm_icomp_req_cachep, req);
 }
 
+static void dm_icomp_bio_copy(struct bio *bio, off_t bio_off, void *buf,
+		ssize_t len, bool to_buf)
+{
+	struct bio_vec bv;
+	struct bvec_iter iter;
+	off_t buf_off = 0;
+	ssize_t size;
+	void *addr;
+
+	WARN_ON(bio_off + len > (bio_sectors(bio) << 9));
+
+	bio_for_each_segment(bv, bio, iter) {
+		int length = bv.bv_len;
+
+		if (bio_off >= length) {
+			bio_off -= length;
+			continue;
+		}
+		addr = kmap_atomic(bv.bv_page);
+		size = min_t(ssize_t, len, length - bio_off);
+		if (to_buf)
+			memcpy(buf + buf_off, addr + bio_off + bv.bv_offset,
+			size);
+		else
+			memcpy(addr + bio_off + bv.bv_offset, buf + buf_off,
+			size);
+		kunmap_atomic(addr);
+		bio_off = 0;
+		buf_off += size;
+		len -= size;
+	}
+}
+
 static void dm_icomp_io_range_done(unsigned long error, void *context)
 {
 	struct dm_icomp_io_range *io = context;
@@ -886,7 +924,7 @@ static inline int dm_icomp_compressor_maxlen(struct dm_icomp_info *info,
  * comp_data
  */
 static struct dm_icomp_io_range *dm_icomp_create_io_range(
-		struct dm_icomp_req *req, int comp_len)
+		struct dm_icomp_req *req)
 {
 	struct dm_icomp_io_range *io;
 
@@ -894,120 +932,67 @@ static struct dm_icomp_io_range *dm_icomp_create_io_range(
 	if (!io)
 		return NULL;
 
-	io->comp_data = dm_icomp_kmalloc(comp_len, GFP_NOIO);
-	if (!io->comp_data) {
-		kmem_cache_free(dm_icomp_io_range_cachep, io);
-		return NULL;
-	}
-
 	io->io_req.notify.fn = dm_icomp_io_range_done;
 	io->io_req.notify.context = io;
 	io->io_req.client = req->info->io_client;
 	io->io_req.mem.type = DM_IO_KMEM;
-	io->io_req.mem.ptr.addr = io->comp_data;
 	io->io_req.mem.offset = 0;
 
 	io->io_region.bdev = req->info->dev->bdev;
-
-	io->comp_len = comp_len;
 	io->req = req;
 
-	io->decomp_data = NULL;
-	io->decomp_real_data = NULL;
-	io->decomp_len = 0;
-	io->decomp_kmap = false;
-	io->decomp_req_len = 0;
+	io->comp_data = io->decomp_data = io->decomp_real_data = NULL;
+	io->comp_len = io->decomp_len = io->decomp_req_len = 0;
+	io->comp_kmap = io->decomp_kmap = false;
 	return io;
 }
 
 static struct dm_icomp_io_range *dm_icomp_create_io_read_range(
 		struct dm_icomp_req *req, int comp_len, int decomp_len)
 {
-	struct dm_icomp_io_range *io = dm_icomp_create_io_range(req, comp_len);
-
-	if (io) {
-		/* note down the requested length for decompress buffer.
-		 * but dont allocate it yet.
-		 */
-		io->decomp_req_len = decomp_len;
-	}
-	return io;
-}
+	struct bio *bio = req->bio;
+	sector_t size  = bio_sectors(bio)<<9;
+	int segments = bio_segments(bio);
+	void *addr;
+	struct dm_icomp_io_range *io = dm_icomp_create_io_range(req);
 
-static int dm_icomp_update_io_read_range(struct dm_icomp_io_range *io,
-		struct bio *bio, ssize_t bio_off)
-{
-	struct bvec_iter iter;
-	struct bio_vec bv;
-	bool just_use = false;
+	if (!io)
+		return NULL;
 
-	if (io->decomp_len)
-		return 0;
+	/* try reusing the bio buffer for compress data. */
+	if (segments == 1) {
+		struct bio_vec bv = bio_iovec(bio);
 
-	/* use a bio buffer long enough to hold the uncompressed data */
-	bio_for_each_segment(bv, bio, iter) {
-		int avail_len;
-		int length = bv.bv_len;
+		WARN_ON(size < comp_len);
+		addr = kmap(bv.bv_page);
+	} else
+		addr  = dm_icomp_kmalloc(comp_len, GFP_NOIO);
 
-		if (!just_use && bio_off >= length) {
-			bio_off -= length;
-			continue;
-		}
-		avail_len = just_use ? length : length-bio_off;
-		if (avail_len >= io->decomp_req_len) {
-			io->decomp_real_data = kmap(bv.bv_page);
-			io->decomp_data = io->decomp_real_data + bio_off;
-			io->decomp_len = io->decomp_req_len = avail_len;
-			io->decomp_kmap = true;
-			return 0;
-		}
-		just_use = true;
+	if (!addr) {
+		kmem_cache_free(dm_icomp_io_range_cachep, io);
+		return NULL;
 	}
+	io->comp_data = io->io_req.mem.ptr.addr = addr;
+	io->comp_len = comp_len;
+	io->comp_kmap = (segments == 1);
+	/* note down the requested length for decompress buffer.
+	 * but dont allocate it yet.
+	 /
+	io->decomp_req_len = decomp_len;
+	return io;
+}
 
-	/* none available. :( Allocate one */
+static int dm_icomp_update_io_read_range(struct dm_icomp_io_range *io)
+{
 	io->decomp_data = dm_icomp_kmalloc(io->decomp_req_len, GFP_NOIO);
 	if (!io->decomp_data)
 		return 1;
 	io->decomp_real_data = io->decomp_data;
 	io->decomp_len = io->decomp_req_len;
 	io->decomp_kmap = false;
-
 	return 0;
 }
 
-static void dm_icomp_bio_copy(struct bio *bio, off_t bio_off, void *buf,
-		ssize_t len, bool to_buf)
-{
-	struct bio_vec bv;
-	struct bvec_iter iter;
-	off_t buf_off = 0;
-	ssize_t size;
-	void *addr;
-
-	WARN_ON(bio_off + len > (bio_sectors(bio) << 9));
-
-	bio_for_each_segment(bv, bio, iter) {
-		int length = bv.bv_len;
-
-		if (bio_off >= length) {
-			bio_off -= length;
-			continue;
-		}
-		addr = kmap_atomic(bv.bv_page);
-		size = min_t(ssize_t, len, length - bio_off);
-		if (to_buf)
-			memcpy(buf + buf_off, addr + bio_off + bv.bv_offset,
-				size);
-		else
-			memcpy(addr + bio_off + bv.bv_offset, buf + buf_off,
-				size);
-		kunmap_atomic(addr);
-		bio_off = 0;
-		buf_off += size;
-		len -= size;
-	}
-}
-
 static int dm_icomp_mod_to_max_io_range(struct dm_icomp_info *info,
 			 struct dm_icomp_io_range *io)
 {
@@ -1030,13 +1015,26 @@ static int dm_icomp_mod_to_max_io_range(struct dm_icomp_info *info,
 static struct dm_icomp_io_range *dm_icomp_create_io_write_range(
 		struct dm_icomp_req *req)
 {
-	struct dm_icomp_io_range *io;
 	struct bio *bio = req->bio;
 	sector_t size  = bio_sectors(req->bio)<<9;
 	int segments = bio_segments(bio);
 	int comp_len = dm_icomp_compressor_len(req->info, size);
 	void *addr;
+	struct dm_icomp_io_range *io = dm_icomp_create_io_range(req);
+
+	if (!io)
+		return NULL;
+
+	addr = dm_icomp_kmalloc(comp_len, GFP_NOIO);
+	if (!addr) {
+		kmem_cache_free(dm_icomp_io_range_cachep, io);
+		return NULL;
+	}
+	io->comp_data = io->io_req.mem.ptr.addr = addr;
+	io->comp_len = comp_len;
+	io->comp_kmap = false;
 
+	/* try reusing the bio buffer for decomp data. */
 	if (segments == 1) {
 		struct bio_vec bv = bio_iovec(bio);
 
@@ -1044,21 +1042,18 @@ static struct dm_icomp_io_range *dm_icomp_create_io_write_range(
 	} else
 		addr  = dm_icomp_kmalloc(size, GFP_NOIO);
 
-	if (!addr)
-		return NULL;
-
-	io = dm_icomp_create_io_range(req, comp_len);
-	if (!io) {
-		(segments == 1) ?  kunmap(addr) : dm_icomp_kfree(addr, size);
+	if (!addr) {
+		dm_icomp_kfree(io->comp_data, comp_len);
+		kmem_cache_free(dm_icomp_io_range_cachep, io);
 		return NULL;
 	}
-
 	io->decomp_data = io->decomp_real_data = addr;
 	io->decomp_len = size;
 
 	io->decomp_kmap = (segments == 1);
 	if (!io->decomp_kmap)
 		dm_icomp_bio_copy(req->bio, 0, io->decomp_data, size, true);
+
 	return io;
 }
 
@@ -1176,7 +1171,7 @@ static void dm_icomp_handle_read_decomp(struct dm_icomp_req *req)
 			src_off = (req->bio->bi_iter.bi_sector -
 				io->io_region.sector) << 9;
 
-		if (dm_icomp_update_io_read_range(io, req->bio, dst_off)) {
+		if (dm_icomp_update_io_read_range(io)) {
 			req->result = -EIO;
 			return;
 		}
diff --git a/drivers/md/dm-inplace-compress.h b/drivers/md/dm-inplace-compress.h
index e144e96..775ccbf 100644
--- a/drivers/md/dm-inplace-compress.h
+++ b/drivers/md/dm-inplace-compress.h
@@ -120,6 +120,7 @@ struct dm_icomp_io_range {
 	void *decomp_real_data; /* holds the actual start of the buffer */
 	unsigned int decomp_req_len;/* originally requested length */
 	unsigned int decomp_len; /* actual allocated/mapped length */
+	int comp_kmap;          /* Is the comp_data kmapped'? */
 	void *comp_data;
 	unsigned int comp_len; /* For write, this is estimated */
 	struct list_head next;
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH 12/16] DM: release unneeded buffer as soon as possible.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, agk, snitzer, corbet, Ram Pai
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

Done to conserve as much free space as possible.  Waiting to release till the
bio is done will unneccesarily hog up memory.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |   14 +++++++++++---
 1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index bc1cf70..b1c3e5f 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -977,7 +977,7 @@ static struct dm_icomp_io_range *dm_icomp_create_io_read_range(
 	io->comp_kmap = (segments == 1);
 	/* note down the requested length for decompress buffer.
 	 * but dont allocate it yet.
-	 /
+	 */
 	io->decomp_req_len = decomp_len;
 	return io;
 }
@@ -1375,8 +1375,13 @@ static int dm_icomp_handle_write_modify(struct dm_icomp_io_range *io,
 	}
 
 	dm_icomp_get_req(req);
-	if (ret == 1)
+
+	if (ret == 1) {
 		io->io_req.mem.ptr.addr = io->decomp_data + (offset << 9);
+		dm_icomp_release_comp_buffer(io);
+	} else
+		dm_icomp_release_decomp_buffer(io);
+
 	io->io_region.count = comp_len >> 9;
 	io->io_region.sector = start + req->info->data_start;
 
@@ -1443,8 +1448,11 @@ static void dm_icomp_handle_write_comp(struct dm_icomp_req *req)
 	io->io_region.sector = req->bio->bi_iter.bi_sector +
 			 req->info->data_start;
 
-	if (ret == 1)
+	if (ret == 1) {
 		io->io_req.mem.ptr.addr = io->decomp_data;
+		dm_icomp_release_comp_buffer(io);
+	} else
+		dm_icomp_release_decomp_buffer(io);
 
 	io->io_region.count = comp_len >> 9;
 	io->io_req.bi_rw = req->bio->bi_rw;
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH 13/16] DM: macros to set and get the state of the request.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, agk, snitzer, corbet, Ram Pai
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

Consolidate code to set and get the state of a request.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |   31 +++++++++++++++++--------------
 1 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index b1c3e5f..55a515b 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -44,6 +44,9 @@ static struct kernel_param_ops dm_icomp_compressor_param_ops = {
 module_param_cb(compress_algorithm, &dm_icomp_compressor_param_ops,
 		&dm_icomp_compressor_kparam, 0644);
 
+#define SET_REQ_STAGE(req, value) (req->stage = value)
+#define GET_REQ_STAGE(req) req->stage
+
 static int dm_icomp_get_compressor(const char *s)
 {
 	int r, val_len;
@@ -835,15 +838,15 @@ static void dm_icomp_put_req(struct dm_icomp_req *req)
 	if (atomic_dec_return(&req->io_pending))
 		return;
 
-	if (req->stage == STAGE_INIT) /* waiting for locking */
+	if (GET_REQ_STAGE(req) == STAGE_INIT) /* waiting for locking */
 		return;
 
-	if (req->stage == STAGE_READ_DECOMP ||
-	    req->stage == STAGE_WRITE_COMP ||
+	if (GET_REQ_STAGE(req) == STAGE_READ_DECOMP ||
+	    GET_REQ_STAGE(req) == STAGE_WRITE_COMP ||
 	    req->result)
-		req->stage = STAGE_DONE;
+		SET_REQ_STAGE(req, STAGE_DONE);
 
-	if (req->stage != STAGE_DONE) {
+	if (GET_REQ_STAGE(req) != STAGE_DONE) {
 		dm_icomp_queue_req(req->info, req);
 		return;
 	}
@@ -1153,7 +1156,7 @@ static void dm_icomp_handle_read_decomp(struct dm_icomp_req *req)
 	off_t bio_off = 0;
 	int ret;
 
-	req->stage = STAGE_READ_DECOMP;
+	SET_REQ_STAGE(req, STAGE_READ_DECOMP);
 
 	if (req->result)
 		return;
@@ -1250,7 +1253,7 @@ static void dm_icomp_handle_read_read_existing(struct dm_icomp_req *req)
 	u64 block_index, first_block_index;
 	u16 logical_sectors, data_sectors;
 
-	req->stage = STAGE_READ_EXISTING;
+	SET_REQ_STAGE(req, STAGE_READ_EXISTING);
 
 	block_index = dm_icomp_sector_to_block(req->bio->bi_iter.bi_sector);
 again:
@@ -1279,14 +1282,14 @@ static void dm_icomp_handle_read_request(struct dm_icomp_req *req)
 {
 	dm_icomp_get_req(req);
 
-	if (req->stage == STAGE_INIT) {
+	if (GET_REQ_STAGE(req) == STAGE_INIT) {
 		if (!dm_icomp_lock_req_range(req)) {
 			dm_icomp_put_req(req);
 			return;
 		}
 
 		dm_icomp_handle_read_read_existing(req);
-	} else if (req->stage == STAGE_READ_EXISTING)
+	} else if (GET_REQ_STAGE(req) == STAGE_READ_EXISTING)
 		dm_icomp_handle_read_decomp(req);
 
 	dm_icomp_put_req(req);
@@ -1411,7 +1414,7 @@ static void dm_icomp_handle_write_comp(struct dm_icomp_req *req)
 	int ret;
 	bool handle_bio = true;
 
-	req->stage = STAGE_WRITE_COMP;
+	SET_REQ_STAGE(req, STAGE_WRITE_COMP);
 
 	if (req->result)
 		return;
@@ -1486,7 +1489,7 @@ static void dm_icomp_handle_write_read_existing(struct dm_icomp_req *req)
 	u64 block_index, first_block_index;
 	u16 logical_sectors, data_sectors;
 
-	req->stage = STAGE_READ_EXISTING;
+	SET_REQ_STAGE(req, STAGE_READ_EXISTING);
 
 	block_index = dm_icomp_sector_to_block(req->bio->bi_iter.bi_sector);
 	dm_icomp_get_extent(req->info, block_index, &first_block_index,
@@ -1524,14 +1527,14 @@ static void dm_icomp_handle_write_request(struct dm_icomp_req *req)
 {
 	dm_icomp_get_req(req);
 
-	if (req->stage == STAGE_INIT) {
+	if (GET_REQ_STAGE(req) == STAGE_INIT) {
 		if (!dm_icomp_lock_req_range(req)) {
 			dm_icomp_put_req(req);
 			return;
 		}
 
 		dm_icomp_handle_write_read_existing(req);
-	} else if (req->stage == STAGE_READ_EXISTING)
+	} else if (GET_REQ_STAGE(req) == STAGE_READ_EXISTING)
 		dm_icomp_handle_write_comp(req);
 
 	dm_icomp_put_req(req);
@@ -1611,7 +1614,7 @@ static int dm_icomp_map(struct dm_target *ti, struct bio *bio)
 	atomic_set(&req->io_pending, 0);
 	INIT_LIST_HEAD(&req->all_io);
 	req->result = 0;
-	req->stage = STAGE_INIT;
+	SET_REQ_STAGE(req, STAGE_INIT);
 	req->locked_locks = 0;
 
 	req->cpu = raw_smp_processor_id();
-- 
1.7.1


^ permalink raw reply related

* [RFC PATCH 14/16] DM: Wasted bio copy.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, agk, snitzer, corbet, Ram Pai
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

No point doing it.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index 55a515b..31b144b 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -1434,7 +1434,6 @@ static void dm_icomp_handle_write_comp(struct dm_icomp_req *req)
 		req->result = -EIO;
 		return;
 	}
-	dm_icomp_bio_copy(req->bio, 0, io->decomp_data, count << 9, true);
 
 	/* compress data */
 	comp_len = io->comp_len;
-- 
1.7.1


^ permalink raw reply related

* [RFC PATCH 15/16] DM: Add sysfs parameters to track total memory saved and allocated.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, agk, snitzer, corbet, Ram Pai
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

Add parameters to monitor the memory efficiency of the module.
dm_icomp_total_alloc_size: total memory currently in use.
dm_icomp_total_bio_save: total memory allocation saved by the optimizations.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 drivers/md/dm-inplace-compress.c |   36 ++++++++++++++++++++++++++++++++++--
 1 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
index 31b144b..0a7790a 100644
--- a/drivers/md/dm-inplace-compress.c
+++ b/drivers/md/dm-inplace-compress.c
@@ -82,6 +82,23 @@ static int dm_icomp_compressor_param_set(const char *val,
 	return 0;
 }
 
+static const struct kernel_param_ops dm_icomp_alloc_param_ops = {
+	.set    = param_set_ulong,
+	.get    = param_get_ulong,
+};
+
+static atomic64_t dm_icomp_total_alloc_size;
+#define DMCP_ALLOC(s) {atomic64_add(s, &dm_icomp_total_alloc_size); }
+#define DMCP_FREE_ALLOC(s) {atomic64_sub(s, &dm_icomp_total_alloc_size); }
+module_param_cb(dm_icomp_total_alloc_size, &dm_icomp_alloc_param_ops,
+				&dm_icomp_total_alloc_size, 0644);
+
+static atomic64_t dm_icomp_total_bio_save;
+#define DMCP_ALLOC_SAVE(s) {atomic64_add(s, &dm_icomp_total_bio_save); }
+module_param_cb(dm_icomp_total_bio_save, &dm_icomp_alloc_param_ops,
+				&dm_icomp_total_bio_save, 0644);
+
+
 static struct kmem_cache *dm_icomp_req_cachep;
 static struct kmem_cache *dm_icomp_io_range_cachep;
 static struct kmem_cache *dm_icomp_meta_io_cachep;
@@ -616,6 +633,9 @@ static int dm_icomp_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	atomic64_set(&info->compressed_write_size, 0);
 	atomic64_set(&info->uncompressed_write_size, 0);
 	atomic64_set(&info->meta_write_size, 0);
+	atomic64_set(&dm_icomp_total_alloc_size, 0);
+	atomic64_set(&dm_icomp_total_bio_save, 0);
+
 	ti->num_flush_bios = 1;
 	/* ti->num_discard_bios = 1; */
 	ti->private = info;
@@ -781,18 +801,28 @@ static void dm_icomp_get_req(struct dm_icomp_req *req)
 
 static void *dm_icomp_kmalloc(size_t size, gfp_t flags)
 {
-	return  kmalloc(size, flags);
+	void *addr = kmalloc(size, flags);
+
+	if (addr)
+		DMCP_ALLOC(size);
+	return addr;
 }
 
 static void *dm_icomp_krealloc(void *addr, size_t size,
 		 size_t orig_size, gfp_t flags)
 {
-	return krealloc(addr, size, flags);
+	void *taddr = krealloc(addr, size, flags);
+
+	DMCP_FREE_ALLOC(orig_size);
+	if (taddr)
+		DMCP_ALLOC(size);
+	return taddr;
 }
 
 static void dm_icomp_kfree(void *addr, unsigned int size)
 {
 	kfree(addr);
+	DMCP_FREE_ALLOC(size);
 }
 
 static void dm_icomp_release_decomp_buffer(struct dm_icomp_io_range *io)
@@ -1056,6 +1086,8 @@ static struct dm_icomp_io_range *dm_icomp_create_io_write_range(
 	io->decomp_kmap = (segments == 1);
 	if (!io->decomp_kmap)
 		dm_icomp_bio_copy(req->bio, 0, io->decomp_data, size, true);
+	else
+		DMCP_ALLOC_SAVE(size);
 
 	return io;
 }
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH 16/16] DM: add documentation for dm-inplace-compress.
From: Ram Pai @ 2016-08-15 17:36 UTC (permalink / raw)
  To: LKML, linux-raid, dm-devel, linux-doc; +Cc: shli, agk, snitzer, corbet, Ram Pai
In-Reply-To: <1471282613-31006-1-git-send-email-linuxram@us.ibm.com>

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 .../device-mapper/dm-inplace-compress.text         |  138 ++++++++++++++++++++
 1 files changed, 138 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/device-mapper/dm-inplace-compress.text

diff --git a/Documentation/device-mapper/dm-inplace-compress.text b/Documentation/device-mapper/dm-inplace-compress.text
new file mode 100644
index 0000000..c31e69e
--- /dev/null
+++ b/Documentation/device-mapper/dm-inplace-compress.text
@@ -0,0 +1,138 @@
+dm-inplace-compress
+====================
+
+Device-Mapper's "inplace-compress" target provides inplace compression of block
+devices using the kernel compression API.
+
+Parameters: <device path> \
+	[ <#opt_params writethough> ]
+	[ <#opt_params <writeback> <meta_commit_delay> ]
+	[ <#opt_params compressor> <type> ]
+
+
+<writethrough>
+    Write data and metadata together.
+
+<writeback> <meta_commit_delay>
+    Write metadata every 'meta_commit_delay' interval.
+
+<device path>
+    This is the device that is going to be used as backend and contains the
+    compressed data.  You can specify it as a path like /dev/xxx or a device
+    number <major>:<minor>.
+
+<compressor> <type>
+    Choose the compressor algorithm. 'lzo' and '842'
+    compressors are supported.
+
+Example scripts
+===============
+
+create a inplace-compress block device using lzo compression. Write metadata
+and data together.
+[[
+#!/bin/sh
+# Create a inplace-compress device using dmsetup
+dmsetup create comp1 --table "0 `blockdev --getsize $1` inplacecompress $1
+		writethrough compressor lzo"
+]]
+
+
+create a inplace-compress block device using nx-842 hardware compression. Write
+metadata periodially every 5sec.
+
+[[
+#!/bin/sh
+# Create a inplace-compress device using dmsetup
+dmsetup create comp1 --table "0 `blockdev --getsize $1` inplacecompress $1
+		writeback 5 compressor 842"
+]]
+
+Description
+===========
+    This is a simple DM target supporting inplace compression. Its best suited for
+    SSD. The underlying disk must support 512B sector size, the target only
+    supports 4k sector size.
+
+    Disk layout:
+    |super|...meta...|..data...|
+
+    Store unit is 4k (a block). Super is 1 block, which stores meta and data
+    size and compression algorithm. Meta is a bitmap. For each data block,
+    there are 5 bits meta.
+
+    Data:
+
+    Data of a block is compressed. Compressed data is round up to 512B, which
+    is the payload. In disk, payload is stored at the beginning of logical
+    sector of the block. Let's look at an example. Say we store data to block
+    A, which is in sector B(A*8), its orginal size is 4k, compressed size is
+    1500. Compressed data (CD) will use 3 sectors (512B). The 3 sectors are the
+    payload. Payload will be stored at sector B.
+
+    ---------------------------------------------------
+    ... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
+    ---------------------------------------------------
+        ^B    ^B+1  ^B+2                  ^B+7 ^B+8
+
+    For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta
+    bits to present payload size. The compressed size (1500) isn't stored in
+    meta directly. Instead, we store it at the last 32bits of payload. In this
+    example, we store it at the end of sector B+2. If compressed size +
+    sizeof(32bits) crosses a sector, payload size will increase one sector. If
+    payload uses 8 sectors, we store uncompressed data directly.
+
+    If IO size is bigger than one block, we can store the data as an extent.
+    Data of the whole extent will compressed and stored in the similar way like
+    above.  The first block of the extent is the head, all others are the tail.
+    If extent is 1 block, the block is head. We have 1 bit of meta to present
+    if a block is head or tail. If 4 meta bits of head block can't store extent
+    payload size, we will borrow tail block meta bits to store payload size.
+    Max allowd extent size is 128k, so we don't compress/decompress too big
+    size data.
+
+    Meta:
+    Modifying data will modify meta too. Meta will be written(flush) to disk
+    depending on meta write policy. We support writeback and writethrough mode.
+    In writeback mode, meta will be written to disk in an interval or a FLUSH
+    request.  In writethrough mode, data and meta data will be written to disk
+    together.
+
+    Advantages:
+
+    1. Simple. Since we store compressed data in-place, we don't need complicated
+    disk data management.
+    2. Efficient. For each 4k, we only need 5 bits meta. 1T data will use less than
+    200M meta, so we can load all meta into memory. And actual compression size is
+    in payload. So if IO doesn't need RMW and we use writeback meta flush, we don't
+    need extra IO for meta.
+
+    Disadvantages:
+
+    1. hole. Since we store compressed data in-place, there are a lot of holes
+    (in above example, B+3 - B+7) Hole can impact IO, because we can't do IO
+    merge.
+
+    2. 1:1 size. Compression doesn't change disk size. If disk is 1T, we can
+    only store 1T data even we do compression.
+
+    But this is for SSD only. Generally SSD firmware has a FTL layer to map
+    disk sectors to flash nand. High end SSD firmware has filesystem-like FTL.
+
+    1. hole. Disk has a lot of holes, but SSD FTL can still store data continuous
+    in nand. Even if we can't do IO merge in OS layer, SSD firmware can do it.
+
+    2. 1:1 size. On one side, we write compressed data to SSD, which means less
+    data is written to SSD. This will be very helpful to improve SSD garbage
+    collection, and so write speed and life cycle. So even this is a problem, the
+    target is still helpful. On the other side, advanced SSD FTL can easily do thin
+    provision. For example, if nand is 1T and we let SSD report it as 2T, and use
+    the SSD as compressed target. In such SSD, we don't have the 1:1 size issue.
+
+    So even if SSD FTL cannot map non-continuous disk sectors to continuous nand,
+    the compression target can still function well.
+
+
+Author:
+    Shaohua Li <shli@fusionio.com>
+    Ram Pai <ram.n.pai@gmail.com>
-- 
1.7.1

^ permalink raw reply related

* Re: Raid5 reshape stuck at 0% - SuSE leap 42.1
From: Wols Lists @ 2016-08-16 10:57 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid, NeilBrown
In-Reply-To: <alpine.DEB.2.02.1608081339310.3593@uplift.swm.pp.se>

On 08/08/16 12:40, Mikael Abrahamsson wrote:
> On Sun, 7 Aug 2016, Wols Lists wrote:
> 
>> Note that I think this dmesg stuff overlaps with the last lot, namely
>> the start of this is the tail end of the array starting successfully
>> last time.
> 
> It helps if you supply output of /proc/mdstat before each operation, and
> also adding verbose output to mdadm command.
> 
Okay. Just tried to do this - and do Neil's thing where I was trying to
reduce the number of raid devices ... I now have a wedged 2-device raid
5 that I can't revert back to raid 1, or set off the required reshape.
Mikael - your --update=revert-reshape that worked fine last time, now
refuses to work ... :-( so I can't try Neil's --raid-devices=2 because I
can't get a clean full-working-order array. (The array is working fine,
so if it was a real live array I wouldn't be worried about losing
anything, but a wedged array is a wedged array - not good!)

As before, OS = SuSE Leap 42.1, "mdadm" is what comes with the OS,
"./mdadm" is Neil's git tree (a week or so old).

Attached is my xterm trace and the associated output from dmesg.

kanga:/home/anthony/mdadm # mdadm --stop /dev/md127
mdadm: stopped /dev/md127
kanga:/home/anthony/mdadm # ./mdadm --assemble /dev/md127 --verbose
--force --update=revert-reshape --invalid-backup
--backup-file=../raidbackup /dev/sdb /dev/sdc /dev/sdd /dev/sde
mdadm: looking for devices for /dev/md127
mdadm: No active reshape to revert on /dev/sdb
kanga:/home/anthony/mdadm # ./mdadm --assemble /dev/md127 --verbose
--force --update=revert-reshape --invalid-backup
--backup-file=../raidbackup /dev/sdc /dev/sdd /dev/sde     mdadm:
looking for devices for /dev/md127
mdadm: Merging with already-assembled /dev/md/testarray
mdadm: No active reshape to revert on /dev/sdb
kanga:/home/anthony/mdadm # mdadm --assemble --scan
mdadm: Merging with already-assembled /dev/md/testarray
mdadm: /dev/md/testarray has been started with 2 drives and 2 spares.
kanga:/home/anthony/mdadm # mdadm - D /dev/md127
mdadm: An option must be given to set the mode before a second device
       (D) is listed
kanga:/home/anthony/mdadm # mdadm -D /dev/md127
/dev/md127:
        Version : 1.2
  Creation Time : Fri Aug  5 18:16:24 2016
     Raid Level : raid5
     Array Size : 8380416 (7.99 GiB 8.58 GB)
  Used Dev Size : 8380416 (7.99 GiB 8.58 GB)
   Raid Devices : 2
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Tue Aug 16 11:32:52 2016
          State : clean
 Active Devices : 2
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 2

         Layout : left-symmetric
     Chunk Size : 64K

           Name : kanga:testarray  (local to host kanga)
           UUID : cf52ebc0:886a35cd:688274b4:3f16096c
         Events : 160

    Number   Major   Minor   RaidDevice State
       4       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc

       2       8       48        -      spare   /dev/sdd
       3       8       64        -      spare   /dev/sde
kanga:/home/anthony/mdadm # mdadm --grow /dev/md127 --continue
kanga:/home/anthony/mdadm # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sdd[2](S) sde[3](S) sdc[1] sdb[4]
      8380416 blocks super 1.2 level 5, 64k chunk, algorithm 2 [2/2] [UU]

unused devices: <none>
kanga:/home/anthony/mdadm # mdadm --stop /dev/md127
mdadm: stopped /dev/md127
kanga:/home/anthony/mdadm # ./mdadm --assemble /dev/md127 --verbose
--force --update=revert-reshape --invalid-backup
--backup-file=../raidbackup /dev/sdb /dev/sdc /dev/sdd /dev/sde
mdadm: looking for devices for /dev/md127
mdadm: No active reshape to revert on /dev/sdb
kanga:/home/anthony/mdadm # ./mdadm --grow /dev/md127 --continue
mdadm: /dev/md127 is not an active md array - aborting
kanga:/home/anthony/mdadm # ./mdadm --assemble --scan
mdadm: Merging with already-assembled /dev/md/testarray
mdadm: /dev/md/testarray has been started with 2 drives and 2 spares.
kanga:/home/anthony/mdadm # ./mdadm --grow /dev/md127 --continue
kanga:/home/anthony/mdadm # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 sdd[2](S) sde[3](S) sdc[1] sdb[4]
      8380416 blocks super 1.2 level 5, 64k chunk, algorithm 2 [2/2] [UU]

unused devices: <none>
kanga:/home/anthony/mdadm # ps -fea | grep mdadm
root      1484  1855  0 11:42 pts/0    00:00:00 grep --color=auto mdadm
root      3586     1  0 09:53 ?        00:00:00 /sbin/mdadm --monitor -d
60 -m root@localhost --scan -c /etc/mdadm.conf
kanga:/home/anthony/mdadm #




00:00:00.000257 main     Log opened 2016-08-07T18:07:43.777555000Z
[22244.641971] 00:00:00.000481 main     OS Product: Linux
[22244.642016] 00:00:00.000526 main     OS Release: 4.1.15-8-default
[22244.642056] 00:00:00.000564 main     OS Version: #1 SMP PREEMPT Wed
Jan 20 16:41:00 UTC 2016 (0e3b3ab)
[22244.642206] 00:00:00.000604 main     Executable: /usr/sbin/VBoxService
00:00:00.000605 main     Process ID: 12443
00:00:00.000606 main     Package type: LINUX_64BITS_GENERIC (OSE)
[22244.643245] 00:00:00.001736 main     5.0.24_SUSE r108355 started.
Verbose level = 0
[22567.452608] SFW2-INext-DROP-DEFLT IN=eth0 OUT=
MAC=08:00:27:0b:0f:57:52:54:00:12:35:02:08:00 SRC=10.0.2.2 DST=10.0.2.15
LEN=576 TOS=0x10 PREC=0x00 TTL=64 ID=2252 PROTO=UDP SPT=67 DPT=68 LEN=556
[25196.727211] RAID conf printout:
[25196.727217]  --- level:5 rd:2 wd:2
[25196.727238]  disk 0, o:1, dev:sdb
[25196.727240]  disk 1, o:1, dev:sdc
[25196.727246] RAID conf printout:
[25196.727248]  --- level:5 rd:2 wd:2
[25196.727250]  disk 0, o:1, dev:sdb
[25196.727251]  disk 1, o:1, dev:sdc
[27195.944364] usb 2-1: USB disconnect, device number 2
[27196.148347] e1000: eth0 NIC Link is Down
[27196.572099] usb 2-1: new full-speed USB device number 3 using ohci-pci
[27196.832714] usb 2-1: New USB device found, idVendor=80ee, idProduct=0021
[27196.832721] usb 2-1: New USB device strings: Mfr=1, Product=3,
SerialNumber=0
[27196.832723] usb 2-1: Product: USB Tablet
[27196.832725] usb 2-1: Manufacturer: VirtualBox
[27196.841815] input: VirtualBox USB Tablet as
/devices/pci0000:00/0000:00:06.0/usb2/2-1/2-1:1.0/0003:80EE:0021.0002/input/input9
[27196.842185] hid-generic 0003:80EE:0021.0002: input,hidraw0: USB HID
v1.10 Mouse [VirtualBox USB Tablet] on usb-0000:00:06.0-1/input0
[27202.148448] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: RX
[27202.152800] SFW2-INext-DROP-DEFLT IN=eth0 OUT=
MAC=08:00:27:0b:0f:57:52:54:00:12:35:02:08:00 SRC=10.0.2.2 DST=10.0.2.15
LEN=576 TOS=0x10 PREC=0x00 TTL=64 ID=0 PROTO=UDP SPT=67 DPT=68 LEN=556
[27265.599076] BTRFS info (device sda2): relocating block group
16202596352 flags 36
[27266.319748] BTRFS info (device sda2): relocating block group
12410945536 flags 34
[27266.439613] BTRFS info (device sda2): relocating block group
16202596352 flags 34
[27266.542733] BTRFS info (device sda2): relocating block group
16236150784 flags 34
[27266.649202] BTRFS info (device sda2): relocating block group
16269705216 flags 34
[27266.757287] BTRFS info (device sda2): relocating block group
12444499968 flags 36
[27271.176254] BTRFS info (device sda2): found 1011 extents
[27271.426455] BTRFS info (device sda2): relocating block group
16303259648 flags 34
[27271.707481] BTRFS info (device sda2): relocating block group
14994636800 flags 36
[27277.442719] BTRFS info (device sda2): found 2417 extents
[27277.862520] BTRFS info (device sda2): relocating block group
13652459520 flags 36
[27283.603832] BTRFS info (device sda2): found 1696 extents
[27366.620131] md127: detected capacity change from 8581545984 to 0
[27366.620131] md: md127 stopped.
[27366.620131] md: unbind<sdb>
[27366.624318] md: export_rdev(sdb)
[27366.624318] md: unbind<sdd>
[27366.632107] md: export_rdev(sdd)
[27366.632137] md: unbind<sde>
[27366.640248] md: export_rdev(sde)
[27366.640248] md: unbind<sdc>
[27366.644324] md: export_rdev(sdc)
[27373.643627] md: md127 stopped.
[27373.775068] md: bind<sdb>
[27437.466705] md: array md127 already has disks!
[27437.467188] md: bind<sdc>
[27437.467593] md: bind<sde>
[27437.467976] md: bind<sdd>
[27437.509156] md/raid:md127: device sdc operational as raid disk 1
[27437.509161] md/raid:md127: device sdb operational as raid disk 0
[27437.509679] md/raid:md127: allocated 2250kB
[27437.520845] md/raid:md127: raid level 5 active with 2 out of 2
devices, algorithm 2
[27437.520850] RAID conf printout:
[27437.520852]  --- level:5 rd:2 wd:2
[27437.520854]  disk 0, o:1, dev:sdb
[27437.520856]  disk 1, o:1, dev:sdc
[27437.521657] md127: detected capacity change from 0 to 8581545984
[27437.523369] RAID conf printout:
[27437.523378]  --- level:5 rd:2 wd:2
[27437.523380]  disk 0, o:1, dev:sdb
[27437.523382]  disk 1, o:1, dev:sdc
[27437.523383] RAID conf printout:
[27437.523384]  --- level:5 rd:2 wd:2
[27437.523384]  disk 0, o:1, dev:sdb
[27437.523385]  disk 1, o:1, dev:sdc
[27587.102558] md127: detected capacity change from 8581545984 to 0
[27587.102569] md: md127 stopped.
[27587.102575] md: unbind<sdd>
[27587.108084] md: export_rdev(sdd)
[27587.108132] md: unbind<sde>
[27587.120239] md: export_rdev(sde)
[27587.120239] md: unbind<sdc>
[27587.128050] md: export_rdev(sdc)
[27587.128069] md: unbind<sdb>
[27587.132042] md: export_rdev(sdb)
[27603.862527] md: md127 stopped.
[27603.963036] md: bind<sdb>
anthony@kanga:/mnt/anthony>



^ permalink raw reply

* Re: read errors with md RAID5 array
From: Tim Small @ 2016-08-16 11:40 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <20160815145937.dsjtzpjdnwfpmxyv@metamorpher.de>


On 15/08/16 15:59, Andreas Klauer wrote:
> On Mon, Aug 15, 2016 at 02:12:23PM +0100, Tim Small wrote:
>> > I'm seeing some strange read errors whilst reading from an md RAID5
>> > array (3x 2TB SATA Drives, Intel AHCI controller).
> mdadm --examine and --examine-badblocks for all disks/partitions?
> 


Hi,

Thanks very much for your suggestions...


# for i in a c d ; do mdadm --examine-badblocks  /dev/sd${i}2 ; done
Bad-blocks on /dev/sda2:
          2321554488 for 512 sectors
          2321555000 for 512 sectors
          2321555512 for 152 sectors
Bad-blocks on /dev/sdc2:
             1656848 for 128 sectors
            28490768 for 512 sectors
            28491280 for 392 sectors
            28572344 for 120 sectors
            32760864 for 128 sectors
          2321554488 for 512 sectors
          2321555000 for 512 sectors
          2321555512 for 152 sectors
Bad-blocks on /dev/sdd2:
             1656848 for 128 sectors
            28490768 for 512 sectors
            28491280 for 392 sectors
            28572344 for 120 sectors
            32760864 for 128 sectors
          2321554488 for 512 sectors
          2321555000 for 512 sectors
          2321555512 for 152 sectors

I didn't know about the bad block functionality in md.  The mdadm manual
page doesn't say much, so is this the canonical document?

http://neil.brown.name/blog/20100519043730

Until recently, two of the drives (sda, sdc) were running a firmware
version which (as far as I can work out) made them occasionally lock up
and disappear from the OS (requiring a power cycle), this firmware has
now been updated, so hopefully they'll now behave.

Degraded array reporting was also broken on this machine for a couple of
weeks due to an email misconfiguration (now fixed), so last week I found
it with sda (ML0220F30ZE35D) apparently missing from the machine, and
also with pending sectors on sdb (ML0220F31085KD).  The array rebuilt
quite quickly from the bitmap, and then I turned to trying to resolve
the pending sectors...

When the 'check' action didn't force the reallocations, I ran a 'repair'
action instead (thinking that perhaps the check wasn't attempting the
read+recontruct+write for some reason, however I now assume that this
was the wrong thing to do in the light of the bad block list entries).

I'm not really sure from the blog post, under what circumstances a bad
block entry would end up being written to multiple devices in the array,
and under what circumstances it might be written to all devices in an
array?  There are no entries on these array members which appear on only
one array member, and some are present on all three drives - which seems
strange to me.

I suppose a combination of the "Firstly" and "Secondly" paragraphs would
result in the same block being marked as bad on two devices.

Will the detection of an inconsistency (e.g. via a check) mark the
stripe which was impacted as bad on all active array members?

FWIW, what I'd like to do in the future with this array, is to reshape
it into a 4 drive RAID6, and then grow it to a 5 drive RAID6, and
possibly replace one or both of sda (ML0220F30ZE35D) and sdc
(ML0220F31085KD).  However I'd like to try and do this without losing
any data which is currently on the array but marked as inaccessible.
I'd also like to avoid losing the entire array, if the reshape fails
when the array is in this state with unreadable portions.

In the meantime I'm trying to work out what data (if any) is now
inaccessible.  This is made slightly more interesting because this array
has 'bcache' sitting in front of it, so I might have good data in the
cache on the SSD which is marked bad/inaccessible on the raid5 md device.

Tim.


# for i in a c d ; do mdadm --examine  /dev/sd${i}2 ; done


/dev/sda2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x9
     Array UUID : ad7ef7fa:e78344ea:a8778f06:abf07bf5
           Name : magic:2  (local to host magic)
  Creation Time : Wed Jul 15 14:43:06 2015
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 3885793456 (1852.89 GiB 1989.53 GB)
     Array Size : 3885793280 (3705.78 GiB 3979.05 GB)
  Used Dev Size : 3885793280 (1852.89 GiB 1989.53 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=176 sectors
          State : clean
    Device UUID : fcc77733:e7e3582c:e8bff1ce:dd8d5232

Internal Bitmap : 8 sectors from superblock
    Update Time : Tue Aug 16 09:05:35 2016
  Bad Block Log : 512 entries available at offset 72 sectors - bad
blocks present.
       Checksum : d18d7379 - correct
         Events : 520706

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x9
     Array UUID : ad7ef7fa:e78344ea:a8778f06:abf07bf5
           Name : magic:2  (local to host magic)
  Creation Time : Wed Jul 15 14:43:06 2015
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 3885793456 (1852.89 GiB 1989.53 GB)
     Array Size : 3885793280 (3705.78 GiB 3979.05 GB)
  Used Dev Size : 3885793280 (1852.89 GiB 1989.53 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=176 sectors
          State : clean
    Device UUID : 55004cc7:b2e691de:c612612a:675ea2f3

Internal Bitmap : 8 sectors from superblock
    Update Time : Tue Aug 16 09:05:35 2016
  Bad Block Log : 512 entries available at offset 72 sectors - bad
blocks present.
       Checksum : 345a1f90 - correct
         Events : 520706

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x9
     Array UUID : ad7ef7fa:e78344ea:a8778f06:abf07bf5
           Name : magic:2  (local to host magic)
  Creation Time : Wed Jul 15 14:43:06 2015
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 3885793456 (1852.89 GiB 1989.53 GB)
     Array Size : 3885793280 (3705.78 GiB 3979.05 GB)
  Used Dev Size : 3885793280 (1852.89 GiB 1989.53 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=176 sectors
          State : clean
    Device UUID : 9abd8f30:29cb5ff5:2742646f:df56aa87

Internal Bitmap : 8 sectors from superblock
    Update Time : Tue Aug 16 09:05:35 2016
  Bad Block Log : 512 entries available at offset 72 sectors - bad
blocks present.
       Checksum : 8d769b9e - correct
         Events : 520706

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)

^ permalink raw reply

* Re: read errors with md RAID5 array
From: Tim Small @ 2016-08-16 12:22 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <CAJCQCtSDEWr6Tp0Hd+Jifw4h_aoM6Tqroav_vLmfwfqjoEe8Rg@mail.gmail.com>



On 15/08/16 17:23, Chris Murphy wrote:
>> > These were all reporting:
>> >
>> > SCT Error Recovery Control:
>> >            Read: Disabled
>> >           Write: Disabled
> 
> You failed to provide the value for the 2nd command. Is it something
> other than 30 for each device?

Sorry about that - it's 30 seconds for all array members.

# for i in a c d ; do cat /sys/block/sd${i}/device/timeout ; done
30
30
30

Thanks,

Tim.

^ permalink raw reply

* [PATCH] md: don't print the same repeated messages about delayed sync operation
From: Artur Paszkiewicz @ 2016-08-16 12:26 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz

This fixes a long-standing bug that caused a flood of messages like:
"md: delaying data-check of md1 until md2 has finished (they share one
or more physical units)"

It can be reproduced like this:
1. Create at least 3 raid1 arrays on a pair of disks, each on different
   partitions.
2. Request a sync operation like 'check' or 'repair' on 2 arrays by
   writing to their md/sync_action attribute files. One operation should
   start and one should be delayed and a message like the above will be
   printed.
3. Issue a write to the third array. Each write will cause 2 copies of
   the message to be printed.

This happens when wake_up(&resync_wait) is called, usually by
md_check_recovery(). Then the delayed sync thread again prints the
message and is put to sleep. This patch adds a check in md_do_sync() to
prevent printing this message more than once for the same pair of
devices.

Reported-by: Sven Koehler <sven.koehler@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=151801
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/md.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2c3ab6f..5096b48 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7862,6 +7862,7 @@ void md_do_sync(struct md_thread *thread)
 	 */
 
 	do {
+		int mddev2_minor = -1;
 		mddev->curr_resync = 2;
 
 	try_again:
@@ -7891,10 +7892,14 @@ void md_do_sync(struct md_thread *thread)
 				prepare_to_wait(&resync_wait, &wq, TASK_INTERRUPTIBLE);
 				if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
 				    mddev2->curr_resync >= mddev->curr_resync) {
-					printk(KERN_INFO "md: delaying %s of %s"
-					       " until %s has finished (they"
-					       " share one or more physical units)\n",
-					       desc, mdname(mddev), mdname(mddev2));
+					if (mddev2_minor != mddev2->md_minor) {
+						mddev2_minor = mddev2->md_minor;
+						printk(KERN_INFO "md: delaying %s of %s"
+						       " until %s has finished (they"
+						       " share one or more physical units)\n",
+						       desc, mdname(mddev),
+						       mdname(mddev2));
+					}
 					mddev_put(mddev2);
 					if (signal_pending(current))
 						flush_signals(current);
-- 
2.9.2


^ permalink raw reply related

* Re: read errors with md RAID5 array
From: Andreas Klauer @ 2016-08-16 12:27 UTC (permalink / raw)
  To: Tim Small; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <d1ccb93d-6856-4b07-04ea-3c4ff9541d60@buttersideup.com>

On Tue, Aug 16, 2016 at 12:40:45PM +0100, Tim Small wrote:
> I didn't know about the bad block functionality in md.

I don't know how it's supposed to work either. I disable it everywhere.
(the option was --update=no-bbl but if I remember correctly it will 
accept that only if the bbl is empty)

I don't want arrays to have bad blocks. I don't want disks with bad blocks 
to be left in the array. I don't trust disks that develop defects or lose 
data so the only choice for me is to replace it with a new one.

Silently ignoring disk errors, silently fixing errors in the background, 
keeping bad disks around, in my point of view this will only cause much 
more trouble later on.

I want to be notified about any and all problems md encounters so I can 
decide what to do... unfortunately not many people seem to share this 
view and the "read errors are normal" faction seems to be growing...

Identical bad blocks on multiple devices should be the reason why your 
md is reporting I/O layers; those blocks are already marked bad by md, 
it does not even try to read them from the disks.

The last time I encountered these I ended up editing metadata 
or doing a (dangerous) re-create since I found no other way to 
get rid of them.

> In the meantime I'm trying to work out what data (if any) is now
> inaccessible.  This is made slightly more interesting because this array
> has 'bcache' sitting in front of it, so I might have good data in the
> cache on the SSD which is marked bad/inaccessible on the raid5 md device.

md won't be able to use that to repair by itself. Does bcache have some 
recovery mode that makes it dump back everything that is cached to disk? 
This comes with its own dangers, if the cache is wrong or other bugs...

Usually for such dangerous experiments you would use an overlay 
https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file
but I'm not sure how well that plays together with bcache either.

If you want to go with re-create in your case it would be something like

mdadm --create /dev/md42 --assume-clean \
    --metadata=1.2 --data-offset=128M --level=5 --chunk=512 --layout=ls \
    --raid-devices=3 /dev/overlay/sd{a,c,d}2

You have to specify all varaibles because mdadm defaults change over time.

Then --stop and --assemble with --update=no-bbl before the horrors repeat...

Mount and verify files for correctness (files larger than disks*chunksize).

Then --add a fourth drive and --replace the one you said has bad sectors 
according to SMART. Book a flight to Olympics in Rio and win a gold medal 
in hard disk long-cast throwing.

Once your RAID is running with three drives that are fully operational 
you can do your RAID6 or whatever.

If you don't have a backup, make one before doing anything else, 
as long as you still have somewhat access to your stuff.

Regards
Andreas Klauer

^ permalink raw reply

* Re: Can't mount Old RHEL 6 Raid with new install of CentOS 7, now can't mount with original RHEL 6
From: John Dawson @ 2016-08-16 17:35 UTC (permalink / raw)
  To: Chris Murphy, Linux-RAID
In-Reply-To: <CAJCQCtQfBjhKh2Fu+W76JyXODFJJ9Qxjbyn7KK0i3ALs1xREgQ@mail.gmail.com>

Sorry.. my bad... I am not using RHEL... I am using CentOS...

     mdadm --version = mdadm - v3.2.3 - 23rd December 2011


When I try to assemble the raid I get the following errors:

      sudo mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1

I get the following:

      mdadm: /dev/md127 has been started with 2 drives.

When I try to mount

      sudo mount /dev/md127 /proj

I get the following:

      mount: you must specify the filesystem type

When I specify the filesystem type,

      sudo mount -t ext4 /dev/md127 /proj/

I get the following:

      mount: wrong fs type, bad option, bad superblock on /dev/md127, 
missing codepage or helper program, or other error In some cases useful 
info is found in syslog - try       dmesg | tail  or so


===============================================
dmesg | tail
===============================================
md: md127 stopped.
md: bind<sdb1>
md: bind<sdc1>
bio: create slab <bio-1> at 1
md/raid0:md127: md_size is 3907039232 sectors.
md: RAID0 configuration for md127 - 1 zone
md: zone0=[sdc1/sdb1]
       zone-offset=         0KB, device-offset=         0KB, 
size=1953519616KB
EXT4-fs (md127): VFS: Can't find ext4 filesystem

===============================================
"sudo mdadm -D /dev/md127" results
===============================================

/dev/md127:
         Version : 1.2
   Creation Time : Fri Aug  5 16:46:10 2016
      Raid Level : raid0
      Array Size : 1953519616 (1863.02 GiB 2000.40 GB)
    Raid Devices : 2
   Total Devices : 2
     Persistence : Superblock is persistent

     Update Time : Fri Aug  5 16:46:10 2016
           State : clean
  Active Devices : 2
Working Devices : 2
  Failed Devices : 0
   Spare Devices : 0

      Chunk Size : 512K

            Name : mymachine:0  (local to host mymachine)
            UUID : 8217dfb5:a97a15df:94a85926:3fea6697
          Events : 0

     Number   Major   Minor   RaidDevice State
        0       8       33        0      active sync   /dev/sdc1
        1       8       17        1      active sync   /dev/sdb1



===============================================
"sudo mdadm --examine /dev/sdb1" results
===============================================

/dev/sdb1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x0
      Array UUID : 8217dfb5:a97a15df:94a85926:3fea6697
            Name : mymachine:0  (local to host mymachine)
   Creation Time : Fri Aug  5 16:46:10 2016
      Raid Level : raid0
    Raid Devices : 2

  Avail Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
     Data Offset : 2048 sectors
    Super Offset : 8 sectors
           State : clean
     Device UUID : d9bab0d7:793e5168:4457d25b:24614a41

     Update Time : Fri Aug  5 16:46:10 2016
        Checksum : 744c405e - correct
          Events : 0

      Chunk Size : 512K

    Device Role : Active device 1
    Array State : AA ('A' == active, '.' == missing)


===============================================
"sudo mdadm --examine /dev/sdc1" results
===============================================

/dev/sdc1:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x0
      Array UUID : 8217dfb5:a97a15df:94a85926:3fea6697
            Name : mymachine:0  (local to host mymachine)
   Creation Time : Fri Aug  5 16:46:10 2016
      Raid Level : raid0
    Raid Devices : 2

  Avail Dev Size : 1953519616 (931.51 GiB 1000.20 GB)
     Data Offset : 2048 sectors
    Super Offset : 8 sectors
           State : clean
     Device UUID : a0dfb805:18718fed:6985075a:12fbb196

     Update Time : Fri Aug  5 16:46:10 2016
        Checksum : 7b2f5fd6 - correct
          Events : 0

      Chunk Size : 512K

    Device Role : Active device 0
    Array State : AA ('A' == active, '.' == missing)




On 2016-08-11 20:38, Chris Murphy wrote:
> On Thu, Aug 11, 2016 at 6:19 AM, John Dawson <linux@celticblues.com> 
> wrote:
>> I have a machine which had a drive with RHEL 6.X installed with a raid
>> device setup on separate disk(s). Installed a new hard disk in machine 
>> and
>> installed CentOS 7. CentOS 7 wouldn't mount the raid. Put the old 
>> drive
>> back in and now RHEL 6.X won't mount the raid. Is the raid permanently
>> hosed? Can I get the data on it back? How? Thx.
> 
> RHEL comes with a support contract so you should contact Red Hat about
> that part.
> 
> Also, not anywhere near enough information has been provided, almost
> like you think what you're experiencing is a widely known problem with
> a known solution. But it isn't. So you should provide mdadm -E
> information for each member block device, whether or not the array
> assembles manually, if not what error do you get in user and kernel
> space, and what command you're using to mount the array that you say
> fails, and what the error message is.
> 
> Also include mdadm version on both systems because few people will
> have any idea what mdadm version is on the particular installation of
> RHEL and CentOS you're using, as these things aren't standardized at
> all across distros.


^ permalink raw reply

* Re: Can't mount Old RHEL 6 Raid with new install of CentOS 7, now can't mount with original RHEL 6
From: Chris Murphy @ 2016-08-16 17:58 UTC (permalink / raw)
  To: John Dawson; +Cc: Chris Murphy, Linux-RAID
In-Reply-To: <57d23adc6db12d6a365dd2c6f50d4bf5@celticblues.com>

On Tue, Aug 16, 2016 at 11:35 AM, John Dawson <linux@celticblues.com> wrote:

> ===============================================
> "sudo mdadm -D /dev/md127" results
> ===============================================
>
> /dev/md127:
>         Version : 1.2
>   Creation Time : Fri Aug  5 16:46:10 2016

Did you really create this array on 10 days ago with CentOS 6 and then
upgrade to CentOS 7 and it didn't work? Or did you happen to try to
fix the problem by doing mdadm --create ?





-- 
Chris Murphy

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox