Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH v1 1/5] block: introduce bio_clone_bioset_partial()
From: Christoph Hellwig @ 2017-02-13 13:46 UTC (permalink / raw)
  To: Ming Lei
  Cc: Shaohua Li, Jens Axboe, linux-kernel, linux-raid, linux-block,
	Christoph Hellwig, NeilBrown
In-Reply-To: <1486724177-14817-2-git-send-email-tom.leiming@gmail.com>

On Fri, Feb 10, 2017 at 06:56:13PM +0800, Ming Lei wrote:
> md still need bio clone(not the fast version) for behind write,
> and it is more efficient to use bio_clone_bioset_partial().
> 
> The idea is simple and just copy the bvecs range specified from
> parameters.

Given how few users bio_clone_bioset has I wonder if we shouldn't
simply add the two new arguments to it instead of adding another
indirection.

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v1 2/5] md/raid1: use bio_clone_bioset_partial() in case of write behind
From: Christoph Hellwig @ 2017-02-13 13:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Shaohua Li, Jens Axboe, linux-kernel, linux-raid, linux-block,
	Christoph Hellwig, NeilBrown
In-Reply-To: <1486724177-14817-3-git-send-email-tom.leiming@gmail.com>

> +		struct bio *mbio = NULL;
> +		int offset;
>  		if (!r1_bio->bios[i])
>  			continue;
>  
> -		mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
> -		bio_trim(mbio, r1_bio->sector - bio->bi_iter.bi_sector,
> -			 max_sectors);
> +		offset = r1_bio->sector - bio->bi_iter.bi_sector;

I think offset should be a sector_t.

Otherwise this looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v1 4/5] md: remove unnecessary check on mddev
From: Christoph Hellwig @ 2017-02-13 13:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Shaohua Li, Jens Axboe, linux-kernel, linux-raid, linux-block,
	Christoph Hellwig, NeilBrown
In-Reply-To: <1486724177-14817-5-git-send-email-tom.leiming@gmail.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v1 5/5] md: fast clone bio in bio_clone_mddev()
From: Christoph Hellwig @ 2017-02-13 13:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Shaohua Li, Jens Axboe, linux-kernel, linux-raid, linux-block,
	Christoph Hellwig, NeilBrown
In-Reply-To: <1486724177-14817-6-git-send-email-tom.leiming@gmail.com>

Please use bio_clone_fast directly and kill the wrapper.

^ permalink raw reply

* Re: ANNOUNCE: mdadm 4.0 - A tool for managing md Soft RAID under Linux
From: Jes Sorensen @ 2017-02-13 17:44 UTC (permalink / raw)
  To: zhilong, Bruce Dubbs, Brown, Neil
  Cc: Guoqing Jiang, Shaohua Li, linux-raid@vger.kernel.org, LKML
In-Reply-To: <f578e46d-8e76-0335-332b-3d90aa7848cf@suse.com>

On 02/13/2017 12:54 AM, zhilong wrote:
> On 02/13/2017 01:08 PM, zhilong wrote:
>> Hi, Jes;
>> On 01/13/2017 12:41 AM, Jes Sorensen wrote:
>>> On 01/11/17 23:24, Guoqing Jiang wrote:
>>>>
>>>> On 01/12/2017 12:59 AM, Jes Sorensen wrote:
>>>>> On 01/11/17 11:52, Shaohua Li wrote:
>>>>>> On Tue, Jan 10, 2017 at 11:49:04AM -0600, Bruce Dubbs wrote:
>>>>>>> Jes Sorensen wrote:
>>>>>>>> I am pleased to announce the availability of
>>>>>>>>       mdadm version 4.0
>>>>>>>>
>>>>>>>> It is available at the usual places:
>>>>>>>> http://www.kernel.org/pub/linux/utils/raid/mdadm/
>>>>>>>> and via git at
>>>>>>>> git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git
>>>>>>>>       http://git.kernel.org/cgit/utils/mdadm/
>>>>>>>>
>>>>>>>> The update in major version number primarily indicates this is a
>>>>>>>> release by it's new maintainer. In addition it contains a large
>>>>>>>> number
>>>>>>>> of fixes in particular for IMSM RAID and clustered RAID
>>>>>>>> support.  In
>>>>>>>> addition this release includes support for IMSM 4k sector drives,
>>>>>>>> failfast and better documentation for journaled RAID.
>>>>>>> Thank you for the new release.  Unfortunately I get 9 failures
>>>>>>> running the
>>>>>>> test suite:
>>>>>>>
>>>>>>> tests/00raid1...          FAILED
>>>>>>> tests/07autoassemble...   FAILED
>>>>>>> tests/07changelevels...   FAILED
>>>>>>> tests/07revert-grow...    FAILED
>>>>>>> tests/07revert-inplace... FAILED
>>>>>>> tests/07testreshape5...   FAILED
>>>>>>> tests/10ddf-fail-twice... FAILED
>>>>>>> tests/20raid5journal...   FAILED
>>>>>>> tests/10ddf-incremental-wrong-order...  FAILED
>>>>>> Yep, several tests usually fail. It appears some checks aren't always
>>>>>> good.  At
>>>>>> least the 'check' function for reshape/resync isn't reliable in my
>>>>>> test, I saw
>>>>>> 07changelevelintr fails frequently.
>>>>> That is my experience as well - some of them are affected by the
>>>>> kernel
>>>>> version too. We probably need to look into making them more reliable.
>>>> If possible, it could be a potential topic for lsf/mm raid
>>>> discussion as
>>>> Coly suggested
>>>> in previous mail.
>>>>
>>>> Is current test can run the test for different raid level, say, "./test
>>>> --raidtype=raid1" could
>>>> execute all the *r1* tests, does it make sense to do it if we don't
>>>> support it now.
>>> We could have a discussion about this at LSF/MM, if someone is willing
>>> to sponsor getting it accepted and we can get the right people there.
>>>
>>> Note that the test suite also allows you to run all the 01 tests by
>>> specifying ./test 01. I do like to see the test suite improved and made
>>> more resilient.
>> I'm sorry for my late response, I'm just back to work today from
>> vacation. In the past months, I learned and worked for cluster-md
>> feature,
>> and I have draft one test suit for cluster-md feature. please refer to
>> https://github.com/zhilongliu/clustermd-autotest
>> I'm very willing to do something for improving mdadm testing part,
>> also wanna improve cluster-md test suit, welcome all comments for it.
>>
>   I would keep making cluster-md test scripts more and more stable, and
> finally apply to integrate into mdadm test part. :-)

I'd very much like to see work to improve the test suite, so that is great.

Once you have the test suites ready, please post patches and I shall be 
happy to implement them.

Please make sure to test that they don't break if people haven't built 
cluster support into their kernels.

Cheers,
Jes


^ permalink raw reply

* Re: [PATCH 1/5] MD: attach data to each bio
From: Shaohua Li @ 2017-02-13 18:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, khlebnikov, hch
In-Reply-To: <8760ke4dzm.fsf@notabene.neil.brown.name>

On Mon, Feb 13, 2017 at 08:49:33PM +1100, Neil Brown wrote:
> On Thu, Feb 09 2017, Shaohua Li wrote:
> 
> > On Fri, Feb 10, 2017 at 05:08:54PM +1100, Neil Brown wrote:
> >> On Tue, Feb 07 2017, Shaohua Li wrote:
> >> 
> >> > Currently MD is rebusing some bio fields. To remove the hack, we attach
> >> > extra data to each bio. Each personablity can attach extra data to the
> >> > bios, so we don't need to rebuse bio fields.
> >> 
> >> I must say that I don't really like this approach.
> >> Temporarily modifying ->bi_private and ->bi_end_io seems
> >> .... intrusive.   I suspect it works, but I wonder if it is really
> >> robust in the long term.
> >> 
> >> How about a different approach..  Your main concern with my first patch
> >> was that it called md_write_start() and md_write_end() much more often,
> >> and these performed atomic ops on "global" variables, particular
> >> writes_pending.
> >> 
> >> We could change writes_pending to a per-cpu array which we only count
> >> occasionally when needed.  As writes_pending is updated often and
> >> checked rarely, a per-cpu array which is summed on demand seems
> >> appropriate.
> >> 
> >> The following patch is an early draft - it doesn't obviously fail and
> >> isn't obviously wrong to me.  There is certainly room for improvement
> >> and may be bugs.
> >> Next week I'll work on collection the re-factoring into separate
> >> patches, which are possible good-to-have anyway.
> >
> > For your first patch, I don't have much concern. It's ok to me. What I don't
> > like is the bi_phys_segments handling part. The patches add a lot of logic to
> > handle the reference count. They should work, but I'd say it's not easy to
> > understand and could be error prone. What we really need is a reference count
> > for the bio, so let's just add a reference count. That's my logic and it's
> > simple.
> 
> We already have two reference counts, and you want to add a third one.
> 
> bi_phys_segments is currently used for two related purposes.
> It counts the number of stripe_heads currently attached to the bio so
> that when the count reaches zero:
>  1/ ->writes_pending can be decremented
>  2/ bio_endio() can be called.
> 
> When the code was written, the __bi_remaining counter didn't exist.  Now
> it does and it is integrated with bio_endio() so it should make the code
> easier to understand if we just use bio_endio() rather and doing our own
> accounting.
> 
> That just leaves '1'.  We can easily decrement ->writes_pending directly
> instead of decrementing a per-bio refcount, and then when it reaches
> zero, decrement ->writes_pending.  As you pointed out, that comes with a
> cost.  If ->writes_pending is changed to a per-cpu array which is summed
> on demand, the cost goes away.
> 
> Having an extra refcount in the bio just adds a level of indirection
> that doesn't (that I can see) provide actual value.

Ok, fair enough. I do think an explict counter in the driver side will help us
a lot, eg, we can better control when to endio and do something there (for
example the current blk trace, or something we want to add in the future). But
I don't insist currently.

For the patches, can you repost? I think:
- patch 2 missed md_write_start for make_discard_request
- It's unnecessary to zero bi_phys_segments in patch 5. And raid5-cache need do
  the same change of bio_endio.
For the md_write_start optimization, we can do it later.

Thanks,
Shaohua

^ permalink raw reply

* [PATCH --resend 0/2] fix writesame
From: Shaohua Li @ 2017-02-13 19:55 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb

This is the writesame issue fix reported in bugzilla 118581. Sitsofe doesn't
reply to me if he tested these, but they do work for me with a scsi_debug test.
I'm going to apply them for 4.11 if no objection.

Shaohua Li (2):
  md: disable WRITE SAME if it fails for linear/raid0
  md/multipath: disable WRITE SAME if it fails for multipath

 drivers/md/linear.c    |  6 +++++-
 drivers/md/md.c        | 45 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/md.h        |  2 ++
 drivers/md/multipath.c |  4 ++++
 drivers/md/raid0.c     |  6 +++++-
 5 files changed, 61 insertions(+), 2 deletions(-)

-- 
2.9.3


^ permalink raw reply

* [PATCH --resend 1/2] md: disable WRITE SAME if it fails for linear/raid0
From: Shaohua Li @ 2017-02-13 19:55 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb
In-Reply-To: <cover.1487015420.git.shli@fb.com>

This makes md do the same thing as dm for write same IO failure. Please
see 7eee4ae(dm: disable WRITE SAME if it fails) for details why we need
this.

Also reported here: https://bugzilla.kernel.org/show_bug.cgi?id=118581

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/linear.c |  6 +++++-
 drivers/md/md.c     | 45 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/md.h     |  2 ++
 drivers/md/raid0.c  |  6 +++++-
 4 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 26a73b2..bebc834 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -291,7 +291,11 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
 				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
 						      split, disk_devt(mddev->gendisk),
 						      bio_sector);
-			generic_make_request(split);
+			if (bio_op(split) == REQ_OP_WRITE_SAME)
+				generic_make_request(md_writesame_setup(mddev,
+								split));
+			else
+				generic_make_request(split);
 		}
 	} while (split != bio);
 	return;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 13020e5..7354f0b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -312,6 +312,51 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
+struct md_writesame_data {
+	struct bio *orig_bio;
+	struct mddev *mddev;
+	struct bio cloned_bio;
+};
+
+static void md_writesame_endio(struct bio *bio)
+{
+	struct md_writesame_data *data = bio->bi_private;
+
+	if (bio->bi_error == -EREMOTEIO &&
+	    !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
+		data->mddev->queue->limits.max_write_same_sectors = 0;
+
+	data->orig_bio->bi_error = bio->bi_error;
+	bio_endio(data->orig_bio);
+
+	kfree(data);
+}
+
+struct bio *md_writesame_setup(struct mddev *mddev, struct bio *bio)
+{
+	struct md_writesame_data *data;
+	struct bio *cloned_bio;
+
+	/*
+	 * this failure means we ignore a chance to handle writesame failure,
+	 * which isn't critcal, we can handle the failure if new writesame IO
+	 * comes
+	 */
+	data = kmalloc(sizeof(*data), GFP_NOIO | __GFP_NORETRY);
+	if (!data)
+		return bio;
+	cloned_bio = &data->cloned_bio;
+	data->mddev = mddev;
+	data->orig_bio = bio;
+	bio_init(cloned_bio, NULL, 0);
+	__bio_clone_fast(cloned_bio, bio);
+
+	cloned_bio->bi_private = data;
+	cloned_bio->bi_end_io = md_writesame_endio;
+	return cloned_bio;
+}
+EXPORT_SYMBOL_GPL(md_writesame_setup);
+
 /* mddev_suspend makes sure no new requests are submitted
  * to the device, and that any requests that have been submitted
  * are completely handled.
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2a51403..5c983e9 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -710,4 +710,6 @@ static inline void mddev_clear_unsupported_flags(struct mddev *mddev,
 {
 	mddev->flags &= ~unsupported_flags;
 }
+
+extern struct bio *md_writesame_setup(struct mddev *mddev, struct bio *bio);
 #endif /* _MD_MD_H */
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 848365d..e38636e 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -503,7 +503,11 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
 						      split, disk_devt(mddev->gendisk),
 						      bio_sector);
-			generic_make_request(split);
+			if (bio_op(split) == REQ_OP_WRITE_SAME)
+				generic_make_request(md_writesame_setup(mddev,
+							split));
+			else
+				generic_make_request(split);
 		}
 	} while (split != bio);
 }
-- 
2.9.3


^ permalink raw reply related

* [PATCH --resend 2/2] md/multipath: disable WRITE SAME if it fails for multipath
From: Shaohua Li @ 2017-02-13 19:55 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb
In-Reply-To: <cover.1487015420.git.shli@fb.com>

This is the part for multipath. Since multipatch already attaches
private data into original bio, we just disable write same there.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/multipath.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index aa8c4e5c..5ee9579 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -89,6 +89,10 @@ static void multipath_end_request(struct bio *bio)
 	struct mpconf *conf = mp_bh->mddev->private;
 	struct md_rdev *rdev = conf->multipaths[mp_bh->path].rdev;
 
+	if (bio_op(bio) == REQ_OP_WRITE_SAME && bio->bi_error == -EREMOTEIO &&
+	    !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
+		mp_bh->mddev->queue->limits.max_write_same_sectors = 0;
+
 	if (!bio->bi_error)
 		multipath_end_bh_io(mp_bh, 0);
 	else if (!(bio->bi_opf & REQ_RAHEAD)) {
-- 
2.9.3


^ permalink raw reply related

* [PATCH v4] DM: dm-inplace-compress: inplace compressed DM target
From: Ram Pai @ 2017-02-13 20:42 UTC (permalink / raw)
  To: linux-doc, linux-kernel, dm-devel, linux-raid
  Cc: agk, snitzer, corbet, shli, hbabu, linuxram

This  patch   provides   a   generic  device-mapper  compression device.
Originally written by Shaohua Li.
https://www.redhat.com/archives/dm-devel/2013-December/msg00143.html

I have optimized and hardened the code.

Testing:
-------
This compression block device  is   tested in  the  following scenarios
a) backing a ext4 filesystem 
b) backing swap
Its further stress tested on PPC64 and x86 system.

Version v4:
	fixed kbuild errors; hopefully they are all taken care off.
       	  - no reference to zero_page
	  - convert all divide and mod operations to bit operations

Version v3:
	Fixed  sector  alignment  bugs  exposed  while   testing on x86.
	Explicitly set the maximum request size  to 128K.  Without which
	range  locking  failed,  causing  I/Os  to  stamp   each  other.
	Fixed an occasional data corruption  caused by wrong size of the
	compression buffer.
	Added  a   parameter   while  creation  of  the   block  device, 
	to  not  sleep   during  memory  allocations. This can be useful
       	if the device is used as a swap device.

Version v2:
	All patches are merged into a single patch.
	Major code re-arrangement.
	Data and metablocks allocated based on the length of the device
	map rather than the size of the backing device.
        Size   of   each entry   in  the bitmap array is explicitly set
	 to 32bits.
	Attempt  to  reuse  the  provided  bio  buffer  space   instead
	 of allocating a new one.

Version v1:
	Comments from Alasdair have been incorporated.
	https://www.redhat.com/archives/dm-devel/2013-December/msg00144.html

Your  comments  to  improve  the  code  is  very  much  appreciated. Will be 
requesting for merge if I do not receive any comments.

Ram Pai (1):
  From: Shaohua Li <shli@kernel.org>

 .../device-mapper/dm-inplace-compress.txt          |  155 ++
 drivers/md/Kconfig                                 |    6 +
 drivers/md/Makefile                                |    2 +
 drivers/md/dm-inplace-compress.c                   | 2214 ++++++++++++++++++++
 drivers/md/dm-inplace-compress.h                   |  187 ++
 5 files changed, 2564 insertions(+)
 create mode 100644 Documentation/device-mapper/dm-inplace-compress.txt
 create mode 100644 drivers/md/dm-inplace-compress.c
 create mode 100644 drivers/md/dm-inplace-compress.h

-- 
1.8.3.1


^ permalink raw reply

* [PATCH v4 1/1] DM: inplace compressed DM target
From: Ram Pai @ 2017-02-13 20:42 UTC (permalink / raw)
  To: linux-doc, linux-kernel, dm-devel, linux-raid
  Cc: agk, snitzer, corbet, shli, hbabu, linuxram
In-Reply-To: <1487018545-5061-1-git-send-email-linuxram@us.ibm.com>

This is a simple DM target supporting inplace compression. Its best
suited for SSD. The underlying disk must support 512B sector size.
The target only supports 4k sector size.

Disk layout:
|super|...meta...|..data...|

Store unit is 4k (a block). Super is 1 block, which stores meta  and
data size and compression algorithm. Meta is a bitmap. For each data
 block, there are 5 bits meta.

Data:

Data of   a block is compressed. Compressed  data  is round up to 512B,
which is the payload. In disk, payload is  stored at  the beginning of
logical sector  of the block. Let's look  at an example.  Say we store
data to block A, which  is in sector  B(A*8), its orginal  size is 4k,
compressed size is  1500.    Compressed     data (CD)  will  use three
sectors (512B). The three  sectors  are the  payload. Payload  will be
stored at sector B.

---------------------------------------------------
... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
---------------------------------------------------
    ^B    ^B+1  ^B+2                  ^B+7 ^B+8

For this block, we will not use sector B+3 to B+7 (a hole). We use four
meta  bits  to  present payload  size. The compressed size (1500) isn't
stored in meta directly. Instead, we  store  it  at  the last 32bits of
payload. In this  example, we store it at the  end  of  sector  B+2. If
compressed size + sizeof(32bits)  crosses a   sector, payload size will
increase one sector.  If payload  uses 8 sectors, we store uncompressed
data directly.

If IO size is bigger than one block, we can store the data as an extent.
Data of the  whole extent will compressed and stored in the similar way
like above.  The first  block of the extent is the head, all others are
the tail.  If extent is 1 block,  the  block  is head. We have 1 bit of
meta to present if a  block  is  head  or  tail. If 4 meta bits of head
block can't  store  extent payload size, we will borrow tail block meta
bits to  store  payload  size.   Max  allowd extent size is 128k, so we
don't compress/decompress too big size data.

Meta:
Modifying   data   will modify meta too. Meta will be written(flush) to
disk   depending   on   meta   write   policy. We support writeback and
writethrough mode.  In  writeback mode, meta will be written to disk in
an interval or a  FLUSH  request.  In  writethrough mode, data and meta
data will be written to disk together.

Advantages:

1. Simple. Since  we  store  compressed  data  in-place,  we don't need
   complicated disk data management.
2. Efficient. For  each  4k, we only need 5 bits meta. 1T data will use
less than 200M meta, so we  can  load  all meta into memory. And actual
compression size is in payload. So   if  IO doesn't need RMW and we use
writeback meta flush, we don't  need  extra IO for meta.

Disadvantages:

1. hole. Since we   store  compressed data in-place, there are a lot of
   holes (in above  example,  B+3 - B+7) Hole can impact IO, because we
   can't do IO merge.

2. 1:1 size. Compression  doesn't  change disk  size. If disk is 1T, we
   can only store 1T data even we do compression.

But this is for SSD only. Generally SSD firmware has a FTL layer to map
disk  sectors  to flash nand. High end SSD firmware has filesystem-like
FTL.

1. hole. Disk has a lot of holes, but SSD FTL   can   still  store data
   contiguous in nand. Even if we can't do IO   merge in  OS layer, SSD
   firmware can do it.

2. 1:1 size. On one side, we write compressed data to SSD, which means
   less  data is  written to SSD. This will be very helpful to improve
   SSD garbage collection, and  so write speed and life cycle. So even
   this is a problem, the target  is still helpful. On the other side,
   advanced SSD FTL can easily do thin provision. For example, if nand
   is   1T   and   we   let   SSD   report   it   as   2T,   and   use
   the  SSD  as  compressed target. In such SSD, we don't have the 1:1
   size issue.

So even if   SSD   FTL   cannot   map   non-contiguous disk sectors to
contiguous nand, the compression target can still function well.


Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Ram Pai <ram.n.pai@gmail.com>
---
 .../device-mapper/dm-inplace-compress.txt          |  155 ++
 drivers/md/Kconfig                                 |    6 +
 drivers/md/Makefile                                |    2 +
 drivers/md/dm-inplace-compress.c                   | 2220 ++++++++++++++++++++
 drivers/md/dm-inplace-compress.h                   |  187 ++
 5 files changed, 2570 insertions(+)
 create mode 100644 Documentation/device-mapper/dm-inplace-compress.txt
 create mode 100644 drivers/md/dm-inplace-compress.c
 create mode 100644 drivers/md/dm-inplace-compress.h

diff --git a/Documentation/device-mapper/dm-inplace-compress.txt b/Documentation/device-mapper/dm-inplace-compress.txt
new file mode 100644
index 0000000..1835369
--- /dev/null
+++ b/Documentation/device-mapper/dm-inplace-compress.txt
@@ -0,0 +1,155 @@
+Device-Mapper's "inplace-compress" target provides inplace compression of block
+devices using the kernel compression API.
+
+Parameters: <device path> \
+	[ <#opt_params writethough> ]
+	[ <#opt_params <writeback> <meta_commit_delay> ]
+	[ <#opt_params compressor> <type> ]
+	[ <#opt_params critical>  ]
+
+
+<writethrough>
+    Write data and metadata together.
+
+<writeback> <meta_commit_delay>
+    Write metadata every 'meta_commit_delay' interval.
+
+<device path>
+    This is the device that is going to be used as backend and contains the
+    compressed data.  You can specify it as a path like /dev/xxx or a device
+    number <major>:<minor>.
+
+<compressor> <type>
+    Choose the compressor algorithm. 'lzo' and '842'
+    compressors are supported.
+
+<critical>
+    Block device used in critical path.
+
+Example scripts
+===============
+
+create a inplace-compress block device using lzo compression. Write metadata
+and data together.
+[[
+#!/bin/sh
+# Create a inplace-compress device using dmsetup
+device=$1  #your backing storage eg: /dev/sdc1
+size=80000 #size of your new compressed block device
+dmsetup create comp1 --table "0 $size inplacecompress $device
+		writethrough compressor lzo"
+]]
+
+
+create a inplace-compress block device using nx-842 hardware compression. Write
+metadata periodially every 5sec.
+
+[[
+#!/bin/sh
+# Create a inplace-compress device using dmsetup
+device=$1  #your backing storage eg: /dev/sdc1
+size=80000 #size of your new compressed block device
+dmsetup create comp1 --table "0 $size inplacecompress $device
+		writeback 5 compressor 842"
+]]
+
+
+Create a inplace-compress block device. Device is used in critical path;
+ex: swap device.
+
+[[
+#!/bin/sh
+# Create a inplace-compress device using dmsetup
+device=$1  #your backing storage eg: /dev/sdc1
+size=80000 #size of your new compressed block device
+dmsetup create comp1 --table "0 $size inplacecompress $device critical"
+]]
+
+Description
+===========
+    This is a simple DM target supporting inplace compression. Its best suited for
+    SSD. The underlying disk must support 512B sector size, the target only
+    supports 4k sector size.
+
+    Disk layout:
+    |super|...meta...|..data...|
+
+    Store unit is 4k (a block). Super is 1 block, which stores meta and data
+    size and compression algorithm. Meta is a bitmap. For each data block,
+    there are 5 bits meta.
+
+    Data:
+
+    Data of a block is compressed. Compressed data is round up to 512B, which
+    is the payload. In disk, payload is stored at the beginning of logical
+    sector of the block. Let's look at an example. Say we store data to block
+    A, which is in sector B(A*8), its orginal size is 4k, compressed size is
+    1500. Compressed data (CD) will use 3 sectors (512B). The 3 sectors are the
+    payload. Payload will be stored at sector B.
+
+    ---------------------------------------------------
+    ... | CD1 | CD2 | CD3 |   |   |   |   |    | ...
+    ---------------------------------------------------
+        ^B    ^B+1  ^B+2                  ^B+7 ^B+8
+
+    For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta
+    bits to present payload size. The compressed size (1500) isn't stored in
+    meta directly. Instead, we store it at the last 32bits of payload. In this
+    example, we store it at the end of sector B+2. If compressed size +
+    sizeof(32bits) crosses a sector, payload size will increase one sector. If
+    payload uses 8 sectors, we store uncompressed data directly.
+
+    If IO size is bigger than one block, we can store the data as an extent.
+    Data of the whole extent will compressed and stored in the similar way like
+    above.  The first block of the extent is the head, all others are the tail.
+    If extent is 1 block, the block is head. We have 1 bit of meta to present
+    if a block is head or tail. If 4 meta bits of head block can't store extent
+    payload size, we will borrow tail block meta bits to store payload size.
+    Max allowd extent size is 128k, so we don't compress/decompress too big
+    size data.
+
+    Meta:
+    Modifying data will modify meta too. Meta will be written(flush) to disk
+    depending on meta write policy. We support writeback and writethrough mode.
+    In writeback mode, meta will be written to disk in an interval or a FLUSH
+    request.  In writethrough mode, data and meta data will be written to disk
+    together.
+
+    Advantages:
+
+    1. Simple. Since we store compressed data in-place, we don't need complicated
+    disk data management.
+    2. Efficient. For each 4k, we only need 5 bits meta. 1T data will use less than
+    200M meta, so we can load all meta into memory. And actual compression size is
+    in payload. So if IO doesn't need RMW and we use writeback meta flush, we don't
+    need extra IO for meta.
+
+    Disadvantages:
+
+    1. hole. Since we store compressed data in-place, there are a lot of holes
+    (in above example, B+3 - B+7) Hole can impact IO, because we can't do IO
+    merge.
+
+    2. 1:1 size. Compression doesn't change disk size. If disk is 1T, we can
+    only store 1T data even we do compression.
+
+    But this is for SSD only. Generally SSD firmware has a FTL layer to map
+    disk sectors to flash nand. High end SSD firmware has filesystem-like FTL.
+
+    1. hole. Disk has a lot of holes, but SSD FTL can still store data continuous
+    in nand. Even if we can't do IO merge in OS layer, SSD firmware can do it.
+
+    2. 1:1 size. On one side, we write compressed data to SSD, which means less
+    data is written to SSD. This will be very helpful to improve SSD garbage
+    collection, and so write speed and life cycle. So even this is a problem, the
+    target is still helpful. On the other side, advanced SSD FTL can easily do thin
+    provision. For example, if nand is 1T and we let SSD report it as 2T, and use
+    the SSD as compressed target. In such SSD, we don't have the 1:1 size issue.
+
+    So even if SSD FTL cannot map non-continuous disk sectors to continuous nand,
+    the compression target can still function well.
+
+
+Author:
+	Shaohua Li <shli@fusionio.com>
+	Ram Pai <linuxram@us.ibm.com>
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index b7767da..2eece2a 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -508,4 +508,10 @@ config DM_LOG_WRITES
 
 	  If unsure, say N.
 
+config DM_INPLACE_COMPRESS
+	tristate "Inplace Compression target"
+	depends on BLK_DEV_DM
+	---help---
+	  Allow volume managers to compress data for SSD.
+
 endif # MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 3cbda1a..4525482 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -59,6 +59,8 @@ obj-$(CONFIG_DM_CACHE_SMQ)	+= dm-cache-smq.o
 obj-$(CONFIG_DM_CACHE_CLEANER)	+= dm-cache-cleaner.o
 obj-$(CONFIG_DM_ERA)		+= dm-era.o
 obj-$(CONFIG_DM_LOG_WRITES)	+= dm-log-writes.o
+obj-$(CONFIG_DM_LOG_WRITES)	+= dm-log-writes.o
+obj-$(CONFIG_DM_INPLACE_COMPRESS)	+= dm-inplace-compress.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
new file mode 100644
index 0000000..41312d6
--- /dev/null
+++ b/drivers/md/dm-inplace-compress.c
@@ -0,0 +1,2220 @@
+/*
+ *  device mapper compression block device.
+ *
+ *  Released under GPL v2.
+ *
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/device-mapper.h>
+#include <linux/dm-io.h>
+#include <linux/crypto.h>
+#include <linux/lzo.h>
+#include <linux/kthread.h>
+#include <linux/page-flags.h>
+#include <linux/completion.h>
+#include <linux/vmalloc.h>
+#include "dm-inplace-compress.h"
+
+#define DM_MSG_PREFIX "dm-inplace-compress"
+
+
+static const struct kernel_param_ops dm_icomp_alloc_param_ops = {
+	.set    = param_set_ulong,
+	.get    = param_get_ulong,
+};
+
+static atomic64_t dm_icomp_total_alloc_size;
+#define DMCP_ALLOC(s) {atomic64_add(s, &dm_icomp_total_alloc_size); }
+#define DMCP_FREE_ALLOC(s) {atomic64_sub(s, &dm_icomp_total_alloc_size); }
+module_param_cb(dm_icomp_total_alloc_size, &dm_icomp_alloc_param_ops,
+				&dm_icomp_total_alloc_size, 0644);
+
+static atomic64_t dm_icomp_total_bio_save;
+#define DMCP_ALLOC_SAVE(s) {atomic64_add(s, &dm_icomp_total_bio_save); }
+module_param_cb(dm_icomp_total_bio_save, &dm_icomp_alloc_param_ops,
+				&dm_icomp_total_bio_save, 0644);
+
+
+static struct kmem_cache *dm_icomp_req_cachep;
+static struct kmem_cache *dm_icomp_io_range_cachep;
+static struct kmem_cache *dm_icomp_meta_io_cachep;
+
+static struct dm_icomp_io_worker dm_icomp_io_workers[NR_CPUS];
+static struct workqueue_struct *dm_icomp_wq;
+
+/*
+ *****************************************************
+ * compressor selection logic
+ *****************************************************
+ */
+static struct dm_icomp_compressor_data compressors[] = {
+	[DMCP_COMP_ALG_LZO] = {
+		.name = "lzo",
+		.can_handle_overflow = false,
+		.comp_len = lzo_comp_len,
+		.max_comp_len = lzo_max_comp_len,
+	},
+	[DMCP_COMP_ALG_842] = {
+		.name = "842",
+		.can_handle_overflow = true,
+		.comp_len = nx842_comp_len,
+		.max_comp_len = nx842_max_comp_len,
+	},
+};
+
+static int default_compressor = DMCP_COMP_ALG_LZO;
+#define DMCP_ALGO_LENGTH 9
+static char dm_icomp_algorithm[DMCP_ALGO_LENGTH] = "lzo";
+static struct kparam_string dm_icomp_compressor_kparam = {
+	.string =	dm_icomp_algorithm,
+	.maxlen =	sizeof(dm_icomp_algorithm),
+};
+static int dm_icomp_compressor_param_set(const char *,
+		const struct kernel_param *);
+static struct kernel_param_ops dm_icomp_compressor_param_ops = {
+	.set =	dm_icomp_compressor_param_set,
+	.get =	param_get_string,
+};
+module_param_cb(compress_algorithm, &dm_icomp_compressor_param_ops,
+		&dm_icomp_compressor_kparam, 0644);
+
+
+
+static int get_comp_id(const char *s)
+{
+	int r, val_len;
+
+	if (!crypto_has_comp(s, 0, 0))
+		return -1;
+
+	for (r = 0; r < ARRAY_SIZE(compressors); r++) {
+		val_len = strlen(compressors[r].name);
+		if (!strncmp(s, compressors[r].name, val_len))
+			return r;
+	}
+	return -1;
+}
+
+static const char *get_comp_name(int id)
+{
+	if (id < 0 || id > ARRAY_SIZE(compressors))
+		return NULL;
+	return compressors[id].name;
+}
+
+static void set_default_compressor(int index)
+{
+	default_compressor = index;
+	strlcpy(dm_icomp_algorithm, compressors[index].name,
+			sizeof(dm_icomp_algorithm));
+	DMINFO("compressor  is %s", dm_icomp_algorithm);
+}
+
+static inline int get_default_compressor(void)
+{
+	return default_compressor;
+}
+
+static int select_default_compressor(void)
+{
+	int r;
+	int arr_size = ARRAY_SIZE(compressors);
+
+	for (r = 0; r < arr_size; r++)
+		if (crypto_has_comp(compressors[r].name, 0, 0))
+			break;
+	if (r >= arr_size) {
+		DMWARN("No crypto compressors are supported");
+		return -EINVAL;
+	}
+	set_default_compressor(r);
+	return 0;
+}
+
+static int dm_icomp_compressor_param_set(const char *val,
+		const struct kernel_param *kp)
+{
+	int ret;
+	char str[kp->str->maxlen], *s;
+	int val_len = strlen(val)+1;
+
+	strlcpy(str, val, val_len);
+	s = strim(str);
+	ret = get_comp_id(s);
+	if (ret < 0) {
+		DMWARN("Compressor %s not supported", s);
+		return -1;
+	}
+	set_default_compressor(ret);
+	return 0;
+}
+
+static void free_compressor(struct dm_icomp_info *info)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		if (info->tfm[i]) {
+			crypto_free_comp(info->tfm[i]);
+			info->tfm[i] = NULL;
+		}
+	}
+}
+
+static int alloc_compressor(struct dm_icomp_info *info)
+{
+	int i;
+	const char *alg_name = get_comp_name(info->comp_alg);
+
+	for_each_possible_cpu(i) {
+		info->tfm[i] = crypto_alloc_comp(
+			alg_name, 0, 0);
+		if (IS_ERR(info->tfm[i]))
+			goto err;
+	}
+	return 0;
+
+err:
+	free_compressor(info);
+	return -ENOMEM;
+}
+
+/**** END compressor select logic ****/
+
+
+/*****  metadata logic ***************/
+/*
+ * return the meta data bits corresponding to a block
+ * @block_index : the index of the block
+ */
+static u8 dm_icomp_get_meta(struct dm_icomp_info *info, u64 block_index)
+{
+	u64 first_bit = block_index * DMCP_META_BITS;
+	int bits, offset;
+	u32 data;
+	u8  ret = 0;
+
+	offset = first_bit & (DMCP_BITS_PER_ENTRY-1);
+	bits = min_t(u32, DMCP_META_BITS, DMCP_BITS_PER_ENTRY - offset);
+
+	data = (u32)info->meta_bitmap[first_bit >> DMCP_META_BITS];
+	ret = (data >> offset) & ((1 << bits) - 1);
+
+	if (bits < DMCP_META_BITS) {
+		data = info->meta_bitmap[(first_bit >> DMCP_META_BITS) + 1];
+		bits = DMCP_META_BITS - bits;
+		ret |= (data & ((1 << bits) - 1)) << (DMCP_META_BITS - bits);
+	}
+	return ret;
+}
+
+
+static void dm_icomp_mark_page(struct dm_icomp_info *info, u32 *addr,
+				bool dirty_meta)
+{
+	struct page *page;
+
+	page = vmalloc_to_page(addr);
+	if (!page)
+		return;
+	if (dirty_meta)
+		SetPageDirty(page);
+	else
+		ClearPageDirty(page);
+}
+
+/*
+ * set the meta data bits corresponding to a block
+ * @block_index : the index of the block
+ * @meta        : the meta data bits.
+ */
+static void dm_icomp_set_meta(struct dm_icomp_info *info, u64 block_index,
+		u8 meta, bool dirty_meta)
+{
+	u64 first_bit = block_index * DMCP_META_BITS;
+	int bits, offset;
+	u32 data;
+
+	offset = first_bit & (DMCP_BITS_PER_ENTRY-1);
+	bits = min_t(u32, DMCP_META_BITS, DMCP_BITS_PER_ENTRY - offset);
+
+
+	data = (u32)info->meta_bitmap[first_bit >> DMCP_META_BITS];
+	data &= ~(((1 << bits) - 1) << offset);
+	data |= (meta & ((1 << bits) - 1)) << offset;
+	info->meta_bitmap[first_bit >> DMCP_META_BITS] = (u32)data;
+
+	if (info->write_mode == DMCP_WRITE_BACK)
+		dm_icomp_mark_page(info,
+			&info->meta_bitmap[first_bit >> DMCP_META_BITS],
+			dirty_meta);
+
+	if (bits < DMCP_META_BITS) {
+		meta >>= bits;
+		data = (u32)
+			info->meta_bitmap[(first_bit >> DMCP_META_BITS) + 1];
+		bits = DMCP_META_BITS - bits;
+		data = (data >> bits) << bits;
+		data |= meta & ((1 << bits) - 1);
+		info->meta_bitmap[(first_bit >> DMCP_META_BITS) + 1] =
+				(u32)data;
+
+		if (info->write_mode == DMCP_WRITE_BACK)
+			dm_icomp_mark_page(info,
+			&info->meta_bitmap[(first_bit >> DMCP_META_BITS) + 1],
+			dirty_meta);
+	}
+}
+
+
+/*
+ * set the meta data bits corresponding to an extent
+ * @block : the index of the block
+ * @logical_blocks: the number of blocks in the extent
+ * @sectors: the number of sectors holding the compressed
+ *		data
+ */
+static void dm_icomp_set_extent(struct dm_icomp_req *req, u64 block,
+	u16 logical_blocks, sector_t data_sectors)
+{
+	int i;
+	u8 data;
+
+	for (i = 0; i < logical_blocks; i++) {
+		data = min_t(sector_t, data_sectors, 8);
+		data_sectors -= data;
+		if (i != 0)
+			data |= DMCP_TAIL_MASK;
+		/* For FUA, we write out meta data directly */
+		dm_icomp_set_meta(req->info, block + i, data,
+					!(req->bio->bi_opf & REQ_FUA));
+	}
+}
+
+/*
+ * get the meta data bits corresponding to an extent
+ * @block_index : the index of the block
+ * @logical_blocks: return the number of blocks in the extent
+ * @sectors: return the number of sectors holding the compressed
+ *		data
+ */
+static void dm_icomp_get_extent(struct dm_icomp_info *info, u64 block_index,
+	u64 *first_block_index, u16 *logical_sectors, u16 *data_sectors)
+{
+	u8 data;
+
+	data = dm_icomp_get_meta(info, block_index);
+	while (data & DMCP_TAIL_MASK) {
+		block_index--;
+		data = dm_icomp_get_meta(info, block_index);
+	}
+	*first_block_index = block_index;
+	*logical_sectors = DMCP_BYTES_TO_SECTOR(DMCP_BLOCK_SIZE);
+	*data_sectors = data & DMCP_LENGTH_MASK;
+	block_index++;
+	while (block_index < info->data_blocks) {
+		data = dm_icomp_get_meta(info, block_index);
+		if (!(data & DMCP_TAIL_MASK))
+			break;
+		*logical_sectors += DMCP_BYTES_TO_SECTOR(DMCP_BLOCK_SIZE);
+		*data_sectors += data & DMCP_LENGTH_MASK;
+		block_index++;
+	}
+}
+
+/*
+ * return the super block
+ */
+static int dm_icomp_access_super(struct dm_icomp_info *info, void *addr,
+		int op, int flag)
+{
+	struct dm_io_region region;
+	struct dm_io_request req;
+	unsigned long io_error = 0;
+	int ret;
+
+	region.bdev = info->dev->bdev;
+	region.sector = 0;
+	region.count = DMCP_BYTES_TO_SECTOR(DMCP_BLOCK_SIZE);
+
+	req.bi_op = op;
+	req.bi_op_flags = flag;
+	req.mem.type = DM_IO_KMEM;
+	req.mem.offset = 0;
+	req.mem.ptr.addr = addr;
+	req.notify.fn = NULL;
+	req.client = info->io_client;
+
+	ret = dm_io(&req, 1, &region, &io_error);
+	if (ret || io_error)
+		return -EIO;
+	return 0;
+}
+
+static void dm_icomp_meta_io_done(unsigned long error, void *context)
+{
+	struct dm_icomp_meta_io *meta_io = context;
+
+	meta_io->fn(meta_io->data, error);
+	kmem_cache_free(dm_icomp_meta_io_cachep, meta_io);
+}
+
+/*
+ * write meta data to the meta blocks in the backing store.
+ */
+static int dm_icomp_write_meta(struct dm_icomp_info *info, u64 start_page,
+	u64 end_page, void *data,
+	void (*fn)(void *data, unsigned long error), int rw, int flags)
+{
+	struct dm_icomp_meta_io *meta_io;
+	sector_t sector, last_sector, last_meta_sector = info->data_start-1;
+
+	WARN_ON(end_page > info->meta_bitmap_pages);
+
+	sector = DMCP_META_START_SECTOR + (start_page << (PAGE_SHIFT - 9));
+	WARN_ON(sector > last_meta_sector);
+	if (sector > last_meta_sector) {
+		fn(data, -EINVAL);
+		return -EINVAL;
+	}
+	last_sector = sector + ((end_page - start_page) << (PAGE_SHIFT - 9));
+	if (last_sector > last_meta_sector)
+		last_sector = last_meta_sector;
+
+
+	meta_io = kmem_cache_alloc(dm_icomp_meta_io_cachep, GFP_NOIO);
+	if (!meta_io) {
+		fn(data, -ENOMEM);
+		return -ENOMEM;
+	}
+	meta_io->data = data;
+	meta_io->fn = fn;
+
+	meta_io->io_region.bdev = info->dev->bdev;
+
+
+	meta_io->io_region.sector = sector;
+	meta_io->io_region.count = last_sector - sector + 1;
+	atomic64_add(DMCP_SECTOR_TO_BYTES(meta_io->io_region.count),
+				&info->meta_write_size);
+
+	meta_io->io_req.bi_op = rw;
+	meta_io->io_req.bi_op_flags = flags;
+	meta_io->io_req.mem.type = DM_IO_VMA;
+	meta_io->io_req.mem.offset = 0;
+	meta_io->io_req.mem.ptr.addr = ((char *)(info->meta_bitmap)) +
+						(start_page << PAGE_SHIFT);
+	meta_io->io_req.notify.fn = dm_icomp_meta_io_done;
+	meta_io->io_req.notify.context = meta_io;
+	meta_io->io_req.client = info->io_client;
+
+	dm_io(&meta_io->io_req, 1, &meta_io->io_region, NULL);
+	return 0;
+}
+
+struct writeback_flush_data {
+	struct completion complete;
+	atomic_t cnt;
+};
+
+static void writeback_flush_io_done(void *data, unsigned long error)
+{
+	struct writeback_flush_data *wb = data;
+
+	if (atomic_dec_return(&wb->cnt))
+		return;
+	complete(&wb->complete);
+}
+
+static void dm_icomp_flush_dirty_meta(struct dm_icomp_info *info,
+			struct writeback_flush_data *data)
+{
+	struct page *page;
+	u64 start = 0, index;
+	u32 pending = 0, cnt = 0;
+	bool dirty;
+	struct blk_plug plug;
+
+	blk_start_plug(&plug);
+	for (index = 0; index < info->meta_bitmap_pages; index++, cnt++) {
+		if (cnt == 256) {
+			cnt = 0;
+			cond_resched();
+		}
+
+		page = vmalloc_to_page((char *)(info->meta_bitmap) +
+					(index << PAGE_SHIFT));
+		if (!page)
+			DMWARN("Uable to find page for block=%llu", index);
+		dirty = TestClearPageDirty(page);
+
+		if (pending == 0 && dirty) {
+			start = index;
+			pending++;
+			continue;
+		} else if (pending == 0)
+			continue;
+		else if (pending > 0 && dirty) {
+			pending++;
+			continue;
+		}
+
+		/* pending > 0 && !dirty */
+		atomic_inc(&data->cnt);
+		dm_icomp_write_meta(info, start, start + pending, data,
+			writeback_flush_io_done, REQ_OP_WRITE, WRITE);
+		pending = 0;
+	}
+
+	if (pending > 0) {
+		atomic_inc(&data->cnt);
+		dm_icomp_write_meta(info, start, start + pending, data,
+			writeback_flush_io_done, REQ_OP_WRITE, WRITE);
+	}
+	blkdev_issue_flush(info->dev->bdev, GFP_NOIO, NULL);
+	blk_finish_plug(&plug);
+}
+
+static int dm_icomp_meta_writeback_thread(void *data)
+{
+	struct dm_icomp_info *info = data;
+	struct writeback_flush_data wb;
+
+	atomic_set(&wb.cnt, 1);
+	init_completion(&wb.complete);
+
+	while (!kthread_should_stop()) {
+		schedule_timeout_interruptible(
+			msecs_to_jiffies(info->writeback_delay * 1000));
+		dm_icomp_flush_dirty_meta(info, &wb);
+	}
+
+	dm_icomp_flush_dirty_meta(info, &wb);
+
+	writeback_flush_io_done(&wb, 0);
+	wait_for_completion(&wb.complete);
+	return 0;
+}
+
+static int dm_icomp_init_meta(struct dm_icomp_info *info, bool new)
+{
+	struct dm_io_region region;
+	struct dm_io_request req;
+	unsigned long io_error = 0;
+	struct blk_plug plug;
+	int ret;
+	ssize_t len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits,
+			DMCP_BITS_PER_ENTRY);
+
+	len *= (DMCP_BITS_PER_ENTRY >> 3);
+
+	region.bdev = info->dev->bdev;
+	region.sector = DMCP_META_START_SECTOR;
+	region.count = DMCP_BYTES_TO_SECTOR(round_up(len,
+				DMCP_SECTOR_SIZE));
+
+	req.mem.type = DM_IO_VMA;
+	req.mem.offset = 0;
+	req.mem.ptr.addr = info->meta_bitmap;
+	req.notify.fn = NULL;
+	req.client = info->io_client;
+
+	blk_start_plug(&plug);
+	if (new) {
+		memset(info->meta_bitmap, 0, len);
+		req.bi_op = REQ_OP_WRITE;
+		req.bi_op_flags = REQ_FUA;
+		ret = dm_io(&req, 1, &region, &io_error);
+	} else {
+		req.bi_op = REQ_OP_READ;
+		req.bi_op_flags = READ;
+		ret = dm_io(&req, 1, &region, &io_error);
+	}
+	blk_finish_plug(&plug);
+
+	if (ret || io_error) {
+		info->ti->error = "Access metadata error";
+		return -EIO;
+	}
+
+	if (info->write_mode == DMCP_WRITE_BACK) {
+		info->writeback_tsk = kthread_run(
+			dm_icomp_meta_writeback_thread,
+			info, "dm_icomp_writeback");
+		if (!info->writeback_tsk) {
+			info->ti->error = "Create writeback thread error";
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+/***** END metadata logic *****/
+
+
+#define SET_REQ_STAGE(req, value) (req->stage = value)
+#define GET_REQ_STAGE(req) req->stage
+
+
+static void print_max_sectors_possible(struct dm_icomp_info *info)
+{
+	u64 total_blocks, data_blocks, meta_blocks;
+	u64 no_pairs;
+	u32 pair_blocks, rem;
+
+	/* superblock takes away one block */
+	total_blocks = DMCP_BYTES_TO_BLOCK(i_size_read(
+				info->dev->bdev->bd_inode)) - 1;
+
+	/* number of datablocks representable by one metadata block. */
+	data_blocks  = DIV_ROUND_CLOSEST_ULL((DMCP_BLOCK_SIZE * 8),
+			 DMCP_META_BITS);
+	meta_blocks  = 1; /* we need this one meta data block for sure. */
+
+
+	/* how many such pairing can we make ? */
+	pair_blocks  = data_blocks + meta_blocks;
+	no_pairs     = DIV_ROUND_CLOSEST_ULL(total_blocks, pair_blocks);
+
+
+	/*
+	 * these many datablocks and these many ..
+	 * metadatablocks will support each other.
+	 */
+	data_blocks  *= no_pairs;
+	meta_blocks  *= no_pairs;
+
+	div_u64_rem(total_blocks, pair_blocks, &rem);
+	if (rem) {
+		/* we have some remaining blocks.
+		 * give one to meta and remaining to data.
+		 */
+		meta_blocks++;
+		data_blocks += (rem - 1);
+	}
+
+	DMINFO(" This device can accommodate at most %llu sector ",
+		DMCP_BLOCK_TO_SECTOR(data_blocks));
+}
+
+
+/*
+ * create a new super block and initialize its contents.
+ */
+static int dm_icomp_read_or_create_super(struct dm_icomp_info *info)
+{
+	void *addr, *bitmap_addr;
+	struct dm_icomp_super_block *super;
+	u64 total_blocks, data_blocks, meta_blocks;
+	bool new_super = false;
+	int ret;
+	ssize_t len;
+
+	info->total_sector = DMCP_BYTES_TO_SECTOR(
+			i_size_read(info->dev->bdev->bd_inode));
+	total_blocks = DMCP_SECTOR_TO_BLOCK(info->total_sector) - 1;
+
+	data_blocks =  DMCP_SECTOR_TO_BLOCK(info->ti->len);
+	meta_blocks =  ((data_blocks * DMCP_META_BITS) +
+			((DMCP_BLOCK_SIZE * 8) - 1)) / (DMCP_BLOCK_SIZE * 8);
+
+	info->data_blocks = data_blocks;
+	info->data_start = DMCP_BLOCK_TO_SECTOR(1 + meta_blocks);
+
+	DMINFO(
+	"data_start=%u data_blocks=%llu metablocks=%llu total_blocks=%llu",
+		(unsigned int)info->data_start, info->data_blocks,
+		meta_blocks, total_blocks);
+
+	if (DMCP_BLOCK_TO_SECTOR(data_blocks + meta_blocks + 1)
+			>= info->total_sector) {
+		print_max_sectors_possible(info);
+		info->ti->error =
+			"Insufficient sectors to satisfy requested size";
+		return -ENOMEM;
+	}
+
+	addr = kzalloc(DMCP_BLOCK_SIZE+DMCP_SECTOR_SIZE, GFP_KERNEL);
+	if (!addr) {
+		info->ti->error = "Cannot allocate super";
+		return -ENOMEM;
+	}
+
+	super = PTR_ALIGN(addr, DMCP_SECTOR_SIZE);
+	ret = dm_icomp_access_super(info, super, REQ_OP_READ, REQ_FUA);
+	if (ret)
+		goto out;
+
+	if (le64_to_cpu(super->magic) == DMCP_SUPER_MAGIC) {
+
+		const char *alg_name;
+
+		if (le64_to_cpu(super->meta_blocks) != meta_blocks ||
+		    le64_to_cpu(super->data_blocks) != data_blocks) {
+			info->ti->error = "Super is invalid";
+			ret = -EINVAL;
+			goto out;
+		}
+
+		alg_name = get_comp_name(super->comp_alg);
+		if (!crypto_has_comp(alg_name, 0, 0)) {
+			info->ti->error =
+				"Compressor algorithm doesn't support";
+			ret = -EINVAL;
+			goto out;
+		}
+		info->comp_alg = super->comp_alg;
+
+	} else {
+		super->magic = cpu_to_le64(DMCP_SUPER_MAGIC);
+		super->meta_blocks = cpu_to_le64(meta_blocks);
+		super->data_blocks = cpu_to_le64(data_blocks);
+		super->comp_alg = info->comp_alg;
+		ret = dm_icomp_access_super(info, super, REQ_OP_WRITE,
+				REQ_FUA);
+		if (ret) {
+			info->ti->error = "Access super fails";
+			goto out;
+		}
+		new_super = true;
+	}
+
+	if (alloc_compressor(info)) {
+		info->ti->error = "Cannot allocate compressor";
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	info->meta_bitmap_bits = data_blocks * DMCP_META_BITS;
+	len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, DMCP_BITS_PER_ENTRY);
+	len *= (DMCP_BITS_PER_ENTRY >> 3);
+	info->meta_bitmap_pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	bitmap_addr = vzalloc((info->meta_bitmap_pages * PAGE_SIZE) +
+				DMCP_SECTOR_SIZE);
+	if (!bitmap_addr) {
+		info->ti->error = "Cannot allocate bitmap";
+		ret = -ENOMEM;
+		goto bitmap_err;
+	}
+	info->meta_bitmap = PTR_ALIGN(bitmap_addr, DMCP_SECTOR_SIZE);
+
+	ret = dm_icomp_init_meta(info, new_super);
+	if (ret)
+		goto meta_err;
+
+	return 0;
+meta_err:
+	vfree(bitmap_addr);
+bitmap_err:
+	free_compressor(info);
+out:
+	kfree(addr);
+	return ret;
+}
+
+/*
+ * <dev> [ <writethough>/<writeback> <meta_commit_delay> ]
+ *	 [ <compressor> <type> ]
+ */
+static int dm_icomp_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct dm_icomp_info *info;
+	char mode[15];
+	int par = 0;
+	int ret, i;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		ti->error = "dm-inplace-compress: Cannot allocate context";
+		return -ENOMEM;
+	}
+	info->ti = ti;
+	info->comp_alg = get_default_compressor();
+	info->critical = false;
+	while (++par < argc) {
+		if (sscanf(argv[par], "%s", mode) != 1) {
+			ti->error = "Invalid argument";
+			ret = -EINVAL;
+			goto err_para;
+		}
+
+		if (strcmp(mode, "writeback") == 0) {
+			info->write_mode = DMCP_WRITE_BACK;
+			if (kstrtouint(argv[++par], 10,
+				 &info->writeback_delay)) {
+				ti->error = "Invalid argument";
+				ret = -EINVAL;
+				goto err_para;
+			}
+		} else if (strcmp(mode, "writethrough") == 0) {
+			info->write_mode = DMCP_WRITE_THROUGH;
+		} else if (strcmp(mode, "critical") == 0) {
+			info->critical = true;
+		} else if (strcmp(mode, "compressor") == 0) {
+			if (sscanf(argv[++par], "%s", mode) != 1) {
+				ti->error = "Invalid argument";
+				ret = -EINVAL;
+				goto err_para;
+			}
+			ret = get_comp_id(mode);
+			if (ret >= 0) {
+				DMINFO("compressor  is %s", mode);
+				info->comp_alg = ret;
+			} else {
+				ti->error = "Unsupported compressor";
+				ret = -EINVAL;
+				goto err_para;
+			}
+		}
+	}
+
+	if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
+							&info->dev)) {
+		ti->error = "Can't get device";
+		ret = -EINVAL;
+		goto err_para;
+	}
+
+	info->io_client = dm_io_client_create();
+	if (!info->io_client) {
+		ti->error = "Can't create io client";
+		ret = -EINVAL;
+		goto err_ioclient;
+	}
+
+	if (bdev_logical_block_size(info->dev->bdev) != 512) {
+		ti->error = "Can't logical block size too big";
+		ret = -EINVAL;
+		goto err_blocksize;
+	}
+
+	if (dm_set_target_max_io_len(ti, DMCP_BYTES_TO_SECTOR(DMCP_MAX_SIZE))) {
+		ti->error = "Failed to configure device ";
+		ret = -EINVAL;
+		goto err_blocksize;
+	}
+
+	if (dm_icomp_read_or_create_super(info)) {
+		ret = -EINVAL;
+		goto err_blocksize;
+	}
+
+	for (i = 0; i < BITMAP_HASH_LEN; i++) {
+		info->bitmap_locks[i].io_running = 0;
+		spin_lock_init(&info->bitmap_locks[i].wait_lock);
+		INIT_LIST_HEAD(&info->bitmap_locks[i].wait_list);
+	}
+
+	atomic64_set(&info->compressed_write_size, 0);
+	atomic64_set(&info->uncompressed_write_size, 0);
+	atomic64_set(&info->meta_write_size, 0);
+	atomic64_set(&dm_icomp_total_alloc_size, 0);
+	atomic64_set(&dm_icomp_total_bio_save, 0);
+
+	ti->num_flush_bios = 1;
+	ti->private = info;
+	return 0;
+
+err_blocksize:
+	dm_io_client_destroy(info->io_client);
+err_ioclient:
+	dm_put_device(ti, info->dev);
+err_para:
+	kfree(info);
+	return ret;
+}
+
+static void dm_icomp_dtr(struct dm_target *ti)
+{
+	struct dm_icomp_info *info = ti->private;
+
+	if (info->write_mode == DMCP_WRITE_BACK)
+		kthread_stop(info->writeback_tsk);
+	free_compressor(info);
+	vfree(info->meta_bitmap);
+	dm_io_client_destroy(info->io_client);
+	dm_put_device(ti, info->dev);
+	kfree(info);
+}
+
+/*
+ * return the range lock to this block.
+ */
+static struct dm_icomp_hash_lock *dm_icomp_block_hash_lock(
+		struct dm_icomp_info *info, u64 block_index)
+{
+	return &info->bitmap_locks[(block_index >> BITMAP_HASH_SHIFT) &
+			BITMAP_HASH_MASK];
+}
+
+/*
+ * unlock the io range correspondingg to this block.
+ */
+static struct dm_icomp_hash_lock *dm_icomp_trylock_block(
+		struct dm_icomp_info *info,
+		struct dm_icomp_req *req, u64 block_index)
+{
+	struct dm_icomp_hash_lock *hash_lock;
+
+	hash_lock = dm_icomp_block_hash_lock(req->info, block_index);
+
+	spin_lock_irq(&hash_lock->wait_lock);
+	if (!hash_lock->io_running) {
+		hash_lock->io_running = 1;
+		spin_unlock_irq(&hash_lock->wait_lock);
+		return hash_lock;
+	}
+	list_add_tail(&req->sibling, &hash_lock->wait_list);
+	spin_unlock_irq(&hash_lock->wait_lock);
+	return NULL;
+}
+
+static void dm_icomp_queue_req_list(struct dm_icomp_info *info,
+	 struct list_head *list);
+
+static void dm_icomp_unlock_block(struct dm_icomp_info *info,
+	struct dm_icomp_req *req, struct dm_icomp_hash_lock *hash_lock)
+{
+	LIST_HEAD(pending_list);
+	unsigned long flags;
+
+	spin_lock_irqsave(&hash_lock->wait_lock, flags);
+	/* wakeup all pending reqs to avoid live lock */
+	list_splice_init(&hash_lock->wait_list, &pending_list);
+	hash_lock->io_running = 0;
+	spin_unlock_irqrestore(&hash_lock->wait_lock, flags);
+
+	dm_icomp_queue_req_list(info, &pending_list);
+}
+
+/*
+ * lock all the range locks corresponding to this io request.
+ */
+static int dm_icomp_lock_req_range(struct dm_icomp_req *req)
+{
+	u64 block_index, first_block_index;
+	u64 first_lock_block, second_lock_block;
+	u16 logical_sectors, data_sectors;
+
+	block_index = DMCP_SECTOR_TO_BLOCK(req->bio->bi_iter.bi_sector);
+	req->locks[0] = dm_icomp_trylock_block(req->info, req, block_index);
+	if (!req->locks[0])
+		return 0;
+	dm_icomp_get_extent(req->info, block_index, &first_block_index,
+				&logical_sectors, &data_sectors);
+	if (dm_icomp_block_hash_lock(req->info, first_block_index) !=
+						req->locks[0]) {
+		dm_icomp_unlock_block(req->info, req, req->locks[0]);
+		first_lock_block = first_block_index;
+		second_lock_block = block_index;
+		goto two_locks;
+	}
+
+	block_index = DMCP_SECTOR_TO_BLOCK(bio_end_sector(req->bio) - 1);
+	dm_icomp_get_extent(req->info, block_index, &first_block_index,
+				&logical_sectors, &data_sectors);
+	first_block_index += DMCP_SECTOR_TO_BLOCK(logical_sectors);
+	if (dm_icomp_block_hash_lock(req->info, first_block_index) !=
+						req->locks[0]) {
+		second_lock_block = first_block_index;
+		goto second_lock;
+	}
+	req->locked_locks = 1;
+	return 1;
+
+two_locks:
+	req->locks[0] = dm_icomp_trylock_block(req->info, req,
+		first_lock_block);
+	if (!req->locks[0])
+		return 0;
+second_lock:
+	req->locks[1] = dm_icomp_trylock_block(req->info, req,
+				second_lock_block);
+	if (!req->locks[1]) {
+		dm_icomp_unlock_block(req->info, req, req->locks[0]);
+		return 0;
+	}
+	/* Don't need check if meta is changed */
+	req->locked_locks = 2;
+	return 1;
+}
+
+
+
+/*
+ * unlock all the range locks corresponding to this io request.
+ */
+static void dm_icomp_unlock_req_range(struct dm_icomp_req *req)
+{
+	int i;
+
+	for (i = req->locked_locks - 1; i >= 0; i--)
+		dm_icomp_unlock_block(req->info, req, req->locks[i]);
+}
+
+static void dm_icomp_queue_req(struct dm_icomp_info *info,
+		struct dm_icomp_req *req)
+{
+	unsigned long flags;
+	struct dm_icomp_io_worker *worker = &dm_icomp_io_workers[req->cpu];
+
+	spin_lock_irqsave(&worker->lock, flags);
+	list_add_tail(&req->sibling, &worker->pending);
+	spin_unlock_irqrestore(&worker->lock, flags);
+
+	queue_work_on(req->cpu, dm_icomp_wq, &worker->work);
+}
+
+static void dm_icomp_queue_req_list(struct dm_icomp_info *info,
+		struct list_head *list)
+{
+	struct dm_icomp_req *req;
+
+	while (!list_empty(list)) {
+		req = list_first_entry(list, struct dm_icomp_req, sibling);
+		list_del_init(&req->sibling);
+		dm_icomp_queue_req(info, req);
+	}
+}
+
+static void dm_icomp_get_req(struct dm_icomp_req *req)
+{
+	atomic_inc(&req->io_pending);
+}
+
+static inline int get_alloc_flag(struct dm_icomp_info *info)
+{
+	/*
+	 * Use GFP_ATOMIC allocations if the device
+	 * is used on the critical path
+	 */
+	return info->critical ? GFP_ATOMIC : GFP_NOIO;
+}
+
+static void *dm_icomp_kmalloc(size_t size, int alloc_flag)
+{
+	void *addr = kmalloc(size, alloc_flag);
+
+	if (!addr)
+		return NULL;
+	DMCP_ALLOC(size);
+	return addr;
+}
+
+static void *dm_icomp_krealloc(void *ptr, size_t size,
+		size_t origsize, int alloc_flag)
+{
+	void *addr = krealloc(ptr, size, alloc_flag);
+
+	if (!addr)
+		return NULL;
+	DMCP_FREE_ALLOC(origsize);
+	DMCP_ALLOC(size);
+	return addr;
+}
+
+static int dm_icomp_alloc_compbuffer(struct dm_icomp_io_range *io, int size)
+{
+	int alloc_len = size + DMCP_SECTOR_SIZE;
+	void *addr = dm_icomp_kmalloc(alloc_len,
+			get_alloc_flag(io->req->info));
+
+	if (!addr)
+		return 1;
+
+	io->comp_real_data = addr;
+	io->comp_kmap	= false;
+	io->comp_len	= size;
+
+	/*
+	 * comp_data is used to read and write from storage.
+	 * So align it.
+	 */
+	io->comp_data   = io->io_req.mem.ptr.addr
+			= PTR_ALIGN(addr, DMCP_SECTOR_SIZE);
+
+	return 0;
+}
+
+static int dm_icomp_realloc_compbuffer(struct dm_icomp_io_range *io, int size)
+{
+	void *addr = dm_icomp_krealloc(io->comp_real_data,
+			size+DMCP_SECTOR_SIZE, io->comp_len,
+				get_alloc_flag(io->req->info));
+	if (!addr)
+		return 1;
+
+	io->comp_real_data = addr;
+	io->comp_kmap	   = false;
+	io->comp_data      = io->io_req.mem.ptr.addr = PTR_ALIGN(addr,
+				DMCP_SECTOR_SIZE);
+	io->comp_len	   = size;
+	return 0;
+}
+
+static void dm_icomp_kfree(void *addr, unsigned int size)
+{
+	kfree(addr);
+	DMCP_FREE_ALLOC(size);
+}
+
+static void dm_icomp_release_decomp_buffer(struct dm_icomp_io_range *io)
+{
+	if (!io->decomp_data)
+		return;
+
+	if (io->decomp_kmap)
+		kunmap(io->decomp_real_data);
+	else
+		dm_icomp_kfree(io->decomp_real_data, io->decomp_len);
+
+	io->decomp_data = io->decomp_real_data = NULL;
+	io->decomp_len  = 0;
+	io->decomp_kmap = false;
+}
+
+static void dm_icomp_release_comp_buffer(struct dm_icomp_io_range *io)
+{
+	if (!io->comp_data)
+		return;
+
+	if (io->comp_kmap)
+		kunmap(io->comp_real_data);
+	else
+		dm_icomp_kfree(io->comp_real_data, io->comp_len);
+
+	io->comp_real_data = io->comp_data = NULL;
+	io->comp_len = 0;
+	io->comp_kmap = false;
+}
+
+static void dm_icomp_free_io_range(struct dm_icomp_io_range *io)
+{
+	dm_icomp_release_decomp_buffer(io);
+	dm_icomp_release_comp_buffer(io);
+	kmem_cache_free(dm_icomp_io_range_cachep, io);
+}
+
+static void dm_icomp_put_req(struct dm_icomp_req *req)
+{
+	struct dm_icomp_io_range *io;
+
+	if (atomic_dec_return(&req->io_pending))
+		return;
+
+	if (GET_REQ_STAGE(req) == STAGE_INIT) /* waiting for locking */
+		return;
+
+	if (GET_REQ_STAGE(req) == STAGE_READ_DECOMP ||
+	    GET_REQ_STAGE(req) == STAGE_WRITE_COMP)
+		SET_REQ_STAGE(req, STAGE_DONE);
+
+	if (!!!req->result && GET_REQ_STAGE(req) != STAGE_DONE) {
+		dm_icomp_queue_req(req->info, req);
+		return;
+	}
+
+	while (!list_empty(&req->all_io)) {
+		io = list_entry(req->all_io.next,
+			struct dm_icomp_io_range, next);
+		list_del(&io->next);
+		dm_icomp_free_io_range(io);
+	}
+
+	dm_icomp_unlock_req_range(req);
+
+	req->bio->bi_error = req->result;
+
+	bio_endio(req->bio);
+	kmem_cache_free(dm_icomp_req_cachep, req);
+}
+
+static void dm_icomp_bio_copy(struct bio *bio, off_t bio_off, void *buf,
+		ssize_t len, bool to_buf)
+{
+	struct bio_vec bv;
+	struct bvec_iter iter;
+	off_t buf_off = 0;
+	ssize_t size;
+	void *addr;
+
+	WARN_ON(bio_off + len > DMCP_SECTOR_TO_BYTES(bio_sectors(bio)));
+
+	bio_for_each_segment(bv, bio, iter) {
+		int length = bv.bv_len;
+
+		if (bio_off > length) {
+			bio_off -= length;
+			continue;
+		}
+		addr = kmap_atomic(bv.bv_page);
+		size = min_t(ssize_t, len, length - bio_off);
+		if (!buf)
+			memset(addr + bio_off + bv.bv_offset, 0, size);
+		else if (to_buf)
+			memcpy(buf + buf_off, addr + bio_off + bv.bv_offset,
+				size);
+		else
+			memcpy(addr + bio_off + bv.bv_offset, buf + buf_off,
+				size);
+		kunmap_atomic(addr);
+		bio_off = 0;
+		buf_off += size;
+
+		if (len <= size)
+			break;
+
+		len -= size;
+	}
+}
+
+static void dm_icomp_io_range_done(unsigned long error, void *context)
+{
+	struct dm_icomp_io_range *io = context;
+
+	if (error)
+		io->req->result = error;
+
+	dm_icomp_put_req(io->req);
+}
+
+static inline int dm_icomp_compressor_len(struct dm_icomp_info *info, int len)
+{
+	if (compressors[info->comp_alg].comp_len)
+		return compressors[info->comp_alg].comp_len(len);
+	return len;
+}
+
+static inline bool dm_icomp_can_handle_overflow(struct dm_icomp_info *info)
+{
+	return compressors[info->comp_alg].can_handle_overflow;
+}
+
+static inline int dm_icomp_compressor_maxlen(struct dm_icomp_info *info,
+		int len)
+{
+	if (compressors[info->comp_alg].max_comp_len)
+		return compressors[info->comp_alg].max_comp_len(len);
+	return len;
+}
+
+/*
+ * caller should set region.sector, region.count. bi_rw. IO always to/from
+ * comp_data
+ */
+static struct dm_icomp_io_range *dm_icomp_create_io_range(
+		struct dm_icomp_req *req)
+{
+	struct dm_icomp_io_range *io;
+
+	io = kmem_cache_alloc(dm_icomp_io_range_cachep, GFP_NOIO);
+	if (!io)
+		return NULL;
+
+	io->io_req.notify.fn = dm_icomp_io_range_done;
+	io->io_req.notify.context = io;
+	io->io_req.client = req->info->io_client;
+	io->io_req.mem.type = DM_IO_KMEM;
+	io->io_req.mem.offset = 0;
+
+	io->io_region.bdev = req->info->dev->bdev;
+	io->req = req;
+
+	io->comp_data = io->comp_real_data =
+			io->decomp_data = io->decomp_real_data = NULL;
+
+	io->data_bytes = io->comp_len =
+			io->decomp_len = io->logical_bytes = 0;
+
+	io->comp_kmap = io->decomp_kmap = false;
+	return io;
+}
+
+
+/*
+ * return an address, within the bio. The address corresponds to
+ * the requested offset 'bio_off' and is contiguous of size 'len'
+ */
+static void *get_addr(struct bio *bio,  int len, u64 bio_off, u64 *offset)
+{
+	struct bio_vec bv;
+	struct bvec_iter iter;
+	void *addr;
+
+	bio_for_each_segment(bv, bio, iter) {
+		int length = bv.bv_len;
+
+		if (bio_off > length) {
+			bio_off -= length;
+			continue;
+		}
+		addr = bv.bv_page;
+		if (bv.bv_offset + bio_off + len >= length) {
+			*offset = bv.bv_offset + bio_off;
+			return kmap(addr);
+		}
+		break;
+	}
+	return NULL;
+}
+
+
+/*
+ * create a io range for tracking  predominantly a read request.
+ * @req		: the read request
+ * @comp_len	: allocation size of the compress buffer
+ * @decomp_len	: allocation size of the decompress buffer
+ * @actual_comp_len : real size of the compress data
+ * @bio_off	: offset within the bio read buffer this request corresponds to.
+ *		try to reuse and read into the bio buffer. -1 means don't reuse.
+ */
+static struct dm_icomp_io_range *dm_icomp_create_io_read_range(
+		struct dm_icomp_req *req, int comp_len, int decomp_len,
+		long bio_off, int actual_comp_len)
+{
+	struct bio *bio = req->bio;
+	void *addr = NULL;
+	struct dm_icomp_io_range *io = dm_icomp_create_io_range(req);
+	u64 offset;
+
+	if (!io)
+		return NULL;
+
+	WARN_ON(comp_len % DMCP_SECTOR_SIZE);
+
+	/* try reusing the bio if possible */
+	if (bio_off >= 0) {
+		addr = get_addr(bio, comp_len, (u64)bio_off, &offset);
+		if (addr) {
+			io->comp_real_data =  addr;
+			io->comp_data = io->io_req.mem.ptr.addr = addr + offset;
+			io->comp_kmap = true;
+			io->comp_len  = comp_len;
+		}
+	}
+
+	if (!addr && dm_icomp_alloc_compbuffer(io, comp_len)) {
+		kmem_cache_free(dm_icomp_io_range_cachep, io);
+		return NULL;
+	}
+
+	io->data_bytes	= actual_comp_len;  /* NOTE, this value can change */
+
+	/*
+	 * note requested length for decompress buffer. Do not allocate it yet.
+	 * Value once set is final.
+	 */
+	io->logical_bytes = decomp_len;
+
+	return io;
+}
+
+/*
+ *  ensure that the io range has all its buffers; of the correct size,
+ *  allocated.
+ */
+static int dm_icomp_update_io_read_range(struct dm_icomp_io_range *io)
+{
+	WARN_ON(!io->comp_data);
+	WARN_ON(io->decomp_data || io->decomp_len);
+	io->decomp_data = dm_icomp_kmalloc(io->logical_bytes,
+				get_alloc_flag(io->req->info));
+	if (!io->decomp_data)
+		return 1;
+	io->decomp_real_data = io->decomp_data;
+	io->decomp_len = io->logical_bytes;
+	io->decomp_kmap = false;
+	return 0;
+}
+
+/*
+ *  resize the comp buffer to its largest possible size.
+ */
+static int dm_icomp_mod_to_max_io_range(struct dm_icomp_info *info,
+			 struct dm_icomp_io_range *io)
+{
+	unsigned int maxlen = dm_icomp_compressor_maxlen(info, io->decomp_len);
+
+	WARN_ON(maxlen > io->logical_bytes);
+
+	if (io->comp_kmap) {
+		WARN_ON(io->comp_kmap);
+		kunmap(io->comp_real_data);
+		io->comp_kmap = false;
+		io->comp_real_data = io->comp_data = NULL;
+	}
+
+	if (dm_icomp_realloc_compbuffer(io, maxlen)) {
+		io->comp_len = 0;
+		return -ENOSPC;
+	}
+	io->comp_len = maxlen;
+	return 0;
+}
+
+/*
+ * create a io range for tracking a write request.
+ * @req		: the write request
+ * @count	: size of the write in sectors.
+ * @offset	: offset within the bio read buffer this request correspond to.
+ */
+static struct dm_icomp_io_range *dm_icomp_create_io_write_range(
+	struct dm_icomp_req *req, sector_t offset, sector_t count)
+{
+	struct bio *bio = req->bio;
+	int size  = DMCP_SECTOR_TO_BYTES(count);
+	u64 of;
+	int comp_len = dm_icomp_compressor_len(req->info, size);
+	void *addr;
+	struct dm_icomp_io_range *io = dm_icomp_create_io_range(req);
+
+	if (!io)
+		return NULL;
+
+	WARN_ON(io->comp_data);
+
+	if (dm_icomp_alloc_compbuffer(io, comp_len)) {
+		kmem_cache_free(dm_icomp_io_range_cachep, io);
+		return NULL;
+	}
+
+	/* we donot know the size of the compress segment yet. */
+	io->data_bytes = 0;
+
+
+	WARN_ON(io->decomp_data);
+
+	io->decomp_kmap = false;
+
+	/* try reusing the bio buffer for decomp data. */
+	addr = get_addr(bio, size, DMCP_SECTOR_TO_BYTES(offset), &of);
+	if (addr)
+		io->decomp_kmap = true;
+	else
+		addr  = dm_icomp_kmalloc(size,
+				get_alloc_flag(req->info));
+
+	if (!addr) {
+		dm_icomp_kfree(io->comp_data, comp_len);
+		kmem_cache_free(dm_icomp_io_range_cachep, io);
+		return NULL;
+	}
+
+	io->logical_bytes = io->decomp_len = size;
+
+	if (io->decomp_kmap) {
+		io->decomp_real_data = addr;
+		io->decomp_data = addr + of;
+		DMCP_ALLOC_SAVE(size);
+	} else {
+		io->decomp_data = io->decomp_real_data = addr;
+		dm_icomp_bio_copy(req->bio, DMCP_SECTOR_TO_BYTES(offset),
+			io->decomp_data, size, true);
+	}
+
+	return io;
+}
+
+static unsigned int round_to_next_sector(unsigned int val)
+{
+	unsigned int c = round_up(val, DMCP_SECTOR_SIZE);
+
+	if ((c - val) < 2*sizeof(u32))
+		c += DMCP_SECTOR_SIZE;
+	return c;
+}
+
+/*
+ * compress and store the data in compress buffer.
+ * return value:
+ * < 0 : error
+ * == 0 : ok
+ * == 1 : ok, but comp/decomp is skipped
+ * Compressed data size is roundup of 512, which makes the payload.
+ * We store the actual compressed len in the last u32 of the payload.
+ * If there is no free space, we add 512 to the payload size.
+ */
+static int dm_icomp_io_range_compress(struct dm_icomp_info *info,
+		struct dm_icomp_io_range *io, unsigned int *comp_len)
+{
+	unsigned int actual_comp_len = io->comp_len;
+	u32 *addr;
+	struct crypto_comp *tfm =  info->tfm[get_cpu()];
+	unsigned int decomp_len = io->logical_bytes;
+	int ret;
+
+	actual_comp_len = io->comp_len;
+	ret = crypto_comp_compress(tfm, io->decomp_data, decomp_len,
+		io->comp_data, &actual_comp_len);
+
+	if (ret || round_to_next_sector(actual_comp_len) > io->comp_len) {
+		ret = dm_icomp_mod_to_max_io_range(info, io);
+		if (!ret) {
+			actual_comp_len = io->comp_len;
+			ret = crypto_comp_compress(tfm, io->decomp_data,
+				decomp_len, io->comp_data,
+				&actual_comp_len);
+		}
+	}
+
+	put_cpu();
+
+	atomic64_add(decomp_len, &info->uncompressed_write_size);
+	io->data_bytes = *comp_len = round_to_next_sector(actual_comp_len);
+	if (ret || decomp_len < *comp_len) {
+		*comp_len = decomp_len;
+		memcpy(io->comp_data, io->decomp_data, *comp_len);
+		atomic64_add(*comp_len, &info->compressed_write_size);
+	} else {
+		atomic64_add(*comp_len, &info->compressed_write_size);
+		addr = (u32 *)((char *)io->comp_data + *comp_len);
+		addr--;
+		*addr = cpu_to_le32(actual_comp_len);
+		addr--;
+		*addr = cpu_to_le32(DMCP_COMPRESS_MAGIC);
+	}
+
+	return 0;
+}
+
+/*
+ * decompress and store the data in decompress buffer.
+ * return value:
+ * < 0 : error
+ * == 0 : ok
+ */
+static int dm_icomp_io_range_decompress(struct dm_icomp_info *info,
+		struct dm_icomp_io_range *io, unsigned int *decomp_len)
+{
+	struct crypto_comp *tfm;
+	u32 *addr;
+	int ret;
+	int comp_len = io->data_bytes;
+
+	WARN_ON(!io->data_bytes);
+
+	if (comp_len == io->logical_bytes) {
+		memcpy(io->decomp_data, io->comp_data, comp_len);
+		*decomp_len = comp_len;
+		return 0;
+	}
+
+	WARN_ON(io->comp_data != io->io_req.mem.ptr.addr);
+
+	addr = (u32 *)((char *)(io->comp_data) + comp_len);
+	addr--;
+	comp_len = le32_to_cpu(*addr);
+	addr--;
+
+	if (le32_to_cpu(*addr) == DMCP_COMPRESS_MAGIC) {
+		tfm = info->tfm[get_cpu()];
+		*decomp_len = io->logical_bytes;
+		ret = crypto_comp_decompress(tfm, io->comp_data, comp_len,
+			io->decomp_data, decomp_len);
+		WARN_ON(*decomp_len != io->decomp_len);
+		put_cpu();
+		if (ret)
+			return -EINVAL;
+		return 0;
+	}
+
+	DMWARN("Decompress Error ");
+	return -1;
+}
+
+/*
+ *  fill the bio with the corresponding decompressed data.
+ */
+static void dm_icomp_handle_read_decomp(struct dm_icomp_req *req)
+{
+	struct dm_icomp_io_range *io;
+	off_t bio_off = 0;
+	int ret;
+	sector_t bio_len  = DMCP_SECTOR_TO_BYTES(bio_sectors(req->bio));
+
+	SET_REQ_STAGE(req, STAGE_READ_DECOMP);
+
+	if (req->result)
+		return;
+
+	list_for_each_entry(io, &req->all_io, next) {
+		ssize_t dst_off = 0, src_off = 0, len;
+		unsigned int decomp_len;
+
+		io->io_region.sector -= req->info->data_start;
+
+		if (io->io_region.sector >=
+				req->bio->bi_iter.bi_sector)
+			dst_off = DMCP_SECTOR_TO_BYTES(
+				io->io_region.sector -
+				req->bio->bi_iter.bi_sector);
+		else
+			src_off = DMCP_SECTOR_TO_BYTES(
+				req->bio->bi_iter.bi_sector -
+				io->io_region.sector);
+
+		if (dm_icomp_update_io_read_range(io)) {
+			req->result = -EIO;
+			return;
+		}
+
+		/* Do decomp here */
+		ret = dm_icomp_io_range_decompress(req->info, io, &decomp_len);
+		if (ret < 0) {
+			dm_icomp_release_decomp_buffer(io);
+			dm_icomp_release_comp_buffer(io);
+			req->result = -EIO;
+			return;
+		}
+
+		len = min_t(ssize_t,
+			max_t(ssize_t, decomp_len - src_off, 0),
+			max_t(ssize_t, bio_len - dst_off, 0));
+
+		dm_icomp_bio_copy(req->bio, dst_off,
+			   io->decomp_data + src_off, len, false);
+
+		/* io range in all_io list is ordered for read IO */
+		while (bio_off < dst_off) {
+			ssize_t size = min_t(ssize_t, PAGE_SIZE,
+					dst_off - bio_off);
+			dm_icomp_bio_copy(req->bio, bio_off, NULL,
+					size, false);
+			bio_off += size;
+		}
+
+		bio_off = dst_off + len;
+		dm_icomp_release_decomp_buffer(io);
+		dm_icomp_release_comp_buffer(io);
+	}
+
+	while (bio_off < bio_len) {
+		ssize_t size = min_t(ssize_t, PAGE_SIZE, (bio_len - bio_off));
+
+		dm_icomp_bio_copy(req->bio, bio_off, NULL,
+			size, false);
+		bio_off += size;
+	}
+}
+
+
+/*
+ * read an extent
+ * @req        : the read request
+ * @block      : the block to be read
+ * @logical_sectors   : no of sectors occupied by the decompressed data
+ * @data_sectors      : no of sectors occupied by the compressed data
+ * @may_resize : the compress data size may change during its life.
+ */
+static void dm_icomp_read_one_extent(struct dm_icomp_req *req, u64 block,
+	u16 logical_sectors, u16 data_sectors, bool may_resize)
+{
+	struct dm_icomp_io_range *io;
+	long bio_off = 0, comp_len;
+	int actual_comp_len = DMCP_SECTOR_TO_BYTES(data_sectors);
+	int actual_decomp_len = DMCP_SECTOR_TO_BYTES(logical_sectors);
+
+	comp_len = actual_comp_len;
+	if (may_resize && !dm_icomp_can_handle_overflow(req->info))
+		comp_len = dm_icomp_compressor_maxlen(req->info,
+				actual_decomp_len);
+
+	bio_off	 =  (may_resize) ? -1 :
+			 DMCP_BLOCK_TO_SECTOR(block) -
+				req->bio->bi_iter.bi_sector;
+
+	io = dm_icomp_create_io_read_range(req, comp_len,
+		actual_decomp_len,
+		bio_off,
+		actual_comp_len);
+	if (!io) {
+		req->result = -EIO;
+		return;
+	}
+
+	dm_icomp_get_req(req);
+	list_add_tail(&io->next, &req->all_io);
+
+	io->io_region.sector = DMCP_BLOCK_TO_SECTOR(block) +
+				req->info->data_start;
+	io->io_region.count = data_sectors;
+	io->io_req.mem.ptr.addr = io->comp_data;
+	io->io_req.mem.type = DM_IO_KMEM;
+	io->io_req.mem.offset = 0;
+	io->io_req.bi_op = REQ_OP_READ;
+	io->io_req.bi_op_flags = (req->bio->bi_opf & REQ_FUA);
+
+	WARN_ON((io->io_region.sector + io->io_region.count)
+		>= req->info->total_sector);
+
+	dm_io(&io->io_req, 1, &io->io_region, NULL);
+}
+
+
+/*
+ * read the data corresponding to this request.
+ * @req   : the request.
+ * @reuse : the read data may be modified. So plan accordingly.
+ */
+static void dm_icomp_handle_read_existing(struct dm_icomp_req *req, bool reuse)
+{
+	u64 block_index, first_block_index;
+	u16 logical_sectors, data_sectors;
+
+	SET_REQ_STAGE(req, STAGE_READ_EXISTING);
+
+	block_index = DMCP_SECTOR_TO_BLOCK(req->bio->bi_iter.bi_sector);
+
+	while (!!!req->result &&
+		(block_index <= DMCP_SECTOR_TO_BLOCK(
+				bio_end_sector(req->bio)-1)) &&
+		(block_index < req->info->data_blocks)) {
+
+		dm_icomp_get_extent(req->info, block_index, &first_block_index,
+			&logical_sectors, &data_sectors);
+
+		if (data_sectors)
+			dm_icomp_read_one_extent(req, first_block_index,
+				logical_sectors, data_sectors, reuse);
+
+		block_index = first_block_index +
+				DMCP_SECTOR_TO_BLOCK(logical_sectors);
+	}
+}
+
+/*
+ * read existing data
+ */
+static void dm_icomp_handle_read_read_existing(struct dm_icomp_req *req)
+{
+	dm_icomp_handle_read_existing(req, false);
+
+	if (req->result)
+		return;
+
+	/* A shortcut if all data is in already */
+	if (list_empty(&req->all_io))
+		dm_icomp_handle_read_decomp(req);
+}
+
+static void dm_icomp_handle_read_request(struct dm_icomp_req *req)
+{
+	dm_icomp_get_req(req);
+
+	if (GET_REQ_STAGE(req) == STAGE_INIT) {
+		if (!dm_icomp_lock_req_range(req)) {
+			dm_icomp_put_req(req);
+			return;
+		}
+		dm_icomp_handle_read_read_existing(req);
+	} else if (GET_REQ_STAGE(req) == STAGE_READ_EXISTING) {
+		dm_icomp_handle_read_decomp(req);
+	}
+
+	dm_icomp_put_req(req);
+}
+
+static void dm_icomp_write_meta_done(void *context, unsigned long error)
+{
+	struct dm_icomp_req *req = context;
+
+	dm_icomp_put_req(req);
+}
+
+static u64 dm_icomp_block_meta_page_index(u64 block, bool end)
+{
+	u64 bits = block * DMCP_META_BITS - !!end;
+	/*
+	 * >> 5; 32 bits per entry
+	 * << 2; each entry is 4 bytes
+	 * >> PAGE_SHIFT; PAGE_SHIFT pages
+	 */
+	return bits >> (5 - 2 + PAGE_SHIFT);
+}
+
+
+/*
+ * write compressed data to the backing storage.
+ * @io : io range
+ * @sector_start : the sector on backing storage to which the
+ *	compressed data needs to be written.
+ * @meta_start: the page index of the bits corresponding to
+ * @meta_end  : start and end blocks.
+ */
+static int dm_icomp_compress_write(struct dm_icomp_io_range *io,
+		sector_t sector_start, u64 *meta_start, u64 *meta_end)
+{
+	struct dm_icomp_req *req = io->req;
+	sector_t count = DMCP_BYTES_TO_SECTOR(io->decomp_len);
+	unsigned int comp_len, ret;
+	u64 page_index;
+
+	/* comp_data must be able to accommadate a larger compress buffer */
+	ret = dm_icomp_io_range_compress(req->info, io, &comp_len);
+	if (ret < 0) {
+		req->result = -EIO;
+		return -EIO;
+	}
+	WARN_ON(comp_len > io->comp_len);
+
+	dm_icomp_get_req(req);
+
+	io->io_req.bi_op = REQ_OP_WRITE;
+	io->io_req.bi_op_flags = (req->bio->bi_opf & REQ_FUA);
+	io->io_req.mem.ptr.addr = io->comp_data;
+	io->io_req.mem.type = DM_IO_KMEM;
+	io->io_req.mem.offset = 0;
+	io->io_region.count = DMCP_BYTES_TO_SECTOR(comp_len);
+	io->io_region.sector = sector_start + req->info->data_start;
+
+	dm_icomp_release_decomp_buffer(io);
+
+
+	WARN_ON((io->io_region.sector + io->io_region.count)
+			>= req->info->total_sector);
+
+	dm_io(&io->io_req, 1, &io->io_region, NULL);
+
+	/* update the meta data bits */
+	dm_icomp_set_extent(req, DMCP_SECTOR_TO_BLOCK(sector_start),
+		DMCP_SECTOR_TO_BLOCK(count), DMCP_BYTES_TO_SECTOR(comp_len));
+
+	page_index = dm_icomp_block_meta_page_index(
+		DMCP_SECTOR_TO_BLOCK(sector_start), false);
+	if (*meta_start > page_index)
+		*meta_start = page_index;
+
+	page_index = dm_icomp_block_meta_page_index(
+			DMCP_SECTOR_TO_BLOCK(sector_start + count), true);
+	if (*meta_end < page_index)
+		*meta_end = page_index;
+	return 0;
+}
+
+/*
+ * modify and write compressed data to the backing storage.
+ * @io : io range
+ * @meta_start: the page index of the bits corresponding to
+ * @meta_end  : start and end blocks.
+ */
+static int dm_icomp_handle_write_modify(struct dm_icomp_io_range *io,
+	u64 *meta_start, u64 *meta_end)
+{
+	struct dm_icomp_req *req = io->req;
+	sector_t bio_start, bio_end, buf_start, buf_end, overlap;
+	off_t bio_off, buf_off;
+	int ret;
+	unsigned int decomp_len;
+
+	io->io_region.sector -= req->info->data_start;
+
+	/* decompress original data */
+	if (dm_icomp_update_io_read_range(io)) {
+		req->result = -EIO;
+		return -EIO;
+	}
+
+	ret = dm_icomp_io_range_decompress(req->info, io, &decomp_len);
+	if (ret < 0) {
+		req->result = -EINVAL;
+		return -EIO;
+	}
+
+	bio_start = req->bio->bi_iter.bi_sector;
+	bio_end = bio_end_sector(req->bio) - 1;
+
+	buf_start = io->io_region.sector;
+	buf_end = buf_start + DMCP_BYTES_TO_SECTOR(decomp_len) - 1;
+
+	/* if no overlap, nothing to do. Just return */
+	if (bio_start >= buf_end || bio_end <= buf_start)
+		return 0;
+
+	bio_off = (buf_start > bio_start) ?  (buf_start - bio_start) : 0;
+	buf_off = (bio_start > buf_start) ?  (bio_start - buf_start) : 0;
+
+	/*
+	 * overlap = sizeof(block1) + sizeof(block2) - sizeof(left_side_shift) -
+	 *		sizeof(right_side_shift)  / 2  +  1
+	 */
+	overlap  =  (((bio_end - bio_start) + (buf_end - buf_start) -
+		abs(buf_end - bio_end) - abs(buf_start - bio_start)) >> 1) + 1;
+
+
+	dm_icomp_bio_copy(req->bio, DMCP_SECTOR_TO_BYTES(bio_off),
+		   io->decomp_data + DMCP_SECTOR_TO_BYTES(buf_off),
+		   DMCP_SECTOR_TO_BYTES(overlap), true);
+
+	return dm_icomp_compress_write(io, io->io_region.sector,
+			meta_start, meta_end);
+}
+
+
+/*
+ * create and write new extents. Each extent is not more than
+ * 256 sectors.
+ * @req : the request
+ * @sec_start: the start sector of the request
+ * @total  : the total sectors
+ * @list  : collect each 256 sector size io request in this list
+ * @meta_start: the page index of the bits corresponding to
+ * @meta_end  : start and end blocks.
+ *
+ */
+static void dm_icomp_handle_write_create(struct dm_icomp_req *req,
+	sector_t sec_start, sector_t total,
+	struct list_head *list, u64 *meta_start, u64 *meta_end)
+{
+	struct dm_icomp_io_range *io;
+	sector_t count, offset = 0;
+	int ret;
+
+	while (total) {
+
+		/* max i/o is 128kbytes i.e 256 sectors */
+		count = min_t(sector_t, total, 256);
+		io = dm_icomp_create_io_write_range(req, offset, count);
+		if (!io) {
+			req->result = -EIO;
+			return;
+		}
+
+		ret = dm_icomp_compress_write(io, sec_start, meta_start,
+			meta_end);
+		if (ret) {
+			dm_icomp_free_io_range(io);
+			return;
+		}
+
+
+		list_add_tail(&io->next, list);
+		total -= count;
+		sec_start += count;
+		offset += count;
+
+	}
+}
+
+/*
+ *  handle the write request.
+ */
+static void dm_icomp_handle_write_comp(struct dm_icomp_req *req)
+{
+	struct dm_icomp_io_range *io;
+	sector_t io_start, req_start, req_end;
+	u64 meta_start = -1L, meta_end = 0;
+	LIST_HEAD(newlist);
+
+	SET_REQ_STAGE(req, STAGE_WRITE_COMP);
+
+	if (req->result)
+		return;
+
+	req_start = req->bio->bi_iter.bi_sector;
+	list_for_each_entry(io, &req->all_io, next) {
+
+		io_start = io->io_region.sector - req->info->data_start;
+
+		if (req_start < io_start) {
+			/* fill the gap */
+			dm_icomp_handle_write_create(req, req_start,
+				(io_start - req_start), &newlist,
+				&meta_start, &meta_end);
+		}
+
+		dm_icomp_handle_write_modify(io, &meta_start, &meta_end);
+
+		req_start = io_start + DMCP_BYTES_TO_SECTOR(io->logical_bytes);
+	}
+
+	req_end =  bio_end_sector(req->bio);
+	if (req_start < req_end) {
+		/* fill the gap */
+		dm_icomp_handle_write_create(req, req_start,
+			 req_end-req_start, &newlist, &meta_start,
+			&meta_end);
+	}
+
+	list_splice_tail(&newlist, &req->all_io);
+
+	if (req->info->write_mode == DMCP_WRITE_THROUGH ||
+				(req->bio->bi_opf & REQ_FUA)) {
+		if (meta_start == -1)
+			return;
+		dm_icomp_get_req(req);
+		dm_icomp_write_meta(req->info, meta_start,
+			meta_end+1, req,
+			dm_icomp_write_meta_done,
+			REQ_OP_WRITE, req->bio->bi_opf);
+	}
+}
+
+/*
+ *  read the data, modify and write it back to the backing store.
+ */
+static void dm_icomp_handle_write_read_existing(struct dm_icomp_req *req)
+{
+	dm_icomp_handle_read_existing(req, true);
+	if (req->result)
+		return;
+
+	if (list_empty(&req->all_io))
+		dm_icomp_handle_write_comp(req);
+}
+
+static void dm_icomp_handle_write_request(struct dm_icomp_req *req)
+{
+	dm_icomp_get_req(req);
+
+	if (GET_REQ_STAGE(req) == STAGE_INIT) {
+		if (!dm_icomp_lock_req_range(req)) {
+			dm_icomp_put_req(req);
+			return;
+		}
+		dm_icomp_handle_write_read_existing(req);
+	} else if (GET_REQ_STAGE(req) == STAGE_READ_EXISTING) {
+		dm_icomp_handle_write_comp(req);
+	}
+
+	dm_icomp_put_req(req);
+}
+
+/* For writeback mode */
+static void dm_icomp_handle_flush_request(struct dm_icomp_req *req)
+{
+	struct writeback_flush_data wb;
+
+	atomic_set(&wb.cnt, 1);
+	init_completion(&wb.complete);
+
+	dm_icomp_flush_dirty_meta(req->info, &wb);
+
+	writeback_flush_io_done(&wb, 0);
+	wait_for_completion(&wb.complete);
+
+	req->bio->bi_error = 0;
+	bio_endio(req->bio);
+	kmem_cache_free(dm_icomp_req_cachep, req);
+}
+
+static void dm_icomp_handle_request(struct dm_icomp_req *req)
+{
+	if (req->bio->bi_opf & REQ_PREFLUSH)
+		dm_icomp_handle_flush_request(req);
+	else if (op_is_write(bio_op(req->bio)))
+		dm_icomp_handle_write_request(req);
+	else
+		dm_icomp_handle_read_request(req);
+}
+
+static void dm_icomp_do_request_work(struct work_struct *work)
+{
+	struct dm_icomp_io_worker *worker = container_of(work,
+				struct dm_icomp_io_worker, work);
+	LIST_HEAD(list);
+	struct dm_icomp_req *req;
+	struct blk_plug plug;
+	bool repeat;
+
+	blk_start_plug(&plug);
+again:
+	spin_lock_irq(&worker->lock);
+	list_splice_init(&worker->pending, &list);
+	spin_unlock_irq(&worker->lock);
+
+	repeat = !list_empty(&list);
+	while (!list_empty(&list)) {
+		req = list_first_entry(&list, struct dm_icomp_req, sibling);
+		list_del(&req->sibling);
+
+		schedule();
+		dm_icomp_handle_request(req);
+	}
+	if (repeat)
+		goto again;
+	blk_finish_plug(&plug);
+}
+
+static bool valid_request(struct bio *bio, struct dm_icomp_info *info)
+{
+	sector_t dev_end	=  info->ti->len;
+	sector_t req_end	=  bio_end_sector(bio) - 1;
+
+	return (req_end <= dev_end);
+}
+
+static int dm_icomp_map(struct dm_target *ti, struct bio *bio)
+{
+	struct dm_icomp_info *info = ti->private;
+	struct dm_icomp_req *req;
+
+	if ((bio->bi_opf & REQ_PREFLUSH) &&
+			info->write_mode == DMCP_WRITE_THROUGH) {
+		bio->bi_bdev = info->dev->bdev;
+		return DM_MAPIO_REMAPPED;
+	}
+
+
+	req = kmem_cache_alloc(dm_icomp_req_cachep, GFP_NOIO);
+	if (!req)
+		return -EIO;
+
+	req->bio = bio;
+	if (!(bio->bi_opf & REQ_PREFLUSH) && !valid_request(bio, info)) {
+		req->bio = bio;
+		req->bio->bi_error = -EINVAL;
+		bio_endio(req->bio);
+		return DM_MAPIO_SUBMITTED;
+	}
+
+	req->info = info;
+	atomic_set(&req->io_pending, 0);
+	INIT_LIST_HEAD(&req->all_io);
+	req->result = 0;
+	SET_REQ_STAGE(req, STAGE_INIT);
+	req->locked_locks = 0;
+
+	req->cpu = raw_smp_processor_id();
+	dm_icomp_queue_req(info, req);
+
+	return DM_MAPIO_SUBMITTED;
+}
+
+static void dm_icomp_status(struct dm_target *ti, status_type_t type,
+	  unsigned int status_flags, char *result, unsigned int maxlen)
+{
+	struct dm_icomp_info *info = ti->private;
+	unsigned int sz = 0;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		DMEMIT("%ld %ld %ld",
+			atomic64_read(&info->uncompressed_write_size),
+			atomic64_read(&info->compressed_write_size),
+			atomic64_read(&info->meta_write_size));
+		break;
+	case STATUSTYPE_TABLE:
+		if (info->write_mode == DMCP_WRITE_BACK)
+			DMEMIT("%s %s:%d %s:%s %s:%d", info->dev->name,
+				"writeback", info->writeback_delay,
+				"compressor", compressors[info->comp_alg].name,
+				"critical", info->critical);
+		else
+			DMEMIT("%s %s %s:%s %s:%d", info->dev->name,
+				"writethrough",
+				"compressor", compressors[info->comp_alg].name,
+				"critical", info->critical);
+		break;
+	}
+}
+
+static int dm_icomp_iterate_devices(struct dm_target *ti,
+				  iterate_devices_callout_fn fn, void *data)
+{
+	struct dm_icomp_info *info = ti->private;
+
+	return fn(ti, info->dev, info->data_start,
+		DMCP_BLOCK_TO_SECTOR(info->data_blocks), data);
+}
+
+static void dm_icomp_io_hints(struct dm_target *ti,
+			    struct queue_limits *limits)
+{
+	/* No blk_limits_logical_block_size */
+	limits->logical_block_size = limits->physical_block_size =
+		limits->io_min = DMCP_BLOCK_SIZE;
+	limits->max_sectors = limits->max_hw_sectors =
+		DMCP_BYTES_TO_SECTOR(DMCP_MAX_SIZE);
+}
+
+static struct target_type dm_icomp_target = {
+	.name   = "inplacecompress",
+	.version = {1, 0, 0},
+	.module = THIS_MODULE,
+	.ctr    = dm_icomp_ctr,
+	.dtr    = dm_icomp_dtr,
+	.map    = dm_icomp_map,
+	.status = dm_icomp_status,
+	.iterate_devices = dm_icomp_iterate_devices,
+	.io_hints = dm_icomp_io_hints,
+};
+
+static int __init dm_icomp_init(void)
+{
+	int r;
+
+	if (select_default_compressor())
+		return -EINVAL;
+
+	r = -ENOMEM;
+	dm_icomp_req_cachep = kmem_cache_create("dm_icomp_requests",
+		sizeof(struct dm_icomp_req), 0, 0, NULL);
+	if (!dm_icomp_req_cachep) {
+		DMWARN("Can't create request cache");
+		goto err;
+	}
+
+	dm_icomp_io_range_cachep = kmem_cache_create("dm_icomp_io_range",
+		sizeof(struct dm_icomp_io_range), 0, 0, NULL);
+	if (!dm_icomp_io_range_cachep) {
+		DMWARN("Can't create io_range cache");
+		goto err;
+	}
+
+	dm_icomp_meta_io_cachep = kmem_cache_create("dm_icomp_meta_io",
+		sizeof(struct dm_icomp_meta_io), 0, 0, NULL);
+	if (!dm_icomp_meta_io_cachep) {
+		DMWARN("Can't create meta_io cache");
+		goto err;
+	}
+
+	dm_icomp_wq = alloc_workqueue("dm_icomp_io",
+		WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
+	if (!dm_icomp_wq) {
+		DMWARN("Can't create io workqueue");
+		goto err;
+	}
+
+	r = dm_register_target(&dm_icomp_target);
+	if (r < 0) {
+		DMWARN("target registration failed");
+		goto err;
+	}
+
+	for_each_possible_cpu(r) {
+		INIT_LIST_HEAD(&dm_icomp_io_workers[r].pending);
+		spin_lock_init(&dm_icomp_io_workers[r].lock);
+		INIT_WORK(&dm_icomp_io_workers[r].work,
+				dm_icomp_do_request_work);
+	}
+	return 0;
+err:
+	kmem_cache_destroy(dm_icomp_req_cachep);
+	kmem_cache_destroy(dm_icomp_io_range_cachep);
+	kmem_cache_destroy(dm_icomp_meta_io_cachep);
+	if (dm_icomp_wq)
+		destroy_workqueue(dm_icomp_wq);
+
+	return r;
+}
+
+static void __exit dm_icomp_exit(void)
+{
+	dm_unregister_target(&dm_icomp_target);
+	kmem_cache_destroy(dm_icomp_req_cachep);
+	kmem_cache_destroy(dm_icomp_io_range_cachep);
+	kmem_cache_destroy(dm_icomp_meta_io_cachep);
+	destroy_workqueue(dm_icomp_wq);
+}
+
+module_init(dm_icomp_init);
+module_exit(dm_icomp_exit);
+
+MODULE_AUTHOR("Shaohua Li <shli@kernel.org>");
+MODULE_DESCRIPTION(DM_NAME " target with data inplace-compression");
+MODULE_LICENSE("GPL");
diff --git a/drivers/md/dm-inplace-compress.h b/drivers/md/dm-inplace-compress.h
new file mode 100644
index 0000000..757799a
--- /dev/null
+++ b/drivers/md/dm-inplace-compress.h
@@ -0,0 +1,187 @@
+#ifndef __DM_INPLACE_COMPRESS_H__
+#define __DM_INPLACE_COMPRESS_H__
+#include <linux/types.h>
+
+#define DMCP_SUPER_MAGIC 0x106526c206506c09
+#define DMCP_COMPRESS_MAGIC 0xfaceecaf
+struct dm_icomp_super_block {
+	__le64 magic;
+	__le64 meta_blocks;
+	__le64 data_blocks;
+	u8 comp_alg;
+} __packed;
+
+#define DMCP_COMP_ALG_LZO 1
+#define DMCP_COMP_ALG_842 0
+
+#ifdef __KERNEL__
+/*
+ * Minium logical size of this target is 4096 byte, which is a block.
+ * Data of a block is compressed. Compressed data is round up to 512B, which is
+ * the payload. For each block, we have 5 bits meta data. bit 0 - 3 stands
+ * payload length(0 - 8 sectors). If compressed payload length is 8 sectors, we
+ * just store uncompressed data. Actual compressed data length is stored at the
+ * last 32 bits of payload if data is compressed. In disk, payload is stored at
+ * the beginning of logical sector of the block. If IO size is bigger than one
+ * block, we store the whole data as an extent. Bit 4 stands tail for an
+ * extent. Max allowed extent size is 128k.
+ */
+#define DMCP_BLOCK_SHIFT	12
+#define DMCP_BLOCK_SIZE		(1 << DMCP_BLOCK_SHIFT)
+#define DMCP_SECTOR_SHIFT	SECTOR_SHIFT
+#define DMCP_SECTOR_SIZE	(1 << SECTOR_SHIFT)
+#define DMCP_BLOCK_SECTOR_SHIFT (DMCP_BLOCK_SHIFT - DMCP_SECTOR_SHIFT)
+#define DMCP_BLOCK_TO_SECTOR(b) ((b) << DMCP_BLOCK_SECTOR_SHIFT)
+#define DMCP_SECTOR_TO_BLOCK(s) ((s) >> DMCP_BLOCK_SECTOR_SHIFT)
+#define DMCP_SECTOR_TO_BYTES(s) ((s) << DMCP_SECTOR_SHIFT)
+#define DMCP_BYTES_TO_SECTOR(b) ((b) >> DMCP_SECTOR_SHIFT)
+#define DMCP_BYTES_TO_BLOCK(b)	((b) >> DMCP_BLOCK_SHIFT)
+
+#define DMCP_MIN_SIZE	DMCP_BLOCK_SIZE
+#define DMCP_MAX_SIZE	(128 * 2 * DMCP_SECTOR_SIZE) /* 128k */
+
+#define DMCP_BITS_PER_ENTRY	32
+#define DMCP_META_BITS		5
+#define DMCP_LENGTH_BITS	4
+#define DMCP_TAIL_MASK		(1 << DMCP_LENGTH_BITS)
+#define DMCP_LENGTH_MASK	(DMCP_TAIL_MASK - 1)
+
+#define DMCP_META_START_SECTOR (DMCP_BLOCK_SIZE >> DMCP_SECTOR_SHIFT)
+
+enum DMCP_WRITE_MODE {
+	DMCP_WRITE_BACK,
+	DMCP_WRITE_THROUGH,
+};
+
+/*
+ * a lock spans 128 blocks i.e 512kbytes.
+ * max I/O is 128K, which can at-most span two locks.
+ */
+#define BITMAP_HASH_SHIFT 7
+#define BITMAP_HASH_LEN (1<<6)
+#define BITMAP_HASH_MASK (BITMAP_HASH_LEN - 1)
+struct dm_icomp_hash_lock {
+	int io_running;
+	spinlock_t wait_lock;
+	struct list_head wait_list;
+};
+
+struct dm_icomp_info {
+	struct dm_target *ti;
+	struct dm_dev *dev;
+
+	int comp_alg;
+	bool critical;
+	struct crypto_comp *tfm[NR_CPUS];
+
+	sector_t total_sector;	/* total sectors in the backing store */
+	sector_t data_start;
+	u64 data_blocks;
+	u64 no_of_sectors;
+
+	u32 *meta_bitmap;
+	u64 meta_bitmap_bits;
+	u64 meta_bitmap_pages;
+	struct dm_icomp_hash_lock bitmap_locks[BITMAP_HASH_LEN];
+
+	enum DMCP_WRITE_MODE write_mode;
+	unsigned int writeback_delay; /* second */
+	struct task_struct *writeback_tsk;
+	struct dm_io_client *io_client;
+
+	atomic64_t compressed_write_size;
+	atomic64_t uncompressed_write_size;
+	atomic64_t meta_write_size;
+};
+
+struct dm_icomp_meta_io {
+	struct dm_io_request io_req;
+	struct dm_io_region io_region;
+	void *data;
+	void (*fn)(void *data, unsigned long error);
+};
+
+struct dm_icomp_io_range {
+	struct dm_io_request io_req;
+	struct dm_io_region io_region;
+	bool decomp_kmap;	     /* Is the decomp_data kmapped'? */
+	void *decomp_data;
+	void *decomp_real_data;      /* holds the actual start of the buffer */
+	unsigned int decomp_len;     /* actual allocated/mapped length */
+	unsigned int logical_bytes;  /* decompressed size of the extent */
+	bool comp_kmap;		     /* Is the comp_data kmapped'? */
+	void *comp_data;
+	void *comp_real_data;	     /* holds the actual start of the buffer */
+	unsigned int comp_len;	     /* actual allocated/mapped length */
+	unsigned int data_bytes;     /* compressed size of the extent */
+	struct list_head next;
+	struct dm_icomp_req *req;
+};
+
+enum DMCP_REQ_STAGE {
+	STAGE_INIT,
+	STAGE_READ_EXISTING,
+	STAGE_READ_DECOMP,
+	STAGE_WRITE_COMP,
+	STAGE_DONE,
+};
+
+struct dm_icomp_req {
+	struct bio *bio;
+	struct dm_icomp_info *info;
+	struct list_head sibling;
+	struct list_head all_io;
+	atomic_t io_pending;
+	enum DMCP_REQ_STAGE stage;
+	struct dm_icomp_hash_lock *locks[2];
+	int locked_locks;
+	int result;
+	int cpu;
+	struct work_struct work;
+};
+
+struct dm_icomp_io_worker {
+	struct list_head pending;
+	spinlock_t lock;
+	struct work_struct work;
+};
+
+struct dm_icomp_compressor_data {
+	char *name;
+	bool can_handle_overflow;
+	int (*comp_len)(int comp_len);
+	int (*max_comp_len)(int comp_len);
+};
+
+static inline int lzo_comp_len(int comp_len)
+{
+	/* lzo compression overshoots the comp buffer
+	 * if the buffer size is insufficient.
+	 * Once that bug is fixed we can return half
+	 * the length.
+	 *
+	 * return lzo1x_worst_compress(comp_len) >> 1;
+	 *
+	 * For now its the full length.
+	 */
+	return lzo1x_worst_compress(comp_len);
+}
+
+static inline int lzo_max_comp_len(int comp_len)
+{
+	return lzo1x_worst_compress(comp_len);
+}
+
+static inline int nx842_comp_len(int comp_len)
+{
+	return (comp_len >> 1);
+}
+
+static inline int nx842_max_comp_len(int comp_len)
+{
+	return comp_len;
+}
+
+#endif /* __KERNEL__ */
+
+#endif /* __DM_INPLACE_COMPRESS_H__ */
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH --resend 1/2] md: disable WRITE SAME if it fails for linear/raid0
From: NeilBrown @ 2017-02-13 23:42 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <c98a106b47626695f3b6976b616230da4b638df8.1487015420.git.shli@fb.com>

[-- Attachment #1: Type: text/plain, Size: 2568 bytes --]

On Mon, Feb 13 2017, Shaohua Li wrote:

> This makes md do the same thing as dm for write same IO failure. Please
> see 7eee4ae(dm: disable WRITE SAME if it fails) for details why we need
> this.
>
> Also reported here: https://bugzilla.kernel.org/show_bug.cgi?id=118581
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/linear.c |  6 +++++-
>  drivers/md/md.c     | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/md.h     |  2 ++
>  drivers/md/raid0.c  |  6 +++++-
>  4 files changed, 57 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/linear.c b/drivers/md/linear.c
> index 26a73b2..bebc834 100644
> --- a/drivers/md/linear.c
> +++ b/drivers/md/linear.c
> @@ -291,7 +291,11 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
>  				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
>  						      split, disk_devt(mddev->gendisk),
>  						      bio_sector);
> -			generic_make_request(split);
> +			if (bio_op(split) == REQ_OP_WRITE_SAME)
> +				generic_make_request(md_writesame_setup(mddev,
> +								split));
> +			else
> +				generic_make_request(split);
>  		}
>  	} while (split != bio);
>  	return;
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 13020e5..7354f0b 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -312,6 +312,51 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>  	return BLK_QC_T_NONE;
>  }
>  
> +struct md_writesame_data {
> +	struct bio *orig_bio;
> +	struct mddev *mddev;
> +	struct bio cloned_bio;
> +};
> +
> +static void md_writesame_endio(struct bio *bio)
> +{
> +	struct md_writesame_data *data = bio->bi_private;
> +
> +	if (bio->bi_error == -EREMOTEIO &&
> +	    !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
> +		data->mddev->queue->limits.max_write_same_sectors = 0;

What would be *really* nice is if a block device could send a
reconfigure message to its 'holder' (bd_holder).  This could include
device size changes and, for this, changes to max_write_same_sectors.
There are probably other changes that can usefully be propagated.

But for this patch, wouldn't it be easier, and maybe more efficient, to
do

  if (bio_op(split) == REQ_OP_WRITE_SAME &&
      !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
           mddev->queue->limits.max_write_same_sectors = 0;
  generic_make_request(split);

???
If there is some reason that can't work, then the patch as it stands
look OK to me.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH --resend 1/2] md: disable WRITE SAME if it fails for linear/raid0
From: Shaohua Li @ 2017-02-14  0:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid
In-Reply-To: <8737fh4pzr.fsf@notabene.neil.brown.name>

On Tue, Feb 14, 2017 at 10:42:32AM +1100, Neil Brown wrote:
> On Mon, Feb 13 2017, Shaohua Li wrote:
> 
> > This makes md do the same thing as dm for write same IO failure. Please
> > see 7eee4ae(dm: disable WRITE SAME if it fails) for details why we need
> > this.
> >
> > Also reported here: https://bugzilla.kernel.org/show_bug.cgi?id=118581
> >
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  drivers/md/linear.c |  6 +++++-
> >  drivers/md/md.c     | 45 +++++++++++++++++++++++++++++++++++++++++++++
> >  drivers/md/md.h     |  2 ++
> >  drivers/md/raid0.c  |  6 +++++-
> >  4 files changed, 57 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/md/linear.c b/drivers/md/linear.c
> > index 26a73b2..bebc834 100644
> > --- a/drivers/md/linear.c
> > +++ b/drivers/md/linear.c
> > @@ -291,7 +291,11 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
> >  				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
> >  						      split, disk_devt(mddev->gendisk),
> >  						      bio_sector);
> > -			generic_make_request(split);
> > +			if (bio_op(split) == REQ_OP_WRITE_SAME)
> > +				generic_make_request(md_writesame_setup(mddev,
> > +								split));
> > +			else
> > +				generic_make_request(split);
> >  		}
> >  	} while (split != bio);
> >  	return;
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 13020e5..7354f0b 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -312,6 +312,51 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
> >  	return BLK_QC_T_NONE;
> >  }
> >  
> > +struct md_writesame_data {
> > +	struct bio *orig_bio;
> > +	struct mddev *mddev;
> > +	struct bio cloned_bio;
> > +};
> > +
> > +static void md_writesame_endio(struct bio *bio)
> > +{
> > +	struct md_writesame_data *data = bio->bi_private;
> > +
> > +	if (bio->bi_error == -EREMOTEIO &&
> > +	    !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
> > +		data->mddev->queue->limits.max_write_same_sectors = 0;
> 
> What would be *really* nice is if a block device could send a
> reconfigure message to its 'holder' (bd_holder).  This could include
> device size changes and, for this, changes to max_write_same_sectors.
> There are probably other changes that can usefully be propagated.
> 
> But for this patch, wouldn't it be easier, and maybe more efficient, to
> do
> 
>   if (bio_op(split) == REQ_OP_WRITE_SAME &&
>       !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
>            mddev->queue->limits.max_write_same_sectors = 0;
>   generic_make_request(split);
> 
> ???
> If there is some reason that can't work, then the patch as it stands
> look OK to me.

So we don't disable writesame in the first IO error and do it until a new
writesame comes? Good idea and much simpler! Let me check if it works.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v4 1/1] DM: inplace compressed DM target
From: kbuild test robot @ 2017-02-14  0:16 UTC (permalink / raw)
  Cc: kbuild-all, linux-doc, linux-kernel, dm-devel, linux-raid, agk,
	snitzer, corbet, shli, hbabu, linuxram
In-Reply-To: <1487018545-5061-2-git-send-email-linuxram@us.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 2426 bytes --]

Hi Ram,

[auto build test WARNING on dm/for-next]
[also build test WARNING on v4.10-rc8 next-20170213]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Ram-Pai/DM-inplace-compressed-DM-target/20170214-055727
base:   https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git for-next
config: m68k-allmodconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All warnings (new ones prefixed by >>):

   drivers/md/dm-inplace-compress.c: In function 'dm_icomp_status':
>> drivers/md/dm-inplace-compress.c:2096:3: warning: format '%ld' expects argument of type 'long int', but argument 4 has type 'long long int' [-Wformat=]
      DMEMIT("%ld %ld %ld",
      ^
   drivers/md/dm-inplace-compress.c:2096:3: warning: format '%ld' expects argument of type 'long int', but argument 5 has type 'long long int' [-Wformat=]
   drivers/md/dm-inplace-compress.c:2096:3: warning: format '%ld' expects argument of type 'long int', but argument 6 has type 'long long int' [-Wformat=]

vim +2096 drivers/md/dm-inplace-compress.c

  2080		req->locked_locks = 0;
  2081	
  2082		req->cpu = raw_smp_processor_id();
  2083		dm_icomp_queue_req(info, req);
  2084	
  2085		return DM_MAPIO_SUBMITTED;
  2086	}
  2087	
  2088	static void dm_icomp_status(struct dm_target *ti, status_type_t type,
  2089		  unsigned int status_flags, char *result, unsigned int maxlen)
  2090	{
  2091		struct dm_icomp_info *info = ti->private;
  2092		unsigned int sz = 0;
  2093	
  2094		switch (type) {
  2095		case STATUSTYPE_INFO:
> 2096			DMEMIT("%ld %ld %ld",
  2097				atomic64_read(&info->uncompressed_write_size),
  2098				atomic64_read(&info->compressed_write_size),
  2099				atomic64_read(&info->meta_write_size));
  2100			break;
  2101		case STATUSTYPE_TABLE:
  2102			if (info->write_mode == DMCP_WRITE_BACK)
  2103				DMEMIT("%s %s:%d %s:%s %s:%d", info->dev->name,
  2104					"writeback", info->writeback_delay,

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 38918 bytes --]

^ permalink raw reply

* [PATCH V2] md: disable WRITE SAME if it fails in underlayer disks
From: Shaohua Li @ 2017-02-14  0:21 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb

This makes md do the same thing as dm for write same IO failure. Please
see 7eee4ae(dm: disable WRITE SAME if it fails) for details why we need
this.

We did a little bit different than dm. Instead of disabling writesame in
the first IO error, we disable it till next writesame IO coming after
the first IO error. This way we don't need to clone a bio.

Also reported here: https://bugzilla.kernel.org/show_bug.cgi?id=118581

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/linear.c    | 1 +
 drivers/md/md.h        | 7 +++++++
 drivers/md/multipath.c | 1 +
 drivers/md/raid0.c     | 1 +
 4 files changed, 10 insertions(+)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 26a73b2..789008b 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -291,6 +291,7 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
 				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
 						      split, disk_devt(mddev->gendisk),
 						      bio_sector);
+			mddev_check_writesame(mddev, split);
 			generic_make_request(split);
 		}
 	} while (split != bio);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2a51403..42f8398 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -710,4 +710,11 @@ static inline void mddev_clear_unsupported_flags(struct mddev *mddev,
 {
 	mddev->flags &= ~unsupported_flags;
 }
+
+static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
+{
+	if (bio_op(bio) == REQ_OP_WRITE_SAME &&
+	    !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
+		mddev->queue->limits.max_write_same_sectors = 0;
+}
 #endif /* _MD_MD_H */
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index aa8c4e5c..065fe28 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -138,6 +138,7 @@ static void multipath_make_request(struct mddev *mddev, struct bio * bio)
 	mp_bh->bio.bi_opf |= REQ_FAILFAST_TRANSPORT;
 	mp_bh->bio.bi_end_io = multipath_end_request;
 	mp_bh->bio.bi_private = mp_bh;
+	mddev_check_writesame(mddev, &mp_bh->bio);
 	generic_make_request(&mp_bh->bio);
 	return;
 }
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 848365d..b3d2644 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -503,6 +503,7 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
 						      split, disk_devt(mddev->gendisk),
 						      bio_sector);
+			mddev_check_writesame(mddev, split);
 			generic_make_request(split);
 		}
 	} while (split != bio);
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH v1 1/5] block: introduce bio_clone_bioset_partial()
From: Ming Lei @ 2017-02-14  1:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shaohua Li, Jens Axboe, Linux Kernel Mailing List,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, linux-block,
	NeilBrown
In-Reply-To: <20170213134654.GB22818@infradead.org>

On Mon, Feb 13, 2017 at 9:46 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Feb 10, 2017 at 06:56:13PM +0800, Ming Lei wrote:
>> md still need bio clone(not the fast version) for behind write,
>> and it is more efficient to use bio_clone_bioset_partial().
>>
>> The idea is simple and just copy the bvecs range specified from
>> parameters.
>
> Given how few users bio_clone_bioset has I wonder if we shouldn't
> simply add the two new arguments to it instead of adding another
> indirection.

For md write-behind, looks we have to provide the two arguments,
could you explain a bit how we can do that by adding another indirection?

>
> Otherwise looks fine:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!

-- 
Ming Lei

^ permalink raw reply

* Re: [PATCH V2] md: disable WRITE SAME if it fails in underlayer disks
From: NeilBrown @ 2017-02-14  2:39 UTC (permalink / raw)
  To: Shaohua Li, linux-raid
In-Reply-To: <728f3fb3d63e6c9512a68fe5fdc61812d95aa2bd.1487031632.git.shli@fb.com>

[-- Attachment #1: Type: text/plain, Size: 3152 bytes --]

On Mon, Feb 13 2017, Shaohua Li wrote:

> This makes md do the same thing as dm for write same IO failure. Please
> see 7eee4ae(dm: disable WRITE SAME if it fails) for details why we need
> this.
>
> We did a little bit different than dm. Instead of disabling writesame in
> the first IO error, we disable it till next writesame IO coming after
> the first IO error. This way we don't need to clone a bio.
>
> Also reported here: https://bugzilla.kernel.org/show_bug.cgi?id=118581
>
> Suggested-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Shaohua Li <shli@fb.com>

Looks good to be, thanks.
  Acked-by: NeilBrown <neilb@suse.com>

NeilBrown

> ---
>  drivers/md/linear.c    | 1 +
>  drivers/md/md.h        | 7 +++++++
>  drivers/md/multipath.c | 1 +
>  drivers/md/raid0.c     | 1 +
>  4 files changed, 10 insertions(+)
>
> diff --git a/drivers/md/linear.c b/drivers/md/linear.c
> index 26a73b2..789008b 100644
> --- a/drivers/md/linear.c
> +++ b/drivers/md/linear.c
> @@ -291,6 +291,7 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
>  				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
>  						      split, disk_devt(mddev->gendisk),
>  						      bio_sector);
> +			mddev_check_writesame(mddev, split);
>  			generic_make_request(split);
>  		}
>  	} while (split != bio);
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 2a51403..42f8398 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -710,4 +710,11 @@ static inline void mddev_clear_unsupported_flags(struct mddev *mddev,
>  {
>  	mddev->flags &= ~unsupported_flags;
>  }
> +
> +static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
> +{
> +	if (bio_op(bio) == REQ_OP_WRITE_SAME &&
> +	    !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
> +		mddev->queue->limits.max_write_same_sectors = 0;
> +}
>  #endif /* _MD_MD_H */
> diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
> index aa8c4e5c..065fe28 100644
> --- a/drivers/md/multipath.c
> +++ b/drivers/md/multipath.c
> @@ -138,6 +138,7 @@ static void multipath_make_request(struct mddev *mddev, struct bio * bio)
>  	mp_bh->bio.bi_opf |= REQ_FAILFAST_TRANSPORT;
>  	mp_bh->bio.bi_end_io = multipath_end_request;
>  	mp_bh->bio.bi_private = mp_bh;
> +	mddev_check_writesame(mddev, &mp_bh->bio);
>  	generic_make_request(&mp_bh->bio);
>  	return;
>  }
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index 848365d..b3d2644 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -503,6 +503,7 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
>  				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
>  						      split, disk_devt(mddev->gendisk),
>  						      bio_sector);
> +			mddev_check_writesame(mddev, split);
>  			generic_make_request(split);
>  		}
>  	} while (split != bio);
> -- 
> 2.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 1/5] MD: attach data to each bio
From: NeilBrown @ 2017-02-14  2:40 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Shaohua Li, linux-raid, khlebnikov, hch
In-Reply-To: <20170213184942.x3s2hawueq3ryzj3@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 3753 bytes --]

On Mon, Feb 13 2017, Shaohua Li wrote:

> On Mon, Feb 13, 2017 at 08:49:33PM +1100, Neil Brown wrote:
>> On Thu, Feb 09 2017, Shaohua Li wrote:
>> 
>> > On Fri, Feb 10, 2017 at 05:08:54PM +1100, Neil Brown wrote:
>> >> On Tue, Feb 07 2017, Shaohua Li wrote:
>> >> 
>> >> > Currently MD is rebusing some bio fields. To remove the hack, we attach
>> >> > extra data to each bio. Each personablity can attach extra data to the
>> >> > bios, so we don't need to rebuse bio fields.
>> >> 
>> >> I must say that I don't really like this approach.
>> >> Temporarily modifying ->bi_private and ->bi_end_io seems
>> >> .... intrusive.   I suspect it works, but I wonder if it is really
>> >> robust in the long term.
>> >> 
>> >> How about a different approach..  Your main concern with my first patch
>> >> was that it called md_write_start() and md_write_end() much more often,
>> >> and these performed atomic ops on "global" variables, particular
>> >> writes_pending.
>> >> 
>> >> We could change writes_pending to a per-cpu array which we only count
>> >> occasionally when needed.  As writes_pending is updated often and
>> >> checked rarely, a per-cpu array which is summed on demand seems
>> >> appropriate.
>> >> 
>> >> The following patch is an early draft - it doesn't obviously fail and
>> >> isn't obviously wrong to me.  There is certainly room for improvement
>> >> and may be bugs.
>> >> Next week I'll work on collection the re-factoring into separate
>> >> patches, which are possible good-to-have anyway.
>> >
>> > For your first patch, I don't have much concern. It's ok to me. What I don't
>> > like is the bi_phys_segments handling part. The patches add a lot of logic to
>> > handle the reference count. They should work, but I'd say it's not easy to
>> > understand and could be error prone. What we really need is a reference count
>> > for the bio, so let's just add a reference count. That's my logic and it's
>> > simple.
>> 
>> We already have two reference counts, and you want to add a third one.
>> 
>> bi_phys_segments is currently used for two related purposes.
>> It counts the number of stripe_heads currently attached to the bio so
>> that when the count reaches zero:
>>  1/ ->writes_pending can be decremented
>>  2/ bio_endio() can be called.
>> 
>> When the code was written, the __bi_remaining counter didn't exist.  Now
>> it does and it is integrated with bio_endio() so it should make the code
>> easier to understand if we just use bio_endio() rather and doing our own
>> accounting.
>> 
>> That just leaves '1'.  We can easily decrement ->writes_pending directly
>> instead of decrementing a per-bio refcount, and then when it reaches
>> zero, decrement ->writes_pending.  As you pointed out, that comes with a
>> cost.  If ->writes_pending is changed to a per-cpu array which is summed
>> on demand, the cost goes away.
>> 
>> Having an extra refcount in the bio just adds a level of indirection
>> that doesn't (that I can see) provide actual value.
>
> Ok, fair enough. I do think an explict counter in the driver side will help us
> a lot, eg, we can better control when to endio and do something there (for
> example the current blk trace, or something we want to add in the future). But
> I don't insist currently.
>
> For the patches, can you repost? I think:
> - patch 2 missed md_write_start for make_discard_request
> - It's unnecessary to zero bi_phys_segments in patch 5. And raid5-cache need do
>   the same change of bio_endio.
> For the md_write_start optimization, we can do it later.

Sure. I agree those two changes are needed.  I'll try to send something
in the next day or so.

NeilBrown


>
> Thanks,
> Shaohua

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v4 1/1] DM: inplace compressed DM target
From: kbuild test robot @ 2017-02-14  3:59 UTC (permalink / raw)
  Cc: kbuild-all, linux-doc, linux-kernel, dm-devel, linux-raid, agk,
	snitzer, corbet, shli, hbabu, linuxram
In-Reply-To: <1487018545-5061-2-git-send-email-linuxram@us.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 2338 bytes --]

Hi Ram,

[auto build test WARNING on dm/for-next]
[also build test WARNING on v4.10-rc8 next-20170213]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Ram-Pai/DM-inplace-compressed-DM-target/20170214-055727
base:   https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git for-next
config: openrisc-allyesconfig (attached as .config)
compiler: or32-linux-gcc (GCC) 4.5.1-or32-1.0rc1
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=openrisc 

All warnings (new ones prefixed by >>):

   drivers/md/dm-inplace-compress.c: In function 'dm_icomp_status':
>> drivers/md/dm-inplace-compress.c:2096:3: warning: format '%ld' expects type 'long int', but argument 4 has type 'long long int'
   drivers/md/dm-inplace-compress.c:2096:3: warning: format '%ld' expects type 'long int', but argument 5 has type 'long long int'
   drivers/md/dm-inplace-compress.c:2096:3: warning: format '%ld' expects type 'long int', but argument 6 has type 'long long int'

vim +2096 drivers/md/dm-inplace-compress.c

  2080		req->locked_locks = 0;
  2081	
  2082		req->cpu = raw_smp_processor_id();
  2083		dm_icomp_queue_req(info, req);
  2084	
  2085		return DM_MAPIO_SUBMITTED;
  2086	}
  2087	
  2088	static void dm_icomp_status(struct dm_target *ti, status_type_t type,
  2089		  unsigned int status_flags, char *result, unsigned int maxlen)
  2090	{
  2091		struct dm_icomp_info *info = ti->private;
  2092		unsigned int sz = 0;
  2093	
  2094		switch (type) {
  2095		case STATUSTYPE_INFO:
> 2096			DMEMIT("%ld %ld %ld",
  2097				atomic64_read(&info->uncompressed_write_size),
  2098				atomic64_read(&info->compressed_write_size),
  2099				atomic64_read(&info->meta_write_size));
  2100			break;
  2101		case STATUSTYPE_TABLE:
  2102			if (info->write_mode == DMCP_WRITE_BACK)
  2103				DMEMIT("%s %s:%d %s:%s %s:%d", info->dev->name,
  2104					"writeback", info->writeback_delay,

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 40204 bytes --]

^ permalink raw reply

* Re: [PATCH v3 3/4] dmaengine: Add Broadcom SBA RAID driver
From: Anup Patel @ 2017-02-14  4:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine@vger.kernel.org,
	Device Tree, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-crypto, linux-raid
In-Reply-To: <CAALAos-dThk-h09Yox5AN37po3OqR1oReoZPfqjZVqTqVkCxUA@mail.gmail.com>

On Mon, Feb 13, 2017 at 2:43 PM, Anup Patel <anup.patel@broadcom.com> wrote:
> On Fri, Feb 10, 2017 at 11:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Fri, Feb 10, 2017 at 1:07 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>> The Broadcom stream buffer accelerator (SBA) provides offloading
>>> capabilities for RAID operations. This SBA offload engine is
>>> accessible via Broadcom SoC specific ring manager.
>>>
>>> This patch adds Broadcom SBA RAID driver which provides one
>>> DMA device with RAID capabilities using one or more Broadcom
>>> SoC specific ring manager channels. The SBA RAID driver in its
>>> current shape implements memcpy, xor, and pq operations.
>>>
>>> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>>> Reviewed-by: Ray Jui <ray.jui@broadcom.com>
>>> ---
>>>  drivers/dma/Kconfig        |   13 +
>>>  drivers/dma/Makefile       |    1 +
>>>  drivers/dma/bcm-sba-raid.c | 1711 ++++++++++++++++++++++++++++++++++++++++++++
>>>  3 files changed, 1725 insertions(+)
>>>  create mode 100644 drivers/dma/bcm-sba-raid.c
>>>
>>> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
>>> index 263495d..bf8fb84 100644
>>> --- a/drivers/dma/Kconfig
>>> +++ b/drivers/dma/Kconfig
>>> @@ -99,6 +99,19 @@ config AXI_DMAC
>>>           controller is often used in Analog Device's reference designs for FPGA
>>>           platforms.
>>>
>>> +config BCM_SBA_RAID
>>> +       tristate "Broadcom SBA RAID engine support"
>>> +       depends on (ARM64 && MAILBOX && RAID6_PQ) || COMPILE_TEST
>>> +       select DMA_ENGINE
>>> +       select DMA_ENGINE_RAID
>>> +       select ASYNC_TX_ENABLE_CHANNEL_SWITCH
>>
>> ASYNC_TX_ENABLE_CHANNEL_SWITCH violates the DMA mapping API and
>> Russell has warned it's especially problematic on ARM [1].  If you
>> need channel switching for this offload engine to be useful then you
>> need to move DMA mapping and channel switching responsibilities to MD
>> itself.
>>
>> [1]: http://lists.infradead.org/pipermail/linux-arm-kernel/2011-January/036753.html
>
> Actually driver works fine with/without
> ASYNC_TX_ENABLE_CHANNEL_SWITCH enabled
> so I am fine with removing dependency on this config option.

I stand corrected.

Previously, when I had tried removing
ASYNC_TX_ENABLE_CHANNEL_SWITCH
from BCM_SBA_RAID config option it worked because other
drivers such xgene-dma and mv_xor_v2 are selecting this option.

The BCM-SBA-RAID driver requires
ASYNC_TX_ENABLE_CHANNEL_SWITCH option

There is no issue reported for
ASYNC_TX_ENABLE_CHANNEL_SWITCH
with ARM64 kernel.

The issue you pointed out was with ARM kernel.

We will have to select
ASYNC_TX_ENABLE_CHANNEL_SWITCH
for BCM-SBA-RAID driver
just like other ARM64 RAID drivers such
as xgene-dma and mv_xor_v2.
(Refer, XGENE_DMA and MV_XOR_V2 options)

Regards,
Anup

^ permalink raw reply

* Re: [PATCH v3 3/4] dmaengine: Add Broadcom SBA RAID driver
From: Anup Patel @ 2017-02-14  5:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback,
	dmaengine-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Device Tree,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA, linux-raid
In-Reply-To: <CAPcyv4hE5gDiHhfaiHDHbhA2xKa45UdzKcSxnQXK-W92sr3Z1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Feb 10, 2017 at 11:20 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Fri, Feb 10, 2017 at 1:07 AM, Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org> wrote:
>> The Broadcom stream buffer accelerator (SBA) provides offloading
>> capabilities for RAID operations. This SBA offload engine is
>> accessible via Broadcom SoC specific ring manager.
>>
>> This patch adds Broadcom SBA RAID driver which provides one
>> DMA device with RAID capabilities using one or more Broadcom
>> SoC specific ring manager channels. The SBA RAID driver in its
>> current shape implements memcpy, xor, and pq operations.
>>
>> Signed-off-by: Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
>> Reviewed-by: Ray Jui <ray.jui-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
>> ---
>>  drivers/dma/Kconfig        |   13 +
>>  drivers/dma/Makefile       |    1 +
>>  drivers/dma/bcm-sba-raid.c | 1711 ++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 1725 insertions(+)
>>  create mode 100644 drivers/dma/bcm-sba-raid.c
>>
>> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
>> index 263495d..bf8fb84 100644
>> --- a/drivers/dma/Kconfig
>> +++ b/drivers/dma/Kconfig
>> @@ -99,6 +99,19 @@ config AXI_DMAC
>>           controller is often used in Analog Device's reference designs for FPGA
>>           platforms.
>>
>> +config BCM_SBA_RAID
>> +       tristate "Broadcom SBA RAID engine support"
>> +       depends on (ARM64 && MAILBOX && RAID6_PQ) || COMPILE_TEST
>> +       select DMA_ENGINE
>> +       select DMA_ENGINE_RAID
>> +       select ASYNC_TX_ENABLE_CHANNEL_SWITCH
>
> ASYNC_TX_ENABLE_CHANNEL_SWITCH violates the DMA mapping API and
> Russell has warned it's especially problematic on ARM [1].  If you
> need channel switching for this offload engine to be useful then you
> need to move DMA mapping and channel switching responsibilities to MD
> itself.
>
> [1]: http://lists.infradead.org/pipermail/linux-arm-kernel/2011-January/036753.html

In case of BCM-SBA-RAID, the underlying "struct device" for each
DMA channel is the mailbox controller "struct device" (i.e. Ring
Manager device). This is because Ring Manager HW is the front
facing device which we program to submit work to BCM-SBA-RAID
HW.

This means DMA channels provided by BCM-SBA-RAID driver
will use same "struct device" for DMA mappings hence channel
switching between BCM-SBA-RAID DMA channels is safe.

Due to above, we can safely enable
ASYNC_TX_ENABLE_CHANNEL_SWITCH option for
BCM-SBA-RAID driver.

Regards,
Anup
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] Monitor: triggers core dump when stat2devnm return NULL
From: Zhilong Liu @ 2017-02-14  5:31 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu

    ensure that the devnm should be a block device when uses
    --wait parameter, such as the 'f' and 'd' type file would
    be triggered core dumped.

Signed-off-by: Zhilong Liu <zlliu@suse.com>

diff --git a/Monitor.c b/Monitor.c
index 802a9d9..1900db3 100644
--- a/Monitor.c
+++ b/Monitor.c
@@ -1002,6 +1002,10 @@ int Wait(char *dev)
 			strerror(errno));
 		return 2;
 	}
+	if ((S_IFMT & stb.st_mode) != S_IFBLK) {
+		pr_err("%s is not a block device.\n", dev);
+		return 2;
+	}
 	strcpy(devnm, stat2devnm(&stb));
 
 	while(1) {
diff --git a/lib.c b/lib.c
index b640634..7116298 100644
--- a/lib.c
+++ b/lib.c
@@ -89,9 +89,6 @@ char *devid2kname(int devid)
 
 char *stat2kname(struct stat *st)
 {
-	if ((S_IFMT & st->st_mode) != S_IFBLK)
-		return NULL;
-
 	return devid2kname(st->st_rdev);
 }
 
-- 
2.6.6


^ permalink raw reply related

* Re: [PATCH v4 1/1] DM: inplace compressed DM target (fwd)
From: Julia Lawall @ 2017-02-14  6:09 UTC (permalink / raw)
  Cc: linux-doc, linux-kernel, dm-devel, linux-raid, agk, snitzer,
	corbet, shli, hbabu, linuxram

On line 1759, since ret is unsigned it will not be less than 0.

julia

---------- Forwarded message ----------
Date: Tue, 14 Feb 2017 09:00:39 +0800
From: kbuild test robot <fengguang.wu@intel.com>
To: kbuild@01.org
Cc: Julia Lawall <julia.lawall@lip6.fr>
Subject: Re: [PATCH v4 1/1] DM: inplace compressed DM target

CC: kbuild-all@01.org
In-Reply-To: <1487018545-5061-2-git-send-email-linuxram@us.ibm.com>

Hi Ram,

[auto build test WARNING on dm/for-next]
[also build test WARNING on v4.10-rc8 next-20170213]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Ram-Pai/DM-inplace-compressed-DM-target/20170214-055727
base:   https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git for-next
:::::: branch date: 3 hours ago
:::::: commit date: 3 hours ago

>> drivers/md/dm-inplace-compress.c:1759:5-8: WARNING: Unsigned expression compared with zero: ret < 0

git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout e7924efaaba5efdcd28f32efbb949ed1153c932c
vim +1759 drivers/md/dm-inplace-compress.c

e7924efa Ram Pai 2017-02-13  1743   * @io : io range
e7924efa Ram Pai 2017-02-13  1744   * @sector_start : the sector on backing storage to which the
e7924efa Ram Pai 2017-02-13  1745   *	compressed data needs to be written.
e7924efa Ram Pai 2017-02-13  1746   * @meta_start: the page index of the bits corresponding to
e7924efa Ram Pai 2017-02-13  1747   * @meta_end  : start and end blocks.
e7924efa Ram Pai 2017-02-13  1748   */
e7924efa Ram Pai 2017-02-13  1749  static int dm_icomp_compress_write(struct dm_icomp_io_range *io,
e7924efa Ram Pai 2017-02-13  1750  		sector_t sector_start, u64 *meta_start, u64 *meta_end)
e7924efa Ram Pai 2017-02-13  1751  {
e7924efa Ram Pai 2017-02-13  1752  	struct dm_icomp_req *req = io->req;
e7924efa Ram Pai 2017-02-13  1753  	sector_t count = DMCP_BYTES_TO_SECTOR(io->decomp_len);
e7924efa Ram Pai 2017-02-13  1754  	unsigned int comp_len, ret;
e7924efa Ram Pai 2017-02-13  1755  	u64 page_index;
e7924efa Ram Pai 2017-02-13  1756
e7924efa Ram Pai 2017-02-13  1757  	/* comp_data must be able to accommadate a larger compress buffer */
e7924efa Ram Pai 2017-02-13  1758  	ret = dm_icomp_io_range_compress(req->info, io, &comp_len);
e7924efa Ram Pai 2017-02-13 @1759  	if (ret < 0) {
e7924efa Ram Pai 2017-02-13  1760  		req->result = -EIO;
e7924efa Ram Pai 2017-02-13  1761  		return -EIO;
e7924efa Ram Pai 2017-02-13  1762  	}
e7924efa Ram Pai 2017-02-13  1763  	WARN_ON(comp_len > io->comp_len);
e7924efa Ram Pai 2017-02-13  1764
e7924efa Ram Pai 2017-02-13  1765  	dm_icomp_get_req(req);
e7924efa Ram Pai 2017-02-13  1766
e7924efa Ram Pai 2017-02-13  1767  	io->io_req.bi_op = REQ_OP_WRITE;

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* [PATCH v4 0/4] Broadcom SBA RAID support
From: Anup Patel @ 2017-02-14  6:51 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
	bcm-kernel-feedback-list-dY08KVG/lbpWk0Htik3J/w,
	dmaengine-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA, Anup Patel

The Broadcom SBA RAID is a stream-based device which provides
RAID5/6 offload.

It requires a SoC specific ring manager (such as Broadcom FlexRM
ring manager) to provide ring-based programming interface. Due to
this, the Broadcom SBA RAID driver (mailbox client) implements
DMA device having one DMA channel using a set of mailbox channels
provided by Broadcom SoC specific ring manager driver (mailbox
controller).

The Broadcom SBA RAID hardware requires PQ disk position instead
of PQ disk coefficient. To address this, we have added raid_gflog
table which will help driver to convert PQ disk coefficient to PQ
disk position.

This patchset is based on Linux-4.10-rc2 and depends on patchset
"[PATCH v4 0/2] Broadcom FlexRM ring manager support"

It is also available at sba-raid-v4 branch of
https://github.com/Broadcom/arm64-linux.git

Changes since v3:
 - Replaced SBA_ENC() with sba_cmd_enc() inline function
 - Use list_first_entry_or_null() wherever possible
 - Remove unwanted brances around loops wherever possible
 - Use lockdep_assert_held() where required

Changes since v2:
 - Droped patch to handle DMA devices having support for fewer
   PQ coefficients in Linux Async Tx
 - Added work-around in bcm-sba-raid driver to handle unsupported
   PQ coefficients using multiple SBA requests

Changes since v1:
 - Droped patch to add mbox_channel_device() API
 - Used GENMASK and BIT macros wherever possible in bcm-sba-raid driver
 - Replaced C_MDATA macros with static inline functions in
   bcm-sba-raid driver
 - Removed sba_alloc_chan_resources() callback in bcm-sba-raid driver
 - Used dev_err() instead of dev_info() wherever applicable
 - Removed call to sba_issue_pending() from sba_tx_submit() in
   bcm-sba-raid driver
 - Implemented SBA request chaning for handling (len > sba->req_size)
   in bcm-sba-raid driver
 - Implemented device_terminate_all() callback in bcm-sba-raid driver

Anup Patel (4):
  lib/raid6: Add log-of-2 table for RAID6 HW requiring disk position
  async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome()
  dmaengine: Add Broadcom SBA RAID driver
  dt-bindings: Add DT bindings document for Broadcom SBA RAID driver

 .../devicetree/bindings/dma/brcm,iproc-sba.txt     |   29 +
 crypto/async_tx/async_pq.c                         |    5 +-
 drivers/dma/Kconfig                                |   13 +
 drivers/dma/Makefile                               |    1 +
 drivers/dma/bcm-sba-raid.c                         | 1694 ++++++++++++++++++++
 include/linux/raid/pq.h                            |    1 +
 lib/raid6/mktables.c                               |   20 +
 7 files changed, 1760 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt
 create mode 100644 drivers/dma/bcm-sba-raid.c

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v4 1/4] lib/raid6: Add log-of-2 table for RAID6 HW requiring disk position
From: Anup Patel @ 2017-02-14  6:51 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
	bcm-kernel-feedback-list-dY08KVG/lbpWk0Htik3J/w,
	dmaengine-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA, Anup Patel
In-Reply-To: <1487055112-5185-1-git-send-email-anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>

The raid6_gfexp table represents {2}^n values for 0 <= n < 256. The
Linux async_tx framework pass values from raid6_gfexp as coefficients
for each source to prep_dma_pq() callback of DMA channel with PQ
capability. This creates problem for RAID6 offload engines (such as
Broadcom SBA) which take disk position (i.e. log of {2}) instead of
multiplicative cofficients from raid6_gfexp table.

This patch adds raid6_gflog table having log-of-2 value for any given
x such that 0 <= x < 256. For any given disk coefficient x, the
corresponding disk position is given by raid6_gflog[x]. The RAID6
offload engine driver can use this newly added raid6_gflog table to
get disk position from multiplicative coefficient.

Signed-off-by: Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
Reviewed-by: Scott Branden <scott.branden-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
Reviewed-by: Ray Jui <ray.jui-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
---
 include/linux/raid/pq.h |  1 +
 lib/raid6/mktables.c    | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 4d57bba..30f9453 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -142,6 +142,7 @@ int raid6_select_algo(void);
 extern const u8 raid6_gfmul[256][256] __attribute__((aligned(256)));
 extern const u8 raid6_vgfmul[256][32] __attribute__((aligned(256)));
 extern const u8 raid6_gfexp[256]      __attribute__((aligned(256)));
+extern const u8 raid6_gflog[256]      __attribute__((aligned(256)));
 extern const u8 raid6_gfinv[256]      __attribute__((aligned(256)));
 extern const u8 raid6_gfexi[256]      __attribute__((aligned(256)));
 
diff --git a/lib/raid6/mktables.c b/lib/raid6/mktables.c
index 39787db..e824d08 100644
--- a/lib/raid6/mktables.c
+++ b/lib/raid6/mktables.c
@@ -125,6 +125,26 @@ int main(int argc, char *argv[])
 	printf("EXPORT_SYMBOL(raid6_gfexp);\n");
 	printf("#endif\n");
 
+	/* Compute log-of-2 table */
+	printf("\nconst u8 __attribute__((aligned(256)))\n"
+	       "raid6_gflog[256] =\n" "{\n");
+	for (i = 0; i < 256; i += 8) {
+		printf("\t");
+		for (j = 0; j < 8; j++) {
+			v = 255;
+			for (k = 0; k < 256; k++)
+				if (exptbl[k] == (i + j)) {
+					v = k;
+					break;
+				}
+			printf("0x%02x,%c", v, (j == 7) ? '\n' : ' ');
+		}
+	}
+	printf("};\n");
+	printf("#ifdef __KERNEL__\n");
+	printf("EXPORT_SYMBOL(raid6_gflog);\n");
+	printf("#endif\n");
+
 	/* Compute inverse table x^-1 == x^254 */
 	printf("\nconst u8 __attribute__((aligned(256)))\n"
 	       "raid6_gfinv[256] =\n" "{\n");
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox