* Re: [PATCH 3/4] mdadm:triggers core dump when stat2devnm return NULL
From: zhilong @ 2017-03-13 5:34 UTC (permalink / raw)
To: NeilBrown, Jes.Sorensen; +Cc: linux-raid
In-Reply-To: <8760jef6d4.fsf@notabene.neil.brown.name>
On 03/13/2017 07:00 AM, NeilBrown wrote:
> On Wed, Mar 08 2017, Zhilong Liu wrote:
>
>> ensure that the device should be a block device when uses
>> --wait parameter, such as the 'f' and 'd' type file would
>> be triggered core dumped.
>> ./mdadm --wait /dev/md/, happened core dump.
>>
>> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>>
>> diff --git a/Monitor.c b/Monitor.c
>> index 802a9d9..1900db3 100644
>> --- a/Monitor.c
>> +++ b/Monitor.c
>> @@ -1002,6 +1002,10 @@ int Wait(char *dev)
>> strerror(errno));
>> return 2;
>> }
>> + if ((S_IFMT & stb.st_mode) != S_IFBLK) {
>> + pr_err("%s is not a block device.\n", dev);
>> + return 2;
>> + }
>> strcpy(devnm, stat2devnm(&stb));
> Surely it would be cleaner to do something like:
>
> tmp = stat2devnm(&stb);
> if (!tmp) {
> pr_err("%s is not a block device.\n", dev);
> return 2;
> }
> strcpy(devnm, tmp);
>
> This makes it more obvious how you have fixed the crash.
Yes, this method is much better than I did. Great thanks for your
improvement.
>>
>> while(1) {
>> diff --git a/lib.c b/lib.c
>> index b640634..7116298 100644
>> --- a/lib.c
>> +++ b/lib.c
>> @@ -89,9 +89,6 @@ char *devid2kname(int devid)
>>
>> char *stat2kname(struct stat *st)
>> {
>> - if ((S_IFMT & st->st_mode) != S_IFBLK)
>> - return NULL;
>> -
>> return devid2kname(st->st_rdev);
>> }
> Why are you removing this test? It has nothing to do with the other
> part of the patch.
Yes, I would remove this part in next revision.
Thanks,
-Zhilong
> NeilBrown
^ permalink raw reply
* [PATCH] md:fix a trivial typo in comments
From: Zhilong Liu @ 2017-03-13 5:53 UTC (permalink / raw)
To: shli; +Cc: linux-raid, Zhilong Liu
fix a trivial typo in freeze_array() of raid1.c
Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
drivers/md/raid1.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b0f647..2ec0617 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -958,7 +958,7 @@ static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
static void freeze_array(struct r1conf *conf, int extra)
{
/* stop syncio and normal IO and wait for everything to
- * go quite.
+ * go quit.
* We wait until nr_pending match nr_queued+extra
* This is called in the context of one normal IO request
* that has failed. Thus any sync request that might be pending
--
2.6.6
^ permalink raw reply related
* [PATCH 3/4 v1] mdadm:triggers core dump when stat2devnm return NULL
From: Zhilong Liu @ 2017-03-13 7:01 UTC (permalink / raw)
To: Jes.Sorensen; +Cc: linux-raid, neilb, Zhilong Liu
In-Reply-To: <8760jef6d4.fsf@notabene.neil.brown.name>
monitor: ensure that the device should be a block
device when uses --wait parameter, such as the 'f'
and 'd' type file would be triggered core dumped.
./mdadm --wait /dev/md/, happened core dump.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
Monitor.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/Monitor.c b/Monitor.c
index 802a9d9..f8850d3 100644
--- a/Monitor.c
+++ b/Monitor.c
@@ -1002,7 +1002,12 @@ int Wait(char *dev)
strerror(errno));
return 2;
}
- strcpy(devnm, stat2devnm(&stb));
+ char *tmp = stat2devnm(&stb);
+ if (!tmp) {
+ pr_err("%s is not a block device.\n", dev);
+ return 2;
+ }
+ strcpy(devnm, tmp);
while(1) {
struct mdstat_ent *ms = mdstat_read(1, 0);
--
2.6.6
^ permalink raw reply related
* Re: [PATCH 2/4] mdadm:external bitmap only supports ext filesystem
From: zhilong @ 2017-03-13 8:16 UTC (permalink / raw)
To: Jes.Sorensen, Shaohua Li; +Cc: linux-raid
In-Reply-To: <20170308075144.24873-1-zlliu@suse.com>
as the purpose to improve the prompt when using the external bitmap
mode, maybe push this patch to mdadm is a little redundant.
For errno rule, RUN_ARRAY returned EINVAL indeed and the man-page
has indicated that external bitmap only works with ext[2-4] file system.
I think it would be more user-friendly if prints one prompt and
returned
EINVAL at the same time when the bmap() got failure, so that user knows
where the EINVAL comes.
Such as:
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 9fb2cca..0bff96b 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -381,6 +381,7 @@ static int read_page(struct file *file, unsigned
long index,
bh->b_blocknr = bmap(inode, block);
if (bh->b_blocknr == 0) {
/* Cannot use this file! */
+ pr_err("writing external bitmap only
supports a file under ext[2-4] filesystem.\n");
ret = -EINVAL;
goto out;
}
Thanks,
-Zhilong
On 03/08/2017 03:51 PM, Zhilong Liu wrote:
> mdadm: ensure that the external bitmap_file is
> stored by ext[2-4] file system, because bmap()
> of linux/driver/md/bitmap.c exits directly when
> the bitmap_file isn't suitable. mdadm should make
> users aware of this scenario and give a prompt.
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>
> diff --git a/Create.c b/Create.c
> index 2721884..9a951b0 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -831,11 +831,6 @@ int Create(struct supertype *st, char *mddev,
> goto abort_locked;
> }
> bitmap_fd = open(s->bitmap_file, O_RDWR);
> - if (bitmap_fd < 0) {
> - pr_err("weird: %s cannot be openned\n",
> - s->bitmap_file);
> - goto abort_locked;
> - }
> if (ioctl(mdfd, SET_BITMAP_FILE, bitmap_fd) < 0) {
> pr_err("Cannot set bitmap file for %s: %s\n",
> mddev, strerror(errno));
> diff --git a/mdadm.c b/mdadm.c
> index d6ad8dc..19a06db 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -28,6 +28,7 @@
> #include "mdadm.h"
> #include "md_p.h"
> #include <ctype.h>
> +#include <sys/vfs.h>
>
> static int scan_assemble(struct supertype *ss,
> struct context *c,
> @@ -1143,6 +1144,21 @@ int main(int argc, char *argv[])
> strcmp(optarg, "none") == 0 ||
> strchr(optarg, '/') != NULL) {
> s.bitmap_file = optarg;
> + if (strchr(s.bitmap_file, '/') != NULL) {
> + bitmap_fd = open(s.bitmap_file, O_RDWR);
> + if (bitmap_fd < 0) {
> + pr_err("weird: %s cannot be openned\n", s.bitmap_file);
> + exit(2);
> + }
> + close(bitmap_fd);
> + struct statfs ext_bitmap;
> + statfs(s.bitmap_file, &ext_bitmap);
> + if (ext_bitmap.f_type != 0xEF53){
> + pr_err("external bitmap only supports ext[2-4] filesystem, %s.\n",
> + s.bitmap_file);
> + exit(2);
> + }
> + }
> continue;
> }
> if (strcmp(optarg, "clustered") == 0) {
^ permalink raw reply related
* Re: [PATCH] md:fix a trivial typo in comments
From: Jack Wang @ 2017-03-13 9:10 UTC (permalink / raw)
To: Zhilong Liu; +Cc: shli, linux-raid
In-Reply-To: <1489384407-12672-1-git-send-email-zlliu@suse.com>
2017-03-13 6:53 GMT+01:00 Zhilong Liu <zlliu@suse.com>:
> fix a trivial typo in freeze_array() of raid1.c
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
> drivers/md/raid1.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7b0f647..2ec0617 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -958,7 +958,7 @@ static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
> static void freeze_array(struct r1conf *conf, int extra)
> {
> /* stop syncio and normal IO and wait for everything to
> - * go quite.
> + * go quit.
> * We wait until nr_pending match nr_queued+extra
> * This is called in the context of one normal IO request
> * that has failed. Thus any sync request that might be pending
> --
s/quite/quietly ?
Cheers,
Jack
> 2.6.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] md:fix a trivial typo in comments
From: Guoqing Jiang @ 2017-03-13 9:23 UTC (permalink / raw)
To: Jack Wang, Zhilong Liu; +Cc: shli, linux-raid
In-Reply-To: <CA+res+SABO1y1FAXX3CrncmmYHKis5QQiNUm=AVcV9i2Ej3-rA@mail.gmail.com>
On 03/13/2017 05:10 PM, Jack Wang wrote:
> 2017-03-13 6:53 GMT+01:00 Zhilong Liu <zlliu@suse.com>:
>> fix a trivial typo in freeze_array() of raid1.c
>>
>> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>> ---
>> drivers/md/raid1.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 7b0f647..2ec0617 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -958,7 +958,7 @@ static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
>> static void freeze_array(struct r1conf *conf, int extra)
>> {
>> /* stop syncio and normal IO and wait for everything to
>> - * go quite.
>> + * go quit.
>> * We wait until nr_pending match nr_queued+extra
>> * This is called in the context of one normal IO request
>> * that has failed. Thus any sync request that might be pending
>> --
> s/quite/quietly ?
I guess it should be "quiet" if referring from freeze_array in raid10.c.
Thanks,
Guoqing
^ permalink raw reply
* [RFC PATCH] md/raid10: refactor some codes from raid10_write_request
From: Guoqing Jiang @ 2017-03-13 9:23 UTC (permalink / raw)
To: linux-raid; +Cc: shli, neilb, Guoqing Jiang
Previously, we clone both bio and repl_bio in raid10_write_request,
then add the cloned bio to plug->pending or conf->pending_bio_list
based on plug or not, and most of the logics are same for the two
conditions.
So introduce handle_clonebio (a better name is welcome) for it, and
use replacement parameter to distinguish the difference. No functional
changes in the patch.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
Another reason for it is to improve the readability of code, but
I didn't touch raid10 before so this is labeled as RFC.
drivers/md/raid10.c | 172 ++++++++++++++++++++++------------------------------
1 file changed, 72 insertions(+), 100 deletions(-)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b1b1f982a722..02d8eff8d26e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1188,18 +1188,81 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
return;
}
-static void raid10_write_request(struct mddev *mddev, struct bio *bio,
- struct r10bio *r10_bio)
+static void handle_clonebio(struct mddev *mddev, struct r10bio *r10_bio,
+ struct bio *bio, int i, int replacement,
+ int max_sectors)
{
- struct r10conf *conf = mddev->private;
- int i;
const int op = bio_op(bio);
const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
const unsigned long do_fua = (bio->bi_opf & REQ_FUA);
unsigned long flags;
- struct md_rdev *blocked_rdev;
struct blk_plug_cb *cb;
struct raid10_plug_cb *plug = NULL;
+ struct r10conf *conf = mddev->private;
+ struct md_rdev *rdev;
+ int devnum = r10_bio->devs[i].devnum;
+ struct bio *mbio;
+
+ if (replacement) {
+ rdev = conf->mirrors[devnum].replacement;
+ if (rdev == NULL) {
+ /* Replacement just got moved to main 'rdev' */
+ smp_mb();
+ rdev = conf->mirrors[devnum].rdev;
+ }
+ } else
+ rdev = conf->mirrors[devnum].rdev;
+
+ mbio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
+ bio_trim(mbio, r10_bio->sector - bio->bi_iter.bi_sector, max_sectors);
+ if (replacement)
+ r10_bio->devs[i].repl_bio = mbio;
+ else
+ r10_bio->devs[i].bio = mbio;
+
+ mbio->bi_iter.bi_sector = (r10_bio->devs[i].addr +
+ choose_data_offset(r10_bio, rdev));
+ mbio->bi_bdev = rdev->bdev;
+ mbio->bi_end_io = raid10_end_write_request;
+ bio_set_op_attrs(mbio, op, do_sync | do_fua);
+ if (!replacement && test_bit(FailFast, &conf->mirrors[devnum].rdev->flags)
+ && enough(conf, devnum))
+ mbio->bi_opf |= MD_FAILFAST;
+ mbio->bi_private = r10_bio;
+
+ if (conf->mddev->gendisk)
+ trace_block_bio_remap(bdev_get_queue(mbio->bi_bdev),
+ mbio, disk_devt(conf->mddev->gendisk),
+ r10_bio->sector);
+ /* flush_pending_writes() needs access to the rdev so...*/
+ mbio->bi_bdev = (void *)rdev;
+
+ atomic_inc(&r10_bio->remaining);
+
+ cb = blk_check_plugged(raid10_unplug, mddev, sizeof(*plug));
+ if (cb)
+ plug = container_of(cb, struct raid10_plug_cb, cb);
+ else
+ plug = NULL;
+ spin_lock_irqsave(&conf->device_lock, flags);
+ if (plug) {
+ bio_list_add(&plug->pending, mbio);
+ plug->pending_cnt++;
+ } else {
+ bio_list_add(&conf->pending_bio_list, mbio);
+ conf->pending_count++;
+ }
+ spin_unlock_irqrestore(&conf->device_lock, flags);
+ if (!plug)
+ md_wakeup_thread(mddev->thread);
+}
+
+static void raid10_write_request(struct mddev *mddev, struct bio *bio,
+ struct r10bio *r10_bio)
+{
+ struct r10conf *conf = mddev->private;
+ int i;
+ struct md_rdev *blocked_rdev;
sector_t sectors;
int sectors_handled;
int max_sectors;
@@ -1402,101 +1465,10 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
bitmap_startwrite(mddev->bitmap, r10_bio->sector, r10_bio->sectors, 0);
for (i = 0; i < conf->copies; i++) {
- struct bio *mbio;
- int d = r10_bio->devs[i].devnum;
- if (r10_bio->devs[i].bio) {
- struct md_rdev *rdev = conf->mirrors[d].rdev;
- mbio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
- bio_trim(mbio, r10_bio->sector - bio->bi_iter.bi_sector,
- max_sectors);
- r10_bio->devs[i].bio = mbio;
-
- mbio->bi_iter.bi_sector = (r10_bio->devs[i].addr+
- choose_data_offset(r10_bio, rdev));
- mbio->bi_bdev = rdev->bdev;
- mbio->bi_end_io = raid10_end_write_request;
- bio_set_op_attrs(mbio, op, do_sync | do_fua);
- if (test_bit(FailFast, &conf->mirrors[d].rdev->flags) &&
- enough(conf, d))
- mbio->bi_opf |= MD_FAILFAST;
- mbio->bi_private = r10_bio;
-
- if (conf->mddev->gendisk)
- trace_block_bio_remap(bdev_get_queue(mbio->bi_bdev),
- mbio, disk_devt(conf->mddev->gendisk),
- r10_bio->sector);
- /* flush_pending_writes() needs access to the rdev so...*/
- mbio->bi_bdev = (void*)rdev;
-
- atomic_inc(&r10_bio->remaining);
-
- cb = blk_check_plugged(raid10_unplug, mddev,
- sizeof(*plug));
- if (cb)
- plug = container_of(cb, struct raid10_plug_cb,
- cb);
- else
- plug = NULL;
- spin_lock_irqsave(&conf->device_lock, flags);
- if (plug) {
- bio_list_add(&plug->pending, mbio);
- plug->pending_cnt++;
- } else {
- bio_list_add(&conf->pending_bio_list, mbio);
- conf->pending_count++;
- }
- spin_unlock_irqrestore(&conf->device_lock, flags);
- if (!plug)
- md_wakeup_thread(mddev->thread);
- }
-
- if (r10_bio->devs[i].repl_bio) {
- struct md_rdev *rdev = conf->mirrors[d].replacement;
- if (rdev == NULL) {
- /* Replacement just got moved to main 'rdev' */
- smp_mb();
- rdev = conf->mirrors[d].rdev;
- }
- mbio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
- bio_trim(mbio, r10_bio->sector - bio->bi_iter.bi_sector,
- max_sectors);
- r10_bio->devs[i].repl_bio = mbio;
-
- mbio->bi_iter.bi_sector = (r10_bio->devs[i].addr +
- choose_data_offset(r10_bio, rdev));
- mbio->bi_bdev = rdev->bdev;
- mbio->bi_end_io = raid10_end_write_request;
- bio_set_op_attrs(mbio, op, do_sync | do_fua);
- mbio->bi_private = r10_bio;
-
- if (conf->mddev->gendisk)
- trace_block_bio_remap(bdev_get_queue(mbio->bi_bdev),
- mbio, disk_devt(conf->mddev->gendisk),
- r10_bio->sector);
- /* flush_pending_writes() needs access to the rdev so...*/
- mbio->bi_bdev = (void*)rdev;
-
- atomic_inc(&r10_bio->remaining);
-
- cb = blk_check_plugged(raid10_unplug, mddev,
- sizeof(*plug));
- if (cb)
- plug = container_of(cb, struct raid10_plug_cb,
- cb);
- else
- plug = NULL;
- spin_lock_irqsave(&conf->device_lock, flags);
- if (plug) {
- bio_list_add(&plug->pending, mbio);
- plug->pending_cnt++;
- } else {
- bio_list_add(&conf->pending_bio_list, mbio);
- conf->pending_count++;
- }
- spin_unlock_irqrestore(&conf->device_lock, flags);
- if (!plug)
- md_wakeup_thread(mddev->thread);
- }
+ if (r10_bio->devs[i].bio)
+ handle_clonebio(mddev, r10_bio, bio, i, 0, max_sectors);
+ if (r10_bio->devs[i].repl_bio)
+ handle_clonebio(mddev, r10_bio, bio, i, 1, max_sectors);
}
/* Don't remove the bias on 'remaining' (one_write_done) until
--
2.6.2
^ permalink raw reply related
* Re: [PATCH] md:fix a trivial typo in comments
From: zhilong @ 2017-03-13 9:26 UTC (permalink / raw)
To: Guoqing Jiang, Jack Wang; +Cc: shli, linux-raid
In-Reply-To: <58C6651A.3090106@suse.com>
On 03/13/2017 05:23 PM, Guoqing Jiang wrote:
>
>
> On 03/13/2017 05:10 PM, Jack Wang wrote:
>> 2017-03-13 6:53 GMT+01:00 Zhilong Liu <zlliu@suse.com>:
>>> fix a trivial typo in freeze_array() of raid1.c
>>>
>>> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>>> ---
>>> drivers/md/raid1.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>> index 7b0f647..2ec0617 100644
>>> --- a/drivers/md/raid1.c
>>> +++ b/drivers/md/raid1.c
>>> @@ -958,7 +958,7 @@ static void allow_barrier(struct r1conf *conf,
>>> sector_t start_next_window,
>>> static void freeze_array(struct r1conf *conf, int extra)
>>> {
>>> /* stop syncio and normal IO and wait for everything to
>>> - * go quite.
>>> + * go quit.
>>> * We wait until nr_pending match nr_queued+extra
>>> * This is called in the context of one normal IO request
>>> * that has failed. Thus any sync request that might be
>>> pending
>>> --
>> s/quite/quietly ?
>
> I guess it should be "quiet" if referring from freeze_array in raid10.c.
many thanks for your review and correction. and here keep 'quiet' same
as raid10.c?
Thanks,
-Zhilong
>
> Thanks,
> Guoqing
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* RE: [PATCH 10/29] drivers, md: convert stripe_head.count from atomic_t to refcount_t
From: Reshetova, Elena @ 2017-03-13 9:49 UTC (permalink / raw)
To: Shaohua Li
Cc: gregkh@linuxfoundation.org, linux-raid@vger.kernel.org,
Hans Liljestrand, Kees Cook, David Windsor
In-Reply-To: <20170309171829.33z7z6czwdivztp4@kernel.org>
> On Wed, Mar 08, 2017 at 09:39:30AM +0000, Reshetova, Elena wrote:
> > > On Mon, Mar 06, 2017 at 04:20:57PM +0200, Elena Reshetova wrote:
> > > > refcount_t type and corresponding API should be
> > > > used instead of atomic_t when the variable is used as
> > > > a reference counter. This allows to avoid accidental
> > > > refcounter overflows that might lead to use-after-free
> > > > situations.
> > > >
> > > > Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
> > > > Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
> > > > Signed-off-by: Kees Cook <keescook@chromium.org>
> > > > Signed-off-by: David Windsor <dwindsor@gmail.com>
> > > > ---
> > > > drivers/md/raid5-cache.c | 8 +++---
> > > > drivers/md/raid5.c | 66 ++++++++++++++++++++++++------------------------
> > > > drivers/md/raid5.h | 3 ++-
> > > > 3 files changed, 39 insertions(+), 38 deletions(-)
> > > >
> > > > diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> > > > index 3f307be..6c05e12 100644
> > > > --- a/drivers/md/raid5-cache.c
> > > > +++ b/drivers/md/raid5-cache.c
> > >
> > > snip
> > > > sh->check_state, sh->reconstruct_state);
> > > >
> > > > analyse_stripe(sh, &s);
> > > > @@ -4924,7 +4924,7 @@ static void activate_bit_delay(struct r5conf *conf,
> > > > struct stripe_head *sh = list_entry(head.next, struct
> > > stripe_head, lru);
> > > > int hash;
> > > > list_del_init(&sh->lru);
> > > > - atomic_inc(&sh->count);
> > > > + refcount_inc(&sh->count);
> > > > hash = sh->hash_lock_index;
> > > > __release_stripe(conf, sh,
> > > &temp_inactive_list[hash]);
> > > > }
> > > > @@ -5240,7 +5240,7 @@ static struct stripe_head
> *__get_priority_stripe(struct
> > > r5conf *conf, int group)
> > > > sh->group = NULL;
> > > > }
> > > > list_del_init(&sh->lru);
> > > > - BUG_ON(atomic_inc_return(&sh->count) != 1);
> > > > + BUG_ON(refcount_inc_not_zero(&sh->count));
> > >
> > > This changes the behavior. refcount_inc_not_zero doesn't inc if original value
> is 0
> >
> > Hm.. So, you want to inc here in any case and BUG if the end result differs from
> 1.
> > So essentially you want to only increment here from zero to one under normal
> conditions... This is a challenge for refcount_t and against the design.
> > Is it ok just to maybe do this here:
> >
> > - BUG_ON(atomic_inc_return(&sh->count) != 1);
> > + BUG_ON(refcount_read(&sh->count) != 0);
> > + refcount_set((&sh->count, 1);
>
> this looks ok
>
>
> > Do we have an issue with locking in this case? Or maybe it is then better to leave
> this one to be atomic_t without protection since it isn't a real refcounter as it turns
> out.
>
> There is no lock issue, the count should be 0 in the list. It's a refcounter actually, so
> good to do the convert.
Ok, great, I will send an updated patch!
Best Regards,
Elena.
>
> Thanks,
> Shaohua
^ permalink raw reply
* RE: linux-next: WARNING: CPU: 0 PID: 1 at lib/refcount.c:114 refcount_inc+0x37/0x40
From: Reshetova, Elena @ 2017-03-13 10:04 UTC (permalink / raw)
To: Shaohua Li, Andrei Vagin; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <20170310205413.wjs64c4zvrqvswg7@kernel.org>
> On Fri, Mar 10, 2017 at 12:01:06PM -0800, Andrei Vagin wrote:
> > Hello,
> >
> > We run CRIU tests for linux-next kernels and here is a new issue:
> >
> > All logs are here: https://api.travis-ci.org/jobs/209680974/log.txt?deansi=true
> > The kernel version is 4.11.0-rc1-next-20170310
>
> Thanks for the reporting. It caused by 731d126(drivers, md: convert
> mddev.active from atomic_t to refcount_t). It turns out the count doesn't match
> the refcount usage. I'll drop the patch temporarily.
The log below indicates that you are using your refcounter in a bit weird way in mddev_find().
However, I can't find the place (just by reading the code) where you would increment refcounter from zero (vs. setting it to one).
It looks like you either iterate over existing nodes (and increment their counters, which should be >= 1 at the time of increment) or create a new node, but then mddev_init() sets the counter to 1.
Do you somehow reuse the objects or?
Best Regards,
Elena.
>
> Thanks,
> Shaohua
> >
> > [ 2.324763] md: Waiting for all devices to be available before autodetect
> > [ 2.331707] md: If you don't use raid, use raid=noautodetect
> > [ 2.338189] ------------[ cut here ]------------
> > [ 2.342965] WARNING: CPU: 0 PID: 1 at lib/refcount.c:114
> > refcount_inc+0x37/0x40
> > [ 2.350427] refcount_t: increment on 0; use-after-free.
> > [ 2.355794] Modules linked in:
> > [ 2.358979] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
> > 4.11.0-rc1-next-20170310 #1
> > [ 2.362966] Hardware name: Google Google Compute Engine/Google
> > Compute Engine, BIOS Google 01/01/2011
> > [ 2.362966] Call Trace:
> > [ 2.362966] dump_stack+0x85/0xc9
> > [ 2.362966] __warn+0xd1/0xf0
> > [ 2.362966] warn_slowpath_fmt+0x4f/0x60
> > [ 2.362966] refcount_inc+0x37/0x40
> > [ 2.362966] mddev_find+0x1f1/0x2b0
> > [ 2.362966] md_open+0x1a/0xd0
> > [ 2.362966] __blkdev_get+0x85/0x4c0
> > [ 2.362966] blkdev_get+0x1d3/0x340
> > [ 2.362966] ? _raw_spin_unlock+0x27/0x40
> > [ 2.362966] blkdev_open+0x5b/0x70
> > [ 2.362966] do_dentry_open+0x213/0x330
> > [ 2.362966] ? bd_acquire+0xd0/0xd0
> > [ 2.362966] vfs_open+0x4f/0x80
> > [ 2.362966] ? may_open+0x9b/0x100
> > [ 2.362966] path_openat+0x48a/0xd50
> > [ 2.362966] ? console_unlock+0x2f9/0x560
> > [ 2.362966] do_filp_open+0x7e/0xd0
> > [ 2.362966] ? _raw_spin_unlock+0x27/0x40
> > [ 2.362966] ? __alloc_fd+0xf7/0x210
> > [ 2.362966] do_sys_open+0x115/0x1f0
> > [ 2.362966] SyS_open+0x1e/0x20
> > [ 2.362966] md_run_setup+0x71/0x9a
> > [ 2.362966] prepare_namespace+0x36/0x1a4
> > [ 2.362966] kernel_init_freeable+0x254/0x269
> > [ 2.362966] ? set_debug_rodata+0x12/0x12
> > [ 2.362966] ? rest_init+0x140/0x140
> > [ 2.362966] kernel_init+0xe/0x100
> > [ 2.362966] ret_from_fork+0x31/0x40
> > [ 2.482465] ---[ end trace a822b43a79b1f9f5 ]---
> > [ 2.487353] md: Autodetecting RAID arrays.
> > [ 2.491647] md: autorun ...
> > [ 2.494592] md: ... autorun DONE.
> > [ 2.503263] EXT4-fs (sda1): couldn't mount as ext3 due to feature
> > incompatibilities
> > [ 2.511467] ------------[ cut here ]------------
> > [ 2.511477] WARNING: CPU: 0 PID: 21 at lib/refcount.c:207
> > refcount_dec_not_one+0x75/0x80
> > [ 2.511478] refcount_t: underflow; use-after-free.
> > [ 2.511480] Modules linked in:
> > [ 2.511485] CPU: 0 PID: 21 Comm: kworker/0:1 Tainted: G W
> > 4.11.0-rc1-next-20170310 #1
> > [ 2.511486] Hardware name: Google Google Compute Engine/Google
> > Compute Engine, BIOS Google 01/01/2011
> > [ 2.511490] Workqueue: events delayed_fput
> > [ 2.511492] Call Trace:
> > [ 2.511496] dump_stack+0x85/0xc9
> > [ 2.511501] __warn+0xd1/0xf0
> > [ 2.511505] warn_slowpath_fmt+0x4f/0x60
> > [ 2.511509] refcount_dec_not_one+0x75/0x80
> > [ 2.511511] refcount_dec_and_lock+0x16/0x50
> > [ 2.511515] mddev_put+0x22/0x150
> > [ 2.511517] md_release+0x21/0x30
> > [ 2.511521] __blkdev_put+0x2df/0x340
> > [ 2.511526] blkdev_put+0x50/0x150
> > [ 2.511529] blkdev_close+0x25/0x30
> > [ 2.511531] __fput+0xfa/0x230
> > [ 2.511535] delayed_fput+0x25/0x30
> > [ 2.511538] process_one_work+0x1e1/0x670
> > [ 2.511539] ? process_one_work+0x162/0x670
> > [ 2.511544] worker_thread+0x137/0x4b0
> > [ 2.511546] ? trace_hardirqs_on+0xd/0x10
> > [ 2.511551] kthread+0x10c/0x140
> > [ 2.511552] ? process_one_work+0x670/0x670
> > [ 2.511554] ? kthread_create_on_node+0x40/0x40
> > [ 2.511558] ret_from_fork+0x31/0x40
> > [ 2.511566] ---[ end trace a822b43a79b1f9f6 ]---
^ permalink raw reply
* [PATCH] drivers, md: convert stripe_head.count from atomic_t to refcount_t
From: Elena Reshetova @ 2017-03-13 10:42 UTC (permalink / raw)
To: shli
Cc: linux-kernel, linux-raid, Elena Reshetova, Hans Liljestrand,
Kees Cook, David Windsor
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
---
drivers/md/raid5-cache.c | 8 +++---
drivers/md/raid5.c | 68 +++++++++++++++++++++++++-----------------------
drivers/md/raid5.h | 3 ++-
3 files changed, 41 insertions(+), 38 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3f307be..6c05e12 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -979,7 +979,7 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
* don't delay.
*/
clear_bit(STRIPE_DELAYED, &sh->state);
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
mutex_lock(&log->io_mutex);
/* meta + data */
@@ -1321,7 +1321,7 @@ static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh)
assert_spin_locked(&conf->device_lock);
list_del_init(&sh->lru);
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
set_bit(STRIPE_HANDLE, &sh->state);
atomic_inc(&conf->active_stripes);
@@ -1424,7 +1424,7 @@ static void r5c_do_reclaim(struct r5conf *conf)
*/
if (!list_empty(&sh->lru) &&
!test_bit(STRIPE_HANDLE, &sh->state) &&
- atomic_read(&sh->count) == 0) {
+ refcount_read(&sh->count) == 0) {
r5c_flush_stripe(conf, sh);
if (count++ >= R5C_RECLAIM_STRIPE_GROUP)
break;
@@ -2650,7 +2650,7 @@ r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
* don't delay.
*/
clear_bit(STRIPE_DELAYED, &sh->state);
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
mutex_lock(&log->io_mutex);
/* meta + data */
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2ce23b0..7e3913a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -296,7 +296,7 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
static void __release_stripe(struct r5conf *conf, struct stripe_head *sh,
struct list_head *temp_inactive_list)
{
- if (atomic_dec_and_test(&sh->count))
+ if (refcount_dec_and_test(&sh->count))
do_release_stripe(conf, sh, temp_inactive_list);
}
@@ -388,7 +388,7 @@ void raid5_release_stripe(struct stripe_head *sh)
/* Avoid release_list until the last reference.
*/
- if (atomic_add_unless(&sh->count, -1, 1))
+ if (refcount_dec_not_one(&sh->count))
return;
if (unlikely(!conf->mddev->thread) ||
@@ -401,7 +401,7 @@ void raid5_release_stripe(struct stripe_head *sh)
slow_path:
local_irq_save(flags);
/* we are ok here if STRIPE_ON_RELEASE_LIST is set or not */
- if (atomic_dec_and_lock(&sh->count, &conf->device_lock)) {
+ if (refcount_dec_and_lock(&sh->count, &conf->device_lock)) {
INIT_LIST_HEAD(&list);
hash = sh->hash_lock_index;
do_release_stripe(conf, sh, &list);
@@ -491,7 +491,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
struct r5conf *conf = sh->raid_conf;
int i, seq;
- BUG_ON(atomic_read(&sh->count) != 0);
+ BUG_ON(refcount_read(&sh->count) != 0);
BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
BUG_ON(stripe_operations_active(sh));
BUG_ON(sh->batch_head);
@@ -668,11 +668,11 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
&conf->cache_state);
} else {
init_stripe(sh, sector, previous);
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
}
- } else if (!atomic_inc_not_zero(&sh->count)) {
+ } else if (!refcount_inc_not_zero(&sh->count)) {
spin_lock(&conf->device_lock);
- if (!atomic_read(&sh->count)) {
+ if (!refcount_read(&sh->count)) {
if (!test_bit(STRIPE_HANDLE, &sh->state))
atomic_inc(&conf->active_stripes);
BUG_ON(list_empty(&sh->lru) &&
@@ -688,7 +688,7 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
sh->group = NULL;
}
}
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
spin_unlock(&conf->device_lock);
}
} while (sh == NULL);
@@ -752,9 +752,9 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
hash = stripe_hash_locks_hash(head_sector);
spin_lock_irq(conf->hash_locks + hash);
head = __find_stripe(conf, head_sector, conf->generation);
- if (head && !atomic_inc_not_zero(&head->count)) {
+ if (head && !refcount_inc_not_zero(&head->count)) {
spin_lock(&conf->device_lock);
- if (!atomic_read(&head->count)) {
+ if (!refcount_read(&head->count)) {
if (!test_bit(STRIPE_HANDLE, &head->state))
atomic_inc(&conf->active_stripes);
BUG_ON(list_empty(&head->lru) &&
@@ -770,7 +770,7 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
head->group = NULL;
}
}
- atomic_inc(&head->count);
+ refcount_inc(&head->count);
spin_unlock(&conf->device_lock);
}
spin_unlock_irq(conf->hash_locks + hash);
@@ -833,7 +833,7 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
sh->batch_head->bm_seq = seq;
}
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
unlock_out:
unlock_two_stripes(head, sh);
out:
@@ -1036,9 +1036,9 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
pr_debug("%s: for %llu schedule op %d on disc %d\n",
__func__, (unsigned long long)sh->sector,
bi->bi_opf, i);
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
if (sh != head_sh)
- atomic_inc(&head_sh->count);
+ refcount_inc(&head_sh->count);
if (use_new_offset(conf, sh))
bi->bi_iter.bi_sector = (sh->sector
+ rdev->new_data_offset);
@@ -1097,9 +1097,9 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
"replacement disc %d\n",
__func__, (unsigned long long)sh->sector,
rbi->bi_opf, i);
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
if (sh != head_sh)
- atomic_inc(&head_sh->count);
+ refcount_inc(&head_sh->count);
if (use_new_offset(conf, sh))
rbi->bi_iter.bi_sector = (sh->sector
+ rrdev->new_data_offset);
@@ -1275,7 +1275,7 @@ static void ops_run_biofill(struct stripe_head *sh)
}
}
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
init_async_submit(&submit, ASYNC_TX_ACK, tx, ops_complete_biofill, sh, NULL);
async_trigger_callback(&submit);
}
@@ -1353,7 +1353,7 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu)
if (i != target)
xor_srcs[count++] = sh->dev[i].page;
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, NULL,
ops_complete_compute, sh, to_addr_conv(sh, percpu, 0));
@@ -1441,7 +1441,7 @@ ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu)
BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
dest = tgt->page;
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
if (target == qd_idx) {
count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_ALL);
@@ -1516,7 +1516,7 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
pr_debug("%s: stripe: %llu faila: %d failb: %d\n",
__func__, (unsigned long long)sh->sector, faila, failb);
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
if (failb == syndrome_disks+1) {
/* Q disk is one of the missing disks */
@@ -1784,7 +1784,7 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
break;
}
if (i >= sh->disks) {
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
set_bit(R5_Discard, &sh->dev[pd_idx].flags);
ops_complete_reconstruct(sh);
return;
@@ -1825,7 +1825,7 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu,
flags = ASYNC_TX_ACK |
(prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST);
- atomic_inc(&head_sh->count);
+ refcount_inc(&head_sh->count);
init_async_submit(&submit, flags, tx, ops_complete_reconstruct, head_sh,
to_addr_conv(sh, percpu, j));
} else {
@@ -1867,7 +1867,7 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu,
break;
}
if (i >= sh->disks) {
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
set_bit(R5_Discard, &sh->dev[sh->pd_idx].flags);
set_bit(R5_Discard, &sh->dev[sh->qd_idx].flags);
ops_complete_reconstruct(sh);
@@ -1891,7 +1891,7 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu,
struct stripe_head, batch_list) == head_sh;
if (last_stripe) {
- atomic_inc(&head_sh->count);
+ refcount_inc(&head_sh->count);
init_async_submit(&submit, txflags, tx, ops_complete_reconstruct,
head_sh, to_addr_conv(sh, percpu, j));
} else
@@ -1948,7 +1948,7 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu)
tx = async_xor_val(xor_dest, xor_srcs, 0, count, STRIPE_SIZE,
&sh->ops.zero_sum_result, &submit);
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
init_async_submit(&submit, ASYNC_TX_ACK, tx, ops_complete_check, sh, NULL);
tx = async_trigger_callback(&submit);
}
@@ -1967,7 +1967,7 @@ static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu
if (!checkp)
srcs[count] = NULL;
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
init_async_submit(&submit, ASYNC_TX_ACK, NULL, ops_complete_check,
sh, to_addr_conv(sh, percpu, 0));
async_syndrome_val(srcs, 0, count+2, STRIPE_SIZE,
@@ -2057,7 +2057,7 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
INIT_LIST_HEAD(&sh->lru);
INIT_LIST_HEAD(&sh->r5c);
INIT_LIST_HEAD(&sh->log_list);
- atomic_set(&sh->count, 1);
+ refcount_set(&sh->count, 1);
sh->log_start = MaxSector;
for (i = 0; i < disks; i++) {
struct r5dev *dev = &sh->dev[i];
@@ -2354,7 +2354,7 @@ static int drop_one_stripe(struct r5conf *conf)
spin_unlock_irq(conf->hash_locks + hash);
if (!sh)
return 0;
- BUG_ON(atomic_read(&sh->count));
+ BUG_ON(refcount_read(&sh->count));
shrink_buffers(sh);
kmem_cache_free(conf->slab_cache, sh);
atomic_dec(&conf->active_stripes);
@@ -2386,7 +2386,7 @@ static void raid5_end_read_request(struct bio * bi)
break;
pr_debug("end_read_request %llu/%d, count: %d, error %d.\n",
- (unsigned long long)sh->sector, i, atomic_read(&sh->count),
+ (unsigned long long)sh->sector, i, refcount_read(&sh->count),
bi->bi_error);
if (i == disks) {
bio_reset(bi);
@@ -2523,7 +2523,7 @@ static void raid5_end_write_request(struct bio *bi)
}
}
pr_debug("end_write_request %llu/%d, count %d, error: %d.\n",
- (unsigned long long)sh->sector, i, atomic_read(&sh->count),
+ (unsigned long long)sh->sector, i, refcount_read(&sh->count),
bi->bi_error);
if (i == disks) {
bio_reset(bi);
@@ -4545,7 +4545,7 @@ static void handle_stripe(struct stripe_head *sh)
pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
"pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",
(unsigned long long)sh->sector, sh->state,
- atomic_read(&sh->count), sh->pd_idx, sh->qd_idx,
+ refcount_read(&sh->count), sh->pd_idx, sh->qd_idx,
sh->check_state, sh->reconstruct_state);
analyse_stripe(sh, &s);
@@ -4924,7 +4924,7 @@ static void activate_bit_delay(struct r5conf *conf,
struct stripe_head *sh = list_entry(head.next, struct stripe_head, lru);
int hash;
list_del_init(&sh->lru);
- atomic_inc(&sh->count);
+ refcount_inc(&sh->count);
hash = sh->hash_lock_index;
__release_stripe(conf, sh, &temp_inactive_list[hash]);
}
@@ -5240,7 +5240,9 @@ static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
sh->group = NULL;
}
list_del_init(&sh->lru);
- BUG_ON(atomic_inc_return(&sh->count) != 1);
+ BUG_ON(refcount_read(&sh->count) != 0);
+ refcount_set(&sh->count, 1);
+
return sh;
}
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 4bb27b9..a1ed351 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -3,6 +3,7 @@
#include <linux/raid/xor.h>
#include <linux/dmaengine.h>
+#include <linux/refcount.h>
/*
*
@@ -207,7 +208,7 @@ struct stripe_head {
short ddf_layout;/* use DDF ordering to calculate Q */
short hash_lock_index;
unsigned long state; /* state flags */
- atomic_t count; /* nr of active thread/requests */
+ refcount_t count; /* nr of active thread/requests */
int bm_seq; /* sequence number for bitmap flushes */
int disks; /* disks in stripe */
int overwrite_disks; /* total overwrite disks in stripe,
--
2.7.4
^ permalink raw reply related
* Re: [PATCH] md:fix a trivial typo in comments
From: John Stoffel @ 2017-03-13 14:27 UTC (permalink / raw)
To: Zhilong Liu; +Cc: shli, linux-raid
In-Reply-To: <1489384407-12672-1-git-send-email-zlliu@suse.com>
>>>>> "Zhilong" == Zhilong Liu <zlliu@suse.com> writes:
Zhilong> fix a trivial typo in freeze_array() of raid1.c
Zhilong> Signed-off-by: Zhilong Liu <zlliu@suse.com>
Zhilong> ---
Zhilong> drivers/md/raid1.c | 2 +-
Zhilong> 1 file changed, 1 insertion(+), 1 deletion(-)
Zhilong> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
Zhilong> index 7b0f647..2ec0617 100644
Zhilong> --- a/drivers/md/raid1.c
Zhilong> +++ b/drivers/md/raid1.c
Zhilong> @@ -958,7 +958,7 @@ static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
Zhilong> static void freeze_array(struct r1conf *conf, int extra)
Zhilong> {
Zhilong> /* stop syncio and normal IO and wait for everything to
Zhilong> - * go quite.
Zhilong> + * go quit.
Zhilong> * We wait until nr_pending match nr_queued+extra
Zhilong> * This is called in the context of one normal IO request
Zhilong> * that has failed. Thus any sync request that might be pending
Zhilong> --
Zhilong> 2.6.6
Don't you mean "quiet" instead?
John
^ permalink raw reply
* Re: [PATCH] md:fix a trivial typo in comments
From: Coly Li @ 2017-03-13 14:59 UTC (permalink / raw)
To: Zhilong Liu, shli; +Cc: linux-raid
In-Reply-To: <1489384407-12672-1-git-send-email-zlliu@suse.com>
On 2017/3/13 下午1:53, Zhilong Liu wrote:
> fix a trivial typo in freeze_array() of raid1.c
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
> drivers/md/raid1.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7b0f647..2ec0617 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -958,7 +958,7 @@ static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
> static void freeze_array(struct r1conf *conf, int extra)
> {
> /* stop syncio and normal IO and wait for everything to
> - * go quite.
It should be "quiet", this is a typo, thanks for catching it :-)
Coly
^ permalink raw reply
* 41959 linux-raid
From: een @ 2017-03-13 17:42 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: 0689.zip --]
[-- Type: application/zip, Size: 5320 bytes --]
^ permalink raw reply
* [PATCH] md/r5cache: fix set_syndrome_sources()
From: Song Liu @ 2017-03-13 20:44 UTC (permalink / raw)
To: linux-raid; +Cc: shli, neilb, kernel-team, dan.j.williams, hch, Song Liu
With srctype == SYNDROME_SRC_WRITTEN, we need include both
dev with non-null ->written and dev with R5_InJournal.
Signed-off-by: Song Liu <songliubraving@fb.com>
---
drivers/md/raid5.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1c554a8..88cc898 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1499,7 +1499,8 @@ static int set_syndrome_sources(struct page **srcs,
(test_bit(R5_Wantdrain, &dev->flags) ||
test_bit(R5_InJournal, &dev->flags))) ||
(srctype == SYNDROME_SRC_WRITTEN &&
- dev->written)) {
+ (dev->written ||
+ test_bit(R5_InJournal, &dev->flags)))) {
if (test_bit(R5_InJournal, &dev->flags))
srcs[slot] = sh->dev[i].orig_page;
else
--
2.9.3
^ permalink raw reply related
* Re: [PATCH] md/r5cache: fix set_syndrome_sources()
From: Dan Williams @ 2017-03-13 20:58 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid, Shaohua Li, NeilBrown, Kernel Team, Christoph Hellwig
In-Reply-To: <20170313204435.1732089-1-songliubraving@fb.com>
On Mon, Mar 13, 2017 at 1:44 PM, Song Liu <songliubraving@fb.com> wrote:
> With srctype == SYNDROME_SRC_WRITTEN, we need include both
> dev with non-null ->written and dev with R5_InJournal.
>
Can you say a bit more about what this fixes? How does this bug
manifest itself (user visible effect), and should this be marked for
-stable?
^ permalink raw reply
* Re: [PATCH] md/r5cache: fix set_syndrome_sources()
From: Song Liu @ 2017-03-13 21:20 UTC (permalink / raw)
To: Dan Williams
Cc: linux-raid, Shaohua Li, NeilBrown, Kernel Team, Christoph Hellwig
In-Reply-To: <CAPcyv4h6XBdDRrq5+cs9v70v93sHnkTUoFrKuN_-Ox_wjJvvkw@mail.gmail.com>
> On Mar 13, 2017, at 1:58 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Mon, Mar 13, 2017 at 1:44 PM, Song Liu <songliubraving@fb.com> wrote:
>> With srctype == SYNDROME_SRC_WRITTEN, we need include both
>> dev with non-null ->written and dev with R5_InJournal.
>>
>
> Can you say a bit more about what this fixes? How does this bug
> manifest itself (user visible effect), and should this be marked for
> -stable?
Before this patch, device InJournal will be included in prexor
(SYNDROME_SRC_WANT_DRAIN) but not in reconstruct (SYNDROME_SRC_WRITTEN).
So it will break parity.
This fixes logic in 1e6d690b9334b7e1b31d25fd8d93e980e449a5f9.
Thanks,
Song
^ permalink raw reply
* [PATCH v5] DM: dm-inplace-compress: inplace compressed DM target
From: Ram Pai @ 2017-03-13 21:30 UTC (permalink / raw)
To: agk, snitzer, dm-devel
Cc: corbet, shli, linux-doc, linux-kernel, linux-raid, hbabu,
linuxram, julia.lawall
This patch provides a generic device-mapper compression device.
Originally written by Shaohua Li.
https://www.redhat.com/archives/dm-devel/2013-December/msg00143.html
I have optimized and hardened the code.
I have not received any negative comments till now. Feel confident with the
code. Please consider merging the code upstream.
Testing:
-------
This compression block device is tested in the following scenarios
a) backing a ext4/xfs/btrfs filesystem
b) backing swap
Ran 'badblocks' test on the compressed block device.
Thoroughly stress tested on PPC64 and x86 system.
I have included a test-script that I used to test the block device.
Version v5:
Modified the parameter list format to use token=value.
Fixed a coding issue noted by Julia Lawall.
Fixed data corruption issue when compressed size was same as the
original data.
Modified the allowed maximum I/O size to be as large as two pages,
without which larger size I/O need larger size buffers to
temporarily hold compressed data. This can lead to inability to
satisfy memory allocation requests.
Version v4:
fixed kbuild errors; hopefully they are all taken care off.
- no reference to zero_page
- convert all divide and mod operations to bit operations
Version v3:
Fixed sector alignment bugs exposed while testing on x86.
Explicitly set the maximum request size to 128K. Without which
range locking failed, causing I/Os to stamp each other.
Fixed an occasional data corruption caused by wrong size of the
compression buffer.
Added a parameter while creation of the block device,
to not sleep during memory allocations. This can be useful
if the device is used as a swap device.
Version v2:
All patches are merged into a single patch.
Major code re-arrangement.
Data and metablocks allocated based on the length of the device
map rather than the size of the backing device.
Size of each entry in the bitmap array is explicitly set
to 32bits.
Attempt to reuse the provided bio buffer space instead
of allocating a new one.
Version v1:
Comments from Alasdair have been incorporated.
https://www.redhat.com/archives/dm-devel/2013-December/msg00144.html
Ram Pai (1):
From: Shaohua Li <shli@kernel.org>
Documentation/device-mapper/dm-inplace-compress.txt | 174 +
drivers/md/Kconfig | 6
drivers/md/Makefile | 2
drivers/md/dm-inplace-compress.c | 2295 ++++++++++++++++++++
drivers/md/dm-inplace-compress.h | 194 +
5 files changed, 2671 insertions(+)
---------- Test script -------------
#!/bin/bash
# a test program to verify the correctness of
# the dm-inplace-compression target
# - Ram Pai
compdevname="__compdev"
usage()
{
echo
echo
echo "$1: -d <device path> [ -h ] [ -c <compdevicename> ]"
echo "-d <device path> path to the block device to "
echo " back the compression device"
echo "-c <compdevicename> some unique name of the compress"
echo " device name to be used. Defaults to $compdevname"
echo "-h help"
echo
echo
return
}
getsize()
{
#the target will spew out the maximum size that
#it can accommodate for the device. So start with
#insane number and let it fail.
insane=999999999999999999990009
dmsetup create $2 --table \
"0 $insane inplacecompress device=$1, \
writethrough,compressor=lzo" 2>/dev/null
echo $(dmesg | \
grep 'dm-inplace-compress: This device can accommodate at most'\
| tail -1 | awk '{print $(NF-1)}')
}
MYNAME=$(basename $0)
OPTIND=0
while getopts "d:c:h" args $OPTIONS
do
case "$args" in
d) device=$OPTARG
;;
c) compdevname=$OPTARG
;;
*) usage $MYNAME
exit 1
;;
esac
done
if [ -z "$device" ]
then
usage $MYNAME
exit 1;
fi
if [ ! -b "$device" ]
then
usage $MYNAME
echo ERROR: $device is not a block device
exit 1;
fi
if [ -b "/dev/mapper/$compdevname" ]
then
echo "WARNING: $compdevname already exist"
usage $MYNAME
fi
echo -n "ANY DATA ON $device WILL BE LOST. Continue using $device?: y/n:"
read yesorno
if [ "$yesorno" != "y" ]
then
echo "Ok, exiting"
exit 1;
fi
dmsetup targets | grep inplacecompress > /dev/null
if [ $? -ne 0 ]
then
echo "Please enable dminplacecompress target in the kernel"
echo "Try modprobe dm-inplace-compress.ko"
exit 1
fi
#clean and init the device
dd if=/dev/zero of=$device count=100 2> /dev/null
dmsetup remove $compdevname 2> /dev/null
size=$(getsize $device $compdevname)
if [ ! -n $size ]
then
echo "FAILURE: determining the maximum possible size of the device"
exit 1
fi
ret=0
for mode in writethrough writeback=2
do
for i in lzo 842
do
cat /proc/crypto | grep -w "$i$" > /dev/null
if [ $? -ne 0 ]
then
continue
fi
echo "Testing: $i $mode....:"
#generate the device
dmsetup create $compdevname --table "0 $size inplacecompress device=$device, $mode ,compressor=$i"
if [ $? -ne 0 ]
then
echo "FAILURE: creating the device"
exit 1
fi
badblocks -wsv /dev/mapper/$compdevname
if [ $? -ne 0 ]
then
echo "FAILURE:"
ret=1
fi
##
##
## ADD MORE TESTS HERE
##
##
echo "PASS"
dmsetup remove $compdevname
done
done
if [ $ret -eq 0 ]
then
echo CONGRATS IT WORKS
exit 0
else
echo FAIL
exit 0
fi
------------------------
^ permalink raw reply
* [PATCH v5 1/1] DM: inplace compressed DM target
From: Ram Pai @ 2017-03-13 21:30 UTC (permalink / raw)
To: agk, snitzer, dm-devel
Cc: corbet, shli, linux-doc, linux-kernel, linux-raid, hbabu,
linuxram, julia.lawall
In-Reply-To: <1489440641-8305-1-git-send-email-linuxram@us.ibm.com>
This is a simple DM target supporting inplace compression. Its best
suited for SSD. The underlying disk must support 512B sector size,
the target only supports 4k sector size.
Disk layout:
|super|...meta...|..data...|
Store unit is 4k (a block). Super is 1 block, which stores meta and
data size and compression algorithm. Meta is a bitmap. For each data
block, there are 5 bits meta.
Data:
Data of a block is compressed. Compressed data is round up to 512B,
which is the payload. In disk, payload is stored at the beginning of
logical sector of the block. Let's look at an example. Say we store
data to block A, which is in sector B(A*8), its orginal size is 4k,
compressed size is 1500. Compressed data (CD) will use three
sectors (512B). The three sectors are the payload. Payload will be
stored at sector B.
---------------------------------------------------
... | CD1 | CD2 | CD3 | | | | | | ...
---------------------------------------------------
^B ^B+1 ^B+2 ^B+7 ^B+8
For this block, we will not use sector B+3 to B+7 (a hole). We use four
meta bits to present payload size. The compressed size (1500) isn't
stored in meta directly. Instead, we store it at the last 32bits of
payload. In this example, we store it at the end of sector B+2. If
compressed size + sizeof(32bits) crosses a sector, payload size will
increase one sector. If payload uses 8 sectors, we store uncompressed
data directly.
If IO size is bigger than one block, we can store the data as an extent.
Data of the whole extent will compressed and stored in the similar way
like above. The first block of the extent is the head, all others are
the tail. If extent is 1 block, the block is head. We have 1 bit of
meta to present if a block is head or tail. If 4 meta bits of head
block can't store extent payload size, we will borrow tail block meta
bits to store payload size. Max allowd extent size is 128k, so we
don't compress/decompress too big size data.
Meta:
Modifying data will modify meta too. Meta will be written(flush) to
disk depending on meta write policy. We support writeback and
writethrough mode. In writeback mode, meta will be written to disk in
an interval or a FLUSH request. In writethrough mode, data and meta
data will be written to disk together.
Advantages:
1. Simple. Since we store compressed data in-place, we don't need
complicated disk data management.
2. Efficient. For each 4k, we only need 5 bits meta. 1T data will use
less than 200M meta, so we can load all meta into memory. And actual
compression size is in payload. So if IO doesn't need RMW and we use
writeback meta flush, we don't need extra IO for meta.
Disadvantages:
1. hole. Since we store compressed data in-place, there are a lot of
holes (in above example, B+3 - B+7) Hole can impact IO, because we
can't do IO merge.
2. 1:1 size. Compression doesn't change disk size. If disk is 1T, we
can only store 1T data even we do compression.
But this is for SSD only. Generally SSD firmware has a FTL layer to map
disk sectors to flash nand. High end SSD firmware has filesystem-like
FTL.
1. hole. Disk has a lot of holes, but SSD FTL can still store data
continuous in nand. Even if we can't do IO merge in OS layer, SSD
firmware can do it.
2. 1:1 size. On one side, we write compressed data to SSD, which means
less data is written to SSD. This will be very helpful to improve
SSD garbage collection, and so write speed and life cycle. So even
this is a problem, the target is still helpful. On the other side,
advanced SSD FTL can easily do thin provision. For example, if nand
is 1T and we let SSD report it as 2T, and use
the SSD as compressed target. In such SSD, we don't have the 1:1
size issue.
So even if SSD FTL cannot map non-continuous disk sectors to
continuous nand, the compression target can still function well.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
.../device-mapper/dm-inplace-compress.txt | 174 ++
drivers/md/Kconfig | 6 +
drivers/md/Makefile | 2 +
drivers/md/dm-inplace-compress.c | 2295 ++++++++++++++++++++
drivers/md/dm-inplace-compress.h | 194 ++
5 files changed, 2671 insertions(+)
create mode 100644 Documentation/device-mapper/dm-inplace-compress.txt
create mode 100644 drivers/md/dm-inplace-compress.c
create mode 100644 drivers/md/dm-inplace-compress.h
diff --git a/Documentation/device-mapper/dm-inplace-compress.txt b/Documentation/device-mapper/dm-inplace-compress.txt
new file mode 100644
index 0000000..2fa0d58
--- /dev/null
+++ b/Documentation/device-mapper/dm-inplace-compress.txt
@@ -0,0 +1,174 @@
+Device-Mapper's "inplace-compress" target provides inplace compression of block
+devices using the kernel compression API.
+
+Parameters: <device>=<device path> | <device>:<device path>
+ [, <#opt_params writethough>, ]
+ [, <#opt_params <writeback>=<meta_commit_delay> ]
+ [, <#opt_params <writeback>:<meta_commit_delay> ]
+ [, <#opt_params compressor>=<type> ]
+ [, <#opt_params compressor>:<type> ]
+ [, <#opt_params critical> ]
+
+
+<writethrough>
+ Write data and metadata together.
+
+<writeback>=<meta_commit_delay>
+ Write metadata every 'meta_commit_delay' interval.
+
+<device>=<device path>
+ This is the device that is going to be used as backend and contains the
+ compressed data. You can specify it as a path like /dev/xxx or a device
+ number <major>:<minor>.
+
+<compressor>=<type>
+ Choose the compressor algorithm. 'lzo' and '842' compressors are supported.
+
+<critical>
+ Block device used in critical path.
+
+Example scripts
+===============
+
+create a inplace-compress block device using lzo compression. Write metadata
+and data together.
+
+[[
+#!/bin/sh
+# Create a inplace-compress device using dmsetup
+device=$1 #your backing storage eg: /dev/sdc1
+size=80000 #size of your new compressed block device
+dmsetup create comp1 --table "0 $size inplacecompress device=$device,
+ writethrough, compressor=lzo"
+]]
+
+
+create a inplace-compress block device using nx-842 hardware compression. Write
+metadata periodially every 5sec.
+
+[[
+#!/bin/sh
+# Create a inplace-compress device using dmsetup
+device=$1 #your backing storage eg: /dev/sdc1
+size=80000 #size of your new compressed block device
+dmsetup create comp1 --table "0 $size inplacecompress device=$device,
+ writeback=5, compressor=842"
+]]
+
+
+Create a inplace-compress block device. Device is used in critical path; ex:
+swap device.
+
+[[
+#!/bin/sh
+# Create a inplace-compress device using dmsetup
+device=$1 #your backing storage eg: /dev/sdc1
+size=80000 #size of your new compressed block device
+dmsetup create comp1 --table "0 $size inplacecompress device=$device,critical"
+]]
+
+Description
+===========
+
+ This is a simple DM target supporting inplace compression. Its best suited
+ for SSD. The underlying disk must support 512B sector size, the target only
+ supports 4k sector size.
+
+
+
+ Disk layout:
+ |super|...meta...|..data...|
+
+ Store unit is 4k (a block). Superblock is 1 block. It stores meta and data
+ size and compression algorithm. Metablock is a bitmap. For each data block,
+ there are 5 bits meta.
+
+
+
+ Data:
+
+ Data of a block is compressed. Compressed data is round up to 512B, which
+ is the payload. On disk, payload is stored at the beginning of logical
+ sector of the block. Let's look at an example. Say we store data to block
+ A, which is in sector B(A*8), its orginal size is 4k, compressed size is
+ 1500. Compressed data (CD) will use three sectors (512B). The three sectors
+ are the payload. Payload will be stored at sector B.
+
+ ---------------------------------------------------
+ ... | CD1 | CD2 | CD3 | | | | | | ...
+ ---------------------------------------------------
+ ^B ^B+1 ^B+2 ^B+7 ^B+8
+
+ For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta
+ bits to present payload size. The compressed size (1500) is not stored in
+ meta directly. Instead, we store it at the last 64bits of payload. In this
+ example, we store it at the end of sector B+2. If compressed size +
+ sizeof(64bits) crosses a sector, payload size will increase one sector. If
+ payload uses 8 sectors, we store uncompressed data directly. A compressed
+ size is 32bits and it is tagged with a 32bit magic number, to ensure its
+ integrity.
+
+ If IO size is bigger than one block, we can store the data as an extent.
+ Data of the whole extent is compressed and stored in the similar way like
+ above. The first block of the extent is the head, all others are the tail.
+ If extent is one block, the block is head. We have one bit of meta to
+ indicate if a block is head or tail. If four meta bits of head block can't
+ store extent payload size, we will borrow tail block meta bits to store
+ payload size. Max allowd extent size is 128k. This is to gaurd against
+ compression/decompression of data that is too large.
+
+ Meta:
+
+ Modifying data modifies meta aswell. Metadata is written(flush) to disk
+ depending on metadata write policy. We support writeback and writethrough
+ mode. In writeback mode, meta will be written to disk periodically or when
+ a FLUSH request is initiated. In writethrough mode, data and meta data
+ will be written to disk together.
+
+ Advantages:
+
+ 1. Simple. Since we store compressed data in-place, we don't need
+ complicated disk data management.
+
+ 2. Efficient. For each 4k, we only need 5 bits meta. 1T data will use less
+ than 200M meta, so we can load all meta into memory. Actual compression
+ size is in payload. This saves a metadata write if the IO does not need
+ RMW.
+
+
+
+ Disadvantages:
+
+ 1. Hole. Since we store compressed data in-place, there are a lot of holes
+ (in above example, B+3 - B+7) hole can impact IO, because we can't merge
+ the IO.
+
+ 2. 1:1 size. Compression does not change disk size. If disk is 1T, we can
+ only store 1T data even we do compression.
+
+ The above disadvantages can be mitigated by using SSDs or NVMe devices.
+ Generally these device firmware have a FTL layer to map disk sectors to
+ flash nand. Some high device firmware have filesystem-like FTL.
+
+ 1. Hole. Disk has a lot of holes, but SSD FTL can still store data
+ continuous in nand. Even if we can't do IO merge in OS layer, SSD firmware
+ can do it.
+
+ 2. 1:1 size. We write compressed data to SSD, which means less data is
+ written to SSD. This will be very helpful to improve SSD garbage
+ collection, write speed and increase its life span. Advanced device FTL
+ can easily do thin provision. For example, if nand is 1T and we let the
+ device report it as 2T, and use the SSD as compressed target. In such
+ cases, we alleviate the issue.
+
+
+
+
+ This target need not neccessarily be backed by FTL supporting device in
+ order to be functional. However having such a device can help maximize
+ the benefits.
+
+
+Author:
+ Shaohua Li <shli@fusionio.com>
+ Ram Pai <linuxram@us.ibm.com>
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index b7767da..2eece2a 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -508,4 +508,10 @@ config DM_LOG_WRITES
If unsure, say N.
+config DM_INPLACE_COMPRESS
+ tristate "Inplace Compression target"
+ depends on BLK_DEV_DM
+ ---help---
+ Allow volume managers to compress data for SSD.
+
endif # MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 3cbda1a..4525482 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -59,6 +59,8 @@ obj-$(CONFIG_DM_CACHE_SMQ) += dm-cache-smq.o
obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o
obj-$(CONFIG_DM_ERA) += dm-era.o
obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o
+obj-$(CONFIG_DM_LOG_WRITES) += dm-log-writes.o
+obj-$(CONFIG_DM_INPLACE_COMPRESS) += dm-inplace-compress.o
ifeq ($(CONFIG_DM_UEVENT),y)
dm-mod-objs += dm-uevent.o
diff --git a/drivers/md/dm-inplace-compress.c b/drivers/md/dm-inplace-compress.c
new file mode 100644
index 0000000..bc3866b
--- /dev/null
+++ b/drivers/md/dm-inplace-compress.c
@@ -0,0 +1,2295 @@
+/*
+ * device mapper compression block device.
+ *
+ * Released under GPL v2.
+ *
+ */
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/device-mapper.h>
+#include <linux/dm-io.h>
+#include <linux/crypto.h>
+#include <linux/lzo.h>
+#include <linux/kthread.h>
+#include <linux/page-flags.h>
+#include <linux/completion.h>
+#include <linux/vmalloc.h>
+#include <linux/parser.h>
+#include "dm-inplace-compress.h"
+
+#define DM_MSG_PREFIX "dm-inplace-compress"
+
+
+static const struct kernel_param_ops dm_icomp_alloc_param_ops = {
+ .set = param_set_ulong,
+ .get = param_get_ulong,
+};
+
+static atomic64_t dm_icomp_total_alloc_size;
+#define DMICP_ALLOC(s) atomic64_add(s, &dm_icomp_total_alloc_size)
+#define DMICP_FREE_ALLOC(s) atomic64_sub(s, &dm_icomp_total_alloc_size)
+module_param_cb(dm_icomp_total_alloc_size, &dm_icomp_alloc_param_ops,
+ &dm_icomp_total_alloc_size, 0644);
+
+static atomic64_t dm_icomp_total_bio_save;
+#define DMICP_ALLOC_SAVE(s) {atomic64_add(s, &dm_icomp_total_bio_save); }
+module_param_cb(dm_icomp_total_bio_save, &dm_icomp_alloc_param_ops,
+ &dm_icomp_total_bio_save, 0644);
+
+
+static struct kmem_cache *dm_icomp_req_cachep;
+static struct kmem_cache *dm_icomp_io_range_cachep;
+static struct kmem_cache *dm_icomp_meta_io_cachep;
+
+static struct dm_icomp_io_worker dm_icomp_io_workers[NR_CPUS];
+static struct workqueue_struct *dm_icomp_wq;
+
+/*
+ *****************************************************
+ * compressor selection logic
+ *****************************************************
+ */
+static struct dm_icomp_compressor_data compressors[] = {
+ [DMICP_COMP_ALG_LZO] = {
+ .name = "lzo",
+ .can_handle_overflow = false,
+ .comp_len = lzo_comp_len,
+ .max_comp_len = lzo_max_comp_len,
+ },
+ [DMICP_COMP_ALG_842] = {
+ .name = "842",
+ .can_handle_overflow = true,
+ .comp_len = nx842_comp_len,
+ .max_comp_len = nx842_max_comp_len,
+ },
+};
+
+static int default_compressor = DMICP_COMP_ALG_LZO;
+#define DMICP_ALGO_LENGTH 9
+static char dm_icomp_algorithm[DMICP_ALGO_LENGTH] = "lzo";
+static struct kparam_string dm_icomp_compressor_kparam = {
+ .string = dm_icomp_algorithm,
+ .maxlen = sizeof(dm_icomp_algorithm),
+};
+static int dm_icomp_compressor_param_set(const char *,
+ const struct kernel_param *);
+static struct kernel_param_ops dm_icomp_compressor_param_ops = {
+ .set = dm_icomp_compressor_param_set,
+ .get = param_get_string,
+};
+module_param_cb(compress_algorithm, &dm_icomp_compressor_param_ops,
+ &dm_icomp_compressor_kparam, 0644);
+
+
+
+static int get_comp_id(const char *s)
+{
+ int r, val_len;
+
+ if (!crypto_has_comp(s, 0, 0))
+ return -1;
+
+ for (r = 0; r < ARRAY_SIZE(compressors); r++) {
+ val_len = strlen(compressors[r].name);
+ if (!strncmp(s, compressors[r].name, val_len))
+ return r;
+ }
+ return -1;
+}
+
+static const char *get_comp_name(int id)
+{
+ if (id < 0 || id > ARRAY_SIZE(compressors))
+ return NULL;
+ return compressors[id].name;
+}
+
+static void set_default_compressor(int index)
+{
+ default_compressor = index;
+ strlcpy(dm_icomp_algorithm, compressors[index].name,
+ sizeof(dm_icomp_algorithm));
+ DMINFO("compressor is %s", dm_icomp_algorithm);
+}
+
+static inline int get_default_compressor(void)
+{
+ return default_compressor;
+}
+
+static int select_default_compressor(void)
+{
+ int r;
+ int arr_size = ARRAY_SIZE(compressors);
+
+ for (r = 0; r < arr_size; r++)
+ if (crypto_has_comp(compressors[r].name, 0, 0))
+ break;
+ if (r >= arr_size) {
+ DMWARN("No crypto compressors are supported");
+ return -EINVAL;
+ }
+ set_default_compressor(r);
+ return 0;
+}
+
+static int dm_icomp_compressor_param_set(const char *val,
+ const struct kernel_param *kp)
+{
+ int ret;
+ char str[kp->str->maxlen], *s;
+ int val_len = strlen(val)+1;
+
+ strlcpy(str, val, val_len);
+ s = strim(str);
+ ret = get_comp_id(s);
+ if (ret < 0) {
+ DMWARN("Compressor %s not supported", s);
+ return -1;
+ }
+ set_default_compressor(ret);
+ return 0;
+}
+
+static void free_compressor(struct dm_icomp_info *info)
+{
+ int i;
+
+ for_each_possible_cpu(i) {
+ if (info->tfm[i]) {
+ crypto_free_comp(info->tfm[i]);
+ info->tfm[i] = NULL;
+ }
+ }
+}
+
+static int alloc_compressor(struct dm_icomp_info *info)
+{
+ int i;
+ const char *alg_name = get_comp_name(info->comp_alg);
+
+ for_each_possible_cpu(i) {
+ info->tfm[i] = crypto_alloc_comp(
+ alg_name, 0, 0);
+ if (IS_ERR(info->tfm[i]))
+ goto err;
+ }
+ return 0;
+
+err:
+ free_compressor(info);
+ return -ENOMEM;
+}
+
+/**** END compressor select logic ****/
+
+
+/***** metadata logic ***************/
+/*
+ * return the meta data bits corresponding to a block
+ * @block_index : the index of the block
+ */
+static u8 dm_icomp_get_meta(struct dm_icomp_info *info, u64 block_index)
+{
+ u64 first_bit = block_index * DMICP_META_BITS;
+ int bits, offset;
+ u32 data;
+ u8 ret = 0;
+
+ offset = first_bit & (DMICP_BITS_PER_ENTRY-1);
+ bits = min_t(u32, DMICP_META_BITS, DMICP_BITS_PER_ENTRY - offset);
+
+ data = (u32)info->meta_bitmap[first_bit >> DMICP_META_BITS];
+ ret = (data >> offset) & ((1 << bits) - 1);
+
+ if (bits < DMICP_META_BITS) {
+ data = info->meta_bitmap[(first_bit >> DMICP_META_BITS) + 1];
+ bits = DMICP_META_BITS - bits;
+ ret |= (data & ((1 << bits) - 1)) << (DMICP_META_BITS - bits);
+ }
+ return ret;
+}
+
+
+static void dm_icomp_mark_page(struct dm_icomp_info *info, u32 *addr,
+ bool dirty_meta)
+{
+ struct page *page;
+
+ page = vmalloc_to_page(addr);
+ if (!page)
+ return;
+ if (dirty_meta)
+ SetPageDirty(page);
+ else
+ ClearPageDirty(page);
+}
+
+/*
+ * set the meta data bits corresponding to a block
+ * @block_index : the index of the block
+ * @meta : the meta data bits.
+ */
+static void dm_icomp_set_meta(struct dm_icomp_info *info, u64 block_index,
+ u8 meta, bool dirty_meta)
+{
+ u64 first_bit = block_index * DMICP_META_BITS;
+ int bits, offset;
+ u32 data;
+
+ offset = first_bit & (DMICP_BITS_PER_ENTRY-1);
+ bits = min_t(u32, DMICP_META_BITS, DMICP_BITS_PER_ENTRY - offset);
+
+
+ data = (u32)info->meta_bitmap[first_bit >> DMICP_META_BITS];
+ data &= ~(((1 << bits) - 1) << offset);
+ data |= (meta & ((1 << bits) - 1)) << offset;
+ info->meta_bitmap[first_bit >> DMICP_META_BITS] = (u32)data;
+
+ if (info->write_mode == DMICP_WRITE_BACK)
+ dm_icomp_mark_page(info,
+ &info->meta_bitmap[first_bit >> DMICP_META_BITS],
+ dirty_meta);
+
+ if (bits < DMICP_META_BITS) {
+ meta >>= bits;
+ data = (u32)
+ info->meta_bitmap[(first_bit >> DMICP_META_BITS) + 1];
+ bits = DMICP_META_BITS - bits;
+ data = (data >> bits) << bits;
+ data |= meta & ((1 << bits) - 1);
+ info->meta_bitmap[(first_bit >> DMICP_META_BITS) + 1] =
+ (u32)data;
+
+ if (info->write_mode == DMICP_WRITE_BACK)
+ dm_icomp_mark_page(info,
+ &info->meta_bitmap[(first_bit >> DMICP_META_BITS) + 1],
+ dirty_meta);
+ }
+}
+
+
+/*
+ * set the meta data bits corresponding to an extent
+ * @block : the index of the block
+ * @logical_blocks: the number of blocks in the extent
+ * @sectors: the number of sectors holding the compressed
+ * data
+ */
+static void dm_icomp_set_extent(struct dm_icomp_req *req, u64 block,
+ u16 logical_blocks, sector_t data_sectors)
+{
+ int i;
+ u8 data;
+
+ for (i = 0; i < logical_blocks; i++) {
+ data = min_t(sector_t, data_sectors, 8);
+ data_sectors -= data;
+ if (i != 0)
+ data |= DMICP_TAIL_MASK;
+ /* For FUA, we write out meta data directly */
+ dm_icomp_set_meta(req->info, block + i, data,
+ !(req->bio->bi_opf & REQ_FUA));
+ }
+}
+
+/*
+ * get the meta data bits corresponding to an extent
+ * @block_index : the index of the block
+ * @logical_blocks: return the number of blocks in the extent
+ * @sectors: return the number of sectors holding the compressed
+ * data
+ */
+static void dm_icomp_get_extent(struct dm_icomp_info *info, u64 block_index,
+ u64 *first_block_index, u16 *logical_sectors, u16 *data_sectors)
+{
+ u8 data;
+
+ data = dm_icomp_get_meta(info, block_index);
+ while (data & DMICP_TAIL_MASK) {
+ block_index--;
+ data = dm_icomp_get_meta(info, block_index);
+ }
+ *first_block_index = block_index;
+ *logical_sectors = DMICP_BYTES_TO_SECTOR(DMICP_BLOCK_SIZE);
+ *data_sectors = data & DMICP_LENGTH_MASK;
+ block_index++;
+ while (block_index < info->data_blocks) {
+ data = dm_icomp_get_meta(info, block_index);
+ if (!(data & DMICP_TAIL_MASK))
+ break;
+ *logical_sectors += DMICP_BYTES_TO_SECTOR(DMICP_BLOCK_SIZE);
+ *data_sectors += data & DMICP_LENGTH_MASK;
+ block_index++;
+ }
+}
+
+/*
+ * return the super block
+ */
+static int dm_icomp_access_super(struct dm_icomp_info *info, void *addr,
+ int op, int flag)
+{
+ struct dm_io_region region;
+ struct dm_io_request req;
+ unsigned long io_error = 0;
+ int ret;
+
+ region.bdev = info->dev->bdev;
+ region.sector = 0;
+ region.count = DMICP_BYTES_TO_SECTOR(DMICP_BLOCK_SIZE);
+
+ req.bi_op = op;
+ req.bi_op_flags = flag;
+ req.mem.type = DM_IO_KMEM;
+ req.mem.offset = 0;
+ req.mem.ptr.addr = addr;
+ req.notify.fn = NULL;
+ req.client = info->io_client;
+
+ ret = dm_io(&req, 1, ®ion, &io_error);
+ if (ret || io_error)
+ return -EIO;
+ return 0;
+}
+
+static void dm_icomp_meta_io_done(unsigned long error, void *context)
+{
+ struct dm_icomp_meta_io *meta_io = context;
+
+ meta_io->fn(meta_io->data, error);
+ kmem_cache_free(dm_icomp_meta_io_cachep, meta_io);
+}
+
+static inline int get_alloc_flag(struct dm_icomp_info *info)
+{
+ /*
+ * Use GFP_ATOMIC allocations if the device
+ * is used on the critical path
+ */
+ return info->critical ? GFP_ATOMIC : GFP_NOIO;
+}
+
+/*
+ * write meta data to the meta blocks in the backing store.
+ */
+static int dm_icomp_write_meta(struct dm_icomp_info *info, u64 start_page,
+ u64 end_page, void *data,
+ void (*fn)(void *data, unsigned long error), int rw, int flags)
+{
+ struct dm_icomp_meta_io *meta_io;
+ sector_t sector, last_sector, last_meta_sector = info->data_start-1;
+
+ WARN_ON(end_page > info->meta_bitmap_pages);
+
+ sector = DMICP_META_START_SECTOR + (start_page <<
+ (PAGE_SHIFT - SECTOR_SHIFT));
+ WARN_ON(sector > last_meta_sector);
+ if (sector > last_meta_sector) {
+ fn(data, -EINVAL);
+ return -EINVAL;
+ }
+ last_sector = sector + ((end_page - start_page) <<
+ (PAGE_SHIFT - SECTOR_SHIFT));
+ if (last_sector > last_meta_sector)
+ last_sector = last_meta_sector;
+
+
+ meta_io = kmem_cache_alloc(dm_icomp_meta_io_cachep,
+ get_alloc_flag(info));
+ if (!meta_io) {
+ fn(data, -ENOMEM);
+ return -ENOMEM;
+ }
+ meta_io->data = data;
+ meta_io->fn = fn;
+
+ meta_io->io_region.bdev = info->dev->bdev;
+
+
+ meta_io->io_region.sector = sector;
+ meta_io->io_region.count = last_sector - sector + 1;
+ atomic64_add(DMICP_SECTOR_TO_BYTES(meta_io->io_region.count),
+ &info->meta_write_size);
+
+ meta_io->io_req.bi_op = rw;
+ meta_io->io_req.bi_op_flags = flags;
+ meta_io->io_req.mem.type = DM_IO_VMA;
+ meta_io->io_req.mem.offset = 0;
+ meta_io->io_req.mem.ptr.addr = ((char *)(info->meta_bitmap)) +
+ (start_page << PAGE_SHIFT);
+ meta_io->io_req.notify.fn = dm_icomp_meta_io_done;
+ meta_io->io_req.notify.context = meta_io;
+ meta_io->io_req.client = info->io_client;
+
+ dm_io(&meta_io->io_req, 1, &meta_io->io_region, NULL);
+ return 0;
+}
+
+struct writeback_flush_data {
+ struct completion complete;
+ atomic_t cnt;
+};
+
+static void writeback_flush_io_done(void *data, unsigned long error)
+{
+ struct writeback_flush_data *wb = data;
+
+ if (atomic_dec_return(&wb->cnt))
+ return;
+ complete(&wb->complete);
+}
+
+static void dm_icomp_flush_dirty_meta(struct dm_icomp_info *info,
+ struct writeback_flush_data *data)
+{
+ struct page *page;
+ u64 start = 0, index;
+ u32 pending = 0, cnt = 0;
+ bool dirty;
+ struct blk_plug plug;
+
+ blk_start_plug(&plug);
+ for (index = 0; index < info->meta_bitmap_pages; index++, cnt++) {
+ if (cnt == 256) {
+ cnt = 0;
+ cond_resched();
+ }
+
+ page = vmalloc_to_page((char *)(info->meta_bitmap) +
+ (index << PAGE_SHIFT));
+ if (!page)
+ DMWARN("Uable to find page for block=%llu", index);
+ dirty = TestClearPageDirty(page);
+
+ if (pending == 0 && dirty) {
+ start = index;
+ pending++;
+ continue;
+ } else if (pending == 0)
+ continue;
+ else if (pending > 0 && dirty) {
+ pending++;
+ continue;
+ }
+
+ /* pending > 0 && !dirty */
+ atomic_inc(&data->cnt);
+ dm_icomp_write_meta(info, start, start + pending, data,
+ writeback_flush_io_done, REQ_OP_WRITE, WRITE);
+ pending = 0;
+ }
+
+ if (pending > 0) {
+ atomic_inc(&data->cnt);
+ dm_icomp_write_meta(info, start, start + pending, data,
+ writeback_flush_io_done, REQ_OP_WRITE, WRITE);
+ }
+ blkdev_issue_flush(info->dev->bdev, get_alloc_flag(info), NULL);
+ blk_finish_plug(&plug);
+}
+
+static int dm_icomp_meta_writeback_thread(void *data)
+{
+ struct dm_icomp_info *info = data;
+ struct writeback_flush_data wb;
+
+ atomic_set(&wb.cnt, 1);
+ init_completion(&wb.complete);
+
+ while (!kthread_should_stop()) {
+ schedule_timeout_interruptible(
+ msecs_to_jiffies(info->writeback_delay * 1000));
+ dm_icomp_flush_dirty_meta(info, &wb);
+ }
+
+ dm_icomp_flush_dirty_meta(info, &wb);
+
+ writeback_flush_io_done(&wb, 0);
+ wait_for_completion(&wb.complete);
+ return 0;
+}
+
+static int dm_icomp_init_meta(struct dm_icomp_info *info, bool new)
+{
+ struct dm_io_region region;
+ struct dm_io_request req;
+ unsigned long io_error = 0;
+ struct blk_plug plug;
+ int ret;
+ ssize_t len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits,
+ DMICP_BITS_PER_ENTRY);
+
+ len *= (DMICP_BITS_PER_ENTRY >> 3);
+
+ region.bdev = info->dev->bdev;
+ region.sector = DMICP_META_START_SECTOR;
+ region.count = DMICP_BYTES_TO_SECTOR(round_up(len,
+ DMICP_SECTOR_SIZE));
+
+ req.mem.type = DM_IO_VMA;
+ req.mem.offset = 0;
+ req.mem.ptr.addr = info->meta_bitmap;
+ req.notify.fn = NULL;
+ req.client = info->io_client;
+
+ blk_start_plug(&plug);
+ if (new) {
+ memset(info->meta_bitmap, 0, len);
+ req.bi_op = REQ_OP_WRITE;
+ req.bi_op_flags = REQ_FUA;
+ ret = dm_io(&req, 1, ®ion, &io_error);
+ } else {
+ req.bi_op = REQ_OP_READ;
+ req.bi_op_flags = READ;
+ ret = dm_io(&req, 1, ®ion, &io_error);
+ }
+ blk_finish_plug(&plug);
+
+ if (ret || io_error) {
+ info->ti->error = "Access metadata error";
+ return -EIO;
+ }
+
+ if (info->write_mode == DMICP_WRITE_BACK) {
+ info->writeback_tsk = kthread_run(
+ dm_icomp_meta_writeback_thread,
+ info, "dm_icomp_writeback");
+ if (!info->writeback_tsk) {
+ info->ti->error = "Create writeback thread error";
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
+/***** END metadata logic *****/
+
+
+#define SET_REQ_STAGE(req, value) (req->stage = value)
+#define GET_REQ_STAGE(req) req->stage
+
+
+static void print_max_sectors_possible(struct dm_icomp_info *info)
+{
+ u64 total_blocks, data_blocks, meta_blocks, no_pairs;
+ u32 pair_blocks, rem;
+
+ /* superblock takes away one block */
+ total_blocks = DMICP_BYTES_TO_BLOCK(i_size_read(
+ info->dev->bdev->bd_inode)) - 1;
+
+ /* number of datablocks representable by one metadata block. */
+ data_blocks = div64_long((DMICP_BLOCK_SIZE * 8),
+ DMICP_META_BITS);
+ meta_blocks = 1; /* we need this one meta data block for sure. */
+
+
+ /* how many such pairing can we make ? */
+ pair_blocks = data_blocks + meta_blocks;
+ no_pairs = div64_long(total_blocks, pair_blocks);
+
+ /*
+ * these many datablocks and these many ..
+ * metadatablocks will support each other.
+ */
+ data_blocks *= no_pairs;
+ meta_blocks *= no_pairs;
+
+ div_u64_rem(total_blocks, pair_blocks, &rem);
+ if (rem) {
+ /* we have some remaining blocks.
+ * give one to meta and remaining to data.
+ */
+ meta_blocks++;
+ data_blocks += (rem - 1);
+ }
+
+ DMINFO(" This device can accommodate at most %llu sector ",
+ DMICP_BLOCK_TO_SECTOR(data_blocks));
+}
+
+
+/*
+ * create a new super block and initialize its contents.
+ */
+static int dm_icomp_read_or_create_super(struct dm_icomp_info *info)
+{
+ void *addr, *bitmap_addr;
+ struct dm_icomp_super_block *super;
+ u64 total_blocks, data_blocks, meta_blocks;
+ bool new_super = false;
+ int ret;
+ ssize_t len;
+
+ info->total_sector = DMICP_BYTES_TO_SECTOR(
+ i_size_read(info->dev->bdev->bd_inode));
+ total_blocks = DMICP_SECTOR_TO_BLOCK(info->total_sector) - 1;
+
+ data_blocks = DMICP_SECTOR_TO_BLOCK(info->ti->len);
+ meta_blocks = div64_long(((data_blocks * DMICP_META_BITS) +
+ ((DMICP_BLOCK_SIZE * 8) - 1)), (DMICP_BLOCK_SIZE * 8));
+
+
+ info->data_blocks = data_blocks;
+ info->data_start = DMICP_BLOCK_TO_SECTOR(1 + meta_blocks);
+
+ DMINFO(
+ "data_start=%u data_blocks=%llu metablocks=%llu total_blocks=%llu",
+ (unsigned int)info->data_start, info->data_blocks,
+ meta_blocks, total_blocks);
+
+ if (DMICP_BLOCK_TO_SECTOR(data_blocks + meta_blocks + 1)
+ > info->total_sector) {
+ print_max_sectors_possible(info);
+ info->ti->error =
+ "Insufficient sectors to satisfy requested size";
+ return -ENOMEM;
+ }
+
+ addr = kzalloc(DMICP_BLOCK_SIZE+DMICP_SECTOR_SIZE, GFP_KERNEL);
+ if (!addr) {
+ info->ti->error = "Cannot allocate super";
+ return -ENOMEM;
+ }
+
+ super = PTR_ALIGN(addr, DMICP_SECTOR_SIZE);
+ ret = dm_icomp_access_super(info, super, REQ_OP_READ, REQ_FUA);
+ if (ret)
+ goto out;
+
+ if (le64_to_cpu(super->magic) == DMICP_SUPER_MAGIC) {
+
+ const char *alg_name;
+
+ if (le64_to_cpu(super->meta_blocks) != meta_blocks ||
+ le64_to_cpu(super->data_blocks) != data_blocks) {
+ info->ti->error = "Super is invalid";
+ ret = -EINVAL;
+ goto out;
+ }
+
+ alg_name = get_comp_name(super->comp_alg);
+ if (!crypto_has_comp(alg_name, 0, 0)) {
+ info->ti->error =
+ "Compressor algorithm doesn't support";
+ ret = -EINVAL;
+ goto out;
+ }
+ info->comp_alg = super->comp_alg;
+
+ } else {
+ super->magic = cpu_to_le64(DMICP_SUPER_MAGIC);
+ super->meta_blocks = cpu_to_le64(meta_blocks);
+ super->data_blocks = cpu_to_le64(data_blocks);
+ super->comp_alg = info->comp_alg;
+ ret = dm_icomp_access_super(info, super, REQ_OP_WRITE,
+ REQ_FUA);
+ if (ret) {
+ info->ti->error = "Access super fails";
+ goto out;
+ }
+ new_super = true;
+ }
+
+ if (alloc_compressor(info)) {
+ info->ti->error = "Cannot allocate compressor";
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ info->meta_bitmap_bits = data_blocks * DMICP_META_BITS;
+ len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, DMICP_BITS_PER_ENTRY);
+ len *= (DMICP_BITS_PER_ENTRY >> 3);
+ info->meta_bitmap_pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ bitmap_addr = vzalloc((info->meta_bitmap_pages * PAGE_SIZE) +
+ DMICP_SECTOR_SIZE);
+ if (!bitmap_addr) {
+ info->ti->error = "Cannot allocate bitmap";
+ ret = -ENOMEM;
+ goto bitmap_err;
+ }
+ info->meta_bitmap = PTR_ALIGN(bitmap_addr, DMICP_SECTOR_SIZE);
+
+ ret = dm_icomp_init_meta(info, new_super);
+ if (ret)
+ goto meta_err;
+
+ return 0;
+meta_err:
+ vfree(bitmap_addr);
+bitmap_err:
+ free_compressor(info);
+out:
+ kfree(addr);
+ return ret;
+}
+
+enum {
+ Opt_wb, Opt_wt, Opt_dev, Opt_critical, Opt_compressor, Opt_err
+};
+
+static const match_table_t dm_icomp_tokens = {
+ {Opt_wb, "writeback=%u"},
+ {Opt_wb, "writeback:%u"},
+ {Opt_dev, "device:%s"},
+ {Opt_dev, "device=%s"},
+ {Opt_wt, "writethrough"},
+ {Opt_critical, "critical"},
+ {Opt_compressor, "compressor=%s"},
+ {Opt_compressor, "compressor:%s"},
+ {Opt_err, NULL}
+};
+
+static char *generate_cmdline(unsigned int argc, char **argv)
+{
+ int i, len = 0;
+ char *cmdline;
+
+ for (i = 0 ; i < argc; i++)
+ len += strlen(argv[i]);
+ cmdline = kmalloc(len+1, GFP_KERNEL);
+ if (!cmdline)
+ return NULL;
+
+ cmdline[0] = '\0';
+ for (i = 0 ; i < argc; i++)
+ strcat(cmdline, argv[i]);
+
+ return cmdline;
+}
+
+
+/*
+ * <device>:<path to device> || <device>=<path to device>,
+ * [ <writeback>=<meta_commit_delay> ],
+ * [ <writeback>:<meta_commit_delay> ],
+ * [ <writethrough> ],
+ * [ <compressor>=<type> ],
+ * [ <compressor>:<type> ],
+ * [ <critical> ]
+ */
+static int dm_icomp_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+ struct dm_icomp_info *info;
+ substring_t args[MAX_OPT_ARGS];
+ char *cmdline = NULL, *mode, *device = NULL;
+ int ret, i;
+ char *p;
+
+ info = kzalloc(sizeof(*info), GFP_KERNEL);
+ if (!info) {
+ ti->error = "dm-inplace-compress: Cannot allocate context";
+ return -ENOMEM;
+ }
+
+ cmdline = generate_cmdline(argc, argv);
+ if (!cmdline) {
+ ti->error = "dm-inplace-compress: Cannot allocate memory";
+ goto err_para;
+ }
+
+ info->ti = ti;
+ info->comp_alg = get_default_compressor();
+ info->critical = false;
+
+ while ((p = strsep(&cmdline, ",")) != NULL) {
+
+ int token;
+
+ if (!*p)
+ continue;
+
+ token = match_token(p, dm_icomp_tokens, args);
+
+ switch (token) {
+ case Opt_wb:
+ if (match_int(&args[0], &info->writeback_delay)) {
+ ti->error = "Invalid argument";
+ ret = -EINVAL;
+ goto err_para;
+ }
+ info->write_mode = DMICP_WRITE_BACK;
+ break;
+
+ case Opt_wt:
+ info->write_mode = DMICP_WRITE_THROUGH;
+ break;
+
+ case Opt_dev:
+ device = match_strdup(&args[0]);
+ break;
+
+ case Opt_critical:
+ info->critical = true;
+ break;
+
+ case Opt_compressor:
+ mode = match_strdup(&args[0]);
+ if (!mode) {
+ ti->error = "Invalid argument for compressor";
+ ret = -EINVAL;
+ goto err_para;
+ }
+ DMINFO("compressor is %s", mode);
+ ret = get_comp_id(mode);
+ kfree(mode);
+ if (ret < 0) {
+ ti->error = "Unsupported compressor";
+ ret = -EINVAL;
+ goto err_para;
+ }
+ info->comp_alg = ret;
+ break;
+ }
+ }
+
+ if (!device ||
+ dm_get_device(ti, device, dm_table_get_mode(ti->table),
+ &info->dev)) {
+ ti->error = "Can't get device";
+ ret = -EINVAL;
+ goto err_para;
+ }
+ kfree(device);
+
+ info->io_client = dm_io_client_create();
+ if (!info->io_client) {
+ ti->error = "Can't create io client";
+ ret = -EINVAL;
+ goto err_ioclient;
+ }
+
+ if (bdev_logical_block_size(info->dev->bdev) != DMICP_SECTOR_SIZE) {
+ ti->error = "Can't logical block size too big";
+ ret = -EINVAL;
+ goto err_blocksize;
+ }
+
+ if (dm_set_target_max_io_len(ti, DMICP_MAX_SECTORS)) {
+ ti->error = "Failed to configure device ";
+ ret = -EINVAL;
+ goto err_blocksize;
+ }
+
+ if (dm_icomp_read_or_create_super(info)) {
+ ret = -EINVAL;
+ goto err_blocksize;
+ }
+
+ for (i = 0; i < BITMAP_HASH_LEN; i++) {
+ info->bitmap_locks[i].io_running = 0;
+ spin_lock_init(&info->bitmap_locks[i].wait_lock);
+ INIT_LIST_HEAD(&info->bitmap_locks[i].wait_list);
+ }
+
+ kfree(cmdline);
+ atomic64_set(&info->compressed_write_size, 0);
+ atomic64_set(&info->uncompressed_write_size, 0);
+ atomic64_set(&info->meta_write_size, 0);
+ atomic64_set(&dm_icomp_total_alloc_size, 0);
+ atomic64_set(&dm_icomp_total_bio_save, 0);
+
+ ti->num_flush_bios = 1;
+ ti->private = info;
+ return 0;
+
+err_blocksize:
+ dm_io_client_destroy(info->io_client);
+err_ioclient:
+ dm_put_device(ti, info->dev);
+err_para:
+ kfree(cmdline);
+ kfree(info);
+ return ret;
+}
+
+static void dm_icomp_dtr(struct dm_target *ti)
+{
+ struct dm_icomp_info *info = ti->private;
+
+ if (info->write_mode == DMICP_WRITE_BACK)
+ kthread_stop(info->writeback_tsk);
+ free_compressor(info);
+ vfree(info->meta_bitmap);
+ dm_io_client_destroy(info->io_client);
+ dm_put_device(ti, info->dev);
+ kfree(info);
+}
+
+/*
+ * return the range lock to this block.
+ */
+static struct dm_icomp_hash_lock *dm_icomp_block_hash_lock(
+ struct dm_icomp_info *info, u64 block_index)
+{
+ return &info->bitmap_locks[(block_index >> BITMAP_HASH_SHIFT) &
+ BITMAP_HASH_MASK];
+}
+
+/*
+ * unlock the io range correspondingg to this block.
+ */
+static struct dm_icomp_hash_lock *dm_icomp_trylock_block(
+ struct dm_icomp_info *info,
+ struct dm_icomp_req *req, u64 block_index)
+{
+ struct dm_icomp_hash_lock *hash_lock;
+
+ hash_lock = dm_icomp_block_hash_lock(req->info, block_index);
+
+ spin_lock_irq(&hash_lock->wait_lock);
+ if (!hash_lock->io_running) {
+ hash_lock->io_running = 1;
+ spin_unlock_irq(&hash_lock->wait_lock);
+ return hash_lock;
+ }
+ list_add_tail(&req->sibling, &hash_lock->wait_list);
+ spin_unlock_irq(&hash_lock->wait_lock);
+ return NULL;
+}
+
+static void dm_icomp_queue_req_list(struct dm_icomp_info *info,
+ struct list_head *list);
+
+static void dm_icomp_unlock_block(struct dm_icomp_info *info,
+ struct dm_icomp_req *req, struct dm_icomp_hash_lock *hash_lock)
+{
+ LIST_HEAD(pending_list);
+ unsigned long flags;
+
+ spin_lock_irqsave(&hash_lock->wait_lock, flags);
+ /* wakeup all pending reqs to avoid live lock */
+ list_splice_init(&hash_lock->wait_list, &pending_list);
+ hash_lock->io_running = 0;
+ spin_unlock_irqrestore(&hash_lock->wait_lock, flags);
+
+ dm_icomp_queue_req_list(info, &pending_list);
+}
+
+/*
+ * lock all the range locks corresponding to this io request.
+ */
+static int dm_icomp_lock_req_range(struct dm_icomp_req *req)
+{
+ u64 block_index, first_block_index;
+ u64 first_lock_block, second_lock_block;
+ u16 logical_sectors, data_sectors;
+
+ block_index = DMICP_SECTOR_TO_BLOCK(req->bio->bi_iter.bi_sector);
+ req->locks[0] = dm_icomp_trylock_block(req->info, req, block_index);
+ if (!req->locks[0])
+ return 0;
+ dm_icomp_get_extent(req->info, block_index, &first_block_index,
+ &logical_sectors, &data_sectors);
+ if (dm_icomp_block_hash_lock(req->info, first_block_index) !=
+ req->locks[0]) {
+ dm_icomp_unlock_block(req->info, req, req->locks[0]);
+ first_lock_block = first_block_index;
+ second_lock_block = block_index;
+ goto two_locks;
+ }
+
+ block_index = DMICP_SECTOR_TO_BLOCK(bio_end_sector(req->bio) - 1);
+ dm_icomp_get_extent(req->info, block_index, &first_block_index,
+ &logical_sectors, &data_sectors);
+ first_block_index += DMICP_SECTOR_TO_BLOCK(logical_sectors);
+ if (dm_icomp_block_hash_lock(req->info, first_block_index) !=
+ req->locks[0]) {
+ second_lock_block = first_block_index;
+ goto second_lock;
+ }
+ req->locked_locks = 1;
+ return 1;
+
+two_locks:
+ req->locks[0] = dm_icomp_trylock_block(req->info, req,
+ first_lock_block);
+ if (!req->locks[0])
+ return 0;
+second_lock:
+ req->locks[1] = dm_icomp_trylock_block(req->info, req,
+ second_lock_block);
+ if (!req->locks[1]) {
+ dm_icomp_unlock_block(req->info, req, req->locks[0]);
+ return 0;
+ }
+ /* Don't need check if meta is changed */
+ req->locked_locks = 2;
+ return 1;
+}
+
+
+
+/*
+ * unlock all the range locks corresponding to this io request.
+ */
+static void dm_icomp_unlock_req_range(struct dm_icomp_req *req)
+{
+ int i;
+
+ for (i = req->locked_locks - 1; i >= 0; i--)
+ dm_icomp_unlock_block(req->info, req, req->locks[i]);
+}
+
+static void dm_icomp_queue_req(struct dm_icomp_info *info,
+ struct dm_icomp_req *req)
+{
+ unsigned long flags;
+ struct dm_icomp_io_worker *worker = &dm_icomp_io_workers[req->cpu];
+
+ spin_lock_irqsave(&worker->lock, flags);
+ list_add_tail(&req->sibling, &worker->pending);
+ spin_unlock_irqrestore(&worker->lock, flags);
+
+ queue_work_on(req->cpu, dm_icomp_wq, &worker->work);
+}
+
+static void dm_icomp_queue_req_list(struct dm_icomp_info *info,
+ struct list_head *list)
+{
+ struct dm_icomp_req *req;
+
+ while (!list_empty(list)) {
+ req = list_first_entry(list, struct dm_icomp_req, sibling);
+ list_del_init(&req->sibling);
+ dm_icomp_queue_req(info, req);
+ }
+}
+
+static void dm_icomp_get_req(struct dm_icomp_req *req)
+{
+ atomic_inc(&req->io_pending);
+}
+
+static void *dm_icomp_kmalloc(size_t size, int alloc_flag)
+{
+ void *addr = kmalloc(size, alloc_flag);
+
+ if (!addr)
+ return NULL;
+ DMICP_ALLOC(size);
+ return addr;
+}
+
+static void *dm_icomp_krealloc(void *ptr, size_t size,
+ size_t origsize, int alloc_flag)
+{
+ void *addr = krealloc(ptr, size, alloc_flag);
+
+ if (!addr)
+ return NULL;
+ DMICP_FREE_ALLOC(origsize);
+ DMICP_ALLOC(size);
+ return addr;
+}
+
+static int dm_icomp_alloc_compbuffer(struct dm_icomp_io_range *io, int size)
+{
+ int alloc_len = size + DMICP_SECTOR_SIZE;
+ void *addr = dm_icomp_kmalloc(alloc_len,
+ get_alloc_flag(io->req->info));
+
+ if (!addr)
+ return 1;
+
+ io->comp_real_data = addr;
+ io->comp_kmap = false;
+ io->comp_len = size;
+
+ /*
+ * comp_data is used to read and write from storage.
+ * So align it.
+ */
+ io->comp_data = io->io_req.mem.ptr.addr
+ = PTR_ALIGN(addr, DMICP_SECTOR_SIZE);
+
+ return 0;
+}
+
+static int dm_icomp_realloc_comp_buffer(struct dm_icomp_io_range *io, int size)
+{
+ void *addr = dm_icomp_krealloc(io->comp_real_data,
+ size+DMICP_SECTOR_SIZE, io->comp_len+DMICP_SECTOR_SIZE,
+ get_alloc_flag(io->req->info));
+ if (!addr)
+ return 1;
+
+ io->comp_real_data = addr;
+ io->comp_kmap = false;
+ io->comp_data = io->io_req.mem.ptr.addr = PTR_ALIGN(addr,
+ DMICP_SECTOR_SIZE);
+ io->comp_len = size;
+ return 0;
+}
+
+static void dm_icomp_kfree(void *addr, unsigned int size)
+{
+ kfree(addr);
+ DMICP_FREE_ALLOC(size);
+}
+
+static void dm_icomp_release_decomp_buffer(struct dm_icomp_io_range *io)
+{
+ if (!io->decomp_data)
+ return;
+
+ if (io->decomp_kmap)
+ kunmap(io->decomp_real_data);
+ else
+ dm_icomp_kfree(io->decomp_real_data, io->decomp_len);
+
+ io->decomp_data = io->decomp_real_data = NULL;
+ io->decomp_len = 0;
+ io->decomp_kmap = false;
+}
+
+static void dm_icomp_release_comp_buffer(struct dm_icomp_io_range *io)
+{
+ if (!io->comp_data)
+ return;
+
+ if (io->comp_kmap)
+ kunmap(io->comp_real_data);
+ else
+ dm_icomp_kfree(io->comp_real_data,
+ io->comp_len+DMICP_SECTOR_SIZE);
+
+ io->comp_real_data = io->comp_data = NULL;
+ io->comp_len = 0;
+ io->comp_kmap = false;
+}
+
+static void dm_icomp_free_io_range(struct dm_icomp_io_range *io)
+{
+ dm_icomp_release_decomp_buffer(io);
+ dm_icomp_release_comp_buffer(io);
+ kmem_cache_free(dm_icomp_io_range_cachep, io);
+}
+
+static void dm_icomp_put_req(struct dm_icomp_req *req)
+{
+ struct dm_icomp_io_range *io;
+
+ if (atomic_dec_return(&req->io_pending))
+ return;
+
+ if (GET_REQ_STAGE(req) == STAGE_INIT) /* waiting for locking */
+ return;
+
+ if (GET_REQ_STAGE(req) == STAGE_READ_DECOMP ||
+ GET_REQ_STAGE(req) == STAGE_WRITE_COMP)
+ SET_REQ_STAGE(req, STAGE_DONE);
+
+ if (!!!req->result && GET_REQ_STAGE(req) != STAGE_DONE) {
+ dm_icomp_queue_req(req->info, req);
+ return;
+ }
+
+ while (!list_empty(&req->all_io)) {
+ io = list_entry(req->all_io.next,
+ struct dm_icomp_io_range, next);
+ list_del(&io->next);
+ dm_icomp_free_io_range(io);
+ }
+
+ dm_icomp_unlock_req_range(req);
+
+ req->bio->bi_error = req->result;
+
+ bio_endio(req->bio);
+ kmem_cache_free(dm_icomp_req_cachep, req);
+}
+
+static void dm_icomp_bio_copy(struct bio *bio, off_t bio_off, void *buf,
+ ssize_t len, bool to_buf)
+{
+ struct bio_vec bv;
+ struct bvec_iter iter;
+ off_t buf_off = 0;
+ ssize_t size;
+ void *addr;
+
+ WARN_ON(bio_off + len > DMICP_SECTOR_TO_BYTES(bio_sectors(bio)));
+
+ bio_for_each_segment(bv, bio, iter) {
+ int length = bv.bv_len;
+
+ if (bio_off > length) {
+ bio_off -= length;
+ continue;
+ }
+ addr = kmap_atomic(bv.bv_page);
+ size = min_t(ssize_t, len, length - bio_off);
+ if (!buf)
+ memset(addr + bio_off + bv.bv_offset, 0, size);
+ else if (to_buf)
+ memcpy(buf + buf_off, addr + bio_off + bv.bv_offset,
+ size);
+ else
+ memcpy(addr + bio_off + bv.bv_offset, buf + buf_off,
+ size);
+ kunmap_atomic(addr);
+ bio_off = 0;
+ buf_off += size;
+
+ if (len <= size)
+ break;
+
+ len -= size;
+ }
+}
+
+static void dm_icomp_io_range_done(unsigned long error, void *context)
+{
+ struct dm_icomp_io_range *io = context;
+
+ if (error)
+ io->req->result = error;
+
+ dm_icomp_put_req(io->req);
+}
+
+static inline int dm_icomp_compressor_len(struct dm_icomp_info *info, int len)
+{
+ if (compressors[info->comp_alg].comp_len)
+ return compressors[info->comp_alg].comp_len(len);
+ return len;
+}
+
+static inline bool dm_icomp_can_handle_overflow(struct dm_icomp_info *info)
+{
+ return compressors[info->comp_alg].can_handle_overflow;
+}
+
+static inline int dm_icomp_compressor_maxlen(struct dm_icomp_info *info,
+ int len)
+{
+ if (compressors[info->comp_alg].max_comp_len)
+ return compressors[info->comp_alg].max_comp_len(len);
+ return len;
+}
+
+/*
+ * caller should set region.sector, region.count. bi_rw. IO always to/from
+ * comp_data
+ */
+static struct dm_icomp_io_range *dm_icomp_create_io_range(
+ struct dm_icomp_req *req)
+{
+ struct dm_icomp_io_range *io;
+
+ io = kmem_cache_alloc(dm_icomp_io_range_cachep,
+ get_alloc_flag(req->info));
+ if (!io)
+ return NULL;
+
+ io->io_req.notify.fn = dm_icomp_io_range_done;
+ io->io_req.notify.context = io;
+ io->io_req.client = req->info->io_client;
+ io->io_req.mem.type = DM_IO_KMEM;
+ io->io_req.mem.offset = 0;
+
+ io->io_region.bdev = req->info->dev->bdev;
+ io->req = req;
+
+ io->comp_data = io->comp_real_data =
+ io->decomp_data = io->decomp_real_data = NULL;
+
+ io->data_bytes = io->comp_len =
+ io->decomp_len = io->logical_bytes = 0;
+
+ io->comp_kmap = io->decomp_kmap = false;
+ return io;
+}
+
+
+/*
+ * return an address, within the bio. The address corresponds to
+ * the requested offset 'bio_off' and is contiguous of size 'len'
+ */
+static void *get_addr(struct bio *bio, int len, u64 bio_off, u64 *offset)
+{
+ struct bio_vec bv;
+ struct bvec_iter iter;
+
+ bio_for_each_segment(bv, bio, iter) {
+
+ if (bio_off <= bv.bv_len) {
+ if ((bio_off + len) > bv.bv_len)
+ break;
+ *offset = bv.bv_offset + bio_off;
+ return kmap(bv.bv_page);
+ }
+ bio_off -= bv.bv_len;
+
+ }
+ return NULL;
+}
+
+
+/*
+ * create a io range for tracking predominantly a read request.
+ * @req : the read request
+ * @comp_len : allocation size of the compress buffer
+ * @decomp_len : allocation size of the decompress buffer
+ * @actual_comp_len : real size of the compress data
+ * @bio_off : offset within the bio read buffer this request corresponds to.
+ * try to reuse and read into the bio buffer. -1 means don't reuse.
+ */
+static struct dm_icomp_io_range *dm_icomp_create_io_read_range(
+ struct dm_icomp_req *req, int comp_len, int decomp_len,
+ sector_t bio_off, int actual_comp_len)
+{
+ struct bio *bio = req->bio;
+ void *addr = NULL;
+ struct dm_icomp_io_range *io = dm_icomp_create_io_range(req);
+ u64 offset;
+
+ if (!io)
+ return NULL;
+
+ /* try reusing the bio if possible */
+ if (bio_off >= 0) {
+ addr = get_addr(bio, comp_len,
+ (u64)DMICP_SECTOR_TO_BYTES(bio_off), &offset);
+ if (addr) {
+ io->comp_real_data = addr;
+ io->comp_data = io->io_req.mem.ptr.addr = addr + offset;
+ io->comp_kmap = true;
+ io->comp_len = comp_len;
+ }
+ }
+
+ if (!addr && dm_icomp_alloc_compbuffer(io, comp_len)) {
+ kmem_cache_free(dm_icomp_io_range_cachep, io);
+ return NULL;
+ }
+
+ io->data_bytes = actual_comp_len; /* NOTE, this value can change */
+
+ /*
+ * note requested length for decompress buffer. Do not allocate it yet.
+ * Value once set is final.
+ */
+ io->logical_bytes = decomp_len;
+
+ return io;
+}
+
+/*
+ * ensure that the io range has all its buffers; of the correct size,
+ * allocated.
+ */
+static int dm_icomp_update_io_read_range(struct dm_icomp_io_range *io)
+{
+ WARN_ON(!io->comp_data);
+ WARN_ON(io->decomp_data || io->decomp_len);
+ io->decomp_data = dm_icomp_kmalloc(io->logical_bytes,
+ get_alloc_flag(io->req->info));
+ if (!io->decomp_data)
+ return 1;
+ io->decomp_real_data = io->decomp_data;
+ io->decomp_len = io->logical_bytes;
+ io->decomp_kmap = false;
+ return 0;
+}
+
+/*
+ * resize the comp buffer to its largest possible size.
+ */
+static int dm_icomp_mod_to_max_io_range(struct dm_icomp_io_range *io)
+{
+ unsigned int maxlen = dm_icomp_compressor_maxlen(io->req->info,
+ io->logical_bytes);
+
+ if (maxlen <= io->comp_len)
+ return 0;
+
+ if (io->comp_kmap) {
+ WARN_ON(io->comp_kmap);
+ kunmap(io->comp_real_data);
+ io->comp_kmap = false;
+ io->comp_real_data = io->comp_data = NULL;
+ }
+
+ if (dm_icomp_realloc_comp_buffer(io, maxlen)) {
+ io->comp_len = 0;
+ return -ENOSPC;
+ }
+ io->comp_len = maxlen;
+ return 0;
+}
+
+/*
+ * create a io range for tracking a write request.
+ * @req : the write request
+ * @count : size of the write in sectors.
+ * @offset : offset within the bio read buffer this request correspond to.
+ */
+static struct dm_icomp_io_range *dm_icomp_create_io_write_range(
+ struct dm_icomp_req *req, sector_t offset, sector_t count)
+{
+ struct bio *bio = req->bio;
+ int size = DMICP_SECTOR_TO_BYTES(count);
+ u64 of;
+ int comp_len = dm_icomp_compressor_len(req->info, size);
+ void *addr;
+ struct dm_icomp_io_range *io = dm_icomp_create_io_range(req);
+
+ if (!io)
+ return NULL;
+
+ WARN_ON(io->comp_data);
+
+ if (dm_icomp_alloc_compbuffer(io, comp_len)) {
+ kmem_cache_free(dm_icomp_io_range_cachep, io);
+ return NULL;
+ }
+
+ /* we donot know the size of the compress segment yet. */
+ io->data_bytes = 0;
+
+
+ WARN_ON(io->decomp_data);
+
+ io->decomp_kmap = false;
+
+ /* try reusing the bio buffer for decomp data. */
+ addr = get_addr(bio, size, DMICP_SECTOR_TO_BYTES(offset), &of);
+ if (addr)
+ io->decomp_kmap = true;
+ else
+ addr = dm_icomp_kmalloc(size, get_alloc_flag(req->info));
+
+ if (!addr) {
+ dm_icomp_kfree(io->comp_data, comp_len);
+ kmem_cache_free(dm_icomp_io_range_cachep, io);
+ return NULL;
+ }
+
+ io->logical_bytes = io->decomp_len = size;
+
+ if (io->decomp_kmap) {
+ io->decomp_real_data = addr;
+ io->decomp_data = addr + of;
+ DMICP_ALLOC_SAVE(size);
+ } else {
+ io->decomp_data = io->decomp_real_data = addr;
+ dm_icomp_bio_copy(req->bio, DMICP_SECTOR_TO_BYTES(offset),
+ io->decomp_data, size, true);
+ }
+
+ return io;
+}
+
+static unsigned int round_to_next_sector(unsigned int val)
+{
+ unsigned int c = round_up(val, DMICP_SECTOR_SIZE);
+
+ if ((c - val) < 2*sizeof(u32))
+ c += DMICP_SECTOR_SIZE;
+ return c;
+}
+
+/*
+ * compress and store the data in compress buffer.
+ * return value:
+ * < 0 : error
+ * == 0 : ok
+ * == 1 : ok, but comp/decomp is skipped
+ * Compressed data size is roundup of DMICP_SECTOR_SIZE, which makes
+ * the payload.
+ * We store the actual compressed len in the last u32 of the payload.
+ * If there is no free space, we add DMICP_SECTOR_SIZE to the
+ * payload size.
+ */
+static int dm_icomp_io_range_compress(struct dm_icomp_info *info,
+ struct dm_icomp_io_range *io, unsigned int *comp_len)
+{
+ unsigned int actual_comp_len = io->comp_len;
+ u32 *addr;
+ struct crypto_comp *tfm = info->tfm[get_cpu()];
+ unsigned int decomp_len = io->logical_bytes;
+ int ret;
+
+ actual_comp_len = io->comp_len;
+ ret = crypto_comp_compress(tfm, io->decomp_data, decomp_len,
+ io->comp_data, &actual_comp_len);
+
+ if (ret || round_to_next_sector(actual_comp_len) > io->comp_len) {
+ ret = dm_icomp_mod_to_max_io_range(io);
+ if (!ret) {
+ actual_comp_len = io->comp_len;
+ ret = crypto_comp_compress(tfm, io->decomp_data,
+ decomp_len, io->comp_data,
+ &actual_comp_len);
+ }
+ }
+
+ put_cpu();
+
+ if (!ret)
+ *comp_len = round_to_next_sector(actual_comp_len);
+
+ if (ret || *comp_len >= decomp_len) {
+ WARN_ON(decomp_len > io->comp_len);
+ *comp_len = decomp_len;
+ memcpy(io->comp_data, io->decomp_data, decomp_len);
+ atomic64_add(*comp_len, &info->compressed_write_size);
+ } else {
+ addr = (u32 *)((char *)io->comp_data + *comp_len);
+ addr--;
+ *addr = cpu_to_le32(actual_comp_len);
+ addr--;
+ *addr = cpu_to_le32(DMICP_COMPRESS_MAGIC);
+ }
+ io->data_bytes = *comp_len;
+ atomic64_add(decomp_len, &info->uncompressed_write_size);
+ atomic64_add(*comp_len, &info->compressed_write_size);
+
+ return 0;
+}
+
+/*
+ * decompress and store the data in decompress buffer.
+ * return value:
+ * < 0 : error
+ * == 0 : ok
+ */
+static int dm_icomp_io_range_decompress(struct dm_icomp_info *info,
+ struct dm_icomp_io_range *io, unsigned int *decomp_len)
+{
+ struct crypto_comp *tfm;
+ u32 *addr;
+ int ret;
+ int comp_len = io->data_bytes;
+
+ WARN_ON(!comp_len);
+ WARN_ON(io->comp_data != io->io_req.mem.ptr.addr);
+
+ if (comp_len == io->logical_bytes) {
+ memcpy(io->decomp_data, io->comp_data, comp_len);
+ *decomp_len = comp_len;
+ return 0;
+ }
+
+ addr = (u32 *)((char *)(io->comp_data) + comp_len);
+ addr--;
+ comp_len = le32_to_cpu(*addr);
+ addr--;
+
+ if (le32_to_cpu(*addr) != DMICP_COMPRESS_MAGIC) {
+ DMWARN("Decompress Error ");
+ return -EINVAL;
+ }
+
+ tfm = info->tfm[get_cpu()];
+ *decomp_len = io->logical_bytes;
+ ret = crypto_comp_decompress(tfm, io->comp_data, comp_len,
+ io->decomp_data, decomp_len);
+ WARN_ON(*decomp_len != io->decomp_len);
+ put_cpu();
+
+ return ret;
+}
+
+/*
+ * fill the bio with the corresponding decompressed data.
+ */
+static void dm_icomp_handle_read_decomp(struct dm_icomp_req *req)
+{
+ struct dm_icomp_io_range *io;
+ off_t bio_off = 0;
+ int ret;
+ sector_t bio_len = DMICP_SECTOR_TO_BYTES(bio_sectors(req->bio));
+
+ SET_REQ_STAGE(req, STAGE_READ_DECOMP);
+
+ if (req->result)
+ return;
+
+ list_for_each_entry(io, &req->all_io, next) {
+ ssize_t dst_off = 0, src_off = 0, len;
+ unsigned int decomp_len;
+
+ io->io_region.sector -= req->info->data_start;
+
+ if (io->io_region.sector >=
+ req->bio->bi_iter.bi_sector)
+ dst_off = DMICP_SECTOR_TO_BYTES(
+ io->io_region.sector -
+ req->bio->bi_iter.bi_sector);
+ else
+ src_off = DMICP_SECTOR_TO_BYTES(
+ req->bio->bi_iter.bi_sector -
+ io->io_region.sector);
+
+ if (dm_icomp_update_io_read_range(io)) {
+ req->result = -EIO;
+ return;
+ }
+
+ /* Do decomp here */
+ ret = dm_icomp_io_range_decompress(req->info, io, &decomp_len);
+ if (ret < 0) {
+ req->result = ret;
+ goto out;
+ }
+
+ len = min_t(ssize_t,
+ max_t(ssize_t, decomp_len - src_off, 0),
+ max_t(ssize_t, bio_len - dst_off, 0));
+
+ dm_icomp_bio_copy(req->bio, dst_off,
+ io->decomp_data + src_off, len, false);
+
+ dm_icomp_release_decomp_buffer(io);
+ dm_icomp_release_comp_buffer(io);
+
+ /* io range in all_io list is ordered for read IO */
+ while (bio_off < dst_off) {
+ ssize_t size = min_t(ssize_t, PAGE_SIZE,
+ dst_off - bio_off);
+ dm_icomp_bio_copy(req->bio, bio_off, NULL,
+ size, false);
+ bio_off += size;
+ }
+
+ bio_off = dst_off + len;
+ }
+
+ while (bio_off < bio_len) {
+ ssize_t size = min_t(ssize_t, PAGE_SIZE, (bio_len - bio_off));
+
+ dm_icomp_bio_copy(req->bio, bio_off, NULL,
+ size, false);
+ bio_off += size;
+ }
+ return;
+
+out:
+ list_for_each_entry(io, &req->all_io, next) {
+ dm_icomp_release_decomp_buffer(io);
+ dm_icomp_release_comp_buffer(io);
+ }
+}
+
+
+/*
+ * read an extent
+ * @req : the read request
+ * @block : the block to be read
+ * @logical_sectors : no of sectors occupied by the decompressed data
+ * @data_sectors : no of sectors occupied by the compressed data
+ * @may_resize : the compress data size may change during its life.
+ */
+static void dm_icomp_read_one_extent(struct dm_icomp_req *req, u64 block,
+ u16 logical_sectors, u16 data_sectors, bool may_resize)
+{
+ struct dm_icomp_io_range *io;
+ sector_t offset = 0;
+ int comp_len;
+ int actual_comp_len = DMICP_SECTOR_TO_BYTES(data_sectors);
+ int actual_decomp_len = DMICP_SECTOR_TO_BYTES(logical_sectors);
+
+ comp_len = actual_comp_len;
+
+ offset = (may_resize) ? -1 :
+ DMICP_BLOCK_TO_SECTOR(block) -
+ req->bio->bi_iter.bi_sector;
+
+ io = dm_icomp_create_io_read_range(req, comp_len,
+ actual_decomp_len,
+ offset,
+ actual_comp_len);
+ if (!io) {
+ req->result = -ENOMEM;
+ return;
+ }
+
+ dm_icomp_get_req(req);
+ list_add_tail(&io->next, &req->all_io);
+
+ io->io_region.sector = DMICP_BLOCK_TO_SECTOR(block) +
+ req->info->data_start;
+ io->io_region.count = data_sectors;
+ io->io_req.mem.ptr.addr = io->comp_data;
+ io->io_req.mem.type = DM_IO_KMEM;
+ io->io_req.mem.offset = 0;
+ io->io_req.bi_op = REQ_OP_READ;
+ io->io_req.bi_op_flags = (req->bio->bi_opf & REQ_FUA);
+
+ WARN_ON((io->io_region.sector + io->io_region.count)
+ >= req->info->total_sector);
+
+ dm_io(&io->io_req, 1, &io->io_region, NULL);
+}
+
+
+/*
+ * read the data corresponding to this request.
+ * @req : the request.
+ * @reuse : the read data may be modified. So plan accordingly.
+ */
+static void dm_icomp_handle_read_existing(struct dm_icomp_req *req, bool reuse)
+{
+ u64 block_index, first_block_index;
+ u16 logical_sectors, data_sectors;
+
+ SET_REQ_STAGE(req, STAGE_READ_EXISTING);
+
+ block_index = DMICP_SECTOR_TO_BLOCK(req->bio->bi_iter.bi_sector);
+
+ while (!!!req->result &&
+ (block_index <= DMICP_SECTOR_TO_BLOCK(
+ bio_end_sector(req->bio)-1)) &&
+ (block_index < req->info->data_blocks)) {
+
+ dm_icomp_get_extent(req->info, block_index, &first_block_index,
+ &logical_sectors, &data_sectors);
+
+ if (data_sectors)
+ dm_icomp_read_one_extent(req, first_block_index,
+ logical_sectors, data_sectors, reuse);
+
+ block_index = first_block_index +
+ DMICP_SECTOR_TO_BLOCK(logical_sectors);
+ }
+}
+
+/*
+ * read existing data
+ */
+static void dm_icomp_handle_read_read_existing(struct dm_icomp_req *req)
+{
+ dm_icomp_handle_read_existing(req, false);
+
+ if (req->result)
+ return;
+
+ /* A shortcut if all data is in already */
+ if (list_empty(&req->all_io))
+ dm_icomp_handle_read_decomp(req);
+}
+
+static void dm_icomp_handle_read_request(struct dm_icomp_req *req)
+{
+ dm_icomp_get_req(req);
+
+ if (GET_REQ_STAGE(req) == STAGE_INIT && dm_icomp_lock_req_range(req))
+ dm_icomp_handle_read_read_existing(req);
+ else if (GET_REQ_STAGE(req) == STAGE_READ_EXISTING)
+ dm_icomp_handle_read_decomp(req);
+
+ dm_icomp_put_req(req);
+}
+
+static void dm_icomp_write_meta_done(void *context, unsigned long error)
+{
+ struct dm_icomp_req *req = context;
+
+ dm_icomp_put_req(req);
+}
+
+static u64 dm_icomp_block_meta_page_index(u64 block, bool end)
+{
+ u64 bits = block * DMICP_META_BITS - !!end;
+ /*
+ * >> 5; 32 bits per entry
+ * << 2; each entry is 4 bytes
+ * >> PAGE_SHIFT; PAGE_SHIFT pages
+ */
+ return bits >> (5 - 2 + PAGE_SHIFT);
+}
+
+
+/*
+ * write compressed data to the backing storage.
+ * @io : io range
+ * @sector_start : the sector on backing storage to which the
+ * compressed data needs to be written.
+ * @meta_start: the page index of the bits corresponding to
+ * @meta_end : start and end blocks.
+ */
+static int dm_icomp_compress_write(struct dm_icomp_io_range *io,
+ sector_t sector_start, u64 *meta_start, u64 *meta_end)
+{
+ struct dm_icomp_req *req = io->req;
+ sector_t count = DMICP_BYTES_TO_SECTOR(io->decomp_len);
+ unsigned int comp_len;
+ int ret;
+ u64 page_index;
+
+ /* comp_data must be able to accommadate a larger compress buffer */
+ ret = dm_icomp_io_range_compress(req->info, io, &comp_len);
+ if (ret < 0) {
+ req->result = -EIO;
+ return -EIO;
+ }
+ WARN_ON(comp_len > io->comp_len);
+
+ dm_icomp_get_req(req);
+
+ io->io_req.bi_op = REQ_OP_WRITE;
+ io->io_req.bi_op_flags = (req->bio->bi_opf & REQ_FUA);
+ io->io_req.mem.ptr.addr = io->comp_data;
+ io->io_req.mem.type = DM_IO_KMEM;
+ io->io_req.mem.offset = 0;
+ io->io_region.count = DMICP_BYTES_TO_SECTOR(comp_len);
+ io->io_region.sector = sector_start + req->info->data_start;
+
+ dm_icomp_release_decomp_buffer(io);
+
+
+ WARN_ON((io->io_region.sector + io->io_region.count)
+ >= req->info->total_sector);
+
+ dm_io(&io->io_req, 1, &io->io_region, NULL);
+
+ /* update the meta data bits */
+ dm_icomp_set_extent(req, DMICP_SECTOR_TO_BLOCK(sector_start),
+ DMICP_SECTOR_TO_BLOCK(count), DMICP_BYTES_TO_SECTOR(comp_len));
+
+ page_index = dm_icomp_block_meta_page_index(
+ DMICP_SECTOR_TO_BLOCK(sector_start), false);
+ if (*meta_start > page_index)
+ *meta_start = page_index;
+
+ page_index = dm_icomp_block_meta_page_index(
+ DMICP_SECTOR_TO_BLOCK(sector_start + count), true);
+ if (*meta_end < page_index)
+ *meta_end = page_index;
+ return 0;
+}
+
+/*
+ * modify and write compressed data to the backing storage.
+ * @io : io range
+ * @meta_start: the page index of the bits corresponding to
+ * @meta_end : start and end blocks.
+ */
+static int dm_icomp_handle_write_modify(struct dm_icomp_io_range *io,
+ u64 *meta_start, u64 *meta_end)
+{
+ struct dm_icomp_req *req = io->req;
+ sector_t bio_start, bio_end, buf_start, buf_end, overlap;
+ off_t bio_off, buf_off;
+ int ret;
+ unsigned int decomp_len;
+
+ io->io_region.sector -= req->info->data_start;
+
+ if (dm_icomp_update_io_read_range(io)) {
+ req->result = -EIO;
+ return -EIO;
+ }
+
+ /* decompress original data */
+ ret = dm_icomp_io_range_decompress(req->info, io, &decomp_len);
+ if (ret < 0) {
+ req->result = ret;
+ return ret;
+ }
+
+ bio_start = req->bio->bi_iter.bi_sector;
+ bio_end = bio_end_sector(req->bio) - 1;
+
+ buf_start = io->io_region.sector;
+ buf_end = buf_start + DMICP_BYTES_TO_SECTOR(decomp_len) - 1;
+
+ /* if no overlap, nothing to do. Just return */
+ if (bio_start >= buf_end || bio_end <= buf_start)
+ return 0;
+
+ bio_off = (buf_start > bio_start) ? (buf_start - bio_start) : 0;
+ buf_off = (bio_start > buf_start) ? (bio_start - buf_start) : 0;
+
+ /*
+ * overlap = sizeof(block1) + sizeof(block2) - sizeof(left_side_shift) -
+ * sizeof(right_side_shift) / 2 + 1
+ */
+ overlap = (((bio_end - bio_start) + (buf_end - buf_start) -
+ abs(buf_end - bio_end) - abs(buf_start - bio_start)) >> 1) + 1;
+
+
+ dm_icomp_bio_copy(req->bio, DMICP_SECTOR_TO_BYTES(bio_off),
+ io->decomp_data + DMICP_SECTOR_TO_BYTES(buf_off),
+ DMICP_SECTOR_TO_BYTES(overlap), true);
+
+ if (!dm_icomp_can_handle_overflow(req->info)) {
+ /* resize the compress buffer to the max range */
+ ret = dm_icomp_mod_to_max_io_range(io);
+ if (ret < 0) {
+ req->result = ret;
+ return ret;
+ }
+ }
+
+ return dm_icomp_compress_write(io, io->io_region.sector,
+ meta_start, meta_end);
+}
+
+
+/*
+ * create and write new extents. Each extent is not more than
+ * DMICP_MAX_SECTORS sectors.
+ * @req : the request
+ * @sec_start: the start sector of the request
+ * @total : the total sectors
+ * @list : collect each DMICP_MAX_SECTORS sector size io request in this list
+ * @meta_start: the page index of the bits corresponding to
+ * @meta_end : start and end blocks.
+ *
+ */
+static void dm_icomp_handle_write_create(struct dm_icomp_req *req,
+ sector_t sec_start, sector_t total,
+ struct list_head *list, u64 *meta_start, u64 *meta_end)
+{
+ struct dm_icomp_io_range *io;
+ sector_t count, offset = 0;
+ int ret;
+
+ while (total) {
+
+ /* max i/o DMICP_MAX_SECTORS sectors */
+ count = min_t(sector_t, total, DMICP_MAX_SECTORS);
+ io = dm_icomp_create_io_write_range(req, offset, count);
+ if (!io) {
+ req->result = -ENOMEM;
+ return;
+ }
+
+ ret = dm_icomp_compress_write(io, sec_start, meta_start,
+ meta_end);
+ if (ret) {
+ dm_icomp_free_io_range(io);
+ return;
+ }
+
+
+ list_add_tail(&io->next, list);
+ total -= count;
+ sec_start += count;
+ offset += count;
+
+ }
+}
+
+/*
+ * handle the write request.
+ */
+static void dm_icomp_handle_write_comp(struct dm_icomp_req *req)
+{
+ struct dm_icomp_io_range *io;
+ sector_t io_start, req_start, req_end;
+ u64 meta_start = -1L, meta_end = 0;
+ LIST_HEAD(newlist);
+
+ SET_REQ_STAGE(req, STAGE_WRITE_COMP);
+
+ if (req->result)
+ return;
+
+ req_start = req->bio->bi_iter.bi_sector;
+ list_for_each_entry(io, &req->all_io, next) {
+
+ io_start = io->io_region.sector - req->info->data_start;
+
+ if (req_start < io_start) {
+ /* fill the gap */
+ dm_icomp_handle_write_create(req, req_start,
+ (io_start - req_start), &newlist,
+ &meta_start, &meta_end);
+ }
+
+ dm_icomp_handle_write_modify(io, &meta_start, &meta_end);
+
+ req_start = io_start + DMICP_BYTES_TO_SECTOR(io->logical_bytes);
+ }
+
+ req_end = bio_end_sector(req->bio);
+ if (req_start < req_end) {
+ /* fill the gap */
+ dm_icomp_handle_write_create(req, req_start,
+ req_end-req_start, &newlist, &meta_start,
+ &meta_end);
+ }
+
+ list_splice_tail(&newlist, &req->all_io);
+
+ if (req->info->write_mode == DMICP_WRITE_THROUGH ||
+ (req->bio->bi_opf & REQ_FUA)) {
+ if (meta_start == -1)
+ return;
+ dm_icomp_get_req(req);
+ dm_icomp_write_meta(req->info, meta_start,
+ meta_end+1, req,
+ dm_icomp_write_meta_done,
+ REQ_OP_WRITE, req->bio->bi_opf);
+ }
+}
+
+/*
+ * read the data, modify and write it back to the backing store.
+ */
+static void dm_icomp_handle_write_read_existing(struct dm_icomp_req *req)
+{
+ dm_icomp_handle_read_existing(req, true);
+ if (req->result)
+ return;
+
+ if (list_empty(&req->all_io))
+ dm_icomp_handle_write_comp(req);
+}
+
+static void dm_icomp_handle_write_request(struct dm_icomp_req *req)
+{
+ dm_icomp_get_req(req);
+
+ if (GET_REQ_STAGE(req) == STAGE_INIT && dm_icomp_lock_req_range(req))
+ dm_icomp_handle_write_read_existing(req);
+ else if (GET_REQ_STAGE(req) == STAGE_READ_EXISTING)
+ dm_icomp_handle_write_comp(req);
+
+ dm_icomp_put_req(req);
+}
+
+/* For writeback mode */
+static void dm_icomp_handle_flush_request(struct dm_icomp_req *req)
+{
+ struct writeback_flush_data wb;
+
+ atomic_set(&wb.cnt, 1);
+ init_completion(&wb.complete);
+
+ dm_icomp_flush_dirty_meta(req->info, &wb);
+
+ writeback_flush_io_done(&wb, 0);
+ wait_for_completion(&wb.complete);
+
+ req->bio->bi_error = 0;
+ bio_endio(req->bio);
+ kmem_cache_free(dm_icomp_req_cachep, req);
+}
+
+static void dm_icomp_handle_request(struct dm_icomp_req *req)
+{
+ if (req->bio->bi_opf & REQ_PREFLUSH)
+ dm_icomp_handle_flush_request(req);
+ else if (op_is_write(bio_op(req->bio)))
+ dm_icomp_handle_write_request(req);
+ else
+ dm_icomp_handle_read_request(req);
+}
+
+static void dm_icomp_do_request_work(struct work_struct *work)
+{
+ struct dm_icomp_io_worker *worker = container_of(work,
+ struct dm_icomp_io_worker, work);
+ LIST_HEAD(list);
+ struct dm_icomp_req *req;
+ struct blk_plug plug;
+ bool repeat;
+
+ blk_start_plug(&plug);
+again:
+ spin_lock_irq(&worker->lock);
+ list_splice_init(&worker->pending, &list);
+ spin_unlock_irq(&worker->lock);
+
+ repeat = !list_empty(&list);
+ while (!list_empty(&list)) {
+ req = list_first_entry(&list, struct dm_icomp_req, sibling);
+ list_del(&req->sibling);
+
+ schedule();
+ dm_icomp_handle_request(req);
+ }
+ if (repeat)
+ goto again;
+ blk_finish_plug(&plug);
+}
+
+static bool valid_request(struct bio *bio, struct dm_icomp_info *info)
+{
+ sector_t dev_end = info->ti->len;
+ sector_t req_end = bio_end_sector(bio) - 1;
+
+ return (req_end <= dev_end);
+}
+
+static int dm_icomp_map(struct dm_target *ti, struct bio *bio)
+{
+ struct dm_icomp_info *info = ti->private;
+ struct dm_icomp_req *req;
+
+ if ((bio->bi_opf & REQ_PREFLUSH) &&
+ info->write_mode == DMICP_WRITE_THROUGH) {
+ bio->bi_bdev = info->dev->bdev;
+ return DM_MAPIO_REMAPPED;
+ }
+
+
+ req = kmem_cache_alloc(dm_icomp_req_cachep, get_alloc_flag(info));
+ if (!req)
+ return -ENOMEM;
+
+ req->bio = bio;
+ if (!(bio->bi_opf & REQ_PREFLUSH) && !valid_request(bio, info)) {
+ req->bio = bio;
+ req->bio->bi_error = -EINVAL;
+ bio_endio(req->bio);
+ return DM_MAPIO_SUBMITTED;
+ }
+
+ req->info = info;
+ atomic_set(&req->io_pending, 0);
+ INIT_LIST_HEAD(&req->all_io);
+ req->result = 0;
+ SET_REQ_STAGE(req, STAGE_INIT);
+ req->locked_locks = 0;
+
+ req->cpu = raw_smp_processor_id();
+ dm_icomp_queue_req(info, req);
+
+ return DM_MAPIO_SUBMITTED;
+}
+
+static void dm_icomp_status(struct dm_target *ti, status_type_t type,
+ unsigned int status_flags, char *result, unsigned int maxlen)
+{
+ struct dm_icomp_info *info = ti->private;
+ unsigned int sz = 0;
+
+ switch (type) {
+ case STATUSTYPE_INFO:
+ DMEMIT("%ld %ld %ld",
+ (long) atomic64_read(&info->uncompressed_write_size),
+ (long) atomic64_read(&info->compressed_write_size),
+ (long) atomic64_read(&info->meta_write_size));
+ break;
+ case STATUSTYPE_TABLE:
+ if (info->write_mode == DMICP_WRITE_BACK)
+ DMEMIT("%s %s:%d %s:%s %s:%d", info->dev->name,
+ "writeback", info->writeback_delay,
+ "compressor", compressors[info->comp_alg].name,
+ "critical", info->critical);
+ else
+ DMEMIT("%s %s %s:%s %s:%d", info->dev->name,
+ "writethrough",
+ "compressor", compressors[info->comp_alg].name,
+ "critical", info->critical);
+ break;
+ }
+}
+
+static int dm_icomp_iterate_devices(struct dm_target *ti,
+ iterate_devices_callout_fn fn, void *data)
+{
+ struct dm_icomp_info *info = ti->private;
+
+ return fn(ti, info->dev, info->data_start,
+ DMICP_BLOCK_TO_SECTOR(info->data_blocks), data);
+}
+
+static void dm_icomp_io_hints(struct dm_target *ti,
+ struct queue_limits *limits)
+{
+ /* No blk_limits_logical_block_size */
+ limits->logical_block_size = limits->physical_block_size =
+ limits->io_min = DMICP_BLOCK_SIZE;
+ limits->max_sectors = limits->max_hw_sectors =
+ DMICP_MAX_SECTORS;
+}
+
+static struct target_type dm_icomp_target = {
+ .name = "inplacecompress",
+ .version = {1, 0, 0},
+ .module = THIS_MODULE,
+ .ctr = dm_icomp_ctr,
+ .dtr = dm_icomp_dtr,
+ .map = dm_icomp_map,
+ .status = dm_icomp_status,
+ .iterate_devices = dm_icomp_iterate_devices,
+ .io_hints = dm_icomp_io_hints,
+};
+
+static int __init dm_icomp_init(void)
+{
+ int r;
+
+ if (select_default_compressor())
+ return -EINVAL;
+
+ r = -ENOMEM;
+ dm_icomp_req_cachep = kmem_cache_create("dm_icomp_requests",
+ sizeof(struct dm_icomp_req), 0, 0, NULL);
+ if (!dm_icomp_req_cachep) {
+ DMWARN("Can't create request cache");
+ goto err;
+ }
+
+ dm_icomp_io_range_cachep = kmem_cache_create("dm_icomp_io_range",
+ sizeof(struct dm_icomp_io_range), 0, 0, NULL);
+ if (!dm_icomp_io_range_cachep) {
+ DMWARN("Can't create io_range cache");
+ goto err;
+ }
+
+ dm_icomp_meta_io_cachep = kmem_cache_create("dm_icomp_meta_io",
+ sizeof(struct dm_icomp_meta_io), 0, 0, NULL);
+ if (!dm_icomp_meta_io_cachep) {
+ DMWARN("Can't create meta_io cache");
+ goto err;
+ }
+
+ dm_icomp_wq = alloc_workqueue("dm_icomp_io",
+ WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
+ if (!dm_icomp_wq) {
+ DMWARN("Can't create io workqueue");
+ goto err;
+ }
+
+ r = dm_register_target(&dm_icomp_target);
+ if (r < 0) {
+ DMWARN("target registration failed");
+ goto err;
+ }
+
+ for_each_possible_cpu(r) {
+ INIT_LIST_HEAD(&dm_icomp_io_workers[r].pending);
+ spin_lock_init(&dm_icomp_io_workers[r].lock);
+ INIT_WORK(&dm_icomp_io_workers[r].work,
+ dm_icomp_do_request_work);
+ }
+ return 0;
+err:
+ kmem_cache_destroy(dm_icomp_req_cachep);
+ kmem_cache_destroy(dm_icomp_io_range_cachep);
+ kmem_cache_destroy(dm_icomp_meta_io_cachep);
+ if (dm_icomp_wq)
+ destroy_workqueue(dm_icomp_wq);
+
+ return r;
+}
+
+static void __exit dm_icomp_exit(void)
+{
+ dm_unregister_target(&dm_icomp_target);
+ kmem_cache_destroy(dm_icomp_req_cachep);
+ kmem_cache_destroy(dm_icomp_io_range_cachep);
+ kmem_cache_destroy(dm_icomp_meta_io_cachep);
+ destroy_workqueue(dm_icomp_wq);
+}
+
+module_init(dm_icomp_init);
+module_exit(dm_icomp_exit);
+
+MODULE_AUTHOR("Shaohua Li <shli@kernel.org>");
+MODULE_DESCRIPTION(DM_NAME " target with data inplace-compression");
+MODULE_LICENSE("GPL");
diff --git a/drivers/md/dm-inplace-compress.h b/drivers/md/dm-inplace-compress.h
new file mode 100644
index 0000000..4a73d4e
--- /dev/null
+++ b/drivers/md/dm-inplace-compress.h
@@ -0,0 +1,194 @@
+#ifndef __DM_INPLACE_COMPRESS_H__
+#define __DM_INPLACE_COMPRESS_H__
+#include <linux/types.h>
+
+#define DMICP_SUPER_MAGIC 0x106526c206506c09
+#define DMICP_COMPRESS_MAGIC 0xfaceecaf
+struct dm_icomp_super_block {
+ __le64 magic;
+ __le64 meta_blocks;
+ __le64 data_blocks;
+ u8 comp_alg;
+} __packed;
+
+#define DMICP_COMP_ALG_LZO 1
+#define DMICP_COMP_ALG_842 0
+
+#ifdef __KERNEL__
+/*
+ * A block which represents the logical size of the target is 4096B.
+ * Data within a block is compressed. Compressed data; payload, is
+ * round up by 512B. Payload is stored at the beginning of logical
+ * sector of the block on the disk. Last 32bit of the payload holds
+ * the payload length. However if payload length is 4096B,store the
+ * uncompressed data. If IO size is larger than a block, store the
+ * data as extents. Each block is represented by 5bit metadata.
+ * Bit 0 - 3 captures payload length (0 - 8 sectors)for that extent.
+ * Bit 4 indicates the head/tail information for that extent.
+ * Maximum allowed extent size is DMICP_MAX_SECTORS.
+ */
+#define DMICP_BLOCK_SHIFT 12
+#define DMICP_BLOCK_SIZE (1 << DMICP_BLOCK_SHIFT)
+#define DMICP_SECTOR_SHIFT SECTOR_SHIFT
+#define DMICP_SECTOR_SIZE (1 << SECTOR_SHIFT)
+#define DMICP_BLOCK_SECTOR_SHIFT (DMICP_BLOCK_SHIFT - DMICP_SECTOR_SHIFT)
+#define DMICP_BLOCK_TO_SECTOR(b) ((b) << DMICP_BLOCK_SECTOR_SHIFT)
+#define DMICP_SECTOR_TO_BLOCK(s) ((s) >> DMICP_BLOCK_SECTOR_SHIFT)
+#define DMICP_SECTOR_TO_BYTES(s) ((s) << DMICP_SECTOR_SHIFT)
+#define DMICP_BYTES_TO_SECTOR(b) ((b) >> DMICP_SECTOR_SHIFT)
+#define DMICP_BYTES_TO_BLOCK(b) ((b) >> DMICP_BLOCK_SHIFT)
+
+#define DMICP_MIN_SIZE DMICP_BLOCK_SIZE
+
+/*
+ * maximum sectors is the twice the number of sectors a page can
+ * hold.
+ */
+#define DMICP_MAX_SECTORS (DMICP_BYTES_TO_SECTOR(PAGE_SIZE) << 1)
+#define DMICP_MAX_SIZE DMICP_SECTOR_TO_BYTES(DMICP_SECTOR_SIZE)
+
+#define DMICP_BITS_PER_ENTRY 32
+#define DMICP_META_BITS 5
+#define DMICP_LENGTH_BITS 4
+#define DMICP_TAIL_MASK (1 << DMICP_LENGTH_BITS)
+#define DMICP_LENGTH_MASK (DMICP_TAIL_MASK - 1)
+
+#define DMICP_META_START_SECTOR (DMICP_BLOCK_SIZE >> DMICP_SECTOR_SHIFT)
+
+enum DMICP_WRITE_MODE {
+ DMICP_WRITE_BACK,
+ DMICP_WRITE_THROUGH,
+};
+
+/*
+ * A lock spans 128 Blocks i.e 512kbytes. Maximum I/O is much lesser than
+ * that. Hence an I/O can span at most two locks.
+ */
+#define BITMAP_HASH_SHIFT 7
+#define BITMAP_HASH_LEN (1<<6)
+#define BITMAP_HASH_MASK (BITMAP_HASH_LEN - 1)
+struct dm_icomp_hash_lock {
+ int io_running;
+ spinlock_t wait_lock;
+ struct list_head wait_list;
+};
+
+struct dm_icomp_info {
+ struct dm_target *ti;
+ struct dm_dev *dev;
+
+ int comp_alg;
+ bool critical;
+ struct crypto_comp *tfm[NR_CPUS];
+
+ sector_t total_sector; /* total sectors in the backing store */
+ sector_t data_start;
+ u64 data_blocks;
+ u64 no_of_sectors;
+
+ u32 *meta_bitmap;
+ u64 meta_bitmap_bits;
+ u64 meta_bitmap_pages;
+ struct dm_icomp_hash_lock bitmap_locks[BITMAP_HASH_LEN];
+
+ enum DMICP_WRITE_MODE write_mode;
+ unsigned int writeback_delay; /* seconds */
+ struct task_struct *writeback_tsk;
+ struct dm_io_client *io_client;
+
+ atomic64_t compressed_write_size;
+ atomic64_t uncompressed_write_size;
+ atomic64_t meta_write_size;
+};
+
+struct dm_icomp_meta_io {
+ struct dm_io_request io_req;
+ struct dm_io_region io_region;
+ void *data;
+ void (*fn)(void *data, unsigned long error);
+};
+
+struct dm_icomp_io_range {
+ struct dm_io_request io_req;
+ struct dm_io_region io_region;
+ bool decomp_kmap; /* Is the decomp_data kmapped'? */
+ void *decomp_data;
+ void *decomp_real_data; /* holds the actual start of the buffer */
+ unsigned int decomp_len; /* actual allocated/mapped length */
+ unsigned int logical_bytes; /* decompressed size of the extent */
+ bool comp_kmap; /* Is the comp_data kmapped'? */
+ void *comp_data;
+ void *comp_real_data; /* holds the actual start of the buffer */
+ unsigned int comp_len; /* actual allocated/mapped length */
+ unsigned int data_bytes; /* compressed size of the extent */
+ struct list_head next;
+ struct dm_icomp_req *req;
+};
+
+enum DMICP_REQ_STAGE {
+ STAGE_INIT,
+ STAGE_READ_EXISTING,
+ STAGE_READ_DECOMP,
+ STAGE_WRITE_COMP,
+ STAGE_DONE,
+};
+
+struct dm_icomp_req {
+ struct bio *bio;
+ struct dm_icomp_info *info;
+ struct list_head sibling;
+ struct list_head all_io;
+ atomic_t io_pending;
+ enum DMICP_REQ_STAGE stage;
+ struct dm_icomp_hash_lock *locks[2];
+ int locked_locks;
+ int result;
+ int cpu;
+ struct work_struct work;
+};
+
+struct dm_icomp_io_worker {
+ struct list_head pending;
+ spinlock_t lock;
+ struct work_struct work;
+};
+
+struct dm_icomp_compressor_data {
+ char *name;
+ bool can_handle_overflow;
+ int (*comp_len)(int comp_len);
+ int (*max_comp_len)(int comp_len);
+};
+
+static inline int lzo_comp_len(int len)
+{
+ /* lzo compression overshoots the comp buffer
+ * if the buffer size is insufficient.
+ * Once that bug is fixed we can return half
+ * the length.
+ *
+ * return lzo1x_worst_compress(len) >> 1;
+ *
+ * For now its the full length.
+ */
+ return lzo1x_worst_compress(len);
+}
+
+static inline int lzo_max_comp_len(int len)
+{
+ return lzo1x_worst_compress(len);
+}
+
+static inline int nx842_comp_len(int len)
+{
+ return (len >> 1);
+}
+
+static inline int nx842_max_comp_len(int len)
+{
+ return len;
+}
+
+#endif /* __KERNEL__ */
+
+#endif /* __DM_INPLACE_COMPRESS_H__ */
--
1.8.3.1
^ permalink raw reply related
* Re: on assembly and recovery of a hardware RAID
From: NeilBrown @ 2017-03-13 21:32 UTC (permalink / raw)
To: Alfred Matthews, linux-raid
In-Reply-To: <CAAZLhTdaVM4fagN5=bqWpPx-DjyU6UZ69bJFf51FWMX7oeyJNw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1605 bytes --]
On Fri, Mar 10 2017, Alfred Matthews wrote:
> Hello list. I'm facing a non-redundant Western Digital hardware RAID,
> for which, hardware seems to cause a kernel panic at about 3 seconds
> running time.
>
> I've assembled the customary testing. The drives appear to be striped RAID 0.
>
> Output: http://pastebin.com/c361jGVx
>
> Evidently WD metadata changes over time, since a new console (adding
> USB) will not recognize the drives without erasing them. Files are
> visible as files for the short period of controller health.
>
> I imagine I'm trying to assemble a 2 x 3TB RAID array from the
> original WD disks when mounted as SATA.
>
> Seeking input on proper mdadm configuration for this.
>
> Then I imagine that I may recover files-as-files from this 2 x 3TB to
> standalone disks. Ultimately they would need to move to a new RAID.
>
> Failing: WD My Book Thunderbolt Duo, 2x3TB
> New, incompatible: WD My Book Pro, 2x3TB.
>
> Thanks for any comment.
>
> Thanks for your time.
>
Does
dmraid -b /dev/sda /dev/sdb /dev/sdc
tell you anything useful?
You will probably want a command like:
mdadm --build /dev/md0 -l0 -n2 --chunk=SOMETHING /dev/sdXX /dev/sdYY
where SOMETHING is the chunk size. e.g. 64K or 512K or something.
Doing this is non-destructive so you can try several different times,
using "mdadm --stop /dev/md0" to reset before trying again.
After building the array, try "cfdisk /dev/md0" or maybe "fdisk
/dev/md0" to look at the partition table.
What filesystem(s) did you have on the device? Maybe "fsck -n /dev/md0p1"
might tell you if the filesystem looks OK.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: Auto replace disk
From: NeilBrown @ 2017-03-13 21:36 UTC (permalink / raw)
To: Gandalf Corvotempesta, Wols Lists; +Cc: linux-raid
In-Reply-To: <CAJH6TXjjjnQ3_OJ87Gv8Spsd=7BZ7RGrfCJ0kqeRXTQN_1Q3KQ@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1875 bytes --]
On Wed, Mar 08 2017, Gandalf Corvotempesta wrote:
> 2017-03-08 19:17 GMT+01:00 Wols Lists <antlists@youngman.org.uk>:
>> Do you mean you remove an old disk, and put a new blank disk in?
>
> Yes
>
>> If that's what you mean, then no, it's not possible. mdadm doesn't have
>> a clue about disks, what it sees is "block devices".
>
> Ok but mdadm.conf man page seems to say the opposite:
> https://linux.die.net/man/5/mdadm.conf
>
> "POLICY
> This is used to specify what automatic behavior is allowed on devices
> newly appearing in the system and provides a way of marking spares
> that can be moved to other arrays as well as the migration domains.
>
> action=include, re-add, spare, spare-same-slot, or force-spare
> auto= yes, no, or homehost.
>
> The action item determines the automatic behavior allowed for devices
> matching the path and type in the same line. If a device matches
> several lines with different actions then the most permissive will
> apply. The ordering of policy lines is irrelevant to the end result.
>
> includeallows adding a disk to an array if metadata on that disk
> matches that arrayre-addwill include the device in the array if it
> appears to be a current member or a member that was recently
> removedspareas above and additionally: if the device is bare it can
> become a spare if there is any array that it is a candidate for based
> on domains and metadata.spare-same-slotas above and additionally if
> given slot was used by an array that went degraded recently and the
> device plugged in has no metadata then it will be automatically added
> to that array (or it's container)force-spareas above and the disk will
> become a spare in remaining cases
> "
Clearly you have read the documentation - excellent!
What exactly are you asking?
Presumably you have tried something and it didn't work. What (exactly)
did you try?
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: LSI RAID
From: NeilBrown @ 2017-03-13 21:38 UTC (permalink / raw)
To: Gandalf Corvotempesta, Hannes Reinecke; +Cc: linux-raid
In-Reply-To: <CAJH6TXhYDMP_Xm+nZyoGBfAR77_W4h18_3bdTKeymEKYpUu-bw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 433 bytes --]
On Tue, Feb 28 2017, Gandalf Corvotempesta wrote:
> 2017-02-28 10:06 GMT+01:00 Hannes Reinecke <hare@suse.de>:
>> Sure.
>> The recent mdadm should be able to create DDF metadata.
>
> This means that i'll be able to import a configuration created with a
> LSI MegaRaid controller and use them with mdadm ?
> If yes, how ?
Try it and see.
i.e. plug the devices into a Linux box with mdadm installed, and see
what happens.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* [PATCH] md/r5cache: flush data in memory during journal device failure
From: Song Liu @ 2017-03-13 23:36 UTC (permalink / raw)
To: linux-raid; +Cc: shli, neilb, kernel-team, dan.j.williams, hch, Song Liu
For the raid456 with writeback cache, when journal device failed during
normal operation, it is still possible to persist all data, as all
pending data is still in stripe cache. However, the stripe will be
marked as fail with s.log_failed. Thus, the write out from stripe cache
cannot make progress.
To unblock the write out in journal failures, this patch allows stripes
with data injournal to make progress.
The array should be read-only in journal failures. Therefore, pending
writes (in dev->towrite) are excluded in this write (in delay_towrite).
Signed-off-by: Song Liu <songliubraving@fb.com>
---
drivers/md/raid5.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3233975..447d9dd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3069,6 +3069,10 @@ sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous)
* When LOG_CRITICAL, stripes with injournal == 0 will be sent to
* no_space_stripes list.
*
+ * 3. during journal failure
+ * In journal failure, we try to flush all cached data to raid disks
+ * based on data in stripe cache. The array is read-only to upper
+ * layers, so we would skip all pending writes.
*/
static inline bool delay_towrite(struct r5conf *conf,
struct r5dev *dev,
@@ -3082,6 +3086,9 @@ static inline bool delay_towrite(struct r5conf *conf,
if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
s->injournal > 0)
return true;
+ /* case 3 above */
+ if (s->log_failed && s->injournal)
+ return true;
return false;
}
@@ -4721,7 +4728,8 @@ static void handle_stripe(struct stripe_head *sh)
/* check if the array has lost more than max_degraded devices and,
* if so, some requests might need to be failed.
*/
- if (s.failed > conf->max_degraded || s.log_failed) {
+ if (s.failed > conf->max_degraded ||
+ (s.log_failed && s.injournal == 0)) {
sh->check_state = 0;
sh->reconstruct_state = 0;
break_stripe_batch_list(sh, 0);
--
2.9.3
^ permalink raw reply related
* [PATCH] r5cache: allow adding journal to array without journal
From: Song Liu @ 2017-03-14 0:09 UTC (permalink / raw)
To: linux-raid
Cc: shli, neilb, kernel-team, dan.j.williams, hch, jsorensen,
Song Liu
This seems just work.
Signed-off-by: Song Liu <songliubraving@fb.com>
---
Manage.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/Manage.c b/Manage.c
index 5c3d2b9..9ab999d 100644
--- a/Manage.c
+++ b/Manage.c
@@ -963,11 +963,6 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
sysfs_free(mdp);
- tst->ss->getinfo_super(tst, &mdi, NULL);
- if (mdi.journal_device_required == 0) {
- pr_err("%s does not support journal device.\n", devname);
- return -1;
- }
disc.raid_disk = 0;
}
--
2.9.3
^ permalink raw reply related
* [PATCH v2] r5cache: allow adding journal to array without journal
From: Song Liu @ 2017-03-14 0:11 UTC (permalink / raw)
To: linux-raid
Cc: shli, neilb, kernel-team, dan.j.williams, hch, jsorensen,
Song Liu
This seems just work.
Signed-off-by: Song Liu <songliubraving@fb.com>
---
Manage.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/Manage.c b/Manage.c
index 5c3d2b9..5ff5de3 100644
--- a/Manage.c
+++ b/Manage.c
@@ -946,7 +946,6 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
/* only add journal to array that supports journaling */
if (dv->disposition == 'j') {
- struct mdinfo mdi;
struct mdinfo *mdp;
mdp = sysfs_read(fd, NULL, GET_ARRAY_STATE);
@@ -963,11 +962,6 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
sysfs_free(mdp);
- tst->ss->getinfo_super(tst, &mdi, NULL);
- if (mdi.journal_device_required == 0) {
- pr_err("%s does not support journal device.\n", devname);
- return -1;
- }
disc.raid_disk = 0;
}
--
2.9.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox