Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* [PATCH 1/2] raid5-cache: suspend reclaim thread instead of shutdown
From: Shaohua Li @ 2016-11-21 18:29 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, neilb

There is mechanism to suspend a kernel thread. Use it instead of playing
create/destroy game.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c          |  4 +++-
 drivers/md/raid5-cache.c | 18 +++++-------------
 2 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index d3cef77..f548469 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7136,10 +7136,12 @@ static int md_thread(void *arg)
 		wait_event_interruptible_timeout
 			(thread->wqueue,
 			 test_bit(THREAD_WAKEUP, &thread->flags)
-			 || kthread_should_stop(),
+			 || kthread_should_stop() || kthread_should_park(),
 			 thread->timeout);
 
 		clear_bit(THREAD_WAKEUP, &thread->flags);
+		if (kthread_should_park())
+			kthread_parkme();
 		if (!kthread_should_stop())
 			thread->run(thread);
 	}
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 8cb79fc..5f817bd 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -19,6 +19,7 @@
 #include <linux/raid/md_p.h>
 #include <linux/crc32c.h>
 #include <linux/random.h>
+#include <linux/kthread.h>
 #include "md.h"
 #include "raid5.h"
 #include "bitmap.h"
@@ -1437,23 +1438,14 @@ void r5l_quiesce(struct r5l_log *log, int state)
 	struct mddev *mddev;
 	if (!log || state == 2)
 		return;
-	if (state == 0) {
-		/*
-		 * This is a special case for hotadd. In suspend, the array has
-		 * no journal. In resume, journal is initialized as well as the
-		 * reclaim thread.
-		 */
-		if (log->reclaim_thread)
-			return;
-		log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
-					log->rdev->mddev, "reclaim");
-		log->reclaim_thread->timeout = R5C_RECLAIM_WAKEUP_INTERVAL;
-	} else if (state == 1) {
+	if (state == 0)
+		kthread_unpark(log->reclaim_thread->tsk);
+	else if (state == 1) {
 		/* make sure r5l_write_super_and_discard_space exits */
 		mddev = log->rdev->mddev;
 		wake_up(&mddev->sb_wait);
+		kthread_park(log->reclaim_thread->tsk);
 		r5l_wake_reclaim(log, MaxSector);
-		md_unregister_thread(&log->reclaim_thread);
 		r5l_do_reclaim(log);
 	}
 }
-- 
2.9.3


^ permalink raw reply related

* Re: [BUG 4.4.26] bio->bi_bdev == NULL in raid6 return_io()
From: Konstantin Khlebnikov @ 2016-11-21 15:32 UTC (permalink / raw)
  To: NeilBrown, Konstantin Khlebnikov, Shaohua Li
  Cc: linux-kernel@vger.kernel.org, linux-raid, linux-block, Jens Axboe,
	Christoph Hellwig
In-Reply-To: <87r365eidd.fsf@notabene.neil.brown.name>

On 21.11.2016 04:23, NeilBrown wrote:
> On Sun, Nov 20 2016, Konstantin Khlebnikov wrote:
>
>> On 07.11.2016 23:34, Konstantin Khlebnikov wrote:
>>> On Mon, Nov 7, 2016 at 10:46 PM, Shaohua Li <shli@kernel.org> wrote:
>>>> On Sat, Nov 05, 2016 at 01:48:45PM +0300, Konstantin Khlebnikov wrote:
>>>>> return_io() resolves request_queue even if trace point isn't active:
>>>>>
>>>>> static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
>>>>> {
>>>>>       return bdev->bd_disk->queue;    /* this is never NULL */
>>>>> }
>>>>>
>>>>> static void return_io(struct bio_list *return_bi)
>>>>> {
>>>>>       struct bio *bi;
>>>>>       while ((bi = bio_list_pop(return_bi)) != NULL) {
>>>>>               bi->bi_iter.bi_size = 0;
>>>>>               trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
>>>>>                                        bi, 0);
>>>>>               bio_endio(bi);
>>>>>       }
>>>>> }
>>>>
>>>> I can't see how this could happen. What kind of tests/environment are these running?
>>>
>>> That was a random piece of production somewhere.
>>> Cording to time all crashes happened soon after reboot.
>>> There're several raids, probably some of them were still under resync.
>>>
>>> For now we have only few machines with this kernel. But I'm sure that
>>> I'll get much more soon =)
>>
>> I've added this debug patch for catching overflow of active stripes in bio
>>
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -164,6 +164,7 @@ static inline void raid5_inc_bi_active_stripes(struct bio *bio)
>>   {
>>          atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
>>          atomic_inc(segments);
>> +       BUG_ON(!(atomic_read(segments) & 0xffff));
>>   }
>>
>> And got this. Counter in %edx = 0x00010000
>>
>> So, looks like one bio (discard?) can cover more than 65535 stripes
>
> 65535 stripes - 256M.  I guess that is possible.  Christoph has
> suggested that now would be a good time to stop using bi_phys_segments
> like this.

Is it possible to fix this by limiting max_hw_sectors and
max_hw_discard_sectors for raid queue?

This should be much easier to backport into stable kernels.

I've found that setup also have dm/lvm on the top of md raid so
hat might be more complicated problem.
Because I cannot see how bio could be big enough to overflow that counter.
That was raid6 with 10 disks and 256k chunk. max_hw_discard_sectors and
max_hw_sectors cannot be bigger than UINT_MAX. Thus in this case bio
cannot cover more than 16384 data chunks, 20480 chunks including checksums.
Please fix me if I'm wrong.

>
> I have some patches which should fix this.  I'll post them shortly.  I'd
> appreciate it if you would test and confirm that they work (and don't
> break anything else)

Ok, I'll try to check that patchset.

-- 
Konstantin

^ permalink raw reply

* Re: [PATCH 08/12] dm: dm.c: replace 'bio->bi_vcnt == 1' with !bio_multiple_segments
From: Mike Snitzer @ 2016-11-21 14:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, linux-block, linux-fsdevel,
	Christoph Hellwig, Alasdair Kergon,
	maintainer:DEVICE-MAPPER (LVM), Shaohua Li,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <1478865957-25252-9-git-send-email-tom.leiming@gmail.com>

On Fri, Nov 11 2016 at  7:05am -0500,
Ming Lei <tom.leiming@gmail.com> wrote:

> Avoid to access .bi_vcnt directly, because the bio can be
> splitted from block layer, and .bi_vcnt should never have
> been used here.
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>

I've staged this for 4.10

^ permalink raw reply

* Re: [PATCH 07/12] dm: use bvec iterator helpers to implement .get_page and .next_page
From: Mike Snitzer @ 2016-11-21 14:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, linux-block, linux-fsdevel,
	Christoph Hellwig, Alasdair Kergon,
	maintainer:DEVICE-MAPPER (LVM), Shaohua Li,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <1478865957-25252-8-git-send-email-tom.leiming@gmail.com>

On Fri, Nov 11 2016 at  7:05am -0500,
Ming Lei <tom.leiming@gmail.com> wrote:

> Firstly we have mature bvec/bio iterator helper for iterate each
> page in one bio, not necessary to reinvent a wheel to do that.
> 
> Secondly the coming multipage bvecs requires this patch.
> 
> Also add comments about the direct access to bvec table.
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>

I've staged this for 4.10

^ permalink raw reply

* Re: [PATCH 06/12] dm: crypt: use bio_add_page()
From: Mike Snitzer @ 2016-11-21 14:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, linux-block, linux-fsdevel,
	Christoph Hellwig, Alasdair Kergon,
	maintainer:DEVICE-MAPPER (LVM), Shaohua Li,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <1478865957-25252-7-git-send-email-tom.leiming@gmail.com>

On Fri, Nov 11 2016 at  7:05am -0500,
Ming Lei <tom.leiming@gmail.com> wrote:

> We have the standard interface to add page to bio, so don't
> do that in hacking way.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>

I've staged this for 4.10

^ permalink raw reply

* Re: MD Remnants After --stop
From: Marc Smith @ 2016-11-21 14:08 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87fumlebwo.fsf@notabene.neil.brown.name>

On Sun, Nov 20, 2016 at 10:42 PM, NeilBrown <neilb@suse.com> wrote:
> On Sat, Nov 19 2016, Marc Smith wrote:
>
>> On Mon, Nov 7, 2016 at 12:44 AM, NeilBrown <neilb@suse.com> wrote:
>>> On Sat, Nov 05 2016, Marc Smith wrote:
>>>
>>>> Hi,
>>>>
>>>> It may be that I've never noticed this before, so maybe its not a
>>>> problem... after using '--stop' to deactivate/stop an MD array, there
>>>> are remnants of it lingering, namely an entry in /sys/block (eg,
>>>> /sys/block/md127) and the device node in /dev remains (eg,
>>>> /dev/md127).
>>>>
>>>> Is this normal? Like I said, it probably is, and I've just never
>>>> noticed it before. I assume its not going to hurt anything, but is
>>>> there a way to clean it up, without rebooting? Obviously I could
>>>> remove the /dev entry, but what about /sys/block?
>>>>
>>>
>>> You can remove them both by running
>>>   mdadm -S /dev/md127
>>>
>>> but they'll probably just reappear again.
>>>
>>> This seems to be an on-going battle between md and udev.  I've "fixed"
>>> it at least once, but it keeps coming back.
>>>
>>> When md removes the md127 device, a message is sent to udev.
>>> As part of its response to this message, udev tries to open /dev/md127.
>>> Because of the rather unusual way that md devices are created (it made
>>> sense nearly 20 years ago when it was designed), opening /dev/md127
>>> causes md to create device md127 again.
>>>
>>> You could
>>>   mv /dev/md127 /dev/md127X
>>>   mdadm -S /dev/md127X
>>>   rm /dev/md127X
>>> that stop udev from opening /dev/md127.  It seems to work reliably.
>>>
>>> md used to generate a CHANGE event before the REMOVE event, and only the
>>> CHANGE event caused udev to open the device file.  I removed that and
>>> the problem went away.  Apparently some change has happened to udev and
>>> now it opens the file in response to REMOVE as well.
>>
>> I used "udevadm monitor -pku" to watch the events when running "mdadm
>> --stop /dev/md127" and this is what I see:
>>
>> --snip--
>> KERNEL[163074.119778] change   /devices/virtual/block/md127 (block)
>> ACTION=change
>> DEVNAME=/dev/md127
>> DEVPATH=/devices/virtual/block/md127
>> DEVTYPE=disk
>> MAJOR=9
>> MINOR=127
>> SEQNUM=3701
>> SUBSYSTEM=block
>>
>> UDEV  [163074.121569] change   /devices/virtual/block/md127 (block)
>> ACTION=change
>> DEVNAME=/dev/md127
>> DEVPATH=/devices/virtual/block/md127
>> DEVTYPE=disk
>> MAJOR=9
>> MINOR=127
>> SEQNUM=3701
>> SUBSYSTEM=block
>> SYSTEMD_READY=0
>> USEC_INITIALIZED=370470
>> --snip--
>>
>> I don't see any 'remove' event generated. I should mention if I hadn't
>> already that I'm testing md-cluster (--bitmap=clustered), and
>> currently using Linux 4.9-rc3.
>
> What version of mdadm are you using?

v3.4


> You need one which contains
> Commit: 229e66cb9689 ("Manage.c: Only issue change events for kernels older than 2.6.28")
>
> which hasn't made it into a release yet.  But if you are playing with
> md-cluster, I would guess you are using the latest from git...

Wasn't, but I will now. Thanks.

--Marc


>
> NeilBrown

^ permalink raw reply

* Re: linux raid wiki - backup files
From: Phil Turmel @ 2016-11-21 14:07 UTC (permalink / raw)
  To: Wols Lists, linux-raid
In-Reply-To: <5831B7B9.8090008@youngman.org.uk>

On 11/20/2016 09:48 AM, Wols Lists wrote:
> On 20/11/16 00:27, Phil Turmel wrote:
>> Yes.  But the new stripes lay on top of the old stripes, unless you move
>> the data offset.  Which is why a backup file holds the old stripe just
>> in case.  If you can move the offset, you use the lower offset for the
>> lower addresses in the array, and the higher offset for the higher
>> addresses, on either side of the reshape position.
> 
> Okay, understood. So v0.9 and v1.0 always need a backup for a reshape.
> 
> But if we have a data offset with v1.2, a reshape will use that space if
> it can rather than needing a backup file?

If there's room to move the entire data area (direction varies with
operation), yes.

> (And changing topic slightly, that space is also used for an internal
> bitmap?)

And bad block log.

Phil

^ permalink raw reply

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
From: Christoph Hellwig @ 2016-11-21 14:01 UTC (permalink / raw)
  To: NeilBrown
  Cc: Shaohua Li, Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <147969099621.5434.12384452255155063186.stgit@noble>

Hi Neil,

I only took a cursory look, but these changes look very nice to me.
Thanks for looking into this!

^ permalink raw reply

* Re: [PATCH v2] RAID1: Avoid unnecessary loop to decrease conf->nr_queued in raid1d()
From: Coly Li @ 2016-11-21 10:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Neil Brown
In-Reply-To: <20161116200519.v6y6t4dsbl2zshk2@kernel.org>

在 2016/11/17 上午4:05, Shaohua Li 写道:
> On Wed, Nov 16, 2016 at 10:36:32PM +0800, Coly Li wrote:
>> 在 2016/11/16 下午10:19, Coly Li 写道:
>> [snip]
>>> ---
>>>  drivers/md/raid1.c | 9 +++++----
>>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> Index: linux-raid1/drivers/md/raid1.c
>>> ===================================================================
>>> --- linux-raid1.orig/drivers/md/raid1.c
>>> +++ linux-raid1/drivers/md/raid1.c
>>> @@ -2387,17 +2387,17 @@ static void raid1d(struct md_thread *thr
>> [snip]
>>>  		while (!list_empty(&tmp)) {
>>>  			r1_bio = list_first_entry(&tmp, struct r1bio,
>>>  						  retry_list);
>>>  			list_del(&r1_bio->retry_list);
>>> +			spin_lock_irqsave(&conf->device_lock, flags);
>>> +			conf->nr_queued--;
>>> +			spin_unlock_irqrestore(&conf->device_lock, flags);
>> [snip]
>>
>> Now I work on another 2 patches for a simpler I/O barrier, and a
>> lockless I/O submit on raid1, where conf->nr_queued will be in atomic_t.
>> So spin lock expense will not exist any more. Just FYI.
> 
> I'd like to hold this patch till you post the simpler I/O barrier, as the patch
> itself currently doesn't make the process faster (lock/unlock is much heavier
> than the loop).

Yeah, I will combine this patch with the new barrier patch, and send out
a RFC patch set with two patches. One is new barrier patch, one is
lockless wait_barrier() patch.

Thanks.

Coly


^ permalink raw reply

* [RFC PATCH] crypto: Add IV generation algorithms
From: Binoy Jayan @ 2016-11-21 10:10 UTC (permalink / raw)
  To: Oded, Ofir
  Cc: Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Binoy Jayan
In-Reply-To: <1479723009-11113-1-git-send-email-binoy.jayan@linaro.org>

Currently, the iv generation algorithms are implemented in dm-crypt.c.
The goal is to move these algorithms from the dm layer to the kernel
crypto layer by implementing them as template ciphers so they can be used
in relation with algorithms like aes, and with multiple modes like cbc,
ecb etc. As part of this patchset, the iv-generation code is moved from the
dm layer to the crypto layer. The dm-layer can later be optimized to
encrypt larger block sizes in a single call to the crypto engine. The iv
generation algorithms implemented in geniv.c includes plain, plain64,
essiv, benbi, null, lmk and tcw. These templates are to be configured
and has to be invoked as:

crypto_alloc_skcipher("plain(cbc(aes))", 0, 0);
crypto_alloc_skcipher("essiv(cbc(aes))", 0, 0);
...

from the dm layer.

Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
---
 crypto/Kconfig            |    8 +
 crypto/Makefile           |    1 +
 crypto/geniv.c            | 1113 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-crypt.c     |  725 +++--------------------------
 include/crypto/geniv.h    |  109 +++++
 include/crypto/skcipher.h |   17 +
 6 files changed, 1309 insertions(+), 664 deletions(-)
 create mode 100644 crypto/geniv.c
 create mode 100644 include/crypto/geniv.h

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 84d7148..7125bc2 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -326,6 +326,14 @@ config CRYPTO_CTS
 	  This mode is required for Kerberos gss mechanism support
 	  for AES encryption.
 
+config CRYPTO_GENIV
+	tristate "IV Generation for dm-crypt"
+	select CRYPTO_BLKCIPHER
+	help
+	  GENIV: IV Generation for dm-crypt
+	  Algorithms to generate Initialization Vector for ciphers
+	  used by dm-crypt.
+
 config CRYPTO_ECB
 	tristate "ECB support"
 	select CRYPTO_BLKCIPHER
diff --git a/crypto/Makefile b/crypto/Makefile
index 99cc64a..fc81a82 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -74,6 +74,7 @@ obj-$(CONFIG_CRYPTO_TGR192) += tgr192.o
 obj-$(CONFIG_CRYPTO_GF128MUL) += gf128mul.o
 obj-$(CONFIG_CRYPTO_ECB) += ecb.o
 obj-$(CONFIG_CRYPTO_CBC) += cbc.o
+obj-$(CONFIG_CRYPTO_GENIV) += geniv.o
 obj-$(CONFIG_CRYPTO_PCBC) += pcbc.o
 obj-$(CONFIG_CRYPTO_CTS) += cts.o
 obj-$(CONFIG_CRYPTO_LRW) += lrw.o
diff --git a/crypto/geniv.c b/crypto/geniv.c
new file mode 100644
index 0000000..46988d5
--- /dev/null
+++ b/crypto/geniv.c
@@ -0,0 +1,1113 @@
+/*
+ * geniv: IV generation algorithms
+ *
+ * Copyright (c) 2016, Linaro Ltd.
+ * Copyright (C) 2006-2015 Red Hat, Inc. All rights reserved.
+ * Copyright (C) 2013 Milan Broz <gmazyland@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#include <crypto/algapi.h>
+#include <crypto/internal/skcipher.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/log2.h>
+#include <linux/module.h>
+#include <linux/scatterlist.h>
+#include <linux/slab.h>
+#include <linux/completion.h>
+#include <linux/crypto.h>
+#include <linux/workqueue.h>
+#include <linux/backing-dev.h>
+#include <linux/atomic.h>
+#include <linux/rbtree.h>
+#include <crypto/hash.h>
+#include <crypto/md5.h>
+#include <crypto/algapi.h>
+#include <crypto/skcipher.h>
+#include <asm/unaligned.h>
+#include <crypto/geniv.h>
+
+struct crypto_geniv_req_ctx {
+	struct skcipher_request subreq CRYPTO_MINALIGN_ATTR;
+};
+
+static struct crypto_skcipher *any_tfm(struct geniv_ctx_data *cd)
+{
+	return cd->tfm;
+}
+
+static int crypt_iv_plain_gen(struct geniv_ctx_data *cd, u8 *iv,
+			      struct dm_crypt_request *dmreq)
+{
+	memset(iv, 0, cd->iv_size);
+	*(__le32 *)iv = cpu_to_le32(dmreq->iv_sector & 0xffffffff);
+
+	return 0;
+}
+
+static int crypt_iv_plain64_gen(struct geniv_ctx_data *cd, u8 *iv,
+				struct dm_crypt_request *dmreq)
+{
+	memset(iv, 0, cd->iv_size);
+	*(__le64 *)iv = cpu_to_le64(dmreq->iv_sector);
+
+	return 0;
+}
+
+/* Initialise ESSIV - compute salt but no local memory allocations */
+static int crypt_iv_essiv_init(struct geniv_ctx_data *cd)
+{
+	struct geniv_essiv_private *essiv = &cd->iv_gen_private.essiv;
+	struct scatterlist sg;
+	struct crypto_cipher *essiv_tfm;
+	int err;
+	AHASH_REQUEST_ON_STACK(req, essiv->hash_tfm);
+
+	sg_init_one(&sg, cd->key, cd->key_size);
+	ahash_request_set_tfm(req, essiv->hash_tfm);
+	ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, NULL);
+	ahash_request_set_crypt(req, &sg, essiv->salt, cd->key_size);
+
+	err = crypto_ahash_digest(req);
+	ahash_request_zero(req);
+	if (err)
+		return err;
+
+	essiv_tfm = cd->iv_private;
+
+	err = crypto_cipher_setkey(essiv_tfm, essiv->salt,
+			    crypto_ahash_digestsize(essiv->hash_tfm));
+	if (err)
+		return err;
+
+	return 0;
+}
+
+/* Wipe salt and reset key derived from volume key */
+static int crypt_iv_essiv_wipe(struct geniv_ctx_data *cd)
+{
+	struct geniv_essiv_private *essiv = &cd->iv_gen_private.essiv;
+	unsigned int salt_size = crypto_ahash_digestsize(essiv->hash_tfm);
+	struct crypto_cipher *essiv_tfm;
+	int r, err = 0;
+
+	memset(essiv->salt, 0, salt_size);
+
+	essiv_tfm = cd->iv_private;
+	r = crypto_cipher_setkey(essiv_tfm, essiv->salt, salt_size);
+	if (r)
+		err = r;
+
+	return err;
+}
+
+/* Set up per cpu cipher state */
+static struct crypto_cipher *setup_essiv_cpu(struct geniv_ctx_data *cd,
+					     u8 *salt, unsigned int saltsize)
+{
+	struct crypto_cipher *essiv_tfm;
+	int err;
+
+	/* Setup the essiv_tfm with the given salt */
+	essiv_tfm = crypto_alloc_cipher(cd->cipher, 0, CRYPTO_ALG_ASYNC);
+
+	if (IS_ERR(essiv_tfm)) {
+		pr_err("Error allocating crypto tfm for ESSIV\n");
+		return essiv_tfm;
+	}
+
+	if (crypto_cipher_blocksize(essiv_tfm) !=
+	    crypto_skcipher_ivsize(any_tfm(cd))) {
+		pr_err("Block size of ESSIV cipher does not match IV size of block cipher\n");
+		crypto_free_cipher(essiv_tfm);
+		return ERR_PTR(-EINVAL);
+	}
+
+	err = crypto_cipher_setkey(essiv_tfm, salt, saltsize);
+	if (err) {
+		pr_err("Failed to set key for ESSIV cipher\n");
+		crypto_free_cipher(essiv_tfm);
+		return ERR_PTR(err);
+	}
+	return essiv_tfm;
+}
+
+static void crypt_iv_essiv_dtr(struct geniv_ctx_data *cd)
+{
+	struct crypto_cipher *essiv_tfm;
+	struct geniv_essiv_private *essiv = &cd->iv_gen_private.essiv;
+
+	crypto_free_ahash(essiv->hash_tfm);
+	essiv->hash_tfm = NULL;
+
+	kzfree(essiv->salt);
+	essiv->salt = NULL;
+
+	essiv_tfm = cd->iv_private;
+
+	if (essiv_tfm)
+		crypto_free_cipher(essiv_tfm);
+
+	cd->iv_private = NULL;
+}
+
+static int crypt_iv_essiv_ctr(struct geniv_ctx_data *cd)
+{
+	struct crypto_cipher *essiv_tfm = NULL;
+	struct crypto_ahash *hash_tfm = NULL;
+	u8 *salt = NULL;
+	int err;
+
+	if (!cd->ivopts) {
+		pr_err("Digest algorithm missing for ESSIV mode\n");
+		return -EINVAL;
+	}
+
+	/* Allocate hash algorithm */
+	hash_tfm = crypto_alloc_ahash(cd->ivopts, 0, CRYPTO_ALG_ASYNC);
+	if (IS_ERR(hash_tfm)) {
+		err = PTR_ERR(hash_tfm);
+		pr_err("Error initializing ESSIV hash. err=%d\n", err);
+		goto bad;
+	}
+
+	salt = kzalloc(crypto_ahash_digestsize(hash_tfm), GFP_KERNEL);
+	if (!salt) {
+		err = -ENOMEM;
+		goto bad;
+	}
+
+	cd->iv_gen_private.essiv.salt = salt;
+	cd->iv_gen_private.essiv.hash_tfm = hash_tfm;
+
+	essiv_tfm = setup_essiv_cpu(cd, salt,
+				crypto_ahash_digestsize(hash_tfm));
+	if (IS_ERR(essiv_tfm)) {
+		crypt_iv_essiv_dtr(cd);
+		return PTR_ERR(essiv_tfm);
+	}
+	cd->iv_private = essiv_tfm;
+
+	return 0;
+
+bad:
+	if (hash_tfm && !IS_ERR(hash_tfm))
+		crypto_free_ahash(hash_tfm);
+	kfree(salt);
+	return err;
+}
+
+static int crypt_iv_essiv_gen(struct geniv_ctx_data *cd, u8 *iv,
+			      struct dm_crypt_request *dmreq)
+{
+	struct crypto_cipher *essiv_tfm = cd->iv_private;
+
+	memset(iv, 0, cd->iv_size);
+	*(__le64 *)iv = cpu_to_le64(dmreq->iv_sector);
+	crypto_cipher_encrypt_one(essiv_tfm, iv, iv);
+
+	return 0;
+}
+
+static int crypt_iv_benbi_ctr(struct geniv_ctx_data *cd)
+{
+	unsigned int bs = crypto_skcipher_blocksize(any_tfm(cd));
+	int log = ilog2(bs);
+
+	/* we need to calculate how far we must shift the sector count
+	 * to get the cipher block count, we use this shift in _gen
+	 */
+
+	if (1 << log != bs) {
+		pr_err("cypher blocksize is not a power of 2\n");
+		return -EINVAL;
+	}
+
+	if (log > 9) {
+		pr_err("cypher blocksize is > 512\n");
+		return -EINVAL;
+	}
+
+	cd->iv_gen_private.benbi.shift = 9 - log;
+
+	return 0;
+}
+
+static int crypt_iv_benbi_gen(struct geniv_ctx_data *cd, u8 *iv,
+			      struct dm_crypt_request *dmreq)
+{
+	__be64 val;
+
+	memset(iv, 0, cd->iv_size - sizeof(u64)); /* rest is cleared below */
+
+	val = cpu_to_be64(((u64) dmreq->iv_sector <<
+			  cd->iv_gen_private.benbi.shift) + 1);
+	put_unaligned(val, (__be64 *)(iv + cd->iv_size - sizeof(u64)));
+
+	return 0;
+}
+
+static int crypt_iv_null_gen(struct geniv_ctx_data *cd, u8 *iv,
+			     struct dm_crypt_request *dmreq)
+{
+	memset(iv, 0, cd->iv_size);
+
+	return 0;
+}
+
+static void crypt_iv_lmk_dtr(struct geniv_ctx_data *cd)
+{
+	struct geniv_lmk_private *lmk = &cd->iv_gen_private.lmk;
+
+	if (lmk->hash_tfm && !IS_ERR(lmk->hash_tfm))
+		crypto_free_shash(lmk->hash_tfm);
+	lmk->hash_tfm = NULL;
+
+	kzfree(lmk->seed);
+	lmk->seed = NULL;
+}
+
+static int crypt_iv_lmk_ctr(struct geniv_ctx_data *cd)
+{
+	struct geniv_lmk_private *lmk = &cd->iv_gen_private.lmk;
+
+	lmk->hash_tfm = crypto_alloc_shash("md5", 0, 0);
+	if (IS_ERR(lmk->hash_tfm)) {
+		pr_err("Error initializing LMK hash; err=%ld\n",
+				PTR_ERR(lmk->hash_tfm));
+		return PTR_ERR(lmk->hash_tfm);
+	}
+
+	/* No seed in LMK version 2 */
+	if (cd->key_parts == cd->tfms_count) {
+		lmk->seed = NULL;
+		return 0;
+	}
+
+	lmk->seed = kzalloc(LMK_SEED_SIZE, GFP_KERNEL);
+	if (!lmk->seed) {
+		crypt_iv_lmk_dtr(cd);
+		pr_err("Error kmallocing seed storage in LMK\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int crypt_iv_lmk_init(struct geniv_ctx_data *cd)
+{
+	struct geniv_lmk_private *lmk = &cd->iv_gen_private.lmk;
+	int subkey_size = cd->key_size / cd->key_parts;
+
+	/* LMK seed is on the position of LMK_KEYS + 1 key */
+	if (lmk->seed)
+		memcpy(lmk->seed, cd->key + (cd->tfms_count * subkey_size),
+		       crypto_shash_digestsize(lmk->hash_tfm));
+
+	return 0;
+}
+
+static int crypt_iv_lmk_wipe(struct geniv_ctx_data *cd)
+{
+	struct geniv_lmk_private *lmk = &cd->iv_gen_private.lmk;
+
+	if (lmk->seed)
+		memset(lmk->seed, 0, LMK_SEED_SIZE);
+
+	return 0;
+}
+
+static int crypt_iv_lmk_one(struct geniv_ctx_data *cd, u8 *iv,
+			    struct dm_crypt_request *dmreq, u8 *data)
+{
+	struct geniv_lmk_private *lmk = &cd->iv_gen_private.lmk;
+	struct md5_state md5state;
+	__le32 buf[4];
+	int i, r;
+	SHASH_DESC_ON_STACK(desc, lmk->hash_tfm);
+
+	desc->tfm = lmk->hash_tfm;
+	desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
+
+	r = crypto_shash_init(desc);
+	if (r)
+		return r;
+
+	if (lmk->seed) {
+		r = crypto_shash_update(desc, lmk->seed, LMK_SEED_SIZE);
+		if (r)
+			return r;
+	}
+
+	/* Sector is always 512B, block size 16, add data of blocks 1-31 */
+	r = crypto_shash_update(desc, data + 16, 16 * 31);
+	if (r)
+		return r;
+
+	/* Sector is cropped to 56 bits here */
+	buf[0] = cpu_to_le32(dmreq->iv_sector & 0xFFFFFFFF);
+	buf[1] = cpu_to_le32((((u64)dmreq->iv_sector >> 32) & 0x00FFFFFF)
+			     | 0x80000000);
+	buf[2] = cpu_to_le32(4024);
+	buf[3] = 0;
+	r = crypto_shash_update(desc, (u8 *)buf, sizeof(buf));
+	if (r)
+		return r;
+
+	/* No MD5 padding here */
+	r = crypto_shash_export(desc, &md5state);
+	if (r)
+		return r;
+
+	for (i = 0; i < MD5_HASH_WORDS; i++)
+		__cpu_to_le32s(&md5state.hash[i]);
+	memcpy(iv, &md5state.hash, cd->iv_size);
+
+	return 0;
+}
+
+static int crypt_iv_lmk_gen(struct geniv_ctx_data *cd, u8 *iv,
+			      struct dm_crypt_request *dmreq)
+{
+	u8 *src;
+	int r = 0;
+
+	if (bio_data_dir(dmreq->ctx->bio_in) == WRITE) {
+		src = kmap_atomic(sg_page(&dmreq->sg_in));
+		r = crypt_iv_lmk_one(cd, iv, dmreq, src + dmreq->sg_in.offset);
+		kunmap_atomic(src);
+	} else
+		memset(iv, 0, cd->iv_size);
+
+	return r;
+}
+
+static int crypt_iv_lmk_post(struct geniv_ctx_data *cd, u8 *iv,
+			     struct dm_crypt_request *dmreq)
+{
+	u8 *dst;
+	int r;
+
+	if (bio_data_dir(dmreq->ctx->bio_in) == WRITE)
+		return 0;
+
+	dst = kmap_atomic(sg_page(&dmreq->sg_out));
+	r = crypt_iv_lmk_one(cd, iv, dmreq, dst + dmreq->sg_out.offset);
+
+	/* Tweak the first block of plaintext sector */
+	if (!r)
+		crypto_xor(dst + dmreq->sg_out.offset, iv, cd->iv_size);
+
+	kunmap_atomic(dst);
+	return r;
+}
+
+static void crypt_iv_tcw_dtr(struct geniv_ctx_data *cd)
+{
+	struct geniv_tcw_private *tcw = &cd->iv_gen_private.tcw;
+
+	kzfree(tcw->iv_seed);
+	tcw->iv_seed = NULL;
+	kzfree(tcw->whitening);
+	tcw->whitening = NULL;
+
+	if (tcw->crc32_tfm && !IS_ERR(tcw->crc32_tfm))
+		crypto_free_shash(tcw->crc32_tfm);
+	tcw->crc32_tfm = NULL;
+}
+
+static int crypt_iv_tcw_ctr(struct geniv_ctx_data *cd)
+{
+	struct geniv_tcw_private *tcw = &cd->iv_gen_private.tcw;
+
+	if (cd->key_size <= (cd->iv_size + TCW_WHITENING_SIZE)) {
+		pr_err("Wrong key size (%d) for TCW. Choose a value > %d bytes\n",
+			cd->key_size,
+			cd->iv_size + TCW_WHITENING_SIZE);
+		return -EINVAL;
+	}
+
+	tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 0);
+	if (IS_ERR(tcw->crc32_tfm)) {
+		pr_err("Error initializing CRC32 in TCW; err=%ld\n",
+			PTR_ERR(tcw->crc32_tfm));
+		return PTR_ERR(tcw->crc32_tfm);
+	}
+
+	tcw->iv_seed = kzalloc(cd->iv_size, GFP_KERNEL);
+	tcw->whitening = kzalloc(TCW_WHITENING_SIZE, GFP_KERNEL);
+	if (!tcw->iv_seed || !tcw->whitening) {
+		crypt_iv_tcw_dtr(cd);
+		pr_err("Error allocating seed storage in TCW\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int crypt_iv_tcw_init(struct geniv_ctx_data *cd)
+{
+	struct geniv_tcw_private *tcw = &cd->iv_gen_private.tcw;
+	int key_offset = cd->key_size - cd->iv_size - TCW_WHITENING_SIZE;
+
+	memcpy(tcw->iv_seed, &cd->key[key_offset], cd->iv_size);
+	memcpy(tcw->whitening, &cd->key[key_offset + cd->iv_size],
+	       TCW_WHITENING_SIZE);
+
+	return 0;
+}
+
+static int crypt_iv_tcw_wipe(struct geniv_ctx_data *cd)
+{
+	struct geniv_tcw_private *tcw = &cd->iv_gen_private.tcw;
+
+	memset(tcw->iv_seed, 0, cd->iv_size);
+	memset(tcw->whitening, 0, TCW_WHITENING_SIZE);
+
+	return 0;
+}
+
+static int crypt_iv_tcw_whitening(struct geniv_ctx_data *cd,
+				  struct dm_crypt_request *dmreq, u8 *data)
+{
+	struct geniv_tcw_private *tcw = &cd->iv_gen_private.tcw;
+	__le64 sector = cpu_to_le64(dmreq->iv_sector);
+	u8 buf[TCW_WHITENING_SIZE];
+	int i, r;
+	SHASH_DESC_ON_STACK(desc, tcw->crc32_tfm);
+
+	/* xor whitening with sector number */
+	memcpy(buf, tcw->whitening, TCW_WHITENING_SIZE);
+	crypto_xor(buf, (u8 *)&sector, 8);
+	crypto_xor(&buf[8], (u8 *)&sector, 8);
+
+	/* calculate crc32 for every 32bit part and xor it */
+	desc->tfm = tcw->crc32_tfm;
+	desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
+	for (i = 0; i < 4; i++) {
+		r = crypto_shash_init(desc);
+		if (r)
+			goto out;
+		r = crypto_shash_update(desc, &buf[i * 4], 4);
+		if (r)
+			goto out;
+		r = crypto_shash_final(desc, &buf[i * 4]);
+		if (r)
+			goto out;
+	}
+	crypto_xor(&buf[0], &buf[12], 4);
+	crypto_xor(&buf[4], &buf[8], 4);
+
+	/* apply whitening (8 bytes) to whole sector */
+	for (i = 0; i < ((1 << SECTOR_SHIFT) / 8); i++)
+		crypto_xor(data + i * 8, buf, 8);
+out:
+	memzero_explicit(buf, sizeof(buf));
+	return r;
+}
+
+static int crypt_iv_tcw_gen(struct geniv_ctx_data *cd, u8 *iv,
+			      struct dm_crypt_request *dmreq)
+{
+	struct geniv_tcw_private *tcw = &cd->iv_gen_private.tcw;
+	__le64 sector = cpu_to_le64(dmreq->iv_sector);
+	u8 *src;
+	int r = 0;
+
+	/* Remove whitening from ciphertext */
+	if (bio_data_dir(dmreq->ctx->bio_in) != WRITE) {
+		src = kmap_atomic(sg_page(&dmreq->sg_in));
+		r = crypt_iv_tcw_whitening(cd, dmreq,
+					   src + dmreq->sg_in.offset);
+		kunmap_atomic(src);
+	}
+
+	/* Calculate IV */
+	memcpy(iv, tcw->iv_seed, cd->iv_size);
+	crypto_xor(iv, (u8 *)&sector, 8);
+	if (cd->iv_size > 8)
+		crypto_xor(&iv[8], (u8 *)&sector, cd->iv_size - 8);
+
+	return r;
+}
+
+static int crypt_iv_tcw_post(struct geniv_ctx_data *cd, u8 *iv,
+			     struct dm_crypt_request *dmreq)
+{
+	u8 *dst;
+	int r;
+
+	if (bio_data_dir(dmreq->ctx->bio_in) != WRITE)
+		return 0;
+
+	/* Apply whitening on ciphertext */
+	dst = kmap_atomic(sg_page(&dmreq->sg_out));
+	r = crypt_iv_tcw_whitening(cd, dmreq, dst + dmreq->sg_out.offset);
+	kunmap_atomic(dst);
+
+	return r;
+}
+
+static struct geniv_operations crypt_iv_plain_ops = {
+	.generator = crypt_iv_plain_gen
+};
+
+static struct geniv_operations crypt_iv_plain64_ops = {
+	.generator = crypt_iv_plain64_gen
+};
+
+static struct geniv_operations crypt_iv_essiv_ops = {
+	.ctr       = crypt_iv_essiv_ctr,
+	.dtr       = crypt_iv_essiv_dtr,
+	.init      = crypt_iv_essiv_init,
+	.wipe      = crypt_iv_essiv_wipe,
+	.generator = crypt_iv_essiv_gen
+};
+
+static struct geniv_operations crypt_iv_benbi_ops = {
+	.ctr	   = crypt_iv_benbi_ctr,
+	.generator = crypt_iv_benbi_gen
+};
+
+static struct geniv_operations crypt_iv_null_ops = {
+	.generator = crypt_iv_null_gen
+};
+
+static struct geniv_operations crypt_iv_lmk_ops = {
+	.ctr	   = crypt_iv_lmk_ctr,
+	.dtr	   = crypt_iv_lmk_dtr,
+	.init	   = crypt_iv_lmk_init,
+	.wipe	   = crypt_iv_lmk_wipe,
+	.generator = crypt_iv_lmk_gen,
+	.post	   = crypt_iv_lmk_post
+};
+
+static struct geniv_operations crypt_iv_tcw_ops = {
+	.ctr	   = crypt_iv_tcw_ctr,
+	.dtr	   = crypt_iv_tcw_dtr,
+	.init	   = crypt_iv_tcw_init,
+	.wipe	   = crypt_iv_tcw_wipe,
+	.generator = crypt_iv_tcw_gen,
+	.post	   = crypt_iv_tcw_post
+};
+
+static int geniv_setkey_set(struct geniv_ctx_data *cd)
+{
+	int ret = 0;
+
+	if (cd->iv_gen_ops && cd->iv_gen_ops->init)
+		ret = cd->iv_gen_ops->init(cd);
+	return ret;
+}
+
+static int geniv_setkey_wipe(struct geniv_ctx_data *cd)
+{
+	int ret = 0;
+
+	if (cd->iv_gen_ops && cd->iv_gen_ops->wipe) {
+		ret = cd->iv_gen_ops->wipe(cd);
+		if (ret)
+			return ret;
+	}
+	return ret;
+}
+
+static int geniv_setkey_init_ctx(struct geniv_ctx_data *cd)
+{
+	int ret = -EINVAL;
+
+	pr_debug("IV Generation algorithm : %s\n", cd->ivmode);
+
+	if (cd->ivmode == NULL)
+		cd->iv_gen_ops = NULL;
+	else if (strcmp(cd->ivmode, "plain") == 0)
+		cd->iv_gen_ops = &crypt_iv_plain_ops;
+	else if (strcmp(cd->ivmode, "plain64") == 0)
+		cd->iv_gen_ops = &crypt_iv_plain64_ops;
+	else if (strcmp(cd->ivmode, "essiv") == 0)
+		cd->iv_gen_ops = &crypt_iv_essiv_ops;
+	else if (strcmp(cd->ivmode, "benbi") == 0)
+		cd->iv_gen_ops = &crypt_iv_benbi_ops;
+	else if (strcmp(cd->ivmode, "null") == 0)
+		cd->iv_gen_ops = &crypt_iv_null_ops;
+	else if (strcmp(cd->ivmode, "lmk") == 0)
+		cd->iv_gen_ops = &crypt_iv_lmk_ops;
+	else if (strcmp(cd->ivmode, "tcw") == 0) {
+		cd->iv_gen_ops = &crypt_iv_tcw_ops;
+		cd->key_parts += 2; /* IV + whitening */
+		cd->key_extra_size = cd->iv_size + TCW_WHITENING_SIZE;
+	} else {
+		ret = -EINVAL;
+		pr_err("Invalid IV mode %s\n", cd->ivmode);
+		goto end;
+	}
+
+	/* Allocate IV */
+	if (cd->iv_gen_ops && cd->iv_gen_ops->ctr) {
+		ret = cd->iv_gen_ops->ctr(cd);
+		if (ret < 0) {
+			pr_err("Error creating IV for %s\n", cd->ivmode);
+			goto end;
+		}
+	}
+
+	/* Initialize IV (set keys for ESSIV etc) */
+	if (cd->iv_gen_ops && cd->iv_gen_ops->init) {
+		ret = cd->iv_gen_ops->init(cd);
+		if (ret < 0)
+			pr_err("Error creating IV for %s\n", cd->ivmode);
+	}
+	ret = 0;
+end:
+	return ret;
+}
+
+static int crypto_geniv_set_ctx(struct crypto_skcipher *cipher,
+				void *newctx, unsigned int len)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(cipher);
+	/*
+	 * TODO:
+	 * Do we really need this API or can we append the context
+	 * 'struct geniv_ctx' to the cipher from dm-crypt and use
+	 * the same here.
+	 */
+	memcpy(ctx, (char *) newctx, len);
+	return geniv_setkey_init_ctx(&ctx->data);
+}
+
+static int crypto_geniv_setkey(struct crypto_skcipher *parent,
+				const u8 *key, unsigned int keylen)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(parent);
+	struct crypto_skcipher *child = ctx->child;
+	int err;
+
+	pr_debug("SETKEY Operation : %d\n", ctx->data.keyop);
+
+	switch (ctx->data.keyop) {
+	case SETKEY_OP_INIT:
+		err = geniv_setkey_init_ctx(&ctx->data);
+		break;
+	case SETKEY_OP_SET:
+		err = geniv_setkey_set(&ctx->data);
+		break;
+	case SETKEY_OP_WIPE:
+		err = geniv_setkey_wipe(&ctx->data);
+		break;
+	}
+
+	crypto_skcipher_clear_flags(child, CRYPTO_TFM_REQ_MASK);
+	crypto_skcipher_set_flags(child, crypto_skcipher_get_flags(parent) &
+					 CRYPTO_TFM_REQ_MASK);
+	err = crypto_skcipher_setkey(child, key, keylen);
+	crypto_skcipher_set_flags(parent, crypto_skcipher_get_flags(child) &
+					  CRYPTO_TFM_RES_MASK);
+	return err;
+}
+
+static struct dm_crypt_request *dmreq_of_req(struct crypto_skcipher *tfm,
+					     struct skcipher_request *req)
+{
+	struct geniv_ctx *ctx;
+
+	ctx = crypto_skcipher_ctx(tfm);
+	return (struct dm_crypt_request *) ((char *) req + ctx->data.dmoffset);
+}
+
+
+static void geniv_async_done(struct crypto_async_request *async_req, int error)
+{
+	struct skcipher_request *req = async_req->data;
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct geniv_ctx_data *cd = &ctx->data;
+	struct dm_crypt_request *dmreq = dmreq_of_req(tfm, req);
+	struct convert_context *cctx = dmreq->ctx;
+	unsigned long align = crypto_skcipher_alignmask(tfm);
+	struct crypto_geniv_req_ctx *rctx =
+		(void *) PTR_ALIGN((u8 *)skcipher_request_ctx(req), align + 1);
+	struct skcipher_request *subreq = &rctx->subreq;
+
+	/*
+	 * A request from crypto driver backlog is going to be processed now,
+	 * finish the completion and continue in crypt_convert().
+	 * (Callback will be called for the second time for this request.)
+	 */
+	if (error == -EINPROGRESS) {
+		complete(&cctx->restart);
+		return;
+	}
+
+	if (!error && cd->iv_gen_ops && cd->iv_gen_ops->post)
+		error = cd->iv_gen_ops->post(cd, req->iv, dmreq);
+
+	skcipher_request_set_callback(subreq, req->base.flags,
+				      req->base.complete, req->base.data);
+	skcipher_request_complete(req, error);
+}
+
+static inline int crypto_geniv_crypt(struct skcipher_request *req, bool encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct geniv_ctx_data *cd = &ctx->data;
+	struct crypto_skcipher *child = ctx->child;
+	struct dm_crypt_request *dmreq;
+	unsigned long align = crypto_skcipher_alignmask(tfm);
+	struct crypto_geniv_req_ctx *rctx =
+		(void *) PTR_ALIGN((u8 *)skcipher_request_ctx(req), align + 1);
+	struct skcipher_request *subreq = &rctx->subreq;
+	int ret = 0;
+	u8 *iv = req->iv;
+
+	dmreq = dmreq_of_req(tfm, req);
+
+	if (cd->iv_gen_ops)
+		ret = cd->iv_gen_ops->generator(cd, iv, dmreq);
+
+	if (ret < 0) {
+		pr_err("Error in generating IV ret: %d\n", ret);
+		goto end;
+	}
+
+	skcipher_request_set_tfm(subreq, child);
+	skcipher_request_set_callback(subreq, req->base.flags,
+				      geniv_async_done, req);
+	skcipher_request_set_crypt(subreq, req->src, req->dst,
+				   req->cryptlen, iv);
+
+	if (encrypt)
+		ret = crypto_skcipher_encrypt(subreq);
+	else
+		ret = crypto_skcipher_decrypt(subreq);
+
+	if (!ret && cd->iv_gen_ops && cd->iv_gen_ops->post)
+		ret = cd->iv_gen_ops->post(cd, iv, dmreq);
+
+end:
+	return ret;
+}
+
+static int crypto_geniv_encrypt(struct skcipher_request *req)
+{
+	return crypto_geniv_crypt(req, true);
+}
+
+static int crypto_geniv_decrypt(struct skcipher_request *req)
+{
+	return crypto_geniv_crypt(req, false);
+}
+
+static int crypto_geniv_init_tfm(struct crypto_skcipher *tfm)
+{
+	struct skcipher_instance *inst = skcipher_alg_instance(tfm);
+	struct crypto_skcipher_spawn *spawn = skcipher_instance_ctx(inst);
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct geniv_ctx_data *cd;
+	struct crypto_skcipher *cipher;
+	unsigned long align;
+	unsigned int reqsize, extrasize;
+
+	cipher = crypto_spawn_skcipher2(spawn);
+	if (IS_ERR(cipher))
+		return PTR_ERR(cipher);
+
+	ctx->child = cipher;
+
+	/* Setup the current cipher's request structure */
+	align = crypto_skcipher_alignmask(tfm);
+	align &= ~(crypto_tfm_ctx_alignment() - 1);
+	reqsize = align + sizeof(struct crypto_geniv_req_ctx) +
+		  crypto_skcipher_reqsize(cipher);
+	crypto_skcipher_set_reqsize(tfm, reqsize);
+
+	/* Set the current cipher's extra context parameters
+	 * Format of req structure, the context and the extra context
+	 * This is set by the caller of the cipher
+	 *   struct skcipher_request   --+
+	 *      context                  |   Request context
+	 *      padding                --+
+	 *   struct dm_crypt_request   --+
+	 *      padding                  |   Extra context
+	 *   IV                        --+
+	 */
+	cd = &ctx->data;
+	cd->dmoffset  = sizeof(struct skcipher_request);
+	cd->dmoffset += crypto_skcipher_reqsize(tfm);
+	cd->dmoffset  = ALIGN(cd->dmoffset,
+			__alignof__(struct dm_crypt_request));
+	extrasize = cd->dmoffset + sizeof(struct dm_crypt_request);
+
+	return 0;
+}
+
+static void crypto_geniv_exit_tfm(struct crypto_skcipher *tfm)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct geniv_ctx_data *cd = &ctx->data;
+
+	if (cd->iv_gen_ops && cd->iv_gen_ops->dtr)
+		cd->iv_gen_ops->dtr(cd);
+
+	crypto_free_skcipher(ctx->child);
+}
+
+static void crypto_geniv_free(struct skcipher_instance *inst)
+{
+	struct crypto_skcipher_spawn *spawn = skcipher_instance_ctx(inst);
+
+	crypto_drop_skcipher(spawn);
+	kfree(inst);
+}
+
+static int crypto_geniv_create(struct crypto_template *tmpl,
+				 struct rtattr **tb, char *algname)
+{
+	struct crypto_attr_type *algt;
+	struct skcipher_instance *inst;
+	struct skcipher_alg *alg;
+	struct crypto_skcipher_spawn *spawn;
+	const char *cipher_name;
+	int err;
+
+	algt = crypto_get_attr_type(tb);
+
+	if (IS_ERR(algt))
+		return PTR_ERR(algt);
+
+	if ((algt->type ^ CRYPTO_ALG_TYPE_SKCIPHER) & algt->mask)
+		return -EINVAL;
+
+	cipher_name = crypto_attr_alg_name(tb[1]);
+
+	if (IS_ERR(cipher_name))
+		return PTR_ERR(cipher_name);
+
+	inst = kzalloc(sizeof(*inst) + sizeof(*spawn), GFP_KERNEL);
+	if (!inst)
+		return -ENOMEM;
+
+	spawn = skcipher_instance_ctx(inst);
+
+	crypto_set_skcipher_spawn(spawn, skcipher_crypto_instance(inst));
+	err = crypto_grab_skcipher2(spawn, cipher_name, 0,
+				    crypto_requires_sync(algt->type,
+							 algt->mask));
+
+	if (err)
+		goto err_free_inst;
+
+	alg = crypto_spawn_skcipher_alg(spawn);
+
+	/* We only support 16-byte blocks. */
+	err = -EINVAL;
+	/*
+	 * if (crypto_skcipher_alg_ivsize(alg) != 16)
+	 *	goto err_drop_spawn;
+	 */
+
+	if (!is_power_of_2(alg->base.cra_blocksize))
+		goto err_drop_spawn;
+
+	err = -ENAMETOOLONG;
+	if (snprintf(inst->alg.base.cra_name, CRYPTO_MAX_ALG_NAME, "%s(%s)",
+		     algname, alg->base.cra_name) >= CRYPTO_MAX_ALG_NAME)
+		goto err_drop_spawn;
+	if (snprintf(inst->alg.base.cra_driver_name, CRYPTO_MAX_ALG_NAME,
+		     "%s(%s)", algname, alg->base.cra_driver_name) >=
+	    CRYPTO_MAX_ALG_NAME)
+		goto err_drop_spawn;
+
+	inst->alg.base.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER;
+	inst->alg.base.cra_priority = alg->base.cra_priority;
+	inst->alg.base.cra_blocksize = alg->base.cra_blocksize;
+	inst->alg.base.cra_alignmask = alg->base.cra_alignmask;
+	inst->alg.base.cra_flags = alg->base.cra_flags & CRYPTO_ALG_ASYNC;
+	inst->alg.ivsize = alg->base.cra_blocksize;
+	inst->alg.chunksize = crypto_skcipher_alg_chunksize(alg);
+	inst->alg.min_keysize = crypto_skcipher_alg_min_keysize(alg);
+	inst->alg.max_keysize = crypto_skcipher_alg_max_keysize(alg);
+
+	inst->alg.setkey = crypto_geniv_setkey;
+	inst->alg.set_ctx = crypto_geniv_set_ctx;
+	inst->alg.encrypt = crypto_geniv_encrypt;
+	inst->alg.decrypt = crypto_geniv_decrypt;
+
+	inst->alg.base.cra_ctxsize = sizeof(struct geniv_ctx);
+
+	inst->alg.init = crypto_geniv_init_tfm;
+	inst->alg.exit = crypto_geniv_exit_tfm;
+
+	inst->free = crypto_geniv_free;
+
+	err = skcipher_register_instance(tmpl, inst);
+	if (err)
+		goto err_drop_spawn;
+
+out:
+	return err;
+
+err_drop_spawn:
+	crypto_drop_skcipher(spawn);
+err_free_inst:
+	kfree(inst);
+	goto out;
+}
+
+static int crypto_plain_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "plain");
+}
+
+static int crypto_plain64_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "plain64");
+}
+
+static int crypto_essiv_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "essiv");
+}
+
+static int crypto_benbi_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "benbi");
+}
+
+static int crypto_null_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "null");
+}
+
+static int crypto_lmk_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "lmk");
+}
+
+static int crypto_tcw_create(struct crypto_template *tmpl,
+				struct rtattr **tb)
+{
+	return crypto_geniv_create(tmpl, tb, "tcw");
+}
+
+static struct crypto_template crypto_plain_tmpl = {
+	.name   = "plain",
+	.create = crypto_plain_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_plain64_tmpl = {
+	.name   = "plain64",
+	.create = crypto_plain64_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_essiv_tmpl = {
+	.name   = "essiv",
+	.create = crypto_essiv_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_benbi_tmpl = {
+	.name   = "benbi",
+	.create = crypto_benbi_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_null_tmpl = {
+	.name   = "null",
+	.create = crypto_null_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_lmk_tmpl = {
+	.name   = "lmk",
+	.create = crypto_lmk_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_tcw_tmpl = {
+	.name   = "tcw",
+	.create = crypto_tcw_create,
+	.module = THIS_MODULE,
+};
+
+static int __init crypto_geniv_module_init(void)
+{
+	int err;
+
+	err = crypto_register_template(&crypto_plain_tmpl);
+	if (err)
+		goto out;
+
+	err = crypto_register_template(&crypto_plain64_tmpl);
+	if (err)
+		goto out_undo_plain;
+
+	err = crypto_register_template(&crypto_essiv_tmpl);
+	if (err)
+		goto out_undo_plain64;
+
+	err = crypto_register_template(&crypto_benbi_tmpl);
+	if (err)
+		goto out_undo_essiv;
+
+	err = crypto_register_template(&crypto_null_tmpl);
+	if (err)
+		goto out_undo_benbi;
+
+	err = crypto_register_template(&crypto_lmk_tmpl);
+	if (err)
+		goto out_undo_null;
+
+	err = crypto_register_template(&crypto_tcw_tmpl);
+	if (!err)
+		goto out;
+
+	crypto_unregister_template(&crypto_lmk_tmpl);
+out_undo_null:
+	crypto_unregister_template(&crypto_null_tmpl);
+out_undo_benbi:
+	crypto_unregister_template(&crypto_benbi_tmpl);
+out_undo_essiv:
+	crypto_unregister_template(&crypto_essiv_tmpl);
+out_undo_plain64:
+	crypto_unregister_template(&crypto_plain64_tmpl);
+out_undo_plain:
+	crypto_unregister_template(&crypto_plain_tmpl);
+out:
+	return err;
+}
+
+static void __exit crypto_geniv_module_exit(void)
+{
+	crypto_unregister_template(&crypto_plain_tmpl);
+	crypto_unregister_template(&crypto_plain64_tmpl);
+	crypto_unregister_template(&crypto_essiv_tmpl);
+	crypto_unregister_template(&crypto_benbi_tmpl);
+	crypto_unregister_template(&crypto_null_tmpl);
+	crypto_unregister_template(&crypto_lmk_tmpl);
+	crypto_unregister_template(&crypto_tcw_tmpl);
+}
+
+module_init(crypto_geniv_module_init);
+module_exit(crypto_geniv_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("IV generation algorithms");
+MODULE_ALIAS_CRYPTO("geniv");
+
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index a276883..05c2677 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -29,26 +29,13 @@
 #include <crypto/md5.h>
 #include <crypto/algapi.h>
 #include <crypto/skcipher.h>
+#include <crypto/geniv.h>
 
 #include <linux/device-mapper.h>
 
 #define DM_MSG_PREFIX "crypt"
 
 /*
- * context holding the current state of a multi-part conversion
- */
-struct convert_context {
-	struct completion restart;
-	struct bio *bio_in;
-	struct bio *bio_out;
-	struct bvec_iter iter_in;
-	struct bvec_iter iter_out;
-	sector_t cc_sector;
-	atomic_t cc_pending;
-	struct skcipher_request *req;
-};
-
-/*
  * per bio private data
  */
 struct dm_crypt_io {
@@ -65,13 +52,6 @@ struct dm_crypt_io {
 	struct rb_node rb_node;
 } CRYPTO_MINALIGN_ATTR;
 
-struct dm_crypt_request {
-	struct convert_context *ctx;
-	struct scatterlist sg_in;
-	struct scatterlist sg_out;
-	sector_t iv_sector;
-};
-
 struct crypt_config;
 
 struct crypt_iv_operations {
@@ -141,7 +121,6 @@ struct crypt_config {
 	char *cipher;
 	char *cipher_string;
 
-	struct crypt_iv_operations *iv_gen_ops;
 	union {
 		struct iv_essiv_private essiv;
 		struct iv_benbi_private benbi;
@@ -241,567 +220,6 @@ static struct crypto_skcipher *any_tfm(struct crypt_config *cc)
  * http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/454
  */
 
-static int crypt_iv_plain_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
-{
-	memset(iv, 0, cc->iv_size);
-	*(__le32 *)iv = cpu_to_le32(dmreq->iv_sector & 0xffffffff);
-
-	return 0;
-}
-
-static int crypt_iv_plain64_gen(struct crypt_config *cc, u8 *iv,
-				struct dm_crypt_request *dmreq)
-{
-	memset(iv, 0, cc->iv_size);
-	*(__le64 *)iv = cpu_to_le64(dmreq->iv_sector);
-
-	return 0;
-}
-
-/* Initialise ESSIV - compute salt but no local memory allocations */
-static int crypt_iv_essiv_init(struct crypt_config *cc)
-{
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
-	AHASH_REQUEST_ON_STACK(req, essiv->hash_tfm);
-	struct scatterlist sg;
-	struct crypto_cipher *essiv_tfm;
-	int err;
-
-	sg_init_one(&sg, cc->key, cc->key_size);
-	ahash_request_set_tfm(req, essiv->hash_tfm);
-	ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, NULL);
-	ahash_request_set_crypt(req, &sg, essiv->salt, cc->key_size);
-
-	err = crypto_ahash_digest(req);
-	ahash_request_zero(req);
-	if (err)
-		return err;
-
-	essiv_tfm = cc->iv_private;
-
-	err = crypto_cipher_setkey(essiv_tfm, essiv->salt,
-			    crypto_ahash_digestsize(essiv->hash_tfm));
-	if (err)
-		return err;
-
-	return 0;
-}
-
-/* Wipe salt and reset key derived from volume key */
-static int crypt_iv_essiv_wipe(struct crypt_config *cc)
-{
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
-	unsigned salt_size = crypto_ahash_digestsize(essiv->hash_tfm);
-	struct crypto_cipher *essiv_tfm;
-	int r, err = 0;
-
-	memset(essiv->salt, 0, salt_size);
-
-	essiv_tfm = cc->iv_private;
-	r = crypto_cipher_setkey(essiv_tfm, essiv->salt, salt_size);
-	if (r)
-		err = r;
-
-	return err;
-}
-
-/* Set up per cpu cipher state */
-static struct crypto_cipher *setup_essiv_cpu(struct crypt_config *cc,
-					     struct dm_target *ti,
-					     u8 *salt, unsigned saltsize)
-{
-	struct crypto_cipher *essiv_tfm;
-	int err;
-
-	/* Setup the essiv_tfm with the given salt */
-	essiv_tfm = crypto_alloc_cipher(cc->cipher, 0, CRYPTO_ALG_ASYNC);
-	if (IS_ERR(essiv_tfm)) {
-		ti->error = "Error allocating crypto tfm for ESSIV";
-		return essiv_tfm;
-	}
-
-	if (crypto_cipher_blocksize(essiv_tfm) !=
-	    crypto_skcipher_ivsize(any_tfm(cc))) {
-		ti->error = "Block size of ESSIV cipher does "
-			    "not match IV size of block cipher";
-		crypto_free_cipher(essiv_tfm);
-		return ERR_PTR(-EINVAL);
-	}
-
-	err = crypto_cipher_setkey(essiv_tfm, salt, saltsize);
-	if (err) {
-		ti->error = "Failed to set key for ESSIV cipher";
-		crypto_free_cipher(essiv_tfm);
-		return ERR_PTR(err);
-	}
-
-	return essiv_tfm;
-}
-
-static void crypt_iv_essiv_dtr(struct crypt_config *cc)
-{
-	struct crypto_cipher *essiv_tfm;
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
-
-	crypto_free_ahash(essiv->hash_tfm);
-	essiv->hash_tfm = NULL;
-
-	kzfree(essiv->salt);
-	essiv->salt = NULL;
-
-	essiv_tfm = cc->iv_private;
-
-	if (essiv_tfm)
-		crypto_free_cipher(essiv_tfm);
-
-	cc->iv_private = NULL;
-}
-
-static int crypt_iv_essiv_ctr(struct crypt_config *cc, struct dm_target *ti,
-			      const char *opts)
-{
-	struct crypto_cipher *essiv_tfm = NULL;
-	struct crypto_ahash *hash_tfm = NULL;
-	u8 *salt = NULL;
-	int err;
-
-	if (!opts) {
-		ti->error = "Digest algorithm missing for ESSIV mode";
-		return -EINVAL;
-	}
-
-	/* Allocate hash algorithm */
-	hash_tfm = crypto_alloc_ahash(opts, 0, CRYPTO_ALG_ASYNC);
-	if (IS_ERR(hash_tfm)) {
-		ti->error = "Error initializing ESSIV hash";
-		err = PTR_ERR(hash_tfm);
-		goto bad;
-	}
-
-	salt = kzalloc(crypto_ahash_digestsize(hash_tfm), GFP_KERNEL);
-	if (!salt) {
-		ti->error = "Error kmallocing salt storage in ESSIV";
-		err = -ENOMEM;
-		goto bad;
-	}
-
-	cc->iv_gen_private.essiv.salt = salt;
-	cc->iv_gen_private.essiv.hash_tfm = hash_tfm;
-
-	essiv_tfm = setup_essiv_cpu(cc, ti, salt,
-				crypto_ahash_digestsize(hash_tfm));
-	if (IS_ERR(essiv_tfm)) {
-		crypt_iv_essiv_dtr(cc);
-		return PTR_ERR(essiv_tfm);
-	}
-	cc->iv_private = essiv_tfm;
-
-	return 0;
-
-bad:
-	if (hash_tfm && !IS_ERR(hash_tfm))
-		crypto_free_ahash(hash_tfm);
-	kfree(salt);
-	return err;
-}
-
-static int crypt_iv_essiv_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
-{
-	struct crypto_cipher *essiv_tfm = cc->iv_private;
-
-	memset(iv, 0, cc->iv_size);
-	*(__le64 *)iv = cpu_to_le64(dmreq->iv_sector);
-	crypto_cipher_encrypt_one(essiv_tfm, iv, iv);
-
-	return 0;
-}
-
-static int crypt_iv_benbi_ctr(struct crypt_config *cc, struct dm_target *ti,
-			      const char *opts)
-{
-	unsigned bs = crypto_skcipher_blocksize(any_tfm(cc));
-	int log = ilog2(bs);
-
-	/* we need to calculate how far we must shift the sector count
-	 * to get the cipher block count, we use this shift in _gen */
-
-	if (1 << log != bs) {
-		ti->error = "cypher blocksize is not a power of 2";
-		return -EINVAL;
-	}
-
-	if (log > 9) {
-		ti->error = "cypher blocksize is > 512";
-		return -EINVAL;
-	}
-
-	cc->iv_gen_private.benbi.shift = 9 - log;
-
-	return 0;
-}
-
-static void crypt_iv_benbi_dtr(struct crypt_config *cc)
-{
-}
-
-static int crypt_iv_benbi_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
-{
-	__be64 val;
-
-	memset(iv, 0, cc->iv_size - sizeof(u64)); /* rest is cleared below */
-
-	val = cpu_to_be64(((u64)dmreq->iv_sector << cc->iv_gen_private.benbi.shift) + 1);
-	put_unaligned(val, (__be64 *)(iv + cc->iv_size - sizeof(u64)));
-
-	return 0;
-}
-
-static int crypt_iv_null_gen(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
-{
-	memset(iv, 0, cc->iv_size);
-
-	return 0;
-}
-
-static void crypt_iv_lmk_dtr(struct crypt_config *cc)
-{
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-
-	if (lmk->hash_tfm && !IS_ERR(lmk->hash_tfm))
-		crypto_free_shash(lmk->hash_tfm);
-	lmk->hash_tfm = NULL;
-
-	kzfree(lmk->seed);
-	lmk->seed = NULL;
-}
-
-static int crypt_iv_lmk_ctr(struct crypt_config *cc, struct dm_target *ti,
-			    const char *opts)
-{
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-
-	lmk->hash_tfm = crypto_alloc_shash("md5", 0, 0);
-	if (IS_ERR(lmk->hash_tfm)) {
-		ti->error = "Error initializing LMK hash";
-		return PTR_ERR(lmk->hash_tfm);
-	}
-
-	/* No seed in LMK version 2 */
-	if (cc->key_parts == cc->tfms_count) {
-		lmk->seed = NULL;
-		return 0;
-	}
-
-	lmk->seed = kzalloc(LMK_SEED_SIZE, GFP_KERNEL);
-	if (!lmk->seed) {
-		crypt_iv_lmk_dtr(cc);
-		ti->error = "Error kmallocing seed storage in LMK";
-		return -ENOMEM;
-	}
-
-	return 0;
-}
-
-static int crypt_iv_lmk_init(struct crypt_config *cc)
-{
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-	int subkey_size = cc->key_size / cc->key_parts;
-
-	/* LMK seed is on the position of LMK_KEYS + 1 key */
-	if (lmk->seed)
-		memcpy(lmk->seed, cc->key + (cc->tfms_count * subkey_size),
-		       crypto_shash_digestsize(lmk->hash_tfm));
-
-	return 0;
-}
-
-static int crypt_iv_lmk_wipe(struct crypt_config *cc)
-{
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-
-	if (lmk->seed)
-		memset(lmk->seed, 0, LMK_SEED_SIZE);
-
-	return 0;
-}
-
-static int crypt_iv_lmk_one(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq,
-			    u8 *data)
-{
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-	SHASH_DESC_ON_STACK(desc, lmk->hash_tfm);
-	struct md5_state md5state;
-	__le32 buf[4];
-	int i, r;
-
-	desc->tfm = lmk->hash_tfm;
-	desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
-
-	r = crypto_shash_init(desc);
-	if (r)
-		return r;
-
-	if (lmk->seed) {
-		r = crypto_shash_update(desc, lmk->seed, LMK_SEED_SIZE);
-		if (r)
-			return r;
-	}
-
-	/* Sector is always 512B, block size 16, add data of blocks 1-31 */
-	r = crypto_shash_update(desc, data + 16, 16 * 31);
-	if (r)
-		return r;
-
-	/* Sector is cropped to 56 bits here */
-	buf[0] = cpu_to_le32(dmreq->iv_sector & 0xFFFFFFFF);
-	buf[1] = cpu_to_le32((((u64)dmreq->iv_sector >> 32) & 0x00FFFFFF) | 0x80000000);
-	buf[2] = cpu_to_le32(4024);
-	buf[3] = 0;
-	r = crypto_shash_update(desc, (u8 *)buf, sizeof(buf));
-	if (r)
-		return r;
-
-	/* No MD5 padding here */
-	r = crypto_shash_export(desc, &md5state);
-	if (r)
-		return r;
-
-	for (i = 0; i < MD5_HASH_WORDS; i++)
-		__cpu_to_le32s(&md5state.hash[i]);
-	memcpy(iv, &md5state.hash, cc->iv_size);
-
-	return 0;
-}
-
-static int crypt_iv_lmk_gen(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq)
-{
-	u8 *src;
-	int r = 0;
-
-	if (bio_data_dir(dmreq->ctx->bio_in) == WRITE) {
-		src = kmap_atomic(sg_page(&dmreq->sg_in));
-		r = crypt_iv_lmk_one(cc, iv, dmreq, src + dmreq->sg_in.offset);
-		kunmap_atomic(src);
-	} else
-		memset(iv, 0, cc->iv_size);
-
-	return r;
-}
-
-static int crypt_iv_lmk_post(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
-{
-	u8 *dst;
-	int r;
-
-	if (bio_data_dir(dmreq->ctx->bio_in) == WRITE)
-		return 0;
-
-	dst = kmap_atomic(sg_page(&dmreq->sg_out));
-	r = crypt_iv_lmk_one(cc, iv, dmreq, dst + dmreq->sg_out.offset);
-
-	/* Tweak the first block of plaintext sector */
-	if (!r)
-		crypto_xor(dst + dmreq->sg_out.offset, iv, cc->iv_size);
-
-	kunmap_atomic(dst);
-	return r;
-}
-
-static void crypt_iv_tcw_dtr(struct crypt_config *cc)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-
-	kzfree(tcw->iv_seed);
-	tcw->iv_seed = NULL;
-	kzfree(tcw->whitening);
-	tcw->whitening = NULL;
-
-	if (tcw->crc32_tfm && !IS_ERR(tcw->crc32_tfm))
-		crypto_free_shash(tcw->crc32_tfm);
-	tcw->crc32_tfm = NULL;
-}
-
-static int crypt_iv_tcw_ctr(struct crypt_config *cc, struct dm_target *ti,
-			    const char *opts)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-
-	if (cc->key_size <= (cc->iv_size + TCW_WHITENING_SIZE)) {
-		ti->error = "Wrong key size for TCW";
-		return -EINVAL;
-	}
-
-	tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 0);
-	if (IS_ERR(tcw->crc32_tfm)) {
-		ti->error = "Error initializing CRC32 in TCW";
-		return PTR_ERR(tcw->crc32_tfm);
-	}
-
-	tcw->iv_seed = kzalloc(cc->iv_size, GFP_KERNEL);
-	tcw->whitening = kzalloc(TCW_WHITENING_SIZE, GFP_KERNEL);
-	if (!tcw->iv_seed || !tcw->whitening) {
-		crypt_iv_tcw_dtr(cc);
-		ti->error = "Error allocating seed storage in TCW";
-		return -ENOMEM;
-	}
-
-	return 0;
-}
-
-static int crypt_iv_tcw_init(struct crypt_config *cc)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	int key_offset = cc->key_size - cc->iv_size - TCW_WHITENING_SIZE;
-
-	memcpy(tcw->iv_seed, &cc->key[key_offset], cc->iv_size);
-	memcpy(tcw->whitening, &cc->key[key_offset + cc->iv_size],
-	       TCW_WHITENING_SIZE);
-
-	return 0;
-}
-
-static int crypt_iv_tcw_wipe(struct crypt_config *cc)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-
-	memset(tcw->iv_seed, 0, cc->iv_size);
-	memset(tcw->whitening, 0, TCW_WHITENING_SIZE);
-
-	return 0;
-}
-
-static int crypt_iv_tcw_whitening(struct crypt_config *cc,
-				  struct dm_crypt_request *dmreq,
-				  u8 *data)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	__le64 sector = cpu_to_le64(dmreq->iv_sector);
-	u8 buf[TCW_WHITENING_SIZE];
-	SHASH_DESC_ON_STACK(desc, tcw->crc32_tfm);
-	int i, r;
-
-	/* xor whitening with sector number */
-	memcpy(buf, tcw->whitening, TCW_WHITENING_SIZE);
-	crypto_xor(buf, (u8 *)&sector, 8);
-	crypto_xor(&buf[8], (u8 *)&sector, 8);
-
-	/* calculate crc32 for every 32bit part and xor it */
-	desc->tfm = tcw->crc32_tfm;
-	desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
-	for (i = 0; i < 4; i++) {
-		r = crypto_shash_init(desc);
-		if (r)
-			goto out;
-		r = crypto_shash_update(desc, &buf[i * 4], 4);
-		if (r)
-			goto out;
-		r = crypto_shash_final(desc, &buf[i * 4]);
-		if (r)
-			goto out;
-	}
-	crypto_xor(&buf[0], &buf[12], 4);
-	crypto_xor(&buf[4], &buf[8], 4);
-
-	/* apply whitening (8 bytes) to whole sector */
-	for (i = 0; i < ((1 << SECTOR_SHIFT) / 8); i++)
-		crypto_xor(data + i * 8, buf, 8);
-out:
-	memzero_explicit(buf, sizeof(buf));
-	return r;
-}
-
-static int crypt_iv_tcw_gen(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	__le64 sector = cpu_to_le64(dmreq->iv_sector);
-	u8 *src;
-	int r = 0;
-
-	/* Remove whitening from ciphertext */
-	if (bio_data_dir(dmreq->ctx->bio_in) != WRITE) {
-		src = kmap_atomic(sg_page(&dmreq->sg_in));
-		r = crypt_iv_tcw_whitening(cc, dmreq, src + dmreq->sg_in.offset);
-		kunmap_atomic(src);
-	}
-
-	/* Calculate IV */
-	memcpy(iv, tcw->iv_seed, cc->iv_size);
-	crypto_xor(iv, (u8 *)&sector, 8);
-	if (cc->iv_size > 8)
-		crypto_xor(&iv[8], (u8 *)&sector, cc->iv_size - 8);
-
-	return r;
-}
-
-static int crypt_iv_tcw_post(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
-{
-	u8 *dst;
-	int r;
-
-	if (bio_data_dir(dmreq->ctx->bio_in) != WRITE)
-		return 0;
-
-	/* Apply whitening on ciphertext */
-	dst = kmap_atomic(sg_page(&dmreq->sg_out));
-	r = crypt_iv_tcw_whitening(cc, dmreq, dst + dmreq->sg_out.offset);
-	kunmap_atomic(dst);
-
-	return r;
-}
-
-static struct crypt_iv_operations crypt_iv_plain_ops = {
-	.generator = crypt_iv_plain_gen
-};
-
-static struct crypt_iv_operations crypt_iv_plain64_ops = {
-	.generator = crypt_iv_plain64_gen
-};
-
-static struct crypt_iv_operations crypt_iv_essiv_ops = {
-	.ctr       = crypt_iv_essiv_ctr,
-	.dtr       = crypt_iv_essiv_dtr,
-	.init      = crypt_iv_essiv_init,
-	.wipe      = crypt_iv_essiv_wipe,
-	.generator = crypt_iv_essiv_gen
-};
-
-static struct crypt_iv_operations crypt_iv_benbi_ops = {
-	.ctr	   = crypt_iv_benbi_ctr,
-	.dtr	   = crypt_iv_benbi_dtr,
-	.generator = crypt_iv_benbi_gen
-};
-
-static struct crypt_iv_operations crypt_iv_null_ops = {
-	.generator = crypt_iv_null_gen
-};
-
-static struct crypt_iv_operations crypt_iv_lmk_ops = {
-	.ctr	   = crypt_iv_lmk_ctr,
-	.dtr	   = crypt_iv_lmk_dtr,
-	.init	   = crypt_iv_lmk_init,
-	.wipe	   = crypt_iv_lmk_wipe,
-	.generator = crypt_iv_lmk_gen,
-	.post	   = crypt_iv_lmk_post
-};
-
-static struct crypt_iv_operations crypt_iv_tcw_ops = {
-	.ctr	   = crypt_iv_tcw_ctr,
-	.dtr	   = crypt_iv_tcw_dtr,
-	.init	   = crypt_iv_tcw_init,
-	.wipe	   = crypt_iv_tcw_wipe,
-	.generator = crypt_iv_tcw_gen,
-	.post	   = crypt_iv_tcw_post
-};
-
 static void crypt_convert_init(struct crypt_config *cc,
 			       struct convert_context *ctx,
 			       struct bio *bio_out, struct bio *bio_in,
@@ -862,12 +280,6 @@ static int crypt_convert_block(struct crypt_config *cc,
 	bio_advance_iter(ctx->bio_in, &ctx->iter_in, 1 << SECTOR_SHIFT);
 	bio_advance_iter(ctx->bio_out, &ctx->iter_out, 1 << SECTOR_SHIFT);
 
-	if (cc->iv_gen_ops) {
-		r = cc->iv_gen_ops->generator(cc, iv, dmreq);
-		if (r < 0)
-			return r;
-	}
-
 	skcipher_request_set_crypt(req, &dmreq->sg_in, &dmreq->sg_out,
 				   1 << SECTOR_SHIFT, iv);
 
@@ -876,9 +288,6 @@ static int crypt_convert_block(struct crypt_config *cc,
 	else
 		r = crypto_skcipher_decrypt(req);
 
-	if (!r && cc->iv_gen_ops && cc->iv_gen_ops->post)
-		r = cc->iv_gen_ops->post(cc, iv, dmreq);
-
 	return r;
 }
 
@@ -1363,19 +772,6 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
 	struct dm_crypt_io *io = container_of(ctx, struct dm_crypt_io, ctx);
 	struct crypt_config *cc = io->cc;
 
-	/*
-	 * A request from crypto driver backlog is going to be processed now,
-	 * finish the completion and continue in crypt_convert().
-	 * (Callback will be called for the second time for this request.)
-	 */
-	if (error == -EINPROGRESS) {
-		complete(&ctx->restart);
-		return;
-	}
-
-	if (!error && cc->iv_gen_ops && cc->iv_gen_ops->post)
-		error = cc->iv_gen_ops->post(cc, iv_of_dmreq(cc, dmreq), dmreq);
-
 	if (error < 0)
 		io->error = -EIO;
 
@@ -1517,6 +913,39 @@ static int crypt_set_key(struct crypt_config *cc, char *key)
 	return r;
 }
 
+static void crypt_init_context(struct dm_target *ti, char *key,
+			      struct crypto_skcipher *tfm,
+			      char *ivmode, char *ivopts)
+{
+	struct crypt_config *cc = ti->private;
+	struct geniv_ctx *ctx = (struct geniv_ctx *) (tfm + 1);
+
+	ctx->data.iv_size = crypto_skcipher_ivsize(tfm);
+	ctx->data.cipher = cc->cipher;
+	ctx->data.ivmode = ivmode;
+	ctx->data.tfms_count = cc->tfms_count;
+	ctx->data.tfm = tfm;
+	ctx->data.ivopts = ivopts;
+	ctx->data.key_size = cc->key_size;
+	ctx->data.key_parts = cc->key_parts;
+	ctx->data.key = cc->key;
+}
+
+static int crypt_init_all_cpus(struct dm_target *ti, char *key,
+			       char *ivmode, char *ivopts)
+{
+	struct crypt_config *cc = ti->private;
+	int ret, i;
+
+	for (i = 0; i < cc->tfms_count; i++)
+		crypt_init_context(ti, key, cc->tfms[i], ivmode, ivopts);
+
+	ret = crypt_set_key(cc, key);
+	if (ret < 0)
+		ti->error = "Error decoding and setting key";
+	return ret;
+}
+
 static int crypt_wipe_key(struct crypt_config *cc)
 {
 	clear_bit(DM_CRYPT_KEY_VALID, &cc->flags);
@@ -1550,9 +979,6 @@ static void crypt_dtr(struct dm_target *ti)
 	mempool_destroy(cc->page_pool);
 	mempool_destroy(cc->req_pool);
 
-	if (cc->iv_gen_ops && cc->iv_gen_ops->dtr)
-		cc->iv_gen_ops->dtr(cc);
-
 	if (cc->dev)
 		dm_put_device(ti, cc->dev);
 
@@ -1629,8 +1055,14 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 	if (!cipher_api)
 		goto bad_mem;
 
-	ret = snprintf(cipher_api, CRYPTO_MAX_ALG_NAME,
-		       "%s(%s)", chainmode, cipher);
+create_cipher:
+	/* Call underlying cipher directly if it does not support iv */
+	if (ivmode)
+		ret = snprintf(cipher_api, CRYPTO_MAX_ALG_NAME, "%s(%s(%s))",
+				ivmode, chainmode, cipher);
+	else
+		ret = snprintf(cipher_api, CRYPTO_MAX_ALG_NAME, "%s(%s)",
+				chainmode, cipher);
 	if (ret < 0) {
 		kfree(cipher_api);
 		goto bad_mem;
@@ -1652,23 +1084,10 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 	else if (ivmode) {
 		DMWARN("Selected cipher does not support IVs");
 		ivmode = NULL;
+		goto create_cipher;
 	}
 
-	/* Choose ivmode, see comments at iv code. */
-	if (ivmode == NULL)
-		cc->iv_gen_ops = NULL;
-	else if (strcmp(ivmode, "plain") == 0)
-		cc->iv_gen_ops = &crypt_iv_plain_ops;
-	else if (strcmp(ivmode, "plain64") == 0)
-		cc->iv_gen_ops = &crypt_iv_plain64_ops;
-	else if (strcmp(ivmode, "essiv") == 0)
-		cc->iv_gen_ops = &crypt_iv_essiv_ops;
-	else if (strcmp(ivmode, "benbi") == 0)
-		cc->iv_gen_ops = &crypt_iv_benbi_ops;
-	else if (strcmp(ivmode, "null") == 0)
-		cc->iv_gen_ops = &crypt_iv_null_ops;
-	else if (strcmp(ivmode, "lmk") == 0) {
-		cc->iv_gen_ops = &crypt_iv_lmk_ops;
+	if (strcmp(ivmode, "lmk") == 0) {
 		/*
 		 * Version 2 and 3 is recognised according
 		 * to length of provided multi-key string.
@@ -1680,39 +1099,14 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 			cc->key_extra_size = cc->key_size / cc->key_parts;
 		}
 	} else if (strcmp(ivmode, "tcw") == 0) {
-		cc->iv_gen_ops = &crypt_iv_tcw_ops;
 		cc->key_parts += 2; /* IV + whitening */
 		cc->key_extra_size = cc->iv_size + TCW_WHITENING_SIZE;
-	} else {
-		ret = -EINVAL;
-		ti->error = "Invalid IV mode";
-		goto bad;
 	}
 
 	/* Initialize and set key */
-	ret = crypt_set_key(cc, key);
-	if (ret < 0) {
-		ti->error = "Error decoding and setting key";
+	ret = crypt_init_all_cpus(ti, key, ivmode, ivopts);
+	if (ret < 0)
 		goto bad;
-	}
-
-	/* Allocate IV */
-	if (cc->iv_gen_ops && cc->iv_gen_ops->ctr) {
-		ret = cc->iv_gen_ops->ctr(cc, ti, ivopts);
-		if (ret < 0) {
-			ti->error = "Error creating IV";
-			goto bad;
-		}
-	}
-
-	/* Initialize IV (set keys for ESSIV etc) */
-	if (cc->iv_gen_ops && cc->iv_gen_ops->init) {
-		ret = cc->iv_gen_ops->init(cc);
-		if (ret < 0) {
-			ti->error = "Error initialising IV";
-			goto bad;
-		}
-	}
 
 	ret = 0;
 bad:
@@ -2007,6 +1401,18 @@ static void crypt_resume(struct dm_target *ti)
 	clear_bit(DM_CRYPT_SUSPENDED, &cc->flags);
 }
 
+static void crypt_setkey_op_allcpus(struct crypt_config *cc,
+				    enum setkey_op keyop)
+{
+	int i;
+	struct geniv_ctx *ctx;
+
+	for (i = 0; i < cc->tfms_count; i++) {
+		ctx = (struct geniv_ctx *) (cc->tfms[i] + 1);
+		ctx->data.keyop = keyop;
+	}
+}
+
 /* Message interface
  *	key set <key>
  *	key wipe
@@ -2014,7 +1420,6 @@ static void crypt_resume(struct dm_target *ti)
 static int crypt_message(struct dm_target *ti, unsigned argc, char **argv)
 {
 	struct crypt_config *cc = ti->private;
-	int ret = -EINVAL;
 
 	if (argc < 2)
 		goto error;
@@ -2025,19 +1430,11 @@ static int crypt_message(struct dm_target *ti, unsigned argc, char **argv)
 			return -EINVAL;
 		}
 		if (argc == 3 && !strcasecmp(argv[1], "set")) {
-			ret = crypt_set_key(cc, argv[2]);
-			if (ret)
-				return ret;
-			if (cc->iv_gen_ops && cc->iv_gen_ops->init)
-				ret = cc->iv_gen_ops->init(cc);
-			return ret;
+			crypt_setkey_op_allcpus(cc, SETKEY_OP_SET);
+			return crypt_set_key(cc, argv[2]);
 		}
 		if (argc == 2 && !strcasecmp(argv[1], "wipe")) {
-			if (cc->iv_gen_ops && cc->iv_gen_ops->wipe) {
-				ret = cc->iv_gen_ops->wipe(cc);
-				if (ret)
-					return ret;
-			}
+			crypt_setkey_op_allcpus(cc, SETKEY_OP_WIPE);
 			return crypt_wipe_key(cc);
 		}
 	}
diff --git a/include/crypto/geniv.h b/include/crypto/geniv.h
new file mode 100644
index 0000000..1325843
--- /dev/null
+++ b/include/crypto/geniv.h
@@ -0,0 +1,109 @@
+/*
+ * geniv: common data structures for IV generation algorithms
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef _CRYPTO_GENIV_
+#define _CRYPTO_GENIV_
+
+#define SECTOR_SHIFT            9
+
+struct geniv_essiv_private {
+	struct crypto_ahash *hash_tfm;
+	u8 *salt;
+};
+
+struct geniv_benbi_private {
+	int shift;
+};
+
+#define LMK_SEED_SIZE 64 /* hash + 0 */
+struct geniv_lmk_private {
+	struct crypto_shash *hash_tfm;
+	u8 *seed;
+};
+
+#define TCW_WHITENING_SIZE 16
+struct geniv_tcw_private {
+	struct crypto_shash *crc32_tfm;
+	u8 *iv_seed;
+	u8 *whitening;
+};
+
+enum setkey_op {
+	SETKEY_OP_INIT,
+	SETKEY_OP_SET,
+	SETKEY_OP_WIPE,
+};
+
+/*
+ * context holding the current state of a multi-part conversion
+ */
+struct convert_context {
+	struct completion restart;
+	struct bio *bio_in;
+	struct bio *bio_out;
+	struct bvec_iter iter_in;
+	struct bvec_iter iter_out;
+	sector_t cc_sector;
+	atomic_t cc_pending;
+	struct skcipher_request *req;
+};
+
+struct dm_crypt_request {
+	struct convert_context *ctx;
+	struct scatterlist sg_in;
+	struct scatterlist sg_out;
+	sector_t iv_sector;
+};
+
+
+struct geniv_ctx_data;
+
+struct geniv_operations {
+	int (*ctr)(struct geniv_ctx_data *cd);
+	void (*dtr)(struct geniv_ctx_data *cd);
+	int (*init)(struct geniv_ctx_data *cd);
+	int (*wipe)(struct geniv_ctx_data *cd);
+	int (*generator)(struct geniv_ctx_data *req, u8 *iv,
+			 struct dm_crypt_request *dmreq);
+	int (*post)(struct geniv_ctx_data *cd, u8 *iv,
+			 struct dm_crypt_request *dmreq);
+};
+
+struct geniv_ctx_data {
+	unsigned int tfms_count;
+	char *ivmode;
+	unsigned int iv_size;
+	char *ivopts;
+	unsigned int dmoffset;
+
+	char *cipher;
+	struct geniv_operations *iv_gen_ops;
+	union {
+		struct geniv_essiv_private essiv;
+		struct geniv_benbi_private benbi;
+		struct geniv_lmk_private lmk;
+		struct geniv_tcw_private tcw;
+	} iv_gen_private;
+	void *iv_private;
+	struct crypto_skcipher *tfm;
+	unsigned int key_size;
+	unsigned int key_extra_size;
+	unsigned int key_parts;      /* independent parts in key buffer */
+	enum setkey_op keyop;
+	char *msg;
+	u8 *key;
+};
+
+struct geniv_ctx {
+	struct crypto_skcipher *child;
+	struct geniv_ctx_data data;
+};
+
+#endif
+
diff --git a/include/crypto/skcipher.h b/include/crypto/skcipher.h
index cc4d98a..290c848 100644
--- a/include/crypto/skcipher.h
+++ b/include/crypto/skcipher.h
@@ -122,6 +122,8 @@ struct crypto_skcipher {
 struct skcipher_alg {
 	int (*setkey)(struct crypto_skcipher *tfm, const u8 *key,
 	              unsigned int keylen);
+	int (*set_ctx)(struct crypto_skcipher *tfm, void *ctx,
+		       unsigned int len);
 	int (*encrypt)(struct skcipher_request *req);
 	int (*decrypt)(struct skcipher_request *req);
 	int (*init)(struct crypto_skcipher *tfm);
@@ -366,6 +368,21 @@ static inline int crypto_skcipher_setkey(struct crypto_skcipher *tfm,
 {
 	return tfm->setkey(tfm, key, keylen);
 }
+/**
+ * crypto_skcipher_set_ctx() - set initial context for cipher
+ * @tfm: cipher handle
+ * @ctx: buffer holding the context data
+ * @len: length of the context data structure
+ *
+ */
+static inline void crypto_skcipher_set_ctx(struct crypto_skcipher *tfm,
+					 void *ctx, unsigned int len)
+{
+	struct skcipher_alg *alg = crypto_skcipher_alg(tfm);
+
+	alg->set_ctx(tfm, ctx, len);
+}
+
 
 static inline bool crypto_skcipher_has_setkey(struct crypto_skcipher *tfm)
 {
-- 
Binoy Jayan

^ permalink raw reply related

* [RFC PATCH] IV Generation algorithms for dm-crypt
From: Binoy Jayan @ 2016-11-21 10:10 UTC (permalink / raw)
  To: Oded, Ofir
  Cc: Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Binoy Jayan

===============================================================================
GENIV Template cipher
===============================================================================

Currently, the iv generation algorithms are implemented in dm-crypt.c. The goal
is to move these algorithms from the dm layer to the kernel crypto layer by
implementing them as template ciphers so they can be used in relation with
algorithms like aes, and with multiple modes like cbc, ecb etc. As part of this
patchset, the iv-generation code is moved from the dm layer to the crypto layer.
The dm-layer can later be optimized to encrypt larger block sizes in a single
call to the crypto engine.

One challenge in doing so is with the 'essiv' which creates the IV by hashing
the 512-byte sector number. This infact limits the block sizes to 512 bytes.
A way to get around this problem has to be explored. Another thing to note is
that the algorithms shares its context data structures (cipher context and
request context) with the callee, i.e. dm-crypt here. Not sure if this coupling
is accepted. If not, this has to be decoupled. A new crypto api
'crypto_skcipher_set_ctx' defined in 'include/crypto/skcipher.h' which was
initially written for addressing this is not used now. But even if it is used,
the data structure definition would still be shared.

The following ASCII art decomposes the kernel crypto API layers when using the
skcipher with the automated IV generation. The shown example is used by the
DM layer. For other use cases of cbc(aes), the ASCII art applies as well, but
the caller may not use the same with a separate IV generator. In this case, the
caller must generate the IV. The depicted example decomposes <ivgen>(cbc(aes))
based on the generic C implementations (geniv.c, cbc.c and aes-generic.c).
The generic implementation depicts the dependency between the templates ciphers
used in implementing geniv using the kernel crypto API.
Here, <geniv> indicates one of the following algorithms:

1. plain
2. plain64
3. essiv
4. benbi
5. null
6. lmk
7. tcw

It is possible that some streamlined cipher implementations (like AES-NI)
provide implementations merging aspects which in the view of the kernel crypto
API cannot be decomposed into layers any more. Each block in the following
ASCII art is an independent cipher instance obtained from the kernel crypto
API. Each block is accessed by the caller or by other blocks using the API
functions defined by the kernel crypto API for the cipher implementation type.
The blocks below indicate the cipher type as well as the specific logic
implemented in the cipher.

The ASCII art picture also indicates the call structure, i.e. who calls which
component. The arrows point to the invoked block where the caller uses the API
applicable to the cipher type specified for the block. For the purpose of
illustration, here we take the example of the aes mode 'cbc'. However, the IV
generation algorithm could be used with other aes modes like ecb as well.

-------------------------------------------------------------------------------
Geniv implementation
-------------------------------------------------------------------------------

NB: The ASCII art below is best viewed in a fixed-width font.

                             crypt_convert_block()              (DM Layer)
                                      |
                                      | (1)
                                      |
                                      v
+------------+   +-----------+   +-----------+       +-----------+
|            |   |           |   |           |  (2)  |           |
|  skcipher  |   | skcipher  |   | skcipher  |----+  | skcipher  |   Blocks for
| (plain/64) |   | (benbi)   |   | (essiv)   |    |  |  (null)   |   lmk, tcw
+------------+   +-----------+   +-----------+    |  +-----------+ 
     |                |               |           v         |
     | (3)            | (3)      (3)  |    +-----------+    |
     |                |               |    |           |    |
     |                |               |    |   ahash   |    | (3)
     |                |               |    |           |    |
     |                |               |    +-----------+    |
     |                |               v                     |     (Crypto API
     |                |         +-----------+               |        Layer)
     |                v         |           |               |
     +------------------------> |  skcipher | <-------------+
                                |   (cbc)   |
                                +-----------+  (AES Mode Template cipher)
                                      | (4)
                                      v
                               +-----------+
                               |           |
                               |   cipher  |   (Base generic-AES cipher)
                               |   (aes)   |
                               +-----------+

The following call sequence is applicable when the DM layer triggers an
encryption operation with the crypt_convert_block() function. During
configuration, the administrator sets up the use of <geniv>(cbc(aes)) as the
template cipher. 'geniv' can be one among plain, plain64, essiv, benbi, null,
lmk, or tcw which are all implemented as seperate templates. 
The following are the template ciphers implemented as part of 'geniv.c'

1. plain(cbc(aes))
2. plain64(cbc(aes))
3. essiv(cbc(aes))
4. benbi(cbc(aes))
5. null(cbc(aes))
6. lmk(cbc(aes))
7. tcw(cbc(aes))

The following call sequence is now depicted in the ASCII art above:

1. crypt_convert_block invokes crypto_skcipher_encrypt() to trigger encryption
   operation of a single block (i.e. sector) with the IV same as the sector no.
   For example, with essiv, the IV generation implementation is registered with
   a call to 'crypto_register_template(&crypto_essiv_tmpl)'

2. During instantiation of the 'geniv' handle, the IV generation algorithm is
   instantiated. For the purpose of illustration, we take the example of essiv.
   In this case, the ahash cipher is instantiated to calculate the hash of the
   sector to generate the IV.

3. Now, geniv uses the skcipher api calls to invoke the associated cipher. In
   our case, during the instantiation of geniv, the cipher handle for cbc is
   provided to geniv. The geniv skcipher type implementation now invokes the
   skcipher api with the instantiated cbc(aes) cipher handle. During the
   instantiation of the cbc(aes) cipher, the cipher type generic-aes is also
   instantiated. That means that the SKCIPHER implementation of cbc(aes) only
   implements the Cipher-block chaining mode. After performing block chaining
   operation, the cipher implementation of aes is invoked. The skcipher of
   cbc(aes) now invokes the cipher api with the aes cipher handle to encrypt
   one block.

-------------------------------------------------------------------------------
Clarifications
-------------------------------------------------------------------------------

1. Changes to testmgr.c
2. How to encrypt blocks bigger than 512 bytes while using essiv?
   As sectors are tied to IV in case of 'essiv'.
   Will changing block size make it backward incompatible
   and with other platforms (like windows) which support LUKS volumes.
3. Did not move the key management code from dm-crypt to cryto layer
   when keycount > 1 as multiple ciphers are instantiated from dm layer
   with each cipher instance is allotted a part of the key provided.

-------------------------------------------------------------------------------
Test procedure
-------------------------------------------------------------------------------

The algorithms are tested using 'cryptsetup' utility to create LUKS
compatible volumes on Qemu.

NB: '/dev/sdb' is a second disk volume (configured in qemu)

# One time setup - Format the device compatible with LUKS.
# Choose one of the following IV generation alorithms at a time
cryptsetup -y -c aes-cbc-plain -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes-cbc-plain64 -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes-cbc-essiv:sha256 -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes-cbc-benbi -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes-cbc-null -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes-cbc-lmk -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes-cbc-tcw -s 256 --hash sha256 luksFormat /dev/sdb

# With a keycount
cryptsetup -y -c aes:2-cbc-plain -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes:2-cbc-plain64 -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes:2-cbc-essiv:sha256 -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes:2-cbc-null -s 256 --hash sha256 luksFormat /dev/sdb
cryptsetup -y -c aes:2-cbc-lmk -s 256 --hash sha256 luksFormat /dev/sdb

# Add additional key - optional
cryptsetup luksAddKey /dev/sdb

# The above lists only a limited number of tests with the aes cipher.
# The IV generation algorithms may also be tested with other ciphers as well.

cryptsetup luksDump --dump-master-key /dev/sdb

# create a luks volume and open the device
cryptsetup luksOpen /dev/sdb crypt_fun
dmsetup table --showkeys

# Write some data to the device
cat data.txt > /dev/mapper/crypt_fun

# Read 100 bytes back
dd if=/dev/mapper/crypt_fun of=out.txt bs=100 count=1
cat out.txt

mkfs.ext4 -j /dev/mapper/crypt_fun

# Mount if fs creation succeeds
mount -t ext4 /dev/mapper/crypt_fun /mnt

<-- Use the encrypted file system -->

umount /mnt
cryptsetup luksClose crypt_fun
cryptsetup luksRemoveKey /dev/sdb

This seems to work well. The file system mounts successfully and the files
written to in the file system remain persistent across reboots.

Binoy Jayan (1):
  crypto: Add IV generation algorithms

 crypto/Kconfig            |    8 +
 crypto/Makefile           |    1 +
 crypto/geniv.c            | 1113 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-crypt.c     |  725 +++--------------------------
 include/crypto/geniv.h    |  109 +++++
 include/crypto/skcipher.h |   17 +
 6 files changed, 1309 insertions(+), 664 deletions(-)
 create mode 100644 crypto/geniv.c
 create mode 100644 include/crypto/geniv.h

-- 
Binoy Jayan

^ permalink raw reply

* Re: Newly-created arrays don't auto-assemble - related to hostname change?
From: Andy Smith @ 2016-11-21  6:02 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <877f7xe9lh.fsf@notabene.neil.brown.name>

Hi Neil,

On Mon, Nov 21, 2016 at 03:32:42PM +1100, NeilBrown wrote:
> If you still want to get to the bottom of this, you might need to revert
> your work-around, the try the "udevadm monitor" and "udevadm info" and "udevadm
> trigger" while the array is not assembled.

I have reverted my addition of "mpt3sas" from
/etc/initramfs-tools/modules and rebooted, so that md5 is again not
assembled.

Result of

    udevadm info /dev/sdc

P: /devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:0/end_device-10:0/target10:0:0/10:0:0:0/block/sdc
N: sdc
S: disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633
S: disk/by-id/wwn-0x5002538c0007e7a8
S: disk/by-path/pci-0000:01:00.0-sas-0x4433221100000000-lun-0
E: DEVLINKS=/dev/disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633 /dev/disk/by-id/wwn-0x5002538c0007e7a8 /dev/disk/by-path/pci-0000:01:00.0-sas-0x4433221100000000-lun-0
E: DEVNAME=/dev/sdc
E: DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:0/end_device-10:0/target10:0:0/10:0:0:0/block/sdc 
E: DEVTYPE=disk
E: ID_ATA=1
E: ID_ATA_DOWNLOAD_MICROCODE=1
E: ID_ATA_FEATURE_SET_HPA=1
E: ID_ATA_FEATURE_SET_HPA_ENABLED=1
E: ID_ATA_FEATURE_SET_PM=1
E: ID_ATA_FEATURE_SET_PM_ENABLED=1
E: ID_ATA_FEATURE_SET_SECURITY=1
E: ID_ATA_FEATURE_SET_SECURITY_ENABLED=0
E: ID_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN=32
E: ID_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN=32
E: ID_ATA_FEATURE_SET_SMART=1
E: ID_ATA_FEATURE_SET_SMART_ENABLED=1
E: ID_ATA_ROTATION_RATE_RPM=0
E: ID_ATA_SATA=1
E: ID_ATA_SATA_SIGNAL_RATE_GEN1=1
E: ID_ATA_SATA_SIGNAL_RATE_GEN2=1
E: ID_ATA_WRITE_CACHE=1
E: ID_ATA_WRITE_CACHE_ENABLED=1
E: ID_BUS=ata
E: ID_FS_LABEL=tbd:5
E: ID_FS_LABEL_ENC=tbd:5
E: ID_FS_TYPE=linux_raid_member
E: ID_FS_USAGE=raid
E: ID_FS_UUID=957030cf-c09f-023d-ceae-bb27e546f095
E: ID_FS_UUID_ENC=957030cf-c09f-023d-ceae-bb27e546f095
E: ID_FS_UUID_SUB=4ac82c29-2d10-9465-7fff-9b228c411c1e
E: ID_FS_VERSION=1.2
E: ID_MODEL=SAMSUNG_MZ7KM1T9HAJM-00005
E: ID_MODEL_ENC=SAMSUNG\x20MZ7KM1T9HAJM-00005\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20
E: ID_PATH=pci-0000:01:00.0-sas-0x4433221100000000-lun-0
E: ID_PATH_TAG=pci-0000_01_00_0-sas-0x4433221100000000-lun-0
E: ID_REVISION=GXM1003Q
E: ID_SERIAL=SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633
E: ID_SERIAL_SHORT=S2HNNAAH200633
E: ID_TYPE=disk
E: ID_WWN=0x5002538c0007e7a8
E: ID_WWN_WITH_EXTENSION=0x5002538c0007e7a8
E: MAJOR=8
E: MINOR=32
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=43226

I then issued

    echo change > /sys/block/sdc/uevent

and

    echo change > /sys/block/sdd/uevent

which resulted in the monitor showing:

KERNEL[572.956390] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:0/end_device-10:0/target10:0:0/10:0:0:0/block/sdc (block)
UDEV  [572.960256] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:0/end_device-10:0/target10:0:0/10:0:0:0/block/sdc (block)
KERNEL[593.140178] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:1/end_device-10:1/target10:0:1/10:0:1:0/block/sdd (block)
UDEV  [593.143824] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:1/end_device-10:1/target10:0:1/10:0:1:0/block/sdd (block)

But no assembly of md5.

Afterwards,

    udevadm info /dev/sdc

showed:

P: /devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:0/end_device-10:0/target10:0:0/10:0:0:0/block/sdc
N: sdc
S: disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633
S: disk/by-id/wwn-0x5002538c0007e7a8
S: disk/by-path/pci-0000:01:00.0-sas-0x4433221100000000-lun-0
E: DEVLINKS=/dev/disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633 /dev/disk/by-id/wwn-0x5002538c0007e7a8 /dev/disk/by-path/pci-0000:01:00.0-sas-0x4433221100000000-lun-0
E: DEVNAME=/dev/sdc
E: DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host10/port-10:0/end_device-10:0/target10:0:0/10:0:0:0/block/sdc
E: DEVTYPE=disk
E: ID_ATA=1
E: ID_ATA_DOWNLOAD_MICROCODE=1
E: ID_ATA_FEATURE_SET_HPA=1
E: ID_ATA_FEATURE_SET_HPA_ENABLED=1
E: ID_ATA_FEATURE_SET_PM=1
E: ID_ATA_FEATURE_SET_PM_ENABLED=1
E: ID_ATA_FEATURE_SET_SECURITY=1
E: ID_ATA_FEATURE_SET_SECURITY_ENABLED=0
E: ID_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN=32
E: ID_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN=32
E: ID_ATA_FEATURE_SET_SMART=1
E: ID_ATA_FEATURE_SET_SMART_ENABLED=1
E: ID_ATA_ROTATION_RATE_RPM=0
E: ID_ATA_SATA=1
E: ID_ATA_SATA_SIGNAL_RATE_GEN1=1
E: ID_ATA_SATA_SIGNAL_RATE_GEN2=1
E: ID_ATA_WRITE_CACHE=1
E: ID_ATA_WRITE_CACHE_ENABLED=1
E: ID_BUS=ata
E: ID_FS_LABEL=tbd:5
E: ID_FS_LABEL_ENC=tbd:5
E: ID_FS_TYPE=linux_raid_member
E: ID_FS_USAGE=raid
E: ID_FS_UUID=957030cf-c09f-023d-ceae-bb27e546f095
E: ID_FS_UUID_ENC=957030cf-c09f-023d-ceae-bb27e546f095
E: ID_FS_UUID_SUB=4ac82c29-2d10-9465-7fff-9b228c411c1e
E: ID_FS_UUID_SUB_ENC=4ac82c29-2d10-9465-7fff-9b228c411c1e
E: ID_FS_VERSION=1.2
E: ID_MODEL=SAMSUNG_MZ7KM1T9HAJM-00005
E: ID_MODEL_ENC=SAMSUNG\x20MZ7KM1T9HAJM-00005\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20
E: ID_PATH=pci-0000:01:00.0-sas-0x4433221100000000-lun-0
E: ID_PATH_TAG=pci-0000_01_00_0-sas-0x4433221100000000-lun-0
E: ID_REVISION=GXM1003Q
E: ID_SERIAL=SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633
E: ID_SERIAL_SHORT=S2HNNAAH200633
E: ID_TYPE=disk
E: ID_WWN=0x5002538c0007e7a8
E: ID_WWN_WITH_EXTENSION=0x5002538c0007e7a8
E: MAJOR=8
E: MINOR=32
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=43226

Thanks,
Andy

^ permalink raw reply

* Re: [PATCH] md/r5cache: handle alloc_page failure
From: NeilBrown @ 2016-11-21  5:05 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161119072057.1302854-1-songliubraving@fb.com>

[-- Attachment #1: Type: text/plain, Size: 1886 bytes --]

On Sat, Nov 19 2016, Song Liu wrote:

> RMW of r5c write back cache uses an extra page to store old data for
> prexor. handle_stripe_dirtying() allocates this page by calling
> alloc_page(). However, alloc_page() may fail.
>
> To handle alloc_page() failures, this patch adds a small mempool
> in r5l_log. When alloc_page fails, the stripe is added to a waiting
> list. Then, these stripes get pages from the mempool (from work queue).
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/md/raid5.c       |  34 +++++++++++-----
>  drivers/md/raid5.h       |   6 +++
>  3 files changed, 130 insertions(+), 10 deletions(-)

This looks *way* more complex that I feel comfortable with.  I cannot
see any obvious errors, but I find it hard to be confident there aren't
any, or that errors won't creep in.

We already have mechanisms for delaying stripes.  It would be nice to
re-use those.

Ideally, when using a mempool, a single allocation from the mempool
should provide all you need for a single transaction.  For that to work
in this case, each object in the mempool would need to hold raid_disks-1
pages.  However in many cases we don't need that many, usually fewer
than raid_disks/2.  So it would be wasteful or clumsy, or both.
So I would:

 - preallocate a single set of spare pages, and store them, one each, in
   the disk_info structures.
 - add a flag to cache_state to record if these pages are in use.
 - if alloc_page() fails and test_and_set() on the bit succeeds, then
     use pages from disk_info
 - if alloc_page() fails and test_and_set() fails, set STRIPE_DELAYED
 - raid5_activate_delayed() doesn't active stripes if the flag is set,
   showing that the spare pages are in use.

I think that would be much simpler, and should be just as effective.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: Newly-created arrays don't auto-assemble - related to hostname change?
From: NeilBrown @ 2016-11-21  4:32 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-raid
In-Reply-To: <20161118041759.GE1804@bitfolk.com>

[-- Attachment #1: Type: text/plain, Size: 3663 bytes --]

On Fri, Nov 18 2016, Andy Smith wrote:

> Hi Neil,
>
> On Fri, Nov 18, 2016 at 03:08:23PM +1100, NeilBrown wrote:
>> Up to you, but I have an idea.
>> The udev rules files depends on 'blkid' having been run.
>>   /lib/udev/rules.d/60-persistent-storage.rules
>> does this, but not for
>>   KERNEL=="fd*|mtd*|nbd*|gnbd*|btibm*|dm-*|md*|zram*|mmcblk[0-9]*rpmb"
>> 
>> ... though that wouldn't apply to you.
>> 
>> what does
>>   udevadm info /dev/sdc
>
> (Since mpt3sas got loaded early the device identifiers have all
> changed; what was sd{a,b} have now shifted to the end as sd{e,f}, so
> the two members of md5 are now sd{a,b})
>
> $ sudo udevadm info /dev/sda
> P: /devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/block/sda
> N: sda
> S: disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633
> S: disk/by-id/wwn-0x5002538c0007e7a8
> S: disk/by-path/pci-0000:01:00.0-sas-0x4433221100000000-lun-0
> E: DEVLINKS=/dev/disk/by-id/ata-SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633 /dev/disk/by-id/wwn-0x5002538c0007e7a8 /dev/disk/by-path/pci-0000:01:00.0-sas-0x4433221100000000-lun-0
> E: DEVNAME=/dev/sda
> E: DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/block/sda
> E: DEVTYPE=disk
> E: ID_ATA=1
> E: ID_ATA_DOWNLOAD_MICROCODE=1
> E: ID_ATA_FEATURE_SET_HPA=1
> E: ID_ATA_FEATURE_SET_HPA_ENABLED=1
> E: ID_ATA_FEATURE_SET_PM=1
> E: ID_ATA_FEATURE_SET_PM_ENABLED=1
> E: ID_ATA_FEATURE_SET_SECURITY=1
> E: ID_ATA_FEATURE_SET_SECURITY_ENABLED=0
> E: ID_ATA_FEATURE_SET_SECURITY_ENHANCED_ERASE_UNIT_MIN=32
> E: ID_ATA_FEATURE_SET_SECURITY_ERASE_UNIT_MIN=32
> E: ID_ATA_FEATURE_SET_SMART=1
> E: ID_ATA_FEATURE_SET_SMART_ENABLED=1
> E: ID_ATA_ROTATION_RATE_RPM=0
> E: ID_ATA_SATA=1
> E: ID_ATA_SATA_SIGNAL_RATE_GEN1=1
> E: ID_ATA_SATA_SIGNAL_RATE_GEN2=1
> E: ID_ATA_WRITE_CACHE=1
> E: ID_ATA_WRITE_CACHE_ENABLED=1
> E: ID_BUS=ata
> E: ID_FS_LABEL=tbd:5
> E: ID_FS_LABEL_ENC=tbd:5
> E: ID_FS_TYPE=linux_raid_member

This is encouraging.  It means that blkid ran and udev knows that this
is part of an md array.

However there are no "MD_" ... I guess that is normal if the latest udev
event happened after the array was assembled.

If you still want to get to the bottom of this, you might need to revert
your work-around, the try the "udevadm monitor" and "udevadm info" and "udevadm
trigger" while the array is not assembled.

You could possibly try stopping the array, then running "udevadm
trigger".
If that works, you revert the recent change to module loading.
If it doesn't result in the array being assembled, then would be a good
time to try "udevadm info" again.

NeilBrown


> E: ID_FS_USAGE=raid
> E: ID_FS_UUID=957030cf-c09f-023d-ceae-bb27e546f095
> E: ID_FS_UUID_ENC=957030cf-c09f-023d-ceae-bb27e546f095
> E: ID_FS_UUID_SUB=4ac82c29-2d10-9465-7fff-9b228c411c1e
> E: ID_FS_UUID_SUB_ENC=4ac82c29-2d10-9465-7fff-9b228c411c1e
> E: ID_FS_VERSION=1.2
> E: ID_MODEL=SAMSUNG_MZ7KM1T9HAJM-00005
> E: ID_MODEL_ENC=SAMSUNG\x20MZ7KM1T9HAJM-00005\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20
> E: ID_PATH=pci-0000:01:00.0-sas-0x4433221100000000-lun-0
> E: ID_PATH_TAG=pci-0000_01_00_0-sas-0x4433221100000000-lun-0
> E: ID_REVISION=GXM1003Q
> E: ID_SERIAL=SAMSUNG_MZ7KM1T9HAJM-00005_S2HNNAAH200633
> E: ID_SERIAL_SHORT=S2HNNAAH200633
> E: ID_TYPE=disk
> E: ID_WWN=0x5002538c0007e7a8
> E: ID_WWN_WITH_EXTENSION=0x5002538c0007e7a8
> E: MAJOR=8
> E: MINOR=0
> E: SUBSYSTEM=block
> E: TAGS=:systemd:
> E: UDEV_LOG=7
> E: USEC_INITIALIZED=38597
>
> Cheers,
> Andy

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH v2] md: add block tracing for bio_remapping
From: NeilBrown @ 2016-11-21  4:00 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <20161118175005.dodwcprytycwyuyq@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2244 bytes --]

On Sat, Nov 19 2016, Shaohua Li wrote:

> On Fri, Nov 18, 2016 at 09:26:22AM -0800, Shaohua Li wrote:
>> On Fri, Nov 18, 2016 at 01:22:04PM +1100, Neil Brown wrote:
>> > 
>> > The block tracing infrastructure (accessed with blktrace/blkparse)
>> > supports the tracing of mapping bios from one device to another.
>> > This is currently used when a bio in a partition is mapped to the
>> > whole device, when bios are mapped by dm, and for mapping in md/raid5.
>> > Other md personalities do not include this tracing yet, so add it.
>> > 
>> > When a read-error is detected we redirect the request to a different device.
>> > This could justifiably be seen as a new mapping for the originial bio,
>> > or a secondary mapping for the bio that errors.  This patch uses
>> > the second option.
>> > 
>> > When md is used under dm-raid, the mappings are not traced as we do
>> > not have access to the block device number of the parent.
>> 
>> thanks, applied patch 1, 3, 4.
>
> BTW, I added below patch
>

Yes, that looks good.  Thanks!

NeilBrown

>
> commit 504634f60f463e73e7d58c6810a04437da942dba
> Author: Shaohua Li <shli@fb.com>
> Date:   Fri Nov 18 09:44:08 2016 -0800
>
>     md: add blktrace event for writes to superblock
>     
>     superblock write is an expensive operation. With raid5-cache, it can be called
>     regularly. Tracing to help performance debug.
>     
>     Signed-off-by: Shaohua Li <shli@fb.com>
>     Cc: NeilBrown <neilb@suse.com>
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 1f1c7f0..d3cef77 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -64,6 +64,7 @@
>  #include <linux/raid/md_p.h>
>  #include <linux/raid/md_u.h>
>  #include <linux/slab.h>
> +#include <trace/events/block.h>
>  #include "md.h"
>  #include "bitmap.h"
>  #include "md-cluster.h"
> @@ -2403,6 +2404,8 @@ void md_update_sb(struct mddev *mddev, int force_change)
>  	pr_debug("md: updating %s RAID superblock on device (in sync %d)\n",
>  		 mdname(mddev), mddev->in_sync);
>  
> +	if (mddev->queue)
> +		blk_add_trace_msg(mddev->queue, "md md_update_sb");
>  	bitmap_update_sb(mddev->bitmap);
>  	rdev_for_each(rdev, mddev) {
>  		char b[BDEVNAME_SIZE];

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: MD Remnants After --stop
From: NeilBrown @ 2016-11-21  3:42 UTC (permalink / raw)
  To: Marc Smith; +Cc: linux-raid
In-Reply-To: <CAHkw+LcfszsWDGQJzDaQU0-PsL+8K7t5msW_bLbv+YyDEFmn8Q@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2838 bytes --]

On Sat, Nov 19 2016, Marc Smith wrote:

> On Mon, Nov 7, 2016 at 12:44 AM, NeilBrown <neilb@suse.com> wrote:
>> On Sat, Nov 05 2016, Marc Smith wrote:
>>
>>> Hi,
>>>
>>> It may be that I've never noticed this before, so maybe its not a
>>> problem... after using '--stop' to deactivate/stop an MD array, there
>>> are remnants of it lingering, namely an entry in /sys/block (eg,
>>> /sys/block/md127) and the device node in /dev remains (eg,
>>> /dev/md127).
>>>
>>> Is this normal? Like I said, it probably is, and I've just never
>>> noticed it before. I assume its not going to hurt anything, but is
>>> there a way to clean it up, without rebooting? Obviously I could
>>> remove the /dev entry, but what about /sys/block?
>>>
>>
>> You can remove them both by running
>>   mdadm -S /dev/md127
>>
>> but they'll probably just reappear again.
>>
>> This seems to be an on-going battle between md and udev.  I've "fixed"
>> it at least once, but it keeps coming back.
>>
>> When md removes the md127 device, a message is sent to udev.
>> As part of its response to this message, udev tries to open /dev/md127.
>> Because of the rather unusual way that md devices are created (it made
>> sense nearly 20 years ago when it was designed), opening /dev/md127
>> causes md to create device md127 again.
>>
>> You could
>>   mv /dev/md127 /dev/md127X
>>   mdadm -S /dev/md127X
>>   rm /dev/md127X
>> that stop udev from opening /dev/md127.  It seems to work reliably.
>>
>> md used to generate a CHANGE event before the REMOVE event, and only the
>> CHANGE event caused udev to open the device file.  I removed that and
>> the problem went away.  Apparently some change has happened to udev and
>> now it opens the file in response to REMOVE as well.
>
> I used "udevadm monitor -pku" to watch the events when running "mdadm
> --stop /dev/md127" and this is what I see:
>
> --snip--
> KERNEL[163074.119778] change   /devices/virtual/block/md127 (block)
> ACTION=change
> DEVNAME=/dev/md127
> DEVPATH=/devices/virtual/block/md127
> DEVTYPE=disk
> MAJOR=9
> MINOR=127
> SEQNUM=3701
> SUBSYSTEM=block
>
> UDEV  [163074.121569] change   /devices/virtual/block/md127 (block)
> ACTION=change
> DEVNAME=/dev/md127
> DEVPATH=/devices/virtual/block/md127
> DEVTYPE=disk
> MAJOR=9
> MINOR=127
> SEQNUM=3701
> SUBSYSTEM=block
> SYSTEMD_READY=0
> USEC_INITIALIZED=370470
> --snip--
>
> I don't see any 'remove' event generated. I should mention if I hadn't
> already that I'm testing md-cluster (--bitmap=clustered), and
> currently using Linux 4.9-rc3.

What version of mdadm are you using?
You need one which contains
Commit: 229e66cb9689 ("Manage.c: Only issue change events for kernels older than 2.6.28")

which hasn't made it into a release yet.  But if you are playing with
md-cluster, I would guess you are using the latest from git...

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* [PATCH - v2] block: call trace_block_split() from bio_split()
From: NeilBrown @ 2016-11-21  3:33 UTC (permalink / raw)
  Cc: Jens Axboe, Christoph Hellwig, linux-block, linux-raid,
	linux-kernel
In-Reply-To: <20161118125533.GA27741@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 1714 bytes --]


Somewhere around
Commit: 20d0189b1012 ("block: Introduce new bio_split()")
and
Commit: 4b1faf931650 ("block: Kill bio_pair_split()")

in 3.14 we lost the call to trace_block_split() from bio_split().

Commit: cda22646adaa ("block: add call to split trace point")

in 4.5 added it back for blk_queue_split(), but not for other users of
bio_split(), and particularly not for md.

This patch moves the trace_block_split() call from blk_queue_split()
to bio_split().
As blk_queue_split() calls bio_split() (via various helper functions)
the same events that were traced before will still be traced.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
---

Thanks Christoph.
This adds the wrap and the reviewed-by.

NeilBrown


 block/bio.c       | 2 ++
 block/blk-merge.c | 1 -
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index db85c5753a76..0aa755abd10b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1804,6 +1804,8 @@ struct bio *bio_split(struct bio *bio, int sectors,
 		bio_integrity_trim(split, 0, sectors);
 
 	bio_advance(bio, split->bi_iter.bi_size);
+	trace_block_split(bdev_get_queue(bio->bi_bdev), split,
+			  bio->bi_iter.bi_sector);
 
 	return split;
 }
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2642e5fc8b69..82cdd35a9f07 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -217,7 +217,6 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
 		split->bi_opf |= REQ_NOMERGE;
 
 		bio_chain(split, *bio);
-		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
 		generic_make_request(*bio);
 		*bio = split;
 	}
-- 
2.10.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply related

* [md PATCH 6/5] md/raid5: remove over-loading of ->bi_phys_segments.
From: NeilBrown @ 2016-11-21  2:32 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Christoph Hellwig, linux-raid
In-Reply-To: <147969099621.5434.12384452255155063186.stgit@noble>

[-- Attachment #1: Type: text/plain, Size: 5786 bytes --]


When a read request, which bypassed the cache, fails, we need to retry
it through the cache.
This involves attaching it to a sequence of stripe_heads, and it may not
be possible to get all the stripe_heads we need at once.
We we do what we can, and record how far we got in ->bi_phys_segments so
we can pick up again later.

There is only every one bio which may have a non-zero offset stored in
->bi_phys_segments, the one that is either active in the single thread
which calls retry_aligned_read(), or is in conf->retry_read_aligned
waiting for retry_aligned_read() to be called again.

So we only need to store one offset value.  This can be in a local
variable passed between remove_bio_from_retry() and
retry_aligned_read(), or in the r5conf structure next to the
->retry_read_aligned pointer.

Storing it there allow the last usage of ->bi_phys_segments to be
removed from md/raid5.c.

Signed-off-by: NeilBrown <neilb@suse.com>
---

This applies on top of the previous set of 5, and finishes the
task for eradicating usage of bi_phys_segments.

NeilBrown


 drivers/md/raid5.c | 46 ++++++++++++----------------------------------
 drivers/md/raid5.h |  1 +
 2 files changed, 13 insertions(+), 34 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7169420bfde5..b7be5a097ead 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -144,28 +144,6 @@ static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
 		return NULL;
 }
 
-/*
- * We maintain a biased count of active stripes in the bottom 16 bits of
- * bi_phys_segments, and a count of processed stripes in the upper 16 bits
- */
-static inline int raid5_bi_processed_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	return (atomic_read(segments) >> 16) & 0xffff;
-}
-
-static inline void raid5_set_bi_processed_stripes(struct bio *bio,
-	unsigned int cnt)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	int old, new;
-
-	do {
-		old = atomic_read(segments);
-		new = (old & 0xffff) | (cnt << 16);
-	} while (atomic_cmpxchg(segments, old, new) != old);
-}
-
 /* Find first data disk in a raid6 stripe */
 static inline int raid6_d0(struct stripe_head *sh)
 {
@@ -4679,12 +4657,14 @@ static void add_bio_to_retry(struct bio *bi,struct r5conf *conf)
 	md_wakeup_thread(conf->mddev->thread);
 }
 
-static struct bio *remove_bio_from_retry(struct r5conf *conf)
+static struct bio *remove_bio_from_retry(struct r5conf *conf,
+					 unsigned int *offset)
 {
 	struct bio *bi;
 
 	bi = conf->retry_read_aligned;
 	if (bi) {
+		*offset = conf->retry_read_offset;
 		conf->retry_read_aligned = NULL;
 		return bi;
 	}
@@ -4692,11 +4672,7 @@ static struct bio *remove_bio_from_retry(struct r5conf *conf)
 	if(bi) {
 		conf->retry_read_aligned_list = bi->bi_next;
 		bi->bi_next = NULL;
-		/*
-		 * this sets the active strip count to 1 and the processed
-		 * strip count to zero (upper 8 bits)
-		 */
-		raid5_set_bi_processed_stripes(bi, 0);
+		*offset = 0;
 	}
 
 	return bi;
@@ -5620,7 +5596,8 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
 	return STRIPE_SECTORS;
 }
 
-static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
+static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
+			       unsigned int offset)
 {
 	/* We may not be able to submit a whole bio at once as there
 	 * may not be enough stripe_heads available.
@@ -5649,7 +5626,7 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 		     sector += STRIPE_SECTORS,
 		     scnt++) {
 
-		if (scnt < raid5_bi_processed_stripes(raid_bio))
+		if (scnt < offset)
 			/* already done this stripe */
 			continue;
 
@@ -5657,15 +5634,15 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 
 		if (!sh) {
 			/* failed to get a stripe - must wait */
-			raid5_set_bi_processed_stripes(raid_bio, scnt);
 			conf->retry_read_aligned = raid_bio;
+			conf->retry_read_offset = scnt;
 			return handled;
 		}
 
 		if (!add_stripe_bio(sh, raid_bio, dd_idx, 0, 0)) {
 			raid5_release_stripe(sh);
-			raid5_set_bi_processed_stripes(raid_bio, scnt);
 			conf->retry_read_aligned = raid_bio;
+			conf->retry_read_offset = scnt;
 			return handled;
 		}
 
@@ -5788,6 +5765,7 @@ static void raid5d(struct md_thread *thread)
 	while (1) {
 		struct bio *bio;
 		int batch_size, released;
+		unsigned int offset;
 
 		released = release_stripe_list(conf, conf->temp_inactive_list);
 		if (released)
@@ -5805,10 +5783,10 @@ static void raid5d(struct md_thread *thread)
 		}
 		raid5_activate_delayed(conf);
 
-		while ((bio = remove_bio_from_retry(conf))) {
+		while ((bio = remove_bio_from_retry(conf, &offset))) {
 			int ok;
 			spin_unlock_irq(&conf->device_lock);
-			ok = retry_aligned_read(conf, bio);
+			ok = retry_aligned_read(conf, bio, offset);
 			spin_lock_irq(&conf->device_lock);
 			if (!ok)
 				break;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 799f84b26838..ec2be7677bfb 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -472,6 +472,7 @@ struct r5conf {
 	struct list_head	delayed_list; /* stripes that have plugged requests */
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */
 	struct bio		*retry_read_aligned; /* currently retrying aligned bios   */
+	unsigned int		retry_read_offset; /* sector offset into retry_read_aligned */
 	struct bio		*retry_read_aligned_list; /* aligned bios retry list  */
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 	atomic_t		active_aligned_reads;
-- 
2.10.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply related

* Re: [BUG 4.4.26] bio->bi_bdev == NULL in raid6 return_io()
From: NeilBrown @ 2016-11-21  1:23 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Konstantin Khlebnikov, Shaohua Li
  Cc: linux-kernel@vger.kernel.org, linux-raid, linux-block, Jens Axboe,
	Christoph Hellwig
In-Reply-To: <c7f3d4e3-0ef6-a6d8-7c8b-bbdb903af7a9@yandex-team.ru>

[-- Attachment #1: Type: text/plain, Size: 2095 bytes --]

On Sun, Nov 20 2016, Konstantin Khlebnikov wrote:

> On 07.11.2016 23:34, Konstantin Khlebnikov wrote:
>> On Mon, Nov 7, 2016 at 10:46 PM, Shaohua Li <shli@kernel.org> wrote:
>>> On Sat, Nov 05, 2016 at 01:48:45PM +0300, Konstantin Khlebnikov wrote:
>>>> return_io() resolves request_queue even if trace point isn't active:
>>>>
>>>> static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
>>>> {
>>>>       return bdev->bd_disk->queue;    /* this is never NULL */
>>>> }
>>>>
>>>> static void return_io(struct bio_list *return_bi)
>>>> {
>>>>       struct bio *bi;
>>>>       while ((bi = bio_list_pop(return_bi)) != NULL) {
>>>>               bi->bi_iter.bi_size = 0;
>>>>               trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
>>>>                                        bi, 0);
>>>>               bio_endio(bi);
>>>>       }
>>>> }
>>>
>>> I can't see how this could happen. What kind of tests/environment are these running?
>>
>> That was a random piece of production somewhere.
>> Cording to time all crashes happened soon after reboot.
>> There're several raids, probably some of them were still under resync.
>>
>> For now we have only few machines with this kernel. But I'm sure that
>> I'll get much more soon =)
>
> I've added this debug patch for catching overflow of active stripes in bio
>
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -164,6 +164,7 @@ static inline void raid5_inc_bi_active_stripes(struct bio *bio)
>   {
>          atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
>          atomic_inc(segments);
> +       BUG_ON(!(atomic_read(segments) & 0xffff));
>   }
>
> And got this. Counter in %edx = 0x00010000
>
> So, looks like one bio (discard?) can cover more than 65535 stripes

65535 stripes - 256M.  I guess that is possible.  Christoph has
suggested that now would be a good time to stop using bi_phys_segments
like this.

I have some patches which should fix this.  I'll post them shortly.  I'd
appreciate it if you would test and confirm that they work (and don't
break anything else)

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* [md PATCH 5/5] md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <147969099621.5434.12384452255155063186.stgit@noble>

md/raid5 needs to keep track of how many stripe_heads are processing a
bio so that it can delay calling bio_endio() until all stripe_heads
have completed.  It currently uses 16 bits of ->bi_phys_segments for
this purpose.

16 bits is only enough for 256M requests, and it is possible for a
single bio to be larger than this, which causes problems.  Also, the
bio struct contains a larger counter, __bi_remaining, which has a
purpose very similar to the purpose of our counter.  So stop using
->bi_phys_segments, and instead use __bi_remaining.

This means we don't need to initialize the counter, as our caller
initializes it to '1'.  It also means we can call bio_endio() directly
as it tests this counter internally.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   68 +++++++++++-----------------------------------------
 1 file changed, 14 insertions(+), 54 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6f3154c80fbf..7169420bfde5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -154,18 +154,6 @@ static inline int raid5_bi_processed_stripes(struct bio *bio)
 	return (atomic_read(segments) >> 16) & 0xffff;
 }
 
-static inline int raid5_dec_bi_active_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	return atomic_sub_return(1, segments) & 0xffff;
-}
-
-static inline void raid5_inc_bi_active_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	atomic_inc(segments);
-}
-
 static inline void raid5_set_bi_processed_stripes(struct bio *bio,
 	unsigned int cnt)
 {
@@ -178,12 +166,6 @@ static inline void raid5_set_bi_processed_stripes(struct bio *bio,
 	} while (atomic_cmpxchg(segments, old, new) != old);
 }
 
-static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	atomic_set(segments, cnt);
-}
-
 /* Find first data disk in a raid6 stripe */
 static inline int raid6_d0(struct stripe_head *sh)
 {
@@ -1190,8 +1172,7 @@ static void ops_complete_biofill(void *stripe_head_ref)
 			while (rbi && rbi->bi_iter.bi_sector <
 				dev->sector + STRIPE_SECTORS) {
 				rbi2 = r5_next_bio(rbi, dev->sector);
-				if (!raid5_dec_bi_active_stripes(rbi))
-					bio_endio(rbi);
+				bio_endio(rbi);
 				rbi = rbi2;
 			}
 		}
@@ -2983,7 +2964,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	if (*bip)
 		bi->bi_next = *bip;
 	*bip = bi;
-	raid5_inc_bi_active_stripes(bi);
+	bio_inc_remaining(bi);
 	md_write_start(conf->mddev, bi);
 
 	if (forwrite) {
@@ -3108,8 +3089,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
-			if (!raid5_dec_bi_active_stripes(bi))
-				bio_endio(bi);
+			bio_endio(bi);
 			bi = nextbi;
 		}
 		if (bitmap_end)
@@ -3131,8 +3111,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
-			if (!raid5_dec_bi_active_stripes(bi))
-				bio_endio(bi);
+			bio_endio(bi);
 			bi = bi2;
 		}
 
@@ -3157,8 +3136,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 					r5_next_bio(bi, sh->dev[i].sector);
 
 				bi->bi_error = -EIO;
-				if (!raid5_dec_bi_active_stripes(bi))
-					bio_endio(bi);
+				bio_endio(bi);
 				bi = nextbi;
 			}
 		}
@@ -3475,8 +3453,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 					dev->sector + STRIPE_SECTORS) {
 					wbi2 = r5_next_bio(wbi, dev->sector);
 					md_write_end(conf->mddev);
-					if (!raid5_dec_bi_active_stripes(wbi))
-						bio_endio(wbi);
+					bio_endio(wbi);
 					wbi = wbi2;
 				}
 				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -4719,7 +4696,7 @@ static struct bio *remove_bio_from_retry(struct r5conf *conf)
 		 * this sets the active strip count to 1 and the processed
 		 * strip count to zero (upper 8 bits)
 		 */
-		raid5_set_bi_stripes(bi, 1); /* biased count of active stripes */
+		raid5_set_bi_processed_stripes(bi, 0);
 	}
 
 	return bi;
@@ -5036,7 +5013,6 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	struct r5conf *conf = mddev->private;
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
-	int remaining;
 	int stripe_sectors;
 
 	if (mddev->reshape_position != MaxSector)
@@ -5047,7 +5023,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	last_sector = bi->bi_iter.bi_sector + (bi->bi_iter.bi_size>>9);
 
 	bi->bi_next = NULL;
-	bi->bi_phys_segments = 1; /* over-loaded to count active stripes */
+	bi->bi_phys_segments = 0;
 
 	stripe_sectors = conf->chunk_sectors *
 		(conf->raid_disks - conf->max_degraded);
@@ -5093,7 +5069,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 				continue;
 			sh->dev[d].towrite = bi;
 			set_bit(R5_OVERWRITE, &sh->dev[d].flags);
-			raid5_inc_bi_active_stripes(bi);
+			bio_inc_remaining(bi);
 			sh->overwrite_disks++;
 		}
 		spin_unlock_irq(&sh->stripe_lock);
@@ -5117,10 +5093,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	}
 
 	md_write_end(mddev);
-	remaining = raid5_dec_bi_active_stripes(bi);
-	if (remaining == 0) {
-		bio_endio(bi);
-	}
+	bio_endio(bi);
 }
 
 static void raid5_make_request(struct mddev *mddev, struct bio * bi)
@@ -5131,7 +5104,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
 	const int rw = bio_data_dir(bi);
-	int remaining;
 	DEFINE_WAIT(w);
 	bool do_prepare;
 
@@ -5167,7 +5139,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	logical_sector = bi->bi_iter.bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
-	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
+	bi->bi_phys_segments = 0;
 	md_write_start(mddev, bi);
 
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
@@ -5299,14 +5271,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 
 	if (rw == WRITE)
 		md_write_end(mddev);
-	remaining = raid5_dec_bi_active_stripes(bi);
-	if (remaining == 0) {
-
-
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
-		bio_endio(bi);
-	}
+	bio_endio(bi);
 }
 
 static sector_t raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks);
@@ -5671,7 +5636,6 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 	int dd_idx;
 	sector_t sector, logical_sector, last_sector;
 	int scnt = 0;
-	int remaining;
 	int handled = 0;
 
 	logical_sector = raid_bio->bi_iter.bi_sector &
@@ -5710,12 +5674,8 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 		raid5_release_stripe(sh);
 		handled++;
 	}
-	remaining = raid5_dec_bi_active_stripes(raid_bio);
-	if (remaining == 0) {
-		trace_block_bio_complete(bdev_get_queue(raid_bio->bi_bdev),
-					 raid_bio, 0);
-		bio_endio(raid_bio);
-	}
+	bio_endio(raid_bio);
+
 	if (atomic_dec_and_test(&conf->active_aligned_reads))
 		wake_up(&conf->wait_for_quiescent);
 	return handled;



^ permalink raw reply related

* [md PATCH 4/5] md/raid5: call bio_endio() directly rather than queuing for later.
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <147969099621.5434.12384452255155063186.stgit@noble>

We gather bios that need to be returned into a bio_list
and call bio_endio() on them all together.
The original reason for this was to avoid making the calls while
holding a spinlock.
Locking has changed a lot since then, and that reason is no longer
valid.

So discard return_io() and various return_bi lists, and just call
bio_endio() directly as needed.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   36 +++++++++---------------------------
 drivers/md/raid5.h |    1 -
 2 files changed, 9 insertions(+), 28 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e53b8f499a4c..6f3154c80fbf 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -223,17 +223,6 @@ static int raid6_idx_to_slot(int idx, struct stripe_head *sh,
 	return slot;
 }
 
-static void return_io(struct bio_list *return_bi)
-{
-	struct bio *bi;
-	while ((bi = bio_list_pop(return_bi)) != NULL) {
-		bi->bi_iter.bi_size = 0;
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
-		bio_endio(bi);
-	}
-}
-
 static void print_raid5_conf (struct r5conf *conf);
 
 static int stripe_operations_active(struct stripe_head *sh)
@@ -1178,7 +1167,6 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,
 static void ops_complete_biofill(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
-	struct bio_list return_bi = BIO_EMPTY_LIST;
 	int i;
 
 	pr_debug("%s: stripe %llu\n", __func__,
@@ -1203,15 +1191,13 @@ static void ops_complete_biofill(void *stripe_head_ref)
 				dev->sector + STRIPE_SECTORS) {
 				rbi2 = r5_next_bio(rbi, dev->sector);
 				if (!raid5_dec_bi_active_stripes(rbi))
-					bio_list_add(&return_bi, rbi);
+					bio_endio(rbi);
 				rbi = rbi2;
 			}
 		}
 	}
 	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
 
-	return_io(&return_bi);
-
 	set_bit(STRIPE_HANDLE, &sh->state);
 	raid5_release_stripe(sh);
 }
@@ -3075,8 +3061,7 @@ static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
 
 static void
 handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
-				struct stripe_head_state *s, int disks,
-				struct bio_list *return_bi)
+		     struct stripe_head_state *s, int disks)
 {
 	int i;
 	BUG_ON(sh->batch_head);
@@ -3124,7 +3109,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
 			if (!raid5_dec_bi_active_stripes(bi))
-				bio_list_add(return_bi, bi);
+				bio_endio(bi);
 			bi = nextbi;
 		}
 		if (bitmap_end)
@@ -3147,7 +3132,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
 			if (!raid5_dec_bi_active_stripes(bi))
-				bio_list_add(return_bi, bi);
+				bio_endio(bi);
 			bi = bi2;
 		}
 
@@ -3173,7 +3158,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 
 				bi->bi_error = -EIO;
 				if (!raid5_dec_bi_active_stripes(bi))
-					bio_list_add(return_bi, bi);
+					bio_endio(bi);
 				bi = nextbi;
 			}
 		}
@@ -3457,7 +3442,7 @@ static void break_stripe_batch_list(struct stripe_head *head_sh,
  * never LOCKED, so we don't need to test 'failed' directly.
  */
 static void handle_stripe_clean_event(struct r5conf *conf,
-	struct stripe_head *sh, int disks, struct bio_list *return_bi)
+	struct stripe_head *sh, int disks)
 {
 	int i;
 	struct r5dev *dev;
@@ -3491,7 +3476,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 					wbi2 = r5_next_bio(wbi, dev->sector);
 					md_write_end(conf->mddev);
 					if (!raid5_dec_bi_active_stripes(wbi))
-						bio_list_add(return_bi, wbi);
+						bio_endio(wbi);
 					wbi = wbi2;
 				}
 				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -4378,7 +4363,7 @@ static void handle_stripe(struct stripe_head *sh)
 		sh->reconstruct_state = 0;
 		break_stripe_batch_list(sh, 0);
 		if (s.to_read+s.to_write+s.written)
-			handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
+			handle_failed_stripe(conf, sh, &s, disks);
 		if (s.syncing + s.replacing)
 			handle_failed_sync(conf, sh, &s);
 	}
@@ -4443,7 +4428,7 @@ static void handle_stripe(struct stripe_head *sh)
 			     && !test_bit(R5_LOCKED, &qdev->flags)
 			     && (test_bit(R5_UPTODATE, &qdev->flags) ||
 				 test_bit(R5_Discard, &qdev->flags))))))
-		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
+		handle_stripe_clean_event(conf, sh, disks);
 
 	/* Now we might consider reading some blocks, either to check/generate
 	 * parity, or to satisfy requests
@@ -4633,9 +4618,6 @@ static void handle_stripe(struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 	}
 
-	if (!bio_list_empty(&s.return_bi))
-		return_io(&s.return_bi);
-
 	clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
 }
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index f654f8207a44..799f84b26838 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -269,7 +269,6 @@ struct stripe_head_state {
 	int dec_preread_active;
 	unsigned long ops_request;
 
-	struct bio_list return_bi;
 	struct md_rdev *blocked_rdev;
 	int handle_bad_blocks;
 	int log_failed;



^ permalink raw reply related

* [md PATCH 3/5] md/raid5: simplfy delaying of writes while metadata is updated.
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <147969099621.5434.12384452255155063186.stgit@noble>

If a device fails during a write, we must ensure the failure is
recorded in the metadata before the completion of the write is
acknowleged.

Commit: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.")
added code for this, but it was unnecessarily complicated.  We already
had similar function for handling updated to the bad-block-list thanks to
Commit: de393cdea66c ("md: make it easier to wait for bad blocks to be acknowledged.")

So revert most of the former commit, and instead avoid collecting
completed write if MD_CHANGE_PENDING is set.  raid5d will then flush
the metadata and retry the stripe_head.

We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
asynchronously.  After analyse_stripe(), we have collected stable data
about the data of devices, which will be used to make decisions.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   27 ++++-----------------------
 drivers/md/raid5.h |    3 ---
 2 files changed, 4 insertions(+), 26 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d07d2dce6856..e53b8f499a4c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4344,7 +4344,8 @@ static void handle_stripe(struct stripe_head *sh)
 	if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
 		goto finish;
 
-	if (s.handle_bad_blocks) {
+	if (s.handle_bad_blocks ||
+	    test_bit(MD_CHANGE_PENDING, &conf->mddev->flags)) {
 		set_bit(STRIPE_HANDLE, &sh->state);
 		goto finish;
 	}
@@ -4632,15 +4633,8 @@ static void handle_stripe(struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 	}
 
-	if (!bio_list_empty(&s.return_bi)) {
-		if (test_bit(MD_CHANGE_PENDING, &conf->mddev->flags)) {
-			spin_lock_irq(&conf->device_lock);
-			bio_list_merge(&conf->return_bi, &s.return_bi);
-			spin_unlock_irq(&conf->device_lock);
-			md_wakeup_thread(conf->mddev->thread);
-		} else
-			return_io(&s.return_bi);
-	}
+	if (!bio_list_empty(&s.return_bi))
+		return_io(&s.return_bi);
 
 	clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
 }
@@ -5846,18 +5840,6 @@ static void raid5d(struct md_thread *thread)
 
 	md_check_recovery(mddev);
 
-	if (!bio_list_empty(&conf->return_bi) &&
-	    !test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
-		struct bio_list tmp = BIO_EMPTY_LIST;
-		spin_lock_irq(&conf->device_lock);
-		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
-			bio_list_merge(&tmp, &conf->return_bi);
-			bio_list_init(&conf->return_bi);
-		}
-		spin_unlock_irq(&conf->device_lock);
-		return_io(&tmp);
-	}
-
 	blk_start_plug(&plug);
 	handled = 0;
 	spin_lock_irq(&conf->device_lock);
@@ -6490,7 +6472,6 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	INIT_LIST_HEAD(&conf->hold_list);
 	INIT_LIST_HEAD(&conf->delayed_list);
 	INIT_LIST_HEAD(&conf->bitmap_list);
-	bio_list_init(&conf->return_bi);
 	init_llist_head(&conf->released_stripes);
 	atomic_set(&conf->active_stripes, 0);
 	atomic_set(&conf->preread_active_stripes, 0);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 57ec49f0839e..f654f8207a44 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -482,9 +482,6 @@ struct r5conf {
 	int			skip_copy; /* Don't copy data from bio to stripe cache */
 	struct list_head	*last_hold; /* detect hold_list promotions */
 
-	/* bios to have bi_end_io called after metadata is synced */
-	struct bio_list		return_bi;
-
 	atomic_t		reshape_stripes; /* stripes with pending writes for reshape */
 	/* unfortunately we need two cache names as we temporarily have
 	 * two caches.



^ permalink raw reply related

* [md PATCH 2/5] md/raid5: use md_write_start to count stripes, not bios
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <147969099621.5434.12384452255155063186.stgit@noble>

We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count.  We currently count bios
submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
has been attached to.

So raid5_make_request() call md_write_start() and then md_write_end().
add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling when raid5_dec_bi_active_stripes() returns 0.

This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index df88656d8798..d07d2dce6856 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2998,6 +2998,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 		bi->bi_next = *bip;
 	*bip = bi;
 	raid5_inc_bi_active_stripes(bi);
+	md_write_start(conf->mddev, bi);
 
 	if (forwrite) {
 		/* check if page is covered */
@@ -3121,10 +3122,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
 
 			bi->bi_error = -EIO;
-			if (!raid5_dec_bi_active_stripes(bi)) {
-				md_write_end(conf->mddev);
+			md_write_end(conf->mddev);
+			if (!raid5_dec_bi_active_stripes(bi))
 				bio_list_add(return_bi, bi);
-			}
 			bi = nextbi;
 		}
 		if (bitmap_end)
@@ -3145,10 +3145,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
 
 			bi->bi_error = -EIO;
-			if (!raid5_dec_bi_active_stripes(bi)) {
-				md_write_end(conf->mddev);
+			md_write_end(conf->mddev);
+			if (!raid5_dec_bi_active_stripes(bi))
 				bio_list_add(return_bi, bi);
-			}
 			bi = bi2;
 		}
 
@@ -3490,10 +3489,9 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 				while (wbi && wbi->bi_iter.bi_sector <
 					dev->sector + STRIPE_SECTORS) {
 					wbi2 = r5_next_bio(wbi, dev->sector);
-					if (!raid5_dec_bi_active_stripes(wbi)) {
-						md_write_end(conf->mddev);
+					md_write_end(conf->mddev);
+					if (!raid5_dec_bi_active_stripes(wbi))
 						bio_list_add(return_bi, wbi);
-					}
 					wbi = wbi2;
 				}
 				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -5142,9 +5140,9 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 		release_stripe_plug(mddev, sh);
 	}
 
+	md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
-		md_write_end(mddev);
 		bio_endio(bi);
 	}
 }
@@ -5173,8 +5171,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 		/* ret == -EAGAIN, fallback */
 	}
 
-	md_write_start(mddev, bi);
-
 	/*
 	 * If array is degraded, better not do chunk aligned read because
 	 * later we might have to read it again in order to reconstruct
@@ -5196,6 +5192,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
+	md_write_start(mddev, bi);
 
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
@@ -5324,11 +5321,11 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	}
 	finish_wait(&conf->wait_for_overlap, &w);
 
+	if (rw == WRITE)
+		md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
 
-		if ( rw == WRITE )
-			md_write_end(mddev);
 
 		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
 					 bi, 0);



^ permalink raw reply related

* [md PATCH 1/5] md: optimize md_write_start() slightly.
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <147969099621.5434.12384452255155063186.stgit@noble>

If md_write_start() finds that ->writes_pending is non-zero, it
should be able to avoid most of the other checks.

To ensure that a non-zero ->writes_pending does mean that other checks
have completed, move it down until after ->in_sync is known to be
clear.  To avoid races with places like array_state_store() which
possible sets ->in_sync, we need to increment ->write_pending inside
the locked region.  As ->writes_pending is now incremented *after*
->in_sync is tested, we must always take the spin_lock, but only if
->writes_pending was found to be zero.

If ->writes_pending is found to be non-zero, we still need to wait it
MD_CHANGE_PENDING is true.

In the common case, md_write_start() will now only
 - check if data_dir is WRITE
 - increment ->writes_pending
 - check MD_CHANGE_PENDING is cleared.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |   32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1f1c7f007b68..2f21f6c7156f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7686,20 +7686,18 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 	int did_change = 0;
 	if (bio_data_dir(bi) != WRITE)
 		return;
-
-	BUG_ON(mddev->ro == 1);
-	if (mddev->ro == 2) {
-		/* need to switch to read/write */
-		mddev->ro = 0;
-		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
-		md_wakeup_thread(mddev->thread);
-		md_wakeup_thread(mddev->sync_thread);
-		did_change = 1;
-	}
-	atomic_inc(&mddev->writes_pending);
-	if (mddev->safemode == 1)
-		mddev->safemode = 0;
-	if (mddev->in_sync) {
+	if (!atomic_inc_not_zero(&mddev->writes_pending)) {
+		BUG_ON(mddev->ro == 1);
+		if (mddev->ro == 2) {
+			/* need to switch to read/write */
+			mddev->ro = 0;
+			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+			md_wakeup_thread(mddev->thread);
+			md_wakeup_thread(mddev->sync_thread);
+			did_change = 1;
+		}
+		if (mddev->safemode == 1)
+			mddev->safemode = 0;
 		spin_lock(&mddev->lock);
 		if (mddev->in_sync) {
 			mddev->in_sync = 0;
@@ -7708,10 +7706,12 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 			md_wakeup_thread(mddev->thread);
 			did_change = 1;
 		}
+		atomic_inc(&mddev->writes_pending);
 		spin_unlock(&mddev->lock);
+
+		if (did_change)
+			sysfs_notify_dirent_safe(mddev->sysfs_state);
 	}
-	if (did_change)
-		sysfs_notify_dirent_safe(mddev->sysfs_state);
 	wait_event(mddev->sb_wait,
 		   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
 }



^ permalink raw reply related

* [md PATCH 0/5] Stop using bi_phys_segments as a counter
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid

There are 2 problems with using bi_phys_segments as a counter
1/ we only use 16bits, which limits bios to 256M
2/ it is poor form to reuse a field like this.  It interferes
   with other changes to bios.

We need to clean up a few things before we can change the use the
counter which is now available inside a bio.

I have only tested this lightly.  More review and testing would be
appreciated.

NeilBrown

---

NeilBrown (5):
      md: optimize md_write_start() slightly.
      md/raid5: use md_write_start to count stripes, not bios
      md/raid5: simplfy delaying of writes while metadata is updated.
      md/raid5: call bio_endio() directly rather than queuing for later.
      md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter

 drivers/md/md.c    |   32 ++++++------
 drivers/md/raid5.c |  136 +++++++++++-----------------------------------------
 drivers/md/raid5.h |    4 --
 3 files changed, 44 insertions(+), 128 deletions(-)

--
Signature

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox