Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Level change from 4 disk RAID5 to 4 disk RAID6
From: NeilBrown @ 2017-04-10  1:04 UTC (permalink / raw)
  To: LM, linux-raid
In-Reply-To: <20170408214239.GF10267@lars-laptop>

[-- Attachment #1: Type: text/plain, Size: 1632 bytes --]

On Sat, Apr 08 2017, LM wrote:

> Hi,
>
> I have a 4 disk RAID5, the used dev size is 640.05 GB. Now I want to
> replace the 4 disks by 4 disks with a size of 2TB each.
>
> As far as I understand the man page, this can be achieved by replacing
> the devices one after another and for each device rebuild the degraded
> array with:
>
>    mdadm /dev/md0 --add /dev/sdX1
>
> Then the level change can be done together with growing the array:
>
>    mdadm --grow /dev/md0 --level=raid6 --backup-file=/root/backup-md0
>
> Does this work?
>
> I am asking if it works, because the man page also says:
>
>> mdadm --grow /dev/md4 --level=6 --backup-file=/root/backup-md4
>>        The array /dev/md4 which is currently a RAID5 array will
>>        be converted to RAID6.  There should normally already be
>>        a spare drive attached to the array as a RAID6 needs one
>>        more drive than a matching RAID5.
>
> And in my case only the size of disks is increased, not their number.
>

Yes, it probably works, and you probably don't need a backup file.
Though you might need to explicitly tell mdadm to keep the number of
devices unchanged by specifying "--raid-disk=4".

You probably aren't very encouraged that I say "probably" and "might",
and this is deliberate.

I recommend that you crate 4 10Meg files, use losetup to create 10M
devices, and build a RAID5 over them with --size=5M.
Then try the --grow --level=6 command, and see what happens.
If you mess up, you can easily start from scratch and try again.
If it works, you can have some confidence that the same process will
have the same result on real devices.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: Can we deprecate ioctl(RAID_VERSION)?
From: NeilBrown @ 2017-04-09 23:01 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Hannes Reinecke, kernel-team
In-Reply-To: <wrfjinmgqkjc.fsf@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 583 bytes --]

On Fri, Apr 07 2017, jes.sorensen@gmail.com wrote:

>
> Next question since I am wearing my 'what is this old stuff doing' hat.
> mdassemble? Does anything still use this? The reason is a lot of the
> newer features are explicitly included, and switching to sysfs is
> effectively going to kill it, unless it gets a major upgrade.
>

I was never a big fan, of mdassemble, but it is smaller than mdadm and
some people apparently have (or had) space-constrained boot
environments.
Maybe post to the linux-raid list with a subject "mdassemble is going
way, do you care?". ??

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Level change from 4 disk RAID5 to 4 disk RAID6
From: LM @ 2017-04-08 21:42 UTC (permalink / raw)
  To: linux-raid

Hi,

I have a 4 disk RAID5, the used dev size is 640.05 GB. Now I want to
replace the 4 disks by 4 disks with a size of 2TB each.

As far as I understand the man page, this can be achieved by replacing
the devices one after another and for each device rebuild the degraded
array with:

   mdadm /dev/md0 --add /dev/sdX1

Then the level change can be done together with growing the array:

   mdadm --grow /dev/md0 --level=raid6 --backup-file=/root/backup-md0

Does this work?

I am asking if it works, because the man page also says:

> mdadm --grow /dev/md4 --level=6 --backup-file=/root/backup-md4
>        The array /dev/md4 which is currently a RAID5 array will
>        be converted to RAID6.  There should normally already be
>        a spare drive attached to the array as a RAID6 needs one
>        more drive than a matching RAID5.

And in my case only the size of disks is increased, not their number.

Thanks,
Lars

^ permalink raw reply

* Re: always use REQ_OP_WRITE_ZEROES for zeroing offload V2
From: Jens Axboe @ 2017-04-08 17:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: martin.petersen, agk, snitzer, shli, philipp.reisner,
	lars.ellenberg, linux-block, linux-scsi, drbd-dev, dm-devel,
	linux-raid
In-Reply-To: <20170405172125.22600-1-hch@lst.de>

On Wed, Apr 05 2017, Christoph Hellwig wrote:
> This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
> supported by the block layer, and switches existing implementations
> of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
> removes incorrect discard_zeroes_data, and also switches WRITE SAME
> based zeroing in SCSI to this new method.
> 
> The series is against the block for-next tree.

Added for 4.12, thanks Christoph.

-- 
Jens Axboe

^ permalink raw reply

* 1506 linux-raid
From: xa0ajutor @ 2017-04-08  4:44 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: 9165107405.zip --]
[-- Type: application/zip, Size: 3589 bytes --]

^ permalink raw reply

* Re: Can we deprecate ioctl(RAID_VERSION)?
From: jes.sorensen @ 2017-04-07 15:55 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Hannes Reinecke, kernel-team
In-Reply-To: <87vaqhyx4i.fsf@notabene.neil.brown.name>

NeilBrown <neilb@suse.com> writes:
> On Thu, Apr 06 2017, Jes Sorensen wrote:
>> Neil,
>>
>> I see, thanks for explaining.
>>
>> The goal is to eventually get out of the ioctl() business and get to a 
>> state where we can do everything via sysfs/configfs. Right now we have a 
>> big mix between ioctl and sysfs where neither interface does everything. 
>> The recent issues with PPL (I think it was) showed that we had to add 
>> more ioctl support because the interfaces needed to do it for sysfs 
>> weren't quite there. My long term goal is to get that situation improved 
>> so we can avoid adding anymore ioctl interfaces and eventually allow for 
>> distros to build mdadm with ioctl support disabled. We had a discussion 
>> at LSF/MM in Boston about this (Hannes, Shaohua, Song, and myself).
>
> Sounds like a good goal, if approached cautiously (and it does sound
> like you are showing proper caution).
> And I don't think we need the "/configfs" bit.  Configfs is just for
> people who don't understand sysfs ;-)

Glad to hear we're in agreement on this. I definitely want to be
cautious about this. While change is good, we have a responsibility
towards existing users. This isn't a desktop environment after all :)

I am not really tied to sysfs/configfs on this, whoever does that part
of the work gets to show us what he/she works best and explain why.

>> I think it's fair to draw a line in the sand and say that mdadm-4.1+ 
>> will not support kernels older than 2.6.15. I am open to the kernel 
>> version we pick here, but I would like to start deprecating some of the 
>> really old code. I have patches that does this in my tree, but I need to 
>> add a check for kernel version > 2.6.15. I am not aware what SuSE's 
>> enterprise kernel versions look like, but checking RHEL/CentOS RHEL5 was 
>> 2.6.18, while RHEL4 was 2.6.9 - and RHEL4 has been unsupported for quite 
>> a while. At least for RHEL/CentOS 2.6.15 as the line in the sand seems fine.
>
> With my SUSE hat on, I'm happy for new mdadm to not support kernels
> older than 3.0.  Probably even 3.12.

I just pushed the first set of changes into git for this. We no longer
support kernels older than 2.6.15.

If it breaks something else, I am ready to take public blame! If it ends
up biting another distro for a slightly older kernel, we can look at
fixing that up.

>> For the kernel to expose features to userland in the future, I would 
>> prefer to go with a feature-flag style interface exposed via sysfs. That 
>> way a distro could enable one feature, but not the other in their kernel 
>> without having to worry about actual version numbers.
>
> I think there are usually better ways than feature flags.
> If the new feature requires a new file in sysfs, then the existence of
> the file signals the presence of the feature.

That is pretty much how I see feature flags.

Next question since I am wearing my 'what is this old stuff doing' hat.
mdassemble? Does anything still use this? The reason is a lot of the
newer features are explicitly included, and switching to sysfs is
effectively going to kill it, unless it gets a major upgrade.

Cheers,
Jes

^ permalink raw reply

* Re: [PATCH 0/2] Trace completion of all bios
From: Jens Axboe @ 2017-04-07 15:42 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid, Mike Snitzer, Christoph Hellwig, Ming Lei,
	linux-kernel, linux-block, dm-devel, Shaohua Li, Alasdair Kergon
In-Reply-To: <149152735878.17489.16036848644242802943.stgit@noble>

On 04/06/2017 07:10 PM, NeilBrown wrote:
> Hi Jens,
>  I think all objections to this patch have been answered so I'm
>  resending.
>  I've added a small cleanup patch first which makes some small
>  enhancements to the documentation and #defines for the bio->flags
>  field.

Added for 4.12. I hand applied 2/2, since it did not apply directly to
the 4.12 branch.

-- 
Jens Axboe

^ permalink raw reply

* Re: Can we deprecate ioctl(RAID_VERSION)?
From: Jes Sorensen @ 2017-04-07 14:54 UTC (permalink / raw)
  To: Coly Li, NeilBrown; +Cc: linux-raid, Hannes Reinecke, kernel-team
In-Reply-To: <ff4b41cf-1ece-bac0-f39a-4361cad0354e@suse.de>

On 04/06/2017 11:54 PM, Coly Li wrote:
> On 2017/4/7 上午2:14, Jes Sorensen wrote:
>> A volunteer! A volunteer!
>>
>> I don't think we need to set a firm timeline for removing code from the
>> kernel at this point. What we can do is implement the new interfaces via
>> sysfs/configfs, and then make ioctl support a config option. Down the
>> line it will be evident if/when we can rip out the old code, but I see
>> that being years away.
>
> Not removing code, just after a timeline, new patch which adding new
> interface should go into configfs.
> I start to look into this now, hope to compose a first version patch set
> for comments in not too long time.
>
> Coly

Sounds good - make sure to also consult with Shaohua how it likes to see 
it implemented on the kernel side.

Cheers,
Jes

^ permalink raw reply

* Re: mdadm:compiled warning in mdadm.c:1974:29 treats as errors
From: Jes Sorensen @ 2017-04-07 14:53 UTC (permalink / raw)
  To: NeilBrown, Liu Zhilong; +Cc: linux-raid
In-Reply-To: <87r315yupy.fsf@notabene.neil.brown.name>

On 04/06/2017 07:36 PM, NeilBrown wrote:
> On Thu, Apr 06 2017, Jes Sorensen wrote:
>
>> On 04/06/2017 04:21 AM, Liu Zhilong wrote:
>>> hi,
>>>
>>> I found this compiling warning, and can reproduce on my Leap42.1,
>>> Leap42.2 and SLES12 SP2 with v4.8.5 version of gcc.
>>>
>>> # gcc -v
>>> Using built-in specs.
>>> COLLECT_GCC=gcc
>>> COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/4.8/lto-wrapper
>>> Target: x86_64-suse-linux
>>> Configured with: ../configure --prefix=/usr --infodir=/usr/share/info
>>> --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64
>>> --enable-languages=c,c++,objc,fortran,obj-c++,java,ada
>>> --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.8
>>> --enable-ssp --disable-libssp --disable-plugin
>>> --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux'
>>> --disable-libgcj --disable-libmudflap --with-slibdir=/lib64
>>> --with-system-zlib --enable-__cxa_atexit
>>> --enable-libstdcxx-allocator=new --disable-libstdcxx-pch
>>> --enable-version-specific-runtime-libs --enable-linker-build-id
>>> --enable-linux-futex --program-suffix=-4.8 --without-system-libunwind
>>> --with-arch-32=i586 --with-tune=generic --build=x86_64-suse-linux
>>> --host=x86_64-suse-linux
>>> Thread model: posix
>>> gcc version 4.8.5 (SUSE Linux)
>>>
>>> # make everything
>>> ... ...
>>> mdadm.c: In function ‘main’:
>>> mdadm.c:1974:29: error: ‘mdfd’ may be used uninitialized in this
>>> function [-Werror=maybe-uninitialized]
>>>    if (dv->devname[0] == '/' || mdfd < 0)
>>>                              ^
>>> mdadm.c:1914:7: note: ‘mdfd’ was declared here
>>>    int mdfd;
>>>        ^
>>> cc1: all warnings being treated as errors
>>> Makefile:206: recipe for target 'mdadm.Os' failed
>>> make: *** [mdadm.Os] Error 1
>>
>> Crappy compiler, you're running an old gcc. But sure, send me a patch
>> that initializes mdfd to -1.
>
> I would rather make the code more obviously correct.
>
> 		if (dv->devname[0] == '/' ||
> 		    (mdfd = open_dev(dv->devname)) < 0)
> 			mdfd = open_mddev(dv->devname, 1);
>
> well... it is more obvious to me.

I really don't like these convoluted if statements, especially the 
recursive assignment of mdfd here.

Cheers,
Jes



^ permalink raw reply

* [RFC PATCH v5] crypto: Add IV generation algorithms
From: Binoy Jayan @ 2017-04-07 10:47 UTC (permalink / raw)
  To: Oded, Ofir
  Cc: Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Rajendra, Milan Broz, Gilad,
	Binoy Jayan
In-Reply-To: <1491562064-23591-1-git-send-email-binoy.jayan@linaro.org>

Currently, the iv generation algorithms are implemented in dm-crypt.c.
The goal is to move these algorithms from the dm layer to the kernel
crypto layer by implementing them as template ciphers so they can be
implemented in hardware for performance. As part of this patchset, the
iv-generation code is moved from the dm layer to the crypto layer and
adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
at a time. Each bio contains an in memory representation of physically
contiguous disk blocks. The dm layer sets up a chained scatterlist of
these blocks split into physically contiguous segments in memory so that
DMA can be performed. Also, the key management code is moved from dm layer
to the cryto layer since the key selection for encrypting neighboring
sectors depend on the keycount.

Synchronous crypto requests to encrypt/decrypt a sector are processed
sequentially. Asynchronous requests if processed in parallel, are freed
in the async callback. The dm layer allocates space for iv. The hardware
implementations can choose to make use of this space to generate their IVs
sequentially or allocate it on their own.
Interface to the crypto layer - include/crypto/geniv.h

Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
---
 drivers/md/dm-crypt.c  | 1916 ++++++++++++++++++++++++++++++++++--------------
 include/crypto/geniv.h |   47 ++
 2 files changed, 1424 insertions(+), 539 deletions(-)
 create mode 100644 include/crypto/geniv.h

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 389a363..ce2bb80 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -32,170 +32,113 @@
 #include <crypto/algapi.h>
 #include <crypto/skcipher.h>
 #include <keys/user-type.h>
-
 #include <linux/device-mapper.h>
-
-#define DM_MSG_PREFIX "crypt"
-
-/*
- * context holding the current state of a multi-part conversion
- */
-struct convert_context {
-	struct completion restart;
-	struct bio *bio_in;
-	struct bio *bio_out;
-	struct bvec_iter iter_in;
-	struct bvec_iter iter_out;
-	sector_t cc_sector;
-	atomic_t cc_pending;
-	struct skcipher_request *req;
+#include <crypto/internal/skcipher.h>
+#include <linux/backing-dev.h>
+#include <linux/log2.h>
+#include <crypto/geniv.h>
+
+#define DM_MSG_PREFIX		"crypt"
+#define MAX_SG_LIST		(BIO_MAX_PAGES * 8)
+#define MIN_IOS			64
+#define LMK_SEED_SIZE		64 /* hash + 0 */
+#define TCW_WHITENING_SIZE	16
+
+struct geniv_ctx;
+struct geniv_req_ctx;
+
+/* Sub request for each of the skcipher_request's for a segment */
+struct geniv_subreq {
+	struct scatterlist src;
+	struct scatterlist dst;
+	struct geniv_req_ctx *rctx;
+	struct skcipher_request req CRYPTO_MINALIGN_ATTR;
 };
 
-/*
- * per bio private data
- */
-struct dm_crypt_io {
-	struct crypt_config *cc;
-	struct bio *base_bio;
-	struct work_struct work;
-
-	struct convert_context ctx;
-
-	atomic_t io_pending;
-	int error;
-	sector_t sector;
-
-	struct rb_node rb_node;
-} CRYPTO_MINALIGN_ATTR;
-
-struct dm_crypt_request {
-	struct convert_context *ctx;
-	struct scatterlist sg_in;
-	struct scatterlist sg_out;
+struct geniv_req_ctx {
+	struct geniv_subreq *subreq;
+	int is_write;
 	sector_t iv_sector;
+	unsigned int nents;
+	u8 *iv;
+	struct completion restart;
+	atomic_t req_pending;
+	struct skcipher_request *req;
 };
 
-struct crypt_config;
-
 struct crypt_iv_operations {
-	int (*ctr)(struct crypt_config *cc, struct dm_target *ti,
-		   const char *opts);
-	void (*dtr)(struct crypt_config *cc);
-	int (*init)(struct crypt_config *cc);
-	int (*wipe)(struct crypt_config *cc);
-	int (*generator)(struct crypt_config *cc, u8 *iv,
-			 struct dm_crypt_request *dmreq);
-	int (*post)(struct crypt_config *cc, u8 *iv,
-		    struct dm_crypt_request *dmreq);
+	int (*ctr)(struct geniv_ctx *ctx);
+	void (*dtr)(struct geniv_ctx *ctx);
+	int (*init)(struct geniv_ctx *ctx);
+	int (*wipe)(struct geniv_ctx *ctx);
+	int (*generator)(struct geniv_ctx *ctx,
+			 struct geniv_req_ctx *rctx,
+			 struct geniv_subreq *subreq);
+	int (*post)(struct geniv_ctx *ctx,
+		    struct geniv_req_ctx *rctx,
+		    struct geniv_subreq *subreq);
 };
 
-struct iv_essiv_private {
+struct geniv_essiv_private {
 	struct crypto_ahash *hash_tfm;
 	u8 *salt;
 };
 
-struct iv_benbi_private {
+struct geniv_benbi_private {
 	int shift;
 };
 
-#define LMK_SEED_SIZE 64 /* hash + 0 */
-struct iv_lmk_private {
+struct geniv_lmk_private {
 	struct crypto_shash *hash_tfm;
 	u8 *seed;
 };
 
-#define TCW_WHITENING_SIZE 16
-struct iv_tcw_private {
+struct geniv_tcw_private {
 	struct crypto_shash *crc32_tfm;
 	u8 *iv_seed;
 	u8 *whitening;
 };
 
-/*
- * Crypt: maps a linear range of a block device
- * and encrypts / decrypts at the same time.
- */
-enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
-	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
-
-/*
- * The fields in here must be read only after initialization.
- */
-struct crypt_config {
-	struct dm_dev *dev;
-	sector_t start;
-
-	/*
-	 * pool for per bio private data, crypto requests and
-	 * encryption requeusts/buffer pages
-	 */
-	mempool_t *req_pool;
-	mempool_t *page_pool;
-	struct bio_set *bs;
-	struct mutex bio_alloc_lock;
-
-	struct workqueue_struct *io_queue;
-	struct workqueue_struct *crypt_queue;
-
-	struct task_struct *write_thread;
-	wait_queue_head_t write_thread_wait;
-	struct rb_root write_tree;
-
+struct geniv_ctx {
+	unsigned int tfms_count;
+	struct crypto_skcipher *child;
+	struct crypto_skcipher **tfms;
+	char *ivmode;
+	unsigned int iv_size;
+	char *algname;
+	char *ivopts;
 	char *cipher;
-	char *cipher_string;
-	char *key_string;
-
+	char *ciphermode;
 	const struct crypt_iv_operations *iv_gen_ops;
 	union {
-		struct iv_essiv_private essiv;
-		struct iv_benbi_private benbi;
-		struct iv_lmk_private lmk;
-		struct iv_tcw_private tcw;
+		struct geniv_essiv_private essiv;
+		struct geniv_benbi_private benbi;
+		struct geniv_lmk_private lmk;
+		struct geniv_tcw_private tcw;
 	} iv_gen_private;
-	sector_t iv_offset;
-	unsigned int iv_size;
-
-	/* ESSIV: struct crypto_cipher *essiv_tfm */
 	void *iv_private;
-	struct crypto_skcipher **tfms;
-	unsigned tfms_count;
-
-	/*
-	 * Layout of each crypto request:
-	 *
-	 *   struct skcipher_request
-	 *      context
-	 *      padding
-	 *   struct dm_crypt_request
-	 *      padding
-	 *   IV
-	 *
-	 * The padding is added so that dm_crypt_request and the IV are
-	 * correctly aligned.
-	 */
-	unsigned int dmreq_start;
-
-	unsigned int per_bio_data_size;
-
-	unsigned long flags;
+	struct crypto_skcipher *tfm;
+	mempool_t *subreq_pool;
 	unsigned int key_size;
+	unsigned int key_extra_size;
 	unsigned int key_parts;      /* independent parts in key buffer */
-	unsigned int key_extra_size; /* additional keys length */
-	u8 key[0];
+	enum setkey_op keyop;
+	char *msg;
+	u8 *key;
 };
 
-#define MIN_IOS        64
-
-static void clone_init(struct dm_crypt_io *, struct bio *);
-static void kcryptd_queue_crypt(struct dm_crypt_io *io);
-static u8 *iv_of_dmreq(struct crypt_config *cc, struct dm_crypt_request *dmreq);
+static struct crypto_skcipher *any_tfm(struct geniv_ctx *ctx)
+{
+	return ctx->tfms[0];
+}
 
-/*
- * Use this to access cipher attributes that are the same for each CPU.
- */
-static struct crypto_skcipher *any_tfm(struct crypt_config *cc)
+static inline
+struct geniv_req_ctx *geniv_req_ctx(struct skcipher_request *req)
 {
-	return cc->tfms[0];
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	unsigned long align = crypto_skcipher_alignmask(tfm);
+
+	return (void *) PTR_ALIGN((u8 *) skcipher_request_ctx(req), align + 1);
 }
 
 /*
@@ -245,44 +188,50 @@ static struct crypto_skcipher *any_tfm(struct crypt_config *cc)
  * http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/454
  */
 
-static int crypt_iv_plain_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
+static int crypt_iv_plain_gen(struct geniv_ctx *ctx,
+			      struct geniv_req_ctx *rctx,
+			      struct geniv_subreq *subreq)
 {
-	memset(iv, 0, cc->iv_size);
-	*(__le32 *)iv = cpu_to_le32(dmreq->iv_sector & 0xffffffff);
+	u8 *iv = rctx->iv;
+
+	memset(iv, 0, ctx->iv_size);
+	*(__le32 *)iv = cpu_to_le32(rctx->iv_sector & 0xffffffff);
 
 	return 0;
 }
 
-static int crypt_iv_plain64_gen(struct crypt_config *cc, u8 *iv,
-				struct dm_crypt_request *dmreq)
+static int crypt_iv_plain64_gen(struct geniv_ctx *ctx,
+				struct geniv_req_ctx *rctx,
+				struct geniv_subreq *subreq)
 {
-	memset(iv, 0, cc->iv_size);
-	*(__le64 *)iv = cpu_to_le64(dmreq->iv_sector);
+	u8 *iv = rctx->iv;
+
+	memset(iv, 0, ctx->iv_size);
+	*(__le64 *)iv = cpu_to_le64(rctx->iv_sector);
 
 	return 0;
 }
 
 /* Initialise ESSIV - compute salt but no local memory allocations */
-static int crypt_iv_essiv_init(struct crypt_config *cc)
+static int crypt_iv_essiv_init(struct geniv_ctx *ctx)
 {
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
-	AHASH_REQUEST_ON_STACK(req, essiv->hash_tfm);
+	struct geniv_essiv_private *essiv = &ctx->iv_gen_private.essiv;
 	struct scatterlist sg;
 	struct crypto_cipher *essiv_tfm;
 	int err;
+	AHASH_REQUEST_ON_STACK(req, essiv->hash_tfm);
 
-	sg_init_one(&sg, cc->key, cc->key_size);
+	sg_init_one(&sg, ctx->key, ctx->key_size);
 	ahash_request_set_tfm(req, essiv->hash_tfm);
 	ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, NULL);
-	ahash_request_set_crypt(req, &sg, essiv->salt, cc->key_size);
+	ahash_request_set_crypt(req, &sg, essiv->salt, ctx->key_size);
 
 	err = crypto_ahash_digest(req);
 	ahash_request_zero(req);
 	if (err)
 		return err;
 
-	essiv_tfm = cc->iv_private;
+	essiv_tfm = ctx->iv_private;
 
 	err = crypto_cipher_setkey(essiv_tfm, essiv->salt,
 			    crypto_ahash_digestsize(essiv->hash_tfm));
@@ -293,16 +242,16 @@ static int crypt_iv_essiv_init(struct crypt_config *cc)
 }
 
 /* Wipe salt and reset key derived from volume key */
-static int crypt_iv_essiv_wipe(struct crypt_config *cc)
+static int crypt_iv_essiv_wipe(struct geniv_ctx *ctx)
 {
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
-	unsigned salt_size = crypto_ahash_digestsize(essiv->hash_tfm);
+	struct geniv_essiv_private *essiv = &ctx->iv_gen_private.essiv;
+	unsigned int salt_size = crypto_ahash_digestsize(essiv->hash_tfm);
 	struct crypto_cipher *essiv_tfm;
 	int r, err = 0;
 
 	memset(essiv->salt, 0, salt_size);
 
-	essiv_tfm = cc->iv_private;
+	essiv_tfm = ctx->iv_private;
 	r = crypto_cipher_setkey(essiv_tfm, essiv->salt, salt_size);
 	if (r)
 		err = r;
@@ -311,42 +260,40 @@ static int crypt_iv_essiv_wipe(struct crypt_config *cc)
 }
 
 /* Set up per cpu cipher state */
-static struct crypto_cipher *setup_essiv_cpu(struct crypt_config *cc,
-					     struct dm_target *ti,
-					     u8 *salt, unsigned saltsize)
+static struct crypto_cipher *setup_essiv_cpu(struct geniv_ctx *ctx,
+					     u8 *salt, unsigned int saltsize)
 {
 	struct crypto_cipher *essiv_tfm;
 	int err;
 
 	/* Setup the essiv_tfm with the given salt */
-	essiv_tfm = crypto_alloc_cipher(cc->cipher, 0, CRYPTO_ALG_ASYNC);
+	essiv_tfm = crypto_alloc_cipher(ctx->cipher, 0, CRYPTO_ALG_ASYNC);
+
 	if (IS_ERR(essiv_tfm)) {
-		ti->error = "Error allocating crypto tfm for ESSIV";
+		DMERR("Error allocating crypto tfm for ESSIV\n");
 		return essiv_tfm;
 	}
 
 	if (crypto_cipher_blocksize(essiv_tfm) !=
-	    crypto_skcipher_ivsize(any_tfm(cc))) {
-		ti->error = "Block size of ESSIV cipher does "
-			    "not match IV size of block cipher";
+	    crypto_skcipher_ivsize(any_tfm(ctx))) {
+		DMERR("Block size of ESSIV cipher does not match IV size of block cipher\n");
 		crypto_free_cipher(essiv_tfm);
 		return ERR_PTR(-EINVAL);
 	}
 
 	err = crypto_cipher_setkey(essiv_tfm, salt, saltsize);
 	if (err) {
-		ti->error = "Failed to set key for ESSIV cipher";
+		DMERR("Failed to set key for ESSIV cipher\n");
 		crypto_free_cipher(essiv_tfm);
 		return ERR_PTR(err);
 	}
-
 	return essiv_tfm;
 }
 
-static void crypt_iv_essiv_dtr(struct crypt_config *cc)
+static void crypt_iv_essiv_dtr(struct geniv_ctx *ctx)
 {
 	struct crypto_cipher *essiv_tfm;
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
+	struct geniv_essiv_private *essiv = &ctx->iv_gen_private.essiv;
 
 	crypto_free_ahash(essiv->hash_tfm);
 	essiv->hash_tfm = NULL;
@@ -354,52 +301,50 @@ static void crypt_iv_essiv_dtr(struct crypt_config *cc)
 	kzfree(essiv->salt);
 	essiv->salt = NULL;
 
-	essiv_tfm = cc->iv_private;
+	essiv_tfm = ctx->iv_private;
 
 	if (essiv_tfm)
 		crypto_free_cipher(essiv_tfm);
 
-	cc->iv_private = NULL;
+	ctx->iv_private = NULL;
 }
 
-static int crypt_iv_essiv_ctr(struct crypt_config *cc, struct dm_target *ti,
-			      const char *opts)
+static int crypt_iv_essiv_ctr(struct geniv_ctx *ctx)
 {
 	struct crypto_cipher *essiv_tfm = NULL;
 	struct crypto_ahash *hash_tfm = NULL;
 	u8 *salt = NULL;
 	int err;
 
-	if (!opts) {
-		ti->error = "Digest algorithm missing for ESSIV mode";
+	if (!ctx->ivopts) {
+		DMERR("Digest algorithm missing for ESSIV mode\n");
 		return -EINVAL;
 	}
 
 	/* Allocate hash algorithm */
-	hash_tfm = crypto_alloc_ahash(opts, 0, CRYPTO_ALG_ASYNC);
+	hash_tfm = crypto_alloc_ahash(ctx->ivopts, 0, CRYPTO_ALG_ASYNC);
 	if (IS_ERR(hash_tfm)) {
-		ti->error = "Error initializing ESSIV hash";
 		err = PTR_ERR(hash_tfm);
+		DMERR("Error initializing ESSIV hash. err=%d\n", err);
 		goto bad;
 	}
 
 	salt = kzalloc(crypto_ahash_digestsize(hash_tfm), GFP_KERNEL);
 	if (!salt) {
-		ti->error = "Error kmallocing salt storage in ESSIV";
 		err = -ENOMEM;
 		goto bad;
 	}
 
-	cc->iv_gen_private.essiv.salt = salt;
-	cc->iv_gen_private.essiv.hash_tfm = hash_tfm;
+	ctx->iv_gen_private.essiv.salt = salt;
+	ctx->iv_gen_private.essiv.hash_tfm = hash_tfm;
 
-	essiv_tfm = setup_essiv_cpu(cc, ti, salt,
+	essiv_tfm = setup_essiv_cpu(ctx, salt,
 				crypto_ahash_digestsize(hash_tfm));
 	if (IS_ERR(essiv_tfm)) {
-		crypt_iv_essiv_dtr(cc);
+		crypt_iv_essiv_dtr(ctx);
 		return PTR_ERR(essiv_tfm);
 	}
-	cc->iv_private = essiv_tfm;
+	ctx->iv_private = essiv_tfm;
 
 	return 0;
 
@@ -410,70 +355,73 @@ static int crypt_iv_essiv_ctr(struct crypt_config *cc, struct dm_target *ti,
 	return err;
 }
 
-static int crypt_iv_essiv_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
+static int crypt_iv_essiv_gen(struct geniv_ctx *ctx,
+			      struct geniv_req_ctx *rctx,
+			      struct geniv_subreq *subreq)
 {
-	struct crypto_cipher *essiv_tfm = cc->iv_private;
+	u8 *iv = rctx->iv;
+	struct crypto_cipher *essiv_tfm = ctx->iv_private;
 
-	memset(iv, 0, cc->iv_size);
-	*(__le64 *)iv = cpu_to_le64(dmreq->iv_sector);
+	memset(iv, 0, ctx->iv_size);
+	*(__le64 *)iv = cpu_to_le64(rctx->iv_sector);
 	crypto_cipher_encrypt_one(essiv_tfm, iv, iv);
 
 	return 0;
 }
 
-static int crypt_iv_benbi_ctr(struct crypt_config *cc, struct dm_target *ti,
-			      const char *opts)
+static int crypt_iv_benbi_ctr(struct geniv_ctx *ctx)
 {
-	unsigned bs = crypto_skcipher_blocksize(any_tfm(cc));
+	unsigned int bs = crypto_skcipher_blocksize(any_tfm(ctx));
 	int log = ilog2(bs);
 
 	/* we need to calculate how far we must shift the sector count
-	 * to get the cipher block count, we use this shift in _gen */
+	 * to get the cipher block count, we use this shift in _gen
+	 */
 
 	if (1 << log != bs) {
-		ti->error = "cypher blocksize is not a power of 2";
+		DMERR("cypher blocksize is not a power of 2\n");
 		return -EINVAL;
 	}
 
 	if (log > 9) {
-		ti->error = "cypher blocksize is > 512";
+		DMERR("cypher blocksize is > 512\n");
 		return -EINVAL;
 	}
 
-	cc->iv_gen_private.benbi.shift = 9 - log;
+	ctx->iv_gen_private.benbi.shift = 9 - log;
 
 	return 0;
 }
 
-static void crypt_iv_benbi_dtr(struct crypt_config *cc)
-{
-}
-
-static int crypt_iv_benbi_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
+static int crypt_iv_benbi_gen(struct geniv_ctx *ctx,
+			      struct geniv_req_ctx *rctx,
+			      struct geniv_subreq *subreq)
 {
+	u8 *iv = rctx->iv;
 	__be64 val;
 
-	memset(iv, 0, cc->iv_size - sizeof(u64)); /* rest is cleared below */
+	memset(iv, 0, ctx->iv_size - sizeof(u64)); /* rest is cleared below */
 
-	val = cpu_to_be64(((u64)dmreq->iv_sector << cc->iv_gen_private.benbi.shift) + 1);
-	put_unaligned(val, (__be64 *)(iv + cc->iv_size - sizeof(u64)));
+	val = cpu_to_be64(((u64) rctx->iv_sector <<
+			  ctx->iv_gen_private.benbi.shift) + 1);
+	put_unaligned(val, (__be64 *)(iv + ctx->iv_size - sizeof(u64)));
 
 	return 0;
 }
 
-static int crypt_iv_null_gen(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
+static int crypt_iv_null_gen(struct geniv_ctx *ctx,
+			     struct geniv_req_ctx *rctx,
+			     struct geniv_subreq *subreq)
 {
-	memset(iv, 0, cc->iv_size);
+	u8 *iv = rctx->iv;
 
+	memset(iv, 0, ctx->iv_size);
 	return 0;
 }
 
-static void crypt_iv_lmk_dtr(struct crypt_config *cc)
+static void crypt_iv_lmk_dtr(struct geniv_ctx *ctx)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
 
 	if (lmk->hash_tfm && !IS_ERR(lmk->hash_tfm))
 		crypto_free_shash(lmk->hash_tfm);
@@ -483,49 +431,49 @@ static void crypt_iv_lmk_dtr(struct crypt_config *cc)
 	lmk->seed = NULL;
 }
 
-static int crypt_iv_lmk_ctr(struct crypt_config *cc, struct dm_target *ti,
-			    const char *opts)
+static int crypt_iv_lmk_ctr(struct geniv_ctx *ctx)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
 
 	lmk->hash_tfm = crypto_alloc_shash("md5", 0, 0);
 	if (IS_ERR(lmk->hash_tfm)) {
-		ti->error = "Error initializing LMK hash";
+		DMERR("Error initializing LMK hash; err=%ld\n",
+		      PTR_ERR(lmk->hash_tfm));
 		return PTR_ERR(lmk->hash_tfm);
 	}
 
 	/* No seed in LMK version 2 */
-	if (cc->key_parts == cc->tfms_count) {
+	if (ctx->key_parts == ctx->tfms_count) {
 		lmk->seed = NULL;
 		return 0;
 	}
 
 	lmk->seed = kzalloc(LMK_SEED_SIZE, GFP_KERNEL);
 	if (!lmk->seed) {
-		crypt_iv_lmk_dtr(cc);
-		ti->error = "Error kmallocing seed storage in LMK";
+		crypt_iv_lmk_dtr(ctx);
+		DMERR("Error kmallocing seed storage in LMK\n");
 		return -ENOMEM;
 	}
 
 	return 0;
 }
 
-static int crypt_iv_lmk_init(struct crypt_config *cc)
+static int crypt_iv_lmk_init(struct geniv_ctx *ctx)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-	int subkey_size = cc->key_size / cc->key_parts;
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
+	int subkey_size = ctx->key_size / ctx->key_parts;
 
 	/* LMK seed is on the position of LMK_KEYS + 1 key */
 	if (lmk->seed)
-		memcpy(lmk->seed, cc->key + (cc->tfms_count * subkey_size),
+		memcpy(lmk->seed, ctx->key + (ctx->tfms_count * subkey_size),
 		       crypto_shash_digestsize(lmk->hash_tfm));
 
 	return 0;
 }
 
-static int crypt_iv_lmk_wipe(struct crypt_config *cc)
+static int crypt_iv_lmk_wipe(struct geniv_ctx *ctx)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
 
 	if (lmk->seed)
 		memset(lmk->seed, 0, LMK_SEED_SIZE);
@@ -533,15 +481,14 @@ static int crypt_iv_lmk_wipe(struct crypt_config *cc)
 	return 0;
 }
 
-static int crypt_iv_lmk_one(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq,
-			    u8 *data)
+static int crypt_iv_lmk_one(struct geniv_ctx *ctx, u8 *iv,
+			    struct geniv_req_ctx *rctx, u8 *data)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-	SHASH_DESC_ON_STACK(desc, lmk->hash_tfm);
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
 	struct md5_state md5state;
 	__le32 buf[4];
 	int i, r;
+	SHASH_DESC_ON_STACK(desc, lmk->hash_tfm);
 
 	desc->tfm = lmk->hash_tfm;
 	desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
@@ -562,8 +509,9 @@ static int crypt_iv_lmk_one(struct crypt_config *cc, u8 *iv,
 		return r;
 
 	/* Sector is cropped to 56 bits here */
-	buf[0] = cpu_to_le32(dmreq->iv_sector & 0xFFFFFFFF);
-	buf[1] = cpu_to_le32((((u64)dmreq->iv_sector >> 32) & 0x00FFFFFF) | 0x80000000);
+	buf[0] = cpu_to_le32(rctx->iv_sector & 0xFFFFFFFF);
+	buf[1] = cpu_to_le32((((u64)rctx->iv_sector >> 32) & 0x00FFFFFF)
+			     | 0x80000000);
 	buf[2] = cpu_to_le32(4024);
 	buf[3] = 0;
 	r = crypto_shash_update(desc, (u8 *)buf, sizeof(buf));
@@ -577,50 +525,54 @@ static int crypt_iv_lmk_one(struct crypt_config *cc, u8 *iv,
 
 	for (i = 0; i < MD5_HASH_WORDS; i++)
 		__cpu_to_le32s(&md5state.hash[i]);
-	memcpy(iv, &md5state.hash, cc->iv_size);
+	memcpy(iv, &md5state.hash, ctx->iv_size);
 
 	return 0;
 }
 
-static int crypt_iv_lmk_gen(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq)
+static int crypt_iv_lmk_gen(struct geniv_ctx *ctx,
+			    struct geniv_req_ctx *rctx,
+			    struct geniv_subreq *subreq)
 {
 	u8 *src;
+	u8 *iv = rctx->iv;
 	int r = 0;
 
-	if (bio_data_dir(dmreq->ctx->bio_in) == WRITE) {
-		src = kmap_atomic(sg_page(&dmreq->sg_in));
-		r = crypt_iv_lmk_one(cc, iv, dmreq, src + dmreq->sg_in.offset);
+	if (rctx->is_write) {
+		src = kmap_atomic(sg_page(&subreq->src));
+		r = crypt_iv_lmk_one(ctx, iv, rctx, src + subreq->src.offset);
 		kunmap_atomic(src);
 	} else
-		memset(iv, 0, cc->iv_size);
+		memset(iv, 0, ctx->iv_size);
 
 	return r;
 }
 
-static int crypt_iv_lmk_post(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
+static int crypt_iv_lmk_post(struct geniv_ctx *ctx,
+			     struct geniv_req_ctx *rctx,
+			     struct geniv_subreq *subreq)
 {
 	u8 *dst;
+	u8 *iv = rctx->iv;
 	int r;
 
-	if (bio_data_dir(dmreq->ctx->bio_in) == WRITE)
+	if (rctx->is_write)
 		return 0;
 
-	dst = kmap_atomic(sg_page(&dmreq->sg_out));
-	r = crypt_iv_lmk_one(cc, iv, dmreq, dst + dmreq->sg_out.offset);
+	dst = kmap_atomic(sg_page(&subreq->dst));
+	r = crypt_iv_lmk_one(ctx, iv, rctx, dst + subreq->dst.offset);
 
 	/* Tweak the first block of plaintext sector */
 	if (!r)
-		crypto_xor(dst + dmreq->sg_out.offset, iv, cc->iv_size);
+		crypto_xor(dst + subreq->dst.offset, iv, ctx->iv_size);
 
 	kunmap_atomic(dst);
 	return r;
 }
 
-static void crypt_iv_tcw_dtr(struct crypt_config *cc)
+static void crypt_iv_tcw_dtr(struct geniv_ctx *ctx)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
 
 	kzfree(tcw->iv_seed);
 	tcw->iv_seed = NULL;
@@ -632,64 +584,65 @@ static void crypt_iv_tcw_dtr(struct crypt_config *cc)
 	tcw->crc32_tfm = NULL;
 }
 
-static int crypt_iv_tcw_ctr(struct crypt_config *cc, struct dm_target *ti,
-			    const char *opts)
+static int crypt_iv_tcw_ctr(struct geniv_ctx *ctx)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
 
-	if (cc->key_size <= (cc->iv_size + TCW_WHITENING_SIZE)) {
-		ti->error = "Wrong key size for TCW";
+	if (ctx->key_size <= (ctx->iv_size + TCW_WHITENING_SIZE)) {
+		DMERR("Wrong key size (%d) for TCW. Choose a value > %d bytes\n",
+			ctx->key_size,
+			ctx->iv_size + TCW_WHITENING_SIZE);
 		return -EINVAL;
 	}
 
 	tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 0);
 	if (IS_ERR(tcw->crc32_tfm)) {
-		ti->error = "Error initializing CRC32 in TCW";
+		DMERR("Error initializing CRC32 in TCW; err=%ld\n",
+			PTR_ERR(tcw->crc32_tfm));
 		return PTR_ERR(tcw->crc32_tfm);
 	}
 
-	tcw->iv_seed = kzalloc(cc->iv_size, GFP_KERNEL);
+	tcw->iv_seed = kzalloc(ctx->iv_size, GFP_KERNEL);
 	tcw->whitening = kzalloc(TCW_WHITENING_SIZE, GFP_KERNEL);
 	if (!tcw->iv_seed || !tcw->whitening) {
-		crypt_iv_tcw_dtr(cc);
-		ti->error = "Error allocating seed storage in TCW";
+		crypt_iv_tcw_dtr(ctx);
+		DMERR("Error allocating seed storage in TCW\n");
 		return -ENOMEM;
 	}
 
 	return 0;
 }
 
-static int crypt_iv_tcw_init(struct crypt_config *cc)
+static int crypt_iv_tcw_init(struct geniv_ctx *ctx)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	int key_offset = cc->key_size - cc->iv_size - TCW_WHITENING_SIZE;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+	int key_offset = ctx->key_size - ctx->iv_size - TCW_WHITENING_SIZE;
 
-	memcpy(tcw->iv_seed, &cc->key[key_offset], cc->iv_size);
-	memcpy(tcw->whitening, &cc->key[key_offset + cc->iv_size],
+	memcpy(tcw->iv_seed, &ctx->key[key_offset], ctx->iv_size);
+	memcpy(tcw->whitening, &ctx->key[key_offset + ctx->iv_size],
 	       TCW_WHITENING_SIZE);
 
 	return 0;
 }
 
-static int crypt_iv_tcw_wipe(struct crypt_config *cc)
+static int crypt_iv_tcw_wipe(struct geniv_ctx *ctx)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
 
-	memset(tcw->iv_seed, 0, cc->iv_size);
+	memset(tcw->iv_seed, 0, ctx->iv_size);
 	memset(tcw->whitening, 0, TCW_WHITENING_SIZE);
 
 	return 0;
 }
 
-static int crypt_iv_tcw_whitening(struct crypt_config *cc,
-				  struct dm_crypt_request *dmreq,
-				  u8 *data)
+static int crypt_iv_tcw_whitening(struct geniv_ctx *ctx,
+				  struct geniv_req_ctx *rctx, u8 *data)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	__le64 sector = cpu_to_le64(dmreq->iv_sector);
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+	__le64 sector = cpu_to_le64(rctx->iv_sector);
 	u8 buf[TCW_WHITENING_SIZE];
-	SHASH_DESC_ON_STACK(desc, tcw->crc32_tfm);
 	int i, r;
+	SHASH_DESC_ON_STACK(desc, tcw->crc32_tfm);
 
 	/* xor whitening with sector number */
 	memcpy(buf, tcw->whitening, TCW_WHITENING_SIZE);
@@ -713,99 +666,1032 @@ static int crypt_iv_tcw_whitening(struct crypt_config *cc,
 	crypto_xor(&buf[0], &buf[12], 4);
 	crypto_xor(&buf[4], &buf[8], 4);
 
-	/* apply whitening (8 bytes) to whole sector */
-	for (i = 0; i < ((1 << SECTOR_SHIFT) / 8); i++)
-		crypto_xor(data + i * 8, buf, 8);
-out:
-	memzero_explicit(buf, sizeof(buf));
-	return r;
-}
+	/* apply whitening (8 bytes) to whole sector */
+	for (i = 0; i < (SECTOR_SIZE / 8); i++)
+		crypto_xor(data + i * 8, buf, 8);
+out:
+	memzero_explicit(buf, sizeof(buf));
+	return r;
+}
+
+static int crypt_iv_tcw_gen(struct geniv_ctx *ctx,
+			    struct geniv_req_ctx *rctx,
+			    struct geniv_subreq *subreq)
+{
+	u8 *iv = rctx->iv;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+	__le64 sector = cpu_to_le64(rctx->iv_sector);
+	u8 *src;
+	int r = 0;
+
+	/* Remove whitening from ciphertext */
+	if (!rctx->is_write) {
+		src = kmap_atomic(sg_page(&subreq->src));
+		r = crypt_iv_tcw_whitening(ctx, rctx,
+					   src + subreq->src.offset);
+		kunmap_atomic(src);
+	}
+
+	/* Calculate IV */
+	memcpy(iv, tcw->iv_seed, ctx->iv_size);
+	crypto_xor(iv, (u8 *)&sector, 8);
+	if (ctx->iv_size > 8)
+		crypto_xor(&iv[8], (u8 *)&sector, ctx->iv_size - 8);
+
+	return r;
+}
+
+static int crypt_iv_tcw_post(struct geniv_ctx *ctx,
+			     struct geniv_req_ctx *rctx,
+			     struct geniv_subreq *subreq)
+{
+	u8 *dst;
+	int r;
+
+	if (!rctx->is_write)
+		return 0;
+
+	/* Apply whitening on ciphertext */
+	dst = kmap_atomic(sg_page(&subreq->dst));
+	r = crypt_iv_tcw_whitening(ctx, rctx, dst + subreq->dst.offset);
+	kunmap_atomic(dst);
+
+	return r;
+}
+
+static const struct crypt_iv_operations crypt_iv_plain_ops = {
+	.generator = crypt_iv_plain_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_plain64_ops = {
+	.generator = crypt_iv_plain64_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_essiv_ops = {
+	.ctr       = crypt_iv_essiv_ctr,
+	.dtr       = crypt_iv_essiv_dtr,
+	.init      = crypt_iv_essiv_init,
+	.wipe      = crypt_iv_essiv_wipe,
+	.generator = crypt_iv_essiv_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_benbi_ops = {
+	.ctr	   = crypt_iv_benbi_ctr,
+	.generator = crypt_iv_benbi_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_null_ops = {
+	.generator = crypt_iv_null_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_lmk_ops = {
+	.ctr	   = crypt_iv_lmk_ctr,
+	.dtr	   = crypt_iv_lmk_dtr,
+	.init	   = crypt_iv_lmk_init,
+	.wipe	   = crypt_iv_lmk_wipe,
+	.generator = crypt_iv_lmk_gen,
+	.post	   = crypt_iv_lmk_post
+};
+
+static const struct crypt_iv_operations crypt_iv_tcw_ops = {
+	.ctr	   = crypt_iv_tcw_ctr,
+	.dtr	   = crypt_iv_tcw_dtr,
+	.init	   = crypt_iv_tcw_init,
+	.wipe	   = crypt_iv_tcw_wipe,
+	.generator = crypt_iv_tcw_gen,
+	.post	   = crypt_iv_tcw_post
+};
+
+static int geniv_setkey_set(struct geniv_ctx *ctx)
+{
+	int ret = 0;
+
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->init)
+		ret = ctx->iv_gen_ops->init(ctx);
+	return ret;
+}
+
+static int geniv_setkey_wipe(struct geniv_ctx *ctx)
+{
+	int ret = 0;
+
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->wipe) {
+		ret = ctx->iv_gen_ops->wipe(ctx);
+		if (ret)
+			return ret;
+	}
+	return ret;
+}
+
+static int geniv_init_iv(struct geniv_ctx *ctx)
+{
+	int ret = -EINVAL;
+
+	DMDEBUG("IV Generation algorithm : %s\n", ctx->ivmode);
+
+	if (ctx->ivmode == NULL)
+		ctx->iv_gen_ops = NULL;
+	else if (strcmp(ctx->ivmode, "plain") == 0)
+		ctx->iv_gen_ops = &crypt_iv_plain_ops;
+	else if (strcmp(ctx->ivmode, "plain64") == 0)
+		ctx->iv_gen_ops = &crypt_iv_plain64_ops;
+	else if (strcmp(ctx->ivmode, "essiv") == 0)
+		ctx->iv_gen_ops = &crypt_iv_essiv_ops;
+	else if (strcmp(ctx->ivmode, "benbi") == 0)
+		ctx->iv_gen_ops = &crypt_iv_benbi_ops;
+	else if (strcmp(ctx->ivmode, "null") == 0)
+		ctx->iv_gen_ops = &crypt_iv_null_ops;
+	else if (strcmp(ctx->ivmode, "lmk") == 0)
+		ctx->iv_gen_ops = &crypt_iv_lmk_ops;
+	else if (strcmp(ctx->ivmode, "tcw") == 0) {
+		ctx->iv_gen_ops = &crypt_iv_tcw_ops;
+		ctx->key_parts += 2; /* IV + whitening */
+		ctx->key_extra_size = ctx->iv_size + TCW_WHITENING_SIZE;
+	} else {
+		ret = -EINVAL;
+		DMERR("Invalid IV mode %s\n", ctx->ivmode);
+		goto end;
+	}
+
+	/* Allocate IV */
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->ctr) {
+		ret = ctx->iv_gen_ops->ctr(ctx);
+		if (ret < 0) {
+			DMERR("Error creating IV for %s\n", ctx->ivmode);
+			goto end;
+		}
+	}
+
+	/* Initialize IV (set keys for ESSIV etc) */
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->init) {
+		ret = ctx->iv_gen_ops->init(ctx);
+		if (ret < 0)
+			DMERR("Error creating IV for %s\n", ctx->ivmode);
+	}
+	ret = 0;
+end:
+	return ret;
+}
+
+static void geniv_free_tfms(struct geniv_ctx *ctx)
+{
+	unsigned int i;
+
+	if (!ctx->tfms)
+		return;
+
+	for (i = 0; i < ctx->tfms_count; i++)
+		if (ctx->tfms[i] && !IS_ERR(ctx->tfms[i])) {
+			crypto_free_skcipher(ctx->tfms[i]);
+			ctx->tfms[i] = NULL;
+		}
+
+	kfree(ctx->tfms);
+	ctx->tfms = NULL;
+}
+
+/* Allocate memory for the underlying cipher algorithm. Ex: cbc(aes)
+ */
+
+static int geniv_alloc_tfms(struct crypto_skcipher *parent,
+			    struct geniv_ctx *ctx)
+{
+	unsigned int i, reqsize, align;
+	int err = 0;
+
+	ctx->tfms = kcalloc(ctx->tfms_count, sizeof(struct crypto_skcipher *),
+			   GFP_KERNEL);
+	if (!ctx->tfms) {
+		err = -ENOMEM;
+		goto end;
+	}
+
+	/* First instance is already allocated in geniv_init_tfm */
+	ctx->tfms[0] = ctx->child;
+	for (i = 1; i < ctx->tfms_count; i++) {
+		ctx->tfms[i] = crypto_alloc_skcipher(ctx->ciphermode, 0, 0);
+		if (IS_ERR(ctx->tfms[i])) {
+			err = PTR_ERR(ctx->tfms[i]);
+			geniv_free_tfms(ctx);
+			goto end;
+		}
+
+		/* Setup the current cipher's request structure */
+		align = crypto_skcipher_alignmask(parent);
+		align &= ~(crypto_tfm_ctx_alignment() - 1);
+		reqsize = align + sizeof(struct geniv_req_ctx) +
+			  crypto_skcipher_reqsize(ctx->tfms[i]);
+		crypto_skcipher_set_reqsize(parent, reqsize);
+	}
+
+end:
+	return err;
+}
+
+/* Initialize the cipher's context with the key, ivmode and other parameters.
+ * Also allocate IV generation template ciphers and initialize them.
+ */
+
+static int geniv_setkey_init(struct crypto_skcipher *parent,
+			     struct geniv_key_info *info)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(parent);
+	int ret = -ENOMEM;
+
+	ctx->iv_size = crypto_skcipher_ivsize(parent);
+	ctx->tfms_count = info->tfms_count;
+	ctx->key = info->key;
+	ctx->key_size = info->key_size;
+	ctx->key_parts = info->key_parts;
+	ctx->ivopts = info->ivopts;
+
+	ret = geniv_alloc_tfms(parent, ctx);
+	if (ret)
+		goto end;
+
+	ret = geniv_init_iv(ctx);
+
+end:
+	return ret;
+}
+
+static int geniv_setkey_tfms(struct crypto_skcipher *parent,
+			     struct geniv_ctx *ctx,
+			     struct geniv_key_info *info)
+{
+	unsigned int subkey_size;
+	int ret = 0, i;
+
+	/* Ignore extra keys (which are used for IV etc) */
+	subkey_size = (ctx->key_size - ctx->key_extra_size)
+		      >> ilog2(ctx->tfms_count);
+
+	for (i = 0; i < ctx->tfms_count; i++) {
+		struct crypto_skcipher *child = ctx->tfms[i];
+		char *subkey = ctx->key + (subkey_size) * i;
+
+		crypto_skcipher_clear_flags(child, CRYPTO_TFM_REQ_MASK);
+		crypto_skcipher_set_flags(child,
+					  crypto_skcipher_get_flags(parent) &
+					  CRYPTO_TFM_REQ_MASK);
+		ret = crypto_skcipher_setkey(child, subkey, subkey_size);
+		if (ret) {
+			DMERR("Error setting key for tfms[%d]\n", i);
+			break;
+		}
+		crypto_skcipher_set_flags(parent,
+					  crypto_skcipher_get_flags(child) &
+					  CRYPTO_TFM_RES_MASK);
+	}
+
+	return ret;
+}
+
+static int geniv_setkey(struct crypto_skcipher *parent,
+			const u8 *key, unsigned int keylen)
+{
+	int err = 0;
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(parent);
+	struct geniv_key_info *info = (struct geniv_key_info *) key;
+
+	DMDEBUG("SETKEY Operation : %d\n", info->keyop);
+
+	switch (info->keyop) {
+	case SETKEY_OP_INIT:
+		err = geniv_setkey_init(parent, info);
+		break;
+	case SETKEY_OP_SET:
+		err = geniv_setkey_set(ctx);
+		break;
+	case SETKEY_OP_WIPE:
+		err = geniv_setkey_wipe(ctx);
+		break;
+	}
+
+	if (err)
+		goto end;
+
+	err = geniv_setkey_tfms(parent, ctx, info);
+
+end:
+	return err;
+}
+
+static void geniv_async_done(struct crypto_async_request *async_req, int error);
+
+static int geniv_alloc_subreq(struct skcipher_request *req,
+			      struct geniv_ctx *ctx,
+			      struct geniv_req_ctx *rctx)
+{
+	int key_index, r = 0;
+	struct skcipher_request *sreq;
+
+	if (!rctx->subreq) {
+		rctx->subreq = mempool_alloc(ctx->subreq_pool, GFP_NOIO);
+		if (!rctx->subreq)
+			r = -ENOMEM;
+	}
+
+	sreq = &rctx->subreq->req;
+	rctx->subreq->rctx = rctx;
+
+	key_index = rctx->iv_sector & (ctx->tfms_count - 1);
+
+	skcipher_request_set_tfm(sreq, ctx->tfms[key_index]);
+	skcipher_request_set_callback(sreq, req->base.flags,
+				      geniv_async_done, rctx->subreq);
+	return r;
+}
+
+/* Asynchronous IO completion callback for each sector in a segment. When all
+ * pending i/o are completed the parent cipher's async function is called.
+ */
+
+static void geniv_async_done(struct crypto_async_request *async_req, int error)
+{
+	struct geniv_subreq *subreq =
+		(struct geniv_subreq *) async_req->data;
+	struct geniv_req_ctx *rctx = subreq->rctx;
+	struct skcipher_request *req = rctx->req;
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	/*
+	 * A request from crypto driver backlog is going to be processed now,
+	 * finish the completion and continue in crypt_convert().
+	 * (Callback will be called for the second time for this request.)
+	 */
+
+	if (error == -EINPROGRESS) {
+		complete(&rctx->restart);
+		return;
+	}
+
+	if (!error && ctx->iv_gen_ops && ctx->iv_gen_ops->post)
+		error = ctx->iv_gen_ops->post(ctx, rctx, subreq);
+
+	mempool_free(subreq, ctx->subreq_pool);
+
+	/* req_pending needs to be checked before req->base.complete is called
+	 * as we need 'req_pending' to be equal to 1 to ensure all subrequests
+	 * are processed.
+	 */
+	if (!atomic_dec_and_test(&rctx->req_pending)) {
+		/* Call the parent cipher's completion function */
+		skcipher_request_complete(req, error);
+	}
+}
+
+static unsigned int geniv_get_sectors(struct scatterlist *sg1,
+				      struct scatterlist *sg2,
+				      unsigned int segments)
+{
+	unsigned int i, n1, n2, nents;
+
+	n1 = n2 = 0;
+	for (i = 0; i < segments ; i++) {
+		n1 += sg1[i].length >> SECTOR_SHIFT;
+		n1 += (sg1[i].length & ~SECTOR_MASK) ? 1 : 0;
+	}
+
+	for (i = 0; i < segments ; i++) {
+		n2 += sg2[i].length >> SECTOR_SHIFT;
+		n2 += (sg2[i].length & ~SECTOR_MASK) ? 1 : 0;
+	}
+
+	nents = n1 > n2 ? n1 : n2;
+	return nents;
+}
+
+/* Iterate scatterlist of segments to retrieve the 512-byte sectors so that
+ * unique IVs could be generated for each 512-byte sector. This split may not
+ * be necessary e.g. when these ciphers are modelled in hardware, where it can
+ * make use of the hardware's IV generation capabilities.
+ */
+
+static int geniv_iter_block(struct skcipher_request *req,
+			    struct geniv_subreq *subreq,
+			    struct geniv_req_ctx *rctx,
+			    unsigned int *seg_no,
+			    unsigned int *done)
+
+{
+	unsigned int srcoff, dstoff, len, rem;
+	struct scatterlist *src1, *dst1, *src2, *dst2;
+
+	if (unlikely(*seg_no >= rctx->nents))
+		return 0; /* done */
+
+	src1 = &req->src[*seg_no];
+	dst1 = &req->dst[*seg_no];
+	src2 = &subreq->src;
+	dst2 = &subreq->dst;
+
+	if (*done >= src1->length) {
+		(*seg_no)++;
+
+		if (*seg_no >= rctx->nents)
+			return 0; /* done */
+
+		src1 = &req->src[*seg_no];
+		dst1 = &req->dst[*seg_no];
+		*done = 0;
+	}
+
+	srcoff = src1->offset + *done;
+	dstoff = dst1->offset + *done;
+	rem = src1->length - *done;
+
+	len = rem > SECTOR_SIZE ? SECTOR_SIZE : rem;
+
+	DMDEBUG("segment:(%d/%u), srcoff:%d, dstoff:%d, done:%d, rem:%d\n",
+		*seg_no + 1, rctx->nents, srcoff, dstoff, *done, rem);
+
+	sg_init_table(src2, 1);
+	sg_set_page(src2, sg_page(src1), len, srcoff);
+	sg_init_table(dst2, 1);
+	sg_set_page(dst2, sg_page(dst1), len, dstoff);
+
+	*done += len;
+
+	return len; /* bytes returned */
+}
+
+/* Common encryt/decrypt function for geniv template cipher. Before the crypto
+ * operation, it splits the memory segments (in the scatterlist) into 512 byte
+ * sectors. The initialization vector(IV) used is based on a unique sector
+ * number which is generated here.
+ */
+static int geniv_crypt(struct skcipher_request *req, int encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct geniv_req_ctx *rctx = geniv_req_ctx(req);
+	struct geniv_req_info *rinfo = (struct geniv_req_info *) req->iv;
+	int i, bytes, cryptlen, ret = 0;
+	unsigned int sectors, segno = 0, done = 0;
+	char *str __maybe_unused = encrypt ? "encrypt" : "decrypt";
+
+	/* Instance of 'struct geniv_req_info' is stored in IV ptr */
+	rctx->is_write = encrypt;
+	rctx->iv_sector = rinfo->iv_sector;
+	rctx->nents = rinfo->nents;
+	rctx->iv = rinfo->iv;
+	rctx->req = req;
+	rctx->subreq = NULL;
+	cryptlen = req->cryptlen;
+
+	DMDEBUG("geniv:%s: starting sector=%d, #segments=%u\n", str,
+		(unsigned int) rctx->iv_sector, rctx->nents);
+
+	sectors = geniv_get_sectors(req->src, req->dst, rctx->nents);
+
+	init_completion(&rctx->restart);
+	atomic_set(&rctx->req_pending, 1);
+
+	for (i = 0; i < sectors; i++) {
+		struct geniv_subreq *subreq;
+
+		ret = geniv_alloc_subreq(req, ctx, rctx);
+		if (ret)
+			goto end;
+
+		subreq = rctx->subreq;
+		subreq->rctx = rctx;
+
+		atomic_inc(&rctx->req_pending);
+		bytes = geniv_iter_block(req, subreq, rctx, &segno, &done);
+
+		if (bytes == 0)
+			break;
+
+		cryptlen -= bytes;
+
+		if (ctx->iv_gen_ops)
+			ret = ctx->iv_gen_ops->generator(ctx, rctx, subreq);
+
+		if (ret < 0) {
+			DMERR("Error in generating IV ret: %d\n", ret);
+			goto end;
+		}
+
+		skcipher_request_set_crypt(&subreq->req, &subreq->src,
+					   &subreq->dst, bytes, rctx->iv);
+
+		if (encrypt)
+			ret = crypto_skcipher_encrypt(&subreq->req);
+
+		else
+			ret = crypto_skcipher_decrypt(&subreq->req);
+
+		if (!ret && ctx->iv_gen_ops && ctx->iv_gen_ops->post)
+			ret = ctx->iv_gen_ops->post(ctx, rctx, subreq);
+
+		switch (ret) {
+		/*
+		 * The request was queued by a crypto driver
+		 * but the driver request queue is full, let's wait.
+		 */
+		case -EBUSY:
+			wait_for_completion(&rctx->restart);
+			reinit_completion(&rctx->restart);
+			/* fall through */
+		/*
+		 * The request is queued and processed asynchronously,
+		 * completion function geniv_async_done() is called.
+		 */
+		case -EINPROGRESS:
+			/* Marking this NULL lets the creation of a new sub-
+			 * request when 'geniv_alloc_subreq' is called.
+			 */
+			rctx->subreq = NULL;
+			rctx->iv_sector++;
+			cond_resched();
+			break;
+		/*
+		 * The request was already processed (synchronously).
+		 */
+		case 0:
+			atomic_dec(&rctx->req_pending);
+			rctx->iv_sector++;
+			cond_resched();
+			continue;
+
+		/* There was an error while processing the request. */
+		default:
+			atomic_dec(&rctx->req_pending);
+			return ret;
+		}
+
+		if (ret)
+			break;
+	}
+
+	if (rctx->subreq && atomic_read(&rctx->req_pending) == 1) {
+		DMDEBUG("geniv:%s: Freeing sub request\n", str);
+		mempool_free(rctx->subreq, ctx->subreq_pool);
+	}
+
+end:
+	return ret;
+}
+
+static int geniv_encrypt(struct skcipher_request *req)
+{
+	return geniv_crypt(req, 1);
+}
+
+static int geniv_decrypt(struct skcipher_request *req)
+{
+	return geniv_crypt(req, 0);
+}
+
+static int geniv_init_tfm(struct crypto_skcipher *tfm)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+	unsigned int reqsize, align;
+	char *algname, *chainmode;
+	int psize, ret = 0;
+
+	algname = (char *) crypto_tfm_alg_name(crypto_skcipher_tfm(tfm));
+	ctx->ciphermode = kmalloc(CRYPTO_MAX_ALG_NAME, GFP_KERNEL);
+	if (!ctx->ciphermode) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ctx->algname = kmalloc(CRYPTO_MAX_ALG_NAME, GFP_KERNEL);
+	if (!ctx->algname) {
+		ret = -ENOMEM;
+		goto free_ciphermode;
+	}
+
+	strlcpy(ctx->algname, algname, CRYPTO_MAX_ALG_NAME);
+	algname = ctx->algname;
+
+	/* Parse the algorithm name 'ivmode(chainmode(cipher))' */
+	ctx->ivmode	= strsep(&algname, "(");
+	chainmode	= strsep(&algname, "(");
+	ctx->cipher	= strsep(&algname, ")");
+
+	snprintf(ctx->ciphermode, CRYPTO_MAX_ALG_NAME, "%s(%s)",
+		 chainmode, ctx->cipher);
+
+	DMDEBUG("ciphermode=%s, ivmode=%s\n", ctx->ciphermode, ctx->ivmode);
+
+	/*
+	 * Usually the underlying cipher instances are spawned here, but since
+	 * the value of tfms_count (which is equal to the key_count) is not
+	 * known yet, create only one instance and delay the creation of the
+	 * rest of the instances of the underlying cipher 'cbc(aes)' until
+	 * the setkey operation is invoked.
+	 * The first instance created i.e. ctx->child will later be assigned as
+	 * the 1st element in the array ctx->tfms. Creation of atleast one
+	 * instance of the cipher is necessary to be created here to uncover
+	 * any errors earlier than during the setkey operation later where the
+	 * remaining instances are created.
+	 */
+	ctx->child = crypto_alloc_skcipher(ctx->ciphermode, 0, 0);
+	if (IS_ERR(ctx->child)) {
+		ret = PTR_ERR(ctx->child);
+		DMERR("Failed to create skcipher %s. err %d\n",
+		      ctx->ciphermode, ret);
+		goto free_algname;
+	}
+
+	/* Setup the current cipher's request structure */
+	align = crypto_skcipher_alignmask(tfm);
+	align &= ~(crypto_tfm_ctx_alignment() - 1);
+	reqsize = align + sizeof(struct geniv_req_ctx)
+			+ crypto_skcipher_reqsize(ctx->child);
+	crypto_skcipher_set_reqsize(tfm, reqsize);
+
+	/* create memory pool for sub-request structure */
+	psize = sizeof(struct geniv_subreq)
+			+ crypto_skcipher_reqsize(ctx->child);
+	ctx->subreq_pool = mempool_create_kmalloc_pool(MIN_IOS, psize);
+	if (!ctx->subreq_pool) {
+		ret = -ENOMEM;
+		DMERR("Could not allocate crypt sub-request mempool\n");
+		goto free_skcipher;
+	}
+out:
+	return ret;
+
+free_skcipher:
+	crypto_free_skcipher(ctx->child);
+free_algname:
+	kfree(ctx->algname);
+free_ciphermode:
+	kfree(ctx->ciphermode);
+	goto out;
+}
+
+static void geniv_exit_tfm(struct crypto_skcipher *tfm)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->dtr)
+		ctx->iv_gen_ops->dtr(ctx);
+
+	mempool_destroy(ctx->subreq_pool);
+	geniv_free_tfms(ctx);
+	kfree(ctx->ciphermode);
+	kfree(ctx->algname);
+}
+
+static void geniv_free(struct skcipher_instance *inst)
+{
+	struct crypto_skcipher_spawn *spawn = skcipher_instance_ctx(inst);
+
+	crypto_drop_skcipher(spawn);
+	kfree(inst);
+}
+
+static int geniv_create(struct crypto_template *tmpl,
+			struct rtattr **tb, char *algname)
+{
+	struct crypto_attr_type *algt;
+	struct skcipher_instance *inst;
+	struct skcipher_alg *alg;
+	struct crypto_skcipher_spawn *spawn;
+	const char *cipher_name;
+	int err;
+
+	algt = crypto_get_attr_type(tb);
+
+	if (IS_ERR(algt))
+		return PTR_ERR(algt);
+
+	if ((algt->type ^ CRYPTO_ALG_TYPE_SKCIPHER) & algt->mask)
+		return -EINVAL;
+
+	cipher_name = crypto_attr_alg_name(tb[1]);
+
+	if (IS_ERR(cipher_name))
+		return PTR_ERR(cipher_name);
+
+	inst = kzalloc(sizeof(*inst) + sizeof(*spawn), GFP_KERNEL);
+	if (!inst)
+		return -ENOMEM;
+
+	spawn = skcipher_instance_ctx(inst);
+
+	crypto_set_skcipher_spawn(spawn, skcipher_crypto_instance(inst));
+	err = crypto_grab_skcipher(spawn, cipher_name, 0,
+				    crypto_requires_sync(algt->type,
+							 algt->mask));
+
+	if (err)
+		goto err_free_inst;
+
+	alg = crypto_spawn_skcipher_alg(spawn);
+
+	err = -EINVAL;
+
+	/* Only support blocks of size which is of a power of 2 */
+	if (!is_power_of_2(alg->base.cra_blocksize))
+		goto err_drop_spawn;
+
+	/* algname: essiv, base.cra_name: cbc(aes) */
+	err = -ENAMETOOLONG;
+	if (snprintf(inst->alg.base.cra_name, CRYPTO_MAX_ALG_NAME, "%s(%s)",
+		     algname, alg->base.cra_name) >= CRYPTO_MAX_ALG_NAME)
+		goto err_drop_spawn;
+	if (snprintf(inst->alg.base.cra_driver_name, CRYPTO_MAX_ALG_NAME,
+		     "%s(%s)", algname, alg->base.cra_driver_name) >=
+	    CRYPTO_MAX_ALG_NAME)
+		goto err_drop_spawn;
+
+	inst->alg.base.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER;
+	inst->alg.base.cra_priority = alg->base.cra_priority;
+	inst->alg.base.cra_blocksize = alg->base.cra_blocksize;
+	inst->alg.base.cra_alignmask = alg->base.cra_alignmask;
+	inst->alg.base.cra_flags = alg->base.cra_flags & CRYPTO_ALG_ASYNC;
+	inst->alg.ivsize = alg->base.cra_blocksize;
+	inst->alg.chunksize = crypto_skcipher_alg_chunksize(alg);
+	inst->alg.min_keysize = crypto_skcipher_alg_min_keysize(alg);
+	inst->alg.max_keysize = crypto_skcipher_alg_max_keysize(alg);
+
+	inst->alg.setkey = geniv_setkey;
+	inst->alg.encrypt = geniv_encrypt;
+	inst->alg.decrypt = geniv_decrypt;
+
+	inst->alg.base.cra_ctxsize = sizeof(struct geniv_ctx);
+
+	inst->alg.init = geniv_init_tfm;
+	inst->alg.exit = geniv_exit_tfm;
+
+	inst->free = geniv_free;
+
+	err = skcipher_register_instance(tmpl, inst);
+	if (err)
+		goto err_drop_spawn;
+
+out:
+	return err;
+
+err_drop_spawn:
+	crypto_drop_skcipher(spawn);
+err_free_inst:
+	kfree(inst);
+	goto out;
+}
+
+static int crypto_plain_create(struct crypto_template *tmpl,
+			       struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "plain");
+}
+
+static int crypto_plain64_create(struct crypto_template *tmpl,
+				 struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "plain64");
+}
+
+static int crypto_essiv_create(struct crypto_template *tmpl,
+			       struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "essiv");
+}
+
+static int crypto_benbi_create(struct crypto_template *tmpl,
+			       struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "benbi");
+}
+
+static int crypto_null_create(struct crypto_template *tmpl,
+			      struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "null");
+}
+
+static int crypto_lmk_create(struct crypto_template *tmpl,
+			     struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "lmk");
+}
+
+static int crypto_tcw_create(struct crypto_template *tmpl,
+			     struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "tcw");
+}
+
+static struct crypto_template crypto_plain_tmpl = {
+	.name   = "plain",
+	.create = crypto_plain_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_plain64_tmpl = {
+	.name   = "plain64",
+	.create = crypto_plain64_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_essiv_tmpl = {
+	.name   = "essiv",
+	.create = crypto_essiv_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_benbi_tmpl = {
+	.name   = "benbi",
+	.create = crypto_benbi_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_null_tmpl = {
+	.name   = "null",
+	.create = crypto_null_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_lmk_tmpl = {
+	.name   = "lmk",
+	.create = crypto_lmk_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_tcw_tmpl = {
+	.name   = "tcw",
+	.create = crypto_tcw_create,
+	.module = THIS_MODULE,
+};
+
+static int __init geniv_register_algs(void)
+{
+	int err;
+
+	err = crypto_register_template(&crypto_plain_tmpl);
+	if (err)
+		goto out;
+
+	err = crypto_register_template(&crypto_plain64_tmpl);
+	if (err)
+		goto out_undo_plain;
+
+	err = crypto_register_template(&crypto_essiv_tmpl);
+	if (err)
+		goto out_undo_plain64;
+
+	err = crypto_register_template(&crypto_benbi_tmpl);
+	if (err)
+		goto out_undo_essiv;
+
+	err = crypto_register_template(&crypto_null_tmpl);
+	if (err)
+		goto out_undo_benbi;
+
+	err = crypto_register_template(&crypto_lmk_tmpl);
+	if (err)
+		goto out_undo_null;
+
+	err = crypto_register_template(&crypto_tcw_tmpl);
+	if (!err)
+		goto out;
+
+	crypto_unregister_template(&crypto_lmk_tmpl);
+out_undo_null:
+	crypto_unregister_template(&crypto_null_tmpl);
+out_undo_benbi:
+	crypto_unregister_template(&crypto_benbi_tmpl);
+out_undo_essiv:
+	crypto_unregister_template(&crypto_essiv_tmpl);
+out_undo_plain64:
+	crypto_unregister_template(&crypto_plain64_tmpl);
+out_undo_plain:
+	crypto_unregister_template(&crypto_plain_tmpl);
+out:
+	return err;
+}
+
+static void __exit geniv_deregister_algs(void)
+{
+	crypto_unregister_template(&crypto_plain_tmpl);
+	crypto_unregister_template(&crypto_plain64_tmpl);
+	crypto_unregister_template(&crypto_essiv_tmpl);
+	crypto_unregister_template(&crypto_benbi_tmpl);
+	crypto_unregister_template(&crypto_null_tmpl);
+	crypto_unregister_template(&crypto_lmk_tmpl);
+	crypto_unregister_template(&crypto_tcw_tmpl);
+}
+
+/* End of geniv template cipher algorithms */
+
+/*
+ * context holding the current state of a multi-part conversion
+ */
+struct convert_context {
+	struct completion restart;
+	struct bio *bio_in;
+	struct bio *bio_out;
+	struct bvec_iter iter_in;
+	struct bvec_iter iter_out;
+	sector_t cc_sector;
+	atomic_t cc_pending;
+	struct skcipher_request *req;
+};
+
+/*
+ * per bio private data
+ */
+struct dm_crypt_io {
+	struct crypt_config *cc;
+	struct bio *base_bio;
+	struct work_struct work;
+
+	struct convert_context ctx;
 
-static int crypt_iv_tcw_gen(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	__le64 sector = cpu_to_le64(dmreq->iv_sector);
-	u8 *src;
-	int r = 0;
+	atomic_t io_pending;
+	int error;
+	sector_t sector;
 
-	/* Remove whitening from ciphertext */
-	if (bio_data_dir(dmreq->ctx->bio_in) != WRITE) {
-		src = kmap_atomic(sg_page(&dmreq->sg_in));
-		r = crypt_iv_tcw_whitening(cc, dmreq, src + dmreq->sg_in.offset);
-		kunmap_atomic(src);
-	}
+	struct rb_node rb_node;
+} CRYPTO_MINALIGN_ATTR;
 
-	/* Calculate IV */
-	memcpy(iv, tcw->iv_seed, cc->iv_size);
-	crypto_xor(iv, (u8 *)&sector, 8);
-	if (cc->iv_size > 8)
-		crypto_xor(&iv[8], (u8 *)&sector, cc->iv_size - 8);
+struct dm_crypt_request {
+	struct convert_context *ctx;
+	struct scatterlist *sg_in;
+	struct scatterlist *sg_out;
+	sector_t iv_sector;
+};
 
-	return r;
-}
+struct crypt_config;
 
-static int crypt_iv_tcw_post(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
-{
-	u8 *dst;
-	int r;
+/*
+ * Crypt: maps a linear range of a block device
+ * and encrypts / decrypts at the same time.
+ */
+enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
+	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
 
-	if (bio_data_dir(dmreq->ctx->bio_in) != WRITE)
-		return 0;
+/*
+ * The fields in here must be read only after initialization.
+ */
+struct crypt_config {
+	struct dm_dev *dev;
+	sector_t start;
 
-	/* Apply whitening on ciphertext */
-	dst = kmap_atomic(sg_page(&dmreq->sg_out));
-	r = crypt_iv_tcw_whitening(cc, dmreq, dst + dmreq->sg_out.offset);
-	kunmap_atomic(dst);
+	/*
+	 * pool for per bio private data, crypto requests and
+	 * encryption requeusts/buffer pages
+	 */
+	mempool_t *req_pool;
+	mempool_t *page_pool;
+	struct bio_set *bs;
+	struct mutex bio_alloc_lock;
 
-	return r;
-}
+	struct workqueue_struct *io_queue;
+	struct workqueue_struct *crypt_queue;
 
-static const struct crypt_iv_operations crypt_iv_plain_ops = {
-	.generator = crypt_iv_plain_gen
-};
+	struct task_struct *write_thread;
+	wait_queue_head_t write_thread_wait;
+	struct rb_root write_tree;
 
-static const struct crypt_iv_operations crypt_iv_plain64_ops = {
-	.generator = crypt_iv_plain64_gen
-};
+	char *cipher;
+	char *cipher_string;
+	char *key_string;
 
-static const struct crypt_iv_operations crypt_iv_essiv_ops = {
-	.ctr       = crypt_iv_essiv_ctr,
-	.dtr       = crypt_iv_essiv_dtr,
-	.init      = crypt_iv_essiv_init,
-	.wipe      = crypt_iv_essiv_wipe,
-	.generator = crypt_iv_essiv_gen
-};
+	sector_t iv_offset;
+	unsigned int iv_size;
 
-static const struct crypt_iv_operations crypt_iv_benbi_ops = {
-	.ctr	   = crypt_iv_benbi_ctr,
-	.dtr	   = crypt_iv_benbi_dtr,
-	.generator = crypt_iv_benbi_gen
-};
+	/* ESSIV: struct crypto_cipher *essiv_tfm */
+	void *iv_private;
+	struct crypto_skcipher *tfm;
+	unsigned int tfms_count;
 
-static const struct crypt_iv_operations crypt_iv_null_ops = {
-	.generator = crypt_iv_null_gen
-};
+	/*
+	 * Layout of each crypto request:
+	 *
+	 *   struct skcipher_request
+	 *      context
+	 *      padding
+	 *   struct dm_crypt_request
+	 *      padding
+	 *   IV
+	 *
+	 * The padding is added so that dm_crypt_request and the IV are
+	 * correctly aligned.
+	 */
+	unsigned int dmreq_start;
 
-static const struct crypt_iv_operations crypt_iv_lmk_ops = {
-	.ctr	   = crypt_iv_lmk_ctr,
-	.dtr	   = crypt_iv_lmk_dtr,
-	.init	   = crypt_iv_lmk_init,
-	.wipe	   = crypt_iv_lmk_wipe,
-	.generator = crypt_iv_lmk_gen,
-	.post	   = crypt_iv_lmk_post
-};
+	unsigned int per_bio_data_size;
 
-static const struct crypt_iv_operations crypt_iv_tcw_ops = {
-	.ctr	   = crypt_iv_tcw_ctr,
-	.dtr	   = crypt_iv_tcw_dtr,
-	.init	   = crypt_iv_tcw_init,
-	.wipe	   = crypt_iv_tcw_wipe,
-	.generator = crypt_iv_tcw_gen,
-	.post	   = crypt_iv_tcw_post
+	unsigned long flags;
+	unsigned int key_size;
+	unsigned int key_parts;      /* independent parts in key buffer */
+	unsigned int key_extra_size; /* additional keys length */
+	u8 key[0];
 };
 
+static void clone_init(struct dm_crypt_io *, struct bio *);
+static void kcryptd_queue_crypt(struct dm_crypt_io *io);
+static u8 *iv_of_dmreq(struct crypt_config *cc, struct dm_crypt_request *dmreq);
+
 static void crypt_convert_init(struct crypt_config *cc,
 			       struct convert_context *ctx,
 			       struct bio *bio_out, struct bio *bio_in,
@@ -837,53 +1723,7 @@ static u8 *iv_of_dmreq(struct crypt_config *cc,
 		       struct dm_crypt_request *dmreq)
 {
 	return (u8 *)ALIGN((unsigned long)(dmreq + 1),
-		crypto_skcipher_alignmask(any_tfm(cc)) + 1);
-}
-
-static int crypt_convert_block(struct crypt_config *cc,
-			       struct convert_context *ctx,
-			       struct skcipher_request *req)
-{
-	struct bio_vec bv_in = bio_iter_iovec(ctx->bio_in, ctx->iter_in);
-	struct bio_vec bv_out = bio_iter_iovec(ctx->bio_out, ctx->iter_out);
-	struct dm_crypt_request *dmreq;
-	u8 *iv;
-	int r;
-
-	dmreq = dmreq_of_req(cc, req);
-	iv = iv_of_dmreq(cc, dmreq);
-
-	dmreq->iv_sector = ctx->cc_sector;
-	dmreq->ctx = ctx;
-	sg_init_table(&dmreq->sg_in, 1);
-	sg_set_page(&dmreq->sg_in, bv_in.bv_page, 1 << SECTOR_SHIFT,
-		    bv_in.bv_offset);
-
-	sg_init_table(&dmreq->sg_out, 1);
-	sg_set_page(&dmreq->sg_out, bv_out.bv_page, 1 << SECTOR_SHIFT,
-		    bv_out.bv_offset);
-
-	bio_advance_iter(ctx->bio_in, &ctx->iter_in, 1 << SECTOR_SHIFT);
-	bio_advance_iter(ctx->bio_out, &ctx->iter_out, 1 << SECTOR_SHIFT);
-
-	if (cc->iv_gen_ops) {
-		r = cc->iv_gen_ops->generator(cc, iv, dmreq);
-		if (r < 0)
-			return r;
-	}
-
-	skcipher_request_set_crypt(req, &dmreq->sg_in, &dmreq->sg_out,
-				   1 << SECTOR_SHIFT, iv);
-
-	if (bio_data_dir(ctx->bio_in) == WRITE)
-		r = crypto_skcipher_encrypt(req);
-	else
-		r = crypto_skcipher_decrypt(req);
-
-	if (!r && cc->iv_gen_ops && cc->iv_gen_ops->post)
-		r = cc->iv_gen_ops->post(cc, iv, dmreq);
-
-	return r;
+		crypto_skcipher_alignmask(cc->tfm) + 1);
 }
 
 static void kcryptd_async_done(struct crypto_async_request *async_req,
@@ -892,12 +1732,10 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
 static void crypt_alloc_req(struct crypt_config *cc,
 			    struct convert_context *ctx)
 {
-	unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1);
-
 	if (!ctx->req)
 		ctx->req = mempool_alloc(cc->req_pool, GFP_NOIO);
 
-	skcipher_request_set_tfm(ctx->req, cc->tfms[key_index]);
+	skcipher_request_set_tfm(ctx->req, cc->tfm);
 
 	/*
 	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
@@ -920,57 +1758,97 @@ static void crypt_free_req(struct crypt_config *cc,
 /*
  * Encrypt / decrypt data from one bio to another one (can be the same one)
  */
-static int crypt_convert(struct crypt_config *cc,
-			 struct convert_context *ctx)
+
+static int crypt_convert_bio(struct crypt_config *cc,
+			     struct convert_context *ctx)
 {
+	unsigned int cryptlen, n1, n2, nents, i = 0, bytes = 0;
+	struct skcipher_request *req;
+	struct dm_crypt_request *dmreq;
+	struct geniv_req_info rinfo;
+	struct bio_vec bv_in, bv_out;
 	int r;
+	u8 *iv;
 
 	atomic_set(&ctx->cc_pending, 1);
+	crypt_alloc_req(cc, ctx);
+
+	req = ctx->req;
+	dmreq = dmreq_of_req(cc, req);
+	iv = iv_of_dmreq(cc, dmreq);
 
-	while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {
+	n1 = bio_segments(ctx->bio_in);
+	n2 = bio_segments(ctx->bio_out);
+	nents = n1 > n2 ? n1 : n2;
+	nents = nents > MAX_SG_LIST ? MAX_SG_LIST : nents;
+	cryptlen = ctx->iter_in.bi_size;
 
-		crypt_alloc_req(cc, ctx);
+	DMDEBUG("dm-crypt:%s: segments:[in=%u, out=%u] bi_size=%u\n",
+		bio_data_dir(ctx->bio_in) == WRITE ? "write" : "read",
+		n1, n2, cryptlen);
 
-		atomic_inc(&ctx->cc_pending);
+	dmreq->sg_in  = kcalloc(nents, sizeof(struct scatterlist), GFP_KERNEL);
+	dmreq->sg_out = kcalloc(nents, sizeof(struct scatterlist), GFP_KERNEL);
+	if (!dmreq->sg_in || !dmreq->sg_out) {
+		DMERR("dm-crypt: Failed to allocate scatterlist\n");
+		r = -ENOMEM;
+		goto end;
+	}
+	dmreq->ctx = ctx;
 
-		r = crypt_convert_block(cc, ctx, ctx->req);
+	sg_init_table(dmreq->sg_in, nents);
+	sg_init_table(dmreq->sg_out, nents);
 
-		switch (r) {
-		/*
-		 * The request was queued by a crypto driver
-		 * but the driver request queue is full, let's wait.
-		 */
-		case -EBUSY:
-			wait_for_completion(&ctx->restart);
-			reinit_completion(&ctx->restart);
-			/* fall through */
-		/*
-		 * The request is queued and processed asynchronously,
-		 * completion function kcryptd_async_done() will be called.
-		 */
-		case -EINPROGRESS:
-			ctx->req = NULL;
-			ctx->cc_sector++;
-			continue;
-		/*
-		 * The request was already processed (synchronously).
-		 */
-		case 0:
-			atomic_dec(&ctx->cc_pending);
-			ctx->cc_sector++;
-			cond_resched();
-			continue;
+	while (ctx->iter_in.bi_size && ctx->iter_out.bi_size && i < nents) {
+		bv_in = bio_iter_iovec(ctx->bio_in, ctx->iter_in);
+		bv_out = bio_iter_iovec(ctx->bio_out, ctx->iter_out);
 
-		/* There was an error while processing the request. */
-		default:
-			atomic_dec(&ctx->cc_pending);
-			return r;
-		}
+		sg_set_page(&dmreq->sg_in[i], bv_in.bv_page, bv_in.bv_len,
+			    bv_in.bv_offset);
+		sg_set_page(&dmreq->sg_out[i], bv_out.bv_page, bv_out.bv_len,
+			    bv_out.bv_offset);
+
+		bio_advance_iter(ctx->bio_in, &ctx->iter_in, bv_in.bv_len);
+		bio_advance_iter(ctx->bio_out, &ctx->iter_out, bv_out.bv_len);
+
+		bytes += bv_in.bv_len;
+		i++;
 	}
 
-	return 0;
+	DMDEBUG("dm-crypt: Processed %u of %u bytes\n", bytes, cryptlen);
+
+	rinfo.iv_sector = ctx->cc_sector;
+	rinfo.nents = nents;
+	rinfo.iv = iv;
+
+	skcipher_request_set_crypt(req, dmreq->sg_in, dmreq->sg_out,
+				   bytes, &rinfo);
+
+	if (bio_data_dir(ctx->bio_in) == WRITE)
+		r = crypto_skcipher_encrypt(req);
+	else
+		r = crypto_skcipher_decrypt(req);
+
+	switch (r) {
+	/* The request was queued so wait. */
+	case -EBUSY:
+		wait_for_completion(&ctx->restart);
+		reinit_completion(&ctx->restart);
+		/* fall through */
+	/*
+	 * The request is queued and processed asynchronously,
+	 * completion function kcryptd_async_done() is called.
+	 */
+	case -EINPROGRESS:
+		ctx->req = NULL;
+		cond_resched();
+		break;
+	}
+end:
+	return r;
 }
 
+
 static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone);
 
 /*
@@ -1070,11 +1948,17 @@ static void crypt_dec_pending(struct dm_crypt_io *io)
 {
 	struct crypt_config *cc = io->cc;
 	struct bio *base_bio = io->base_bio;
+	struct dm_crypt_request *dmreq;
 	int error = io->error;
 
 	if (!atomic_dec_and_test(&io->io_pending))
 		return;
 
+	dmreq = dmreq_of_req(cc, io->ctx.req);
+	DMDEBUG("dm-crypt: Freeing scatterlists [sync]\n");
+	kfree(dmreq->sg_in);
+	kfree(dmreq->sg_out);
+
 	if (io->ctx.req)
 		crypt_free_req(cc, io->ctx.req, base_bio);
 
@@ -1313,7 +2197,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
 	sector += bio_sectors(clone);
 
 	crypt_inc_pending(io);
-	r = crypt_convert(cc, &io->ctx);
+	r = crypt_convert_bio(cc, &io->ctx);
 	if (r)
 		io->error = -EIO;
 	crypt_finished = atomic_dec_and_test(&io->ctx.cc_pending);
@@ -1343,7 +2227,8 @@ static void kcryptd_crypt_read_convert(struct dm_crypt_io *io)
 	crypt_convert_init(cc, &io->ctx, io->base_bio, io->base_bio,
 			   io->sector);
 
-	r = crypt_convert(cc, &io->ctx);
+	r = crypt_convert_bio(cc, &io->ctx);
+
 	if (r < 0)
 		io->error = -EIO;
 
@@ -1371,12 +2256,13 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
 		return;
 	}
 
-	if (!error && cc->iv_gen_ops && cc->iv_gen_ops->post)
-		error = cc->iv_gen_ops->post(cc, iv_of_dmreq(cc, dmreq), dmreq);
-
 	if (error < 0)
 		io->error = -EIO;
 
+	DMDEBUG("dm-crypt: Freeing scatterlists and request struct [async]\n");
+	kfree(dmreq->sg_in);
+	kfree(dmreq->sg_out);
+
 	crypt_free_req(cc, req_of_dmreq(cc, dmreq), io->base_bio);
 
 	if (!atomic_dec_and_test(&ctx->cc_pending))
@@ -1430,62 +2316,38 @@ static int crypt_decode_key(u8 *key, char *hex, unsigned int size)
 	return 0;
 }
 
-static void crypt_free_tfms(struct crypt_config *cc)
+static void crypt_free_tfm(struct crypt_config *cc)
 {
-	unsigned i;
-
-	if (!cc->tfms)
+	if (!cc->tfm)
 		return;
 
-	for (i = 0; i < cc->tfms_count; i++)
-		if (cc->tfms[i] && !IS_ERR(cc->tfms[i])) {
-			crypto_free_skcipher(cc->tfms[i]);
-			cc->tfms[i] = NULL;
-		}
+	if (cc->tfm && !IS_ERR(cc->tfm))
+		crypto_free_skcipher(cc->tfm);
 
-	kfree(cc->tfms);
-	cc->tfms = NULL;
+	cc->tfm = NULL;
 }
 
-static int crypt_alloc_tfms(struct crypt_config *cc, char *ciphermode)
+static int crypt_alloc_tfm(struct crypt_config *cc, char *ciphermode)
 {
-	unsigned i;
 	int err;
 
-	cc->tfms = kzalloc(cc->tfms_count * sizeof(struct crypto_skcipher *),
-			   GFP_KERNEL);
-	if (!cc->tfms)
-		return -ENOMEM;
-
-	for (i = 0; i < cc->tfms_count; i++) {
-		cc->tfms[i] = crypto_alloc_skcipher(ciphermode, 0, 0);
-		if (IS_ERR(cc->tfms[i])) {
-			err = PTR_ERR(cc->tfms[i]);
-			crypt_free_tfms(cc);
-			return err;
-		}
+	cc->tfm = crypto_alloc_skcipher(ciphermode, 0, 0);
+	if (IS_ERR(cc->tfm)) {
+		err = PTR_ERR(cc->tfm);
+		crypt_free_tfm(cc);
+		return err;
 	}
 
 	return 0;
 }
 
-static int crypt_setkey(struct crypt_config *cc)
+static inline int crypt_setkey(struct crypt_config *cc, enum setkey_op keyop,
+			       char *ivopts)
 {
-	unsigned subkey_size;
-	int err = 0, i, r;
-
-	/* Ignore extra keys (which are used for IV etc) */
-	subkey_size = (cc->key_size - cc->key_extra_size) >> ilog2(cc->tfms_count);
-
-	for (i = 0; i < cc->tfms_count; i++) {
-		r = crypto_skcipher_setkey(cc->tfms[i],
-					   cc->key + (i * subkey_size),
-					   subkey_size);
-		if (r)
-			err = r;
-	}
+	DECLARE_GENIV_KEY(kinfo, keyop, cc->tfms_count, cc->key, cc->key_size,
+			  cc->key_parts, ivopts);
 
-	return err;
+	return crypto_skcipher_setkey(cc->tfm, (u8 *) &kinfo, sizeof(kinfo));
 }
 
 #ifdef CONFIG_KEYS
@@ -1498,7 +2360,9 @@ static bool contains_whitespace(const char *str)
 	return false;
 }
 
-static int crypt_set_keyring_key(struct crypt_config *cc, const char *key_string)
+static int crypt_set_keyring_key(struct crypt_config *cc,
+				 const char *key_string,
+				 enum setkey_op keyop, char *ivopts)
 {
 	char *new_key_string, *key_desc;
 	int ret;
@@ -1559,7 +2423,7 @@ static int crypt_set_keyring_key(struct crypt_config *cc, const char *key_string
 	/* clear the flag since following operations may invalidate previously valid key */
 	clear_bit(DM_CRYPT_KEY_VALID, &cc->flags);
 
-	ret = crypt_setkey(cc);
+	ret = crypt_setkey(cc, keyop, ivopts);
 
 	/* wipe the kernel key payload copy in each case */
 	memset(cc->key, 0, cc->key_size * sizeof(u8));
@@ -1599,7 +2463,9 @@ static int get_key_size(char **key_string)
 
 #else
 
-static int crypt_set_keyring_key(struct crypt_config *cc, const char *key_string)
+static int crypt_set_keyring_key(struct crypt_config *cc,
+				 const char *key_string,
+				 enum setkey_op keyop, char *ivopts)
 {
 	return -EINVAL;
 }
@@ -1611,7 +2477,8 @@ static int get_key_size(char **key_string)
 
 #endif
 
-static int crypt_set_key(struct crypt_config *cc, char *key)
+static int crypt_set_key(struct crypt_config *cc, enum setkey_op keyop,
+			 char *key, char *ivopts)
 {
 	int r = -EINVAL;
 	int key_string_len = strlen(key);
@@ -1622,7 +2489,7 @@ static int crypt_set_key(struct crypt_config *cc, char *key)
 
 	/* ':' means the key is in kernel keyring, short-circuit normal key processing */
 	if (key[0] == ':') {
-		r = crypt_set_keyring_key(cc, key + 1);
+		r = crypt_set_keyring_key(cc, key + 1, keyop, ivopts);
 		goto out;
 	}
 
@@ -1636,7 +2503,7 @@ static int crypt_set_key(struct crypt_config *cc, char *key)
 	if (cc->key_size && crypt_decode_key(cc->key, key, cc->key_size) < 0)
 		goto out;
 
-	r = crypt_setkey(cc);
+	r = crypt_setkey(cc, keyop, ivopts);
 	if (!r)
 		set_bit(DM_CRYPT_KEY_VALID, &cc->flags);
 
@@ -1647,6 +2514,17 @@ static int crypt_set_key(struct crypt_config *cc, char *key)
 	return r;
 }
 
+static int crypt_init_key(struct dm_target *ti, char *key, char *ivopts)
+{
+	struct crypt_config *cc = ti->private;
+	int ret;
+
+	ret = crypt_set_key(cc, SETKEY_OP_INIT, key, ivopts);
+	if (ret < 0)
+		ti->error = "Error decoding and setting key";
+	return ret;
+}
+
 static int crypt_wipe_key(struct crypt_config *cc)
 {
 	clear_bit(DM_CRYPT_KEY_VALID, &cc->flags);
@@ -1654,7 +2532,7 @@ static int crypt_wipe_key(struct crypt_config *cc)
 	kzfree(cc->key_string);
 	cc->key_string = NULL;
 
-	return crypt_setkey(cc);
+	return crypt_setkey(cc, SETKEY_OP_WIPE, NULL);
 }
 
 static void crypt_dtr(struct dm_target *ti)
@@ -1674,7 +2552,7 @@ static void crypt_dtr(struct dm_target *ti)
 	if (cc->crypt_queue)
 		destroy_workqueue(cc->crypt_queue);
 
-	crypt_free_tfms(cc);
+	crypt_free_tfm(cc);
 
 	if (cc->bs)
 		bioset_free(cc->bs);
@@ -1682,9 +2560,6 @@ static void crypt_dtr(struct dm_target *ti)
 	mempool_destroy(cc->page_pool);
 	mempool_destroy(cc->req_pool);
 
-	if (cc->iv_gen_ops && cc->iv_gen_ops->dtr)
-		cc->iv_gen_ops->dtr(cc);
-
 	if (cc->dev)
 		dm_put_device(ti, cc->dev);
 
@@ -1762,22 +2637,30 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 	if (!cipher_api)
 		goto bad_mem;
 
-	ret = snprintf(cipher_api, CRYPTO_MAX_ALG_NAME,
-		       "%s(%s)", chainmode, cipher);
+create_cipher:
+	/* For those ciphers which do not support IVs,
+	 * use the 'null' template cipher
+	 */
+
+	if (!ivmode)
+		ivmode = "null";
+
+	ret = snprintf(cipher_api, CRYPTO_MAX_ALG_NAME, "%s(%s(%s))",
+		       ivmode, chainmode, cipher);
 	if (ret < 0) {
 		kfree(cipher_api);
 		goto bad_mem;
 	}
 
 	/* Allocate cipher */
-	ret = crypt_alloc_tfms(cc, cipher_api);
+	ret = crypt_alloc_tfm(cc, cipher_api);
 	if (ret < 0) {
 		ti->error = "Error allocating crypto tfm";
 		goto bad;
 	}
 
 	/* Initialize IV */
-	cc->iv_size = crypto_skcipher_ivsize(any_tfm(cc));
+	cc->iv_size = crypto_skcipher_ivsize(cc->tfm);
 	if (cc->iv_size)
 		/* at least a 64 bit sector number should fit in our buffer */
 		cc->iv_size = max(cc->iv_size,
@@ -1785,23 +2668,10 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 	else if (ivmode) {
 		DMWARN("Selected cipher does not support IVs");
 		ivmode = NULL;
+		goto create_cipher;
 	}
 
-	/* Choose ivmode, see comments at iv code. */
-	if (ivmode == NULL)
-		cc->iv_gen_ops = NULL;
-	else if (strcmp(ivmode, "plain") == 0)
-		cc->iv_gen_ops = &crypt_iv_plain_ops;
-	else if (strcmp(ivmode, "plain64") == 0)
-		cc->iv_gen_ops = &crypt_iv_plain64_ops;
-	else if (strcmp(ivmode, "essiv") == 0)
-		cc->iv_gen_ops = &crypt_iv_essiv_ops;
-	else if (strcmp(ivmode, "benbi") == 0)
-		cc->iv_gen_ops = &crypt_iv_benbi_ops;
-	else if (strcmp(ivmode, "null") == 0)
-		cc->iv_gen_ops = &crypt_iv_null_ops;
-	else if (strcmp(ivmode, "lmk") == 0) {
-		cc->iv_gen_ops = &crypt_iv_lmk_ops;
+	if (strcmp(ivmode, "lmk") == 0) {
 		/*
 		 * Version 2 and 3 is recognised according
 		 * to length of provided multi-key string.
@@ -1813,39 +2683,14 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 			cc->key_extra_size = cc->key_size / cc->key_parts;
 		}
 	} else if (strcmp(ivmode, "tcw") == 0) {
-		cc->iv_gen_ops = &crypt_iv_tcw_ops;
 		cc->key_parts += 2; /* IV + whitening */
 		cc->key_extra_size = cc->iv_size + TCW_WHITENING_SIZE;
-	} else {
-		ret = -EINVAL;
-		ti->error = "Invalid IV mode";
-		goto bad;
 	}
 
 	/* Initialize and set key */
-	ret = crypt_set_key(cc, key);
-	if (ret < 0) {
-		ti->error = "Error decoding and setting key";
+	ret = crypt_init_key(ti, key, ivopts);
+	if (ret < 0)
 		goto bad;
-	}
-
-	/* Allocate IV */
-	if (cc->iv_gen_ops && cc->iv_gen_ops->ctr) {
-		ret = cc->iv_gen_ops->ctr(cc, ti, ivopts);
-		if (ret < 0) {
-			ti->error = "Error creating IV";
-			goto bad;
-		}
-	}
-
-	/* Initialize IV (set keys for ESSIV etc) */
-	if (cc->iv_gen_ops && cc->iv_gen_ops->init) {
-		ret = cc->iv_gen_ops->init(cc);
-		if (ret < 0) {
-			ti->error = "Error initialising IV";
-			goto bad;
-		}
-	}
 
 	ret = 0;
 bad:
@@ -1901,20 +2746,20 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 		goto bad;
 
 	cc->dmreq_start = sizeof(struct skcipher_request);
-	cc->dmreq_start += crypto_skcipher_reqsize(any_tfm(cc));
+	cc->dmreq_start += crypto_skcipher_reqsize(cc->tfm);
 	cc->dmreq_start = ALIGN(cc->dmreq_start, __alignof__(struct dm_crypt_request));
 
-	if (crypto_skcipher_alignmask(any_tfm(cc)) < CRYPTO_MINALIGN) {
+	if (crypto_skcipher_alignmask(cc->tfm) < CRYPTO_MINALIGN) {
 		/* Allocate the padding exactly */
 		iv_size_padding = -(cc->dmreq_start + sizeof(struct dm_crypt_request))
-				& crypto_skcipher_alignmask(any_tfm(cc));
+				& crypto_skcipher_alignmask(cc->tfm);
 	} else {
 		/*
 		 * If the cipher requires greater alignment than kmalloc
 		 * alignment, we don't know the exact position of the
 		 * initialization vector. We must assume worst case.
 		 */
-		iv_size_padding = crypto_skcipher_alignmask(any_tfm(cc));
+		iv_size_padding = crypto_skcipher_alignmask(cc->tfm);
 	}
 
 	ret = -ENOMEM;
@@ -2072,8 +2917,9 @@ static int crypt_map(struct dm_target *ti, struct bio *bio)
 	if (bio_data_dir(io->base_bio) == READ) {
 		if (kcryptd_io_read(io, GFP_NOWAIT))
 			kcryptd_queue_read(io);
-	} else
+	} else {
 		kcryptd_queue_crypt(io);
+	}
 
 	return DM_MAPIO_SUBMITTED;
 }
@@ -2155,7 +3001,7 @@ static void crypt_resume(struct dm_target *ti)
 static int crypt_message(struct dm_target *ti, unsigned argc, char **argv)
 {
 	struct crypt_config *cc = ti->private;
-	int key_size, ret = -EINVAL;
+	int key_size;
 
 	if (argc < 2)
 		goto error;
@@ -2173,19 +3019,9 @@ static int crypt_message(struct dm_target *ti, unsigned argc, char **argv)
 				return -EINVAL;
 			}
 
-			ret = crypt_set_key(cc, argv[2]);
-			if (ret)
-				return ret;
-			if (cc->iv_gen_ops && cc->iv_gen_ops->init)
-				ret = cc->iv_gen_ops->init(cc);
-			return ret;
+			return crypt_set_key(cc, SETKEY_OP_SET, argv[2], NULL);
 		}
 		if (argc == 2 && !strcasecmp(argv[1], "wipe")) {
-			if (cc->iv_gen_ops && cc->iv_gen_ops->wipe) {
-				ret = cc->iv_gen_ops->wipe(cc);
-				if (ret)
-					return ret;
-			}
 			return crypt_wipe_key(cc);
 		}
 	}
@@ -2216,7 +3052,7 @@ static void crypt_io_hints(struct dm_target *ti, struct queue_limits *limits)
 
 static struct target_type crypt_target = {
 	.name   = "crypt",
-	.version = {1, 15, 0},
+	.version = {1, 16, 0},
 	.module = THIS_MODULE,
 	.ctr    = crypt_ctr,
 	.dtr    = crypt_dtr,
@@ -2234,6 +3070,7 @@ static int __init dm_crypt_init(void)
 {
 	int r;
 
+	geniv_register_algs();
 	r = dm_register_target(&crypt_target);
 	if (r < 0)
 		DMERR("register failed %d", r);
@@ -2244,6 +3081,7 @@ static int __init dm_crypt_init(void)
 static void __exit dm_crypt_exit(void)
 {
 	dm_unregister_target(&crypt_target);
+	geniv_deregister_algs();
 }
 
 module_init(dm_crypt_init);
diff --git a/include/crypto/geniv.h b/include/crypto/geniv.h
new file mode 100644
index 0000000..599ce62
--- /dev/null
+++ b/include/crypto/geniv.h
@@ -0,0 +1,47 @@
+/*
+ * geniv: common interface for IV generation algorithms
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef _CRYPTO_GENIV_
+#define _CRYPTO_GENIV_
+
+#define SECTOR_SIZE		(1 << SECTOR_SHIFT)
+#define SECTOR_MASK		(~((1 << SECTOR_SHIFT) - 1))
+
+enum setkey_op {
+	SETKEY_OP_INIT,
+	SETKEY_OP_SET,
+	SETKEY_OP_WIPE,
+};
+
+struct geniv_key_info {
+	enum setkey_op keyop;
+	unsigned int tfms_count;
+	u8 *key;
+	unsigned int key_size;
+	unsigned int key_parts;
+	char *ivopts;
+};
+
+#define DECLARE_GENIV_KEY(c, op, n, k, sz, kp, opts)	\
+	struct geniv_key_info c = {			\
+		.keyop = op,				\
+		.tfms_count = n,			\
+		.key = k,				\
+		.key_size = sz,				\
+		.key_parts = kp,			\
+		.ivopts = opts,				\
+	}
+
+struct geniv_req_info {
+	sector_t iv_sector;
+	unsigned int nents;
+	u8 *iv;
+};
+
+#endif
-- 
Binoy Jayan

^ permalink raw reply related

* [RFC PATCH v5] IV Generation algorithms for dm-crypt
From: Binoy Jayan @ 2017-04-07 10:47 UTC (permalink / raw)
  To: Oded, Ofir
  Cc: Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Rajendra, Milan Broz, Gilad,
	Binoy Jayan

===============================================================================
dm-crypt optimization for larger block sizes
===============================================================================

Currently, the iv generation algorithms are implemented in dm-crypt.c. The goal
is to move these algorithms from the dm layer to the kernel crypto layer by
implementing them as template ciphers so they can be used in relation with
algorithms like aes, and with multiple modes like cbc, ecb etc. As part of this
patchset, the iv-generation code is moved from the dm layer to the crypto layer
and adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
at a time. Each bio contains the in memory representation of physically
contiguous disk blocks. Since the bio itself may not be contiguous in main
memory, the dm layer sets up a chained scatterlist of these blocks split into
physically contiguous segments in memory so that DMA can be performed.

One challenge in doing so is that the IVs are generated based on a 512-byte
sector number. This infact limits the block sizes to 512 bytes. But this should
not be a problem if a hardware with iv generation support is used. The geniv
itself splits the segments into sectors so it could choose the IV based on
sector number. But it could be modelled in hardware effectively by not
splitting up the segments in the bio.

Another challenge faced is that dm-crypt has an option to use multiple keys.
The key selection is done based on the sector number. If the whole bio is
encrypted / decrypted with the same key, the encrypted volumes will not be
compatible with the original dm-crypt [without the changes]. So, the key
selection code is moved to crypto layer so the neighboring sectors are
encrypted with a different key.

The dm layer allocates space for iv. The hardware drivers can choose to make
use of this space to generate their IVs sequentially or allocate it on their
own. This can be moved to crypto layer too. Postponing this decision until
the requirement to integrate milan's changes are clear.

Interface to the crypto layer - include/crypto/geniv.h

More information on test procedure can be found in v1.

-------------------------------------------------------------------------------
Peformance comparison [Tests on 1 GiB Volume] on db410c
Test script:
https://github.com/binoyjayan/utilities/blob/master/utils/dmtest
dmtest -d <block device> -o out.log -s 1024 -r 384 -f 768
-------------------------------------------------------------------------------

This includes tests done with dd, fio and bonnie++ with the original dm-crypt
and the proposed solution with algorithm 'essiv(cbc(aes-arm))' implemented
in software. The hardware is yet to be evaluated. These tests are to make sure
there is no drastic performance degradation on systems without hw crypto.

Tests with dd [direct i/o]

Sequential read            -0.134 %
Sequential Write           +0.091 %

Tests with fio [Aggregate bandwidth - aggrb]

Random Read                +0.358 %
Random Write               +0.010 %

Tests with bonnie++ [768 MiB File, 384 MiB Ram]
after mounting dm-crypt target as ext4

Sequential o/p [per-char]  -2.876 %
Sequential o/p [per-blk]   +0.992 %
Sequential o/p [re-write]  +4.465 %

Sequential i/p [per-char] -0.453 %
Sequential i/p [per-blk]  -0.740 %

Sequential create         -0.255 %
Sequential delete         +0.042 %
Random create             -0.007 %
Random delete             +0.454 %

NB: The '+' sign shows improvement and '-' shows degradation.
    The tests were performed with minimal cpu load.
    Tests with higher cpu load to be done

Revisions:
----------

v1: https://patchwork.kernel.org/patch/9439175
v2: https://patchwork.kernel.org/patch/9471923
v3: https://lkml.org/lkml/2017/1/18/170
v4: https://patchwork.kernel.org/patch/9559665

v4 --> v5
----------

1. Fix for the multiple instance issue in /proc/crypto
2. Few cosmetic changes including struct alignment
3. Simplified 'struct geniv_req_info'

v3 --> v4
----------
Fix for the bug reported by Gilad Ben-Yossef.
The element '__ctx' in 'struct skcipher_request req' overflowed into the
element 'struct scatterlist src' which immediately follows 'req' in
'struct geniv_subreq' and corrupted src.

v2 --> v3
----------

1. Moved iv algorithms in dm-crypt.c for control
2. Key management code moved from dm layer to cryto layer
   so that cipher instance selection can be made depending on key_index
3. The revision v2 had scatterlist nodes created for every sector in the bio.
   It is modified to create only once scatterlist node to reduce memory
   foot print. Synchronous requests are processed sequentially. Asynchronous
   requests are processed in parallel and is freed in the async callback.
4. Changed allocation for sub-requests using mempool

v1 --> v2
----------

1. dm-crypt changes to process larger block sizes (one segment in a bio)
2. Incorporated changes w.r.t. comments from Herbert.

Binoy Jayan (1):
  crypto: Add IV generation algorithms

 drivers/md/dm-crypt.c  | 1916 ++++++++++++++++++++++++++++++++++--------------
 include/crypto/geniv.h |   47 ++
 2 files changed, 1424 insertions(+), 539 deletions(-)
 create mode 100644 include/crypto/geniv.h

-- 
Binoy Jayan

^ permalink raw reply

* Re: Degraded RAID reshaping
From: Zhilong Liu @ 2017-04-07  9:06 UTC (permalink / raw)
  To: Victor Helmholtz, linux-raid
In-Reply-To: <7C33B6CD-B871-4D42-86F6-D80BDDC6D639@gmail.com>



On 04/07/2017 04:38 PM, Victor Helmholtz wrote:
> Hi,
>
> I have problem with reshaping a RAID6. I had a drive failure in 8 disk RAID6, and since I
> don't need that much space anymore I decided to shrink array instead of buying replacement
> disk. I executed following commands:
>
> e2fsck -f /dev/md2
> mdadm --grow -n7 /dev/md2
> mdadm: this change will reduce the size of the array.
>         use --grow --array-size first to truncate array.
>         e.g. mdadm --grow /dev/md2 --array-size 14650664960
> resize2fs /dev/md2 3500000000
> mdadm /dev/md2 --grow --array-size=14650664960
> e2fsck -f /dev/md2
> mdadm --grow -n7 /dev/md2 --backup-file /root/mdadm-md2.backup
>
>
> There was no errors and 'cat /proc/mdstat' reports reshape in progress:
>
> Personalities : [raid6] [raid5] [raid4]
> md2 : active raid6 sde1[1] sdn1[8] sdb1[9] sdr1[11] sdp1[10] sdl1[4] sdi1[3]
>        14650664960 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
>        [>....................]  reshape =  0.0% (1/2930132992) finish=9641968871.8min speed=0K/sec
>        bitmap: 22/22 pages [88KB], 65536KB chunk
>
> unused devices: <none>
>
>
> The problem is that there was no progress for more than an hour, the reshaping has stopped
> at the fist chunk. Is this a bug or is it not possible to reshape degraded RAID? What
> shall I do with the array, can I abort the reshape or is it going to reshape eventually?
>
> Output of "mdadm --detail /dev/md2":
> /dev/md2:
>          Version : 1.2
>    Creation Time : Sun Oct 19 22:10:51 2014
>       Raid Level : raid6
>       Array Size : 14650664960 (13971.96 GiB 15002.28 GB)
>    Used Dev Size : 2930132992 (2794.39 GiB 3000.46 GB)
>     Raid Devices : 7
>    Total Devices : 7
>      Persistence : Superblock is persistent
>
>    Intent Bitmap : Internal
>
>      Update Time : Fri Apr  7 08:38:29 2017
>            State : clean, degraded, reshaping
>   Active Devices : 7
> Working Devices : 7
>   Failed Devices : 0
>    Spare Devices : 0
>
>           Layout : left-symmetric
>       Chunk Size : 512K
>
>   Reshape Status : 0% complete
>    Delta Devices : -1, (8->7)
>
>             Name : borox:2  (local to host borox)
>             UUID : 216515ea:4a08e3b7:022786cd:534b5f0f
>           Events : 148046
>
>      Number   Major   Minor   RaidDevice State
>         0       0        0        0      removed
>         1       8       65        1      active sync   /dev/sde1
>         3       8      129        2      active sync   /dev/sdi1
>         4       8      177        3      active sync   /dev/sdl1
>        10       8      241        4      active sync   /dev/sdp1
>        11      65       17        5      active sync   /dev/sdr1
>         9       8       17        6      active sync   /dev/sdb1
>
>         8       8      209        7      active sync   /dev/sdn1

please type "mdadm ---grow --continue /dev/md2" and recheck,
check the "systemctl status mdadm-grow-continue@md2.service"
check the "journalctl -xn", to make sure whether or not the instance
of mdadm-grow-continue@.service has worked well.

Thanks,
-Zhilong
> Thanks,
> Victor
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply

* Degraded RAID reshaping
From: Victor Helmholtz @ 2017-04-07  8:38 UTC (permalink / raw)
  To: linux-raid

Hi,

I have problem with reshaping a RAID6. I had a drive failure in 8 disk RAID6, and since I
don't need that much space anymore I decided to shrink array instead of buying replacement
disk. I executed following commands:

e2fsck -f /dev/md2
mdadm --grow -n7 /dev/md2
mdadm: this change will reduce the size of the array.
       use --grow --array-size first to truncate array.
       e.g. mdadm --grow /dev/md2 --array-size 14650664960
resize2fs /dev/md2 3500000000
mdadm /dev/md2 --grow --array-size=14650664960
e2fsck -f /dev/md2
mdadm --grow -n7 /dev/md2 --backup-file /root/mdadm-md2.backup


There was no errors and 'cat /proc/mdstat' reports reshape in progress:

Personalities : [raid6] [raid5] [raid4] 
md2 : active raid6 sde1[1] sdn1[8] sdb1[9] sdr1[11] sdp1[10] sdl1[4] sdi1[3]
      14650664960 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
      [>....................]  reshape =  0.0% (1/2930132992) finish=9641968871.8min speed=0K/sec
      bitmap: 22/22 pages [88KB], 65536KB chunk

unused devices: <none>


The problem is that there was no progress for more than an hour, the reshaping has stopped
at the fist chunk. Is this a bug or is it not possible to reshape degraded RAID? What
shall I do with the array, can I abort the reshape or is it going to reshape eventually?

Output of "mdadm --detail /dev/md2":
/dev/md2:
        Version : 1.2
  Creation Time : Sun Oct 19 22:10:51 2014
     Raid Level : raid6
     Array Size : 14650664960 (13971.96 GiB 15002.28 GB)
  Used Dev Size : 2930132992 (2794.39 GiB 3000.46 GB)
   Raid Devices : 7
  Total Devices : 7
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Apr  7 08:38:29 2017
          State : clean, degraded, reshaping 
 Active Devices : 7
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

 Reshape Status : 0% complete
  Delta Devices : -1, (8->7)

           Name : borox:2  (local to host borox)
           UUID : 216515ea:4a08e3b7:022786cd:534b5f0f
         Events : 148046

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       65        1      active sync   /dev/sde1
       3       8      129        2      active sync   /dev/sdi1
       4       8      177        3      active sync   /dev/sdl1
      10       8      241        4      active sync   /dev/sdp1
      11      65       17        5      active sync   /dev/sdr1
       9       8       17        6      active sync   /dev/sdb1

       8       8      209        7      active sync   /dev/sdn1

Thanks,
Victor


^ permalink raw reply

* [PATCH] bcache: Keep the labels the same in cache dev and cache set.
From: lixiubo @ 2017-04-07  6:26 UTC (permalink / raw)
  To: kent.overstreet; +Cc: inux-bcache, linux-raid, linux-kernel, Xiubo Li

From: Xiubo Li <lixiubo@cmss.chinamobile.com>

Since the unusual characters from sysfs attributes have been stript
for dc superblock label, so should here.

Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
---
 drivers/md/bcache/sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index f90f136..d449f37 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -249,7 +249,7 @@
 		bch_write_bdev_super(dc, NULL);
 		if (dc->disk.c) {
 			memcpy(dc->disk.c->uuids[dc->disk.id].label,
-			       buf, SB_LABEL_SIZE);
+			       dc->sb.label, strlen(dc->sb.label));
 			bch_uuid_write(dc->disk.c);
 		}
 		env = kzalloc(sizeof(struct kobj_uevent_env), GFP_KERNEL);
-- 
1.8.3.1

^ permalink raw reply related

* Re: Can we deprecate ioctl(RAID_VERSION)?
From: Coly Li @ 2017-04-07  3:54 UTC (permalink / raw)
  To: Jes Sorensen, NeilBrown; +Cc: linux-raid, Hannes Reinecke, kernel-team
In-Reply-To: <0cdb9fe9-0eea-3c4f-af93-5894d04d680f@gmail.com>

On 2017/4/7 上午2:14, Jes Sorensen wrote:
> On 04/06/2017 02:10 PM, Coly Li wrote:
>> On 2017/4/6 下午11:31, Jes Sorensen wrote:
>>> Neil,
>>>
>>> I see, thanks for explaining.
>>>
>>> The goal is to eventually get out of the ioctl() business and get to a
>>> state where we can do everything via sysfs/configfs. Right now we have a
>>> big mix between ioctl and sysfs where neither interface does everything.
>>> The recent issues with PPL (I think it was) showed that we had to add
>>> more ioctl support because the interfaces needed to do it for sysfs
>>> weren't quite there. My long term goal is to get that situation improved
>>> so we can avoid adding anymore ioctl interfaces and eventually allow for
>>> distros to build mdadm with ioctl support disabled. We had a discussion
>>> at LSF/MM in Boston about this (Hannes, Shaohua, Song, and myself).
>>>
>>> Reading the code I found it confusing that it was so tied to the patch
>>> level, but didn't do anything with the version numbers. At least
>>> intuitively if I bumped the version number, I would reset the patchlevel
>>> which would break things.
>>>
>>> I think it's fair to draw a line in the sand and say that mdadm-4.1+
>>> will not support kernels older than 2.6.15. I am open to the kernel
>>> version we pick here, but I would like to start deprecating some of the
>>> really old code. I have patches that does this in my tree, but I need to
>>> add a check for kernel version > 2.6.15. I am not aware what SuSE's
>>> enterprise kernel versions look like, but checking RHEL/CentOS RHEL5 was
>>> 2.6.18, while RHEL4 was 2.6.9 - and RHEL4 has been unsupported for quite
>>> a while. At least for RHEL/CentOS 2.6.15 as the line in the sand seems
>>> fine.
>>>
>>> For the kernel to expose features to userland in the future, I would
>>> prefer to go with a feature-flag style interface exposed via sysfs. That
>>> way a distro could enable one feature, but not the other in their kernel
>>> without having to worry about actual version numbers.
>>
>> BTW, I'd like to take on the kernel side stuffs.
>> IMHO, it would be better that we also set a timeline, after the
>> timeline, we should add new interface via configfs, and keep all legancy
>> ioctl implementation before this timeline.
>>
>> Coly
> 
> A volunteer! A volunteer!
> 
> I don't think we need to set a firm timeline for removing code from the
> kernel at this point. What we can do is implement the new interfaces via
> sysfs/configfs, and then make ioctl support a config option. Down the
> line it will be evident if/when we can rip out the old code, but I see
> that being years away.

Not removing code, just after a timeline, new patch which adding new
interface should go into configfs.
I start to look into this now, hope to compose a first version patch set
for comments in not too long time.

Coly

^ permalink raw reply

* Re: mdadm:compiled warning in mdadm.c:1974:29 treats as errors
From: Liuzhilong @ 2017-04-07  3:53 UTC (permalink / raw)
  To: NeilBrown, Jes Sorensen, Liu Zhilong; +Cc: linux-raid
In-Reply-To: <87r315yupy.fsf@notabene.neil.brown.name>



On 04/07/2017 07:36 AM, NeilBrown wrote:
> On Thu, Apr 06 2017, Jes Sorensen wrote:
>
>> On 04/06/2017 04:21 AM, Liu Zhilong wrote:
>>> hi,
>>>
>>> I found this compiling warning, and can reproduce on my Leap42.1,
>>> Leap42.2 and SLES12 SP2 with v4.8.5 version of gcc.
>>>
>>> # gcc -v
>>> Using built-in specs.
>>> COLLECT_GCC=gcc
>>> COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/4.8/lto-wrapper
>>> Target: x86_64-suse-linux
>>> Configured with: ../configure --prefix=/usr --infodir=/usr/share/info
>>> --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64
>>> --enable-languages=c,c++,objc,fortran,obj-c++,java,ada
>>> --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.8
>>> --enable-ssp --disable-libssp --disable-plugin
>>> --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux'
>>> --disable-libgcj --disable-libmudflap --with-slibdir=/lib64
>>> --with-system-zlib --enable-__cxa_atexit
>>> --enable-libstdcxx-allocator=new --disable-libstdcxx-pch
>>> --enable-version-specific-runtime-libs --enable-linker-build-id
>>> --enable-linux-futex --program-suffix=-4.8 --without-system-libunwind
>>> --with-arch-32=i586 --with-tune=generic --build=x86_64-suse-linux
>>> --host=x86_64-suse-linux
>>> Thread model: posix
>>> gcc version 4.8.5 (SUSE Linux)
>>>
>>> # make everything
>>> ... ...
>>> mdadm.c: In function ‘main’:
>>> mdadm.c:1974:29: error: ‘mdfd’ may be used uninitialized in this
>>> function [-Werror=maybe-uninitialized]
>>>     if (dv->devname[0] == '/' || mdfd < 0)
>>>                               ^
>>> mdadm.c:1914:7: note: ‘mdfd’ was declared here
>>>     int mdfd;
>>>         ^
>>> cc1: all warnings being treated as errors
>>> Makefile:206: recipe for target 'mdadm.Os' failed
>>> make: *** [mdadm.Os] Error 1
>> Crappy compiler, you're running an old gcc. But sure, send me a patch
>> that initializes mdfd to -1.
> I would rather make the code more obviously correct.
>
> 		if (dv->devname[0] == '/' ||
> 		    (mdfd = open_dev(dv->devname)) < 0)
> 			mdfd = open_mddev(dv->devname, 1);
>
> well... it is more obvious to me.

I also consider that it's better method for this compiling warning.
And I can send this sample as a patch once jes agrees.

Thanks,
-Zhilong

> NeilBrown


^ permalink raw reply

* [dm-devel] [PATCH] Fix for find_lowest_key in dm-btree.c
From: Vinothkumar Raja @ 2017-04-07  2:09 UTC (permalink / raw)
  To: agk, snitzer
  Cc: dm-devel, shli, linux-raid, linux-kernel, ezk, npanpalia, tarasov,
	Vinothkumar Raja, Erez Zadok

We are working on dm-dedup which is a device-mapper's dedup target that 
provides transparent data deduplication of block devices. Every write 
coming to a dm-dedup instance is deduplicated against previously written 
data.  We’ve been working on this project for several years now. The 
Github link for the same is https://github.com/dmdedup. Detailed design 
and performance evaluation can be found in the following paper: 
http://www.fsl.cs.stonybrook.edu/docs/ols-dmdedup/dmdedup-ols14.pdf.

We are currently working on garbage collection for which we traverse our 
btrees from lowest key to highest key. While using find_lowest_key and 
find_highest_key, we noticed that find_lowest_key is giving incorrect 
results. While the function find_key traverses the btree correctly for 
finding the highest key, we found that there is an error in the way it 
traverses the btree for retrieving the lowest key. The find_lowest_key 
function fetches the first key of the rightmost block of the btree 
instead of fetching the first key from the leftmost block. This patch 
fixes the bug and gives us the correct result.  

Signed-off-by: Erez Zadok <ezk@fsl.cs.sunysb.edu>
Signed-off-by: Vinothkumar Raja <vinraja@cs.stonybrook.edu>
Signed-off-by: Nidhi Panpalia <npanpalia@cs.stonybrook.edu>

---
 drivers/md/persistent-data/dm-btree.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/md/persistent-data/dm-btree.c b/drivers/md/persistent-data/dm-btree.c
index 02e2ee0..83121d1 100644
--- a/drivers/md/persistent-data/dm-btree.c
+++ b/drivers/md/persistent-data/dm-btree.c
@@ -902,9 +902,12 @@ static int find_key(struct ro_spine *s, dm_block_t block, bool find_highest,
 		else
 			*result_key = le64_to_cpu(ro_node(s)->keys[0]);

-		if (next_block || flags & INTERNAL_NODE)
-			block = value64(ro_node(s), i);
-
+		if (next_block || flags & INTERNAL_NODE) {
+			if (find_highest)
+				block = value64(ro_node(s), i);
+			else
+				block = value64(ro_node(s), 0);
+		}
 	} while (flags & INTERNAL_NODE);

 	if (next_block)
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH 2/2] block: trace completion of all bios.
From: NeilBrown @ 2017-04-07  1:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Mike Snitzer, Ming Lei, linux-kernel,
	linux-raid, dm-devel, linux-block, Shaohua Li, Alasdair Kergon
In-Reply-To: <149152735878.17489.16036848644242802943.stgit@noble>

Currently only dm and md/raid5 bios trigger
trace_block_bio_complete().  Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete.  Only bio_endio() knows that.

So move the trace_block_bio_complete() call to bio_endio().

Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.

There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
   trace event at the 'request' level, there is no point generating
   one at the bio level too.  In this case the bi_sector and bi_size
   will have changed, so the bio level event would be wrong

2/ If the bio hasn't actually been queued yet, but is being aborted
   early, then a trace event could be confusing.  Some filesystems
   call bio_endio() but do not want tracing.

3/ The bio_integrity code interposes itself by replacing bi_end_io,
   then restoring it and calling bio_endio() again.  This would produce
   two identical trace events if left like that.

To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.

When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication.  A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component.  To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.

So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c               |   13 +++++++++++++
 block/blk-core.c          |   10 +++++++++-
 drivers/md/dm.c           |    1 -
 drivers/md/raid5.c        |    2 --
 include/linux/blk_types.h |    2 ++
 5 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 12c2837c4277..3db98fc3ee87 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1789,6 +1789,11 @@ static inline bool bio_remaining_done(struct bio *bio)
  *   bio_endio() will end I/O on the whole bio. bio_endio() is the preferred
  *   way to end I/O on a bio. No one should call bi_end_io() directly on a
  *   bio unless they own it and thus know that it has an end_io function.
+ *
+ *   bio_endio() can be called several times on a bio that has been chained
+ *   using bio_chain().  The ->bi_end_io() function will only be called the
+ *   last time.  At this point the BLK_TA_COMPLETE tracing event will be
+ *   generated if BIO_TRACE_COMPLETION is set.
  **/
 void bio_endio(struct bio *bio)
 {
@@ -1809,6 +1814,11 @@ void bio_endio(struct bio *bio)
 		goto again;
 	}
 
+	if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) {
+		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
+					 bio, bio->bi_error);
+		bio_clear_flag(bio, BIO_TRACE_COMPLETION);
+	}
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
@@ -1847,6 +1857,9 @@ struct bio *bio_split(struct bio *bio, int sectors,
 
 	bio_advance(bio, split->bi_iter.bi_size);
 
+	if (bio_flagged(bio, BIO_TRACE_COMPLETION))
+		bio_set_flag(bio, BIO_TRACE_COMPLETION);
+
 	return split;
 }
 EXPORT_SYMBOL(bio_split);
diff --git a/block/blk-core.c b/block/blk-core.c
index d772c221cc17..2a7993063a7e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1936,7 +1936,13 @@ generic_make_request_checks(struct bio *bio)
 	if (!blkcg_bio_issue_check(q, bio))
 		return false;
 
-	trace_block_bio_queue(q, bio);
+	if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) {
+		trace_block_bio_queue(q, bio);
+		/* Now that enqueuing has been traced, we need to trace
+		 * completion as well.
+		 */
+		bio_set_flag(bio, BIO_TRACE_COMPLETION);
+	}
 	return true;
 
 not_supported:
@@ -2601,6 +2607,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
 		if (bio_bytes == bio->bi_iter.bi_size)
 			req->bio = bio->bi_next;
 
+		/* Completion has already been traced */
+		bio_clear_flag(bio, BIO_TRACE_COMPLETION);
 		req_bio_endio(req, bio, bio_bytes, error);
 
 		total_bytes += bio_bytes;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb75979e455..cd93a3b9ceca 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -810,7 +810,6 @@ static void dec_pending(struct dm_io *io, int error)
 			queue_io(md, bio);
 		} else {
 			/* done with normal IO or empty flush */
-			trace_block_bio_complete(md->queue, bio, io_error);
 			bio->bi_error = io_error;
 			bio_endio(bio);
 		}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d8bd25f235dc..14557259dd69 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5138,8 +5138,6 @@ static void raid5_align_endio(struct bio *bi)
 	rdev_dec_pending(rdev, conf->mddev);
 
 	if (!error) {
-		trace_block_bio_complete(bdev_get_queue(raid_bi->bi_bdev),
-					 raid_bi, 0);
 		bio_endio(raid_bi);
 		if (atomic_dec_and_test(&conf->active_aligned_reads))
 			wake_up(&conf->wait_for_quiescent);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ff07d891f38f..56c2a8180ce1 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -102,6 +102,8 @@ struct bio {
 #define BIO_REFFED	8	/* bio has elevated ->bi_cnt */
 #define BIO_THROTTLED	9	/* This bio has already been subjected to
 				 * throttling rules. Don't do it again. */
+#define BIO_TRACE_COMPLETION 10	/* bio_endio() should trace the final completion
+				 * of this bio. */
 /* See BVEC_POOL_OFFSET below before adding new flags */
 
 /*

^ permalink raw reply related

* [PATCH 1/2] block: simple improvements for bio->flags
From: NeilBrown @ 2017-04-07  1:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-raid, Mike Snitzer, Christoph Hellwig, Ming Lei,
	linux-kernel, linux-block, dm-devel, Shaohua Li, Alasdair Kergon
In-Reply-To: <149152735878.17489.16036848644242802943.stgit@noble>

The comment for the 'flags' field of 'bio' mentions
"command" which is no longer stored there, and doesn't
mention the bvec pool number, which is.

BIO_RESET_BITS is set in such a way that it would need to be
updated if new bits were added, which is easy to miss.

BVEC_POOL_BITS is larger than needed.  The BVEC_POOL_IDX()
ranges from 0 to 6, so 3 bits are sufficient.

This patch make improvements in each of these areas.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/blk_types.h |   22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d703acb55d0f..ff07d891f38f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -29,7 +29,7 @@ struct bio {
 						 * top bits REQ_OP. Use
 						 * accessors.
 						 */
-	unsigned short		bi_flags;	/* status, command, etc */
+	unsigned short		bi_flags;	/* status, etc and bvec pool number */
 	unsigned short		bi_ioprio;
 
 	struct bvec_iter	bi_iter;
@@ -102,12 +102,7 @@ struct bio {
 #define BIO_REFFED	8	/* bio has elevated ->bi_cnt */
 #define BIO_THROTTLED	9	/* This bio has already been subjected to
 				 * throttling rules. Don't do it again. */
-
-/*
- * Flags starting here get preserved by bio_reset() - this includes
- * BVEC_POOL_IDX()
- */
-#define BIO_RESET_BITS	10
+/* See BVEC_POOL_OFFSET below before adding new flags */
 
 /*
  * We support 6 different bvec pools, the last one is magic in that it
@@ -117,13 +112,22 @@ struct bio {
 #define BVEC_POOL_MAX		(BVEC_POOL_NR - 1)
 
 /*
- * Top 4 bits of bio flags indicate the pool the bvecs came from.  We add
+ * Top 3 bits of bio flags indicate the pool the bvecs came from.  We add
  * 1 to the actual index so that 0 indicates that there are no bvecs to be
  * freed.
  */
-#define BVEC_POOL_BITS		(4)
+#define BVEC_POOL_BITS		(3)
 #define BVEC_POOL_OFFSET	(16 - BVEC_POOL_BITS)
 #define BVEC_POOL_IDX(bio)	((bio)->bi_flags >> BVEC_POOL_OFFSET)
+#if (1<< BVEC_POOL_BITS) < (BVEC_POOL_NR+1)
+# error "BVEC_POOL_BITS is too small"
+#endif
+
+/*
+ * Flags starting here get preserved by bio_reset() - this includes
+ * only BVEC_POOL_IDX()
+ */
+#define BIO_RESET_BITS	BVEC_POOL_OFFSET
 
 /*
  * Operations and flags common to the bio and request structures.

^ permalink raw reply related

* [PATCH 0/2] Trace completion of all bios
From: NeilBrown @ 2017-04-07  1:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-raid, Mike Snitzer, Christoph Hellwig, Ming Lei,
	linux-kernel, linux-block, dm-devel, Shaohua Li, Alasdair Kergon

Hi Jens,
 I think all objections to this patch have been answered so I'm
 resending.
 I've added a small cleanup patch first which makes some small
 enhancements to the documentation and #defines for the bio->flags
 field.

Thanks,
NeilBrown

---

NeilBrown (2):
      block: simple improvements for bio->flags
      block: trace completion of all bios.


 block/bio.c               |   13 +++++++++++++
 block/blk-core.c          |   10 +++++++++-
 drivers/md/dm.c           |    1 -
 drivers/md/raid5.c        |    2 --
 include/linux/blk_types.h |   24 +++++++++++++++---------
 5 files changed, 37 insertions(+), 13 deletions(-)

--
Signature

^ permalink raw reply

* Re: mdadm:compiled warning in mdadm.c:1974:29 treats as errors
From: NeilBrown @ 2017-04-06 23:36 UTC (permalink / raw)
  To: Jes Sorensen, Liu Zhilong; +Cc: linux-raid
In-Reply-To: <a5f537c0-9b73-9a58-7566-0a242e0ff840@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2090 bytes --]

On Thu, Apr 06 2017, Jes Sorensen wrote:

> On 04/06/2017 04:21 AM, Liu Zhilong wrote:
>> hi,
>>
>> I found this compiling warning, and can reproduce on my Leap42.1,
>> Leap42.2 and SLES12 SP2 with v4.8.5 version of gcc.
>>
>> # gcc -v
>> Using built-in specs.
>> COLLECT_GCC=gcc
>> COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/4.8/lto-wrapper
>> Target: x86_64-suse-linux
>> Configured with: ../configure --prefix=/usr --infodir=/usr/share/info
>> --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64
>> --enable-languages=c,c++,objc,fortran,obj-c++,java,ada
>> --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.8
>> --enable-ssp --disable-libssp --disable-plugin
>> --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux'
>> --disable-libgcj --disable-libmudflap --with-slibdir=/lib64
>> --with-system-zlib --enable-__cxa_atexit
>> --enable-libstdcxx-allocator=new --disable-libstdcxx-pch
>> --enable-version-specific-runtime-libs --enable-linker-build-id
>> --enable-linux-futex --program-suffix=-4.8 --without-system-libunwind
>> --with-arch-32=i586 --with-tune=generic --build=x86_64-suse-linux
>> --host=x86_64-suse-linux
>> Thread model: posix
>> gcc version 4.8.5 (SUSE Linux)
>>
>> # make everything
>> ... ...
>> mdadm.c: In function ‘main’:
>> mdadm.c:1974:29: error: ‘mdfd’ may be used uninitialized in this
>> function [-Werror=maybe-uninitialized]
>>    if (dv->devname[0] == '/' || mdfd < 0)
>>                              ^
>> mdadm.c:1914:7: note: ‘mdfd’ was declared here
>>    int mdfd;
>>        ^
>> cc1: all warnings being treated as errors
>> Makefile:206: recipe for target 'mdadm.Os' failed
>> make: *** [mdadm.Os] Error 1
>
> Crappy compiler, you're running an old gcc. But sure, send me a patch 
> that initializes mdfd to -1.

I would rather make the code more obviously correct.

		if (dv->devname[0] == '/' ||
		    (mdfd = open_dev(dv->devname)) < 0)
			mdfd = open_mddev(dv->devname, 1);

well... it is more obvious to me.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: Can we deprecate ioctl(RAID_VERSION)?
From: NeilBrown @ 2017-04-06 22:44 UTC (permalink / raw)
  To: Jes Sorensen, NeilBrown; +Cc: linux-raid, Hannes Reinecke, kernel-team
In-Reply-To: <070d7f50-c8f0-d5df-89ed-adb8b7582d8a@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6619 bytes --]

On Thu, Apr 06 2017, Jes Sorensen wrote:

> On 04/05/2017 06:32 PM, NeilBrown wrote:
>> On Thu, Apr 06 2017, jes.sorensen@gmail.com wrote:
>>
>>> jes.sorensen@gmail.com writes:
>>>> Hi Neil,
>>>>
>>>> Looking through the code in mdadm, I noticed a number of cases calling
>>>> ioctl(RAID_VERSION). At first I had it confused with metadata version,
>>>> but it looks like RAID_VERSION will always return 90000 if it's a valid
>>>> raid device.
>>>>
>>>> In the cases we want to confirm the fd is a valid raid array,
>>>> ioctl(GET_ARRAY_INFO) should do, or sysfs_read(GET_VERSION).
>>>>
>>>> Am I missing something obvious here, or do you see any reason for
>>>> leaving this around?
>>>
>>> Sorry the above is wrong, it will always return 900, not 90000. Some of
>>> the code that stood out is in util.c:
>>>
>>> int md_get_version(int fd)
>>> {
>>>         struct stat stb;
>>>         mdu_version_t vers;
>>>
>>>         if (fstat(fd, &stb)<0)
>>>                 return -1;
>>>         if ((S_IFMT&stb.st_mode) != S_IFBLK)
>>>                 return -1;
>>>
>>>         if (ioctl(fd, RAID_VERSION, &vers) == 0)
>>>                 return  (vers.major*10000) + (vers.minor*100) + vers.patchlevel;
>>>         if (errno == EACCES)
>>>                 return -1;
>>>         if (major(stb.st_rdev) == MD_MAJOR)
>>>                 return (3600);
>>>         return -1;
>>> }
>>>
>>> ....
>>>
>>> int set_array_info(int mdfd, struct supertype *st, struct mdinfo *info)
>>> {
>>>         /* Initialise kernel's knowledge of array.
>>>          * This varies between externally managed arrays
>>>          * and older kernels
>>>          */
>>>         int vers = md_get_version(mdfd);
>>>         int rv;
>>>
>>> #ifndef MDASSEMBLE
>>>         if (st->ss->external)
>>>                 rv = sysfs_set_array(info, vers);
>>>         else
>>> #endif
>>>                 if ((vers % 100) >= 1) { /* can use different versions */
>>>                 mdu_array_info_t inf;
>>>                 memset(&inf, 0, sizeof(inf));
>>>                 inf.major_version = info->array.major_version;
>>>                 inf.minor_version = info->array.minor_version;
>>>                 rv = ioctl(mdfd, SET_ARRAY_INFO, &inf);
>>>         } else
>>>                 rv = ioctl(mdfd, SET_ARRAY_INFO, NULL);
>>>         return rv;
>>> }
>>>
>>> This has been around since at least 2008, the current code came in
>>> f35f25259279573c6274e2783536c0b0a399bdd4, but it looks like even the
>>> prior code made the same assumptions.
>>>
>>> In either case, the above 'if ((vers % 100) >= 1)' will always trigger
>>> since the kernel does #define MD_PATCHLEVEL_VERSION 3
>>>
>>> It's not like we have been updating MD_PATCHLEVEL_VERSION for a
>>> while. Was the code meant to be looking at the superblock minor version?
>>> I've been staring at this for a while now, so please beat me over the
>>> head if I missed something blatantly obvious.
>>>
>>> Jes
>>
>> It is hard to get versioning right...
>>
>> The version returned by the RAID_VERSION ioctl is meant to reflect the
>> capabilities of the implementation.  We could use the kernel version
>> number for that (and sometimes do), but as distro's often backport
>> features, that isn't always reliable.
>>
>> I've incremented the MD_PATCHLEVEL_VERSION when a change is made that
>> cannot easily be detected from user-space.  As you note, we are up to
>> three.  The last change was in 2.6.15.
>> I've never contemplated changing the other two numbers that RAID_VERSION
>> return.  They don't seem to mean anything useful.
>>
>> What exactly do you mean by "deprecate" the ioctl?
>> If you remove the code in mdadm that calls it, mdadm will not work
>> correctly on kernels older than 2.6.15, and it will be harder to
>> and an future capability that is not easily visible from user space.
>> If you remove the code in the kernel that handles it, you'll break
>> mdadm.
>
> Neil,
>
> I see, thanks for explaining.
>
> The goal is to eventually get out of the ioctl() business and get to a 
> state where we can do everything via sysfs/configfs. Right now we have a 
> big mix between ioctl and sysfs where neither interface does everything. 
> The recent issues with PPL (I think it was) showed that we had to add 
> more ioctl support because the interfaces needed to do it for sysfs 
> weren't quite there. My long term goal is to get that situation improved 
> so we can avoid adding anymore ioctl interfaces and eventually allow for 
> distros to build mdadm with ioctl support disabled. We had a discussion 
> at LSF/MM in Boston about this (Hannes, Shaohua, Song, and myself).

Sounds like a good goal, if approached cautiously (and it does sound
like you are showing proper caution).
And I don't think we need the "/configfs" bit.  Configfs is just for
people who don't understand sysfs ;-)

>
> Reading the code I found it confusing that it was so tied to the patch 
> level, but didn't do anything with the version numbers. At least 
> intuitively if I bumped the version number, I would reset the patchlevel 
> which would break things.

I assumed the version number would never change, because it didn't mean
anything useful.

>
> I think it's fair to draw a line in the sand and say that mdadm-4.1+ 
> will not support kernels older than 2.6.15. I am open to the kernel 
> version we pick here, but I would like to start deprecating some of the 
> really old code. I have patches that does this in my tree, but I need to 
> add a check for kernel version > 2.6.15. I am not aware what SuSE's 
> enterprise kernel versions look like, but checking RHEL/CentOS RHEL5 was 
> 2.6.18, while RHEL4 was 2.6.9 - and RHEL4 has been unsupported for quite 
> a while. At least for RHEL/CentOS 2.6.15 as the line in the sand seems fine.

With my SUSE hat on, I'm happy for new mdadm to not support kernels
older than 3.0.  Probably even 3.12.

>
> For the kernel to expose features to userland in the future, I would 
> prefer to go with a feature-flag style interface exposed via sysfs. That 
> way a distro could enable one feature, but not the other in their kernel 
> without having to worry about actual version numbers.

I think there are usually better ways than feature flags.
If the new feature requires a new file in sysfs, then the existence of
the file signals the presence of the feature.

I'm happy with a plan to freeze the patchlevel at its current value and
use other mechanisms going forward.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: mdadm:compiled warning in mdadm.c:1974:29 treats as errors
From: Jes Sorensen @ 2017-04-06 19:10 UTC (permalink / raw)
  To: Liu Zhilong; +Cc: linux-raid
In-Reply-To: <0492b02e-1c4d-02e7-a196-f15e8684e4f3@suse.com>

On 04/06/2017 04:21 AM, Liu Zhilong wrote:
> hi,
>
> I found this compiling warning, and can reproduce on my Leap42.1,
> Leap42.2 and SLES12 SP2 with v4.8.5 version of gcc.
>
> # gcc -v
> Using built-in specs.
> COLLECT_GCC=gcc
> COLLECT_LTO_WRAPPER=/usr/lib64/gcc/x86_64-suse-linux/4.8/lto-wrapper
> Target: x86_64-suse-linux
> Configured with: ../configure --prefix=/usr --infodir=/usr/share/info
> --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64
> --enable-languages=c,c++,objc,fortran,obj-c++,java,ada
> --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.8
> --enable-ssp --disable-libssp --disable-plugin
> --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux'
> --disable-libgcj --disable-libmudflap --with-slibdir=/lib64
> --with-system-zlib --enable-__cxa_atexit
> --enable-libstdcxx-allocator=new --disable-libstdcxx-pch
> --enable-version-specific-runtime-libs --enable-linker-build-id
> --enable-linux-futex --program-suffix=-4.8 --without-system-libunwind
> --with-arch-32=i586 --with-tune=generic --build=x86_64-suse-linux
> --host=x86_64-suse-linux
> Thread model: posix
> gcc version 4.8.5 (SUSE Linux)
>
> # make everything
> ... ...
> mdadm.c: In function ‘main’:
> mdadm.c:1974:29: error: ‘mdfd’ may be used uninitialized in this
> function [-Werror=maybe-uninitialized]
>    if (dv->devname[0] == '/' || mdfd < 0)
>                              ^
> mdadm.c:1914:7: note: ‘mdfd’ was declared here
>    int mdfd;
>        ^
> cc1: all warnings being treated as errors
> Makefile:206: recipe for target 'mdadm.Os' failed
> make: *** [mdadm.Os] Error 1

Crappy compiler, you're running an old gcc. But sure, send me a patch 
that initializes mdfd to -1.

Jes



^ permalink raw reply

* Re: Can we deprecate ioctl(RAID_VERSION)?
From: Jes Sorensen @ 2017-04-06 18:14 UTC (permalink / raw)
  To: Coly Li, NeilBrown; +Cc: linux-raid, Hannes Reinecke, kernel-team
In-Reply-To: <7f1cdca1-2781-d6da-265c-6a51061b3da0@suse.de>

On 04/06/2017 02:10 PM, Coly Li wrote:
> On 2017/4/6 下午11:31, Jes Sorensen wrote:
>> Neil,
>>
>> I see, thanks for explaining.
>>
>> The goal is to eventually get out of the ioctl() business and get to a
>> state where we can do everything via sysfs/configfs. Right now we have a
>> big mix between ioctl and sysfs where neither interface does everything.
>> The recent issues with PPL (I think it was) showed that we had to add
>> more ioctl support because the interfaces needed to do it for sysfs
>> weren't quite there. My long term goal is to get that situation improved
>> so we can avoid adding anymore ioctl interfaces and eventually allow for
>> distros to build mdadm with ioctl support disabled. We had a discussion
>> at LSF/MM in Boston about this (Hannes, Shaohua, Song, and myself).
>>
>> Reading the code I found it confusing that it was so tied to the patch
>> level, but didn't do anything with the version numbers. At least
>> intuitively if I bumped the version number, I would reset the patchlevel
>> which would break things.
>>
>> I think it's fair to draw a line in the sand and say that mdadm-4.1+
>> will not support kernels older than 2.6.15. I am open to the kernel
>> version we pick here, but I would like to start deprecating some of the
>> really old code. I have patches that does this in my tree, but I need to
>> add a check for kernel version > 2.6.15. I am not aware what SuSE's
>> enterprise kernel versions look like, but checking RHEL/CentOS RHEL5 was
>> 2.6.18, while RHEL4 was 2.6.9 - and RHEL4 has been unsupported for quite
>> a while. At least for RHEL/CentOS 2.6.15 as the line in the sand seems
>> fine.
>>
>> For the kernel to expose features to userland in the future, I would
>> prefer to go with a feature-flag style interface exposed via sysfs. That
>> way a distro could enable one feature, but not the other in their kernel
>> without having to worry about actual version numbers.
>
> BTW, I'd like to take on the kernel side stuffs.
> IMHO, it would be better that we also set a timeline, after the
> timeline, we should add new interface via configfs, and keep all legancy
> ioctl implementation before this timeline.
>
> Coly

A volunteer! A volunteer!

I don't think we need to set a firm timeline for removing code from the 
kernel at this point. What we can do is implement the new interfaces via 
sysfs/configfs, and then make ioctl support a config option. Down the 
line it will be evident if/when we can rip out the old code, but I see 
that being years away.

Cheers,
Jes


^ permalink raw reply

* Re: Can we deprecate ioctl(RAID_VERSION)?
From: Coly Li @ 2017-04-06 18:10 UTC (permalink / raw)
  To: Jes Sorensen, NeilBrown; +Cc: linux-raid, Hannes Reinecke, kernel-team
In-Reply-To: <070d7f50-c8f0-d5df-89ed-adb8b7582d8a@gmail.com>

On 2017/4/6 下午11:31, Jes Sorensen wrote:
> On 04/05/2017 06:32 PM, NeilBrown wrote:
>> On Thu, Apr 06 2017, jes.sorensen@gmail.com wrote:
>>
>>> jes.sorensen@gmail.com writes:
>>>> Hi Neil,
>>>>
>>>> Looking through the code in mdadm, I noticed a number of cases calling
>>>> ioctl(RAID_VERSION). At first I had it confused with metadata version,
>>>> but it looks like RAID_VERSION will always return 90000 if it's a valid
>>>> raid device.
>>>>
>>>> In the cases we want to confirm the fd is a valid raid array,
>>>> ioctl(GET_ARRAY_INFO) should do, or sysfs_read(GET_VERSION).
>>>>
>>>> Am I missing something obvious here, or do you see any reason for
>>>> leaving this around?
>>>
>>> Sorry the above is wrong, it will always return 900, not 90000. Some of
>>> the code that stood out is in util.c:
>>>
>>> int md_get_version(int fd)
>>> {
>>>         struct stat stb;
>>>         mdu_version_t vers;
>>>
>>>         if (fstat(fd, &stb)<0)
>>>                 return -1;
>>>         if ((S_IFMT&stb.st_mode) != S_IFBLK)
>>>                 return -1;
>>>
>>>         if (ioctl(fd, RAID_VERSION, &vers) == 0)
>>>                 return  (vers.major*10000) + (vers.minor*100) +
>>> vers.patchlevel;
>>>         if (errno == EACCES)
>>>                 return -1;
>>>         if (major(stb.st_rdev) == MD_MAJOR)
>>>                 return (3600);
>>>         return -1;
>>> }
>>>
>>> ....
>>>
>>> int set_array_info(int mdfd, struct supertype *st, struct mdinfo *info)
>>> {
>>>         /* Initialise kernel's knowledge of array.
>>>          * This varies between externally managed arrays
>>>          * and older kernels
>>>          */
>>>         int vers = md_get_version(mdfd);
>>>         int rv;
>>>
>>> #ifndef MDASSEMBLE
>>>         if (st->ss->external)
>>>                 rv = sysfs_set_array(info, vers);
>>>         else
>>> #endif
>>>                 if ((vers % 100) >= 1) { /* can use different
>>> versions */
>>>                 mdu_array_info_t inf;
>>>                 memset(&inf, 0, sizeof(inf));
>>>                 inf.major_version = info->array.major_version;
>>>                 inf.minor_version = info->array.minor_version;
>>>                 rv = ioctl(mdfd, SET_ARRAY_INFO, &inf);
>>>         } else
>>>                 rv = ioctl(mdfd, SET_ARRAY_INFO, NULL);
>>>         return rv;
>>> }
>>>
>>> This has been around since at least 2008, the current code came in
>>> f35f25259279573c6274e2783536c0b0a399bdd4, but it looks like even the
>>> prior code made the same assumptions.
>>>
>>> In either case, the above 'if ((vers % 100) >= 1)' will always trigger
>>> since the kernel does #define MD_PATCHLEVEL_VERSION 3
>>>
>>> It's not like we have been updating MD_PATCHLEVEL_VERSION for a
>>> while. Was the code meant to be looking at the superblock minor version?
>>> I've been staring at this for a while now, so please beat me over the
>>> head if I missed something blatantly obvious.
>>>
>>> Jes
>>
>> It is hard to get versioning right...
>>
>> The version returned by the RAID_VERSION ioctl is meant to reflect the
>> capabilities of the implementation.  We could use the kernel version
>> number for that (and sometimes do), but as distro's often backport
>> features, that isn't always reliable.
>>
>> I've incremented the MD_PATCHLEVEL_VERSION when a change is made that
>> cannot easily be detected from user-space.  As you note, we are up to
>> three.  The last change was in 2.6.15.
>> I've never contemplated changing the other two numbers that RAID_VERSION
>> return.  They don't seem to mean anything useful.
>>
>> What exactly do you mean by "deprecate" the ioctl?
>> If you remove the code in mdadm that calls it, mdadm will not work
>> correctly on kernels older than 2.6.15, and it will be harder to
>> and an future capability that is not easily visible from user space.
>> If you remove the code in the kernel that handles it, you'll break
>> mdadm.
> 
> Neil,
> 
> I see, thanks for explaining.
> 
> The goal is to eventually get out of the ioctl() business and get to a
> state where we can do everything via sysfs/configfs. Right now we have a
> big mix between ioctl and sysfs where neither interface does everything.
> The recent issues with PPL (I think it was) showed that we had to add
> more ioctl support because the interfaces needed to do it for sysfs
> weren't quite there. My long term goal is to get that situation improved
> so we can avoid adding anymore ioctl interfaces and eventually allow for
> distros to build mdadm with ioctl support disabled. We had a discussion
> at LSF/MM in Boston about this (Hannes, Shaohua, Song, and myself).
> 
> Reading the code I found it confusing that it was so tied to the patch
> level, but didn't do anything with the version numbers. At least
> intuitively if I bumped the version number, I would reset the patchlevel
> which would break things.
> 
> I think it's fair to draw a line in the sand and say that mdadm-4.1+
> will not support kernels older than 2.6.15. I am open to the kernel
> version we pick here, but I would like to start deprecating some of the
> really old code. I have patches that does this in my tree, but I need to
> add a check for kernel version > 2.6.15. I am not aware what SuSE's
> enterprise kernel versions look like, but checking RHEL/CentOS RHEL5 was
> 2.6.18, while RHEL4 was 2.6.9 - and RHEL4 has been unsupported for quite
> a while. At least for RHEL/CentOS 2.6.15 as the line in the sand seems
> fine.
> 
> For the kernel to expose features to userland in the future, I would
> prefer to go with a feature-flag style interface exposed via sysfs. That
> way a distro could enable one feature, but not the other in their kernel
> without having to worry about actual version numbers.

BTW, I'd like to take on the kernel side stuffs.
IMHO, it would be better that we also set a timeline, after the
timeline, we should add new interface via configfs, and keep all legancy
ioctl implementation before this timeline.

Coly


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox