Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* [PATCH v2 1/5] lib/raid6: Add log-of-2 table for RAID6 HW requiring disk position
From: Anup Patel @ 2017-02-07  8:16 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
	bcm-kernel-feedback-list, dmaengine, devicetree, linux-arm-kernel,
	linux-kernel, linux-crypto, linux-raid, Anup Patel
In-Reply-To: <1486455406-11202-1-git-send-email-anup.patel@broadcom.com>

The raid6_gfexp table represents {2}^n values for 0 <= n < 256. The
Linux async_tx framework pass values from raid6_gfexp as coefficients
for each source to prep_dma_pq() callback of DMA channel with PQ
capability. This creates problem for RAID6 offload engines (such as
Broadcom SBA) which take disk position (i.e. log of {2}) instead of
multiplicative cofficients from raid6_gfexp table.

This patch adds raid6_gflog table having log-of-2 value for any given
x such that 0 <= x < 256. For any given disk coefficient x, the
corresponding disk position is given by raid6_gflog[x]. The RAID6
offload engine driver can use this newly added raid6_gflog table to
get disk position from multiplicative coefficient.

Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
---
 include/linux/raid/pq.h |  1 +
 lib/raid6/mktables.c    | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 4d57bba..30f9453 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -142,6 +142,7 @@ int raid6_select_algo(void);
 extern const u8 raid6_gfmul[256][256] __attribute__((aligned(256)));
 extern const u8 raid6_vgfmul[256][32] __attribute__((aligned(256)));
 extern const u8 raid6_gfexp[256]      __attribute__((aligned(256)));
+extern const u8 raid6_gflog[256]      __attribute__((aligned(256)));
 extern const u8 raid6_gfinv[256]      __attribute__((aligned(256)));
 extern const u8 raid6_gfexi[256]      __attribute__((aligned(256)));
 
diff --git a/lib/raid6/mktables.c b/lib/raid6/mktables.c
index 39787db..e824d08 100644
--- a/lib/raid6/mktables.c
+++ b/lib/raid6/mktables.c
@@ -125,6 +125,26 @@ int main(int argc, char *argv[])
 	printf("EXPORT_SYMBOL(raid6_gfexp);\n");
 	printf("#endif\n");
 
+	/* Compute log-of-2 table */
+	printf("\nconst u8 __attribute__((aligned(256)))\n"
+	       "raid6_gflog[256] =\n" "{\n");
+	for (i = 0; i < 256; i += 8) {
+		printf("\t");
+		for (j = 0; j < 8; j++) {
+			v = 255;
+			for (k = 0; k < 256; k++)
+				if (exptbl[k] == (i + j)) {
+					v = k;
+					break;
+				}
+			printf("0x%02x,%c", v, (j == 7) ? '\n' : ' ');
+		}
+	}
+	printf("};\n");
+	printf("#ifdef __KERNEL__\n");
+	printf("EXPORT_SYMBOL(raid6_gflog);\n");
+	printf("#endif\n");
+
 	/* Compute inverse table x^-1 == x^254 */
 	printf("\nconst u8 __attribute__((aligned(256)))\n"
 	       "raid6_gfinv[256] =\n" "{\n");
-- 
2.7.4


^ permalink raw reply related

* [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Anup Patel @ 2017-02-07  8:16 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
	bcm-kernel-feedback-list-dY08KVG/lbpWk0Htik3J/w,
	dmaengine-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA, Anup Patel
In-Reply-To: <1486455406-11202-1-git-send-email-anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>

The DMAENGINE framework assumes that if PQ offload is supported by a
DMA device then all 256 PQ coefficients are supported. This assumption
does not hold anymore because we now have BCM-SBA-RAID offload engine
which supports PQ offload with limited number of PQ coefficients.

This patch extends async_tx APIs to handle DMA devices with support
for fewer PQ coefficients.

Signed-off-by: Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
Reviewed-by: Scott Branden <scott.branden-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
---
 crypto/async_tx/async_pq.c          |  3 +++
 crypto/async_tx/async_raid6_recov.c | 12 ++++++++++--
 include/linux/dmaengine.h           | 19 +++++++++++++++++++
 include/linux/raid/pq.h             |  3 +++
 4 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c
index f83de99..16c6526 100644
--- a/crypto/async_tx/async_pq.c
+++ b/crypto/async_tx/async_pq.c
@@ -187,6 +187,9 @@ async_gen_syndrome(struct page **blocks, unsigned int offset, int disks,
 
 	BUG_ON(disks > 255 || !(P(blocks, disks) || Q(blocks, disks)));
 
+	if (device && dma_maxpqcoef(device) < src_cnt)
+		device = NULL;
+
 	if (device)
 		unmap = dmaengine_get_unmap_data(device->dev, disks, GFP_NOWAIT);
 
diff --git a/crypto/async_tx/async_raid6_recov.c b/crypto/async_tx/async_raid6_recov.c
index 8fab627..2916f95 100644
--- a/crypto/async_tx/async_raid6_recov.c
+++ b/crypto/async_tx/async_raid6_recov.c
@@ -352,6 +352,7 @@ async_raid6_2data_recov(int disks, size_t bytes, int faila, int failb,
 {
 	void *scribble = submit->scribble;
 	int non_zero_srcs, i;
+	struct dma_chan *chan = async_dma_find_channel(DMA_PQ);
 
 	BUG_ON(faila == failb);
 	if (failb < faila)
@@ -359,12 +360,15 @@ async_raid6_2data_recov(int disks, size_t bytes, int faila, int failb,
 
 	pr_debug("%s: disks: %d len: %zu\n", __func__, disks, bytes);
 
+	if (chan && dma_maxpqcoef(chan->device) < RAID6_PQ_MAX_COEF)
+		chan = NULL;
+
 	/* if a dma resource is not available or a scribble buffer is not
 	 * available punt to the synchronous path.  In the 'dma not
 	 * available' case be sure to use the scribble buffer to
 	 * preserve the content of 'blocks' as the caller intended.
 	 */
-	if (!async_dma_find_channel(DMA_PQ) || !scribble) {
+	if (!chan || !scribble) {
 		void **ptrs = scribble ? scribble : (void **) blocks;
 
 		async_tx_quiesce(&submit->depend_tx);
@@ -432,15 +436,19 @@ async_raid6_datap_recov(int disks, size_t bytes, int faila,
 	void *scribble = submit->scribble;
 	int good_srcs, good, i;
 	struct page *srcs[2];
+	struct dma_chan *chan = async_dma_find_channel(DMA_PQ);
 
 	pr_debug("%s: disks: %d len: %zu\n", __func__, disks, bytes);
 
+	if (chan && dma_maxpqcoef(chan->device) < RAID6_PQ_MAX_COEF)
+		chan = NULL;
+
 	/* if a dma resource is not available or a scribble buffer is not
 	 * available punt to the synchronous path.  In the 'dma not
 	 * available' case be sure to use the scribble buffer to
 	 * preserve the content of 'blocks' as the caller intended.
 	 */
-	if (!async_dma_find_channel(DMA_PQ) || !scribble) {
+	if (!chan || !scribble) {
 		void **ptrs = scribble ? scribble : (void **) blocks;
 
 		async_tx_quiesce(&submit->depend_tx);
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index feee6ec..d938a8b 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -24,6 +24,7 @@
 #include <linux/scatterlist.h>
 #include <linux/bitmap.h>
 #include <linux/types.h>
+#include <linux/raid/pq.h>
 #include <asm/page.h>
 
 /**
@@ -668,6 +669,7 @@ struct dma_filter {
  * @cap_mask: one or more dma_capability flags
  * @max_xor: maximum number of xor sources, 0 if no capability
  * @max_pq: maximum number of PQ sources and PQ-continue capability
+ * @max_pqcoef: maximum number of PQ coefficients, 0 if all supported
  * @copy_align: alignment shift for memcpy operations
  * @xor_align: alignment shift for xor operations
  * @pq_align: alignment shift for pq operations
@@ -727,11 +729,13 @@ struct dma_device {
 	dma_cap_mask_t  cap_mask;
 	unsigned short max_xor;
 	unsigned short max_pq;
+	unsigned short max_pqcoef;
 	enum dmaengine_alignment copy_align;
 	enum dmaengine_alignment xor_align;
 	enum dmaengine_alignment pq_align;
 	enum dmaengine_alignment fill_align;
 	#define DMA_HAS_PQ_CONTINUE (1 << 15)
+	#define DMA_HAS_FEWER_PQ_COEF (1 << 15)
 
 	int dev_id;
 	struct device *dev;
@@ -1122,6 +1126,21 @@ static inline int dma_maxpq(struct dma_device *dma, enum dma_ctrl_flags flags)
 	BUG();
 }
 
+static inline void dma_set_maxpqcoef(struct dma_device *dma,
+				     unsigned short max_pqcoef)
+{
+	if (max_pqcoef < RAID6_PQ_MAX_COEF) {
+		dma->max_pqcoef = max_pqcoef;
+		dma->max_pqcoef |= DMA_HAS_FEWER_PQ_COEF;
+	}
+}
+
+static inline unsigned short dma_maxpqcoef(struct dma_device *dma)
+{
+	return (dma->max_pqcoef & DMA_HAS_FEWER_PQ_COEF) ?
+		(dma->max_pqcoef & ~DMA_HAS_FEWER_PQ_COEF) : RAID6_PQ_MAX_COEF;
+}
+
 static inline size_t dmaengine_get_icg(bool inc, bool sgl, size_t icg,
 				      size_t dir_icg)
 {
diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 30f9453..f3a04bb 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -15,6 +15,9 @@
 
 #ifdef __KERNEL__
 
+/* Max number of PQ coefficients */
+#define RAID6_PQ_MAX_COEF 256
+
 /* Set to 1 to use kernel-wide empty_zero_page */
 #define RAID6_USE_EMPTY_ZERO_PAGE 0
 #include <linux/blkdev.h>
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v2 3/5] async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome()
From: Anup Patel @ 2017-02-07  8:16 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
	bcm-kernel-feedback-list-dY08KVG/lbpWk0Htik3J/w,
	dmaengine-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA, Anup Patel
In-Reply-To: <1486455406-11202-1-git-send-email-anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>

The DMA_PREP_FENCE is to be used when preparing Tx descriptor if output
of Tx descriptor is to be used by next/dependent Tx descriptor.

The DMA_PREP_FENSE will not be set correctly in do_async_gen_syndrome()
when calling dma->device_prep_dma_pq() under following conditions:
1. ASYNC_TX_FENCE not set in submit->flags
2. DMA_PREP_FENCE not set in dma_flags
3. src_cnt (= (disks - 2)) is greater than dma_maxpq(dma, dma_flags)

This patch fixes DMA_PREP_FENCE usage in do_async_gen_syndrome() taking
inspiration from do_async_xor() implementation.

Signed-off-by: Anup Patel <anup.patel-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
Reviewed-by: Ray Jui <ray.jui-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
Reviewed-by: Scott Branden <scott.branden-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
---
 crypto/async_tx/async_pq.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c
index 16c6526..947cf35 100644
--- a/crypto/async_tx/async_pq.c
+++ b/crypto/async_tx/async_pq.c
@@ -62,9 +62,6 @@ do_async_gen_syndrome(struct dma_chan *chan,
 	dma_addr_t dma_dest[2];
 	int src_off = 0;
 
-	if (submit->flags & ASYNC_TX_FENCE)
-		dma_flags |= DMA_PREP_FENCE;
-
 	while (src_cnt > 0) {
 		submit->flags = flags_orig;
 		pq_src_cnt = min(src_cnt, dma_maxpq(dma, dma_flags));
@@ -83,6 +80,8 @@ do_async_gen_syndrome(struct dma_chan *chan,
 			if (cb_fn_orig)
 				dma_flags |= DMA_PREP_INTERRUPT;
 		}
+		if (submit->flags & ASYNC_TX_FENCE)
+			dma_flags |= DMA_PREP_FENCE;
 
 		/* Drivers force forward progress in case they can not provide
 		 * a descriptor
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v2 4/5] dmaengine: Add Broadcom SBA RAID driver
From: Anup Patel @ 2017-02-07  8:16 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
	bcm-kernel-feedback-list, dmaengine, devicetree, linux-arm-kernel,
	linux-kernel, linux-crypto, linux-raid, Anup Patel
In-Reply-To: <1486455406-11202-1-git-send-email-anup.patel@broadcom.com>

The Broadcom stream buffer accelerator (SBA) provides offloading
capabilities for RAID operations. This SBA offload engine is
accessible via Broadcom SoC specific ring manager.

This patch adds Broadcom SBA RAID driver which provides one
DMA device with RAID capabilities using one or more Broadcom
SoC specific ring manager channels. The SBA RAID driver in its
current shape implements memcpy, xor, and pq operations.

Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
---
 drivers/dma/Kconfig        |   13 +
 drivers/dma/Makefile       |    1 +
 drivers/dma/bcm-sba-raid.c | 1470 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1484 insertions(+)
 create mode 100644 drivers/dma/bcm-sba-raid.c

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 263495d..bf8fb84 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -99,6 +99,19 @@ config AXI_DMAC
 	  controller is often used in Analog Device's reference designs for FPGA
 	  platforms.
 
+config BCM_SBA_RAID
+	tristate "Broadcom SBA RAID engine support"
+	depends on (ARM64 && MAILBOX && RAID6_PQ) || COMPILE_TEST
+	select DMA_ENGINE
+	select DMA_ENGINE_RAID
+	select ASYNC_TX_ENABLE_CHANNEL_SWITCH
+	default ARCH_BCM_IPROC
+	help
+	  Enable support for Broadcom SBA RAID Engine. The SBA RAID
+	  engine is available on most of the Broadcom iProc SoCs. It
+	  has the capability to offload memcpy, xor and pq computation
+	  for raid5/6.
+
 config COH901318
 	bool "ST-Ericsson COH901318 DMA support"
 	select DMA_ENGINE
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index a4fa336..ba96bdd 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_AMCC_PPC440SPE_ADMA) += ppc4xx/
 obj-$(CONFIG_AT_HDMAC) += at_hdmac.o
 obj-$(CONFIG_AT_XDMAC) += at_xdmac.o
 obj-$(CONFIG_AXI_DMAC) += dma-axi-dmac.o
+obj-$(CONFIG_BCM_SBA_RAID) += bcm-sba-raid.o
 obj-$(CONFIG_COH901318) += coh901318.o coh901318_lli.o
 obj-$(CONFIG_DMA_BCM2835) += bcm2835-dma.o
 obj-$(CONFIG_DMA_JZ4740) += dma-jz4740.o
diff --git a/drivers/dma/bcm-sba-raid.c b/drivers/dma/bcm-sba-raid.c
new file mode 100644
index 0000000..c823462
--- /dev/null
+++ b/drivers/dma/bcm-sba-raid.c
@@ -0,0 +1,1470 @@
+/*
+ * Copyright (C) 2017 Broadcom
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+/*
+ * Broadcom SBA RAID Driver
+ *
+ * The Broadcom stream buffer accelerator (SBA) provides offloading
+ * capabilities for RAID operations. The SBA offload engine is accessible
+ * via Broadcom SoC specific ring manager. Two or more offload engines
+ * can share same Broadcom SoC specific ring manager due to this Broadcom
+ * SoC specific ring manager driver is implemented as a mailbox controller
+ * driver and offload engine drivers are implemented as mallbox clients.
+ *
+ * Typically, Broadcom SoC specific ring manager will implement larger
+ * number of hardware rings over one or more SBA hardware devices. By
+ * design, the internal buffer size of SBA hardware device is limited
+ * but all offload operations supported by SBA can be broken down into
+ * multiple small size requests and executed parallely on multiple SBA
+ * hardware devices for achieving high through-put.
+ *
+ * The Broadcom SBA RAID driver does not require any register programming
+ * except submitting request to SBA hardware device via mailbox channels.
+ * This driver implements a DMA device with one DMA channel using a set
+ * of mailbox channels provided by Broadcom SoC specific ring manager
+ * driver. To exploit parallelism (as described above), all DMA request
+ * coming to SBA RAID DMA channel are broken down to smaller requests
+ * and submitted to multiple mailbox channels in round-robin fashion.
+ * For having more SBA DMA channels, we can create more SBA device nodes
+ * in Broadcom SoC specific DTS based on number of hardware rings supported
+ * by Broadcom SoC ring manager.
+ */
+
+#include <linux/bitops.h>
+#include <linux/dma-mapping.h>
+#include <linux/dmaengine.h>
+#include <linux/list.h>
+#include <linux/mailbox_client.h>
+#include <linux/mailbox/brcm-message.h>
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/slab.h>
+#include <linux/raid/pq.h>
+
+#include "dmaengine.h"
+
+/* SBA command helper macros */
+#define SBA_DEC(_d, _s, _m)		(((_d) >> (_s)) & (_m))
+#define SBA_ENC(_d, _v, _s, _m)					\
+		do {						\
+			(_d) &= ~((u64)(_m) << (_s));		\
+			(_d) |= (((u64)(_v) & (_m)) << (_s));	\
+		} while (0)
+
+/* SBA command related defines */
+#define SBA_TYPE_SHIFT					48
+#define SBA_TYPE_MASK					GENMASK(1, 0)
+#define SBA_TYPE_A					0x0
+#define SBA_TYPE_B					0x2
+#define SBA_TYPE_C					0x3
+#define SBA_USER_DEF_SHIFT				32
+#define SBA_USER_DEF_MASK				GENMASK(15, 0)
+#define SBA_R_MDATA_SHIFT				24
+#define SBA_R_MDATA_MASK				GENMASK(7, 0)
+#define SBA_C_MDATA_MS_SHIFT				18
+#define SBA_C_MDATA_MS_MASK				GENMASK(1, 0)
+#define SBA_INT_SHIFT					17
+#define SBA_INT_MASK					BIT(0)
+#define SBA_RESP_SHIFT					16
+#define SBA_RESP_MASK					BIT(0)
+#define SBA_C_MDATA_SHIFT				8
+#define SBA_C_MDATA_MASK				GENMASK(7, 0)
+#define SBA_C_MDATA_BNUMx_SHIFT(__bnum)			(2 * (__bnum))
+#define SBA_C_MDATA_BNUMx_MASK				GENMASK(1, 0)
+#define SBA_C_MDATA_DNUM_SHIFT				5
+#define SBA_C_MDATA_DNUM_MASK				GENMASK(4, 0)
+#define SBA_C_MDATA_LS(__v)				((__v) & 0xff)
+#define SBA_C_MDATA_MS(__v)				(((__v) >> 8) & 0x3)
+#define SBA_CMD_SHIFT					0
+#define SBA_CMD_MASK					GENMASK(3, 0)
+#define SBA_CMD_ZERO_ALL_BUFFERS			0x8
+#define SBA_CMD_LOAD_BUFFER				0x9
+#define SBA_CMD_XOR					0xa
+#define SBA_CMD_GALOIS_XOR				0xb
+#define SBA_CMD_ZERO_BUFFER				0x4
+#define SBA_CMD_WRITE_BUFFER				0xc
+
+/* Driver helper macros */
+#define to_sba_request(tx)		\
+	container_of(tx, struct sba_request, tx)
+#define to_sba_device(dchan)		\
+	container_of(dchan, struct sba_device, dma_chan)
+
+enum sba_request_state {
+	SBA_REQUEST_STATE_FREE = 1,
+	SBA_REQUEST_STATE_ALLOCED = 2,
+	SBA_REQUEST_STATE_PENDING = 3,
+	SBA_REQUEST_STATE_ACTIVE = 4,
+	SBA_REQUEST_STATE_COMPLETED = 5,
+	SBA_REQUEST_STATE_ABORTED = 6,
+};
+
+struct sba_request {
+	/* Global state */
+	struct list_head node;
+	struct sba_device *sba;
+	enum sba_request_state state;
+	bool fence;
+	/* Chained requests management */
+	struct sba_request *first;
+	struct list_head next;
+	unsigned int next_count;
+	atomic_t next_pending_count;
+	/* BRCM message data */
+	void *resp;
+	dma_addr_t resp_dma;
+	struct brcm_sba_command *cmds;
+	struct brcm_message *msgs;
+	struct brcm_message bmsg;
+	atomic_t msgs_pending_count;
+	struct dma_async_tx_descriptor tx;
+};
+
+enum sba_version {
+	SBA_VER_1 = 0,
+	SBA_VER_2
+};
+
+struct sba_device {
+	/* Underlying device */
+	struct device *dev;
+	/* DT configuration parameters */
+	enum sba_version ver;
+	u32 max_req;
+	u32 req_size;
+	/* Derived configuration parameters */
+	u32 hw_buf_size;
+	u32 hw_resp_size;
+	u32 max_pq_coefs;
+	u32 max_pq_srcs;
+	u32 max_msg_per_req;
+	u32 max_cmd_per_msg;
+	u32 max_cmd_per_req;
+	u32 max_xor_srcs;
+	u32 max_resp_pool_size;
+	u32 max_cmds_pool_size;
+	/* Maibox client and Mailbox channels */
+	struct mbox_client client;
+	int mchans_count;
+	atomic_t mchans_current;
+	struct mbox_chan **mchans;
+	struct device *mbox_dev;
+	/* DMA device and DMA channel */
+	struct dma_device dma_dev;
+	struct dma_chan dma_chan;
+	/* DMA channel resources */
+	void *resp_base;
+	dma_addr_t resp_dma_base;
+	void *cmds_base;
+	dma_addr_t cmds_dma_base;
+	spinlock_t reqs_lock;
+	struct sba_request *reqs;
+	bool reqs_fence;
+	struct list_head reqs_alloc_list;
+	struct list_head reqs_pending_list;
+	struct list_head reqs_active_list;
+	struct list_head reqs_completed_list;
+	struct list_head reqs_aborted_list;
+	struct list_head reqs_free_list;
+	int reqs_free_count;
+};
+
+/* ====== C_MDATA helper routines ===== */
+
+static inline u32 sba_cmd_load_c_mdata(u32 b0)
+{
+	return b0 & SBA_C_MDATA_BNUMx_MASK;
+}
+
+static inline u32 sba_cmd_write_c_mdata(u32 b0)
+{
+	return b0 & SBA_C_MDATA_BNUMx_MASK;
+}
+
+static inline u32 sba_cmd_xor_c_mdata(u32 b1, u32 b0)
+{
+	return (b0 & SBA_C_MDATA_BNUMx_MASK) |
+	       ((b1 & SBA_C_MDATA_BNUMx_MASK) << SBA_C_MDATA_BNUMx_SHIFT(1));
+}
+
+static inline u32 sba_cmd_pq_c_mdata(u32 d, u32 b1, u32 b0)
+{
+	return (b0 & SBA_C_MDATA_BNUMx_MASK) |
+	       ((b1 & SBA_C_MDATA_BNUMx_MASK) << SBA_C_MDATA_BNUMx_SHIFT(1)) |
+	       ((d & SBA_C_MDATA_DNUM_MASK) << SBA_C_MDATA_DNUM_SHIFT);
+}
+
+/* ====== Channel resource management routines ===== */
+
+static struct sba_request *sba_alloc_request(struct sba_device *sba)
+{
+	unsigned long flags;
+	struct sba_request *req = NULL;
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	if (!list_empty(&sba->reqs_free_list)) {
+		req = list_first_entry(&sba->reqs_free_list,
+				       struct sba_request,
+				       node);
+
+		list_move_tail(&req->node, &sba->reqs_alloc_list);
+		req->state = SBA_REQUEST_STATE_ALLOCED;
+		req->fence = false;
+		req->first = req;
+		INIT_LIST_HEAD(&req->next);
+		req->next_count = 1;
+		atomic_set(&req->next_pending_count, 1);
+		atomic_set(&req->msgs_pending_count, 0);
+
+		sba->reqs_free_count--;
+
+		dma_async_tx_descriptor_init(&req->tx, &sba->dma_chan);
+	}
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+
+	return req;
+}
+
+/* Note: Must be called with sba->reqs_lock held */
+static void _sba_pending_request(struct sba_device *sba,
+				 struct sba_request *req)
+{
+	req->state = SBA_REQUEST_STATE_PENDING;
+	list_move_tail(&req->node, &sba->reqs_pending_list);
+	if (list_empty(&sba->reqs_active_list))
+		sba->reqs_fence = false;
+}
+
+/* Note: Must be called with sba->reqs_lock held */
+static bool _sba_active_request(struct sba_device *sba,
+				struct sba_request *req)
+{
+	if (list_empty(&sba->reqs_active_list))
+		sba->reqs_fence = false;
+	if (sba->reqs_fence)
+		return false;
+	req->state = SBA_REQUEST_STATE_ACTIVE;
+	list_move_tail(&req->node, &sba->reqs_active_list);
+	if (req->fence)
+		sba->reqs_fence = true;
+	return true;
+}
+
+/* Note: Must be called with sba->reqs_lock held */
+static void _sba_abort_request(struct sba_device *sba,
+			       struct sba_request *req)
+{
+	req->state = SBA_REQUEST_STATE_ABORTED;
+	list_move_tail(&req->node, &sba->reqs_aborted_list);
+	if (list_empty(&sba->reqs_active_list))
+		sba->reqs_fence = false;
+}
+
+/* Note: Must be called with sba->reqs_lock held */
+static void _sba_free_request(struct sba_device *sba,
+			      struct sba_request *req)
+{
+	req->state = SBA_REQUEST_STATE_FREE;
+	list_move_tail(&req->node, &sba->reqs_free_list);
+	if (list_empty(&sba->reqs_active_list))
+		sba->reqs_fence = false;
+	sba->reqs_free_count++;
+}
+
+static void sba_complete_chained_requests(struct sba_request *req)
+{
+	unsigned long flags;
+	struct sba_request *nreq;
+	struct sba_device *sba = req->sba;
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	list_for_each_entry(nreq, &req->next, next) {
+		nreq->state = SBA_REQUEST_STATE_COMPLETED;
+		list_move_tail(&nreq->node, &sba->reqs_completed_list);
+	}
+	req->state = SBA_REQUEST_STATE_COMPLETED;
+	list_move_tail(&req->node, &sba->reqs_completed_list);
+	if (list_empty(&sba->reqs_active_list))
+		sba->reqs_fence = false;
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_free_chained_requests(struct sba_request *req)
+{
+	unsigned long flags;
+	struct sba_request *nreq;
+	struct sba_device *sba = req->sba;
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	list_for_each_entry(nreq, &req->next, next) {
+		_sba_free_request(sba, nreq);
+	}
+	_sba_free_request(sba, req);
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_chain_request(struct sba_request *first,
+			      struct sba_request *req)
+{
+	unsigned long flags;
+	struct sba_device *sba = req->sba;
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	list_add_tail(&req->next, &first->next);
+	req->first = first;
+	first->next_count++;
+	atomic_set(&first->next_pending_count, first->next_count);
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_cleanup_nonpending_requests(struct sba_device *sba)
+{
+	unsigned long flags;
+	struct sba_request *req, *req1;
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	/* Freeup all alloced request */
+	list_for_each_entry_safe(req, req1, &sba->reqs_alloc_list, node) {
+		_sba_free_request(sba, req);
+	}
+
+	/* Freeup all completed request */
+	list_for_each_entry_safe(req, req1, &sba->reqs_completed_list, node) {
+		_sba_free_request(sba, req);
+	}
+
+	/* Set all active requests as aborted */
+	list_for_each_entry_safe(req, req1, &sba->reqs_active_list, node) {
+		_sba_abort_request(sba, req);
+	}
+
+	/*
+	 * Note: We expect that aborted request will be eventually
+	 * freed by sba_receive_message()
+	 */
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static void sba_cleanup_pending_requests(struct sba_device *sba)
+{
+	unsigned long flags;
+	struct sba_request *req, *req1;
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	/* Freeup all pending request */
+	list_for_each_entry_safe(req, req1, &sba->reqs_pending_list, node) {
+		if (req->bmsg.batch.msgs_queued < req->bmsg.batch.msgs_count)
+			/* Set partially-queued request as aborted */
+			_sba_abort_request(sba, req);
+		else
+			/* Freeup rest of the pending request */
+			_sba_free_request(sba, req);
+	}
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+/* ====== DMAENGINE callbacks ===== */
+
+static void sba_free_chan_resources(struct dma_chan *dchan)
+{
+	/*
+	 * Channel resources are pre-alloced so we just free-up
+	 * whatever we can so that we can re-use pre-alloced
+	 * channel resources next time.
+	 */
+	sba_cleanup_nonpending_requests(to_sba_device(dchan));
+}
+
+static int sba_device_terminate_all(struct dma_chan *dchan)
+{
+	/* Cleanup all pending requests */
+	sba_cleanup_pending_requests(to_sba_device(dchan));
+
+	return 0;
+}
+
+static int sba_send_mbox_request(struct sba_device *sba,
+				 struct sba_request *req)
+{
+	int mchans_idx, ret = 0;
+
+	/* Select mailbox channel in round-robin fashion */
+	mchans_idx = atomic_inc_return(&sba->mchans_current);
+	mchans_idx = mchans_idx % sba->mchans_count;
+
+	/* Send batch message for the request */
+	req->bmsg.batch.msgs_queued = 0;
+	ret = mbox_send_message(sba->mchans[mchans_idx], &req->bmsg);
+	if (ret < 0) {
+		dev_err(sba->dev, "channel %d message %d (total %d)",
+			mchans_idx, req->bmsg.batch.msgs_queued,
+			req->bmsg.batch.msgs_count);
+		dev_err(sba->dev, "send message failed with error %d", ret);
+		return ret;
+	}
+	ret = req->bmsg.error;
+	if (ret < 0) {
+		dev_err(sba->dev,
+			"mbox channel %d message %d (total %d)",
+			mchans_idx, req->bmsg.batch.msgs_queued,
+			req->bmsg.batch.msgs_count);
+		dev_err(sba->dev, "message error %d", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void sba_issue_pending(struct dma_chan *dchan)
+{
+	int ret;
+	unsigned long flags;
+	struct sba_request *req, *req1;
+	struct sba_device *sba = to_sba_device(dchan);
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	/* Process all pending request */
+	list_for_each_entry_safe(req, req1, &sba->reqs_pending_list, node) {
+		/* Try to make request active */
+		if (!_sba_active_request(sba, req))
+			break;
+
+		/* Send request to mailbox channel */
+		spin_unlock_irqrestore(&sba->reqs_lock, flags);
+		ret = sba_send_mbox_request(sba, req);
+		spin_lock_irqsave(&sba->reqs_lock, flags);
+
+		/* If something went wrong then keep request pending */
+		if (ret < 0) {
+			_sba_pending_request(sba, req);
+			break;
+		}
+	}
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+}
+
+static dma_cookie_t sba_tx_submit(struct dma_async_tx_descriptor *tx)
+{
+	unsigned long flags;
+	dma_cookie_t cookie;
+	struct sba_device *sba;
+	struct sba_request *req, *nreq;
+
+	if (unlikely(!tx))
+		return -EINVAL;
+
+	sba = to_sba_device(tx->chan);
+	req = to_sba_request(tx);
+
+	/* Assign cookie and mark all chained requests pending */
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+	cookie = dma_cookie_assign(tx);
+	list_for_each_entry(nreq, &req->next, next) {
+		_sba_pending_request(sba, nreq);
+	}
+	_sba_pending_request(sba, req);
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+
+	return cookie;
+}
+
+static enum dma_status sba_tx_status(struct dma_chan *dchan,
+				     dma_cookie_t cookie,
+				     struct dma_tx_state *txstate)
+{
+	int mchan_idx;
+	enum dma_status ret;
+	struct sba_device *sba = to_sba_device(dchan);
+
+	for (mchan_idx = 0; mchan_idx < sba->mchans_count; mchan_idx++)
+		mbox_client_peek_data(sba->mchans[mchan_idx]);
+
+	ret = dma_cookie_status(dchan, cookie, txstate);
+	if (ret == DMA_COMPLETE)
+		return ret;
+
+	return dma_cookie_status(dchan, cookie, txstate);
+}
+
+static unsigned int sba_fillup_memcpy_msg(struct sba_request *req,
+					  struct brcm_sba_command *cmds,
+					  struct brcm_message *msg,
+					  dma_addr_t msg_offset, size_t msg_len,
+					  dma_addr_t dst, dma_addr_t src)
+{
+	u64 cmd;
+	u32 c_mdata;
+	struct brcm_sba_command *cmdsp = cmds;
+
+	/* Type-B command to load data into buf0 */
+	cmd = 0;
+	SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+	SBA_ENC(cmd, msg_len,
+		SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+	c_mdata = sba_cmd_load_c_mdata(0);
+	SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+		SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+	SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+		SBA_CMD_SHIFT, SBA_CMD_MASK);
+	cmdsp->cmd = cmd;
+	*cmdsp->cmd_dma = cpu_to_le64(cmd);
+	cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+	cmdsp->data = src + msg_offset;
+	cmdsp->data_len = msg_len;
+	cmdsp++;
+
+	/* Type-A command to write buf0 */
+	cmd = 0;
+	SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+	SBA_ENC(cmd, msg_len,
+		SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+	SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+	c_mdata = sba_cmd_write_c_mdata(0);
+	SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+		SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+	SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+		SBA_CMD_SHIFT, SBA_CMD_MASK);
+	cmdsp->cmd = cmd;
+	*cmdsp->cmd_dma = cpu_to_le64(cmd);
+	cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+	if (req->sba->hw_resp_size) {
+		cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+		cmdsp->resp = req->resp_dma;
+		cmdsp->resp_len = req->sba->hw_resp_size;
+	}
+	cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+	cmdsp->data = dst + msg_offset;
+	cmdsp->data_len = msg_len;
+	cmdsp++;
+
+	/* Fillup brcm_message */
+	msg->type = BRCM_MESSAGE_SBA;
+	msg->sba.cmds = cmds;
+	msg->sba.cmds_count = cmdsp - cmds;
+	msg->ctx = req;
+	msg->error = 0;
+
+	return cmdsp - cmds;
+}
+
+static struct sba_request *
+sba_prep_dma_memcpy_req(struct sba_device *sba,
+			dma_addr_t off, dma_addr_t dst, dma_addr_t src,
+			size_t len, unsigned long flags)
+{
+	size_t msg_len;
+	dma_addr_t msg_offset = 0;
+	unsigned int msgs_count = 0, cmds_count, cmds_idx = 0;
+	struct sba_request *req = NULL;
+
+	/* Alloc new request */
+	req = sba_alloc_request(sba);
+	if (!req)
+		return NULL;
+	req->fence = (flags & DMA_PREP_FENCE) ? true : false;
+
+	/* Fillup request messages */
+	while (len) {
+		msg_len = (len < sba->hw_buf_size) ? len : sba->hw_buf_size;
+		cmds_count = sba_fillup_memcpy_msg(req,
+					&req->cmds[cmds_idx],
+					&req->msgs[msgs_count],
+					off + msg_offset, msg_len, dst, src);
+		msgs_count++;
+		cmds_idx += cmds_count;
+		msg_offset += msg_len;
+		len -= msg_len;
+	}
+	req->bmsg.type = BRCM_MESSAGE_BATCH;
+	req->bmsg.batch.msgs = &req->msgs[0];
+	req->bmsg.batch.msgs_queued = 0;
+	req->bmsg.batch.msgs_count = msgs_count;
+	req->bmsg.ctx = req;
+	req->bmsg.error = 0;
+	atomic_set(&req->msgs_pending_count, msgs_count);
+
+	/* Init async_tx descriptor */
+	req->tx.flags = flags;
+	req->tx.cookie = -EBUSY;
+
+	return req;
+}
+
+static struct dma_async_tx_descriptor *
+sba_prep_dma_memcpy(struct dma_chan *dchan, dma_addr_t dst, dma_addr_t src,
+		    size_t len, unsigned long flags)
+{
+	size_t req_len;
+	dma_addr_t off = 0;
+	struct sba_device *sba = to_sba_device(dchan);
+	struct sba_request *first = NULL, *req;
+
+	/* Create chained requests where each request is upto req_size */
+	while (len) {
+		req_len = (len < sba->req_size) ? len : sba->req_size;
+
+		req = sba_prep_dma_memcpy_req(sba, off, dst, src,
+					      req_len, flags);
+		if (!req) {
+			if (first)
+				sba_free_chained_requests(first);
+			return NULL;
+		}
+
+		if (first)
+			sba_chain_request(first, req);
+		else
+			first = req;
+
+		off += req_len;
+		len -= req_len;
+	}
+
+	return (first) ? &first->tx : NULL;
+}
+
+static unsigned int sba_fillup_xor_msg(struct sba_request *req,
+				struct brcm_sba_command *cmds,
+				struct brcm_message *msg,
+				dma_addr_t msg_offset, size_t msg_len,
+				dma_addr_t dst, dma_addr_t *src, u32 src_cnt)
+{
+	u64 cmd;
+	u32 c_mdata;
+	unsigned int i;
+	struct brcm_sba_command *cmdsp = cmds;
+
+	/* Type-B command to load data into buf0 */
+	cmd = 0;
+	SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+	SBA_ENC(cmd, msg_len,
+		SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+	c_mdata = sba_cmd_load_c_mdata(0);
+	SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+		SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+	SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+		SBA_CMD_SHIFT, SBA_CMD_MASK);
+	cmdsp->cmd = cmd;
+	*cmdsp->cmd_dma = cpu_to_le64(cmd);
+	cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+	cmdsp->data = src[0] + msg_offset;
+	cmdsp->data_len = msg_len;
+	cmdsp++;
+
+	/* Type-B commands to xor data with buf0 and put it back in buf0 */
+	for (i = 1; i < src_cnt; i++) {
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+		c_mdata = sba_cmd_xor_c_mdata(0, 0);
+		SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+			SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+		SBA_ENC(cmd, SBA_CMD_XOR, SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+		cmdsp->data = src[i] + msg_offset;
+		cmdsp->data_len = msg_len;
+		cmdsp++;
+	}
+
+	/* Type-A command to write buf0 */
+	cmd = 0;
+	SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+	SBA_ENC(cmd, msg_len,
+		SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+	SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+	c_mdata = sba_cmd_write_c_mdata(0);
+	SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+		SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+	SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+		SBA_CMD_SHIFT, SBA_CMD_MASK);
+	cmdsp->cmd = cmd;
+	*cmdsp->cmd_dma = cpu_to_le64(cmd);
+	cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+	if (req->sba->hw_resp_size) {
+		cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+		cmdsp->resp = req->resp_dma;
+		cmdsp->resp_len = req->sba->hw_resp_size;
+	}
+	cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+	cmdsp->data = dst + msg_offset;
+	cmdsp->data_len = msg_len;
+	cmdsp++;
+
+	/* Fillup brcm_message */
+	msg->type = BRCM_MESSAGE_SBA;
+	msg->sba.cmds = cmds;
+	msg->sba.cmds_count = cmdsp - cmds;
+	msg->ctx = req;
+	msg->error = 0;
+
+	return cmdsp - cmds;
+}
+
+struct sba_request *
+sba_prep_dma_xor_req(struct sba_device *sba,
+		     dma_addr_t off, dma_addr_t dst, dma_addr_t *src,
+		     u32 src_cnt, size_t len, unsigned long flags)
+{
+	size_t msg_len;
+	dma_addr_t msg_offset = 0;
+	unsigned int msgs_count = 0, cmds_count, cmds_idx = 0;
+	struct sba_request *req = NULL;
+
+	/* Alloc new request */
+	req = sba_alloc_request(sba);
+	if (!req)
+		return NULL;
+	req->fence = (flags & DMA_PREP_FENCE) ? true : false;
+
+	/* Fillup request messages */
+	while (len) {
+		msg_len = (len < sba->hw_buf_size) ? len : sba->hw_buf_size;
+		cmds_count = sba_fillup_xor_msg(req,
+				     &req->cmds[cmds_idx],
+				     &req->msgs[msgs_count],
+				     off + msg_offset, msg_len,
+				     dst, src, src_cnt);
+		msgs_count++;
+		cmds_idx += cmds_count;
+		msg_offset += msg_len;
+		len -= msg_len;
+	}
+	req->bmsg.type = BRCM_MESSAGE_BATCH;
+	req->bmsg.batch.msgs = &req->msgs[0];
+	req->bmsg.batch.msgs_queued = 0;
+	req->bmsg.batch.msgs_count = msgs_count;
+	req->bmsg.ctx = req;
+	req->bmsg.error = 0;
+	atomic_set(&req->msgs_pending_count, msgs_count);
+
+	/* Init async_tx descriptor */
+	req->tx.flags = flags;
+	req->tx.cookie = -EBUSY;
+
+	return req;
+}
+
+static struct dma_async_tx_descriptor *
+sba_prep_dma_xor(struct dma_chan *dchan, dma_addr_t dst, dma_addr_t *src,
+		 u32 src_cnt, size_t len, unsigned long flags)
+{
+	size_t req_len;
+	dma_addr_t off = 0;
+	struct sba_device *sba = to_sba_device(dchan);
+	struct sba_request *first = NULL, *req;
+
+	/* Sanity checks */
+	if (unlikely(src_cnt > sba->max_xor_srcs))
+		return NULL;
+
+	/* Create chained requests where each request is upto req_size */
+	while (len) {
+		req_len = (len < sba->req_size) ? len : sba->req_size;
+
+		req = sba_prep_dma_xor_req(sba, off, dst, src, src_cnt,
+					   req_len, flags);
+		if (!req) {
+			if (first)
+				sba_free_chained_requests(first);
+			return NULL;
+		}
+
+		if (first)
+			sba_chain_request(first, req);
+		else
+			first = req;
+
+		off += req_len;
+		len -= req_len;
+	}
+
+	return (first) ? &first->tx : NULL;
+}
+
+static unsigned int sba_fillup_pq_msg(struct sba_request *req,
+				bool pq_continue,
+				struct brcm_sba_command *cmds,
+				struct brcm_message *msg,
+				dma_addr_t msg_offset, size_t msg_len,
+				dma_addr_t *dst_p, dma_addr_t *dst_q,
+				const u8 *scf, dma_addr_t *src, u32 src_cnt)
+{
+	u64 cmd;
+	u32 c_mdata;
+	unsigned int i;
+	struct brcm_sba_command *cmdsp = cmds;
+
+	if (pq_continue) {
+		/* Type-B command to load old P into buf0 */
+		if (dst_p) {
+			cmd = 0;
+			SBA_ENC(cmd, SBA_TYPE_B,
+				SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+			SBA_ENC(cmd, msg_len,
+				SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+			c_mdata = sba_cmd_load_c_mdata(0);
+			SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+				SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+			SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+				SBA_CMD_SHIFT, SBA_CMD_MASK);
+			cmdsp->cmd = cmd;
+			*cmdsp->cmd_dma = cpu_to_le64(cmd);
+			cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+			cmdsp->data = *dst_p + msg_offset;
+			cmdsp->data_len = msg_len;
+			cmdsp++;
+		}
+
+		/* Type-B command to load old Q into buf1 */
+		if (dst_q) {
+			cmd = 0;
+			SBA_ENC(cmd, SBA_TYPE_B,
+				SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+			SBA_ENC(cmd, msg_len,
+				SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+			c_mdata = sba_cmd_load_c_mdata(1);
+			SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+				SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+			SBA_ENC(cmd, SBA_CMD_LOAD_BUFFER,
+				SBA_CMD_SHIFT, SBA_CMD_MASK);
+			cmdsp->cmd = cmd;
+			*cmdsp->cmd_dma = cpu_to_le64(cmd);
+			cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+			cmdsp->data = *dst_q + msg_offset;
+			cmdsp->data_len = msg_len;
+			cmdsp++;
+		}
+	} else {
+		/* Type-A command to load data into buf0 */
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+		SBA_ENC(cmd, SBA_CMD_ZERO_ALL_BUFFERS,
+			SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+		cmdsp++;
+	}
+
+	/* Type-B commands for generate P onto buf0 and Q onto buf1 */
+	for (i = 0; i < src_cnt; i++) {
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_B, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+		c_mdata = sba_cmd_pq_c_mdata(raid6_gflog[scf[i]], 1, 0);
+		SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+			SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+		SBA_ENC(cmd, SBA_C_MDATA_MS(c_mdata),
+			SBA_C_MDATA_MS_SHIFT, SBA_C_MDATA_MS_MASK);
+		SBA_ENC(cmd, SBA_CMD_GALOIS_XOR,
+			SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_B;
+		cmdsp->data = src[i] + msg_offset;
+		cmdsp->data_len = msg_len;
+		cmdsp++;
+	}
+
+	/* Type-A command to write buf0 */
+	if (dst_p) {
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+		SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+		c_mdata = sba_cmd_write_c_mdata(0);
+		SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+			SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+		SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+			SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+		if (req->sba->hw_resp_size) {
+			cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+			cmdsp->resp = req->resp_dma;
+			cmdsp->resp_len = req->sba->hw_resp_size;
+		}
+		cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+		cmdsp->data = *dst_p + msg_offset;
+		cmdsp->data_len = msg_len;
+		cmdsp++;
+	}
+
+	/* Type-A command to write buf1 */
+	if (dst_q) {
+		cmd = 0;
+		SBA_ENC(cmd, SBA_TYPE_A, SBA_TYPE_SHIFT, SBA_TYPE_MASK);
+		SBA_ENC(cmd, msg_len,
+			SBA_USER_DEF_SHIFT, SBA_USER_DEF_MASK);
+		SBA_ENC(cmd, 0x1, SBA_RESP_SHIFT, SBA_RESP_MASK);
+		c_mdata = sba_cmd_write_c_mdata(1);
+		SBA_ENC(cmd, SBA_C_MDATA_LS(c_mdata),
+			SBA_C_MDATA_SHIFT, SBA_C_MDATA_MASK);
+		SBA_ENC(cmd, SBA_CMD_WRITE_BUFFER,
+			SBA_CMD_SHIFT, SBA_CMD_MASK);
+		cmdsp->cmd = cmd;
+		*cmdsp->cmd_dma = cpu_to_le64(cmd);
+		cmdsp->flags = BRCM_SBA_CMD_TYPE_A;
+		if (req->sba->hw_resp_size) {
+			cmdsp->flags |= BRCM_SBA_CMD_HAS_RESP;
+			cmdsp->resp = req->resp_dma;
+			cmdsp->resp_len = req->sba->hw_resp_size;
+		}
+		cmdsp->flags |= BRCM_SBA_CMD_HAS_OUTPUT;
+		cmdsp->data = *dst_q + msg_offset;
+		cmdsp->data_len = msg_len;
+		cmdsp++;
+	}
+
+	/* Fillup brcm_message */
+	msg->type = BRCM_MESSAGE_SBA;
+	msg->sba.cmds = cmds;
+	msg->sba.cmds_count = cmdsp - cmds;
+	msg->ctx = req;
+	msg->error = 0;
+
+	return cmdsp - cmds;
+}
+
+struct sba_request *
+sba_prep_dma_pq_req(struct sba_device *sba,
+		    dma_addr_t off, dma_addr_t *dst, dma_addr_t *src,
+		    u32 src_cnt, const u8 *scf, size_t len, unsigned long flags)
+{
+	size_t dst_count, msg_len;
+	unsigned int msgs_count = 0, cmds_count, cmds_idx = 0;
+	dma_addr_t *dst_p = NULL, *dst_q = NULL;
+	dma_addr_t msg_offset = 0;
+	struct sba_request *req = NULL;
+
+	/* Figure-out P and Q destination addresses */
+	dst_count = 0;
+	if (!(flags & DMA_PREP_PQ_DISABLE_P))
+		dst_p = &dst[dst_count++];
+	if (!(flags & DMA_PREP_PQ_DISABLE_Q))
+		dst_q = &dst[dst_count++];
+
+	/* Alloc new request */
+	req = sba_alloc_request(sba);
+	if (!req)
+		return NULL;
+	req->fence = (flags & DMA_PREP_FENCE) ? true : false;
+
+	/* Fillup request messages */
+	while (len) {
+		msg_len = (len < sba->hw_buf_size) ? len : sba->hw_buf_size;
+		cmds_count = sba_fillup_pq_msg(req, dmaf_continue(flags),
+				    &req->cmds[cmds_idx],
+				    &req->msgs[msgs_count],
+				    off + msg_offset, msg_len,
+				    dst_p, dst_q, scf, src, src_cnt);
+		msgs_count++;
+		cmds_idx += cmds_count;
+		msg_offset += msg_len;
+		len -= msg_len;
+	}
+	req->bmsg.type = BRCM_MESSAGE_BATCH;
+	req->bmsg.batch.msgs = &req->msgs[0];
+	req->bmsg.batch.msgs_queued = 0;
+	req->bmsg.batch.msgs_count = msgs_count;
+	req->bmsg.ctx = req;
+	req->bmsg.error = 0;
+	atomic_set(&req->msgs_pending_count, msgs_count);
+
+	/* Init async_tx descriptor */
+	req->tx.flags = flags;
+	req->tx.cookie = -EBUSY;
+
+	return req;
+}
+
+static struct dma_async_tx_descriptor *
+sba_prep_dma_pq(struct dma_chan *dchan, dma_addr_t *dst, dma_addr_t *src,
+		u32 src_cnt, const u8 *scf, size_t len, unsigned long flags)
+{
+	u32 i;
+	size_t req_len;
+	dma_addr_t off = 0;
+	struct sba_device *sba = to_sba_device(dchan);
+	struct sba_request *first = NULL, *req;
+
+	/* Sanity checks */
+	if (unlikely(src_cnt > sba->max_pq_srcs))
+		return NULL;
+	for (i = 0; i < src_cnt; i++)
+		if (sba->max_pq_coefs <= raid6_gflog[scf[i]])
+			return NULL;
+
+	/* Create chained requests where each request is upto req_size */
+	while (len) {
+		req_len = (len < sba->req_size) ? len : sba->req_size;
+
+		req = sba_prep_dma_pq_req(sba, off, dst, src, src_cnt,
+					  scf, req_len, flags);
+		if (!req) {
+			if (first)
+				sba_free_chained_requests(first);
+			return NULL;
+		}
+
+		if (first)
+			sba_chain_request(first, req);
+		else
+			first = req;
+
+		off += req_len;
+		len -= req_len;
+	}
+
+	return (first) ? &first->tx : NULL;
+}
+
+/* ====== Mailbox callbacks ===== */
+
+static void sba_dma_tx_actions(struct sba_request *req)
+{
+	struct dma_async_tx_descriptor *tx = &req->tx;
+
+	WARN_ON(tx->cookie < 0);
+
+	if (tx->cookie > 0) {
+		dma_cookie_complete(tx);
+
+		/* call the callback (must not sleep or submit new
+		 * operations to this channel)
+		 */
+		if (tx->callback)
+			tx->callback(tx->callback_param);
+
+		dma_descriptor_unmap(tx);
+	}
+
+	/* run dependent operations */
+	dma_run_dependencies(tx);
+
+	/* If waiting for 'ack' then move to completed list */
+	if (!async_tx_test_ack(&req->tx))
+		sba_complete_chained_requests(req);
+	else
+		sba_free_chained_requests(req);
+}
+
+static void sba_receive_message(struct mbox_client *cl, void *msg)
+{
+	unsigned long flags;
+	struct brcm_message *m = msg;
+	struct sba_request *req = m->ctx, *req1;
+	struct sba_device *sba = req->sba;
+
+	/*  error count if message has error */
+	if (m->error < 0) {
+		dev_err(sba->dev, "%s got message with error %d",
+			dma_chan_name(&sba->dma_chan), m->error);
+	}
+
+	/* Wait for all messages of a request to be completed */
+	if (atomic_dec_return(&req->msgs_pending_count))
+		return;
+
+	/* Wait for all chained request to be completed */
+	if (atomic_dec_return(&req->first->next_pending_count))
+		return;
+
+	/* Point to first request */
+	req = req->first;
+
+	/* Update request */
+	if (req->state == SBA_REQUEST_STATE_ACTIVE)
+		sba_dma_tx_actions(req);
+	else
+		sba_free_chained_requests(req);
+
+	spin_lock_irqsave(&sba->reqs_lock, flags);
+
+	/* Re-check all completed request waiting for 'ack' */
+	list_for_each_entry_safe(req, req1, &sba->reqs_completed_list, node) {
+		spin_unlock_irqrestore(&sba->reqs_lock, flags);
+		sba_dma_tx_actions(req);
+		spin_lock_irqsave(&sba->reqs_lock, flags);
+	}
+
+	spin_unlock_irqrestore(&sba->reqs_lock, flags);
+
+	/* Try to submit pending request */
+	sba_issue_pending(&sba->dma_chan);
+}
+
+/* ====== Platform driver routines ===== */
+
+static int sba_prealloc_channel_resources(struct sba_device *sba)
+{
+	int i, j, p, ret = 0;
+	struct sba_request *req = NULL;
+
+	sba->resp_base = dma_alloc_coherent(sba->dma_dev.dev,
+					    sba->max_resp_pool_size,
+					    &sba->resp_dma_base, GFP_KERNEL);
+	if (!sba->resp_base)
+		return -ENOMEM;
+
+	sba->cmds_base = dma_alloc_coherent(sba->dma_dev.dev,
+					    sba->max_cmds_pool_size,
+					    &sba->cmds_dma_base, GFP_KERNEL);
+	if (!sba->cmds_base) {
+		ret = -ENOMEM;
+		goto fail_free_resp_pool;
+	}
+
+	spin_lock_init(&sba->reqs_lock);
+	sba->reqs_fence = false;
+	INIT_LIST_HEAD(&sba->reqs_alloc_list);
+	INIT_LIST_HEAD(&sba->reqs_pending_list);
+	INIT_LIST_HEAD(&sba->reqs_active_list);
+	INIT_LIST_HEAD(&sba->reqs_completed_list);
+	INIT_LIST_HEAD(&sba->reqs_aborted_list);
+	INIT_LIST_HEAD(&sba->reqs_free_list);
+
+	sba->reqs = devm_kcalloc(sba->dev, sba->max_req,
+				 sizeof(*req), GFP_KERNEL);
+	if (!sba->reqs) {
+		ret = -ENOMEM;
+		goto fail_free_cmds_pool;
+	}
+
+	for (i = 0, p = 0; i < sba->max_req; i++) {
+		req = &sba->reqs[i];
+		INIT_LIST_HEAD(&req->node);
+		req->sba = sba;
+		req->state = SBA_REQUEST_STATE_FREE;
+		INIT_LIST_HEAD(&req->next);
+		req->next_count = 1;
+		atomic_set(&req->next_pending_count, 0);
+		req->fence = false;
+		req->resp = sba->resp_base + p;
+		req->resp_dma = sba->resp_dma_base + p;
+		p += sba->hw_resp_size;
+		req->cmds = devm_kcalloc(sba->dev, sba->max_cmd_per_req,
+					 sizeof(*req->cmds), GFP_KERNEL);
+		if (!req->cmds) {
+			ret = -ENOMEM;
+			goto fail_free_cmds_pool;
+		}
+		for (j = 0; j < sba->max_cmd_per_req; j++) {
+			req->cmds[j].cmd = 0;
+			req->cmds[j].cmd_dma = sba->cmds_base +
+				(i * sba->max_cmd_per_req + j) * sizeof(u64);
+			req->cmds[j].cmd_dma_addr = sba->cmds_dma_base +
+				(i * sba->max_cmd_per_req + j) * sizeof(u64);
+			req->cmds[j].flags = 0;
+		}
+		req->msgs = devm_kcalloc(sba->dev, sba->max_msg_per_req,
+					 sizeof(*req->msgs), GFP_KERNEL);
+		if (!req->msgs) {
+			ret = -ENOMEM;
+			goto fail_free_cmds_pool;
+		}
+		memset(&req->bmsg, 0, sizeof(req->bmsg));
+		atomic_set(&req->msgs_pending_count, 0);
+		dma_async_tx_descriptor_init(&req->tx, &sba->dma_chan);
+		req->tx.tx_submit = sba_tx_submit;
+		req->tx.phys = req->resp_dma;
+		list_add_tail(&req->node, &sba->reqs_free_list);
+	}
+
+	sba->reqs_free_count = sba->max_req;
+
+	return 0;
+
+fail_free_cmds_pool:
+	dma_free_coherent(sba->dma_dev.dev,
+			  sba->max_cmds_pool_size,
+			  sba->cmds_base, sba->cmds_dma_base);
+fail_free_resp_pool:
+	dma_free_coherent(sba->dma_dev.dev,
+			  sba->max_resp_pool_size,
+			  sba->resp_base, sba->resp_dma_base);
+	return ret;
+}
+
+static void sba_freeup_channel_resources(struct sba_device *sba)
+{
+	dmaengine_terminate_all(&sba->dma_chan);
+	dma_free_coherent(sba->dma_dev.dev, sba->max_cmds_pool_size,
+			  sba->cmds_base, sba->cmds_dma_base);
+	dma_free_coherent(sba->dma_dev.dev, sba->max_resp_pool_size,
+			  sba->resp_base, sba->resp_dma_base);
+	sba->resp_base = NULL;
+	sba->resp_dma_base = 0;
+}
+
+static int sba_async_register(struct sba_device *sba)
+{
+	int ret;
+	struct dma_device *dma_dev = &sba->dma_dev;
+
+	/* Initialize DMA channel cookie */
+	sba->dma_chan.device = dma_dev;
+	dma_cookie_init(&sba->dma_chan);
+
+	/* Initialize DMA device capability mask */
+	dma_cap_zero(dma_dev->cap_mask);
+	dma_cap_set(DMA_MEMCPY, dma_dev->cap_mask);
+	dma_cap_set(DMA_XOR, dma_dev->cap_mask);
+	dma_cap_set(DMA_PQ, dma_dev->cap_mask);
+
+	/*
+	 * Set mailbox channel device as the base device of
+	 * our dma_device because the actual memory accesses
+	 * will be done by mailbox controller
+	 */
+	dma_dev->dev = sba->mbox_dev;
+
+	/* Set base prep routines */
+	dma_dev->device_free_chan_resources = sba_free_chan_resources;
+	dma_dev->device_terminate_all = sba_device_terminate_all;
+	dma_dev->device_issue_pending = sba_issue_pending;
+	dma_dev->device_tx_status = sba_tx_status;
+
+	/* Set memcpy routines and capability */
+	if (dma_has_cap(DMA_MEMCPY, dma_dev->cap_mask))
+		dma_dev->device_prep_dma_memcpy = sba_prep_dma_memcpy;
+
+	/* Set xor routines and capability */
+	if (dma_has_cap(DMA_XOR, dma_dev->cap_mask)) {
+		dma_dev->device_prep_dma_xor = sba_prep_dma_xor;
+		dma_dev->max_xor = sba->max_xor_srcs;
+	}
+
+	/* Set pq routines and capability */
+	if (dma_has_cap(DMA_PQ, dma_dev->cap_mask)) {
+		dma_dev->device_prep_dma_pq = sba_prep_dma_pq;
+		dma_set_maxpq(dma_dev, sba->max_pq_srcs, 0);
+		dma_set_maxpqcoef(dma_dev, sba->max_pq_coefs);
+	}
+
+	/* Initialize DMA device channel list */
+	INIT_LIST_HEAD(&dma_dev->channels);
+	list_add_tail(&sba->dma_chan.device_node, &dma_dev->channels);
+
+	/* Register with Linux async DMA framework*/
+	ret = dma_async_device_register(dma_dev);
+	if (ret) {
+		dev_err(sba->dev, "async device register error %d", ret);
+		return ret;
+	}
+
+	dev_info(sba->dev, "%s capabilities: %s%s%s\n",
+		 dma_chan_name(&sba->dma_chan),
+		 dma_has_cap(DMA_MEMCPY, dma_dev->cap_mask) ? "memcpy " : "",
+		 dma_has_cap(DMA_XOR, dma_dev->cap_mask) ? "xor " : "",
+		 dma_has_cap(DMA_PQ, dma_dev->cap_mask) ? "pq " : "");
+
+	return 0;
+}
+
+static int sba_probe(struct platform_device *pdev)
+{
+	int i, ret = 0, mchans_count;
+	struct sba_device *sba;
+	struct platform_device *mbox_pdev;
+	struct of_phandle_args args;
+
+	/* Allocate main SBA struct */
+	sba = devm_kzalloc(&pdev->dev, sizeof(*sba), GFP_KERNEL);
+	if (!sba)
+		return -ENOMEM;
+
+	sba->dev = &pdev->dev;
+	platform_set_drvdata(pdev, sba);
+
+	/* Determine SBA version from DT compatible string */
+	if (of_device_is_compatible(sba->dev->of_node, "brcm,iproc-sba"))
+		sba->ver = SBA_VER_1;
+	else if (of_device_is_compatible(sba->dev->of_node,
+					 "brcm,iproc-sba-v2"))
+		sba->ver = SBA_VER_2;
+	else
+		return -ENODEV;
+
+	/* Derived Configuration parameters */
+	switch (sba->ver) {
+	case SBA_VER_1:
+		sba->max_req = 256;
+		sba->req_size = PAGE_SIZE;
+		sba->hw_buf_size = 4096;
+		sba->hw_resp_size = 8;
+		sba->max_pq_coefs = 6;
+		sba->max_pq_srcs = 6;
+		break;
+	case SBA_VER_2:
+		sba->max_req = 256;
+		sba->req_size = PAGE_SIZE;
+		sba->hw_buf_size = 4096;
+		sba->hw_resp_size = 8;
+		sba->max_pq_coefs = 30;
+		/*
+		 * We can support max_pq_srcs == max_pq_coefs because
+		 * we are limited by number of SBA commands that we can
+		 * fit in one message for underlying ring manager HW.
+		 */
+		sba->max_pq_srcs = 12;
+		break;
+	default:
+		return -EINVAL;
+	}
+	sba->max_msg_per_req = sba->req_size / sba->hw_buf_size;
+	if ((sba->max_msg_per_req * sba->hw_buf_size) < sba->req_size)
+		sba->max_msg_per_req++;
+	sba->max_cmd_per_msg = sba->max_pq_srcs + 3;
+	sba->max_cmd_per_req = sba->max_msg_per_req * sba->max_cmd_per_msg;
+	sba->max_xor_srcs = sba->max_cmd_per_msg - 1;
+	sba->max_resp_pool_size = sba->max_req * sba->hw_resp_size;
+	sba->max_cmds_pool_size = sba->max_req *
+				  sba->max_cmd_per_req * sizeof(u64);
+
+	/* Setup mailbox client */
+	sba->client.dev			= &pdev->dev;
+	sba->client.rx_callback		= sba_receive_message;
+	sba->client.tx_block		= false;
+	sba->client.knows_txdone	= false;
+	sba->client.tx_tout		= 0;
+
+	/* Number of channels equals number of mailbox channels */
+	ret = of_count_phandle_with_args(pdev->dev.of_node,
+					 "mboxes", "#mbox-cells");
+	if (ret <= 0)
+		return -ENODEV;
+	mchans_count = ret;
+	sba->mchans_count = 0;
+	atomic_set(&sba->mchans_current, 0);
+
+	/* Allocate mailbox channel array */
+	sba->mchans = devm_kcalloc(&pdev->dev, sba->mchans_count,
+				   sizeof(*sba->mchans), GFP_KERNEL);
+	if (!sba->mchans)
+		return -ENOMEM;
+
+	/* Request mailbox channels */
+	for (i = 0; i < mchans_count; i++) {
+		sba->mchans[i] = mbox_request_channel(&sba->client, i);
+		if (IS_ERR(sba->mchans[i])) {
+			ret = PTR_ERR(sba->mchans[i]);
+			goto fail_free_mchans;
+		}
+		sba->mchans_count++;
+	}
+
+	/* Find-out underlying mailbox device */
+	ret = of_parse_phandle_with_args(pdev->dev.of_node,
+					 "mboxes", "#mbox-cells", 0, &args);
+	if (ret)
+		goto fail_free_mchans;
+	mbox_pdev = of_find_device_by_node(args.np);
+	of_node_put(args.np);
+	if (!mbox_pdev) {
+		ret = -ENODEV;
+		goto fail_free_mchans;
+	}
+	sba->mbox_dev = &mbox_pdev->dev;
+
+	/* All mailbox channels should be of same ring manager device */
+	for (i = 1; i < mchans_count; i++) {
+		ret = of_parse_phandle_with_args(pdev->dev.of_node,
+					 "mboxes", "#mbox-cells", i, &args);
+		if (ret)
+			goto fail_free_mchans;
+		mbox_pdev = of_find_device_by_node(args.np);
+		of_node_put(args.np);
+		if (sba->mbox_dev != &mbox_pdev->dev) {
+			ret = -EINVAL;
+			goto fail_free_mchans;
+		}
+	}
+
+	/* Register DMA device with linux async framework */
+	ret = sba_async_register(sba);
+	if (ret)
+		goto fail_free_mchans;
+
+	/* Prealloc channel resource */
+	ret = sba_prealloc_channel_resources(sba);
+	if (ret)
+		goto fail_async_dev_unreg;
+
+	/* Print device info */
+	dev_info(sba->dev, "%s using SBAv%d and %d mailbox channels",
+		 dma_chan_name(&sba->dma_chan), sba->ver+1,
+		 sba->mchans_count);
+
+	return 0;
+
+fail_async_dev_unreg:
+	dma_async_device_unregister(&sba->dma_dev);
+fail_free_mchans:
+	for (i = 0; i < sba->mchans_count; i++)
+		mbox_free_channel(sba->mchans[i]);
+	return ret;
+}
+
+static int sba_remove(struct platform_device *pdev)
+{
+	int i;
+	struct sba_device *sba = platform_get_drvdata(pdev);
+
+	sba_freeup_channel_resources(sba);
+
+	dma_async_device_unregister(&sba->dma_dev);
+
+	for (i = 0; i < sba->mchans_count; i++)
+		mbox_free_channel(sba->mchans[i]);
+
+	return 0;
+}
+
+static const struct of_device_id sba_of_match[] = {
+	{ .compatible = "brcm,iproc-sba", },
+	{ .compatible = "brcm,iproc-sba-v2", },
+	{},
+};
+MODULE_DEVICE_TABLE(of, sba_of_match);
+
+static struct platform_driver sba_driver = {
+	.probe = sba_probe,
+	.remove = sba_remove,
+	.driver = {
+		.name = "bcm-sba-raid",
+		.of_match_table = sba_of_match,
+	},
+};
+module_platform_driver(sba_driver);
+
+MODULE_DESCRIPTION("Broadcom SBA RAID driver");
+MODULE_AUTHOR("Anup Patel <anup.patel@broadcom.com>");
+MODULE_LICENSE("GPL v2");
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 5/5] dt-bindings: Add DT bindings document for Broadcom SBA RAID driver
From: Anup Patel @ 2017-02-07  8:16 UTC (permalink / raw)
  To: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar
  Cc: Dan Williams, Ray Jui, Scott Branden, Jon Mason, Rob Rice,
	bcm-kernel-feedback-list, dmaengine, devicetree, linux-arm-kernel,
	linux-kernel, linux-crypto, linux-raid, Anup Patel
In-Reply-To: <1486455406-11202-1-git-send-email-anup.patel@broadcom.com>

This patch adds the DT bindings document for newly added Broadcom
SBA RAID driver.

Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
---
 .../devicetree/bindings/dma/brcm,iproc-sba.txt     | 29 ++++++++++++++++++++++
 1 file changed, 29 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt

diff --git a/Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt b/Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt
new file mode 100644
index 0000000..092913a
--- /dev/null
+++ b/Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt
@@ -0,0 +1,29 @@
+* Broadcom SBA RAID engine
+
+Required properties:
+- compatible: Should be one of the following
+	      "brcm,iproc-sba"
+	      "brcm,iproc-sba-v2"
+  The "brcm,iproc-sba" has support for only 6 PQ coefficients
+  The "brcm,iproc-sba-v2" has support for only 30 PQ coefficients
+- mboxes: List of phandle and mailbox channel specifiers
+
+Example:
+
+raid_mbox: mbox@67400000 {
+	...
+	#mbox-cells = <3>;
+	...
+};
+
+raid0 {
+	compatible = "brcm,iproc-sba-v2";
+	mboxes = <&raid_mbox 0 0x1 0xffff>,
+		 <&raid_mbox 1 0x1 0xffff>,
+		 <&raid_mbox 2 0x1 0xffff>,
+		 <&raid_mbox 3 0x1 0xffff>,
+		 <&raid_mbox 4 0x1 0xffff>,
+		 <&raid_mbox 5 0x1 0xffff>,
+		 <&raid_mbox 6 0x1 0xffff>,
+		 <&raid_mbox 7 0x1 0xffff>;
+};
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Dan Williams @ 2017-02-07  8:27 UTC (permalink / raw)
  To: Anup Patel
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine@vger.kernel.org,
	Device Tree, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-crypto, linux-raid
In-Reply-To: <1486455406-11202-3-git-send-email-anup.patel@broadcom.com>

On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel@broadcom.com> wrote:
> The DMAENGINE framework assumes that if PQ offload is supported by a
> DMA device then all 256 PQ coefficients are supported. This assumption
> does not hold anymore because we now have BCM-SBA-RAID offload engine
> which supports PQ offload with limited number of PQ coefficients.
>
> This patch extends async_tx APIs to handle DMA devices with support
> for fewer PQ coefficients.
>
> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
> Reviewed-by: Scott Branden <scott.branden@broadcom.com>

I don't like this approach. Define an interface for md to query the
offload engine once at the beginning of time. We should not be adding
any new extensions to async_tx.

^ permalink raw reply

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Anup Patel @ 2017-02-07  9:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Mark Rutland, Device Tree, Herbert Xu, Scott Branden, Vinod Koul,
	Ray Jui, Jassi Brar, linux-kernel@vger.kernel.org, linux-raid,
	Jon Mason, Rob Herring, BCM Kernel Feedback, linux-crypto,
	Rob Rice, dmaengine@vger.kernel.org, David S . Miller,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <CAPcyv4gbV1yf52JrejW+Dn3wOPn7Er+8HWdu2qsW0+Qm+cRr9g@mail.gmail.com>

On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>> The DMAENGINE framework assumes that if PQ offload is supported by a
>> DMA device then all 256 PQ coefficients are supported. This assumption
>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>> which supports PQ offload with limited number of PQ coefficients.
>>
>> This patch extends async_tx APIs to handle DMA devices with support
>> for fewer PQ coefficients.
>>
>> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
>
> I don't like this approach. Define an interface for md to query the
> offload engine once at the beginning of time. We should not be adding
> any new extensions to async_tx.

Even if we do capability checks in Linux MD, we still need a way
for DMAENGINE drivers to advertise number of PQ coefficients
handled by the HW.

I agree capability checks should be done once in Linux MD but I don't
see why this has to be part of BCM-SBA-RAID driver patches. We need
separate patchsets to address limitations of async_tx framework.

Regards,
Anup

^ permalink raw reply

* [RFC PATCH v4]  IV Generation algorithms for dm-crypt
From: Binoy Jayan @ 2017-02-07 10:35 UTC (permalink / raw)
  To: Oded, Ofir
  Cc: Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Rajendra, Milan Broz, Gilad,
	Binoy Jayan

===============================================================================
dm-crypt optimization for larger block sizes
===============================================================================

Currently, the iv generation algorithms are implemented in dm-crypt.c. The goal
is to move these algorithms from the dm layer to the kernel crypto layer by
implementing them as template ciphers so they can be used in relation with
algorithms like aes, and with multiple modes like cbc, ecb etc. As part of this
patchset, the iv-generation code is moved from the dm layer to the crypto layer
and adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
at a time. Each bio contains the in memory representation of physically
contiguous disk blocks. Since the bio itself may not be contiguous in main
memory, the dm layer sets up a chained scatterlist of these blocks split into
physically contiguous segments in memory so that DMA can be performed.

One challenge in doing so is that the IVs are generated based on a 512-byte
sector number. This infact limits the block sizes to 512 bytes. But this should
not be a problem if a hardware with iv generation support is used. The geniv
itself splits the segments into sectors so it could choose the IV based on
sector number. But it could be modelled in hardware effectively by not
splitting up the segments in the bio.

Another challenge faced is that dm-crypt has an option to use multiple keys.
The key selection is done based on the sector number. If the whole bio is
encrypted / decrypted with the same key, the encrypted volumes will not be
compatible with the original dm-crypt [without the changes]. So, the key
selection code is moved to crypto layer so the neighboring sectors are
encrypted with a different key.

The dm layer allocates space for iv. The hardware drivers can choose to make
use of this space to generate their IVs sequentially or allocate it on their
own. This can be moved to crypto layer too. Postponing this decision until
the requirement to integrate milan's changes are clear.

Interface to the crypto layer - include/crypto/geniv.h

Revisions:
----------

v1: https://patchwork.kernel.org/patch/9439175
v2: https://patchwork.kernel.org/patch/9471923
v3: https://lkml.org/lkml/2017/1/18/170

v3 --> v4
----------
Fix for the bug reported by Gilad Ben-Yossef.
The element '__ctx' in 'struct skcipher_request req' overflowed into the
element 'struct scatterlist src' which immediately follows 'req' in
'struct geniv_subreq' and corrupted src.

v2 --> v3
----------

1. Moved iv algorithms in dm-crypt.c for control
2. Key management code moved from dm layer to cryto layer
   so that cipher instance selection can be made depending on key_index
3. The revision v2 had scatterlist nodes created for every sector in the bio.
   It is modified to create only once scatterlist node to reduce memory
   foot print. Synchronous requests are processed sequentially. Asynchronous
   requests are processed in parallel and is freed in the async callback.
4. Changed allocation for sub-requests using mempool

v1 --> v2
----------

1. dm-crypt changes to process larger block sizes (one segment in a bio)
2. Incorporated changes w.r.t. comments from Herbert.

Binoy Jayan (1):
  crypto: Add IV generation algorithms

 drivers/md/dm-crypt.c  | 1894 ++++++++++++++++++++++++++++++++++--------------
 include/crypto/geniv.h |   47 ++
 2 files changed, 1402 insertions(+), 539 deletions(-)
 create mode 100644 include/crypto/geniv.h

-- 
Binoy Jayan

^ permalink raw reply

* [RFC PATCH v4] crypto: Add IV generation algorithms
From: Binoy Jayan @ 2017-02-07 10:35 UTC (permalink / raw)
  To: Oded, Ofir
  Cc: Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Rajendra, Milan Broz, Gilad,
	Binoy Jayan
In-Reply-To: <1486463731-6224-1-git-send-email-binoy.jayan@linaro.org>

Currently, the iv generation algorithms are implemented in dm-crypt.c.
The goal is to move these algorithms from the dm layer to the kernel
crypto layer by implementing them as template ciphers so they can be
implemented in hardware for performance. As part of this patchset, the
iv-generation code is moved from the dm layer to the crypto layer and
adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
at a time. Each bio contains an in memory representation of physically
contiguous disk blocks. The dm layer sets up a chained scatterlist of
these blocks split into physically contiguous segments in memory so that
DMA can be performed. Also, the key management code is moved from dm layer
to the cryto layer since the key selection for encrypting neighboring
sectors depend on the keycount.

Synchronous crypto requests to encrypt/decrypt a sector are processed
sequentially. Asynchronous requests if processed in parallel, are freed
in the async callback. The dm layer allocates space for iv. The hardware
implementations can choose to make use of this space to generate their IVs
sequentially or allocate it on their own.
Interface to the crypto layer - include/crypto/geniv.h

Signed-off-by: Binoy Jayan <binoy.jayan@linaro.org>
---
 drivers/md/dm-crypt.c  | 1894 ++++++++++++++++++++++++++++++++++--------------
 include/crypto/geniv.h |   47 ++
 2 files changed, 1402 insertions(+), 539 deletions(-)
 create mode 100644 include/crypto/geniv.h

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 7c6c572..8540c0f 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -32,170 +32,113 @@
 #include <crypto/algapi.h>
 #include <crypto/skcipher.h>
 #include <keys/user-type.h>
-
 #include <linux/device-mapper.h>
-
-#define DM_MSG_PREFIX "crypt"
-
-/*
- * context holding the current state of a multi-part conversion
- */
-struct convert_context {
-	struct completion restart;
-	struct bio *bio_in;
-	struct bio *bio_out;
-	struct bvec_iter iter_in;
-	struct bvec_iter iter_out;
-	sector_t cc_sector;
-	atomic_t cc_pending;
-	struct skcipher_request *req;
+#include <crypto/internal/skcipher.h>
+#include <linux/backing-dev.h>
+#include <linux/log2.h>
+#include <crypto/geniv.h>
+
+#define DM_MSG_PREFIX		"crypt"
+#define MAX_SG_LIST		(BIO_MAX_PAGES * 8)
+#define MIN_IOS			64
+#define LMK_SEED_SIZE		64 /* hash + 0 */
+#define TCW_WHITENING_SIZE	16
+
+struct geniv_ctx;
+struct geniv_req_ctx;
+
+/* Sub request for each of the skcipher_request's for a segment */
+struct geniv_subreq {
+	struct scatterlist src;
+	struct scatterlist dst;
+	int n;
+	struct geniv_req_ctx *rctx;
+	struct skcipher_request req CRYPTO_MINALIGN_ATTR;
 };
 
-/*
- * per bio private data
- */
-struct dm_crypt_io {
-	struct crypt_config *cc;
-	struct bio *base_bio;
-	struct work_struct work;
-
-	struct convert_context ctx;
-
-	atomic_t io_pending;
-	int error;
-	sector_t sector;
-
-	struct rb_node rb_node;
-} CRYPTO_MINALIGN_ATTR;
-
-struct dm_crypt_request {
-	struct convert_context *ctx;
-	struct scatterlist sg_in;
-	struct scatterlist sg_out;
+struct geniv_req_ctx {
+	struct geniv_subreq *subreq;
+	bool is_write;
 	sector_t iv_sector;
+	unsigned int nents;
+	u8 *iv;
+	struct completion restart;
+	atomic_t req_pending;
+	struct skcipher_request *req;
 };
 
-struct crypt_config;
-
 struct crypt_iv_operations {
-	int (*ctr)(struct crypt_config *cc, struct dm_target *ti,
-		   const char *opts);
-	void (*dtr)(struct crypt_config *cc);
-	int (*init)(struct crypt_config *cc);
-	int (*wipe)(struct crypt_config *cc);
-	int (*generator)(struct crypt_config *cc, u8 *iv,
-			 struct dm_crypt_request *dmreq);
-	int (*post)(struct crypt_config *cc, u8 *iv,
-		    struct dm_crypt_request *dmreq);
+	int (*ctr)(struct geniv_ctx *ctx);
+	void (*dtr)(struct geniv_ctx *ctx);
+	int (*init)(struct geniv_ctx *ctx);
+	int (*wipe)(struct geniv_ctx *ctx);
+	int (*generator)(struct geniv_ctx *ctx,
+			 struct geniv_req_ctx *rctx,
+			 struct geniv_subreq *subreq);
+	int (*post)(struct geniv_ctx *ctx,
+		    struct geniv_req_ctx *rctx,
+		    struct geniv_subreq *subreq);
 };
 
-struct iv_essiv_private {
+struct geniv_essiv_private {
 	struct crypto_ahash *hash_tfm;
 	u8 *salt;
 };
 
-struct iv_benbi_private {
+struct geniv_benbi_private {
 	int shift;
 };
 
-#define LMK_SEED_SIZE 64 /* hash + 0 */
-struct iv_lmk_private {
+struct geniv_lmk_private {
 	struct crypto_shash *hash_tfm;
 	u8 *seed;
 };
 
-#define TCW_WHITENING_SIZE 16
-struct iv_tcw_private {
+struct geniv_tcw_private {
 	struct crypto_shash *crc32_tfm;
 	u8 *iv_seed;
 	u8 *whitening;
 };
 
-/*
- * Crypt: maps a linear range of a block device
- * and encrypts / decrypts at the same time.
- */
-enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
-	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
-
-/*
- * The fields in here must be read only after initialization.
- */
-struct crypt_config {
-	struct dm_dev *dev;
-	sector_t start;
-
-	/*
-	 * pool for per bio private data, crypto requests and
-	 * encryption requeusts/buffer pages
-	 */
-	mempool_t *req_pool;
-	mempool_t *page_pool;
-	struct bio_set *bs;
-	struct mutex bio_alloc_lock;
-
-	struct workqueue_struct *io_queue;
-	struct workqueue_struct *crypt_queue;
-
-	struct task_struct *write_thread;
-	wait_queue_head_t write_thread_wait;
-	struct rb_root write_tree;
-
+struct geniv_ctx {
+	unsigned int tfms_count;
+	struct crypto_skcipher *child;
+	struct crypto_skcipher **tfms;
+	char *ivmode;
+	unsigned int iv_size;
+	char *ivopts;
 	char *cipher;
-	char *cipher_string;
-	char *key_string;
-
+	char *ciphermode;
 	const struct crypt_iv_operations *iv_gen_ops;
 	union {
-		struct iv_essiv_private essiv;
-		struct iv_benbi_private benbi;
-		struct iv_lmk_private lmk;
-		struct iv_tcw_private tcw;
+		struct geniv_essiv_private essiv;
+		struct geniv_benbi_private benbi;
+		struct geniv_lmk_private lmk;
+		struct geniv_tcw_private tcw;
 	} iv_gen_private;
-	sector_t iv_offset;
-	unsigned int iv_size;
-
-	/* ESSIV: struct crypto_cipher *essiv_tfm */
 	void *iv_private;
-	struct crypto_skcipher **tfms;
-	unsigned tfms_count;
-
-	/*
-	 * Layout of each crypto request:
-	 *
-	 *   struct skcipher_request
-	 *      context
-	 *      padding
-	 *   struct dm_crypt_request
-	 *      padding
-	 *   IV
-	 *
-	 * The padding is added so that dm_crypt_request and the IV are
-	 * correctly aligned.
-	 */
-	unsigned int dmreq_start;
-
-	unsigned int per_bio_data_size;
-
-	unsigned long flags;
+	struct crypto_skcipher *tfm;
+	mempool_t *subreq_pool;
 	unsigned int key_size;
+	unsigned int key_extra_size;
 	unsigned int key_parts;      /* independent parts in key buffer */
-	unsigned int key_extra_size; /* additional keys length */
-	u8 key[0];
+	enum setkey_op keyop;
+	char *msg;
+	u8 *key;
 };
 
-#define MIN_IOS        64
-
-static void clone_init(struct dm_crypt_io *, struct bio *);
-static void kcryptd_queue_crypt(struct dm_crypt_io *io);
-static u8 *iv_of_dmreq(struct crypt_config *cc, struct dm_crypt_request *dmreq);
+static struct crypto_skcipher *any_tfm(struct geniv_ctx *ctx)
+{
+	return ctx->tfms[0];
+}
 
-/*
- * Use this to access cipher attributes that are the same for each CPU.
- */
-static struct crypto_skcipher *any_tfm(struct crypt_config *cc)
+static inline
+struct geniv_req_ctx *geniv_req_ctx(struct skcipher_request *req)
 {
-	return cc->tfms[0];
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	unsigned long align = crypto_skcipher_alignmask(tfm);
+
+	return (void *) PTR_ALIGN((u8 *) skcipher_request_ctx(req), align + 1);
 }
 
 /*
@@ -245,44 +188,50 @@ static struct crypto_skcipher *any_tfm(struct crypt_config *cc)
  * http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/454
  */
 
-static int crypt_iv_plain_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
+static int crypt_iv_plain_gen(struct geniv_ctx *ctx,
+			      struct geniv_req_ctx *rctx,
+			      struct geniv_subreq *subreq)
 {
-	memset(iv, 0, cc->iv_size);
-	*(__le32 *)iv = cpu_to_le32(dmreq->iv_sector & 0xffffffff);
+	u8 *iv = rctx->iv;
+
+	memset(iv, 0, ctx->iv_size);
+	*(__le32 *)iv = cpu_to_le32(rctx->iv_sector & 0xffffffff);
 
 	return 0;
 }
 
-static int crypt_iv_plain64_gen(struct crypt_config *cc, u8 *iv,
-				struct dm_crypt_request *dmreq)
+static int crypt_iv_plain64_gen(struct geniv_ctx *ctx,
+				struct geniv_req_ctx *rctx,
+				struct geniv_subreq *subreq)
 {
-	memset(iv, 0, cc->iv_size);
-	*(__le64 *)iv = cpu_to_le64(dmreq->iv_sector);
+	u8 *iv = rctx->iv;
+
+	memset(iv, 0, ctx->iv_size);
+	*(__le64 *)iv = cpu_to_le64(rctx->iv_sector);
 
 	return 0;
 }
 
 /* Initialise ESSIV - compute salt but no local memory allocations */
-static int crypt_iv_essiv_init(struct crypt_config *cc)
+static int crypt_iv_essiv_init(struct geniv_ctx *ctx)
 {
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
-	AHASH_REQUEST_ON_STACK(req, essiv->hash_tfm);
+	struct geniv_essiv_private *essiv = &ctx->iv_gen_private.essiv;
 	struct scatterlist sg;
 	struct crypto_cipher *essiv_tfm;
 	int err;
+	AHASH_REQUEST_ON_STACK(req, essiv->hash_tfm);
 
-	sg_init_one(&sg, cc->key, cc->key_size);
+	sg_init_one(&sg, ctx->key, ctx->key_size);
 	ahash_request_set_tfm(req, essiv->hash_tfm);
 	ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, NULL);
-	ahash_request_set_crypt(req, &sg, essiv->salt, cc->key_size);
+	ahash_request_set_crypt(req, &sg, essiv->salt, ctx->key_size);
 
 	err = crypto_ahash_digest(req);
 	ahash_request_zero(req);
 	if (err)
 		return err;
 
-	essiv_tfm = cc->iv_private;
+	essiv_tfm = ctx->iv_private;
 
 	err = crypto_cipher_setkey(essiv_tfm, essiv->salt,
 			    crypto_ahash_digestsize(essiv->hash_tfm));
@@ -293,16 +242,16 @@ static int crypt_iv_essiv_init(struct crypt_config *cc)
 }
 
 /* Wipe salt and reset key derived from volume key */
-static int crypt_iv_essiv_wipe(struct crypt_config *cc)
+static int crypt_iv_essiv_wipe(struct geniv_ctx *ctx)
 {
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
-	unsigned salt_size = crypto_ahash_digestsize(essiv->hash_tfm);
+	struct geniv_essiv_private *essiv = &ctx->iv_gen_private.essiv;
+	unsigned int salt_size = crypto_ahash_digestsize(essiv->hash_tfm);
 	struct crypto_cipher *essiv_tfm;
 	int r, err = 0;
 
 	memset(essiv->salt, 0, salt_size);
 
-	essiv_tfm = cc->iv_private;
+	essiv_tfm = ctx->iv_private;
 	r = crypto_cipher_setkey(essiv_tfm, essiv->salt, salt_size);
 	if (r)
 		err = r;
@@ -311,42 +260,40 @@ static int crypt_iv_essiv_wipe(struct crypt_config *cc)
 }
 
 /* Set up per cpu cipher state */
-static struct crypto_cipher *setup_essiv_cpu(struct crypt_config *cc,
-					     struct dm_target *ti,
-					     u8 *salt, unsigned saltsize)
+static struct crypto_cipher *setup_essiv_cpu(struct geniv_ctx *ctx,
+					     u8 *salt, unsigned int saltsize)
 {
 	struct crypto_cipher *essiv_tfm;
 	int err;
 
 	/* Setup the essiv_tfm with the given salt */
-	essiv_tfm = crypto_alloc_cipher(cc->cipher, 0, CRYPTO_ALG_ASYNC);
+	essiv_tfm = crypto_alloc_cipher(ctx->cipher, 0, CRYPTO_ALG_ASYNC);
+
 	if (IS_ERR(essiv_tfm)) {
-		ti->error = "Error allocating crypto tfm for ESSIV";
+		DMERR("Error allocating crypto tfm for ESSIV\n");
 		return essiv_tfm;
 	}
 
 	if (crypto_cipher_blocksize(essiv_tfm) !=
-	    crypto_skcipher_ivsize(any_tfm(cc))) {
-		ti->error = "Block size of ESSIV cipher does "
-			    "not match IV size of block cipher";
+	    crypto_skcipher_ivsize(any_tfm(ctx))) {
+		DMERR("Block size of ESSIV cipher does not match IV size of block cipher\n");
 		crypto_free_cipher(essiv_tfm);
 		return ERR_PTR(-EINVAL);
 	}
 
 	err = crypto_cipher_setkey(essiv_tfm, salt, saltsize);
 	if (err) {
-		ti->error = "Failed to set key for ESSIV cipher";
+		DMERR("Failed to set key for ESSIV cipher\n");
 		crypto_free_cipher(essiv_tfm);
 		return ERR_PTR(err);
 	}
-
 	return essiv_tfm;
 }
 
-static void crypt_iv_essiv_dtr(struct crypt_config *cc)
+static void crypt_iv_essiv_dtr(struct geniv_ctx *ctx)
 {
 	struct crypto_cipher *essiv_tfm;
-	struct iv_essiv_private *essiv = &cc->iv_gen_private.essiv;
+	struct geniv_essiv_private *essiv = &ctx->iv_gen_private.essiv;
 
 	crypto_free_ahash(essiv->hash_tfm);
 	essiv->hash_tfm = NULL;
@@ -354,52 +301,50 @@ static void crypt_iv_essiv_dtr(struct crypt_config *cc)
 	kzfree(essiv->salt);
 	essiv->salt = NULL;
 
-	essiv_tfm = cc->iv_private;
+	essiv_tfm = ctx->iv_private;
 
 	if (essiv_tfm)
 		crypto_free_cipher(essiv_tfm);
 
-	cc->iv_private = NULL;
+	ctx->iv_private = NULL;
 }
 
-static int crypt_iv_essiv_ctr(struct crypt_config *cc, struct dm_target *ti,
-			      const char *opts)
+static int crypt_iv_essiv_ctr(struct geniv_ctx *ctx)
 {
 	struct crypto_cipher *essiv_tfm = NULL;
 	struct crypto_ahash *hash_tfm = NULL;
 	u8 *salt = NULL;
 	int err;
 
-	if (!opts) {
-		ti->error = "Digest algorithm missing for ESSIV mode";
+	if (!ctx->ivopts) {
+		DMERR("Digest algorithm missing for ESSIV mode\n");
 		return -EINVAL;
 	}
 
 	/* Allocate hash algorithm */
-	hash_tfm = crypto_alloc_ahash(opts, 0, CRYPTO_ALG_ASYNC);
+	hash_tfm = crypto_alloc_ahash(ctx->ivopts, 0, CRYPTO_ALG_ASYNC);
 	if (IS_ERR(hash_tfm)) {
-		ti->error = "Error initializing ESSIV hash";
 		err = PTR_ERR(hash_tfm);
+		DMERR("Error initializing ESSIV hash. err=%d\n", err);
 		goto bad;
 	}
 
 	salt = kzalloc(crypto_ahash_digestsize(hash_tfm), GFP_KERNEL);
 	if (!salt) {
-		ti->error = "Error kmallocing salt storage in ESSIV";
 		err = -ENOMEM;
 		goto bad;
 	}
 
-	cc->iv_gen_private.essiv.salt = salt;
-	cc->iv_gen_private.essiv.hash_tfm = hash_tfm;
+	ctx->iv_gen_private.essiv.salt = salt;
+	ctx->iv_gen_private.essiv.hash_tfm = hash_tfm;
 
-	essiv_tfm = setup_essiv_cpu(cc, ti, salt,
+	essiv_tfm = setup_essiv_cpu(ctx, salt,
 				crypto_ahash_digestsize(hash_tfm));
 	if (IS_ERR(essiv_tfm)) {
-		crypt_iv_essiv_dtr(cc);
+		crypt_iv_essiv_dtr(ctx);
 		return PTR_ERR(essiv_tfm);
 	}
-	cc->iv_private = essiv_tfm;
+	ctx->iv_private = essiv_tfm;
 
 	return 0;
 
@@ -410,70 +355,73 @@ static int crypt_iv_essiv_ctr(struct crypt_config *cc, struct dm_target *ti,
 	return err;
 }
 
-static int crypt_iv_essiv_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
+static int crypt_iv_essiv_gen(struct geniv_ctx *ctx,
+			      struct geniv_req_ctx *rctx,
+			      struct geniv_subreq *subreq)
 {
-	struct crypto_cipher *essiv_tfm = cc->iv_private;
+	u8 *iv = rctx->iv;
+	struct crypto_cipher *essiv_tfm = ctx->iv_private;
 
-	memset(iv, 0, cc->iv_size);
-	*(__le64 *)iv = cpu_to_le64(dmreq->iv_sector);
+	memset(iv, 0, ctx->iv_size);
+	*(__le64 *)iv = cpu_to_le64(rctx->iv_sector);
 	crypto_cipher_encrypt_one(essiv_tfm, iv, iv);
 
 	return 0;
 }
 
-static int crypt_iv_benbi_ctr(struct crypt_config *cc, struct dm_target *ti,
-			      const char *opts)
+static int crypt_iv_benbi_ctr(struct geniv_ctx *ctx)
 {
-	unsigned bs = crypto_skcipher_blocksize(any_tfm(cc));
+	unsigned int bs = crypto_skcipher_blocksize(any_tfm(ctx));
 	int log = ilog2(bs);
 
 	/* we need to calculate how far we must shift the sector count
-	 * to get the cipher block count, we use this shift in _gen */
+	 * to get the cipher block count, we use this shift in _gen
+	 */
 
 	if (1 << log != bs) {
-		ti->error = "cypher blocksize is not a power of 2";
+		DMERR("cypher blocksize is not a power of 2\n");
 		return -EINVAL;
 	}
 
 	if (log > 9) {
-		ti->error = "cypher blocksize is > 512";
+		DMERR("cypher blocksize is > 512\n");
 		return -EINVAL;
 	}
 
-	cc->iv_gen_private.benbi.shift = 9 - log;
+	ctx->iv_gen_private.benbi.shift = 9 - log;
 
 	return 0;
 }
 
-static void crypt_iv_benbi_dtr(struct crypt_config *cc)
-{
-}
-
-static int crypt_iv_benbi_gen(struct crypt_config *cc, u8 *iv,
-			      struct dm_crypt_request *dmreq)
+static int crypt_iv_benbi_gen(struct geniv_ctx *ctx,
+			      struct geniv_req_ctx *rctx,
+			      struct geniv_subreq *subreq)
 {
+	u8 *iv = rctx->iv;
 	__be64 val;
 
-	memset(iv, 0, cc->iv_size - sizeof(u64)); /* rest is cleared below */
+	memset(iv, 0, ctx->iv_size - sizeof(u64)); /* rest is cleared below */
 
-	val = cpu_to_be64(((u64)dmreq->iv_sector << cc->iv_gen_private.benbi.shift) + 1);
-	put_unaligned(val, (__be64 *)(iv + cc->iv_size - sizeof(u64)));
+	val = cpu_to_be64(((u64) rctx->iv_sector <<
+			  ctx->iv_gen_private.benbi.shift) + 1);
+	put_unaligned(val, (__be64 *)(iv + ctx->iv_size - sizeof(u64)));
 
 	return 0;
 }
 
-static int crypt_iv_null_gen(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
+static int crypt_iv_null_gen(struct geniv_ctx *ctx,
+			     struct geniv_req_ctx *rctx,
+			     struct geniv_subreq *subreq)
 {
-	memset(iv, 0, cc->iv_size);
+	u8 *iv = rctx->iv;
 
+	memset(iv, 0, ctx->iv_size);
 	return 0;
 }
 
-static void crypt_iv_lmk_dtr(struct crypt_config *cc)
+static void crypt_iv_lmk_dtr(struct geniv_ctx *ctx)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
 
 	if (lmk->hash_tfm && !IS_ERR(lmk->hash_tfm))
 		crypto_free_shash(lmk->hash_tfm);
@@ -483,49 +431,49 @@ static void crypt_iv_lmk_dtr(struct crypt_config *cc)
 	lmk->seed = NULL;
 }
 
-static int crypt_iv_lmk_ctr(struct crypt_config *cc, struct dm_target *ti,
-			    const char *opts)
+static int crypt_iv_lmk_ctr(struct geniv_ctx *ctx)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
 
 	lmk->hash_tfm = crypto_alloc_shash("md5", 0, 0);
 	if (IS_ERR(lmk->hash_tfm)) {
-		ti->error = "Error initializing LMK hash";
+		DMERR("Error initializing LMK hash; err=%ld\n",
+		      PTR_ERR(lmk->hash_tfm));
 		return PTR_ERR(lmk->hash_tfm);
 	}
 
 	/* No seed in LMK version 2 */
-	if (cc->key_parts == cc->tfms_count) {
+	if (ctx->key_parts == ctx->tfms_count) {
 		lmk->seed = NULL;
 		return 0;
 	}
 
 	lmk->seed = kzalloc(LMK_SEED_SIZE, GFP_KERNEL);
 	if (!lmk->seed) {
-		crypt_iv_lmk_dtr(cc);
-		ti->error = "Error kmallocing seed storage in LMK";
+		crypt_iv_lmk_dtr(ctx);
+		DMERR("Error kmallocing seed storage in LMK\n");
 		return -ENOMEM;
 	}
 
 	return 0;
 }
 
-static int crypt_iv_lmk_init(struct crypt_config *cc)
+static int crypt_iv_lmk_init(struct geniv_ctx *ctx)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-	int subkey_size = cc->key_size / cc->key_parts;
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
+	int subkey_size = ctx->key_size / ctx->key_parts;
 
 	/* LMK seed is on the position of LMK_KEYS + 1 key */
 	if (lmk->seed)
-		memcpy(lmk->seed, cc->key + (cc->tfms_count * subkey_size),
+		memcpy(lmk->seed, ctx->key + (ctx->tfms_count * subkey_size),
 		       crypto_shash_digestsize(lmk->hash_tfm));
 
 	return 0;
 }
 
-static int crypt_iv_lmk_wipe(struct crypt_config *cc)
+static int crypt_iv_lmk_wipe(struct geniv_ctx *ctx)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
 
 	if (lmk->seed)
 		memset(lmk->seed, 0, LMK_SEED_SIZE);
@@ -533,15 +481,14 @@ static int crypt_iv_lmk_wipe(struct crypt_config *cc)
 	return 0;
 }
 
-static int crypt_iv_lmk_one(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq,
-			    u8 *data)
+static int crypt_iv_lmk_one(struct geniv_ctx *ctx, u8 *iv,
+			    struct geniv_req_ctx *rctx, u8 *data)
 {
-	struct iv_lmk_private *lmk = &cc->iv_gen_private.lmk;
-	SHASH_DESC_ON_STACK(desc, lmk->hash_tfm);
+	struct geniv_lmk_private *lmk = &ctx->iv_gen_private.lmk;
 	struct md5_state md5state;
 	__le32 buf[4];
 	int i, r;
+	SHASH_DESC_ON_STACK(desc, lmk->hash_tfm);
 
 	desc->tfm = lmk->hash_tfm;
 	desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
@@ -562,8 +509,9 @@ static int crypt_iv_lmk_one(struct crypt_config *cc, u8 *iv,
 		return r;
 
 	/* Sector is cropped to 56 bits here */
-	buf[0] = cpu_to_le32(dmreq->iv_sector & 0xFFFFFFFF);
-	buf[1] = cpu_to_le32((((u64)dmreq->iv_sector >> 32) & 0x00FFFFFF) | 0x80000000);
+	buf[0] = cpu_to_le32(rctx->iv_sector & 0xFFFFFFFF);
+	buf[1] = cpu_to_le32((((u64)rctx->iv_sector >> 32) & 0x00FFFFFF)
+			     | 0x80000000);
 	buf[2] = cpu_to_le32(4024);
 	buf[3] = 0;
 	r = crypto_shash_update(desc, (u8 *)buf, sizeof(buf));
@@ -577,50 +525,54 @@ static int crypt_iv_lmk_one(struct crypt_config *cc, u8 *iv,
 
 	for (i = 0; i < MD5_HASH_WORDS; i++)
 		__cpu_to_le32s(&md5state.hash[i]);
-	memcpy(iv, &md5state.hash, cc->iv_size);
+	memcpy(iv, &md5state.hash, ctx->iv_size);
 
 	return 0;
 }
 
-static int crypt_iv_lmk_gen(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq)
+static int crypt_iv_lmk_gen(struct geniv_ctx *ctx,
+			    struct geniv_req_ctx *rctx,
+			    struct geniv_subreq *subreq)
 {
 	u8 *src;
+	u8 *iv = rctx->iv;
 	int r = 0;
 
-	if (bio_data_dir(dmreq->ctx->bio_in) == WRITE) {
-		src = kmap_atomic(sg_page(&dmreq->sg_in));
-		r = crypt_iv_lmk_one(cc, iv, dmreq, src + dmreq->sg_in.offset);
+	if (rctx->is_write) {
+		src = kmap_atomic(sg_page(&subreq->src));
+		r = crypt_iv_lmk_one(ctx, iv, rctx, src + subreq->src.offset);
 		kunmap_atomic(src);
 	} else
-		memset(iv, 0, cc->iv_size);
+		memset(iv, 0, ctx->iv_size);
 
 	return r;
 }
 
-static int crypt_iv_lmk_post(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
+static int crypt_iv_lmk_post(struct geniv_ctx *ctx,
+			     struct geniv_req_ctx *rctx,
+			     struct geniv_subreq *subreq)
 {
 	u8 *dst;
+	u8 *iv = rctx->iv;
 	int r;
 
-	if (bio_data_dir(dmreq->ctx->bio_in) == WRITE)
+	if (rctx->is_write)
 		return 0;
 
-	dst = kmap_atomic(sg_page(&dmreq->sg_out));
-	r = crypt_iv_lmk_one(cc, iv, dmreq, dst + dmreq->sg_out.offset);
+	dst = kmap_atomic(sg_page(&subreq->dst));
+	r = crypt_iv_lmk_one(ctx, iv, rctx, dst + subreq->dst.offset);
 
 	/* Tweak the first block of plaintext sector */
 	if (!r)
-		crypto_xor(dst + dmreq->sg_out.offset, iv, cc->iv_size);
+		crypto_xor(dst + subreq->dst.offset, iv, ctx->iv_size);
 
 	kunmap_atomic(dst);
 	return r;
 }
 
-static void crypt_iv_tcw_dtr(struct crypt_config *cc)
+static void crypt_iv_tcw_dtr(struct geniv_ctx *ctx)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
 
 	kzfree(tcw->iv_seed);
 	tcw->iv_seed = NULL;
@@ -632,64 +584,65 @@ static void crypt_iv_tcw_dtr(struct crypt_config *cc)
 	tcw->crc32_tfm = NULL;
 }
 
-static int crypt_iv_tcw_ctr(struct crypt_config *cc, struct dm_target *ti,
-			    const char *opts)
+static int crypt_iv_tcw_ctr(struct geniv_ctx *ctx)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
 
-	if (cc->key_size <= (cc->iv_size + TCW_WHITENING_SIZE)) {
-		ti->error = "Wrong key size for TCW";
+	if (ctx->key_size <= (ctx->iv_size + TCW_WHITENING_SIZE)) {
+		DMERR("Wrong key size (%d) for TCW. Choose a value > %d bytes\n",
+			ctx->key_size,
+			ctx->iv_size + TCW_WHITENING_SIZE);
 		return -EINVAL;
 	}
 
 	tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 0);
 	if (IS_ERR(tcw->crc32_tfm)) {
-		ti->error = "Error initializing CRC32 in TCW";
+		DMERR("Error initializing CRC32 in TCW; err=%ld\n",
+			PTR_ERR(tcw->crc32_tfm));
 		return PTR_ERR(tcw->crc32_tfm);
 	}
 
-	tcw->iv_seed = kzalloc(cc->iv_size, GFP_KERNEL);
+	tcw->iv_seed = kzalloc(ctx->iv_size, GFP_KERNEL);
 	tcw->whitening = kzalloc(TCW_WHITENING_SIZE, GFP_KERNEL);
 	if (!tcw->iv_seed || !tcw->whitening) {
-		crypt_iv_tcw_dtr(cc);
-		ti->error = "Error allocating seed storage in TCW";
+		crypt_iv_tcw_dtr(ctx);
+		DMERR("Error allocating seed storage in TCW\n");
 		return -ENOMEM;
 	}
 
 	return 0;
 }
 
-static int crypt_iv_tcw_init(struct crypt_config *cc)
+static int crypt_iv_tcw_init(struct geniv_ctx *ctx)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	int key_offset = cc->key_size - cc->iv_size - TCW_WHITENING_SIZE;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+	int key_offset = ctx->key_size - ctx->iv_size - TCW_WHITENING_SIZE;
 
-	memcpy(tcw->iv_seed, &cc->key[key_offset], cc->iv_size);
-	memcpy(tcw->whitening, &cc->key[key_offset + cc->iv_size],
+	memcpy(tcw->iv_seed, &ctx->key[key_offset], ctx->iv_size);
+	memcpy(tcw->whitening, &ctx->key[key_offset + ctx->iv_size],
 	       TCW_WHITENING_SIZE);
 
 	return 0;
 }
 
-static int crypt_iv_tcw_wipe(struct crypt_config *cc)
+static int crypt_iv_tcw_wipe(struct geniv_ctx *ctx)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
 
-	memset(tcw->iv_seed, 0, cc->iv_size);
+	memset(tcw->iv_seed, 0, ctx->iv_size);
 	memset(tcw->whitening, 0, TCW_WHITENING_SIZE);
 
 	return 0;
 }
 
-static int crypt_iv_tcw_whitening(struct crypt_config *cc,
-				  struct dm_crypt_request *dmreq,
-				  u8 *data)
+static int crypt_iv_tcw_whitening(struct geniv_ctx *ctx,
+				  struct geniv_req_ctx *rctx, u8 *data)
 {
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	__le64 sector = cpu_to_le64(dmreq->iv_sector);
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+	__le64 sector = cpu_to_le64(rctx->iv_sector);
 	u8 buf[TCW_WHITENING_SIZE];
-	SHASH_DESC_ON_STACK(desc, tcw->crc32_tfm);
 	int i, r;
+	SHASH_DESC_ON_STACK(desc, tcw->crc32_tfm);
 
 	/* xor whitening with sector number */
 	memcpy(buf, tcw->whitening, TCW_WHITENING_SIZE);
@@ -713,99 +666,1009 @@ static int crypt_iv_tcw_whitening(struct crypt_config *cc,
 	crypto_xor(&buf[0], &buf[12], 4);
 	crypto_xor(&buf[4], &buf[8], 4);
 
-	/* apply whitening (8 bytes) to whole sector */
-	for (i = 0; i < ((1 << SECTOR_SHIFT) / 8); i++)
-		crypto_xor(data + i * 8, buf, 8);
-out:
-	memzero_explicit(buf, sizeof(buf));
-	return r;
-}
+	/* apply whitening (8 bytes) to whole sector */
+	for (i = 0; i < (SECTOR_SIZE / 8); i++)
+		crypto_xor(data + i * 8, buf, 8);
+out:
+	memzero_explicit(buf, sizeof(buf));
+	return r;
+}
+
+static int crypt_iv_tcw_gen(struct geniv_ctx *ctx,
+			    struct geniv_req_ctx *rctx,
+			    struct geniv_subreq *subreq)
+{
+	u8 *iv = rctx->iv;
+	struct geniv_tcw_private *tcw = &ctx->iv_gen_private.tcw;
+	__le64 sector = cpu_to_le64(rctx->iv_sector);
+	u8 *src;
+	int r = 0;
+
+	/* Remove whitening from ciphertext */
+	if (!rctx->is_write) {
+		src = kmap_atomic(sg_page(&subreq->src));
+		r = crypt_iv_tcw_whitening(ctx, rctx,
+					   src + subreq->src.offset);
+		kunmap_atomic(src);
+	}
+
+	/* Calculate IV */
+	memcpy(iv, tcw->iv_seed, ctx->iv_size);
+	crypto_xor(iv, (u8 *)&sector, 8);
+	if (ctx->iv_size > 8)
+		crypto_xor(&iv[8], (u8 *)&sector, ctx->iv_size - 8);
+
+	return r;
+}
+
+static int crypt_iv_tcw_post(struct geniv_ctx *ctx,
+			     struct geniv_req_ctx *rctx,
+			     struct geniv_subreq *subreq)
+{
+	u8 *dst;
+	int r;
+
+	if (!rctx->is_write)
+		return 0;
+
+	/* Apply whitening on ciphertext */
+	dst = kmap_atomic(sg_page(&subreq->dst));
+	r = crypt_iv_tcw_whitening(ctx, rctx, dst + subreq->dst.offset);
+	kunmap_atomic(dst);
+
+	return r;
+}
+
+static const struct crypt_iv_operations crypt_iv_plain_ops = {
+	.generator = crypt_iv_plain_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_plain64_ops = {
+	.generator = crypt_iv_plain64_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_essiv_ops = {
+	.ctr       = crypt_iv_essiv_ctr,
+	.dtr       = crypt_iv_essiv_dtr,
+	.init      = crypt_iv_essiv_init,
+	.wipe      = crypt_iv_essiv_wipe,
+	.generator = crypt_iv_essiv_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_benbi_ops = {
+	.ctr	   = crypt_iv_benbi_ctr,
+	.generator = crypt_iv_benbi_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_null_ops = {
+	.generator = crypt_iv_null_gen
+};
+
+static const struct crypt_iv_operations crypt_iv_lmk_ops = {
+	.ctr	   = crypt_iv_lmk_ctr,
+	.dtr	   = crypt_iv_lmk_dtr,
+	.init	   = crypt_iv_lmk_init,
+	.wipe	   = crypt_iv_lmk_wipe,
+	.generator = crypt_iv_lmk_gen,
+	.post	   = crypt_iv_lmk_post
+};
+
+static const struct crypt_iv_operations crypt_iv_tcw_ops = {
+	.ctr	   = crypt_iv_tcw_ctr,
+	.dtr	   = crypt_iv_tcw_dtr,
+	.init	   = crypt_iv_tcw_init,
+	.wipe	   = crypt_iv_tcw_wipe,
+	.generator = crypt_iv_tcw_gen,
+	.post	   = crypt_iv_tcw_post
+};
+
+static int geniv_setkey_set(struct geniv_ctx *ctx)
+{
+	int ret = 0;
+
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->init)
+		ret = ctx->iv_gen_ops->init(ctx);
+	return ret;
+}
+
+static int geniv_setkey_wipe(struct geniv_ctx *ctx)
+{
+	int ret = 0;
+
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->wipe) {
+		ret = ctx->iv_gen_ops->wipe(ctx);
+		if (ret)
+			return ret;
+	}
+	return ret;
+}
+
+static int geniv_init_iv(struct geniv_ctx *ctx)
+{
+	int ret = -EINVAL;
+
+	DMDEBUG("IV Generation algorithm : %s\n", ctx->ivmode);
+
+	if (ctx->ivmode == NULL)
+		ctx->iv_gen_ops = NULL;
+	else if (strcmp(ctx->ivmode, "plain") == 0)
+		ctx->iv_gen_ops = &crypt_iv_plain_ops;
+	else if (strcmp(ctx->ivmode, "plain64") == 0)
+		ctx->iv_gen_ops = &crypt_iv_plain64_ops;
+	else if (strcmp(ctx->ivmode, "essiv") == 0)
+		ctx->iv_gen_ops = &crypt_iv_essiv_ops;
+	else if (strcmp(ctx->ivmode, "benbi") == 0)
+		ctx->iv_gen_ops = &crypt_iv_benbi_ops;
+	else if (strcmp(ctx->ivmode, "null") == 0)
+		ctx->iv_gen_ops = &crypt_iv_null_ops;
+	else if (strcmp(ctx->ivmode, "lmk") == 0)
+		ctx->iv_gen_ops = &crypt_iv_lmk_ops;
+	else if (strcmp(ctx->ivmode, "tcw") == 0) {
+		ctx->iv_gen_ops = &crypt_iv_tcw_ops;
+		ctx->key_parts += 2; /* IV + whitening */
+		ctx->key_extra_size = ctx->iv_size + TCW_WHITENING_SIZE;
+	} else {
+		ret = -EINVAL;
+		DMERR("Invalid IV mode %s\n", ctx->ivmode);
+		goto end;
+	}
+
+	/* Allocate IV */
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->ctr) {
+		ret = ctx->iv_gen_ops->ctr(ctx);
+		if (ret < 0) {
+			DMERR("Error creating IV for %s\n", ctx->ivmode);
+			goto end;
+		}
+	}
+
+	/* Initialize IV (set keys for ESSIV etc) */
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->init) {
+		ret = ctx->iv_gen_ops->init(ctx);
+		if (ret < 0)
+			DMERR("Error creating IV for %s\n", ctx->ivmode);
+	}
+	ret = 0;
+end:
+	return ret;
+}
+
+static void geniv_free_tfms(struct geniv_ctx *ctx)
+{
+	unsigned int i;
+
+	if (!ctx->tfms)
+		return;
+
+	for (i = 0; i < ctx->tfms_count; i++)
+		if (ctx->tfms[i] && !IS_ERR(ctx->tfms[i])) {
+			crypto_free_skcipher(ctx->tfms[i]);
+			ctx->tfms[i] = NULL;
+		}
+
+	kfree(ctx->tfms);
+	ctx->tfms = NULL;
+}
+
+/* Allocate memory for the underlying cipher algorithm. Ex: cbc(aes)
+ */
+
+static int geniv_alloc_tfms(struct crypto_skcipher *parent,
+			    struct geniv_ctx *ctx)
+{
+	unsigned int i, reqsize, align;
+	int err = 0;
+
+	ctx->tfms = kcalloc(ctx->tfms_count, sizeof(struct crypto_skcipher *),
+			   GFP_KERNEL);
+	if (!ctx->tfms) {
+		err = -ENOMEM;
+		goto end;
+	}
+
+	/* First instance is already allocated in geniv_init_tfm */
+	ctx->tfms[0] = ctx->child;
+	for (i = 1; i < ctx->tfms_count; i++) {
+		ctx->tfms[i] = crypto_alloc_skcipher(ctx->ciphermode, 0, 0);
+		if (IS_ERR(ctx->tfms[i])) {
+			err = PTR_ERR(ctx->tfms[i]);
+			geniv_free_tfms(ctx);
+			goto end;
+		}
+
+		/* Setup the current cipher's request structure */
+		align = crypto_skcipher_alignmask(parent);
+		align &= ~(crypto_tfm_ctx_alignment() - 1);
+		reqsize = align + sizeof(struct geniv_req_ctx) +
+			  crypto_skcipher_reqsize(ctx->tfms[i]);
+		crypto_skcipher_set_reqsize(parent, reqsize);
+	}
+
+end:
+	return err;
+}
+
+/* Initialize the cipher's context with the key, ivmode and other parameters.
+ * Also allocate IV generation template ciphers and initialize them.
+ */
+
+static int geniv_setkey_init(struct crypto_skcipher *parent,
+			     struct geniv_key_info *info)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(parent);
+	int ret = -ENOMEM;
+
+	ctx->iv_size = crypto_skcipher_ivsize(parent);
+	ctx->tfms_count = info->tfms_count;
+	ctx->key = info->key;
+	ctx->key_size = info->key_size;
+	ctx->key_parts = info->key_parts;
+	ctx->ivopts = info->ivopts;
+
+	ret = geniv_alloc_tfms(parent, ctx);
+	if (ret)
+		goto end;
+
+	ret = geniv_init_iv(ctx);
+
+end:
+	return ret;
+}
+
+static int geniv_setkey_tfms(struct crypto_skcipher *parent,
+			     struct geniv_ctx *ctx,
+			     struct geniv_key_info *info)
+{
+	unsigned int subkey_size;
+	int ret = 0, i;
+
+	/* Ignore extra keys (which are used for IV etc) */
+	subkey_size = (ctx->key_size - ctx->key_extra_size)
+		      >> ilog2(ctx->tfms_count);
+
+	for (i = 0; i < ctx->tfms_count; i++) {
+		struct crypto_skcipher *child = ctx->tfms[i];
+		char *subkey = ctx->key + (subkey_size) * i;
+
+		crypto_skcipher_clear_flags(child, CRYPTO_TFM_REQ_MASK);
+		crypto_skcipher_set_flags(child,
+					  crypto_skcipher_get_flags(parent) &
+					  CRYPTO_TFM_REQ_MASK);
+		ret = crypto_skcipher_setkey(child, subkey, subkey_size);
+		if (ret) {
+			DMERR("Error setting key for tfms[%d]\n", i);
+			break;
+		}
+		crypto_skcipher_set_flags(parent,
+					  crypto_skcipher_get_flags(child) &
+					  CRYPTO_TFM_RES_MASK);
+	}
+
+	return ret;
+}
+
+static int geniv_setkey(struct crypto_skcipher *parent,
+			const u8 *key, unsigned int keylen)
+{
+	int err = 0;
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(parent);
+	struct geniv_key_info *info = (struct geniv_key_info *) key;
+
+	DMDEBUG("SETKEY Operation : %d\n", info->keyop);
+
+	switch (info->keyop) {
+	case SETKEY_OP_INIT:
+		err = geniv_setkey_init(parent, info);
+		break;
+	case SETKEY_OP_SET:
+		err = geniv_setkey_set(ctx);
+		break;
+	case SETKEY_OP_WIPE:
+		err = geniv_setkey_wipe(ctx);
+		break;
+	}
+
+	if (err)
+		goto end;
+
+	err = geniv_setkey_tfms(parent, ctx, info);
+
+end:
+	return err;
+}
+
+static void geniv_async_done(struct crypto_async_request *async_req, int error);
+
+static int geniv_alloc_subreq(struct skcipher_request *req,
+			      struct geniv_ctx *ctx,
+			      struct geniv_req_ctx *rctx)
+{
+	int key_index, r = 0;
+	struct skcipher_request *sreq;
+
+	if (!rctx->subreq) {
+		rctx->subreq = mempool_alloc(ctx->subreq_pool, GFP_NOIO);
+		if (!rctx->subreq)
+			r = -ENOMEM;
+	}
+
+	sreq = &rctx->subreq->req;
+	rctx->subreq->rctx = rctx;
+
+	key_index = rctx->iv_sector & (ctx->tfms_count - 1);
+
+	skcipher_request_set_tfm(sreq, ctx->tfms[key_index]);
+	skcipher_request_set_callback(sreq, req->base.flags,
+				      geniv_async_done, rctx->subreq);
+	return r;
+}
+
+/* Asynchronous IO completion callback for each sector in a segment. When all
+ * pending i/o are completed the parent cipher's async function is called.
+ */
+
+static void geniv_async_done(struct crypto_async_request *async_req, int error)
+{
+	struct geniv_subreq *subreq =
+		(struct geniv_subreq *) async_req->data;
+	struct geniv_req_ctx *rctx = subreq->rctx;
+	struct skcipher_request *req = rctx->req;
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	/*
+	 * A request from crypto driver backlog is going to be processed now,
+	 * finish the completion and continue in crypt_convert().
+	 * (Callback will be called for the second time for this request.)
+	 */
+
+	if (error == -EINPROGRESS) {
+		complete(&rctx->restart);
+		return;
+	}
+
+	if (!error && ctx->iv_gen_ops && ctx->iv_gen_ops->post)
+		error = ctx->iv_gen_ops->post(ctx, rctx, subreq);
+
+	mempool_free(subreq, ctx->subreq_pool);
+
+	/* req_pending needs to be checked before req->base.complete is called
+	 * as we need 'req_pending' to be equal to 1 to ensure all subrequests
+	 * are processed.
+	 */
+	if (!atomic_dec_and_test(&rctx->req_pending)) {
+		/* Call the parent cipher's completion function */
+		skcipher_request_complete(req, error);
+	}
+}
+
+static unsigned int geniv_get_sectors(struct scatterlist *sg1,
+				      struct scatterlist *sg2,
+				      unsigned int segments)
+{
+	unsigned int i, n1, n2, nents;
+
+	n1 = n2 = 0;
+	for (i = 0; i < segments ; i++)
+		n1 += sg1[i].length >> SECTOR_SHIFT;
+
+	for (i = 0; i < segments ; i++)
+		n2 += sg2[i].length >> SECTOR_SHIFT;
+
+	nents = n1 > n2 ? n1 : n2;
+	return nents;
+}
+
+/* Iterate scatterlist of segments to retrieve the 512-byte sectors so that
+ * unique IVs could be generated for each 512-byte sector. This split may not
+ * be necessary e.g. when these ciphers are modelled in hardware, where it can
+ * make use of the hardware's IV generation capabilities.
+ */
+
+static int geniv_iter_block(struct skcipher_request *req,
+			    struct geniv_subreq *subreq,
+			    struct geniv_req_ctx *rctx,
+			    unsigned int *seg_no,
+			    unsigned int *done)
+
+{
+	unsigned int srcoff, dstoff, len, rem;
+	struct scatterlist *src1, *dst1, *src2, *dst2;
+
+	if (unlikely(*seg_no >= rctx->nents))
+		return 0; /* done */
+
+	src1 = &req->src[*seg_no];
+	dst1 = &req->dst[*seg_no];
+	src2 = &subreq->src;
+	dst2 = &subreq->dst;
+
+	if (*done >= src1->length) {
+		(*seg_no)++;
+
+		if (*seg_no >= rctx->nents)
+			return 0; /* done */
+
+		src1 = &req->src[*seg_no];
+		dst1 = &req->dst[*seg_no];
+		*done = 0;
+	}
+
+	srcoff = src1->offset + *done;
+	dstoff = dst1->offset + *done;
+	rem = src1->length - *done;
+
+	len = rem > SECTOR_SIZE ? SECTOR_SIZE : rem;
+
+	DMDEBUG("segment:(%d/%u), srcoff:%d, dstoff:%d, done:%d, rem:%d\n",
+		*seg_no + 1, rctx->nents, srcoff, dstoff, *done, rem);
+
+	sg_init_table(src2, 1);
+	sg_set_page(src2, sg_page(src1), len, srcoff);
+	sg_init_table(dst2, 1);
+	sg_set_page(dst2, sg_page(dst1), len, dstoff);
+
+	*done += len;
+
+	return len; /* bytes returned */
+}
+
+/* Common encryt/decrypt function for geniv template cipher. Before the crypto
+ * operation, it splits the memory segments (in the scatterlist) into 512 byte
+ * sectors. The initialization vector(IV) used is based on a unique sector
+ * number which is generated here.
+ */
+static int geniv_crypt(struct skcipher_request *req, bool encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct geniv_req_ctx *rctx = geniv_req_ctx(req);
+	struct geniv_req_info *rinfo = (struct geniv_req_info *) req->iv;
+	int i, bytes, cryptlen, ret = 0;
+	unsigned int sectors, segno = 0, done = 0;
+	char *str __maybe_unused = encrypt ? "encrypt" : "decrypt";
+
+	/* Instance of 'struct geniv_req_info' is stored in IV ptr */
+	rctx->is_write = rinfo->is_write;
+	rctx->iv_sector = rinfo->iv_sector;
+	rctx->nents = rinfo->nents;
+	rctx->iv = rinfo->iv;
+	rctx->req = req;
+	rctx->subreq = NULL;
+	cryptlen = req->cryptlen;
+
+	DMDEBUG("geniv:%s: starting sector=%d, #segments=%u\n", str,
+		(unsigned int) rctx->iv_sector, rctx->nents);
+
+	sectors = geniv_get_sectors(req->src, req->dst, rctx->nents);
+
+	init_completion(&rctx->restart);
+	atomic_set(&rctx->req_pending, 1);
+
+	for (i = 0; i < sectors; i++) {
+		struct geniv_subreq *subreq;
+
+		ret = geniv_alloc_subreq(req, ctx, rctx);
+		if (ret)
+			goto end;
+
+		subreq = rctx->subreq;
+		subreq->rctx = rctx;
+
+		atomic_inc(&rctx->req_pending);
+		bytes = geniv_iter_block(req, subreq, rctx, &segno, &done);
+
+		if (bytes == 0)
+			break;
+
+		cryptlen -= bytes;
+
+		if (ctx->iv_gen_ops)
+			ret = ctx->iv_gen_ops->generator(ctx, rctx, subreq);
+
+		if (ret < 0) {
+			DMERR("Error in generating IV ret: %d\n", ret);
+			goto end;
+		}
+
+		skcipher_request_set_crypt(&subreq->req, &subreq->src,
+					   &subreq->dst, bytes, rctx->iv);
+
+		if (encrypt)
+			ret = crypto_skcipher_encrypt(&subreq->req);
+
+		else
+			ret = crypto_skcipher_decrypt(&subreq->req);
+
+		if (!ret && ctx->iv_gen_ops && ctx->iv_gen_ops->post)
+			ret = ctx->iv_gen_ops->post(ctx, rctx, subreq);
+
+		switch (ret) {
+		/*
+		 * The request was queued by a crypto driver
+		 * but the driver request queue is full, let's wait.
+		 */
+		case -EBUSY:
+			wait_for_completion(&rctx->restart);
+			reinit_completion(&rctx->restart);
+			/* fall through */
+		/*
+		 * The request is queued and processed asynchronously,
+		 * completion function geniv_async_done() is called.
+		 */
+		case -EINPROGRESS:
+			/* Marking this NULL lets the creation of a new sub-
+			 * request when 'geniv_alloc_subreq' is called.
+			 */
+			rctx->subreq = NULL;
+			rctx->iv_sector++;
+			cond_resched();
+			break;
+		/*
+		 * The request was already processed (synchronously).
+		 */
+		case 0:
+			atomic_dec(&rctx->req_pending);
+			rctx->iv_sector++;
+			cond_resched();
+			continue;
+
+		/* There was an error while processing the request. */
+		default:
+			atomic_dec(&rctx->req_pending);
+			return ret;
+		}
+
+		if (ret)
+			break;
+	}
+
+	if (rctx->subreq && atomic_read(&rctx->req_pending) == 1) {
+		DMDEBUG("geniv:%s: Freeing sub request\n", str);
+		mempool_free(rctx->subreq, ctx->subreq_pool);
+	}
+
+end:
+	return ret;
+}
+
+static int geniv_encrypt(struct skcipher_request *req)
+{
+	return geniv_crypt(req, true);
+}
+
+static int geniv_decrypt(struct skcipher_request *req)
+{
+	return geniv_crypt(req, false);
+}
+
+static int geniv_init_tfm(struct crypto_skcipher *tfm)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+	unsigned int reqsize, align;
+	char *algname, *chainmode;
+	int psize, ret = 0;
+
+	algname = (char *) crypto_tfm_alg_name(crypto_skcipher_tfm(tfm));
+	ctx->ciphermode = kmalloc(CRYPTO_MAX_ALG_NAME, GFP_KERNEL);
+	if (!ctx->ciphermode) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	/* Parse algorithm name 'ivmode(chainmode(cipher))' */
+	ctx->ivmode	= strsep(&algname, "(");
+	chainmode	= strsep(&algname, "(");
+	ctx->cipher	= strsep(&algname, ")");
+
+	snprintf(ctx->ciphermode, CRYPTO_MAX_ALG_NAME, "%s(%s)",
+		 chainmode, ctx->cipher);
+
+	DMDEBUG("ciphermode=%s, ivmode=%s\n", ctx->ciphermode, ctx->ivmode);
+
+	/*
+	 * Usually the underlying cipher instances are spawned here, but since
+	 * the value of tfms_count (which is equal to the key_count) is not
+	 * known yet, create only one instance and delay the creation of the
+	 * rest of the instances of the underlying cipher 'cbc(aes)' until
+	 * the setkey operation is invoked.
+	 * The first instance created i.e. ctx->child will later be assigned as
+	 * the 1st element in the array ctx->tfms. Creation of atleast one
+	 * instance of the cipher is necessary to be created here to uncover
+	 * any errors earlier than during the setkey operation later where the
+	 * remaining instances are created.
+	 */
+	ctx->child = crypto_alloc_skcipher(ctx->ciphermode, 0, 0);
+	if (IS_ERR(ctx->child)) {
+		ret = PTR_ERR(ctx->child);
+		DMERR("Failed to create skcipher %s. err %d\n",
+		      ctx->ciphermode, ret);
+		goto end;
+	}
+
+	/* Setup the current cipher's request structure */
+	align = crypto_skcipher_alignmask(tfm);
+	align &= ~(crypto_tfm_ctx_alignment() - 1);
+	reqsize = align + sizeof(struct geniv_req_ctx)
+			+ crypto_skcipher_reqsize(ctx->child);
+	crypto_skcipher_set_reqsize(tfm, reqsize);
+
+	/* create memory pool for sub-request structure */
+	psize = sizeof(struct geniv_subreq)
+			+ crypto_skcipher_reqsize(ctx->child);
+	ctx->subreq_pool = mempool_create_kmalloc_pool(MIN_IOS, psize);
+	if (!ctx->subreq_pool) {
+		ret = -ENOMEM;
+		DMERR("Could not allocate crypt sub-request mempool\n");
+	}
+end:
+	return ret;
+}
+
+static void geniv_exit_tfm(struct crypto_skcipher *tfm)
+{
+	struct geniv_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	if (ctx->iv_gen_ops && ctx->iv_gen_ops->dtr)
+		ctx->iv_gen_ops->dtr(ctx);
+
+	mempool_destroy(ctx->subreq_pool);
+	geniv_free_tfms(ctx);
+	kfree(ctx->ciphermode);
+}
+
+static void geniv_free(struct skcipher_instance *inst)
+{
+	struct crypto_skcipher_spawn *spawn = skcipher_instance_ctx(inst);
+
+	crypto_drop_skcipher(spawn);
+	kfree(inst);
+}
+
+static int geniv_create(struct crypto_template *tmpl,
+			struct rtattr **tb, char *algname)
+{
+	struct crypto_attr_type *algt;
+	struct skcipher_instance *inst;
+	struct skcipher_alg *alg;
+	struct crypto_skcipher_spawn *spawn;
+	const char *cipher_name;
+	int err;
+
+	algt = crypto_get_attr_type(tb);
+
+	if (IS_ERR(algt))
+		return PTR_ERR(algt);
+
+	if ((algt->type ^ CRYPTO_ALG_TYPE_SKCIPHER) & algt->mask)
+		return -EINVAL;
+
+	cipher_name = crypto_attr_alg_name(tb[1]);
+
+	if (IS_ERR(cipher_name))
+		return PTR_ERR(cipher_name);
+
+	inst = kzalloc(sizeof(*inst) + sizeof(*spawn), GFP_KERNEL);
+	if (!inst)
+		return -ENOMEM;
+
+	spawn = skcipher_instance_ctx(inst);
+
+	crypto_set_skcipher_spawn(spawn, skcipher_crypto_instance(inst));
+	err = crypto_grab_skcipher(spawn, cipher_name, 0,
+				    crypto_requires_sync(algt->type,
+							 algt->mask));
+
+	if (err)
+		goto err_free_inst;
+
+	alg = crypto_spawn_skcipher_alg(spawn);
+
+	err = -EINVAL;
+
+	/* Only support blocks of size which is of a power of 2 */
+	if (!is_power_of_2(alg->base.cra_blocksize))
+		goto err_drop_spawn;
+
+	/* algname: essiv, base.cra_name: cbc(aes) */
+	err = -ENAMETOOLONG;
+	if (snprintf(inst->alg.base.cra_name, CRYPTO_MAX_ALG_NAME, "%s(%s)",
+		     algname, alg->base.cra_name) >= CRYPTO_MAX_ALG_NAME)
+		goto err_drop_spawn;
+	if (snprintf(inst->alg.base.cra_driver_name, CRYPTO_MAX_ALG_NAME,
+		     "%s(%s)", algname, alg->base.cra_driver_name) >=
+	    CRYPTO_MAX_ALG_NAME)
+		goto err_drop_spawn;
+
+	inst->alg.base.cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER;
+	inst->alg.base.cra_priority = alg->base.cra_priority;
+	inst->alg.base.cra_blocksize = alg->base.cra_blocksize;
+	inst->alg.base.cra_alignmask = alg->base.cra_alignmask;
+	inst->alg.base.cra_flags = alg->base.cra_flags & CRYPTO_ALG_ASYNC;
+	inst->alg.ivsize = alg->base.cra_blocksize;
+	inst->alg.chunksize = crypto_skcipher_alg_chunksize(alg);
+	inst->alg.min_keysize = crypto_skcipher_alg_min_keysize(alg);
+	inst->alg.max_keysize = crypto_skcipher_alg_max_keysize(alg);
+
+	inst->alg.setkey = geniv_setkey;
+	inst->alg.encrypt = geniv_encrypt;
+	inst->alg.decrypt = geniv_decrypt;
+
+	inst->alg.base.cra_ctxsize = sizeof(struct geniv_ctx);
+
+	inst->alg.init = geniv_init_tfm;
+	inst->alg.exit = geniv_exit_tfm;
+
+	inst->free = geniv_free;
+
+	err = skcipher_register_instance(tmpl, inst);
+	if (err)
+		goto err_drop_spawn;
+
+out:
+	return err;
+
+err_drop_spawn:
+	crypto_drop_skcipher(spawn);
+err_free_inst:
+	kfree(inst);
+	goto out;
+}
+
+static int crypto_plain_create(struct crypto_template *tmpl,
+			       struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "plain");
+}
+
+static int crypto_plain64_create(struct crypto_template *tmpl,
+				 struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "plain64");
+}
+
+static int crypto_essiv_create(struct crypto_template *tmpl,
+			       struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "essiv");
+}
+
+static int crypto_benbi_create(struct crypto_template *tmpl,
+			       struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "benbi");
+}
+
+static int crypto_null_create(struct crypto_template *tmpl,
+			      struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "null");
+}
+
+static int crypto_lmk_create(struct crypto_template *tmpl,
+			     struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "lmk");
+}
+
+static int crypto_tcw_create(struct crypto_template *tmpl,
+			     struct rtattr **tb)
+{
+	return geniv_create(tmpl, tb, "tcw");
+}
+
+static struct crypto_template crypto_plain_tmpl = {
+	.name   = "plain",
+	.create = crypto_plain_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_plain64_tmpl = {
+	.name   = "plain64",
+	.create = crypto_plain64_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_essiv_tmpl = {
+	.name   = "essiv",
+	.create = crypto_essiv_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_benbi_tmpl = {
+	.name   = "benbi",
+	.create = crypto_benbi_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_null_tmpl = {
+	.name   = "null",
+	.create = crypto_null_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_lmk_tmpl = {
+	.name   = "lmk",
+	.create = crypto_lmk_create,
+	.module = THIS_MODULE,
+};
+
+static struct crypto_template crypto_tcw_tmpl = {
+	.name   = "tcw",
+	.create = crypto_tcw_create,
+	.module = THIS_MODULE,
+};
+
+static int __init geniv_register_algs(void)
+{
+	int err;
+
+	err = crypto_register_template(&crypto_plain_tmpl);
+	if (err)
+		goto out;
+
+	err = crypto_register_template(&crypto_plain64_tmpl);
+	if (err)
+		goto out_undo_plain;
+
+	err = crypto_register_template(&crypto_essiv_tmpl);
+	if (err)
+		goto out_undo_plain64;
+
+	err = crypto_register_template(&crypto_benbi_tmpl);
+	if (err)
+		goto out_undo_essiv;
+
+	err = crypto_register_template(&crypto_null_tmpl);
+	if (err)
+		goto out_undo_benbi;
+
+	err = crypto_register_template(&crypto_lmk_tmpl);
+	if (err)
+		goto out_undo_null;
+
+	err = crypto_register_template(&crypto_tcw_tmpl);
+	if (!err)
+		goto out;
+
+	crypto_unregister_template(&crypto_lmk_tmpl);
+out_undo_null:
+	crypto_unregister_template(&crypto_null_tmpl);
+out_undo_benbi:
+	crypto_unregister_template(&crypto_benbi_tmpl);
+out_undo_essiv:
+	crypto_unregister_template(&crypto_essiv_tmpl);
+out_undo_plain64:
+	crypto_unregister_template(&crypto_plain64_tmpl);
+out_undo_plain:
+	crypto_unregister_template(&crypto_plain_tmpl);
+out:
+	return err;
+}
+
+static void __exit geniv_deregister_algs(void)
+{
+	crypto_unregister_template(&crypto_plain_tmpl);
+	crypto_unregister_template(&crypto_plain64_tmpl);
+	crypto_unregister_template(&crypto_essiv_tmpl);
+	crypto_unregister_template(&crypto_benbi_tmpl);
+	crypto_unregister_template(&crypto_null_tmpl);
+	crypto_unregister_template(&crypto_lmk_tmpl);
+	crypto_unregister_template(&crypto_tcw_tmpl);
+}
+
+/* End of geniv template cipher algorithms */
+
+/*
+ * context holding the current state of a multi-part conversion
+ */
+struct convert_context {
+	struct completion restart;
+	struct bio *bio_in;
+	struct bio *bio_out;
+	struct bvec_iter iter_in;
+	struct bvec_iter iter_out;
+	sector_t cc_sector;
+	atomic_t cc_pending;
+	struct skcipher_request *req;
+};
+
+/*
+ * per bio private data
+ */
+struct dm_crypt_io {
+	struct crypt_config *cc;
+	struct bio *base_bio;
+	struct work_struct work;
+
+	struct convert_context ctx;
 
-static int crypt_iv_tcw_gen(struct crypt_config *cc, u8 *iv,
-			    struct dm_crypt_request *dmreq)
-{
-	struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw;
-	__le64 sector = cpu_to_le64(dmreq->iv_sector);
-	u8 *src;
-	int r = 0;
+	atomic_t io_pending;
+	int error;
+	sector_t sector;
 
-	/* Remove whitening from ciphertext */
-	if (bio_data_dir(dmreq->ctx->bio_in) != WRITE) {
-		src = kmap_atomic(sg_page(&dmreq->sg_in));
-		r = crypt_iv_tcw_whitening(cc, dmreq, src + dmreq->sg_in.offset);
-		kunmap_atomic(src);
-	}
+	struct rb_node rb_node;
+} CRYPTO_MINALIGN_ATTR;
 
-	/* Calculate IV */
-	memcpy(iv, tcw->iv_seed, cc->iv_size);
-	crypto_xor(iv, (u8 *)&sector, 8);
-	if (cc->iv_size > 8)
-		crypto_xor(&iv[8], (u8 *)&sector, cc->iv_size - 8);
+struct dm_crypt_request {
+	struct convert_context *ctx;
+	struct scatterlist *sg_in;
+	struct scatterlist *sg_out;
+	sector_t iv_sector;
+};
 
-	return r;
-}
+struct crypt_config;
 
-static int crypt_iv_tcw_post(struct crypt_config *cc, u8 *iv,
-			     struct dm_crypt_request *dmreq)
-{
-	u8 *dst;
-	int r;
+/*
+ * Crypt: maps a linear range of a block device
+ * and encrypts / decrypts at the same time.
+ */
+enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
+	     DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
 
-	if (bio_data_dir(dmreq->ctx->bio_in) != WRITE)
-		return 0;
+/*
+ * The fields in here must be read only after initialization.
+ */
+struct crypt_config {
+	struct dm_dev *dev;
+	sector_t start;
 
-	/* Apply whitening on ciphertext */
-	dst = kmap_atomic(sg_page(&dmreq->sg_out));
-	r = crypt_iv_tcw_whitening(cc, dmreq, dst + dmreq->sg_out.offset);
-	kunmap_atomic(dst);
+	/*
+	 * pool for per bio private data, crypto requests and
+	 * encryption requeusts/buffer pages
+	 */
+	mempool_t *req_pool;
+	mempool_t *page_pool;
+	struct bio_set *bs;
+	struct mutex bio_alloc_lock;
 
-	return r;
-}
+	struct workqueue_struct *io_queue;
+	struct workqueue_struct *crypt_queue;
 
-static const struct crypt_iv_operations crypt_iv_plain_ops = {
-	.generator = crypt_iv_plain_gen
-};
+	struct task_struct *write_thread;
+	wait_queue_head_t write_thread_wait;
+	struct rb_root write_tree;
 
-static const struct crypt_iv_operations crypt_iv_plain64_ops = {
-	.generator = crypt_iv_plain64_gen
-};
+	char *cipher;
+	char *cipher_string;
+	char *key_string;
 
-static const struct crypt_iv_operations crypt_iv_essiv_ops = {
-	.ctr       = crypt_iv_essiv_ctr,
-	.dtr       = crypt_iv_essiv_dtr,
-	.init      = crypt_iv_essiv_init,
-	.wipe      = crypt_iv_essiv_wipe,
-	.generator = crypt_iv_essiv_gen
-};
+	sector_t iv_offset;
+	unsigned int iv_size;
 
-static const struct crypt_iv_operations crypt_iv_benbi_ops = {
-	.ctr	   = crypt_iv_benbi_ctr,
-	.dtr	   = crypt_iv_benbi_dtr,
-	.generator = crypt_iv_benbi_gen
-};
+	/* ESSIV: struct crypto_cipher *essiv_tfm */
+	void *iv_private;
+	struct crypto_skcipher *tfm;
+	unsigned int tfms_count;
 
-static const struct crypt_iv_operations crypt_iv_null_ops = {
-	.generator = crypt_iv_null_gen
-};
+	/*
+	 * Layout of each crypto request:
+	 *
+	 *   struct skcipher_request
+	 *      context
+	 *      padding
+	 *   struct dm_crypt_request
+	 *      padding
+	 *   IV
+	 *
+	 * The padding is added so that dm_crypt_request and the IV are
+	 * correctly aligned.
+	 */
+	unsigned int dmreq_start;
 
-static const struct crypt_iv_operations crypt_iv_lmk_ops = {
-	.ctr	   = crypt_iv_lmk_ctr,
-	.dtr	   = crypt_iv_lmk_dtr,
-	.init	   = crypt_iv_lmk_init,
-	.wipe	   = crypt_iv_lmk_wipe,
-	.generator = crypt_iv_lmk_gen,
-	.post	   = crypt_iv_lmk_post
-};
+	unsigned int per_bio_data_size;
 
-static const struct crypt_iv_operations crypt_iv_tcw_ops = {
-	.ctr	   = crypt_iv_tcw_ctr,
-	.dtr	   = crypt_iv_tcw_dtr,
-	.init	   = crypt_iv_tcw_init,
-	.wipe	   = crypt_iv_tcw_wipe,
-	.generator = crypt_iv_tcw_gen,
-	.post	   = crypt_iv_tcw_post
+	unsigned long flags;
+	unsigned int key_size;
+	unsigned int key_parts;      /* independent parts in key buffer */
+	unsigned int key_extra_size; /* additional keys length */
+	u8 key[0];
 };
 
+static void clone_init(struct dm_crypt_io *, struct bio *);
+static void kcryptd_queue_crypt(struct dm_crypt_io *io);
+static u8 *iv_of_dmreq(struct crypt_config *cc, struct dm_crypt_request *dmreq);
+
 static void crypt_convert_init(struct crypt_config *cc,
 			       struct convert_context *ctx,
 			       struct bio *bio_out, struct bio *bio_in,
@@ -837,53 +1700,7 @@ static u8 *iv_of_dmreq(struct crypt_config *cc,
 		       struct dm_crypt_request *dmreq)
 {
 	return (u8 *)ALIGN((unsigned long)(dmreq + 1),
-		crypto_skcipher_alignmask(any_tfm(cc)) + 1);
-}
-
-static int crypt_convert_block(struct crypt_config *cc,
-			       struct convert_context *ctx,
-			       struct skcipher_request *req)
-{
-	struct bio_vec bv_in = bio_iter_iovec(ctx->bio_in, ctx->iter_in);
-	struct bio_vec bv_out = bio_iter_iovec(ctx->bio_out, ctx->iter_out);
-	struct dm_crypt_request *dmreq;
-	u8 *iv;
-	int r;
-
-	dmreq = dmreq_of_req(cc, req);
-	iv = iv_of_dmreq(cc, dmreq);
-
-	dmreq->iv_sector = ctx->cc_sector;
-	dmreq->ctx = ctx;
-	sg_init_table(&dmreq->sg_in, 1);
-	sg_set_page(&dmreq->sg_in, bv_in.bv_page, 1 << SECTOR_SHIFT,
-		    bv_in.bv_offset);
-
-	sg_init_table(&dmreq->sg_out, 1);
-	sg_set_page(&dmreq->sg_out, bv_out.bv_page, 1 << SECTOR_SHIFT,
-		    bv_out.bv_offset);
-
-	bio_advance_iter(ctx->bio_in, &ctx->iter_in, 1 << SECTOR_SHIFT);
-	bio_advance_iter(ctx->bio_out, &ctx->iter_out, 1 << SECTOR_SHIFT);
-
-	if (cc->iv_gen_ops) {
-		r = cc->iv_gen_ops->generator(cc, iv, dmreq);
-		if (r < 0)
-			return r;
-	}
-
-	skcipher_request_set_crypt(req, &dmreq->sg_in, &dmreq->sg_out,
-				   1 << SECTOR_SHIFT, iv);
-
-	if (bio_data_dir(ctx->bio_in) == WRITE)
-		r = crypto_skcipher_encrypt(req);
-	else
-		r = crypto_skcipher_decrypt(req);
-
-	if (!r && cc->iv_gen_ops && cc->iv_gen_ops->post)
-		r = cc->iv_gen_ops->post(cc, iv, dmreq);
-
-	return r;
+		crypto_skcipher_alignmask(cc->tfm) + 1);
 }
 
 static void kcryptd_async_done(struct crypto_async_request *async_req,
@@ -892,12 +1709,10 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
 static void crypt_alloc_req(struct crypt_config *cc,
 			    struct convert_context *ctx)
 {
-	unsigned key_index = ctx->cc_sector & (cc->tfms_count - 1);
-
 	if (!ctx->req)
 		ctx->req = mempool_alloc(cc->req_pool, GFP_NOIO);
 
-	skcipher_request_set_tfm(ctx->req, cc->tfms[key_index]);
+	skcipher_request_set_tfm(ctx->req, cc->tfm);
 
 	/*
 	 * Use REQ_MAY_BACKLOG so a cipher driver internally backlogs
@@ -920,57 +1735,98 @@ static void crypt_free_req(struct crypt_config *cc,
 /*
  * Encrypt / decrypt data from one bio to another one (can be the same one)
  */
-static int crypt_convert(struct crypt_config *cc,
-			 struct convert_context *ctx)
+
+static int crypt_convert_bio(struct crypt_config *cc,
+			     struct convert_context *ctx)
 {
+	unsigned int cryptlen, n1, n2, nents, i = 0, bytes = 0;
+	struct skcipher_request *req;
+	struct dm_crypt_request *dmreq;
+	struct geniv_req_info rinfo;
+	struct bio_vec bv_in, bv_out;
 	int r;
+	u8 *iv;
 
 	atomic_set(&ctx->cc_pending, 1);
+	crypt_alloc_req(cc, ctx);
+
+	req = ctx->req;
+	dmreq = dmreq_of_req(cc, req);
+	iv = iv_of_dmreq(cc, dmreq);
 
-	while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {
+	n1 = bio_segments(ctx->bio_in);
+	n2 = bio_segments(ctx->bio_out);
+	nents = n1 > n2 ? n1 : n2;
+	nents = nents > MAX_SG_LIST ? MAX_SG_LIST : nents;
+	cryptlen = ctx->iter_in.bi_size;
 
-		crypt_alloc_req(cc, ctx);
+	DMDEBUG("dm-crypt:%s: segments:[in=%u, out=%u] bi_size=%u\n",
+		bio_data_dir(ctx->bio_in) == WRITE ? "write" : "read",
+		n1, n2, cryptlen);
 
-		atomic_inc(&ctx->cc_pending);
+	dmreq->sg_in  = kcalloc(nents, sizeof(struct scatterlist), GFP_KERNEL);
+	dmreq->sg_out = kcalloc(nents, sizeof(struct scatterlist), GFP_KERNEL);
+	if (!dmreq->sg_in || !dmreq->sg_out) {
+		DMERR("dm-crypt: Failed to allocate scatterlist\n");
+		r = -ENOMEM;
+		goto end;
+	}
+	dmreq->ctx = ctx;
 
-		r = crypt_convert_block(cc, ctx, ctx->req);
+	sg_init_table(dmreq->sg_in, nents);
+	sg_init_table(dmreq->sg_out, nents);
 
-		switch (r) {
-		/*
-		 * The request was queued by a crypto driver
-		 * but the driver request queue is full, let's wait.
-		 */
-		case -EBUSY:
-			wait_for_completion(&ctx->restart);
-			reinit_completion(&ctx->restart);
-			/* fall through */
-		/*
-		 * The request is queued and processed asynchronously,
-		 * completion function kcryptd_async_done() will be called.
-		 */
-		case -EINPROGRESS:
-			ctx->req = NULL;
-			ctx->cc_sector++;
-			continue;
-		/*
-		 * The request was already processed (synchronously).
-		 */
-		case 0:
-			atomic_dec(&ctx->cc_pending);
-			ctx->cc_sector++;
-			cond_resched();
-			continue;
+	while (ctx->iter_in.bi_size && ctx->iter_out.bi_size && i < nents) {
+		bv_in = bio_iter_iovec(ctx->bio_in, ctx->iter_in);
+		bv_out = bio_iter_iovec(ctx->bio_out, ctx->iter_out);
 
-		/* There was an error while processing the request. */
-		default:
-			atomic_dec(&ctx->cc_pending);
-			return r;
-		}
+		sg_set_page(&dmreq->sg_in[i], bv_in.bv_page, bv_in.bv_len,
+			    bv_in.bv_offset);
+		sg_set_page(&dmreq->sg_out[i], bv_out.bv_page, bv_out.bv_len,
+			    bv_out.bv_offset);
+
+		bio_advance_iter(ctx->bio_in, &ctx->iter_in, bv_in.bv_len);
+		bio_advance_iter(ctx->bio_out, &ctx->iter_out, bv_out.bv_len);
+
+		bytes += bv_in.bv_len;
+		i++;
 	}
 
-	return 0;
+	DMDEBUG("dm-crypt: Processed %u of %u bytes\n", bytes, cryptlen);
+
+	rinfo.is_write = (bio_data_dir(ctx->bio_in) == WRITE);
+	rinfo.iv_sector = ctx->cc_sector;
+	rinfo.nents = nents;
+	rinfo.iv = iv;
+
+	skcipher_request_set_crypt(req, dmreq->sg_in, dmreq->sg_out,
+				   bytes, &rinfo);
+
+	if (bio_data_dir(ctx->bio_in) == WRITE)
+		r = crypto_skcipher_encrypt(req);
+	else
+		r = crypto_skcipher_decrypt(req);
+
+	switch (r) {
+	/* The request was queued so wait. */
+	case -EBUSY:
+		wait_for_completion(&ctx->restart);
+		reinit_completion(&ctx->restart);
+		/* fall through */
+	/*
+	 * The request is queued and processed asynchronously,
+	 * completion function kcryptd_async_done() is called.
+	 */
+	case -EINPROGRESS:
+		ctx->req = NULL;
+		cond_resched();
+		break;
+	}
+end:
+	return r;
 }
 
+
 static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone);
 
 /*
@@ -1070,11 +1926,17 @@ static void crypt_dec_pending(struct dm_crypt_io *io)
 {
 	struct crypt_config *cc = io->cc;
 	struct bio *base_bio = io->base_bio;
+	struct dm_crypt_request *dmreq;
 	int error = io->error;
 
 	if (!atomic_dec_and_test(&io->io_pending))
 		return;
 
+	dmreq = dmreq_of_req(cc, io->ctx.req);
+	DMDEBUG("dm-crypt: Freeing scatterlists [sync]\n");
+	kfree(dmreq->sg_in);
+	kfree(dmreq->sg_out);
+
 	if (io->ctx.req)
 		crypt_free_req(cc, io->ctx.req, base_bio);
 
@@ -1313,7 +2175,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
 	sector += bio_sectors(clone);
 
 	crypt_inc_pending(io);
-	r = crypt_convert(cc, &io->ctx);
+	r = crypt_convert_bio(cc, &io->ctx);
 	if (r)
 		io->error = -EIO;
 	crypt_finished = atomic_dec_and_test(&io->ctx.cc_pending);
@@ -1343,7 +2205,8 @@ static void kcryptd_crypt_read_convert(struct dm_crypt_io *io)
 	crypt_convert_init(cc, &io->ctx, io->base_bio, io->base_bio,
 			   io->sector);
 
-	r = crypt_convert(cc, &io->ctx);
+	r = crypt_convert_bio(cc, &io->ctx);
+
 	if (r < 0)
 		io->error = -EIO;
 
@@ -1371,12 +2234,13 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
 		return;
 	}
 
-	if (!error && cc->iv_gen_ops && cc->iv_gen_ops->post)
-		error = cc->iv_gen_ops->post(cc, iv_of_dmreq(cc, dmreq), dmreq);
-
 	if (error < 0)
 		io->error = -EIO;
 
+	DMDEBUG("dm-crypt: Freeing scatterlists and request struct [async]\n");
+	kfree(dmreq->sg_in);
+	kfree(dmreq->sg_out);
+
 	crypt_free_req(cc, req_of_dmreq(cc, dmreq), io->base_bio);
 
 	if (!atomic_dec_and_test(&ctx->cc_pending))
@@ -1430,62 +2294,38 @@ static int crypt_decode_key(u8 *key, char *hex, unsigned int size)
 	return 0;
 }
 
-static void crypt_free_tfms(struct crypt_config *cc)
+static void crypt_free_tfm(struct crypt_config *cc)
 {
-	unsigned i;
-
-	if (!cc->tfms)
+	if (!cc->tfm)
 		return;
 
-	for (i = 0; i < cc->tfms_count; i++)
-		if (cc->tfms[i] && !IS_ERR(cc->tfms[i])) {
-			crypto_free_skcipher(cc->tfms[i]);
-			cc->tfms[i] = NULL;
-		}
+	if (cc->tfm && !IS_ERR(cc->tfm))
+		crypto_free_skcipher(cc->tfm);
 
-	kfree(cc->tfms);
-	cc->tfms = NULL;
+	cc->tfm = NULL;
 }
 
-static int crypt_alloc_tfms(struct crypt_config *cc, char *ciphermode)
+static int crypt_alloc_tfm(struct crypt_config *cc, char *ciphermode)
 {
-	unsigned i;
 	int err;
 
-	cc->tfms = kzalloc(cc->tfms_count * sizeof(struct crypto_skcipher *),
-			   GFP_KERNEL);
-	if (!cc->tfms)
-		return -ENOMEM;
-
-	for (i = 0; i < cc->tfms_count; i++) {
-		cc->tfms[i] = crypto_alloc_skcipher(ciphermode, 0, 0);
-		if (IS_ERR(cc->tfms[i])) {
-			err = PTR_ERR(cc->tfms[i]);
-			crypt_free_tfms(cc);
-			return err;
-		}
+	cc->tfm = crypto_alloc_skcipher(ciphermode, 0, 0);
+	if (IS_ERR(cc->tfm)) {
+		err = PTR_ERR(cc->tfm);
+		crypt_free_tfm(cc);
+		return err;
 	}
 
 	return 0;
 }
 
-static int crypt_setkey(struct crypt_config *cc)
+static inline int crypt_setkey(struct crypt_config *cc, enum setkey_op keyop,
+			       char *ivopts)
 {
-	unsigned subkey_size;
-	int err = 0, i, r;
-
-	/* Ignore extra keys (which are used for IV etc) */
-	subkey_size = (cc->key_size - cc->key_extra_size) >> ilog2(cc->tfms_count);
-
-	for (i = 0; i < cc->tfms_count; i++) {
-		r = crypto_skcipher_setkey(cc->tfms[i],
-					   cc->key + (i * subkey_size),
-					   subkey_size);
-		if (r)
-			err = r;
-	}
+	DECLARE_GENIV_KEY(kinfo, keyop, cc->tfms_count, cc->key, cc->key_size,
+			  cc->key_parts, ivopts);
 
-	return err;
+	return crypto_skcipher_setkey(cc->tfm, (u8 *) &kinfo, sizeof(kinfo));
 }
 
 #ifdef CONFIG_KEYS
@@ -1498,7 +2338,9 @@ static bool contains_whitespace(const char *str)
 	return false;
 }
 
-static int crypt_set_keyring_key(struct crypt_config *cc, const char *key_string)
+static int crypt_set_keyring_key(struct crypt_config *cc,
+				 const char *key_string,
+				 enum setkey_op keyop, char *ivopts)
 {
 	char *new_key_string, *key_desc;
 	int ret;
@@ -1559,7 +2401,7 @@ static int crypt_set_keyring_key(struct crypt_config *cc, const char *key_string
 	/* clear the flag since following operations may invalidate previously valid key */
 	clear_bit(DM_CRYPT_KEY_VALID, &cc->flags);
 
-	ret = crypt_setkey(cc);
+	ret = crypt_setkey(cc, keyop, ivopts);
 
 	/* wipe the kernel key payload copy in each case */
 	memset(cc->key, 0, cc->key_size * sizeof(u8));
@@ -1599,7 +2441,9 @@ static int get_key_size(char **key_string)
 
 #else
 
-static int crypt_set_keyring_key(struct crypt_config *cc, const char *key_string)
+static int crypt_set_keyring_key(struct crypt_config *cc,
+				 const char *key_string,
+				 enum setkey_op keyop, char *ivopts)
 {
 	return -EINVAL;
 }
@@ -1611,7 +2455,8 @@ static int get_key_size(char **key_string)
 
 #endif
 
-static int crypt_set_key(struct crypt_config *cc, char *key)
+static int crypt_set_key(struct crypt_config *cc, enum setkey_op keyop,
+			 char *key, char *ivopts)
 {
 	int r = -EINVAL;
 	int key_string_len = strlen(key);
@@ -1622,7 +2467,7 @@ static int crypt_set_key(struct crypt_config *cc, char *key)
 
 	/* ':' means the key is in kernel keyring, short-circuit normal key processing */
 	if (key[0] == ':') {
-		r = crypt_set_keyring_key(cc, key + 1);
+		r = crypt_set_keyring_key(cc, key + 1, keyop, ivopts);
 		goto out;
 	}
 
@@ -1636,7 +2481,7 @@ static int crypt_set_key(struct crypt_config *cc, char *key)
 	if (cc->key_size && crypt_decode_key(cc->key, key, cc->key_size) < 0)
 		goto out;
 
-	r = crypt_setkey(cc);
+	r = crypt_setkey(cc, keyop, ivopts);
 	if (!r)
 		set_bit(DM_CRYPT_KEY_VALID, &cc->flags);
 
@@ -1647,6 +2492,17 @@ static int crypt_set_key(struct crypt_config *cc, char *key)
 	return r;
 }
 
+static int crypt_init_key(struct dm_target *ti, char *key, char *ivopts)
+{
+	struct crypt_config *cc = ti->private;
+	int ret;
+
+	ret = crypt_set_key(cc, SETKEY_OP_INIT, key, ivopts);
+	if (ret < 0)
+		ti->error = "Error decoding and setting key";
+	return ret;
+}
+
 static int crypt_wipe_key(struct crypt_config *cc)
 {
 	clear_bit(DM_CRYPT_KEY_VALID, &cc->flags);
@@ -1654,7 +2510,7 @@ static int crypt_wipe_key(struct crypt_config *cc)
 	kzfree(cc->key_string);
 	cc->key_string = NULL;
 
-	return crypt_setkey(cc);
+	return crypt_setkey(cc, SETKEY_OP_WIPE, NULL);
 }
 
 static void crypt_dtr(struct dm_target *ti)
@@ -1674,7 +2530,7 @@ static void crypt_dtr(struct dm_target *ti)
 	if (cc->crypt_queue)
 		destroy_workqueue(cc->crypt_queue);
 
-	crypt_free_tfms(cc);
+	crypt_free_tfm(cc);
 
 	if (cc->bs)
 		bioset_free(cc->bs);
@@ -1682,9 +2538,6 @@ static void crypt_dtr(struct dm_target *ti)
 	mempool_destroy(cc->page_pool);
 	mempool_destroy(cc->req_pool);
 
-	if (cc->iv_gen_ops && cc->iv_gen_ops->dtr)
-		cc->iv_gen_ops->dtr(cc);
-
 	if (cc->dev)
 		dm_put_device(ti, cc->dev);
 
@@ -1762,22 +2615,30 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 	if (!cipher_api)
 		goto bad_mem;
 
-	ret = snprintf(cipher_api, CRYPTO_MAX_ALG_NAME,
-		       "%s(%s)", chainmode, cipher);
+create_cipher:
+	/* For those ciphers which do not support IVs,
+	 * use the 'null' template cipher
+	 */
+
+	if (!ivmode)
+		ivmode = "null";
+
+	ret = snprintf(cipher_api, CRYPTO_MAX_ALG_NAME, "%s(%s(%s))",
+		       ivmode, chainmode, cipher);
 	if (ret < 0) {
 		kfree(cipher_api);
 		goto bad_mem;
 	}
 
 	/* Allocate cipher */
-	ret = crypt_alloc_tfms(cc, cipher_api);
+	ret = crypt_alloc_tfm(cc, cipher_api);
 	if (ret < 0) {
 		ti->error = "Error allocating crypto tfm";
 		goto bad;
 	}
 
 	/* Initialize IV */
-	cc->iv_size = crypto_skcipher_ivsize(any_tfm(cc));
+	cc->iv_size = crypto_skcipher_ivsize(cc->tfm);
 	if (cc->iv_size)
 		/* at least a 64 bit sector number should fit in our buffer */
 		cc->iv_size = max(cc->iv_size,
@@ -1785,23 +2646,10 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 	else if (ivmode) {
 		DMWARN("Selected cipher does not support IVs");
 		ivmode = NULL;
+		goto create_cipher;
 	}
 
-	/* Choose ivmode, see comments at iv code. */
-	if (ivmode == NULL)
-		cc->iv_gen_ops = NULL;
-	else if (strcmp(ivmode, "plain") == 0)
-		cc->iv_gen_ops = &crypt_iv_plain_ops;
-	else if (strcmp(ivmode, "plain64") == 0)
-		cc->iv_gen_ops = &crypt_iv_plain64_ops;
-	else if (strcmp(ivmode, "essiv") == 0)
-		cc->iv_gen_ops = &crypt_iv_essiv_ops;
-	else if (strcmp(ivmode, "benbi") == 0)
-		cc->iv_gen_ops = &crypt_iv_benbi_ops;
-	else if (strcmp(ivmode, "null") == 0)
-		cc->iv_gen_ops = &crypt_iv_null_ops;
-	else if (strcmp(ivmode, "lmk") == 0) {
-		cc->iv_gen_ops = &crypt_iv_lmk_ops;
+	if (strcmp(ivmode, "lmk") == 0) {
 		/*
 		 * Version 2 and 3 is recognised according
 		 * to length of provided multi-key string.
@@ -1813,39 +2661,14 @@ static int crypt_ctr_cipher(struct dm_target *ti,
 			cc->key_extra_size = cc->key_size / cc->key_parts;
 		}
 	} else if (strcmp(ivmode, "tcw") == 0) {
-		cc->iv_gen_ops = &crypt_iv_tcw_ops;
 		cc->key_parts += 2; /* IV + whitening */
 		cc->key_extra_size = cc->iv_size + TCW_WHITENING_SIZE;
-	} else {
-		ret = -EINVAL;
-		ti->error = "Invalid IV mode";
-		goto bad;
 	}
 
 	/* Initialize and set key */
-	ret = crypt_set_key(cc, key);
-	if (ret < 0) {
-		ti->error = "Error decoding and setting key";
+	ret = crypt_init_key(ti, key, ivopts);
+	if (ret < 0)
 		goto bad;
-	}
-
-	/* Allocate IV */
-	if (cc->iv_gen_ops && cc->iv_gen_ops->ctr) {
-		ret = cc->iv_gen_ops->ctr(cc, ti, ivopts);
-		if (ret < 0) {
-			ti->error = "Error creating IV";
-			goto bad;
-		}
-	}
-
-	/* Initialize IV (set keys for ESSIV etc) */
-	if (cc->iv_gen_ops && cc->iv_gen_ops->init) {
-		ret = cc->iv_gen_ops->init(cc);
-		if (ret < 0) {
-			ti->error = "Error initialising IV";
-			goto bad;
-		}
-	}
 
 	ret = 0;
 bad:
@@ -1901,20 +2724,20 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 		goto bad;
 
 	cc->dmreq_start = sizeof(struct skcipher_request);
-	cc->dmreq_start += crypto_skcipher_reqsize(any_tfm(cc));
+	cc->dmreq_start += crypto_skcipher_reqsize(cc->tfm);
 	cc->dmreq_start = ALIGN(cc->dmreq_start, __alignof__(struct dm_crypt_request));
 
-	if (crypto_skcipher_alignmask(any_tfm(cc)) < CRYPTO_MINALIGN) {
+	if (crypto_skcipher_alignmask(cc->tfm) < CRYPTO_MINALIGN) {
 		/* Allocate the padding exactly */
 		iv_size_padding = -(cc->dmreq_start + sizeof(struct dm_crypt_request))
-				& crypto_skcipher_alignmask(any_tfm(cc));
+				& crypto_skcipher_alignmask(cc->tfm);
 	} else {
 		/*
 		 * If the cipher requires greater alignment than kmalloc
 		 * alignment, we don't know the exact position of the
 		 * initialization vector. We must assume worst case.
 		 */
-		iv_size_padding = crypto_skcipher_alignmask(any_tfm(cc));
+		iv_size_padding = crypto_skcipher_alignmask(cc->tfm);
 	}
 
 	ret = -ENOMEM;
@@ -2072,8 +2895,9 @@ static int crypt_map(struct dm_target *ti, struct bio *bio)
 	if (bio_data_dir(io->base_bio) == READ) {
 		if (kcryptd_io_read(io, GFP_NOWAIT))
 			kcryptd_queue_read(io);
-	} else
+	} else {
 		kcryptd_queue_crypt(io);
+	}
 
 	return DM_MAPIO_SUBMITTED;
 }
@@ -2155,7 +2979,7 @@ static void crypt_resume(struct dm_target *ti)
 static int crypt_message(struct dm_target *ti, unsigned argc, char **argv)
 {
 	struct crypt_config *cc = ti->private;
-	int key_size, ret = -EINVAL;
+	int key_size;
 
 	if (argc < 2)
 		goto error;
@@ -2173,19 +2997,9 @@ static int crypt_message(struct dm_target *ti, unsigned argc, char **argv)
 				return -EINVAL;
 			}
 
-			ret = crypt_set_key(cc, argv[2]);
-			if (ret)
-				return ret;
-			if (cc->iv_gen_ops && cc->iv_gen_ops->init)
-				ret = cc->iv_gen_ops->init(cc);
-			return ret;
+			return crypt_set_key(cc, SETKEY_OP_SET, argv[2], NULL);
 		}
 		if (argc == 2 && !strcasecmp(argv[1], "wipe")) {
-			if (cc->iv_gen_ops && cc->iv_gen_ops->wipe) {
-				ret = cc->iv_gen_ops->wipe(cc);
-				if (ret)
-					return ret;
-			}
 			return crypt_wipe_key(cc);
 		}
 	}
@@ -2216,7 +3030,7 @@ static void crypt_io_hints(struct dm_target *ti, struct queue_limits *limits)
 
 static struct target_type crypt_target = {
 	.name   = "crypt",
-	.version = {1, 15, 0},
+	.version = {1, 16, 0},
 	.module = THIS_MODULE,
 	.ctr    = crypt_ctr,
 	.dtr    = crypt_dtr,
@@ -2234,6 +3048,7 @@ static int __init dm_crypt_init(void)
 {
 	int r;
 
+	geniv_register_algs();
 	r = dm_register_target(&crypt_target);
 	if (r < 0)
 		DMERR("register failed %d", r);
@@ -2244,6 +3059,7 @@ static int __init dm_crypt_init(void)
 static void __exit dm_crypt_exit(void)
 {
 	dm_unregister_target(&crypt_target);
+	geniv_deregister_algs();
 }
 
 module_init(dm_crypt_init);
diff --git a/include/crypto/geniv.h b/include/crypto/geniv.h
new file mode 100644
index 0000000..b472507
--- /dev/null
+++ b/include/crypto/geniv.h
@@ -0,0 +1,47 @@
+/*
+ * geniv: common interface for IV generation algorithms
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef _CRYPTO_GENIV_
+#define _CRYPTO_GENIV_
+
+#define SECTOR_SIZE		(1 << SECTOR_SHIFT)
+
+enum setkey_op {
+	SETKEY_OP_INIT,
+	SETKEY_OP_SET,
+	SETKEY_OP_WIPE,
+};
+
+struct geniv_key_info {
+	enum setkey_op keyop;
+	unsigned int tfms_count;
+	u8 *key;
+	unsigned int key_size;
+	unsigned int key_parts;
+	char *ivopts;
+};
+
+#define DECLARE_GENIV_KEY(c, op, n, k, sz, kp, opts)	\
+	struct geniv_key_info c = {			\
+		.keyop = op,				\
+		.tfms_count = n,			\
+		.key = k,				\
+		.key_size = sz,				\
+		.key_parts = kp,			\
+		.ivopts = opts,				\
+	}
+
+struct geniv_req_info {
+	bool is_write;
+	sector_t iv_sector;
+	unsigned int nents;
+	u8 *iv;
+};
+
+#endif
-- 
Binoy Jayan

^ permalink raw reply related

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Vinod Koul @ 2017-02-07 16:42 UTC (permalink / raw)
  To: Anup Patel
  Cc: Dan Williams, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine@vger.kernel.org,
	Device Tree, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-crypto, linux-raid
In-Reply-To: <CAALAos_yhXufdt79KWY0eugjF5oyJ8+NcPh2E-dEXnSQu2WhhA@mail.gmail.com>

On Tue, Feb 07, 2017 at 02:32:15PM +0530, Anup Patel wrote:
> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel@broadcom.com> wrote:
> >> The DMAENGINE framework assumes that if PQ offload is supported by a
> >> DMA device then all 256 PQ coefficients are supported. This assumption
> >> does not hold anymore because we now have BCM-SBA-RAID offload engine
> >> which supports PQ offload with limited number of PQ coefficients.
> >>
> >> This patch extends async_tx APIs to handle DMA devices with support
> >> for fewer PQ coefficients.
> >>
> >> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
> >> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
> >
> > I don't like this approach. Define an interface for md to query the
> > offload engine once at the beginning of time. We should not be adding
> > any new extensions to async_tx.
> 
> Even if we do capability checks in Linux MD, we still need a way
> for DMAENGINE drivers to advertise number of PQ coefficients
> handled by the HW.

If the question is only for advertising caps, then why not do as done
for dma_get_slave_caps(). you can add dma_get_pq_caps() so that clients (md)
in this case would know the HW capability.

> I agree capability checks should be done once in Linux MD but I don't
> see why this has to be part of BCM-SBA-RAID driver patches. We need
> separate patchsets to address limitations of async_tx framework.
> 
> Regards,
> Anup

-- 
~Vinod

^ permalink raw reply

* [PATCH 0/5] MD: don't abuse bi_phys_segements
From: Shaohua Li @ 2017-02-07 16:55 UTC (permalink / raw)
  To: linux-raid; +Cc: khlebnikov, hch, neilb

This patchset removes the hack of bio->by_phys_segements usage

Shaohua Li (5):
  MD: attach data to each bio
  md/raid5: don't abuse bio->bi_phys_segments
  md/raid5: change disk limits
  md/raid1: don't abuse bio->bi_phys_segments
  md/raid10: don't abuse bio->bi_phys_segments

 drivers/md/md.c     | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/md.h     | 18 ++++++++++++++
 drivers/md/raid1.c  | 44 +++++++++++++++++++---------------
 drivers/md/raid10.c | 40 ++++++++++++++++++-------------
 drivers/md/raid5.c  | 21 ++++++++---------
 drivers/md/raid5.h  | 37 +++++++++++++++--------------
 6 files changed, 164 insertions(+), 64 deletions(-)

-- 
2.9.3


^ permalink raw reply

* [PATCH 1/5] MD: attach data to each bio
From: Shaohua Li @ 2017-02-07 16:55 UTC (permalink / raw)
  To: linux-raid; +Cc: khlebnikov, hch, neilb
In-Reply-To: <cover.1486485935.git.shli@fb.com>

Currently MD is rebusing some bio fields. To remove the hack, we attach
extra data to each bio. Each personablity can attach extra data to the
bios, so we don't need to rebuse bio fields.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/md.h | 18 +++++++++++++++
 2 files changed, 86 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 67a1854..6f23964 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -248,6 +248,47 @@ static DEFINE_SPINLOCK(all_mddevs_lock);
 		_tmp = _tmp->next;})					\
 		)
 
+#define MD_MEMPOOL_SIZE (32)
+static mempool_t *md_alloc_per_bio_pool(struct md_personality *per)
+{
+	mempool_t *p;
+
+	if (!per->per_bio_data_size)
+		return NULL;
+	p = mempool_create_kmalloc_pool(MD_MEMPOOL_SIZE,
+				MD_PER_BIO_DATA_SIZE(per));
+	if (p)
+		return p;
+	return ERR_PTR(-ENOMEM);
+}
+
+static void md_bio_end_io(struct bio *bio)
+{
+	struct md_per_bio_data *data = bio->bi_private;
+
+	bio->bi_private = data->orig_private;
+	bio->bi_end_io = data->orig_endio;
+	bio_endio(bio);
+
+	mempool_free(data, data->mddev->per_bio_pool);
+}
+
+void md_bio_attach_data(struct mddev *mddev, struct bio *bio)
+{
+	struct md_per_bio_data *data;
+
+	if (!mddev->per_bio_pool)
+		return;
+	data = mempool_alloc(mddev->per_bio_pool, GFP_NOIO);
+	data->orig_endio = bio->bi_end_io;
+	data->orig_private = bio->bi_private;
+	data->mddev = mddev;
+
+	bio->bi_private = data;
+	bio->bi_end_io = md_bio_end_io;
+}
+EXPORT_SYMBOL(md_bio_attach_data);
+
 /* Rather than calling directly into the personality make_request function,
  * IO requests come here first so that we can check if the device is
  * being suspended pending a reconfiguration.
@@ -274,6 +315,8 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 		bio_endio(bio);
 		return BLK_QC_T_NONE;
 	}
+	md_bio_attach_data(mddev, bio);
+
 	smp_rmb(); /* Ensure implications of  'active' are visible */
 	rcu_read_lock();
 	if (mddev->suspended) {
@@ -3513,6 +3556,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 	long level;
 	void *priv, *oldpriv;
 	struct md_rdev *rdev;
+	mempool_t *new_pool;
 
 	if (slen == 0 || slen >= sizeof(clevel))
 		return -EINVAL;
@@ -3580,7 +3624,15 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 		rv = len;
 		goto out_unlock;
 	}
+	new_pool = md_alloc_per_bio_pool(pers);
+	if (IS_ERR(new_pool)) {
+		module_put(pers->owner);
+		rv = -EINVAL;
+		goto out_unlock;
+	}
+
 	if (!pers->takeover) {
+		mempool_destroy(new_pool);
 		module_put(pers->owner);
 		pr_warn("md: %s: %s does not support personality takeover\n",
 			mdname(mddev), clevel);
@@ -3596,6 +3648,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 	 */
 	priv = pers->takeover(mddev);
 	if (IS_ERR(priv)) {
+		mempool_destroy(new_pool);
 		mddev->new_level = mddev->level;
 		mddev->new_layout = mddev->layout;
 		mddev->new_chunk_sectors = mddev->chunk_sectors;
@@ -3660,6 +3713,9 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 
 	module_put(oldpers->owner);
 
+	mempool_destroy(mddev->per_bio_pool);
+	mddev->per_bio_pool = new_pool;
+
 	rdev_for_each(rdev, mddev) {
 		if (rdev->raid_disk < 0)
 			continue;
@@ -5209,6 +5265,7 @@ int md_run(struct mddev *mddev)
 	int err;
 	struct md_rdev *rdev;
 	struct md_personality *pers;
+	mempool_t *new_pool;
 
 	if (list_empty(&mddev->disks))
 		/* cannot run an array with no devices.. */
@@ -5299,6 +5356,13 @@ int md_run(struct mddev *mddev)
 		return -EINVAL;
 	}
 
+	new_pool = md_alloc_per_bio_pool(pers);
+	if (IS_ERR(new_pool)) {
+		module_put(pers->owner);
+		return -EINVAL;
+	}
+	mddev->per_bio_pool = new_pool;
+
 	if (pers->sync_request) {
 		/* Warn if this is a potentially silly
 		 * configuration.
@@ -5364,6 +5428,8 @@ int md_run(struct mddev *mddev)
 
 	}
 	if (err) {
+		mempool_destroy(new_pool);
+		mddev->per_bio_pool = NULL;
 		mddev_detach(mddev);
 		if (mddev->private)
 			pers->free(mddev, mddev->private);
@@ -5619,6 +5685,8 @@ static void __md_stop(struct mddev *mddev)
 	mddev->pers = NULL;
 	spin_unlock(&mddev->lock);
 	pers->free(mddev, mddev->private);
+	mempool_destroy(mddev->per_bio_pool);
+	mddev->per_bio_pool = NULL;
 	mddev->private = NULL;
 	if (pers->sync_request && mddev->to_remove == NULL)
 		mddev->to_remove = &md_redundancy_group;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 968bbe7..3d2fe21 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -450,6 +450,7 @@ struct mddev {
 	void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
 	struct md_cluster_info		*cluster_info;
 	unsigned int			good_device_nr;	/* good device num within cluster raid */
+	mempool_t			*per_bio_pool;
 };
 
 enum recovery_flags {
@@ -540,8 +541,25 @@ struct md_personality
 	/* congested implements bdi.congested_fn().
 	 * Will not be called while array is 'suspended' */
 	int (*congested)(struct mddev *mddev, int bits);
+	size_t per_bio_data_size;
 };
 
+struct md_per_bio_data {
+	bio_end_io_t *orig_endio;
+	void *orig_private;
+	struct mddev *mddev;
+};
+
+#define MD_PER_BIO_DATA_SIZE(per) (sizeof(struct md_per_bio_data) + \
+	roundup(per->per_bio_data_size, __alignof__(struct md_per_bio_data)))
+
+static inline void *md_get_per_bio_data(struct bio *bio)
+{
+	return ((struct md_per_bio_data *)bio->bi_private) + 1;
+}
+
+extern void md_bio_attach_data(struct mddev *mddev, struct bio *bio);
+
 struct md_sysfs_entry {
 	struct attribute attr;
 	ssize_t (*show)(struct mddev *, char *);
-- 
2.9.3


^ permalink raw reply related

* [PATCH 2/5] md/raid5: don't abuse bio->bi_phys_segments
From: Shaohua Li @ 2017-02-07 16:55 UTC (permalink / raw)
  To: linux-raid; +Cc: khlebnikov, hch, neilb
In-Reply-To: <cover.1486485935.git.shli@fb.com>

Allocates extra data for activate_stripes and processed_stripes
accounting. Don't need to abuse bio->bi_phys_segments.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 12 ++++++++----
 drivers/md/raid5.h | 37 +++++++++++++++++++------------------
 2 files changed, 27 insertions(+), 22 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b0d1345..fd3e5ce 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3103,7 +3103,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 		(unsigned long long)sh->sector);
 
 	/*
-	 * If several bio share a stripe. The bio bi_phys_segments acts as a
+	 * If several bio share a stripe. The raid5_bio_data has a
 	 * reference count to avoid race. The reference count should already be
 	 * increased before this function is called (for example, in
 	 * raid5_make_request()), so other bio sharing this stripe will not free the
@@ -5146,6 +5146,7 @@ static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio)
 		if (sectors < bio_sectors(raid_bio)) {
 			split = bio_split(raid_bio, sectors, GFP_NOIO, fs_bio_set);
 			bio_chain(split, raid_bio);
+			md_bio_attach_data(mddev, split);
 		} else
 			split = raid_bio;
 
@@ -5335,7 +5336,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	last_sector = bi->bi_iter.bi_sector + (bi->bi_iter.bi_size>>9);
 
 	bi->bi_next = NULL;
-	bi->bi_phys_segments = 1; /* over-loaded to count active stripes */
+	raid5_set_bi_stripes(bi, 1);
 
 	stripe_sectors = conf->chunk_sectors *
 		(conf->raid_disks - conf->max_degraded);
@@ -5463,7 +5464,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	logical_sector = bi->bi_iter.bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
-	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
+	raid5_set_bi_stripes(bi, 1);
 
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
@@ -5963,7 +5964,7 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 	 * We cannot pre-allocate enough stripe_heads as we may need
 	 * more than exist in the cache (if we allow ever large chunks).
 	 * So we do one stripe head at a time and record in
-	 * ->bi_hw_segments how many have been done.
+	 * ->raid5_bio_data.processed_stripes how many have been done.
 	 *
 	 * We *know* that this entire raid_bio is in one chunk, so
 	 * it will be only one 'dd_idx' and only need one call to raid5_compute_sector.
@@ -8188,6 +8189,7 @@ static struct md_personality raid6_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid6_takeover,
 	.congested	= raid5_congested,
+	.per_bio_data_size = sizeof(struct raid5_bio_data),
 };
 static struct md_personality raid5_personality =
 {
@@ -8211,6 +8213,7 @@ static struct md_personality raid5_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid5_takeover,
 	.congested	= raid5_congested,
+	.per_bio_data_size = sizeof(struct raid5_bio_data),
 };
 
 static struct md_personality raid4_personality =
@@ -8235,6 +8238,7 @@ static struct md_personality raid4_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid4_takeover,
 	.congested	= raid5_congested,
+	.per_bio_data_size = sizeof(struct raid5_bio_data),
 };
 
 static int __init raid5_init(void)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index c0687df..9232463 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -481,48 +481,49 @@ static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
 		return NULL;
 }
 
-/*
- * We maintain a biased count of active stripes in the bottom 16 bits of
- * bi_phys_segments, and a count of processed stripes in the upper 16 bits
- */
+struct raid5_bio_data {
+	atomic_t active_stripes;
+	atomic_t processed_stripes;
+};
+
+#define raid5_get_bio_data(bio) \
+	((struct raid5_bio_data *)md_get_per_bio_data(bio))
+
 static inline int raid5_bi_processed_stripes(struct bio *bio)
 {
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	struct raid5_bio_data *data= raid5_get_bio_data(bio);
 
-	return (atomic_read(segments) >> 16) & 0xffff;
+	return atomic_read(&data->processed_stripes);
 }
 
 static inline int raid5_dec_bi_active_stripes(struct bio *bio)
 {
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	struct raid5_bio_data *data= raid5_get_bio_data(bio);
 
-	return atomic_sub_return(1, segments) & 0xffff;
+	return atomic_sub_return(1, &data->active_stripes);
 }
 
 static inline void raid5_inc_bi_active_stripes(struct bio *bio)
 {
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	struct raid5_bio_data *data= raid5_get_bio_data(bio);
 
-	atomic_inc(segments);
+	atomic_inc(&data->active_stripes);
 }
 
 static inline void raid5_set_bi_processed_stripes(struct bio *bio,
 	unsigned int cnt)
 {
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	int old, new;
+	struct raid5_bio_data *data= raid5_get_bio_data(bio);
 
-	do {
-		old = atomic_read(segments);
-		new = (old & 0xffff) | (cnt << 16);
-	} while (atomic_cmpxchg(segments, old, new) != old);
+	atomic_set(&data->processed_stripes, cnt);
 }
 
 static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
 {
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	struct raid5_bio_data *data= raid5_get_bio_data(bio);
 
-	atomic_set(segments, cnt);
+	atomic_set(&data->active_stripes, cnt);
+	atomic_set(&data->processed_stripes, 0);
 }
 
 /* NOTE NR_STRIPE_HASH_LOCKS must remain below 64.
-- 
2.9.3


^ permalink raw reply related

* [PATCH 3/5] md/raid5: change disk limits
From: Shaohua Li @ 2017-02-07 16:56 UTC (permalink / raw)
  To: linux-raid; +Cc: khlebnikov, hch, neilb
In-Reply-To: <cover.1486485935.git.shli@fb.com>

Now we are using atomic_t to track stripes for a bio, the original limit
(because of 16-bits data for the tracking) doesn't apply any more.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index fd3e5ce..467ad4f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7255,13 +7255,8 @@ static int raid5_run(struct mddev *mddev)
 		mddev->queue->limits.discard_alignment = stripe;
 		mddev->queue->limits.discard_granularity = stripe;
 
-		/*
-		 * We use 16-bit counter of active stripes in bi_phys_segments
-		 * (minus one for over-loaded initialization)
-		 */
-		blk_queue_max_hw_sectors(mddev->queue, 0xfffe * STRIPE_SECTORS);
-		blk_queue_max_discard_sectors(mddev->queue,
-					      0xfffe * STRIPE_SECTORS);
+		blk_queue_max_hw_sectors(mddev->queue, BLK_DEF_MAX_SECTORS);
+		blk_queue_max_discard_sectors(mddev->queue, UINT_MAX);
 
 		/*
 		 * unaligned part of discard request will be ignored, so can't
-- 
2.9.3


^ permalink raw reply related

* [PATCH 4/5] md/raid1: don't abuse bio->bi_phys_segments
From: Shaohua Li @ 2017-02-07 16:56 UTC (permalink / raw)
  To: linux-raid; +Cc: khlebnikov, hch, neilb
In-Reply-To: <cover.1486485935.git.shli@fb.com>

Allocates extra data (an 'unsigned int') for bio reference accounting.
Don't need to abuse bio->bi_phys_segments.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid1.c | 44 +++++++++++++++++++++++++-------------------
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b0f647..daa4038 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -102,6 +102,11 @@ static void r1bio_pool_free(void *r1_bio, void *data)
 #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
 #define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
 
+static unsigned int *raid1_bio_ref_ptr(struct bio *bio)
+{
+	return md_get_per_bio_data(bio);
+}
+
 static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
 	struct pool_info *pi = data;
@@ -246,15 +251,15 @@ static void call_bio_endio(struct r1bio *r1_bio)
 	sector_t start_next_window = r1_bio->start_next_window;
 	sector_t bi_sector = bio->bi_iter.bi_sector;
 
-	if (bio->bi_phys_segments) {
+	if (*raid1_bio_ref_ptr(bio)) {
 		unsigned long flags;
 		spin_lock_irqsave(&conf->device_lock, flags);
-		bio->bi_phys_segments--;
-		done = (bio->bi_phys_segments == 0);
+		(*raid1_bio_ref_ptr(bio))--;
+		done = (*raid1_bio_ref_ptr(bio) == 0);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 		/*
 		 * make_request() might be waiting for
-		 * bi_phys_segments to decrease
+		 * bio reference count to decrease
 		 */
 		wake_up(&conf->wait_barrier);
 	} else
@@ -1138,10 +1143,10 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 				   - bio->bi_iter.bi_sector);
 		r1_bio->sectors = max_sectors;
 		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
+		if (*raid1_bio_ref_ptr(bio) == 0)
+			*raid1_bio_ref_ptr(bio) = 2;
 		else
-			bio->bi_phys_segments++;
+			(*raid1_bio_ref_ptr(bio))++;
 		spin_unlock_irq(&conf->device_lock);
 
 		/*
@@ -1316,13 +1321,13 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		start_next_window = wait_barrier(conf, bio);
 		/*
 		 * We must make sure the multi r1bios of bio have
-		 * the same value of bi_phys_segments
+		 * the same value of reference count
 		 */
-		if (bio->bi_phys_segments && old &&
+		if (*raid1_bio_ref_ptr(bio) && old &&
 		    old != start_next_window)
 			/* Wait for the former r1bio(s) to complete */
 			wait_event(conf->wait_barrier,
-				   bio->bi_phys_segments == 1);
+				   *raid1_bio_ref_ptr(bio) == 1);
 		goto retry_write;
 	}
 
@@ -1332,10 +1337,10 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		 */
 		r1_bio->sectors = max_sectors;
 		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
+		if (*raid1_bio_ref_ptr(bio) == 0)
+			*raid1_bio_ref_ptr(bio) = 2;
 		else
-			bio->bi_phys_segments++;
+			(*raid1_bio_ref_ptr(bio))++;
 		spin_unlock_irq(&conf->device_lock);
 	}
 	sectors_handled = r1_bio->sector + max_sectors - bio->bi_iter.bi_sector;
@@ -1428,7 +1433,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	if (sectors_handled < bio_sectors(bio)) {
 		r1_bio_write_done(r1_bio);
 		/* We need another r1_bio.  It has already been counted
-		 * in bio->bi_phys_segments
+		 * in bio reference count
 		 */
 		r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
 		r1_bio->master_bio = bio;
@@ -1466,11 +1471,11 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 	/*
 	 * We might need to issue multiple reads to different devices if there
 	 * are bad blocks around, so we keep track of the number of reads in
-	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
+	 * bio reference count.  If this is 0, there is only one r1_bio and
 	 * no locking will be needed when requests complete.  If it is
 	 * non-zero, then it is the number of not-completed requests.
 	 */
-	bio->bi_phys_segments = 0;
+	*raid1_bio_ref_ptr(bio) = 0;
 	bio_clear_flag(bio, BIO_SEG_VALID);
 
 	if (bio_data_dir(bio) == READ)
@@ -2438,10 +2443,10 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 					       - mbio->bi_iter.bi_sector);
 			r1_bio->sectors = max_sectors;
 			spin_lock_irq(&conf->device_lock);
-			if (mbio->bi_phys_segments == 0)
-				mbio->bi_phys_segments = 2;
+			if (*raid1_bio_ref_ptr(mbio) == 0)
+				*raid1_bio_ref_ptr(mbio) = 2;
 			else
-				mbio->bi_phys_segments++;
+				(*raid1_bio_ref_ptr(mbio))++;
 			spin_unlock_irq(&conf->device_lock);
 			trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
 					      bio, bio_dev, bio_sector);
@@ -3289,6 +3294,7 @@ static struct md_personality raid1_personality =
 	.quiesce	= raid1_quiesce,
 	.takeover	= raid1_takeover,
 	.congested	= raid1_congested,
+	.per_bio_data_size = sizeof(unsigned int),
 };
 
 static int __init raid_init(void)
-- 
2.9.3


^ permalink raw reply related

* [PATCH 5/5] md/raid10: don't abuse bio->bi_phys_segments
From: Shaohua Li @ 2017-02-07 16:56 UTC (permalink / raw)
  To: linux-raid; +Cc: khlebnikov, hch, neilb
In-Reply-To: <cover.1486485935.git.shli@fb.com>

Allocates extra data (an 'unsigned int') for bio reference accounting.
Don't need to abuse bio->bi_phys_segments.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid10.c | 40 ++++++++++++++++++++++++----------------
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 1920756..77623ff 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -133,6 +133,12 @@ static void r10bio_pool_free(void *r10_bio, void *data)
 /* maximum number of concurrent requests, memory permitting */
 #define RESYNC_DEPTH (32*1024*1024/RESYNC_BLOCK_SIZE)
 
+/* The bio has a counter attached */
+static unsigned int *raid10_bio_ref_ptr(struct bio *bio)
+{
+	return md_get_per_bio_data(bio);
+}
+
 /*
  * When performing a resync, we need to read and compare, so
  * we need as many pages are there are copies.
@@ -304,11 +310,11 @@ static void raid_end_bio_io(struct r10bio *r10_bio)
 	int done;
 	struct r10conf *conf = r10_bio->mddev->private;
 
-	if (bio->bi_phys_segments) {
+	if (*raid10_bio_ref_ptr(bio)) {
 		unsigned long flags;
 		spin_lock_irqsave(&conf->device_lock, flags);
-		bio->bi_phys_segments--;
-		done = (bio->bi_phys_segments == 0);
+		(*raid10_bio_ref_ptr(bio))--;
+		done = (*raid10_bio_ref_ptr(bio) == 0);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	} else
 		done = 1;
@@ -1162,10 +1168,10 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 				   - bio->bi_iter.bi_sector);
 		r10_bio->sectors = max_sectors;
 		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
+		if (*raid10_bio_ref_ptr(bio) == 0)
+			*raid10_bio_ref_ptr(bio) = 2;
 		else
-			bio->bi_phys_segments++;
+			(*raid10_bio_ref_ptr(bio))++;
 		spin_unlock_irq(&conf->device_lock);
 		/*
 		 * Cannot call generic_make_request directly as that will be
@@ -1262,7 +1268,7 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	 * writing to those blocks.  This potentially requires several
 	 * writes to write around the bad blocks.  Each set of writes
 	 * gets its own r10_bio with a set of bios attached.  The number
-	 * of r10_bios is recored in bio->bi_phys_segments just as with
+	 * of r10_bios is recored in a counter just as with
 	 * the read case.
 	 */
 
@@ -1389,10 +1395,10 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 		 */
 		r10_bio->sectors = max_sectors;
 		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
+		if (*raid10_bio_ref_ptr(bio) == 0)
+			*raid10_bio_ref_ptr(bio) = 2;
 		else
-			bio->bi_phys_segments++;
+			(*raid10_bio_ref_ptr(bio))++;
 		spin_unlock_irq(&conf->device_lock);
 	}
 	sectors_handled = r10_bio->sector + max_sectors -
@@ -1493,7 +1499,7 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	if (sectors_handled < bio_sectors(bio)) {
 		one_write_done(r10_bio);
 		/* We need another r10_bio.  It has already been counted
-		 * in bio->bi_phys_segments.
+		 * in bio's counter.
 		 */
 		r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
 
@@ -1525,11 +1531,11 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 	/*
 	 * We might need to issue multiple reads to different devices if there
 	 * are bad blocks around, so we keep track of the number of reads in
-	 * bio->bi_phys_segments.  If this is 0, there is only one r10_bio and
+	 * a counter.  If this is 0, there is only one r10_bio and
 	 * no locking will be needed when the request completes.  If it is
 	 * non-zero, then it is the number of not-completed requests.
 	 */
-	bio->bi_phys_segments = 0;
+	*raid10_bio_ref_ptr(bio) = 0;
 	bio_clear_flag(bio, BIO_SEG_VALID);
 
 	if (bio_data_dir(bio) == READ)
@@ -1567,6 +1573,7 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
 					   (chunk_sects - 1)),
 					  GFP_NOIO, fs_bio_set);
 			bio_chain(split, bio);
+			md_bio_attach_data(mddev, split);
 		} else {
 			split = bio;
 		}
@@ -2667,10 +2674,10 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 			- mbio->bi_iter.bi_sector;
 		r10_bio->sectors = max_sectors;
 		spin_lock_irq(&conf->device_lock);
-		if (mbio->bi_phys_segments == 0)
-			mbio->bi_phys_segments = 2;
+		if (*raid10_bio_ref_ptr(mbio) == 0)
+			*raid10_bio_ref_ptr(mbio) = 2;
 		else
-			mbio->bi_phys_segments++;
+			(*raid10_bio_ref_ptr(mbio))++;
 		spin_unlock_irq(&conf->device_lock);
 		generic_make_request(bio);
 
@@ -4816,6 +4823,7 @@ static struct md_personality raid10_personality =
 	.start_reshape	= raid10_start_reshape,
 	.finish_reshape	= raid10_finish_reshape,
 	.congested	= raid10_congested,
+	.per_bio_data_size = sizeof(unsigned int),
 };
 
 static int __init raid_init(void)
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients
From: Dan Williams @ 2017-02-07 18:16 UTC (permalink / raw)
  To: Anup Patel
  Cc: Vinod Koul, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Jassi Brar, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine@vger.kernel.org,
	Device Tree, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-crypto, linux-raid
In-Reply-To: <CAALAos_yhXufdt79KWY0eugjF5oyJ8+NcPh2E-dEXnSQu2WhhA@mail.gmail.com>

On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel <anup.patel@broadcom.com> wrote:
> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel <anup.patel@broadcom.com> wrote:
>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>> which supports PQ offload with limited number of PQ coefficients.
>>>
>>> This patch extends async_tx APIs to handle DMA devices with support
>>> for fewer PQ coefficients.
>>>
>>> Signed-off-by: Anup Patel <anup.patel@broadcom.com>
>>> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
>>
>> I don't like this approach. Define an interface for md to query the
>> offload engine once at the beginning of time. We should not be adding
>> any new extensions to async_tx.
>
> Even if we do capability checks in Linux MD, we still need a way
> for DMAENGINE drivers to advertise number of PQ coefficients
> handled by the HW.
>
> I agree capability checks should be done once in Linux MD but I don't
> see why this has to be part of BCM-SBA-RAID driver patches. We need
> separate patchsets to address limitations of async_tx framework.

Right, separate enabling before we pile on new hardware support to a
known broken framework.

^ permalink raw reply

* Re: [PATCH v3 3/9] md: superblock changes for PPL
From: Shaohua Li @ 2017-02-07 21:20 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170130185953.30428-4-artur.paszkiewicz@intel.com>

On Mon, Jan 30, 2017 at 07:59:47PM +0100, Artur Paszkiewicz wrote:
> Include information about PPL location and size into mdp_superblock_1
> and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
> put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
> 'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
> to mddev->flags to indicate that PPL is enabled on an array.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/md.c                | 15 +++++++++++++++
>  drivers/md/md.h                |  8 ++++++++
>  drivers/md/raid0.c             |  3 ++-
>  drivers/md/raid1.c             |  3 ++-
>  include/uapi/linux/raid/md_p.h | 18 ++++++++++++++----
>  5 files changed, 41 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 85ac98417a08..e96f73572e23 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1566,6 +1566,12 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
>  	} else if (sb->bblog_offset != 0)
>  		rdev->badblocks.shift = 0;
>  
> +	if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
> +		rdev->ppl.offset = (__s16)le16_to_cpu(sb->ppl.offset);
> +		rdev->ppl.size = le16_to_cpu(sb->ppl.size);
> +		rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
> +	}
> +
>  	if (!refdev) {
>  		ret = 1;
>  	} else {
> @@ -1678,6 +1684,9 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
>  
>  		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
>  			set_bit(MD_HAS_JOURNAL, &mddev->flags);
> +
> +		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL)
> +			set_bit(MD_HAS_PPL, &mddev->flags);

I think we should check wrong configuration and bail out. For example, both
MD_FEATURE_PPL and MD_FEATURE_BITMAP_OFFSET set, or both MD_FEATURE_PPL and
MD_FEATURE_JOURNAL set.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v3 4/9] raid5: calculate partial parity for a stripe
From: Shaohua Li @ 2017-02-07 21:25 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170130185953.30428-5-artur.paszkiewicz@intel.com>

On Mon, Jan 30, 2017 at 07:59:48PM +0100, Artur Paszkiewicz wrote:
> Attach a page for holding the partial parity data to stripe_head.
> Allocate it only if mddev has the MD_HAS_PPL flag set.
> 
> Partial parity is the xor of not modified data chunks of a stripe and is
> calculated as follows:
> 
> - reconstruct-write case:
>   xor data from all not updated disks in a stripe
> 
> - read-modify-write case:
>   xor old data and parity from all updated disks in a stripe
> 
> Implement it using the async_tx API and integrate into raid_run_ops().
> It must be called when we still have access to old data, so do it when
> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
> stored into sh->ppl_page.
> 
> Partial parity is not meaningful for full stripe write and is not stored
> in the log or used for recovery, so don't attempt to calculate it when
> stripe has STRIPE_FULL_WRITE.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/raid5.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.h |  2 ++
>  2 files changed, 100 insertions(+)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index d1cba941951e..e1e238da32ba 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -466,6 +466,11 @@ static void shrink_buffers(struct stripe_head *sh)
>  		sh->dev[i].page = NULL;
>  		put_page(p);
>  	}
> +
> +	if (sh->ppl_page) {
> +		put_page(sh->ppl_page);
> +		sh->ppl_page = NULL;
> +	}
>  }
>  
>  static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
> @@ -482,6 +487,13 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
>  		sh->dev[i].page = page;
>  		sh->dev[i].orig_page = page;
>  	}
> +
> +	if (test_bit(MD_HAS_PPL, &sh->raid_conf->mddev->flags)) {

I think the test should be something like:
	if (raid5_ppl_enabled())

Having the feature doesn't mean the feature is enabled.

>  	pr_debug("%s: stripe %llu locked: %d ops_request: %lx\n",
>  		__func__, (unsigned long long)sh->sector,
>  		s->locked, s->ops_request);
> @@ -3105,6 +3175,34 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
>  	if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
>  		goto overlap;
>  
> +	if (forwrite && test_bit(MD_HAS_PPL, &conf->mddev->flags)) {
> +		/*
> +		 * With PPL only writes to consecutive data chunks within a
> +		 * stripe are allowed. Not really an overlap, but
> +		 * wait_for_overlap can be used to handle this.
> +		 */

Please describe why the data must be consecutive.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
From: Shaohua Li @ 2017-02-07 21:42 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170130185953.30428-6-artur.paszkiewicz@intel.com>

On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
> This implements the PPL write logging functionality, using the
> raid5-cache policy logic introduced in previous patches. The description
> of PPL is added to the documentation. More details can be found in the
> comments in raid5-ppl.c.
> 
> Put the PPL metadata structures to md_p.h because userspace tools
> (mdadm) will also need to read/write PPL.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  Documentation/admin-guide/md.rst |  53 ++++
>  drivers/md/Makefile              |   2 +-
>  drivers/md/raid5-cache.c         |  13 +-
>  drivers/md/raid5-cache.h         |   8 +
>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.c               |  15 +-
>  include/uapi/linux/raid/md_p.h   |  26 ++
>  7 files changed, 661 insertions(+), 7 deletions(-)
>  create mode 100644 drivers/md/raid5-ppl.c
> 
> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> index e449fb5f277c..7104ef757e73 100644
> --- a/Documentation/admin-guide/md.rst
> +++ b/Documentation/admin-guide/md.rst
> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
>  that all auto-detected arrays are assembled as partitionable.
>  
> +
> +.. _dirty_degraded_boot:
> +
>  Boot time assembly of degraded/dirty arrays
>  -------------------------------------------
>  
> @@ -176,6 +179,56 @@ and its role in the array.
>  Once started with RUN_ARRAY, uninitialized spares can be added with
>  HOT_ADD_DISK.
>  
> +.. _ppl:
> +
> +Partial Parity Log
> +------------------
Please put this part to a separate file under Documentation/md. The admin-guide
is supposed to serve as a doc for user. The Documentation/md files will server
as a doc for developer.

> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
> +may become inconsistent with data on other member disks. If the array is also
> +in degraded state, there is no way to recalculate parity, because one of the
> +disks is missing. This can lead to silent data corruption when rebuilding the
> +array or using it is as degraded - data calculated from parity for array blocks
> +that have not been touched by a write request during the unclean shutdown can
> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
> +this, md by default does not allow starting a dirty degraded array, see
> +:ref:`dirty_degraded_boot`.
> +
> +Partial parity for a write operation is the XOR of stripe data chunks not
> +modified by this write. It is just enough data needed for recovering from the
> +write hole. XORing partial parity with the modified chunks produces parity for
> +the stripe, consistent with its state before the write operation, regardless of
> +which chunk writes have completed. If one of the not modified data disks of
> +this stripe is missing, this updated parity can be used to recover its
> +contents. PPL recovery is also performed when starting an array after an
> +unclean shutdown and all disks are available, eliminating the need to resync
> +the array. Because of this, using write-intent bitmap and PPL together is not
> +supported.
> +
> +When handling a write request PPL writes partial parity before new data and
> +parity are dispatched to disks. PPL is a distributed log - it is stored on
> +array member drives in the metadata area, on the parity drive of a particular
> +stripe.  It does not require a dedicated journaling drive. Write performance is
> +reduced by up to 30%-40% but it scales with the number of drives in the array
> +and the journaling drive does not become a bottleneck or a single point of
> +failure.
> +
> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
> +not a true journal. It does not protect from losing in-flight data, only from
> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
> +performed for this stripe (parity is not updated). So it is possible to have
> +arbitrary data in the written part of a stripe if that disk is lost. In such
> +case the behavior is the same as in plain raid5.
> +
> +PPL is available for md version-1 metadata and external (specifically IMSM)
> +metadata arrays. It can be enabled using mdadm option
> +``--consistency-policy=ppl``.
> +
> +Currently, volatile write-back cache should be disabled on all member drives
> +when using PPL. Otherwise it cannot guarantee consistency in case of power
> +failure.

The limitation is weird and dumb. Is this an implementation limitation which
can be fixed later or a design limitation?

> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> new file mode 100644
> index 000000000000..6bc246c80f6b
> --- /dev/null
> +++ b/drivers/md/raid5-ppl.c
> @@ -0,0 +1,551 @@
> +/*
> + * Partial Parity Log for closing the RAID5 write hole
> + * Copyright (c) 2017, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/blkdev.h>
> +#include <linux/slab.h>
> +#include <linux/crc32c.h>
> +#include <linux/raid/md_p.h>
> +#include "md.h"
> +#include "raid5.h"
> +#include "raid5-cache.h"
> +
> +/*
> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
> + * partial parity data. The header contains an array of entries
> + * (struct ppl_header_entry) which describe the logged write requests.
> + * Partial parity for the entries comes after the header, written in the same
> + * sequence as the entries:
> + *
> + * Header
> + *   entry0
> + *   ...
> + *   entryN
> + * PP data
> + *   PP for entry0
> + *   ...
> + *   PP for entryN
> + *
> + * Every entry holds a checksum of its partial parity, the header also has a
> + * checksum of the header itself. Entries for full stripes writes contain no
> + * partial parity, they only mark the stripes for which parity should be
> + * recalculated after an unclean shutdown.
> + *
> + * A write request is always logged to the PPL instance stored on the parity
> + * disk of the corresponding stripe. For each member disk there is one r5l_log
> + * used to handle logging for this disk, independently from others. They are
> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
> + * common parent r5l_log. This parent log serves as a proxy and is used in
> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
> + *
> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
> + * be appended to the last entry if the chunks to write are the same, otherwise
> + * a new entry is added. Checksums of entries are calculated incrementally as
> + * stripes containing partial parity are being added to entries.
> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
> + * for all stripes of the io_unit. When the PPL write completes, the stripes
> + * associated with the io_unit are released and raid5d starts writing their data
> + * and parity. When all stripes are written, the io_unit is freed and the next
> + * can be submitted.
> + *
> + * An io_unit is used to gather stripes until it is submitted or becomes full
> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
> + * can't be submitted until the previous has completed (PPL and stripe
> + * data+parity is written). The log->running_ios list tracks all io_units of
> + * a log (for a single member disk). New io_units are added to the end of the
> + * list and the first io_unit is submitted, if it is not submitted already.
> + * The current io_unit accepting new stripes is always the last on the list.
> + */

I don't like the way that io_unit and r5l_log are reused here. The code below
shares very little with raid5-cache. There are a lot of fields in the data
structures which are not used by ppl. It's not a good practice to share data
structure like this way, so please define ppl its own data structure. The
workflow between raid5 cache and ppl might be similar, but what we really need
to share is the hooks to raid5 core.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v3 6/9] md: add sysfs entries for PPL
From: Shaohua Li @ 2017-02-07 21:49 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <20170130185953.30428-7-artur.paszkiewicz@intel.com>

On Mon, Jan 30, 2017 at 07:59:50PM +0100, Artur Paszkiewicz wrote:
> Add 'consistency_policy' attribute for array. It indicates how the array
> maintains consistency in case of unexpected shutdown.
> 
> Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
> and size of the PPL space on the device.
> 
> These attributes are writable to allow enabling PPL for external
> metadata arrays and (later) to enable/disable PPL for a running array.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  Documentation/admin-guide/md.rst |  32 ++++++++++-
>  drivers/md/md.c                  | 115 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 144 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> index 7104ef757e73..2c153b3d798f 100644
> --- a/Documentation/admin-guide/md.rst
> +++ b/Documentation/admin-guide/md.rst
> @@ -329,14 +329,14 @@ All md devices contain:
>       array creation it will default to 0, though starting the array as
>       ``clean`` will set it much larger.
>  
> -   new_dev
> +  new_dev
>       This file can be written but not read.  The value written should
>       be a block device number as major:minor.  e.g. 8:0
>       This will cause that device to be attached to the array, if it is
>       available.  It will then appear at md/dev-XXX (depending on the
>       name of the device) and further configuration is then possible.
>  
> -   safe_mode_delay
> +  safe_mode_delay
>       When an md array has seen no write requests for a certain period
>       of time, it will be marked as ``clean``.  When another write
>       request arrives, the array is marked as ``dirty`` before the write
> @@ -345,7 +345,7 @@ All md devices contain:
>       period as a number of seconds.  The default is 200msec (0.200).
>       Writing a value of 0 disables safemode.
>  
> -   array_state
> +  array_state
>       This file contains a single word which describes the current
>       state of the array.  In many cases, the state can be set by
>       writing the word for the desired state, however some states
> @@ -454,7 +454,30 @@ All md devices contain:
>       once the array becomes non-degraded, and this fact has been
>       recorded in the metadata.
>  
> +  consistency_policy
> +     This indicates how the array maintains consistency in case of unexpected
> +     shutdown. It can be:
>  
> +     none
> +       Array has no redundancy information, e.g. raid0, linear.
> +
> +     resync
> +       Full resync is performed and all redundancy is regenerated when the
> +       array is started after unclean shutdown.
> +
> +     bitmap
> +       Resync assisted by a write-intent bitmap.
> +
> +     journal
> +       For raid4/5/6, journal device is used to log transactions and replay
> +       after unclean shutdown.
> +
> +     ppl
> +       For raid5 only, :ref:`ppl` is used to close the write hole and eliminate
> +       resync.
> +
> +     The accepted values when writing to this file are ``ppl`` and ``resync``,
> +     used to enable and disable PPL.
>  
>  
>  As component devices are added to an md array, they appear in the ``md``
> @@ -616,6 +639,9 @@ Each directory contains:
>  	adds bad blocks without acknowledging them. This is largely
>  	for testing.
>  
> +      ppl_sector, ppl_size
> +        Location and size (in sectors) of the space used for Partial Parity Log
> +        on this device.
>  
>  
>  An active md device will also contain an entry for each active device
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index e96f73572e23..47ab3afe87cb 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -3205,6 +3205,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
>  static struct rdev_sysfs_entry rdev_unack_bad_blocks =
>  __ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
>  
> +static ssize_t
> +ppl_sector_show(struct md_rdev *rdev, char *page)
> +{
> +	return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
> +}
> +
> +static ssize_t
> +ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
> +{
> +	unsigned long long sector;
> +
> +	if (kstrtoull(buf, 10, &sector) < 0)
> +		return -EINVAL;
> +	if (sector != (sector_t)sector)
> +	        return -EINVAL;
> +
> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
> +	    rdev->raid_disk >= 0)
> +		return -EBUSY;
> +
> +	if (rdev->mddev->persistent) {
> +		if (rdev->mddev->major_version == 0)
> +			return -EINVAL;
> +		if ((sector > rdev->sb_start &&
> +		     sector - rdev->sb_start > S16_MAX) ||
> +		    (sector < rdev->sb_start &&
> +		     rdev->sb_start - sector > -S16_MIN))
> +			return -EINVAL;
> +		rdev->ppl.offset = sector - rdev->sb_start;

Why do we allow non-external array changes the ppl log position and size?

> +	} else if (!rdev->mddev->external) {
> +		return -EBUSY;
> +	}
> +	rdev->ppl.sector = sector;
> +	return len;
> +}
> +
> +static struct rdev_sysfs_entry rdev_ppl_sector =
> +__ATTR(ppl_sector, S_IRUGO|S_IWUSR, ppl_sector_show, ppl_sector_store);
> +
> +static ssize_t
> +ppl_size_show(struct md_rdev *rdev, char *page)
> +{
> +	return sprintf(page, "%u\n", rdev->ppl.size);
> +}
> +
> +static ssize_t
> +ppl_size_store(struct md_rdev *rdev, const char *buf, size_t len)
> +{
> +	unsigned int size;
> +
> +	if (kstrtouint(buf, 10, &size) < 0)
> +		return -EINVAL;
> +
> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
> +	    rdev->raid_disk >= 0)
> +		return -EBUSY;
> +
> +	if (rdev->mddev->persistent) {
> +		if (rdev->mddev->major_version == 0)
> +			return -EINVAL;
> +		if (size > U16_MAX)
> +			return -EINVAL;
> +	} else if (!rdev->mddev->external) {
> +		return -EBUSY;
> +	}
> +	rdev->ppl.size = size;
> +	return len;
> +}
> +
> +static struct rdev_sysfs_entry rdev_ppl_size =
> +__ATTR(ppl_size, S_IRUGO|S_IWUSR, ppl_size_show, ppl_size_store);
> +
>  static struct attribute *rdev_default_attrs[] = {
>  	&rdev_state.attr,
>  	&rdev_errors.attr,
> @@ -3215,6 +3287,8 @@ static struct attribute *rdev_default_attrs[] = {
>  	&rdev_recovery_start.attr,
>  	&rdev_bad_blocks.attr,
>  	&rdev_unack_bad_blocks.attr,
> +	&rdev_ppl_sector.attr,
> +	&rdev_ppl_size.attr,
>  	NULL,
>  };
>  static ssize_t
> @@ -4951,6 +5025,46 @@ static struct md_sysfs_entry md_array_size =
>  __ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
>         array_size_store);
>  
> +static ssize_t
> +consistency_policy_show(struct mddev *mddev, char *page)
> +{
> +	int ret;
> +
> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
> +		ret = sprintf(page, "journal\n");
> +	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
> +		ret = sprintf(page, "ppl\n");
> +	} else if (mddev->bitmap) {
> +		ret = sprintf(page, "bitmap\n");
> +	} else if (mddev->pers) {
> +		if (mddev->pers->sync_request)
> +			ret = sprintf(page, "resync\n");
> +		else
> +			ret = sprintf(page, "none\n");
> +	} else {
> +		ret = sprintf(page, "unknown\n");
> +	}
> +
> +	return ret;
> +}
> +
> +static ssize_t
> +consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
> +{
> +	if (mddev->pers) {
> +		return -EBUSY;
> +	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
> +		set_bit(MD_HAS_PPL, &mddev->flags);
> +		return len;

Here we should check if ppl position and size are valid before we set the flag.

And shouldn't we move this part to .change_consistency_policy added in patch 9?
This is a consistency policy change too.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH V2] MD: add doc for raid5-cache
From: Shaohua Li @ 2017-02-08  1:03 UTC (permalink / raw)
  To: Anthony Youngman
  Cc: Shaohua Li, linux-raid, philip, songliubraving, neilb,
	jure.erznoznik, rramesh2400
In-Reply-To: <c2c87a20-05b5-f769-8eb6-4d7b164dfad6@youngman.org.uk>

On Mon, Feb 06, 2017 at 09:02:32PM +0000, Anthony Youngman wrote:
> 
> 
> On 06/02/17 19:23, Shaohua Li wrote:
> > I'm starting document of the raid5-cache feature. Please note this is a
> > kernel doc instead of a mdadm manual, so I don't add the details about
> > how to use the feature in mdadm side. Please let me know what else we
> > should put into the document. Of course, comments are welcome!
> > 
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  Documentation/md/raid5-cache.txt | 109 +++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 109 insertions(+)
> >  create mode 100644 Documentation/md/raid5-cache.txt
> 
> Note that the kernel documentation is moving over to a new format - .rst.
> They're using a new system called Sphinx. There was an article on lwn about
> it. I've lost the reference to lwn, but if you look at
> https://www.kernel.org/doc/html/latest/ you will see that it is a match to
> Documentation/index.rst.
> 
> So of course, converting all the md stuff has made its way onto my "to do"
> list, but I don't want to start until I've got a converted kernel running on
> my system. I'm currently running 4.4.6 and I think it came out with
> something like 4.6. Problem is, running gentoo, my system is well out of
> date because I need a helper system to get me to upgrade and going from KDE4
> to KDE5 is likely to give me grief :-)
> 
> You're probably working with an up-to-date kernel, but it's up to you - do
> you want to create a .rst file, or submit it as .txt and let me convert it
> and go through the grief of working out how to do it :-). Conversion is
> probably rather more than just converting the one file, it'll need
> converting the directory structure too, I expect. Any problems, Jon Corbet's
> probably our go-to guy.

Yep, I saw some directories are converted to .rst, but most not yet. I'll
commit this as-is if nobody objects, being lazy to learn/install the rst
stuffes :). It would be great you can help convert the md directory to new
format, but I don't think it's in a hurry.

Thanks,
Shaohua

^ permalink raw reply

* Re: [systemd-devel] Errorneous detection of degraded array
From: NeilBrown @ 2017-02-08  4:10 UTC (permalink / raw)
  To: Andrei Borzenkov
  Cc: Luke Pyzowski, systemd-devel@lists.freedesktop.org, linux-raid
In-Reply-To: <e32430d2-5f59-40af-6027-715dfde1045d@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2681 bytes --]

On Tue, Jan 31 2017, Andrei Borzenkov wrote:

>
>>>> Changing the
>>>>   Conflicts=sys-devices-virtual-block-%i.device
>>>> lines to
>>>>   ConditionPathExists=/sys/devices/virtual/block/%i
>>>> might make the problem go away, without any negative consequences.
>>>>
>>>
>>> Ugly, but yes, may be this is the only way using current systemd.
>>>
>
> This won't work. sysfs node appears as soon as the very first array
> member is found and array is still inactive, while what we need is
> condition "array is active".

Of course, you are right.
A suitable "array is active" test is the existence of
  .../md/sync_action
which appears when an array is activated (except for RAID0 and Linear,
which don't need last-resort support).

So this is what I propose to post upstream.  Could you please confirm
that it works for you?  It appears to work for me.

Thanks,
NeilBrown

From: NeilBrown <neilb@suse.com>
Date: Wed, 8 Feb 2017 15:01:05 +1100
Subject: [PATCH] systemd/mdadm-last-resort: use ConditionPathExists instead of
 Conflicts

Commit cec72c071bbe ("systemd/mdadm-last-resort: add Conflicts to .service file.")

added a 'Conflicts' directive to the mdadm-last-resort@.service file in
the hope that this would make sure the service didn't run after the device
was active, even if the timer managed to get started, which is possible in
race conditions.

This seemed to work is testing, but it isn't clear why, and it is known
to cause problems.
If systemd happens to know that the mentioned device is a dependency of a
mount point, the Conflicts can unmount that mountpoint, which is certainly
not wanted.

So remove the "Conflicts" and instead use
 ConditionPathExists=!/sys/devices/virtual/block/%i/md/sync_action

The "sync_action" file exists for any array which require last-resort
handling, and only appear when the array is activated.  So it is safe
to reliy on it to determine if the last-resort is really needed.

Fixes: cec72c071bbe ("systemd/mdadm-last-resort: add Conflicts to .service file.")
Signed-off-by: NeilBrown <neilb@suse.com>
---
 systemd/mdadm-last-resort@.service | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/systemd/mdadm-last-resort@.service b/systemd/mdadm-last-resort@.service
index e93d72b2b45e..f9d4d12738a3 100644
--- a/systemd/mdadm-last-resort@.service
+++ b/systemd/mdadm-last-resort@.service
@@ -1,7 +1,7 @@
 [Unit]
 Description=Activate md array even though degraded
 DefaultDependencies=no
-Conflicts=sys-devices-virtual-block-%i.device
+ConditionPathExists=!/sys/devices/virtual/block/%i/md/sync_action

 [Service]
 Type=oneshot
-- 
2.11.0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH 1/5] MD: attach data to each bio
From: Shaohua Li @ 2017-02-08  5:24 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: Shaohua Li, linux-raid, khlebnikov, hch, neilb
In-Reply-To: <589ADFEC.1070902@suse.com>

On Wed, Feb 08, 2017 at 05:07:56PM +0800, Guoqing Jiang wrote:
> 
> 
> On 02/08/2017 12:55 AM, Shaohua Li wrote:
> > Currently MD is rebusing some bio fields. To remove the hack, we attach
> > extra data to each bio. Each personablity can attach extra data to the
> > bios, so we don't need to rebuse bio fields.
> 
> rebuse? Maybe it should be reuse, or abuse.

oops, this is a typo, should be abuse 
> [snip]
> 
> > +static inline void *md_get_per_bio_data(struct bio *bio)
> > +{
> > +	return ((struct md_per_bio_data *)bio->bi_private) + 1;
> > +}
> > +
> > +extern void md_bio_attach_data(struct mddev *mddev, struct bio *bio);
> 
> Why not make raid1_bio_ref_ptr and raid10_bio_ref_ptr as macro
> like raid5_get_bio_data? Or add inline before the two funcs since
> both of them are called frequently.

I'll add 'inline'. Actually it doesn't really matter. The compiler will inline
simple functions automatically. In this case, I'm sure compiler will do the
inline. On the other hand, 'inline' doesn't guarantee compiler will do inline
unless you use __attribute__((always_inline)).

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
From: Shaohua Li @ 2017-02-08  5:34 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen
In-Reply-To: <43b52cb7-b9d6-f615-d7c0-d78fdbb9a0cd@intel.com>

On Wed, Feb 08, 2017 at 12:58:42PM +0100, Artur Paszkiewicz wrote:
> On 02/07/2017 10:42 PM, Shaohua Li wrote:
> > On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
> >> This implements the PPL write logging functionality, using the
> >> raid5-cache policy logic introduced in previous patches. The description
> >> of PPL is added to the documentation. More details can be found in the
> >> comments in raid5-ppl.c.
> >>
> >> Put the PPL metadata structures to md_p.h because userspace tools
> >> (mdadm) will also need to read/write PPL.
> >>
> >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> >> ---
> >>  Documentation/admin-guide/md.rst |  53 ++++
> >>  drivers/md/Makefile              |   2 +-
> >>  drivers/md/raid5-cache.c         |  13 +-
> >>  drivers/md/raid5-cache.h         |   8 +
> >>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
> >>  drivers/md/raid5.c               |  15 +-
> >>  include/uapi/linux/raid/md_p.h   |  26 ++
> >>  7 files changed, 661 insertions(+), 7 deletions(-)
> >>  create mode 100644 drivers/md/raid5-ppl.c
> >>
> >> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> >> index e449fb5f277c..7104ef757e73 100644
> >> --- a/Documentation/admin-guide/md.rst
> >> +++ b/Documentation/admin-guide/md.rst
> >> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
> >>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
> >>  that all auto-detected arrays are assembled as partitionable.
> >>  
> >> +
> >> +.. _dirty_degraded_boot:
> >> +
> >>  Boot time assembly of degraded/dirty arrays
> >>  -------------------------------------------
> >>  
> >> @@ -176,6 +179,56 @@ and its role in the array.
> >>  Once started with RUN_ARRAY, uninitialized spares can be added with
> >>  HOT_ADD_DISK.
> >>  
> >> +.. _ppl:
> >> +
> >> +Partial Parity Log
> >> +------------------
> > Please put this part to a separate file under Documentation/md. The admin-guide
> > is supposed to serve as a doc for user. The Documentation/md files will server
> > as a doc for developer.
> 
> OK, I will move it.
> 
> >> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
> >> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
> >> +may become inconsistent with data on other member disks. If the array is also
> >> +in degraded state, there is no way to recalculate parity, because one of the
> >> +disks is missing. This can lead to silent data corruption when rebuilding the
> >> +array or using it is as degraded - data calculated from parity for array blocks
> >> +that have not been touched by a write request during the unclean shutdown can
> >> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
> >> +this, md by default does not allow starting a dirty degraded array, see
> >> +:ref:`dirty_degraded_boot`.
> >> +
> >> +Partial parity for a write operation is the XOR of stripe data chunks not
> >> +modified by this write. It is just enough data needed for recovering from the
> >> +write hole. XORing partial parity with the modified chunks produces parity for
> >> +the stripe, consistent with its state before the write operation, regardless of
> >> +which chunk writes have completed. If one of the not modified data disks of
> >> +this stripe is missing, this updated parity can be used to recover its
> >> +contents. PPL recovery is also performed when starting an array after an
> >> +unclean shutdown and all disks are available, eliminating the need to resync
> >> +the array. Because of this, using write-intent bitmap and PPL together is not
> >> +supported.
> >> +
> >> +When handling a write request PPL writes partial parity before new data and
> >> +parity are dispatched to disks. PPL is a distributed log - it is stored on
> >> +array member drives in the metadata area, on the parity drive of a particular
> >> +stripe.  It does not require a dedicated journaling drive. Write performance is
> >> +reduced by up to 30%-40% but it scales with the number of drives in the array
> >> +and the journaling drive does not become a bottleneck or a single point of
> >> +failure.
> >> +
> >> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
> >> +not a true journal. It does not protect from losing in-flight data, only from
> >> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
> >> +performed for this stripe (parity is not updated). So it is possible to have
> >> +arbitrary data in the written part of a stripe if that disk is lost. In such
> >> +case the behavior is the same as in plain raid5.
> >> +
> >> +PPL is available for md version-1 metadata and external (specifically IMSM)
> >> +metadata arrays. It can be enabled using mdadm option
> >> +``--consistency-policy=ppl``.
> >> +
> >> +Currently, volatile write-back cache should be disabled on all member drives
> >> +when using PPL. Otherwise it cannot guarantee consistency in case of power
> >> +failure.
> > 
> > The limitation is weird and dumb. Is this an implementation limitation which
> > can be fixed later or a design limitation?
> 
> This can be fixed later and I'm planning to do this. The limitation
> results from the fact that partial parity is the xor of "old" data.
> During recovery we xor it with "new" data and that produces valid
> parity, assuming the "old" data has not changed after dirty shutdown.
> But it's not necessarily true, since the "old" data might have been only
> in cache and not on persistent storage. So fixing this will require
> making sure the old data is on persistent media before writing PPL.

so it's the implementation which doesn't flush disk cache before writting new
data to disks, right? That makes sense then. I think we should fix this soon.
Don't think people will (or remember to) disable write-back cache before using
ppl.

> >> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> >> new file mode 100644
> >> index 000000000000..6bc246c80f6b
> >> --- /dev/null
> >> +++ b/drivers/md/raid5-ppl.c
> >> @@ -0,0 +1,551 @@
> >> +/*
> >> + * Partial Parity Log for closing the RAID5 write hole
> >> + * Copyright (c) 2017, Intel Corporation.
> >> + *
> >> + * This program is free software; you can redistribute it and/or modify it
> >> + * under the terms and conditions of the GNU General Public License,
> >> + * version 2, as published by the Free Software Foundation.
> >> + *
> >> + * This program is distributed in the hope it will be useful, but WITHOUT
> >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> >> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> >> + * more details.
> >> + */
> >> +
> >> +#include <linux/kernel.h>
> >> +#include <linux/blkdev.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/crc32c.h>
> >> +#include <linux/raid/md_p.h>
> >> +#include "md.h"
> >> +#include "raid5.h"
> >> +#include "raid5-cache.h"
> >> +
> >> +/*
> >> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
> >> + * partial parity data. The header contains an array of entries
> >> + * (struct ppl_header_entry) which describe the logged write requests.
> >> + * Partial parity for the entries comes after the header, written in the same
> >> + * sequence as the entries:
> >> + *
> >> + * Header
> >> + *   entry0
> >> + *   ...
> >> + *   entryN
> >> + * PP data
> >> + *   PP for entry0
> >> + *   ...
> >> + *   PP for entryN
> >> + *
> >> + * Every entry holds a checksum of its partial parity, the header also has a
> >> + * checksum of the header itself. Entries for full stripes writes contain no
> >> + * partial parity, they only mark the stripes for which parity should be
> >> + * recalculated after an unclean shutdown.
> >> + *
> >> + * A write request is always logged to the PPL instance stored on the parity
> >> + * disk of the corresponding stripe. For each member disk there is one r5l_log
> >> + * used to handle logging for this disk, independently from others. They are
> >> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
> >> + * common parent r5l_log. This parent log serves as a proxy and is used in
> >> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
> >> + *
> >> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
> >> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
> >> + * be appended to the last entry if the chunks to write are the same, otherwise
> >> + * a new entry is added. Checksums of entries are calculated incrementally as
> >> + * stripes containing partial parity are being added to entries.
> >> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
> >> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
> >> + * for all stripes of the io_unit. When the PPL write completes, the stripes
> >> + * associated with the io_unit are released and raid5d starts writing their data
> >> + * and parity. When all stripes are written, the io_unit is freed and the next
> >> + * can be submitted.
> >> + *
> >> + * An io_unit is used to gather stripes until it is submitted or becomes full
> >> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
> >> + * can't be submitted until the previous has completed (PPL and stripe
> >> + * data+parity is written). The log->running_ios list tracks all io_units of
> >> + * a log (for a single member disk). New io_units are added to the end of the
> >> + * list and the first io_unit is submitted, if it is not submitted already.
> >> + * The current io_unit accepting new stripes is always the last on the list.
> >> + */
> > 
> > I don't like the way that io_unit and r5l_log are reused here. The code below
> > shares very little with raid5-cache. There are a lot of fields in the data
> > structures which are not used by ppl. It's not a good practice to share data
> > structure like this way, so please define ppl its own data structure. The
> > workflow between raid5 cache and ppl might be similar, but what we really need
> > to share is the hooks to raid5 core.
> 
> Reusing the structures seemed like a good idea at first, but now I must
> admit you are right, especially since the structures were extended to
> support write-back caching. One instance of r5l_log will still be
> necessary though, to use for the hooks. Its private field will be used
> for holding the ppl structures. Is that ok?

I'd prefer adding a 'void *log_private' in struct r5conf. raid5-cache and ppl
can store whatever data in the log_private. And convert all callbacks to use a
'struct r5conf' parameter. This way raid5 cache and ppl don't need to share the
data structure.

Thanks,
Shaohua

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox