Linux cryptographic layer development
 help / color / mirror / Atom feed
* [PATCH 03/14] crypto: caam - desc.h fixes
From: Horia Geantă @ 2016-11-09  8:46 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, linux-crypto
In-Reply-To: <1478681184-9442-1-git-send-email-horia.geanta@nxp.com>

1. fix HDR_START_IDX_MASK, HDR_SD_SHARE_MASK, HDR_JD_SHARE_MASK
Define HDR_START_IDX_MASK consistently with the other masks:
mask = bitmask << offset

2. OP_ALG_TYPE_CLASS1 and OP_ALG_TYPE_CLASS2 must be shifted.

3. fix FIFO_STORE output data type value for AFHA S-Box

4. fix OPERATION pkha modular arithmetic source mask

5. rename LDST_SRCDST_WORD_CLASS1_ICV_SZ to
LDST_SRCDST_WORD_CLASS1_IV_SZ (it refers to IV, not ICV).

Signed-off-by: Horia Geantă <horia.geanta@nxp.com>
---
 drivers/crypto/caam/desc.h | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/crypto/caam/desc.h b/drivers/crypto/caam/desc.h
index 513b6646bb36..61059abef737 100644
--- a/drivers/crypto/caam/desc.h
+++ b/drivers/crypto/caam/desc.h
@@ -90,8 +90,8 @@ struct sec4_sg_entry {
 #define HDR_ZRO			0x00008000
 
 /* Start Index or SharedDesc Length */
-#define HDR_START_IDX_MASK	0x3f
 #define HDR_START_IDX_SHIFT	16
+#define HDR_START_IDX_MASK	(0x3f << HDR_START_IDX_SHIFT)
 
 /* If shared descriptor header, 6-bit length */
 #define HDR_DESCLEN_SHR_MASK	0x3f
@@ -121,10 +121,10 @@ struct sec4_sg_entry {
 #define HDR_PROP_DNR		0x00000800
 
 /* JobDesc/SharedDesc share property */
-#define HDR_SD_SHARE_MASK	0x03
 #define HDR_SD_SHARE_SHIFT	8
-#define HDR_JD_SHARE_MASK	0x07
+#define HDR_SD_SHARE_MASK	(0x03 << HDR_SD_SHARE_SHIFT)
 #define HDR_JD_SHARE_SHIFT	8
+#define HDR_JD_SHARE_MASK	(0x07 << HDR_JD_SHARE_SHIFT)
 
 #define HDR_SHARE_NEVER		(0x00 << HDR_SD_SHARE_SHIFT)
 #define HDR_SHARE_WAIT		(0x01 << HDR_SD_SHARE_SHIFT)
@@ -235,7 +235,7 @@ struct sec4_sg_entry {
 #define LDST_SRCDST_WORD_DECO_MATH2	(0x0a << LDST_SRCDST_SHIFT)
 #define LDST_SRCDST_WORD_DECO_AAD_SZ	(0x0b << LDST_SRCDST_SHIFT)
 #define LDST_SRCDST_WORD_DECO_MATH3	(0x0b << LDST_SRCDST_SHIFT)
-#define LDST_SRCDST_WORD_CLASS1_ICV_SZ	(0x0c << LDST_SRCDST_SHIFT)
+#define LDST_SRCDST_WORD_CLASS1_IV_SZ	(0x0c << LDST_SRCDST_SHIFT)
 #define LDST_SRCDST_WORD_ALTDS_CLASS1	(0x0f << LDST_SRCDST_SHIFT)
 #define LDST_SRCDST_WORD_PKHA_A_SZ	(0x10 << LDST_SRCDST_SHIFT)
 #define LDST_SRCDST_WORD_PKHA_B_SZ	(0x11 << LDST_SRCDST_SHIFT)
@@ -400,7 +400,7 @@ struct sec4_sg_entry {
 #define FIFOST_TYPE_PKHA_N	 (0x08 << FIFOST_TYPE_SHIFT)
 #define FIFOST_TYPE_PKHA_A	 (0x0c << FIFOST_TYPE_SHIFT)
 #define FIFOST_TYPE_PKHA_B	 (0x0d << FIFOST_TYPE_SHIFT)
-#define FIFOST_TYPE_AF_SBOX_JKEK (0x10 << FIFOST_TYPE_SHIFT)
+#define FIFOST_TYPE_AF_SBOX_JKEK (0x20 << FIFOST_TYPE_SHIFT)
 #define FIFOST_TYPE_AF_SBOX_TKEK (0x21 << FIFOST_TYPE_SHIFT)
 #define FIFOST_TYPE_PKHA_E_JKEK	 (0x22 << FIFOST_TYPE_SHIFT)
 #define FIFOST_TYPE_PKHA_E_TKEK	 (0x23 << FIFOST_TYPE_SHIFT)
@@ -1107,8 +1107,8 @@ struct sec4_sg_entry {
 /* For non-protocol/alg-only op commands */
 #define OP_ALG_TYPE_SHIFT	24
 #define OP_ALG_TYPE_MASK	(0x7 << OP_ALG_TYPE_SHIFT)
-#define OP_ALG_TYPE_CLASS1	2
-#define OP_ALG_TYPE_CLASS2	4
+#define OP_ALG_TYPE_CLASS1	(2 << OP_ALG_TYPE_SHIFT)
+#define OP_ALG_TYPE_CLASS2	(4 << OP_ALG_TYPE_SHIFT)
 
 #define OP_ALG_ALGSEL_SHIFT	16
 #define OP_ALG_ALGSEL_MASK	(0xff << OP_ALG_ALGSEL_SHIFT)
@@ -1249,7 +1249,7 @@ struct sec4_sg_entry {
 #define OP_ALG_PKMODE_MOD_PRIMALITY	0x00f
 
 /* PKHA mode copy-memory functions */
-#define OP_ALG_PKMODE_SRC_REG_SHIFT	13
+#define OP_ALG_PKMODE_SRC_REG_SHIFT	17
 #define OP_ALG_PKMODE_SRC_REG_MASK	(7 << OP_ALG_PKMODE_SRC_REG_SHIFT)
 #define OP_ALG_PKMODE_DST_REG_SHIFT	10
 #define OP_ALG_PKMODE_DST_REG_MASK	(7 << OP_ALG_PKMODE_DST_REG_SHIFT)
-- 
2.4.4

^ permalink raw reply related

* Re: [PATCH 11/14] Revert "crypto: caam - get rid of tasklet"
From: Russell King - ARM Linux @ 2016-11-09  8:53 UTC (permalink / raw)
  To: Horia Geantă, Thomas Gleixner
  Cc: Herbert Xu, David S. Miller, linux-crypto
In-Reply-To: <1478681184-9442-12-git-send-email-horia.geanta@nxp.com>

Please include Thomas in this.

On Wed, Nov 09, 2016 at 10:46:21AM +0200, Horia Geantă wrote:
> This reverts commit 66d2e2028091a074aa1290d2eeda5ddb1a6c329c.
> 
> Quoting from Russell's findings:
> https://www.mail-archive.com/linux-crypto@vger.kernel.org/msg21136.html
> 
> [quote]
> Okay, I've re-tested, using a different way of measuring, because using
> openssl speed is impractical for off-loaded engines.  I've decided to
> use this way to measure the performance:
> 
> dd if=/dev/zero bs=1048576 count=128 | /usr/bin/time openssl dgst -md5
> 
> For the threaded IRQs case gives:
> 
> 0.05user 2.74system 0:05.30elapsed 52%CPU (0avgtext+0avgdata 2400maxresident)k
> 0.06user 2.52system 0:05.18elapsed 49%CPU (0avgtext+0avgdata 2404maxresident)k
> 0.12user 2.60system 0:05.61elapsed 48%CPU (0avgtext+0avgdata 2460maxresident)k
> 	=> 5.36s => 25.0MB/s
> 
> and the tasklet case:
> 
> 0.08user 2.53system 0:04.83elapsed 54%CPU (0avgtext+0avgdata 2468maxresident)k
> 0.09user 2.47system 0:05.16elapsed 49%CPU (0avgtext+0avgdata 2368maxresident)k
> 0.10user 2.51system 0:04.87elapsed 53%CPU (0avgtext+0avgdata 2460maxresident)k
> 	=> 4.95 => 27.1MB/s
> 
> which corresponds to an 8% slowdown for the threaded IRQ case.  So,
> tasklets are indeed faster than threaded IRQs.
> 
> [...]
> 
> I think I've proven from the above that this patch needs to be reverted
> due to the performance regression, and that there _is_ most definitely
> a deterimental effect of switching from tasklets to threaded IRQs.
> [/quote]
> 
> Signed-off-by: Horia Geantă <horia.geanta@nxp.com>
> ---
>  drivers/crypto/caam/intern.h |  1 +
>  drivers/crypto/caam/jr.c     | 25 ++++++++++++++++---------
>  2 files changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/crypto/caam/intern.h b/drivers/crypto/caam/intern.h
> index 5d4c05074a5c..e2bcacc1a921 100644
> --- a/drivers/crypto/caam/intern.h
> +++ b/drivers/crypto/caam/intern.h
> @@ -41,6 +41,7 @@ struct caam_drv_private_jr {
>  	struct device		*dev;
>  	int ridx;
>  	struct caam_job_ring __iomem *rregs;	/* JobR's register space */
> +	struct tasklet_struct irqtask;
>  	int irq;			/* One per queue */
>  
>  	/* Number of scatterlist crypt transforms active on the JobR */
> diff --git a/drivers/crypto/caam/jr.c b/drivers/crypto/caam/jr.c
> index 7331ea734f37..c8604dfadbf5 100644
> --- a/drivers/crypto/caam/jr.c
> +++ b/drivers/crypto/caam/jr.c
> @@ -73,6 +73,8 @@ static int caam_jr_shutdown(struct device *dev)
>  
>  	ret = caam_reset_hw_jr(dev);
>  
> +	tasklet_kill(&jrp->irqtask);
> +
>  	/* Release interrupt */
>  	free_irq(jrp->irq, dev);
>  
> @@ -128,7 +130,7 @@ static irqreturn_t caam_jr_interrupt(int irq, void *st_dev)
>  
>  	/*
>  	 * Check the output ring for ready responses, kick
> -	 * the threaded irq if jobs done.
> +	 * tasklet if jobs done.
>  	 */
>  	irqstate = rd_reg32(&jrp->rregs->jrintstatus);
>  	if (!irqstate)
> @@ -150,13 +152,18 @@ static irqreturn_t caam_jr_interrupt(int irq, void *st_dev)
>  	/* Have valid interrupt at this point, just ACK and trigger */
>  	wr_reg32(&jrp->rregs->jrintstatus, irqstate);
>  
> -	return IRQ_WAKE_THREAD;
> +	preempt_disable();
> +	tasklet_schedule(&jrp->irqtask);
> +	preempt_enable();
> +
> +	return IRQ_HANDLED;
>  }
>  
> -static irqreturn_t caam_jr_threadirq(int irq, void *st_dev)
> +/* Deferred service handler, run as interrupt-fired tasklet */
> +static void caam_jr_dequeue(unsigned long devarg)
>  {
>  	int hw_idx, sw_idx, i, head, tail;
> -	struct device *dev = st_dev;
> +	struct device *dev = (struct device *)devarg;
>  	struct caam_drv_private_jr *jrp = dev_get_drvdata(dev);
>  	void (*usercall)(struct device *dev, u32 *desc, u32 status, void *arg);
>  	u32 *userdesc, userstatus;
> @@ -230,8 +237,6 @@ static irqreturn_t caam_jr_threadirq(int irq, void *st_dev)
>  
>  	/* reenable / unmask IRQs */
>  	clrsetbits_32(&jrp->rregs->rconfig_lo, JRCFG_IMSK, 0);
> -
> -	return IRQ_HANDLED;
>  }
>  
>  /**
> @@ -389,10 +394,11 @@ static int caam_jr_init(struct device *dev)
>  
>  	jrp = dev_get_drvdata(dev);
>  
> +	tasklet_init(&jrp->irqtask, caam_jr_dequeue, (unsigned long)dev);
> +
>  	/* Connect job ring interrupt handler. */
> -	error = request_threaded_irq(jrp->irq, caam_jr_interrupt,
> -				     caam_jr_threadirq, IRQF_SHARED,
> -				     dev_name(dev), dev);
> +	error = request_irq(jrp->irq, caam_jr_interrupt, IRQF_SHARED,
> +			    dev_name(dev), dev);
>  	if (error) {
>  		dev_err(dev, "can't connect JobR %d interrupt (%d)\n",
>  			jrp->ridx, jrp->irq);
> @@ -454,6 +460,7 @@ static int caam_jr_init(struct device *dev)
>  out_free_irq:
>  	free_irq(jrp->irq, dev);
>  out_kill_deq:
> +	tasklet_kill(&jrp->irqtask);
>  	return error;
>  }
>  
> -- 
> 2.4.4
> 

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply

* [PATCH 04/14] crypto: caam - fix sparse warnings
From: Horia Geantă @ 2016-11-09  8:46 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, linux-crypto
In-Reply-To: <1478681184-9442-1-git-send-email-horia.geanta@nxp.com>

Fix the following sparse warning (note that endianness issues
are not not addressed in current patch):

drivers/crypto/caam/ctrl.c:388:24: warning: incorrect type in argument 1 (different address spaces)
drivers/crypto/caam/ctrl.c:388:24:    expected void [noderef] <asn:2>*reg
drivers/crypto/caam/ctrl.c:388:24:    got unsigned int *<noident>
drivers/crypto/caam/ctrl.c:390:24: warning: incorrect type in argument 1 (different address spaces)
drivers/crypto/caam/ctrl.c:390:24:    expected void [noderef] <asn:2>*reg
drivers/crypto/caam/ctrl.c:390:24:    got unsigned int *<noident>
drivers/crypto/caam/ctrl.c:548:24: warning: incorrect type in assignment (different address spaces)
drivers/crypto/caam/ctrl.c:548:24:    expected struct caam_ctrl [noderef] <asn:2>*ctrl
drivers/crypto/caam/ctrl.c:548:24:    got struct caam_ctrl *<noident>
drivers/crypto/caam/ctrl.c:550:30: warning: cast removes address space of expression
drivers/crypto/caam/ctrl.c:549:26: warning: incorrect type in assignment (different address spaces)
drivers/crypto/caam/ctrl.c:549:26:    expected struct caam_assurance [noderef] <asn:2>*assure
drivers/crypto/caam/ctrl.c:549:26:    got struct caam_assurance *<noident>
drivers/crypto/caam/ctrl.c:554:28: warning: cast removes address space of expression
drivers/crypto/caam/ctrl.c:553:24: warning: incorrect type in assignment (different address spaces)
drivers/crypto/caam/ctrl.c:553:24:    expected struct caam_deco [noderef] <asn:2>*deco
drivers/crypto/caam/ctrl.c:553:24:    got struct caam_deco *<noident>
drivers/crypto/caam/ctrl.c:634:48: warning: cast removes address space of expression
drivers/crypto/caam/ctrl.c:633:44: warning: incorrect type in assignment (different address spaces)
drivers/crypto/caam/ctrl.c:633:44:    expected struct caam_job_ring [noderef] <asn:2>*<noident>
drivers/crypto/caam/ctrl.c:633:44:    got struct caam_job_ring *<noident>
drivers/crypto/caam/ctrl.c:648:34: warning: cast removes address space of expression
drivers/crypto/caam/ctrl.c:647:30: warning: incorrect type in assignment (different address spaces)
drivers/crypto/caam/ctrl.c:647:30:    expected struct caam_queue_if [noderef] <asn:2>*qi
drivers/crypto/caam/ctrl.c:647:30:    got struct caam_queue_if *<noident>
drivers/crypto/caam/ctrl.c:806:37: warning: incorrect type in assignment (different address spaces)
drivers/crypto/caam/ctrl.c:806:37:    expected void *data
drivers/crypto/caam/ctrl.c:806:37:    got unsigned int [noderef] <asn:2>*
drivers/crypto/caam/ctrl.c:814:38: warning: incorrect type in assignment (different address spaces)
drivers/crypto/caam/ctrl.c:814:38:    expected void *data
drivers/crypto/caam/ctrl.c:814:38:    got unsigned int [noderef] <asn:2>*
drivers/crypto/caam/ctrl.c:822:38: warning: incorrect type in assignment (different address spaces)
drivers/crypto/caam/ctrl.c:822:38:    expected void *data
drivers/crypto/caam/ctrl.c:822:38:    got unsigned int [noderef] <asn:2>*
drivers/crypto/caam/jr.c:492:23: warning: incorrect type in assignment (different address spaces)
drivers/crypto/caam/jr.c:492:23:    expected struct caam_job_ring [noderef] <asn:2>*rregs
drivers/crypto/caam/jr.c:492:23:    got struct caam_job_ring *<noident>
drivers/crypto/caam/caampkc.c:398:35: warning: Using plain integer as NULL pointer
drivers/crypto/caam/caampkc.c:444:35: warning: Using plain integer as NULL pointer

Signed-off-by: Horia Geantă <horia.geanta@nxp.com>
---
 drivers/crypto/caam/caampkc.c |  4 ++--
 drivers/crypto/caam/ctrl.c    | 40 +++++++++++++++++-----------------------
 drivers/crypto/caam/jr.c      |  2 +-
 3 files changed, 20 insertions(+), 26 deletions(-)

diff --git a/drivers/crypto/caam/caampkc.c b/drivers/crypto/caam/caampkc.c
index 851015e652b8..32100c4851dd 100644
--- a/drivers/crypto/caam/caampkc.c
+++ b/drivers/crypto/caam/caampkc.c
@@ -395,7 +395,7 @@ static int caam_rsa_set_pub_key(struct crypto_akcipher *tfm, const void *key,
 				unsigned int keylen)
 {
 	struct caam_rsa_ctx *ctx = akcipher_tfm_ctx(tfm);
-	struct rsa_key raw_key = {0};
+	struct rsa_key raw_key = {NULL};
 	struct caam_rsa_key *rsa_key = &ctx->key;
 	int ret;
 
@@ -441,7 +441,7 @@ static int caam_rsa_set_priv_key(struct crypto_akcipher *tfm, const void *key,
 				 unsigned int keylen)
 {
 	struct caam_rsa_ctx *ctx = akcipher_tfm_ctx(tfm);
-	struct rsa_key raw_key = {0};
+	struct rsa_key raw_key = {NULL};
 	struct caam_rsa_key *rsa_key = &ctx->key;
 	int ret;
 
diff --git a/drivers/crypto/caam/ctrl.c b/drivers/crypto/caam/ctrl.c
index a79937d68c26..be62a7f482ac 100644
--- a/drivers/crypto/caam/ctrl.c
+++ b/drivers/crypto/caam/ctrl.c
@@ -365,11 +365,8 @@ static void kick_trng(struct platform_device *pdev, int ent_delay)
 	 */
 	val = (rd_reg32(&r4tst->rtsdctl) & RTSDCTL_ENT_DLY_MASK)
 	      >> RTSDCTL_ENT_DLY_SHIFT;
-	if (ent_delay <= val) {
-		/* put RNG4 into run mode */
-		clrsetbits_32(&r4tst->rtmctl, RTMCTL_PRGM, 0);
-		return;
-	}
+	if (ent_delay <= val)
+		goto start_rng;
 
 	val = rd_reg32(&r4tst->rtsdctl);
 	val = (val & ~RTSDCTL_ENT_DLY_MASK) |
@@ -381,15 +378,12 @@ static void kick_trng(struct platform_device *pdev, int ent_delay)
 	wr_reg32(&r4tst->rtfrqmax, RTFRQMAX_DISABLE);
 	/* read the control register */
 	val = rd_reg32(&r4tst->rtmctl);
+start_rng:
 	/*
 	 * select raw sampling in both entropy shifter
-	 * and statistical checker
+	 * and statistical checker; ; put RNG4 into run mode
 	 */
-	clrsetbits_32(&val, 0, RTMCTL_SAMP_MODE_RAW_ES_SC);
-	/* put RNG4 into run mode */
-	clrsetbits_32(&val, RTMCTL_PRGM, 0);
-	/* write back the control register */
-	wr_reg32(&r4tst->rtmctl, val);
+	clrsetbits_32(&r4tst->rtmctl, RTMCTL_PRGM, RTMCTL_SAMP_MODE_RAW_ES_SC);
 }
 
 /**
@@ -545,13 +539,13 @@ static int caam_probe(struct platform_device *pdev)
 	else
 		BLOCK_OFFSET = PG_SIZE_64K;
 
-	ctrlpriv->ctrl = (struct caam_ctrl __force *)ctrl;
-	ctrlpriv->assure = (struct caam_assurance __force *)
-			   ((uint8_t *)ctrl +
+	ctrlpriv->ctrl = (struct caam_ctrl __iomem __force *)ctrl;
+	ctrlpriv->assure = (struct caam_assurance __iomem __force *)
+			   ((__force uint8_t *)ctrl +
 			    BLOCK_OFFSET * ASSURE_BLOCK_NUMBER
 			   );
-	ctrlpriv->deco = (struct caam_deco __force *)
-			 ((uint8_t *)ctrl +
+	ctrlpriv->deco = (struct caam_deco __iomem __force *)
+			 ((__force uint8_t *)ctrl +
 			 BLOCK_OFFSET * DECO_BLOCK_NUMBER
 			 );
 
@@ -630,8 +624,8 @@ static int caam_probe(struct platform_device *pdev)
 					ring);
 				continue;
 			}
-			ctrlpriv->jr[ring] = (struct caam_job_ring __force *)
-					     ((uint8_t *)ctrl +
+			ctrlpriv->jr[ring] = (struct caam_job_ring __iomem __force *)
+					     ((__force uint8_t *)ctrl +
 					     (ring + JR_BLOCK_NUMBER) *
 					      BLOCK_OFFSET
 					     );
@@ -644,8 +638,8 @@ static int caam_probe(struct platform_device *pdev)
 			!!(rd_reg32(&ctrl->perfmon.comp_parms_ms) &
 			   CTPR_MS_QI_MASK);
 	if (ctrlpriv->qi_present) {
-		ctrlpriv->qi = (struct caam_queue_if __force *)
-			       ((uint8_t *)ctrl +
+		ctrlpriv->qi = (struct caam_queue_if __iomem __force *)
+			       ((__force uint8_t *)ctrl +
 				 BLOCK_OFFSET * QI_BLOCK_NUMBER
 			       );
 		/* This is all that's required to physically enable QI */
@@ -803,7 +797,7 @@ static int caam_probe(struct platform_device *pdev)
 				    &caam_fops_u32_ro);
 
 	/* Internal covering keys (useful in non-secure mode only) */
-	ctrlpriv->ctl_kek_wrap.data = &ctrlpriv->ctrl->kek[0];
+	ctrlpriv->ctl_kek_wrap.data = (__force void *)&ctrlpriv->ctrl->kek[0];
 	ctrlpriv->ctl_kek_wrap.size = KEK_KEY_SIZE * sizeof(u32);
 	ctrlpriv->ctl_kek = debugfs_create_blob("kek",
 						S_IRUSR |
@@ -811,7 +805,7 @@ static int caam_probe(struct platform_device *pdev)
 						ctrlpriv->ctl,
 						&ctrlpriv->ctl_kek_wrap);
 
-	ctrlpriv->ctl_tkek_wrap.data = &ctrlpriv->ctrl->tkek[0];
+	ctrlpriv->ctl_tkek_wrap.data = (__force void *)&ctrlpriv->ctrl->tkek[0];
 	ctrlpriv->ctl_tkek_wrap.size = KEK_KEY_SIZE * sizeof(u32);
 	ctrlpriv->ctl_tkek = debugfs_create_blob("tkek",
 						 S_IRUSR |
@@ -819,7 +813,7 @@ static int caam_probe(struct platform_device *pdev)
 						 ctrlpriv->ctl,
 						 &ctrlpriv->ctl_tkek_wrap);
 
-	ctrlpriv->ctl_tdsk_wrap.data = &ctrlpriv->ctrl->tdsk[0];
+	ctrlpriv->ctl_tdsk_wrap.data = (__force void *)&ctrlpriv->ctrl->tdsk[0];
 	ctrlpriv->ctl_tdsk_wrap.size = KEK_KEY_SIZE * sizeof(u32);
 	ctrlpriv->ctl_tdsk = debugfs_create_blob("tdsk",
 						 S_IRUSR |
diff --git a/drivers/crypto/caam/jr.c b/drivers/crypto/caam/jr.c
index 757c27f9953d..7331ea734f37 100644
--- a/drivers/crypto/caam/jr.c
+++ b/drivers/crypto/caam/jr.c
@@ -489,7 +489,7 @@ static int caam_jr_probe(struct platform_device *pdev)
 		return -ENOMEM;
 	}
 
-	jrpriv->rregs = (struct caam_job_ring __force *)ctrl;
+	jrpriv->rregs = (struct caam_job_ring __iomem __force *)ctrl;
 
 	if (sizeof(dma_addr_t) == sizeof(u64))
 		if (of_device_is_compatible(nprop, "fsl,sec-v5.0-job-ring"))
-- 
2.4.4

^ permalink raw reply related

* [PATCH 01/14] crypto: caam - fix AEAD givenc descriptors
From: Horia Geantă @ 2016-11-09  8:46 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, linux-crypto, Alex Porosanu
In-Reply-To: <1478681184-9442-1-git-send-email-horia.geanta@nxp.com>

From: Alex Porosanu <alexandru.porosanu@nxp.com>

The AEAD givenc descriptor relies on moving the IV through the
output FIFO and then back to the CTX2 for authentication. The
SEQ FIFO STORE could be scheduled before the data can be
read from OFIFO, especially since the SEQ FIFO LOAD needs
to wait for the SEQ FIFO LOAD SKIP to finish first. The
SKIP takes more time when the input is SG than when it's
a contiguous buffer. If the SEQ FIFO LOAD is not scheduled
before the STORE, the DECO will hang waiting for data
to be available in the OFIFO so it can be transferred to C2.
In order to overcome this, first force transfer of IV to C2
by starting the "cryptlen" transfer first and then starting to
store data from OFIFO to the output buffer.

Fixes: 1acebad3d8db8 ("crypto: caam - faster aead implementation")
Cc: <stable@vger.kernel.org> # 3.2+
Signed-off-by: Alex Porosanu <alexandru.porosanu@nxp.com>
Signed-off-by: Horia Geantă <horia.geanta@nxp.com>
---
 drivers/crypto/caam/caamalg.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/caam/caamalg.c b/drivers/crypto/caam/caamalg.c
index 8de85dfb1b04..5317d8cad44d 100644
--- a/drivers/crypto/caam/caamalg.c
+++ b/drivers/crypto/caam/caamalg.c
@@ -736,7 +736,9 @@ static int aead_set_sh_desc(struct crypto_aead *aead)
 
 	/* Will read cryptlen */
 	append_math_add(desc, VARSEQINLEN, SEQINLEN, REG0, CAAM_CMD_SZ);
-	aead_append_src_dst(desc, FIFOLD_TYPE_MSG1OUT2);
+	append_seq_fifo_load(desc, 0, FIFOLD_CLASS_BOTH | KEY_VLF |
+			     FIFOLD_TYPE_MSG1OUT2 | FIFOLD_TYPE_LASTBOTH);
+	append_seq_fifo_store(desc, 0, FIFOST_TYPE_MESSAGE_DATA | KEY_VLF);
 
 	/* Write ICV */
 	append_seq_store(desc, ctx->authsize, LDST_CLASS_2_CCB |
-- 
2.4.4

^ permalink raw reply related

* Re: [PATCH] crypto: rsa: rename two rsa key files
From: yjin @ 2016-11-09  6:39 UTC (permalink / raw)
  To: Herbert Xu; +Cc: davem, linux-kernel, linux-crypto, jinyanjiang
In-Reply-To: <20161108120900.GA5267@gondor.apana.org.au>

Thanks for Herbert's reminder.
I have drop this patch in a previous mail.

Regards!
Yanjiang

On 2016年11月08日 20:09, Herbert Xu wrote:
> yanjiang.jin@windriver.com wrote:
>> From: Yanjiang Jin <yanjiang.jin@windriver.com>
>>
>> This is to eliminate the below compile error:
>>
>> crypto/rsa_helper.c:19:29: fatal error: rsaprivkey-asn1.h: No such file or directory
>> #include "rsaprivkey-asn1.h"
>>                              ^
>> compilation terminated.
>>
>> Signed-off-by: Yanjiang Jin <yanjiang.jin@windriver.com>
> This patch is bogus.  The header files are meant to be generated.
> Please find out why they are not being generated in your case.
>
> Thanks,

^ permalink raw reply

* [PATCH RESEND] crypto: gf128mul - remove dead gf128mul_64k_lle code
From: Alex Cope @ 2016-11-09  1:16 UTC (permalink / raw)
  To: linux-crypto; +Cc: Alex Cope

This code is unlikely to be useful in the future because transforms
don't know how often keys will be changed, new algorithms are unlikely
to use lle representation, and tables should be replaced with
carryless multiplication instructions when available.

Signed-off-by: Alex Cope <alexcope@google.com>
---
 crypto/gf128mul.c         | 55 -----------------------------------------------
 include/crypto/gf128mul.h | 13 ++++++-----
 2 files changed, 6 insertions(+), 62 deletions(-)

diff --git a/crypto/gf128mul.c b/crypto/gf128mul.c
index 5276607..57c85dd 100644
--- a/crypto/gf128mul.c
+++ b/crypto/gf128mul.c
@@ -263,48 +263,6 @@ EXPORT_SYMBOL(gf128mul_bbe);
  * t[1][BYTE] contains g*x^8*BYTE
  *  ..
  * t[15][BYTE] contains g*x^120*BYTE */
-struct gf128mul_64k *gf128mul_init_64k_lle(const be128 *g)
-{
-	struct gf128mul_64k *t;
-	int i, j, k;
-
-	t = kzalloc(sizeof(*t), GFP_KERNEL);
-	if (!t)
-		goto out;
-
-	for (i = 0; i < 16; i++) {
-		t->t[i] = kzalloc(sizeof(*t->t[i]), GFP_KERNEL);
-		if (!t->t[i]) {
-			gf128mul_free_64k(t);
-			t = NULL;
-			goto out;
-		}
-	}
-
-	t->t[0]->t[128] = *g;
-	for (j = 64; j > 0; j >>= 1)
-		gf128mul_x_lle(&t->t[0]->t[j], &t->t[0]->t[j + j]);
-
-	for (i = 0;;) {
-		for (j = 2; j < 256; j += j)
-			for (k = 1; k < j; ++k)
-				be128_xor(&t->t[i]->t[j + k],
-					  &t->t[i]->t[j], &t->t[i]->t[k]);
-
-		if (++i >= 16)
-			break;
-
-		for (j = 128; j > 0; j >>= 1) {
-			t->t[i]->t[j] = t->t[i - 1]->t[j];
-			gf128mul_x8_lle(&t->t[i]->t[j]);
-		}
-	}
-
-out:
-	return t;
-}
-EXPORT_SYMBOL(gf128mul_init_64k_lle);
-
 struct gf128mul_64k *gf128mul_init_64k_bbe(const be128 *g)
 {
 	struct gf128mul_64k *t;
@@ -357,19 +315,6 @@ void gf128mul_free_64k(struct gf128mul_64k *t)
 }
 EXPORT_SYMBOL(gf128mul_free_64k);
 
-void gf128mul_64k_lle(be128 *a, struct gf128mul_64k *t)
-{
-	u8 *ap = (u8 *)a;
-	be128 r[1];
-	int i;
-
-	*r = t->t[0]->t[ap[0]];
-	for (i = 1; i < 16; ++i)
-		be128_xor(r, r, &t->t[i]->t[ap[i]]);
-	*a = *r;
-}
-EXPORT_SYMBOL(gf128mul_64k_lle);
-
 void gf128mul_64k_bbe(be128 *a, struct gf128mul_64k *t)
 {
 	u8 *ap = (u8 *)a;
diff --git a/include/crypto/gf128mul.h b/include/crypto/gf128mul.h
index da2530e..b611aa9 100644
--- a/include/crypto/gf128mul.h
+++ b/include/crypto/gf128mul.h
@@ -181,20 +181,19 @@ static inline void gf128mul_free_4k(struct gf128mul_4k *t)
 }
 
 
-/* 64k table optimization, implemented for lle and bbe */
+/* 64k table optimization, implemented for bbe */
 
 struct gf128mul_64k {
 	struct gf128mul_4k *t[16];
 };
 
-/* first initialize with the constant factor with which you
- * want to multiply and then call gf128_64k_lle with the other
- * factor in the first argument, the table in the second and a
- * scratch register in the third. Afterwards *a = *r. */
-struct gf128mul_64k *gf128mul_init_64k_lle(const be128 *g);
+/* First initialize with the constant factor with which you
+ * want to multiply and then call gf128mul_64k_bbe with the other
+ * factor in the first argument, and the table in the second.
+ * Afterwards, the result is stored in *a.
+ */
 struct gf128mul_64k *gf128mul_init_64k_bbe(const be128 *g);
 void gf128mul_free_64k(struct gf128mul_64k *t);
-void gf128mul_64k_lle(be128 *a, struct gf128mul_64k *t);
 void gf128mul_64k_bbe(be128 *a, struct gf128mul_64k *t);
 
 #endif /* _CRYPTO_GF128MUL_H */
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH] crypto: dh - Consistenly return negative error codes
From: Mat Martineau @ 2016-11-08 23:48 UTC (permalink / raw)
  To: linux-crypto, herbert; +Cc: Mat Martineau, salvatore.benedetto

Fix the single instance where a positive EINVAL was returned.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 crypto/dh.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/crypto/dh.c b/crypto/dh.c
index 9d19360..ddcb528 100644
--- a/crypto/dh.c
+++ b/crypto/dh.c
@@ -118,7 +118,7 @@ static int dh_compute_value(struct kpp_request *req)
 	if (req->src) {
 		base = mpi_read_raw_from_sgl(req->src, req->src_len);
 		if (!base) {
-			ret = EINVAL;
+			ret = -EINVAL;
 			goto err_free_val;
 		}
 	} else {
-- 
2.10.2

^ permalink raw reply related

* Re: [PATCH v4] poly1305: generic C can be faster on chips with slow unaligned access
From: Eric Biggers @ 2016-11-08 17:26 UTC (permalink / raw)
  To: Martin Willi
  Cc: Jason A. Donenfeld, Herbert Xu, David S. Miller, linux-crypto,
	linux-kernel, René van Dorst
In-Reply-To: <1478591559.5216.7.camel@strongswan.org>

On Tue, Nov 08, 2016 at 08:52:39AM +0100, Martin Willi wrote:
> 
> 
> Not sure what the exact alignment rules for key/iv are, but maybe we
> want to replace the same function in chacha20_generic.c as well?
> 
> Martin

chacha20-generic provides a blkcipher API and sets an alignmask of sizeof(u32)
- 1.  This applies to the key buffer for ->setkey() and to the mapped addresses
for the input/output buffers and IV during the blkcipher walk.  So it does not
appear to have the problems that poly1305 has.  

I do however see one small problem which is that
'u8 stream[CHACHA20_BLOCK_SIZE];' is, strictly speaking, not guaranteed to be
aligned to u32.

Eric

^ permalink raw reply

* Re: [PATCH 6/6] Add support for AEAD algos.
From: Harsh Jain @ 2016-11-08 14:21 UTC (permalink / raw)
  To: Stephan Mueller
  Cc: dan.carpenter, herbert, linux-crypto, jlulla, atul.gupta,
	yeshaswi, hariprasad
In-Reply-To: <46208469.tzXHy1sfql@positron.chronox.de>



On 08-11-2016 18:29, Stephan Mueller wrote:
> Am Dienstag, 8. November 2016, 17:16:38 CET schrieb Harsh Jain:
>
> Hi Harsh,
>
>> On 08-11-2016 16:45, Stephan Mueller wrote:
>>> Am Donnerstag, 27. Oktober 2016, 15:36:08 CET schrieb Harsh Jain:
>>>
>>> Hi Harsh,
>>>
>>>>>> +static void chcr_verify_tag(struct aead_request *req, u8 *input, int
>>>>>> *err)
>>>>>> +{
>>>>>> +	u8 temp[SHA512_DIGEST_SIZE];
>>>>>> +	struct crypto_aead *tfm = crypto_aead_reqtfm(req);
>>>>>> +	int authsize = crypto_aead_authsize(tfm);
>>>>>> +	struct cpl_fw6_pld *fw6_pld;
>>>>>> +	int cmp = 0;
>>>>>> +
>>>>>> +	fw6_pld = (struct cpl_fw6_pld *)input;
>>>>>> +	if ((get_aead_subtype(tfm) == CRYPTO_ALG_SUB_TYPE_AEAD_RFC4106) ||
>>>>>> +	    (get_aead_subtype(tfm) == CRYPTO_ALG_SUB_TYPE_AEAD_GCM)) {
>>>>>> +		cmp = memcmp(&fw6_pld->data[2], (fw6_pld + 1), authsize);
>>>>>> +	} else {
>>>>>> +
>>>>>> +		sg_pcopy_to_buffer(req->src, sg_nents(req->src), temp,
>>>>>> +				authsize, req->assoclen +
>>>>>> +				req->cryptlen - authsize);
>>>>> I am wondering whether the math is correct here in any case. It is
>>>>> permissible that we have an AAD size of 0 and even a zero-sized
>>>>> ciphertext. How is such scenario covered here?
>>>> Here we are trying to copy user supplied tag to local buffer(temp) for
>>>> decrypt operation only. relative index of tag in src sg list will not
>>>> change when AAD is zero and in decrypt operation cryptlen > authsize.
>>> I am just wondering where this is checked. Since all of these
>>> implementations are directly accessible from unprivileged user space, we
>>> should be careful.
>> chcr_verify_tag() will be called when req->verify is set to "VERIFY_SW", 
>> same will set in decrypt callback function of Algo(like chcr_aead_decrypt)
>> only. It will ensure calling of chcr_verify_tag() in de-crypt operation
>> only.
> I think that limiting to the decryption path may not be enough. What happens 
> if a caller sets some assoclen, but when invoking the decryption operation it 
> provides input data that is smaller than the assoclen? The API allows this 
> scenario.
If I understand correctly, in this case passed sg list will be smaller. We should return with error -EINVAL at entry point only (like create_gcm_wr), control should not reach to chcr_verify_tag().

>
> Ciao
> Stephan

^ permalink raw reply

* Re: [PATCH 6/6] Add support for AEAD algos.
From: Stephan Mueller @ 2016-11-08 12:59 UTC (permalink / raw)
  To: Harsh Jain
  Cc: dan.carpenter, herbert, linux-crypto, jlulla, atul.gupta,
	yeshaswi, hariprasad
In-Reply-To: <ab295a27-cb83-0ee7-0cf1-03d7c05b20c5@chelsio.com>

Am Dienstag, 8. November 2016, 17:16:38 CET schrieb Harsh Jain:

Hi Harsh,

> On 08-11-2016 16:45, Stephan Mueller wrote:
> > Am Donnerstag, 27. Oktober 2016, 15:36:08 CET schrieb Harsh Jain:
> > 
> > Hi Harsh,
> > 
> >>>> +static void chcr_verify_tag(struct aead_request *req, u8 *input, int
> >>>> *err)
> >>>> +{
> >>>> +	u8 temp[SHA512_DIGEST_SIZE];
> >>>> +	struct crypto_aead *tfm = crypto_aead_reqtfm(req);
> >>>> +	int authsize = crypto_aead_authsize(tfm);
> >>>> +	struct cpl_fw6_pld *fw6_pld;
> >>>> +	int cmp = 0;
> >>>> +
> >>>> +	fw6_pld = (struct cpl_fw6_pld *)input;
> >>>> +	if ((get_aead_subtype(tfm) == CRYPTO_ALG_SUB_TYPE_AEAD_RFC4106) ||
> >>>> +	    (get_aead_subtype(tfm) == CRYPTO_ALG_SUB_TYPE_AEAD_GCM)) {
> >>>> +		cmp = memcmp(&fw6_pld->data[2], (fw6_pld + 1), authsize);
> >>>> +	} else {
> >>>> +
> >>>> +		sg_pcopy_to_buffer(req->src, sg_nents(req->src), temp,
> >>>> +				authsize, req->assoclen +
> >>>> +				req->cryptlen - authsize);
> >>> 
> >>> I am wondering whether the math is correct here in any case. It is
> >>> permissible that we have an AAD size of 0 and even a zero-sized
> >>> ciphertext. How is such scenario covered here?
> >> 
> >> Here we are trying to copy user supplied tag to local buffer(temp) for
> >> decrypt operation only. relative index of tag in src sg list will not
> >> change when AAD is zero and in decrypt operation cryptlen > authsize.
> > 
> > I am just wondering where this is checked. Since all of these
> > implementations are directly accessible from unprivileged user space, we
> > should be careful.
> chcr_verify_tag() will be called when req->verify is set to "VERIFY_SW", 
> same will set in decrypt callback function of Algo(like chcr_aead_decrypt)
> only. It will ensure calling of chcr_verify_tag() in de-crypt operation
> only.

I think that limiting to the decryption path may not be enough. What happens 
if a caller sets some assoclen, but when invoking the decryption operation it 
provides input data that is smaller than the assoclen? The API allows this 
scenario.

Ciao
Stephan

^ permalink raw reply

* Re: [PATCH] crypto: rsa: rename two rsa key files
From: Herbert Xu @ 2016-11-08 12:09 UTC (permalink / raw)
  To: yanjiang.jin; +Cc: davem, linux-kernel, linux-crypto, jinyanjiang
In-Reply-To: <1478240017-23117-1-git-send-email-yanjiang.jin@windriver.com>

yanjiang.jin@windriver.com wrote:
> From: Yanjiang Jin <yanjiang.jin@windriver.com>
> 
> This is to eliminate the below compile error:
> 
> crypto/rsa_helper.c:19:29: fatal error: rsaprivkey-asn1.h: No such file or directory
> #include "rsaprivkey-asn1.h"
>                             ^
> compilation terminated.
> 
> Signed-off-by: Yanjiang Jin <yanjiang.jin@windriver.com>

This patch is bogus.  The header files are meant to be generated.
Please find out why they are not being generated in your case.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] crypto: gf128mul - remove dead gf128mul_64k_lle code
From: Herbert Xu @ 2016-11-08 12:04 UTC (permalink / raw)
  To: Alex Cope; +Cc: linux-crypto
In-Reply-To: <CA+cSK10Q4aLK+uXtRwWuUR4_vuZZWaVadPpofNpWf9G=QkGSyg@mail.gmail.com>

Alex Cope <alexcope@google.com> wrote:
> This code is unlikely to be useful in the future because transforms
> don't know how often keys will be changed, new algorithms are unlikely
> to use lle representation, and tables should be replaced with
> carryless multiplication instructions when available.
> 
> Signed-off-by: Alex Cope <alexcope@google.com>

Your patch doesn't apply.  Please fix it and resubmit.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 6/6] Add support for AEAD algos.
From: Harsh Jain @ 2016-11-08 11:46 UTC (permalink / raw)
  To: Stephan Mueller
  Cc: dan.carpenter, herbert, linux-crypto, jlulla, atul.gupta,
	yeshaswi, hariprasad
In-Reply-To: <3617075.SV0YkRBZ96@positron.chronox.de>



On 08-11-2016 16:45, Stephan Mueller wrote:
> Am Donnerstag, 27. Oktober 2016, 15:36:08 CET schrieb Harsh Jain:
>
> Hi Harsh,
>
>>>> +static void chcr_verify_tag(struct aead_request *req, u8 *input, int
>>>> *err)
>>>> +{
>>>> +	u8 temp[SHA512_DIGEST_SIZE];
>>>> +	struct crypto_aead *tfm = crypto_aead_reqtfm(req);
>>>> +	int authsize = crypto_aead_authsize(tfm);
>>>> +	struct cpl_fw6_pld *fw6_pld;
>>>> +	int cmp = 0;
>>>> +
>>>> +	fw6_pld = (struct cpl_fw6_pld *)input;
>>>> +	if ((get_aead_subtype(tfm) == CRYPTO_ALG_SUB_TYPE_AEAD_RFC4106) ||
>>>> +	    (get_aead_subtype(tfm) == CRYPTO_ALG_SUB_TYPE_AEAD_GCM)) {
>>>> +		cmp = memcmp(&fw6_pld->data[2], (fw6_pld + 1), authsize);
>>>> +	} else {
>>>> +
>>>> +		sg_pcopy_to_buffer(req->src, sg_nents(req->src), temp,
>>>> +				authsize, req->assoclen +
>>>> +				req->cryptlen - authsize);
>>> I am wondering whether the math is correct here in any case. It is
>>> permissible that we have an AAD size of 0 and even a zero-sized
>>> ciphertext. How is such scenario covered here?
>> Here we are trying to copy user supplied tag to local buffer(temp) for
>> decrypt operation only. relative index of tag in src sg list will not
>> change when AAD is zero and in decrypt operation cryptlen > authsize.
> I am just wondering where this is checked. Since all of these implementations 
> are directly accessible from unprivileged user space, we should be careful.
chcr_verify_tag() will be called when req->verify is set to "VERIFY_SW",  same will set in decrypt callback function of Algo(like chcr_aead_decrypt) only. It will ensure calling of chcr_verify_tag() in de-crypt operation only.


>
>>>> +		cmp = memcmp(temp, (fw6_pld + 1), authsize);
>>> I would guess in both cases memcmp should be replaced with crypto_memneq
>> Yes can be done
>>
>>>> +	}
>>>> +	if (cmp)
>>>> +		*err = -EBADMSG;
>>>> +	else
>>>> +		*err = 0;
>>> What do you think about memzero_explicit(tmp)?
>> No Idea why we needs explicitly setting of zero for local variable.  Please
>> share some online resources to understand this.
> In dumps, the stack is also produced. Yet I see that stack memory is very 
> volatile and thus will be overwritten soon. Thus my common approach for 
> sensitive data is that heap variables must be zeroized. Stack variables are 
> suggested to be zeroized. As far as I understand the code, temp will hold a 
> copy of the tag value, i.e. a public piece of information. If this is correct, 
> that I concur that a memset may not be needed after all.
Yes, temp contains user supplied tag. We can ignore memset here. I will review the other function weather they need similar memset or not.
>
> Ciao
> Stephan

^ permalink raw reply

* Re: [PATCH 6/6] Add support for AEAD algos.
From: Stephan Mueller @ 2016-11-08 11:15 UTC (permalink / raw)
  To: Harsh Jain
  Cc: dan.carpenter, herbert, linux-crypto, jlulla, atul.gupta,
	yeshaswi, hariprasad
In-Reply-To: <3420937b-51f3-cc6e-8919-d698ba087170@chelsio.com>

Am Donnerstag, 27. Oktober 2016, 15:36:08 CET schrieb Harsh Jain:

Hi Harsh,

> >> +static void chcr_verify_tag(struct aead_request *req, u8 *input, int
> >> *err)
> >> +{
> >> +	u8 temp[SHA512_DIGEST_SIZE];
> >> +	struct crypto_aead *tfm = crypto_aead_reqtfm(req);
> >> +	int authsize = crypto_aead_authsize(tfm);
> >> +	struct cpl_fw6_pld *fw6_pld;
> >> +	int cmp = 0;
> >> +
> >> +	fw6_pld = (struct cpl_fw6_pld *)input;
> >> +	if ((get_aead_subtype(tfm) == CRYPTO_ALG_SUB_TYPE_AEAD_RFC4106) ||
> >> +	    (get_aead_subtype(tfm) == CRYPTO_ALG_SUB_TYPE_AEAD_GCM)) {
> >> +		cmp = memcmp(&fw6_pld->data[2], (fw6_pld + 1), authsize);
> >> +	} else {
> >> +
> >> +		sg_pcopy_to_buffer(req->src, sg_nents(req->src), temp,
> >> +				authsize, req->assoclen +
> >> +				req->cryptlen - authsize);
> > 
> > I am wondering whether the math is correct here in any case. It is
> > permissible that we have an AAD size of 0 and even a zero-sized
> > ciphertext. How is such scenario covered here?
> 
> Here we are trying to copy user supplied tag to local buffer(temp) for
> decrypt operation only. relative index of tag in src sg list will not
> change when AAD is zero and in decrypt operation cryptlen > authsize.

I am just wondering where this is checked. Since all of these implementations 
are directly accessible from unprivileged user space, we should be careful.

> >> +		cmp = memcmp(temp, (fw6_pld + 1), authsize);
> > 
> > I would guess in both cases memcmp should be replaced with crypto_memneq
> 
> Yes can be done
> 
> >> +	}
> >> +	if (cmp)
> >> +		*err = -EBADMSG;
> >> +	else
> >> +		*err = 0;
> > 
> > What do you think about memzero_explicit(tmp)?
> 
> No Idea why we needs explicitly setting of zero for local variable.  Please
> share some online resources to understand this.

In dumps, the stack is also produced. Yet I see that stack memory is very 
volatile and thus will be overwritten soon. Thus my common approach for 
sensitive data is that heap variables must be zeroized. Stack variables are 
suggested to be zeroized. As far as I understand the code, temp will hold a 
copy of the tag value, i.e. a public piece of information. If this is correct, 
that I concur that a memset may not be needed after all.

Ciao
Stephan

^ permalink raw reply

* Re: [PATCH v4] poly1305: generic C can be faster on chips with slow unaligned access
From: Martin Willi @ 2016-11-08  7:52 UTC (permalink / raw)
  To: Jason A. Donenfeld, Herbert Xu, David S. Miller, linux-crypto,
	linux-kernel, Eric Biggers, René van Dorst
In-Reply-To: <20161107194709.20457-1-Jason@zx2c4.com>


> By using the unaligned access helpers, we drastically improve
> performance on small MIPS routers that have to go through the
> exception fix-up handler for these unaligned accesses.

I couldn't measure any slowdown here, so:

Acked-by: Martin Willi <martin@strongswan.org>

> -       dctx->s[0] = le32_to_cpuvp(key +  0);
> +       dctx->s[0] = get_unaligned_le32(key +  0);

Not sure what the exact alignment rules for key/iv are, but maybe we
want to replace the same function in chacha20_generic.c as well?

Martin

^ permalink raw reply

* Re: [PATCH v4] poly1305: generic C can be faster on chips with slow unaligned access
From: Eric Biggers @ 2016-11-07 20:40 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, David S. Miller, linux-crypto, linux-kernel,
	Martin Willi, René van Dorst
In-Reply-To: <20161107194709.20457-1-Jason@zx2c4.com>

On Mon, Nov 07, 2016 at 08:47:09PM +0100, Jason A. Donenfeld wrote:
> By using the unaligned access helpers, we drastically improve
> performance on small MIPS routers that have to go through the exception
> fix-up handler for these unaligned accesses.
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---

Reviewed-by: Eric Biggers <ebiggers@google.com>

Nit: the subject line is a little unclear about what was changed.
"make generic C faster on chips with slow unaligned access" would be better.

^ permalink raw reply

* [PATCH v4] poly1305: generic C can be faster on chips with slow unaligned access
From: Jason A. Donenfeld @ 2016-11-07 19:47 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, linux-kernel,
	Martin Willi, Eric Biggers, René van Dorst
  Cc: Jason A. Donenfeld
In-Reply-To: <20161107191253.17998-1-Jason@zx2c4.com>

By using the unaligned access helpers, we drastically improve
performance on small MIPS routers that have to go through the exception
fix-up handler for these unaligned accesses.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 crypto/poly1305_generic.c | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index 2df9835d..b1c2d57 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -17,6 +17,7 @@
 #include <linux/crypto.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <asm/unaligned.h>
 
 static inline u64 mlt(u64 a, u64 b)
 {
@@ -33,11 +34,6 @@ static inline u32 and(u32 v, u32 mask)
 	return v & mask;
 }
 
-static inline u32 le32_to_cpuvp(const void *p)
-{
-	return le32_to_cpup(p);
-}
-
 int crypto_poly1305_init(struct shash_desc *desc)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
@@ -65,19 +61,19 @@ EXPORT_SYMBOL_GPL(crypto_poly1305_setkey);
 static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key)
 {
 	/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
-	dctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
-	dctx->r[1] = (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
-	dctx->r[2] = (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
-	dctx->r[3] = (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
-	dctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;
+	dctx->r[0] = (get_unaligned_le32(key +  0) >> 0) & 0x3ffffff;
+	dctx->r[1] = (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
+	dctx->r[2] = (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
+	dctx->r[3] = (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;
+	dctx->r[4] = (get_unaligned_le32(key + 12) >> 8) & 0x00fffff;
 }
 
 static void poly1305_setskey(struct poly1305_desc_ctx *dctx, const u8 *key)
 {
-	dctx->s[0] = le32_to_cpuvp(key +  0);
-	dctx->s[1] = le32_to_cpuvp(key +  4);
-	dctx->s[2] = le32_to_cpuvp(key +  8);
-	dctx->s[3] = le32_to_cpuvp(key + 12);
+	dctx->s[0] = get_unaligned_le32(key +  0);
+	dctx->s[1] = get_unaligned_le32(key +  4);
+	dctx->s[2] = get_unaligned_le32(key +  8);
+	dctx->s[3] = get_unaligned_le32(key + 12);
 }
 
 unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
@@ -137,11 +133,11 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
 	while (likely(srclen >= POLY1305_BLOCK_SIZE)) {
 
 		/* h += m[i] */
-		h0 += (le32_to_cpuvp(src +  0) >> 0) & 0x3ffffff;
-		h1 += (le32_to_cpuvp(src +  3) >> 2) & 0x3ffffff;
-		h2 += (le32_to_cpuvp(src +  6) >> 4) & 0x3ffffff;
-		h3 += (le32_to_cpuvp(src +  9) >> 6) & 0x3ffffff;
-		h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;
+		h0 += (get_unaligned_le32(src +  0) >> 0) & 0x3ffffff;
+		h1 += (get_unaligned_le32(src +  3) >> 2) & 0x3ffffff;
+		h2 += (get_unaligned_le32(src +  6) >> 4) & 0x3ffffff;
+		h3 += (get_unaligned_le32(src +  9) >> 6) & 0x3ffffff;
+		h4 += (get_unaligned_le32(src + 12) >> 8) | hibit;
 
 		/* h *= r */
 		d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) +
-- 
2.10.2

^ permalink raw reply related

* [PATCH v3] poly1305: generic C can be faster on chips with slow unaligned access
From: Jason A. Donenfeld @ 2016-11-07 19:43 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, linux-kernel,
	Martin Willi, Eric Biggers, René van Dorst
  Cc: Jason A. Donenfeld
In-Reply-To: <20161107191253.17998-1-Jason@zx2c4.com>

By using the unaligned access helpers, we drastically improve
performance on small MIPS routers that have to go through the exception
fix-up handler for these unaligned accesses.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 crypto/poly1305_generic.c | 33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index 2df9835d..7a77cfc 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -33,11 +33,6 @@ static inline u32 and(u32 v, u32 mask)
 	return v & mask;
 }
 
-static inline u32 le32_to_cpuvp(const void *p)
-{
-	return le32_to_cpup(p);
-}
-
 int crypto_poly1305_init(struct shash_desc *desc)
 {
 	struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
@@ -65,19 +60,19 @@ EXPORT_SYMBOL_GPL(crypto_poly1305_setkey);
 static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key)
 {
 	/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
-	dctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
-	dctx->r[1] = (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
-	dctx->r[2] = (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
-	dctx->r[3] = (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
-	dctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;
+	dctx->r[0] = (get_unaligned_le32(key +  0) >> 0) & 0x3ffffff;
+	dctx->r[1] = (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
+	dctx->r[2] = (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
+	dctx->r[3] = (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;
+	dctx->r[4] = (get_unaligned_le32(key + 12) >> 8) & 0x00fffff;
 }
 
 static void poly1305_setskey(struct poly1305_desc_ctx *dctx, const u8 *key)
 {
-	dctx->s[0] = le32_to_cpuvp(key +  0);
-	dctx->s[1] = le32_to_cpuvp(key +  4);
-	dctx->s[2] = le32_to_cpuvp(key +  8);
-	dctx->s[3] = le32_to_cpuvp(key + 12);
+	dctx->s[0] = get_unaligned_le32(key +  0);
+	dctx->s[1] = get_unaligned_le32(key +  4);
+	dctx->s[2] = get_unaligned_le32(key +  8);
+	dctx->s[3] = get_unaligned_le32(key + 12);
 }
 
 unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
@@ -137,11 +132,11 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
 	while (likely(srclen >= POLY1305_BLOCK_SIZE)) {
 
 		/* h += m[i] */
-		h0 += (le32_to_cpuvp(src +  0) >> 0) & 0x3ffffff;
-		h1 += (le32_to_cpuvp(src +  3) >> 2) & 0x3ffffff;
-		h2 += (le32_to_cpuvp(src +  6) >> 4) & 0x3ffffff;
-		h3 += (le32_to_cpuvp(src +  9) >> 6) & 0x3ffffff;
-		h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;
+		h0 += (get_unaligned_le32(src +  0) >> 0) & 0x3ffffff;
+		h1 += (get_unaligned_le32(src +  3) >> 2) & 0x3ffffff;
+		h2 += (get_unaligned_le32(src +  6) >> 4) & 0x3ffffff;
+		h3 += (get_unaligned_le32(src +  9) >> 6) & 0x3ffffff;
+		h4 += (get_unaligned_le32(src + 12) >> 8) | hibit;
 
 		/* h *= r */
 		d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) +
-- 
2.10.2

^ permalink raw reply related

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
From: Jason A. Donenfeld @ 2016-11-07 19:41 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list
In-Reply-To: <20161107192505.GB34388@google.com>

On Mon, Nov 7, 2016 at 8:25 PM, Eric Biggers <ebiggers@google.com> wrote:
> No it does *not* buffer all incoming blocks, which is why the source pointer can
> fall out of alignment.  Yes, I actually tested this.  In fact this situation is
> even hit, in both possible places, in the self-tests.

Urgh! v3 coming right up...

^ permalink raw reply

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
From: Eric Biggers @ 2016-11-07 19:25 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list
In-Reply-To: <CAHmME9rzBST8Lqapzykwf2rUyxLXuyOF0wX+Sy259Qtx9a4YSg@mail.gmail.com>

On Mon, Nov 07, 2016 at 08:02:35PM +0100, Jason A. Donenfeld wrote:
> On Mon, Nov 7, 2016 at 7:26 PM, Eric Biggers <ebiggers@google.com> wrote:
> >
> > I was not referring to any users in particular, only what users could do.  As an
> > example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
> > underlying algorithm is poly1305-generic, the last block would end up
> > misaligned.  This doesn't appear possible with your pseudocode because it only
> > passes in multiples of the block size until the very end.  However I don't see
> > it claimed anywhere that shash API users have to do that.
> 
> Actually it appears that crypto/poly1305_generic.c already buffers
> incoming blocks to a buffer that definitely looks aligned, to prevent
> this condition!
> 

No it does *not* buffer all incoming blocks, which is why the source pointer can
fall out of alignment.  Yes, I actually tested this.  In fact this situation is
even hit, in both possible places, in the self-tests.

Eric

^ permalink raw reply

* [PATCH v2] poly1305: generic C can be faster on chips with slow unaligned access
From: Jason A. Donenfeld @ 2016-11-07 19:12 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, linux-crypto, linux-kernel,
	Martin Willi, Eric Biggers, René van Dorst
  Cc: Jason A. Donenfeld
In-Reply-To: <20161102175810.18647-1-Jason@zx2c4.com>

By using the unaligned access helpers, we drastically improve
performance on small MIPS routers that have to go through the exception
fix-up handler for these unaligned accesses.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 crypto/poly1305_generic.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index 2df9835d..0b86f4e 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -66,9 +66,9 @@ static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key)
 {
 	/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
 	dctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ffffff;
-	dctx->r[1] = (le32_to_cpuvp(key +  3) >> 2) & 0x3ffff03;
-	dctx->r[2] = (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
-	dctx->r[3] = (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
+	dctx->r[1] = (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
+	dctx->r[2] = (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
+	dctx->r[3] = (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;
 	dctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff;
 }
 
@@ -138,9 +138,9 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
 
 		/* h += m[i] */
 		h0 += (le32_to_cpuvp(src +  0) >> 0) & 0x3ffffff;
-		h1 += (le32_to_cpuvp(src +  3) >> 2) & 0x3ffffff;
-		h2 += (le32_to_cpuvp(src +  6) >> 4) & 0x3ffffff;
-		h3 += (le32_to_cpuvp(src +  9) >> 6) & 0x3ffffff;
+		h1 += (get_unaligned_le32(src +  3) >> 2) & 0x3ffffff;
+		h2 += (get_unaligned_le32(src +  6) >> 4) & 0x3ffffff;
+		h3 += (get_unaligned_le32(src +  9) >> 6) & 0x3ffffff;
 		h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;
 
 		/* h *= r */
-- 
2.10.2

^ permalink raw reply related

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
From: Jason A. Donenfeld @ 2016-11-07 19:02 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list
In-Reply-To: <20161107182646.GA34388@google.com>

On Mon, Nov 7, 2016 at 7:26 PM, Eric Biggers <ebiggers@google.com> wrote:
>
> I was not referring to any users in particular, only what users could do.  As an
> example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
> underlying algorithm is poly1305-generic, the last block would end up
> misaligned.  This doesn't appear possible with your pseudocode because it only
> passes in multiples of the block size until the very end.  However I don't see
> it claimed anywhere that shash API users have to do that.

Actually it appears that crypto/poly1305_generic.c already buffers
incoming blocks to a buffer that definitely looks aligned, to prevent
this condition!

I'll submit a v2 with only the inner unaligned operations changed.

^ permalink raw reply

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
From: Eric Biggers @ 2016-11-07 18:26 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list
In-Reply-To: <CAHmME9oejs+USiGJX2P0TgvnR6XRzV1HYZPrq=c-CjGzq59=NQ@mail.gmail.com>

On Mon, Nov 07, 2016 at 07:08:22PM +0100, Jason A. Donenfeld wrote:
> Hmm... The general data flow that strikes me as most pertinent is
> something like:
> 
> struct sk_buff *skb = get_it_from_somewhere();
> skb = skb_share_check(skb, GFP_ATOMIC);
> num_frags = skb_cow_data(skb, ..., ...);
> struct scatterlist sg[num_frags];
> sg_init_table(sg, num_frags);
> skb_to_sgvec(skb, sg, ..., ...);
> blkcipher_walk_init(&walk, sg, sg, len);
> blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
> while (walk.nbytes >= BLOCK_SIZE) {
>     size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
>     poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
>     blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
> }
> if (walk.nbytes) {
>     poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
>     blkcipher_walk_done(&desc, &walk, 0);
> }
> 
> Is your suggestion that that in the final if block, walk.src.virt.addr
> might be unaligned? Like in the case of the last fragment being 67
> bytes long?

I was not referring to any users in particular, only what users could do.  As an
example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the
underlying algorithm is poly1305-generic, the last block would end up
misaligned.  This doesn't appear possible with your pseudocode because it only
passes in multiples of the block size until the very end.  However I don't see
it claimed anywhere that shash API users have to do that.

Eric

^ permalink raw reply

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
From: Jason A. Donenfeld @ 2016-11-07 18:23 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list
In-Reply-To: <CAHmME9oejs+USiGJX2P0TgvnR6XRzV1HYZPrq=c-CjGzq59=NQ@mail.gmail.com>

On Mon, Nov 7, 2016 at 7:08 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hmm... The general data flow that strikes me as most pertinent is
> something like:
>
> struct sk_buff *skb = get_it_from_somewhere();
> skb = skb_share_check(skb, GFP_ATOMIC);
> num_frags = skb_cow_data(skb, ..., ...);
> struct scatterlist sg[num_frags];
> sg_init_table(sg, num_frags);
> skb_to_sgvec(skb, sg, ..., ...);
> blkcipher_walk_init(&walk, sg, sg, len);
> blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
> while (walk.nbytes >= BLOCK_SIZE) {
>     size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
>     poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
>     blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
> }
> if (walk.nbytes) {
>     poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
>     blkcipher_walk_done(&desc, &walk, 0);
> }
>
> Is your suggestion that that in the final if block, walk.src.virt.addr
> might be unaligned? Like in the case of the last fragment being 67
> bytes long?

In fact, I'm not so sure this happens here. In the while loop, each
new walk.src.virt.addr will be aligned to BLOCK_SIZE or be aligned by
virtue of being at the start of a new page. In the subsequent if
block, walk.src.virt.addr will either be
some_aligned_address+BLOCK_SIZE, which will be aligned, or it will be
a start of a new page, which will be aligned.

So what did you have in mind exactly?

I don't think anybody is running code like:

for (size_t i = 0; i < len; i += 17)
    poly1305_update(&poly, &buffer[i], 17);

(And if so, those consumers should be fixed.)

^ permalink raw reply

* Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
From: Jason A. Donenfeld @ 2016-11-07 18:08 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Martin Willi, LKML, linux-crypto, David Miller,
	WireGuard mailing list
In-Reply-To: <20161104173723.GB34176@google.com>

Hi Eric,

On Fri, Nov 4, 2016 at 6:37 PM, Eric Biggers <ebiggers@google.com> wrote:
> I agree, and the current code is wrong; but do note that this proposal is
> correct for poly1305_setrkey() but not for poly1305_setskey() and
> poly1305_blocks().  In the latter two cases, 4-byte alignment of the source
> buffer is *not* guaranteed.  Although crypto_poly1305_update() will be called
> with a 4-byte aligned buffer due to the alignmask set on poly1305_alg, the
> algorithm operates on 16-byte blocks and therefore has to buffer partial blocks.
> If some number of bytes that is not 0 mod 4 is buffered, then the buffer will
> fall out of alignment on the next update call.  Hence, get_unaligned_le32() is
> actually needed on all the loads, since the buffer will, in general, be of
> unknown alignment.

Hmm... The general data flow that strikes me as most pertinent is
something like:

struct sk_buff *skb = get_it_from_somewhere();
skb = skb_share_check(skb, GFP_ATOMIC);
num_frags = skb_cow_data(skb, ..., ...);
struct scatterlist sg[num_frags];
sg_init_table(sg, num_frags);
skb_to_sgvec(skb, sg, ..., ...);
blkcipher_walk_init(&walk, sg, sg, len);
blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
while (walk.nbytes >= BLOCK_SIZE) {
    size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
    poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
    blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
}
if (walk.nbytes) {
    poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
    blkcipher_walk_done(&desc, &walk, 0);
}

Is your suggestion that that in the final if block, walk.src.virt.addr
might be unaligned? Like in the case of the last fragment being 67
bytes long?

If so, what a hassle. I hope the performance overhead isn't too
awful... I'll resubmit taking into account your suggestions.

By the way -- offlist benchmarks sent to me concluded that using the
unaligned load helpers like David suggested is just as fast as that
handrolled bit magic in the v1.

Regards,
Jason

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox