Linux cryptographic layer development
 help / color / mirror / Atom feed
* Re: [PATCH 1/16] crypto: skcipher - Add skcipher walk interface
From: Herbert Xu @ 2016-11-11 11:19 UTC (permalink / raw)
  To: Eric Biggers; +Cc: Linux Crypto Mailing List
In-Reply-To: <20161102205420.GA17645@google.com>

On Wed, Nov 02, 2016 at 01:54:20PM -0700, Eric Biggers wrote:
>
> I think the case where skcipher_copy_iv() fails may be handled incorrectly.
> Wouldn't it need to set walk.nbytes to 0 so as to not confuse callers which
> expect that behavior?  Or maybe it should be calling skcipher_walk_done().

Good catch.  I'll fix and repost.

> Setting walk->in.sg and walk->out.sg is redundant with the scatterwalk_start()
> calls.

Will remove.

> This gets called with uninitialized 'walk.flags'.  This was somewhat of a
> theoretical problem with the old blkcipher_walk code but it looks like now it
> will interact badly with the new SKCIPHER_WALK_SLEEP flag.  As far as I can see,
> whether the flag will end up set or not can depend on the uninitialized value.
> It would be nice if this problem could be avoided entirely be setting flags=0.

Right.  I'll fix this as well.

> I'm also wondering about the choice to not look at 'atomic' until after the call
> to skcipher_walk_skcipher().  Wouldn't this mean that the choice of 'atomic'
> would not be respected in e.g. the kmalloc() in skcipher_copy_iv()?

The atomic flag is meant to be used in cases such as aesni where
you need to do kernel_fpu_begin after the call to start the walk.
IOW sleeping is fine at the start but not on subsequent walk calls.

> I don't see any users of the "async" walking being introduced; are some planned?

skcipher_walk is meant to unite blkcipher_walk and ablkcipher_walk.
The latter will use the async case.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH] crypto: jitterentropy - drop duplicate header module.h
From: Geliang Tang @ 2016-11-11 12:45 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller; +Cc: Geliang Tang, linux-crypto, linux-kernel

Drop duplicate header module.h from jitterentropy-kcapi.c.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
 crypto/jitterentropy-kcapi.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/crypto/jitterentropy-kcapi.c b/crypto/jitterentropy-kcapi.c
index c4938497..787dccc 100644
--- a/crypto/jitterentropy-kcapi.c
+++ b/crypto/jitterentropy-kcapi.c
@@ -39,7 +39,6 @@
 
 #include <linux/module.h>
 #include <linux/slab.h>
-#include <linux/module.h>
 #include <linux/fips.h>
 #include <linux/time.h>
 #include <linux/crypto.h>
-- 
2.9.3

^ permalink raw reply related

* [PATCH] crypto: nx - drop duplicate header types.h
From: Geliang Tang @ 2016-11-11 12:50 UTC (permalink / raw)
  To: Leonidas S. Barbosa, Paulo Flabiano Smorigo,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Herbert Xu, David S. Miller
  Cc: Geliang Tang, linux-crypto, linuxppc-dev, linux-kernel

Drop duplicate header types.h from nx.c.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
 drivers/crypto/nx/nx.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/crypto/nx/nx.c b/drivers/crypto/nx/nx.c
index 42f0f22..036057a 100644
--- a/drivers/crypto/nx/nx.c
+++ b/drivers/crypto/nx/nx.c
@@ -32,7 +32,6 @@
 #include <linux/scatterlist.h>
 #include <linux/device.h>
 #include <linux/of.h>
-#include <linux/types.h>
 #include <asm/hvcall.h>
 #include <asm/vio.h>
 
-- 
2.9.3

^ permalink raw reply related

* Re: algif_aead: AIO broken with more than one iocb
From: Stephan Mueller @ 2016-11-11 13:46 UTC (permalink / raw)
  To: Herbert Xu; +Cc: linux-crypto
In-Reply-To: <20160913101246.GA30851@gondor.apana.org.au>

Am Dienstag, 13. September 2016, 18:12:46 CET schrieb Herbert Xu:

Hi Herbert,

> On Sun, Sep 11, 2016 at 04:59:19AM +0200, Stephan Mueller wrote:
> > Hi Herbert,
> > 
> > The AIO support for algif_aead is broken when submitting more than one
> > iocb.> 
> > The break happens in aead_recvmsg_async at the following code:
> >         /* ensure output buffer is sufficiently large */
> >         if (usedpages < outlen)
> >         
> >                 goto free;
> > 
> > The reason is that when submitting, say, two iocb, ctx->used contains the
> > buffer length for two AEAD operations (as expected). However, the recvmsg
> > code
> I don't think we should allow that.  We should make it so that you
> must start a recvmsg before you can send data for a new request.
> 
> Remember that the async path should be identical to the sync path,
> except that you don't wait for completion.

Just as a followup: with the patch submitted the other day to cover the AAD 
and tag handling, the algif_aead now supports also multiple iocb.

Ciao
Stephan

^ permalink raw reply

* [PATCH] crypto: arm64/sha2: integrate OpenSSL implementations of SHA256/SHA512
From: Ard Biesheuvel @ 2016-11-11 13:51 UTC (permalink / raw)
  To: linux-crypto, linux-arm-kernel, herbert
  Cc: daniel.thompson, Ard Biesheuvel, catalin.marinas, will.deacon,
	appro, victor.chong

This integrates both the accelerated scalar and the NEON implementations
of SHA-224/256 as well as SHA-384/512 from the OpenSSL project.

Relative performance compared to the respective generic C versions:

                 |  SHA256-scalar  | SHA256-NEON* |  SHA512  |
     ------------+-----------------+--------------+----------+
     Cortex-A53  |      1.63x      |     1.63x    |   2.34x  |
     Cortex-A57  |      1.43x      |     1.59x    |   1.95x  |
     Cortex-A73  |      1.26x      |     1.56x    |     ?    |

The core crypto code was authored by Andy Polyakov of the OpenSSL
project, in collaboration with whom the upstream code was adapted so
that this module can be built from the same version of sha512-armv8.pl.

The version in this patch was taken from OpenSSL commit

   866e505e0d66 sha/asm/sha512-armv8.pl: add NEON version of SHA256.

* The core SHA algorithm is fundamentally sequential, but there is a
  secondary transformation involved, called the schedule update, which
  can be performed independently. The NEON version of SHA-224/SHA-256
  only implements this part of the algorithm using NEON instructions,
  the sequential part is always done using scalar instructions.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---

This supersedes the SHA-256-NEON-only patch I sent out about 6 weeks ago.

Will, Catalin: note that this pulls in a .pl script, and adds a build rule
locally in arch/arm64/crypto to generate .S files on the fly from Perl
scripts. I will leave it to you to decide whether you are ok with this as
is, or whether you prefer .S_shipped files, in which case the Perl script
is only included as a reference (this is how we did it for arch/arm in the
past, but given that it adds about 3000 lines of generated code to the patch,
I think we may want to simply keep it as below)
 
 arch/arm64/crypto/Kconfig         |   8 +
 arch/arm64/crypto/Makefile        |  15 +
 arch/arm64/crypto/sha256-glue.c   | 185 +++++
 arch/arm64/crypto/sha512-armv8.pl | 778 ++++++++++++++++++++
 arch/arm64/crypto/sha512-glue.c   |  94 +++
 5 files changed, 1080 insertions(+)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 2cf32e9887e1..5f4a617e2957 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -8,6 +8,14 @@ menuconfig ARM64_CRYPTO
 
 if ARM64_CRYPTO
 
+config CRYPTO_SHA256_ARM64
+	tristate "SHA-224/SHA-256 digest algorithm for arm64"
+	select CRYPTO_HASH
+
+config CRYPTO_SHA512_ARM64
+	tristate "SHA-384/SHA-512 digest algorithm for arm64"
+	select CRYPTO_HASH
+
 config CRYPTO_SHA1_ARM64_CE
 	tristate "SHA-1 digest algorithm (ARMv8 Crypto Extensions)"
 	depends on ARM64 && KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index abb79b3cfcfe..861589faf6ef 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -29,6 +29,12 @@ aes-ce-blk-y := aes-glue-ce.o aes-ce.o
 obj-$(CONFIG_CRYPTO_AES_ARM64_NEON_BLK) += aes-neon-blk.o
 aes-neon-blk-y := aes-glue-neon.o aes-neon.o
 
+obj-$(CONFIG_CRYPTO_SHA256_ARM64) += sha256-arm64.o
+sha256-arm64-y := sha256-glue.o sha256-core.o
+
+obj-$(CONFIG_CRYPTO_SHA512_ARM64) += sha512-arm64.o
+sha512-arm64-y := sha512-glue.o sha512-core.o
+
 AFLAGS_aes-ce.o		:= -DINTERLEAVE=4
 AFLAGS_aes-neon.o	:= -DINTERLEAVE=4
 
@@ -40,3 +46,12 @@ CFLAGS_crc32-arm64.o	:= -mcpu=generic+crc
 
 $(obj)/aes-glue-%.o: $(src)/aes-glue.c FORCE
 	$(call if_changed_rule,cc_o_c)
+
+quiet_cmd_perl = PERLASM $@
+      cmd_perl = $(PERL) $(<) void $(@)
+
+$(obj)/sha256-core.S: $(src)/sha512-armv8.pl
+	$(call cmd,perl)
+
+$(obj)/sha512-core.S: $(src)/sha512-armv8.pl
+	$(call cmd,perl)
diff --git a/arch/arm64/crypto/sha256-glue.c b/arch/arm64/crypto/sha256-glue.c
new file mode 100644
index 000000000000..a2226f841960
--- /dev/null
+++ b/arch/arm64/crypto/sha256-glue.c
@@ -0,0 +1,185 @@
+/*
+ * Linux/arm64 port of the OpenSSL SHA256 implementation for AArch64
+ *
+ * Copyright (c) 2016 Linaro Ltd. <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#include <asm/hwcap.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/internal/hash.h>
+#include <crypto/sha.h>
+#include <crypto/sha256_base.h>
+#include <linux/cryptohash.h>
+#include <linux/types.h>
+#include <linux/string.h>
+
+MODULE_DESCRIPTION("SHA-224/SHA-256 secure hash for arm64");
+MODULE_AUTHOR("Andy Polyakov <appro@openssl.org>");
+MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("sha224");
+MODULE_ALIAS_CRYPTO("sha256");
+
+asmlinkage void sha256_block_data_order(u32 *digest, const void *data,
+					unsigned int num_blks);
+
+asmlinkage void sha256_block_neon(u32 *digest, const void *data,
+				  unsigned int num_blks);
+
+static int sha256_update(struct shash_desc *desc, const u8 *data,
+			 unsigned int len)
+{
+	return sha256_base_do_update(desc, data, len,
+				(sha256_block_fn *)sha256_block_data_order);
+}
+
+static int sha256_finup(struct shash_desc *desc, const u8 *data,
+			unsigned int len, u8 *out)
+{
+	if (len)
+		sha256_base_do_update(desc, data, len,
+				(sha256_block_fn *)sha256_block_data_order);
+	sha256_base_do_finalize(desc,
+				(sha256_block_fn *)sha256_block_data_order);
+
+	return sha256_base_finish(desc, out);
+}
+
+static int sha256_final(struct shash_desc *desc, u8 *out)
+{
+	return sha256_finup(desc, NULL, 0, out);
+}
+
+static struct shash_alg algs[] = { {
+	.digestsize		= SHA256_DIGEST_SIZE,
+	.init			= sha256_base_init,
+	.update			= sha256_update,
+	.final			= sha256_final,
+	.finup			= sha256_finup,
+	.descsize		= sizeof(struct sha256_state),
+	.base.cra_name		= "sha256",
+	.base.cra_driver_name	= "sha256-arm64",
+	.base.cra_priority	= 100,
+	.base.cra_flags		= CRYPTO_ALG_TYPE_SHASH,
+	.base.cra_blocksize	= SHA256_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+}, {
+	.digestsize		= SHA224_DIGEST_SIZE,
+	.init			= sha224_base_init,
+	.update			= sha256_update,
+	.final			= sha256_final,
+	.finup			= sha256_finup,
+	.descsize		= sizeof(struct sha256_state),
+	.base.cra_name		= "sha224",
+	.base.cra_driver_name	= "sha224-arm64",
+	.base.cra_priority	= 100,
+	.base.cra_flags		= CRYPTO_ALG_TYPE_SHASH,
+	.base.cra_blocksize	= SHA224_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+} };
+
+static int sha256_update_neon(struct shash_desc *desc, const u8 *data,
+			      unsigned int len)
+{
+	/*
+	 * Stacking and unstacking a substantial slice of the NEON register
+	 * file may significantly affect performance for small updates when
+	 * executing in interrupt context, so fall back to the scalar code
+	 * in that case.
+	 */
+	if (!may_use_simd())
+		return sha256_base_do_update(desc, data, len,
+				(sha256_block_fn *)sha256_block_data_order);
+
+	kernel_neon_begin();
+	sha256_base_do_update(desc, data, len,
+				(sha256_block_fn *)sha256_block_neon);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int sha256_finup_neon(struct shash_desc *desc, const u8 *data,
+			     unsigned int len, u8 *out)
+{
+	if (!may_use_simd()) {
+		if (len)
+			sha256_base_do_update(desc, data, len,
+				(sha256_block_fn *)sha256_block_data_order);
+		sha256_base_do_finalize(desc,
+				(sha256_block_fn *)sha256_block_data_order);
+	} else {
+		kernel_neon_begin();
+		if (len)
+			sha256_base_do_update(desc, data, len,
+				(sha256_block_fn *)sha256_block_neon);
+		sha256_base_do_finalize(desc,
+				(sha256_block_fn *)sha256_block_neon);
+		kernel_neon_end();
+	}
+	return sha256_base_finish(desc, out);
+}
+
+static int sha256_final_neon(struct shash_desc *desc, u8 *out)
+{
+	return sha256_finup_neon(desc, NULL, 0, out);
+}
+
+static struct shash_alg neon_algs[] = { {
+	.digestsize		= SHA256_DIGEST_SIZE,
+	.init			= sha256_base_init,
+	.update			= sha256_update_neon,
+	.final			= sha256_final_neon,
+	.finup			= sha256_finup_neon,
+	.descsize		= sizeof(struct sha256_state),
+	.base.cra_name		= "sha256",
+	.base.cra_driver_name	= "sha256-arm64-neon",
+	.base.cra_priority	= 150,
+	.base.cra_flags		= CRYPTO_ALG_TYPE_SHASH,
+	.base.cra_blocksize	= SHA256_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+}, {
+	.digestsize		= SHA224_DIGEST_SIZE,
+	.init			= sha224_base_init,
+	.update			= sha256_update_neon,
+	.final			= sha256_final_neon,
+	.finup			= sha256_finup_neon,
+	.descsize		= sizeof(struct sha256_state),
+	.base.cra_name		= "sha224",
+	.base.cra_driver_name	= "sha224-arm64-neon",
+	.base.cra_priority	= 150,
+	.base.cra_flags		= CRYPTO_ALG_TYPE_SHASH,
+	.base.cra_blocksize	= SHA224_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+} };
+
+static int __init sha256_mod_init(void)
+{
+	int ret = crypto_register_shashes(algs, ARRAY_SIZE(algs));
+	if (ret)
+		return ret;
+
+	if (elf_hwcap & HWCAP_ASIMD) {
+		ret = crypto_register_shashes(neon_algs, ARRAY_SIZE(neon_algs));
+		if (ret)
+			crypto_unregister_shashes(algs, ARRAY_SIZE(algs));
+	}
+	return ret;
+}
+
+static void __exit sha256_mod_fini(void)
+{
+	if (elf_hwcap & HWCAP_ASIMD)
+		crypto_unregister_shashes(neon_algs, ARRAY_SIZE(neon_algs));
+	crypto_unregister_shashes(algs, ARRAY_SIZE(algs));
+}
+
+module_init(sha256_mod_init);
+module_exit(sha256_mod_fini);
diff --git a/arch/arm64/crypto/sha512-armv8.pl b/arch/arm64/crypto/sha512-armv8.pl
new file mode 100644
index 000000000000..ffae5f23bcd8
--- /dev/null
+++ b/arch/arm64/crypto/sha512-armv8.pl
@@ -0,0 +1,778 @@
+#! /usr/bin/env perl
+# Copyright 2014-2016 The OpenSSL Project Authors. All Rights Reserved.
+#
+# Licensed under the OpenSSL license (the "License").  You may not use
+# this file except in compliance with the License.  You can obtain a copy
+# in the file LICENSE in the source distribution or at
+# https://www.openssl.org/source/license.html
+
+# ====================================================================
+# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+#
+# Permission to use under GPLv2 terms is granted.
+# ====================================================================
+#
+# SHA256/512 for ARMv8.
+#
+# Performance in cycles per processed byte and improvement coefficient
+# over code generated with "default" compiler:
+#
+#		SHA256-hw	SHA256(*)	SHA512
+# Apple A7	1.97		10.5 (+33%)	6.73 (-1%(**))
+# Cortex-A53	2.38		15.5 (+115%)	10.0 (+150%(***))
+# Cortex-A57	2.31		11.6 (+86%)	7.51 (+260%(***))
+# Denver	2.01		10.5 (+26%)	6.70 (+8%)
+# X-Gene			20.0 (+100%)	12.8 (+300%(***))
+# Mongoose	2.36		13.0 (+50%)	8.36 (+33%)
+#
+# (*)	Software SHA256 results are of lesser relevance, presented
+#	mostly for informational purposes.
+# (**)	The result is a trade-off: it's possible to improve it by
+#	10% (or by 1 cycle per round), but at the cost of 20% loss
+#	on Cortex-A53 (or by 4 cycles per round).
+# (***)	Super-impressive coefficients over gcc-generated code are
+#	indication of some compiler "pathology", most notably code
+#	generated with -mgeneral-regs-only is significanty faster
+#	and the gap is only 40-90%.
+#
+# October 2016.
+#
+# Originally it was reckoned that it makes no sense to implement NEON
+# version of SHA256 for 64-bit processors. This is because performance
+# improvement on most wide-spread Cortex-A5x processors was observed
+# to be marginal, same on Cortex-A53 and ~10% on A57. But then it was
+# observed that 32-bit NEON SHA256 performs significantly better than
+# 64-bit scalar version on *some* of the more recent processors. As
+# result 64-bit NEON version of SHA256 was added to provide best
+# all-round performance. For example it executes ~30% faster on X-Gene
+# and Mongoose. [For reference, NEON version of SHA512 is bound to
+# deliver much less improvement, likely *negative* on Cortex-A5x.
+# Which is why NEON support is limited to SHA256.]
+
+$output=pop;
+$flavour=pop;
+
+if ($flavour && $flavour ne "void") {
+    $0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+    ( $xlate="${dir}arm-xlate.pl" and -f $xlate ) or
+    ( $xlate="${dir}../../perlasm/arm-xlate.pl" and -f $xlate) or
+    die "can't locate arm-xlate.pl";
+
+    open OUT,"| \"$^X\" $xlate $flavour $output";
+    *STDOUT=*OUT;
+} else {
+    open STDOUT,">$output";
+}
+
+if ($output =~ /512/) {
+	$BITS=512;
+	$SZ=8;
+	@Sigma0=(28,34,39);
+	@Sigma1=(14,18,41);
+	@sigma0=(1,  8, 7);
+	@sigma1=(19,61, 6);
+	$rounds=80;
+	$reg_t="x";
+} else {
+	$BITS=256;
+	$SZ=4;
+	@Sigma0=( 2,13,22);
+	@Sigma1=( 6,11,25);
+	@sigma0=( 7,18, 3);
+	@sigma1=(17,19,10);
+	$rounds=64;
+	$reg_t="w";
+}
+
+$func="sha${BITS}_block_data_order";
+
+($ctx,$inp,$num,$Ktbl)=map("x$_",(0..2,30));
+
+@X=map("$reg_t$_",(3..15,0..2));
+@V=($A,$B,$C,$D,$E,$F,$G,$H)=map("$reg_t$_",(20..27));
+($t0,$t1,$t2,$t3)=map("$reg_t$_",(16,17,19,28));
+
+sub BODY_00_xx {
+my ($i,$a,$b,$c,$d,$e,$f,$g,$h)=@_;
+my $j=($i+1)&15;
+my ($T0,$T1,$T2)=(@X[($i-8)&15],@X[($i-9)&15],@X[($i-10)&15]);
+   $T0=@X[$i+3] if ($i<11);
+
+$code.=<<___	if ($i<16);
+#ifndef	__ARMEB__
+	rev	@X[$i],@X[$i]			// $i
+#endif
+___
+$code.=<<___	if ($i<13 && ($i&1));
+	ldp	@X[$i+1],@X[$i+2],[$inp],#2*$SZ
+___
+$code.=<<___	if ($i==13);
+	ldp	@X[14],@X[15],[$inp]
+___
+$code.=<<___	if ($i>=14);
+	ldr	@X[($i-11)&15],[sp,#`$SZ*(($i-11)%4)`]
+___
+$code.=<<___	if ($i>0 && $i<16);
+	add	$a,$a,$t1			// h+=Sigma0(a)
+___
+$code.=<<___	if ($i>=11);
+	str	@X[($i-8)&15],[sp,#`$SZ*(($i-8)%4)`]
+___
+# While ARMv8 specifies merged rotate-n-logical operation such as
+# 'eor x,y,z,ror#n', it was found to negatively affect performance
+# on Apple A7. The reason seems to be that it requires even 'y' to
+# be available earlier. This means that such merged instruction is
+# not necessarily best choice on critical path... On the other hand
+# Cortex-A5x handles merged instructions much better than disjoint
+# rotate and logical... See (**) footnote above.
+$code.=<<___	if ($i<15);
+	ror	$t0,$e,#$Sigma1[0]
+	add	$h,$h,$t2			// h+=K[i]
+	eor	$T0,$e,$e,ror#`$Sigma1[2]-$Sigma1[1]`
+	and	$t1,$f,$e
+	bic	$t2,$g,$e
+	add	$h,$h,@X[$i&15]			// h+=X[i]
+	orr	$t1,$t1,$t2			// Ch(e,f,g)
+	eor	$t2,$a,$b			// a^b, b^c in next round
+	eor	$t0,$t0,$T0,ror#$Sigma1[1]	// Sigma1(e)
+	ror	$T0,$a,#$Sigma0[0]
+	add	$h,$h,$t1			// h+=Ch(e,f,g)
+	eor	$t1,$a,$a,ror#`$Sigma0[2]-$Sigma0[1]`
+	add	$h,$h,$t0			// h+=Sigma1(e)
+	and	$t3,$t3,$t2			// (b^c)&=(a^b)
+	add	$d,$d,$h			// d+=h
+	eor	$t3,$t3,$b			// Maj(a,b,c)
+	eor	$t1,$T0,$t1,ror#$Sigma0[1]	// Sigma0(a)
+	add	$h,$h,$t3			// h+=Maj(a,b,c)
+	ldr	$t3,[$Ktbl],#$SZ		// *K++, $t2 in next round
+	//add	$h,$h,$t1			// h+=Sigma0(a)
+___
+$code.=<<___	if ($i>=15);
+	ror	$t0,$e,#$Sigma1[0]
+	add	$h,$h,$t2			// h+=K[i]
+	ror	$T1,@X[($j+1)&15],#$sigma0[0]
+	and	$t1,$f,$e
+	ror	$T2,@X[($j+14)&15],#$sigma1[0]
+	bic	$t2,$g,$e
+	ror	$T0,$a,#$Sigma0[0]
+	add	$h,$h,@X[$i&15]			// h+=X[i]
+	eor	$t0,$t0,$e,ror#$Sigma1[1]
+	eor	$T1,$T1,@X[($j+1)&15],ror#$sigma0[1]
+	orr	$t1,$t1,$t2			// Ch(e,f,g)
+	eor	$t2,$a,$b			// a^b, b^c in next round
+	eor	$t0,$t0,$e,ror#$Sigma1[2]	// Sigma1(e)
+	eor	$T0,$T0,$a,ror#$Sigma0[1]
+	add	$h,$h,$t1			// h+=Ch(e,f,g)
+	and	$t3,$t3,$t2			// (b^c)&=(a^b)
+	eor	$T2,$T2,@X[($j+14)&15],ror#$sigma1[1]
+	eor	$T1,$T1,@X[($j+1)&15],lsr#$sigma0[2]	// sigma0(X[i+1])
+	add	$h,$h,$t0			// h+=Sigma1(e)
+	eor	$t3,$t3,$b			// Maj(a,b,c)
+	eor	$t1,$T0,$a,ror#$Sigma0[2]	// Sigma0(a)
+	eor	$T2,$T2,@X[($j+14)&15],lsr#$sigma1[2]	// sigma1(X[i+14])
+	add	@X[$j],@X[$j],@X[($j+9)&15]
+	add	$d,$d,$h			// d+=h
+	add	$h,$h,$t3			// h+=Maj(a,b,c)
+	ldr	$t3,[$Ktbl],#$SZ		// *K++, $t2 in next round
+	add	@X[$j],@X[$j],$T1
+	add	$h,$h,$t1			// h+=Sigma0(a)
+	add	@X[$j],@X[$j],$T2
+___
+	($t2,$t3)=($t3,$t2);
+}
+
+$code.=<<___;
+#ifndef	__KERNEL__
+# include "arm_arch.h"
+#endif
+
+.text
+
+.extern	OPENSSL_armcap_P
+.globl	$func
+.type	$func,%function
+.align	6
+$func:
+___
+$code.=<<___	if ($SZ==4);
+#ifndef	__KERNEL__
+# ifdef	__ILP32__
+	ldrsw	x16,.LOPENSSL_armcap_P
+# else
+	ldr	x16,.LOPENSSL_armcap_P
+# endif
+	adr	x17,.LOPENSSL_armcap_P
+	add	x16,x16,x17
+	ldr	w16,[x16]
+	tst	w16,#ARMV8_SHA256
+	b.ne	.Lv8_entry
+	tst	w16,#ARMV7_NEON
+	b.ne	.Lneon_entry
+#endif
+___
+$code.=<<___;
+	stp	x29,x30,[sp,#-128]!
+	add	x29,sp,#0
+
+	stp	x19,x20,[sp,#16]
+	stp	x21,x22,[sp,#32]
+	stp	x23,x24,[sp,#48]
+	stp	x25,x26,[sp,#64]
+	stp	x27,x28,[sp,#80]
+	sub	sp,sp,#4*$SZ
+
+	ldp	$A,$B,[$ctx]				// load context
+	ldp	$C,$D,[$ctx,#2*$SZ]
+	ldp	$E,$F,[$ctx,#4*$SZ]
+	add	$num,$inp,$num,lsl#`log(16*$SZ)/log(2)`	// end of input
+	ldp	$G,$H,[$ctx,#6*$SZ]
+	adr	$Ktbl,.LK$BITS
+	stp	$ctx,$num,[x29,#96]
+
+.Loop:
+	ldp	@X[0],@X[1],[$inp],#2*$SZ
+	ldr	$t2,[$Ktbl],#$SZ			// *K++
+	eor	$t3,$B,$C				// magic seed
+	str	$inp,[x29,#112]
+___
+for ($i=0;$i<16;$i++)	{ &BODY_00_xx($i,@V); unshift(@V,pop(@V)); }
+$code.=".Loop_16_xx:\n";
+for (;$i<32;$i++)	{ &BODY_00_xx($i,@V); unshift(@V,pop(@V)); }
+$code.=<<___;
+	cbnz	$t2,.Loop_16_xx
+
+	ldp	$ctx,$num,[x29,#96]
+	ldr	$inp,[x29,#112]
+	sub	$Ktbl,$Ktbl,#`$SZ*($rounds+1)`		// rewind
+
+	ldp	@X[0],@X[1],[$ctx]
+	ldp	@X[2],@X[3],[$ctx,#2*$SZ]
+	add	$inp,$inp,#14*$SZ			// advance input pointer
+	ldp	@X[4],@X[5],[$ctx,#4*$SZ]
+	add	$A,$A,@X[0]
+	ldp	@X[6],@X[7],[$ctx,#6*$SZ]
+	add	$B,$B,@X[1]
+	add	$C,$C,@X[2]
+	add	$D,$D,@X[3]
+	stp	$A,$B,[$ctx]
+	add	$E,$E,@X[4]
+	add	$F,$F,@X[5]
+	stp	$C,$D,[$ctx,#2*$SZ]
+	add	$G,$G,@X[6]
+	add	$H,$H,@X[7]
+	cmp	$inp,$num
+	stp	$E,$F,[$ctx,#4*$SZ]
+	stp	$G,$H,[$ctx,#6*$SZ]
+	b.ne	.Loop
+
+	ldp	x19,x20,[x29,#16]
+	add	sp,sp,#4*$SZ
+	ldp	x21,x22,[x29,#32]
+	ldp	x23,x24,[x29,#48]
+	ldp	x25,x26,[x29,#64]
+	ldp	x27,x28,[x29,#80]
+	ldp	x29,x30,[sp],#128
+	ret
+.size	$func,.-$func
+
+.align	6
+.type	.LK$BITS,%object
+.LK$BITS:
+___
+$code.=<<___ if ($SZ==8);
+	.quad	0x428a2f98d728ae22,0x7137449123ef65cd
+	.quad	0xb5c0fbcfec4d3b2f,0xe9b5dba58189dbbc
+	.quad	0x3956c25bf348b538,0x59f111f1b605d019
+	.quad	0x923f82a4af194f9b,0xab1c5ed5da6d8118
+	.quad	0xd807aa98a3030242,0x12835b0145706fbe
+	.quad	0x243185be4ee4b28c,0x550c7dc3d5ffb4e2
+	.quad	0x72be5d74f27b896f,0x80deb1fe3b1696b1
+	.quad	0x9bdc06a725c71235,0xc19bf174cf692694
+	.quad	0xe49b69c19ef14ad2,0xefbe4786384f25e3
+	.quad	0x0fc19dc68b8cd5b5,0x240ca1cc77ac9c65
+	.quad	0x2de92c6f592b0275,0x4a7484aa6ea6e483
+	.quad	0x5cb0a9dcbd41fbd4,0x76f988da831153b5
+	.quad	0x983e5152ee66dfab,0xa831c66d2db43210
+	.quad	0xb00327c898fb213f,0xbf597fc7beef0ee4
+	.quad	0xc6e00bf33da88fc2,0xd5a79147930aa725
+	.quad	0x06ca6351e003826f,0x142929670a0e6e70
+	.quad	0x27b70a8546d22ffc,0x2e1b21385c26c926
+	.quad	0x4d2c6dfc5ac42aed,0x53380d139d95b3df
+	.quad	0x650a73548baf63de,0x766a0abb3c77b2a8
+	.quad	0x81c2c92e47edaee6,0x92722c851482353b
+	.quad	0xa2bfe8a14cf10364,0xa81a664bbc423001
+	.quad	0xc24b8b70d0f89791,0xc76c51a30654be30
+	.quad	0xd192e819d6ef5218,0xd69906245565a910
+	.quad	0xf40e35855771202a,0x106aa07032bbd1b8
+	.quad	0x19a4c116b8d2d0c8,0x1e376c085141ab53
+	.quad	0x2748774cdf8eeb99,0x34b0bcb5e19b48a8
+	.quad	0x391c0cb3c5c95a63,0x4ed8aa4ae3418acb
+	.quad	0x5b9cca4f7763e373,0x682e6ff3d6b2b8a3
+	.quad	0x748f82ee5defb2fc,0x78a5636f43172f60
+	.quad	0x84c87814a1f0ab72,0x8cc702081a6439ec
+	.quad	0x90befffa23631e28,0xa4506cebde82bde9
+	.quad	0xbef9a3f7b2c67915,0xc67178f2e372532b
+	.quad	0xca273eceea26619c,0xd186b8c721c0c207
+	.quad	0xeada7dd6cde0eb1e,0xf57d4f7fee6ed178
+	.quad	0x06f067aa72176fba,0x0a637dc5a2c898a6
+	.quad	0x113f9804bef90dae,0x1b710b35131c471b
+	.quad	0x28db77f523047d84,0x32caab7b40c72493
+	.quad	0x3c9ebe0a15c9bebc,0x431d67c49c100d4c
+	.quad	0x4cc5d4becb3e42b6,0x597f299cfc657e2a
+	.quad	0x5fcb6fab3ad6faec,0x6c44198c4a475817
+	.quad	0	// terminator
+___
+$code.=<<___ if ($SZ==4);
+	.long	0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5
+	.long	0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5
+	.long	0xd807aa98,0x12835b01,0x243185be,0x550c7dc3
+	.long	0x72be5d74,0x80deb1fe,0x9bdc06a7,0xc19bf174
+	.long	0xe49b69c1,0xefbe4786,0x0fc19dc6,0x240ca1cc
+	.long	0x2de92c6f,0x4a7484aa,0x5cb0a9dc,0x76f988da
+	.long	0x983e5152,0xa831c66d,0xb00327c8,0xbf597fc7
+	.long	0xc6e00bf3,0xd5a79147,0x06ca6351,0x14292967
+	.long	0x27b70a85,0x2e1b2138,0x4d2c6dfc,0x53380d13
+	.long	0x650a7354,0x766a0abb,0x81c2c92e,0x92722c85
+	.long	0xa2bfe8a1,0xa81a664b,0xc24b8b70,0xc76c51a3
+	.long	0xd192e819,0xd6990624,0xf40e3585,0x106aa070
+	.long	0x19a4c116,0x1e376c08,0x2748774c,0x34b0bcb5
+	.long	0x391c0cb3,0x4ed8aa4a,0x5b9cca4f,0x682e6ff3
+	.long	0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208
+	.long	0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
+	.long	0	//terminator
+___
+$code.=<<___;
+.size	.LK$BITS,.-.LK$BITS
+#ifndef	__KERNEL__
+.align	3
+.LOPENSSL_armcap_P:
+# ifdef	__ILP32__
+	.long	OPENSSL_armcap_P-.
+# else
+	.quad	OPENSSL_armcap_P-.
+# endif
+#endif
+.asciz	"SHA$BITS block transform for ARMv8, CRYPTOGAMS by <appro\@openssl.org>"
+.align	2
+___
+
+if ($SZ==4) {
+my $Ktbl="x3";
+
+my ($ABCD,$EFGH,$abcd)=map("v$_.16b",(0..2));
+my @MSG=map("v$_.16b",(4..7));
+my ($W0,$W1)=("v16.4s","v17.4s");
+my ($ABCD_SAVE,$EFGH_SAVE)=("v18.16b","v19.16b");
+
+$code.=<<___;
+#ifndef	__KERNEL__
+.type	sha256_block_armv8,%function
+.align	6
+sha256_block_armv8:
+.Lv8_entry:
+	stp		x29,x30,[sp,#-16]!
+	add		x29,sp,#0
+
+	ld1.32		{$ABCD,$EFGH},[$ctx]
+	adr		$Ktbl,.LK256
+
+.Loop_hw:
+	ld1		{@MSG[0]-@MSG[3]},[$inp],#64
+	sub		$num,$num,#1
+	ld1.32		{$W0},[$Ktbl],#16
+	rev32		@MSG[0],@MSG[0]
+	rev32		@MSG[1],@MSG[1]
+	rev32		@MSG[2],@MSG[2]
+	rev32		@MSG[3],@MSG[3]
+	orr		$ABCD_SAVE,$ABCD,$ABCD		// offload
+	orr		$EFGH_SAVE,$EFGH,$EFGH
+___
+for($i=0;$i<12;$i++) {
+$code.=<<___;
+	ld1.32		{$W1},[$Ktbl],#16
+	add.i32		$W0,$W0,@MSG[0]
+	sha256su0	@MSG[0],@MSG[1]
+	orr		$abcd,$ABCD,$ABCD
+	sha256h		$ABCD,$EFGH,$W0
+	sha256h2	$EFGH,$abcd,$W0
+	sha256su1	@MSG[0],@MSG[2],@MSG[3]
+___
+	($W0,$W1)=($W1,$W0);	push(@MSG,shift(@MSG));
+}
+$code.=<<___;
+	ld1.32		{$W1},[$Ktbl],#16
+	add.i32		$W0,$W0,@MSG[0]
+	orr		$abcd,$ABCD,$ABCD
+	sha256h		$ABCD,$EFGH,$W0
+	sha256h2	$EFGH,$abcd,$W0
+
+	ld1.32		{$W0},[$Ktbl],#16
+	add.i32		$W1,$W1,@MSG[1]
+	orr		$abcd,$ABCD,$ABCD
+	sha256h		$ABCD,$EFGH,$W1
+	sha256h2	$EFGH,$abcd,$W1
+
+	ld1.32		{$W1},[$Ktbl]
+	add.i32		$W0,$W0,@MSG[2]
+	sub		$Ktbl,$Ktbl,#$rounds*$SZ-16	// rewind
+	orr		$abcd,$ABCD,$ABCD
+	sha256h		$ABCD,$EFGH,$W0
+	sha256h2	$EFGH,$abcd,$W0
+
+	add.i32		$W1,$W1,@MSG[3]
+	orr		$abcd,$ABCD,$ABCD
+	sha256h		$ABCD,$EFGH,$W1
+	sha256h2	$EFGH,$abcd,$W1
+
+	add.i32		$ABCD,$ABCD,$ABCD_SAVE
+	add.i32		$EFGH,$EFGH,$EFGH_SAVE
+
+	cbnz		$num,.Loop_hw
+
+	st1.32		{$ABCD,$EFGH},[$ctx]
+
+	ldr		x29,[sp],#16
+	ret
+.size	sha256_block_armv8,.-sha256_block_armv8
+#endif
+___
+}
+
+if ($SZ==4) {	######################################### NEON stuff #
+# You'll surely note a lot of similarities with sha256-armv4 module,
+# and of course it's not a coincidence. sha256-armv4 was used as
+# initial template, but was adapted for ARMv8 instruction set and
+# extensively re-tuned for all-round performance.
+
+my @V = ($A,$B,$C,$D,$E,$F,$G,$H) = map("w$_",(3..10));
+my ($t0,$t1,$t2,$t3,$t4) = map("w$_",(11..15));
+my $Ktbl="x16";
+my $Xfer="x17";
+my @X = map("q$_",(0..3));
+my ($T0,$T1,$T2,$T3,$T4,$T5,$T6,$T7) = map("q$_",(4..7,16..19));
+my $j=0;
+
+sub AUTOLOAD()          # thunk [simplified] x86-style perlasm
+{ my $opcode = $AUTOLOAD; $opcode =~ s/.*:://; $opcode =~ s/_/\./;
+  my $arg = pop;
+    $arg = "#$arg" if ($arg*1 eq $arg);
+    $code .= "\t$opcode\t".join(',',@_,$arg)."\n";
+}
+
+sub Dscalar { shift =~ m|[qv]([0-9]+)|?"d$1":""; }
+sub Dlo     { shift =~ m|[qv]([0-9]+)|?"v$1.d[0]":""; }
+sub Dhi     { shift =~ m|[qv]([0-9]+)|?"v$1.d[1]":""; }
+
+sub Xupdate()
+{ use integer;
+  my $body = shift;
+  my @insns = (&$body,&$body,&$body,&$body);
+  my ($a,$b,$c,$d,$e,$f,$g,$h);
+
+	&ext_8		($T0,@X[0],@X[1],4);	# X[1..4]
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&ext_8		($T3,@X[2],@X[3],4);	# X[9..12]
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&mov		(&Dscalar($T7),&Dhi(@X[3]));	# X[14..15]
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&ushr_32	($T2,$T0,$sigma0[0]);
+	 eval(shift(@insns));
+	&ushr_32	($T1,$T0,$sigma0[2]);
+	 eval(shift(@insns));
+	&add_32 	(@X[0],@X[0],$T3);	# X[0..3] += X[9..12]
+	 eval(shift(@insns));
+	&sli_32		($T2,$T0,32-$sigma0[0]);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&ushr_32	($T3,$T0,$sigma0[1]);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&eor_8		($T1,$T1,$T2);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&sli_32		($T3,$T0,32-$sigma0[1]);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	  &ushr_32	($T4,$T7,$sigma1[0]);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&eor_8		($T1,$T1,$T3);		# sigma0(X[1..4])
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	  &sli_32	($T4,$T7,32-$sigma1[0]);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	  &ushr_32	($T5,$T7,$sigma1[2]);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	  &ushr_32	($T3,$T7,$sigma1[1]);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&add_32		(@X[0],@X[0],$T1);	# X[0..3] += sigma0(X[1..4])
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	  &sli_u32	($T3,$T7,32-$sigma1[1]);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	  &eor_8	($T5,$T5,$T4);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	  &eor_8	($T5,$T5,$T3);		# sigma1(X[14..15])
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&add_32		(@X[0],@X[0],$T5);	# X[0..1] += sigma1(X[14..15])
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	  &ushr_32	($T6,@X[0],$sigma1[0]);
+	 eval(shift(@insns));
+	  &ushr_32	($T7,@X[0],$sigma1[2]);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	  &sli_32	($T6,@X[0],32-$sigma1[0]);
+	 eval(shift(@insns));
+	  &ushr_32	($T5,@X[0],$sigma1[1]);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	  &eor_8	($T7,$T7,$T6);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	  &sli_32	($T5,@X[0],32-$sigma1[1]);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&ld1_32		("{$T0}","[$Ktbl], #16");
+	 eval(shift(@insns));
+	  &eor_8	($T7,$T7,$T5);		# sigma1(X[16..17])
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&eor_8		($T5,$T5,$T5);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&mov		(&Dhi($T5), &Dlo($T7));
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&add_32		(@X[0],@X[0],$T5);	# X[2..3] += sigma1(X[16..17])
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&add_32		($T0,$T0,@X[0]);
+	 while($#insns>=1) { eval(shift(@insns)); }
+	&st1_32		("{$T0}","[$Xfer], #16");
+	 eval(shift(@insns));
+
+	push(@X,shift(@X));		# "rotate" X[]
+}
+
+sub Xpreload()
+{ use integer;
+  my $body = shift;
+  my @insns = (&$body,&$body,&$body,&$body);
+  my ($a,$b,$c,$d,$e,$f,$g,$h);
+
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&ld1_8		("{@X[0]}","[$inp],#16");
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&ld1_32		("{$T0}","[$Ktbl],#16");
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&rev32		(@X[0],@X[0]);
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	 eval(shift(@insns));
+	&add_32		($T0,$T0,@X[0]);
+	 foreach (@insns) { eval; }	# remaining instructions
+	&st1_32		("{$T0}","[$Xfer], #16");
+
+	push(@X,shift(@X));		# "rotate" X[]
+}
+
+sub body_00_15 () {
+	(
+	'($a,$b,$c,$d,$e,$f,$g,$h)=@V;'.
+	'&add	($h,$h,$t1)',			# h+=X[i]+K[i]
+	'&add	($a,$a,$t4);'.			# h+=Sigma0(a) from the past
+	'&and	($t1,$f,$e)',
+	'&bic	($t4,$g,$e)',
+	'&eor	($t0,$e,$e,"ror#".($Sigma1[1]-$Sigma1[0]))',
+	'&add	($a,$a,$t2)',			# h+=Maj(a,b,c) from the past
+	'&orr	($t1,$t1,$t4)',			# Ch(e,f,g)
+	'&eor	($t0,$t0,$e,"ror#".($Sigma1[2]-$Sigma1[0]))',	# Sigma1(e)
+	'&eor	($t4,$a,$a,"ror#".($Sigma0[1]-$Sigma0[0]))',
+	'&add	($h,$h,$t1)',			# h+=Ch(e,f,g)
+	'&ror	($t0,$t0,"#$Sigma1[0]")',
+	'&eor	($t2,$a,$b)',			# a^b, b^c in next round
+	'&eor	($t4,$t4,$a,"ror#".($Sigma0[2]-$Sigma0[0]))',	# Sigma0(a)
+	'&add	($h,$h,$t0)',			# h+=Sigma1(e)
+	'&ldr	($t1,sprintf "[sp,#%d]",4*(($j+1)&15))	if (($j&15)!=15);'.
+	'&ldr	($t1,"[$Ktbl]")				if ($j==15);'.
+	'&and	($t3,$t3,$t2)',			# (b^c)&=(a^b)
+	'&ror	($t4,$t4,"#$Sigma0[0]")',
+	'&add	($d,$d,$h)',			# d+=h
+	'&eor	($t3,$t3,$b)',			# Maj(a,b,c)
+	'$j++;	unshift(@V,pop(@V)); ($t2,$t3)=($t3,$t2);'
+	)
+}
+
+$code.=<<___;
+#ifdef	__KERNEL__
+.globl	sha256_block_neon
+#endif
+.type	sha256_block_neon,%function
+.align	4
+sha256_block_neon:
+.Lneon_entry:
+	stp	x29, x30, [sp, #-16]!
+	mov	x29, sp
+	sub	sp,sp,#16*4
+
+	adr	$Ktbl,.LK256
+	add	$num,$inp,$num,lsl#6	// len to point at the end of inp
+
+	ld1.8	{@X[0]},[$inp], #16
+	ld1.8	{@X[1]},[$inp], #16
+	ld1.8	{@X[2]},[$inp], #16
+	ld1.8	{@X[3]},[$inp], #16
+	ld1.32	{$T0},[$Ktbl], #16
+	ld1.32	{$T1},[$Ktbl], #16
+	ld1.32	{$T2},[$Ktbl], #16
+	ld1.32	{$T3},[$Ktbl], #16
+	rev32	@X[0],@X[0]		// yes, even on
+	rev32	@X[1],@X[1]		// big-endian
+	rev32	@X[2],@X[2]
+	rev32	@X[3],@X[3]
+	mov	$Xfer,sp
+	add.32	$T0,$T0,@X[0]
+	add.32	$T1,$T1,@X[1]
+	add.32	$T2,$T2,@X[2]
+	st1.32	{$T0-$T1},[$Xfer], #32
+	add.32	$T3,$T3,@X[3]
+	st1.32	{$T2-$T3},[$Xfer]
+	sub	$Xfer,$Xfer,#32
+
+	ldp	$A,$B,[$ctx]
+	ldp	$C,$D,[$ctx,#8]
+	ldp	$E,$F,[$ctx,#16]
+	ldp	$G,$H,[$ctx,#24]
+	ldr	$t1,[sp,#0]
+	mov	$t2,wzr
+	eor	$t3,$B,$C
+	mov	$t4,wzr
+	b	.L_00_48
+
+.align	4
+.L_00_48:
+___
+	&Xupdate(\&body_00_15);
+	&Xupdate(\&body_00_15);
+	&Xupdate(\&body_00_15);
+	&Xupdate(\&body_00_15);
+$code.=<<___;
+	cmp	$t1,#0				// check for K256 terminator
+	ldr	$t1,[sp,#0]
+	sub	$Xfer,$Xfer,#64
+	bne	.L_00_48
+
+	sub	$Ktbl,$Ktbl,#256		// rewind $Ktbl
+	cmp	$inp,$num
+	mov	$Xfer, #64
+	csel	$Xfer, $Xfer, xzr, eq
+	sub	$inp,$inp,$Xfer			// avoid SEGV
+	mov	$Xfer,sp
+___
+	&Xpreload(\&body_00_15);
+	&Xpreload(\&body_00_15);
+	&Xpreload(\&body_00_15);
+	&Xpreload(\&body_00_15);
+$code.=<<___;
+	add	$A,$A,$t4			// h+=Sigma0(a) from the past
+	ldp	$t0,$t1,[$ctx,#0]
+	add	$A,$A,$t2			// h+=Maj(a,b,c) from the past
+	ldp	$t2,$t3,[$ctx,#8]
+	add	$A,$A,$t0			// accumulate
+	add	$B,$B,$t1
+	ldp	$t0,$t1,[$ctx,#16]
+	add	$C,$C,$t2
+	add	$D,$D,$t3
+	ldp	$t2,$t3,[$ctx,#24]
+	add	$E,$E,$t0
+	add	$F,$F,$t1
+	 ldr	$t1,[sp,#0]
+	stp	$A,$B,[$ctx,#0]
+	add	$G,$G,$t2
+	 mov	$t2,wzr
+	stp	$C,$D,[$ctx,#8]
+	add	$H,$H,$t3
+	stp	$E,$F,[$ctx,#16]
+	 eor	$t3,$B,$C
+	stp	$G,$H,[$ctx,#24]
+	 mov	$t4,wzr
+	 mov	$Xfer,sp
+	b.ne	.L_00_48
+
+	ldr	x29,[x29]
+	add	sp,sp,#16*4+16
+	ret
+.size	sha256_block_neon,.-sha256_block_neon
+___
+}
+
+$code.=<<___;
+#ifndef	__KERNEL__
+.comm	OPENSSL_armcap_P,4,4
+#endif
+___
+
+{   my  %opcode = (
+	"sha256h"	=> 0x5e004000,	"sha256h2"	=> 0x5e005000,
+	"sha256su0"	=> 0x5e282800,	"sha256su1"	=> 0x5e006000	);
+
+    sub unsha256 {
+	my ($mnemonic,$arg)=@_;
+
+	$arg =~ m/[qv]([0-9]+)[^,]*,\s*[qv]([0-9]+)[^,]*(?:,\s*[qv]([0-9]+))?/o
+	&&
+	sprintf ".inst\t0x%08x\t//%s %s",
+			$opcode{$mnemonic}|$1|($2<<5)|($3<<16),
+			$mnemonic,$arg;
+    }
+}
+
+open SELF,$0;
+while(<SELF>) {
+        next if (/^#!/);
+        last if (!s/^#/\/\// and !/^$/);
+        print;
+}
+close SELF;
+
+foreach(split("\n",$code)) {
+
+	s/\`([^\`]*)\`/eval($1)/ge;
+
+	s/\b(sha256\w+)\s+([qv].*)/unsha256($1,$2)/ge;
+
+	s/\bq([0-9]+)\b/v$1.16b/g;		# old->new registers
+
+	s/\.[ui]?8(\s)/$1/;
+	s/\.\w?32\b//		and s/\.16b/\.4s/g;
+	m/(ld|st)1[^\[]+\[0\]/	and s/\.4s/\.s/g;
+
+	print $_,"\n";
+}
+
+close STDOUT;
diff --git a/arch/arm64/crypto/sha512-glue.c b/arch/arm64/crypto/sha512-glue.c
new file mode 100644
index 000000000000..aff35c9992a4
--- /dev/null
+++ b/arch/arm64/crypto/sha512-glue.c
@@ -0,0 +1,94 @@
+/*
+ * Linux/arm64 port of the OpenSSL SHA512 implementation for AArch64
+ *
+ * Copyright (c) 2016 Linaro Ltd. <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#include <crypto/internal/hash.h>
+#include <linux/cryptohash.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <crypto/sha.h>
+#include <crypto/sha512_base.h>
+#include <asm/neon.h>
+
+MODULE_DESCRIPTION("SHA-384/SHA-512 secure hash for arm64");
+MODULE_AUTHOR("Andy Polyakov <appro@openssl.org>");
+MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
+MODULE_LICENSE("GPL v2");
+MODULE_ALIAS_CRYPTO("sha384");
+MODULE_ALIAS_CRYPTO("sha512");
+
+asmlinkage void sha512_block_data_order(u32 *digest, const void *data,
+					unsigned int num_blks);
+
+static int sha512_update(struct shash_desc *desc, const u8 *data,
+			 unsigned int len)
+{
+	return sha512_base_do_update(desc, data, len,
+			(sha512_block_fn *)sha512_block_data_order);
+}
+
+static int sha512_finup(struct shash_desc *desc, const u8 *data,
+			unsigned int len, u8 *out)
+{
+	if (len)
+		sha512_base_do_update(desc, data, len,
+			(sha512_block_fn *)sha512_block_data_order);
+	sha512_base_do_finalize(desc,
+			(sha512_block_fn *)sha512_block_data_order);
+
+	return sha512_base_finish(desc, out);
+}
+
+static int sha512_final(struct shash_desc *desc, u8 *out)
+{
+	return sha512_finup(desc, NULL, 0, out);
+}
+
+static struct shash_alg algs[] = { {
+	.digestsize		= SHA512_DIGEST_SIZE,
+	.init			= sha512_base_init,
+	.update			= sha512_update,
+	.final			= sha512_final,
+	.finup			= sha512_finup,
+	.descsize		= sizeof(struct sha512_state),
+	.base.cra_name		= "sha512",
+	.base.cra_driver_name	= "sha512-arm64",
+	.base.cra_priority	= 150,
+	.base.cra_flags		= CRYPTO_ALG_TYPE_SHASH,
+	.base.cra_blocksize	= SHA512_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+}, {
+	.digestsize		= SHA384_DIGEST_SIZE,
+	.init			= sha384_base_init,
+	.update			= sha512_update,
+	.final			= sha512_final,
+	.finup			= sha512_finup,
+	.descsize		= sizeof(struct sha512_state),
+	.base.cra_name		= "sha384",
+	.base.cra_driver_name	= "sha384-arm64",
+	.base.cra_priority	= 150,
+	.base.cra_flags		= CRYPTO_ALG_TYPE_SHASH,
+	.base.cra_blocksize	= SHA384_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+} };
+
+static int __init sha512_mod_init(void)
+{
+	return crypto_register_shashes(algs, ARRAY_SIZE(algs));
+}
+
+static void __exit sha512_mod_fini(void)
+{
+	crypto_unregister_shashes(algs, ARRAY_SIZE(algs));
+}
+
+module_init(sha512_mod_init);
+module_exit(sha512_mod_fini);
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v3] crypto: only call put_page on referenced and used pages
From: Stephan Mueller @ 2016-11-11 14:28 UTC (permalink / raw)
  To: Herbert Xu; +Cc: linux-crypto
In-Reply-To: <6581903.GBJMzZudEe@tauon.atsec.com>

Am Dienstag, 13. September 2016, 13:27:34 CET schrieb Stephan Mueller:

Hi Herbert,

> Am Dienstag, 13. September 2016, 18:08:16 CEST schrieb Herbert Xu:
> 
> Hi Herbert,
> 
> > This patch appears to be papering over a real bug.
> > 
> > The async path should be exactly the same as the sync path, except
> > that we don't wait for completion.  So the question is why are we
> > getting this crash here for async but not sync?
> 
> At least one reason is found in skcipher_recvmsg_async with the following
> code path:
> 
>  if (txbufs == tx_nents) {
>                         struct scatterlist *tmp;
>                         int x;
>                         /* Ran out of tx slots in async request
>                          * need to expand */
>                         tmp = kcalloc(tx_nents * 2, sizeof(*tmp),
>                                       GFP_KERNEL);
>                         if (!tmp)
>                                 goto free;
> 
>                         sg_init_table(tmp, tx_nents * 2);
>                         for (x = 0; x < tx_nents; x++)
>                                 sg_set_page(&tmp[x], sg_page(&sreq->tsg[x]),
> sreq->tsg[x].length,
>                                             sreq->tsg[x].offset);
>                         kfree(sreq->tsg);
>                         sreq->tsg = tmp;
>                         tx_nents *= 2;
>                         mark = true;
>                 }
> 
> 
> ==> the code allocates twice the amount of the previously existing memory,
> copies the existing SGs over, but does not set the remaining SGs to
> anything. If the caller provides less pages than the number of allocated
> SGs, some SGs are unset. Hence, the deallocation must not do anything with
> the yet uninitialized SGs.

I looked into the issue a bit deeper. In addition to the aforementioned code, 
the following code seems to be a second culprit:

	tx_nents = skcipher_all_sg_nents(ctx);
	sreq->tsg = kcalloc(tx_nents, sizeof(*sg), GFP_KERNEL);
	if (unlikely(!sreq->tsg))
		goto unlock;
	sg_init_table(sreq->tsg, tx_nents);

Here again, an SGL is initialized, but there are no pages mapped to the SGs.

May I ask you to reconsider this patch as well as the patch "[PATCH] crypto: 
call put_page on used pages only" from September 10 since the current code of 
libkcapi can easily trigger these bugs and lead to a kernel crash.

If you consider the patches papering over the heart of the problem, may I ask 
for suggestions on how the mentioned code should be changed such that the 
issues are removed? If the suggestion is to re-architect the memory handling 
in the async part, may I ask to at least apply the patches for now with the 
goal to have time for re-architecting the async code and yet have no open 
holes that lead to crashes?

Thanks.

Ciao
Stephan

^ permalink raw reply

* [PATCH -next] hwrng: atmel - use clk_disable_unprepare instead of clk_disable
From: Wei Yongjun @ 2016-11-11 14:56 UTC (permalink / raw)
  To: Matt Mackall, Herbert Xu, Wenyou Yang, Nicolas Ferre
  Cc: Wei Yongjun, linux-crypto

From: Wei Yongjun <weiyongjun1@huawei.com>

Since clk_prepare_enable() is used to get trng->clk, we should
use clk_disable_unprepare() to release it for the error path.

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
---
 drivers/char/hw_random/atmel-rng.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/char/hw_random/atmel-rng.c b/drivers/char/hw_random/atmel-rng.c
index ae7cae5..661c82c 100644
--- a/drivers/char/hw_random/atmel-rng.c
+++ b/drivers/char/hw_random/atmel-rng.c
@@ -94,7 +94,7 @@ static int atmel_trng_probe(struct platform_device *pdev)
 	return 0;
 
 err_register:
-	clk_disable(trng->clk);
+	clk_disable_unprepare(trng->clk);
 	return ret;
 }

^ permalink raw reply related

* Re: [PATCH v2 00/11] getting back -Wmaybe-uninitialized
From: Linus Torvalds @ 2016-11-11 17:13 UTC (permalink / raw)
  To: Arnd Bergmann, Srinivas Kandagatla, sayli karnik,
	Jonathan Cameron, Mark Brown
  Cc: Andrew Morton, Anna Schumaker, David S. Miller, Herbert Xu,
	Ilya Dryomov, Javier Martinez Canillas, Jiri Kosina, Ley Foon Tan,
	Luis R . Rodriguez, Martin Schwidefsky, Mauro Carvalho Chehab,
	Michal Marek, Russell King, Sean Young, Sebastian Ott,
	Trond Myklebust, the arch/x86 maintainers,
	Linux Kbuild mailing list, Linux
In-Reply-To: <20161110164454.293477-1-arnd-r2nGTMty4D4@public.gmane.org>

On Thu, Nov 10, 2016 at 8:44 AM, Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> wrote:
>
> Please merge these directly if you are happy with the result.

I will take this.

I do see two warnings, but they both seem to be valid and recent,
though, so I have no issues with the spurious cases.

Warning #1:

  sound/soc/qcom/lpass-platform.c: In function ‘lpass_platform_pcmops_open’:
  sound/soc/qcom/lpass-platform.c:83:29: warning: ‘dma_ch’ may be used
uninitialized in this function [-Wmaybe-uninitialized]
    drvdata->substream[dma_ch] = substream;
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~

and 'dma_ch' usage there really is crazy and wrong. Broken by
022d00ee0b55 ("ASoC: lpass-platform: Fix broken pcm data usage")

Warning #2 is not a real bug, but it's reasonable that gcc doesn't
know that storage_bytes (chip->read_size) has to be 2/4. Again,
introduced recently by commit 231147ee77f3 ("iio: maxim_thermocouple:
Align 16 bit big endian value of raw reads"), so you didn't see it.

  drivers/iio/temperature/maxim_thermocouple.c: In function
‘maxim_thermocouple_read_raw’:
  drivers/iio/temperature/maxim_thermocouple.c:141:5: warning: ‘ret’
may be used uninitialized in this function [-Wmaybe-uninitialized]
    if (ret)
       ^
  drivers/iio/temperature/maxim_thermocouple.c:128:6: note: ‘ret’ was
declared here
    int ret;
        ^~~

and I guess that code can just initialize 'ret' to '-EINVAL' or
something to just make the theoretical "somehow we had a wrong
chip->read_size" case error out cleanly.

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 00/11] getting back -Wmaybe-uninitialized
From: Arnd Bergmann @ 2016-11-11 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sean Young, sayli karnik, Trond Myklebust, Srinivas Kandagatla,
	linux-s390, Herbert Xu, the arch/x86 maintainers, Sebastian Ott,
	Russell King, Javier Martinez Canillas, Ilya Dryomov, arcml,
	Linux Media Mailing List, Linux Kbuild mailing list, Jiri Kosina,
	Mark Brown, nios2-dev, Mauro Carvalho Chehab,
	Linux NFS Mailing List, gregkh,
	Linux Kernel Mailing List <linux
In-Reply-To: <CA+55aFx_scFVFKU__TBmoffw_iHvrdAU2dj5u1WKfWJXAkS4QA@mail.gmail.com>

On Friday, November 11, 2016 9:13:00 AM CET Linus Torvalds wrote:
> On Thu, Nov 10, 2016 at 8:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > Please merge these directly if you are happy with the result.
> 
> I will take this.

Thanks a lot!
 
> I do see two warnings, but they both seem to be valid and recent,
> though, so I have no issues with the spurious cases.

Ok, both of them should have my fixes coming your way already.

> Warning #1:
> 
>   sound/soc/qcom/lpass-platform.c: In function ‘lpass_platform_pcmops_open’:
>   sound/soc/qcom/lpass-platform.c:83:29: warning: ‘dma_ch’ may be used
> uninitialized in this function [-Wmaybe-uninitialized]
>     drvdata->substream[dma_ch] = substream;
>     ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~
> 
> and 'dma_ch' usage there really is crazy and wrong. Broken by
> 022d00ee0b55 ("ASoC: lpass-platform: Fix broken pcm data usage")

Right, the patches crossed here, the bugfix patch that introduced
this came into linux-next over the kernel summit, and the fix I
sent on Tuesday made it into Mark Brown's tree on Wednesday but not
before you pulled alsa tree. It should be fixed the next time you
pull from the alsa tree, the commit is

3b89e4b77ef9 ("ASoC: lpass-platform: initialize dma channel number")
 
> Warning #2 is not a real bug, but it's reasonable that gcc doesn't
> know that storage_bytes (chip->read_size) has to be 2/4. Again,
> introduced recently by commit 231147ee77f3 ("iio: maxim_thermocouple:
> Align 16 bit big endian value of raw reads"), so you didn't see it.

This is the one I mentioned in the commit message as one that
is fixed in linux-next and that should make it in soon.

>   drivers/iio/temperature/maxim_thermocouple.c: In function
> ‘maxim_thermocouple_read_raw’:
>   drivers/iio/temperature/maxim_thermocouple.c:141:5: warning: ‘ret’
> may be used uninitialized in this function [-Wmaybe-uninitialized]
>     if (ret)
>        ^
>   drivers/iio/temperature/maxim_thermocouple.c:128:6: note: ‘ret’ was
> declared here
>     int ret;
>         ^~~
> 
> and I guess that code can just initialize 'ret' to '-EINVAL' or
> something to just make the theoretical "somehow we had a wrong
> chip->read_size" case error out cleanly.

Right, that was my conclusion too. I sent the bugfix on Oct 25
for linux-next but it didn't make it in until this Monday, after
you pulled the patch that introduced it on Oct 29.

The commit in staging-testing is
32cb7d27e65d ("iio: maxim_thermocouple: detect invalid storage size in read()")

Greg and Jonathan, I see now that this is part of the 'iio-for-4.10b'
branch, so I suspect you were not planning to send this before the
merge window. Could you make sure this ends up in v4.9 so we get
a clean build when -Wmaybe-uninitialized gets enabled again?

	Arnd

_______________________________________________
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc

^ permalink raw reply

* Re: [PATCH] crypto: arm64/sha2: integrate OpenSSL implementations of SHA256/SHA512
From: Will Deacon @ 2016-11-11 19:56 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-crypto, linux-arm-kernel, herbert, catalin.marinas, appro,
	victor.chong, daniel.thompson
In-Reply-To: <1478872273-16382-1-git-send-email-ard.biesheuvel@linaro.org>

On Fri, Nov 11, 2016 at 09:51:13PM +0800, Ard Biesheuvel wrote:
> This integrates both the accelerated scalar and the NEON implementations
> of SHA-224/256 as well as SHA-384/512 from the OpenSSL project.
> 
> Relative performance compared to the respective generic C versions:
> 
>                  |  SHA256-scalar  | SHA256-NEON* |  SHA512  |
>      ------------+-----------------+--------------+----------+
>      Cortex-A53  |      1.63x      |     1.63x    |   2.34x  |
>      Cortex-A57  |      1.43x      |     1.59x    |   1.95x  |
>      Cortex-A73  |      1.26x      |     1.56x    |     ?    |
> 
> The core crypto code was authored by Andy Polyakov of the OpenSSL
> project, in collaboration with whom the upstream code was adapted so
> that this module can be built from the same version of sha512-armv8.pl.
> 
> The version in this patch was taken from OpenSSL commit
> 
>    866e505e0d66 sha/asm/sha512-armv8.pl: add NEON version of SHA256.
> 
> * The core SHA algorithm is fundamentally sequential, but there is a
>   secondary transformation involved, called the schedule update, which
>   can be performed independently. The NEON version of SHA-224/SHA-256
>   only implements this part of the algorithm using NEON instructions,
>   the sequential part is always done using scalar instructions.
> 
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
> 
> This supersedes the SHA-256-NEON-only patch I sent out about 6 weeks ago.
> 
> Will, Catalin: note that this pulls in a .pl script, and adds a build rule
> locally in arch/arm64/crypto to generate .S files on the fly from Perl
> scripts. I will leave it to you to decide whether you are ok with this as
> is, or whether you prefer .S_shipped files, in which case the Perl script
> is only included as a reference (this is how we did it for arch/arm in the
> past, but given that it adds about 3000 lines of generated code to the patch,
> I think we may want to simply keep it as below)

I think we should include the shipped files too. 3000 lines isn't that much
in the grand scheme of things, and there will be people who complain about
the unconditional perl dependency.

Will

^ permalink raw reply

* Re: [PATCH 2/3] crypto: AF_ALG - disregard AAD buffer space for output
From: Mat Martineau @ 2016-11-12  0:26 UTC (permalink / raw)
  To: Stephan Mueller; +Cc: herbert, linux-crypto
In-Reply-To: <3506033.FskOdlTquT@positron.chronox.de>


Stephan,

On Thu, 10 Nov 2016, Stephan Mueller wrote:

> The kernel crypto API AEAD cipher operation generates output such that
> space for the AAD is reserved in the output buffer without being
> touched. The processed ciphertext/plaintext is appended to the reserved
> AAD buffer.
>
> The user space interface followed that approach. However, this is a
> violation of the POSIX read definition which requires that any read data
> is placed at the beginning of the caller-provided buffer. As the kernel
> crypto API would leave room for the AAD, the old approach did not fully
> comply with the POSIX specification.
>
> The patch changes the user space AF_ALG AEAD interface such that the
> processed ciphertext/plaintext are now placed at the beginning of the
> user buffer provided with the read system call. That means the user
> space interface now deviates from the in-kernel output buffer handling.
>
> For the cipher operation, the AAD buffer provided during input is
> pointed to by a new SGL which is chained with the output buffer SGL.
> With this approach, only pointers to one copy of the AAD are maintained
> to avoid data duplication.
>
> With this solution, the caller must not use sendpage with the exact same
> buffers for input and output. The following rationale applies: When
> the caller sends the same buffer for input/output to the sendpage
> operation, the cipher operation now will write the ciphertext to the
> beginning of the buffer where the AAD used to be. The subsequent tag
> calculation will now use the data it finds where the AAD is expected.
> As the cipher operation has already replaced the AAD with the ciphertext,
> the tag calculation will take the ciphertext as AAD and thus calculate
> a wrong tag.

If it's not much overhead, I suggest checking for this condition and 
returning an error.

Other than that, I've done a quick test of the patches using sendmsg() and 
read() and found that they work as expected.

Thanks,
Mat



> Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
> Signed-off-by: Stephan Mueller <smueller@chronox.de>
> ---
> crypto/algif_aead.c | 143 +++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 113 insertions(+), 30 deletions(-)
>
> diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
> index c54bcb8..0212cc2 100644
> --- a/crypto/algif_aead.c
> +++ b/crypto/algif_aead.c
> @@ -32,6 +32,7 @@ struct aead_sg_list {
> struct aead_async_rsgl {
> 	struct af_alg_sgl sgl;
> 	struct list_head list;
> +	bool new_page;
> };
>
> struct aead_async_req {
> @@ -405,6 +406,61 @@ static void aead_async_cb(struct crypto_async_request *_req, int err)
> 	iocb->ki_complete(iocb, err, err);
> }
>
> +/**
> + * scatterwalk_get_part() - get subset a scatterlist
> + *
> + * @dst: destination SGL to receive the pointers from source SGL
> + * @src: source SGL
> + * @len: data length in bytes to get from source SGL
> + * @max_sgs: number of SGs present in dst SGL to prevent overstepping boundaries
> + *
> + * @return: number of SG entries in dst
> + */
> +static inline int scatterwalk_get_part(struct scatterlist *dst,
> +				       struct scatterlist *src,
> +				       unsigned int len, unsigned int max_sgs)
> +{
> +	/* leave one SG entry for chaining */
> +	unsigned int j = 1;
> +
> +	while (len && j < max_sgs) {
> +		unsigned int todo = min_t(unsigned int, len, src->length);
> +
> +		sg_set_page(dst, sg_page(src), todo, src->offset);
> +		if (src->length >= len) {
> +			sg_mark_end(dst);
> +			break;
> +		}
> +		len -= todo;
> +		j++;
> +		src = sg_next(src);
> +		dst = sg_next(dst);
> +	}
> +
> +	return j;
> +}
> +
> +static inline int aead_alloc_rsgl(struct sock *sk, struct aead_async_rsgl **ret)
> +{
> +	struct aead_async_rsgl *rsgl =
> +				sock_kmalloc(sk, sizeof(*rsgl), GFP_KERNEL);
> +	if (unlikely(!rsgl))
> +		return -ENOMEM;
> +	*ret = rsgl;
> +	return 0;
> +}
> +
> +static inline int aead_get_rsgl_areq(struct sock *sk,
> +				     struct aead_async_req *areq,
> +				     struct aead_async_rsgl **ret)
> +{
> +	if (list_empty(&areq->list)) {
> +		*ret = &areq->first_rsgl;
> +		return 0;
> +	} else
> +		return aead_alloc_rsgl(sk, ret);
> +}
> +
> static int aead_recvmsg_async(struct socket *sock, struct msghdr *msg,
> 			      int flags)
> {
> @@ -433,7 +489,7 @@ static int aead_recvmsg_async(struct socket *sock, struct msghdr *msg,
> 	if (!aead_sufficient_data(ctx))
> 		goto unlock;
>
> -	used = ctx->used;
> +	used = ctx->used - ctx->aead_assoclen;
> 	if (ctx->enc)
> 		outlen = used + as;
> 	else
> @@ -452,7 +508,6 @@ static int aead_recvmsg_async(struct socket *sock, struct msghdr *msg,
> 	aead_request_set_ad(req, ctx->aead_assoclen);
> 	aead_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> 				  aead_async_cb, sk);
> -	used -= ctx->aead_assoclen;
>
> 	/* take over all tx sgls from ctx */
> 	areq->tsgl = sock_kmalloc(sk, sizeof(*areq->tsgl) * sgl->cur,
> @@ -467,21 +522,26 @@ static int aead_recvmsg_async(struct socket *sock, struct msghdr *msg,
>
> 	areq->tsgls = sgl->cur;
>
> +	/* set AAD buffer */
> +	err = aead_get_rsgl_areq(sk, areq, &rsgl);
> +	if (err)
> +		goto free;
> +	list_add_tail(&rsgl->list, &areq->list);
> +	sg_init_table(rsgl->sgl.sg, ALG_MAX_PAGES);
> +	rsgl->sgl.npages = scatterwalk_get_part(rsgl->sgl.sg, sgl->sg,
> +						ctx->aead_assoclen,
> +						ALG_MAX_PAGES);
> +	rsgl->new_page = false;
> +	last_rsgl = rsgl;
> +
> 	/* create rx sgls */
> 	while (outlen > usedpages && iov_iter_count(&msg->msg_iter)) {
> 		size_t seglen = min_t(size_t, iov_iter_count(&msg->msg_iter),
> 				      (outlen - usedpages));
>
> -		if (list_empty(&areq->list)) {
> -			rsgl = &areq->first_rsgl;
> -
> -		} else {
> -			rsgl = sock_kmalloc(sk, sizeof(*rsgl), GFP_KERNEL);
> -			if (unlikely(!rsgl)) {
> -				err = -ENOMEM;
> -				goto free;
> -			}
> -		}
> +		err = aead_get_rsgl_areq(sk, areq, &rsgl);
> +		if (err)
> +			goto free;
> 		rsgl->sgl.npages = 0;
> 		list_add_tail(&rsgl->list, &areq->list);
>
> @@ -491,6 +551,7 @@ static int aead_recvmsg_async(struct socket *sock, struct msghdr *msg,
> 			goto free;
>
> 		usedpages += err;
> +		rsgl->new_page = true;
>
> 		/* chain the new scatterlist with previous one */
> 		if (last_rsgl)
> @@ -507,7 +568,7 @@ static int aead_recvmsg_async(struct socket *sock, struct msghdr *msg,
>
> 		if (used < less) {
> 			err = -EINVAL;
> -			goto unlock;
> +			goto free;
> 		}
> 		used -= less;
> 		outlen -= less;
> @@ -531,7 +592,8 @@ static int aead_recvmsg_async(struct socket *sock, struct msghdr *msg,
>
> free:
> 	list_for_each_entry(rsgl, &areq->list, list) {
> -		af_alg_free_sg(&rsgl->sgl);
> +		if (rsgl->new_page)
> +			af_alg_free_sg(&rsgl->sgl);
> 		if (rsgl != &areq->first_rsgl)
> 			sock_kfree_s(sk, rsgl, sizeof(*rsgl));
> 	}
> @@ -545,6 +607,16 @@ static int aead_recvmsg_async(struct socket *sock, struct msghdr *msg,
> 	return err ? err : outlen;
> }
>
> +static inline int aead_get_rsgl_ctx(struct sock *sk, struct aead_ctx *ctx,
> +				    struct aead_async_rsgl **ret)
> +{
> +	if (list_empty(&ctx->list)) {
> +		*ret = &ctx->first_rsgl;
> +		return 0;
> +	} else
> +		return aead_alloc_rsgl(sk, ret);
> +}
> +
> static int aead_recvmsg_sync(struct socket *sock, struct msghdr *msg, int flags)
> {
> 	struct sock *sk = sock->sk;
> @@ -582,9 +654,6 @@ static int aead_recvmsg_sync(struct socket *sock, struct msghdr *msg, int flags)
> 			goto unlock;
> 	}
>
> -	/* data length provided by caller via sendmsg/sendpage */
> -	used = ctx->used;
> -
> 	/*
> 	 * Make sure sufficient data is present -- note, the same check is
> 	 * is also present in sendmsg/sendpage. The checks in sendpage/sendmsg
> @@ -598,6 +667,12 @@ static int aead_recvmsg_sync(struct socket *sock, struct msghdr *msg, int flags)
> 		goto unlock;
>
> 	/*
> +	 * The cipher operation input data is reduced by the associated data
> +	 * as the destination buffer will not hold the AAD.
> +	 */
> +	used = ctx->used - ctx->aead_assoclen;
> +
> +	/*
> 	 * Calculate the minimum output buffer size holding the result of the
> 	 * cipher operation. When encrypting data, the receiving buffer is
> 	 * larger by the tag length compared to the input buffer as the
> @@ -611,25 +686,29 @@ static int aead_recvmsg_sync(struct socket *sock, struct msghdr *msg, int flags)
> 		outlen = used - as;
>
> 	/*
> -	 * The cipher operation input data is reduced by the associated data
> -	 * length as this data is processed separately later on.
> +	 * Pre-pend the AAD buffer from the source SGL to the destination SGL.
> +	 * As the AAD buffer is not touched by the AEAD operation, the source
> +	 * SG buffers remain unchanged.
> 	 */
> -	used -= ctx->aead_assoclen;
> +	err = aead_get_rsgl_ctx(sk, ctx, &rsgl);
> +	if (err)
> +		goto unlock;
> +	list_add_tail(&rsgl->list, &ctx->list);
> +	sg_init_table(rsgl->sgl.sg, ALG_MAX_PAGES);
> +	rsgl->sgl.npages = scatterwalk_get_part(rsgl->sgl.sg, sgl->sg,
> +						ctx->aead_assoclen,
> +						ALG_MAX_PAGES);
> +	rsgl->new_page = false;
> +	last_rsgl = rsgl;
>
> 	/* convert iovecs of output buffers into scatterlists */
> 	while (outlen > usedpages && iov_iter_count(&msg->msg_iter)) {
> 		size_t seglen = min_t(size_t, iov_iter_count(&msg->msg_iter),
> 				      (outlen - usedpages));
>
> -		if (list_empty(&ctx->list)) {
> -			rsgl = &ctx->first_rsgl;
> -		} else {
> -			rsgl = sock_kmalloc(sk, sizeof(*rsgl), GFP_KERNEL);
> -			if (unlikely(!rsgl)) {
> -				err = -ENOMEM;
> -				goto unlock;
> -			}
> -		}
> +		err = aead_get_rsgl_ctx(sk, ctx, &rsgl);
> +		if (err)
> +			goto unlock;
> 		rsgl->sgl.npages = 0;
> 		list_add_tail(&rsgl->list, &ctx->list);
>
> @@ -637,7 +716,10 @@ static int aead_recvmsg_sync(struct socket *sock, struct msghdr *msg, int flags)
> 		err = af_alg_make_sg(&rsgl->sgl, &msg->msg_iter, seglen);
> 		if (err < 0)
> 			goto unlock;
> +
> 		usedpages += err;
> +		rsgl->new_page = true;
> +
> 		/* chain the new scatterlist with previous one */
> 		if (last_rsgl)
> 			af_alg_link_sg(&last_rsgl->sgl, &rsgl->sgl);
> @@ -688,7 +770,8 @@ static int aead_recvmsg_sync(struct socket *sock, struct msghdr *msg, int flags)
>
> unlock:
> 	list_for_each_entry_safe(rsgl, tmp, &ctx->list, list) {
> -		af_alg_free_sg(&rsgl->sgl);
> +		if (rsgl->new_page)
> +			af_alg_free_sg(&rsgl->sgl);
> 		if (rsgl != &ctx->first_rsgl)
> 			sock_kfree_s(sk, rsgl, sizeof(*rsgl));
> 		list_del(&rsgl->list);
> -- 
> 2.7.4
>
>
>

--
Mat Martineau
Intel OTC

^ permalink raw reply

* Re: [PATCH 2/3] crypto: AF_ALG - disregard AAD buffer space for output
From: Herbert Xu @ 2016-11-12  1:55 UTC (permalink / raw)
  To: Stephan Mueller; +Cc: mathew.j.martineau, linux-crypto
In-Reply-To: <3506033.FskOdlTquT@positron.chronox.de>

On Thu, Nov 10, 2016 at 04:32:03AM +0100, Stephan Mueller wrote:
> The kernel crypto API AEAD cipher operation generates output such that
> space for the AAD is reserved in the output buffer without being
> touched. The processed ciphertext/plaintext is appended to the reserved
> AAD buffer.
> 
> The user space interface followed that approach. However, this is a
> violation of the POSIX read definition which requires that any read data
> is placed at the beginning of the caller-provided buffer. As the kernel
> crypto API would leave room for the AAD, the old approach did not fully
> comply with the POSIX specification.

Nack.  The kernel AEAD API will copy the AD as is, it definitely
does not leave the output untouched unless of course when it is
an in-place operation.  The user-space operation should operate
in the same manner.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 2/3] crypto: AF_ALG - disregard AAD buffer space for output
From: Stephan Mueller @ 2016-11-12  2:12 UTC (permalink / raw)
  To: Mat Martineau; +Cc: herbert, linux-crypto
In-Reply-To: <alpine.OSX.2.20.1611111622500.1815@mjmartin-mac01.sea.intel.com>

Am Freitag, 11. November 2016, 16:26:12 CET schrieb Mat Martineau:

Hi Mat,
> > 
> > With this solution, the caller must not use sendpage with the exact same
> > buffers for input and output. The following rationale applies: When
> > the caller sends the same buffer for input/output to the sendpage
> > operation, the cipher operation now will write the ciphertext to the
> > beginning of the buffer where the AAD used to be. The subsequent tag
> > calculation will now use the data it finds where the AAD is expected.
> > As the cipher operation has already replaced the AAD with the ciphertext,
> > the tag calculation will take the ciphertext as AAD and thus calculate
> > a wrong tag.
> 
> If it's not much overhead, I suggest checking for this condition and
> returning an error.

I can surely look into that. But Herbert's NACK seems to make this patch 
unlikely.
> 
> Other than that, I've done a quick test of the patches using sendmsg() and
> read() and found that they work as expected.
> 
Thanks for testing.

Ciao
Stephan

^ permalink raw reply

* Re: [PATCH 2/3] crypto: AF_ALG - disregard AAD buffer space for output
From: Stephan Mueller @ 2016-11-12  2:03 UTC (permalink / raw)
  To: Herbert Xu; +Cc: mathew.j.martineau, linux-crypto
In-Reply-To: <20161112015519.GA32234@gondor.apana.org.au>

Am Samstag, 12. November 2016, 09:55:19 CET schrieb Herbert Xu:

Hi Herbert,

> On Thu, Nov 10, 2016 at 04:32:03AM +0100, Stephan Mueller wrote:
> > The kernel crypto API AEAD cipher operation generates output such that
> > space for the AAD is reserved in the output buffer without being
> > touched. The processed ciphertext/plaintext is appended to the reserved
> > AAD buffer.
> > 
> > The user space interface followed that approach. However, this is a
> > violation of the POSIX read definition which requires that any read data
> > is placed at the beginning of the caller-provided buffer. As the kernel
> > crypto API would leave room for the AAD, the old approach did not fully
> > comply with the POSIX specification.
> 
> Nack.  The kernel AEAD API will copy the AD as is, it definitely
> does not leave the output untouched unless of course when it is
> an in-place operation.  The user-space operation should operate
> in the same manner.

When you have separate buffers, the kernel does not seem to copy the AD over 
to the target buffer.
> 
> Cheers,


Ciao
Stephan

^ permalink raw reply

* Re: [PATCH 2/3] crypto: AF_ALG - disregard AAD buffer space for output
From: Herbert Xu @ 2016-11-12  2:13 UTC (permalink / raw)
  To: Stephan Mueller; +Cc: mathew.j.martineau, linux-crypto
In-Reply-To: <11739696.UFOoJjMX73@positron.chronox.de>

On Sat, Nov 12, 2016 at 03:03:36AM +0100, Stephan Mueller wrote:
> 
> When you have separate buffers, the kernel does not seem to copy the AD over 
> to the target buffer.

OK we should definitely fix that.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] crypto: arm64/sha2: integrate OpenSSL implementations of SHA256/SHA512
From: Ard Biesheuvel @ 2016-11-12 12:26 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Herbert Xu, Catalin Marinas,
	Andy Polyakov, Victor Chong, Daniel Thompson
In-Reply-To: <20161111195607.GB4457@arm.com>

On 11 November 2016 at 20:56, Will Deacon <will.deacon@arm.com> wrote:
> On Fri, Nov 11, 2016 at 09:51:13PM +0800, Ard Biesheuvel wrote:
>> This integrates both the accelerated scalar and the NEON implementations
>> of SHA-224/256 as well as SHA-384/512 from the OpenSSL project.
>>
>> Relative performance compared to the respective generic C versions:
>>
>>                  |  SHA256-scalar  | SHA256-NEON* |  SHA512  |
>>      ------------+-----------------+--------------+----------+
>>      Cortex-A53  |      1.63x      |     1.63x    |   2.34x  |
>>      Cortex-A57  |      1.43x      |     1.59x    |   1.95x  |
>>      Cortex-A73  |      1.26x      |     1.56x    |     ?    |
>>
>> The core crypto code was authored by Andy Polyakov of the OpenSSL
>> project, in collaboration with whom the upstream code was adapted so
>> that this module can be built from the same version of sha512-armv8.pl.
>>
>> The version in this patch was taken from OpenSSL commit
>>
>>    866e505e0d66 sha/asm/sha512-armv8.pl: add NEON version of SHA256.
>>
>> * The core SHA algorithm is fundamentally sequential, but there is a
>>   secondary transformation involved, called the schedule update, which
>>   can be performed independently. The NEON version of SHA-224/SHA-256
>>   only implements this part of the algorithm using NEON instructions,
>>   the sequential part is always done using scalar instructions.
>>
>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> ---
>>
>> This supersedes the SHA-256-NEON-only patch I sent out about 6 weeks ago.
>>
>> Will, Catalin: note that this pulls in a .pl script, and adds a build rule
>> locally in arch/arm64/crypto to generate .S files on the fly from Perl
>> scripts. I will leave it to you to decide whether you are ok with this as
>> is, or whether you prefer .S_shipped files, in which case the Perl script
>> is only included as a reference (this is how we did it for arch/arm in the
>> past, but given that it adds about 3000 lines of generated code to the patch,
>> I think we may want to simply keep it as below)
>
> I think we should include the shipped files too. 3000 lines isn't that much
> in the grand scheme of things, and there will be people who complain about
> the unconditional perl dependency.
>

OK, fair enough. I will repost with the generated files included.

^ permalink raw reply

* Re: [PATCH v2 00/11] getting back -Wmaybe-uninitialized
From: Jonathan Cameron @ 2016-11-12 13:27 UTC (permalink / raw)
  To: Arnd Bergmann, Linus Torvalds
  Cc: Srinivas Kandagatla, sayli karnik, Mark Brown, Andrew Morton,
	Anna Schumaker, David S. Miller, Herbert Xu, Ilya Dryomov,
	Javier Martinez Canillas, Jiri Kosina, Ley Foon Tan,
	Luis R . Rodriguez, Martin Schwidefsky, Mauro Carvalho Chehab,
	Michal Marek, Russell King, Sean Young, Sebastian Ott,
	Trond Myklebust <trond.myklebu
In-Reply-To: <2695221.kyRJMsRMjs@wuerfel>

On 11/11/16 19:49, Arnd Bergmann wrote:
> On Friday, November 11, 2016 9:13:00 AM CET Linus Torvalds wrote:
>> On Thu, Nov 10, 2016 at 8:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>>
>>> Please merge these directly if you are happy with the result.
>>
>> I will take this.
> 
> Thanks a lot!
>  
>> I do see two warnings, but they both seem to be valid and recent,
>> though, so I have no issues with the spurious cases.
> 
> Ok, both of them should have my fixes coming your way already.
> 
>> Warning #1:
>>
>>   sound/soc/qcom/lpass-platform.c: In function ‘lpass_platform_pcmops_open’:
>>   sound/soc/qcom/lpass-platform.c:83:29: warning: ‘dma_ch’ may be used
>> uninitialized in this function [-Wmaybe-uninitialized]
>>     drvdata->substream[dma_ch] = substream;
>>     ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~
>>
>> and 'dma_ch' usage there really is crazy and wrong. Broken by
>> 022d00ee0b55 ("ASoC: lpass-platform: Fix broken pcm data usage")
> 
> Right, the patches crossed here, the bugfix patch that introduced
> this came into linux-next over the kernel summit, and the fix I
> sent on Tuesday made it into Mark Brown's tree on Wednesday but not
> before you pulled alsa tree. It should be fixed the next time you
> pull from the alsa tree, the commit is
> 
> 3b89e4b77ef9 ("ASoC: lpass-platform: initialize dma channel number")
>  
>> Warning #2 is not a real bug, but it's reasonable that gcc doesn't
>> know that storage_bytes (chip->read_size) has to be 2/4. Again,
>> introduced recently by commit 231147ee77f3 ("iio: maxim_thermocouple:
>> Align 16 bit big endian value of raw reads"), so you didn't see it.
> 
> This is the one I mentioned in the commit message as one that
> is fixed in linux-next and that should make it in soon.
> 
>>   drivers/iio/temperature/maxim_thermocouple.c: In function
>> ‘maxim_thermocouple_read_raw’:
>>   drivers/iio/temperature/maxim_thermocouple.c:141:5: warning: ‘ret’
>> may be used uninitialized in this function [-Wmaybe-uninitialized]
>>     if (ret)
>>        ^
>>   drivers/iio/temperature/maxim_thermocouple.c:128:6: note: ‘ret’ was
>> declared here
>>     int ret;
>>         ^~~
>>
>> and I guess that code can just initialize 'ret' to '-EINVAL' or
>> something to just make the theoretical "somehow we had a wrong
>> chip->read_size" case error out cleanly.
> 
> Right, that was my conclusion too. I sent the bugfix on Oct 25
> for linux-next but it didn't make it in until this Monday, after
> you pulled the patch that introduced it on Oct 29.
> 
> The commit in staging-testing is
> 32cb7d27e65d ("iio: maxim_thermocouple: detect invalid storage size in read()")
> 
> Greg and Jonathan, I see now that this is part of the 'iio-for-4.10b'
> branch, so I suspect you were not planning to send this before the
> merge window. Could you make sure this ends up in v4.9 so we get
> a clean build when -Wmaybe-uninitialized gets enabled again?
I'll queue this up and send a pull to Greg tomorrow.

Was highly doubtful that a false warning suppression (be it an
understandable one) was worth sending mid cycle, hence it was
taking the slow route.

Jonathan
> 
> 	Arnd
> 


^ permalink raw reply

* Re: [PATCH v3] crypto: arm64/sha2: integrate OpenSSL implementations of SHA256/SHA512
From: Will Deacon @ 2016-11-12 22:15 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: linux-crypto, linux-arm-kernel, herbert, catalin.marinas
In-Reply-To: <1478953953-11523-1-git-send-email-ard.biesheuvel@linaro.org>

Hi Ard,

On Sat, Nov 12, 2016 at 01:32:33PM +0100, Ard Biesheuvel wrote:
> This integrates both the accelerated scalar and the NEON implementations
> of SHA-224/256 as well as SHA-384/512 from the OpenSSL project.
> 
> Relative performance compared to the respective generic C versions:
> 
>                  |  SHA256-scalar  | SHA256-NEON* |  SHA512  |
>      ------------+-----------------+--------------+----------+
>      Cortex-A53  |      1.63x      |     1.63x    |   2.34x  |
>      Cortex-A57  |      1.43x      |     1.59x    |   1.95x  |
>      Cortex-A73  |      1.26x      |     1.56x    |     ?    |
> 
> The core crypto code was authored by Andy Polyakov of the OpenSSL
> project, in collaboration with whom the upstream code was adapted so
> that this module can be built from the same version of sha512-armv8.pl.
> 
> The version in this patch was taken from OpenSSL commit
> 
>    866e505e0d66 sha/asm/sha512-armv8.pl: add NEON version of SHA256.
> 
> * The core SHA algorithm is fundamentally sequential, but there is a
>   secondary transformation involved, called the schedule update, which
>   can be performed independently. The NEON version of SHA-224/SHA-256
>   only implements this part of the algorithm using NEON instructions,
>   the sequential part is always done using scalar instructions.
> 
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
> v3: at Will's request, the generated assembly files are now included
>     as .S_shipped files, for which generic build rules are defined
>     already. Note that this has caused issues in the past with
>     patchwork, so for Herbert's convenience, the patch can be pulled
>     from http://git.kernel.org/cgit/linux/kernel/git/ardb/linux.git,
>     branch arm64-sha256 (based on today's cryptodev)

Thanks.

Looking at the generated code, I see references to __ARMEB__ and __ILP32__.
The former is probably a bug, whilst the second is not required. There are
also some commented out instructions, which is weird.

Will

^ permalink raw reply

* Re: [PATCH v3] poly1305: generic C can be faster on chips with slow unaligned access
From: kbuild test robot @ 2016-11-12 23:27 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: kbuild-all, Herbert Xu, David S. Miller, linux-crypto,
	linux-kernel, Martin Willi, Eric Biggers, René van Dorst,
	Jason A. Donenfeld
In-Reply-To: <20161107194345.19955-1-Jason@zx2c4.com>

[-- Attachment #1: Type: text/plain, Size: 1708 bytes --]

Hi Jason,

[auto build test ERROR on cryptodev/master]
[also build test ERROR on v4.9-rc4 next-20161111]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jason-A-Donenfeld/poly1305-generic-C-can-be-faster-on-chips-with-slow-unaligned-access/20161108-053912
base:   https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git master
config: openrisc-allmodconfig (attached as .config)
compiler: or32-linux-gcc (GCC) 4.5.1-or32-1.0rc1
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=openrisc 

All errors (new ones prefixed by >>):

   crypto/poly1305_generic.c: In function 'poly1305_setrkey':
>> crypto/poly1305_generic.c:63:2: error: implicit declaration of function 'get_unaligned_le32'

vim +/get_unaligned_le32 +63 crypto/poly1305_generic.c

    57	}
    58	EXPORT_SYMBOL_GPL(crypto_poly1305_setkey);
    59	
    60	static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key)
    61	{
    62		/* r &= 0xffffffc0ffffffc0ffffffc0fffffff */
  > 63		dctx->r[0] = (get_unaligned_le32(key +  0) >> 0) & 0x3ffffff;
    64		dctx->r[1] = (get_unaligned_le32(key +  3) >> 2) & 0x3ffff03;
    65		dctx->r[2] = (get_unaligned_le32(key +  6) >> 4) & 0x3ffc0ff;
    66		dctx->r[3] = (get_unaligned_le32(key +  9) >> 6) & 0x3f03fff;

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39242 bytes --]

^ permalink raw reply

* Re: [PATCH v3] poly1305: generic C can be faster on chips with slow unaligned access
From: Jason A. Donenfeld @ 2016-11-12 23:31 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Herbert Xu, David S. Miller, linux-crypto, LKML,
	Martin Willi, Eric Biggers, René van Dorst

Hello friendly test robot,

On Sun, Nov 13, 2016 at 12:27 AM, kbuild test robot <lkp@intel.com> wrote:
> Hi Jason,
>
> [auto build test ERROR on cryptodev/master]

That error was fixed by v4 in this series. The version that should be
tested and ultimately applied is v4 and can be found here:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1265977.html

Regards from a human,
Jason

^ permalink raw reply

* 56532 linux-crypto
From: douille.l @ 2016-11-13  0:02 UTC (permalink / raw)
  To: linux-crypto

[-- Attachment #1: MESSAGE_77028149_linux-crypto.zip --]
[-- Type: application/zip, Size: 3266 bytes --]

^ permalink raw reply

* 41307 linux-crypto
From: douille.l @ 2016-11-13  8:02 UTC (permalink / raw)
  To: linux-crypto

[-- Attachment #1: EMAIL_31165189507_linux-crypto.zip --]
[-- Type: application/zip, Size: 3273 bytes --]

^ permalink raw reply

* Re: [PATCH v2 00/11] getting back -Wmaybe-uninitialized
From: Greg KH @ 2016-11-13  8:47 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Arnd Bergmann, Linus Torvalds, Srinivas Kandagatla, sayli karnik,
	Mark Brown, Andrew Morton, Anna Schumaker, David S. Miller,
	Herbert Xu, Ilya Dryomov, Javier Martinez Canillas, Jiri Kosina,
	Ley Foon Tan, Luis R . Rodriguez, Martin Schwidefsky,
	Mauro Carvalho Chehab, Michal Marek, Russell King, Sean Young
In-Reply-To: <f6dccd27-09d2-1842-220b-24aa84043674@kernel.org>

On Sat, Nov 12, 2016 at 01:27:12PM +0000, Jonathan Cameron wrote:
> On 11/11/16 19:49, Arnd Bergmann wrote:
> > On Friday, November 11, 2016 9:13:00 AM CET Linus Torvalds wrote:
> >> On Thu, Nov 10, 2016 at 8:44 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> >>>
> >>> Please merge these directly if you are happy with the result.
> >>
> >> I will take this.
> > 
> > Thanks a lot!
> >  
> >> I do see two warnings, but they both seem to be valid and recent,
> >> though, so I have no issues with the spurious cases.
> > 
> > Ok, both of them should have my fixes coming your way already.
> > 
> >> Warning #1:
> >>
> >>   sound/soc/qcom/lpass-platform.c: In function ‘lpass_platform_pcmops_open’:
> >>   sound/soc/qcom/lpass-platform.c:83:29: warning: ‘dma_ch’ may be used
> >> uninitialized in this function [-Wmaybe-uninitialized]
> >>     drvdata->substream[dma_ch] = substream;
> >>     ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~
> >>
> >> and 'dma_ch' usage there really is crazy and wrong. Broken by
> >> 022d00ee0b55 ("ASoC: lpass-platform: Fix broken pcm data usage")
> > 
> > Right, the patches crossed here, the bugfix patch that introduced
> > this came into linux-next over the kernel summit, and the fix I
> > sent on Tuesday made it into Mark Brown's tree on Wednesday but not
> > before you pulled alsa tree. It should be fixed the next time you
> > pull from the alsa tree, the commit is
> > 
> > 3b89e4b77ef9 ("ASoC: lpass-platform: initialize dma channel number")
> >  
> >> Warning #2 is not a real bug, but it's reasonable that gcc doesn't
> >> know that storage_bytes (chip->read_size) has to be 2/4. Again,
> >> introduced recently by commit 231147ee77f3 ("iio: maxim_thermocouple:
> >> Align 16 bit big endian value of raw reads"), so you didn't see it.
> > 
> > This is the one I mentioned in the commit message as one that
> > is fixed in linux-next and that should make it in soon.
> > 
> >>   drivers/iio/temperature/maxim_thermocouple.c: In function
> >> ‘maxim_thermocouple_read_raw’:
> >>   drivers/iio/temperature/maxim_thermocouple.c:141:5: warning: ‘ret’
> >> may be used uninitialized in this function [-Wmaybe-uninitialized]
> >>     if (ret)
> >>        ^
> >>   drivers/iio/temperature/maxim_thermocouple.c:128:6: note: ‘ret’ was
> >> declared here
> >>     int ret;
> >>         ^~~
> >>
> >> and I guess that code can just initialize 'ret' to '-EINVAL' or
> >> something to just make the theoretical "somehow we had a wrong
> >> chip->read_size" case error out cleanly.
> > 
> > Right, that was my conclusion too. I sent the bugfix on Oct 25
> > for linux-next but it didn't make it in until this Monday, after
> > you pulled the patch that introduced it on Oct 29.
> > 
> > The commit in staging-testing is
> > 32cb7d27e65d ("iio: maxim_thermocouple: detect invalid storage size in read()")
> > 
> > Greg and Jonathan, I see now that this is part of the 'iio-for-4.10b'
> > branch, so I suspect you were not planning to send this before the
> > merge window. Could you make sure this ends up in v4.9 so we get
> > a clean build when -Wmaybe-uninitialized gets enabled again?
> I'll queue this up and send a pull to Greg tomorrow.
> 
> Was highly doubtful that a false warning suppression (be it an
> understandable one) was worth sending mid cycle, hence it was
> taking the slow route.

I can just cherry-pick this, no need to send a separate pull request.

greg k-h

^ permalink raw reply

* Re: [PATCH V2 6/9] crypto: ccp - Add support for RSA on the CCP
From: Herbert Xu @ 2016-11-13  9:39 UTC (permalink / raw)
  To: Gary R Hook; +Cc: linux-crypto, thomas.lendacky, davem
In-Reply-To: <20161104160432.18155.29136.stgit@taos>

On Fri, Nov 04, 2016 at 11:04:32AM -0500, Gary R Hook wrote:
>
> +	ctx->u.rsa.pkey.e = mpi_read_raw_data(raw_key.e, raw_key.e_sz);
> +	if (!ctx->u.rsa.pkey.e)
> +		goto e_ret;
> +	ctx->u.rsa.e_buf = mpi_get_buffer(ctx->u.rsa.pkey.e,
> +					  &ctx->u.rsa.e_len, NULL);

You're converting a raw integer into an MPI and then back again.
Why?

In general drivers shouldn't touch the MPI stuff at all since the
hardware generally deals with raw integers.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v2] crypto: caam: do not register AES-XTS mode on LP units
From: Herbert Xu @ 2016-11-13  9:46 UTC (permalink / raw)
  To: Sven Ebenfeld
  Cc: linux-crypto, linux-kernel, stable, horia.geanta, davem,
	cata.vasile
In-Reply-To: <1478541094-73173-1-git-send-email-sven.ebenfeld@gmail.com>

On Mon, Nov 07, 2016 at 06:51:34PM +0100, Sven Ebenfeld wrote:
> When using AES-XTS on a Wandboard, we receive a Mode error:
> caam_jr 2102000.jr1: 20001311: CCB: desc idx 19: AES: Mode error.
> 
> According to the Security Reference Manual, the Low Power AES units
> of the i.MX6 do not support the XTS mode. Therefore we must not
> register XTS implementations in the Crypto API.
> 
> Signed-off-by: Sven Ebenfeld <sven.ebenfeld@gmail.com>
> Reviewed-by: Horia Geantă <horia.geanta@nxp.com>
> 
> Cc: <stable@vger.kernel.org> # 4.4+
> Fixes: c6415a6016bf "crypto: caam - add support for acipher xts(aes)"

Patch applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox