sha512: make it work, undo percpu message schedule

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* sha512: make it work, undo percpu message schedule
@ 2012-01-11  0:00 Alexey Dobriyan
  2012-01-11  0:12 ` Alexey Dobriyan
                   ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Alexey Dobriyan @ 2012-01-11  0:00 UTC (permalink / raw)
  To: herbert; +Cc: linux-crypto, netdev, ken

commit f9e2bca6c22d75a289a349f869701214d63b5060
aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
created global message schedule area.

If sha512_update will ever be entered twice, hilarity ensures.

Probably the easiest way to notice incorrect hashes being calculated is
to run 2 ping floods over AH with hmac(sha512):

	#!/usr/sbin/setkey -f
	flush;
	spdflush;
	add IP1 IP2 ah 25 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000025;
	add IP2 IP1 ah 52 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000052;
	spdadd IP1 IP2 any -P out ipsec ah/transport//require;
	spdadd IP2 IP1 any -P in  ipsec ah/transport//require;

XfrmInStateProtoError will start ticking with -EBADMSG being returned
from ah_input(). This never happens with, say, hmac(sha1).

I personally don't understand this changelog entry:

	"The message schedule W (u64[80]) is too big for the stack."

Hash context is dynamically allocated.

W[80] stack usage can be trimmed like in SHA-1, this is for separate patch.

P. S.: If only one ping flood runs, XfrmInStateProtoError will also tick
	but very slowly. I personally can't explain this.

With patch applied (on BOTH sides), XfrmInStateProtoError does not tick
with multiple bidirectional ping flood streams like it doesn't tick
with SHA-1.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 crypto/sha512_generic.c |    6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -21,8 +21,6 @@
 #include <linux/percpu.h>
 #include <asm/byteorder.h>
 
-static DEFINE_PER_CPU(u64[80], msg_schedule);
-
 static inline u64 Ch(u64 x, u64 y, u64 z)
 {
         return z ^ (x & (y ^ z));
@@ -89,7 +87,7 @@ sha512_transform(u64 *state, const u8 *input)
 	u64 a, b, c, d, e, f, g, h, t1, t2;
 
 	int i;
-	u64 *W = get_cpu_var(msg_schedule);
+	u64 W[80];
 
 	/* load the input */
         for (i = 0; i < 16; i++)
@@ -128,8 +126,6 @@ sha512_transform(u64 *state, const u8 *input)
 
 	/* erase our data */
 	a = b = c = d = e = f = g = h = t1 = t2 = 0;
-	memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
-	put_cpu_var(msg_schedule);
 }
 
 static int

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-11  0:00 sha512: make it work, undo percpu message schedule Alexey Dobriyan
@ 2012-01-11  0:12 ` Alexey Dobriyan
  2012-01-11  0:36 ` Herbert Xu
  2012-01-11  1:12 ` Adrian-Ken Rueegsegger
  2 siblings, 0 replies; 42+ messages in thread
From: Alexey Dobriyan @ 2012-01-11  0:12 UTC (permalink / raw)
  To: herbert; +Cc: linux-crypto, netdev, ken

On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote:
> -	memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));

And, yes, this is intentional -- modern gcc pisses on stone age data clearing.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-11  0:00 sha512: make it work, undo percpu message schedule Alexey Dobriyan
  2012-01-11  0:12 ` Alexey Dobriyan
@ 2012-01-11  0:36 ` Herbert Xu
  2012-01-12 23:55   ` Alexey Dobriyan
  2012-01-13  6:22   ` Steffen Klassert
  2012-01-11  1:12 ` Adrian-Ken Rueegsegger
  2 siblings, 2 replies; 42+ messages in thread
From: Herbert Xu @ 2012-01-11  0:36 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: linux-crypto, netdev, ken

On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote:
> commit f9e2bca6c22d75a289a349f869701214d63b5060
> aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> created global message schedule area.
> 
> If sha512_update will ever be entered twice, hilarity ensures.

Hmm, do you know why this happens? On the face of it this shouldn't
be possible as preemption is disabled.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-11  0:36 ` Herbert Xu
@ 2012-01-12 23:55   ` Alexey Dobriyan
  2012-01-13  0:19     ` Herbert Xu
  2012-01-13  7:08     ` Herbert Xu
  2012-01-13  6:22   ` Steffen Klassert
  1 sibling, 2 replies; 42+ messages in thread
From: Alexey Dobriyan @ 2012-01-12 23:55 UTC (permalink / raw)
  To: Herbert Xu; +Cc: linux-crypto, netdev, ken

On Wed, Jan 11, 2012 at 11:36:11AM +1100, Herbert Xu wrote:
> On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote:
> > commit f9e2bca6c22d75a289a349f869701214d63b5060
> > aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> > created global message schedule area.
> > 
> > If sha512_update will ever be entered twice, hilarity ensures.
> 
> Hmm, do you know why this happens? On the face of it this shouldn't
> be possible as preemption is disabled.

Herbert, I couldn't come up with a single scenario. :-(
But the bug is easy to reproduce.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-12 23:55   ` Alexey Dobriyan
@ 2012-01-13  0:19     ` Herbert Xu
  2012-01-13  7:08     ` Herbert Xu
  1 sibling, 0 replies; 42+ messages in thread
From: Herbert Xu @ 2012-01-13  0:19 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: linux-crypto, netdev, ken

On Fri, Jan 13, 2012 at 02:55:14AM +0300, Alexey Dobriyan wrote:
>
> Herbert, I couldn't come up with a single scenario. :-(
> But the bug is easy to reproduce.

OK, I'll try to reproduce it here.

Thanks!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-12 23:55   ` Alexey Dobriyan
  2012-01-13  0:19     ` Herbert Xu
@ 2012-01-13  7:08     ` Herbert Xu
  2012-01-13 10:35       ` Eric Dumazet
  1 sibling, 1 reply; 42+ messages in thread
From: Herbert Xu @ 2012-01-13  7:08 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: linux-crypto, netdev, ken, Steffen Klassert, Eric Dumazet

On Fri, Jan 13, 2012 at 02:55:14AM +0300, Alexey Dobriyan wrote:
>
> Herbert, I couldn't come up with a single scenario. :-(
> But the bug is easy to reproduce.

OK, does this patch work for you?

commit 31f4e55c09c1170f8b813c14b1299b70f50db414
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Fri Jan 13 18:06:50 2012 +1100

    crypto: sha512 - Fix msg_schedule race
    
    The percpu msg_schedule setup was unsafe as a user in a process
    context can be interrupted by a softirq user which would then
    scribble over the exact same work area.  This was discovered by
    Steffen Klassert.
    
    This patch based on ideas from Eric Dumazet fixes this by using
    two independent work areas.
    
    Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
index 9ed9f60..a49c457 100644
--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -11,6 +11,7 @@
  *
  */
 #include <crypto/internal/hash.h>
+#include <linux/interrupt.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/mm.h>
@@ -21,7 +22,9 @@
 #include <linux/percpu.h>
 #include <asm/byteorder.h>
 
-static DEFINE_PER_CPU(u64[80], msg_schedule);
+#define SHA512_SCHEDULE_SIZE 80
+
+static DEFINE_PER_CPU(u64[SHA512_SCHEDULE_SIZE * 2], msg_schedule);
 
 static inline u64 Ch(u64 x, u64 y, u64 z)
 {
@@ -89,7 +92,8 @@ sha512_transform(u64 *state, const u8 *input)
 	u64 a, b, c, d, e, f, g, h, t1, t2;
 
 	int i;
-	u64 *W = get_cpu_var(msg_schedule);
+	u64 *W = get_cpu_var(msg_schedule) +
+		 SHA512_SCHEDULE_SIZE * !!in_interrupt();
 
 	/* load the input */
         for (i = 0; i < 16; i++)
@@ -128,7 +132,7 @@ sha512_transform(u64 *state, const u8 *input)
 
 	/* erase our data */
 	a = b = c = d = e = f = g = h = t1 = t2 = 0;
-	memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
+	memset(W, 0, SHA512_SCHEDULE_SIZE * sizeof(*W));
 	put_cpu_var(msg_schedule);
 }
 
Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-13  7:08     ` Herbert Xu
@ 2012-01-13 10:35       ` Eric Dumazet
  2012-01-13 10:41         ` Eric Dumazet
                           ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Eric Dumazet @ 2012-01-13 10:35 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Alexey Dobriyan, linux-crypto, netdev, ken, Steffen Klassert

Le vendredi 13 janvier 2012 à 18:08 +1100, Herbert Xu a écrit :
> On Fri, Jan 13, 2012 at 02:55:14AM +0300, Alexey Dobriyan wrote:
> >
> > Herbert, I couldn't come up with a single scenario. :-(
> > But the bug is easy to reproduce.
> 
> OK, does this patch work for you?
> 
> commit 31f4e55c09c1170f8b813c14b1299b70f50db414
> Author: Herbert Xu <herbert@gondor.apana.org.au>
> Date:   Fri Jan 13 18:06:50 2012 +1100
> 
>     crypto: sha512 - Fix msg_schedule race
>     
>     The percpu msg_schedule setup was unsafe as a user in a process
>     context can be interrupted by a softirq user which would then
>     scribble over the exact same work area.  This was discovered by
>     Steffen Klassert.
>     
>     This patch based on ideas from Eric Dumazet fixes this by using
>     two independent work areas.
>     
>     Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
>     Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 

I wonder ...

With 4096 cpus, do we really want to reserve 5242880 bytes of memory for
this function ?

What about following patch instead ?

(Trying a dynamic memory allocation, and fallback on a single
pre-allocated bloc of memory, shared by all cpus, protected by a
spinlock)

diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
index 9ed9f60..5c80a76 100644
--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -21,7 +21,6 @@
 #include <linux/percpu.h>
 #include <asm/byteorder.h>
 
-static DEFINE_PER_CPU(u64[80], msg_schedule);
 
 static inline u64 Ch(u64 x, u64 y, u64 z)
 {
@@ -87,10 +86,16 @@ static void
 sha512_transform(u64 *state, const u8 *input)
 {
 	u64 a, b, c, d, e, f, g, h, t1, t2;
-
+	static u64 msg_schedule[80];
+	static DEFINE_SPINLOCK(msg_schedule_lock);
 	int i;
-	u64 *W = get_cpu_var(msg_schedule);
+	u64 *W = kzalloc(sizeof(msg_schedule), GFP_ATOMIC | __GFP_NOWARN);
 
+	
+	if (!W) {
+		spin_lock_bh(&msg_schedule_lock);
+		W = msg_schedule;
+	}
 	/* load the input */
         for (i = 0; i < 16; i++)
                 LOAD_OP(i, W, input);
@@ -128,8 +133,11 @@ sha512_transform(u64 *state, const u8 *input)
 
 	/* erase our data */
 	a = b = c = d = e = f = g = h = t1 = t2 = 0;
-	memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
-	put_cpu_var(msg_schedule);
+	memset(W, 0, sizeof(msg_schedule));
+	if (W == msg_schedule)
+		spin_unlock_bh(&msg_schedule_lock);
+	else
+		kfree(W);
 }
 
 static int

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-13 10:35       ` Eric Dumazet
@ 2012-01-13 10:41         ` Eric Dumazet
  2012-01-13 10:57           ` Eric Dumazet
  2012-01-13 11:02         ` Steffen Klassert
  2012-01-13 11:45         ` David Laight
  2 siblings, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2012-01-13 10:41 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Alexey Dobriyan, linux-crypto, netdev, ken, Steffen Klassert

Le vendredi 13 janvier 2012 à 11:35 +0100, Eric Dumazet a écrit :

> I wonder ...
> 
> With 4096 cpus, do we really want to reserve 5242880 bytes of memory for
> this function ?
> 
> What about following patch instead ?
> 
> (Trying a dynamic memory allocation, and fallback on a single
> pre-allocated bloc of memory, shared by all cpus, protected by a
> spinlock)
> 
> diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
> index 9ed9f60..5c80a76 100644
> --- a/crypto/sha512_generic.c
> +++ b/crypto/sha512_generic.c
> @@ -21,7 +21,6 @@
>  #include <linux/percpu.h>
>  #include <asm/byteorder.h>
>  
> -static DEFINE_PER_CPU(u64[80], msg_schedule);
>  
>  static inline u64 Ch(u64 x, u64 y, u64 z)
>  {
> @@ -87,10 +86,16 @@ static void
>  sha512_transform(u64 *state, const u8 *input)
>  {
>  	u64 a, b, c, d, e, f, g, h, t1, t2;
> -
> +	static u64 msg_schedule[80];
> +	static DEFINE_SPINLOCK(msg_schedule_lock);
>  	int i;
> -	u64 *W = get_cpu_var(msg_schedule);
> +	u64 *W = kzalloc(sizeof(msg_schedule), GFP_ATOMIC | __GFP_NOWARN);
>  

And a plain kmalloc() is enough, since we fully initialize the array a
bit later.

	for (i = 0; i < 16; i++)
		LOAD_OP(i, W, input);
	for (i = 16; i < 80; i++) {
		BLEND_OP(i, W);
	}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-13 10:41         ` Eric Dumazet
@ 2012-01-13 10:57           ` Eric Dumazet
  2012-01-13 11:33             ` Alexey Dobriyan
  0 siblings, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2012-01-13 10:57 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Alexey Dobriyan, linux-crypto, netdev, ken, Steffen Klassert

Le vendredi 13 janvier 2012 à 11:41 +0100, Eric Dumazet a écrit :

> And a plain kmalloc() is enough, since we fully initialize the array a
> bit later.
> 
> 	for (i = 0; i < 16; i++)
> 		LOAD_OP(i, W, input);
> 	for (i = 16; i < 80; i++) {
> 		BLEND_OP(i, W);
> 	}


Here is an official patch ;)

[PATCH v3] crypto: sha512 - Fix msg_schedule race

The percpu msg_schedule setup was unsafe as a user in a process
context can be interrupted by a softirq user which would then
scribble over the exact same work area.

Bug reported by Alexey Dobriyan, and diagnosed by Steffen Klassert.

Bug added in commit f9e2bca6c22 (crypto: sha512 - Move message schedule
W[80] to static percpu area)
    
Try a dynamic memory allocation, and fallback using a single static
area, guarded by a spinlock.

This removes use of percpu memory.

Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Adrian-Ken Rueegsegger <ken@codelabs.ch>
CC: Steffen Klassert <steffen.klassert@secunet.com>
---
 crypto/sha512_generic.c |   18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
index 9ed9f60..b52ef9b 100644
--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -18,10 +18,8 @@
 #include <linux/crypto.h>
 #include <linux/types.h>
 #include <crypto/sha.h>
-#include <linux/percpu.h>
 #include <asm/byteorder.h>
 
-static DEFINE_PER_CPU(u64[80], msg_schedule);
 
 static inline u64 Ch(u64 x, u64 y, u64 z)
 {
@@ -87,10 +85,15 @@ static void
 sha512_transform(u64 *state, const u8 *input)
 {
 	u64 a, b, c, d, e, f, g, h, t1, t2;
-
+	static u64 msg_schedule[80];
+	static DEFINE_SPINLOCK(msg_schedule_lock);
 	int i;
-	u64 *W = get_cpu_var(msg_schedule);
+	u64 *W = kmalloc(sizeof(msg_schedule), GFP_ATOMIC | __GFP_NOWARN);
 
+	if (!W) {
+		spin_lock_bh(&msg_schedule_lock);
+		W = msg_schedule;
+	}
 	/* load the input */
         for (i = 0; i < 16; i++)
                 LOAD_OP(i, W, input);
@@ -128,8 +131,11 @@ sha512_transform(u64 *state, const u8 *input)
 
 	/* erase our data */
 	a = b = c = d = e = f = g = h = t1 = t2 = 0;
-	memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
-	put_cpu_var(msg_schedule);
+	memset(W, 0, sizeof(msg_schedule));
+	if (W == msg_schedule)
+		spin_unlock_bh(&msg_schedule_lock);
+	else
+		kfree(W);
 }
 
 static int

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-13 10:57           ` Eric Dumazet
@ 2012-01-13 11:33             ` Alexey Dobriyan
  2012-01-13 12:34               ` Eric Dumazet
  0 siblings, 1 reply; 42+ messages in thread
From: Alexey Dobriyan @ 2012-01-13 11:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert

On 1/13/12, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> +	static u64 msg_schedule[80];
> +	static DEFINE_SPINLOCK(msg_schedule_lock);

No guys, no.

SHA-512 only needs u64[16] running window for message scheduling.

I'm sending whitespace mangled patch which is only tested
with selfcryptotest passed, so you won't apply something complex.

Stackspace usage drops down to like this:

    -139:   48 81 ec c8 02 00 00    sub    $0x2c8,%rsp
    +136:       48 81 ec 18 01 00 00    sub    $0x118,%rsp

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -21,8 +21,6 @@
 #include <linux/percpu.h>
 #include <asm/byteorder.h>

-static DEFINE_PER_CPU(u64[80], msg_schedule);
-
 static inline u64 Ch(u64 x, u64 y, u64 z)
 {
         return z ^ (x & (y ^ z));
@@ -80,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input)

 static inline void BLEND_OP(int I, u64 *W)
 {
-	W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
+	W[I%16] = s1(W[(I-2)%16]) + W[(I-7)%16] + s0(W[(I-15)%16]) + W[I%16];
 }

 static void
@@ -89,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input)
 	u64 a, b, c, d, e, f, g, h, t1, t2;

 	int i;
-	u64 *W = get_cpu_var(msg_schedule);
+	u64 W[16];

 	/* load the input */
         for (i = 0; i < 16; i++)
                 LOAD_OP(i, W, input);

-        for (i = 16; i < 80; i++) {
-                BLEND_OP(i, W);
-        }
-
 	/* load the state into our registers */
 	a=state[0];   b=state[1];   c=state[2];   d=state[3];
 	e=state[4];   f=state[5];   g=state[6];   h=state[7];

-	/* now iterate */
-	for (i=0; i<80; i+=8) {
-		t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i  ] + W[i  ];
-		t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
-		t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1];
-		t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
-		t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2];
-		t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
-		t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3];
-		t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
-		t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4];
-		t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
-		t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5];
-		t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
-		t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6];
-		t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
-		t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7];
-		t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
+#define SHA512_0_15(i, a, b, c, d, e, f, g, h)			\
+	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];	\
+	t2 = e0(a) + Maj(a,b,c);				\
+	d += t1;						\
+	h = t1 + t2
+
+#define SHA512_16_79(i, a, b, c, d, e, f, g, h)			\
+	BLEND_OP(i, W);						\
+	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)%16];	\
+	t2 = e0(a) + Maj(a,b,c);				\
+	d += t1;						\
+	h = t1 + t2
+
+	for (i = 0; i < 16; i += 8) {
+		SHA512_0_15(i, a, b, c, d, e, f, g, h);
+		SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
+		SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
+		SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
+		SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
+		SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
+		SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
+		SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
+	}
+	for (i = 16; i < 80; i += 8) {
+		SHA512_16_79(i, a, b, c, d, e, f, g, h);
+		SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
+		SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
+		SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
+		SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
+		SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
+		SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
+		SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
 	}

 	state[0] += a; state[1] += b; state[2] += c; state[3] += d;
@@ -128,8 +136,6 @@ sha512_transform(u64 *state, const u8 *input)

 	/* erase our data */
 	a = b = c = d = e = f = g = h = t1 = t2 = 0;
-	memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
-	put_cpu_var(msg_schedule);
 }

 static int

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-13 11:33             ` Alexey Dobriyan
@ 2012-01-13 12:34               ` Eric Dumazet
  2012-01-14 18:20                 ` Alexey Dobriyan
  0 siblings, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2012-01-13 12:34 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert

Le vendredi 13 janvier 2012 à 13:33 +0200, Alexey Dobriyan a écrit :
> On 1/13/12, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > +	static u64 msg_schedule[80];
> > +	static DEFINE_SPINLOCK(msg_schedule_lock);
> 
> No guys, no.
> 
> SHA-512 only needs u64[16] running window for message scheduling.
> 
> I'm sending whitespace mangled patch which is only tested
> with selfcryptotest passed, so you won't apply something complex.
> 
> Stackspace usage drops down to like this:
> 
>     -139:   48 81 ec c8 02 00 00    sub    $0x2c8,%rsp
>     +136:       48 81 ec 18 01 00 00    sub    $0x118,%rsp
> 
> --- a/crypto/sha512_generic.c
> +++ b/crypto/sha512_generic.c
> @@ -21,8 +21,6 @@
>  #include <linux/percpu.h>
>  #include <asm/byteorder.h>
> 
> -static DEFINE_PER_CPU(u64[80], msg_schedule);
> -
>  static inline u64 Ch(u64 x, u64 y, u64 z)
>  {
>          return z ^ (x & (y ^ z));
> @@ -80,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input)
> 
>  static inline void BLEND_OP(int I, u64 *W)
>  {
> -	W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
> +	W[I%16] = s1(W[(I-2)%16]) + W[(I-7)%16] + s0(W[(I-15)%16]) + W[I%16];
>  }
> 
>  static void
> @@ -89,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input)
>  	u64 a, b, c, d, e, f, g, h, t1, t2;
> 
>  	int i;
> -	u64 *W = get_cpu_var(msg_schedule);
> +	u64 W[16];
> 
>  	/* load the input */
>          for (i = 0; i < 16; i++)
>                  LOAD_OP(i, W, input);
> 
> -        for (i = 16; i < 80; i++) {
> -                BLEND_OP(i, W);
> -        }
> -
>  	/* load the state into our registers */
>  	a=state[0];   b=state[1];   c=state[2];   d=state[3];
>  	e=state[4];   f=state[5];   g=state[6];   h=state[7];
> 
> -	/* now iterate */
> -	for (i=0; i<80; i+=8) {
> -		t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i  ] + W[i  ];
> -		t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
> -		t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1];
> -		t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
> -		t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2];
> -		t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
> -		t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3];
> -		t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
> -		t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4];
> -		t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
> -		t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5];
> -		t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
> -		t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6];
> -		t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
> -		t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7];
> -		t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
> +#define SHA512_0_15(i, a, b, c, d, e, f, g, h)			\
> +	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];	\
> +	t2 = e0(a) + Maj(a,b,c);				\
> +	d += t1;						\
> +	h = t1 + t2
> +
> +#define SHA512_16_79(i, a, b, c, d, e, f, g, h)			\
> +	BLEND_OP(i, W);						\
> +	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)%16];	\
> +	t2 = e0(a) + Maj(a,b,c);				\
> +	d += t1;						\
> +	h = t1 + t2
> +
> +	for (i = 0; i < 16; i += 8) {
> +		SHA512_0_15(i, a, b, c, d, e, f, g, h);
> +		SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
> +		SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
> +		SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
> +		SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
> +		SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
> +		SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
> +		SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
> +	}
> +	for (i = 16; i < 80; i += 8) {
> +		SHA512_16_79(i, a, b, c, d, e, f, g, h);
> +		SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
> +		SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
> +		SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
> +		SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
> +		SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
> +		SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
> +		SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
>  	}
> 
>  	state[0] += a; state[1] += b; state[2] += c; state[3] += d;
> @@ -128,8 +136,6 @@ sha512_transform(u64 *state, const u8 *input)
> 
>  	/* erase our data */
>  	a = b = c = d = e = f = g = h = t1 = t2 = 0;
> -	memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
> -	put_cpu_var(msg_schedule);
>  }
> 
>  static int


Even if its true, its not stable material.

stable teams want obvious patches.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-13 12:34               ` Eric Dumazet
@ 2012-01-14 18:20                 ` Alexey Dobriyan
  2012-01-14 18:27                   ` [PATCH 1/3] " Alexey Dobriyan
  0 siblings, 1 reply; 42+ messages in thread
From: Alexey Dobriyan @ 2012-01-14 18:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert

On Fri, Jan 13, 2012 at 01:34:13PM +0100, Eric Dumazet wrote:
> Le vendredi 13 janvier 2012 à 13:33 +0200, Alexey Dobriyan a écrit :
> > On 1/13/12, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> > > +	static u64 msg_schedule[80];
> > > +	static DEFINE_SPINLOCK(msg_schedule_lock);
> > 
> > No guys, no.
> > 
> > SHA-512 only needs u64[16] running window for message scheduling.
> > 
> > I'm sending whitespace mangled patch which is only tested
> > with selfcryptotest passed, so you won't apply something complex.
> > 
> > Stackspace usage drops down to like this:
> > 
> >     -139:   48 81 ec c8 02 00 00    sub    $0x2c8,%rsp
> >     +136:       48 81 ec 18 01 00 00    sub    $0x118,%rsp
> > 
> > --- a/crypto/sha512_generic.c
> > +++ b/crypto/sha512_generic.c
> > @@ -21,8 +21,6 @@
> >  #include <linux/percpu.h>
> >  #include <asm/byteorder.h>
> > 
> > -static DEFINE_PER_CPU(u64[80], msg_schedule);
> > -
> >  static inline u64 Ch(u64 x, u64 y, u64 z)
> >  {
> >          return z ^ (x & (y ^ z));
> > @@ -80,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input)
> > 
> >  static inline void BLEND_OP(int I, u64 *W)
> >  {
> > -	W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
> > +	W[I%16] = s1(W[(I-2)%16]) + W[(I-7)%16] + s0(W[(I-15)%16]) + W[I%16];
> >  }
> > 
> >  static void
> > @@ -89,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input)
> >  	u64 a, b, c, d, e, f, g, h, t1, t2;
> > 
> >  	int i;
> > -	u64 *W = get_cpu_var(msg_schedule);
> > +	u64 W[16];
> > 
> >  	/* load the input */
> >          for (i = 0; i < 16; i++)
> >                  LOAD_OP(i, W, input);
> > 
> > -        for (i = 16; i < 80; i++) {
> > -                BLEND_OP(i, W);
> > -        }
> > -
> >  	/* load the state into our registers */
> >  	a=state[0];   b=state[1];   c=state[2];   d=state[3];
> >  	e=state[4];   f=state[5];   g=state[6];   h=state[7];
> > 
> > -	/* now iterate */
> > -	for (i=0; i<80; i+=8) {
> > -		t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i  ] + W[i  ];
> > -		t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
> > -		t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1];
> > -		t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
> > -		t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2];
> > -		t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
> > -		t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3];
> > -		t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
> > -		t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4];
> > -		t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
> > -		t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5];
> > -		t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
> > -		t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6];
> > -		t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
> > -		t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7];
> > -		t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
> > +#define SHA512_0_15(i, a, b, c, d, e, f, g, h)			\
> > +	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];	\
> > +	t2 = e0(a) + Maj(a,b,c);				\
> > +	d += t1;						\
> > +	h = t1 + t2
> > +
> > +#define SHA512_16_79(i, a, b, c, d, e, f, g, h)			\
> > +	BLEND_OP(i, W);						\
> > +	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)%16];	\
> > +	t2 = e0(a) + Maj(a,b,c);				\
> > +	d += t1;						\
> > +	h = t1 + t2
> > +
> > +	for (i = 0; i < 16; i += 8) {
> > +		SHA512_0_15(i, a, b, c, d, e, f, g, h);
> > +		SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
> > +		SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
> > +		SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
> > +		SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
> > +		SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
> > +		SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
> > +		SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
> > +	}
> > +	for (i = 16; i < 80; i += 8) {
> > +		SHA512_16_79(i, a, b, c, d, e, f, g, h);
> > +		SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
> > +		SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
> > +		SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
> > +		SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
> > +		SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
> > +		SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
> > +		SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
> >  	}
> > 
> >  	state[0] += a; state[1] += b; state[2] += c; state[3] += d;
> > @@ -128,8 +136,6 @@ sha512_transform(u64 *state, const u8 *input)
> > 
> >  	/* erase our data */
> >  	a = b = c = d = e = f = g = h = t1 = t2 = 0;
> > -	memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
> > -	put_cpu_var(msg_schedule);
> >  }
> > 
> >  static int
> 
> 
> Even if its true, its not stable material.
> 
> stable teams want obvious patches.

I understand that.

But it _is_ obvious if you see what macro does and original code and
read FIPS 180-2 6.3.2 "SHA-512 Hash Computation" from where
it is obvious that W[i] depends on W[i - 16] and doesn't depend on
earlier values.

This stack reduction patch completely fixes original bug and
doesn't show any AH errors. Given the nature of crypto code
where one bit mistake ruins everything, it should be enough.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/3] sha512: make it work, undo percpu message schedule
  2012-01-14 18:20                 ` Alexey Dobriyan
@ 2012-01-14 18:27                   ` Alexey Dobriyan
  2012-01-14 18:40                     ` [PATCH 2/3] sha512: reduce stack usage to safe number Alexey Dobriyan
                                       ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Alexey Dobriyan @ 2012-01-14 18:27 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-crypto, netdev, ken, Steffen Klassert, Eric Dumazet,
	security

commit f9e2bca6c22d75a289a349f869701214d63b5060
aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
created global message schedule area.

If sha512_update will ever be entered twice, hash will be silently
calculated incorrectly.

Probably the easiest way to notice incorrect hashes being calculated is
to run 2 ping floods over AH with hmac(sha512):

	#!/usr/sbin/setkey -f
	flush;
	spdflush;
	add IP1 IP2 ah 25 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000025;
	add IP2 IP1 ah 52 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000052;
	spdadd IP1 IP2 any -P out ipsec ah/transport//require;
	spdadd IP2 IP1 any -P in  ipsec ah/transport//require;

XfrmInStateProtoError will start ticking with -EBADMSG being returned
from ah_input(). This never happens with, say, hmac(sha1).

With patch applied (on BOTH sides), XfrmInStateProtoError does not tick
with multiple bidirectional ping flood streams like it doesn't tick
with SHA-1.

After this patch sha512_transform() will start using ~750 bytes of stack on x86_64.
This is OK for simple loads, for something more heavy, stack reduction will be done
separatedly.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@vger.kernel.org
---

 crypto/sha512_generic.c |    6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -21,8 +21,6 @@
 #include <linux/percpu.h>
 #include <asm/byteorder.h>
 
-static DEFINE_PER_CPU(u64[80], msg_schedule);
-
 static inline u64 Ch(u64 x, u64 y, u64 z)
 {
         return z ^ (x & (y ^ z));
@@ -89,7 +87,7 @@ sha512_transform(u64 *state, const u8 *input)
 	u64 a, b, c, d, e, f, g, h, t1, t2;
 
 	int i;
-	u64 *W = get_cpu_var(msg_schedule);
+	u64 W[80];
 
 	/* load the input */
         for (i = 0; i < 16; i++)
@@ -128,8 +126,6 @@ sha512_transform(u64 *state, const u8 *input)
 
 	/* erase our data */
 	a = b = c = d = e = f = g = h = t1 = t2 = 0;
-	memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
-	put_cpu_var(msg_schedule);
 }
 
 static int

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 2/3] sha512: reduce stack usage to safe number
  2012-01-14 18:27                   ` [PATCH 1/3] " Alexey Dobriyan
@ 2012-01-14 18:40                     ` Alexey Dobriyan
  2012-01-14 19:08                       ` Linus Torvalds
  2012-01-14 21:46                     ` [PATCH 1/3] sha512: make it work, undo percpu message schedule Eric Dumazet
  2012-01-15  1:43                     ` Herbert Xu
  2 siblings, 1 reply; 42+ messages in thread
From: Alexey Dobriyan @ 2012-01-14 18:40 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-crypto, netdev, ken, Steffen Klassert, Eric Dumazet,
	security

For rounds 16--79, W[i] only depends on W[i - 2], W[i - 7], W[i - 15] and W[i - 16].
Consequently, keeping all W[80] array on stack is unnecessary,
only 16 values are really needed.

Using W[16] instead of W[80] greatly reduces stack usage
(~750 bytes to ~340 bytes on x86_64).

Line by line explanation:
* BLEND_OP
  array is "circular" now, all indexes have to be modulo 16.
  Round number is positive, so remainder operation should be
  without surprises.

* initial full message scheduling is trimmed to first 16 values which
  come from data block, the rest is calculated before it's needed.

* original loop body is unrolled version of new SHA512_0_15 and
  SHA512_16_79 macros, unrolling was done to not do explicit variable
  renaming. Otherwise it's the very same code after preprocessing.
  See sha1_transform() code which does the same trick.

Patch survives in-tree crypto test and original bugreport test
(ping flood with hmac(sha512).

See FIPS 180-2 for SHA-512 definition
http://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@vger.kernel.org
---

 This is patch is for stable if 750 byte stack usage is not
 considered safe.

 crypto/sha512_generic.c |   58 ++++++++++++++++++++++++++++--------------------
 1 file changed, 34 insertions(+), 24 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -78,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input)
 
 static inline void BLEND_OP(int I, u64 *W)
 {
-	W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
+	W[I % 16] += s1(W[(I-2) % 16]) + W[(I-7) % 16] + s0(W[(I-15) % 16]);
 }
 
 static void
@@ -87,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input)
 	u64 a, b, c, d, e, f, g, h, t1, t2;
 
 	int i;
-	u64 W[80];
+	u64 W[16];
 
 	/* load the input */
         for (i = 0; i < 16; i++)
                 LOAD_OP(i, W, input);
 
-        for (i = 16; i < 80; i++) {
-                BLEND_OP(i, W);
-        }
-
 	/* load the state into our registers */
 	a=state[0];   b=state[1];   c=state[2];   d=state[3];
 	e=state[4];   f=state[5];   g=state[6];   h=state[7];
 
-	/* now iterate */
-	for (i=0; i<80; i+=8) {
-		t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i  ] + W[i  ];
-		t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
-		t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1];
-		t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
-		t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2];
-		t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
-		t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3];
-		t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
-		t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4];
-		t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
-		t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5];
-		t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
-		t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6];
-		t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
-		t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7];
-		t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
+#define SHA512_0_15(i, a, b, c, d, e, f, g, h)			\
+	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];	\
+	t2 = e0(a) + Maj(a, b, c);				\
+	d += t1;						\
+	h = t1 + t2
+
+#define SHA512_16_79(i, a, b, c, d, e, f, g, h)			\
+	BLEND_OP(i, W);						\
+	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)%16];	\
+	t2 = e0(a) + Maj(a, b, c);				\
+	d += t1;						\
+	h = t1 + t2
+
+	for (i = 0; i < 16; i += 8) {
+		SHA512_0_15(i, a, b, c, d, e, f, g, h);
+		SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
+		SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
+		SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
+		SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
+		SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
+		SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
+		SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
+	}
+	for (i = 16; i < 80; i += 8) {
+		SHA512_16_79(i, a, b, c, d, e, f, g, h);
+		SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
+		SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
+		SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
+		SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
+		SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
+		SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
+		SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
 	}
 
 	state[0] += a; state[1] += b; state[2] += c; state[3] += d;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/3] sha512: reduce stack usage to safe number
  2012-01-14 18:40                     ` [PATCH 2/3] sha512: reduce stack usage to safe number Alexey Dobriyan
@ 2012-01-14 19:08                       ` Linus Torvalds
  2012-01-14 20:41                         ` Alexey Dobriyan
  0 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2012-01-14 19:08 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert,
	Eric Dumazet, security

On Sat, Jan 14, 2012 at 10:40 AM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
> Line by line explanation:
> * BLEND_OP
>  array is "circular" now, all indexes have to be modulo 16.
>  Round number is positive, so remainder operation should be
>  without surprises.

Don't use "%" except on unsigned values. Even if it's positive, if
it's a signed number and the compiler doesn't *see* that it is
absolutely positive, division is nontrivial. Even when you divide by a
constant.

For example, "% 16" on an 'int' on x86-64 will generate

	movl	%edi, %edx
	sarl	$31, %edx
	shrl	$28, %edx
	leal	(%rdi,%rdx), %eax
	andl	$15, %eax
	subl	%edx, %eax

in order to get the signed case right. The fact that the end result is
correct for unsigned numbers is irrelevant: it's still stupid and
slow.

With an unsigned int, '% 16' will generate the obvious

	andl	$15, %eax

instead.

Quite frankly, stop using division in the first place. Dividing by
powers-of-two and expecting the compiler to fix things up is just
stupid, *exactly* because of issues like these: you either have to
think about it carefully, or the compiler may end up creating crap
code.

So just use "& 15" instead. That doesn't have these kinds of issues.
It is a *good* thing when the C code is close to the end result you
want to generate. It is *not* a good thing to write code that looks
nothing like the end result and just expect the compiler to do the
right thing. Even if the compiler does do the right thing, what was
the advantage?

                         Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/3] sha512: reduce stack usage to safe number
  2012-01-14 19:08                       ` Linus Torvalds
@ 2012-01-14 20:41                         ` Alexey Dobriyan
  2012-01-14 21:14                           ` Linus Torvalds
  2012-01-16  9:56                           ` David Laight
  0 siblings, 2 replies; 42+ messages in thread
From: Alexey Dobriyan @ 2012-01-14 20:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert,
	Eric Dumazet, security

On Sat, Jan 14, 2012 at 11:08:45AM -0800, Linus Torvalds wrote:
> On Sat, Jan 14, 2012 at 10:40 AM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
> >
> > Line by line explanation:
> > * BLEND_OP
> >  array is "circular" now, all indexes have to be modulo 16.
> >  Round number is positive, so remainder operation should be
> >  without surprises.
> 
> Don't use "%" except on unsigned values. Even if it's positive, if
> it's a signed number and the compiler doesn't *see* that it is
> absolutely positive, division is nontrivial. Even when you divide by a
> constant.
> 
> For example, "% 16" on an 'int' on x86-64 will generate
> 
> 	movl	%edi, %edx
> 	sarl	$31, %edx
> 	shrl	$28, %edx
> 	leal	(%rdi,%rdx), %eax
> 	andl	$15, %eax
> 	subl	%edx, %eax
> 
> in order to get the signed case right. The fact that the end result is
> correct for unsigned numbers is irrelevant: it's still stupid and
> slow.
> 
> With an unsigned int, '% 16' will generate the obvious
> 
> 	andl	$15, %eax
> 
> instead.
> 
> Quite frankly, stop using division in the first place. Dividing by
> powers-of-two and expecting the compiler to fix things up is just
> stupid, *exactly* because of issues like these: you either have to
> think about it carefully, or the compiler may end up creating crap
> code.

For the record, it generates "andl $15" here.

> So just use "& 15" instead. That doesn't have these kinds of issues.
> It is a *good* thing when the C code is close to the end result you
> want to generate. It is *not* a good thing to write code that looks
> nothing like the end result and just expect the compiler to do the
> right thing. Even if the compiler does do the right thing, what was
> the advantage?

Here is updated patch which explicitly uses & (equally tested):
---------------------------------------------------------------

For rounds 16--79, W[i] only depends on W[i - 2], W[i - 7], W[i - 15]
and W[i - 16]. Consequently, keeping all W[80] array on stack is
unnecessary, only 16 values are really needed.

Using W[16] instead of W[80] greatly reduces stack usage
(~750 bytes to ~340 bytes on x86_64).

Line by line explanation:
* BLEND_OP
  array is "circular" now, all indexes have to be modulo 16.

* initial full message scheduling is trimmed to first 16 values which
  come from data block, the rest is calculated right before it's needed.

* original loop body is unrolled version of new SHA512_0_15 and
  SHA512_16_79 macros, unrolling was done to not do explicit variable
  renaming. Otherwise it's the very same code after preprocessing.
  See sha1_transform() code which does the same trick.

Patch survives in-tree crypto test and original bugreport test
(ping flood with hmac(sha512).

See FIPS 180-2 for SHA-512 definition
http://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@vger.kernel.org
---

 This is patch is for stable if 750 byte stack usage is not
 considered safe.

 crypto/sha512_generic.c |   58 ++++++++++++++++++++++++++++--------------------
 1 file changed, 34 insertions(+), 24 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -78,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input)
 
 static inline void BLEND_OP(int I, u64 *W)
 {
-	W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
+	W[I & 15] += s1(W[(I-2) & 15]) + W[(I-7) & 15] + s0(W[(I-15) & 15]);
 }
 
 static void
@@ -87,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input)
 	u64 a, b, c, d, e, f, g, h, t1, t2;
 
 	int i;
-	u64 W[80];
+	u64 W[16];
 
 	/* load the input */
         for (i = 0; i < 16; i++)
                 LOAD_OP(i, W, input);
 
-        for (i = 16; i < 80; i++) {
-                BLEND_OP(i, W);
-        }
-
 	/* load the state into our registers */
 	a=state[0];   b=state[1];   c=state[2];   d=state[3];
 	e=state[4];   f=state[5];   g=state[6];   h=state[7];
 
-	/* now iterate */
-	for (i=0; i<80; i+=8) {
-		t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i  ] + W[i  ];
-		t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
-		t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1];
-		t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
-		t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2];
-		t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
-		t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3];
-		t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
-		t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4];
-		t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
-		t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5];
-		t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
-		t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6];
-		t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
-		t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7];
-		t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
+#define SHA512_0_15(i, a, b, c, d, e, f, g, h)			\
+	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];	\
+	t2 = e0(a) + Maj(a, b, c);				\
+	d += t1;						\
+	h = t1 + t2
+
+#define SHA512_16_79(i, a, b, c, d, e, f, g, h)			\
+	BLEND_OP(i, W);						\
+	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15];	\
+	t2 = e0(a) + Maj(a, b, c);				\
+	d += t1;						\
+	h = t1 + t2
+
+	for (i = 0; i < 16; i += 8) {
+		SHA512_0_15(i, a, b, c, d, e, f, g, h);
+		SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
+		SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
+		SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
+		SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
+		SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
+		SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
+		SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
+	}
+	for (i = 16; i < 80; i += 8) {
+		SHA512_16_79(i, a, b, c, d, e, f, g, h);
+		SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
+		SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
+		SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
+		SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
+		SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
+		SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
+		SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
 	}
 
 	state[0] += a; state[1] += b; state[2] += c; state[3] += d;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/3] sha512: reduce stack usage to safe number
  2012-01-14 20:41                         ` Alexey Dobriyan
@ 2012-01-14 21:14                           ` Linus Torvalds
  2012-01-16  9:56                           ` David Laight
  1 sibling, 0 replies; 42+ messages in thread
From: Linus Torvalds @ 2012-01-14 21:14 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert,
	Eric Dumazet, security

On Sat, Jan 14, 2012 at 12:41 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
> For the record, it generates "andl $15" here.

Ok. That means that gcc was able to prove that it never had any signed
values (which is certainly reasonable when you do things like "for
(i=0; i<X;i++)"). But it's better to simply not rely on gcc always
getting details like this right.

It's also better to use a model that simply doesn't even require you
as a programmer to have to even *think* about signed values.

It's easy to get "%" wrong by mistake (signed integer modulus didn't
even use to have well-defined semantics in traditional C), and there
is almost never any excuse for using it for powers-of-two.

> Here is updated patch which explicitly uses & (equally tested):

Thanks,

                     Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH 2/3] sha512: reduce stack usage to safe number
  2012-01-14 20:41                         ` Alexey Dobriyan
  2012-01-14 21:14                           ` Linus Torvalds
@ 2012-01-16  9:56                           ` David Laight
  2012-01-16 10:20                             ` Alexey Dobriyan
  2012-01-16 10:23                             ` Eric Dumazet
  1 sibling, 2 replies; 42+ messages in thread
From: David Laight @ 2012-01-16  9:56 UTC (permalink / raw)
  To: Alexey Dobriyan, Linus Torvalds
  Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert,
	Eric Dumazet, security

Doesn't this badly overflow W[] ..
 
> +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
> +	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];	\
...
> +	for (i = 0; i < 16; i += 8) {
...
> +		SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
> +	}

	David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/3] sha512: reduce stack usage to safe number
  2012-01-16  9:56                           ` David Laight
@ 2012-01-16 10:20                             ` Alexey Dobriyan
  2012-01-16 10:23                             ` Eric Dumazet
  1 sibling, 0 replies; 42+ messages in thread
From: Alexey Dobriyan @ 2012-01-16 10:20 UTC (permalink / raw)
  To: David Laight
  Cc: Linus Torvalds, Herbert Xu, linux-crypto, netdev, ken,
	Steffen Klassert, Eric Dumazet, security

On 1/16/12, David Laight <David.Laight@aculab.com> wrote:
> Doesn't this badly overflow W[] ..
>
>> +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
>> +	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];	\
> ...
>> +	for (i = 0; i < 16; i += 8) {
> ...
>> +		SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
>> +	}

No, why should it?
i can be only 0 and 8.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH 2/3] sha512: reduce stack usage to safe number
  2012-01-16  9:56                           ` David Laight
  2012-01-16 10:20                             ` Alexey Dobriyan
@ 2012-01-16 10:23                             ` Eric Dumazet
  2012-01-16 11:37                               ` David Laight
  2012-01-17 12:03                               ` Alexey Dobriyan
  1 sibling, 2 replies; 42+ messages in thread
From: Eric Dumazet @ 2012-01-16 10:23 UTC (permalink / raw)
  To: David Laight
  Cc: Alexey Dobriyan, Linus Torvalds, Herbert Xu, linux-crypto, netdev,
	ken, Steffen Klassert, security

Le lundi 16 janvier 2012 à 09:56 +0000, David Laight a écrit :
> Doesn't this badly overflow W[] ..
>  
> > +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
> > +	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];	\
> ...
> > +	for (i = 0; i < 16; i += 8) {
> ...
> > +		SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
> > +	}
> 
> 	David
> 
> 

No overflow since loop is done for only i=0 and i=8

By the way, I suspect previous code was chosen years ago because this
version uses less stack but adds much more code bloat.


size crypto/sha512_generic.o crypto/sha512_generic_old.o
   text	   data	    bss	    dec	    hex	filename
  17369	    704	      0	  18073	   4699	crypto/sha512_generic.o
   8249	    704	      0	   8953	   22f9	crypto/sha512_generic_old.o

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH 2/3] sha512: reduce stack usage to safe number
  2012-01-16 10:23                             ` Eric Dumazet
@ 2012-01-16 11:37                               ` David Laight
  2012-01-17 12:03                               ` Alexey Dobriyan
  1 sibling, 0 replies; 42+ messages in thread
From: David Laight @ 2012-01-16 11:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Dobriyan, Linus Torvalds, Herbert Xu, linux-crypto, netdev,
	ken, Steffen Klassert, security

 
> By the way, I suspect previous code was chosen years ago because this
> version uses less stack but adds much more code bloat.
> 
> size crypto/sha512_generic.o crypto/sha512_generic_old.o
>    text	   data	    bss	    dec	    hex	filename
>   17369	    704	      0	  18073	   4699	crypto/sha512_generic.o
>    8249	    704	      0	   8953	   22f9	
> crypto/sha512_generic_old.o

I wonder what the difference in execution times is?
I know this is very dependant in the actual cpu
and the cold/warm cache state...

But some measurements might be useful.

	David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/3] sha512: reduce stack usage to safe number
  2012-01-16 10:23                             ` Eric Dumazet
  2012-01-16 11:37                               ` David Laight
@ 2012-01-17 12:03                               ` Alexey Dobriyan
  2012-01-18 18:02                                 ` [PATCH 4/3] sha512: reduce stack usage even on i386 Alexey Dobriyan
  1 sibling, 1 reply; 42+ messages in thread
From: Alexey Dobriyan @ 2012-01-17 12:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Laight, Linus Torvalds, Herbert Xu, linux-crypto, netdev,
	ken, Steffen Klassert, security

On 1/16/12, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le lundi 16 janvier 2012 à 09:56 +0000, David Laight a écrit :
>> Doesn't this badly overflow W[] ..
>>
>> > +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
>> > +	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];	\
>> ...
>> > +	for (i = 0; i < 16; i += 8) {
>> ...
>> > +		SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
>> > +	}
>>
>> 	David
>>
>>
>
> No overflow since loop is done for only i=0 and i=8
>
> By the way, I suspect previous code was chosen years ago because this
> version uses less stack but adds much more code bloat.

I think W[80] was use because it's the most straightforward way to
write this code
by following spec.

All SHA definitions have full message schedule pseudocoded
before hash computation.

> size crypto/sha512_generic.o crypto/sha512_generic_old.o
>    text	   data	    bss	    dec	    hex	filename
>   17369	    704	      0	  18073	   4699	crypto/sha512_generic.o
>    8249	    704	      0	   8953	   22f9	crypto/sha512_generic_old.o

This is because SHA-512 is fundamentally 64-bit algorithm multiplied by
excessive unrolling.

Surprisingly, doing variable renaming by hand like in spec:
t1 = ...
t2 = ...
h = g;
g = f;
f = e;
e = d + T1;
d = c;
c = b;
b = a;
a = t1 + t2;

bring stack space on i386 under control too.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/3] sha512: reduce stack usage even on i386
  2012-01-17 12:03                               ` Alexey Dobriyan
@ 2012-01-18 18:02                                 ` Alexey Dobriyan
  2012-01-26  2:35                                   ` Herbert Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alexey Dobriyan @ 2012-01-18 18:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Laight, Linus Torvalds, Herbert Xu, linux-crypto, netdev,
	ken, Steffen Klassert, security, Eric Dumazet

Fix still excessive stack usage on i386.

There is too much loop unrolling going on, despite W[16] being used,
gcc screws up this for some reason. So, don't be smart, use simple code
from SHA-512 definition, this keeps code size _and_ stack usage back
under control even on i386:

	-14b:   81 ec 9c 03 00 00       sub    $0x39c,%esp
	+149:   81 ec 64 01 00 00       sub    $0x164,%esp

	$ size ../sha512_generic-i386-00*
	   text    data     bss     dec     hex filename
	  15521     712       0   16233    3f69 ../sha512_generic-i386-000.o
	   4225     712       0    4937    1349 ../sha512_generic-i386-001.o

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@vger.kernel.org
---

 crypto/sha512_generic.c |   42 ++++++++++++++++++++----------------------
 1 file changed, 20 insertions(+), 22 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -95,35 +95,33 @@ sha512_transform(u64 *state, const u8 *input)
 #define SHA512_0_15(i, a, b, c, d, e, f, g, h)			\
 	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];	\
 	t2 = e0(a) + Maj(a, b, c);				\
-	d += t1;						\
-	h = t1 + t2
+	h = g;							\
+	g = f;							\
+	f = e;							\
+	e = d + t1;						\
+	d = c;							\
+	c = b;							\
+	b = a;							\
+	a = t1 + t2
 
 #define SHA512_16_79(i, a, b, c, d, e, f, g, h)			\
 	BLEND_OP(i, W);						\
-	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15];	\
+	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i & 15];	\
 	t2 = e0(a) + Maj(a, b, c);				\
-	d += t1;						\
-	h = t1 + t2
-
-	for (i = 0; i < 16; i += 8) {
+	h = g;							\
+	g = f;							\
+	f = e;							\
+	e = d + t1;						\
+	d = c;							\
+	c = b;							\
+	b = a;							\
+	a = t1 + t2
+
+	for (i = 0; i < 16; i++) {
 		SHA512_0_15(i, a, b, c, d, e, f, g, h);
-		SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
-		SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
-		SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
-		SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
-		SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
-		SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
-		SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
 	}
-	for (i = 16; i < 80; i += 8) {
+	for (i = 16; i < 80; i++) {
 		SHA512_16_79(i, a, b, c, d, e, f, g, h);
-		SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
-		SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
-		SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
-		SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
-		SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
-		SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
-		SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
 	}
 
 	state[0] += a; state[1] += b; state[2] += c; state[3] += d;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/3] sha512: reduce stack usage even on i386
  2012-01-18 18:02                                 ` [PATCH 4/3] sha512: reduce stack usage even on i386 Alexey Dobriyan
@ 2012-01-26  2:35                                   ` Herbert Xu
  2012-01-27 17:51                                     ` Alexey Dobriyan
  0 siblings, 1 reply; 42+ messages in thread
From: Herbert Xu @ 2012-01-26  2:35 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: David Laight, Linus Torvalds, linux-crypto, netdev, ken,
	Steffen Klassert, security, Eric Dumazet

On Wed, Jan 18, 2012 at 09:02:10PM +0300, Alexey Dobriyan wrote:
> Fix still excessive stack usage on i386.
> 
> There is too much loop unrolling going on, despite W[16] being used,
> gcc screws up this for some reason. So, don't be smart, use simple code
> from SHA-512 definition, this keeps code size _and_ stack usage back
> under control even on i386:
> 
> 	-14b:   81 ec 9c 03 00 00       sub    $0x39c,%esp
> 	+149:   81 ec 64 01 00 00       sub    $0x164,%esp
> 
> 	$ size ../sha512_generic-i386-00*
> 	   text    data     bss     dec     hex filename
> 	  15521     712       0   16233    3f69 ../sha512_generic-i386-000.o
> 	   4225     712       0    4937    1349 ../sha512_generic-i386-001.o
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: stable@vger.kernel.org

Hmm, your patch doesn't apply against my crypto tree.  Please
regenerate.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/3] sha512: reduce stack usage even on i386
  2012-01-26  2:35                                   ` Herbert Xu
@ 2012-01-27 17:51                                     ` Alexey Dobriyan
  2012-01-27 22:32                                       ` Herbert Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alexey Dobriyan @ 2012-01-27 17:51 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Laight, Linus Torvalds, linux-crypto, netdev, ken,
	Steffen Klassert, security, Eric Dumazet

On Thu, Jan 26, 2012 at 01:35:02PM +1100, Herbert Xu wrote:
> On Wed, Jan 18, 2012 at 09:02:10PM +0300, Alexey Dobriyan wrote:
> > Fix still excessive stack usage on i386.
> > 
> > There is too much loop unrolling going on, despite W[16] being used,
> > gcc screws up this for some reason. So, don't be smart, use simple code
> > from SHA-512 definition, this keeps code size _and_ stack usage back
> > under control even on i386:
> > 
> > 	-14b:   81 ec 9c 03 00 00       sub    $0x39c,%esp
> > 	+149:   81 ec 64 01 00 00       sub    $0x164,%esp
> > 
> > 	$ size ../sha512_generic-i386-00*
> > 	   text    data     bss     dec     hex filename
> > 	  15521     712       0   16233    3f69 ../sha512_generic-i386-000.o
> > 	   4225     712       0    4937    1349 ../sha512_generic-i386-001.o
> > 
> > Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> > Cc: stable@vger.kernel.org
> 
> Hmm, your patch doesn't apply against my crypto tree.  Please
> regenerate.

I think this is because your tree contained "%16" code instead if "&15".
Now that it contains "&15" it should become applicable.

Anyway.
--------------------------------------------------------------------------
[PATCH] sha512: reduce stack usage even on i386

Fix still excessive stack usage on i386.

There is too much loop unrolling going on, despite W[16] being used,
gcc screws up this for some reason. So, don't be smart, use simple code
from SHA-512 definition, this keeps code size _and_ stack usage back
under control even on i386:

	-14b:   81 ec 9c 03 00 00       sub    $0x39c,%esp
	+149:   81 ec 64 01 00 00       sub    $0x164,%esp

	$ size ../sha512_generic-i386-00*
	   text    data     bss     dec     hex filename
	  15521     712       0   16233    3f69 ../sha512_generic-i386-000.o
	   4225     712       0    4937    1349 ../sha512_generic-i386-001.o

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@vger.kernel.org
---

 crypto/sha512_generic.c |   42 ++++++++++++++++++++----------------------
 1 file changed, 20 insertions(+), 22 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -100,35 +100,33 @@ sha512_transform(u64 *state, const u8 *input)
 #define SHA512_0_15(i, a, b, c, d, e, f, g, h)			\
 	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];	\
 	t2 = e0(a) + Maj(a, b, c);				\
-	d += t1;						\
-	h = t1 + t2
+	h = g;							\
+	g = f;							\
+	f = e;							\
+	e = d + t1;						\
+	d = c;							\
+	c = b;							\
+	b = a;							\
+	a = t1 + t2
 
 #define SHA512_16_79(i, a, b, c, d, e, f, g, h)			\
 	BLEND_OP(i, W);						\
-	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15];	\
+	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i & 15];	\
 	t2 = e0(a) + Maj(a, b, c);				\
-	d += t1;						\
-	h = t1 + t2
-
-	for (i = 0; i < 16; i += 8) {
+	h = g;							\
+	g = f;							\
+	f = e;							\
+	e = d + t1;						\
+	d = c;							\
+	c = b;							\
+	b = a;							\
+	a = t1 + t2
+
+	for (i = 0; i < 16; i++) {
 		SHA512_0_15(i, a, b, c, d, e, f, g, h);
-		SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
-		SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
-		SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
-		SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
-		SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
-		SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
-		SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
 	}
-	for (i = 16; i < 80; i += 8) {
+	for (i = 16; i < 80; i++) {
 		SHA512_16_79(i, a, b, c, d, e, f, g, h);
-		SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
-		SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
-		SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
-		SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
-		SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
-		SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
-		SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
 	}
 
 	state[0] += a; state[1] += b; state[2] += c; state[3] += d;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/3] sha512: reduce stack usage even on i386
  2012-01-27 17:51                                     ` Alexey Dobriyan
@ 2012-01-27 22:32                                       ` Herbert Xu
  2012-01-30 11:10                                         ` Alexey Dobriyan
  0 siblings, 1 reply; 42+ messages in thread
From: Herbert Xu @ 2012-01-27 22:32 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: David Laight, Linus Torvalds, linux-crypto, netdev, ken,
	Steffen Klassert, security, Eric Dumazet

On Fri, Jan 27, 2012 at 08:51:30PM +0300, Alexey Dobriyan wrote:
>
> I think this is because your tree contained "%16" code instead if "&15".
> Now that it contains "&15" it should become applicable.

OK.

> --------------------------------------------------------------------------
> [PATCH] sha512: reduce stack usage even on i386

Can you try the approach that git takes with using asm to read
and write W (see previous email from Linus in respone to my push
request)? As it stands your patch is simply relying on gcc's
ability to optimise.  At least with asm volatile we know that
gcc will leave it alone.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/3] sha512: reduce stack usage even on i386
  2012-01-27 22:32                                       ` Herbert Xu
@ 2012-01-30 11:10                                         ` Alexey Dobriyan
  2012-02-03  3:34                                           ` Herbert Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Alexey Dobriyan @ 2012-01-30 11:10 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Laight, Linus Torvalds, linux-crypto, netdev, ken,
	Steffen Klassert, security, Eric Dumazet

On 1/28/12, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Fri, Jan 27, 2012 at 08:51:30PM +0300, Alexey Dobriyan wrote:
>>
>> I think this is because your tree contained "%16" code instead if "&15".
>> Now that it contains "&15" it should become applicable.
>
> OK.
>
>> --------------------------------------------------------------------------
>> [PATCH] sha512: reduce stack usage even on i386
>
> Can you try the approach that git takes with using asm to read
> and write W (see previous email from Linus in respone to my push
> request)? As it stands your patch is simply relying on gcc's
> ability to optimise.  At least with asm volatile we know that
> gcc will leave it alone.

For some reason it doesn't. :-( I've also tried full barriers.

With this patch, stack usage is still ~900 bytes.

diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
index dd0439d..35e7ae7 100644
--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -66,16 +66,6 @@ static const u64 sha512_K[80] = {
 #define s0(x)       (ror64(x, 1) ^ ror64(x, 8) ^ (x >> 7))
 #define s1(x)       (ror64(x,19) ^ ror64(x,61) ^ (x >> 6))

-static inline void LOAD_OP(int I, u64 *W, const u8 *input)
-{
-	W[I] = __be64_to_cpu( ((__be64*)(input))[I] );
-}
-
-static inline void BLEND_OP(int I, u64 *W)
-{
-	W[I & 15] += s1(W[(I-2) & 15]) + W[(I-7) & 15] + s0(W[(I-15) & 15]);
-}
-
 static void
 sha512_transform(u64 *state, const u8 *input)
 {
@@ -84,26 +74,29 @@ sha512_transform(u64 *state, const u8 *input)
 	int i;
 	u64 W[16];

-	/* load the input */
-        for (i = 0; i < 16; i++)
-                LOAD_OP(i, W, input);
-
 	/* load the state into our registers */
 	a=state[0];   b=state[1];   c=state[2];   d=state[3];
 	e=state[4];   f=state[5];   g=state[6];   h=state[7];

 #define SHA512_0_15(i, a, b, c, d, e, f, g, h)			\
-	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];	\
+	{							\
+	u64 tmp = be64_to_cpu(*((__be64 *)input + (i)));	\
+	*(volatile u64 *)&W[i] = tmp;				\
+	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + tmp;	\
 	t2 = e0(a) + Maj(a, b, c);				\
 	d += t1;						\
-	h = t1 + t2
+	h = t1 + t2;						\
+	}

 #define SHA512_16_79(i, a, b, c, d, e, f, g, h)			\
-	BLEND_OP(i, W);						\
-	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15];\
+	{							\
+	u64 tmp = W[(i) & 15] + s1(W[(i-2) & 15]) + W[(i-7) & 15] +
s0(W[(i-15) & 15]);\
+	*(volatile u64 *)&W[(i) & 15] = tmp;			\
+	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + tmp;	\
 	t2 = e0(a) + Maj(a, b, c);				\
 	d += t1;						\
-	h = t1 + t2
+	h = t1 + t2;						\
+	}

 	for (i = 0; i < 16; i += 8) {
 		SHA512_0_15(i, a, b, c, d, e, f, g, h);

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/3] sha512: reduce stack usage even on i386
  2012-01-30 11:10                                         ` Alexey Dobriyan
@ 2012-02-03  3:34                                           ` Herbert Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Herbert Xu @ 2012-02-03  3:34 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: David Laight, Linus Torvalds, linux-crypto, netdev, ken,
	Steffen Klassert, security, Eric Dumazet

On Mon, Jan 30, 2012 at 01:10:48PM +0200, Alexey Dobriyan wrote:
> 
> For some reason it doesn't. :-( I've also tried full barriers.

OK, I don't think it worked in the sha1 case either, as on my
i386 here lib/sha1.c ends up using nearly 500 bytes of stack
space.

It appears that the crux of the issue here is the use of 64-bit
ops.  gcc just doesn't like that at all on i386 and ends up
putting every other intermediate result on the stack in its own
unique place.

While you de-unrolling patch does reduce the stack use, it also
incurs a speed penalty of around 15%, and that's on top of the
4% we've already added by getting rid of per_cpu.

However, this patch seems to work for me, and reduces the stack
usage to a level about the same as lib/sha1.c + crypto/sha1_generic.c
without adding any new overhead.

So let me know if you have any better ideas.

diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
index 3edebfd..f04af93 100644
--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -89,46 +89,42 @@ sha512_transform(u64 *state, const u8 *input)
 	int i;
 	u64 W[16];
 
-	/* load the input */
-        for (i = 0; i < 16; i++)
-                LOAD_OP(i, W, input);
-
 	/* load the state into our registers */
 	a=state[0];   b=state[1];   c=state[2];   d=state[3];
 	e=state[4];   f=state[5];   g=state[6];   h=state[7];
 
-#define SHA512_0_15(i, a, b, c, d, e, f, g, h)			\
-	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];	\
-	t2 = e0(a) + Maj(a, b, c);				\
-	d += t1;						\
-	h = t1 + t2
-
-#define SHA512_16_79(i, a, b, c, d, e, f, g, h)			\
-	BLEND_OP(i, W);						\
-	t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15];	\
-	t2 = e0(a) + Maj(a, b, c);				\
-	d += t1;						\
-	h = t1 + t2
-
-	for (i = 0; i < 16; i += 8) {
-		SHA512_0_15(i, a, b, c, d, e, f, g, h);
-		SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
-		SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
-		SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
-		SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
-		SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
-		SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
-		SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
-	}
-	for (i = 16; i < 80; i += 8) {
-		SHA512_16_79(i, a, b, c, d, e, f, g, h);
-		SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
-		SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
-		SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
-		SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
-		SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
-		SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
-		SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
+	/* now iterate */
+	for (i=0; i<80; i+=8) {
+		if (!(i & 8)) {
+			int j;
+
+			if (i < 16) {
+				/* load the input */
+				for (j = 0; j < 16; j++)
+					LOAD_OP(i + j, W, input);
+			} else {
+				for (j = 0; j < 16; j++) {
+					BLEND_OP(i + j, W);
+				}
+			}
+		}
+
+		t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i  ] + W[(i & 15)];
+		t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
+		t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[(i & 15) + 1];
+		t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
+		t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[(i & 15) + 2];
+		t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
+		t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[(i & 15) + 3];
+		t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
+		t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[(i & 15) + 4];
+		t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
+		t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[(i & 15) + 5];
+		t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
+		t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[(i & 15) + 6];
+		t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
+		t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[(i & 15) + 7];
+		t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
 	}
 
 	state[0] += a; state[1] += b; state[2] += c; state[3] += d;

Thanks!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/3] sha512: make it work, undo percpu message schedule
  2012-01-14 18:27                   ` [PATCH 1/3] " Alexey Dobriyan
  2012-01-14 18:40                     ` [PATCH 2/3] sha512: reduce stack usage to safe number Alexey Dobriyan
@ 2012-01-14 21:46                     ` Eric Dumazet
  2012-01-14 21:52                       ` Linus Torvalds
  2012-01-15  1:43                     ` Herbert Xu
  2 siblings, 1 reply; 42+ messages in thread
From: Eric Dumazet @ 2012-01-14 21:46 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert, security

Le samedi 14 janvier 2012 à 21:27 +0300, Alexey Dobriyan a écrit :
> commit f9e2bca6c22d75a289a349f869701214d63b5060
> aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> created global message schedule area.


> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: stable@vger.kernel.org
> ---
> 
>  crypto/sha512_generic.c |    6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> --- a/crypto/sha512_generic.c
> +++ b/crypto/sha512_generic.c
> @@ -21,8 +21,6 @@
>  #include <linux/percpu.h>
>  #include <asm/byteorder.h>
>  
> -static DEFINE_PER_CPU(u64[80], msg_schedule);
> -
>  static inline u64 Ch(u64 x, u64 y, u64 z)
>  {
>          return z ^ (x & (y ^ z));
> @@ -89,7 +87,7 @@ sha512_transform(u64 *state, const u8 *input)
>  	u64 a, b, c, d, e, f, g, h, t1, t2;
>  
>  	int i;
> -	u64 *W = get_cpu_var(msg_schedule);
> +	u64 W[80];
>  
>  	/* load the input */
>          for (i = 0; i < 16; i++)
> @@ -128,8 +126,6 @@ sha512_transform(u64 *state, const u8 *input)
>  
>  	/* erase our data */
>  	a = b = c = d = e = f = g = h = t1 = t2 = 0;
> -	memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
> -	put_cpu_var(msg_schedule);
>  }
>  
>  static int

Is it just me or are you ignoring what crypto maintainer and others
thought of your patch ?

You are re-introducing a 640 bytes stack array, how comes it can be
really safe ?

This is too risky, and we provided an alternate patch, not just for fun.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/3] sha512: make it work, undo percpu message schedule
  2012-01-14 21:46                     ` [PATCH 1/3] sha512: make it work, undo percpu message schedule Eric Dumazet
@ 2012-01-14 21:52                       ` Linus Torvalds
  2012-01-14 22:00                         ` Eric Dumazet
  0 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2012-01-14 21:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Dobriyan, Herbert Xu, linux-crypto, netdev, ken,
	Steffen Klassert, security

On Sat, Jan 14, 2012 at 1:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> This is too risky, and we provided an alternate patch, not just for fun.

Did you see the second patch?

The one that got rid of the *stupid* 80-entry array?

I don't know why so many sha implementations do that idiotic full
array, when the circular one is much better.

In fact, the 16-entry circular array allows machines with lots of
registers to keep all the state in registers and the C implementation
can often be as good as hand-tuned assembly. At least that's true for
sha1, I'm not sure you can do the same with sha512.

But that actually *requires* that the 16-entry array be done on the
stack as an automatic array. Anything else, and the compiler won't be
able to do it.

                       Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/3] sha512: make it work, undo percpu message schedule
  2012-01-14 21:52                       ` Linus Torvalds
@ 2012-01-14 22:00                         ` Eric Dumazet
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Dumazet @ 2012-01-14 22:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexey Dobriyan, Herbert Xu, linux-crypto, netdev, ken,
	Steffen Klassert, security

Le samedi 14 janvier 2012 à 13:52 -0800, Linus Torvalds a écrit :
> On Sat, Jan 14, 2012 at 1:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > This is too risky, and we provided an alternate patch, not just for fun.
> 
> Did you see the second patch?
> 

I saw it and felt it was not a stable material.

And Herbert sent an alternate patch, then I sent a V3 patch.

> The one that got rid of the *stupid* 80-entry array?
> 

Maybe, I didnt wrote this code, and I am very glad this code can be
smarter and all.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/3] sha512: make it work, undo percpu message schedule
  2012-01-14 18:27                   ` [PATCH 1/3] " Alexey Dobriyan
  2012-01-14 18:40                     ` [PATCH 2/3] sha512: reduce stack usage to safe number Alexey Dobriyan
  2012-01-14 21:46                     ` [PATCH 1/3] sha512: make it work, undo percpu message schedule Eric Dumazet
@ 2012-01-15  1:43                     ` Herbert Xu
  2 siblings, 0 replies; 42+ messages in thread
From: Herbert Xu @ 2012-01-15  1:43 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-crypto, netdev, ken, Steffen Klassert, Eric Dumazet,
	security

On Sat, Jan 14, 2012 at 09:27:37PM +0300, Alexey Dobriyan wrote:
> commit f9e2bca6c22d75a289a349f869701214d63b5060
> aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> created global message schedule area.
> 
> If sha512_update will ever be entered twice, hash will be silently
> calculated incorrectly.
> 
> Probably the easiest way to notice incorrect hashes being calculated is
> to run 2 ping floods over AH with hmac(sha512):
> 
> 	#!/usr/sbin/setkey -f
> 	flush;
> 	spdflush;
> 	add IP1 IP2 ah 25 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000025;
> 	add IP2 IP1 ah 52 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000052;
> 	spdadd IP1 IP2 any -P out ipsec ah/transport//require;
> 	spdadd IP2 IP1 any -P in  ipsec ah/transport//require;
> 
> XfrmInStateProtoError will start ticking with -EBADMSG being returned
> from ah_input(). This never happens with, say, hmac(sha1).
> 
> With patch applied (on BOTH sides), XfrmInStateProtoError does not tick
> with multiple bidirectional ping flood streams like it doesn't tick
> with SHA-1.
> 
> After this patch sha512_transform() will start using ~750 bytes of stack on x86_64.
> This is OK for simple loads, for something more heavy, stack reduction will be done
> separatedly.
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: stable@vger.kernel.org

OK, I've applied patches 1-2 to crypto and patch 3 to cryptodev.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-13 10:35       ` Eric Dumazet
  2012-01-13 10:41         ` Eric Dumazet
@ 2012-01-13 11:02         ` Steffen Klassert
  2012-01-15  1:43           ` Herbert Xu
  2012-01-13 11:45         ` David Laight
  2 siblings, 1 reply; 42+ messages in thread
From: Steffen Klassert @ 2012-01-13 11:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, Alexey Dobriyan, linux-crypto, netdev, ken

On Fri, Jan 13, 2012 at 11:35:42AM +0100, Eric Dumazet wrote:
> Le vendredi 13 janvier 2012 à 18:08 +1100, Herbert Xu a écrit :
> > On Fri, Jan 13, 2012 at 02:55:14AM +0300, Alexey Dobriyan wrote:
> > >
> > > Herbert, I couldn't come up with a single scenario. :-(
> > > But the bug is easy to reproduce.
> > 
> > OK, does this patch work for you?
> > 
> > commit 31f4e55c09c1170f8b813c14b1299b70f50db414
> > Author: Herbert Xu <herbert@gondor.apana.org.au>
> > Date:   Fri Jan 13 18:06:50 2012 +1100
> > 
> >     crypto: sha512 - Fix msg_schedule race
> >     
> >     The percpu msg_schedule setup was unsafe as a user in a process
> >     context can be interrupted by a softirq user which would then
> >     scribble over the exact same work area.  This was discovered by
> >     Steffen Klassert.
> >     
> >     This patch based on ideas from Eric Dumazet fixes this by using
> >     two independent work areas.
> >     
> >     Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
> >     Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> > 
> 
> I wonder ...
> 
> With 4096 cpus, do we really want to reserve 5242880 bytes of memory for
> this function ?
> 
> What about following patch instead ?
> 
> (Trying a dynamic memory allocation, and fallback on a single
> pre-allocated bloc of memory, shared by all cpus, protected by a
> spinlock)

If we want to do dynamic memory allocation, we could place it to the
shash_desc context where we already store the intermediate digest value.
This is preallocated anyway, so we don't need to do another allocation.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-13 11:02         ` Steffen Klassert
@ 2012-01-15  1:43           ` Herbert Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Herbert Xu @ 2012-01-15  1:43 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Eric Dumazet, Alexey Dobriyan, linux-crypto, netdev, ken

On Fri, Jan 13, 2012 at 12:02:55PM +0100, Steffen Klassert wrote:
>
> If we want to do dynamic memory allocation, we could place it to the
> shash_desc context where we already store the intermediate digest value.
> This is preallocated anyway, so we don't need to do another allocation.

The problem with that is that some call paths place shash_desc on
the stack too.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: sha512: make it work, undo percpu message schedule
  2012-01-13 10:35       ` Eric Dumazet
  2012-01-13 10:41         ` Eric Dumazet
  2012-01-13 11:02         ` Steffen Klassert
@ 2012-01-13 11:45         ` David Laight
  2012-01-13 12:35           ` Eric Dumazet
  2 siblings, 1 reply; 42+ messages in thread
From: David Laight @ 2012-01-13 11:45 UTC (permalink / raw)
  To: Eric Dumazet, Herbert Xu
  Cc: Alexey Dobriyan, linux-crypto, netdev, ken, Steffen Klassert

 
> Trying a dynamic memory allocation, and fallback on a single
> pre-allocated bloc of memory, shared by all cpus, protected by a
> spinlock
...
> -
> +	static u64 msg_schedule[80];
> +	static DEFINE_SPINLOCK(msg_schedule_lock);
>  	int i;
> -	u64 *W = get_cpu_var(msg_schedule);
> +	u64 *W = kzalloc(sizeof(msg_schedule), GFP_ATOMIC |
__GFP_NOWARN);
>  
> +	
> +	if (!W) {
> +		spin_lock_bh(&msg_schedule_lock);
> +		W = msg_schedule;
> +	}

If this code can be called from an ISR is the kmalloc()
call safe?

If the above is safe, wouldn't it be better to:
1) try to use the static buffer
2) try to kalloc() a buffer
3) spinwait for the static buffer to be free

	David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: sha512: make it work, undo percpu message schedule
  2012-01-13 11:45         ` David Laight
@ 2012-01-13 12:35           ` Eric Dumazet
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Dumazet @ 2012-01-13 12:35 UTC (permalink / raw)
  To: David Laight
  Cc: Herbert Xu, Alexey Dobriyan, linux-crypto, netdev, ken,
	Steffen Klassert

Le vendredi 13 janvier 2012 à 11:45 +0000, David Laight a écrit :
> > Trying a dynamic memory allocation, and fallback on a single
> > pre-allocated bloc of memory, shared by all cpus, protected by a
> > spinlock
> ...
> > -
> > +	static u64 msg_schedule[80];
> > +	static DEFINE_SPINLOCK(msg_schedule_lock);
> >  	int i;
> > -	u64 *W = get_cpu_var(msg_schedule);
> > +	u64 *W = kzalloc(sizeof(msg_schedule), GFP_ATOMIC |
> __GFP_NOWARN);
> >  
> > +	
> > +	if (!W) {
> > +		spin_lock_bh(&msg_schedule_lock);
> > +		W = msg_schedule;
> > +	}
> 
> If this code can be called from an ISR is the kmalloc()
> call safe?
> 

Yes, obviously, kmalloc() is IRQ safe.

> If the above is safe, wouldn't it be better to:
> 1) try to use the static buffer
> 2) try to kalloc() a buffer
> 3) spinwait for the static buffer to be free
> 

No idea of what you mean, and why you think its better.

kmalloc() propably can give us a block already hot in cpu cache.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-11  0:36 ` Herbert Xu
  2012-01-12 23:55   ` Alexey Dobriyan
@ 2012-01-13  6:22   ` Steffen Klassert
  2012-01-13  6:46     ` Herbert Xu
  2012-01-13  6:48     ` Eric Dumazet
  1 sibling, 2 replies; 42+ messages in thread
From: Steffen Klassert @ 2012-01-13  6:22 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Alexey Dobriyan, linux-crypto, netdev, ken

On Wed, Jan 11, 2012 at 11:36:11AM +1100, Herbert Xu wrote:
> On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote:
> > commit f9e2bca6c22d75a289a349f869701214d63b5060
> > aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> > created global message schedule area.
> > 
> > If sha512_update will ever be entered twice, hilarity ensures.
> 
> Hmm, do you know why this happens? On the face of it this shouldn't
> be possible as preemption is disabled.
> 

I did not try to reproduce, but this looks like a race of the 'local out'
and the receive packet path. On 'lokal out' bottom halves are enabled,
so could be interrupted by the NET_RX_SOFTIRQ while doing a sha512_update.
The NET_RX_SOFTIRQ could invoke sha512_update too, that would corrupt the
hash value. My guess could be checked easily by disabling the bottom halves
before the percpu value is fetched.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-13  6:22   ` Steffen Klassert
@ 2012-01-13  6:46     ` Herbert Xu
  2012-01-13  6:48     ` Eric Dumazet
  1 sibling, 0 replies; 42+ messages in thread
From: Herbert Xu @ 2012-01-13  6:46 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Alexey Dobriyan, linux-crypto, netdev, ken

On Fri, Jan 13, 2012 at 07:22:56AM +0100, Steffen Klassert wrote:
>
> I did not try to reproduce, but this looks like a race of the 'local out'
> and the receive packet path. On 'lokal out' bottom halves are enabled,
> so could be interrupted by the NET_RX_SOFTIRQ while doing a sha512_update.
> The NET_RX_SOFTIRQ could invoke sha512_update too, that would corrupt the
> hash value. My guess could be checked easily by disabling the bottom halves
> before the percpu value is fetched.

Good point.  I'll send out a patch for Alexey to test.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-13  6:22   ` Steffen Klassert
  2012-01-13  6:46     ` Herbert Xu
@ 2012-01-13  6:48     ` Eric Dumazet
  2012-01-13  6:50       ` Herbert Xu
  2012-01-13  9:45       ` David Laight
  1 sibling, 2 replies; 42+ messages in thread
From: Eric Dumazet @ 2012-01-13  6:48 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Herbert Xu, Alexey Dobriyan, linux-crypto, netdev, ken

Le vendredi 13 janvier 2012 à 07:22 +0100, Steffen Klassert a écrit :
> On Wed, Jan 11, 2012 at 11:36:11AM +1100, Herbert Xu wrote:
> > On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote:
> > > commit f9e2bca6c22d75a289a349f869701214d63b5060
> > > aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> > > created global message schedule area.
> > > 
> > > If sha512_update will ever be entered twice, hilarity ensures.
> > 
> > Hmm, do you know why this happens? On the face of it this shouldn't
> > be possible as preemption is disabled.
> > 
> 
> I did not try to reproduce, but this looks like a race of the 'local out'
> and the receive packet path. On 'lokal out' bottom halves are enabled,
> so could be interrupted by the NET_RX_SOFTIRQ while doing a sha512_update.
> The NET_RX_SOFTIRQ could invoke sha512_update too, that would corrupt the
> hash value. My guess could be checked easily by disabling the bottom halves
> before the percpu value is fetched.

Good catch. It can be generalized to any interrupts (soft and hard)

Another solution is using two blocks, one used from interrupt context.

static DEFINE_PER_CPU(u64[80], msg_schedule);
static DEFINE_PER_CPU(u64[80], msg_schedule_irq);

(Like we do for SNMP mibs on !x86 arches)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-13  6:48     ` Eric Dumazet
@ 2012-01-13  6:50       ` Herbert Xu
  2012-01-13  9:45       ` David Laight
  1 sibling, 0 replies; 42+ messages in thread
From: Herbert Xu @ 2012-01-13  6:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Steffen Klassert, Alexey Dobriyan, linux-crypto, netdev, ken

On Fri, Jan 13, 2012 at 07:48:38AM +0100, Eric Dumazet wrote:
>
> Another solution is using two blocks, one used from interrupt context.
> 
> static DEFINE_PER_CPU(u64[80], msg_schedule);
> static DEFINE_PER_CPU(u64[80], msg_schedule_irq);
> 
> (Like we do for SNMP mibs on !x86 arches)

Yes that sounds good.  Thanks Eric!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: sha512: make it work, undo percpu message schedule
  2012-01-13  6:48     ` Eric Dumazet
  2012-01-13  6:50       ` Herbert Xu
@ 2012-01-13  9:45       ` David Laight
  1 sibling, 0 replies; 42+ messages in thread
From: David Laight @ 2012-01-13  9:45 UTC (permalink / raw)
  To: Eric Dumazet, Steffen Klassert
  Cc: Herbert Xu, Alexey Dobriyan, linux-crypto, netdev, ken

> Good catch. It can be generalized to any interrupts (soft and hard)
> 
> Another solution is using two blocks, one used from interrupt context.
> 
> static DEFINE_PER_CPU(u64[80], msg_schedule);
> static DEFINE_PER_CPU(u64[80], msg_schedule_irq);
> 
> (Like we do for SNMP mibs on !x86 arches)

Don't you need one block per interrupt level ?

It also means that the functions need to know where they
are being called from - which may not be true.

A thought is that if you are going to reserve this memory
for each cpu, it might as well be reserved as part of the
interrupt stack (assuming interrupts switch stacks).

Process level code would still need to use a reserved buffer.

	David

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: sha512: make it work, undo percpu message schedule
  2012-01-11  0:00 sha512: make it work, undo percpu message schedule Alexey Dobriyan
  2012-01-11  0:12 ` Alexey Dobriyan
  2012-01-11  0:36 ` Herbert Xu
@ 2012-01-11  1:12 ` Adrian-Ken Rueegsegger
  2 siblings, 0 replies; 42+ messages in thread
From: Adrian-Ken Rueegsegger @ 2012-01-11  1:12 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: herbert, linux-crypto, netdev

On 01/11/2012 01:00 AM, Alexey Dobriyan wrote:
> commit f9e2bca6c22d75a289a349f869701214d63b5060
> aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> created global message schedule area.

[snip]

> I personally don't understand this changelog entry:
> 
> 	"The message schedule W (u64[80]) is too big for the stack."
> 
> Hash context is dynamically allocated.

My original patch did the same thing as yours and put the message
schedule on the stack in sha512_transform. Herbert argued [1], that it
was too big and suggested to store it in a static per-cpu area.

[1] - http://www.mail-archive.com/linux-crypto@vger.kernel.org/msg02527.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2012-02-03  3:34 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-11  0:00 sha512: make it work, undo percpu message schedule Alexey Dobriyan
2012-01-11  0:12 ` Alexey Dobriyan
2012-01-11  0:36 ` Herbert Xu
2012-01-12 23:55   ` Alexey Dobriyan
2012-01-13  0:19     ` Herbert Xu
2012-01-13  7:08     ` Herbert Xu
2012-01-13 10:35       ` Eric Dumazet
2012-01-13 10:41         ` Eric Dumazet
2012-01-13 10:57           ` Eric Dumazet
2012-01-13 11:33             ` Alexey Dobriyan
2012-01-13 12:34               ` Eric Dumazet
2012-01-14 18:20                 ` Alexey Dobriyan
2012-01-14 18:27                   ` [PATCH 1/3] " Alexey Dobriyan
2012-01-14 18:40                     ` [PATCH 2/3] sha512: reduce stack usage to safe number Alexey Dobriyan
2012-01-14 19:08                       ` Linus Torvalds
2012-01-14 20:41                         ` Alexey Dobriyan
2012-01-14 21:14                           ` Linus Torvalds
2012-01-16  9:56                           ` David Laight
2012-01-16 10:20                             ` Alexey Dobriyan
2012-01-16 10:23                             ` Eric Dumazet
2012-01-16 11:37                               ` David Laight
2012-01-17 12:03                               ` Alexey Dobriyan
2012-01-18 18:02                                 ` [PATCH 4/3] sha512: reduce stack usage even on i386 Alexey Dobriyan
2012-01-26  2:35                                   ` Herbert Xu
2012-01-27 17:51                                     ` Alexey Dobriyan
2012-01-27 22:32                                       ` Herbert Xu
2012-01-30 11:10                                         ` Alexey Dobriyan
2012-02-03  3:34                                           ` Herbert Xu
2012-01-14 21:46                     ` [PATCH 1/3] sha512: make it work, undo percpu message schedule Eric Dumazet
2012-01-14 21:52                       ` Linus Torvalds
2012-01-14 22:00                         ` Eric Dumazet
2012-01-15  1:43                     ` Herbert Xu
2012-01-13 11:02         ` Steffen Klassert
2012-01-15  1:43           ` Herbert Xu
2012-01-13 11:45         ` David Laight
2012-01-13 12:35           ` Eric Dumazet
2012-01-13  6:22   ` Steffen Klassert
2012-01-13  6:46     ` Herbert Xu
2012-01-13  6:48     ` Eric Dumazet
2012-01-13  6:50       ` Herbert Xu
2012-01-13  9:45       ` David Laight
2012-01-11  1:12 ` Adrian-Ken Rueegsegger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).