* sha512: make it work, undo percpu message schedule @ 2012-01-11 0:00 Alexey Dobriyan 2012-01-11 0:12 ` Alexey Dobriyan ` (2 more replies) 0 siblings, 3 replies; 42+ messages in thread From: Alexey Dobriyan @ 2012-01-11 0:00 UTC (permalink / raw) To: herbert; +Cc: linux-crypto, netdev, ken commit f9e2bca6c22d75a289a349f869701214d63b5060 aka "crypto: sha512 - Move message schedule W[80] to static percpu area" created global message schedule area. If sha512_update will ever be entered twice, hilarity ensures. Probably the easiest way to notice incorrect hashes being calculated is to run 2 ping floods over AH with hmac(sha512): #!/usr/sbin/setkey -f flush; spdflush; add IP1 IP2 ah 25 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000025; add IP2 IP1 ah 52 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000052; spdadd IP1 IP2 any -P out ipsec ah/transport//require; spdadd IP2 IP1 any -P in ipsec ah/transport//require; XfrmInStateProtoError will start ticking with -EBADMSG being returned from ah_input(). This never happens with, say, hmac(sha1). I personally don't understand this changelog entry: "The message schedule W (u64[80]) is too big for the stack." Hash context is dynamically allocated. W[80] stack usage can be trimmed like in SHA-1, this is for separate patch. P. S.: If only one ping flood runs, XfrmInStateProtoError will also tick but very slowly. I personally can't explain this. With patch applied (on BOTH sides), XfrmInStateProtoError does not tick with multiple bidirectional ping flood streams like it doesn't tick with SHA-1. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> --- crypto/sha512_generic.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) --- a/crypto/sha512_generic.c +++ b/crypto/sha512_generic.c @@ -21,8 +21,6 @@ #include <linux/percpu.h> #include <asm/byteorder.h> -static DEFINE_PER_CPU(u64[80], msg_schedule); - static inline u64 Ch(u64 x, u64 y, u64 z) { return z ^ (x & (y ^ z)); @@ -89,7 +87,7 @@ sha512_transform(u64 *state, const u8 *input) u64 a, b, c, d, e, f, g, h, t1, t2; int i; - u64 *W = get_cpu_var(msg_schedule); + u64 W[80]; /* load the input */ for (i = 0; i < 16; i++) @@ -128,8 +126,6 @@ sha512_transform(u64 *state, const u8 *input) /* erase our data */ a = b = c = d = e = f = g = h = t1 = t2 = 0; - memset(W, 0, sizeof(__get_cpu_var(msg_schedule))); - put_cpu_var(msg_schedule); } static int ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-11 0:00 sha512: make it work, undo percpu message schedule Alexey Dobriyan @ 2012-01-11 0:12 ` Alexey Dobriyan 2012-01-11 0:36 ` Herbert Xu 2012-01-11 1:12 ` Adrian-Ken Rueegsegger 2 siblings, 0 replies; 42+ messages in thread From: Alexey Dobriyan @ 2012-01-11 0:12 UTC (permalink / raw) To: herbert; +Cc: linux-crypto, netdev, ken On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote: > - memset(W, 0, sizeof(__get_cpu_var(msg_schedule))); And, yes, this is intentional -- modern gcc pisses on stone age data clearing. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-11 0:00 sha512: make it work, undo percpu message schedule Alexey Dobriyan 2012-01-11 0:12 ` Alexey Dobriyan @ 2012-01-11 0:36 ` Herbert Xu 2012-01-12 23:55 ` Alexey Dobriyan 2012-01-13 6:22 ` Steffen Klassert 2012-01-11 1:12 ` Adrian-Ken Rueegsegger 2 siblings, 2 replies; 42+ messages in thread From: Herbert Xu @ 2012-01-11 0:36 UTC (permalink / raw) To: Alexey Dobriyan; +Cc: linux-crypto, netdev, ken On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote: > commit f9e2bca6c22d75a289a349f869701214d63b5060 > aka "crypto: sha512 - Move message schedule W[80] to static percpu area" > created global message schedule area. > > If sha512_update will ever be entered twice, hilarity ensures. Hmm, do you know why this happens? On the face of it this shouldn't be possible as preemption is disabled. Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-11 0:36 ` Herbert Xu @ 2012-01-12 23:55 ` Alexey Dobriyan 2012-01-13 0:19 ` Herbert Xu 2012-01-13 7:08 ` Herbert Xu 2012-01-13 6:22 ` Steffen Klassert 1 sibling, 2 replies; 42+ messages in thread From: Alexey Dobriyan @ 2012-01-12 23:55 UTC (permalink / raw) To: Herbert Xu; +Cc: linux-crypto, netdev, ken On Wed, Jan 11, 2012 at 11:36:11AM +1100, Herbert Xu wrote: > On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote: > > commit f9e2bca6c22d75a289a349f869701214d63b5060 > > aka "crypto: sha512 - Move message schedule W[80] to static percpu area" > > created global message schedule area. > > > > If sha512_update will ever be entered twice, hilarity ensures. > > Hmm, do you know why this happens? On the face of it this shouldn't > be possible as preemption is disabled. Herbert, I couldn't come up with a single scenario. :-( But the bug is easy to reproduce. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-12 23:55 ` Alexey Dobriyan @ 2012-01-13 0:19 ` Herbert Xu 2012-01-13 7:08 ` Herbert Xu 1 sibling, 0 replies; 42+ messages in thread From: Herbert Xu @ 2012-01-13 0:19 UTC (permalink / raw) To: Alexey Dobriyan; +Cc: linux-crypto, netdev, ken On Fri, Jan 13, 2012 at 02:55:14AM +0300, Alexey Dobriyan wrote: > > Herbert, I couldn't come up with a single scenario. :-( > But the bug is easy to reproduce. OK, I'll try to reproduce it here. Thanks! -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-12 23:55 ` Alexey Dobriyan 2012-01-13 0:19 ` Herbert Xu @ 2012-01-13 7:08 ` Herbert Xu 2012-01-13 10:35 ` Eric Dumazet 1 sibling, 1 reply; 42+ messages in thread From: Herbert Xu @ 2012-01-13 7:08 UTC (permalink / raw) To: Alexey Dobriyan; +Cc: linux-crypto, netdev, ken, Steffen Klassert, Eric Dumazet On Fri, Jan 13, 2012 at 02:55:14AM +0300, Alexey Dobriyan wrote: > > Herbert, I couldn't come up with a single scenario. :-( > But the bug is easy to reproduce. OK, does this patch work for you? commit 31f4e55c09c1170f8b813c14b1299b70f50db414 Author: Herbert Xu <herbert@gondor.apana.org.au> Date: Fri Jan 13 18:06:50 2012 +1100 crypto: sha512 - Fix msg_schedule race The percpu msg_schedule setup was unsafe as a user in a process context can be interrupted by a softirq user which would then scribble over the exact same work area. This was discovered by Steffen Klassert. This patch based on ideas from Eric Dumazet fixes this by using two independent work areas. Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c index 9ed9f60..a49c457 100644 --- a/crypto/sha512_generic.c +++ b/crypto/sha512_generic.c @@ -11,6 +11,7 @@ * */ #include <crypto/internal/hash.h> +#include <linux/interrupt.h> #include <linux/kernel.h> #include <linux/module.h> #include <linux/mm.h> @@ -21,7 +22,9 @@ #include <linux/percpu.h> #include <asm/byteorder.h> -static DEFINE_PER_CPU(u64[80], msg_schedule); +#define SHA512_SCHEDULE_SIZE 80 + +static DEFINE_PER_CPU(u64[SHA512_SCHEDULE_SIZE * 2], msg_schedule); static inline u64 Ch(u64 x, u64 y, u64 z) { @@ -89,7 +92,8 @@ sha512_transform(u64 *state, const u8 *input) u64 a, b, c, d, e, f, g, h, t1, t2; int i; - u64 *W = get_cpu_var(msg_schedule); + u64 *W = get_cpu_var(msg_schedule) + + SHA512_SCHEDULE_SIZE * !!in_interrupt(); /* load the input */ for (i = 0; i < 16; i++) @@ -128,7 +132,7 @@ sha512_transform(u64 *state, const u8 *input) /* erase our data */ a = b = c = d = e = f = g = h = t1 = t2 = 0; - memset(W, 0, sizeof(__get_cpu_var(msg_schedule))); + memset(W, 0, SHA512_SCHEDULE_SIZE * sizeof(*W)); put_cpu_var(msg_schedule); } Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-13 7:08 ` Herbert Xu @ 2012-01-13 10:35 ` Eric Dumazet 2012-01-13 10:41 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 42+ messages in thread From: Eric Dumazet @ 2012-01-13 10:35 UTC (permalink / raw) To: Herbert Xu; +Cc: Alexey Dobriyan, linux-crypto, netdev, ken, Steffen Klassert Le vendredi 13 janvier 2012 à 18:08 +1100, Herbert Xu a écrit : > On Fri, Jan 13, 2012 at 02:55:14AM +0300, Alexey Dobriyan wrote: > > > > Herbert, I couldn't come up with a single scenario. :-( > > But the bug is easy to reproduce. > > OK, does this patch work for you? > > commit 31f4e55c09c1170f8b813c14b1299b70f50db414 > Author: Herbert Xu <herbert@gondor.apana.org.au> > Date: Fri Jan 13 18:06:50 2012 +1100 > > crypto: sha512 - Fix msg_schedule race > > The percpu msg_schedule setup was unsafe as a user in a process > context can be interrupted by a softirq user which would then > scribble over the exact same work area. This was discovered by > Steffen Klassert. > > This patch based on ideas from Eric Dumazet fixes this by using > two independent work areas. > > Reported-by: Alexey Dobriyan <adobriyan@gmail.com> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> > I wonder ... With 4096 cpus, do we really want to reserve 5242880 bytes of memory for this function ? What about following patch instead ? (Trying a dynamic memory allocation, and fallback on a single pre-allocated bloc of memory, shared by all cpus, protected by a spinlock) diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c index 9ed9f60..5c80a76 100644 --- a/crypto/sha512_generic.c +++ b/crypto/sha512_generic.c @@ -21,7 +21,6 @@ #include <linux/percpu.h> #include <asm/byteorder.h> -static DEFINE_PER_CPU(u64[80], msg_schedule); static inline u64 Ch(u64 x, u64 y, u64 z) { @@ -87,10 +86,16 @@ static void sha512_transform(u64 *state, const u8 *input) { u64 a, b, c, d, e, f, g, h, t1, t2; - + static u64 msg_schedule[80]; + static DEFINE_SPINLOCK(msg_schedule_lock); int i; - u64 *W = get_cpu_var(msg_schedule); + u64 *W = kzalloc(sizeof(msg_schedule), GFP_ATOMIC | __GFP_NOWARN); + + if (!W) { + spin_lock_bh(&msg_schedule_lock); + W = msg_schedule; + } /* load the input */ for (i = 0; i < 16; i++) LOAD_OP(i, W, input); @@ -128,8 +133,11 @@ sha512_transform(u64 *state, const u8 *input) /* erase our data */ a = b = c = d = e = f = g = h = t1 = t2 = 0; - memset(W, 0, sizeof(__get_cpu_var(msg_schedule))); - put_cpu_var(msg_schedule); + memset(W, 0, sizeof(msg_schedule)); + if (W == msg_schedule) + spin_unlock_bh(&msg_schedule_lock); + else + kfree(W); } static int ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-13 10:35 ` Eric Dumazet @ 2012-01-13 10:41 ` Eric Dumazet 2012-01-13 10:57 ` Eric Dumazet 2012-01-13 11:02 ` Steffen Klassert 2012-01-13 11:45 ` David Laight 2 siblings, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2012-01-13 10:41 UTC (permalink / raw) To: Herbert Xu; +Cc: Alexey Dobriyan, linux-crypto, netdev, ken, Steffen Klassert Le vendredi 13 janvier 2012 à 11:35 +0100, Eric Dumazet a écrit : > I wonder ... > > With 4096 cpus, do we really want to reserve 5242880 bytes of memory for > this function ? > > What about following patch instead ? > > (Trying a dynamic memory allocation, and fallback on a single > pre-allocated bloc of memory, shared by all cpus, protected by a > spinlock) > > diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c > index 9ed9f60..5c80a76 100644 > --- a/crypto/sha512_generic.c > +++ b/crypto/sha512_generic.c > @@ -21,7 +21,6 @@ > #include <linux/percpu.h> > #include <asm/byteorder.h> > > -static DEFINE_PER_CPU(u64[80], msg_schedule); > > static inline u64 Ch(u64 x, u64 y, u64 z) > { > @@ -87,10 +86,16 @@ static void > sha512_transform(u64 *state, const u8 *input) > { > u64 a, b, c, d, e, f, g, h, t1, t2; > - > + static u64 msg_schedule[80]; > + static DEFINE_SPINLOCK(msg_schedule_lock); > int i; > - u64 *W = get_cpu_var(msg_schedule); > + u64 *W = kzalloc(sizeof(msg_schedule), GFP_ATOMIC | __GFP_NOWARN); > And a plain kmalloc() is enough, since we fully initialize the array a bit later. for (i = 0; i < 16; i++) LOAD_OP(i, W, input); for (i = 16; i < 80; i++) { BLEND_OP(i, W); } ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-13 10:41 ` Eric Dumazet @ 2012-01-13 10:57 ` Eric Dumazet 2012-01-13 11:33 ` Alexey Dobriyan 0 siblings, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2012-01-13 10:57 UTC (permalink / raw) To: Herbert Xu; +Cc: Alexey Dobriyan, linux-crypto, netdev, ken, Steffen Klassert Le vendredi 13 janvier 2012 à 11:41 +0100, Eric Dumazet a écrit : > And a plain kmalloc() is enough, since we fully initialize the array a > bit later. > > for (i = 0; i < 16; i++) > LOAD_OP(i, W, input); > for (i = 16; i < 80; i++) { > BLEND_OP(i, W); > } Here is an official patch ;) [PATCH v3] crypto: sha512 - Fix msg_schedule race The percpu msg_schedule setup was unsafe as a user in a process context can be interrupted by a softirq user which would then scribble over the exact same work area. Bug reported by Alexey Dobriyan, and diagnosed by Steffen Klassert. Bug added in commit f9e2bca6c22 (crypto: sha512 - Move message schedule W[80] to static percpu area) Try a dynamic memory allocation, and fallback using a single static area, guarded by a spinlock. This removes use of percpu memory. Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Herbert Xu <herbert@gondor.apana.org.au> CC: Adrian-Ken Rueegsegger <ken@codelabs.ch> CC: Steffen Klassert <steffen.klassert@secunet.com> --- crypto/sha512_generic.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c index 9ed9f60..b52ef9b 100644 --- a/crypto/sha512_generic.c +++ b/crypto/sha512_generic.c @@ -18,10 +18,8 @@ #include <linux/crypto.h> #include <linux/types.h> #include <crypto/sha.h> -#include <linux/percpu.h> #include <asm/byteorder.h> -static DEFINE_PER_CPU(u64[80], msg_schedule); static inline u64 Ch(u64 x, u64 y, u64 z) { @@ -87,10 +85,15 @@ static void sha512_transform(u64 *state, const u8 *input) { u64 a, b, c, d, e, f, g, h, t1, t2; - + static u64 msg_schedule[80]; + static DEFINE_SPINLOCK(msg_schedule_lock); int i; - u64 *W = get_cpu_var(msg_schedule); + u64 *W = kmalloc(sizeof(msg_schedule), GFP_ATOMIC | __GFP_NOWARN); + if (!W) { + spin_lock_bh(&msg_schedule_lock); + W = msg_schedule; + } /* load the input */ for (i = 0; i < 16; i++) LOAD_OP(i, W, input); @@ -128,8 +131,11 @@ sha512_transform(u64 *state, const u8 *input) /* erase our data */ a = b = c = d = e = f = g = h = t1 = t2 = 0; - memset(W, 0, sizeof(__get_cpu_var(msg_schedule))); - put_cpu_var(msg_schedule); + memset(W, 0, sizeof(msg_schedule)); + if (W == msg_schedule) + spin_unlock_bh(&msg_schedule_lock); + else + kfree(W); } static int ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-13 10:57 ` Eric Dumazet @ 2012-01-13 11:33 ` Alexey Dobriyan 2012-01-13 12:34 ` Eric Dumazet 0 siblings, 1 reply; 42+ messages in thread From: Alexey Dobriyan @ 2012-01-13 11:33 UTC (permalink / raw) To: Eric Dumazet; +Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert On 1/13/12, Eric Dumazet <eric.dumazet@gmail.com> wrote: > + static u64 msg_schedule[80]; > + static DEFINE_SPINLOCK(msg_schedule_lock); No guys, no. SHA-512 only needs u64[16] running window for message scheduling. I'm sending whitespace mangled patch which is only tested with selfcryptotest passed, so you won't apply something complex. Stackspace usage drops down to like this: -139: 48 81 ec c8 02 00 00 sub $0x2c8,%rsp +136: 48 81 ec 18 01 00 00 sub $0x118,%rsp --- a/crypto/sha512_generic.c +++ b/crypto/sha512_generic.c @@ -21,8 +21,6 @@ #include <linux/percpu.h> #include <asm/byteorder.h> -static DEFINE_PER_CPU(u64[80], msg_schedule); - static inline u64 Ch(u64 x, u64 y, u64 z) { return z ^ (x & (y ^ z)); @@ -80,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input) static inline void BLEND_OP(int I, u64 *W) { - W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16]; + W[I%16] = s1(W[(I-2)%16]) + W[(I-7)%16] + s0(W[(I-15)%16]) + W[I%16]; } static void @@ -89,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input) u64 a, b, c, d, e, f, g, h, t1, t2; int i; - u64 *W = get_cpu_var(msg_schedule); + u64 W[16]; /* load the input */ for (i = 0; i < 16; i++) LOAD_OP(i, W, input); - for (i = 16; i < 80; i++) { - BLEND_OP(i, W); - } - /* load the state into our registers */ a=state[0]; b=state[1]; c=state[2]; d=state[3]; e=state[4]; f=state[5]; g=state[6]; h=state[7]; - /* now iterate */ - for (i=0; i<80; i+=8) { - t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i ] + W[i ]; - t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2; - t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1]; - t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2; - t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2]; - t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2; - t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3]; - t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2; - t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4]; - t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2; - t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5]; - t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2; - t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6]; - t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2; - t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7]; - t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2; +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \ + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \ + t2 = e0(a) + Maj(a,b,c); \ + d += t1; \ + h = t1 + t2 + +#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \ + BLEND_OP(i, W); \ + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)%16]; \ + t2 = e0(a) + Maj(a,b,c); \ + d += t1; \ + h = t1 + t2 + + for (i = 0; i < 16; i += 8) { + SHA512_0_15(i, a, b, c, d, e, f, g, h); + SHA512_0_15(i + 1, h, a, b, c, d, e, f, g); + SHA512_0_15(i + 2, g, h, a, b, c, d, e, f); + SHA512_0_15(i + 3, f, g, h, a, b, c, d, e); + SHA512_0_15(i + 4, e, f, g, h, a, b, c, d); + SHA512_0_15(i + 5, d, e, f, g, h, a, b, c); + SHA512_0_15(i + 6, c, d, e, f, g, h, a, b); + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a); + } + for (i = 16; i < 80; i += 8) { + SHA512_16_79(i, a, b, c, d, e, f, g, h); + SHA512_16_79(i + 1, h, a, b, c, d, e, f, g); + SHA512_16_79(i + 2, g, h, a, b, c, d, e, f); + SHA512_16_79(i + 3, f, g, h, a, b, c, d, e); + SHA512_16_79(i + 4, e, f, g, h, a, b, c, d); + SHA512_16_79(i + 5, d, e, f, g, h, a, b, c); + SHA512_16_79(i + 6, c, d, e, f, g, h, a, b); + SHA512_16_79(i + 7, b, c, d, e, f, g, h, a); } state[0] += a; state[1] += b; state[2] += c; state[3] += d; @@ -128,8 +136,6 @@ sha512_transform(u64 *state, const u8 *input) /* erase our data */ a = b = c = d = e = f = g = h = t1 = t2 = 0; - memset(W, 0, sizeof(__get_cpu_var(msg_schedule))); - put_cpu_var(msg_schedule); } static int ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-13 11:33 ` Alexey Dobriyan @ 2012-01-13 12:34 ` Eric Dumazet 2012-01-14 18:20 ` Alexey Dobriyan 0 siblings, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2012-01-13 12:34 UTC (permalink / raw) To: Alexey Dobriyan; +Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert Le vendredi 13 janvier 2012 à 13:33 +0200, Alexey Dobriyan a écrit : > On 1/13/12, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > + static u64 msg_schedule[80]; > > + static DEFINE_SPINLOCK(msg_schedule_lock); > > No guys, no. > > SHA-512 only needs u64[16] running window for message scheduling. > > I'm sending whitespace mangled patch which is only tested > with selfcryptotest passed, so you won't apply something complex. > > Stackspace usage drops down to like this: > > -139: 48 81 ec c8 02 00 00 sub $0x2c8,%rsp > +136: 48 81 ec 18 01 00 00 sub $0x118,%rsp > > --- a/crypto/sha512_generic.c > +++ b/crypto/sha512_generic.c > @@ -21,8 +21,6 @@ > #include <linux/percpu.h> > #include <asm/byteorder.h> > > -static DEFINE_PER_CPU(u64[80], msg_schedule); > - > static inline u64 Ch(u64 x, u64 y, u64 z) > { > return z ^ (x & (y ^ z)); > @@ -80,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input) > > static inline void BLEND_OP(int I, u64 *W) > { > - W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16]; > + W[I%16] = s1(W[(I-2)%16]) + W[(I-7)%16] + s0(W[(I-15)%16]) + W[I%16]; > } > > static void > @@ -89,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input) > u64 a, b, c, d, e, f, g, h, t1, t2; > > int i; > - u64 *W = get_cpu_var(msg_schedule); > + u64 W[16]; > > /* load the input */ > for (i = 0; i < 16; i++) > LOAD_OP(i, W, input); > > - for (i = 16; i < 80; i++) { > - BLEND_OP(i, W); > - } > - > /* load the state into our registers */ > a=state[0]; b=state[1]; c=state[2]; d=state[3]; > e=state[4]; f=state[5]; g=state[6]; h=state[7]; > > - /* now iterate */ > - for (i=0; i<80; i+=8) { > - t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i ] + W[i ]; > - t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2; > - t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1]; > - t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2; > - t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2]; > - t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2; > - t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3]; > - t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2; > - t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4]; > - t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2; > - t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5]; > - t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2; > - t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6]; > - t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2; > - t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7]; > - t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2; > +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \ > + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \ > + t2 = e0(a) + Maj(a,b,c); \ > + d += t1; \ > + h = t1 + t2 > + > +#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \ > + BLEND_OP(i, W); \ > + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)%16]; \ > + t2 = e0(a) + Maj(a,b,c); \ > + d += t1; \ > + h = t1 + t2 > + > + for (i = 0; i < 16; i += 8) { > + SHA512_0_15(i, a, b, c, d, e, f, g, h); > + SHA512_0_15(i + 1, h, a, b, c, d, e, f, g); > + SHA512_0_15(i + 2, g, h, a, b, c, d, e, f); > + SHA512_0_15(i + 3, f, g, h, a, b, c, d, e); > + SHA512_0_15(i + 4, e, f, g, h, a, b, c, d); > + SHA512_0_15(i + 5, d, e, f, g, h, a, b, c); > + SHA512_0_15(i + 6, c, d, e, f, g, h, a, b); > + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a); > + } > + for (i = 16; i < 80; i += 8) { > + SHA512_16_79(i, a, b, c, d, e, f, g, h); > + SHA512_16_79(i + 1, h, a, b, c, d, e, f, g); > + SHA512_16_79(i + 2, g, h, a, b, c, d, e, f); > + SHA512_16_79(i + 3, f, g, h, a, b, c, d, e); > + SHA512_16_79(i + 4, e, f, g, h, a, b, c, d); > + SHA512_16_79(i + 5, d, e, f, g, h, a, b, c); > + SHA512_16_79(i + 6, c, d, e, f, g, h, a, b); > + SHA512_16_79(i + 7, b, c, d, e, f, g, h, a); > } > > state[0] += a; state[1] += b; state[2] += c; state[3] += d; > @@ -128,8 +136,6 @@ sha512_transform(u64 *state, const u8 *input) > > /* erase our data */ > a = b = c = d = e = f = g = h = t1 = t2 = 0; > - memset(W, 0, sizeof(__get_cpu_var(msg_schedule))); > - put_cpu_var(msg_schedule); > } > > static int Even if its true, its not stable material. stable teams want obvious patches. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-13 12:34 ` Eric Dumazet @ 2012-01-14 18:20 ` Alexey Dobriyan 2012-01-14 18:27 ` [PATCH 1/3] " Alexey Dobriyan 0 siblings, 1 reply; 42+ messages in thread From: Alexey Dobriyan @ 2012-01-14 18:20 UTC (permalink / raw) To: Eric Dumazet; +Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert On Fri, Jan 13, 2012 at 01:34:13PM +0100, Eric Dumazet wrote: > Le vendredi 13 janvier 2012 à 13:33 +0200, Alexey Dobriyan a écrit : > > On 1/13/12, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > > + static u64 msg_schedule[80]; > > > + static DEFINE_SPINLOCK(msg_schedule_lock); > > > > No guys, no. > > > > SHA-512 only needs u64[16] running window for message scheduling. > > > > I'm sending whitespace mangled patch which is only tested > > with selfcryptotest passed, so you won't apply something complex. > > > > Stackspace usage drops down to like this: > > > > -139: 48 81 ec c8 02 00 00 sub $0x2c8,%rsp > > +136: 48 81 ec 18 01 00 00 sub $0x118,%rsp > > > > --- a/crypto/sha512_generic.c > > +++ b/crypto/sha512_generic.c > > @@ -21,8 +21,6 @@ > > #include <linux/percpu.h> > > #include <asm/byteorder.h> > > > > -static DEFINE_PER_CPU(u64[80], msg_schedule); > > - > > static inline u64 Ch(u64 x, u64 y, u64 z) > > { > > return z ^ (x & (y ^ z)); > > @@ -80,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input) > > > > static inline void BLEND_OP(int I, u64 *W) > > { > > - W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16]; > > + W[I%16] = s1(W[(I-2)%16]) + W[(I-7)%16] + s0(W[(I-15)%16]) + W[I%16]; > > } > > > > static void > > @@ -89,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input) > > u64 a, b, c, d, e, f, g, h, t1, t2; > > > > int i; > > - u64 *W = get_cpu_var(msg_schedule); > > + u64 W[16]; > > > > /* load the input */ > > for (i = 0; i < 16; i++) > > LOAD_OP(i, W, input); > > > > - for (i = 16; i < 80; i++) { > > - BLEND_OP(i, W); > > - } > > - > > /* load the state into our registers */ > > a=state[0]; b=state[1]; c=state[2]; d=state[3]; > > e=state[4]; f=state[5]; g=state[6]; h=state[7]; > > > > - /* now iterate */ > > - for (i=0; i<80; i+=8) { > > - t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i ] + W[i ]; > > - t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2; > > - t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1]; > > - t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2; > > - t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2]; > > - t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2; > > - t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3]; > > - t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2; > > - t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4]; > > - t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2; > > - t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5]; > > - t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2; > > - t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6]; > > - t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2; > > - t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7]; > > - t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2; > > +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \ > > + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \ > > + t2 = e0(a) + Maj(a,b,c); \ > > + d += t1; \ > > + h = t1 + t2 > > + > > +#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \ > > + BLEND_OP(i, W); \ > > + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)%16]; \ > > + t2 = e0(a) + Maj(a,b,c); \ > > + d += t1; \ > > + h = t1 + t2 > > + > > + for (i = 0; i < 16; i += 8) { > > + SHA512_0_15(i, a, b, c, d, e, f, g, h); > > + SHA512_0_15(i + 1, h, a, b, c, d, e, f, g); > > + SHA512_0_15(i + 2, g, h, a, b, c, d, e, f); > > + SHA512_0_15(i + 3, f, g, h, a, b, c, d, e); > > + SHA512_0_15(i + 4, e, f, g, h, a, b, c, d); > > + SHA512_0_15(i + 5, d, e, f, g, h, a, b, c); > > + SHA512_0_15(i + 6, c, d, e, f, g, h, a, b); > > + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a); > > + } > > + for (i = 16; i < 80; i += 8) { > > + SHA512_16_79(i, a, b, c, d, e, f, g, h); > > + SHA512_16_79(i + 1, h, a, b, c, d, e, f, g); > > + SHA512_16_79(i + 2, g, h, a, b, c, d, e, f); > > + SHA512_16_79(i + 3, f, g, h, a, b, c, d, e); > > + SHA512_16_79(i + 4, e, f, g, h, a, b, c, d); > > + SHA512_16_79(i + 5, d, e, f, g, h, a, b, c); > > + SHA512_16_79(i + 6, c, d, e, f, g, h, a, b); > > + SHA512_16_79(i + 7, b, c, d, e, f, g, h, a); > > } > > > > state[0] += a; state[1] += b; state[2] += c; state[3] += d; > > @@ -128,8 +136,6 @@ sha512_transform(u64 *state, const u8 *input) > > > > /* erase our data */ > > a = b = c = d = e = f = g = h = t1 = t2 = 0; > > - memset(W, 0, sizeof(__get_cpu_var(msg_schedule))); > > - put_cpu_var(msg_schedule); > > } > > > > static int > > > Even if its true, its not stable material. > > stable teams want obvious patches. I understand that. But it _is_ obvious if you see what macro does and original code and read FIPS 180-2 6.3.2 "SHA-512 Hash Computation" from where it is obvious that W[i] depends on W[i - 16] and doesn't depend on earlier values. This stack reduction patch completely fixes original bug and doesn't show any AH errors. Given the nature of crypto code where one bit mistake ruins everything, it should be enough. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 1/3] sha512: make it work, undo percpu message schedule 2012-01-14 18:20 ` Alexey Dobriyan @ 2012-01-14 18:27 ` Alexey Dobriyan 2012-01-14 18:40 ` [PATCH 2/3] sha512: reduce stack usage to safe number Alexey Dobriyan ` (2 more replies) 0 siblings, 3 replies; 42+ messages in thread From: Alexey Dobriyan @ 2012-01-14 18:27 UTC (permalink / raw) To: Herbert Xu Cc: linux-crypto, netdev, ken, Steffen Klassert, Eric Dumazet, security commit f9e2bca6c22d75a289a349f869701214d63b5060 aka "crypto: sha512 - Move message schedule W[80] to static percpu area" created global message schedule area. If sha512_update will ever be entered twice, hash will be silently calculated incorrectly. Probably the easiest way to notice incorrect hashes being calculated is to run 2 ping floods over AH with hmac(sha512): #!/usr/sbin/setkey -f flush; spdflush; add IP1 IP2 ah 25 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000025; add IP2 IP1 ah 52 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000052; spdadd IP1 IP2 any -P out ipsec ah/transport//require; spdadd IP2 IP1 any -P in ipsec ah/transport//require; XfrmInStateProtoError will start ticking with -EBADMSG being returned from ah_input(). This never happens with, say, hmac(sha1). With patch applied (on BOTH sides), XfrmInStateProtoError does not tick with multiple bidirectional ping flood streams like it doesn't tick with SHA-1. After this patch sha512_transform() will start using ~750 bytes of stack on x86_64. This is OK for simple loads, for something more heavy, stack reduction will be done separatedly. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: stable@vger.kernel.org --- crypto/sha512_generic.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) --- a/crypto/sha512_generic.c +++ b/crypto/sha512_generic.c @@ -21,8 +21,6 @@ #include <linux/percpu.h> #include <asm/byteorder.h> -static DEFINE_PER_CPU(u64[80], msg_schedule); - static inline u64 Ch(u64 x, u64 y, u64 z) { return z ^ (x & (y ^ z)); @@ -89,7 +87,7 @@ sha512_transform(u64 *state, const u8 *input) u64 a, b, c, d, e, f, g, h, t1, t2; int i; - u64 *W = get_cpu_var(msg_schedule); + u64 W[80]; /* load the input */ for (i = 0; i < 16; i++) @@ -128,8 +126,6 @@ sha512_transform(u64 *state, const u8 *input) /* erase our data */ a = b = c = d = e = f = g = h = t1 = t2 = 0; - memset(W, 0, sizeof(__get_cpu_var(msg_schedule))); - put_cpu_var(msg_schedule); } static int ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 2/3] sha512: reduce stack usage to safe number 2012-01-14 18:27 ` [PATCH 1/3] " Alexey Dobriyan @ 2012-01-14 18:40 ` Alexey Dobriyan 2012-01-14 19:08 ` Linus Torvalds 2012-01-14 21:46 ` [PATCH 1/3] sha512: make it work, undo percpu message schedule Eric Dumazet 2012-01-15 1:43 ` Herbert Xu 2 siblings, 1 reply; 42+ messages in thread From: Alexey Dobriyan @ 2012-01-14 18:40 UTC (permalink / raw) To: Herbert Xu Cc: linux-crypto, netdev, ken, Steffen Klassert, Eric Dumazet, security For rounds 16--79, W[i] only depends on W[i - 2], W[i - 7], W[i - 15] and W[i - 16]. Consequently, keeping all W[80] array on stack is unnecessary, only 16 values are really needed. Using W[16] instead of W[80] greatly reduces stack usage (~750 bytes to ~340 bytes on x86_64). Line by line explanation: * BLEND_OP array is "circular" now, all indexes have to be modulo 16. Round number is positive, so remainder operation should be without surprises. * initial full message scheduling is trimmed to first 16 values which come from data block, the rest is calculated before it's needed. * original loop body is unrolled version of new SHA512_0_15 and SHA512_16_79 macros, unrolling was done to not do explicit variable renaming. Otherwise it's the very same code after preprocessing. See sha1_transform() code which does the same trick. Patch survives in-tree crypto test and original bugreport test (ping flood with hmac(sha512). See FIPS 180-2 for SHA-512 definition http://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: stable@vger.kernel.org --- This is patch is for stable if 750 byte stack usage is not considered safe. crypto/sha512_generic.c | 58 ++++++++++++++++++++++++++++-------------------- 1 file changed, 34 insertions(+), 24 deletions(-) --- a/crypto/sha512_generic.c +++ b/crypto/sha512_generic.c @@ -78,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input) static inline void BLEND_OP(int I, u64 *W) { - W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16]; + W[I % 16] += s1(W[(I-2) % 16]) + W[(I-7) % 16] + s0(W[(I-15) % 16]); } static void @@ -87,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input) u64 a, b, c, d, e, f, g, h, t1, t2; int i; - u64 W[80]; + u64 W[16]; /* load the input */ for (i = 0; i < 16; i++) LOAD_OP(i, W, input); - for (i = 16; i < 80; i++) { - BLEND_OP(i, W); - } - /* load the state into our registers */ a=state[0]; b=state[1]; c=state[2]; d=state[3]; e=state[4]; f=state[5]; g=state[6]; h=state[7]; - /* now iterate */ - for (i=0; i<80; i+=8) { - t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i ] + W[i ]; - t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2; - t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1]; - t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2; - t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2]; - t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2; - t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3]; - t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2; - t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4]; - t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2; - t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5]; - t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2; - t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6]; - t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2; - t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7]; - t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2; +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \ + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \ + t2 = e0(a) + Maj(a, b, c); \ + d += t1; \ + h = t1 + t2 + +#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \ + BLEND_OP(i, W); \ + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)%16]; \ + t2 = e0(a) + Maj(a, b, c); \ + d += t1; \ + h = t1 + t2 + + for (i = 0; i < 16; i += 8) { + SHA512_0_15(i, a, b, c, d, e, f, g, h); + SHA512_0_15(i + 1, h, a, b, c, d, e, f, g); + SHA512_0_15(i + 2, g, h, a, b, c, d, e, f); + SHA512_0_15(i + 3, f, g, h, a, b, c, d, e); + SHA512_0_15(i + 4, e, f, g, h, a, b, c, d); + SHA512_0_15(i + 5, d, e, f, g, h, a, b, c); + SHA512_0_15(i + 6, c, d, e, f, g, h, a, b); + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a); + } + for (i = 16; i < 80; i += 8) { + SHA512_16_79(i, a, b, c, d, e, f, g, h); + SHA512_16_79(i + 1, h, a, b, c, d, e, f, g); + SHA512_16_79(i + 2, g, h, a, b, c, d, e, f); + SHA512_16_79(i + 3, f, g, h, a, b, c, d, e); + SHA512_16_79(i + 4, e, f, g, h, a, b, c, d); + SHA512_16_79(i + 5, d, e, f, g, h, a, b, c); + SHA512_16_79(i + 6, c, d, e, f, g, h, a, b); + SHA512_16_79(i + 7, b, c, d, e, f, g, h, a); } state[0] += a; state[1] += b; state[2] += c; state[3] += d; ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] sha512: reduce stack usage to safe number 2012-01-14 18:40 ` [PATCH 2/3] sha512: reduce stack usage to safe number Alexey Dobriyan @ 2012-01-14 19:08 ` Linus Torvalds 2012-01-14 20:41 ` Alexey Dobriyan 0 siblings, 1 reply; 42+ messages in thread From: Linus Torvalds @ 2012-01-14 19:08 UTC (permalink / raw) To: Alexey Dobriyan Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert, Eric Dumazet, security On Sat, Jan 14, 2012 at 10:40 AM, Alexey Dobriyan <adobriyan@gmail.com> wrote: > > Line by line explanation: > * BLEND_OP > array is "circular" now, all indexes have to be modulo 16. > Round number is positive, so remainder operation should be > without surprises. Don't use "%" except on unsigned values. Even if it's positive, if it's a signed number and the compiler doesn't *see* that it is absolutely positive, division is nontrivial. Even when you divide by a constant. For example, "% 16" on an 'int' on x86-64 will generate movl %edi, %edx sarl $31, %edx shrl $28, %edx leal (%rdi,%rdx), %eax andl $15, %eax subl %edx, %eax in order to get the signed case right. The fact that the end result is correct for unsigned numbers is irrelevant: it's still stupid and slow. With an unsigned int, '% 16' will generate the obvious andl $15, %eax instead. Quite frankly, stop using division in the first place. Dividing by powers-of-two and expecting the compiler to fix things up is just stupid, *exactly* because of issues like these: you either have to think about it carefully, or the compiler may end up creating crap code. So just use "& 15" instead. That doesn't have these kinds of issues. It is a *good* thing when the C code is close to the end result you want to generate. It is *not* a good thing to write code that looks nothing like the end result and just expect the compiler to do the right thing. Even if the compiler does do the right thing, what was the advantage? Linus ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] sha512: reduce stack usage to safe number 2012-01-14 19:08 ` Linus Torvalds @ 2012-01-14 20:41 ` Alexey Dobriyan 2012-01-14 21:14 ` Linus Torvalds 2012-01-16 9:56 ` David Laight 0 siblings, 2 replies; 42+ messages in thread From: Alexey Dobriyan @ 2012-01-14 20:41 UTC (permalink / raw) To: Linus Torvalds Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert, Eric Dumazet, security On Sat, Jan 14, 2012 at 11:08:45AM -0800, Linus Torvalds wrote: > On Sat, Jan 14, 2012 at 10:40 AM, Alexey Dobriyan <adobriyan@gmail.com> wrote: > > > > Line by line explanation: > > * BLEND_OP > > array is "circular" now, all indexes have to be modulo 16. > > Round number is positive, so remainder operation should be > > without surprises. > > Don't use "%" except on unsigned values. Even if it's positive, if > it's a signed number and the compiler doesn't *see* that it is > absolutely positive, division is nontrivial. Even when you divide by a > constant. > > For example, "% 16" on an 'int' on x86-64 will generate > > movl %edi, %edx > sarl $31, %edx > shrl $28, %edx > leal (%rdi,%rdx), %eax > andl $15, %eax > subl %edx, %eax > > in order to get the signed case right. The fact that the end result is > correct for unsigned numbers is irrelevant: it's still stupid and > slow. > > With an unsigned int, '% 16' will generate the obvious > > andl $15, %eax > > instead. > > Quite frankly, stop using division in the first place. Dividing by > powers-of-two and expecting the compiler to fix things up is just > stupid, *exactly* because of issues like these: you either have to > think about it carefully, or the compiler may end up creating crap > code. For the record, it generates "andl $15" here. > So just use "& 15" instead. That doesn't have these kinds of issues. > It is a *good* thing when the C code is close to the end result you > want to generate. It is *not* a good thing to write code that looks > nothing like the end result and just expect the compiler to do the > right thing. Even if the compiler does do the right thing, what was > the advantage? Here is updated patch which explicitly uses & (equally tested): --------------------------------------------------------------- For rounds 16--79, W[i] only depends on W[i - 2], W[i - 7], W[i - 15] and W[i - 16]. Consequently, keeping all W[80] array on stack is unnecessary, only 16 values are really needed. Using W[16] instead of W[80] greatly reduces stack usage (~750 bytes to ~340 bytes on x86_64). Line by line explanation: * BLEND_OP array is "circular" now, all indexes have to be modulo 16. * initial full message scheduling is trimmed to first 16 values which come from data block, the rest is calculated right before it's needed. * original loop body is unrolled version of new SHA512_0_15 and SHA512_16_79 macros, unrolling was done to not do explicit variable renaming. Otherwise it's the very same code after preprocessing. See sha1_transform() code which does the same trick. Patch survives in-tree crypto test and original bugreport test (ping flood with hmac(sha512). See FIPS 180-2 for SHA-512 definition http://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: stable@vger.kernel.org --- This is patch is for stable if 750 byte stack usage is not considered safe. crypto/sha512_generic.c | 58 ++++++++++++++++++++++++++++-------------------- 1 file changed, 34 insertions(+), 24 deletions(-) --- a/crypto/sha512_generic.c +++ b/crypto/sha512_generic.c @@ -78,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input) static inline void BLEND_OP(int I, u64 *W) { - W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16]; + W[I & 15] += s1(W[(I-2) & 15]) + W[(I-7) & 15] + s0(W[(I-15) & 15]); } static void @@ -87,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input) u64 a, b, c, d, e, f, g, h, t1, t2; int i; - u64 W[80]; + u64 W[16]; /* load the input */ for (i = 0; i < 16; i++) LOAD_OP(i, W, input); - for (i = 16; i < 80; i++) { - BLEND_OP(i, W); - } - /* load the state into our registers */ a=state[0]; b=state[1]; c=state[2]; d=state[3]; e=state[4]; f=state[5]; g=state[6]; h=state[7]; - /* now iterate */ - for (i=0; i<80; i+=8) { - t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i ] + W[i ]; - t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2; - t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1]; - t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2; - t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2]; - t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2; - t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3]; - t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2; - t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4]; - t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2; - t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5]; - t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2; - t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6]; - t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2; - t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7]; - t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2; +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \ + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \ + t2 = e0(a) + Maj(a, b, c); \ + d += t1; \ + h = t1 + t2 + +#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \ + BLEND_OP(i, W); \ + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15]; \ + t2 = e0(a) + Maj(a, b, c); \ + d += t1; \ + h = t1 + t2 + + for (i = 0; i < 16; i += 8) { + SHA512_0_15(i, a, b, c, d, e, f, g, h); + SHA512_0_15(i + 1, h, a, b, c, d, e, f, g); + SHA512_0_15(i + 2, g, h, a, b, c, d, e, f); + SHA512_0_15(i + 3, f, g, h, a, b, c, d, e); + SHA512_0_15(i + 4, e, f, g, h, a, b, c, d); + SHA512_0_15(i + 5, d, e, f, g, h, a, b, c); + SHA512_0_15(i + 6, c, d, e, f, g, h, a, b); + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a); + } + for (i = 16; i < 80; i += 8) { + SHA512_16_79(i, a, b, c, d, e, f, g, h); + SHA512_16_79(i + 1, h, a, b, c, d, e, f, g); + SHA512_16_79(i + 2, g, h, a, b, c, d, e, f); + SHA512_16_79(i + 3, f, g, h, a, b, c, d, e); + SHA512_16_79(i + 4, e, f, g, h, a, b, c, d); + SHA512_16_79(i + 5, d, e, f, g, h, a, b, c); + SHA512_16_79(i + 6, c, d, e, f, g, h, a, b); + SHA512_16_79(i + 7, b, c, d, e, f, g, h, a); } state[0] += a; state[1] += b; state[2] += c; state[3] += d; ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] sha512: reduce stack usage to safe number 2012-01-14 20:41 ` Alexey Dobriyan @ 2012-01-14 21:14 ` Linus Torvalds 2012-01-16 9:56 ` David Laight 1 sibling, 0 replies; 42+ messages in thread From: Linus Torvalds @ 2012-01-14 21:14 UTC (permalink / raw) To: Alexey Dobriyan Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert, Eric Dumazet, security On Sat, Jan 14, 2012 at 12:41 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote: > > For the record, it generates "andl $15" here. Ok. That means that gcc was able to prove that it never had any signed values (which is certainly reasonable when you do things like "for (i=0; i<X;i++)"). But it's better to simply not rely on gcc always getting details like this right. It's also better to use a model that simply doesn't even require you as a programmer to have to even *think* about signed values. It's easy to get "%" wrong by mistake (signed integer modulus didn't even use to have well-defined semantics in traditional C), and there is almost never any excuse for using it for powers-of-two. > Here is updated patch which explicitly uses & (equally tested): Thanks, Linus ^ permalink raw reply [flat|nested] 42+ messages in thread
* RE: [PATCH 2/3] sha512: reduce stack usage to safe number 2012-01-14 20:41 ` Alexey Dobriyan 2012-01-14 21:14 ` Linus Torvalds @ 2012-01-16 9:56 ` David Laight 2012-01-16 10:20 ` Alexey Dobriyan 2012-01-16 10:23 ` Eric Dumazet 1 sibling, 2 replies; 42+ messages in thread From: David Laight @ 2012-01-16 9:56 UTC (permalink / raw) To: Alexey Dobriyan, Linus Torvalds Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert, Eric Dumazet, security Doesn't this badly overflow W[] .. > +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \ > + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \ ... > + for (i = 0; i < 16; i += 8) { ... > + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a); > + } David ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] sha512: reduce stack usage to safe number 2012-01-16 9:56 ` David Laight @ 2012-01-16 10:20 ` Alexey Dobriyan 2012-01-16 10:23 ` Eric Dumazet 1 sibling, 0 replies; 42+ messages in thread From: Alexey Dobriyan @ 2012-01-16 10:20 UTC (permalink / raw) To: David Laight Cc: Linus Torvalds, Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert, Eric Dumazet, security On 1/16/12, David Laight <David.Laight@aculab.com> wrote: > Doesn't this badly overflow W[] .. > >> +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \ >> + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \ > ... >> + for (i = 0; i < 16; i += 8) { > ... >> + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a); >> + } No, why should it? i can be only 0 and 8. ^ permalink raw reply [flat|nested] 42+ messages in thread
* RE: [PATCH 2/3] sha512: reduce stack usage to safe number 2012-01-16 9:56 ` David Laight 2012-01-16 10:20 ` Alexey Dobriyan @ 2012-01-16 10:23 ` Eric Dumazet 2012-01-16 11:37 ` David Laight 2012-01-17 12:03 ` Alexey Dobriyan 1 sibling, 2 replies; 42+ messages in thread From: Eric Dumazet @ 2012-01-16 10:23 UTC (permalink / raw) To: David Laight Cc: Alexey Dobriyan, Linus Torvalds, Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert, security Le lundi 16 janvier 2012 à 09:56 +0000, David Laight a écrit : > Doesn't this badly overflow W[] .. > > > +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \ > > + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \ > ... > > + for (i = 0; i < 16; i += 8) { > ... > > + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a); > > + } > > David > > No overflow since loop is done for only i=0 and i=8 By the way, I suspect previous code was chosen years ago because this version uses less stack but adds much more code bloat. size crypto/sha512_generic.o crypto/sha512_generic_old.o text data bss dec hex filename 17369 704 0 18073 4699 crypto/sha512_generic.o 8249 704 0 8953 22f9 crypto/sha512_generic_old.o ^ permalink raw reply [flat|nested] 42+ messages in thread
* RE: [PATCH 2/3] sha512: reduce stack usage to safe number 2012-01-16 10:23 ` Eric Dumazet @ 2012-01-16 11:37 ` David Laight 2012-01-17 12:03 ` Alexey Dobriyan 1 sibling, 0 replies; 42+ messages in thread From: David Laight @ 2012-01-16 11:37 UTC (permalink / raw) To: Eric Dumazet Cc: Alexey Dobriyan, Linus Torvalds, Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert, security > By the way, I suspect previous code was chosen years ago because this > version uses less stack but adds much more code bloat. > > size crypto/sha512_generic.o crypto/sha512_generic_old.o > text data bss dec hex filename > 17369 704 0 18073 4699 crypto/sha512_generic.o > 8249 704 0 8953 22f9 > crypto/sha512_generic_old.o I wonder what the difference in execution times is? I know this is very dependant in the actual cpu and the cold/warm cache state... But some measurements might be useful. David ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/3] sha512: reduce stack usage to safe number 2012-01-16 10:23 ` Eric Dumazet 2012-01-16 11:37 ` David Laight @ 2012-01-17 12:03 ` Alexey Dobriyan 2012-01-18 18:02 ` [PATCH 4/3] sha512: reduce stack usage even on i386 Alexey Dobriyan 1 sibling, 1 reply; 42+ messages in thread From: Alexey Dobriyan @ 2012-01-17 12:03 UTC (permalink / raw) To: Eric Dumazet Cc: David Laight, Linus Torvalds, Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert, security On 1/16/12, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le lundi 16 janvier 2012 à 09:56 +0000, David Laight a écrit : >> Doesn't this badly overflow W[] .. >> >> > +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \ >> > + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \ >> ... >> > + for (i = 0; i < 16; i += 8) { >> ... >> > + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a); >> > + } >> >> David >> >> > > No overflow since loop is done for only i=0 and i=8 > > By the way, I suspect previous code was chosen years ago because this > version uses less stack but adds much more code bloat. I think W[80] was use because it's the most straightforward way to write this code by following spec. All SHA definitions have full message schedule pseudocoded before hash computation. > size crypto/sha512_generic.o crypto/sha512_generic_old.o > text data bss dec hex filename > 17369 704 0 18073 4699 crypto/sha512_generic.o > 8249 704 0 8953 22f9 crypto/sha512_generic_old.o This is because SHA-512 is fundamentally 64-bit algorithm multiplied by excessive unrolling. Surprisingly, doing variable renaming by hand like in spec: t1 = ... t2 = ... h = g; g = f; f = e; e = d + T1; d = c; c = b; b = a; a = t1 + t2; bring stack space on i386 under control too. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 4/3] sha512: reduce stack usage even on i386 2012-01-17 12:03 ` Alexey Dobriyan @ 2012-01-18 18:02 ` Alexey Dobriyan 2012-01-26 2:35 ` Herbert Xu 0 siblings, 1 reply; 42+ messages in thread From: Alexey Dobriyan @ 2012-01-18 18:02 UTC (permalink / raw) To: Herbert Xu Cc: David Laight, Linus Torvalds, Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert, security, Eric Dumazet Fix still excessive stack usage on i386. There is too much loop unrolling going on, despite W[16] being used, gcc screws up this for some reason. So, don't be smart, use simple code from SHA-512 definition, this keeps code size _and_ stack usage back under control even on i386: -14b: 81 ec 9c 03 00 00 sub $0x39c,%esp +149: 81 ec 64 01 00 00 sub $0x164,%esp $ size ../sha512_generic-i386-00* text data bss dec hex filename 15521 712 0 16233 3f69 ../sha512_generic-i386-000.o 4225 712 0 4937 1349 ../sha512_generic-i386-001.o Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: stable@vger.kernel.org --- crypto/sha512_generic.c | 42 ++++++++++++++++++++---------------------- 1 file changed, 20 insertions(+), 22 deletions(-) --- a/crypto/sha512_generic.c +++ b/crypto/sha512_generic.c @@ -95,35 +95,33 @@ sha512_transform(u64 *state, const u8 *input) #define SHA512_0_15(i, a, b, c, d, e, f, g, h) \ t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \ t2 = e0(a) + Maj(a, b, c); \ - d += t1; \ - h = t1 + t2 + h = g; \ + g = f; \ + f = e; \ + e = d + t1; \ + d = c; \ + c = b; \ + b = a; \ + a = t1 + t2 #define SHA512_16_79(i, a, b, c, d, e, f, g, h) \ BLEND_OP(i, W); \ - t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15]; \ + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i & 15]; \ t2 = e0(a) + Maj(a, b, c); \ - d += t1; \ - h = t1 + t2 - - for (i = 0; i < 16; i += 8) { + h = g; \ + g = f; \ + f = e; \ + e = d + t1; \ + d = c; \ + c = b; \ + b = a; \ + a = t1 + t2 + + for (i = 0; i < 16; i++) { SHA512_0_15(i, a, b, c, d, e, f, g, h); - SHA512_0_15(i + 1, h, a, b, c, d, e, f, g); - SHA512_0_15(i + 2, g, h, a, b, c, d, e, f); - SHA512_0_15(i + 3, f, g, h, a, b, c, d, e); - SHA512_0_15(i + 4, e, f, g, h, a, b, c, d); - SHA512_0_15(i + 5, d, e, f, g, h, a, b, c); - SHA512_0_15(i + 6, c, d, e, f, g, h, a, b); - SHA512_0_15(i + 7, b, c, d, e, f, g, h, a); } - for (i = 16; i < 80; i += 8) { + for (i = 16; i < 80; i++) { SHA512_16_79(i, a, b, c, d, e, f, g, h); - SHA512_16_79(i + 1, h, a, b, c, d, e, f, g); - SHA512_16_79(i + 2, g, h, a, b, c, d, e, f); - SHA512_16_79(i + 3, f, g, h, a, b, c, d, e); - SHA512_16_79(i + 4, e, f, g, h, a, b, c, d); - SHA512_16_79(i + 5, d, e, f, g, h, a, b, c); - SHA512_16_79(i + 6, c, d, e, f, g, h, a, b); - SHA512_16_79(i + 7, b, c, d, e, f, g, h, a); } state[0] += a; state[1] += b; state[2] += c; state[3] += d; ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 4/3] sha512: reduce stack usage even on i386 2012-01-18 18:02 ` [PATCH 4/3] sha512: reduce stack usage even on i386 Alexey Dobriyan @ 2012-01-26 2:35 ` Herbert Xu 2012-01-27 17:51 ` Alexey Dobriyan 0 siblings, 1 reply; 42+ messages in thread From: Herbert Xu @ 2012-01-26 2:35 UTC (permalink / raw) To: Alexey Dobriyan Cc: David Laight, Linus Torvalds, linux-crypto, netdev, ken, Steffen Klassert, security, Eric Dumazet On Wed, Jan 18, 2012 at 09:02:10PM +0300, Alexey Dobriyan wrote: > Fix still excessive stack usage on i386. > > There is too much loop unrolling going on, despite W[16] being used, > gcc screws up this for some reason. So, don't be smart, use simple code > from SHA-512 definition, this keeps code size _and_ stack usage back > under control even on i386: > > -14b: 81 ec 9c 03 00 00 sub $0x39c,%esp > +149: 81 ec 64 01 00 00 sub $0x164,%esp > > $ size ../sha512_generic-i386-00* > text data bss dec hex filename > 15521 712 0 16233 3f69 ../sha512_generic-i386-000.o > 4225 712 0 4937 1349 ../sha512_generic-i386-001.o > > Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> > Cc: stable@vger.kernel.org Hmm, your patch doesn't apply against my crypto tree. Please regenerate. Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 4/3] sha512: reduce stack usage even on i386 2012-01-26 2:35 ` Herbert Xu @ 2012-01-27 17:51 ` Alexey Dobriyan 2012-01-27 22:32 ` Herbert Xu 0 siblings, 1 reply; 42+ messages in thread From: Alexey Dobriyan @ 2012-01-27 17:51 UTC (permalink / raw) To: Herbert Xu Cc: David Laight, Linus Torvalds, linux-crypto, netdev, ken, Steffen Klassert, security, Eric Dumazet On Thu, Jan 26, 2012 at 01:35:02PM +1100, Herbert Xu wrote: > On Wed, Jan 18, 2012 at 09:02:10PM +0300, Alexey Dobriyan wrote: > > Fix still excessive stack usage on i386. > > > > There is too much loop unrolling going on, despite W[16] being used, > > gcc screws up this for some reason. So, don't be smart, use simple code > > from SHA-512 definition, this keeps code size _and_ stack usage back > > under control even on i386: > > > > -14b: 81 ec 9c 03 00 00 sub $0x39c,%esp > > +149: 81 ec 64 01 00 00 sub $0x164,%esp > > > > $ size ../sha512_generic-i386-00* > > text data bss dec hex filename > > 15521 712 0 16233 3f69 ../sha512_generic-i386-000.o > > 4225 712 0 4937 1349 ../sha512_generic-i386-001.o > > > > Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> > > Cc: stable@vger.kernel.org > > Hmm, your patch doesn't apply against my crypto tree. Please > regenerate. I think this is because your tree contained "%16" code instead if "&15". Now that it contains "&15" it should become applicable. Anyway. -------------------------------------------------------------------------- [PATCH] sha512: reduce stack usage even on i386 Fix still excessive stack usage on i386. There is too much loop unrolling going on, despite W[16] being used, gcc screws up this for some reason. So, don't be smart, use simple code from SHA-512 definition, this keeps code size _and_ stack usage back under control even on i386: -14b: 81 ec 9c 03 00 00 sub $0x39c,%esp +149: 81 ec 64 01 00 00 sub $0x164,%esp $ size ../sha512_generic-i386-00* text data bss dec hex filename 15521 712 0 16233 3f69 ../sha512_generic-i386-000.o 4225 712 0 4937 1349 ../sha512_generic-i386-001.o Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: stable@vger.kernel.org --- crypto/sha512_generic.c | 42 ++++++++++++++++++++---------------------- 1 file changed, 20 insertions(+), 22 deletions(-) --- a/crypto/sha512_generic.c +++ b/crypto/sha512_generic.c @@ -100,35 +100,33 @@ sha512_transform(u64 *state, const u8 *input) #define SHA512_0_15(i, a, b, c, d, e, f, g, h) \ t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \ t2 = e0(a) + Maj(a, b, c); \ - d += t1; \ - h = t1 + t2 + h = g; \ + g = f; \ + f = e; \ + e = d + t1; \ + d = c; \ + c = b; \ + b = a; \ + a = t1 + t2 #define SHA512_16_79(i, a, b, c, d, e, f, g, h) \ BLEND_OP(i, W); \ - t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15]; \ + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i & 15]; \ t2 = e0(a) + Maj(a, b, c); \ - d += t1; \ - h = t1 + t2 - - for (i = 0; i < 16; i += 8) { + h = g; \ + g = f; \ + f = e; \ + e = d + t1; \ + d = c; \ + c = b; \ + b = a; \ + a = t1 + t2 + + for (i = 0; i < 16; i++) { SHA512_0_15(i, a, b, c, d, e, f, g, h); - SHA512_0_15(i + 1, h, a, b, c, d, e, f, g); - SHA512_0_15(i + 2, g, h, a, b, c, d, e, f); - SHA512_0_15(i + 3, f, g, h, a, b, c, d, e); - SHA512_0_15(i + 4, e, f, g, h, a, b, c, d); - SHA512_0_15(i + 5, d, e, f, g, h, a, b, c); - SHA512_0_15(i + 6, c, d, e, f, g, h, a, b); - SHA512_0_15(i + 7, b, c, d, e, f, g, h, a); } - for (i = 16; i < 80; i += 8) { + for (i = 16; i < 80; i++) { SHA512_16_79(i, a, b, c, d, e, f, g, h); - SHA512_16_79(i + 1, h, a, b, c, d, e, f, g); - SHA512_16_79(i + 2, g, h, a, b, c, d, e, f); - SHA512_16_79(i + 3, f, g, h, a, b, c, d, e); - SHA512_16_79(i + 4, e, f, g, h, a, b, c, d); - SHA512_16_79(i + 5, d, e, f, g, h, a, b, c); - SHA512_16_79(i + 6, c, d, e, f, g, h, a, b); - SHA512_16_79(i + 7, b, c, d, e, f, g, h, a); } state[0] += a; state[1] += b; state[2] += c; state[3] += d; ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 4/3] sha512: reduce stack usage even on i386 2012-01-27 17:51 ` Alexey Dobriyan @ 2012-01-27 22:32 ` Herbert Xu 2012-01-30 11:10 ` Alexey Dobriyan 0 siblings, 1 reply; 42+ messages in thread From: Herbert Xu @ 2012-01-27 22:32 UTC (permalink / raw) To: Alexey Dobriyan Cc: David Laight, Linus Torvalds, linux-crypto, netdev, ken, Steffen Klassert, security, Eric Dumazet On Fri, Jan 27, 2012 at 08:51:30PM +0300, Alexey Dobriyan wrote: > > I think this is because your tree contained "%16" code instead if "&15". > Now that it contains "&15" it should become applicable. OK. > -------------------------------------------------------------------------- > [PATCH] sha512: reduce stack usage even on i386 Can you try the approach that git takes with using asm to read and write W (see previous email from Linus in respone to my push request)? As it stands your patch is simply relying on gcc's ability to optimise. At least with asm volatile we know that gcc will leave it alone. Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 4/3] sha512: reduce stack usage even on i386 2012-01-27 22:32 ` Herbert Xu @ 2012-01-30 11:10 ` Alexey Dobriyan 2012-02-03 3:34 ` Herbert Xu 0 siblings, 1 reply; 42+ messages in thread From: Alexey Dobriyan @ 2012-01-30 11:10 UTC (permalink / raw) To: Herbert Xu Cc: David Laight, Linus Torvalds, linux-crypto, netdev, ken, Steffen Klassert, security, Eric Dumazet On 1/28/12, Herbert Xu <herbert@gondor.apana.org.au> wrote: > On Fri, Jan 27, 2012 at 08:51:30PM +0300, Alexey Dobriyan wrote: >> >> I think this is because your tree contained "%16" code instead if "&15". >> Now that it contains "&15" it should become applicable. > > OK. > >> -------------------------------------------------------------------------- >> [PATCH] sha512: reduce stack usage even on i386 > > Can you try the approach that git takes with using asm to read > and write W (see previous email from Linus in respone to my push > request)? As it stands your patch is simply relying on gcc's > ability to optimise. At least with asm volatile we know that > gcc will leave it alone. For some reason it doesn't. :-( I've also tried full barriers. With this patch, stack usage is still ~900 bytes. diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c index dd0439d..35e7ae7 100644 --- a/crypto/sha512_generic.c +++ b/crypto/sha512_generic.c @@ -66,16 +66,6 @@ static const u64 sha512_K[80] = { #define s0(x) (ror64(x, 1) ^ ror64(x, 8) ^ (x >> 7)) #define s1(x) (ror64(x,19) ^ ror64(x,61) ^ (x >> 6)) -static inline void LOAD_OP(int I, u64 *W, const u8 *input) -{ - W[I] = __be64_to_cpu( ((__be64*)(input))[I] ); -} - -static inline void BLEND_OP(int I, u64 *W) -{ - W[I & 15] += s1(W[(I-2) & 15]) + W[(I-7) & 15] + s0(W[(I-15) & 15]); -} - static void sha512_transform(u64 *state, const u8 *input) { @@ -84,26 +74,29 @@ sha512_transform(u64 *state, const u8 *input) int i; u64 W[16]; - /* load the input */ - for (i = 0; i < 16; i++) - LOAD_OP(i, W, input); - /* load the state into our registers */ a=state[0]; b=state[1]; c=state[2]; d=state[3]; e=state[4]; f=state[5]; g=state[6]; h=state[7]; #define SHA512_0_15(i, a, b, c, d, e, f, g, h) \ - t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \ + { \ + u64 tmp = be64_to_cpu(*((__be64 *)input + (i))); \ + *(volatile u64 *)&W[i] = tmp; \ + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + tmp; \ t2 = e0(a) + Maj(a, b, c); \ d += t1; \ - h = t1 + t2 + h = t1 + t2; \ + } #define SHA512_16_79(i, a, b, c, d, e, f, g, h) \ - BLEND_OP(i, W); \ - t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15];\ + { \ + u64 tmp = W[(i) & 15] + s1(W[(i-2) & 15]) + W[(i-7) & 15] + s0(W[(i-15) & 15]);\ + *(volatile u64 *)&W[(i) & 15] = tmp; \ + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + tmp; \ t2 = e0(a) + Maj(a, b, c); \ d += t1; \ - h = t1 + t2 + h = t1 + t2; \ + } for (i = 0; i < 16; i += 8) { SHA512_0_15(i, a, b, c, d, e, f, g, h); ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 4/3] sha512: reduce stack usage even on i386 2012-01-30 11:10 ` Alexey Dobriyan @ 2012-02-03 3:34 ` Herbert Xu 0 siblings, 0 replies; 42+ messages in thread From: Herbert Xu @ 2012-02-03 3:34 UTC (permalink / raw) To: Alexey Dobriyan Cc: David Laight, Linus Torvalds, linux-crypto, netdev, ken, Steffen Klassert, security, Eric Dumazet On Mon, Jan 30, 2012 at 01:10:48PM +0200, Alexey Dobriyan wrote: > > For some reason it doesn't. :-( I've also tried full barriers. OK, I don't think it worked in the sha1 case either, as on my i386 here lib/sha1.c ends up using nearly 500 bytes of stack space. It appears that the crux of the issue here is the use of 64-bit ops. gcc just doesn't like that at all on i386 and ends up putting every other intermediate result on the stack in its own unique place. While you de-unrolling patch does reduce the stack use, it also incurs a speed penalty of around 15%, and that's on top of the 4% we've already added by getting rid of per_cpu. However, this patch seems to work for me, and reduces the stack usage to a level about the same as lib/sha1.c + crypto/sha1_generic.c without adding any new overhead. So let me know if you have any better ideas. diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c index 3edebfd..f04af93 100644 --- a/crypto/sha512_generic.c +++ b/crypto/sha512_generic.c @@ -89,46 +89,42 @@ sha512_transform(u64 *state, const u8 *input) int i; u64 W[16]; - /* load the input */ - for (i = 0; i < 16; i++) - LOAD_OP(i, W, input); - /* load the state into our registers */ a=state[0]; b=state[1]; c=state[2]; d=state[3]; e=state[4]; f=state[5]; g=state[6]; h=state[7]; -#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \ - t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \ - t2 = e0(a) + Maj(a, b, c); \ - d += t1; \ - h = t1 + t2 - -#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \ - BLEND_OP(i, W); \ - t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15]; \ - t2 = e0(a) + Maj(a, b, c); \ - d += t1; \ - h = t1 + t2 - - for (i = 0; i < 16; i += 8) { - SHA512_0_15(i, a, b, c, d, e, f, g, h); - SHA512_0_15(i + 1, h, a, b, c, d, e, f, g); - SHA512_0_15(i + 2, g, h, a, b, c, d, e, f); - SHA512_0_15(i + 3, f, g, h, a, b, c, d, e); - SHA512_0_15(i + 4, e, f, g, h, a, b, c, d); - SHA512_0_15(i + 5, d, e, f, g, h, a, b, c); - SHA512_0_15(i + 6, c, d, e, f, g, h, a, b); - SHA512_0_15(i + 7, b, c, d, e, f, g, h, a); - } - for (i = 16; i < 80; i += 8) { - SHA512_16_79(i, a, b, c, d, e, f, g, h); - SHA512_16_79(i + 1, h, a, b, c, d, e, f, g); - SHA512_16_79(i + 2, g, h, a, b, c, d, e, f); - SHA512_16_79(i + 3, f, g, h, a, b, c, d, e); - SHA512_16_79(i + 4, e, f, g, h, a, b, c, d); - SHA512_16_79(i + 5, d, e, f, g, h, a, b, c); - SHA512_16_79(i + 6, c, d, e, f, g, h, a, b); - SHA512_16_79(i + 7, b, c, d, e, f, g, h, a); + /* now iterate */ + for (i=0; i<80; i+=8) { + if (!(i & 8)) { + int j; + + if (i < 16) { + /* load the input */ + for (j = 0; j < 16; j++) + LOAD_OP(i + j, W, input); + } else { + for (j = 0; j < 16; j++) { + BLEND_OP(i + j, W); + } + } + } + + t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i ] + W[(i & 15)]; + t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2; + t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[(i & 15) + 1]; + t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2; + t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[(i & 15) + 2]; + t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2; + t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[(i & 15) + 3]; + t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2; + t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[(i & 15) + 4]; + t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2; + t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[(i & 15) + 5]; + t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2; + t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[(i & 15) + 6]; + t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2; + t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[(i & 15) + 7]; + t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2; } state[0] += a; state[1] += b; state[2] += c; state[3] += d; Thanks! -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] sha512: make it work, undo percpu message schedule 2012-01-14 18:27 ` [PATCH 1/3] " Alexey Dobriyan 2012-01-14 18:40 ` [PATCH 2/3] sha512: reduce stack usage to safe number Alexey Dobriyan @ 2012-01-14 21:46 ` Eric Dumazet 2012-01-14 21:52 ` Linus Torvalds 2012-01-15 1:43 ` Herbert Xu 2 siblings, 1 reply; 42+ messages in thread From: Eric Dumazet @ 2012-01-14 21:46 UTC (permalink / raw) To: Alexey Dobriyan Cc: Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert, security Le samedi 14 janvier 2012 à 21:27 +0300, Alexey Dobriyan a écrit : > commit f9e2bca6c22d75a289a349f869701214d63b5060 > aka "crypto: sha512 - Move message schedule W[80] to static percpu area" > created global message schedule area. > Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> > Cc: stable@vger.kernel.org > --- > > crypto/sha512_generic.c | 6 +----- > 1 file changed, 1 insertion(+), 5 deletions(-) > > --- a/crypto/sha512_generic.c > +++ b/crypto/sha512_generic.c > @@ -21,8 +21,6 @@ > #include <linux/percpu.h> > #include <asm/byteorder.h> > > -static DEFINE_PER_CPU(u64[80], msg_schedule); > - > static inline u64 Ch(u64 x, u64 y, u64 z) > { > return z ^ (x & (y ^ z)); > @@ -89,7 +87,7 @@ sha512_transform(u64 *state, const u8 *input) > u64 a, b, c, d, e, f, g, h, t1, t2; > > int i; > - u64 *W = get_cpu_var(msg_schedule); > + u64 W[80]; > > /* load the input */ > for (i = 0; i < 16; i++) > @@ -128,8 +126,6 @@ sha512_transform(u64 *state, const u8 *input) > > /* erase our data */ > a = b = c = d = e = f = g = h = t1 = t2 = 0; > - memset(W, 0, sizeof(__get_cpu_var(msg_schedule))); > - put_cpu_var(msg_schedule); > } > > static int Is it just me or are you ignoring what crypto maintainer and others thought of your patch ? You are re-introducing a 640 bytes stack array, how comes it can be really safe ? This is too risky, and we provided an alternate patch, not just for fun. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] sha512: make it work, undo percpu message schedule 2012-01-14 21:46 ` [PATCH 1/3] sha512: make it work, undo percpu message schedule Eric Dumazet @ 2012-01-14 21:52 ` Linus Torvalds 2012-01-14 22:00 ` Eric Dumazet 0 siblings, 1 reply; 42+ messages in thread From: Linus Torvalds @ 2012-01-14 21:52 UTC (permalink / raw) To: Eric Dumazet Cc: Alexey Dobriyan, Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert, security On Sat, Jan 14, 2012 at 1:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > This is too risky, and we provided an alternate patch, not just for fun. Did you see the second patch? The one that got rid of the *stupid* 80-entry array? I don't know why so many sha implementations do that idiotic full array, when the circular one is much better. In fact, the 16-entry circular array allows machines with lots of registers to keep all the state in registers and the C implementation can often be as good as hand-tuned assembly. At least that's true for sha1, I'm not sure you can do the same with sha512. But that actually *requires* that the 16-entry array be done on the stack as an automatic array. Anything else, and the compiler won't be able to do it. Linus ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] sha512: make it work, undo percpu message schedule 2012-01-14 21:52 ` Linus Torvalds @ 2012-01-14 22:00 ` Eric Dumazet 0 siblings, 0 replies; 42+ messages in thread From: Eric Dumazet @ 2012-01-14 22:00 UTC (permalink / raw) To: Linus Torvalds Cc: Alexey Dobriyan, Herbert Xu, linux-crypto, netdev, ken, Steffen Klassert, security Le samedi 14 janvier 2012 à 13:52 -0800, Linus Torvalds a écrit : > On Sat, Jan 14, 2012 at 1:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > This is too risky, and we provided an alternate patch, not just for fun. > > Did you see the second patch? > I saw it and felt it was not a stable material. And Herbert sent an alternate patch, then I sent a V3 patch. > The one that got rid of the *stupid* 80-entry array? > Maybe, I didnt wrote this code, and I am very glad this code can be smarter and all. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/3] sha512: make it work, undo percpu message schedule 2012-01-14 18:27 ` [PATCH 1/3] " Alexey Dobriyan 2012-01-14 18:40 ` [PATCH 2/3] sha512: reduce stack usage to safe number Alexey Dobriyan 2012-01-14 21:46 ` [PATCH 1/3] sha512: make it work, undo percpu message schedule Eric Dumazet @ 2012-01-15 1:43 ` Herbert Xu 2 siblings, 0 replies; 42+ messages in thread From: Herbert Xu @ 2012-01-15 1:43 UTC (permalink / raw) To: Alexey Dobriyan Cc: linux-crypto, netdev, ken, Steffen Klassert, Eric Dumazet, security On Sat, Jan 14, 2012 at 09:27:37PM +0300, Alexey Dobriyan wrote: > commit f9e2bca6c22d75a289a349f869701214d63b5060 > aka "crypto: sha512 - Move message schedule W[80] to static percpu area" > created global message schedule area. > > If sha512_update will ever be entered twice, hash will be silently > calculated incorrectly. > > Probably the easiest way to notice incorrect hashes being calculated is > to run 2 ping floods over AH with hmac(sha512): > > #!/usr/sbin/setkey -f > flush; > spdflush; > add IP1 IP2 ah 25 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000025; > add IP2 IP1 ah 52 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000052; > spdadd IP1 IP2 any -P out ipsec ah/transport//require; > spdadd IP2 IP1 any -P in ipsec ah/transport//require; > > XfrmInStateProtoError will start ticking with -EBADMSG being returned > from ah_input(). This never happens with, say, hmac(sha1). > > With patch applied (on BOTH sides), XfrmInStateProtoError does not tick > with multiple bidirectional ping flood streams like it doesn't tick > with SHA-1. > > After this patch sha512_transform() will start using ~750 bytes of stack on x86_64. > This is OK for simple loads, for something more heavy, stack reduction will be done > separatedly. > > Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> > Cc: stable@vger.kernel.org OK, I've applied patches 1-2 to crypto and patch 3 to cryptodev. Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-13 10:35 ` Eric Dumazet 2012-01-13 10:41 ` Eric Dumazet @ 2012-01-13 11:02 ` Steffen Klassert 2012-01-15 1:43 ` Herbert Xu 2012-01-13 11:45 ` David Laight 2 siblings, 1 reply; 42+ messages in thread From: Steffen Klassert @ 2012-01-13 11:02 UTC (permalink / raw) To: Eric Dumazet; +Cc: Herbert Xu, Alexey Dobriyan, linux-crypto, netdev, ken On Fri, Jan 13, 2012 at 11:35:42AM +0100, Eric Dumazet wrote: > Le vendredi 13 janvier 2012 à 18:08 +1100, Herbert Xu a écrit : > > On Fri, Jan 13, 2012 at 02:55:14AM +0300, Alexey Dobriyan wrote: > > > > > > Herbert, I couldn't come up with a single scenario. :-( > > > But the bug is easy to reproduce. > > > > OK, does this patch work for you? > > > > commit 31f4e55c09c1170f8b813c14b1299b70f50db414 > > Author: Herbert Xu <herbert@gondor.apana.org.au> > > Date: Fri Jan 13 18:06:50 2012 +1100 > > > > crypto: sha512 - Fix msg_schedule race > > > > The percpu msg_schedule setup was unsafe as a user in a process > > context can be interrupted by a softirq user which would then > > scribble over the exact same work area. This was discovered by > > Steffen Klassert. > > > > This patch based on ideas from Eric Dumazet fixes this by using > > two independent work areas. > > > > Reported-by: Alexey Dobriyan <adobriyan@gmail.com> > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> > > > > I wonder ... > > With 4096 cpus, do we really want to reserve 5242880 bytes of memory for > this function ? > > What about following patch instead ? > > (Trying a dynamic memory allocation, and fallback on a single > pre-allocated bloc of memory, shared by all cpus, protected by a > spinlock) If we want to do dynamic memory allocation, we could place it to the shash_desc context where we already store the intermediate digest value. This is preallocated anyway, so we don't need to do another allocation. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-13 11:02 ` Steffen Klassert @ 2012-01-15 1:43 ` Herbert Xu 0 siblings, 0 replies; 42+ messages in thread From: Herbert Xu @ 2012-01-15 1:43 UTC (permalink / raw) To: Steffen Klassert; +Cc: Eric Dumazet, Alexey Dobriyan, linux-crypto, netdev, ken On Fri, Jan 13, 2012 at 12:02:55PM +0100, Steffen Klassert wrote: > > If we want to do dynamic memory allocation, we could place it to the > shash_desc context where we already store the intermediate digest value. > This is preallocated anyway, so we don't need to do another allocation. The problem with that is that some call paths place shash_desc on the stack too. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 42+ messages in thread
* RE: sha512: make it work, undo percpu message schedule 2012-01-13 10:35 ` Eric Dumazet 2012-01-13 10:41 ` Eric Dumazet 2012-01-13 11:02 ` Steffen Klassert @ 2012-01-13 11:45 ` David Laight 2012-01-13 12:35 ` Eric Dumazet 2 siblings, 1 reply; 42+ messages in thread From: David Laight @ 2012-01-13 11:45 UTC (permalink / raw) To: Eric Dumazet, Herbert Xu Cc: Alexey Dobriyan, linux-crypto, netdev, ken, Steffen Klassert > Trying a dynamic memory allocation, and fallback on a single > pre-allocated bloc of memory, shared by all cpus, protected by a > spinlock ... > - > + static u64 msg_schedule[80]; > + static DEFINE_SPINLOCK(msg_schedule_lock); > int i; > - u64 *W = get_cpu_var(msg_schedule); > + u64 *W = kzalloc(sizeof(msg_schedule), GFP_ATOMIC | __GFP_NOWARN); > > + > + if (!W) { > + spin_lock_bh(&msg_schedule_lock); > + W = msg_schedule; > + } If this code can be called from an ISR is the kmalloc() call safe? If the above is safe, wouldn't it be better to: 1) try to use the static buffer 2) try to kalloc() a buffer 3) spinwait for the static buffer to be free David ^ permalink raw reply [flat|nested] 42+ messages in thread
* RE: sha512: make it work, undo percpu message schedule 2012-01-13 11:45 ` David Laight @ 2012-01-13 12:35 ` Eric Dumazet 0 siblings, 0 replies; 42+ messages in thread From: Eric Dumazet @ 2012-01-13 12:35 UTC (permalink / raw) To: David Laight Cc: Herbert Xu, Alexey Dobriyan, linux-crypto, netdev, ken, Steffen Klassert Le vendredi 13 janvier 2012 à 11:45 +0000, David Laight a écrit : > > Trying a dynamic memory allocation, and fallback on a single > > pre-allocated bloc of memory, shared by all cpus, protected by a > > spinlock > ... > > - > > + static u64 msg_schedule[80]; > > + static DEFINE_SPINLOCK(msg_schedule_lock); > > int i; > > - u64 *W = get_cpu_var(msg_schedule); > > + u64 *W = kzalloc(sizeof(msg_schedule), GFP_ATOMIC | > __GFP_NOWARN); > > > > + > > + if (!W) { > > + spin_lock_bh(&msg_schedule_lock); > > + W = msg_schedule; > > + } > > If this code can be called from an ISR is the kmalloc() > call safe? > Yes, obviously, kmalloc() is IRQ safe. > If the above is safe, wouldn't it be better to: > 1) try to use the static buffer > 2) try to kalloc() a buffer > 3) spinwait for the static buffer to be free > No idea of what you mean, and why you think its better. kmalloc() propably can give us a block already hot in cpu cache. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-11 0:36 ` Herbert Xu 2012-01-12 23:55 ` Alexey Dobriyan @ 2012-01-13 6:22 ` Steffen Klassert 2012-01-13 6:46 ` Herbert Xu 2012-01-13 6:48 ` Eric Dumazet 1 sibling, 2 replies; 42+ messages in thread From: Steffen Klassert @ 2012-01-13 6:22 UTC (permalink / raw) To: Herbert Xu; +Cc: Alexey Dobriyan, linux-crypto, netdev, ken On Wed, Jan 11, 2012 at 11:36:11AM +1100, Herbert Xu wrote: > On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote: > > commit f9e2bca6c22d75a289a349f869701214d63b5060 > > aka "crypto: sha512 - Move message schedule W[80] to static percpu area" > > created global message schedule area. > > > > If sha512_update will ever be entered twice, hilarity ensures. > > Hmm, do you know why this happens? On the face of it this shouldn't > be possible as preemption is disabled. > I did not try to reproduce, but this looks like a race of the 'local out' and the receive packet path. On 'lokal out' bottom halves are enabled, so could be interrupted by the NET_RX_SOFTIRQ while doing a sha512_update. The NET_RX_SOFTIRQ could invoke sha512_update too, that would corrupt the hash value. My guess could be checked easily by disabling the bottom halves before the percpu value is fetched. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-13 6:22 ` Steffen Klassert @ 2012-01-13 6:46 ` Herbert Xu 2012-01-13 6:48 ` Eric Dumazet 1 sibling, 0 replies; 42+ messages in thread From: Herbert Xu @ 2012-01-13 6:46 UTC (permalink / raw) To: Steffen Klassert; +Cc: Alexey Dobriyan, linux-crypto, netdev, ken On Fri, Jan 13, 2012 at 07:22:56AM +0100, Steffen Klassert wrote: > > I did not try to reproduce, but this looks like a race of the 'local out' > and the receive packet path. On 'lokal out' bottom halves are enabled, > so could be interrupted by the NET_RX_SOFTIRQ while doing a sha512_update. > The NET_RX_SOFTIRQ could invoke sha512_update too, that would corrupt the > hash value. My guess could be checked easily by disabling the bottom halves > before the percpu value is fetched. Good point. I'll send out a patch for Alexey to test. Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-13 6:22 ` Steffen Klassert 2012-01-13 6:46 ` Herbert Xu @ 2012-01-13 6:48 ` Eric Dumazet 2012-01-13 6:50 ` Herbert Xu 2012-01-13 9:45 ` David Laight 1 sibling, 2 replies; 42+ messages in thread From: Eric Dumazet @ 2012-01-13 6:48 UTC (permalink / raw) To: Steffen Klassert; +Cc: Herbert Xu, Alexey Dobriyan, linux-crypto, netdev, ken Le vendredi 13 janvier 2012 à 07:22 +0100, Steffen Klassert a écrit : > On Wed, Jan 11, 2012 at 11:36:11AM +1100, Herbert Xu wrote: > > On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote: > > > commit f9e2bca6c22d75a289a349f869701214d63b5060 > > > aka "crypto: sha512 - Move message schedule W[80] to static percpu area" > > > created global message schedule area. > > > > > > If sha512_update will ever be entered twice, hilarity ensures. > > > > Hmm, do you know why this happens? On the face of it this shouldn't > > be possible as preemption is disabled. > > > > I did not try to reproduce, but this looks like a race of the 'local out' > and the receive packet path. On 'lokal out' bottom halves are enabled, > so could be interrupted by the NET_RX_SOFTIRQ while doing a sha512_update. > The NET_RX_SOFTIRQ could invoke sha512_update too, that would corrupt the > hash value. My guess could be checked easily by disabling the bottom halves > before the percpu value is fetched. Good catch. It can be generalized to any interrupts (soft and hard) Another solution is using two blocks, one used from interrupt context. static DEFINE_PER_CPU(u64[80], msg_schedule); static DEFINE_PER_CPU(u64[80], msg_schedule_irq); (Like we do for SNMP mibs on !x86 arches) ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-13 6:48 ` Eric Dumazet @ 2012-01-13 6:50 ` Herbert Xu 2012-01-13 9:45 ` David Laight 1 sibling, 0 replies; 42+ messages in thread From: Herbert Xu @ 2012-01-13 6:50 UTC (permalink / raw) To: Eric Dumazet; +Cc: Steffen Klassert, Alexey Dobriyan, linux-crypto, netdev, ken On Fri, Jan 13, 2012 at 07:48:38AM +0100, Eric Dumazet wrote: > > Another solution is using two blocks, one used from interrupt context. > > static DEFINE_PER_CPU(u64[80], msg_schedule); > static DEFINE_PER_CPU(u64[80], msg_schedule_irq); > > (Like we do for SNMP mibs on !x86 arches) Yes that sounds good. Thanks Eric! -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 42+ messages in thread
* RE: sha512: make it work, undo percpu message schedule 2012-01-13 6:48 ` Eric Dumazet 2012-01-13 6:50 ` Herbert Xu @ 2012-01-13 9:45 ` David Laight 1 sibling, 0 replies; 42+ messages in thread From: David Laight @ 2012-01-13 9:45 UTC (permalink / raw) To: Eric Dumazet, Steffen Klassert Cc: Herbert Xu, Alexey Dobriyan, linux-crypto, netdev, ken > Good catch. It can be generalized to any interrupts (soft and hard) > > Another solution is using two blocks, one used from interrupt context. > > static DEFINE_PER_CPU(u64[80], msg_schedule); > static DEFINE_PER_CPU(u64[80], msg_schedule_irq); > > (Like we do for SNMP mibs on !x86 arches) Don't you need one block per interrupt level ? It also means that the functions need to know where they are being called from - which may not be true. A thought is that if you are going to reserve this memory for each cpu, it might as well be reserved as part of the interrupt stack (assuming interrupts switch stacks). Process level code would still need to use a reserved buffer. David ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: sha512: make it work, undo percpu message schedule 2012-01-11 0:00 sha512: make it work, undo percpu message schedule Alexey Dobriyan 2012-01-11 0:12 ` Alexey Dobriyan 2012-01-11 0:36 ` Herbert Xu @ 2012-01-11 1:12 ` Adrian-Ken Rueegsegger 2 siblings, 0 replies; 42+ messages in thread From: Adrian-Ken Rueegsegger @ 2012-01-11 1:12 UTC (permalink / raw) To: Alexey Dobriyan; +Cc: herbert, linux-crypto, netdev On 01/11/2012 01:00 AM, Alexey Dobriyan wrote: > commit f9e2bca6c22d75a289a349f869701214d63b5060 > aka "crypto: sha512 - Move message schedule W[80] to static percpu area" > created global message schedule area. [snip] > I personally don't understand this changelog entry: > > "The message schedule W (u64[80]) is too big for the stack." > > Hash context is dynamically allocated. My original patch did the same thing as yours and put the message schedule on the stack in sha512_transform. Herbert argued [1], that it was too big and suggested to store it in a static per-cpu area. [1] - http://www.mail-archive.com/linux-crypto@vger.kernel.org/msg02527.html ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2012-02-03 3:34 UTC | newest] Thread overview: 42+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-01-11 0:00 sha512: make it work, undo percpu message schedule Alexey Dobriyan 2012-01-11 0:12 ` Alexey Dobriyan 2012-01-11 0:36 ` Herbert Xu 2012-01-12 23:55 ` Alexey Dobriyan 2012-01-13 0:19 ` Herbert Xu 2012-01-13 7:08 ` Herbert Xu 2012-01-13 10:35 ` Eric Dumazet 2012-01-13 10:41 ` Eric Dumazet 2012-01-13 10:57 ` Eric Dumazet 2012-01-13 11:33 ` Alexey Dobriyan 2012-01-13 12:34 ` Eric Dumazet 2012-01-14 18:20 ` Alexey Dobriyan 2012-01-14 18:27 ` [PATCH 1/3] " Alexey Dobriyan 2012-01-14 18:40 ` [PATCH 2/3] sha512: reduce stack usage to safe number Alexey Dobriyan 2012-01-14 19:08 ` Linus Torvalds 2012-01-14 20:41 ` Alexey Dobriyan 2012-01-14 21:14 ` Linus Torvalds 2012-01-16 9:56 ` David Laight 2012-01-16 10:20 ` Alexey Dobriyan 2012-01-16 10:23 ` Eric Dumazet 2012-01-16 11:37 ` David Laight 2012-01-17 12:03 ` Alexey Dobriyan 2012-01-18 18:02 ` [PATCH 4/3] sha512: reduce stack usage even on i386 Alexey Dobriyan 2012-01-26 2:35 ` Herbert Xu 2012-01-27 17:51 ` Alexey Dobriyan 2012-01-27 22:32 ` Herbert Xu 2012-01-30 11:10 ` Alexey Dobriyan 2012-02-03 3:34 ` Herbert Xu 2012-01-14 21:46 ` [PATCH 1/3] sha512: make it work, undo percpu message schedule Eric Dumazet 2012-01-14 21:52 ` Linus Torvalds 2012-01-14 22:00 ` Eric Dumazet 2012-01-15 1:43 ` Herbert Xu 2012-01-13 11:02 ` Steffen Klassert 2012-01-15 1:43 ` Herbert Xu 2012-01-13 11:45 ` David Laight 2012-01-13 12:35 ` Eric Dumazet 2012-01-13 6:22 ` Steffen Klassert 2012-01-13 6:46 ` Herbert Xu 2012-01-13 6:48 ` Eric Dumazet 2012-01-13 6:50 ` Herbert Xu 2012-01-13 9:45 ` David Laight 2012-01-11 1:12 ` Adrian-Ken Rueegsegger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).