Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
@ 2004-01-30 15:43 Arnd Bergmann
  2004-01-30 16:57 ` James Morris
  0 siblings, 1 reply; 22+ messages in thread
From: Arnd Bergmann @ 2004-01-30 15:43 UTC (permalink / raw)
  To: James Morris; +Cc: linux-kernel, R CHAN

James Morris wrote:
> Have you noticed if this happens for any of the other crypto algorithms?

Just as a reminder, there is still an issue with extreme stack usage
of some of the algorithms, depending on compiler version and
flags.

The worst I have seen was around 16kb for twofish_setkey on 64 bit
s390 with gcc-3.1 (iirc). Right now, I get up to 4kb for this
function with gcc-3.3.1, which probably works but is definitely
a bad sign. I've seen this as well on other architectures (iirc
on x86_64), but not as severe.

Other algorithms are bad as well, these are the top scores from
Jörn Engel's checkstack.pl (s390 64bit 2.6.1 gcc-3.3.1):

0x00000a twofish_setkey:           lay    %r15,-3960(%r15)
0x0026fc aes_decrypt:              lay    %r15,-1168(%r15)
0x000c0c aes_encrypt:              lay    %r15,-1000(%r15)
0x00000e sha512_transform:         lay    %r15,-936(%r15)
0x001292 test_deflate:             lay    %r15,-784(%r15)
0x0028a2 cast6_decrypt:            lay    %r15,-696(%r15)
0x00d1a0 twofish_encrypt:          lay    %r15,-664(%r15)
0x001b34 setkey:                   lay    %r15,-656(%r15)
0x00e2b0 twofish_decrypt:          lay    %r15,-624(%r15)
0x000c9e cast6_encrypt:            lay    %r15,-600(%r15)
0x000014 sha1_transform:           lay    %r15,-504(%r15)
                                                 ^
		This is the stack size   --------|

       Arnd <><

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30 15:43 [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4 Arnd Bergmann
@ 2004-01-30 16:57 ` James Morris
  2004-01-30 18:02   ` Randy.Dunlap
  0 siblings, 1 reply; 22+ messages in thread
From: James Morris @ 2004-01-30 16:57 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-kernel, R CHAN

On Fri, 30 Jan 2004, Arnd Bergmann wrote:

> James Morris wrote:
> > Have you noticed if this happens for any of the other crypto algorithms?
> 
> Just as a reminder, there is still an issue with extreme stack usage
> of some of the algorithms, depending on compiler version and
> flags.
> 
> The worst I have seen was around 16kb for twofish_setkey on 64 bit
> s390 with gcc-3.1 (iirc). Right now, I get up to 4kb for this
> function with gcc-3.3.1, which probably works but is definitely
> a bad sign. I've seen this as well on other architectures (iirc
> on x86_64), but not as severe.
> 
> Other algorithms are bad as well, these are the top scores from
> Jörn Engel's checkstack.pl (s390 64bit 2.6.1 gcc-3.3.1):
> 
> 0x00000a twofish_setkey:           lay    %r15,-3960(%r15)
> 0x0026fc aes_decrypt:              lay    %r15,-1168(%r15)
> 0x000c0c aes_encrypt:              lay    %r15,-1000(%r15)
> 0x00000e sha512_transform:         lay    %r15,-936(%r15)
> 0x001292 test_deflate:             lay    %r15,-784(%r15)
> 0x0028a2 cast6_decrypt:            lay    %r15,-696(%r15)
> 0x00d1a0 twofish_encrypt:          lay    %r15,-664(%r15)
> 0x001b34 setkey:                   lay    %r15,-656(%r15)
> 0x00e2b0 twofish_decrypt:          lay    %r15,-624(%r15)
> 0x000c9e cast6_encrypt:            lay    %r15,-600(%r15)
> 0x000014 sha1_transform:           lay    %r15,-504(%r15)

I'm not seeing anything like this on x86 (e.g. stack usage is 8 bytes), 
and can't see an obvious way to fix the problem for your compiler.


- James
-- 
James Morris
<jmorris@redhat.com>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30 16:57 ` James Morris
@ 2004-01-30 18:02   ` Randy.Dunlap
  2004-01-30 18:35     ` James Morris
  0 siblings, 1 reply; 22+ messages in thread
From: Randy.Dunlap @ 2004-01-30 18:02 UTC (permalink / raw)
  To: James Morris; +Cc: arnd, linux-kernel, rspchan

On Fri, 30 Jan 2004 11:57:20 -0500 (EST) James Morris <jmorris@redhat.com> wrote:

| On Fri, 30 Jan 2004, Arnd Bergmann wrote:
| 
| > James Morris wrote:
| > > Have you noticed if this happens for any of the other crypto algorithms?
| > 
| > Just as a reminder, there is still an issue with extreme stack usage
| > of some of the algorithms, depending on compiler version and
| > flags.
| > 
| > The worst I have seen was around 16kb for twofish_setkey on 64 bit
| > s390 with gcc-3.1 (iirc). Right now, I get up to 4kb for this
| > function with gcc-3.3.1, which probably works but is definitely
| > a bad sign. I've seen this as well on other architectures (iirc
| > on x86_64), but not as severe.
| > 
| > Other algorithms are bad as well, these are the top scores from
| > Jörn Engel's checkstack.pl (s390 64bit 2.6.1 gcc-3.3.1):
| > 
| > 0x00000a twofish_setkey:           lay    %r15,-3960(%r15)
| > 0x0026fc aes_decrypt:              lay    %r15,-1168(%r15)
| > 0x000c0c aes_encrypt:              lay    %r15,-1000(%r15)
| > 0x00000e sha512_transform:         lay    %r15,-936(%r15)
| > 0x001292 test_deflate:             lay    %r15,-784(%r15)
| > 0x0028a2 cast6_decrypt:            lay    %r15,-696(%r15)
| > 0x00d1a0 twofish_encrypt:          lay    %r15,-664(%r15)
| > 0x001b34 setkey:                   lay    %r15,-656(%r15)
| > 0x00e2b0 twofish_decrypt:          lay    %r15,-624(%r15)
| > 0x000c9e cast6_encrypt:            lay    %r15,-600(%r15)
| > 0x000014 sha1_transform:           lay    %r15,-504(%r15)
| 
| I'm not seeing anything like this on x86 (e.g. stack usage is 8 bytes), 
| and can't see an obvious way to fix the problem for your compiler.

Here's what I see on x86 and gcc 3.2 (for Linux 2.6.1).
What Linux version of the code are you looking at?


$0x180,%esp: c01fd5aa <aes_encrypt+4/1750>
$0x1b0,%esp: c01fecfa <aes_decrypt+4/17da>
$0x230,%esp: c0206acd <test_deflate+b/2f8>
$0x10c,%esp: c0205c90 <test_hmac+6/4fc>
$0x1fc,%esp: c01e9ed8 <sha1_transform+4/178a>
$0x120,%esp: c01eb842 <sha256_transform+6/1ef0>
$0x384,%esp: c01ed988 <sha512_transform+4/17e8>
$0x10c,%esp: c01f1c90 <twofish_setkey+4/7480>


--
~Randy
kernel-janitors project:  http://janitor.kernelnewbies.org/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30 18:02   ` Randy.Dunlap
@ 2004-01-30 18:35     ` James Morris
  2004-01-30 18:53       ` James Morris
  0 siblings, 1 reply; 22+ messages in thread
From: James Morris @ 2004-01-30 18:35 UTC (permalink / raw)
  To: Randy.Dunlap; +Cc: arnd, linux-kernel, rspchan

On Fri, 30 Jan 2004, Randy.Dunlap wrote:

> Here's what I see on x86 and gcc 3.2 (for Linux 2.6.1).
> What Linux version of the code are you looking at?
> 
> 
> $0x180,%esp: c01fd5aa <aes_encrypt+4/1750>
> $0x1b0,%esp: c01fecfa <aes_decrypt+4/17da>
> $0x230,%esp: c0206acd <test_deflate+b/2f8>
> $0x10c,%esp: c0205c90 <test_hmac+6/4fc>
> $0x1fc,%esp: c01e9ed8 <sha1_transform+4/178a>
> $0x120,%esp: c01eb842 <sha256_transform+6/1ef0>
> $0x384,%esp: c01ed988 <sha512_transform+4/17e8>
> $0x10c,%esp: c01f1c90 <twofish_setkey+4/7480>
> 

2.6.2-rc2-mm2 with gcc version 3.3.1 20030915 (Red Hat Linux 3.3.1-5)


- James
-- 
James Morris
<jmorris@redhat.com>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30 18:35     ` James Morris
@ 2004-01-30 18:53       ` James Morris
  0 siblings, 0 replies; 22+ messages in thread
From: James Morris @ 2004-01-30 18:53 UTC (permalink / raw)
  To: Randy.Dunlap; +Cc: arnd, linux-kernel, rspchan

On Fri, 30 Jan 2004, James Morris wrote:

> On Fri, 30 Jan 2004, Randy.Dunlap wrote:
> 
> > Here's what I see on x86 and gcc 3.2 (for Linux 2.6.1).
> > What Linux version of the code are you looking at?
> > 
> > 
> > $0x180,%esp: c01fd5aa <aes_encrypt+4/1750>
> > $0x1b0,%esp: c01fecfa <aes_decrypt+4/17da>
> > $0x230,%esp: c0206acd <test_deflate+b/2f8>
> > $0x10c,%esp: c0205c90 <test_hmac+6/4fc>
> > $0x1fc,%esp: c01e9ed8 <sha1_transform+4/178a>
> > $0x120,%esp: c01eb842 <sha256_transform+6/1ef0>
> > $0x384,%esp: c01ed988 <sha512_transform+4/17e8>
> > $0x10c,%esp: c01f1c90 <twofish_setkey+4/7480>
> > 
> 
> 2.6.2-rc2-mm2 with gcc version 3.3.1 20030915 (Red Hat Linux 3.3.1-5)
> 

Actually, 24 bytes (not 8):

c02120b0 <twofish_setkey>:
c02120b0:       55                      push   %ebp
c02120b1:       57                      push   %edi
c02120b2:       56                      push   %esi
c02120b3:       53                      push   %ebx
c02120b4:       83 ec 18                sub    $0x18,%esp



- James
-- 
James Morris
<jmorris@redhat.com>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch   pentium3,4
@ 2004-02-02  4:08 linux
  2004-02-05 19:40 ` Jean-Luc Cooke
  0 siblings, 1 reply; 22+ messages in thread
From: linux @ 2004-02-02  4:08 UTC (permalink / raw)
  To: linux-kernel

If it helps anyone, here's a modified sha256.c file that takes 256 bytes
off the stack and shrinks the main code loop by a factor of 8.

The speed penalty is pretty minor, and might actually be negative due to
better icache usage.

Includes x86 stubs for standalone self-test using the NIST test vectors.
Enable the #if 0 parts and remove the commented-out debug lines
for production use.

A few other additions:
- Uses optimized unaligned-fetch and byte-swap routines.
- Reordered the the Ch() and Maj() functions slightly to work better on
  2-address instruction sets.
- Got rid of two temporaries in the round function
- Removed the need to memset() 256 bytes on the stack every call to
  sha256_transform
- Uses memset() rather than a padding array.

A few questions:
- Should we just use a u64 to count the bytes rather than doing carries by
  hand?
- Do we actually need support for hashing more than 2^32 bytes in the first
  place?
- This does the aligned/unaligned check at the outermost feasible loop
  position in the code, leading to quite a bit of duplication.
  If the processor supports unaligned loads natively, it isn't even
  necessary at all.  An better ideas?  I can understand:
  - Moving the test to once per call to sha256_transform, on the
    grounds that the latter outweights it by a large amount.
  - Always using the unaligned load, likewise.
  - Rolling my own unaligned block move-and-byte-swap (using aligned
    loads, shifts, and algned stores), since the existing unaligned code
    isn't quite what we want.
  - Just using memcpy and byte-swap in place, and letting the L1 cache
    take care of it.
- The e() functions as written take a an temp register in addition to the
  input and output.  Would it be better to rewrite them as e.g.
  better to rewrite them as e.g.
	static inline u32 e0(u32 x)
	{
		u32 y = x;
		y = RORu32(y, 22-13) ^ x;
		y = RORu32(y, 13-2) ^ x;
		return RORu32(y, 2);
	}
  to get rid of the need?  (And can someone figure out a similar trick
  for the s() functions?  I don't think it's as important because there's
  less register pressure when they're used.)

(P.S. These modifications are in the public domain; copyright abandoned.)


/*
 * Cryptographic API.
 *
 * SHA-256, as specified in
 * http://csrc.nist.gov/cryptval/shs/sha256-384-512.pdf
 *
 * SHA-256 code by Jean-Luc Cooke <jlcooke@certainkey.com>.
 *
 * Copyright (c) Jean-Luc Cooke <jlcooke@certainkey.com>
 * Copyright (c) Andrew McDonald <andrew@mcdonald.org.uk>
 * Copyright (c) 2002 James Morris <jmorris@intercode.com.au>
 *
 * This program is free software; you can redistribute it and/or modify it
 * under the terms of the GNU General Public License as published by the Free
 * Software Foundation; either version 2 of the License, or (at your option) 
 * any later version.
 *
 */
#if 0
#include <linux/init.h>
#include <linux/module.h>
#include <linux/mm.h>
#include <linux/crypto.h>
#include <asm/scatterlist.h>
#include <asm/byteorder.h>
#else
typedef unsigned int u32;
typedef unsigned char u8;
#include <string.h>
#define get_unaligned(p) (*(p))
#define be32_to_cpus(p) asm("bswap %0" : "+r" (*(p)));
static inline u32
be32_to_cpup(u32 const *p)
{
	u32 x;
	asm("bswap %0" : "=r" (x) : "0" (*p));
	return x;
}
static inline u32
be32_to_cpu(u32 x)
{
	asm("bswap %0" : "+r" (x));
	return x;
}
#define cpu_to_be32s be32_to_cpus
#define cpu_to_be32p be32_to_cpup
#define cpu_to_be32 be32_to_cpu
#endif


#define SHA256_DIGEST_SIZE	32
#define SHA256_HMAC_BLOCK_SIZE	64

struct sha256_ctx {
	u32 count[2];	/* Bytes so far: Low, then high */
	u32 state[8];
	u32 W[64];
};

/* Bit-wise choose x ? y : z = (x & y) + (~x & z) = z ^ (x & (y ^ z)) */
#define Ch(x, y, z) ((z) ^ ((x) & ((y) ^ (z))))

/*
 * Majority function.  There's no really good way to optimize this,
 * although x is the most recently computed vaue, so put it later in
 * the dependency chain.  Also use ^ in the OR so that the halves can
 * be merged with + which gives the compiler more flexibility to
 * rearrange the surrounding sum.
 * x&y | y&z | z&x = x&(y|z) | y&z = x&(y^z) + y&z
 */
#define Maj(x, y, z) (((x) & ((y) ^ (z))) + ((y) & (z)))

static inline u32 RORu32(u32 x, u32 y)
{
	return (x >> y) | (x << (32 - y));
}

#define e0(x)       (RORu32(x, 2) ^ RORu32(x,13) ^ RORu32(x,22))
#define e1(x)       (RORu32(x, 6) ^ RORu32(x,11) ^ RORu32(x,25))
#define s0(x)       (RORu32(x, 7) ^ RORu32(x,18) ^ (x >> 3))
#define s1(x)       (RORu32(x,17) ^ RORu32(x,19) ^ (x >> 10))

#if __BYTE_ORDER == __BIG_ENDIAN
#error WTF?
# define loadin_inplace(w, len) (void)0
#else
static inline void
loadin_inplace(u32 *W, unsigned i)
{
	unsigned j;
	for (j = 0; j < i; j++)
		be32_to_cpus(W+j);
}
#endif

static inline void
loadin_aligned(u32 *W, u32 const *input, unsigned i)
{
	unsigned j;
	for (j = 0; j < i; j++)
		W[j] = be32_to_cpup(input+j);
}
/*
 * This may be a bit bigger than we want an inline function, but
 * it's only called from one place.
 */
static inline void
loadin_unaligned(u32 *W, u8 const *input, unsigned i)
{
	unsigned j;
	for (j = 0; j < i; j++)
		W[j] = be32_to_cpu(get_unaligned(input+4*j));
}

static inline void BLEND_OP(int I, u32 *W)
{
	W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
}

/*
 * This expects input in the first 16 words of W[], and uses all
 * 64 words.  Doing it this way avoinds the need to allocate
 * and zeroize a stack temporary each time.
 */
static void sha256_transform(u32 state[8], u32 W[64])
{
	u32 a, b, c, d, e, f, g, h;
	int i;
	static u32 const k[64] = {
		0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5,
		0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5, 
		0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
		0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
		0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc,
		0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
		0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7,
		0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
		0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
		0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
		0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3,
		0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
		0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5,
		0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
		0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
		0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
	};
//printf("in: %08x %08x %08x %08x %08x %08x %08x %08x\n",W[0],W[1],W[2],W[3],W[4],W[5],W[6],W[7]);
//printf("in: %08x %08x %08x %08x %08x %08x %08x %08x\n",W[8],W[9],W[10],W[11],W[12],W[13],W[14],W[15]);

	/* The first 16 words of W already contain the input - now blend */
	/* This is a sort of key-scheduling operation */
	for (i = 16; i < 64; i++)
		W[i] = s1(W[i-2]) + W[i-7] + s0(W[i-15]) + W[i-16];
    
	/* load the state into our registers */
	a=state[0];  b=state[1];  c=state[2];  d=state[3];
	e=state[4];  f=state[5];  g=state[6];  h=state[7];
	/*
	 * Now iterate the actual round function.  This is actually 64
	 * copies of the same round function with the variables
	 * rotated each time, but here we unroll it 8 times to
	 * reduce the amount of data motion.  You could roll it up
	 * more if code size is a priority.
	 */
	for (i = 0; i < 64; i += 8) {
//printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+0,a,b,c,d,e,f,g,h);
		d += h += e1(e) + Ch(e,f,g) + k[i+0] + W[i+0];
		h += e0(a) + Maj(a,b,c);
//printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+1,h,a,b,c,d,e,f,g);
		c += g += e1(d) + Ch(d,e,f) + k[i+1] + W[i+1];
		g += e0(h) + Maj(h,a,b);
//printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+2,g,h,a,b,c,d,e,f);
		b += f += e1(c) + Ch(c,d,e) + k[i+2] + W[i+2];
		f += e0(g) + Maj(g,h,a);
//printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+3,f,g,h,a,b,c,d,e);
		a += e += e1(b) + Ch(b,c,d) + k[i+3] + W[i+3];
		e += e0(f) + Maj(f,g,h);
//printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+4,e,f,g,h,a,b,c,d);
		h += d += e1(a) + Ch(a,b,c) + k[i+4] + W[i+4];
		d += e0(e) + Maj(e,f,g);
//printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+5,d,e,f,g,h,a,b,c);
		g += c += e1(h) + Ch(h,a,b) + k[i+5] + W[i+5];
		c += e0(d) + Maj(d,e,f);
//printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+6,c,d,e,f,g,h,a,b);
		f += b += e1(g) + Ch(g,h,a) + k[i+6] + W[i+6];
		b += e0(c) + Maj(c,d,e);
//printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+7,b,c,d,e,f,g,h,a);
		e += a += e1(f) + Ch(f,g,h) + k[i+7] + W[i+7];
		a += e0(b) + Maj(b,c,d);
	}
//printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+0,a,b,c,d,e,f,g,h);

	/* Add back to the state to make hash one-way */
	state[0] += a; state[1] += b; state[2] += c; state[3] += d;
	state[4] += e; state[5] += f; state[6] += g; state[7] += h;

	/* clear any sensitive info... */
	a = b = c = d = e = f = g = h = 0;
}

static void
sha256_init(void *ctx)
{
	struct sha256_ctx *sctx = ctx;
	static u32 const Hinit[SHA256_DIGEST_SIZE/4] = {
		0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a,
		0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
	};

	memcpy(sctx->state, Hinit, sizeof Hinit);
	sctx->count[0] = sctx->count[1] = 0;
}

static void
sha256_update(void *ctx, const u8 *data, unsigned int len)
{
	struct sha256_ctx *sctx = ctx;

	/* Compute number of bytes mod 63 */
	unsigned index = (unsigned)sctx->count[0] & (SHA256_HMAC_BLOCK_SIZE-1);

//printf("sha256_update(%.*s)\n", (int)len, data);

	/* Add 32-bit len to 64-bit sctx->count */
	sctx->count[1] += ((sctx->count[0] += len) < len);

	if (len + index >= SHA256_HMAC_BLOCK_SIZE) {
		if (index) {
			memcpy((char *)sctx->W + index, data, len);
			loadin_inplace(sctx->W, 16);
			sha256_transform(sctx->state, sctx->W);
			index = SHA256_HMAC_BLOCK_SIZE-1-index;
			data += index;
			len -= index;
			index = 0;
		}
		/*
		 * I wish there was a way to tell if unaligned loads
		 * required a special instruction sequence and there
		 * was a point to splitting this...
		 */
		if ((unsigned)data % 4 == 0) {
			/* Aligned case */
			while (len >= SHA256_HMAC_BLOCK_SIZE) {
				loadin_aligned(sctx->W, (u32 const *)data, 16);
				sha256_transform(sctx->state, sctx->W);
				data += SHA256_HMAC_BLOCK_SIZE;
				len -= SHA256_HMAC_BLOCK_SIZE;
			}
		} else {
			/* Unaligned case */
			while (len >= SHA256_HMAC_BLOCK_SIZE) {
				loadin_unaligned(sctx->W, data, 16);
				sha256_transform(sctx->state, sctx->W);
				data += SHA256_HMAC_BLOCK_SIZE;
				len -= SHA256_HMAC_BLOCK_SIZE;
			}
		}
	}
	/* Buffer any leftover data */
	memcpy((char *)sctx->W + index, data, len);
}

static void
sha256_final(void* ctx, u8 *out)
{
	struct sha256_ctx *sctx = ctx;

	/* Next byte to store */
	unsigned index = (unsigned)sctx->count[0] & (SHA256_HMAC_BLOCK_SIZE-1);

	/* Add padding bit sequence 1000...; we're limited to 8-bit bytes. */
	((u8 *)sctx->W)[index++] = 0x80;
	if (index > 56) {
		/* Wups, need a whole extra block to fit the 8-bit count */
		memset((u8 *)sctx->W + index, 0, 64-index);
		loadin_inplace(sctx->W, 16);
		sha256_transform(sctx->state, sctx->W);
		index = 0;
	}
	memset((u8 *)sctx->W + index, 0, 56-index);
	loadin_inplace(sctx->W, 14);

	/* Append number of bits in final 8 bytes, big-endian */
	/* QUESTION: Are we ever going to hash more than 256 MiB? */
	sctx->W[14] = (sctx->count[1] << 3) + (sctx->count[0] >> 29);
	sctx->W[15] = (sctx->count[0] << 3);

	sha256_transform(sctx->state, sctx->W);

	/* Output the state */
	for (index = 0; index < SHA256_DIGEST_SIZE/4; index++)
		cpu_to_be32s(sctx->state + index);
	memcpy(out, sctx->state, SHA256_DIGEST_SIZE);

	/* Zeroize sensitive information. */
	memset(sctx, 0, sizeof *sctx);
}

#if 0

static struct crypto_alg alg = {
	.cra_name	=	"sha256",
	.cra_flags	=	CRYPTO_ALG_TYPE_DIGEST,
	.cra_blocksize	=	SHA256_HMAC_BLOCK_SIZE,
	.cra_ctxsize	=	sizeof(struct sha256_ctx),
	.cra_module	=	THIS_MODULE,
	.cra_list       =       LIST_HEAD_INIT(alg.cra_list),
	.cra_u		=	{ .digest = {
		.dia_digestsize	=	SHA256_DIGEST_SIZE,
		.dia_init   	= 	sha256_init,
		.dia_update 	=	sha256_update,
		.dia_final  	=	sha256_final }
       	}
};

static int __init init(void)
{
	return crypto_register_alg(&alg);
}

static void __exit fini(void)
{
	crypto_unregister_alg(&alg);
}

module_init(init);
module_exit(fini);

MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("SHA256 Secure Hash Algorithm");
#else

#include <stdio.h>

static void
dump_hash(u8 const buf[32])
{
	unsigned i;
	for (i = 0; i < 32; i++) {
		if (i % 4 == 0)
			putchar(' ');
		printf("%02x", buf[i]);
	}
}

static int
check_hash(u8 const buf[32], u8 const expected[32])
{
	printf(" Result:");
	dump_hash(buf);
	if (memcmp(buf, expected, 32) != 0) {
		printf("Correct:");
		puts("*** MISMATCH ***");
		return 1;
	}
	return 0;
}

int
main(void)
{
	/* Test driver using the test vectors from FIPS 180-2 */
	char tv1[3] = "abc";
	static u8 const out1[32] =  {
		0xba,0x78,0x16,0xbf, 0x8f,0x01,0xcf,0xea, 0x41,0x41,0x40,0xde,
	       	0x5d,0xae,0x22,0x23, 0xb0,0x03,0x61,0xa3, 0x96,0x17,0x7a,0x9c,
	       	0xb4,0x10,0xff,0x61, 0xf2,0x00,0x15,0xad
	};

	char tv2[56] = "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq";
	static u8 const out2[32] =  {
		0x24,0x8d,0x6a,0x61, 0xd2,0x06,0x38,0xb8, 0xe5,0xc0,0x26,0x93,
	       	0x0c,0x3e,0x60,0x39, 0xa3,0x3c,0xe4,0x59, 0x64,0xff,0x21,0x67,
	       	0xf6,0xec,0xed,0xd4, 0x19,0xdb,0x06,0xc1
	};

	char tv3[1024];
	static u8 const out3[32] =  {
		0xcd,0xc7,0x6e,0x5c, 0x99,0x14,0xfb,0x92, 0x81,0xa1,0xc7,0xe2,
	       	0x84,0xd7,0x3e,0x67, 0xf1,0x80,0x9a,0x48, 0xa4,0x97,0x20,0x0e,
	       	0x04,0x6d,0x39,0xcc, 0xc7,0x11,0x2c,0xd0
	};
	u8 hash[32];
	struct sha256_ctx ctx;
	unsigned i, j, errors = 0;

	/* tv1 */
	for (i = 1; i <= 3; i++) {
		sha256_init(&ctx);
		for (j = 0; j < 3-i; j += i)
			sha256_update(&ctx, tv1+j, i);
		sha256_update(&ctx, tv1+j, 3-j);
		sha256_final(&ctx, hash);
		printf("Hash computed %u bytes at a time:\n", i);
		errors += check_hash(hash, out1);
	}

	/* tv2 */
	for (i = 1; i <= 56; i++) {
		sha256_init(&ctx);
		for (j = 0; j < 56-i; j += i)
			sha256_update(&ctx, tv2+j, i);
		sha256_update(&ctx, tv2+j, 56-j);
		sha256_final(&ctx, hash);
		printf("Hash computed %u bytes at a time:\n", i);
		errors += check_hash(hash, out2);
	}

	/* tv3 */
	memset(tv3, 'a', sizeof tv3);
	for (i = 1; i <= sizeof tv3; i *= 2) {
		sha256_init(&ctx);
		for (j = 0; j < 1000000-i; j += i)
			sha256_update(&ctx, tv3, i);
		sha256_update(&ctx, tv3, 1000000-j);
		sha256_final(&ctx, hash);
		printf("Hash computed %u bytes at a time:\n", i);
		errors += check_hash(hash, out3);
	}
	printf("Check complete, %u errors\n", errors);
	return errors != 0;
}

#endif

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch   pentium3,4
  2004-02-02  4:08 linux
@ 2004-02-05 19:40 ` Jean-Luc Cooke
  0 siblings, 0 replies; 22+ messages in thread
From: Jean-Luc Cooke @ 2004-02-05 19:40 UTC (permalink / raw)
  To: linux; +Cc: linux-kernel

"Linux",

I've always wanted to have done the compression function like this (loop, not
unrolled).

Your changes to choice and majority are already submitted (see my path from a
few weeks ago).

Your argument to changing big/little sigma is thin in my opinion.

Also, the memset(W, 0, 64 * sizeof(u32)) is is needed should you me computing
a MAC for an encrypted message (MACs are done on the unencrypted portion you
see, and parts of the message would still be in memory).  Can someone
confirm the kernel MemMgr doesn't wipe memory?

Here is a patch to sha256.c and sha512.c which does not use full
unrolling, saving space.
  http://jlcooke.ca/lkml/smaller_sha2.patch

This has been tested against the existing sha256/512.  Performance:
  10,000,000 sha256_transform()'s
   - manually unrolled (current):
     21sec or about 476,190 sha256 compression function calls per second
   - for(i=0; i<64; i++){ ... }
     26sec or about 384,615 sha256 compression function calls per second
   - for(i=0; i<64; i++){ ... } with -funroll-all-loops -funroll-loops
     24sec or about 416,667 sha256 compression function calls per second

So you see, it's very tight.

  1,000,000 sha512_transform()'s
   - manually unrolled by 8 (current)
     for(i=0; i<80; i+=8) {...}
     10sec or 100,000 per sec
   - for(i=0; i<80; i+=1) {...}
     10sec or 100,000 per sec
   - for(i=0; i<80; i+=1) {...} with -funroll-all-loops -funroll-loops
     10sec or 100,000 per sec

And here it's even tighter!

JLC - all tests were on a AMD Athlon(tm) XP 1700+ 1.4GHz using time(&timeval)

On Mon, Feb 02, 2004 at 04:08:37AM -0000, linux@horizon.com wrote:
> If it helps anyone, here's a modified sha256.c file that takes 256 bytes
> off the stack and shrinks the main code loop by a factor of 8.
> 
> The speed penalty is pretty minor, and might actually be negative due to
> better icache usage.
> 
> Includes x86 stubs for standalone self-test using the NIST test vectors.
> Enable the #if 0 parts and remove the commented-out debug lines
> for production use.
> 
> A few other additions:
> - Uses optimized unaligned-fetch and byte-swap routines.
> - Reordered the the Ch() and Maj() functions slightly to work better on
>   2-address instruction sets.
> - Got rid of two temporaries in the round function
> - Removed the need to memset() 256 bytes on the stack every call to
>   sha256_transform
> - Uses memset() rather than a padding array.
> 
> A few questions:
> - Should we just use a u64 to count the bytes rather than doing carries by
>   hand?
> - Do we actually need support for hashing more than 2^32 bytes in the first
>   place?
> - This does the aligned/unaligned check at the outermost feasible loop
>   position in the code, leading to quite a bit of duplication.
>   If the processor supports unaligned loads natively, it isn't even
>   necessary at all.  An better ideas?  I can understand:
>   - Moving the test to once per call to sha256_transform, on the
>     grounds that the latter outweights it by a large amount.
>   - Always using the unaligned load, likewise.
>   - Rolling my own unaligned block move-and-byte-swap (using aligned
>     loads, shifts, and algned stores), since the existing unaligned code
>     isn't quite what we want.
>   - Just using memcpy and byte-swap in place, and letting the L1 cache
>     take care of it.
> - The e() functions as written take a an temp register in addition to the
>   input and output.  Would it be better to rewrite them as e.g.
>   better to rewrite them as e.g.
> 	static inline u32 e0(u32 x)
> 	{
> 		u32 y = x;
> 		y = RORu32(y, 22-13) ^ x;
> 		y = RORu32(y, 13-2) ^ x;
> 		return RORu32(y, 2);
> 	}
>   to get rid of the need?  (And can someone figure out a similar trick
>   for the s() functions?  I don't think it's as important because there's
>   less register pressure when they're used.)
> 
> (P.S. These modifications are in the public domain; copyright abandoned.)
> 
> 
> /*
>  * Cryptographic API.
>  *
>  * SHA-256, as specified in
>  * http://csrc.nist.gov/cryptval/shs/sha256-384-512.pdf
>  *
>  * SHA-256 code by Jean-Luc Cooke <jlcooke@certainkey.com>.
>  *
>  * Copyright (c) Jean-Luc Cooke <jlcooke@certainkey.com>
>  * Copyright (c) Andrew McDonald <andrew@mcdonald.org.uk>
>  * Copyright (c) 2002 James Morris <jmorris@intercode.com.au>
>  *
>  * This program is free software; you can redistribute it and/or modify it
>  * under the terms of the GNU General Public License as published by the Free
>  * Software Foundation; either version 2 of the License, or (at your option) 
>  * any later version.
>  *
>  */
> #if 0
> #include <linux/init.h>
> #include <linux/module.h>
> #include <linux/mm.h>
> #include <linux/crypto.h>
> #include <asm/scatterlist.h>
> #include <asm/byteorder.h>
> #else
> typedef unsigned int u32;
> typedef unsigned char u8;
> #include <string.h>
> #define get_unaligned(p) (*(p))
> #define be32_to_cpus(p) asm("bswap %0" : "+r" (*(p)));
> static inline u32
> be32_to_cpup(u32 const *p)
> {
> 	u32 x;
> 	asm("bswap %0" : "=r" (x) : "0" (*p));
> 	return x;
> }
> static inline u32
> be32_to_cpu(u32 x)
> {
> 	asm("bswap %0" : "+r" (x));
> 	return x;
> }
> #define cpu_to_be32s be32_to_cpus
> #define cpu_to_be32p be32_to_cpup
> #define cpu_to_be32 be32_to_cpu
> #endif
> 
> 
> #define SHA256_DIGEST_SIZE	32
> #define SHA256_HMAC_BLOCK_SIZE	64
> 
> struct sha256_ctx {
> 	u32 count[2];	/* Bytes so far: Low, then high */
> 	u32 state[8];
> 	u32 W[64];
> };
> 
> /* Bit-wise choose x ? y : z = (x & y) + (~x & z) = z ^ (x & (y ^ z)) */
> #define Ch(x, y, z) ((z) ^ ((x) & ((y) ^ (z))))
> 
> /*
>  * Majority function.  There's no really good way to optimize this,
>  * although x is the most recently computed vaue, so put it later in
>  * the dependency chain.  Also use ^ in the OR so that the halves can
>  * be merged with + which gives the compiler more flexibility to
>  * rearrange the surrounding sum.
>  * x&y | y&z | z&x = x&(y|z) | y&z = x&(y^z) + y&z
>  */
> #define Maj(x, y, z) (((x) & ((y) ^ (z))) + ((y) & (z)))
> 
> static inline u32 RORu32(u32 x, u32 y)
> {
> 	return (x >> y) | (x << (32 - y));
> }
> 
> #define e0(x)       (RORu32(x, 2) ^ RORu32(x,13) ^ RORu32(x,22))
> #define e1(x)       (RORu32(x, 6) ^ RORu32(x,11) ^ RORu32(x,25))
> #define s0(x)       (RORu32(x, 7) ^ RORu32(x,18) ^ (x >> 3))
> #define s1(x)       (RORu32(x,17) ^ RORu32(x,19) ^ (x >> 10))
> 
> #if __BYTE_ORDER == __BIG_ENDIAN
> #error WTF?
> # define loadin_inplace(w, len) (void)0
> #else
> static inline void
> loadin_inplace(u32 *W, unsigned i)
> {
> 	unsigned j;
> 	for (j = 0; j < i; j++)
> 		be32_to_cpus(W+j);
> }
> #endif
> 
> static inline void
> loadin_aligned(u32 *W, u32 const *input, unsigned i)
> {
> 	unsigned j;
> 	for (j = 0; j < i; j++)
> 		W[j] = be32_to_cpup(input+j);
> }
> /*
>  * This may be a bit bigger than we want an inline function, but
>  * it's only called from one place.
>  */
> static inline void
> loadin_unaligned(u32 *W, u8 const *input, unsigned i)
> {
> 	unsigned j;
> 	for (j = 0; j < i; j++)
> 		W[j] = be32_to_cpu(get_unaligned(input+4*j));
> }
> 
> static inline void BLEND_OP(int I, u32 *W)
> {
> 	W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
> }
> 
> /*
>  * This expects input in the first 16 words of W[], and uses all
>  * 64 words.  Doing it this way avoinds the need to allocate
>  * and zeroize a stack temporary each time.
>  */
> static void sha256_transform(u32 state[8], u32 W[64])
> {
> 	u32 a, b, c, d, e, f, g, h;
> 	int i;
> 	static u32 const k[64] = {
> 		0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5,
> 		0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5, 
> 		0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
> 		0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
> 		0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc,
> 		0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
> 		0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7,
> 		0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
> 		0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
> 		0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
> 		0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3,
> 		0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
> 		0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5,
> 		0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
> 		0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
> 		0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
> 	};
> //printf("in: %08x %08x %08x %08x %08x %08x %08x %08x\n",W[0],W[1],W[2],W[3],W[4],W[5],W[6],W[7]);
> //printf("in: %08x %08x %08x %08x %08x %08x %08x %08x\n",W[8],W[9],W[10],W[11],W[12],W[13],W[14],W[15]);
> 
> 	/* The first 16 words of W already contain the input - now blend */
> 	/* This is a sort of key-scheduling operation */
> 	for (i = 16; i < 64; i++)
> 		W[i] = s1(W[i-2]) + W[i-7] + s0(W[i-15]) + W[i-16];
>     
> 	/* load the state into our registers */
> 	a=state[0];  b=state[1];  c=state[2];  d=state[3];
> 	e=state[4];  f=state[5];  g=state[6];  h=state[7];
> 	/*
> 	 * Now iterate the actual round function.  This is actually 64
> 	 * copies of the same round function with the variables
> 	 * rotated each time, but here we unroll it 8 times to
> 	 * reduce the amount of data motion.  You could roll it up
> 	 * more if code size is a priority.
> 	 */
> 	for (i = 0; i < 64; i += 8) {
> //printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+0,a,b,c,d,e,f,g,h);
> 		d += h += e1(e) + Ch(e,f,g) + k[i+0] + W[i+0];
> 		h += e0(a) + Maj(a,b,c);
> //printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+1,h,a,b,c,d,e,f,g);
> 		c += g += e1(d) + Ch(d,e,f) + k[i+1] + W[i+1];
> 		g += e0(h) + Maj(h,a,b);
> //printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+2,g,h,a,b,c,d,e,f);
> 		b += f += e1(c) + Ch(c,d,e) + k[i+2] + W[i+2];
> 		f += e0(g) + Maj(g,h,a);
> //printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+3,f,g,h,a,b,c,d,e);
> 		a += e += e1(b) + Ch(b,c,d) + k[i+3] + W[i+3];
> 		e += e0(f) + Maj(f,g,h);
> //printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+4,e,f,g,h,a,b,c,d);
> 		h += d += e1(a) + Ch(a,b,c) + k[i+4] + W[i+4];
> 		d += e0(e) + Maj(e,f,g);
> //printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+5,d,e,f,g,h,a,b,c);
> 		g += c += e1(h) + Ch(h,a,b) + k[i+5] + W[i+5];
> 		c += e0(d) + Maj(d,e,f);
> //printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+6,c,d,e,f,g,h,a,b);
> 		f += b += e1(g) + Ch(g,h,a) + k[i+6] + W[i+6];
> 		b += e0(c) + Maj(c,d,e);
> //printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+7,b,c,d,e,f,g,h,a);
> 		e += a += e1(f) + Ch(f,g,h) + k[i+7] + W[i+7];
> 		a += e0(b) + Maj(b,c,d);
> 	}
> //printf("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", i+0,a,b,c,d,e,f,g,h);
> 
> 	/* Add back to the state to make hash one-way */
> 	state[0] += a; state[1] += b; state[2] += c; state[3] += d;
> 	state[4] += e; state[5] += f; state[6] += g; state[7] += h;
> 
> 	/* clear any sensitive info... */
> 	a = b = c = d = e = f = g = h = 0;
> }
> 
> static void
> sha256_init(void *ctx)
> {
> 	struct sha256_ctx *sctx = ctx;
> 	static u32 const Hinit[SHA256_DIGEST_SIZE/4] = {
> 		0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a,
> 		0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
> 	};
> 
> 	memcpy(sctx->state, Hinit, sizeof Hinit);
> 	sctx->count[0] = sctx->count[1] = 0;
> }
> 
> static void
> sha256_update(void *ctx, const u8 *data, unsigned int len)
> {
> 	struct sha256_ctx *sctx = ctx;
> 
> 	/* Compute number of bytes mod 63 */
> 	unsigned index = (unsigned)sctx->count[0] & (SHA256_HMAC_BLOCK_SIZE-1);
> 
> //printf("sha256_update(%.*s)\n", (int)len, data);
> 
> 	/* Add 32-bit len to 64-bit sctx->count */
> 	sctx->count[1] += ((sctx->count[0] += len) < len);
> 
> 	if (len + index >= SHA256_HMAC_BLOCK_SIZE) {
> 		if (index) {
> 			memcpy((char *)sctx->W + index, data, len);
> 			loadin_inplace(sctx->W, 16);
> 			sha256_transform(sctx->state, sctx->W);
> 			index = SHA256_HMAC_BLOCK_SIZE-1-index;
> 			data += index;
> 			len -= index;
> 			index = 0;
> 		}
> 		/*
> 		 * I wish there was a way to tell if unaligned loads
> 		 * required a special instruction sequence and there
> 		 * was a point to splitting this...
> 		 */
> 		if ((unsigned)data % 4 == 0) {
> 			/* Aligned case */
> 			while (len >= SHA256_HMAC_BLOCK_SIZE) {
> 				loadin_aligned(sctx->W, (u32 const *)data, 16);
> 				sha256_transform(sctx->state, sctx->W);
> 				data += SHA256_HMAC_BLOCK_SIZE;
> 				len -= SHA256_HMAC_BLOCK_SIZE;
> 			}
> 		} else {
> 			/* Unaligned case */
> 			while (len >= SHA256_HMAC_BLOCK_SIZE) {
> 				loadin_unaligned(sctx->W, data, 16);
> 				sha256_transform(sctx->state, sctx->W);
> 				data += SHA256_HMAC_BLOCK_SIZE;
> 				len -= SHA256_HMAC_BLOCK_SIZE;
> 			}
> 		}
> 	}
> 	/* Buffer any leftover data */
> 	memcpy((char *)sctx->W + index, data, len);
> }
> 
> static void
> sha256_final(void* ctx, u8 *out)
> {
> 	struct sha256_ctx *sctx = ctx;
> 
> 	/* Next byte to store */
> 	unsigned index = (unsigned)sctx->count[0] & (SHA256_HMAC_BLOCK_SIZE-1);
> 
> 	/* Add padding bit sequence 1000...; we're limited to 8-bit bytes. */
> 	((u8 *)sctx->W)[index++] = 0x80;
> 	if (index > 56) {
> 		/* Wups, need a whole extra block to fit the 8-bit count */
> 		memset((u8 *)sctx->W + index, 0, 64-index);
> 		loadin_inplace(sctx->W, 16);
> 		sha256_transform(sctx->state, sctx->W);
> 		index = 0;
> 	}
> 	memset((u8 *)sctx->W + index, 0, 56-index);
> 	loadin_inplace(sctx->W, 14);
> 
> 	/* Append number of bits in final 8 bytes, big-endian */
> 	/* QUESTION: Are we ever going to hash more than 256 MiB? */
> 	sctx->W[14] = (sctx->count[1] << 3) + (sctx->count[0] >> 29);
> 	sctx->W[15] = (sctx->count[0] << 3);
> 
> 	sha256_transform(sctx->state, sctx->W);
> 
> 	/* Output the state */
> 	for (index = 0; index < SHA256_DIGEST_SIZE/4; index++)
> 		cpu_to_be32s(sctx->state + index);
> 	memcpy(out, sctx->state, SHA256_DIGEST_SIZE);
> 
> 	/* Zeroize sensitive information. */
> 	memset(sctx, 0, sizeof *sctx);
> }
> 
> #if 0
> 
> static struct crypto_alg alg = {
> 	.cra_name	=	"sha256",
> 	.cra_flags	=	CRYPTO_ALG_TYPE_DIGEST,
> 	.cra_blocksize	=	SHA256_HMAC_BLOCK_SIZE,
> 	.cra_ctxsize	=	sizeof(struct sha256_ctx),
> 	.cra_module	=	THIS_MODULE,
> 	.cra_list       =       LIST_HEAD_INIT(alg.cra_list),
> 	.cra_u		=	{ .digest = {
> 		.dia_digestsize	=	SHA256_DIGEST_SIZE,
> 		.dia_init   	= 	sha256_init,
> 		.dia_update 	=	sha256_update,
> 		.dia_final  	=	sha256_final }
>        	}
> };
> 
> static int __init init(void)
> {
> 	return crypto_register_alg(&alg);
> }
> 
> static void __exit fini(void)
> {
> 	crypto_unregister_alg(&alg);
> }
> 
> module_init(init);
> module_exit(fini);
> 
> MODULE_LICENSE("GPL");
> MODULE_DESCRIPTION("SHA256 Secure Hash Algorithm");
> #else
> 
> #include <stdio.h>
> 
> static void
> dump_hash(u8 const buf[32])
> {
> 	unsigned i;
> 	for (i = 0; i < 32; i++) {
> 		if (i % 4 == 0)
> 			putchar(' ');
> 		printf("%02x", buf[i]);
> 	}
> }
> 
> static int
> check_hash(u8 const buf[32], u8 const expected[32])
> {
> 	printf(" Result:");
> 	dump_hash(buf);
> 	if (memcmp(buf, expected, 32) != 0) {
> 		printf("Correct:");
> 		puts("*** MISMATCH ***");
> 		return 1;
> 	}
> 	return 0;
> }
> 
> int
> main(void)
> {
> 	/* Test driver using the test vectors from FIPS 180-2 */
> 	char tv1[3] = "abc";
> 	static u8 const out1[32] =  {
> 		0xba,0x78,0x16,0xbf, 0x8f,0x01,0xcf,0xea, 0x41,0x41,0x40,0xde,
> 	       	0x5d,0xae,0x22,0x23, 0xb0,0x03,0x61,0xa3, 0x96,0x17,0x7a,0x9c,
> 	       	0xb4,0x10,0xff,0x61, 0xf2,0x00,0x15,0xad
> 	};
> 
> 	char tv2[56] = "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq";
> 	static u8 const out2[32] =  {
> 		0x24,0x8d,0x6a,0x61, 0xd2,0x06,0x38,0xb8, 0xe5,0xc0,0x26,0x93,
> 	       	0x0c,0x3e,0x60,0x39, 0xa3,0x3c,0xe4,0x59, 0x64,0xff,0x21,0x67,
> 	       	0xf6,0xec,0xed,0xd4, 0x19,0xdb,0x06,0xc1
> 	};
> 
> 	char tv3[1024];
> 	static u8 const out3[32] =  {
> 		0xcd,0xc7,0x6e,0x5c, 0x99,0x14,0xfb,0x92, 0x81,0xa1,0xc7,0xe2,
> 	       	0x84,0xd7,0x3e,0x67, 0xf1,0x80,0x9a,0x48, 0xa4,0x97,0x20,0x0e,
> 	       	0x04,0x6d,0x39,0xcc, 0xc7,0x11,0x2c,0xd0
> 	};
> 	u8 hash[32];
> 	struct sha256_ctx ctx;
> 	unsigned i, j, errors = 0;
> 
> 	/* tv1 */
> 	for (i = 1; i <= 3; i++) {
> 		sha256_init(&ctx);
> 		for (j = 0; j < 3-i; j += i)
> 			sha256_update(&ctx, tv1+j, i);
> 		sha256_update(&ctx, tv1+j, 3-j);
> 		sha256_final(&ctx, hash);
> 		printf("Hash computed %u bytes at a time:\n", i);
> 		errors += check_hash(hash, out1);
> 	}
> 
> 	/* tv2 */
> 	for (i = 1; i <= 56; i++) {
> 		sha256_init(&ctx);
> 		for (j = 0; j < 56-i; j += i)
> 			sha256_update(&ctx, tv2+j, i);
> 		sha256_update(&ctx, tv2+j, 56-j);
> 		sha256_final(&ctx, hash);
> 		printf("Hash computed %u bytes at a time:\n", i);
> 		errors += check_hash(hash, out2);
> 	}
> 
> 	/* tv3 */
> 	memset(tv3, 'a', sizeof tv3);
> 	for (i = 1; i <= sizeof tv3; i *= 2) {
> 		sha256_init(&ctx);
> 		for (j = 0; j < 1000000-i; j += i)
> 			sha256_update(&ctx, tv3, i);
> 		sha256_update(&ctx, tv3, 1000000-j);
> 		sha256_final(&ctx, hash);
> 		printf("Hash computed %u bytes at a time:\n", i);
> 		errors += check_hash(hash, out3);
> 	}
> 	printf("Check complete, %u errors\n", errors);
> 	return errors != 0;
> }
> 
> #endif
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
http://www.certainkey.com
Suite 4560 CTTC
1125 Colonel By Dr.
Ottawa ON, K1S 5B6

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
@ 2004-01-30  3:21 R CHAN
  2004-01-30 14:39 ` James Morris
  0 siblings, 1 reply; 22+ messages in thread
From: R CHAN @ 2004-01-30  3:21 UTC (permalink / raw)
  To: linux-kernel

2.6.2-rc2 sha256.c is miscompiled with gcc 3.2.3.
(User space is RedHat 3ES.)

Just an observation:

gcc 3.2.3 is miscompiling sha256.c when using
-O2 -march=pentium3
or pentium4.

gcc 3.3.x is ok, or the problem disappears
if I use arch=i686 or reduce the optimisation level to -O2.

Sympthoms are all the sha256 test vectors fail.
If I extract the guts of the file to compile in user space
the same problem occurs.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30  3:21 R CHAN
@ 2004-01-30 14:39 ` James Morris
  2004-01-30 15:04   ` Dave Paris
  0 siblings, 1 reply; 22+ messages in thread
From: James Morris @ 2004-01-30 14:39 UTC (permalink / raw)
  To: R CHAN; +Cc: linux-kernel, David S. Miller

On Fri, 30 Jan 2004, R CHAN wrote:

> 2.6.2-rc2 sha256.c is miscompiled with gcc 3.2.3.
> (User space is RedHat 3ES.)
> 
> Just an observation:
> 
> gcc 3.2.3 is miscompiling sha256.c when using
> -O2 -march=pentium3
> or pentium4.
> 
> gcc 3.3.x is ok, or the problem disappears
> if I use arch=i686 or reduce the optimisation level to -O2.
> 
> Sympthoms are all the sha256 test vectors fail.
> If I extract the guts of the file to compile in user space
> the same problem occurs.

Have you noticed if this happens for any of the other crypto algorithms?



- James
-- 
James Morris
<jmorris@redhat.com>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30 14:39 ` James Morris
@ 2004-01-30 15:04   ` Dave Paris
  2004-01-30 15:24     ` R Chan
  2004-01-30 15:28     ` Jakub Jelinek
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Paris @ 2004-01-30 15:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: rspchan

Has this been demonstrated on *any* system/arch using GCC 3.2.3 (or other
3.2 series) or is it limited in scope to the description below?  Does it
seem to do this with other implementations (other than sha256.c) or other
kernels?  Just trying to get an idea if this is a complier optimization bug
or something much more limited in scope.

My personal lab is currently being unboxed and I won't be able to run my own
tests for another week or so.  (apologies in advance)

In any case, this is *extremely* serious from a number of angles.

Kind Regards,
-dsp

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of James Morris
Sent: Friday, January 30, 2004 9:40 AM
To: R CHAN
Cc: linux-kernel@vger.kernel.org; David S. Miller
Subject: Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch
pentium3,4


On Fri, 30 Jan 2004, R CHAN wrote:

> 2.6.2-rc2 sha256.c is miscompiled with gcc 3.2.3.
> (User space is RedHat 3ES.)
>
> Just an observation:
>
> gcc 3.2.3 is miscompiling sha256.c when using
> -O2 -march=pentium3
> or pentium4.
>
> gcc 3.3.x is ok, or the problem disappears
> if I use arch=i686 or reduce the optimisation level to -O2.
>
> Sympthoms are all the sha256 test vectors fail.
> If I extract the guts of the file to compile in user space
> the same problem occurs.

Have you noticed if this happens for any of the other crypto algorithms?



- James
--
James Morris
<jmorris@redhat.com>


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30 15:04   ` Dave Paris
@ 2004-01-30 15:24     ` R Chan
  2004-01-30 15:28     ` Jakub Jelinek
  1 sibling, 0 replies; 22+ messages in thread
From: R Chan @ 2004-01-30 15:24 UTC (permalink / raw)
  To: Dave Paris; +Cc: linux-kernel, jmorris

I've narrowed the problem - it seems to be a specific optimisation bug with
gcc-3.2.3-24 that comes with RedHat 3ES. The problem is not kernel 
related (sorry!)
as the guts of the file exhibits the same problem when compiled in user 
space.
(All the non-sha256 test vectors in tcrypt.ko pass BTW.)

The older gcc32 that comes with Fedora Cora 1 and the gcc33 there get it 
right.

For the actual files please check bug #114610 in 
https://bugzilla.redhat.com.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=114610
 - the bug report includes sha256.c, tcrypt.c, and types.h necessary to 
demonstrate
the problem in user space.





Dave Paris wrote:

>Has this been demonstrated on *any* system/arch using GCC 3.2.3 (or other
>3.2 series) or is it limited in scope to the description below?  Does it
>seem to do this with other implementations (other than sha256.c) or other
>kernels?  Just trying to get an idea if this is a complier optimization bug
>or something much more limited in scope.
>
>My personal lab is currently being unboxed and I won't be able to run my own
>tests for another week or so.  (apologies in advance)
>
>In any case, this is *extremely* serious from a number of angles.
>
>Kind Regards,
>-dsp
>
>-----Original Message-----
>From: linux-kernel-owner@vger.kernel.org
>[mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of James Morris
>Sent: Friday, January 30, 2004 9:40 AM
>To: R CHAN
>Cc: linux-kernel@vger.kernel.org; David S. Miller
>Subject: Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch
>pentium3,4
>
>
>On Fri, 30 Jan 2004, R CHAN wrote:
>
>  
>
>>2.6.2-rc2 sha256.c is miscompiled with gcc 3.2.3.
>>(User space is RedHat 3ES.)
>>
>>Just an observation:
>>
>>gcc 3.2.3 is miscompiling sha256.c when using
>>-O2 -march=pentium3
>>or pentium4.
>>
>>gcc 3.3.x is ok, or the problem disappears
>>if I use arch=i686 or reduce the optimisation level to -O2.
>>
>>Sympthoms are all the sha256 test vectors fail.
>>If I extract the guts of the file to compile in user space
>>the same problem occurs.
>>    
>>
>
>Have you noticed if this happens for any of the other crypto algorithms?
>
>
>
>- James
>--
>James Morris
><jmorris@redhat.com>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>
>
>
>  
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30 15:04   ` Dave Paris
  2004-01-30 15:24     ` R Chan
@ 2004-01-30 15:28     ` Jakub Jelinek
  2004-01-30 16:35       ` James Morris
  2004-01-30 17:26       ` Jean-Luc Cooke
  1 sibling, 2 replies; 22+ messages in thread
From: Jakub Jelinek @ 2004-01-30 15:28 UTC (permalink / raw)
  To: Dave Paris; +Cc: linux-kernel, rspchan

On Fri, Jan 30, 2004 at 10:04:00AM -0500, Dave Paris wrote:
> Has this been demonstrated on *any* system/arch using GCC 3.2.3 (or other
> 3.2 series) or is it limited in scope to the description below?  Does it
> seem to do this with other implementations (other than sha256.c) or other
> kernels?  Just trying to get an idea if this is a complier optimization bug
> or something much more limited in scope.
> 
> My personal lab is currently being unboxed and I won't be able to run my own
> tests for another week or so.  (apologies in advance)
> 
> In any case, this is *extremely* serious from a number of angles.

GCC handling of automatic const variables with initializers is very fragile.
There have been numerous bugs in it in the past and last one has been fixed
just yesterday (on GCC trunk only for the time being).  It will be
eventually backported once it gets some more testing on GCC mainline.

The problematic line in sha256.c is:
static void sha256_final(void* ctx, u8 *out)
{
...
	const u8 padding[64] = { 0x80, };

where if you are unlucky, scheduler will with various different GCC versions
on various architectures reorder instructions so that store of 0x80 into the
struct is before clearing of the 64 bytes.

On the other side, doing this in sha256.c seems quite inefficient, IMHO much
better would be to have there static u8 padding[64] = { 0x80 };
because that will not mean clearing 64 bytes and writing another one on top
of it every time the routine is executed.

	Jakub

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30 15:28     ` Jakub Jelinek
@ 2004-01-30 16:35       ` James Morris
  2004-01-30 17:14         ` Andy Isaacson
  2004-01-30 21:14         ` David S. Miller
  2004-01-30 17:26       ` Jean-Luc Cooke
  1 sibling, 2 replies; 22+ messages in thread
From: James Morris @ 2004-01-30 16:35 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Dave Paris, linux-kernel, rspchan, David S. Miller

On Fri, 30 Jan 2004, Jakub Jelinek wrote:

> The problematic line in sha256.c is:
> static void sha256_final(void* ctx, u8 *out)
> {
> ...
> 	const u8 padding[64] = { 0x80, };
> 
> where if you are unlucky, scheduler will with various different GCC versions
> on various architectures reorder instructions so that store of 0x80 into the
> struct is before clearing of the 64 bytes.
> 
> On the other side, doing this in sha256.c seems quite inefficient, IMHO much
> better would be to have there static u8 padding[64] = { 0x80 };
> because that will not mean clearing 64 bytes and writing another one on top
> of it every time the routine is executed.

Proposed patch below.  I think sha512 would have been ok, but might as 
well make them the same.

R Chan, please test and let us know if it fixes the problem for you.


- James
-- 
James Morris
<jmorris@redhat.com>

diff -urN -X dontdiff linux-2.6.2-rc2-mm2.o/crypto/sha256.c linux-2.6.2-rc2-mm2.w/crypto/sha256.c
--- linux-2.6.2-rc2-mm2.o/crypto/sha256.c	2003-09-27 20:50:15.000000000 -0400
+++ linux-2.6.2-rc2-mm2.w/crypto/sha256.c	2004-01-30 11:27:26.465531792 -0500
@@ -295,7 +295,7 @@
 	u8 bits[8];
 	unsigned int index, pad_len, t;
 	int i, j;
-	const u8 padding[64] = { 0x80, };
+	static u8 padding[64] = { 0x80, };
 
 	/* Save number of bits */
 	t = sctx->count[0];
diff -urN -X dontdiff linux-2.6.2-rc2-mm2.o/crypto/sha512.c linux-2.6.2-rc2-mm2.w/crypto/sha512.c
--- linux-2.6.2-rc2-mm2.o/crypto/sha512.c	2003-09-27 20:51:04.000000000 -0400
+++ linux-2.6.2-rc2-mm2.w/crypto/sha512.c	2004-01-30 11:32:16.182488120 -0500
@@ -249,7 +249,7 @@
 {
         struct sha512_ctx *sctx = ctx;
 	
-        const static u8 padding[128] = { 0x80, };
+        static u8 padding[128] = { 0x80, };
 
         u32 t;
 	u64 t2;


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30 16:35       ` James Morris
@ 2004-01-30 17:14         ` Andy Isaacson
  2004-01-30 19:49           ` Jakub Jelinek
  2004-01-30 21:14         ` David S. Miller
  1 sibling, 1 reply; 22+ messages in thread
From: Andy Isaacson @ 2004-01-30 17:14 UTC (permalink / raw)
  To: James Morris; +Cc: linux-kernel

On Fri, Jan 30, 2004 at 11:35:20AM -0500, James Morris wrote:
> -	const u8 padding[64] = { 0x80, };
> +	static u8 padding[64] = { 0x80, };

The RedHat bug suggests 'static const' as the appropriate replacement.

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=114610#c4

Unfortunately that probably means an extra 64 bytes of text, rather than
the 10 or so bytes of instructions to do the memset and store.  Ideally
padding[] would be allocated in BSS rather than text or the stack (and
initialized with { 0x80, } at runtime), but I guess you can't have
everything.

-andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30 17:14         ` Andy Isaacson
@ 2004-01-30 19:49           ` Jakub Jelinek
  2004-02-01 21:17             ` Bill Davidsen
  0 siblings, 1 reply; 22+ messages in thread
From: Jakub Jelinek @ 2004-01-30 19:49 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: James Morris, linux-kernel

On Fri, Jan 30, 2004 at 11:14:07AM -0600, Andy Isaacson wrote:
> On Fri, Jan 30, 2004 at 11:35:20AM -0500, James Morris wrote:
> > -	const u8 padding[64] = { 0x80, };
> > +	static u8 padding[64] = { 0x80, };
> 
> The RedHat bug suggests 'static const' as the appropriate replacement.
> 
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=114610#c4
> 
> Unfortunately that probably means an extra 64 bytes of text, rather than
> the 10 or so bytes of instructions to do the memset and store.  Ideally
> padding[] would be allocated in BSS rather than text or the stack (and
> initialized with { 0x80, } at runtime), but I guess you can't have
> everything.

Or you can use
	u8 padding[64] = { 0x80 };
if you really want to initialize it at runtime and want to work around the
compiler bug.  It shouldn't be any less efficient than
	const u8 padding[64] = { 0x80 };
since it is used just once, passed to non-inlined function.

	Jakub

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30 19:49           ` Jakub Jelinek
@ 2004-02-01 21:17             ` Bill Davidsen
  0 siblings, 0 replies; 22+ messages in thread
From: Bill Davidsen @ 2004-02-01 21:17 UTC (permalink / raw)
  To: linux-kernel, Jakub Jelinek; +Cc: Andy Isaacson, James Morris, linux-kernel

Jakub Jelinek wrote:
> On Fri, Jan 30, 2004 at 11:14:07AM -0600, Andy Isaacson wrote:
> 
>>On Fri, Jan 30, 2004 at 11:35:20AM -0500, James Morris wrote:
>>
>>>-	const u8 padding[64] = { 0x80, };
>>>+	static u8 padding[64] = { 0x80, };
>>
>>The RedHat bug suggests 'static const' as the appropriate replacement.
>>
>>https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=114610#c4
>>
>>Unfortunately that probably means an extra 64 bytes of text, rather than
>>the 10 or so bytes of instructions to do the memset and store.  Ideally
>>padding[] would be allocated in BSS rather than text or the stack (and
>>initialized with { 0x80, } at runtime), but I guess you can't have
>>everything.
> 
> 
> Or you can use
> 	u8 padding[64] = { 0x80 };
> if you really want to initialize it at runtime and want to work around the
> compiler bug.  It shouldn't be any less efficient than
> 	const u8 padding[64] = { 0x80 };
> since it is used just once, passed to non-inlined function.

I like this, as it avoids bloating the 64 bytes into the kernel.


-- 
bill davidsen <davidsen@tmr.com>
   CTO TMR Associates, Inc
   Doing interesting things with small computers since 1979

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30 16:35       ` James Morris
  2004-01-30 17:14         ` Andy Isaacson
@ 2004-01-30 21:14         ` David S. Miller
  2004-02-01 21:18           ` Bill Davidsen
  1 sibling, 1 reply; 22+ messages in thread
From: David S. Miller @ 2004-01-30 21:14 UTC (permalink / raw)
  To: James Morris; +Cc: jakub, dparis, linux-kernel, rspchan

On Fri, 30 Jan 2004 11:35:20 -0500 (EST)
James Morris <jmorris@redhat.com> wrote:

> Proposed patch below.  I think sha512 would have been ok, but might as 
> well make them the same.
> 
> R Chan, please test and let us know if it fixes the problem for you.

I'm putting this into my tree(s), thanks James.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch   pentium3,4
  2004-01-30 21:14         ` David S. Miller
@ 2004-02-01 21:18           ` Bill Davidsen
  2004-02-01 23:22             ` David S. Miller
  2004-02-02  1:00             ` Linus Torvalds
  0 siblings, 2 replies; 22+ messages in thread
From: Bill Davidsen @ 2004-02-01 21:18 UTC (permalink / raw)
  To: linux-kernel, David S. Miller
  Cc: James Morris, jakub, dparis, linux-kernel, rspchan

David S. Miller wrote:
> On Fri, 30 Jan 2004 11:35:20 -0500 (EST)
> James Morris <jmorris@redhat.com> wrote:
> 
> 
>>Proposed patch below.  I think sha512 would have been ok, but might as 
>>well make them the same.
>>
>>R Chan, please test and let us know if it fixes the problem for you.
> 
> 
> I'm putting this into my tree(s), thanks James.

What didn't you like about Jakob's patch which avoids the 64 byte size 
penalty?

-- 
bill davidsen <davidsen@tmr.com>
   CTO TMR Associates, Inc
   Doing interesting things with small computers since 1979

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch   pentium3,4
  2004-02-01 21:18           ` Bill Davidsen
@ 2004-02-01 23:22             ` David S. Miller
  2004-02-02  1:00             ` Linus Torvalds
  1 sibling, 0 replies; 22+ messages in thread
From: David S. Miller @ 2004-02-01 23:22 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: jmorris, jakub, dparis, linux-kernel, rspchan

On Sun, 01 Feb 2004 16:18:48 -0500
Bill Davidsen <davidsen@tmr.com> wrote:

> What didn't you like about Jakob's patch which avoids the 64 byte size 
> penalty?

Because it means memset'ing the thing every time the function is called.
And this function is called for every transform.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch   pentium3,4
  2004-02-01 21:18           ` Bill Davidsen
  2004-02-01 23:22             ` David S. Miller
@ 2004-02-02  1:00             ` Linus Torvalds
  2004-02-02 18:55               ` Bill Davidsen
  1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2004-02-02  1:00 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: David S. Miller, James Morris, jakub, dparis, Kernel Mailing List,
	rspchan

On Sun, 1 Feb 2004, Bill Davidsen wrote:
> 
> What didn't you like about Jakob's patch which avoids the 64 byte size 
> penalty?

What size penalty?

The data has to be allocated somewhere, and on the stack is simply not 
acceptable. So there can be no size penalty.

Yes, the text size of the binary is slightly bigger, because a "static 
const" ends up in the ro-section, but that's _purely_ an accounting thing. 
It has to be somewhere, be it .text, .data or .bss. Who would ever care 
where it is?

Having it in .ro means that there are no initialization issues, and a 
compressed kernel compresses the zero bytes better than having init-time 
code to initialize the array (or, worse, doing it over and over at 
run-time).

So where does this size penalty idea come from?

In short: "static const" is correct.

		Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch   pentium3,4
  2004-02-02  1:00             ` Linus Torvalds
@ 2004-02-02 18:55               ` Bill Davidsen
  0 siblings, 0 replies; 22+ messages in thread
From: Bill Davidsen @ 2004-02-02 18:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, jakub, Kernel Mailing List

On Sun, 1 Feb 2004, Linus Torvalds wrote:

> 
> 
> On Sun, 1 Feb 2004, Bill Davidsen wrote:
> > 
> > What didn't you like about Jakob's patch which avoids the 64 byte size 
> > penalty?
> 
> What size penalty?
> 
> The data has to be allocated somewhere, and on the stack is simply not 
> acceptable. So there can be no size penalty.

The point was that it doesn't have to be allocated somewhere unless it's
actually used, and many kernels will want the capability but in practice
won't use it. David pointed out the overhead of repeated initialize, and
that's an overpowering reason not to do runtime allocation.

> 
> Yes, the text size of the binary is slightly bigger, because a "static 
> const" ends up in the ro-section, but that's _purely_ an accounting thing. 
> It has to be somewhere, be it .text, .data or .bss. Who would ever care 
> where it is?
> 
> Having it in .ro means that there are no initialization issues, and a 
> compressed kernel compresses the zero bytes better than having init-time 
> code to initialize the array (or, worse, doing it over and over at 
> run-time).
> 
> So where does this size penalty idea come from?

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4
  2004-01-30 15:28     ` Jakub Jelinek
  2004-01-30 16:35       ` James Morris
@ 2004-01-30 17:26       ` Jean-Luc Cooke
  1 sibling, 0 replies; 22+ messages in thread
From: Jean-Luc Cooke @ 2004-01-30 17:26 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Dave Paris, linux-kernel, rspchan

Jakub said something like...
> GCC handling of automatic const variables with initializers is very fragile.
> There have been numerous bugs in it in the past and last one has been fixed
> just yesterday (on GCC trunk only for the time being).  It will be
> eventually backported once it gets some more testing on GCC mainline.
> 
> The problematic line in sha256.c is:
> static void sha256_final(void* ctx, u8 *out)
> {
> ...
> 	const u8 padding[64] = { 0x80, };

Verry interesting.  Good work Jakub.

JLC

-- 
http://www.certainkey.com
Suite 4560 CTTC
1125 Colonel By Dr.
Ottawa ON, K1S 5B6

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2004-02-05 19:49 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-30 15:43 [CRYPTO]: Miscompiling sha256.c by gcc 3.2.3 and arch pentium3,4 Arnd Bergmann
2004-01-30 16:57 ` James Morris
2004-01-30 18:02   ` Randy.Dunlap
2004-01-30 18:35     ` James Morris
2004-01-30 18:53       ` James Morris
  -- strict thread matches above, loose matches on Subject: below --
2004-02-02  4:08 linux
2004-02-05 19:40 ` Jean-Luc Cooke
2004-01-30  3:21 R CHAN
2004-01-30 14:39 ` James Morris
2004-01-30 15:04   ` Dave Paris
2004-01-30 15:24     ` R Chan
2004-01-30 15:28     ` Jakub Jelinek
2004-01-30 16:35       ` James Morris
2004-01-30 17:14         ` Andy Isaacson
2004-01-30 19:49           ` Jakub Jelinek
2004-02-01 21:17             ` Bill Davidsen
2004-01-30 21:14         ` David S. Miller
2004-02-01 21:18           ` Bill Davidsen
2004-02-01 23:22             ` David S. Miller
2004-02-02  1:00             ` Linus Torvalds
2004-02-02 18:55               ` Bill Davidsen
2004-01-30 17:26       ` Jean-Luc Cooke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox