Linux cryptographic layer development

Linux cryptographic layer development
 help / color / mirror / Atom feed

* Re: x86-64: Maintain 16-byte stack alignment
From: Herbert Xu @ 2017-01-11  4:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Linux Crypto Mailing List, Ingo Molnar,
	Thomas Gleixner, Andy Lutomirski, Ard Biesheuvel
In-Reply-To: <CA+55aFzypqpMog8aTczWYok0MqOcMi4gGzCor6k-NdJzBoLWyQ@mail.gmail.com>

On Tue, Jan 10, 2017 at 08:17:17PM -0800, Linus Torvalds wrote:
>
> That said, I do think that the "don't assume stack alignment, do it by
> hand" may be the safer thing. Because who knows what the random rules
> will be on other architectures.

Sure we can ban the use of attribute aligned on stacks.  But
what about indirect uses through structures?  For example, if
someone does

struct foo {
} __attribute__ ((__aligned__(16)));

int bar(...)
{
	struct foo f;

	return baz(&f);
}

then baz will end up with an unaligned argument.  The worst part
is that it is not at all obvious to the person writing the function
bar.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: x86-64: Maintain 16-byte stack alignment
From: Linus Torvalds @ 2017-01-11  4:17 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linux Kernel Mailing List, Linux Crypto Mailing List, Ingo Molnar,
	Thomas Gleixner, Andy Lutomirski, Ard Biesheuvel
In-Reply-To: <CA+55aFxJOzMim_d-O2E2yip8JWo0NdYs_72sNwFKSkTjy8q0Sw@mail.gmail.com>

On Tue, Jan 10, 2017 at 7:30 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If you really want more stack alignment, you have to generate that
> alignment yourself by hand (and have a bigger buffer that you do that
> alignment inside).

Side note: gcc can (and does) actually generate forced alignment using
"and" instructions on %rsp rather than assuming pre-existing
alignment.  And that would be valid.

The problem with "alignof(16)" is not that gcc couldn't generate the
alignment itself, it's just the broken "it's already aligned to 16
bytes" assumption because -mpreferred-stack-boundary=3 doesn't work.

You *could* try to hack around it by forcing a 32-byte alignment
instead. That (I think) will make gcc generate the "and" instruction
mess.

And it shouldn't actually use any more memory than doing it by hand
(by having twice the alignment and hand-aligning the pointer).

So we *could* try to just have a really hacky rule saying that you can
align stack data to 8 or 32 bytes, but *not* to 16 bytes.

That said, I do think that the "don't assume stack alignment, do it by
hand" may be the safer thing. Because who knows what the random rules
will be on other architectures.

               Linus

^ permalink raw reply

* Re: x86-64: Maintain 16-byte stack alignment
From: Linus Torvalds @ 2017-01-11  3:30 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linux Kernel Mailing List, Linux Crypto Mailing List, Ingo Molnar,
	Thomas Gleixner, Andy Lutomirski, Ard Biesheuvel
In-Reply-To: <20170111031124.GA4515@gondor.apana.org.au>

On Tue, Jan 10, 2017 at 7:11 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> Well the only other alternative I see is to ban compilers which
> enforce 16-byte stack alignment, such as gcc 4.7.2.

No,  you don't have to ban the compiler - it's just a "generate overly
stupid code that just uses extra instructions to likely mis-align the
stack more" issue. So it's "stupid code generation" vs "buggy".

What we should ban is code that assumes that stack objects can be
aligned to more than word boundary.

__attribute__((align)) simply doesn't work on stack objects, because
the stack isn't aligned.

If you really want more stack alignment, you have to generate that
alignment yourself by hand (and have a bigger buffer that you do that
alignment inside).

So this was just simply buggy:

      u32 state[16] __aligned(CHACHA20_STATE_ALIGN);

because you just can't do that. It's that simple. There is a reason
why the code does the dance with

    u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];

    state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);

rather than ask the compiler to do something invalid.

                Linus

^ permalink raw reply

* Re: x86-64: Maintain 16-byte stack alignment
From: Herbert Xu @ 2017-01-11  3:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ard Biesheuvel, Linux Kernel Mailing List,
	Linux Crypto Mailing List, Linus Torvalds, Ingo Molnar,
	Thomas Gleixner, Andy Lutomirski
In-Reply-To: <CALCETrW4DZeq=tcEnEyPc=H3Z35J-o6BGPwxEfcdKjg2tDqBGg@mail.gmail.com>

On Tue, Jan 10, 2017 at 03:25:47PM -0800, Andy Lutomirski wrote:
>
> > If it does what it says on the tin, it should fix the issue, but after
> > adding the attribute, I get the exact same object output, so there's
> > something dodgy going on here.
> 
> Ugh, that's annoying.  Maybe it needs noinline too?

Perhaps something to do with

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66697

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: x86-64: Maintain 16-byte stack alignment
From: Herbert Xu @ 2017-01-11  3:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ard Biesheuvel, Linux Kernel Mailing List,
	Linux Crypto Mailing List, Linus Torvalds, Ingo Molnar,
	Thomas Gleixner, Andy Lutomirski
In-Reply-To: <CALCETrW7mXxntZBTTHOgvaRLV4=9hcsNoTKM3o2O5Jod_6RR6Q@mail.gmail.com>

On Tue, Jan 10, 2017 at 11:22:15AM -0800, Andy Lutomirski wrote:
>
> > Actually, the breakage is introduced by the patch Herbert refers to
> >
> > https://patchwork.kernel.org/patch/9468391/
> >
> > where the state is replaced by a simple
> >
> > u32 state[16] __aligned(CHACHA20_STATE_ALIGN);
> >
> > which seemed harmless enough to me. So the code above works fine.
> 
> So how about just the one-line patch of adding the
> force_align_arg_pointer?  Would that solve the problem?

It probably does.  However, this is too error-prone.  Surely
you can't expect random kernel developers to know to add this
force_align_arg_pointer every time they try to align a stack
variable to 16 bytes?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: x86-64: Maintain 16-byte stack alignment
From: Herbert Xu @ 2017-01-11  3:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ard Biesheuvel, Linux Kernel Mailing List,
	Linux Crypto Mailing List, Linus Torvalds, Ingo Molnar,
	Thomas Gleixner, Andy Lutomirski
In-Reply-To: <CALCETrWGihhtbdkKcSD0kBrx-fWxM_TyUOUafvhsKMYXq5yEEQ@mail.gmail.com>

On Tue, Jan 10, 2017 at 11:00:31AM -0800, Andy Lutomirski wrote:
> 
> Here's what I think is really going on.  This is partially from
> memory, so I could be off base.  The kernel is up against
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
> on some GCC versions (like the bad one and maybe even current ones),
> things compiled without -mno-sse can't have the stack alignment set
> properly.  IMO we should fix this in the affected code, not the entry

No that's not it.  My compiler (gcc 4.7.2) doesn't support it period:

$ gcc -S -O2 -mno-sse -mpreferred-stack-boundary=3 a.c
a.c:1:0: error: -mpreferred-stack-boundary=3 is not between 4 and 12
$ 

So you either have to ban all compilers older than whatever version
that started supporting 8-byte stack alignment, or fix the kernel.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: x86-64: Maintain 16-byte stack alignment
From: Herbert Xu @ 2017-01-11  3:15 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Linux Kernel Mailing List, Linux Crypto Mailing List,
	Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andy Lutomirski
In-Reply-To: <CAKv+Gu8xVrfR=J5b2pGK7R5Tv-M3xZhL_dnTvvM7nTZLLtC-EQ@mail.gmail.com>

On Tue, Jan 10, 2017 at 05:30:48PM +0000, Ard Biesheuvel wrote:
> 
> Apologies for introducing this breakage. It seemed like an obvious and
> simple cleanup, so I didn't even bother to mention it in the commit
> log, but if the kernel does not guarantee 16 byte alignment, I guess
> we should revert to the old method. If SSE instructions are the only
> ones that require this alignment, then I suppose not having a ABI
> conforming stack pointer should not be an issue in general.

I think we need to address this regardless of your patch.  You
won't be the last person to use __attribute__ to get 16-byte
alignment on the stack.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v2 7/8] net: Rename TCA*BPF_DIGEST to ..._SHA256
From: Andy Lutomirski @ 2017-01-11  3:11 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andy Lutomirski, Netdev, LKML, Linux Crypto Mailing List,
	Jason A. Donenfeld, Hannes Frederic Sowa, Alexei Starovoitov,
	Eric Dumazet, Eric Biggers, Tom Herbert, David S. Miller,
	Alexei Starovoitov
In-Reply-To: <58758169.2020408@iogearbox.net>

On Tue, Jan 10, 2017 at 4:50 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 01/11/2017 12:24 AM, Andy Lutomirski wrote:
>>
>> This makes it easier to add another digest algorithm down the road if
>> needed.  It also serves to force any programs that might have been
>> written against a kernel that had the old field name to notice the
>> change and make any necessary changes.
>>
>> This shouldn't violate any stable API policies, as no released kernel
>> has ever had TCA*BPF_DIGEST.
>
>
> Imho, this and patch 6/8 is not really needed. Should there ever
> another digest alg be used (doubt it), then you'd need a new nl
> attribute and fdinfo line anyway to keep existing stuff intact.
> Nobody made the claim that you can just change this underneath
> and not respecting abi for existing applications when I read from
> above that such apps now will get "forced" to notice a change.

Fair enough.  I was more concerned about prerelease iproute2 versions,
but maybe that's a nonissue.  I'll drop these two patches.

--Andy

^ permalink raw reply

* Re: x86-64: Maintain 16-byte stack alignment
From: Herbert Xu @ 2017-01-11  3:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Linux Crypto Mailing List, Ingo Molnar,
	Thomas Gleixner, Andy Lutomirski, Ard Biesheuvel
In-Reply-To: <CA+55aFyuUCr3aUBhFt6T3vKyZOskLyH8Kd_2jeFkPYxxvEaXEg@mail.gmail.com>

On Tue, Jan 10, 2017 at 09:05:28AM -0800, Linus Torvalds wrote:
> On Tue, Jan 10, 2017 at 6:39 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >
> > BTW this is with Debian gcc 4.7.2 which does not allow an 8-byte
> > stack alignment as attempted by the Makefile:
> 
> I'm pretty sure we have random asm code that may not maintain a
> 16-byte stack alignment when it calls other code (including, in some
> cases, calling C code).
> 
> So I'm not at all convinced that this is a good idea. We shouldn't
> expect 16-byte alignment to be something trustworthy.

Well the only other alternative I see is to ban compilers which
enforce 16-byte stack alignment, such as gcc 4.7.2.  Or is there
another way?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v2 0/8] Switch BPF's digest to SHA256
From: Alexei Starovoitov @ 2017-01-11  1:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Borkmann, Netdev, LKML, Linux Crypto Mailing List,
	Jason A. Donenfeld, Hannes Frederic Sowa, Eric Dumazet,
	Eric Biggers, Tom Herbert, David S. Miller
In-Reply-To: <cover.1484090585.git.luto@kernel.org>

On Tue, Jan 10, 2017 at 03:24:38PM -0800, Andy Lutomirski wrote:
> I can imagine future uses for the new-in-4.10 BPF digest feature that
> would be problematic if malicious users could produce collisions, and
> SHA-1 is no longer consdiered to be collision-free.  Even without
> needing collision resistance, SHA-1 is no longer recommended for new
> applications.  Switch the BPF digest to SHA-256 instead.

Using this reasoning we must immediately change 'git' to use sha256
as well. malicious users are coming!
Seriously, NACK for bpf bits again, since you somehow missed
the reasons I gave earlier, here they are again:
.......
This statement is also bogus. The only reason we added prog_digest is
to improve debuggability and introspection of bpf programs.
As I said in the previous thread "collisions are ok" and we could have
used jhash here to avoid patches like this ever appearing
and wasting everyones time.

sha1 is 20 bytes which is already a bit long to print and copy paste by humans.
whereas 4 byte jhash is a bit too short, since collisions are not that rare
and may lead to incorrect assumptions from the users that develop the programs.
I would prefer something in 6-10 byte range that prevents collisions most of
the time and short to print as hex, but I don't know of anything like this
in the existing kernel and inventing bpf specific hash is not great.
Another requirement for debugging (and prog_digest) that user space
should be able to produce the same hash without asking kernel, so
sha1 fits that as well, since it's well known and easy to put into library.

sha256 doesn't fit either of these requirements. 32-bytes are too long to print
and when we use it as a substitue for the prog name for jited ksym, looking
at long function names will screw up all tools like perf, which we don't
want. sha256 is equally not easy for user space app like iproute2,
so not an acceptable choice from that pov either.
...........

tldr: I see only Cons for sha1->sha256 switch. Not a single Pro, hence
nack for bpf patches 4+
The patches 1-3 are nice and useful on their own.

^ permalink raw reply

* Re: [PATCH v2 7/8] net: Rename TCA*BPF_DIGEST to ..._SHA256
From: Daniel Borkmann @ 2017-01-11  0:50 UTC (permalink / raw)
  To: Andy Lutomirski, Netdev, LKML, Linux Crypto Mailing List
  Cc: Jason A. Donenfeld, Hannes Frederic Sowa, Alexei Starovoitov,
	Eric Dumazet, Eric Biggers, Tom Herbert, David S. Miller,
	Alexei Starovoitov
In-Reply-To: <f284a4d87cde103cd78de64b8607e3c80a17e2e5.1484090585.git.luto@kernel.org>

On 01/11/2017 12:24 AM, Andy Lutomirski wrote:
> This makes it easier to add another digest algorithm down the road if
> needed.  It also serves to force any programs that might have been
> written against a kernel that had the old field name to notice the
> change and make any necessary changes.
>
> This shouldn't violate any stable API policies, as no released kernel
> has ever had TCA*BPF_DIGEST.

Imho, this and patch 6/8 is not really needed. Should there ever
another digest alg be used (doubt it), then you'd need a new nl
attribute and fdinfo line anyway to keep existing stuff intact.
Nobody made the claim that you can just change this underneath
and not respecting abi for existing applications when I read from
above that such apps now will get "forced" to notice a change.

^ permalink raw reply

* Re: x86-64: Maintain 16-byte stack alignment
From: Andy Lutomirski @ 2017-01-10 23:25 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Herbert Xu, Linux Kernel Mailing List, Linux Crypto Mailing List,
	Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andy Lutomirski
In-Reply-To: <CAKv+Gu_Te-D5V3eG-GHtpP2hdpb+aTCgxBhmvf1DVK-pKPT54g@mail.gmail.com>

On Tue, Jan 10, 2017 at 12:00 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 10 January 2017 at 19:22, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, Jan 10, 2017 at 11:16 AM, Ard Biesheuvel
>> <ard.biesheuvel@linaro.org> wrote:
>>> On 10 January 2017 at 19:00, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Tue, Jan 10, 2017 at 9:30 AM, Ard Biesheuvel
>>>> <ard.biesheuvel@linaro.org> wrote:
>>>>> On 10 January 2017 at 14:33, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>>>>> I recently applied the patch
>>>>>>
>>>>>>         https://patchwork.kernel.org/patch/9468391/
>>>>>>
>>>>>> and ended up with a boot crash when it tried to run the x86 chacha20
>>>>>> code.  It turned out that the patch changed a manually aligned
>>>>>> stack buffer to one that is aligned by gcc.  What was happening was
>>>>>> that gcc can stack align to any value on x86-64 except 16.  The
>>>>>> reason is that gcc assumes that the stack is always 16-byte aligned,
>>>>>> which is not actually the case in the kernel.
>>>>>>
>>>>>
>>>>> Apologies for introducing this breakage. It seemed like an obvious and
>>>>> simple cleanup, so I didn't even bother to mention it in the commit
>>>>> log, but if the kernel does not guarantee 16 byte alignment, I guess
>>>>> we should revert to the old method. If SSE instructions are the only
>>>>> ones that require this alignment, then I suppose not having a ABI
>>>>> conforming stack pointer should not be an issue in general.
>>>>
>>>> Here's what I think is really going on.  This is partially from
>>>> memory, so I could be off base.  The kernel is up against
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
>>>> on some GCC versions (like the bad one and maybe even current ones),
>>>> things compiled without -mno-sse can't have the stack alignment set
>>>> properly.  IMO we should fix this in the affected code, not the entry
>>>> code.  In fact, I think that fixing it in the entry code won't even
>>>> fully fix it because modern GCC will compile the rest of the kernel
>>>> with 8-byte alignment and the stack will get randomly unaligned (GCC
>>>> 4.8 and newer).
>>>>
>>>> Can we just add __attribute__((force_align_arg_pointer)) to the
>>>> affected functions?  Maybe have:
>>>>
>>>> #define __USES_SSE __attribute__((force_align_arg_pointer))
>>>>
>>>> on affected gcc versions?
>>>>
>>>> ***HOWEVER***
>>>>
>>>> I think this is missing the tree for the supposed forest.  The actual
>>>> affected code appears to be:
>>>>
>>>> static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
>>>>                          struct scatterlist *src, unsigned int nbytes)
>>>> {
>>>>         u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];
>>>>
>>>> ...
>>>>
>>>>         state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);
>>>>
>>>> gcc presumably infers (incorrectly) that state_buf is 16-byte aligned
>>>> and optimizes out the roundup.  How about just declaring an actual
>>>> __aligned(16) buffer, marking the function
>>>> __attribute__((force_align_arg_pointer)), and being done with it?
>>>> After all, we need that forcible alignment on *all* gcc versions.
>>>>
>>>
>>> Actually, the breakage is introduced by the patch Herbert refers to
>>>
>>> https://patchwork.kernel.org/patch/9468391/
>>>
>>> where the state is replaced by a simple
>>>
>>> u32 state[16] __aligned(CHACHA20_STATE_ALIGN);
>>>
>>> which seemed harmless enough to me. So the code above works fine.
>>
>> So how about just the one-line patch of adding the
>> force_align_arg_pointer?  Would that solve the problem?
>
> If it does what it says on the tin, it should fix the issue, but after
> adding the attribute, I get the exact same object output, so there's
> something dodgy going on here.

Ugh, that's annoying.  Maybe it needs noinline too?

--Andy

^ permalink raw reply

* [PATCH v2 6/8] bpf: Rename fdinfo's prog_digest to prog_sha256
From: Andy Lutomirski @ 2017-01-10 23:24 UTC (permalink / raw)
  To: Daniel Borkmann, Netdev, LKML, Linux Crypto Mailing List
  Cc: Jason A. Donenfeld, Hannes Frederic Sowa, Alexei Starovoitov,
	Eric Dumazet, Eric Biggers, Tom Herbert, David S. Miller,
	Andy Lutomirski, Alexei Starovoitov
In-Reply-To: <cover.1484090585.git.luto@kernel.org>

This makes it easier to add another digest algorithm down the road
if needed.  It also serves to force any programs that might have
been written against a kernel that had 'prog_digest' to be updated.

This shouldn't violate any stable API policies, as no released
kernel has ever had 'prog_digest'.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/bpf/syscall.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e89acea22ecf..956370b80296 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -694,7 +694,7 @@ static void bpf_prog_show_fdinfo(struct seq_file *m, struct file *filp)
 	seq_printf(m,
 		   "prog_type:\t%u\n"
 		   "prog_jited:\t%u\n"
-		   "prog_digest:\t%s\n"
+		   "prog_sha256:\t%s\n"
 		   "memlock:\t%llu\n",
 		   prog->type,
 		   prog->jited,
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 0/8] Switch BPF's digest to SHA256
From: Andy Lutomirski @ 2017-01-10 23:24 UTC (permalink / raw)
  To: Daniel Borkmann, Netdev, LKML, Linux Crypto Mailing List
  Cc: Jason A. Donenfeld, Hannes Frederic Sowa, Alexei Starovoitov,
	Eric Dumazet, Eric Biggers, Tom Herbert, David S. Miller,
	Andy Lutomirski

I can imagine future uses for the new-in-4.10 BPF digest feature that
would be problematic if malicious users could produce collisions, and
SHA-1 is no longer consdiered to be collision-free.  Even without
needing collision resistance, SHA-1 is no longer recommended for new
applications.  Switch the BPF digest to SHA-256 instead.

The actual switchover is trivial.  Most of this series consists of
cleanups to the SHA256 code to make it usable as a standalone library
(since BPF should not depend on crypto).

The cleaned up library is much more user-friendly than the SHA-1 code,
so this also significantly tidies up the BPF digest code.

This is intended for 4.10.  If this series misses 4.10 and nothing
takes its place, then we'll have an unpleasant ABI stability
situation.

NB: I would be happy to make parallel changes to the SHA-512 code if
the crypto folks would like for me to do so.  I haven't yet because I
wanted to minimize churn.  Also, the changes will be essentially
identical to the SHA-256 changes and I want to get the latter
reviewed first.

Andy Lutomirski (8):
  crypto/sha256: Factor out the parts of base API that don't use
    shash_desc
  crypto/sha256: Export a sha256_{init,update,final}_direct() API
  crypto/sha256: Build the SHA256 core separately from the crypto module
  bpf: Use SHA256 instead of SHA1 for bpf digests
  bpf: Avoid copying the entire BPF program when hashing it
  bpf: Rename fdinfo's prog_digest to prog_sha256
  net: Rename TCA*BPF_DIGEST to ..._SHA256
  crypto/testmgr: Allocate only the required output size for hash tests

 crypto/Kconfig                     |   8 ++
 crypto/Makefile                    |   1 +
 crypto/sha256_direct.c             | 238 +++++++++++++++++++++++++++++++++++++
 crypto/sha256_generic.c            | 215 ++-------------------------------
 crypto/testmgr.c                   |   9 +-
 include/crypto/sha.h               |  24 ++++
 include/crypto/sha256_base.h       |  58 +++++----
 include/linux/filter.h             |  11 +-
 include/uapi/linux/pkt_cls.h       |   2 +-
 include/uapi/linux/tc_act/tc_bpf.h |   2 +-
 init/Kconfig                       |   1 +
 kernel/bpf/core.c                  |  63 +++-------
 kernel/bpf/syscall.c               |   2 +-
 net/sched/act_bpf.c                |   2 +-
 net/sched/cls_bpf.c                |   2 +-
 15 files changed, 343 insertions(+), 295 deletions(-)
 create mode 100644 crypto/sha256_direct.c

-- 
2.9.3

^ permalink raw reply

* [PATCH v2 8/8] crypto/testmgr: Allocate only the required output size for hash tests
From: Andy Lutomirski @ 2017-01-10 23:24 UTC (permalink / raw)
  To: Daniel Borkmann, Netdev, LKML, Linux Crypto Mailing List
  Cc: Jason A. Donenfeld, Hannes Frederic Sowa, Alexei Starovoitov,
	Eric Dumazet, Eric Biggers, Tom Herbert, David S. Miller,
	Andy Lutomirski, Ard Biesheuvel, Herbert Xu
In-Reply-To: <cover.1484090585.git.luto@kernel.org>

There are some hashes (e.g. sha224) that have some internal trickery
to make sure that only the correct number of output bytes are
generated.  If something goes wrong, they could potentially overrun
the output buffer.

Make the test more robust by allocating only enough space for the
correct output size so that memory debugging will catch the error if
the output is overrun.

Tested by intentionally breaking sha224 to output all 256
internally-generated bits while running on KASAN.

Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 crypto/testmgr.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index f616ad74cce7..575fc28a9ab2 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -265,6 +265,7 @@ static int __test_hash(struct crypto_ahash *tfm, struct hash_testvec *template,
 		       const int align_offset)
 {
 	const char *algo = crypto_tfm_alg_driver_name(crypto_ahash_tfm(tfm));
+	size_t digest_size = crypto_ahash_digestsize(tfm);
 	unsigned int i, j, k, temp;
 	struct scatterlist sg[8];
 	char *result;
@@ -275,7 +276,7 @@ static int __test_hash(struct crypto_ahash *tfm, struct hash_testvec *template,
 	char *xbuf[XBUFSIZE];
 	int ret = -ENOMEM;
 
-	result = kmalloc(MAX_DIGEST_SIZE, GFP_KERNEL);
+	result = kmalloc(digest_size, GFP_KERNEL);
 	if (!result)
 		return ret;
 	key = kmalloc(MAX_KEYLEN, GFP_KERNEL);
@@ -305,7 +306,7 @@ static int __test_hash(struct crypto_ahash *tfm, struct hash_testvec *template,
 			goto out;
 
 		j++;
-		memset(result, 0, MAX_DIGEST_SIZE);
+		memset(result, 0, digest_size);
 
 		hash_buff = xbuf[0];
 		hash_buff += align_offset;
@@ -380,7 +381,7 @@ static int __test_hash(struct crypto_ahash *tfm, struct hash_testvec *template,
 			continue;
 
 		j++;
-		memset(result, 0, MAX_DIGEST_SIZE);
+		memset(result, 0, digest_size);
 
 		temp = 0;
 		sg_init_table(sg, template[i].np);
@@ -458,7 +459,7 @@ static int __test_hash(struct crypto_ahash *tfm, struct hash_testvec *template,
 			continue;
 
 		j++;
-		memset(result, 0, MAX_DIGEST_SIZE);
+		memset(result, 0, digest_size);
 
 		ret = -EINVAL;
 		hash_buff = xbuf[0];
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 7/8] net: Rename TCA*BPF_DIGEST to ..._SHA256
From: Andy Lutomirski @ 2017-01-10 23:24 UTC (permalink / raw)
  To: Daniel Borkmann, Netdev, LKML, Linux Crypto Mailing List
  Cc: Jason A. Donenfeld, Hannes Frederic Sowa, Alexei Starovoitov,
	Eric Dumazet, Eric Biggers, Tom Herbert, David S. Miller,
	Andy Lutomirski, Alexei Starovoitov
In-Reply-To: <cover.1484090585.git.luto@kernel.org>

This makes it easier to add another digest algorithm down the road if
needed.  It also serves to force any programs that might have been
written against a kernel that had the old field name to notice the
change and make any necessary changes.

This shouldn't violate any stable API policies, as no released kernel
has ever had TCA*BPF_DIGEST.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/uapi/linux/pkt_cls.h       | 2 +-
 include/uapi/linux/tc_act/tc_bpf.h | 2 +-
 net/sched/act_bpf.c                | 2 +-
 net/sched/cls_bpf.c                | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index cb4bcdc58543..ac6b300c1550 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -397,7 +397,7 @@ enum {
 	TCA_BPF_NAME,
 	TCA_BPF_FLAGS,
 	TCA_BPF_FLAGS_GEN,
-	TCA_BPF_DIGEST,
+	TCA_BPF_SHA256,
 	__TCA_BPF_MAX,
 };
 
diff --git a/include/uapi/linux/tc_act/tc_bpf.h b/include/uapi/linux/tc_act/tc_bpf.h
index a6b88a6f7f71..eae18a7430eb 100644
--- a/include/uapi/linux/tc_act/tc_bpf.h
+++ b/include/uapi/linux/tc_act/tc_bpf.h
@@ -27,7 +27,7 @@ enum {
 	TCA_ACT_BPF_FD,
 	TCA_ACT_BPF_NAME,
 	TCA_ACT_BPF_PAD,
-	TCA_ACT_BPF_DIGEST,
+	TCA_ACT_BPF_SHA256,
 	__TCA_ACT_BPF_MAX,
 };
 #define TCA_ACT_BPF_MAX (__TCA_ACT_BPF_MAX - 1)
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 1c60317f0121..3868a66d0b24 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -123,7 +123,7 @@ static int tcf_bpf_dump_ebpf_info(const struct tcf_bpf *prog,
 	    nla_put_string(skb, TCA_ACT_BPF_NAME, prog->bpf_name))
 		return -EMSGSIZE;
 
-	nla = nla_reserve(skb, TCA_ACT_BPF_DIGEST,
+	nla = nla_reserve(skb, TCA_ACT_BPF_SHA256,
 			  sizeof(prog->filter->digest));
 	if (nla == NULL)
 		return -EMSGSIZE;
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index adc776048d1a..6fc110321621 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -555,7 +555,7 @@ static int cls_bpf_dump_ebpf_info(const struct cls_bpf_prog *prog,
 	    nla_put_string(skb, TCA_BPF_NAME, prog->bpf_name))
 		return -EMSGSIZE;
 
-	nla = nla_reserve(skb, TCA_BPF_DIGEST, sizeof(prog->filter->digest));
+	nla = nla_reserve(skb, TCA_BPF_SHA256, sizeof(prog->filter->digest));
 	if (nla == NULL)
 		return -EMSGSIZE;
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 5/8] bpf: Avoid copying the entire BPF program when hashing it
From: Andy Lutomirski @ 2017-01-10 23:24 UTC (permalink / raw)
  To: Daniel Borkmann, Netdev, LKML, Linux Crypto Mailing List
  Cc: Jason A. Donenfeld, Hannes Frederic Sowa, Alexei Starovoitov,
	Eric Dumazet, Eric Biggers, Tom Herbert, David S. Miller,
	Andy Lutomirski, Alexei Starovoitov
In-Reply-To: <cover.1484090585.git.luto@kernel.org>

The sha256 helpers can consume a message incrementally, so there's no need
to allocate a buffer to store the whole blob to be hashed.

This may be a slight slowdown for very long messages because gcc can't
inline the sha256_update() calls.  For reasonably-sized programs,
however, this should be a considerable speedup as vmalloc() is quite
slow.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 kernel/bpf/core.c | 33 +++++++++++++--------------------
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 668b92f6ab58..106162a1bc54 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -149,44 +149,37 @@ void __bpf_prog_free(struct bpf_prog *fp)
 int bpf_prog_calc_digest(struct bpf_prog *fp)
 {
 	struct sha256_state sha;
-	u32 i, psize;
-	struct bpf_insn *dst;
+	u32 i;
 	bool was_ld_map;
-	u8 *raw;
-
-	psize = bpf_prog_insn_size(fp);
-	raw = vmalloc(psize);
-	if (!raw)
-		return -ENOMEM;
 
 	sha256_init_direct(&sha);
 
 	/* We need to take out the map fd for the digest calculation
 	 * since they are unstable from user space side.
 	 */
-	dst = (void *)raw;
 	for (i = 0, was_ld_map = false; i < fp->len; i++) {
-		dst[i] = fp->insnsi[i];
+		struct bpf_insn insn = fp->insnsi[i];
+
 		if (!was_ld_map &&
-		    dst[i].code == (BPF_LD | BPF_IMM | BPF_DW) &&
-		    dst[i].src_reg == BPF_PSEUDO_MAP_FD) {
+		    insn.code == (BPF_LD | BPF_IMM | BPF_DW) &&
+		    insn.src_reg == BPF_PSEUDO_MAP_FD) {
 			was_ld_map = true;
-			dst[i].imm = 0;
+			insn.imm = 0;
 		} else if (was_ld_map &&
-			   dst[i].code == 0 &&
-			   dst[i].dst_reg == 0 &&
-			   dst[i].src_reg == 0 &&
-			   dst[i].off == 0) {
+			   insn.code == 0 &&
+			   insn.dst_reg == 0 &&
+			   insn.src_reg == 0 &&
+			   insn.off == 0) {
 			was_ld_map = false;
-			dst[i].imm = 0;
+			insn.imm = 0;
 		} else {
 			was_ld_map = false;
 		}
+
+		sha256_update_direct(&sha, (const u8 *)&insn, sizeof(insn));
 	}
 
-	sha256_update_direct(&sha, raw, psize);
 	sha256_final_direct(&sha, fp->digest);
-	vfree(raw);
 	return 0;
 }
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 4/8] bpf: Use SHA256 instead of SHA1 for bpf digests
From: Andy Lutomirski @ 2017-01-10 23:24 UTC (permalink / raw)
  To: Daniel Borkmann, Netdev, LKML, Linux Crypto Mailing List
  Cc: Jason A. Donenfeld, Hannes Frederic Sowa, Alexei Starovoitov,
	Eric Dumazet, Eric Biggers, Tom Herbert, David S. Miller,
	Andy Lutomirski, Alexei Starovoitov
In-Reply-To: <cover.1484090585.git.luto@kernel.org>

SHA1 is considered obsolete.  It is no longer considered to be
collision resistant and, in general, it should not be used for new
applications.  Change the new-in-4.10 BPF digest to SHA-256.  This
means that potential future applications of the digest that need
collision resistance will be able to use the BPF digest.  Applications
that just want a short identifier for a BPF program are welcome to
truncate the digest.

This is also a cleanup IMO -- the new sha256_*_direct() API is much
nicer than the old SHA1 library helpers.  It will also enable
incremental hashing so the BPF code can avoid calling vmalloc().

I moved the digest field to keep all of the bpf program metadata in
the same cache line.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/filter.h | 11 +++--------
 init/Kconfig           |  1 +
 kernel/bpf/core.c      | 42 ++++++++----------------------------------
 3 files changed, 12 insertions(+), 42 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 702314253797..23df2574e30c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -14,7 +14,8 @@
 #include <linux/workqueue.h>
 #include <linux/sched.h>
 #include <linux/capability.h>
-#include <linux/cryptohash.h>
+
+#include <crypto/sha.h>
 
 #include <net/sch_generic.h>
 
@@ -408,11 +409,11 @@ struct bpf_prog {
 	kmemcheck_bitfield_end(meta);
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	u32			len;		/* Number of filter blocks */
-	u32			digest[SHA_DIGEST_WORDS]; /* Program digest */
 	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
 	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
 	unsigned int		(*bpf_func)(const void *ctx,
 					    const struct bpf_insn *insn);
+	u8			digest[SHA256_DIGEST_SIZE]; /* Program digest */
 	/* Instructions for interpreter */
 	union {
 		struct sock_filter	insns[0];
@@ -519,12 +520,6 @@ static inline u32 bpf_prog_insn_size(const struct bpf_prog *prog)
 	return prog->len * sizeof(struct bpf_insn);
 }
 
-static inline u32 bpf_prog_digest_scratch_size(const struct bpf_prog *prog)
-{
-	return round_up(bpf_prog_insn_size(prog) +
-			sizeof(__be64) + 1, SHA_MESSAGE_BYTES);
-}
-
 static inline unsigned int bpf_prog_size(unsigned int proglen)
 {
 	return max(sizeof(struct bpf_prog),
diff --git a/init/Kconfig b/init/Kconfig
index 223b734abccd..f1ea6d023f8c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1389,6 +1389,7 @@ config HAVE_PCSPKR_PLATFORM
 # interpreter that classic socket filters depend on
 config BPF
 	bool
+	select CRYPTO_SHA256_DIRECT
 
 menuconfig EXPERT
 	bool "Configure standard kernel features (expert users)"
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 1eb4f1303756..668b92f6ab58 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -148,22 +148,18 @@ void __bpf_prog_free(struct bpf_prog *fp)
 
 int bpf_prog_calc_digest(struct bpf_prog *fp)
 {
-	const u32 bits_offset = SHA_MESSAGE_BYTES - sizeof(__be64);
-	u32 raw_size = bpf_prog_digest_scratch_size(fp);
-	u32 ws[SHA_WORKSPACE_WORDS];
-	u32 i, bsize, psize, blocks;
+	struct sha256_state sha;
+	u32 i, psize;
 	struct bpf_insn *dst;
 	bool was_ld_map;
-	u8 *raw, *todo;
-	__be32 *result;
-	__be64 *bits;
+	u8 *raw;
 
-	raw = vmalloc(raw_size);
+	psize = bpf_prog_insn_size(fp);
+	raw = vmalloc(psize);
 	if (!raw)
 		return -ENOMEM;
 
-	sha_init(fp->digest);
-	memset(ws, 0, sizeof(ws));
+	sha256_init_direct(&sha);
 
 	/* We need to take out the map fd for the digest calculation
 	 * since they are unstable from user space side.
@@ -188,30 +184,8 @@ int bpf_prog_calc_digest(struct bpf_prog *fp)
 		}
 	}
 
-	psize = bpf_prog_insn_size(fp);
-	memset(&raw[psize], 0, raw_size - psize);
-	raw[psize++] = 0x80;
-
-	bsize  = round_up(psize, SHA_MESSAGE_BYTES);
-	blocks = bsize / SHA_MESSAGE_BYTES;
-	todo   = raw;
-	if (bsize - psize >= sizeof(__be64)) {
-		bits = (__be64 *)(todo + bsize - sizeof(__be64));
-	} else {
-		bits = (__be64 *)(todo + bsize + bits_offset);
-		blocks++;
-	}
-	*bits = cpu_to_be64((psize - 1) << 3);
-
-	while (blocks--) {
-		sha_transform(fp->digest, todo, ws);
-		todo += SHA_MESSAGE_BYTES;
-	}
-
-	result = (__force __be32 *)fp->digest;
-	for (i = 0; i < SHA_DIGEST_WORDS; i++)
-		result[i] = cpu_to_be32(fp->digest[i]);
-
+	sha256_update_direct(&sha, raw, psize);
+	sha256_final_direct(&sha, fp->digest);
 	vfree(raw);
 	return 0;
 }
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 3/8] crypto/sha256: Build the SHA256 core separately from the crypto module
From: Andy Lutomirski @ 2017-01-10 23:24 UTC (permalink / raw)
  To: Daniel Borkmann, Netdev, LKML, Linux Crypto Mailing List
  Cc: Jason A. Donenfeld, Hannes Frederic Sowa, Alexei Starovoitov,
	Eric Dumazet, Eric Biggers, Tom Herbert, David S. Miller,
	Andy Lutomirski, Ard Biesheuvel, Herbert Xu
In-Reply-To: <cover.1484090585.git.luto@kernel.org>

This just moves code around -- no code changes in this patch.  This
wil let BPF-based tracing link against the SHA256 core code without
depending on the crypto core.

Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 crypto/Kconfig          |   8 ++
 crypto/Makefile         |   1 +
 crypto/sha256_direct.c  | 238 ++++++++++++++++++++++++++++++++++++++++++++++++
 crypto/sha256_generic.c | 214 -------------------------------------------
 4 files changed, 247 insertions(+), 214 deletions(-)
 create mode 100644 crypto/sha256_direct.c

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 160f08e721cc..b83ae6789e78 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -10,6 +10,13 @@ config XOR_BLOCKS
 source "crypto/async_tx/Kconfig"
 
 #
+# Cryptographic algorithms that are usable without the Crypto API.
+# None of these should have visible config options.
+#
+config CRYPTO_SHA256_DIRECT
+	bool
+
+#
 # Cryptographic API Configuration
 #
 menuconfig CRYPTO
@@ -763,6 +770,7 @@ config CRYPTO_SHA512_MB
 
 config CRYPTO_SHA256
 	tristate "SHA224 and SHA256 digest algorithm"
+	select CRYPTO_SHA256_DIRECT
 	select CRYPTO_HASH
 	help
 	  SHA256 secure hash standard (DFIPS 180-2).
diff --git a/crypto/Makefile b/crypto/Makefile
index b8f0e3eb0791..e0118a3b0d99 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_CRYPTO_RMD160) += rmd160.o
 obj-$(CONFIG_CRYPTO_RMD256) += rmd256.o
 obj-$(CONFIG_CRYPTO_RMD320) += rmd320.o
 obj-$(CONFIG_CRYPTO_SHA1) += sha1_generic.o
+obj-$(CONFIG_CRYPTO_SHA256_DIRECT) += sha256_direct.o
 obj-$(CONFIG_CRYPTO_SHA256) += sha256_generic.o
 obj-$(CONFIG_CRYPTO_SHA512) += sha512_generic.o
 obj-$(CONFIG_CRYPTO_SHA3) += sha3_generic.o
diff --git a/crypto/sha256_direct.c b/crypto/sha256_direct.c
new file mode 100644
index 000000000000..2029e4c08339
--- /dev/null
+++ b/crypto/sha256_direct.c
@@ -0,0 +1,238 @@
+/*
+ * Cryptographic API.
+ *
+ * SHA-256, as specified in
+ * http://csrc.nist.gov/groups/STM/cavp/documents/shs/sha256-384-512.pdf
+ *
+ * SHA-256 code by Jean-Luc Cooke <jlcooke@certainkey.com>.
+ *
+ * Copyright (c) Jean-Luc Cooke <jlcooke@certainkey.com>
+ * Copyright (c) Andrew McDonald <andrew@mcdonald.org.uk>
+ * Copyright (c) 2002 James Morris <jmorris@intercode.com.au>
+ * SHA224 Support Copyright 2007 Intel Corporation <jonathan.lynch@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option) 
+ * any later version.
+ *
+ */
+#include <linux/types.h>
+#include <crypto/sha.h>
+#include <crypto/sha256_base.h>
+#include <asm/byteorder.h>
+#include <asm/unaligned.h>
+
+static inline u32 Ch(u32 x, u32 y, u32 z)
+{
+	return z ^ (x & (y ^ z));
+}
+
+static inline u32 Maj(u32 x, u32 y, u32 z)
+{
+	return (x & y) | (z & (x | y));
+}
+
+#define e0(x)       (ror32(x, 2) ^ ror32(x,13) ^ ror32(x,22))
+#define e1(x)       (ror32(x, 6) ^ ror32(x,11) ^ ror32(x,25))
+#define s0(x)       (ror32(x, 7) ^ ror32(x,18) ^ (x >> 3))
+#define s1(x)       (ror32(x,17) ^ ror32(x,19) ^ (x >> 10))
+
+static inline void LOAD_OP(int I, u32 *W, const u8 *input)
+{
+	W[I] = get_unaligned_be32((__u32 *)input + I);
+}
+
+static inline void BLEND_OP(int I, u32 *W)
+{
+	W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
+}
+
+static void sha256_transform(u32 *state, const u8 *input)
+{
+	u32 a, b, c, d, e, f, g, h, t1, t2;
+	u32 W[64];
+	int i;
+
+	/* load the input */
+	for (i = 0; i < 16; i++)
+		LOAD_OP(i, W, input);
+
+	/* now blend */
+	for (i = 16; i < 64; i++)
+		BLEND_OP(i, W);
+
+	/* load the state into our registers */
+	a=state[0];  b=state[1];  c=state[2];  d=state[3];
+	e=state[4];  f=state[5];  g=state[6];  h=state[7];
+
+	/* now iterate */
+	t1 = h + e1(e) + Ch(e,f,g) + 0x428a2f98 + W[ 0];
+	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
+	t1 = g + e1(d) + Ch(d,e,f) + 0x71374491 + W[ 1];
+	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
+	t1 = f + e1(c) + Ch(c,d,e) + 0xb5c0fbcf + W[ 2];
+	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
+	t1 = e + e1(b) + Ch(b,c,d) + 0xe9b5dba5 + W[ 3];
+	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
+	t1 = d + e1(a) + Ch(a,b,c) + 0x3956c25b + W[ 4];
+	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
+	t1 = c + e1(h) + Ch(h,a,b) + 0x59f111f1 + W[ 5];
+	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
+	t1 = b + e1(g) + Ch(g,h,a) + 0x923f82a4 + W[ 6];
+	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
+	t1 = a + e1(f) + Ch(f,g,h) + 0xab1c5ed5 + W[ 7];
+	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
+
+	t1 = h + e1(e) + Ch(e,f,g) + 0xd807aa98 + W[ 8];
+	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
+	t1 = g + e1(d) + Ch(d,e,f) + 0x12835b01 + W[ 9];
+	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
+	t1 = f + e1(c) + Ch(c,d,e) + 0x243185be + W[10];
+	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
+	t1 = e + e1(b) + Ch(b,c,d) + 0x550c7dc3 + W[11];
+	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
+	t1 = d + e1(a) + Ch(a,b,c) + 0x72be5d74 + W[12];
+	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
+	t1 = c + e1(h) + Ch(h,a,b) + 0x80deb1fe + W[13];
+	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
+	t1 = b + e1(g) + Ch(g,h,a) + 0x9bdc06a7 + W[14];
+	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
+	t1 = a + e1(f) + Ch(f,g,h) + 0xc19bf174 + W[15];
+	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
+
+	t1 = h + e1(e) + Ch(e,f,g) + 0xe49b69c1 + W[16];
+	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
+	t1 = g + e1(d) + Ch(d,e,f) + 0xefbe4786 + W[17];
+	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
+	t1 = f + e1(c) + Ch(c,d,e) + 0x0fc19dc6 + W[18];
+	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
+	t1 = e + e1(b) + Ch(b,c,d) + 0x240ca1cc + W[19];
+	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
+	t1 = d + e1(a) + Ch(a,b,c) + 0x2de92c6f + W[20];
+	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
+	t1 = c + e1(h) + Ch(h,a,b) + 0x4a7484aa + W[21];
+	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
+	t1 = b + e1(g) + Ch(g,h,a) + 0x5cb0a9dc + W[22];
+	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
+	t1 = a + e1(f) + Ch(f,g,h) + 0x76f988da + W[23];
+	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
+
+	t1 = h + e1(e) + Ch(e,f,g) + 0x983e5152 + W[24];
+	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
+	t1 = g + e1(d) + Ch(d,e,f) + 0xa831c66d + W[25];
+	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
+	t1 = f + e1(c) + Ch(c,d,e) + 0xb00327c8 + W[26];
+	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
+	t1 = e + e1(b) + Ch(b,c,d) + 0xbf597fc7 + W[27];
+	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
+	t1 = d + e1(a) + Ch(a,b,c) + 0xc6e00bf3 + W[28];
+	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
+	t1 = c + e1(h) + Ch(h,a,b) + 0xd5a79147 + W[29];
+	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
+	t1 = b + e1(g) + Ch(g,h,a) + 0x06ca6351 + W[30];
+	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
+	t1 = a + e1(f) + Ch(f,g,h) + 0x14292967 + W[31];
+	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
+
+	t1 = h + e1(e) + Ch(e,f,g) + 0x27b70a85 + W[32];
+	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
+	t1 = g + e1(d) + Ch(d,e,f) + 0x2e1b2138 + W[33];
+	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
+	t1 = f + e1(c) + Ch(c,d,e) + 0x4d2c6dfc + W[34];
+	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
+	t1 = e + e1(b) + Ch(b,c,d) + 0x53380d13 + W[35];
+	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
+	t1 = d + e1(a) + Ch(a,b,c) + 0x650a7354 + W[36];
+	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
+	t1 = c + e1(h) + Ch(h,a,b) + 0x766a0abb + W[37];
+	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
+	t1 = b + e1(g) + Ch(g,h,a) + 0x81c2c92e + W[38];
+	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
+	t1 = a + e1(f) + Ch(f,g,h) + 0x92722c85 + W[39];
+	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
+
+	t1 = h + e1(e) + Ch(e,f,g) + 0xa2bfe8a1 + W[40];
+	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
+	t1 = g + e1(d) + Ch(d,e,f) + 0xa81a664b + W[41];
+	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
+	t1 = f + e1(c) + Ch(c,d,e) + 0xc24b8b70 + W[42];
+	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
+	t1 = e + e1(b) + Ch(b,c,d) + 0xc76c51a3 + W[43];
+	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
+	t1 = d + e1(a) + Ch(a,b,c) + 0xd192e819 + W[44];
+	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
+	t1 = c + e1(h) + Ch(h,a,b) + 0xd6990624 + W[45];
+	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
+	t1 = b + e1(g) + Ch(g,h,a) + 0xf40e3585 + W[46];
+	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
+	t1 = a + e1(f) + Ch(f,g,h) + 0x106aa070 + W[47];
+	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
+
+	t1 = h + e1(e) + Ch(e,f,g) + 0x19a4c116 + W[48];
+	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
+	t1 = g + e1(d) + Ch(d,e,f) + 0x1e376c08 + W[49];
+	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
+	t1 = f + e1(c) + Ch(c,d,e) + 0x2748774c + W[50];
+	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
+	t1 = e + e1(b) + Ch(b,c,d) + 0x34b0bcb5 + W[51];
+	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
+	t1 = d + e1(a) + Ch(a,b,c) + 0x391c0cb3 + W[52];
+	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
+	t1 = c + e1(h) + Ch(h,a,b) + 0x4ed8aa4a + W[53];
+	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
+	t1 = b + e1(g) + Ch(g,h,a) + 0x5b9cca4f + W[54];
+	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
+	t1 = a + e1(f) + Ch(f,g,h) + 0x682e6ff3 + W[55];
+	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
+
+	t1 = h + e1(e) + Ch(e,f,g) + 0x748f82ee + W[56];
+	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
+	t1 = g + e1(d) + Ch(d,e,f) + 0x78a5636f + W[57];
+	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
+	t1 = f + e1(c) + Ch(c,d,e) + 0x84c87814 + W[58];
+	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
+	t1 = e + e1(b) + Ch(b,c,d) + 0x8cc70208 + W[59];
+	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
+	t1 = d + e1(a) + Ch(a,b,c) + 0x90befffa + W[60];
+	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
+	t1 = c + e1(h) + Ch(h,a,b) + 0xa4506ceb + W[61];
+	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
+	t1 = b + e1(g) + Ch(g,h,a) + 0xbef9a3f7 + W[62];
+	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
+	t1 = a + e1(f) + Ch(f,g,h) + 0xc67178f2 + W[63];
+	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
+
+	state[0] += a; state[1] += b; state[2] += c; state[3] += d;
+	state[4] += e; state[5] += f; state[6] += g; state[7] += h;
+
+	/* clear any sensitive info... */
+	a = b = c = d = e = f = g = h = t1 = t2 = 0;
+	memzero_explicit(W, 64 * sizeof(u32));
+}
+
+static void sha256_generic_block_fn(struct sha256_state *sst, u8 const *src,
+				    int blocks)
+{
+	while (blocks--) {
+		sha256_transform(sst->state, src);
+		src += SHA256_BLOCK_SIZE;
+	}
+}
+
+void sha256_update_direct(struct sha256_state *sctx, const u8 *data,
+			  unsigned int len)
+{
+	__sha256_base_do_update(sctx, data, len, sha256_generic_block_fn);
+}
+EXPORT_SYMBOL(sha256_update_direct);
+
+void __sha256_final_direct(struct sha256_state *sctx, unsigned int digest_size,
+			   u8 *out)
+{
+	sha256_do_finalize_direct(sctx, sha256_generic_block_fn);
+	__sha256_base_finish(sctx, digest_size, out);
+}
+EXPORT_SYMBOL(sha256_final_direct);
+
+MODULE_LICENSE("GPL");
diff --git a/crypto/sha256_generic.c b/crypto/sha256_generic.c
index 573e114382f9..ffc444ef627c 100644
--- a/crypto/sha256_generic.c
+++ b/crypto/sha256_generic.c
@@ -24,8 +24,6 @@
 #include <linux/types.h>
 #include <crypto/sha.h>
 #include <crypto/sha256_base.h>
-#include <asm/byteorder.h>
-#include <asm/unaligned.h>
 
 const u8 sha224_zero_message_hash[SHA224_DIGEST_SIZE] = {
 	0xd1, 0x4a, 0x02, 0x8c, 0x2a, 0x3a, 0x2b, 0xc9, 0x47,
@@ -43,210 +41,6 @@ const u8 sha256_zero_message_hash[SHA256_DIGEST_SIZE] = {
 };
 EXPORT_SYMBOL_GPL(sha256_zero_message_hash);
 
-static inline u32 Ch(u32 x, u32 y, u32 z)
-{
-	return z ^ (x & (y ^ z));
-}
-
-static inline u32 Maj(u32 x, u32 y, u32 z)
-{
-	return (x & y) | (z & (x | y));
-}
-
-#define e0(x)       (ror32(x, 2) ^ ror32(x,13) ^ ror32(x,22))
-#define e1(x)       (ror32(x, 6) ^ ror32(x,11) ^ ror32(x,25))
-#define s0(x)       (ror32(x, 7) ^ ror32(x,18) ^ (x >> 3))
-#define s1(x)       (ror32(x,17) ^ ror32(x,19) ^ (x >> 10))
-
-static inline void LOAD_OP(int I, u32 *W, const u8 *input)
-{
-	W[I] = get_unaligned_be32((__u32 *)input + I);
-}
-
-static inline void BLEND_OP(int I, u32 *W)
-{
-	W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
-}
-
-static void sha256_transform(u32 *state, const u8 *input)
-{
-	u32 a, b, c, d, e, f, g, h, t1, t2;
-	u32 W[64];
-	int i;
-
-	/* load the input */
-	for (i = 0; i < 16; i++)
-		LOAD_OP(i, W, input);
-
-	/* now blend */
-	for (i = 16; i < 64; i++)
-		BLEND_OP(i, W);
-
-	/* load the state into our registers */
-	a=state[0];  b=state[1];  c=state[2];  d=state[3];
-	e=state[4];  f=state[5];  g=state[6];  h=state[7];
-
-	/* now iterate */
-	t1 = h + e1(e) + Ch(e,f,g) + 0x428a2f98 + W[ 0];
-	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
-	t1 = g + e1(d) + Ch(d,e,f) + 0x71374491 + W[ 1];
-	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
-	t1 = f + e1(c) + Ch(c,d,e) + 0xb5c0fbcf + W[ 2];
-	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
-	t1 = e + e1(b) + Ch(b,c,d) + 0xe9b5dba5 + W[ 3];
-	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
-	t1 = d + e1(a) + Ch(a,b,c) + 0x3956c25b + W[ 4];
-	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
-	t1 = c + e1(h) + Ch(h,a,b) + 0x59f111f1 + W[ 5];
-	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
-	t1 = b + e1(g) + Ch(g,h,a) + 0x923f82a4 + W[ 6];
-	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
-	t1 = a + e1(f) + Ch(f,g,h) + 0xab1c5ed5 + W[ 7];
-	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
-
-	t1 = h + e1(e) + Ch(e,f,g) + 0xd807aa98 + W[ 8];
-	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
-	t1 = g + e1(d) + Ch(d,e,f) + 0x12835b01 + W[ 9];
-	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
-	t1 = f + e1(c) + Ch(c,d,e) + 0x243185be + W[10];
-	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
-	t1 = e + e1(b) + Ch(b,c,d) + 0x550c7dc3 + W[11];
-	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
-	t1 = d + e1(a) + Ch(a,b,c) + 0x72be5d74 + W[12];
-	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
-	t1 = c + e1(h) + Ch(h,a,b) + 0x80deb1fe + W[13];
-	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
-	t1 = b + e1(g) + Ch(g,h,a) + 0x9bdc06a7 + W[14];
-	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
-	t1 = a + e1(f) + Ch(f,g,h) + 0xc19bf174 + W[15];
-	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
-
-	t1 = h + e1(e) + Ch(e,f,g) + 0xe49b69c1 + W[16];
-	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
-	t1 = g + e1(d) + Ch(d,e,f) + 0xefbe4786 + W[17];
-	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
-	t1 = f + e1(c) + Ch(c,d,e) + 0x0fc19dc6 + W[18];
-	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
-	t1 = e + e1(b) + Ch(b,c,d) + 0x240ca1cc + W[19];
-	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
-	t1 = d + e1(a) + Ch(a,b,c) + 0x2de92c6f + W[20];
-	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
-	t1 = c + e1(h) + Ch(h,a,b) + 0x4a7484aa + W[21];
-	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
-	t1 = b + e1(g) + Ch(g,h,a) + 0x5cb0a9dc + W[22];
-	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
-	t1 = a + e1(f) + Ch(f,g,h) + 0x76f988da + W[23];
-	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
-
-	t1 = h + e1(e) + Ch(e,f,g) + 0x983e5152 + W[24];
-	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
-	t1 = g + e1(d) + Ch(d,e,f) + 0xa831c66d + W[25];
-	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
-	t1 = f + e1(c) + Ch(c,d,e) + 0xb00327c8 + W[26];
-	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
-	t1 = e + e1(b) + Ch(b,c,d) + 0xbf597fc7 + W[27];
-	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
-	t1 = d + e1(a) + Ch(a,b,c) + 0xc6e00bf3 + W[28];
-	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
-	t1 = c + e1(h) + Ch(h,a,b) + 0xd5a79147 + W[29];
-	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
-	t1 = b + e1(g) + Ch(g,h,a) + 0x06ca6351 + W[30];
-	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
-	t1 = a + e1(f) + Ch(f,g,h) + 0x14292967 + W[31];
-	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
-
-	t1 = h + e1(e) + Ch(e,f,g) + 0x27b70a85 + W[32];
-	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
-	t1 = g + e1(d) + Ch(d,e,f) + 0x2e1b2138 + W[33];
-	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
-	t1 = f + e1(c) + Ch(c,d,e) + 0x4d2c6dfc + W[34];
-	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
-	t1 = e + e1(b) + Ch(b,c,d) + 0x53380d13 + W[35];
-	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
-	t1 = d + e1(a) + Ch(a,b,c) + 0x650a7354 + W[36];
-	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
-	t1 = c + e1(h) + Ch(h,a,b) + 0x766a0abb + W[37];
-	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
-	t1 = b + e1(g) + Ch(g,h,a) + 0x81c2c92e + W[38];
-	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
-	t1 = a + e1(f) + Ch(f,g,h) + 0x92722c85 + W[39];
-	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
-
-	t1 = h + e1(e) + Ch(e,f,g) + 0xa2bfe8a1 + W[40];
-	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
-	t1 = g + e1(d) + Ch(d,e,f) + 0xa81a664b + W[41];
-	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
-	t1 = f + e1(c) + Ch(c,d,e) + 0xc24b8b70 + W[42];
-	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
-	t1 = e + e1(b) + Ch(b,c,d) + 0xc76c51a3 + W[43];
-	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
-	t1 = d + e1(a) + Ch(a,b,c) + 0xd192e819 + W[44];
-	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
-	t1 = c + e1(h) + Ch(h,a,b) + 0xd6990624 + W[45];
-	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
-	t1 = b + e1(g) + Ch(g,h,a) + 0xf40e3585 + W[46];
-	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
-	t1 = a + e1(f) + Ch(f,g,h) + 0x106aa070 + W[47];
-	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
-
-	t1 = h + e1(e) + Ch(e,f,g) + 0x19a4c116 + W[48];
-	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
-	t1 = g + e1(d) + Ch(d,e,f) + 0x1e376c08 + W[49];
-	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
-	t1 = f + e1(c) + Ch(c,d,e) + 0x2748774c + W[50];
-	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
-	t1 = e + e1(b) + Ch(b,c,d) + 0x34b0bcb5 + W[51];
-	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
-	t1 = d + e1(a) + Ch(a,b,c) + 0x391c0cb3 + W[52];
-	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
-	t1 = c + e1(h) + Ch(h,a,b) + 0x4ed8aa4a + W[53];
-	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
-	t1 = b + e1(g) + Ch(g,h,a) + 0x5b9cca4f + W[54];
-	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
-	t1 = a + e1(f) + Ch(f,g,h) + 0x682e6ff3 + W[55];
-	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
-
-	t1 = h + e1(e) + Ch(e,f,g) + 0x748f82ee + W[56];
-	t2 = e0(a) + Maj(a,b,c);    d+=t1;    h=t1+t2;
-	t1 = g + e1(d) + Ch(d,e,f) + 0x78a5636f + W[57];
-	t2 = e0(h) + Maj(h,a,b);    c+=t1;    g=t1+t2;
-	t1 = f + e1(c) + Ch(c,d,e) + 0x84c87814 + W[58];
-	t2 = e0(g) + Maj(g,h,a);    b+=t1;    f=t1+t2;
-	t1 = e + e1(b) + Ch(b,c,d) + 0x8cc70208 + W[59];
-	t2 = e0(f) + Maj(f,g,h);    a+=t1;    e=t1+t2;
-	t1 = d + e1(a) + Ch(a,b,c) + 0x90befffa + W[60];
-	t2 = e0(e) + Maj(e,f,g);    h+=t1;    d=t1+t2;
-	t1 = c + e1(h) + Ch(h,a,b) + 0xa4506ceb + W[61];
-	t2 = e0(d) + Maj(d,e,f);    g+=t1;    c=t1+t2;
-	t1 = b + e1(g) + Ch(g,h,a) + 0xbef9a3f7 + W[62];
-	t2 = e0(c) + Maj(c,d,e);    f+=t1;    b=t1+t2;
-	t1 = a + e1(f) + Ch(f,g,h) + 0xc67178f2 + W[63];
-	t2 = e0(b) + Maj(b,c,d);    e+=t1;    a=t1+t2;
-
-	state[0] += a; state[1] += b; state[2] += c; state[3] += d;
-	state[4] += e; state[5] += f; state[6] += g; state[7] += h;
-
-	/* clear any sensitive info... */
-	a = b = c = d = e = f = g = h = t1 = t2 = 0;
-	memzero_explicit(W, 64 * sizeof(u32));
-}
-
-static void sha256_generic_block_fn(struct sha256_state *sst, u8 const *src,
-				    int blocks)
-{
-	while (blocks--) {
-		sha256_transform(sst->state, src);
-		src += SHA256_BLOCK_SIZE;
-	}
-}
-
-void sha256_update_direct(struct sha256_state *sctx, const u8 *data,
-			  unsigned int len)
-{
-	__sha256_base_do_update(sctx, data, len, sha256_generic_block_fn);
-}
-EXPORT_SYMBOL(sha256_update_direct);
-
 int crypto_sha256_update(struct shash_desc *desc, const u8 *data,
 			  unsigned int len)
 {
@@ -262,14 +56,6 @@ static int sha256_final(struct shash_desc *desc, u8 *out)
 	return 0;
 }
 
-void __sha256_final_direct(struct sha256_state *sctx, unsigned int digest_size,
-			   u8 *out)
-{
-	sha256_do_finalize_direct(sctx, sha256_generic_block_fn);
-	__sha256_base_finish(sctx, digest_size, out);
-}
-EXPORT_SYMBOL(sha256_final_direct);
-
 int crypto_sha256_finup(struct shash_desc *desc, const u8 *data,
 			unsigned int len, u8 *hash)
 {
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 2/8] crypto/sha256: Export a sha256_{init,update,final}_direct() API
From: Andy Lutomirski @ 2017-01-10 23:24 UTC (permalink / raw)
  To: Daniel Borkmann, Netdev, LKML, Linux Crypto Mailing List
  Cc: Jason A. Donenfeld, Hannes Frederic Sowa, Alexei Starovoitov,
	Eric Dumazet, Eric Biggers, Tom Herbert, David S. Miller,
	Andy Lutomirski, Ard Biesheuvel, Herbert Xu
In-Reply-To: <cover.1484090585.git.luto@kernel.org>

This provides a very simple interface for kernel code to use to do
synchronous, unaccelerated, virtual-address-based SHA256 hashing
without needing to create a crypto context.

Subsequent patches will make this work without building the crypto
core and will use to avoid making BPF-based tracing depend on
crypto.

Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 crypto/sha256_generic.c      | 31 ++++++++++++++++++++++++++-----
 include/crypto/sha.h         | 24 ++++++++++++++++++++++++
 include/crypto/sha256_base.h | 13 -------------
 3 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/crypto/sha256_generic.c b/crypto/sha256_generic.c
index 8f9c47e1a96e..573e114382f9 100644
--- a/crypto/sha256_generic.c
+++ b/crypto/sha256_generic.c
@@ -240,24 +240,45 @@ static void sha256_generic_block_fn(struct sha256_state *sst, u8 const *src,
 	}
 }
 
+void sha256_update_direct(struct sha256_state *sctx, const u8 *data,
+			  unsigned int len)
+{
+	__sha256_base_do_update(sctx, data, len, sha256_generic_block_fn);
+}
+EXPORT_SYMBOL(sha256_update_direct);
+
 int crypto_sha256_update(struct shash_desc *desc, const u8 *data,
 			  unsigned int len)
 {
-	return sha256_base_do_update(desc, data, len, sha256_generic_block_fn);
+	sha256_update_direct(shash_desc_ctx(desc), data, len);
+	return 0;
 }
 EXPORT_SYMBOL(crypto_sha256_update);
 
 static int sha256_final(struct shash_desc *desc, u8 *out)
 {
-	sha256_base_do_finalize(desc, sha256_generic_block_fn);
-	return sha256_base_finish(desc, out);
+	__sha256_final_direct(shash_desc_ctx(desc),
+			      crypto_shash_digestsize(desc->tfm), out);
+	return 0;
 }
 
+void __sha256_final_direct(struct sha256_state *sctx, unsigned int digest_size,
+			   u8 *out)
+{
+	sha256_do_finalize_direct(sctx, sha256_generic_block_fn);
+	__sha256_base_finish(sctx, digest_size, out);
+}
+EXPORT_SYMBOL(sha256_final_direct);
+
 int crypto_sha256_finup(struct shash_desc *desc, const u8 *data,
 			unsigned int len, u8 *hash)
 {
-	sha256_base_do_update(desc, data, len, sha256_generic_block_fn);
-	return sha256_final(desc, hash);
+	struct sha256_state *sctx = shash_desc_ctx(desc);
+	unsigned int digest_size = crypto_shash_digestsize(desc->tfm);
+
+	sha256_update_direct(sctx, data, len);
+	__sha256_final_direct(sctx, digest_size, hash);
+	return 0;
 }
 EXPORT_SYMBOL(crypto_sha256_finup);
 
diff --git a/include/crypto/sha.h b/include/crypto/sha.h
index c94d3eb1cefd..0df1a0e42c95 100644
--- a/include/crypto/sha.h
+++ b/include/crypto/sha.h
@@ -88,6 +88,30 @@ struct sha512_state {
 	u8 buf[SHA512_BLOCK_SIZE];
 };
 
+static inline void sha256_init_direct(struct sha256_state *sctx)
+{
+	sctx->state[0] = SHA256_H0;
+	sctx->state[1] = SHA256_H1;
+	sctx->state[2] = SHA256_H2;
+	sctx->state[3] = SHA256_H3;
+	sctx->state[4] = SHA256_H4;
+	sctx->state[5] = SHA256_H5;
+	sctx->state[6] = SHA256_H6;
+	sctx->state[7] = SHA256_H7;
+	sctx->count = 0;
+}
+
+extern void sha256_update_direct(struct sha256_state *sctx, const u8 *data,
+				 unsigned int len);
+
+extern void __sha256_final_direct(struct sha256_state *sctx,
+				  unsigned int digest_size, u8 *out);
+
+static inline void sha256_final_direct(struct sha256_state *sctx, u8 *out)
+{
+	__sha256_final_direct(sctx, SHA256_DIGEST_SIZE, out);
+}
+
 struct shash_desc;
 
 extern int crypto_sha1_update(struct shash_desc *desc, const u8 *data,
diff --git a/include/crypto/sha256_base.h b/include/crypto/sha256_base.h
index fc77b8e099a7..9bbe73ce458f 100644
--- a/include/crypto/sha256_base.h
+++ b/include/crypto/sha256_base.h
@@ -37,19 +37,6 @@ static inline int sha224_base_init(struct shash_desc *desc)
 	return 0;
 }
 
-static inline void sha256_init_direct(struct sha256_state *sctx)
-{
-	sctx->state[0] = SHA256_H0;
-	sctx->state[1] = SHA256_H1;
-	sctx->state[2] = SHA256_H2;
-	sctx->state[3] = SHA256_H3;
-	sctx->state[4] = SHA256_H4;
-	sctx->state[5] = SHA256_H5;
-	sctx->state[6] = SHA256_H6;
-	sctx->state[7] = SHA256_H7;
-	sctx->count = 0;
-}
-
 static inline int sha256_base_init(struct shash_desc *desc)
 {
 	sha256_init_direct(shash_desc_ctx(desc));
-- 
2.9.3

^ permalink raw reply related

* [PATCH v2 1/8] crypto/sha256: Factor out the parts of base API that don't use shash_desc
From: Andy Lutomirski @ 2017-01-10 23:24 UTC (permalink / raw)
  To: Daniel Borkmann, Netdev, LKML, Linux Crypto Mailing List
  Cc: Jason A. Donenfeld, Hannes Frederic Sowa, Alexei Starovoitov,
	Eric Dumazet, Eric Biggers, Tom Herbert, David S. Miller,
	Andy Lutomirski, Ard Biesheuvel, Herbert Xu
In-Reply-To: <cover.1484090585.git.luto@kernel.org>

I want to expose a minimal SHA256 API that can be used without the
depending on the crypto core.  To prepare for this, factor out the
meat of the sha256_base_*() helpers.

Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/crypto/sha256_base.h | 53 ++++++++++++++++++++++++++++++--------------
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/include/crypto/sha256_base.h b/include/crypto/sha256_base.h
index d1f2195bb7de..fc77b8e099a7 100644
--- a/include/crypto/sha256_base.h
+++ b/include/crypto/sha256_base.h
@@ -18,10 +18,8 @@
 typedef void (sha256_block_fn)(struct sha256_state *sst, u8 const *src,
 			       int blocks);
 
-static inline int sha224_base_init(struct shash_desc *desc)
+static inline void sha224_init_direct(struct sha256_state *sctx)
 {
-	struct sha256_state *sctx = shash_desc_ctx(desc);
-
 	sctx->state[0] = SHA224_H0;
 	sctx->state[1] = SHA224_H1;
 	sctx->state[2] = SHA224_H2;
@@ -31,14 +29,16 @@ static inline int sha224_base_init(struct shash_desc *desc)
 	sctx->state[6] = SHA224_H6;
 	sctx->state[7] = SHA224_H7;
 	sctx->count = 0;
+}
 
+static inline int sha224_base_init(struct shash_desc *desc)
+{
+	sha224_init_direct(shash_desc_ctx(desc));
 	return 0;
 }
 
-static inline int sha256_base_init(struct shash_desc *desc)
+static inline void sha256_init_direct(struct sha256_state *sctx)
 {
-	struct sha256_state *sctx = shash_desc_ctx(desc);
-
 	sctx->state[0] = SHA256_H0;
 	sctx->state[1] = SHA256_H1;
 	sctx->state[2] = SHA256_H2;
@@ -48,16 +48,19 @@ static inline int sha256_base_init(struct shash_desc *desc)
 	sctx->state[6] = SHA256_H6;
 	sctx->state[7] = SHA256_H7;
 	sctx->count = 0;
+}
 
+static inline int sha256_base_init(struct shash_desc *desc)
+{
+	sha256_init_direct(shash_desc_ctx(desc));
 	return 0;
 }
 
-static inline int sha256_base_do_update(struct shash_desc *desc,
-					const u8 *data,
-					unsigned int len,
-					sha256_block_fn *block_fn)
+static inline void __sha256_base_do_update(struct sha256_state *sctx,
+					   const u8 *data,
+					   unsigned int len,
+					   sha256_block_fn *block_fn)
 {
-	struct sha256_state *sctx = shash_desc_ctx(desc);
 	unsigned int partial = sctx->count % SHA256_BLOCK_SIZE;
 
 	sctx->count += len;
@@ -86,15 +89,21 @@ static inline int sha256_base_do_update(struct shash_desc *desc,
 	}
 	if (len)
 		memcpy(sctx->buf + partial, data, len);
+}
 
+static inline int sha256_base_do_update(struct shash_desc *desc,
+					const u8 *data,
+					unsigned int len,
+					sha256_block_fn *block_fn)
+{
+	__sha256_base_do_update(shash_desc_ctx(desc), data, len, block_fn);
 	return 0;
 }
 
-static inline int sha256_base_do_finalize(struct shash_desc *desc,
-					  sha256_block_fn *block_fn)
+static inline void sha256_do_finalize_direct(struct sha256_state *sctx,
+					     sha256_block_fn *block_fn)
 {
 	const int bit_offset = SHA256_BLOCK_SIZE - sizeof(__be64);
-	struct sha256_state *sctx = shash_desc_ctx(desc);
 	__be64 *bits = (__be64 *)(sctx->buf + bit_offset);
 	unsigned int partial = sctx->count % SHA256_BLOCK_SIZE;
 
@@ -109,14 +118,18 @@ static inline int sha256_base_do_finalize(struct shash_desc *desc,
 	memset(sctx->buf + partial, 0x0, bit_offset - partial);
 	*bits = cpu_to_be64(sctx->count << 3);
 	block_fn(sctx, sctx->buf, 1);
+}
 
+static inline int sha256_base_do_finalize(struct shash_desc *desc,
+					  sha256_block_fn *block_fn)
+{
+	sha256_do_finalize_direct(shash_desc_ctx(desc), block_fn);
 	return 0;
 }
 
-static inline int sha256_base_finish(struct shash_desc *desc, u8 *out)
+static inline void __sha256_base_finish(struct sha256_state *sctx,
+					unsigned int digest_size, u8 *out)
 {
-	unsigned int digest_size = crypto_shash_digestsize(desc->tfm);
-	struct sha256_state *sctx = shash_desc_ctx(desc);
 	__be32 *digest = (__be32 *)out;
 	int i;
 
@@ -124,5 +137,11 @@ static inline int sha256_base_finish(struct shash_desc *desc, u8 *out)
 		put_unaligned_be32(sctx->state[i], digest++);
 
 	*sctx = (struct sha256_state){};
+}
+
+static inline int sha256_base_finish(struct shash_desc *desc, u8 *out)
+{
+	__sha256_base_finish(shash_desc_ctx(desc),
+			     crypto_shash_digestsize(desc->tfm), out);
 	return 0;
 }
-- 
2.9.3

^ permalink raw reply related

* Re: x86-64: Maintain 16-byte stack alignment
From: Ard Biesheuvel @ 2017-01-10 20:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Herbert Xu, Linux Kernel Mailing List, Linux Crypto Mailing List,
	Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andy Lutomirski
In-Reply-To: <CALCETrW7mXxntZBTTHOgvaRLV4=9hcsNoTKM3o2O5Jod_6RR6Q@mail.gmail.com>

On 10 January 2017 at 19:22, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jan 10, 2017 at 11:16 AM, Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
>> On 10 January 2017 at 19:00, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Tue, Jan 10, 2017 at 9:30 AM, Ard Biesheuvel
>>> <ard.biesheuvel@linaro.org> wrote:
>>>> On 10 January 2017 at 14:33, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>>>> I recently applied the patch
>>>>>
>>>>>         https://patchwork.kernel.org/patch/9468391/
>>>>>
>>>>> and ended up with a boot crash when it tried to run the x86 chacha20
>>>>> code.  It turned out that the patch changed a manually aligned
>>>>> stack buffer to one that is aligned by gcc.  What was happening was
>>>>> that gcc can stack align to any value on x86-64 except 16.  The
>>>>> reason is that gcc assumes that the stack is always 16-byte aligned,
>>>>> which is not actually the case in the kernel.
>>>>>
>>>>
>>>> Apologies for introducing this breakage. It seemed like an obvious and
>>>> simple cleanup, so I didn't even bother to mention it in the commit
>>>> log, but if the kernel does not guarantee 16 byte alignment, I guess
>>>> we should revert to the old method. If SSE instructions are the only
>>>> ones that require this alignment, then I suppose not having a ABI
>>>> conforming stack pointer should not be an issue in general.
>>>
>>> Here's what I think is really going on.  This is partially from
>>> memory, so I could be off base.  The kernel is up against
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
>>> on some GCC versions (like the bad one and maybe even current ones),
>>> things compiled without -mno-sse can't have the stack alignment set
>>> properly.  IMO we should fix this in the affected code, not the entry
>>> code.  In fact, I think that fixing it in the entry code won't even
>>> fully fix it because modern GCC will compile the rest of the kernel
>>> with 8-byte alignment and the stack will get randomly unaligned (GCC
>>> 4.8 and newer).
>>>
>>> Can we just add __attribute__((force_align_arg_pointer)) to the
>>> affected functions?  Maybe have:
>>>
>>> #define __USES_SSE __attribute__((force_align_arg_pointer))
>>>
>>> on affected gcc versions?
>>>
>>> ***HOWEVER***
>>>
>>> I think this is missing the tree for the supposed forest.  The actual
>>> affected code appears to be:
>>>
>>> static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
>>>                          struct scatterlist *src, unsigned int nbytes)
>>> {
>>>         u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];
>>>
>>> ...
>>>
>>>         state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);
>>>
>>> gcc presumably infers (incorrectly) that state_buf is 16-byte aligned
>>> and optimizes out the roundup.  How about just declaring an actual
>>> __aligned(16) buffer, marking the function
>>> __attribute__((force_align_arg_pointer)), and being done with it?
>>> After all, we need that forcible alignment on *all* gcc versions.
>>>
>>
>> Actually, the breakage is introduced by the patch Herbert refers to
>>
>> https://patchwork.kernel.org/patch/9468391/
>>
>> where the state is replaced by a simple
>>
>> u32 state[16] __aligned(CHACHA20_STATE_ALIGN);
>>
>> which seemed harmless enough to me. So the code above works fine.
>
> So how about just the one-line patch of adding the
> force_align_arg_pointer?  Would that solve the problem?

If it does what it says on the tin, it should fix the issue, but after
adding the attribute, I get the exact same object output, so there's
something dodgy going on here.

^ permalink raw reply

* Re: x86-64: Maintain 16-byte stack alignment
From: Andy Lutomirski @ 2017-01-10 19:22 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Herbert Xu, Linux Kernel Mailing List, Linux Crypto Mailing List,
	Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andy Lutomirski
In-Reply-To: <CAKv+Gu8XVUXQrWCzgL6WRA=zY=5iTbbJ3Jp2kegCm0R7d6LwGA@mail.gmail.com>

On Tue, Jan 10, 2017 at 11:16 AM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 10 January 2017 at 19:00, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Tue, Jan 10, 2017 at 9:30 AM, Ard Biesheuvel
>> <ard.biesheuvel@linaro.org> wrote:
>>> On 10 January 2017 at 14:33, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>>> I recently applied the patch
>>>>
>>>>         https://patchwork.kernel.org/patch/9468391/
>>>>
>>>> and ended up with a boot crash when it tried to run the x86 chacha20
>>>> code.  It turned out that the patch changed a manually aligned
>>>> stack buffer to one that is aligned by gcc.  What was happening was
>>>> that gcc can stack align to any value on x86-64 except 16.  The
>>>> reason is that gcc assumes that the stack is always 16-byte aligned,
>>>> which is not actually the case in the kernel.
>>>>
>>>
>>> Apologies for introducing this breakage. It seemed like an obvious and
>>> simple cleanup, so I didn't even bother to mention it in the commit
>>> log, but if the kernel does not guarantee 16 byte alignment, I guess
>>> we should revert to the old method. If SSE instructions are the only
>>> ones that require this alignment, then I suppose not having a ABI
>>> conforming stack pointer should not be an issue in general.
>>
>> Here's what I think is really going on.  This is partially from
>> memory, so I could be off base.  The kernel is up against
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
>> on some GCC versions (like the bad one and maybe even current ones),
>> things compiled without -mno-sse can't have the stack alignment set
>> properly.  IMO we should fix this in the affected code, not the entry
>> code.  In fact, I think that fixing it in the entry code won't even
>> fully fix it because modern GCC will compile the rest of the kernel
>> with 8-byte alignment and the stack will get randomly unaligned (GCC
>> 4.8 and newer).
>>
>> Can we just add __attribute__((force_align_arg_pointer)) to the
>> affected functions?  Maybe have:
>>
>> #define __USES_SSE __attribute__((force_align_arg_pointer))
>>
>> on affected gcc versions?
>>
>> ***HOWEVER***
>>
>> I think this is missing the tree for the supposed forest.  The actual
>> affected code appears to be:
>>
>> static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
>>                          struct scatterlist *src, unsigned int nbytes)
>> {
>>         u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];
>>
>> ...
>>
>>         state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);
>>
>> gcc presumably infers (incorrectly) that state_buf is 16-byte aligned
>> and optimizes out the roundup.  How about just declaring an actual
>> __aligned(16) buffer, marking the function
>> __attribute__((force_align_arg_pointer)), and being done with it?
>> After all, we need that forcible alignment on *all* gcc versions.
>>
>
> Actually, the breakage is introduced by the patch Herbert refers to
>
> https://patchwork.kernel.org/patch/9468391/
>
> where the state is replaced by a simple
>
> u32 state[16] __aligned(CHACHA20_STATE_ALIGN);
>
> which seemed harmless enough to me. So the code above works fine.

So how about just the one-line patch of adding the
force_align_arg_pointer?  Would that solve the problem?

^ permalink raw reply

* Re: x86-64: Maintain 16-byte stack alignment
From: Ard Biesheuvel @ 2017-01-10 19:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Herbert Xu, Linux Kernel Mailing List, Linux Crypto Mailing List,
	Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andy Lutomirski
In-Reply-To: <CALCETrWGihhtbdkKcSD0kBrx-fWxM_TyUOUafvhsKMYXq5yEEQ@mail.gmail.com>

On 10 January 2017 at 19:00, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jan 10, 2017 at 9:30 AM, Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
>> On 10 January 2017 at 14:33, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>> I recently applied the patch
>>>
>>>         https://patchwork.kernel.org/patch/9468391/
>>>
>>> and ended up with a boot crash when it tried to run the x86 chacha20
>>> code.  It turned out that the patch changed a manually aligned
>>> stack buffer to one that is aligned by gcc.  What was happening was
>>> that gcc can stack align to any value on x86-64 except 16.  The
>>> reason is that gcc assumes that the stack is always 16-byte aligned,
>>> which is not actually the case in the kernel.
>>>
>>
>> Apologies for introducing this breakage. It seemed like an obvious and
>> simple cleanup, so I didn't even bother to mention it in the commit
>> log, but if the kernel does not guarantee 16 byte alignment, I guess
>> we should revert to the old method. If SSE instructions are the only
>> ones that require this alignment, then I suppose not having a ABI
>> conforming stack pointer should not be an issue in general.
>
> Here's what I think is really going on.  This is partially from
> memory, so I could be off base.  The kernel is up against
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
> on some GCC versions (like the bad one and maybe even current ones),
> things compiled without -mno-sse can't have the stack alignment set
> properly.  IMO we should fix this in the affected code, not the entry
> code.  In fact, I think that fixing it in the entry code won't even
> fully fix it because modern GCC will compile the rest of the kernel
> with 8-byte alignment and the stack will get randomly unaligned (GCC
> 4.8 and newer).
>
> Can we just add __attribute__((force_align_arg_pointer)) to the
> affected functions?  Maybe have:
>
> #define __USES_SSE __attribute__((force_align_arg_pointer))
>
> on affected gcc versions?
>
> ***HOWEVER***
>
> I think this is missing the tree for the supposed forest.  The actual
> affected code appears to be:
>
> static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
>                          struct scatterlist *src, unsigned int nbytes)
> {
>         u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];
>
> ...
>
>         state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);
>
> gcc presumably infers (incorrectly) that state_buf is 16-byte aligned
> and optimizes out the roundup.  How about just declaring an actual
> __aligned(16) buffer, marking the function
> __attribute__((force_align_arg_pointer)), and being done with it?
> After all, we need that forcible alignment on *all* gcc versions.
>

Actually, the breakage is introduced by the patch Herbert refers to

https://patchwork.kernel.org/patch/9468391/

where the state is replaced by a simple

u32 state[16] __aligned(CHACHA20_STATE_ALIGN);

which seemed harmless enough to me. So the code above works fine.

^ permalink raw reply

* Re: x86-64: Maintain 16-byte stack alignment
From: Andy Lutomirski @ 2017-01-10 19:00 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Herbert Xu, Linux Kernel Mailing List, Linux Crypto Mailing List,
	Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andy Lutomirski
In-Reply-To: <CAKv+Gu8xVrfR=J5b2pGK7R5Tv-M3xZhL_dnTvvM7nTZLLtC-EQ@mail.gmail.com>

On Tue, Jan 10, 2017 at 9:30 AM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 10 January 2017 at 14:33, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> I recently applied the patch
>>
>>         https://patchwork.kernel.org/patch/9468391/
>>
>> and ended up with a boot crash when it tried to run the x86 chacha20
>> code.  It turned out that the patch changed a manually aligned
>> stack buffer to one that is aligned by gcc.  What was happening was
>> that gcc can stack align to any value on x86-64 except 16.  The
>> reason is that gcc assumes that the stack is always 16-byte aligned,
>> which is not actually the case in the kernel.
>>
>
> Apologies for introducing this breakage. It seemed like an obvious and
> simple cleanup, so I didn't even bother to mention it in the commit
> log, but if the kernel does not guarantee 16 byte alignment, I guess
> we should revert to the old method. If SSE instructions are the only
> ones that require this alignment, then I suppose not having a ABI
> conforming stack pointer should not be an issue in general.

Here's what I think is really going on.  This is partially from
memory, so I could be off base.  The kernel is up against
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
on some GCC versions (like the bad one and maybe even current ones),
things compiled without -mno-sse can't have the stack alignment set
properly.  IMO we should fix this in the affected code, not the entry
code.  In fact, I think that fixing it in the entry code won't even
fully fix it because modern GCC will compile the rest of the kernel
with 8-byte alignment and the stack will get randomly unaligned (GCC
4.8 and newer).

Can we just add __attribute__((force_align_arg_pointer)) to the
affected functions?  Maybe have:

#define __USES_SSE __attribute__((force_align_arg_pointer))

on affected gcc versions?

***HOWEVER***

I think this is missing the tree for the supposed forest.  The actual
affected code appears to be:

static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
                         struct scatterlist *src, unsigned int nbytes)
{
        u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];

...

        state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);

gcc presumably infers (incorrectly) that state_buf is 16-byte aligned
and optimizes out the roundup.  How about just declaring an actual
__aligned(16) buffer, marking the function
__attribute__((force_align_arg_pointer)), and being done with it?
After all, we need that forcible alignment on *all* gcc versions.

--Andy

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox