[PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv
@ 2023-05-31 11:22 Ard Biesheuvel
  2023-05-31 11:22 ` [PATCH v2 1/2] target/arm: use x86 intrinsics to implement AES instructions Ard Biesheuvel
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Ard Biesheuvel @ 2023-05-31 11:22 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, Ard Biesheuvel, Peter Maydell, Alex Bennée,
	Richard Henderson, Philippe Mathieu-Daudé

Use the host native instructions to implement the AES instructions
exposed by the emulated target. The mapping is not 1:1, so it requires a
bit of fiddling to get the right result.

This is still RFC material - the current approach feels too ad-hoc, but
given the non-1:1 correspondence, doing a proper abstraction is rather
difficult.

Changes since v1/RFC:
- add second patch to implement x86 AES instructions on ARM hosts - this
  helps illustrate what an abstraction should cover.
- use cpuinfo framework to detect host support for AES instructions.
- implement ARM aesimc using x86 aesimc directly

Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's
tcrypt benchmark (mode=500)

Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to
the fact that ARM uses two instructions to implement a single AES round,
whereas x86 only uses one.

Note that using the ARM intrinsics is fiddly with Clang, as it does not
declare the prototypes unless some builtin CPP macro (__ARM_FEATURE_AES)
is defined, which will be set by the compiler based on the command line
arch/cpu options. However, setting this globally for a compilation unit
is dubious, given that we test cpuinfo for AES support, and only emit
the instructions conditionally. So I used inline asm() instead.

As for the design of an abstraction: I imagine we could introduce a
host/aes.h API that implements some building blocks that the TCG helper
implementation could use.

Quoting from my reply to Richard:

Using the primitive operations defined in the AES paper, we basically
perform the following transformation for n rounds of AES (for n in {10,
12, 14})

for (n-1 rounds) {
  AddRoundKey
  ShiftRows
  SubBytes
  MixColumns
}
AddRoundKey
ShiftRows
SubBytes
AddRoundKey

AddRoundKey is just XOR, but it is incorporated into the instructions
that combine a couple of these steps.

So on x86, we have

aesenc:
  ShiftRows
  SubBytes
  MixColumns
  AddRoundKey

aesenclast:
  ShiftRows
  SubBytes
  AddRoundKey

and on ARM we have

aese:
  AddRoundKey
  ShiftRows
  SubBytes

aesmc:
  MixColumns

So a generic routine that does only ShiftRows+SubBytes could be backed by
x86's aesenclast and ARM's aese, using a NULL round key argument in each
case. Then, it would be up to the TCG helper code for either ARM or x86
to incorporate those routines in the right way.

I suppose it really depends on whether there is a third host
architecture that could make use of this, and how its AES instructions
map onto the primitive AES ops above.

Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Alex Bennée <alex.bennee@linaro.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>

Ard Biesheuvel (2):
  target/arm: use x86 intrinsics to implement AES instructions
  target/i386: Implement AES instructions using AArch64 counterparts

 host/include/aarch64/host/cpuinfo.h |  1 +
 host/include/i386/host/cpuinfo.h    |  1 +
 target/arm/tcg/crypto_helper.c      | 37 ++++++++++-
 target/i386/ops_sse.h               | 69 ++++++++++++++++++++
 util/cpuinfo-aarch64.c              |  1 +
 util/cpuinfo-i386.c                 |  1 +
 6 files changed, 107 insertions(+), 3 deletions(-)

-- 
2.39.2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/2] target/arm: use x86 intrinsics to implement AES instructions
  2023-05-31 11:22 [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv Ard Biesheuvel
@ 2023-05-31 11:22 ` Ard Biesheuvel
  2023-05-31 11:22 ` [PATCH v2 2/2] target/i386: Implement AES instructions using AArch64 counterparts Ard Biesheuvel
  2023-05-31 16:33 ` [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv Richard Henderson
  2 siblings, 0 replies; 8+ messages in thread
From: Ard Biesheuvel @ 2023-05-31 11:22 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, Ard Biesheuvel, Peter Maydell, Alex Bennée,
	Richard Henderson, Philippe Mathieu-Daudé

ARM intrinsics for AES deviate from the x86 ones in the way they cover
the different stages of each round, and so mapping one to the other is
not entirely straight-forward. However, with a bit of care, we can still
use the x86 ones to emulate the ARM ones, which makes them constant time
(which is an important property in crypto) and substantially more
efficient.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 host/include/i386/host/cpuinfo.h |  1 +
 target/arm/tcg/crypto_helper.c   | 37 ++++++++++++++++++--
 util/cpuinfo-i386.c              |  1 +
 3 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/host/include/i386/host/cpuinfo.h b/host/include/i386/host/cpuinfo.h
index a6537123cf80ec5b..073d0a426f31487d 100644
--- a/host/include/i386/host/cpuinfo.h
+++ b/host/include/i386/host/cpuinfo.h
@@ -26,6 +26,7 @@
 #define CPUINFO_AVX512VBMI2     (1u << 15)
 #define CPUINFO_ATOMIC_VMOVDQA  (1u << 16)
 #define CPUINFO_ATOMIC_VMOVDQU  (1u << 17)
+#define CPUINFO_AES             (1u << 18)
 
 /* Initialized with a constructor. */
 extern unsigned cpuinfo;
diff --git a/target/arm/tcg/crypto_helper.c b/target/arm/tcg/crypto_helper.c
index d28690321f0b86ea..747c061b5a1b0e5e 100644
--- a/target/arm/tcg/crypto_helper.c
+++ b/target/arm/tcg/crypto_helper.c
@@ -18,10 +18,21 @@
 #include "crypto/sm4.h"
 #include "vec_internal.h"
 
+#ifdef __x86_64__
+#include "host/cpuinfo.h"
+#include <wmmintrin.h>
+#define TARGET_AES  __attribute__((__target__("aes")))
+#else
+#define TARGET_AES
+#endif
+
 union CRYPTO_STATE {
     uint8_t    bytes[16];
     uint32_t   words[4];
     uint64_t   l[2];
+#ifdef __x86_64__
+    __m128i    vec;
+#endif
 };
 
 #if HOST_BIG_ENDIAN
@@ -45,8 +56,8 @@ static void clear_tail_16(void *vd, uint32_t desc)
     clear_tail(vd, opr_sz, max_sz);
 }
 
-static void do_crypto_aese(uint64_t *rd, uint64_t *rn,
-                           uint64_t *rm, bool decrypt)
+static void TARGET_AES do_crypto_aese(uint64_t *rd, uint64_t *rn,
+                                      uint64_t *rm, bool decrypt)
 {
     static uint8_t const * const sbox[2] = { AES_sbox, AES_isbox };
     static uint8_t const * const shift[2] = { AES_shifts, AES_ishifts };
@@ -54,6 +65,16 @@ static void do_crypto_aese(uint64_t *rd, uint64_t *rn,
     union CRYPTO_STATE st = { .l = { rn[0], rn[1] } };
     int i;
 
+#ifdef __x86_64__
+    if (cpuinfo & CPUINFO_AES) {
+        __m128i *d = (__m128i *)rd, z = {};
+
+        *d = decrypt ? _mm_aesdeclast_si128(rk.vec ^ st.vec, z)
+                     : _mm_aesenclast_si128(rk.vec ^ st.vec, z);
+        return;
+    }
+#endif
+
     /* xor state vector with round key */
     rk.l[0] ^= st.l[0];
     rk.l[1] ^= st.l[1];
@@ -78,7 +99,7 @@ void HELPER(crypto_aese)(void *vd, void *vn, void *vm, uint32_t desc)
     clear_tail(vd, opr_sz, simd_maxsz(desc));
 }
 
-static void do_crypto_aesmc(uint64_t *rd, uint64_t *rm, bool decrypt)
+static void TARGET_AES do_crypto_aesmc(uint64_t *rd, uint64_t *rm, bool decrypt)
 {
     static uint32_t const mc[][256] = { {
         /* MixColumns lookup table */
@@ -217,6 +238,16 @@ static void do_crypto_aesmc(uint64_t *rd, uint64_t *rm, bool decrypt)
     union CRYPTO_STATE st = { .l = { rm[0], rm[1] } };
     int i;
 
+#ifdef __x86_64__
+    if (cpuinfo & CPUINFO_AES) {
+        __m128i *d = (__m128i *)rd, z = {};
+
+        *d = decrypt ? _mm_aesimc_si128(st.vec)
+                     : _mm_aesenc_si128(_mm_aesdeclast_si128(st.vec, z), z);
+        return;
+    }
+#endif
+
     for (i = 0; i < 16; i += 4) {
         CR_ST_WORD(st, i >> 2) =
             mc[decrypt][CR_ST_BYTE(st, i)] ^
diff --git a/util/cpuinfo-i386.c b/util/cpuinfo-i386.c
index ab6143d9e77291f1..3043f066c0182dc8 100644
--- a/util/cpuinfo-i386.c
+++ b/util/cpuinfo-i386.c
@@ -39,6 +39,7 @@ unsigned __attribute__((constructor)) cpuinfo_init(void)
         info |= (c & bit_SSE4_1 ? CPUINFO_SSE4 : 0);
         info |= (c & bit_MOVBE ? CPUINFO_MOVBE : 0);
         info |= (c & bit_POPCNT ? CPUINFO_POPCNT : 0);
+        info |= (c & bit_AES ? CPUINFO_AES : 0);
 
         /* For AVX features, we must check available and usable. */
         if ((c & bit_AVX) && (c & bit_OSXSAVE)) {
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] target/i386: Implement AES instructions using AArch64 counterparts
  2023-05-31 11:22 [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv Ard Biesheuvel
  2023-05-31 11:22 ` [PATCH v2 1/2] target/arm: use x86 intrinsics to implement AES instructions Ard Biesheuvel
@ 2023-05-31 11:22 ` Ard Biesheuvel
  2023-05-31 17:13   ` Richard Henderson
  2023-05-31 16:33 ` [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv Richard Henderson
  2 siblings, 1 reply; 8+ messages in thread
From: Ard Biesheuvel @ 2023-05-31 11:22 UTC (permalink / raw)
  To: qemu-arm
  Cc: qemu-devel, Ard Biesheuvel, Peter Maydell, Alex Bennée,
	Richard Henderson, Philippe Mathieu-Daudé

When available, use the AArch64 AES instructions to implement the x86
ones. These are not a 1:1 fit, but considerably more efficient, and
without data dependent timing.

For a typical benchmark (linux tcrypt mode=500), this gives a 2-3x
speedup when running on ThunderX2.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 host/include/aarch64/host/cpuinfo.h |  1 +
 target/i386/ops_sse.h               | 69 ++++++++++++++++++++
 util/cpuinfo-aarch64.c              |  1 +
 3 files changed, 71 insertions(+)

diff --git a/host/include/aarch64/host/cpuinfo.h b/host/include/aarch64/host/cpuinfo.h
index 82227890b4b4db03..05feeb4f4369fc19 100644
--- a/host/include/aarch64/host/cpuinfo.h
+++ b/host/include/aarch64/host/cpuinfo.h
@@ -9,6 +9,7 @@
 #define CPUINFO_ALWAYS          (1u << 0)  /* so cpuinfo is nonzero */
 #define CPUINFO_LSE             (1u << 1)
 #define CPUINFO_LSE2            (1u << 2)
+#define CPUINFO_AES             (1u << 3)
 
 /* Initialized with a constructor. */
 extern unsigned cpuinfo;
diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index fb63af7afa21588d..db79132778efd211 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -20,6 +20,11 @@
 
 #include "crypto/aes.h"
 
+#ifdef __aarch64__
+#include "host/cpuinfo.h"
+typedef uint8_t aes_vec_t __attribute__((vector_size(16)));
+#endif
+
 #if SHIFT == 0
 #define Reg MMXReg
 #define XMM_ONLY(...)
@@ -2165,6 +2170,20 @@ void glue(helper_aesdec, SUFFIX)(CPUX86State *env, Reg *d, Reg *v, Reg *s)
     Reg st = *v;
     Reg rk = *s;
 
+#ifdef __aarch64__
+    if (cpuinfo & CPUINFO_AES) {
+        asm("   .arch_extension aes             \n"
+            "   aesd    %0.16b, %1.16b          \n"
+            "   aesimc  %0.16b, %0.16b          \n"
+            "   eor     %0.16b, %0.16b, %2.16b  \n"
+            :   "=w"(*(aes_vec_t *)d)
+            :   "w"((aes_vec_t){}),
+                "w"(*(aes_vec_t *)s),
+                "0"(*(aes_vec_t *)v));
+        return;
+    }
+#endif
+
     for (i = 0 ; i < 2 << SHIFT ; i++) {
         int j = i & 3;
         d->L(i) = rk.L(i) ^ bswap32(AES_Td0[st.B(AES_ishifts[4 * j + 0])] ^
@@ -2180,6 +2199,19 @@ void glue(helper_aesdeclast, SUFFIX)(CPUX86State *env, Reg *d, Reg *v, Reg *s)
     Reg st = *v;
     Reg rk = *s;
 
+#ifdef __aarch64__
+    if (cpuinfo & CPUINFO_AES) {
+        asm("   .arch_extension aes             \n"
+            "   aesd    %0.16b, %1.16b          \n"
+            "   eor     %0.16b, %0.16b, %2.16b  \n"
+            :   "=w"(*(aes_vec_t *)d)
+            :   "w"((aes_vec_t){}),
+                "w"(*(aes_vec_t *)s),
+                "0"(*(aes_vec_t *)v));
+        return;
+    }
+#endif
+
     for (i = 0; i < 8 << SHIFT; i++) {
         d->B(i) = rk.B(i) ^ (AES_isbox[st.B(AES_ishifts[i & 15] + (i & ~15))]);
     }
@@ -2191,6 +2223,20 @@ void glue(helper_aesenc, SUFFIX)(CPUX86State *env, Reg *d, Reg *v, Reg *s)
     Reg st = *v;
     Reg rk = *s;
 
+#ifdef __aarch64__
+    if (cpuinfo & CPUINFO_AES) {
+        asm("   .arch_extension aes             \n"
+            "   aese    %0.16b, %1.16b          \n"
+            "   aesmc   %0.16b, %0.16b          \n"
+            "   eor     %0.16b, %0.16b, %2.16b  \n"
+            :   "=w"(*(aes_vec_t *)d)
+            :   "w"((aes_vec_t){}),
+                "w"(*(aes_vec_t *)s),
+                "0"(*(aes_vec_t *)v));
+        return;
+    }
+#endif
+
     for (i = 0 ; i < 2 << SHIFT ; i++) {
         int j = i & 3;
         d->L(i) = rk.L(i) ^ bswap32(AES_Te0[st.B(AES_shifts[4 * j + 0])] ^
@@ -2206,6 +2252,19 @@ void glue(helper_aesenclast, SUFFIX)(CPUX86State *env, Reg *d, Reg *v, Reg *s)
     Reg st = *v;
     Reg rk = *s;
 
+#ifdef __aarch64__
+    if (cpuinfo & CPUINFO_AES) {
+        asm("   .arch_extension aes             \n"
+            "   aese    %0.16b, %1.16b          \n"
+            "   eor     %0.16b, %0.16b, %2.16b  \n"
+            :   "=w"(*(aes_vec_t *)d)
+            :   "w"((aes_vec_t){}),
+                "w"(*(aes_vec_t *)s),
+                "0"(*(aes_vec_t *)v));
+        return;
+    }
+#endif
+
     for (i = 0; i < 8 << SHIFT; i++) {
         d->B(i) = rk.B(i) ^ (AES_sbox[st.B(AES_shifts[i & 15] + (i & ~15))]);
     }
@@ -2217,6 +2276,16 @@ void glue(helper_aesimc, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
     int i;
     Reg tmp = *s;
 
+#ifdef __aarch64__
+    if (cpuinfo & CPUINFO_AES) {
+        asm("   .arch_extension aes             \n"
+            "   aesimc  %0.16b, %1.16b          \n"
+            :   "=w"(*(aes_vec_t *)d)
+            :   "w"(*(aes_vec_t *)s));
+        return;
+    }
+#endif
+
     for (i = 0 ; i < 4 ; i++) {
         d->L(i) = bswap32(AES_imc[tmp.B(4 * i + 0)][0] ^
                           AES_imc[tmp.B(4 * i + 1)][1] ^
diff --git a/util/cpuinfo-aarch64.c b/util/cpuinfo-aarch64.c
index f99acb788454e5ab..769cdfeb2fc32d5e 100644
--- a/util/cpuinfo-aarch64.c
+++ b/util/cpuinfo-aarch64.c
@@ -56,6 +56,7 @@ unsigned __attribute__((constructor)) cpuinfo_init(void)
     unsigned long hwcap = qemu_getauxval(AT_HWCAP);
     info |= (hwcap & HWCAP_ATOMICS ? CPUINFO_LSE : 0);
     info |= (hwcap & HWCAP_USCAT ? CPUINFO_LSE2 : 0);
+    info |= (hwcap & HWCAP_AES ? CPUINFO_AES : 0);
 #endif
 #ifdef CONFIG_DARWIN
     info |= sysctl_for_bool("hw.optional.arm.FEAT_LSE") * CPUINFO_LSE;
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv
  2023-05-31 11:22 [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv Ard Biesheuvel
  2023-05-31 11:22 ` [PATCH v2 1/2] target/arm: use x86 intrinsics to implement AES instructions Ard Biesheuvel
  2023-05-31 11:22 ` [PATCH v2 2/2] target/i386: Implement AES instructions using AArch64 counterparts Ard Biesheuvel
@ 2023-05-31 16:33 ` Richard Henderson
  2023-05-31 16:47   ` Ard Biesheuvel
  2 siblings, 1 reply; 8+ messages in thread
From: Richard Henderson @ 2023-05-31 16:33 UTC (permalink / raw)
  To: Ard Biesheuvel, qemu-arm
  Cc: qemu-devel, Peter Maydell, Alex Bennée,
	Philippe Mathieu-Daudé

On 5/31/23 04:22, Ard Biesheuvel wrote:
> Use the host native instructions to implement the AES instructions
> exposed by the emulated target. The mapping is not 1:1, so it requires a
> bit of fiddling to get the right result.
> 
> This is still RFC material - the current approach feels too ad-hoc, but
> given the non-1:1 correspondence, doing a proper abstraction is rather
> difficult.
> 
> Changes since v1/RFC:
> - add second patch to implement x86 AES instructions on ARM hosts - this
>    helps illustrate what an abstraction should cover.
> - use cpuinfo framework to detect host support for AES instructions.
> - implement ARM aesimc using x86 aesimc directly
> 
> Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's
> tcrypt benchmark (mode=500)
> 
> Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to
> the fact that ARM uses two instructions to implement a single AES round,
> whereas x86 only uses one.

Thanks.  I spent some time yesterday looking at this, with an encrypted disk test case and 
could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively.

> As for the design of an abstraction: I imagine we could introduce a
> host/aes.h API that implements some building blocks that the TCG helper
> implementation could use.

Indeed.  I was considering interfaces like

/* Perform SubBytes + ShiftRows on state. */
Int128 aesenc_SB_SR(Int128 state);

/* Perform MixColumns on state. */
Int128 aesenc_MC(Int128 state);

/* Perform SubBytes + ShiftRows + MixColumns on state. */
Int128 aesenc_SB_SR_MC(Int128 state);

/* Perform SubBytes + ShiftRows + MixColumns + AddRoundKey. */
Int128 aesenc_SB_SR_MC_AK(Int128 state, Int128 roundkey);

and so forth for aesdec as well.  All but aesenc_MC should be implementable on x86 and 
Power7, and all of them on aarch64.

> I suppose it really depends on whether there is a third host
> architecture that could make use of this, and how its AES instructions
> map onto the primitive AES ops above.

There is Power6 (v{,n}cipher{,last}) and RISC-V Zkn (aes64{es,esm,ds,dsm,im})

I got hung up yesterday was understanding the different endian requirements of x86 vs Power.

ppc64:

     asm("lxvd2x 32,0,%1;"
         "lxvd2x 33,0,%2;"
         "vcipher 0,0,1;"
         "stxvd2x 32,0,%0"
         : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2");

ppc64le:

     unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7};
     asm("lxvd2x 32,0,%1;"
         "lxvd2x 33,0,%2;"
         "lxvd2x 34,0,%3;"
         "vperm 0,0,0,2;"
         "vperm 1,1,1,2;"
         "vcipher 0,0,1;"
         "vperm 0,0,0,2;"
         "stxvd2x 32,0,%0"
         : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2");

There are also differences in their AES_Te* based C routines as well, which made me wonder 
if we are handling host endianness differences correctly in emulation right now.  I think 
I should most definitely add some generic-ish tests for this...


r~


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv
  2023-05-31 16:33 ` [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv Richard Henderson
@ 2023-05-31 16:47   ` Ard Biesheuvel
  2023-05-31 17:08     ` Richard Henderson
  0 siblings, 1 reply; 8+ messages in thread
From: Ard Biesheuvel @ 2023-05-31 16:47 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-arm, qemu-devel, Peter Maydell, Alex Bennée,
	Philippe Mathieu-Daudé

On Wed, 31 May 2023 at 18:33, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 5/31/23 04:22, Ard Biesheuvel wrote:
> > Use the host native instructions to implement the AES instructions
> > exposed by the emulated target. The mapping is not 1:1, so it requires a
> > bit of fiddling to get the right result.
> >
> > This is still RFC material - the current approach feels too ad-hoc, but
> > given the non-1:1 correspondence, doing a proper abstraction is rather
> > difficult.
> >
> > Changes since v1/RFC:
> > - add second patch to implement x86 AES instructions on ARM hosts - this
> >    helps illustrate what an abstraction should cover.
> > - use cpuinfo framework to detect host support for AES instructions.
> > - implement ARM aesimc using x86 aesimc directly
> >
> > Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's
> > tcrypt benchmark (mode=500)
> >
> > Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to
> > the fact that ARM uses two instructions to implement a single AES round,
> > whereas x86 only uses one.
>
> Thanks.  I spent some time yesterday looking at this, with an encrypted disk test case and
> could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively.
>

I don't understand what 'overhead' means in this context. Are you
saying you saw barely any improvement?

> > As for the design of an abstraction: I imagine we could introduce a
> > host/aes.h API that implements some building blocks that the TCG helper
> > implementation could use.
>
> Indeed.  I was considering interfaces like
>
> /* Perform SubBytes + ShiftRows on state. */
> Int128 aesenc_SB_SR(Int128 state);
>
> /* Perform MixColumns on state. */
> Int128 aesenc_MC(Int128 state);
>
> /* Perform SubBytes + ShiftRows + MixColumns on state. */
> Int128 aesenc_SB_SR_MC(Int128 state);
>
> /* Perform SubBytes + ShiftRows + MixColumns + AddRoundKey. */
> Int128 aesenc_SB_SR_MC_AK(Int128 state, Int128 roundkey);
>
> and so forth for aesdec as well.  All but aesenc_MC should be implementable on x86 and
> Power7, and all of them on aarch64.
>

aesenc_MC() can be implemented on x86 the way I did in patch #!, using
aesdeclast+aesenc


> > I suppose it really depends on whether there is a third host
> > architecture that could make use of this, and how its AES instructions
> > map onto the primitive AES ops above.
>
> There is Power6 (v{,n}cipher{,last}) and RISC-V Zkn (aes64{es,esm,ds,dsm,im})
>
> I got hung up yesterday was understanding the different endian requirements of x86 vs Power.
>
> ppc64:
>
>      asm("lxvd2x 32,0,%1;"
>          "lxvd2x 33,0,%2;"
>          "vcipher 0,0,1;"
>          "stxvd2x 32,0,%0"
>          : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2");
>
> ppc64le:
>
>      unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7};
>      asm("lxvd2x 32,0,%1;"
>          "lxvd2x 33,0,%2;"
>          "lxvd2x 34,0,%3;"
>          "vperm 0,0,0,2;"
>          "vperm 1,1,1,2;"
>          "vcipher 0,0,1;"
>          "vperm 0,0,0,2;"
>          "stxvd2x 32,0,%0"
>          : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2");
>
> There are also differences in their AES_Te* based C routines as well, which made me wonder
> if we are handling host endianness differences correctly in emulation right now.  I think
> I should most definitely add some generic-ish tests for this...
>

The above kind of sums it up, no? Or isn't this working code?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv
  2023-05-31 16:47   ` Ard Biesheuvel
@ 2023-05-31 17:08     ` Richard Henderson
  2023-06-01  4:08       ` Richard Henderson
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Henderson @ 2023-05-31 17:08 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: qemu-arm, qemu-devel, Peter Maydell, Alex Bennée,
	Philippe Mathieu-Daudé

On 5/31/23 09:47, Ard Biesheuvel wrote:
> On Wed, 31 May 2023 at 18:33, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 5/31/23 04:22, Ard Biesheuvel wrote:
>>> Use the host native instructions to implement the AES instructions
>>> exposed by the emulated target. The mapping is not 1:1, so it requires a
>>> bit of fiddling to get the right result.
>>>
>>> This is still RFC material - the current approach feels too ad-hoc, but
>>> given the non-1:1 correspondence, doing a proper abstraction is rather
>>> difficult.
>>>
>>> Changes since v1/RFC:
>>> - add second patch to implement x86 AES instructions on ARM hosts - this
>>>     helps illustrate what an abstraction should cover.
>>> - use cpuinfo framework to detect host support for AES instructions.
>>> - implement ARM aesimc using x86 aesimc directly
>>>
>>> Patch #1 produces a 1.5-2x speedup in tests using the Linux kernel's
>>> tcrypt benchmark (mode=500)
>>>
>>> Patch #2 produces a 2-3x speedup. The discrepancy is most likely due to
>>> the fact that ARM uses two instructions to implement a single AES round,
>>> whereas x86 only uses one.
>>
>> Thanks.  I spent some time yesterday looking at this, with an encrypted disk test case and
>> could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively.
>>
> 
> I don't understand what 'overhead' means in this context. Are you
> saying you saw barely any improvement?

I saw, without changes, just over 1% of total system emulation time was devoted to aes, 
which gives an upper limit to the runtime improvement possible there.  But I'll have a 
look at tcrypt.

> aesenc_MC() can be implemented on x86 the way I did in patch #!, using
> aesdeclast+aesenc

Oh, nice.  I have not read the actual patches yet.

>> ppc64:
>>
>>       asm("lxvd2x 32,0,%1;"
>>           "lxvd2x 33,0,%2;"
>>           "vcipher 0,0,1;"
>>           "stxvd2x 32,0,%0"
>>           : : "r"(o), "r"(i), "r"(k), : "memory", "v0", "v1", "v2");
>>
>> ppc64le:
>>
>>       unsigned char le[16] = {8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7};
>>       asm("lxvd2x 32,0,%1;"
>>           "lxvd2x 33,0,%2;"
>>           "lxvd2x 34,0,%3;"
>>           "vperm 0,0,0,2;"
>>           "vperm 1,1,1,2;"
>>           "vcipher 0,0,1;"
>>           "vperm 0,0,0,2;"
>>           "stxvd2x 32,0,%0"
>>           : : "r"(o), "r"(i), "r"(k), "r"(le) : "memory", "v0", "v1", "v2");
>>
>> There are also differences in their AES_Te* based C routines as well, which made me wonder
>> if we are handling host endianness differences correctly in emulation right now.  I think
>> I should most definitely add some generic-ish tests for this...
>>
> 
> The above kind of sums it up, no? Or isn't this working code?

It sums up the problem.  It works to produce the same output as the x86 instructions, with 
input bytes in the same order.  It shows that we have to extra careful emulating vcipher 
etc, and should have unit tests.


r~



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] target/i386: Implement AES instructions using AArch64 counterparts
  2023-05-31 11:22 ` [PATCH v2 2/2] target/i386: Implement AES instructions using AArch64 counterparts Ard Biesheuvel
@ 2023-05-31 17:13   ` Richard Henderson
  0 siblings, 0 replies; 8+ messages in thread
From: Richard Henderson @ 2023-05-31 17:13 UTC (permalink / raw)
  To: Ard Biesheuvel, qemu-arm
  Cc: qemu-devel, Peter Maydell, Alex Bennée,
	Philippe Mathieu-Daudé

On 5/31/23 04:22, Ard Biesheuvel wrote:
> +++ b/util/cpuinfo-aarch64.c
> @@ -56,6 +56,7 @@ unsigned __attribute__((constructor)) cpuinfo_init(void)
>       unsigned long hwcap = qemu_getauxval(AT_HWCAP);
>       info |= (hwcap & HWCAP_ATOMICS ? CPUINFO_LSE : 0);
>       info |= (hwcap & HWCAP_USCAT ? CPUINFO_LSE2 : 0);
> +    info |= (hwcap & HWCAP_AES ? CPUINFO_AES : 0);
>   #endif
>   #ifdef CONFIG_DARWIN
>       info |= sysctl_for_bool("hw.optional.arm.FEAT_LSE") * CPUINFO_LSE;

FYI, "hw.optional.arm.FEAT_AES" exists for darwin, and is set for Apple M1.
I'll incorporate that when adding the probing.


r~


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv
  2023-05-31 17:08     ` Richard Henderson
@ 2023-06-01  4:08       ` Richard Henderson
  0 siblings, 0 replies; 8+ messages in thread
From: Richard Henderson @ 2023-06-01  4:08 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: qemu-arm, qemu-devel, Peter Maydell, Alex Bennée,
	Philippe Mathieu-Daudé

On 5/31/23 10:08, Richard Henderson wrote:
> On 5/31/23 09:47, Ard Biesheuvel wrote:
>> On Wed, 31 May 2023 at 18:33, Richard Henderson
>>> Thanks.  I spent some time yesterday looking at this, with an encrypted disk test case and
>>> could only measure 0.6% and 0.5% for total overhead of decrypt and encrypt respectively.
>>>
>>
>> I don't understand what 'overhead' means in this context. Are you
>> saying you saw barely any improvement?
> 
> I saw, without changes, just over 1% of total system emulation time was devoted to aes, 
> which gives an upper limit to the runtime improvement possible there.  But I'll have a 
> look at tcrypt.

Using

# insmod /lib/modules/5.10.0-21-arm64/kernel/crypto/tcrypt.ko mode=600 sec=10

I see

   25.50%  qemu-system-aar  qemu-system-aarch64      [.] helper_crypto_aese
   25.36%  qemu-system-aar  qemu-system-aarch64      [.] helper_crypto_aesmc
    6.66%  qemu-system-aar  qemu-system-aarch64      [.] rebuild_hflags_a64
    3.25%  qemu-system-aar  qemu-system-aarch64      [.] tb_lookup
    2.52%  qemu-system-aar  qemu-system-aarch64      [.] fp_exception_el
    2.35%  qemu-system-aar  qemu-system-aarch64      [.] helper_lookup_tb_ptr

Obviously a crypto-heavy test, but 51% of runtime is certainly worth more work.


r~


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-06-01  4:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-31 11:22 [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv Ard Biesheuvel
2023-05-31 11:22 ` [PATCH v2 1/2] target/arm: use x86 intrinsics to implement AES instructions Ard Biesheuvel
2023-05-31 11:22 ` [PATCH v2 2/2] target/i386: Implement AES instructions using AArch64 counterparts Ard Biesheuvel
2023-05-31 17:13   ` Richard Henderson
2023-05-31 16:33 ` [PATCH v2 0/2] Implement AES on ARM using x86 instructions and vv Richard Henderson
2023-05-31 16:47   ` Ard Biesheuvel
2023-05-31 17:08     ` Richard Henderson
2023-06-01  4:08       ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).