[PATCH 0/2] tcg/s390x: Fix chacha20-s390

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] tcg/s390x: Fix chacha20-s390
@ 2024-01-17 21:36 Richard Henderson
  2024-01-17 21:36 ` [PATCH 1/2] tcg/s390x: Fix encoding of VRIc, VRSa, VRSc insns Richard Henderson
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Richard Henderson @ 2024-01-17 21:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-s390x, thuth, david, philmd, mjt

So it turns out the regression exposed by "Optimize env memory operations"
is caused by an s390x host encoding error.  This is the first time that we
have had sufficient register pressure to use more than a few vector
registers at the same time.

As such, the testcase itself is interesting, since nothing else in our
testsuite generates translation blocks with quite so many vector insns
with more than 16 simultaneously live values.

r~

Richard Henderson (2):
  tcg/s390x: Fix encoding of VRIc, VRSa, VRSc insns
  tests/tcg/s390x: Import linux tools/testing/crypto/chacha20-s390

 tests/tcg/s390x/chacha.c        | 341 ++++++++++++
 tcg/s390x/tcg-target.c.inc      |   6 +-
 tests/tcg/s390x/Makefile.target |   4 +
 tests/tcg/s390x/chacha-vx.S     | 914 ++++++++++++++++++++++++++++++++
 4 files changed, 1262 insertions(+), 3 deletions(-)
 create mode 100644 tests/tcg/s390x/chacha.c
 create mode 100644 tests/tcg/s390x/chacha-vx.S

-- 
2.34.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] tcg/s390x: Fix encoding of VRIc, VRSa, VRSc insns
  2024-01-17 21:36 [PATCH 0/2] tcg/s390x: Fix chacha20-s390 Richard Henderson
@ 2024-01-17 21:36 ` Richard Henderson
  2024-01-18  6:50   ` Thomas Huth
  2024-01-19 21:54   ` Philippe Mathieu-Daudé
  2024-01-17 21:36 ` [PATCH 2/2] tests/tcg/s390x: Import linux tools/testing/crypto/chacha20-s390 Richard Henderson
  2024-01-18  6:07 ` [PATCH 0/2] tcg/s390x: Fix chacha20-s390 Michael Tokarev
  2 siblings, 2 replies; 10+ messages in thread
From: Richard Henderson @ 2024-01-17 21:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-s390x, thuth, david, philmd, mjt, qemu-stable

While the format names the second vector register 'v3',
it is still in the second position (bits 12-15) and
the argument to RXB must match.

Example error:
 -   e7 00 00 10 2a 33       verllf  %v16,%v0,16
 +   e7 00 00 10 2c 33       verllf  %v16,%v16,16

Cc: qemu-stable@nongnu.org
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Fixes: 22cb37b4172 ("tcg/s390x: Implement vector shift operations")
Fixes: 79cada8693d ("tcg/s390x: Implement tcg_out_dup*_vec")
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target.c.inc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index fbee43d3b0..7f6b84aa2c 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -683,7 +683,7 @@ static void tcg_out_insn_VRIc(TCGContext *s, S390Opcode op,
     tcg_debug_assert(is_vector_reg(v3));
     tcg_out16(s, (op & 0xff00) | ((v1 & 0xf) << 4) | (v3 & 0xf));
     tcg_out16(s, i2);
-    tcg_out16(s, (op & 0x00ff) | RXB(v1, 0, v3, 0) | (m4 << 12));
+    tcg_out16(s, (op & 0x00ff) | RXB(v1, v3, 0, 0) | (m4 << 12));
 }
 
 static void tcg_out_insn_VRRa(TCGContext *s, S390Opcode op,
@@ -738,7 +738,7 @@ static void tcg_out_insn_VRSa(TCGContext *s, S390Opcode op, TCGReg v1,
     tcg_debug_assert(is_vector_reg(v3));
     tcg_out16(s, (op & 0xff00) | ((v1 & 0xf) << 4) | (v3 & 0xf));
     tcg_out16(s, b2 << 12 | d2);
-    tcg_out16(s, (op & 0x00ff) | RXB(v1, 0, v3, 0) | (m4 << 12));
+    tcg_out16(s, (op & 0x00ff) | RXB(v1, v3, 0, 0) | (m4 << 12));
 }
 
 static void tcg_out_insn_VRSb(TCGContext *s, S390Opcode op, TCGReg v1,
@@ -762,7 +762,7 @@ static void tcg_out_insn_VRSc(TCGContext *s, S390Opcode op, TCGReg r1,
     tcg_debug_assert(is_vector_reg(v3));
     tcg_out16(s, (op & 0xff00) | (r1 << 4) | (v3 & 0xf));
     tcg_out16(s, b2 << 12 | d2);
-    tcg_out16(s, (op & 0x00ff) | RXB(0, 0, v3, 0) | (m4 << 12));
+    tcg_out16(s, (op & 0x00ff) | RXB(0, v3, 0, 0) | (m4 << 12));
 }
 
 static void tcg_out_insn_VRX(TCGContext *s, S390Opcode op, TCGReg v1,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] tests/tcg/s390x: Import linux tools/testing/crypto/chacha20-s390
  2024-01-17 21:36 [PATCH 0/2] tcg/s390x: Fix chacha20-s390 Richard Henderson
  2024-01-17 21:36 ` [PATCH 1/2] tcg/s390x: Fix encoding of VRIc, VRSa, VRSc insns Richard Henderson
@ 2024-01-17 21:36 ` Richard Henderson
  2024-01-17 22:45   ` Richard Henderson
  2024-01-18  7:17   ` Thomas Huth
  2024-01-18  6:07 ` [PATCH 0/2] tcg/s390x: Fix chacha20-s390 Michael Tokarev
  2 siblings, 2 replies; 10+ messages in thread
From: Richard Henderson @ 2024-01-17 21:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-s390x, thuth, david, philmd, mjt

Modify and simplify the driver, as we're really only interested
in correctness of translation of chacha-vx.S.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tests/tcg/s390x/chacha.c        | 341 ++++++++++++
 tests/tcg/s390x/Makefile.target |   4 +
 tests/tcg/s390x/chacha-vx.S     | 914 ++++++++++++++++++++++++++++++++
 3 files changed, 1259 insertions(+)
 create mode 100644 tests/tcg/s390x/chacha.c
 create mode 100644 tests/tcg/s390x/chacha-vx.S

diff --git a/tests/tcg/s390x/chacha.c b/tests/tcg/s390x/chacha.c
new file mode 100644
index 0000000000..ca9e4c1959
--- /dev/null
+++ b/tests/tcg/s390x/chacha.c
@@ -0,0 +1,341 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Derived from linux kernel sources:
+ *   ./include/crypto/chacha.h
+ *   ./crypto/chacha_generic.c
+ *   ./arch/s390/crypto/chacha-glue.c
+ *   ./tools/testing/crypto/chacha20-s390/test-cipher.c
+ *   ./tools/testing/crypto/chacha20-s390/run-tests.sh
+ */
+
+#include <stdlib.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <string.h>
+#include <inttypes.h>
+#include <sys/random.h>
+
+typedef uint8_t u8;
+typedef uint32_t u32;
+typedef uint64_t u64;
+
+static unsigned data_size;
+static bool debug;
+
+#define CHACHA_IV_SIZE          16
+#define CHACHA_KEY_SIZE         32
+#define CHACHA_BLOCK_SIZE       64
+#define CHACHAPOLY_IV_SIZE      12
+#define CHACHA_STATE_WORDS      (CHACHA_BLOCK_SIZE / sizeof(u32))
+
+static u32 rol32(u32 val, u32 sh)
+{
+    return (val << (sh & 31)) | (val >> (-sh & 31));
+}
+
+static u32 get_unaligned_le32(const void *ptr)
+{
+    u32 val;
+    memcpy(&val, ptr, 4);
+    return __builtin_bswap32(val);
+}
+
+static void put_unaligned_le32(u32 val, void *ptr)
+{
+    val = __builtin_bswap32(val);
+    memcpy(ptr, &val, 4);
+}
+
+static void chacha_permute(u32 *x, int nrounds)
+{
+    for (int i = 0; i < nrounds; i += 2) {
+        x[0]  += x[4];    x[12] = rol32(x[12] ^ x[0],  16);
+        x[1]  += x[5];    x[13] = rol32(x[13] ^ x[1],  16);
+        x[2]  += x[6];    x[14] = rol32(x[14] ^ x[2],  16);
+        x[3]  += x[7];    x[15] = rol32(x[15] ^ x[3],  16);
+
+        x[8]  += x[12];   x[4]  = rol32(x[4]  ^ x[8],  12);
+        x[9]  += x[13];   x[5]  = rol32(x[5]  ^ x[9],  12);
+        x[10] += x[14];   x[6]  = rol32(x[6]  ^ x[10], 12);
+        x[11] += x[15];   x[7]  = rol32(x[7]  ^ x[11], 12);
+
+        x[0]  += x[4];    x[12] = rol32(x[12] ^ x[0],   8);
+        x[1]  += x[5];    x[13] = rol32(x[13] ^ x[1],   8);
+        x[2]  += x[6];    x[14] = rol32(x[14] ^ x[2],   8);
+        x[3]  += x[7];    x[15] = rol32(x[15] ^ x[3],   8);
+
+        x[8]  += x[12];   x[4]  = rol32(x[4]  ^ x[8],   7);
+        x[9]  += x[13];   x[5]  = rol32(x[5]  ^ x[9],   7);
+        x[10] += x[14];   x[6]  = rol32(x[6]  ^ x[10],  7);
+        x[11] += x[15];   x[7]  = rol32(x[7]  ^ x[11],  7);
+
+        x[0]  += x[5];    x[15] = rol32(x[15] ^ x[0],  16);
+        x[1]  += x[6];    x[12] = rol32(x[12] ^ x[1],  16);
+        x[2]  += x[7];    x[13] = rol32(x[13] ^ x[2],  16);
+        x[3]  += x[4];    x[14] = rol32(x[14] ^ x[3],  16);
+
+        x[10] += x[15];   x[5]  = rol32(x[5]  ^ x[10], 12);
+        x[11] += x[12];   x[6]  = rol32(x[6]  ^ x[11], 12);
+        x[8]  += x[13];   x[7]  = rol32(x[7]  ^ x[8],  12);
+        x[9]  += x[14];   x[4]  = rol32(x[4]  ^ x[9],  12);
+
+        x[0]  += x[5];    x[15] = rol32(x[15] ^ x[0],   8);
+        x[1]  += x[6];    x[12] = rol32(x[12] ^ x[1],   8);
+        x[2]  += x[7];    x[13] = rol32(x[13] ^ x[2],   8);
+        x[3]  += x[4];    x[14] = rol32(x[14] ^ x[3],   8);
+
+        x[10] += x[15];   x[5]  = rol32(x[5]  ^ x[10],  7);
+        x[11] += x[12];   x[6]  = rol32(x[6]  ^ x[11],  7);
+        x[8]  += x[13];   x[7]  = rol32(x[7]  ^ x[8],   7);
+        x[9]  += x[14];   x[4]  = rol32(x[4]  ^ x[9],   7);
+    }
+}
+
+static void chacha_block_generic(u32 *state, u8 *stream, int nrounds)
+{
+    u32 x[16];
+
+    memcpy(x, state, 64);
+    chacha_permute(x, nrounds);
+
+    for (int i = 0; i < 16; i++) {
+        put_unaligned_le32(x[i] + state[i], &stream[i * sizeof(u32)]);
+    }
+    state[12]++;
+}
+
+static void crypto_xor_cpy(u8 *dst, const u8 *src1,
+                           const u8 *src2, unsigned len)
+{
+    while (len--) {
+        *dst++ = *src1++ ^ *src2++;
+    }
+}
+
+static void chacha_crypt_generic(u32 *state, u8 *dst, const u8 *src,
+                                 unsigned int bytes, int nrounds)
+{
+    u8 stream[CHACHA_BLOCK_SIZE];
+
+    while (bytes >= CHACHA_BLOCK_SIZE) {
+        chacha_block_generic(state, stream, nrounds);
+        crypto_xor_cpy(dst, src, stream, CHACHA_BLOCK_SIZE);
+        bytes -= CHACHA_BLOCK_SIZE;
+        dst += CHACHA_BLOCK_SIZE;
+        src += CHACHA_BLOCK_SIZE;
+    }
+    if (bytes) {
+        chacha_block_generic(state, stream, nrounds);
+        crypto_xor_cpy(dst, src, stream, bytes);
+    }
+}
+
+enum chacha_constants { /* expand 32-byte k */
+    CHACHA_CONSTANT_EXPA = 0x61707865U,
+    CHACHA_CONSTANT_ND_3 = 0x3320646eU,
+    CHACHA_CONSTANT_2_BY = 0x79622d32U,
+    CHACHA_CONSTANT_TE_K = 0x6b206574U
+};
+
+static void chacha_init_generic(u32 *state, const u32 *key, const u8 *iv)
+{
+    state[0]  = CHACHA_CONSTANT_EXPA;
+    state[1]  = CHACHA_CONSTANT_ND_3;
+    state[2]  = CHACHA_CONSTANT_2_BY;
+    state[3]  = CHACHA_CONSTANT_TE_K;
+    state[4]  = key[0];
+    state[5]  = key[1];
+    state[6]  = key[2];
+    state[7]  = key[3];
+    state[8]  = key[4];
+    state[9]  = key[5];
+    state[10] = key[6];
+    state[11] = key[7];
+    state[12] = get_unaligned_le32(iv +  0);
+    state[13] = get_unaligned_le32(iv +  4);
+    state[14] = get_unaligned_le32(iv +  8);
+    state[15] = get_unaligned_le32(iv + 12);
+}
+
+void chacha20_vx(u8 *out, const u8 *inp, size_t len, const u32 *key,
+                 const u32 *counter);
+
+static void chacha20_crypt_s390(u32 *state, u8 *dst, const u8 *src,
+                                unsigned int nbytes, const u32 *key,
+                                u32 *counter)
+{
+    chacha20_vx(dst, src, nbytes, key, counter);
+    *counter += (nbytes + CHACHA_BLOCK_SIZE - 1) / CHACHA_BLOCK_SIZE;
+}
+
+static void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src,
+                              unsigned int bytes, int nrounds)
+{
+    /*
+     * s390 chacha20 implementation has 20 rounds hard-coded,
+     * it cannot handle a block of data or less, but otherwise
+     * it can handle data of arbitrary size
+     */
+    if (bytes <= CHACHA_BLOCK_SIZE || nrounds != 20) {
+        chacha_crypt_generic(state, dst, src, bytes, nrounds);
+    } else {
+        chacha20_crypt_s390(state, dst, src, bytes, &state[4], &state[12]);
+    }
+}
+
+static void print_hex_dump(const char *prefix_str, const void *buf, int len)
+{
+    for (int i = 0; i < len; i += 16) {
+        printf("%s%.8x: ", prefix_str, i);
+        for (int j = 0; j < 16; ++j) {
+            printf("%02x%c", *(u8 *)(buf + i + j), j == 15 ? '\n' : ' ');
+        }
+    }
+}
+
+/* Perform cipher operations with the chacha lib */
+static int test_lib_chacha(u8 *revert, u8 *cipher, u8 *plain, bool generic)
+{
+    u32 chacha_state[CHACHA_STATE_WORDS];
+    u8 iv[16], key[32];
+
+    memset(key, 'X', sizeof(key));
+    memset(iv, 'I', sizeof(iv));
+
+    if (debug) {
+        print_hex_dump("key: ", key, 32);
+        print_hex_dump("iv:  ", iv, 16);
+    }
+
+    /* Encrypt */
+    chacha_init_generic(chacha_state, (u32*)key, iv);
+
+    if (generic) {
+        chacha_crypt_generic(chacha_state, cipher, plain, data_size, 20);
+    } else {
+        chacha_crypt_arch(chacha_state, cipher, plain, data_size, 20);
+    }
+
+    if (debug) {
+        print_hex_dump("encr:", cipher,
+                       (data_size > 64 ? 64 : data_size));
+    }
+
+    /* Decrypt */
+    chacha_init_generic(chacha_state, (u32 *)key, iv);
+
+    if (generic) {
+        chacha_crypt_generic(chacha_state, revert, cipher, data_size, 20);
+    } else {
+        chacha_crypt_arch(chacha_state, revert, cipher, data_size, 20);
+    }
+
+    if (debug) {
+        print_hex_dump("decr:", revert,
+                       (data_size > 64 ? 64 : data_size));
+    }
+    return 0;
+}
+
+static int chacha_s390_test_init(void)
+{
+    u8 *plain = NULL, *revert = NULL;
+    u8 *cipher_generic = NULL, *cipher_s390 = NULL;
+    int ret = -1;
+
+    printf("s390 ChaCha20 test module: size=%d debug=%d\n",
+           data_size, debug);
+
+    /* Allocate and fill buffers */
+    plain = malloc(data_size);
+    if (!plain) {
+        printf("could not allocate plain buffer\n");
+        ret = -2;
+        goto out;
+    }
+
+    memset(plain, 'a', data_size);
+    for (unsigned i = 0, n = data_size > 256 ? 256 : data_size; i < n; ) {
+        ssize_t t = getrandom(plain + i, n - i, 0);
+        if (t < 0) {
+            break;
+        }
+        i -= t;
+    }
+
+    cipher_generic = calloc(1, data_size);
+    if (!cipher_generic) {
+        printf("could not allocate cipher_generic buffer\n");
+        ret = -2;
+        goto out;
+    }
+
+    cipher_s390 = calloc(1, data_size);
+    if (!cipher_s390) {
+        printf("could not allocate cipher_s390 buffer\n");
+        ret = -2;
+        goto out;
+    }
+
+    revert = calloc(1, data_size);
+    if (!revert) {
+        printf("could not allocate revert buffer\n");
+        ret = -2;
+        goto out;
+    }
+
+    if (debug) {
+        print_hex_dump("src: ", plain,
+                       (data_size > 64 ? 64 : data_size));
+    }
+
+    /* Use chacha20 lib */
+    test_lib_chacha(revert, cipher_generic, plain, true);
+    if (memcmp(plain, revert, data_size)) {
+        printf("generic en/decryption check FAILED\n");
+        ret = -2;
+        goto out;
+    }
+    printf("generic en/decryption check OK\n");
+
+    test_lib_chacha(revert, cipher_s390, plain, false);
+    if (memcmp(plain, revert, data_size)) {
+        printf("lib en/decryption check FAILED\n");
+        ret = -2;
+        goto out;
+    }
+    printf("lib en/decryption check OK\n");
+
+    if (memcmp(cipher_generic, cipher_s390, data_size)) {
+        printf("lib vs generic check FAILED\n");
+        ret = -2;
+        goto out;
+    }
+    printf("lib vs generic check OK\n");
+
+    printf("--- chacha20 s390 test end ---\n");
+
+out:
+    free(plain);
+    free(cipher_generic);
+    free(cipher_s390);
+    free(revert);
+    return ret;
+}
+
+int main(int ac, char **av)
+{
+    static const unsigned sizes[] = {
+        63, 64, 65, 127, 128, 129, 511, 512, 513, 4096, 65611,
+        /* too slow for tcg: 6291456, 62914560 */
+    };
+
+    debug = ac >= 2;
+    for (int i = 0; i < sizeof(sizes) / sizeof(sizes[0]); ++i) {
+        data_size = sizes[i];
+        if (chacha_s390_test_init() != -1) {
+            return 1;
+        }
+    }
+    return 0;
+}
diff --git a/tests/tcg/s390x/Makefile.target b/tests/tcg/s390x/Makefile.target
index 30994dcf9c..28f19a3176 100644
--- a/tests/tcg/s390x/Makefile.target
+++ b/tests/tcg/s390x/Makefile.target
@@ -66,9 +66,13 @@ Z13_TESTS+=vcksm
 Z13_TESTS+=vstl
 Z13_TESTS+=vrep
 Z13_TESTS+=precise-smc-user
+Z13_TESTS+=chacha
 $(Z13_TESTS): CFLAGS+=-march=z13 -O2
 TESTS+=$(Z13_TESTS)
 
+chacha: chacha.c chacha-vx.S
+	$(CC) $(CFLAGS) $(EXTRA_CFLAGS) $^ -o $@
+
 ifneq ($(CROSS_CC_HAS_Z14),)
 Z14_TESTS=vfminmax
 vfminmax: LDFLAGS+=-lm
diff --git a/tests/tcg/s390x/chacha-vx.S b/tests/tcg/s390x/chacha-vx.S
new file mode 100644
index 0000000000..eee6275368
--- /dev/null
+++ b/tests/tcg/s390x/chacha-vx.S
@@ -0,0 +1,914 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Original implementation written by Andy Polyakov, @dot-asm.
+ * This is an adaptation of the original code for kernel use.
+ *
+ * Copyright (C) 2006-2019 CRYPTOGAMS by <appro@openssl.org>. All Rights Reserved.
+ *
+ * For qemu testing, drop <asm/vx-insn-asm.h> and assume assembler support.
+ */
+
+#define SP	%r15
+#define FRAME	(16 * 8 + 4 * 8)
+
+	.data
+	.balign	32
+
+sigma:
+	.long	0x61707865,0x3320646e,0x79622d32,0x6b206574	# endian-neutral
+	.long	1,0,0,0
+	.long	2,0,0,0
+	.long	3,0,0,0
+	.long	0x03020100,0x07060504,0x0b0a0908,0x0f0e0d0c	# byte swap
+
+	.long	0,1,2,3
+	.long	0x61707865,0x61707865,0x61707865,0x61707865	# smashed sigma
+	.long	0x3320646e,0x3320646e,0x3320646e,0x3320646e
+	.long	0x79622d32,0x79622d32,0x79622d32,0x79622d32
+	.long	0x6b206574,0x6b206574,0x6b206574,0x6b206574
+
+	.type	sigma, @object
+	.size	sigma, . - sigma
+
+	.previous
+
+	.text
+
+#############################################################################
+# void chacha20_vx_4x(u8 *out, counst u8 *inp, size_t len,
+#		      counst u32 *key, const u32 *counter)
+
+#define	OUT		%r2
+#define	INP		%r3
+#define	LEN		%r4
+#define	KEY		%r5
+#define	COUNTER		%r6
+
+#define BEPERM		%v31
+#define CTR		%v26
+
+#define K0		%v16
+#define K1		%v17
+#define K2		%v18
+#define K3		%v19
+
+#define XA0		%v0
+#define XA1		%v1
+#define XA2		%v2
+#define XA3		%v3
+
+#define XB0		%v4
+#define XB1		%v5
+#define XB2		%v6
+#define XB3		%v7
+
+#define XC0		%v8
+#define XC1		%v9
+#define XC2		%v10
+#define XC3		%v11
+
+#define XD0		%v12
+#define XD1		%v13
+#define XD2		%v14
+#define XD3		%v15
+
+#define XT0		%v27
+#define XT1		%v28
+#define XT2		%v29
+#define XT3		%v30
+
+	.balign	32
+chacha20_vx_4x:
+	stmg	%r6,%r7,6*8(SP)
+
+	larl	%r7,sigma
+	lhi	%r0,10
+	lhi	%r1,0
+
+	vl	K0,0(%r7)		# load sigma
+	vl	K1,0(KEY)		# load key
+	vl	K2,16(KEY)
+	vl	K3,0(COUNTER)		# load counter
+
+	vl	BEPERM,0x40(%r7)
+	vl	CTR,0x50(%r7)
+
+	vlm	XA0,XA3,0x60(%r7),4	# load [smashed] sigma
+
+	vrepf	XB0,K1,0		# smash the key
+	vrepf	XB1,K1,1
+	vrepf	XB2,K1,2
+	vrepf	XB3,K1,3
+
+	vrepf	XD0,K3,0
+	vrepf	XD1,K3,1
+	vrepf	XD2,K3,2
+	vrepf	XD3,K3,3
+	vaf	XD0,XD0,CTR
+
+	vrepf	XC0,K2,0
+	vrepf	XC1,K2,1
+	vrepf	XC2,K2,2
+	vrepf	XC3,K2,3
+
+.Loop_4x:
+	vaf	XA0,XA0,XB0
+	vx	XD0,XD0,XA0
+	verllf	XD0,XD0,16
+
+	vaf	XA1,XA1,XB1
+	vx	XD1,XD1,XA1
+	verllf	XD1,XD1,16
+
+	vaf	XA2,XA2,XB2
+	vx	XD2,XD2,XA2
+	verllf	XD2,XD2,16
+
+	vaf	XA3,XA3,XB3
+	vx	XD3,XD3,XA3
+	verllf	XD3,XD3,16
+
+	vaf	XC0,XC0,XD0
+	vx	XB0,XB0,XC0
+	verllf	XB0,XB0,12
+
+	vaf	XC1,XC1,XD1
+	vx	XB1,XB1,XC1
+	verllf	XB1,XB1,12
+
+	vaf	XC2,XC2,XD2
+	vx	XB2,XB2,XC2
+	verllf	XB2,XB2,12
+
+	vaf	XC3,XC3,XD3
+	vx	XB3,XB3,XC3
+	verllf	XB3,XB3,12
+
+	vaf	XA0,XA0,XB0
+	vx	XD0,XD0,XA0
+	verllf	XD0,XD0,8
+
+	vaf	XA1,XA1,XB1
+	vx	XD1,XD1,XA1
+	verllf	XD1,XD1,8
+
+	vaf	XA2,XA2,XB2
+	vx	XD2,XD2,XA2
+	verllf	XD2,XD2,8
+
+	vaf	XA3,XA3,XB3
+	vx	XD3,XD3,XA3
+	verllf	XD3,XD3,8
+
+	vaf	XC0,XC0,XD0
+	vx	XB0,XB0,XC0
+	verllf	XB0,XB0,7
+
+	vaf	XC1,XC1,XD1
+	vx	XB1,XB1,XC1
+	verllf	XB1,XB1,7
+
+	vaf	XC2,XC2,XD2
+	vx	XB2,XB2,XC2
+	verllf	XB2,XB2,7
+
+	vaf	XC3,XC3,XD3
+	vx	XB3,XB3,XC3
+	verllf	XB3,XB3,7
+
+	vaf	XA0,XA0,XB1
+	vx	XD3,XD3,XA0
+	verllf	XD3,XD3,16
+
+	vaf	XA1,XA1,XB2
+	vx	XD0,XD0,XA1
+	verllf	XD0,XD0,16
+
+	vaf	XA2,XA2,XB3
+	vx	XD1,XD1,XA2
+	verllf	XD1,XD1,16
+
+	vaf	XA3,XA3,XB0
+	vx	XD2,XD2,XA3
+	verllf	XD2,XD2,16
+
+	vaf	XC2,XC2,XD3
+	vx	XB1,XB1,XC2
+	verllf	XB1,XB1,12
+
+	vaf	XC3,XC3,XD0
+	vx	XB2,XB2,XC3
+	verllf	XB2,XB2,12
+
+	vaf	XC0,XC0,XD1
+	vx	XB3,XB3,XC0
+	verllf	XB3,XB3,12
+
+	vaf	XC1,XC1,XD2
+	vx	XB0,XB0,XC1
+	verllf	XB0,XB0,12
+
+	vaf	XA0,XA0,XB1
+	vx	XD3,XD3,XA0
+	verllf	XD3,XD3,8
+
+	vaf	XA1,XA1,XB2
+	vx	XD0,XD0,XA1
+	verllf	XD0,XD0,8
+
+	vaf	XA2,XA2,XB3
+	vx	XD1,XD1,XA2
+	verllf	XD1,XD1,8
+
+	vaf	XA3,XA3,XB0
+	vx	XD2,XD2,XA3
+	verllf	XD2,XD2,8
+
+	vaf	XC2,XC2,XD3
+	vx	XB1,XB1,XC2
+	verllf	XB1,XB1,7
+
+	vaf	XC3,XC3,XD0
+	vx	XB2,XB2,XC3
+	verllf	XB2,XB2,7
+
+	vaf	XC0,XC0,XD1
+	vx	XB3,XB3,XC0
+	verllf	XB3,XB3,7
+
+	vaf	XC1,XC1,XD2
+	vx	XB0,XB0,XC1
+	verllf	XB0,XB0,7
+	brct	%r0,.Loop_4x
+
+	vaf	XD0,XD0,CTR
+
+	vmrhf	XT0,XA0,XA1		# transpose data
+	vmrhf	XT1,XA2,XA3
+	vmrlf	XT2,XA0,XA1
+	vmrlf	XT3,XA2,XA3
+	vpdi	XA0,XT0,XT1,0b0000
+	vpdi	XA1,XT0,XT1,0b0101
+	vpdi	XA2,XT2,XT3,0b0000
+	vpdi	XA3,XT2,XT3,0b0101
+
+	vmrhf	XT0,XB0,XB1
+	vmrhf	XT1,XB2,XB3
+	vmrlf	XT2,XB0,XB1
+	vmrlf	XT3,XB2,XB3
+	vpdi	XB0,XT0,XT1,0b0000
+	vpdi	XB1,XT0,XT1,0b0101
+	vpdi	XB2,XT2,XT3,0b0000
+	vpdi	XB3,XT2,XT3,0b0101
+
+	vmrhf	XT0,XC0,XC1
+	vmrhf	XT1,XC2,XC3
+	vmrlf	XT2,XC0,XC1
+	vmrlf	XT3,XC2,XC3
+	vpdi	XC0,XT0,XT1,0b0000
+	vpdi	XC1,XT0,XT1,0b0101
+	vpdi	XC2,XT2,XT3,0b0000
+	vpdi	XC3,XT2,XT3,0b0101
+
+	vmrhf	XT0,XD0,XD1
+	vmrhf	XT1,XD2,XD3
+	vmrlf	XT2,XD0,XD1
+	vmrlf	XT3,XD2,XD3
+	vpdi	XD0,XT0,XT1,0b0000
+	vpdi	XD1,XT0,XT1,0b0101
+	vpdi	XD2,XT2,XT3,0b0000
+	vpdi	XD3,XT2,XT3,0b0101
+
+	vaf	XA0,XA0,K0
+	vaf	XB0,XB0,K1
+	vaf	XC0,XC0,K2
+	vaf	XD0,XD0,K3
+
+	vperm	XA0,XA0,XA0,BEPERM
+	vperm	XB0,XB0,XB0,BEPERM
+	vperm	XC0,XC0,XC0,BEPERM
+	vperm	XD0,XD0,XD0,BEPERM
+
+	vlm	XT0,XT3,0(INP),0
+
+	vx	XT0,XT0,XA0
+	vx	XT1,XT1,XB0
+	vx	XT2,XT2,XC0
+	vx	XT3,XT3,XD0
+
+	vstm	XT0,XT3,0(OUT),0
+
+	la	INP,0x40(INP)
+	la	OUT,0x40(OUT)
+	aghi	LEN,-0x40
+
+	vaf	XA0,XA1,K0
+	vaf	XB0,XB1,K1
+	vaf	XC0,XC1,K2
+	vaf	XD0,XD1,K3
+
+	vperm	XA0,XA0,XA0,BEPERM
+	vperm	XB0,XB0,XB0,BEPERM
+	vperm	XC0,XC0,XC0,BEPERM
+	vperm	XD0,XD0,XD0,BEPERM
+
+	clgfi	LEN,0x40
+	jl	.Ltail_4x
+
+	vlm	XT0,XT3,0(INP),0
+
+	vx	XT0,XT0,XA0
+	vx	XT1,XT1,XB0
+	vx	XT2,XT2,XC0
+	vx	XT3,XT3,XD0
+
+	vstm	XT0,XT3,0(OUT),0
+
+	la	INP,0x40(INP)
+	la	OUT,0x40(OUT)
+	aghi	LEN,-0x40
+	je	.Ldone_4x
+
+	vaf	XA0,XA2,K0
+	vaf	XB0,XB2,K1
+	vaf	XC0,XC2,K2
+	vaf	XD0,XD2,K3
+
+	vperm	XA0,XA0,XA0,BEPERM
+	vperm	XB0,XB0,XB0,BEPERM
+	vperm	XC0,XC0,XC0,BEPERM
+	vperm	XD0,XD0,XD0,BEPERM
+
+	clgfi	LEN,0x40
+	jl	.Ltail_4x
+
+	vlm	XT0,XT3,0(INP),0
+
+	vx	XT0,XT0,XA0
+	vx	XT1,XT1,XB0
+	vx	XT2,XT2,XC0
+	vx	XT3,XT3,XD0
+
+	vstm	XT0,XT3,0(OUT),0
+
+	la	INP,0x40(INP)
+	la	OUT,0x40(OUT)
+	aghi	LEN,-0x40
+	je	.Ldone_4x
+
+	vaf	XA0,XA3,K0
+	vaf	XB0,XB3,K1
+	vaf	XC0,XC3,K2
+	vaf	XD0,XD3,K3
+
+	vperm	XA0,XA0,XA0,BEPERM
+	vperm	XB0,XB0,XB0,BEPERM
+	vperm	XC0,XC0,XC0,BEPERM
+	vperm	XD0,XD0,XD0,BEPERM
+
+	clgfi	LEN,0x40
+	jl	.Ltail_4x
+
+	vlm	XT0,XT3,0(INP),0
+
+	vx	XT0,XT0,XA0
+	vx	XT1,XT1,XB0
+	vx	XT2,XT2,XC0
+	vx	XT3,XT3,XD0
+
+	vstm	XT0,XT3,0(OUT),0
+
+.Ldone_4x:
+	lmg	%r6,%r7,6*8(SP)
+	br	%r14
+
+.Ltail_4x: 
+	vlr	XT0,XC0
+	vlr	XT1,XD0
+
+	vst	XA0,8*8+0x00(SP)
+	vst	XB0,8*8+0x10(SP)
+	vst	XT0,8*8+0x20(SP)
+	vst	XT1,8*8+0x30(SP)
+
+	lghi	%r1,0
+
+.Loop_tail_4x:
+	llgc	%r5,0(%r1,INP)
+	llgc	%r6,8*8(%r1,SP)
+	xr	%r6,%r5
+	stc	%r6,0(%r1,OUT)
+	la	%r1,1(%r1)
+	brct	LEN,.Loop_tail_4x
+
+	lmg	%r6,%r7,6*8(SP)
+	br	%r14
+
+	.type	chacha20_vx_4x, @function
+	.size	chacha20_vx_4x, . - chacha20_vx_4x
+
+#undef	OUT
+#undef	INP
+#undef	LEN
+#undef	KEY
+#undef	COUNTER
+
+#undef BEPERM
+
+#undef K0
+#undef K1
+#undef K2
+#undef K3
+
+
+#############################################################################
+# void chacha20_vx(u8 *out, counst u8 *inp, size_t len,
+#		   counst u32 *key, const u32 *counter)
+
+#define	OUT		%r2
+#define	INP		%r3
+#define	LEN		%r4
+#define	KEY		%r5
+#define	COUNTER		%r6
+
+#define BEPERM		%v31
+
+#define K0		%v27
+#define K1		%v24
+#define K2		%v25
+#define K3		%v26
+
+#define A0		%v0
+#define B0		%v1
+#define C0		%v2
+#define D0		%v3
+
+#define A1		%v4
+#define B1		%v5
+#define C1		%v6
+#define D1		%v7
+
+#define A2		%v8
+#define B2		%v9
+#define C2		%v10
+#define D2		%v11
+
+#define A3		%v12
+#define B3		%v13
+#define C3		%v14
+#define D3		%v15
+
+#define A4		%v16
+#define B4		%v17
+#define C4		%v18
+#define D4		%v19
+
+#define A5		%v20
+#define B5		%v21
+#define C5		%v22
+#define D5		%v23
+
+#define T0		%v27
+#define T1		%v28
+#define T2		%v29
+#define T3		%v30
+
+	.balign	32
+chacha20_vx:
+	clgfi	LEN,256
+	jle	chacha20_vx_4x
+	stmg	%r6,%r7,6*8(SP)
+
+	lghi	%r1,-FRAME
+	lgr	%r0,SP
+	la	SP,0(%r1,SP)
+	stg	%r0,0(SP)		# back-chain
+
+	larl	%r7,sigma
+	lhi	%r0,10
+
+	vlm	K1,K2,0(KEY),0		# load key
+	vl	K3,0(COUNTER)		# load counter
+
+	vlm	K0,BEPERM,0(%r7),4	# load sigma, increments, ...
+
+.Loop_outer_vx:
+	vlr	A0,K0
+	vlr	B0,K1
+	vlr	A1,K0
+	vlr	B1,K1
+	vlr	A2,K0
+	vlr	B2,K1
+	vlr	A3,K0
+	vlr	B3,K1
+	vlr	A4,K0
+	vlr	B4,K1
+	vlr	A5,K0
+	vlr	B5,K1
+
+	vlr	D0,K3
+	vaf	D1,K3,T1		# K[3]+1
+	vaf	D2,K3,T2		# K[3]+2
+	vaf	D3,K3,T3		# K[3]+3
+	vaf	D4,D2,T2		# K[3]+4
+	vaf	D5,D2,T3		# K[3]+5
+
+	vlr	C0,K2
+	vlr	C1,K2
+	vlr	C2,K2
+	vlr	C3,K2
+	vlr	C4,K2
+	vlr	C5,K2
+
+	vlr	T1,D1
+	vlr	T2,D2
+	vlr	T3,D3
+
+.Loop_vx:
+	vaf	A0,A0,B0
+	vaf	A1,A1,B1
+	vaf	A2,A2,B2
+	vaf	A3,A3,B3
+	vaf	A4,A4,B4
+	vaf	A5,A5,B5
+	vx	D0,D0,A0
+	vx	D1,D1,A1
+	vx	D2,D2,A2
+	vx	D3,D3,A3
+	vx	D4,D4,A4
+	vx	D5,D5,A5
+	verllf	D0,D0,16
+	verllf	D1,D1,16
+	verllf	D2,D2,16
+	verllf	D3,D3,16
+	verllf	D4,D4,16
+	verllf	D5,D5,16
+
+	vaf	C0,C0,D0
+	vaf	C1,C1,D1
+	vaf	C2,C2,D2
+	vaf	C3,C3,D3
+	vaf	C4,C4,D4
+	vaf	C5,C5,D5
+	vx	B0,B0,C0
+	vx	B1,B1,C1
+	vx	B2,B2,C2
+	vx	B3,B3,C3
+	vx	B4,B4,C4
+	vx	B5,B5,C5
+	verllf	B0,B0,12
+	verllf	B1,B1,12
+	verllf	B2,B2,12
+	verllf	B3,B3,12
+	verllf	B4,B4,12
+	verllf	B5,B5,12
+
+	vaf	A0,A0,B0
+	vaf	A1,A1,B1
+	vaf	A2,A2,B2
+	vaf	A3,A3,B3
+	vaf	A4,A4,B4
+	vaf	A5,A5,B5
+	vx	D0,D0,A0
+	vx	D1,D1,A1
+	vx	D2,D2,A2
+	vx	D3,D3,A3
+	vx	D4,D4,A4
+	vx	D5,D5,A5
+	verllf	D0,D0,8
+	verllf	D1,D1,8
+	verllf	D2,D2,8
+	verllf	D3,D3,8
+	verllf	D4,D4,8
+	verllf	D5,D5,8
+
+	vaf	C0,C0,D0
+	vaf	C1,C1,D1
+	vaf	C2,C2,D2
+	vaf	C3,C3,D3
+	vaf	C4,C4,D4
+	vaf	C5,C5,D5
+	vx	B0,B0,C0
+	vx	B1,B1,C1
+	vx	B2,B2,C2
+	vx	B3,B3,C3
+	vx	B4,B4,C4
+	vx	B5,B5,C5
+	verllf	B0,B0,7
+	verllf	B1,B1,7
+	verllf	B2,B2,7
+	verllf	B3,B3,7
+	verllf	B4,B4,7
+	verllf	B5,B5,7
+
+	vsldb	C0,C0,C0,8
+	vsldb	C1,C1,C1,8
+	vsldb	C2,C2,C2,8
+	vsldb	C3,C3,C3,8
+	vsldb	C4,C4,C4,8
+	vsldb	C5,C5,C5,8
+	vsldb	B0,B0,B0,4
+	vsldb	B1,B1,B1,4
+	vsldb	B2,B2,B2,4
+	vsldb	B3,B3,B3,4
+	vsldb	B4,B4,B4,4
+	vsldb	B5,B5,B5,4
+	vsldb	D0,D0,D0,12
+	vsldb	D1,D1,D1,12
+	vsldb	D2,D2,D2,12
+	vsldb	D3,D3,D3,12
+	vsldb	D4,D4,D4,12
+	vsldb	D5,D5,D5,12
+
+	vaf	A0,A0,B0
+	vaf	A1,A1,B1
+	vaf	A2,A2,B2
+	vaf	A3,A3,B3
+	vaf	A4,A4,B4
+	vaf	A5,A5,B5
+	vx	D0,D0,A0
+	vx	D1,D1,A1
+	vx	D2,D2,A2
+	vx	D3,D3,A3
+	vx	D4,D4,A4
+	vx	D5,D5,A5
+	verllf	D0,D0,16
+	verllf	D1,D1,16
+	verllf	D2,D2,16
+	verllf	D3,D3,16
+	verllf	D4,D4,16
+	verllf	D5,D5,16
+
+	vaf	C0,C0,D0
+	vaf	C1,C1,D1
+	vaf	C2,C2,D2
+	vaf	C3,C3,D3
+	vaf	C4,C4,D4
+	vaf	C5,C5,D5
+	vx	B0,B0,C0
+	vx	B1,B1,C1
+	vx	B2,B2,C2
+	vx	B3,B3,C3
+	vx	B4,B4,C4
+	vx	B5,B5,C5
+	verllf	B0,B0,12
+	verllf	B1,B1,12
+	verllf	B2,B2,12
+	verllf	B3,B3,12
+	verllf	B4,B4,12
+	verllf	B5,B5,12
+
+	vaf	A0,A0,B0
+	vaf	A1,A1,B1
+	vaf	A2,A2,B2
+	vaf	A3,A3,B3
+	vaf	A4,A4,B4
+	vaf	A5,A5,B5
+	vx	D0,D0,A0
+	vx	D1,D1,A1
+	vx	D2,D2,A2
+	vx	D3,D3,A3
+	vx	D4,D4,A4
+	vx	D5,D5,A5
+	verllf	D0,D0,8
+	verllf	D1,D1,8
+	verllf	D2,D2,8
+	verllf	D3,D3,8
+	verllf	D4,D4,8
+	verllf	D5,D5,8
+
+	vaf	C0,C0,D0
+	vaf	C1,C1,D1
+	vaf	C2,C2,D2
+	vaf	C3,C3,D3
+	vaf	C4,C4,D4
+	vaf	C5,C5,D5
+	vx	B0,B0,C0
+	vx	B1,B1,C1
+	vx	B2,B2,C2
+	vx	B3,B3,C3
+	vx	B4,B4,C4
+	vx	B5,B5,C5
+	verllf	B0,B0,7
+	verllf	B1,B1,7
+	verllf	B2,B2,7
+	verllf	B3,B3,7
+	verllf	B4,B4,7
+	verllf	B5,B5,7
+
+	vsldb	C0,C0,C0,8
+	vsldb	C1,C1,C1,8
+	vsldb	C2,C2,C2,8
+	vsldb	C3,C3,C3,8
+	vsldb	C4,C4,C4,8
+	vsldb	C5,C5,C5,8
+	vsldb	B0,B0,B0,12
+	vsldb	B1,B1,B1,12
+	vsldb	B2,B2,B2,12
+	vsldb	B3,B3,B3,12
+	vsldb	B4,B4,B4,12
+	vsldb	B5,B5,B5,12
+	vsldb	D0,D0,D0,4
+	vsldb	D1,D1,D1,4
+	vsldb	D2,D2,D2,4
+	vsldb	D3,D3,D3,4
+	vsldb	D4,D4,D4,4
+	vsldb	D5,D5,D5,4
+	brct	%r0,.Loop_vx
+
+	vaf	A0,A0,K0
+	vaf	B0,B0,K1
+	vaf	C0,C0,K2
+	vaf	D0,D0,K3
+	vaf	A1,A1,K0
+	vaf	D1,D1,T1		# +K[3]+1
+
+	vperm	A0,A0,A0,BEPERM
+	vperm	B0,B0,B0,BEPERM
+	vperm	C0,C0,C0,BEPERM
+	vperm	D0,D0,D0,BEPERM
+
+	clgfi	LEN,0x40
+	jl	.Ltail_vx
+
+	vaf	D2,D2,T2		# +K[3]+2
+	vaf	D3,D3,T3		# +K[3]+3
+	vlm	T0,T3,0(INP),0
+
+	vx	A0,A0,T0
+	vx	B0,B0,T1
+	vx	C0,C0,T2
+	vx	D0,D0,T3
+
+	vlm	K0,T3,0(%r7),4		# re-load sigma and increments
+
+	vstm	A0,D0,0(OUT),0
+
+	la	INP,0x40(INP)
+	la	OUT,0x40(OUT)
+	aghi	LEN,-0x40
+	je	.Ldone_vx
+
+	vaf	B1,B1,K1
+	vaf	C1,C1,K2
+
+	vperm	A0,A1,A1,BEPERM
+	vperm	B0,B1,B1,BEPERM
+	vperm	C0,C1,C1,BEPERM
+	vperm	D0,D1,D1,BEPERM
+
+	clgfi	LEN,0x40
+	jl	.Ltail_vx
+
+	vlm	A1,D1,0(INP),0
+
+	vx	A0,A0,A1
+	vx	B0,B0,B1
+	vx	C0,C0,C1
+	vx	D0,D0,D1
+
+	vstm	A0,D0,0(OUT),0
+
+	la	INP,0x40(INP)
+	la	OUT,0x40(OUT)
+	aghi	LEN,-0x40
+	je	.Ldone_vx
+
+	vaf	A2,A2,K0
+	vaf	B2,B2,K1
+	vaf	C2,C2,K2
+
+	vperm	A0,A2,A2,BEPERM
+	vperm	B0,B2,B2,BEPERM
+	vperm	C0,C2,C2,BEPERM
+	vperm	D0,D2,D2,BEPERM
+
+	clgfi	LEN,0x40
+	jl	.Ltail_vx
+
+	vlm	A1,D1,0(INP),0
+
+	vx	A0,A0,A1
+	vx	B0,B0,B1
+	vx	C0,C0,C1
+	vx	D0,D0,D1
+
+	vstm	A0,D0,0(OUT),0
+
+	la	INP,0x40(INP)
+	la	OUT,0x40(OUT)
+	aghi	LEN,-0x40
+	je	.Ldone_vx
+
+	vaf	A3,A3,K0
+	vaf	B3,B3,K1
+	vaf	C3,C3,K2
+	vaf	D2,K3,T3		# K[3]+3
+
+	vperm	A0,A3,A3,BEPERM
+	vperm	B0,B3,B3,BEPERM
+	vperm	C0,C3,C3,BEPERM
+	vperm	D0,D3,D3,BEPERM
+
+	clgfi	LEN,0x40
+	jl	.Ltail_vx
+
+	vaf	D3,D2,T1		# K[3]+4
+	VLM	A1,D1,0(INP),0
+
+	vx	A0,A0,A1
+	vx	B0,B0,B1
+	vx	C0,C0,C1
+	vx	D0,D0,D1
+
+	vstm	A0,D0,0(OUT),0
+
+	la	INP,0x40(INP)
+	la	OUT,0x40(OUT)
+	aghi	LEN,-0x40
+	je	.Ldone_vx
+
+	vaf	A4,A4,K0
+	vaf	B4,B4,K1
+	vaf	C4,C4,K2
+	vaf	D4,D4,D3		# +K[3]+4
+	vaf	D3,D3,T1		# K[3]+5
+	vaf	K3,D2,T3		# K[3]+=6
+
+	vperm	A0,A4,A4,BEPERM
+	vperm	B0,B4,B4,BEPERM
+	vperm	C0,C4,C4,BEPERM
+	vperm	D0,D4,D4,BEPERM
+
+	clgfi	LEN,0x40
+	jl	.Ltail_vx
+
+	vlm	A1,D1,0(INP),0
+
+	vx	A0,A0,A1
+	vx	B0,B0,B1
+	vx	C0,C0,C1
+	vx	D0,D0,D1
+
+	vstm	A0,D0,0(OUT),0
+
+	la	INP,0x40(INP)
+	la	OUT,0x40(OUT)
+	aghi	LEN,-0x40
+	je	.Ldone_vx
+
+	vaf	A5,A5,K0
+	vaf	B5,B5,K1
+	vaf	C5,C5,K2
+	vaf	D5,D5,D3		# +K[3]+5
+
+	vperm	A0,A5,A5,BEPERM
+	vperm	B0,B5,B5,BEPERM
+	vperm	C0,C5,C5,BEPERM
+	vperm	D0,D5,D5,BEPERM
+
+	clgfi	LEN,0x40
+	jl	.Ltail_vx
+
+	vlm	A1,D1,0(INP),0
+
+	vx	A0,A0,A1
+	vx	B0,B0,B1
+	vx	C0,C0,C1
+	vx	D0,D0,D1
+
+	vstm	A0,D0,0(OUT),0
+
+	la	INP,0x40(INP)
+	la	OUT,0x40(OUT)
+	lhi	%r0,10
+	aghi	LEN,-0x40
+	jne	.Loop_outer_vx
+
+.Ldone_vx:
+	lmg	%r6,%r7,FRAME+6*8(SP)
+	la	SP,FRAME(SP)
+	br	%r14
+
+.Ltail_vx:
+	vstm	A0,D0,8*8(SP),3
+	lghi	%r1,0
+
+.Loop_tail_vx:
+	llgc	%r5,0(%r1,INP)
+	llgc	%r6,8*8(%r1,SP)
+	xr	%r6,%r5
+	stc	%r6,0(%r1,OUT)
+	la	%r1,1(%r1)
+	brct	LEN,.Loop_tail_vx
+
+	lmg	%r6,%r7,FRAME+6*8(SP)
+	la	SP,FRAME(SP)
+	br	%r14
+
+	.type	chacha20_vx, @function
+	.size	chacha20_vx, . - chacha20_vx
+	.globl	chacha20_vx
+
+.previous
+.section .note.GNU-stack,"",%progbits
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] tests/tcg/s390x: Import linux tools/testing/crypto/chacha20-s390
  2024-01-17 21:36 ` [PATCH 2/2] tests/tcg/s390x: Import linux tools/testing/crypto/chacha20-s390 Richard Henderson
@ 2024-01-17 22:45   ` Richard Henderson
  2024-01-18  7:17   ` Thomas Huth
  1 sibling, 0 replies; 10+ messages in thread
From: Richard Henderson @ 2024-01-17 22:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-s390x, thuth, david, philmd, mjt

On 1/18/24 08:36, Richard Henderson wrote:
> diff --git a/tests/tcg/s390x/Makefile.target b/tests/tcg/s390x/Makefile.target
> index 30994dcf9c..28f19a3176 100644
> --- a/tests/tcg/s390x/Makefile.target
> +++ b/tests/tcg/s390x/Makefile.target
> @@ -66,9 +66,13 @@ Z13_TESTS+=vcksm
>   Z13_TESTS+=vstl
>   Z13_TESTS+=vrep
>   Z13_TESTS+=precise-smc-user
> +Z13_TESTS+=chacha
>   $(Z13_TESTS): CFLAGS+=-march=z13 -O2
>   TESTS+=$(Z13_TESTS)
>   
> +chacha: chacha.c chacha-vx.S
> +	$(CC) $(CFLAGS) $(EXTRA_CFLAGS) $^ -o $@

Once I started testing with cross-compilers I realize $(LDFLAGS) is needed here for e.g. 
-static.


r~



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] tcg/s390x: Fix chacha20-s390
  2024-01-17 21:36 [PATCH 0/2] tcg/s390x: Fix chacha20-s390 Richard Henderson
  2024-01-17 21:36 ` [PATCH 1/2] tcg/s390x: Fix encoding of VRIc, VRSa, VRSc insns Richard Henderson
  2024-01-17 21:36 ` [PATCH 2/2] tests/tcg/s390x: Import linux tools/testing/crypto/chacha20-s390 Richard Henderson
@ 2024-01-18  6:07 ` Michael Tokarev
  2024-01-18  7:03   ` Richard Henderson
  2 siblings, 1 reply; 10+ messages in thread
From: Michael Tokarev @ 2024-01-18  6:07 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: qemu-s390x, thuth, david, philmd

18.01.2024 00:36, Richard Henderson:
> So it turns out the regression exposed by "Optimize env memory operations"
> is caused by an s390x host encoding error.  This is the first time that we
> have had sufficient register pressure to use more than a few vector
> registers at the same time.
> 
> As such, the testcase itself is interesting, since nothing else in our
> testsuite generates translation blocks with quite so many vector insns
> with more than 16 simultaneously live values.

Tested-by: Michael Tokarev <mjt@tls.msk.ru>

Both changes - the fix and the testsuite.  With several (debian) kernels on
actual s390x hw and on a few other architectures as well.

Why the problem didn't occur on non-s390x *host*?  As I noted in my initial
email, the testcase worked on amd64 host but not on s390x host..

Thank you for the good work Richard!

/mjt


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] tcg/s390x: Fix encoding of VRIc, VRSa, VRSc insns
  2024-01-17 21:36 ` [PATCH 1/2] tcg/s390x: Fix encoding of VRIc, VRSa, VRSc insns Richard Henderson
@ 2024-01-18  6:50   ` Thomas Huth
  2024-01-18  7:04     ` Richard Henderson
  2024-01-19 21:54   ` Philippe Mathieu-Daudé
  1 sibling, 1 reply; 10+ messages in thread
From: Thomas Huth @ 2024-01-18  6:50 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: qemu-s390x, david, philmd, mjt, qemu-stable

On 17/01/2024 22.36, Richard Henderson wrote:
> While the format names the second vector register 'v3',
> it is still in the second position (bits 12-15) and
> the argument to RXB must match.
> 
> Example error:
>   -   e7 00 00 10 2a 33       verllf  %v16,%v0,16
>   +   e7 00 00 10 2c 33       verllf  %v16,%v16,16
> 
> Cc: qemu-stable@nongnu.org
> Reported-by: Michael Tokarev <mjt@tls.msk.ru>
> Fixes: 22cb37b4172 ("tcg/s390x: Implement vector shift operations")
> Fixes: 79cada8693d ("tcg/s390x: Implement tcg_out_dup*_vec")
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   tcg/s390x/tcg-target.c.inc | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
> index fbee43d3b0..7f6b84aa2c 100644
> --- a/tcg/s390x/tcg-target.c.inc
> +++ b/tcg/s390x/tcg-target.c.inc
> @@ -683,7 +683,7 @@ static void tcg_out_insn_VRIc(TCGContext *s, S390Opcode op,
>       tcg_debug_assert(is_vector_reg(v3));
>       tcg_out16(s, (op & 0xff00) | ((v1 & 0xf) << 4) | (v3 & 0xf));
>       tcg_out16(s, i2);
> -    tcg_out16(s, (op & 0x00ff) | RXB(v1, 0, v3, 0) | (m4 << 12));
> +    tcg_out16(s, (op & 0x00ff) | RXB(v1, v3, 0, 0) | (m4 << 12));
>   }
>   
>   static void tcg_out_insn_VRRa(TCGContext *s, S390Opcode op,
> @@ -738,7 +738,7 @@ static void tcg_out_insn_VRSa(TCGContext *s, S390Opcode op, TCGReg v1,
>       tcg_debug_assert(is_vector_reg(v3));
>       tcg_out16(s, (op & 0xff00) | ((v1 & 0xf) << 4) | (v3 & 0xf));
>       tcg_out16(s, b2 << 12 | d2);
> -    tcg_out16(s, (op & 0x00ff) | RXB(v1, 0, v3, 0) | (m4 << 12));
> +    tcg_out16(s, (op & 0x00ff) | RXB(v1, v3, 0, 0) | (m4 << 12));
>   }
>   
>   static void tcg_out_insn_VRSb(TCGContext *s, S390Opcode op, TCGReg v1,
> @@ -762,7 +762,7 @@ static void tcg_out_insn_VRSc(TCGContext *s, S390Opcode op, TCGReg r1,
>       tcg_debug_assert(is_vector_reg(v3));
>       tcg_out16(s, (op & 0xff00) | (r1 << 4) | (v3 & 0xf));
>       tcg_out16(s, b2 << 12 | d2);
> -    tcg_out16(s, (op & 0x00ff) | RXB(0, 0, v3, 0) | (m4 << 12));
> +    tcg_out16(s, (op & 0x00ff) | RXB(0, v3, 0, 0) | (m4 << 12));
>   }
>   
>   static void tcg_out_insn_VRX(TCGContext *s, S390Opcode op, TCGReg v1,

I double-checked the Principles of Operation that VRI-c, VRS-a and VRS-c are 
the only encodings where this could happen, and yes, your modification looks 
right to me:

Reviewed-by: Thomas Huth <thuth@redhat.com>

Do you want to take it through our TCG branch or shall I pick it up for my 
s390x branch?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] tcg/s390x: Fix chacha20-s390
  2024-01-18  6:07 ` [PATCH 0/2] tcg/s390x: Fix chacha20-s390 Michael Tokarev
@ 2024-01-18  7:03   ` Richard Henderson
  0 siblings, 0 replies; 10+ messages in thread
From: Richard Henderson @ 2024-01-18  7:03 UTC (permalink / raw)
  To: Michael Tokarev, qemu-devel; +Cc: qemu-s390x, thuth, david, philmd

On 1/18/24 17:07, Michael Tokarev wrote:
> Why the problem didn't occur on non-s390x *host*?  As I noted in my initial
> email, the testcase worked on amd64 host but not on s390x host..

Because the error was in the s390x host tcg backend.


r~



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] tcg/s390x: Fix encoding of VRIc, VRSa, VRSc insns
  2024-01-18  6:50   ` Thomas Huth
@ 2024-01-18  7:04     ` Richard Henderson
  0 siblings, 0 replies; 10+ messages in thread
From: Richard Henderson @ 2024-01-18  7:04 UTC (permalink / raw)
  To: Thomas Huth, qemu-devel; +Cc: qemu-s390x, david, philmd, mjt, qemu-stable

On 1/18/24 17:50, Thomas Huth wrote:
> Do you want to take it through our TCG branch or shall I pick it up for my s390x branch?

I can take it through tcg.


r~



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] tests/tcg/s390x: Import linux tools/testing/crypto/chacha20-s390
  2024-01-17 21:36 ` [PATCH 2/2] tests/tcg/s390x: Import linux tools/testing/crypto/chacha20-s390 Richard Henderson
  2024-01-17 22:45   ` Richard Henderson
@ 2024-01-18  7:17   ` Thomas Huth
  1 sibling, 0 replies; 10+ messages in thread
From: Thomas Huth @ 2024-01-18  7:17 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: qemu-s390x, david, philmd, mjt

On 17/01/2024 22.36, Richard Henderson wrote:
> Modify and simplify the driver, as we're really only interested
> in correctness of translation of chacha-vx.S.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   tests/tcg/s390x/chacha.c        | 341 ++++++++++++
>   tests/tcg/s390x/Makefile.target |   4 +
>   tests/tcg/s390x/chacha-vx.S     | 914 ++++++++++++++++++++++++++++++++
>   3 files changed, 1259 insertions(+)
>   create mode 100644 tests/tcg/s390x/chacha.c
>   create mode 100644 tests/tcg/s390x/chacha-vx.S
...
> +	vx	XT0,XT0,XA0
> +	vx	XT1,XT1,XB0
> +	vx	XT2,XT2,XC0
> +	vx	XT3,XT3,XD0
> +
> +	vstm	XT0,XT3,0(OUT),0
> +
> +.Ldone_4x:
> +	lmg	%r6,%r7,6*8(SP)
> +	br	%r14
> +
> +.Ltail_4x:

FWIW, my "git am" complains about a trailing white space here.

Apart from that:
Tested-by: Thomas Huth <thuth@redhat.com>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] tcg/s390x: Fix encoding of VRIc, VRSa, VRSc insns
  2024-01-17 21:36 ` [PATCH 1/2] tcg/s390x: Fix encoding of VRIc, VRSa, VRSc insns Richard Henderson
  2024-01-18  6:50   ` Thomas Huth
@ 2024-01-19 21:54   ` Philippe Mathieu-Daudé
  1 sibling, 0 replies; 10+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-01-19 21:54 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: qemu-s390x, thuth, david, mjt, qemu-stable

On 17/1/24 22:36, Richard Henderson wrote:
> While the format names the second vector register 'v3',
> it is still in the second position (bits 12-15) and
> the argument to RXB must match.
> 
> Example error:
>   -   e7 00 00 10 2a 33       verllf  %v16,%v0,16
>   +   e7 00 00 10 2c 33       verllf  %v16,%v16,16
> 
> Cc: qemu-stable@nongnu.org
> Reported-by: Michael Tokarev <mjt@tls.msk.ru>
> Fixes: 22cb37b4172 ("tcg/s390x: Implement vector shift operations")
> Fixes: 79cada8693d ("tcg/s390x: Implement tcg_out_dup*_vec")
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   tcg/s390x/tcg-target.c.inc | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
> index fbee43d3b0..7f6b84aa2c 100644
> --- a/tcg/s390x/tcg-target.c.inc
> +++ b/tcg/s390x/tcg-target.c.inc
> @@ -683,7 +683,7 @@ static void tcg_out_insn_VRIc(TCGContext *s, S390Opcode op,
>       tcg_debug_assert(is_vector_reg(v3));
>       tcg_out16(s, (op & 0xff00) | ((v1 & 0xf) << 4) | (v3 & 0xf));
>       tcg_out16(s, i2);
> -    tcg_out16(s, (op & 0x00ff) | RXB(v1, 0, v3, 0) | (m4 << 12));
> +    tcg_out16(s, (op & 0x00ff) | RXB(v1, v3, 0, 0) | (m4 << 12));

🎩 Chapeau.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-01-19 21:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-17 21:36 [PATCH 0/2] tcg/s390x: Fix chacha20-s390 Richard Henderson
2024-01-17 21:36 ` [PATCH 1/2] tcg/s390x: Fix encoding of VRIc, VRSa, VRSc insns Richard Henderson
2024-01-18  6:50   ` Thomas Huth
2024-01-18  7:04     ` Richard Henderson
2024-01-19 21:54   ` Philippe Mathieu-Daudé
2024-01-17 21:36 ` [PATCH 2/2] tests/tcg/s390x: Import linux tools/testing/crypto/chacha20-s390 Richard Henderson
2024-01-17 22:45   ` Richard Henderson
2024-01-18  7:17   ` Thomas Huth
2024-01-18  6:07 ` [PATCH 0/2] tcg/s390x: Fix chacha20-s390 Michael Tokarev
2024-01-18  7:03   ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).