* RE: [PATCH] bpf: fix x86 call stack alignment, add tests
2025-12-19 18:26 [PATCH] bpf: fix x86 call stack alignment, add tests Marat Khalili
@ 2025-12-28 14:16 ` Konstantin Ananyev
2026-01-05 16:09 ` [PATCH v2] " Marat Khalili
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Konstantin Ananyev @ 2025-12-28 14:16 UTC (permalink / raw)
To: Marat Khalili, dev@dpdk.org, Ferruh Yigit; +Cc: stable@dpdk.org
> Correctly align stack pointer on x86 JIT if external calls are present.
>
> Add tests for external calls from BPF program demonstrating the problem:
> * direct verification of a local variable alignment;
> * operations with 128-bit integers;
> * aligned and unaligned SSE2 instructions;
> * memcpy and rte_memcpy (may use vector instructions in their code).
>
> (Such variety is needed because not all of these tests are available or
> reproduce the problem on all targets even when the problem exists.)
Probably worth to mention in the commit header, that by default
x86_64 ABI expects sp to be 16B aligned on function entry.
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Tested-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
>
> Fixes: cc752e43e079 ("bpf: add JIT compilation for x86_64 ISA")
>
> Signed-off-by: Marat Khalili <marat.khalili@huawei.com>
> ---
> app/test/test_bpf.c | 386 ++++++++++++++++++++++++++++++++++++++++++
> lib/bpf/bpf_jit_x86.c | 15 +-
> 2 files changed, 400 insertions(+), 1 deletion(-)
>
> diff --git a/app/test/test_bpf.c b/app/test/test_bpf.c
> index b7c94ba1c7..e3f6b21a8b 100644
> --- a/app/test/test_bpf.c
> +++ b/app/test/test_bpf.c
> @@ -3280,6 +3280,392 @@ test_bpf(void)
>
> REGISTER_FAST_TEST(bpf_autotest, true, true, test_bpf);
>
> +/* Tests of BPF JIT stack alignment when calling external functions (xfuncs). */
> +
> +/* Function called from the BPF program in a test. */
> +typedef uint64_t (*text_xfunc_t)(uint64_t argument);
> +
> +/* Call function from BPF program, verify that it incremented its argument. */
> +static int
> +call_from_bpf_test(text_xfunc_t xfunc)
> +{
> + static const struct ebpf_insn ins[] = {
> + {
> + .code = (BPF_JMP | EBPF_CALL),
> + .imm = 0, /* xsym #0 */
> + },
> + {
> + .code = (BPF_JMP | EBPF_EXIT),
> + },
> + };
> + const struct rte_bpf_xsym xsym[] = {
> + {
> + .name = "xfunc",
> + .type = RTE_BPF_XTYPE_FUNC,
> + .func = {
> + .val = (void *)xfunc,
> + .nb_args = 1,
> + .args = {
> + {
> + .type = RTE_BPF_ARG_RAW,
> + .size = sizeof(uint64_t),
> + },
> + },
> + .ret = {
> + .type = RTE_BPF_ARG_RAW,
> + .size = sizeof(uint64_t),
> + },
> + },
> + },
> + };
> + const struct rte_bpf_prm prm = {
> + .ins = ins,
> + .nb_ins = RTE_DIM(ins),
> + .xsym = xsym,
> + .nb_xsym = RTE_DIM(xsym),
> + .prog_arg = {
> + .type = RTE_BPF_ARG_RAW,
> + .size = sizeof(uint64_t),
> + },
> + };
> +
> + struct rte_bpf_jit jit;
> +
> + struct rte_bpf *const bpf = rte_bpf_load(&prm);
> + RTE_TEST_ASSERT_NOT_EQUAL(bpf, NULL,
> + "expect rte_bpf_load() != NULL");
> +
> + RTE_TEST_ASSERT_SUCCESS(rte_bpf_get_jit(bpf, &jit),
> + "expect rte_bpf_get_jit() to succeed");
> +
> + const text_xfunc_t jit_function = (void *)jit.func;
> + if (jit_function == NULL) {
> + rte_bpf_destroy(bpf);
> + return TEST_SKIPPED;
> + }
> +
> + const uint64_t argument = 42;
> + const uint64_t result = jit_function(argument);
> + rte_bpf_destroy(bpf);
> +
> + RTE_TEST_ASSERT_EQUAL(result, argument + 1,
> + "expect result == %ju, found %ju",
> + (uintmax_t)(argument + 1), (uintmax_t)result);
> +
> + return TEST_SUCCESS;
> +}
> +
> +/*
> + * Test alignment of a local variable.
> + *
> + * NOTE: May produce false negatives with sanitizers if they replace the stack.
> + */
> +
> +/* Copy of the pointer to max_align stack variable, volatile to thwart optimization.
> */
> +static volatile uintptr_t stack_alignment_test_pointer;
> +
> +static uint64_t
> +stack_alignment_xfunc(uint64_t argument)
> +{
> + max_align_t max_align;
> + stack_alignment_test_pointer = (uintptr_t)&max_align;
> + return argument + 1;
> +}
> +
> +static int
> +test_stack_alignment(void)
> +{
> + const int test_rc = call_from_bpf_test(stack_alignment_xfunc);
> + if (test_rc == TEST_SKIPPED)
> + return TEST_SKIPPED;
> +
> + RTE_TEST_ASSERT_SUCCESS(test_rc,
> + "expect call_from_bpf_test(stack_alignment_xfunc) to succeed");
> +
> + const uintptr_t test_offset = stack_alignment_test_pointer;
> + RTE_TEST_ASSERT_NOT_EQUAL(test_offset, 0, "expect test_pointer != 0");
> +
> + const size_t test_alignment = test_offset % alignof(max_align_t);
> + RTE_TEST_ASSERT_EQUAL(test_alignment, 0,
> + "expect test_alignment == 0, found %zu", test_alignment);
> +
> + return TEST_SUCCESS;
> +}
> +
> +REGISTER_FAST_TEST(bpf_stack_alignment_autotest, true, true,
> test_stack_alignment);
> +
> +/*
> + * Test copying `__uint128_t`.
> + *
> + * This operation is used by some variations of `rte_memcpy`;
> + * it can also be produced by vectorizer in the compiler.
> + */
> +
> +#if defined(__SIZEOF_INT128__)
> +
> +static uint64_t
> +stack_copy_uint128_xfunc(uint64_t argument)
> +{
> + /* Pass addresses through volatiles to prevent compiler from optimizing it all
> out. */
> + char alignas(16) src_buffer[16];
> + char alignas(16) dst_buffer[16];
> + void *const src = (char *volatile)src_buffer;
> + void *const dst = (char *volatile)dst_buffer;
> + const size_t size = 16;
> +
> + memset(src, 0x2a, size);
> + memset(dst, 0x55, size);
> + const int initial_memcmp_rc = memcmp(dst, src, size);
> +
> + const __uint128_t *const src128 = (const __uint128_t *)src;
> + __uint128_t *const dst128 = (__uint128_t *)dst;
> + *dst128 = *src128;
> + const int memcmp_rc = memcmp(dst, src, size);
> +
> + return argument + 1 + !initial_memcmp_rc + memcmp_rc;
> +}
> +
> +static int
> +test_stack_copy_uint128(void)
> +{
> + const int test_rc = call_from_bpf_test(stack_copy_uint128_xfunc);
> + if (test_rc == TEST_SKIPPED)
> + return TEST_SKIPPED;
> +
> + RTE_TEST_ASSERT_SUCCESS(test_rc,
> + "expect call_from_bpf_test(stack_copy_uint128_xfunc) to
> succeed");
> +
> + return TEST_SUCCESS;
> +}
> +
> +#else
> +
> +static int
> +test_stack_copy_uint128(void)
> +{
> + return TEST_SKIPPED;
> +}
> +
> +#endif
> +
> +REGISTER_FAST_TEST(bpf_stack_copy_uint128_autotest, true, true,
> test_stack_copy_uint128);
> +
> +/*
> + * Test SSE2 load and store intrinsics.
> + *
> + * These intrinsics are used by e.g. lib/hash.
> + *
> + * Test both aligned and unaligned versions. Unaligned intrinsics may still fail
> + * when the stack is misaligned, since they only treat memory address as
> + * unaligned, not stack.
> + */
> +
> +#if defined(__SSE2__)
> +
> +static uint64_t
> +stack_sse2_aligned_xfunc(uint64_t argument)
> +{
> + /* Pass addresses through volatiles to prevent compiler from optimizing it all
> out. */
> + char alignas(16) src_buffer[16];
> + char alignas(16) dst_buffer[16];
> + void *const src = (char *volatile)src_buffer;
> + void *const dst = (char *volatile)dst_buffer;
> + const size_t size = 16;
> +
> + memset(src, 0x2a, size);
> + memset(dst, 0x55, size);
> + const int initial_memcmp_rc = memcmp(dst, src, size);
> +
> + const __m128i tmp = _mm_load_si128((const __m128i *)src);
> + _mm_store_si128((__m128i *)dst, tmp);
> + const int memcmp_rc = memcmp(dst, src, size);
> +
> + return argument + 1 + !initial_memcmp_rc + memcmp_rc;
> +}
> +
> +static uint64_t
> +stack_sse2_unaligned_xfunc(uint64_t argument)
> +{
> + /* Pass addresses through volatiles to prevent compiler from optimizing it all
> out. */
> + char alignas(16) src_buffer[17];
> + char alignas(16) dst_buffer[17];
> + void *const src = (char *volatile)src_buffer + 1;
> + void *const dst = (char *volatile)dst_buffer + 1;
> + const size_t size = 16;
> +
> + memset(src, 0x2a, size);
> + memset(dst, 0x55, size);
> + const int initial_memcmp_rc = memcmp(dst, src, size);
> +
> + const __m128i tmp = _mm_loadu_si128((const __m128i *)src);
> + _mm_storeu_si128((__m128i *)dst, tmp);
> + const int memcmp_rc = memcmp(dst, src, size);
> +
> + return argument + 1 + !initial_memcmp_rc + memcmp_rc;
> +}
> +
> +static int
> +test_stack_sse2(void)
> +{
> + int test_rc;
> +
> + test_rc = call_from_bpf_test(stack_sse2_aligned_xfunc);
> + if (test_rc == TEST_SKIPPED)
> + return test_rc;
> + RTE_TEST_ASSERT_SUCCESS(test_rc,
> + "expect call_from_bpf_test(stack_sse2_aligned_xfunc) to
> succeed");
> +
> + test_rc = call_from_bpf_test(stack_sse2_unaligned_xfunc);
> + if (test_rc == TEST_SKIPPED)
> + return test_rc;
> + RTE_TEST_ASSERT_SUCCESS(test_rc,
> + "expect call_from_bpf_test(stack_sse2_unaligned_xfunc) to
> succeed");
> +
> + return TEST_SUCCESS;
> +}
> +
> +#else
> +
> +static int
> +test_stack_sse2(void)
> +{
> + return TEST_SKIPPED;
> +}
> +
> +#endif
> +
> +REGISTER_FAST_TEST(bpf_stack_sse2_autotest, true, true, test_stack_sse2);
> +
> +/*
> + * Run memcpy and rte_memcpy with various data sizes and offsets (unaligned and
> aligned).
> + *
> + * May produce false negatives even if BPF breaks stack alignment since
> + * compilers may realign the stack in the beginning of the function to use
> + * vector instructions with width larger than the default stack alignment.
> + * However, represents very important use case that was broken in practice.
> + *
> + * For the reason specified above test 16-byte fixed-width memcpy explicitly.
> + */
> +
> +static void *volatile stack_memcpy_dst;
> +static const void *volatile stack_memcpy_src;
> +static size_t volatile stack_memcpy_size;
> +
> +static uint64_t
> +stack_memcpy16_xfunc(uint64_t argument)
> +{
> + RTE_ASSERT(stack_memcpy_size == 16);
> + memcpy(stack_memcpy_dst, stack_memcpy_src, 16);
> + return argument + 1;
> +}
> +
> +static uint64_t
> +stack_rte_memcpy16_xfunc(uint64_t argument)
> +{
> + RTE_ASSERT(stack_memcpy_size == 16);
> + rte_memcpy(stack_memcpy_dst, stack_memcpy_src, 16);
> + return argument + 1;
> +}
> +
> +static uint64_t
> +stack_memcpy_xfunc(uint64_t argument)
> +{
> + memcpy(stack_memcpy_dst, stack_memcpy_src, stack_memcpy_size);
> + return argument + 1;
> +}
> +
> +static uint64_t
> +stack_rte_memcpy_xfunc(uint64_t argument)
> +{
> + rte_memcpy(stack_memcpy_dst, stack_memcpy_src, stack_memcpy_size);
> + return argument + 1;
> +}
> +
> +static int
> +stack_memcpy_subtest(text_xfunc_t xfunc, size_t size, size_t src_offset, size_t
> dst_offset)
> +{
> + stack_memcpy_size = size;
> +
> + char *const src_buffer = malloc(size + src_offset);
> + memset(src_buffer + src_offset, 0x2a, size);
> + stack_memcpy_src = src_buffer + src_offset;
> +
> + char *const dst_buffer = malloc(size + dst_offset);
> + memset(dst_buffer + dst_offset, 0x55, size);
> + stack_memcpy_dst = dst_buffer + dst_offset;
> +
> + const int initial_memcmp_rc = memcmp(stack_memcpy_dst,
> stack_memcpy_src, size);
> + const int test_rc = call_from_bpf_test(xfunc);
> + const int memcmp_rc = memcmp(stack_memcpy_dst, stack_memcpy_src,
> size);
> +
> + free(dst_buffer);
> + free(src_buffer);
> +
> + if (test_rc == TEST_SKIPPED)
> + return TEST_SKIPPED;
> +
> + RTE_TEST_ASSERT_FAIL(initial_memcmp_rc, "expect memcmp() to fail
> initially");
> + RTE_TEST_ASSERT_SUCCESS(test_rc, "expect call_from_bpf_test(xfunc) to
> succeed");
> + RTE_TEST_ASSERT_SUCCESS(memcmp_rc, "expect memcmp() to succeed");
> +
> + return TEST_SUCCESS;
> +}
> +
> +static int
> +test_stack_memcpy(void)
> +{
> + for (int offsets = 0; offsets < 4; ++offsets) {
> + const bool src_offset = offsets & 1;
> + const bool dst_offset = offsets & 2;
> + int test_rc;
> +
> + test_rc = stack_memcpy_subtest(stack_memcpy16_xfunc,
> + 16, src_offset, dst_offset);
> + if (test_rc == TEST_SKIPPED)
> + return test_rc;
> + RTE_TEST_ASSERT_SUCCESS(test_rc,
> + "expect stack_memcpy_subtest(stack_memcpy16_xfunc, "
> + "16, %i, %i) to succeed",
> + src_offset, dst_offset);
> +
> + test_rc = stack_memcpy_subtest(stack_rte_memcpy16_xfunc,
> + 16, src_offset, dst_offset);
> + if (test_rc == TEST_SKIPPED)
> + return test_rc;
> + RTE_TEST_ASSERT_SUCCESS(test_rc,
> + "expect
> stack_memcpy_subtest(stack_rte_memcpy16_xfunc, "
> + "16, %i, %i) to succeed",
> + src_offset, dst_offset);
> +
> + for (size_t size = 1; size <= 1024; size <<= 1) {
> + const bool src_offset = offsets & 1;
> + const bool dst_offset = offsets & 2;
> + int test_rc;
> +
> + test_rc = stack_memcpy_subtest(stack_memcpy_xfunc,
> + size, src_offset, dst_offset);
> + if (test_rc == TEST_SKIPPED)
> + return test_rc;
> + RTE_TEST_ASSERT_SUCCESS(test_rc,
> + "expect
> stack_memcpy_subtest(stack_memcpy_xfunc, "
> + "%zu, %i, %i) to succeed",
> + size, src_offset, dst_offset);
> +
> + test_rc = stack_memcpy_subtest(stack_rte_memcpy_xfunc,
> + size, src_offset, dst_offset);
> + if (test_rc == TEST_SKIPPED)
> + return test_rc;
> + RTE_TEST_ASSERT_SUCCESS(test_rc,
> + "expect
> stack_memcpy_subtest(stack_rte_memcpy_xfunc, "
> + "%zu, %i, %i) to succeed",
> + size, src_offset, dst_offset);
> + }
> + }
> + return TEST_SUCCESS;
> +}
> +
> +REGISTER_FAST_TEST(bpf_stack_memcpy_autotest, true, true,
> test_stack_memcpy);
> +
> #ifdef TEST_BPF_ELF_LOAD
>
> /*
> diff --git a/lib/bpf/bpf_jit_x86.c b/lib/bpf/bpf_jit_x86.c
> index 4d74e418f8..09865acabb 100644
> --- a/lib/bpf/bpf_jit_x86.c
> +++ b/lib/bpf/bpf_jit_x86.c
> @@ -93,6 +93,8 @@ enum {
> /*
> * callee saved registers list.
> * keep RBP as the last one.
> + * RBP is marked as used every time we have external calls
> + * since we need it to save RSP before stack realignment.
> */
> static const uint32_t save_regs[] = {RBX, R12, R13, R14, R15, RBP};
>
> @@ -685,6 +687,8 @@ emit_call(struct bpf_jit_state *st, uintptr_t trg)
> const uint8_t ops = 0xFF;
> const uint8_t mods = 2;
>
> + /* Mark RBP as used to trigger stack realignment in prolog. */
> + USED(st->reguse, RBP);
> emit_ld_imm64(st, RAX, trg, trg >> 32);
> emit_bytes(st, &ops, sizeof(ops));
> emit_modregrm(st, MOD_DIRECT, mods, RAX);
> @@ -1204,7 +1208,6 @@ emit_prolog(struct bpf_jit_state *st, int32_t stack_size)
> if (spil == 0)
> return;
>
> -
> emit_alu_imm(st, EBPF_ALU64 | BPF_SUB | BPF_K, RSP,
> spil * sizeof(uint64_t));
>
> @@ -1220,6 +1223,16 @@ emit_prolog(struct bpf_jit_state *st, int32_t stack_size)
> if (INUSE(st->reguse, RBP) != 0) {
> emit_mov_reg(st, EBPF_ALU64 | EBPF_MOV | BPF_X, RSP, RBP);
> emit_alu_imm(st, EBPF_ALU64 | BPF_SUB | BPF_K, RSP, stack_size);
> +
> + /*
> + * Align stack pointer appropriately for function calls.
> + * We do not have direct access to function stack alignment
> + * value (like gcc internal macro INCOMING_STACK_BOUNDARY),
> + * but hopefully `alignof(max_align_t)` will always be greater.
> + * Original stack pointer will be restored from rbp.
> + */
> + emit_alu_imm(st, EBPF_ALU64 | BPF_AND | BPF_K, RSP,
> + -(uint32_t)alignof(max_align_t));
> }
> }
>
> --
> 2.43.0
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH v2] bpf: fix x86 call stack alignment, add tests
2025-12-19 18:26 [PATCH] bpf: fix x86 call stack alignment, add tests Marat Khalili
2025-12-28 14:16 ` Konstantin Ananyev
@ 2026-01-05 16:09 ` Marat Khalili
2026-01-21 10:16 ` [PATCH v3] bpf: fix x86 call stack alignment for external calls Marat Khalili
2026-01-14 17:49 ` [PATCH] bpf: fix x86 call stack alignment, add tests Stephen Hemminger
2026-01-14 19:43 ` Stephen Hemminger
3 siblings, 1 reply; 7+ messages in thread
From: Marat Khalili @ 2026-01-05 16:09 UTC (permalink / raw)
To: dev, Konstantin Ananyev, Ferruh Yigit; +Cc: stable
Correctly align stack pointer on x86 JIT if external calls are present.
According to x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI,
section 3.2.2 The Stack Frame) stack needs to be 16 (or more) bytes
aligned immediately before the call instruction is executed. Once
control has been transferred to the function entry point it is always
off by 8 bytes. It means that JIT-compiled BPF function will always have
its stack misaligned for any nested call unless it performs operations
with the stack; even if it does use stack there is still 50% chance of
stack being misaligned since it uses it in multiples of 8.
To solve the issue mark RBP as used whenever we have external function
calls, and align RSP using AND instruction at the end of the prolog.
Marking RBP as used triggers stack pointer saving in prolog and
restoration in epilog.
Add tests for external calls from BPF program demonstrating the problem:
* direct verification of a local variable alignment;
* operations with 128-bit integers;
* aligned and unaligned SSE2 instructions;
* memcpy and rte_memcpy (may use vector instructions in their code).
(Such variety is needed because not all of these tests are available or
reproduce the problem on all targets even when the problem exists.)
Fixes: cc752e43e079 ("bpf: add JIT compilation for x86_64 ISA")
Signed-off-by: Marat Khalili <marat.khalili@huawei.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Tested-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
---
v2: updated commit message with background info and API reference
app/test/test_bpf.c | 386 ++++++++++++++++++++++++++++++++++++++++++
lib/bpf/bpf_jit_x86.c | 15 +-
2 files changed, 400 insertions(+), 1 deletion(-)
diff --git a/app/test/test_bpf.c b/app/test/test_bpf.c
index b7c94ba1c7..e3f6b21a8b 100644
--- a/app/test/test_bpf.c
+++ b/app/test/test_bpf.c
@@ -3280,6 +3280,392 @@ test_bpf(void)
REGISTER_FAST_TEST(bpf_autotest, true, true, test_bpf);
+/* Tests of BPF JIT stack alignment when calling external functions (xfuncs). */
+
+/* Function called from the BPF program in a test. */
+typedef uint64_t (*text_xfunc_t)(uint64_t argument);
+
+/* Call function from BPF program, verify that it incremented its argument. */
+static int
+call_from_bpf_test(text_xfunc_t xfunc)
+{
+ static const struct ebpf_insn ins[] = {
+ {
+ .code = (BPF_JMP | EBPF_CALL),
+ .imm = 0, /* xsym #0 */
+ },
+ {
+ .code = (BPF_JMP | EBPF_EXIT),
+ },
+ };
+ const struct rte_bpf_xsym xsym[] = {
+ {
+ .name = "xfunc",
+ .type = RTE_BPF_XTYPE_FUNC,
+ .func = {
+ .val = (void *)xfunc,
+ .nb_args = 1,
+ .args = {
+ {
+ .type = RTE_BPF_ARG_RAW,
+ .size = sizeof(uint64_t),
+ },
+ },
+ .ret = {
+ .type = RTE_BPF_ARG_RAW,
+ .size = sizeof(uint64_t),
+ },
+ },
+ },
+ };
+ const struct rte_bpf_prm prm = {
+ .ins = ins,
+ .nb_ins = RTE_DIM(ins),
+ .xsym = xsym,
+ .nb_xsym = RTE_DIM(xsym),
+ .prog_arg = {
+ .type = RTE_BPF_ARG_RAW,
+ .size = sizeof(uint64_t),
+ },
+ };
+
+ struct rte_bpf_jit jit;
+
+ struct rte_bpf *const bpf = rte_bpf_load(&prm);
+ RTE_TEST_ASSERT_NOT_EQUAL(bpf, NULL,
+ "expect rte_bpf_load() != NULL");
+
+ RTE_TEST_ASSERT_SUCCESS(rte_bpf_get_jit(bpf, &jit),
+ "expect rte_bpf_get_jit() to succeed");
+
+ const text_xfunc_t jit_function = (void *)jit.func;
+ if (jit_function == NULL) {
+ rte_bpf_destroy(bpf);
+ return TEST_SKIPPED;
+ }
+
+ const uint64_t argument = 42;
+ const uint64_t result = jit_function(argument);
+ rte_bpf_destroy(bpf);
+
+ RTE_TEST_ASSERT_EQUAL(result, argument + 1,
+ "expect result == %ju, found %ju",
+ (uintmax_t)(argument + 1), (uintmax_t)result);
+
+ return TEST_SUCCESS;
+}
+
+/*
+ * Test alignment of a local variable.
+ *
+ * NOTE: May produce false negatives with sanitizers if they replace the stack.
+ */
+
+/* Copy of the pointer to max_align stack variable, volatile to thwart optimization. */
+static volatile uintptr_t stack_alignment_test_pointer;
+
+static uint64_t
+stack_alignment_xfunc(uint64_t argument)
+{
+ max_align_t max_align;
+ stack_alignment_test_pointer = (uintptr_t)&max_align;
+ return argument + 1;
+}
+
+static int
+test_stack_alignment(void)
+{
+ const int test_rc = call_from_bpf_test(stack_alignment_xfunc);
+ if (test_rc == TEST_SKIPPED)
+ return TEST_SKIPPED;
+
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect call_from_bpf_test(stack_alignment_xfunc) to succeed");
+
+ const uintptr_t test_offset = stack_alignment_test_pointer;
+ RTE_TEST_ASSERT_NOT_EQUAL(test_offset, 0, "expect test_pointer != 0");
+
+ const size_t test_alignment = test_offset % alignof(max_align_t);
+ RTE_TEST_ASSERT_EQUAL(test_alignment, 0,
+ "expect test_alignment == 0, found %zu", test_alignment);
+
+ return TEST_SUCCESS;
+}
+
+REGISTER_FAST_TEST(bpf_stack_alignment_autotest, true, true, test_stack_alignment);
+
+/*
+ * Test copying `__uint128_t`.
+ *
+ * This operation is used by some variations of `rte_memcpy`;
+ * it can also be produced by vectorizer in the compiler.
+ */
+
+#if defined(__SIZEOF_INT128__)
+
+static uint64_t
+stack_copy_uint128_xfunc(uint64_t argument)
+{
+ /* Pass addresses through volatiles to prevent compiler from optimizing it all out. */
+ char alignas(16) src_buffer[16];
+ char alignas(16) dst_buffer[16];
+ void *const src = (char *volatile)src_buffer;
+ void *const dst = (char *volatile)dst_buffer;
+ const size_t size = 16;
+
+ memset(src, 0x2a, size);
+ memset(dst, 0x55, size);
+ const int initial_memcmp_rc = memcmp(dst, src, size);
+
+ const __uint128_t *const src128 = (const __uint128_t *)src;
+ __uint128_t *const dst128 = (__uint128_t *)dst;
+ *dst128 = *src128;
+ const int memcmp_rc = memcmp(dst, src, size);
+
+ return argument + 1 + !initial_memcmp_rc + memcmp_rc;
+}
+
+static int
+test_stack_copy_uint128(void)
+{
+ const int test_rc = call_from_bpf_test(stack_copy_uint128_xfunc);
+ if (test_rc == TEST_SKIPPED)
+ return TEST_SKIPPED;
+
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect call_from_bpf_test(stack_copy_uint128_xfunc) to succeed");
+
+ return TEST_SUCCESS;
+}
+
+#else
+
+static int
+test_stack_copy_uint128(void)
+{
+ return TEST_SKIPPED;
+}
+
+#endif
+
+REGISTER_FAST_TEST(bpf_stack_copy_uint128_autotest, true, true, test_stack_copy_uint128);
+
+/*
+ * Test SSE2 load and store intrinsics.
+ *
+ * These intrinsics are used by e.g. lib/hash.
+ *
+ * Test both aligned and unaligned versions. Unaligned intrinsics may still fail
+ * when the stack is misaligned, since they only treat memory address as
+ * unaligned, not stack.
+ */
+
+#if defined(__SSE2__)
+
+static uint64_t
+stack_sse2_aligned_xfunc(uint64_t argument)
+{
+ /* Pass addresses through volatiles to prevent compiler from optimizing it all out. */
+ char alignas(16) src_buffer[16];
+ char alignas(16) dst_buffer[16];
+ void *const src = (char *volatile)src_buffer;
+ void *const dst = (char *volatile)dst_buffer;
+ const size_t size = 16;
+
+ memset(src, 0x2a, size);
+ memset(dst, 0x55, size);
+ const int initial_memcmp_rc = memcmp(dst, src, size);
+
+ const __m128i tmp = _mm_load_si128((const __m128i *)src);
+ _mm_store_si128((__m128i *)dst, tmp);
+ const int memcmp_rc = memcmp(dst, src, size);
+
+ return argument + 1 + !initial_memcmp_rc + memcmp_rc;
+}
+
+static uint64_t
+stack_sse2_unaligned_xfunc(uint64_t argument)
+{
+ /* Pass addresses through volatiles to prevent compiler from optimizing it all out. */
+ char alignas(16) src_buffer[17];
+ char alignas(16) dst_buffer[17];
+ void *const src = (char *volatile)src_buffer + 1;
+ void *const dst = (char *volatile)dst_buffer + 1;
+ const size_t size = 16;
+
+ memset(src, 0x2a, size);
+ memset(dst, 0x55, size);
+ const int initial_memcmp_rc = memcmp(dst, src, size);
+
+ const __m128i tmp = _mm_loadu_si128((const __m128i *)src);
+ _mm_storeu_si128((__m128i *)dst, tmp);
+ const int memcmp_rc = memcmp(dst, src, size);
+
+ return argument + 1 + !initial_memcmp_rc + memcmp_rc;
+}
+
+static int
+test_stack_sse2(void)
+{
+ int test_rc;
+
+ test_rc = call_from_bpf_test(stack_sse2_aligned_xfunc);
+ if (test_rc == TEST_SKIPPED)
+ return test_rc;
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect call_from_bpf_test(stack_sse2_aligned_xfunc) to succeed");
+
+ test_rc = call_from_bpf_test(stack_sse2_unaligned_xfunc);
+ if (test_rc == TEST_SKIPPED)
+ return test_rc;
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect call_from_bpf_test(stack_sse2_unaligned_xfunc) to succeed");
+
+ return TEST_SUCCESS;
+}
+
+#else
+
+static int
+test_stack_sse2(void)
+{
+ return TEST_SKIPPED;
+}
+
+#endif
+
+REGISTER_FAST_TEST(bpf_stack_sse2_autotest, true, true, test_stack_sse2);
+
+/*
+ * Run memcpy and rte_memcpy with various data sizes and offsets (unaligned and aligned).
+ *
+ * May produce false negatives even if BPF breaks stack alignment since
+ * compilers may realign the stack in the beginning of the function to use
+ * vector instructions with width larger than the default stack alignment.
+ * However, represents very important use case that was broken in practice.
+ *
+ * For the reason specified above test 16-byte fixed-width memcpy explicitly.
+ */
+
+static void *volatile stack_memcpy_dst;
+static const void *volatile stack_memcpy_src;
+static size_t volatile stack_memcpy_size;
+
+static uint64_t
+stack_memcpy16_xfunc(uint64_t argument)
+{
+ RTE_ASSERT(stack_memcpy_size == 16);
+ memcpy(stack_memcpy_dst, stack_memcpy_src, 16);
+ return argument + 1;
+}
+
+static uint64_t
+stack_rte_memcpy16_xfunc(uint64_t argument)
+{
+ RTE_ASSERT(stack_memcpy_size == 16);
+ rte_memcpy(stack_memcpy_dst, stack_memcpy_src, 16);
+ return argument + 1;
+}
+
+static uint64_t
+stack_memcpy_xfunc(uint64_t argument)
+{
+ memcpy(stack_memcpy_dst, stack_memcpy_src, stack_memcpy_size);
+ return argument + 1;
+}
+
+static uint64_t
+stack_rte_memcpy_xfunc(uint64_t argument)
+{
+ rte_memcpy(stack_memcpy_dst, stack_memcpy_src, stack_memcpy_size);
+ return argument + 1;
+}
+
+static int
+stack_memcpy_subtest(text_xfunc_t xfunc, size_t size, size_t src_offset, size_t dst_offset)
+{
+ stack_memcpy_size = size;
+
+ char *const src_buffer = malloc(size + src_offset);
+ memset(src_buffer + src_offset, 0x2a, size);
+ stack_memcpy_src = src_buffer + src_offset;
+
+ char *const dst_buffer = malloc(size + dst_offset);
+ memset(dst_buffer + dst_offset, 0x55, size);
+ stack_memcpy_dst = dst_buffer + dst_offset;
+
+ const int initial_memcmp_rc = memcmp(stack_memcpy_dst, stack_memcpy_src, size);
+ const int test_rc = call_from_bpf_test(xfunc);
+ const int memcmp_rc = memcmp(stack_memcpy_dst, stack_memcpy_src, size);
+
+ free(dst_buffer);
+ free(src_buffer);
+
+ if (test_rc == TEST_SKIPPED)
+ return TEST_SKIPPED;
+
+ RTE_TEST_ASSERT_FAIL(initial_memcmp_rc, "expect memcmp() to fail initially");
+ RTE_TEST_ASSERT_SUCCESS(test_rc, "expect call_from_bpf_test(xfunc) to succeed");
+ RTE_TEST_ASSERT_SUCCESS(memcmp_rc, "expect memcmp() to succeed");
+
+ return TEST_SUCCESS;
+}
+
+static int
+test_stack_memcpy(void)
+{
+ for (int offsets = 0; offsets < 4; ++offsets) {
+ const bool src_offset = offsets & 1;
+ const bool dst_offset = offsets & 2;
+ int test_rc;
+
+ test_rc = stack_memcpy_subtest(stack_memcpy16_xfunc,
+ 16, src_offset, dst_offset);
+ if (test_rc == TEST_SKIPPED)
+ return test_rc;
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect stack_memcpy_subtest(stack_memcpy16_xfunc, "
+ "16, %i, %i) to succeed",
+ src_offset, dst_offset);
+
+ test_rc = stack_memcpy_subtest(stack_rte_memcpy16_xfunc,
+ 16, src_offset, dst_offset);
+ if (test_rc == TEST_SKIPPED)
+ return test_rc;
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect stack_memcpy_subtest(stack_rte_memcpy16_xfunc, "
+ "16, %i, %i) to succeed",
+ src_offset, dst_offset);
+
+ for (size_t size = 1; size <= 1024; size <<= 1) {
+ const bool src_offset = offsets & 1;
+ const bool dst_offset = offsets & 2;
+ int test_rc;
+
+ test_rc = stack_memcpy_subtest(stack_memcpy_xfunc,
+ size, src_offset, dst_offset);
+ if (test_rc == TEST_SKIPPED)
+ return test_rc;
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect stack_memcpy_subtest(stack_memcpy_xfunc, "
+ "%zu, %i, %i) to succeed",
+ size, src_offset, dst_offset);
+
+ test_rc = stack_memcpy_subtest(stack_rte_memcpy_xfunc,
+ size, src_offset, dst_offset);
+ if (test_rc == TEST_SKIPPED)
+ return test_rc;
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect stack_memcpy_subtest(stack_rte_memcpy_xfunc, "
+ "%zu, %i, %i) to succeed",
+ size, src_offset, dst_offset);
+ }
+ }
+ return TEST_SUCCESS;
+}
+
+REGISTER_FAST_TEST(bpf_stack_memcpy_autotest, true, true, test_stack_memcpy);
+
#ifdef TEST_BPF_ELF_LOAD
/*
diff --git a/lib/bpf/bpf_jit_x86.c b/lib/bpf/bpf_jit_x86.c
index 4d74e418f8..09865acabb 100644
--- a/lib/bpf/bpf_jit_x86.c
+++ b/lib/bpf/bpf_jit_x86.c
@@ -93,6 +93,8 @@ enum {
/*
* callee saved registers list.
* keep RBP as the last one.
+ * RBP is marked as used every time we have external calls
+ * since we need it to save RSP before stack realignment.
*/
static const uint32_t save_regs[] = {RBX, R12, R13, R14, R15, RBP};
@@ -685,6 +687,8 @@ emit_call(struct bpf_jit_state *st, uintptr_t trg)
const uint8_t ops = 0xFF;
const uint8_t mods = 2;
+ /* Mark RBP as used to trigger stack realignment in prolog. */
+ USED(st->reguse, RBP);
emit_ld_imm64(st, RAX, trg, trg >> 32);
emit_bytes(st, &ops, sizeof(ops));
emit_modregrm(st, MOD_DIRECT, mods, RAX);
@@ -1204,7 +1208,6 @@ emit_prolog(struct bpf_jit_state *st, int32_t stack_size)
if (spil == 0)
return;
-
emit_alu_imm(st, EBPF_ALU64 | BPF_SUB | BPF_K, RSP,
spil * sizeof(uint64_t));
@@ -1220,6 +1223,16 @@ emit_prolog(struct bpf_jit_state *st, int32_t stack_size)
if (INUSE(st->reguse, RBP) != 0) {
emit_mov_reg(st, EBPF_ALU64 | EBPF_MOV | BPF_X, RSP, RBP);
emit_alu_imm(st, EBPF_ALU64 | BPF_SUB | BPF_K, RSP, stack_size);
+
+ /*
+ * Align stack pointer appropriately for function calls.
+ * We do not have direct access to function stack alignment
+ * value (like gcc internal macro INCOMING_STACK_BOUNDARY),
+ * but hopefully `alignof(max_align_t)` will always be greater.
+ * Original stack pointer will be restored from rbp.
+ */
+ emit_alu_imm(st, EBPF_ALU64 | BPF_AND | BPF_K, RSP,
+ -(uint32_t)alignof(max_align_t));
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH v3] bpf: fix x86 call stack alignment for external calls
2026-01-05 16:09 ` [PATCH v2] " Marat Khalili
@ 2026-01-21 10:16 ` Marat Khalili
2026-02-04 9:35 ` Thomas Monjalon
0 siblings, 1 reply; 7+ messages in thread
From: Marat Khalili @ 2026-01-21 10:16 UTC (permalink / raw)
To: dev, stephen, Konstantin Ananyev, Ferruh Yigit; +Cc: stable
Correctly align stack pointer on x86 JIT if external calls are present.
According to x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI,
section 3.2.2 The Stack Frame) stack needs to be 16 (or more) bytes
aligned immediately before the call instruction is executed. Once
control has been transferred to the function entry point it is always
off by 8 bytes. It means that JIT-compiled BPF function will always have
its stack misaligned for any nested call unless it performs operations
with the stack; even if it does use stack there is still 50% chance of
stack being misaligned since it uses it in multiples of 8.
To solve the issue mark RBP as used whenever we have external function
calls, and align RSP using AND instruction at the end of the prolog.
Marking RBP as used triggers stack pointer saving in prolog and
restoration in epilog.
Add tests for external calls from BPF program demonstrating the problem:
* direct verification of a local variable alignment;
* operations with 128-bit integers;
* aligned and unaligned SSE2 instructions;
* memcpy and rte_memcpy (may use vector instructions in their code).
(Such variety is needed because not all of these tests are available or
reproduce the problem on all targets even when the problem exists.)
Fixes: cc752e43e079 ("bpf: add JIT compilation for x86_64 ISA")
Cc: stable@dpdk.org
Signed-off-by: Marat Khalili <marat.khalili@huawei.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Tested-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
---
v3:
* Remove forbidden punctuation (comma) from subject line.
* Ensure a blank line before Signed-off-by tag group.
* Restore unrelated blank line in lib/bpf/bpf_jit_x86.c.
* Fix variable shadowing in app/test/test_bpf.c.
* Add NULL checks for malloc() in app/test/test_bpf.c (ensuring no memory
leak if one allocation fails).
* Update REGISTER_FAST_TEST() calls in app/test/test_bpf.c to use new
constants (NOHUGE_OK, ASAN_OK) instead of old boolean values.
v2:
* Updated commit message with background info and API reference.
app/test/test_bpf.c | 389 ++++++++++++++++++++++++++++++++++++++++++
lib/bpf/bpf_jit_x86.c | 14 ++
2 files changed, 403 insertions(+)
diff --git a/app/test/test_bpf.c b/app/test/test_bpf.c
index a7d56f8d86..a85b241638 100644
--- a/app/test/test_bpf.c
+++ b/app/test/test_bpf.c
@@ -3280,6 +3280,395 @@ test_bpf(void)
REGISTER_FAST_TEST(bpf_autotest, NOHUGE_OK, ASAN_OK, test_bpf);
+/* Tests of BPF JIT stack alignment when calling external functions (xfuncs). */
+
+/* Function called from the BPF program in a test. */
+typedef uint64_t (*text_xfunc_t)(uint64_t argument);
+
+/* Call function from BPF program, verify that it incremented its argument. */
+static int
+call_from_bpf_test(text_xfunc_t xfunc)
+{
+ static const struct ebpf_insn ins[] = {
+ {
+ .code = (BPF_JMP | EBPF_CALL),
+ .imm = 0, /* xsym #0 */
+ },
+ {
+ .code = (BPF_JMP | EBPF_EXIT),
+ },
+ };
+ const struct rte_bpf_xsym xsym[] = {
+ {
+ .name = "xfunc",
+ .type = RTE_BPF_XTYPE_FUNC,
+ .func = {
+ .val = (void *)xfunc,
+ .nb_args = 1,
+ .args = {
+ {
+ .type = RTE_BPF_ARG_RAW,
+ .size = sizeof(uint64_t),
+ },
+ },
+ .ret = {
+ .type = RTE_BPF_ARG_RAW,
+ .size = sizeof(uint64_t),
+ },
+ },
+ },
+ };
+ const struct rte_bpf_prm prm = {
+ .ins = ins,
+ .nb_ins = RTE_DIM(ins),
+ .xsym = xsym,
+ .nb_xsym = RTE_DIM(xsym),
+ .prog_arg = {
+ .type = RTE_BPF_ARG_RAW,
+ .size = sizeof(uint64_t),
+ },
+ };
+
+ struct rte_bpf_jit jit;
+
+ struct rte_bpf *const bpf = rte_bpf_load(&prm);
+ RTE_TEST_ASSERT_NOT_EQUAL(bpf, NULL,
+ "expect rte_bpf_load() != NULL");
+
+ RTE_TEST_ASSERT_SUCCESS(rte_bpf_get_jit(bpf, &jit),
+ "expect rte_bpf_get_jit() to succeed");
+
+ const text_xfunc_t jit_function = (void *)jit.func;
+ if (jit_function == NULL) {
+ rte_bpf_destroy(bpf);
+ return TEST_SKIPPED;
+ }
+
+ const uint64_t argument = 42;
+ const uint64_t result = jit_function(argument);
+ rte_bpf_destroy(bpf);
+
+ RTE_TEST_ASSERT_EQUAL(result, argument + 1,
+ "expect result == %ju, found %ju",
+ (uintmax_t)(argument + 1), (uintmax_t)result);
+
+ return TEST_SUCCESS;
+}
+
+/*
+ * Test alignment of a local variable.
+ *
+ * NOTE: May produce false negatives with sanitizers if they replace the stack.
+ */
+
+/* Copy of the pointer to max_align stack variable, volatile to thwart optimization. */
+static volatile uintptr_t stack_alignment_test_pointer;
+
+static uint64_t
+stack_alignment_xfunc(uint64_t argument)
+{
+ max_align_t max_align;
+ stack_alignment_test_pointer = (uintptr_t)&max_align;
+ return argument + 1;
+}
+
+static int
+test_stack_alignment(void)
+{
+ const int test_rc = call_from_bpf_test(stack_alignment_xfunc);
+ if (test_rc == TEST_SKIPPED)
+ return TEST_SKIPPED;
+
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect call_from_bpf_test(stack_alignment_xfunc) to succeed");
+
+ const uintptr_t test_offset = stack_alignment_test_pointer;
+ RTE_TEST_ASSERT_NOT_EQUAL(test_offset, 0, "expect test_pointer != 0");
+
+ const size_t test_alignment = test_offset % alignof(max_align_t);
+ RTE_TEST_ASSERT_EQUAL(test_alignment, 0,
+ "expect test_alignment == 0, found %zu", test_alignment);
+
+ return TEST_SUCCESS;
+}
+
+REGISTER_FAST_TEST(bpf_stack_alignment_autotest, NOHUGE_OK, ASAN_OK, test_stack_alignment);
+
+/*
+ * Test copying `__uint128_t`.
+ *
+ * This operation is used by some variations of `rte_memcpy`;
+ * it can also be produced by vectorizer in the compiler.
+ */
+
+#if defined(__SIZEOF_INT128__)
+
+static uint64_t
+stack_copy_uint128_xfunc(uint64_t argument)
+{
+ /* Pass addresses through volatiles to prevent compiler from optimizing it all out. */
+ char alignas(16) src_buffer[16];
+ char alignas(16) dst_buffer[16];
+ void *const src = (char *volatile)src_buffer;
+ void *const dst = (char *volatile)dst_buffer;
+ const size_t size = 16;
+
+ memset(src, 0x2a, size);
+ memset(dst, 0x55, size);
+ const int initial_memcmp_rc = memcmp(dst, src, size);
+
+ const __uint128_t *const src128 = (const __uint128_t *)src;
+ __uint128_t *const dst128 = (__uint128_t *)dst;
+ *dst128 = *src128;
+ const int memcmp_rc = memcmp(dst, src, size);
+
+ return argument + 1 + !initial_memcmp_rc + memcmp_rc;
+}
+
+static int
+test_stack_copy_uint128(void)
+{
+ const int test_rc = call_from_bpf_test(stack_copy_uint128_xfunc);
+ if (test_rc == TEST_SKIPPED)
+ return TEST_SKIPPED;
+
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect call_from_bpf_test(stack_copy_uint128_xfunc) to succeed");
+
+ return TEST_SUCCESS;
+}
+
+#else
+
+static int
+test_stack_copy_uint128(void)
+{
+ return TEST_SKIPPED;
+}
+
+#endif
+
+REGISTER_FAST_TEST(bpf_stack_copy_uint128_autotest, NOHUGE_OK, ASAN_OK, test_stack_copy_uint128);
+
+/*
+ * Test SSE2 load and store intrinsics.
+ *
+ * These intrinsics are used by e.g. lib/hash.
+ *
+ * Test both aligned and unaligned versions. Unaligned intrinsics may still fail
+ * when the stack is misaligned, since they only treat memory address as
+ * unaligned, not stack.
+ */
+
+#if defined(__SSE2__)
+
+static uint64_t
+stack_sse2_aligned_xfunc(uint64_t argument)
+{
+ /* Pass addresses through volatiles to prevent compiler from optimizing it all out. */
+ char alignas(16) src_buffer[16];
+ char alignas(16) dst_buffer[16];
+ void *const src = (char *volatile)src_buffer;
+ void *const dst = (char *volatile)dst_buffer;
+ const size_t size = 16;
+
+ memset(src, 0x2a, size);
+ memset(dst, 0x55, size);
+ const int initial_memcmp_rc = memcmp(dst, src, size);
+
+ const __m128i tmp = _mm_load_si128((const __m128i *)src);
+ _mm_store_si128((__m128i *)dst, tmp);
+ const int memcmp_rc = memcmp(dst, src, size);
+
+ return argument + 1 + !initial_memcmp_rc + memcmp_rc;
+}
+
+static uint64_t
+stack_sse2_unaligned_xfunc(uint64_t argument)
+{
+ /* Pass addresses through volatiles to prevent compiler from optimizing it all out. */
+ char alignas(16) src_buffer[17];
+ char alignas(16) dst_buffer[17];
+ void *const src = (char *volatile)src_buffer + 1;
+ void *const dst = (char *volatile)dst_buffer + 1;
+ const size_t size = 16;
+
+ memset(src, 0x2a, size);
+ memset(dst, 0x55, size);
+ const int initial_memcmp_rc = memcmp(dst, src, size);
+
+ const __m128i tmp = _mm_loadu_si128((const __m128i *)src);
+ _mm_storeu_si128((__m128i *)dst, tmp);
+ const int memcmp_rc = memcmp(dst, src, size);
+
+ return argument + 1 + !initial_memcmp_rc + memcmp_rc;
+}
+
+static int
+test_stack_sse2(void)
+{
+ int test_rc;
+
+ test_rc = call_from_bpf_test(stack_sse2_aligned_xfunc);
+ if (test_rc == TEST_SKIPPED)
+ return test_rc;
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect call_from_bpf_test(stack_sse2_aligned_xfunc) to succeed");
+
+ test_rc = call_from_bpf_test(stack_sse2_unaligned_xfunc);
+ if (test_rc == TEST_SKIPPED)
+ return test_rc;
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect call_from_bpf_test(stack_sse2_unaligned_xfunc) to succeed");
+
+ return TEST_SUCCESS;
+}
+
+#else
+
+static int
+test_stack_sse2(void)
+{
+ return TEST_SKIPPED;
+}
+
+#endif
+
+REGISTER_FAST_TEST(bpf_stack_sse2_autotest, NOHUGE_OK, ASAN_OK, test_stack_sse2);
+
+/*
+ * Run memcpy and rte_memcpy with various data sizes and offsets (unaligned and aligned).
+ *
+ * May produce false negatives even if BPF breaks stack alignment since
+ * compilers may realign the stack in the beginning of the function to use
+ * vector instructions with width larger than the default stack alignment.
+ * However, represents very important use case that was broken in practice.
+ *
+ * For the reason specified above test 16-byte fixed-width memcpy explicitly.
+ */
+
+static void *volatile stack_memcpy_dst;
+static const void *volatile stack_memcpy_src;
+static size_t volatile stack_memcpy_size;
+
+static uint64_t
+stack_memcpy16_xfunc(uint64_t argument)
+{
+ RTE_ASSERT(stack_memcpy_size == 16);
+ memcpy(stack_memcpy_dst, stack_memcpy_src, 16);
+ return argument + 1;
+}
+
+static uint64_t
+stack_rte_memcpy16_xfunc(uint64_t argument)
+{
+ RTE_ASSERT(stack_memcpy_size == 16);
+ rte_memcpy(stack_memcpy_dst, stack_memcpy_src, 16);
+ return argument + 1;
+}
+
+static uint64_t
+stack_memcpy_xfunc(uint64_t argument)
+{
+ memcpy(stack_memcpy_dst, stack_memcpy_src, stack_memcpy_size);
+ return argument + 1;
+}
+
+static uint64_t
+stack_rte_memcpy_xfunc(uint64_t argument)
+{
+ rte_memcpy(stack_memcpy_dst, stack_memcpy_src, stack_memcpy_size);
+ return argument + 1;
+}
+
+static int
+stack_memcpy_subtest(text_xfunc_t xfunc, size_t size, size_t src_offset, size_t dst_offset)
+{
+ stack_memcpy_size = size;
+
+ char *const src_buffer = malloc(size + src_offset);
+ char *const dst_buffer = malloc(size + dst_offset);
+
+ if (src_buffer == NULL || dst_buffer == NULL) {
+ free(dst_buffer);
+ free(src_buffer);
+ return TEST_FAILED;
+ }
+
+ memset(src_buffer + src_offset, 0x2a, size);
+ stack_memcpy_src = src_buffer + src_offset;
+
+ memset(dst_buffer + dst_offset, 0x55, size);
+ stack_memcpy_dst = dst_buffer + dst_offset;
+
+ const int initial_memcmp_rc = memcmp(stack_memcpy_dst, stack_memcpy_src, size);
+ const int test_rc = call_from_bpf_test(xfunc);
+ const int memcmp_rc = memcmp(stack_memcpy_dst, stack_memcpy_src, size);
+
+ free(dst_buffer);
+ free(src_buffer);
+
+ if (test_rc == TEST_SKIPPED)
+ return TEST_SKIPPED;
+
+ RTE_TEST_ASSERT_FAIL(initial_memcmp_rc, "expect memcmp() to fail initially");
+ RTE_TEST_ASSERT_SUCCESS(test_rc, "expect call_from_bpf_test(xfunc) to succeed");
+ RTE_TEST_ASSERT_SUCCESS(memcmp_rc, "expect memcmp() to succeed");
+
+ return TEST_SUCCESS;
+}
+
+static int
+test_stack_memcpy(void)
+{
+ for (int offsets = 0; offsets < 4; ++offsets) {
+ const bool src_offset = offsets & 1;
+ const bool dst_offset = offsets & 2;
+ int test_rc;
+
+ test_rc = stack_memcpy_subtest(stack_memcpy16_xfunc,
+ 16, src_offset, dst_offset);
+ if (test_rc == TEST_SKIPPED)
+ return test_rc;
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect stack_memcpy_subtest(stack_memcpy16_xfunc, "
+ "16, %i, %i) to succeed",
+ src_offset, dst_offset);
+
+ test_rc = stack_memcpy_subtest(stack_rte_memcpy16_xfunc,
+ 16, src_offset, dst_offset);
+ if (test_rc == TEST_SKIPPED)
+ return test_rc;
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect stack_memcpy_subtest(stack_rte_memcpy16_xfunc, "
+ "16, %i, %i) to succeed",
+ src_offset, dst_offset);
+
+ for (size_t size = 1; size <= 1024; size <<= 1) {
+ test_rc = stack_memcpy_subtest(stack_memcpy_xfunc,
+ size, src_offset, dst_offset);
+ if (test_rc == TEST_SKIPPED)
+ return test_rc;
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect stack_memcpy_subtest(stack_memcpy_xfunc, "
+ "%zu, %i, %i) to succeed",
+ size, src_offset, dst_offset);
+
+ test_rc = stack_memcpy_subtest(stack_rte_memcpy_xfunc,
+ size, src_offset, dst_offset);
+ if (test_rc == TEST_SKIPPED)
+ return test_rc;
+ RTE_TEST_ASSERT_SUCCESS(test_rc,
+ "expect stack_memcpy_subtest(stack_rte_memcpy_xfunc, "
+ "%zu, %i, %i) to succeed",
+ size, src_offset, dst_offset);
+ }
+ }
+ return TEST_SUCCESS;
+}
+
+REGISTER_FAST_TEST(bpf_stack_memcpy_autotest, NOHUGE_OK, ASAN_OK, test_stack_memcpy);
+
#ifdef TEST_BPF_ELF_LOAD
/*
diff --git a/lib/bpf/bpf_jit_x86.c b/lib/bpf/bpf_jit_x86.c
index 4d74e418f8..e05fd38b0e 100644
--- a/lib/bpf/bpf_jit_x86.c
+++ b/lib/bpf/bpf_jit_x86.c
@@ -93,6 +93,8 @@ enum {
/*
* callee saved registers list.
* keep RBP as the last one.
+ * RBP is marked as used every time we have external calls
+ * since we need it to save RSP before stack realignment.
*/
static const uint32_t save_regs[] = {RBX, R12, R13, R14, R15, RBP};
@@ -685,6 +687,8 @@ emit_call(struct bpf_jit_state *st, uintptr_t trg)
const uint8_t ops = 0xFF;
const uint8_t mods = 2;
+ /* Mark RBP as used to trigger stack realignment in prolog. */
+ USED(st->reguse, RBP);
emit_ld_imm64(st, RAX, trg, trg >> 32);
emit_bytes(st, &ops, sizeof(ops));
emit_modregrm(st, MOD_DIRECT, mods, RAX);
@@ -1220,6 +1224,16 @@ emit_prolog(struct bpf_jit_state *st, int32_t stack_size)
if (INUSE(st->reguse, RBP) != 0) {
emit_mov_reg(st, EBPF_ALU64 | EBPF_MOV | BPF_X, RSP, RBP);
emit_alu_imm(st, EBPF_ALU64 | BPF_SUB | BPF_K, RSP, stack_size);
+
+ /*
+ * Align stack pointer appropriately for function calls.
+ * We do not have direct access to function stack alignment
+ * value (like gcc internal macro INCOMING_STACK_BOUNDARY),
+ * but hopefully `alignof(max_align_t)` will always be greater.
+ * Original stack pointer will be restored from rbp.
+ */
+ emit_alu_imm(st, EBPF_ALU64 | BPF_AND | BPF_K, RSP,
+ -(uint32_t)alignof(max_align_t));
}
}
--
2.34.1
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v3] bpf: fix x86 call stack alignment for external calls
2026-01-21 10:16 ` [PATCH v3] bpf: fix x86 call stack alignment for external calls Marat Khalili
@ 2026-02-04 9:35 ` Thomas Monjalon
0 siblings, 0 replies; 7+ messages in thread
From: Thomas Monjalon @ 2026-02-04 9:35 UTC (permalink / raw)
To: Marat Khalili
Cc: dev, stephen, Konstantin Ananyev, Ferruh Yigit, stable, stable
21/01/2026 11:16, Marat Khalili:
> Correctly align stack pointer on x86 JIT if external calls are present.
>
> According to x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI,
> section 3.2.2 The Stack Frame) stack needs to be 16 (or more) bytes
> aligned immediately before the call instruction is executed. Once
> control has been transferred to the function entry point it is always
> off by 8 bytes. It means that JIT-compiled BPF function will always have
> its stack misaligned for any nested call unless it performs operations
> with the stack; even if it does use stack there is still 50% chance of
> stack being misaligned since it uses it in multiples of 8.
>
> To solve the issue mark RBP as used whenever we have external function
> calls, and align RSP using AND instruction at the end of the prolog.
> Marking RBP as used triggers stack pointer saving in prolog and
> restoration in epilog.
>
> Add tests for external calls from BPF program demonstrating the problem:
> * direct verification of a local variable alignment;
> * operations with 128-bit integers;
> * aligned and unaligned SSE2 instructions;
> * memcpy and rte_memcpy (may use vector instructions in their code).
>
> (Such variety is needed because not all of these tests are available or
> reproduce the problem on all targets even when the problem exists.)
>
> Fixes: cc752e43e079 ("bpf: add JIT compilation for x86_64 ISA")
> Cc: stable@dpdk.org
>
> Signed-off-by: Marat Khalili <marat.khalili@huawei.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Tested-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Applied, thanks.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] bpf: fix x86 call stack alignment, add tests
2025-12-19 18:26 [PATCH] bpf: fix x86 call stack alignment, add tests Marat Khalili
2025-12-28 14:16 ` Konstantin Ananyev
2026-01-05 16:09 ` [PATCH v2] " Marat Khalili
@ 2026-01-14 17:49 ` Stephen Hemminger
2026-01-14 19:43 ` Stephen Hemminger
3 siblings, 0 replies; 7+ messages in thread
From: Stephen Hemminger @ 2026-01-14 17:49 UTC (permalink / raw)
To: Marat Khalili; +Cc: dev, Konstantin Ananyev, Ferruh Yigit, stable
On Fri, 19 Dec 2025 18:26:23 +0000
Marat Khalili <marat.khalili@huawei.com> wrote:
> Correctly align stack pointer on x86 JIT if external calls are present.
>
> Add tests for external calls from BPF program demonstrating the problem:
> * direct verification of a local variable alignment;
> * operations with 128-bit integers;
> * aligned and unaligned SSE2 instructions;
> * memcpy and rte_memcpy (may use vector instructions in their code).
>
> (Such variety is needed because not all of these tests are available or
> reproduce the problem on all targets even when the problem exists.)
>
> Fixes: cc752e43e079 ("bpf: add JIT compilation for x86_64 ISA")
>
> Signed-off-by: Marat Khalili <marat.khalili@huawei.com>
> ---
AI code review of the test spotted some minor stuff.
## DPDK Patch Review: BPF x86 Call Stack Alignment Fix
### Commit Message
| Check | Status | Notes |
|-------|--------|-------|
| Subject ≤60 chars | ✅ | 43 characters |
| Lowercase after colon | ✅ | Correct |
| Imperative mood | ✅ | "fix" is imperative |
| No trailing period | ✅ | Correct |
| **Forbidden punctuation** | ❌ **ERROR** | Subject contains comma (`,`): "fix x86 call stack alignment**,** add tests" |
| Body wrapped ≤75 chars | ✅ | All lines within limit |
| Body doesn't start with "It" | ✅ | Starts with "Correctly" |
| Signed-off-by present | ✅ | Real name and email |
| Fixes: tag format | ✅ | 12-char SHA, quoted subject |
| Cc: stable@dpdk.org | ✅ | Present in headers for backport |
| Tag order | ✅ | Blank line between Fixes: and Signed-off-by |
**Suggested subject fix:**
```
bpf: fix x86 call stack alignment and add tests
```
or split into two patches.
---
### Code Quality
**JIT Fix (`lib/bpf/bpf_jit_x86.c`):**
| Check | Status | Notes |
|-------|--------|-------|
| Lines ≤100 chars | ✅ | All within limit |
| Comment style | ✅ | Proper `/* */` multi-line format |
| Logic correctness | ✅ | Stack alignment via `AND RSP, -alignof(max_align_t)` is correct |
| Side effect: blank line removal | ⚠️ **INFO** | Line 1217 removes unrelated blank line - minor cleanup mixed with fix |
The alignment approach is sound: using `-(uint32_t)alignof(max_align_t)` creates mask `0xFFFFFFF0` for 16-byte alignment.
**Tests (`app/test/test_bpf.c`):**
| Check | Status | Notes |
|-------|--------|-------|
| Test registration macro | ✅ | Uses `REGISTER_FAST_TEST` (correct per guidelines) |
| NULL comparisons | ✅ | Uses `== NULL` explicitly |
| Format specifiers | ✅ | Uses `%ju`/`%zu` correctly with casts |
| **Missing NULL check** | ⚠️ **WARNING** | Lines 410, 414: `malloc()` return not checked |
| Conditional compilation | ✅ | `__SIZEOF_INT128__`, `__SSE2__` guards appropriate |
---
### Specific Code Observations
**Minor warning - malloc without NULL check:**
```c
char *const src_buffer = malloc(size + src_offset); // line 410
// ... no NULL check
char *const dst_buffer = malloc(size + dst_offset); // line 414
// ... no NULL check
```
While test code is more tolerant, adding a check would be cleaner:
```c
if (src_buffer == NULL || dst_buffer == NULL) {
free(src_buffer);
return TEST_FAILED;
}
```
**Good practices observed:**
- Thorough test coverage: direct alignment verification, uint128, SSE2 aligned/unaligned, memcpy variants
- Proper volatile usage to prevent optimization interfering with tests
- Good inline documentation explaining the ABI requirement and fix rationale
- Appropriate use of `alignof(max_align_t)` rather than hardcoding 16
---
### Summary
| Severity | Count | Items |
|----------|-------|-------|
| **Error** | 1 | Subject contains forbidden comma |
| **Warning** | 1 | malloc() without NULL checks in test |
| **Info** | 1 | Unrelated blank line removal mixed with fix |
**Recommendation:** Request v3 with subject line fixed. The technical content is solid - the fix correctly addresses the x86-64 ABI stack alignment requirement for external calls, and the tests comprehensively validate the fix across multiple scenarios.
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH] bpf: fix x86 call stack alignment, add tests
2025-12-19 18:26 [PATCH] bpf: fix x86 call stack alignment, add tests Marat Khalili
` (2 preceding siblings ...)
2026-01-14 17:49 ` [PATCH] bpf: fix x86 call stack alignment, add tests Stephen Hemminger
@ 2026-01-14 19:43 ` Stephen Hemminger
3 siblings, 0 replies; 7+ messages in thread
From: Stephen Hemminger @ 2026-01-14 19:43 UTC (permalink / raw)
To: Marat Khalili; +Cc: dev, Konstantin Ananyev, Ferruh Yigit, stable
On Fri, 19 Dec 2025 18:26:23 +0000
Marat Khalili <marat.khalili@huawei.com> wrote:
> Correctly align stack pointer on x86 JIT if external calls are present.
>
> Add tests for external calls from BPF program demonstrating the problem:
> * direct verification of a local variable alignment;
> * operations with 128-bit integers;
> * aligned and unaligned SSE2 instructions;
> * memcpy and rte_memcpy (may use vector instructions in their code).
>
> (Such variety is needed because not all of these tests are available or
> reproduce the problem on all targets even when the problem exists.)
>
> Fixes: cc752e43e079 ("bpf: add JIT compilation for x86_64 ISA")
>
> Signed-off-by: Marat Khalili <marat.khalili@huawei.com>
> ---
AI code review identifies some shadow declarations in this patch.
Please fix and resubmit.
## DPDK Patch Review: BPF x86 Call Stack Alignment Fix
### Summary
This patch fixes a real x86-64 ABI compliance bug in the BPF JIT compiler where stack alignment wasn't guaranteed to be 16-byte aligned before external function calls. The fix is elegant and the test coverage is comprehensive.
---
### Commit Message Issues
| Severity | Issue | Details |
|----------|-------|---------|
| **Error** | Comma in subject line | Subject contains `,` which is a forbidden punctuation mark per `check-git-log.sh` |
| **Error** | Missing blank line before Signed-off-by | Per tag order rules, blank line required between `Fixes:`/`Cc:` group and `Signed-off-by:` group |
**Subject line:**
```
bpf: fix x86 call stack alignment, add tests
```
Should be split into two patches, or reworded:
```
bpf: fix x86 call stack alignment for external calls
```
(with tests mentioned in body, or as separate patch)
**Tag ordering should be:**
```
Fixes: cc752e43e079 ("bpf: add JIT compilation for x86_64 ISA")
Cc: stable@dpdk.org
Signed-off-by: Marat Khalili <marat.khalili@huawei.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Tested-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
```
---
### Code Review
**lib/bpf/bpf_jit_x86.c** — The fix is clean and well-documented:
```c
/* Mark RBP as used to trigger stack realignment in prolog. */
USED(st->reguse, RBP);
```
The alignment approach using `alignof(max_align_t)` is portable and correct. The comment block explaining the rationale is helpful.
**Minor observation:** Line 1219 removes an extra blank line — this is unrelated cleanup but acceptable.
**app/test/test_bpf.c** — Comprehensive test coverage:
| Severity | Issue | Location |
|----------|-------|----------|
| **Warning** | Variable shadowing | Lines 462-464: `src_offset`, `dst_offset`, `test_rc` redeclared in inner loop scope |
| **Info** | Style preference | Consider extracting the common pattern in memcpy subtests |
The shadowed variables at lines 462-464:
```c
for (size_t size = 1; size <= 1024; size <<= 1) {
const bool src_offset = offsets & 1; /* shadows outer scope */
const bool dst_offset = offsets & 2; /* shadows outer scope */
int test_rc; /* shadows outer scope */
```
These are unnecessary redeclarations since the outer loop variables can be reused.
---
### Technical Assessment
The fix correctly addresses the x86-64 ABI requirement (System V AMD64 ABI §3.2.2) that RSP must be 16-byte aligned before `CALL`. The approach of:
1. Marking RBP as used when external calls exist → triggers stack frame setup
2. Aligning RSP with `AND` instruction using `-alignof(max_align_t)`
3. Restoring original RSP from RBP in epilog
is sound and minimal in its impact on generated code for programs without external calls.
The test coverage is thorough, testing:
- Direct stack variable alignment verification
- 128-bit integer operations
- SSE2 aligned/unaligned loads/stores
- memcpy/rte_memcpy with various sizes and alignments
---
### Verdict
**Recommended changes before merge:**
1. **Fix subject line** — Remove comma or split into two patches
2. **Add blank line** before `Signed-off-by:` block
3. **Consider** removing shadowed variable declarations in test (minor)
The technical fix itself is correct and well-implemented. With commit message fixes, this should be ready to merge.
^ permalink raw reply [flat|nested] 7+ messages in thread