[PATCH] arm64: clear_user: align __arch_clear_user() to 128B for I-cache efficiency

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] arm64: clear_user: align __arch_clear_user() to 128B for I-cache efficiency
@ 2025-11-21  5:04 Luke Yang
  2025-11-24 13:38 ` Will Deacon
  0 siblings, 1 reply; 3+ messages in thread
From: Luke Yang @ 2025-11-21  5:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: Jirka Hladky, linux-arm-kernel, linux-kernel, Joe Mario

On  aarch64 kernels, recent changes (specifically irqbypass patch
https://lore.kernel.org/all/20250516230734.2564775-6-seanjc@google.com/)
shifted __arch_clear_user() such that the tight zeroing loop straddles
I-cache lines. This causes measurable read performance regression when
reading from /dev/zero.

Add `.p2align 6` (64-byte alignment) to guarantee the loop stays within a
single I-cache boundary, restoring the previous IPC and throughput.

Tested on bare-metal aarch64 systems:

  Good kernel: pread_z100k ~ 6.9 s
  Bad kernel:  pread_z100k ~ 9.0 s
  With patch:  pread_z100k ~ 6.9 s

Reproducer:

// gcc -O2 -Wall -Wextra -o pread_z100k pread_z100k.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (100 * 1024)
#define COUNT 1000000

static double now_sec(void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1e6;
}

int main(void)
{
    int fd = open("/dev/zero", O_RDONLY);
    if (fd < 0) { perror("open /dev/zero"); return 1; }

    char *buf = malloc(SIZE);
    if (!buf) { perror("malloc"); close(fd); return 1; }

    double t1 = now_sec();
    for (int i = 0; i < COUNT; i++) {
        ssize_t r = pread(fd, buf, SIZE, 0);
        if (r != SIZE) { perror("pread"); break; }
    }
    double t2 = now_sec();

    printf("%.6f\n", t2 - t1);

    close(fd);
    free(buf);
    return 0;
}

Signed-off-by: Luke Yang <luyang@redhat.com>
Signed-off-by: Jirka Hladky <jhladky@redhat.com>
---
 arch/arm64/lib/clear_user.S | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/lib/clear_user.S b/arch/arm64/lib/clear_user.S
index de9a303b6..91eee4a7c 100644
--- a/arch/arm64/lib/clear_user.S
+++ b/arch/arm64/lib/clear_user.S
@@ -17,6 +17,11 @@
  * Alignment fixed up by hardware.
  */

+/*
+ * Ensure __arch_clear_user() always starts on a clean I-cache boundary.
+ */
+    .p2align 6    // 2^6 = 64-byte alignment
+
 SYM_FUNC_START(__arch_clear_user)
     add    x2, x0, x1

-- 
2.51.1



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] arm64: clear_user: align __arch_clear_user() to 128B for I-cache efficiency
  2025-11-21  5:04 [PATCH] arm64: clear_user: align __arch_clear_user() to 128B for I-cache efficiency Luke Yang
@ 2025-11-24 13:38 ` Will Deacon
  2025-11-25  3:45   ` Luke Yang
  0 siblings, 1 reply; 3+ messages in thread
From: Will Deacon @ 2025-11-24 13:38 UTC (permalink / raw)
  To: Luke Yang
  Cc: Catalin Marinas, Jirka Hladky, linux-arm-kernel, linux-kernel,
	Joe Mario

On Fri, Nov 21, 2025 at 12:04:55AM -0500, Luke Yang wrote:
> On  aarch64 kernels, recent changes (specifically irqbypass patch
> https://lore.kernel.org/all/20250516230734.2564775-6-seanjc@google.com/)
> shifted __arch_clear_user() such that the tight zeroing loop straddles
> I-cache lines. This causes measurable read performance regression when
> reading from /dev/zero.
> 
> Add `.p2align 6` (64-byte alignment) to guarantee the loop stays within a
> single I-cache boundary, restoring the previous IPC and throughput.

Hmm, but what's special about __arch_clear_user()? If we make this change,
anybody could surely make similar arguments for other functions on their
hot paths?

Amusingly, there's CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B to play around
with that and it sounds like the irqbyass change you cite calls into
the category of changes highlighted by the Kconfig text.

Will


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] arm64: clear_user: align __arch_clear_user() to 128B for I-cache efficiency
  2025-11-24 13:38 ` Will Deacon
@ 2025-11-25  3:45   ` Luke Yang
  0 siblings, 0 replies; 3+ messages in thread
From: Luke Yang @ 2025-11-25  3:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Jirka Hladky, linux-arm-kernel, linux-kernel,
	Joe Mario

On Mon, 24 Nov 2025 13:38:25 +0000, Will Deacon wrote:
> Hmm, but what's special about __arch_clear_user()? If we make this
> change, anybody could surely make similar arguments for other
> functions on their hot paths?

Hi Will,

Thanks for the feedback.

I agree that the precedent question matters. In this case, though, the
irqbypass change introduced roughly a 30% regression in /dev/zero read
throughput. That is a fundamental primitive that many workloads rely on,
and the regression stems from an unintended shift of __arch_clear_user()
so that its tight zeroing loop now crosses an I-cache boundary. This has
also persisted as a deterministic change in performance in all
subsequent kernel builds we have tested since its appearance in 6.17.

The proposed ".p2align 6" is not adding a new micro-optimisation. It
restores the previous instruction-cache locality that the function had
before the irqbypass reshuffle. The cost is very small (up to 64 bytes
of padding in one place), and the bar for applying this kind of fix is
correspondingly high: did an unrelated change cause a significant
performance regression in a widely used core primitive?

I am open to any solution that reliably restores the lost IPC, so please
let me know if you have something else in mind.

Thanks,
Luke

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-11-25  3:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-21  5:04 [PATCH] arm64: clear_user: align __arch_clear_user() to 128B for I-cache efficiency Luke Yang
2025-11-24 13:38 ` Will Deacon
2025-11-25  3:45   ` Luke Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox