public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] arm64: mte: Improve performance by explicitly disabling unwanted tag checking
@ 2026-01-15 23:07 Carl Worth
  2026-01-15 23:07 ` [PATCH v2 1/2] arm64: mte: Clarify kernel MTE policy and manipulation of TCO Carl Worth
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Carl Worth @ 2026-01-15 23:07 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: linux-arm-kernel, linux-kernel, Taehyun Noh, Carl Worth

[Thanks to Taehyun Noh from UT Austin for originally reporting this
bug. In this cover letter, "we" refers to a collaborative effort
between indiviuals at both Ampere Computing and UT Austin.]

We measured severe performance overhead (25-50%) when enabling
userspace MTE and running memcached on an AmpereOne machine, (detailed
benchmark results are provided below).

We identified excessive tag checking taking place in the kernel,
(though only userspace tag checking was requested), as the culprit for
the performance slowdown. The existing kernel implementation expects
that if tag check faults are not requested, then the implementation
will not perform tag checking. We found (empirically) that this is not
the case for at least some implementations, and verified that there's
no architectural requirement that tag checking be disabled when tag
check faults are not requested.

This patch series addresses the slowdown by using TCMA1 to explicitly
disable unwanted tag checking.

The effect of this patch series is most-readily seen by using perf to
count tag-checked accesses in both kernel and userspace, for example
while runnning "perf bench futex hash" with MTE enabled.

Prior to the patch series, we see:

 # GLIBC_TUNABLES=glibc.mem.tagging=3 perf stat -e mem_access_checked_rd:u,mem_access_checked_wr:u,mem_access_checked_rd:k,mem_access_checked_wr:k perf bench futex hash
...
 Performance counter stats for 'perf bench futex hash':
     4,246,651,954      mem_access_checked_rd:u
        29,375,167      mem_access_checked_wr:u
   246,588,717,771      mem_access_checked_rd:k
    78,805,316,911      mem_access_checked_wr:k

And after the patch series we see (for the same command):

 Performance counter stats for 'perf bench futex hash':
     4,337,091,554      mem_access_checked_rd:u
            23,487      mem_access_checked_wr:u
     4,342,774,550      mem_access_checked_rd:k
               788      mem_access_checked_wr:k

As can be seen above, with roughly equivalent counts of userspace
tag-checked accesses, over 98% of the kernel-space tag-checked
accesses are eliminated.

As to performance, the patch series should have no behavioral impact
if the kernel is not compiled with MTE support. And the series has not
been observed to have any impact when the kernel includes MTE support
but the workloads have MTE disabled in userspace.

For workloads with MTE enabled, we measured the series giving a 2%
improvement for "perf bench futex hash" at 95% confidence.

Also, we used the Phoronix Test Suite pts/memcached benchmark with a
get-heavy workload (1:10 Set:Get ratio) which is where the slowdown
appears most clearly. The slowdown worsens with increased core count,
levelling out above 32 cores. The numbers below are based on averages
from 50 runs each, with 96 cores on each run. For "MTE on",
GLIBC_TUNABLES was set to "glibc.mem.tagging=3". For "MTE off",
GLIBC_TUNABLES was unset.

The numbers below are normalized ops./sec. (higher is better),
normalized to the baseline case (unpatched kernel, MTE off).

Before the patch series (upstream v6.19-rc5+):

	MTE off: 1.000
	MTE  on: 0.742

	MTE overhead: 25.8% +/- 1.6%

After applying this patch series:

	MTE off: 0.991
	MTE  on: 0.990

	MTE overhead: No difference proven at 95.0% confidence

-Carl

---
Changes in v2:
- Fixed to correctly pass 'current' vs. 'next' in set_kernel_mte_policy,
  (thanks to Will Deacon)
- Changed approach to use TCMA1 rather than toggling PSTATE.TCO
  (thanks to Catalin Marinas)
- Link to v1: https://lore.kernel.org/r/20251030-mte-tighten-tco-v1-0-88c92e7529d9@os.amperecomputing.com
---
Carl Worth (1):
      arm64: mte: Set TCMA1 whenever MTE is present in the kernel

Taehyun Noh (1):
      arm64: mte: Clarify kernel MTE policy and manipulation of TCO

 arch/arm64/include/asm/mte.h     | 40 +++++++++++++++++++++++++++++++++-------
 arch/arm64/kernel/entry-common.c |  4 ++--
 arch/arm64/kernel/mte.c          |  2 +-
 arch/arm64/mm/proc.S             | 10 +++++-----
 4 files changed, 41 insertions(+), 15 deletions(-)
---
base-commit: 944aacb68baf7624ab8d277d0ebf07f025ca137c


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-01-27 11:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-15 23:07 [PATCH v2 0/2] arm64: mte: Improve performance by explicitly disabling unwanted tag checking Carl Worth
2026-01-15 23:07 ` [PATCH v2 1/2] arm64: mte: Clarify kernel MTE policy and manipulation of TCO Carl Worth
2026-01-19 18:17   ` Catalin Marinas
2026-01-20 19:44     ` Taehyun Noh
2026-01-15 23:07 ` [PATCH v2 2/2] arm64: mte: Set TCMA1 whenever MTE is present in the kernel Carl Worth
2026-01-19 17:57   ` Catalin Marinas
2026-01-22 10:23   ` Usama Anjum
2026-01-22 11:49     ` Catalin Marinas
2026-01-27 11:39 ` [PATCH v2 0/2] arm64: mte: Improve performance by explicitly disabling unwanted tag checking Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox