linux-hardening.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v5 00/18] pkeys-based page table hardening
@ 2025-08-15  8:54 Kevin Brodsky
  2025-08-15  8:54 ` [RFC PATCH v5 01/18] mm: Introduce kpkeys Kevin Brodsky
                   ` (19 more replies)
  0 siblings, 20 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:54 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

This is a proposal to leverage protection keys (pkeys) to harden
critical kernel data, by making it mostly read-only. The series includes
a simple framework called "kpkeys" to manipulate pkeys for in-kernel use,
as well as a page table hardening feature based on that framework,
"kpkeys_hardened_pgtables". Both are implemented on arm64 as a proof of
concept, but they are designed to be compatible with any architecture
that supports pkeys.

The proposed approach is a typical use of pkeys: the data to protect is
mapped with a given pkey P, and the pkey register is initially configured
to grant read-only access to P. Where the protected data needs to be
written to, the pkey register is temporarily switched to grant write
access to P on the current CPU.

The key fact this approach relies on is that the target data is
only written to via a limited and well-defined API. This makes it
possible to explicitly switch the pkey register where needed, without
introducing excessively invasive changes, and only for a small amount of
trusted code.

Page tables were chosen as they are a popular (and critical) target for
attacks, but there are of course many others - this is only a starting
point (see section "Further use-cases"). It has become more and more
common for accesses to such target data to be mediated by a hypervisor
in vendor kernels; the hope is that kpkeys can provide much of that
protection in a simpler and cheaper manner. A rough performance
estimation has been performed on a modern arm64 system, see section
"Performance".

This series has similarities with the "PKS write protected page tables"
series posted by Rick Edgecombe a few years ago [1], but it is not
specific to x86/PKS - the approach is meant to be generic.

kpkeys
======

The use of pkeys involves two separate mechanisms: assigning a pkey to
pages, and defining the pkeys -> permissions mapping via the pkey
register. This is implemented through the following interface:

- Pages in the linear mapping are assigned a pkey using set_memory_pkey().
  This is sufficient for this series, but of course higher-level
  interfaces can be introduced later to ask allocators to return pages
  marked with a given pkey. It should also be possible to extend this to
  vmalloc() if needed.

- The pkey register is configured based on a *kpkeys level*. kpkeys
  levels are simple integers that correspond to a given configuration,
  for instance:

  KPKEYS_LVL_DEFAULT:
        RW access to KPKEYS_PKEY_DEFAULT
        RO access to any other KPKEYS_PKEY_*

  KPKEYS_LVL_<FEAT>:
        RW access to KPKEYS_PKEY_DEFAULT
        RW access to KPKEYS_PKEY_<FEAT>
        RO access to any other KPKEYS_PKEY_*

  Only pkeys that are managed by the kpkeys framework are impacted;
  permissions for other pkeys are left unchanged (this allows for other
  schemes using pkeys to be used in parallel, and arch-specific use of
  certain pkeys).

  The kpkeys level is changed by calling kpkeys_set_level(), setting the
  pkey register accordingly and returning the original value. A
  subsequent call to kpkeys_restore_pkey_reg() restores the kpkeys
  level. The numeric value of KPKEYS_LVL_* (kpkeys level) is purely
  symbolic and thus generic, however each architecture is free to define
  KPKEYS_PKEY_* (pkey value).

kpkeys_hardened_pgtables
========================

The kpkeys_hardened_pgtables feature uses the interface above to make
the (kernel and user) page tables read-only by default, enabling write
access only in helpers such as set_pte(). One complication is that those
helpers as well as page table allocators are used very early, before
kpkeys become available. Enabling kpkeys_hardened_pgtables, if and when
kpkeys become available, is therefore done as follows:

1. A static key is turned on. This enables a transition to
   KPKEYS_LVL_PGTABLES in all helpers writing to page tables, and also
   impacts page table allocators (step 3).

2. swapper_pg_dir is walked to set all early page table pages to
   KPKEYS_PKEY_PGTABLES.

3. Page table allocators set the returned pages to KPKEYS_PKEY_PGTABLES
   (and the pkey is reset upon freeing). This ensures that all page
   tables are mapped with that privileged pkey.

This series
===========

The series is composed of two parts:

- The kpkeys framework (patch 1-9). The main API is introduced in
  <linux/kpkeys.h>, and it is implemented on arm64 using the POE
  (Permission Overlay Extension) feature.

- The kpkeys_hardened_pgtables feature (patch 10-18). <linux/kpkeys.h>
  is extended with an API to set page table pages to a given pkey and a
  guard object to switch kpkeys level accordingly, both gated on a
  static key. This is then used in generic and arm64 pgtable handling
  code as needed. Finally a simple KUnit-based test suite is added to
  demonstrate the page table protection.

pkey register management
========================

The kpkeys model relies on the kernel pkey register being set to a
specific value for the duration of a relatively small section of code,
and otherwise to the default value. Accordingly, the arm64
implementation based on POE handles its pkey register (POR_EL1) as
follows:

- POR_EL1 is saved and reset to its default value on exception entry,
  and restored on exception return. This ensures that exception handling
  code runs in a fixed kpkeys state.

- POR_EL1 is context-switched per-thread. This allows sections of code
  that run at a non-default kpkeys level to sleep (e.g. when locking a
  mutex). For kpkeys_hardened_pgtables, only involuntary preemption is
  relevant and the previous point already handles that; however sleeping
  is likely to occur in more advanced uses of kpkeys.

An important assumption is that all kpkeys levels allow RW access to the
default pkey (0). Otherwise, saving POR_EL1 before resetting it on
exception entry would be a best difficult, and context-switching it too.

Performance
===========

No arm64 hardware currently implements POE. To estimate the performance
impact of kpkeys_hardened_pgtables, a mock implementation of kpkeys has
been used, replacing accesses to the POR_EL1 register with accesses to
another system register that is otherwise unused (CONTEXTIDR_EL1), and
leaving everything else unchanged. Most of the kpkeys overhead is
expected to originate from the barrier (ISB) that is required after
writing to POR_EL1, and from setting the POIndex (pkey) in page tables;
both of these are done exactly in the same way in the mock
implementation.

The original implementation of kpkeys_hardened_pgtables is very
inefficient when many PTEs are changed at once, as the kpkeys level is
switched twice for every PTE (two ISBs per PTE). Patch 18 introduces
an optimisation that makes use of the lazy_mmu mode to batch those
switches: 1. switch to KPKEYS_LVL_PGTABLES on arch_enter_lazy_mmu_mode(),
2. skip any kpkeys switch while in that section, and 3. restore the
kpkeys level on arch_leave_lazy_mmu_mode(). When that last function
already issues an ISB (when updating kernel page tables), we get a
further optimisation as we can skip the ISB when restoring the kpkeys
level.

Both implementations (without and with batching) were evaluated on an
Amazon EC2 M7g instance (Graviton3), using a variety of benchmarks that
involve heavy page table manipulations. The results shown below are
relative to the baseline for this series, which is 6.17-rc1. The
branches used for all three sets of results (baseline, with/without
batching) are available in a repository, see next section.

Caveat: these numbers should be seen as a lower bound for the overhead
of a real POE-based protection. The hardware checks added by POE are
however not expected to incur significant extra overhead.

Reading example: for the fix_size_alloc_test benchmark, using 1 page per
iteration (no hugepage), kpkeys_hardened_pgtables incurs 17.35% overhead
without batching, and 14.62% overhead with batching. Both results are
considered statistically significant (95% confidence interval),
indicated by "(R)".

+-------------------+----------------------------------+------------------+---------------+
| Benchmark         | Result Class                     | Without batching | With batching |
+===================+==================================+==================+===============+
| mmtests/kernbench | real time                        |            0.30% |         0.11% |
|                   | system time                      |        (R) 3.97% |     (R) 2.17% |
|                   | user time                        |            0.12% |         0.02% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/fork      | fork: h:0                        |      (R) 217.31% |        -0.97% |
|                   | fork: h:1                        |      (R) 275.25% |     (R) 2.25% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/munmap    | munmap: h:0                      |       (R) 15.57% |        -1.95% |
|                   | munmap: h:1                      |      (R) 169.53% |     (R) 6.53% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 17.35% |    (R) 14.62% |
|                   | fix_size_alloc_test: p:4, h:0    |       (R) 37.54% |     (R) 9.35% |
|                   | fix_size_alloc_test: p:16, h:0   |       (R) 66.08% |     (R) 3.15% |
|                   | fix_size_alloc_test: p:64, h:0   |       (R) 82.94% |        -0.39% |
|                   | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |        -1.67% |
|                   | fix_size_alloc_test: p:16, h:1   |       (R) 50.31% |         3.00% |
|                   | fix_size_alloc_test: p:64, h:1   |       (R) 59.73% |         2.23% |
|                   | fix_size_alloc_test: p:256, h:1  |       (R) 62.14% |         1.51% |
|                   | random_size_alloc_test: p:1, h:0 |       (R) 77.82% |        -0.21% |
|                   | vm_map_ram_test: p:1, h:0        |       (R) 30.66% |    (R) 27.30% |
+-------------------+----------------------------------+------------------+---------------+

Benchmarks:
- mmtests/kernbench: running kernbench (kernel build) [4].
- micromm/{fork,munmap}: from David Hildenbrand's benchmark suite. A
  1 GB mapping is created and then fork/unmap is called. The mapping is
  created using either page-sized (h:0) or hugepage folios (h:1); in all
  cases the memory is PTE-mapped.
- micromm/vmalloc: from test_vmalloc.ko, varying the number of pages
  (p:) and whether huge pages are used (h:).

On a "real-world" and fork-heavy workload like kernbench, the estimated
overhead of kpkeys_hardened_pgtables is reasonable: 4% system time
overhead without batching, and about half that figure (2.2%) with
batching. The real time overhead is negligible.

Microbenchmarks show large overheads without batching, which increase
with the number of pages being manipulated. Batching drastically reduces
that overhead, almost negating it for micromm/fork. Because all PTEs in
the mapping are modified in the same lazy_mmu section, the kpkeys level
is changed just twice regardless of the mapping size; as a result the
relative overhead actually decreases as the size increases for
fix_size_alloc_test.

Note: the performance impact of set_memory_pkey() is likely to be
relatively low on arm64 because the linear mapping uses PTE-level
descriptors only. This means that set_memory_pkey() simply changes the
attributes of some PTE descriptors. However, some systems may be able to
use higher-level descriptors in the future [5], meaning that
set_memory_pkey() may have to split mappings. Allocating page tables
from a contiguous cache of pages could help minimise the overhead, as
proposed for x86 in [1].

Branches
========

To make reviewing and testing easier, this series is available here:

  https://gitlab.arm.com/linux-arm/linux-kb

The following branches are available:

- kpkeys/rfc-v5 - this series, as posted

- kpkeys/rfc-v5-base - the baseline for this series, that is 6.17-rc1

- kpkeys/rfc-v5-bench-batching - this series + patch for benchmarking on
  a regular arm64 system (see section above)

- kpkeys/rfc-v5-bench-no-batching - this series without patch 18
  (batching) + benchmarking patch

Threat model
============

The proposed scheme aims at mitigating data-only attacks (e.g.
use-after-free/cross-cache attacks). In other words, it is assumed that
control flow is not corrupted, and that the attacker does not achieve
arbitrary code execution. Nothing prevents the pkey register from being
set to its most permissive state - the assumption is that the register
is only modified on legitimate code paths.

A few related notes:

- Functions that set the pkey register are all implemented inline.
  Besides performance considerations, this is meant to avoid creating
  a function that can be used as a straightforward gadget to set the
  pkey register to an arbitrary value.

- kpkeys_set_level() only accepts a compile-time constant as argument,
  as a variable could be manipulated by an attacker. This could be
  relaxed but it seems unlikely that a variable kpkeys level would be
  needed in practice.

Further use-cases
=================

It should be possible to harden various targets using kpkeys, including:

- struct cred - kpkeys-based cred hardening is now available in a
  separate series [6]

- fixmap (occasionally used even after early boot, e.g.
  set_swapper_pgd() in arch/arm64/mm/mmu.c)

- eBPF programs (preventing direct access to core kernel code/data)

- SELinux state (e.g. struct selinux_state::initialized)

... and many others.

kpkeys could also be used to strengthen the confidentiality of secret
data by making it completely inaccessible by default, and granting
read-only or read-write access as needed. This requires such data to be
rarely accessed (or via a limited interface only). One example on arm64
is the pointer authentication keys in thread_struct, whose leakage to
userspace would lead to pointer authentication being easily defeated.

Open questions
==============

A few aspects in this RFC that are debatable and/or worth discussing:

- There is currently no restriction on how kpkeys levels map to pkeys
  permissions. A typical approach is to allocate one pkey per level and
  make it writable at that level only. As the number of levels
  increases, we may however run out of pkeys, especially on arm64 (just
  8 pkeys with POE). Depending on the use-cases, it may be acceptable to
  use the same pkey for the data associated to multiple levels.

  Another potential concern is that a given piece of code may require
  write access to multiple privileged pkeys. This could be addressed by
  introducing a notion of hierarchy in trust levels, where Tn is able to
  write to memory owned by Tm if n >= m, for instance.

- kpkeys_set_level() and kpkeys_restore_pkey_reg() are not symmetric:
  the former takes a kpkeys level and returns a pkey register value, to
  be consumed by the latter. It would be more intuitive to manipulate
  kpkeys levels only. However this assumes that there is a 1:1 mapping
  between kpkeys levels and pkey register values, while in principle
  the mapping is 1:n (certain pkeys may be used outside the kpkeys
  framework).

- An architecture that supports kpkeys is expected to select
  CONFIG_ARCH_HAS_KPKEYS and always enable them if available - there is
  no CONFIG_KPKEYS to control this behaviour. Since this creates no
  significant overhead (at least on arm64), it seemed better to keep it
  simple. Each hardening feature does have its own option and arch
  opt-in if needed (CONFIG_KPKEYS_HARDENED_PGTABLES,
  CONFIG_ARCH_HAS_KPKEYS_HARDENED_PGTABLES).


Any comment or feedback will be highly appreciated, be it on the
high-level approach or implementation choices!

- Kevin

---
Changelog

RFC v4..v5:

- Rebased on v6.17-rc1.

- Cover letter: re-ran benchmarks on top of v6.17-rc1, made various
  small improvements especially to the "Performance" section.

- Patch 18: disable batching while in interrupt, since POR_EL1 is reset
  on exception entry, making the TIF_LAZY_MMU flag meaningless. This
  fixes a crash that may occur when a page table page is freed while in
  interrupt context.

- Patch 17: ensure that the target kernel address is actually
  PTE-mapped. Certain mappings (e.g. code) may be PMD-mapped instead -
  this explains why the change made in v4 was required.


RFC v4: https://lore.kernel.org/linux-mm/20250411091631.954228-1-kevin.brodsky@arm.com/

RFC v3..v4:

- Added appropriate handling of the arm64 pkey register (POR_EL1):
  context-switching between threads and resetting on exception entry
  (patch 7 and 8). See section "pkey register management" above for more
  details. A new POR_EL1_INIT macro is introduced to make the default
  value available to assembly (where POR_EL1 is reset on exception
  entry); it is updated in each patch allocating new keys.

- Added patch 18 making use of the lazy_mmu mode to batch switches to
  KPKEYS_LVL_PGTABLES - just once per lazy_mmu section rather than on
  every pgtable write. See section "Performance" for details.

- Rebased on top of [2]. No direct impact on the patches, but it ensures that
  the ctor/dtor is always called for kernel pgtables. This is an
  important fix as kernel PTEs allocated after boot were not protected
  by kpkeys_hardened_pgtables in v3 - a new test was added to patch 17
  to ensure that pgtables created by vmalloc are protected too.

- Rebased on top of [3]. The batching of kpkeys level switches in patch
  18 relies on the last patch in [3].

- Moved kpkeys guard definitions out of <linux/kpkeys.h> and to a relevant
  header for each subsystem (e.g. <asm/pgtable.h> for the
  kpkeys_hardened_pgtables guard).

- Patch 1,5: marked kpkeys_{set_level,restore_pkey_reg} as
  __always_inline to ensure that no callable gadget is created.
  [Maxwell Bland's suggestion]

- Patch 5: added helper __kpkeys_set_pkey_reg_nosync().

- Patch 10: marked kernel_pgtables_set_pkey() and related helpers as
  __init. [Linus Walleij's suggestion]

- Patch 11: added helper kpkeys_hardened_pgtables_enabled(), renamed the
  static key to kpkeys_hardened_pgtables_key.

- Patch 17: followed the KUnit conventions more closely. [Kees Cook's
  suggestion]

- Patch 17: changed the address used in the write_linear_map_pte()
  test. It seems that the PTEs that map some functions are allocated in
  ZONE_DMA and read-only (unclear why exactly). This doesn't seem to
  occur for global variables.

- Various minor fixes/improvements.

- Rebased on v6.15-rc1. This includes [7], which renames a few POE
  symbols: s/POE_RXW/POE_RWX/ and
  s/por_set_pkey_perms/por_elx_set_pkey_perms/


RFC v3: https://lore.kernel.org/linux-hardening/20250203101839.1223008-1-kevin.brodsky@arm.com/

RFC v2..v3:

- Patch 1: kpkeys_set_level() may now return KPKEYS_PKEY_REG_INVAL to indicate
  that the pkey register wasn't written to, and as a result that
  kpkeys_restore_pkey_reg() should do nothing. This simplifies the conditional
  guard macro and also allows architectures to skip writes to the pkey
  register if the target value is the same as the current one.

- Patch 1: introduced additional KPKEYS_GUARD* macros to cover more use-cases
  and reduce duplication.

- Patch 6: reject pkey value above arch_max_pkey().

- Patch 13: added missing guard(kpkeys_hardened_pgtables) in
  __clear_young_dirty_pte().

- Rebased on v6.14-rc1.

RFC v2: https://lore.kernel.org/linux-hardening/20250108103250.3188419-1-kevin.brodsky@arm.com/

RFC v1..v2:

- A new approach is used to set the pkey of page table pages. Thanks to
  Qi Zheng's and my own series [8][9], pagetable_*_ctor is
  systematically called when a PTP is allocated at any level (PTE to
  PGD), and pagetable_*_dtor when it is freed, on all architectures.
  Patch 11 makes use of this to call kpkeys_{,un}protect_pgtable_memory
  from the common ctor/dtor helper. The arm64 patches from v1 (patch 12
  and 13) are dropped as they are no longer needed. Patch 10 is
  introduced to allow pagetable_*_ctor to fail at all levels, since
  kpkeys_protect_pgtable_memory may itself fail.
  [Original suggestion by Peter Zijlstra]

- Changed the prototype of kpkeys_{,un}protect_pgtable_memory in patch 9
  to take a struct folio * for more convenience, and implemented them
  out-of-line to avoid a circular dependency with <linux/mm.h>.

- Rebased on next-20250107, which includes [8] and [9].

- Added locking in patch 8. [Peter Zijlstra's suggestion]

RFC v1: https://lore.kernel.org/linux-hardening/20241206101110.1646108-1-kevin.brodsky@arm.com/
---
References

[1] https://lore.kernel.org/all/20210830235927.6443-1-rick.p.edgecombe@intel.com/
[2] https://lore.kernel.org/linux-mm/20250408095222.860601-1-kevin.brodsky@arm.com/
[3] https://lore.kernel.org/linux-mm/20250304150444.3788920-1-ryan.roberts@arm.com/
[4] https://github.com/gormanm/mmtests/blob/master/shellpack_src/src/kernbench/kernbench-bench
[5] https://lore.kernel.org/all/20250724221216.1998696-1-yang@os.amperecomputing.com/
[6] https://lore.kernel.org/linux-mm/?q=s%3Apkeys+s%3Acred+s%3A0
[7] https://lore.kernel.org/linux-arm-kernel/20250219164029.2309119-1-kevin.brodsky@arm.com/
[8] https://lore.kernel.org/linux-mm/cover.1736317725.git.zhengqi.arch@bytedance.com/
[9] https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/
---
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maxwell Bland <mbland@motorola.com>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pierre Langlois <pierre.langlois@arm.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
---
Kevin Brodsky (18):
  mm: Introduce kpkeys
  set_memory: Introduce set_memory_pkey() stub
  arm64: mm: Enable overlays for all EL1 indirect permissions
  arm64: Introduce por_elx_set_pkey_perms() helper
  arm64: Implement asm/kpkeys.h using POE
  arm64: set_memory: Implement set_memory_pkey()
  arm64: Reset POR_EL1 on exception entry
  arm64: Context-switch POR_EL1
  arm64: Enable kpkeys
  mm: Introduce kernel_pgtables_set_pkey()
  mm: Introduce kpkeys_hardened_pgtables
  mm: Allow __pagetable_ctor() to fail
  mm: Map page tables with privileged pkey
  arm64: kpkeys: Support KPKEYS_LVL_PGTABLES
  arm64: mm: Guard page table writes with kpkeys
  arm64: Enable kpkeys_hardened_pgtables support
  mm: Add basic tests for kpkeys_hardened_pgtables
  arm64: mm: Batch kpkeys level switches

 arch/arm64/Kconfig                        |   2 +
 arch/arm64/include/asm/kpkeys.h           |  62 +++++++++
 arch/arm64/include/asm/pgtable-prot.h     |  16 +--
 arch/arm64/include/asm/pgtable.h          |  57 +++++++-
 arch/arm64/include/asm/por.h              |  11 ++
 arch/arm64/include/asm/processor.h        |   2 +
 arch/arm64/include/asm/ptrace.h           |   4 +
 arch/arm64/include/asm/set_memory.h       |   4 +
 arch/arm64/kernel/asm-offsets.c           |   3 +
 arch/arm64/kernel/cpufeature.c            |   5 +-
 arch/arm64/kernel/entry.S                 |  24 +++-
 arch/arm64/kernel/process.c               |   9 ++
 arch/arm64/kernel/smp.c                   |   2 +
 arch/arm64/mm/fault.c                     |   2 +
 arch/arm64/mm/mmu.c                       |  26 ++--
 arch/arm64/mm/pageattr.c                  |  25 ++++
 include/asm-generic/kpkeys.h              |  21 +++
 include/asm-generic/pgalloc.h             |  15 ++-
 include/linux/kpkeys.h                    | 157 ++++++++++++++++++++++
 include/linux/mm.h                        |  27 ++--
 include/linux/set_memory.h                |   7 +
 mm/Kconfig                                |   5 +
 mm/Makefile                               |   2 +
 mm/kpkeys_hardened_pgtables.c             |  44 ++++++
 mm/memory.c                               | 137 +++++++++++++++++++
 mm/tests/kpkeys_hardened_pgtables_kunit.c | 106 +++++++++++++++
 security/Kconfig.hardening                |  24 ++++
 27 files changed, 758 insertions(+), 41 deletions(-)
 create mode 100644 arch/arm64/include/asm/kpkeys.h
 create mode 100644 include/asm-generic/kpkeys.h
 create mode 100644 include/linux/kpkeys.h
 create mode 100644 mm/kpkeys_hardened_pgtables.c
 create mode 100644 mm/tests/kpkeys_hardened_pgtables_kunit.c


base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
-- 
2.47.0


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 01/18] mm: Introduce kpkeys
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
@ 2025-08-15  8:54 ` Kevin Brodsky
  2025-08-15  8:54 ` [RFC PATCH v5 02/18] set_memory: Introduce set_memory_pkey() stub Kevin Brodsky
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:54 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

kpkeys is a simple framework to enable the use of protection keys
(pkeys) to harden the kernel itself. This patch introduces the basic
API in <linux/kpkeys.h>: a couple of functions to set and restore
the pkey register and macros to define guard objects.

kpkeys introduces a new concept on top of pkeys: the kpkeys level.
Each level is associated to a set of permissions for the pkeys
managed by the kpkeys framework. kpkeys_set_level(lvl) sets those
permissions according to lvl, and returns the original pkey
register, to be later restored by kpkeys_restore_pkey_reg(). To
start with, only KPKEYS_LVL_DEFAULT is available, which is meant
to grant RW access to KPKEYS_PKEY_DEFAULT (i.e. all memory since
this is the only available pkey for now).

Because each architecture implementing pkeys uses a different
representation for the pkey register, and may reserve certain pkeys
for specific uses, support for kpkeys must be explicitly indicated
by selecting ARCH_HAS_KPKEYS and defining the following functions in
<asm/kpkeys.h>, in addition to the macros provided in
<asm-generic/kpkeys.h>:

- arch_kpkeys_set_level()
- arch_kpkeys_restore_pkey_reg()
- arch_kpkeys_enabled()

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/asm-generic/kpkeys.h |  17 ++++++
 include/linux/kpkeys.h       | 113 +++++++++++++++++++++++++++++++++++
 mm/Kconfig                   |   2 +
 3 files changed, 132 insertions(+)
 create mode 100644 include/asm-generic/kpkeys.h
 create mode 100644 include/linux/kpkeys.h

diff --git a/include/asm-generic/kpkeys.h b/include/asm-generic/kpkeys.h
new file mode 100644
index 000000000000..ab819f157d6a
--- /dev/null
+++ b/include/asm-generic/kpkeys.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_GENERIC_KPKEYS_H
+#define __ASM_GENERIC_KPKEYS_H
+
+#ifndef KPKEYS_PKEY_DEFAULT
+#define KPKEYS_PKEY_DEFAULT	0
+#endif
+
+/*
+ * Represents a pkey register value that cannot be used, typically disabling
+ * access to all keys.
+ */
+#ifndef KPKEYS_PKEY_REG_INVAL
+#define KPKEYS_PKEY_REG_INVAL	0
+#endif
+
+#endif	/* __ASM_GENERIC_KPKEYS_H */
diff --git a/include/linux/kpkeys.h b/include/linux/kpkeys.h
new file mode 100644
index 000000000000..faa6e2615798
--- /dev/null
+++ b/include/linux/kpkeys.h
@@ -0,0 +1,113 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_KPKEYS_H
+#define _LINUX_KPKEYS_H
+
+#include <linux/bug.h>
+#include <linux/cleanup.h>
+
+#define KPKEYS_LVL_DEFAULT	0
+
+#define KPKEYS_LVL_MIN		KPKEYS_LVL_DEFAULT
+#define KPKEYS_LVL_MAX		KPKEYS_LVL_DEFAULT
+
+#define __KPKEYS_GUARD(name, set_level, restore_pkey_reg, set_arg, ...)	\
+	__DEFINE_CLASS_IS_CONDITIONAL(name, false);			\
+	DEFINE_CLASS(name, u64,						\
+		     restore_pkey_reg, set_level, set_arg);		\
+	static inline void *class_##name##_lock_ptr(u64 *_T)		\
+	{ return _T; }
+
+/**
+ * KPKEYS_GUARD_NOOP() - define a guard type that does nothing
+ * @name: the name of the guard type
+ * @cond_arg: an argument specification (optional)
+ *
+ * Define a guard type that does nothing, useful to match a real guard type
+ * that is defined under an #ifdef. @cond_arg may optionally be passed to match
+ * a guard defined using KPKEYS_GUARD_COND().
+ */
+#define KPKEYS_GUARD_NOOP(name, ...)					\
+	__KPKEYS_GUARD(name, 0, (void)_T, ##__VA_ARGS__, void)
+
+#ifdef CONFIG_ARCH_HAS_KPKEYS
+
+#include <asm/kpkeys.h>
+
+/**
+ * KPKEYS_GUARD_COND() - define a guard type that conditionally switches to
+ *                       a given kpkeys level
+ * @name: the name of the guard type
+ * @level: the kpkeys level to switch to
+ * @cond: an expression that is evaluated as condition
+ * @cond_arg: an argument specification for the condition (optional)
+ *
+ * Define a guard type that switches to @level if @cond evaluates to true, and
+ * does nothing otherwise. @cond_arg may be specified to give access to a
+ * caller-defined argument to @cond.
+ */
+#define KPKEYS_GUARD_COND(name, level, cond, ...)			\
+	__KPKEYS_GUARD(name,						\
+		       cond ? kpkeys_set_level(level)			\
+			    : KPKEYS_PKEY_REG_INVAL,			\
+		       kpkeys_restore_pkey_reg(_T),			\
+		       ##__VA_ARGS__, void)
+
+/**
+ * KPKEYS_GUARD() - define a guard type that switches to a given kpkeys level
+ *                  if kpkeys are enabled
+ * @name: the name of the guard type
+ * @level: the kpkeys level to switch to
+ *
+ * Define a guard type that switches to @level if the system supports kpkeys.
+ */
+#define KPKEYS_GUARD(name, level)					\
+	KPKEYS_GUARD_COND(name, level, arch_kpkeys_enabled())
+
+/**
+ * kpkeys_set_level() - switch kpkeys level
+ * @level: the level to switch to
+ *
+ * Switches the kpkeys level to the specified value. @level must be a
+ * compile-time constant. The arch-specific pkey register will be updated
+ * accordingly, and the original value returned.
+ *
+ * Return: the original pkey register value if the register was written to, or
+ *         KPKEYS_PKEY_REG_INVAL otherwise (no write to the register was
+ *         required).
+ */
+static __always_inline u64 kpkeys_set_level(int level)
+{
+	BUILD_BUG_ON_MSG(!__builtin_constant_p(level),
+			 "kpkeys_set_level() only takes constant levels");
+	BUILD_BUG_ON_MSG(level < KPKEYS_LVL_MIN || level > KPKEYS_LVL_MAX,
+			 "Invalid level passed to kpkeys_set_level()");
+
+	return arch_kpkeys_set_level(level);
+}
+
+/**
+ * kpkeys_restore_pkey_reg() - restores a pkey register value
+ * @pkey_reg: the pkey register value to restore
+ *
+ * This function is meant to be passed the value returned by kpkeys_set_level(),
+ * in order to restore the pkey register to its original value (thus restoring
+ * the original kpkeys level).
+ */
+static __always_inline void kpkeys_restore_pkey_reg(u64 pkey_reg)
+{
+	if (pkey_reg != KPKEYS_PKEY_REG_INVAL)
+		arch_kpkeys_restore_pkey_reg(pkey_reg);
+}
+
+#else /* CONFIG_ARCH_HAS_KPKEYS */
+
+#include <asm-generic/kpkeys.h>
+
+static inline bool arch_kpkeys_enabled(void)
+{
+	return false;
+}
+
+#endif /* CONFIG_ARCH_HAS_KPKEYS */
+
+#endif /* _LINUX_KPKEYS_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf..90f2e5c381a6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1173,6 +1173,8 @@ config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+config ARCH_HAS_KPKEYS
+	bool
 
 config ARCH_USES_PG_ARCH_2
 	bool
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 02/18] set_memory: Introduce set_memory_pkey() stub
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
  2025-08-15  8:54 ` [RFC PATCH v5 01/18] mm: Introduce kpkeys Kevin Brodsky
@ 2025-08-15  8:54 ` Kevin Brodsky
  2025-08-15  8:54 ` [RFC PATCH v5 03/18] arm64: mm: Enable overlays for all EL1 indirect permissions Kevin Brodsky
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:54 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

Introduce a new function, set_memory_pkey(), which sets the
protection key (pkey) of pages in the specified linear mapping
range. Architectures implementing kernel pkeys (kpkeys) must
provide a suitable implementation; an empty stub is added as
fallback.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/linux/set_memory.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 3030d9245f5a..7b3a8bfde3c6 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -84,4 +84,11 @@ static inline int set_memory_decrypted(unsigned long addr, int numpages)
 }
 #endif /* CONFIG_ARCH_HAS_MEM_ENCRYPT */
 
+#ifndef CONFIG_ARCH_HAS_KPKEYS
+static inline int set_memory_pkey(unsigned long addr, int numpages, int pkey)
+{
+	return 0;
+}
+#endif
+
 #endif /* _LINUX_SET_MEMORY_H_ */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 03/18] arm64: mm: Enable overlays for all EL1 indirect permissions
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
  2025-08-15  8:54 ` [RFC PATCH v5 01/18] mm: Introduce kpkeys Kevin Brodsky
  2025-08-15  8:54 ` [RFC PATCH v5 02/18] set_memory: Introduce set_memory_pkey() stub Kevin Brodsky
@ 2025-08-15  8:54 ` Kevin Brodsky
  2025-08-15  8:54 ` [RFC PATCH v5 04/18] arm64: Introduce por_elx_set_pkey_perms() helper Kevin Brodsky
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:54 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

In preparation of using POE inside the kernel, enable "Overlay
applied" for all stage 1 base permissions in PIR_EL1. This ensures
that the permissions set in POR_EL1 affect all kernel mappings.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/pgtable-prot.h | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 85dceb1c66f4..0ab4ac214b06 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -180,13 +180,13 @@ static inline bool __pure lpa2_is_enabled(void)
 	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_GCS),           PIE_NONE_O) | \
 	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_GCS_RO),        PIE_NONE_O) | \
 	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_EXECONLY),      PIE_NONE_O) | \
-	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_READONLY_EXEC), PIE_R)      | \
-	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_SHARED_EXEC),   PIE_RW)     | \
-	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_READONLY),      PIE_R)      | \
-	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_SHARED),        PIE_RW)     | \
-	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_KERNEL_ROX),    PIE_RX)     | \
-	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_KERNEL_EXEC),   PIE_RWX)    | \
-	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_KERNEL_RO),     PIE_R)      | \
-	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_KERNEL),        PIE_RW))
+	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_READONLY_EXEC), PIE_R_O)      | \
+	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_SHARED_EXEC),   PIE_RW_O)     | \
+	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_READONLY),      PIE_R_O)      | \
+	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_SHARED),        PIE_RW_O)     | \
+	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_KERNEL_ROX),    PIE_RX_O)     | \
+	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_KERNEL_EXEC),   PIE_RWX_O)    | \
+	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_KERNEL_RO),     PIE_R_O)      | \
+	PIRx_ELx_PERM_PREP(pte_pi_index(_PAGE_KERNEL),        PIE_RW_O))
 
 #endif /* __ASM_PGTABLE_PROT_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 04/18] arm64: Introduce por_elx_set_pkey_perms() helper
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (2 preceding siblings ...)
  2025-08-15  8:54 ` [RFC PATCH v5 03/18] arm64: mm: Enable overlays for all EL1 indirect permissions Kevin Brodsky
@ 2025-08-15  8:54 ` Kevin Brodsky
  2025-08-15  8:54 ` [RFC PATCH v5 05/18] arm64: Implement asm/kpkeys.h using POE Kevin Brodsky
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:54 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

Introduce a helper that sets the permissions of a given pkey
(POIndex) in the POR_ELx format, and make use of it in
arch_set_user_pkey_access().

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/por.h |  7 +++++++
 arch/arm64/mm/mmu.c          | 26 ++++++++++----------------
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/asm/por.h b/arch/arm64/include/asm/por.h
index d913d5b529e4..bffb4d2b1246 100644
--- a/arch/arm64/include/asm/por.h
+++ b/arch/arm64/include/asm/por.h
@@ -31,4 +31,11 @@ static inline bool por_elx_allows_exec(u64 por, u8 pkey)
 	return perm & POE_X;
 }
 
+static inline u64 por_elx_set_pkey_perms(u64 por, u8 pkey, u64 perms)
+{
+	u64 shift = POR_ELx_PERM_SHIFT(pkey);
+
+	return (por & ~(POE_MASK << shift)) | (perms << shift);
+}
+
 #endif /* _ASM_ARM64_POR_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 34e5d78af076..e41ed9e0d799 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1597,8 +1597,8 @@ void __cpu_replace_ttbr1(pgd_t *pgdp, bool cnp)
 #ifdef CONFIG_ARCH_HAS_PKEYS
 int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, unsigned long init_val)
 {
-	u64 new_por;
-	u64 old_por;
+	u64 new_perms;
+	u64 por;
 
 	if (!system_supports_poe())
 		return -ENOSPC;
@@ -1612,25 +1612,19 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, unsigned long i
 		return -EINVAL;
 
 	/* Set the bits we need in POR:  */
-	new_por = POE_RWX;
+	new_perms = POE_RWX;
 	if (init_val & PKEY_DISABLE_WRITE)
-		new_por &= ~POE_W;
+		new_perms &= ~POE_W;
 	if (init_val & PKEY_DISABLE_ACCESS)
-		new_por &= ~POE_RW;
+		new_perms &= ~POE_RW;
 	if (init_val & PKEY_DISABLE_READ)
-		new_por &= ~POE_R;
+		new_perms &= ~POE_R;
 	if (init_val & PKEY_DISABLE_EXECUTE)
-		new_por &= ~POE_X;
+		new_perms &= ~POE_X;
 
-	/* Shift the bits in to the correct place in POR for pkey: */
-	new_por = POR_ELx_PERM_PREP(pkey, new_por);
-
-	/* Get old POR and mask off any old bits in place: */
-	old_por = read_sysreg_s(SYS_POR_EL0);
-	old_por &= ~(POE_MASK << POR_ELx_PERM_SHIFT(pkey));
-
-	/* Write old part along with new part: */
-	write_sysreg_s(old_por | new_por, SYS_POR_EL0);
+	por = read_sysreg_s(SYS_POR_EL0);
+	por = por_elx_set_pkey_perms(por, pkey, new_perms);
+	write_sysreg_s(por, SYS_POR_EL0);
 
 	return 0;
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 05/18] arm64: Implement asm/kpkeys.h using POE
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (3 preceding siblings ...)
  2025-08-15  8:54 ` [RFC PATCH v5 04/18] arm64: Introduce por_elx_set_pkey_perms() helper Kevin Brodsky
@ 2025-08-15  8:54 ` Kevin Brodsky
  2025-08-15  8:55 ` [RFC PATCH v5 06/18] arm64: set_memory: Implement set_memory_pkey() Kevin Brodsky
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:54 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

Implement the kpkeys interface if CONFIG_ARM64_POE is enabled.
The permissions for KPKEYS_PKEY_DEFAULT (pkey 0) are set to RWX as
this pkey is also used for code mappings.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/kpkeys.h | 49 +++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)
 create mode 100644 arch/arm64/include/asm/kpkeys.h

diff --git a/arch/arm64/include/asm/kpkeys.h b/arch/arm64/include/asm/kpkeys.h
new file mode 100644
index 000000000000..3b0ab5e7dd22
--- /dev/null
+++ b/arch/arm64/include/asm/kpkeys.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_KPKEYS_H
+#define __ASM_KPKEYS_H
+
+#include <asm/barrier.h>
+#include <asm/cpufeature.h>
+#include <asm/por.h>
+
+#include <asm-generic/kpkeys.h>
+
+static inline bool arch_kpkeys_enabled(void)
+{
+	return system_supports_poe();
+}
+
+#ifdef CONFIG_ARM64_POE
+
+static inline u64 por_set_kpkeys_level(u64 por, int level)
+{
+	por = por_elx_set_pkey_perms(por, KPKEYS_PKEY_DEFAULT, POE_RWX);
+
+	return por;
+}
+
+static __always_inline void __kpkeys_set_pkey_reg_nosync(u64 pkey_reg)
+{
+	write_sysreg_s(pkey_reg, SYS_POR_EL1);
+}
+
+static __always_inline int arch_kpkeys_set_level(int level)
+{
+	u64 prev_por = read_sysreg_s(SYS_POR_EL1);
+	u64 new_por = por_set_kpkeys_level(prev_por, level);
+
+	__kpkeys_set_pkey_reg_nosync(new_por);
+	isb();
+
+	return prev_por;
+}
+
+static __always_inline void arch_kpkeys_restore_pkey_reg(u64 pkey_reg)
+{
+	__kpkeys_set_pkey_reg_nosync(pkey_reg);
+	isb();
+}
+
+#endif /* CONFIG_ARM64_POE */
+
+#endif	/* __ASM_KPKEYS_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 06/18] arm64: set_memory: Implement set_memory_pkey()
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (4 preceding siblings ...)
  2025-08-15  8:54 ` [RFC PATCH v5 05/18] arm64: Implement asm/kpkeys.h using POE Kevin Brodsky
@ 2025-08-15  8:55 ` Kevin Brodsky
  2025-08-15  8:55 ` [RFC PATCH v5 07/18] arm64: Reset POR_EL1 on exception entry Kevin Brodsky
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:55 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

Implement set_memory_pkey() using POE if supported.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/set_memory.h |  4 ++++
 arch/arm64/mm/pageattr.c            | 25 +++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/arch/arm64/include/asm/set_memory.h b/arch/arm64/include/asm/set_memory.h
index 90f61b17275e..b6cd6de34abf 100644
--- a/arch/arm64/include/asm/set_memory.h
+++ b/arch/arm64/include/asm/set_memory.h
@@ -19,4 +19,8 @@ bool kernel_page_present(struct page *page);
 int set_memory_encrypted(unsigned long addr, int numpages);
 int set_memory_decrypted(unsigned long addr, int numpages);
 
+#ifdef CONFIG_ARCH_HAS_KPKEYS
+int set_memory_pkey(unsigned long addr, int numpages, int pkey);
+#endif
+
 #endif /* _ASM_ARM64_SET_MEMORY_H */
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 04d4a8f676db..41d87c2880fe 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -8,6 +8,7 @@
 #include <linux/mem_encrypt.h>
 #include <linux/sched.h>
 #include <linux/vmalloc.h>
+#include <linux/pkeys.h>
 
 #include <asm/cacheflush.h>
 #include <asm/pgtable-prot.h>
@@ -292,6 +293,30 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
 	return set_memory_valid(addr, nr, valid);
 }
 
+#ifdef CONFIG_ARCH_HAS_KPKEYS
+int set_memory_pkey(unsigned long addr, int numpages, int pkey)
+{
+	unsigned long set_prot = 0;
+
+	if (!system_supports_poe())
+		return 0;
+
+	if (!__is_lm_address(addr))
+		return -EINVAL;
+
+	if (pkey >= arch_max_pkey())
+		return -EINVAL;
+
+	set_prot |= pkey & BIT(0) ? PTE_PO_IDX_0 : 0;
+	set_prot |= pkey & BIT(1) ? PTE_PO_IDX_1 : 0;
+	set_prot |= pkey & BIT(2) ? PTE_PO_IDX_2 : 0;
+
+	return __change_memory_common(addr, PAGE_SIZE * numpages,
+				      __pgprot(set_prot),
+				      __pgprot(PTE_PO_IDX_MASK));
+}
+#endif
+
 #ifdef CONFIG_DEBUG_PAGEALLOC
 /*
  * This is - apart from the return value - doing the same
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 07/18] arm64: Reset POR_EL1 on exception entry
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (5 preceding siblings ...)
  2025-08-15  8:55 ` [RFC PATCH v5 06/18] arm64: set_memory: Implement set_memory_pkey() Kevin Brodsky
@ 2025-08-15  8:55 ` Kevin Brodsky
  2025-08-15  8:55 ` [RFC PATCH v5 08/18] arm64: Context-switch POR_EL1 Kevin Brodsky
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:55 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

POR_EL1 will be modified, through the kpkeys framework, in order to
grant temporary RW access to certain keys. If an exception occurs
in the middle of a "critical section" where POR_EL1 is set to a
privileged value, it is preferable to reset it to its default value
upon taking the exception to minimise the amount of code running at
higher kpkeys level.

This patch implements the reset of POR_EL1 on exception entry,
storing the original value in a new pt_regs field and restoring on
exception return. To avoid an expensive ISB, the register is only
reset if the interrupted value isn't the default. No check is made
on the return path as an ISB occurs anyway as part of ERET.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/kpkeys.h | 10 ++++++++++
 arch/arm64/include/asm/por.h    |  4 ++++
 arch/arm64/include/asm/ptrace.h |  4 ++++
 arch/arm64/kernel/asm-offsets.c |  3 +++
 arch/arm64/kernel/entry.S       | 24 +++++++++++++++++++++++-
 5 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kpkeys.h b/arch/arm64/include/asm/kpkeys.h
index 3b0ab5e7dd22..79ae33388088 100644
--- a/arch/arm64/include/asm/kpkeys.h
+++ b/arch/arm64/include/asm/kpkeys.h
@@ -8,6 +8,14 @@
 
 #include <asm-generic/kpkeys.h>
 
+/*
+ * Equivalent to por_set_kpkeys_level(0, KPKEYS_LVL_DEFAULT), but can also be
+ * used in assembly.
+ */
+#define POR_EL1_INIT	POR_ELx_PERM_PREP(KPKEYS_PKEY_DEFAULT, POE_RWX)
+
+#ifndef __ASSEMBLY__
+
 static inline bool arch_kpkeys_enabled(void)
 {
 	return system_supports_poe();
@@ -46,4 +54,6 @@ static __always_inline void arch_kpkeys_restore_pkey_reg(u64 pkey_reg)
 
 #endif /* CONFIG_ARM64_POE */
 
+#endif	/* __ASSEMBLY__ */
+
 #endif	/* __ASM_KPKEYS_H */
diff --git a/arch/arm64/include/asm/por.h b/arch/arm64/include/asm/por.h
index bffb4d2b1246..58dce4b8021b 100644
--- a/arch/arm64/include/asm/por.h
+++ b/arch/arm64/include/asm/por.h
@@ -10,6 +10,8 @@
 
 #define POR_EL0_INIT	POR_ELx_PERM_PREP(0, POE_RWX)
 
+#ifndef __ASSEMBLY__
+
 static inline bool por_elx_allows_read(u64 por, u8 pkey)
 {
 	u8 perm = POR_ELx_PERM_GET(pkey, por);
@@ -38,4 +40,6 @@ static inline u64 por_elx_set_pkey_perms(u64 por, u8 pkey, u64 perms)
 	return (por & ~(POE_MASK << shift)) | (perms << shift);
 }
 
+#endif	/* __ASSEMBLY__ */
+
 #endif /* _ASM_ARM64_POR_H */
diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index 47ff8654c5ec..e907df4225d4 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -166,6 +166,10 @@ struct pt_regs {
 	u64 orig_x0;
 	s32 syscallno;
 	u32 pmr;
+#ifdef CONFIG_ARM64_POE
+	u64 por_el1;
+	u64 __unused;
+#endif
 
 	u64 sdei_ttbr1;
 	struct frame_record_meta stackframe;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 30d4bbe68661..8ae5cc3c203b 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -75,6 +75,9 @@ int main(void)
   DEFINE(S_SYSCALLNO,		offsetof(struct pt_regs, syscallno));
   DEFINE(S_SDEI_TTBR1,		offsetof(struct pt_regs, sdei_ttbr1));
   DEFINE(S_PMR,			offsetof(struct pt_regs, pmr));
+#ifdef CONFIG_ARM64_POE
+  DEFINE(S_POR_EL1,		offsetof(struct pt_regs, por_el1));
+#endif
   DEFINE(S_STACKFRAME,		offsetof(struct pt_regs, stackframe));
   DEFINE(S_STACKFRAME_TYPE,	offsetof(struct pt_regs, stackframe.type));
   DEFINE(PT_REGS_SIZE,		sizeof(struct pt_regs));
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index f8018b5c1f9a..0dd6f7fbb669 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -20,6 +20,7 @@
 #include <asm/errno.h>
 #include <asm/esr.h>
 #include <asm/irq.h>
+#include <asm/kpkeys.h>
 #include <asm/memory.h>
 #include <asm/mmu.h>
 #include <asm/processor.h>
@@ -277,6 +278,19 @@ alternative_else_nop_endif
 	.else
 	add	x21, sp, #PT_REGS_SIZE
 	get_current_task tsk
+#ifdef CONFIG_ARM64_POE
+alternative_if_not ARM64_HAS_S1POE
+	b	1f
+alternative_else_nop_endif
+	mrs_s	x0, SYS_POR_EL1
+	str	x0, [sp, #S_POR_EL1]
+	mov	x1, #POR_EL1_INIT
+	cmp	x0, x1
+	b.eq	1f
+	msr_s	SYS_POR_EL1, x1
+	isb
+1:
+#endif /* CONFIG_ARM64_POE */
 	.endif /* \el == 0 */
 	mrs	x22, elr_el1
 	mrs	x23, spsr_el1
@@ -407,7 +421,15 @@ alternative_else_nop_endif
 	mte_set_user_gcr tsk, x0, x1
 
 	apply_ssbd 0, x0, x1
-	.endif
+	.else
+#ifdef CONFIG_ARM64_POE
+alternative_if ARM64_HAS_S1POE
+	ldr	x0, [sp, #S_POR_EL1]
+	msr_s	SYS_POR_EL1, x0
+	/* No explicit ISB; we rely on ERET */
+alternative_else_nop_endif
+#endif /* CONFIG_ARM64_POE */
+	.endif /* \el == 0 */
 
 	msr	elr_el1, x21			// set up the return data
 	msr	spsr_el1, x22
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 08/18] arm64: Context-switch POR_EL1
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (6 preceding siblings ...)
  2025-08-15  8:55 ` [RFC PATCH v5 07/18] arm64: Reset POR_EL1 on exception entry Kevin Brodsky
@ 2025-08-15  8:55 ` Kevin Brodsky
  2025-08-15  8:55 ` [RFC PATCH v5 09/18] arm64: Enable kpkeys Kevin Brodsky
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:55 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

POR_EL1 is about to be used by the kpkeys framework, modifying it
for (typically small) sections of code. If an exception occurs
during that window and scheduling occurs, we must ensure that
POR_EL1 is context-switched as needed (saving the old value and
restoring the new one). An ISB is needed to ensure the write takes
effect, so we skip it if the new value is the same as the old, like
for POR_EL0.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/processor.h | 1 +
 arch/arm64/kernel/process.c        | 9 +++++++++
 2 files changed, 10 insertions(+)

diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index 61d62bfd5a7b..9340e94a27f6 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -187,6 +187,7 @@ struct thread_struct {
 	u64			svcr;
 	u64			tpidr2_el0;
 	u64			por_el0;
+	u64			por_el1;
 #ifdef CONFIG_ARM64_GCS
 	unsigned int		gcs_el0_mode;
 	unsigned int		gcs_el0_locked;
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 96482a1412c6..f698839f018f 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -428,6 +428,9 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 
 	ptrauth_thread_init_kernel(p);
 
+	if (system_supports_poe())
+		p->thread.por_el1 = read_sysreg_s(SYS_POR_EL1);
+
 	if (likely(!args->fn)) {
 		*childregs = *current_pt_regs();
 		childregs->regs[0] = 0;
@@ -678,6 +681,12 @@ static void permission_overlay_switch(struct task_struct *next)
 		 * of POR_EL0.
 		 */
 	}
+
+	current->thread.por_el1 = read_sysreg_s(SYS_POR_EL1);
+	if (current->thread.por_el1 != next->thread.por_el1) {
+		write_sysreg_s(next->thread.por_el1, SYS_POR_EL1);
+		isb();
+	}
 }
 
 /*
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 09/18] arm64: Enable kpkeys
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (7 preceding siblings ...)
  2025-08-15  8:55 ` [RFC PATCH v5 08/18] arm64: Context-switch POR_EL1 Kevin Brodsky
@ 2025-08-15  8:55 ` Kevin Brodsky
  2025-08-15  8:55 ` [RFC PATCH v5 10/18] mm: Introduce kernel_pgtables_set_pkey() Kevin Brodsky
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:55 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

This is the final step to enable kpkeys on arm64. We enable
POE at EL1 by setting TCR2_EL1.POE, and initialise POR_EL1 to the
default value, enabling access to the default pkey/POIndex (0).
An ISB is added so that POE restrictions are enforced immediately.

Having done this, we can now select ARCH_HAS_KPKEYS if ARM64_POE is
enabled.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/Kconfig             | 1 +
 arch/arm64/kernel/cpufeature.c | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e9bbfacc35a6..88b544244829 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2187,6 +2187,7 @@ config ARM64_POE
 	def_bool y
 	select ARCH_USES_HIGH_VMA_FLAGS
 	select ARCH_HAS_PKEYS
+	select ARCH_HAS_KPKEYS
 	help
 	  The Permission Overlay Extension is used to implement Memory
 	  Protection Keys. Memory Protection Keys provides a mechanism for
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 9ad065f15f1d..4a631115341a 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -76,6 +76,7 @@
 #include <linux/kasan.h>
 #include <linux/percpu.h>
 #include <linux/sched/isolation.h>
+#include <linux/kpkeys.h>
 
 #include <asm/cpu.h>
 #include <asm/cpufeature.h>
@@ -2458,8 +2459,10 @@ static void cpu_enable_mops(const struct arm64_cpu_capabilities *__unused)
 #ifdef CONFIG_ARM64_POE
 static void cpu_enable_poe(const struct arm64_cpu_capabilities *__unused)
 {
-	sysreg_clear_set(REG_TCR2_EL1, 0, TCR2_EL1_E0POE);
+	write_sysreg_s(POR_EL1_INIT, SYS_POR_EL1);
+	sysreg_clear_set(REG_TCR2_EL1, 0, TCR2_EL1_E0POE | TCR2_EL1_POE);
 	sysreg_clear_set(CPACR_EL1, 0, CPACR_EL1_E0POE);
+	isb();
 }
 #endif
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 10/18] mm: Introduce kernel_pgtables_set_pkey()
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (8 preceding siblings ...)
  2025-08-15  8:55 ` [RFC PATCH v5 09/18] arm64: Enable kpkeys Kevin Brodsky
@ 2025-08-15  8:55 ` Kevin Brodsky
  2025-08-15  8:55 ` [RFC PATCH v5 11/18] mm: Introduce kpkeys_hardened_pgtables Kevin Brodsky
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:55 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

kernel_pgtables_set_pkey() allows setting the pkey of all page table
pages in swapper_pg_dir, recursively. This will be needed by
kpkeys_hardened_pgtables, as it relies on all PTPs being mapped with
a non-default pkey. Those initial kernel page tables cannot
practically be assigned a non-default pkey right when they are
allocated, so mutating them during (early) boot is required.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/linux/mm.h |   2 +
 mm/memory.c        | 137 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 139 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ae97a0b8ec7..f4dd96f3db91 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4134,6 +4134,8 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
 int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
 int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
 
+int kernel_pgtables_set_pkey(int pkey);
+
 
 /*
  * mseal of userspace process's system mappings.
diff --git a/mm/memory.c b/mm/memory.c
index 0ba4f6b71847..4f144abf5fc3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -75,6 +75,8 @@
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/sysctl.h>
+#include <linux/kpkeys.h>
+#include <linux/set_memory.h>
 
 #include <trace/events/kmem.h>
 
@@ -7183,3 +7185,138 @@ void vma_pgtable_walk_end(struct vm_area_struct *vma)
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_vma_unlock_read(vma);
 }
+
+static int __init set_page_pkey(void *p, int pkey)
+{
+	unsigned long addr = (unsigned long)p;
+
+	/*
+	 * swapper_pg_dir itself will be made read-only by mark_rodata_ro()
+	 * so there is no point in changing its pkey.
+	 */
+	if (p == swapper_pg_dir)
+		return 0;
+
+	return set_memory_pkey(addr, 1, pkey);
+}
+
+static int __init set_pkey_pte(pmd_t *pmd, int pkey)
+{
+	pte_t *pte;
+	int err;
+
+	pte = pte_offset_kernel(pmd, 0);
+	err = set_page_pkey(pte, pkey);
+
+	return err;
+}
+
+static int __init set_pkey_pmd(pud_t *pud, int pkey)
+{
+	pmd_t *pmd;
+	int i, err = 0;
+
+	pmd = pmd_offset(pud, 0);
+
+	err = set_page_pkey(pmd, pkey);
+	if (err)
+		return err;
+
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (pmd_none(pmd[i]) || pmd_bad(pmd[i]) || pmd_leaf(pmd[i]))
+			continue;
+		err = set_pkey_pte(&pmd[i], pkey);
+		if (err)
+			break;
+	}
+
+	return err;
+}
+
+static int __init set_pkey_pud(p4d_t *p4d, int pkey)
+{
+	pud_t *pud;
+	int i, err = 0;
+
+	if (mm_pmd_folded(&init_mm))
+		return set_pkey_pmd((pud_t *)p4d, pkey);
+
+	pud = pud_offset(p4d, 0);
+
+	err = set_page_pkey(pud, pkey);
+	if (err)
+		return err;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		if (pud_none(pud[i]) || pud_bad(pud[i]) || pud_leaf(pud[i]))
+			continue;
+		err = set_pkey_pmd(&pud[i], pkey);
+		if (err)
+			break;
+	}
+
+	return err;
+}
+
+static int __init set_pkey_p4d(pgd_t *pgd, int pkey)
+{
+	p4d_t *p4d;
+	int i, err = 0;
+
+	if (mm_pud_folded(&init_mm))
+		return set_pkey_pud((p4d_t *)pgd, pkey);
+
+	p4d = p4d_offset(pgd, 0);
+
+	err = set_page_pkey(p4d, pkey);
+	if (err)
+		return err;
+
+	for (i = 0; i < PTRS_PER_P4D; i++) {
+		if (p4d_none(p4d[i]) || p4d_bad(p4d[i]) || p4d_leaf(p4d[i]))
+			continue;
+		err = set_pkey_pud(&p4d[i], pkey);
+		if (err)
+			break;
+	}
+
+	return err;
+}
+
+/**
+ * kernel_pgtables_set_pkey - set pkey for all kernel page table pages
+ * @pkey: pkey to set the page table pages to
+ *
+ * Walks swapper_pg_dir setting the protection key of every page table page (at
+ * all levels) to @pkey. swapper_pg_dir itself is left untouched as it is
+ * expected to be mapped read-only by mark_rodata_ro().
+ *
+ * No-op if the architecture does not support kpkeys.
+ */
+int __init kernel_pgtables_set_pkey(int pkey)
+{
+	pgd_t *pgd = swapper_pg_dir;
+	int i, err = 0;
+
+	if (!arch_kpkeys_enabled())
+		return 0;
+
+	spin_lock(&init_mm.page_table_lock);
+
+	if (mm_p4d_folded(&init_mm)) {
+		err = set_pkey_p4d(pgd, pkey);
+		goto out;
+	}
+
+	for (i = 0; i < PTRS_PER_PGD; i++) {
+		if (pgd_none(pgd[i]) || pgd_bad(pgd[i]) || pgd_leaf(pgd[i]))
+			continue;
+		err = set_pkey_p4d(&pgd[i], pkey);
+		if (err)
+			break;
+	}
+
+out:
+	spin_unlock(&init_mm.page_table_lock);
+	return err;
+}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 11/18] mm: Introduce kpkeys_hardened_pgtables
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (9 preceding siblings ...)
  2025-08-15  8:55 ` [RFC PATCH v5 10/18] mm: Introduce kernel_pgtables_set_pkey() Kevin Brodsky
@ 2025-08-15  8:55 ` Kevin Brodsky
  2025-08-15  8:55 ` [RFC PATCH v5 12/18] mm: Allow __pagetable_ctor() to fail Kevin Brodsky
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:55 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

kpkeys_hardened_pgtables is a hardening feature based on kpkeys. It
aims to prevent the corruption of page tables by: 1. mapping all
page table pages, both kernel and user, with a privileged pkey
(KPKEYS_PKEY_PGTABLES), and 2. granting write access to that pkey
only when running at a higher kpkeys level (KPKEYS_LVL_PGTABLES).

The feature is exposed as CONFIG_KPKEYS_HARDENED_PGTABLES; it
requires explicit architecture opt-in by selecting
ARCH_HAS_KPKEYS_HARDENED_PGTABLES, since much of the page table
handling is arch-specific.

This patch introduces an API to modify the PTPs' pkey. Because this
API is going to be called from low-level pgtable helpers, it must
be inactive on boot and explicitly switched on if and when kpkeys
become available. A static key is used for that purpose; it is the
responsibility of each architecture supporting
kpkeys_hardened_pgtables to call kpkeys_hardened_pgtables_enable()
as early as possible to switch on that static key. The initial
kernel page tables are also walked to set their pkey, since they
have already been allocated at that point.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/asm-generic/kpkeys.h  |  4 +++
 include/linux/kpkeys.h        | 46 ++++++++++++++++++++++++++++++++++-
 mm/Kconfig                    |  3 +++
 mm/Makefile                   |  1 +
 mm/kpkeys_hardened_pgtables.c | 44 +++++++++++++++++++++++++++++++++
 security/Kconfig.hardening    | 12 +++++++++
 6 files changed, 109 insertions(+), 1 deletion(-)
 create mode 100644 mm/kpkeys_hardened_pgtables.c

diff --git a/include/asm-generic/kpkeys.h b/include/asm-generic/kpkeys.h
index ab819f157d6a..cec92334a9f3 100644
--- a/include/asm-generic/kpkeys.h
+++ b/include/asm-generic/kpkeys.h
@@ -2,6 +2,10 @@
 #ifndef __ASM_GENERIC_KPKEYS_H
 #define __ASM_GENERIC_KPKEYS_H
 
+#ifndef KPKEYS_PKEY_PGTABLES
+#define KPKEYS_PKEY_PGTABLES	1
+#endif
+
 #ifndef KPKEYS_PKEY_DEFAULT
 #define KPKEYS_PKEY_DEFAULT	0
 #endif
diff --git a/include/linux/kpkeys.h b/include/linux/kpkeys.h
index faa6e2615798..5f4b096374ba 100644
--- a/include/linux/kpkeys.h
+++ b/include/linux/kpkeys.h
@@ -4,11 +4,15 @@
 
 #include <linux/bug.h>
 #include <linux/cleanup.h>
+#include <linux/jump_label.h>
+
+struct folio;
 
 #define KPKEYS_LVL_DEFAULT	0
+#define KPKEYS_LVL_PGTABLES	1
 
 #define KPKEYS_LVL_MIN		KPKEYS_LVL_DEFAULT
-#define KPKEYS_LVL_MAX		KPKEYS_LVL_DEFAULT
+#define KPKEYS_LVL_MAX		KPKEYS_LVL_PGTABLES
 
 #define __KPKEYS_GUARD(name, set_level, restore_pkey_reg, set_arg, ...)	\
 	__DEFINE_CLASS_IS_CONDITIONAL(name, false);			\
@@ -110,4 +114,44 @@ static inline bool arch_kpkeys_enabled(void)
 
 #endif /* CONFIG_ARCH_HAS_KPKEYS */
 
+#ifdef CONFIG_KPKEYS_HARDENED_PGTABLES
+
+DECLARE_STATIC_KEY_FALSE(kpkeys_hardened_pgtables_key);
+
+static inline bool kpkeys_hardened_pgtables_enabled(void)
+{
+	return static_branch_unlikely(&kpkeys_hardened_pgtables_key);
+}
+
+int kpkeys_protect_pgtable_memory(struct folio *folio);
+int kpkeys_unprotect_pgtable_memory(struct folio *folio);
+
+/*
+ * Enables kpkeys_hardened_pgtables and switches existing kernel page tables to
+ * a privileged pkey (KPKEYS_PKEY_PGTABLES).
+ *
+ * Should be called as early as possible by architecture code, after (k)pkeys
+ * are initialised and before any user task is spawned.
+ */
+void kpkeys_hardened_pgtables_enable(void);
+
+#else /* CONFIG_KPKEYS_HARDENED_PGTABLES */
+
+static inline bool kpkeys_hardened_pgtables_enabled(void)
+{
+	return false;
+}
+
+static inline int kpkeys_protect_pgtable_memory(struct folio *folio)
+{
+	return 0;
+}
+static inline int kpkeys_unprotect_pgtable_memory(struct folio *folio)
+{
+	return 0;
+}
+static inline void kpkeys_hardened_pgtables_enable(void) {}
+
+#endif /* CONFIG_KPKEYS_HARDENED_PGTABLES */
+
 #endif /* _LINUX_KPKEYS_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 90f2e5c381a6..e34edf5c41e7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1175,6 +1175,9 @@ config ARCH_HAS_PKEYS
 	bool
 config ARCH_HAS_KPKEYS
 	bool
+# ARCH_HAS_KPKEYS must be selected when selecting this option
+config ARCH_HAS_KPKEYS_HARDENED_PGTABLES
+	bool
 
 config ARCH_USES_PG_ARCH_2
 	bool
diff --git a/mm/Makefile b/mm/Makefile
index ef54aa615d9d..10848df0ca85 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -147,3 +147,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
+obj-$(CONFIG_KPKEYS_HARDENED_PGTABLES) += kpkeys_hardened_pgtables.o
diff --git a/mm/kpkeys_hardened_pgtables.c b/mm/kpkeys_hardened_pgtables.c
new file mode 100644
index 000000000000..931fa97bc8a7
--- /dev/null
+++ b/mm/kpkeys_hardened_pgtables.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/mm.h>
+#include <linux/kpkeys.h>
+#include <linux/set_memory.h>
+
+DEFINE_STATIC_KEY_FALSE(kpkeys_hardened_pgtables_key);
+
+int kpkeys_protect_pgtable_memory(struct folio *folio)
+{
+	unsigned long addr = (unsigned long)folio_address(folio);
+	unsigned int order = folio_order(folio);
+	int ret = 0;
+
+	if (kpkeys_hardened_pgtables_enabled())
+		ret = set_memory_pkey(addr, 1 << order, KPKEYS_PKEY_PGTABLES);
+
+	WARN_ON(ret);
+	return ret;
+}
+
+int kpkeys_unprotect_pgtable_memory(struct folio *folio)
+{
+	unsigned long addr = (unsigned long)folio_address(folio);
+	unsigned int order = folio_order(folio);
+	int ret = 0;
+
+	if (kpkeys_hardened_pgtables_enabled())
+		ret = set_memory_pkey(addr, 1 << order, KPKEYS_PKEY_DEFAULT);
+
+	WARN_ON(ret);
+	return ret;
+}
+
+void __init kpkeys_hardened_pgtables_enable(void)
+{
+	int ret;
+
+	if (!arch_kpkeys_enabled())
+		return;
+
+	static_branch_enable(&kpkeys_hardened_pgtables_key);
+	ret = kernel_pgtables_set_pkey(KPKEYS_PKEY_PGTABLES);
+	WARN_ON(ret);
+}
diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening
index b9a5bc3430aa..41b7530530b7 100644
--- a/security/Kconfig.hardening
+++ b/security/Kconfig.hardening
@@ -265,6 +265,18 @@ config BUG_ON_DATA_CORRUPTION
 
 	  If unsure, say N.
 
+config KPKEYS_HARDENED_PGTABLES
+	bool "Harden page tables using kernel pkeys"
+	depends on ARCH_HAS_KPKEYS_HARDENED_PGTABLES
+	help
+	  This option makes all page tables mostly read-only by
+	  allocating them with a non-default protection key (pkey) and
+	  only enabling write access to that pkey in routines that are
+	  expected to write to page table entries.
+
+	  This option has no effect if the system does not support
+	  kernel pkeys.
+
 endmenu
 
 config CC_HAS_RANDSTRUCT
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 12/18] mm: Allow __pagetable_ctor() to fail
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (10 preceding siblings ...)
  2025-08-15  8:55 ` [RFC PATCH v5 11/18] mm: Introduce kpkeys_hardened_pgtables Kevin Brodsky
@ 2025-08-15  8:55 ` Kevin Brodsky
  2025-08-15  8:55 ` [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey Kevin Brodsky
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:55 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

In preparation for adding construction hooks (that may fail) to
__pagetable_ctor(), make __pagetable_ctor() return a bool,
propagate it to pagetable_*_ctor() and handle failure in
the generic {pud,p4d,pgd}_alloc.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/asm-generic/pgalloc.h | 15 ++++++++++++---
 include/linux/mm.h            | 21 ++++++++++-----------
 2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 3c8ec3bfea44..3e184f3ca37a 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -178,7 +178,10 @@ static inline pud_t *__pud_alloc_one_noprof(struct mm_struct *mm, unsigned long
 	if (!ptdesc)
 		return NULL;
 
-	pagetable_pud_ctor(ptdesc);
+	if (!pagetable_pud_ctor(ptdesc)) {
+		pagetable_free(ptdesc);
+		return NULL;
+	}
 	return ptdesc_address(ptdesc);
 }
 #define __pud_alloc_one(...)	alloc_hooks(__pud_alloc_one_noprof(__VA_ARGS__))
@@ -232,7 +235,10 @@ static inline p4d_t *__p4d_alloc_one_noprof(struct mm_struct *mm, unsigned long
 	if (!ptdesc)
 		return NULL;
 
-	pagetable_p4d_ctor(ptdesc);
+	if (!pagetable_p4d_ctor(ptdesc)) {
+		pagetable_free(ptdesc);
+		return NULL;
+	}
 	return ptdesc_address(ptdesc);
 }
 #define __p4d_alloc_one(...)	alloc_hooks(__p4d_alloc_one_noprof(__VA_ARGS__))
@@ -276,7 +282,10 @@ static inline pgd_t *__pgd_alloc_noprof(struct mm_struct *mm, unsigned int order
 	if (!ptdesc)
 		return NULL;
 
-	pagetable_pgd_ctor(ptdesc);
+	if (!pagetable_pgd_ctor(ptdesc)) {
+		pagetable_free(ptdesc);
+		return NULL;
+	}
 	return ptdesc_address(ptdesc);
 }
 #define __pgd_alloc(...)	alloc_hooks(__pgd_alloc_noprof(__VA_ARGS__))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f4dd96f3db91..d9371d992033 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2973,12 +2973,13 @@ static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
 static inline void ptlock_free(struct ptdesc *ptdesc) {}
 #endif /* defined(CONFIG_SPLIT_PTE_PTLOCKS) */
 
-static inline void __pagetable_ctor(struct ptdesc *ptdesc)
+static inline bool __pagetable_ctor(struct ptdesc *ptdesc)
 {
 	struct folio *folio = ptdesc_folio(ptdesc);
 
 	__folio_set_pgtable(folio);
 	lruvec_stat_add_folio(folio, NR_PAGETABLE);
+	return true;
 }
 
 static inline void pagetable_dtor(struct ptdesc *ptdesc)
@@ -3001,8 +3002,7 @@ static inline bool pagetable_pte_ctor(struct mm_struct *mm,
 {
 	if (mm != &init_mm && !ptlock_init(ptdesc))
 		return false;
-	__pagetable_ctor(ptdesc);
-	return true;
+	return __pagetable_ctor(ptdesc);
 }
 
 pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp);
@@ -3109,8 +3109,7 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
 	if (mm != &init_mm && !pmd_ptlock_init(ptdesc))
 		return false;
 	ptdesc_pmd_pts_init(ptdesc);
-	__pagetable_ctor(ptdesc);
-	return true;
+	return __pagetable_ctor(ptdesc);
 }
 
 /*
@@ -3132,19 +3131,19 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
 	return ptl;
 }
 
-static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
+static inline bool pagetable_pud_ctor(struct ptdesc *ptdesc)
 {
-	__pagetable_ctor(ptdesc);
+	return __pagetable_ctor(ptdesc);
 }
 
-static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
+static inline bool pagetable_p4d_ctor(struct ptdesc *ptdesc)
 {
-	__pagetable_ctor(ptdesc);
+	return __pagetable_ctor(ptdesc);
 }
 
-static inline void pagetable_pgd_ctor(struct ptdesc *ptdesc)
+static inline bool pagetable_pgd_ctor(struct ptdesc *ptdesc)
 {
-	__pagetable_ctor(ptdesc);
+	return __pagetable_ctor(ptdesc);
 }
 
 extern void __init pagecache_init(void);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (11 preceding siblings ...)
  2025-08-15  8:55 ` [RFC PATCH v5 12/18] mm: Allow __pagetable_ctor() to fail Kevin Brodsky
@ 2025-08-15  8:55 ` Kevin Brodsky
  2025-08-15 16:37   ` Edgecombe, Rick P
  2025-08-15  8:55 ` [RFC PATCH v5 14/18] arm64: kpkeys: Support KPKEYS_LVL_PGTABLES Kevin Brodsky
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:55 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

If CONFIG_KPKEYS_HARDENED_PGTABLES is enabled, map allocated page
table pages using a privileged pkey (KPKEYS_PKEY_PGTABLES), so that
page tables can only be written under guard(kpkeys_hardened_pgtables).

This patch is a no-op if CONFIG_KPKEYS_HARDENED_PGTABLES is disabled
(default).

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/linux/mm.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d9371d992033..4880cb7a4cb9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -34,6 +34,7 @@
 #include <linux/slab.h>
 #include <linux/cacheinfo.h>
 #include <linux/rcuwait.h>
+#include <linux/kpkeys.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -2979,6 +2980,8 @@ static inline bool __pagetable_ctor(struct ptdesc *ptdesc)
 
 	__folio_set_pgtable(folio);
 	lruvec_stat_add_folio(folio, NR_PAGETABLE);
+	if (kpkeys_protect_pgtable_memory(folio))
+		return false;
 	return true;
 }
 
@@ -2989,6 +2992,7 @@ static inline void pagetable_dtor(struct ptdesc *ptdesc)
 	ptlock_free(ptdesc);
 	__folio_clear_pgtable(folio);
 	lruvec_stat_sub_folio(folio, NR_PAGETABLE);
+	kpkeys_unprotect_pgtable_memory(folio);
 }
 
 static inline void pagetable_dtor_free(struct ptdesc *ptdesc)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 14/18] arm64: kpkeys: Support KPKEYS_LVL_PGTABLES
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (12 preceding siblings ...)
  2025-08-15  8:55 ` [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey Kevin Brodsky
@ 2025-08-15  8:55 ` Kevin Brodsky
  2025-08-15  8:55 ` [RFC PATCH v5 15/18] arm64: mm: Guard page table writes with kpkeys Kevin Brodsky
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:55 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

Enable RW access to KPKEYS_PKEY_PGTABLES (used to map page table
pages) if switching to KPKEYS_LVL_PGTABLES, otherwise only grant RO
access.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/kpkeys.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kpkeys.h b/arch/arm64/include/asm/kpkeys.h
index 79ae33388088..64d6e22740ec 100644
--- a/arch/arm64/include/asm/kpkeys.h
+++ b/arch/arm64/include/asm/kpkeys.h
@@ -12,7 +12,8 @@
  * Equivalent to por_set_kpkeys_level(0, KPKEYS_LVL_DEFAULT), but can also be
  * used in assembly.
  */
-#define POR_EL1_INIT	POR_ELx_PERM_PREP(KPKEYS_PKEY_DEFAULT, POE_RWX)
+#define POR_EL1_INIT	(POR_ELx_PERM_PREP(KPKEYS_PKEY_DEFAULT, POE_RWX) | \
+			 POR_ELx_PERM_PREP(KPKEYS_PKEY_PGTABLES, POE_R))
 
 #ifndef __ASSEMBLY__
 
@@ -26,6 +27,8 @@ static inline bool arch_kpkeys_enabled(void)
 static inline u64 por_set_kpkeys_level(u64 por, int level)
 {
 	por = por_elx_set_pkey_perms(por, KPKEYS_PKEY_DEFAULT, POE_RWX);
+	por = por_elx_set_pkey_perms(por, KPKEYS_PKEY_PGTABLES,
+				     level == KPKEYS_LVL_PGTABLES ? POE_RW : POE_R);
 
 	return por;
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 15/18] arm64: mm: Guard page table writes with kpkeys
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (13 preceding siblings ...)
  2025-08-15  8:55 ` [RFC PATCH v5 14/18] arm64: kpkeys: Support KPKEYS_LVL_PGTABLES Kevin Brodsky
@ 2025-08-15  8:55 ` Kevin Brodsky
  2025-08-15  8:55 ` [RFC PATCH v5 16/18] arm64: Enable kpkeys_hardened_pgtables support Kevin Brodsky
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:55 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

When CONFIG_KPKEYS_HARDENED_PGTABLES is enabled, page tables (both
user and kernel) are mapped with a privileged pkey in the linear
mapping. As a result, they can only be written at an elevated kpkeys
level.

Introduce a kpkeys guard that sets POR_EL1 appropriately to allow
writing to page tables, and use this guard wherever necessary. The
scope is kept as small as possible, so that POR_EL1 is quickly reset
to its default value. Where atomics are involved, the guard's scope
encompasses the whole loop to avoid switching POR_EL1 unnecessarily.

This patch is a no-op if CONFIG_KPKEYS_HARDENED_PGTABLES is disabled
(default).

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 22 +++++++++++++++++++++-
 arch/arm64/mm/fault.c            |  2 ++
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index abd2dee416b3..1694fb839854 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -39,6 +39,14 @@
 #include <linux/mm_types.h>
 #include <linux/sched.h>
 #include <linux/page_table_check.h>
+#include <linux/kpkeys.h>
+
+#ifdef CONFIG_KPKEYS_HARDENED_PGTABLES
+KPKEYS_GUARD_COND(kpkeys_hardened_pgtables, KPKEYS_LVL_PGTABLES,
+		  kpkeys_hardened_pgtables_enabled())
+#else
+KPKEYS_GUARD_NOOP(kpkeys_hardened_pgtables)
+#endif
 
 static inline void emit_pte_barriers(void)
 {
@@ -390,6 +398,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
 
 static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
 {
+	guard(kpkeys_hardened_pgtables)();
 	WRITE_ONCE(*ptep, pte);
 }
 
@@ -858,6 +867,7 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 	}
 #endif /* __PAGETABLE_PMD_FOLDED */
 
+	guard(kpkeys_hardened_pgtables)();
 	WRITE_ONCE(*pmdp, pmd);
 
 	if (pmd_valid(pmd))
@@ -918,6 +928,7 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
 		return;
 	}
 
+	guard(kpkeys_hardened_pgtables)();
 	WRITE_ONCE(*pudp, pud);
 
 	if (pud_valid(pud))
@@ -999,6 +1010,7 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
 		return;
 	}
 
+	guard(kpkeys_hardened_pgtables)();
 	WRITE_ONCE(*p4dp, p4d);
 	queue_pte_barriers();
 }
@@ -1127,6 +1139,7 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 		return;
 	}
 
+	guard(kpkeys_hardened_pgtables)();
 	WRITE_ONCE(*pgdp, pgd);
 	queue_pte_barriers();
 }
@@ -1316,6 +1329,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
 {
 	pte_t old_pte, pte;
 
+	guard(kpkeys_hardened_pgtables)();
 	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
@@ -1363,7 +1377,10 @@ static inline pte_t __ptep_get_and_clear_anysz(struct mm_struct *mm,
 					       pte_t *ptep,
 					       unsigned long pgsize)
 {
-	pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
+	pte_t pte;
+
+	scoped_guard(kpkeys_hardened_pgtables)
+		pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
 
 	switch (pgsize) {
 	case PAGE_SIZE:
@@ -1436,6 +1453,7 @@ static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
 {
 	pte_t old_pte;
 
+	guard(kpkeys_hardened_pgtables)();
 	do {
 		old_pte = pte;
 		pte = pte_wrprotect(pte);
@@ -1469,6 +1487,7 @@ static inline void __clear_young_dirty_pte(struct vm_area_struct *vma,
 {
 	pte_t old_pte;
 
+	guard(kpkeys_hardened_pgtables)();
 	do {
 		old_pte = pte;
 
@@ -1516,6 +1535,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmdp, pmd_t pmd)
 {
 	page_table_check_pmd_set(vma->vm_mm, pmdp, pmd);
+	guard(kpkeys_hardened_pgtables)();
 	return __pmd(xchg_relaxed(&pmd_val(*pmdp), pmd_val(pmd)));
 }
 #endif
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index d816ff44faff..c4ab361bba72 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -214,6 +214,8 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
 	if (pte_same(pte, entry))
 		return 0;
 
+	guard(kpkeys_hardened_pgtables)();
+
 	/* only preserve the access flags and write permission */
 	pte_val(entry) &= PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 16/18] arm64: Enable kpkeys_hardened_pgtables support
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (14 preceding siblings ...)
  2025-08-15  8:55 ` [RFC PATCH v5 15/18] arm64: mm: Guard page table writes with kpkeys Kevin Brodsky
@ 2025-08-15  8:55 ` Kevin Brodsky
  2025-08-15  8:55 ` [RFC PATCH v5 17/18] mm: Add basic tests for kpkeys_hardened_pgtables Kevin Brodsky
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:55 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

kpkeys_hardened_pgtables should be enabled as early as possible (if
selected). It does however require kpkeys being available, which
means on arm64 POE being detected and enabled. POE is a boot
feature, so calling kpkeys_hardened_pgtables_enable() just after
setup_boot_cpu_features() in smp_prepare_boot_cpu() is the best we
can do.

With that done, all the bits are in place and we can advertise
support for kpkeys_hardened_pgtables by selecting
ARCH_HAS_KPKEYS_HARDENED_PGTABLES if ARM64_POE is enabled.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/Kconfig      | 1 +
 arch/arm64/kernel/smp.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 88b544244829..6c77f446ab09 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2188,6 +2188,7 @@ config ARM64_POE
 	select ARCH_USES_HIGH_VMA_FLAGS
 	select ARCH_HAS_PKEYS
 	select ARCH_HAS_KPKEYS
+	select ARCH_HAS_KPKEYS_HARDENED_PGTABLES
 	help
 	  The Permission Overlay Extension is used to implement Memory
 	  Protection Keys. Memory Protection Keys provides a mechanism for
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 68cea3a4a35c..04ce6b2fd884 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -35,6 +35,7 @@
 #include <linux/kgdb.h>
 #include <linux/kvm_host.h>
 #include <linux/nmi.h>
+#include <linux/kpkeys.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -460,6 +461,7 @@ void __init smp_prepare_boot_cpu(void)
 	if (system_uses_irq_prio_masking())
 		init_gic_priority_masking();
 
+	kpkeys_hardened_pgtables_enable();
 	kasan_init_hw_tags();
 	/* Init percpu seeds for random tags after cpus are set up. */
 	kasan_init_sw_tags();
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 17/18] mm: Add basic tests for kpkeys_hardened_pgtables
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (15 preceding siblings ...)
  2025-08-15  8:55 ` [RFC PATCH v5 16/18] arm64: Enable kpkeys_hardened_pgtables support Kevin Brodsky
@ 2025-08-15  8:55 ` Kevin Brodsky
  2025-08-15  8:55 ` [RFC PATCH v5 18/18] arm64: mm: Batch kpkeys level switches Kevin Brodsky
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:55 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

Add basic tests for the kpkeys_hardened_pgtables feature: try to
perform a direct write to kernel and user page table entries and
ensure it fails.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 mm/Makefile                               |   1 +
 mm/tests/kpkeys_hardened_pgtables_kunit.c | 106 ++++++++++++++++++++++
 security/Kconfig.hardening                |  12 +++
 3 files changed, 119 insertions(+)
 create mode 100644 mm/tests/kpkeys_hardened_pgtables_kunit.c

diff --git a/mm/Makefile b/mm/Makefile
index 10848df0ca85..b1e6cf7f753c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -148,3 +148,4 @@ obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
 obj-$(CONFIG_KPKEYS_HARDENED_PGTABLES) += kpkeys_hardened_pgtables.o
+obj-$(CONFIG_KPKEYS_HARDENED_PGTABLES_KUNIT_TEST) += tests/kpkeys_hardened_pgtables_kunit.o
diff --git a/mm/tests/kpkeys_hardened_pgtables_kunit.c b/mm/tests/kpkeys_hardened_pgtables_kunit.c
new file mode 100644
index 000000000000..3d916f0719d0
--- /dev/null
+++ b/mm/tests/kpkeys_hardened_pgtables_kunit.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <kunit/test.h>
+#include <linux/mman.h>
+#include <linux/pgtable.h>
+#include <linux/vmalloc.h>
+
+KUNIT_DEFINE_ACTION_WRAPPER(vfree_wrapper, vfree, const void *);
+
+static inline pte_t *get_kernel_pte(unsigned long addr)
+{
+	pmd_t *pmdp = pmd_off_k(addr);
+
+	if (!pmdp || pmd_leaf(*pmdp))
+		return NULL;
+
+	return pte_offset_kernel(pmdp, addr);
+}
+
+static void write_linear_map_pte(struct kunit *test)
+{
+	pte_t *ptep;
+	pte_t pte;
+	int ret;
+
+	if (!arch_kpkeys_enabled())
+		kunit_skip(test, "kpkeys are not supported");
+
+	/*
+	 * The choice of address is mostly arbitrary - we just need something
+	 * that is PTE-mapped, such as a global variable.
+	 */
+	ptep = get_kernel_pte((unsigned long)&init_mm);
+	KUNIT_ASSERT_NOT_NULL_MSG(test, ptep, "Failed to get PTE");
+
+	pte = ptep_get(ptep);
+	pte = set_pte_bit(pte, __pgprot(PTE_WRITE));
+	ret = copy_to_kernel_nofault(ptep, &pte, sizeof(pte));
+	KUNIT_EXPECT_EQ_MSG(test, ret, -EFAULT,
+			    "Direct PTE write wasn't prevented");
+}
+
+static void write_kernel_vmalloc_pte(struct kunit *test)
+{
+	void *mem;
+	pte_t *ptep;
+	pte_t pte;
+	int ret;
+
+	if (!arch_kpkeys_enabled())
+		kunit_skip(test, "kpkeys are not supported");
+
+	mem = vmalloc(PAGE_SIZE);
+	KUNIT_ASSERT_NOT_NULL(test, mem);
+	ret = kunit_add_action_or_reset(test, vfree_wrapper, mem);
+	KUNIT_ASSERT_EQ(test, ret, 0);
+
+	ptep = get_kernel_pte((unsigned long)mem);
+	KUNIT_ASSERT_NOT_NULL_MSG(test, ptep, "Failed to get PTE");
+
+	pte = ptep_get(ptep);
+	pte = set_pte_bit(pte, __pgprot(PTE_WRITE));
+	ret = copy_to_kernel_nofault(ptep, &pte, sizeof(pte));
+	KUNIT_EXPECT_EQ_MSG(test, ret, -EFAULT,
+			    "Direct PTE write wasn't prevented");
+}
+
+static void write_user_pmd(struct kunit *test)
+{
+	pmd_t *pmdp;
+	pmd_t pmd;
+	unsigned long uaddr;
+	int ret;
+
+	if (!arch_kpkeys_enabled())
+		kunit_skip(test, "kpkeys are not supported");
+
+	uaddr = kunit_vm_mmap(test, NULL, 0, PAGE_SIZE, PROT_READ,
+			      MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, 0);
+	KUNIT_ASSERT_NE_MSG(test, uaddr, 0, "Could not create userspace mm");
+
+	/* We passed MAP_POPULATE so a PMD should already be allocated */
+	pmdp = pmd_off(current->mm, uaddr);
+	KUNIT_ASSERT_NOT_NULL_MSG(test, pmdp, "Failed to get PMD");
+
+	pmd = pmdp_get(pmdp);
+	pmd = set_pmd_bit(pmd, __pgprot(PROT_SECT_NORMAL));
+	ret = copy_to_kernel_nofault(pmdp, &pmd, sizeof(pmd));
+	KUNIT_EXPECT_EQ_MSG(test, ret, -EFAULT,
+			    "Direct PMD write wasn't prevented");
+}
+
+static struct kunit_case kpkeys_hardened_pgtables_test_cases[] = {
+	KUNIT_CASE(write_linear_map_pte),
+	KUNIT_CASE(write_kernel_vmalloc_pte),
+	KUNIT_CASE(write_user_pmd),
+	{}
+};
+
+static struct kunit_suite kpkeys_hardened_pgtables_test_suite = {
+	.name = "Hardened pgtables using kpkeys",
+	.test_cases = kpkeys_hardened_pgtables_test_cases,
+};
+kunit_test_suite(kpkeys_hardened_pgtables_test_suite);
+
+MODULE_DESCRIPTION("Tests for the kpkeys_hardened_pgtables feature");
+MODULE_LICENSE("GPL");
diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening
index 41b7530530b7..653663008096 100644
--- a/security/Kconfig.hardening
+++ b/security/Kconfig.hardening
@@ -277,6 +277,18 @@ config KPKEYS_HARDENED_PGTABLES
 	  This option has no effect if the system does not support
 	  kernel pkeys.
 
+config KPKEYS_HARDENED_PGTABLES_KUNIT_TEST
+	tristate "KUnit tests for kpkeys_hardened_pgtables" if !KUNIT_ALL_TESTS
+	depends on KPKEYS_HARDENED_PGTABLES
+	depends on KUNIT
+	default KUNIT_ALL_TESTS
+	help
+	  Enable this option to check that the kpkeys_hardened_pgtables feature
+	  functions as intended, i.e. prevents arbitrary writes to user and
+	  kernel page tables.
+
+	  If unsure, say N.
+
 endmenu
 
 config CC_HAS_RANDSTRUCT
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v5 18/18] arm64: mm: Batch kpkeys level switches
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (16 preceding siblings ...)
  2025-08-15  8:55 ` [RFC PATCH v5 17/18] mm: Add basic tests for kpkeys_hardened_pgtables Kevin Brodsky
@ 2025-08-15  8:55 ` Kevin Brodsky
  2025-08-20 15:53 ` [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
  2025-08-21 17:29 ` Yang Shi
  19 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-15  8:55 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Andy Lutomirski,
	Catalin Marinas, Dave Hansen, David Hildenbrand, Ira Weiny,
	Jann Horn, Jeff Xu, Joey Gouly, Kees Cook, Linus Walleij,
	Lorenzo Stoakes, Marc Zyngier, Mark Brown, Matthew Wilcox,
	Maxwell Bland, Mike Rapoport (IBM), Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Rick Edgecombe, Ryan Roberts,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel,
	linux-mm, x86

The kpkeys_hardened_pgtables feature currently switches kpkeys level
in every helper that writes to page tables, such as set_pte(). With
kpkeys implemented using POE, this entails a pair of ISBs whenever
such helper is called.

A simple way to reduce this overhead is to make use of the lazy_mmu
mode, which has recently been adopted on arm64 to batch barriers
(DSB/ISB) when updating kernel pgtables [1]. Reusing the
TIF_LAZY_MMU flag introduced by this series, we amend the
kpkeys_hardened_pgtables guard so that no level switch (i.e. POR_EL1
update) is issued while that flag is set. Instead, we switch to
KPKEYS_LVL_PGTABLES when entering lazy_mmu mode, and restore the
previous level when exiting it. The optimisation is disabled while
in interrupt as POR_EL1 is reset on exception entry, i.e. switching
is not batched in that case.

Restoring the previous kpkeys level requires storing the original
value of POR_EL1 somewhere. This is a full 64-bit value so we cannot
simply use a TIF flag, but since lazy_mmu sections cannot nest, some
sort of thread-local variable would do the trick. There is no
straightforward way to reuse current->thread.por_el1 for that
purpose - this is where the current value of POR_EL1 is stored on a
context switch, i.e. the value corresponding to KPKEYS_LVL_PGTABLES
inside a lazy_mmu section. Instead, we add a new member to
thread_struct to hold that value temporarily. This isn't optimal as
that member is unused outside of lazy_mmu sections, but it is the
simplest option.

A further optimisation this patch makes is to merge the ISBs when
exiting lazy_mmu mode. That is, if an ISB is going to be issued by
emit_pte_barriers() because kernel pgtables were modified in the
lazy_mmu section, we skip the ISB after restoring POR_EL1. This is
done by checking TIF_LAZY_MMU_PENDING and ensuring that POR_EL1 is
restored before emit_pte_barriers() is called.

[1] https://lore.kernel.org/all/20250422081822.1836315-12-ryan.roberts@arm.com/

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---

Unfortunately lazy_mmu sections can in fact nest under certain
circumstances [2], which means that storing the original value of
POR_EL1 in thread_struct is not always safe.

I am working on modifying the lazy_mmu API to handle nesting gracefully,
which should also help with restoring POR_EL1 without using
thread_struct. See also the discussion in [3].

[2] https://lore.kernel.org/all/20250512150333.5589-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/all/20250606135654.178300-1-ryan.roberts@arm.com/t/#u

 arch/arm64/include/asm/pgtable.h   | 37 +++++++++++++++++++++++++++++-
 arch/arm64/include/asm/processor.h |  1 +
 2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1694fb839854..35d15b9722e4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -43,11 +43,40 @@
 
 #ifdef CONFIG_KPKEYS_HARDENED_PGTABLES
 KPKEYS_GUARD_COND(kpkeys_hardened_pgtables, KPKEYS_LVL_PGTABLES,
-		  kpkeys_hardened_pgtables_enabled())
+		  kpkeys_hardened_pgtables_enabled() &&
+		  (in_interrupt() || !test_thread_flag(TIF_LAZY_MMU)))
 #else
 KPKEYS_GUARD_NOOP(kpkeys_hardened_pgtables)
 #endif
 
+static void kpkeys_lazy_mmu_enter(void)
+{
+	if (!kpkeys_hardened_pgtables_enabled())
+		return;
+
+	current->thread.por_el1_lazy_mmu = kpkeys_set_level(KPKEYS_LVL_PGTABLES);
+}
+
+static void kpkeys_lazy_mmu_exit(void)
+{
+	u64 saved_por_el1;
+
+	if (!kpkeys_hardened_pgtables_enabled())
+		return;
+
+	saved_por_el1 = current->thread.por_el1_lazy_mmu;
+
+	/*
+	 * We skip any barrier if TIF_LAZY_MMU_PENDING is set:
+	 * emit_pte_barriers() will issue an ISB just after this function
+	 * returns.
+	 */
+	if (test_thread_flag(TIF_LAZY_MMU_PENDING))
+		__kpkeys_set_pkey_reg_nosync(saved_por_el1);
+	else
+		arch_kpkeys_restore_pkey_reg(saved_por_el1);
+}
+
 static inline void emit_pte_barriers(void)
 {
 	/*
@@ -107,6 +136,7 @@ static inline void arch_enter_lazy_mmu_mode(void)
 		return;
 
 	set_thread_flag(TIF_LAZY_MMU);
+	kpkeys_lazy_mmu_enter();
 }
 
 static inline void arch_flush_lazy_mmu_mode(void)
@@ -123,6 +153,11 @@ static inline void arch_leave_lazy_mmu_mode(void)
 	if (in_interrupt())
 		return;
 
+	/*
+	 * The ordering should be preserved to allow kpkeys_lazy_mmu_exit()
+	 * to skip any barrier when TIF_LAZY_MMU_PENDING is set.
+	 */
+	kpkeys_lazy_mmu_exit();
 	arch_flush_lazy_mmu_mode();
 	clear_thread_flag(TIF_LAZY_MMU);
 }
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index 9340e94a27f6..7b20eedfe2fe 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -188,6 +188,7 @@ struct thread_struct {
 	u64			tpidr2_el0;
 	u64			por_el0;
 	u64			por_el1;
+	u64			por_el1_lazy_mmu;
 #ifdef CONFIG_ARM64_GCS
 	unsigned int		gcs_el0_mode;
 	unsigned int		gcs_el0_locked;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey
  2025-08-15  8:55 ` [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey Kevin Brodsky
@ 2025-08-15 16:37   ` Edgecombe, Rick P
  2025-08-18 16:02     ` Kevin Brodsky
  0 siblings, 1 reply; 32+ messages in thread
From: Edgecombe, Rick P @ 2025-08-15 16:37 UTC (permalink / raw)
  To: kevin.brodsky@arm.com, linux-hardening@vger.kernel.org
  Cc: maz@kernel.org, luto@kernel.org, willy@infradead.org,
	mbland@motorola.com, david@redhat.com,
	dave.hansen@linux.intel.com, rppt@kernel.org, joey.gouly@arm.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	catalin.marinas@arm.com, Weiny, Ira, vbabka@suse.cz,
	pierre.langlois@arm.com, jeffxu@chromium.org,
	linus.walleij@linaro.org, lorenzo.stoakes@oracle.com,
	kees@kernel.org, ryan.roberts@arm.com, tglx@linutronix.de,
	jannh@google.com, peterz@infradead.org,
	linux-arm-kernel@lists.infradead.org, will@kernel.org,
	qperret@google.com, linux-mm@kvack.org, broonie@kernel.org,
	x86@kernel.org

On Fri, 2025-08-15 at 09:55 +0100, Kevin Brodsky wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d9371d992033..4880cb7a4cb9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -34,6 +34,7 @@
>  #include <linux/slab.h>
>  #include <linux/cacheinfo.h>
>  #include <linux/rcuwait.h>
> +#include <linux/kpkeys.h>
>  
>  struct mempolicy;
>  struct anon_vma;
> @@ -2979,6 +2980,8 @@ static inline bool __pagetable_ctor(struct ptdesc *ptdesc)
>  
>  	__folio_set_pgtable(folio);
>  	lruvec_stat_add_folio(folio, NR_PAGETABLE);
> +	if (kpkeys_protect_pgtable_memory(folio))
> +		return false;
>  	return true;
>  }

It seems like this does a kernel range shootdown for every page table that gets
allocated? If so it throws a pretty big wrench into the carefully managed TLB
flush minimization logic in the kernel.

Obviously this is much more straightforward then the x86 series' page table
conversion batching stuff, but TBH I was worried that even that was going to
have a performance hit. I think how to efficiently do direct map permissions is
the key technical problem to solve for pkeys security usages. They can switch on
and off fast, but applying the key is just as much of a hit as any other kernel
memory permission. (I assume this works the similarly to x86's?)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey
  2025-08-15 16:37   ` Edgecombe, Rick P
@ 2025-08-18 16:02     ` Kevin Brodsky
  2025-08-18 17:01       ` Edgecombe, Rick P
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-18 16:02 UTC (permalink / raw)
  To: Edgecombe, Rick P, linux-hardening@vger.kernel.org
  Cc: maz@kernel.org, luto@kernel.org, willy@infradead.org,
	mbland@motorola.com, david@redhat.com,
	dave.hansen@linux.intel.com, rppt@kernel.org, joey.gouly@arm.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	catalin.marinas@arm.com, Weiny, Ira, vbabka@suse.cz,
	pierre.langlois@arm.com, jeffxu@chromium.org,
	linus.walleij@linaro.org, lorenzo.stoakes@oracle.com,
	kees@kernel.org, ryan.roberts@arm.com, tglx@linutronix.de,
	jannh@google.com, peterz@infradead.org,
	linux-arm-kernel@lists.infradead.org, will@kernel.org,
	qperret@google.com, linux-mm@kvack.org, broonie@kernel.org,
	x86@kernel.org

On 15/08/2025 18:37, Edgecombe, Rick P wrote:
> On Fri, 2025-08-15 at 09:55 +0100, Kevin Brodsky wrote:
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index d9371d992033..4880cb7a4cb9 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -34,6 +34,7 @@
>>  #include <linux/slab.h>
>>  #include <linux/cacheinfo.h>
>>  #include <linux/rcuwait.h>
>> +#include <linux/kpkeys.h>
>>  
>>  struct mempolicy;
>>  struct anon_vma;
>> @@ -2979,6 +2980,8 @@ static inline bool __pagetable_ctor(struct ptdesc *ptdesc)
>>  
>>  	__folio_set_pgtable(folio);
>>  	lruvec_stat_add_folio(folio, NR_PAGETABLE);
>> +	if (kpkeys_protect_pgtable_memory(folio))
>> +		return false;
>>  	return true;
>>  }
> It seems like this does a kernel range shootdown for every page table that gets
> allocated? If so it throws a pretty big wrench into the carefully managed TLB
> flush minimization logic in the kernel.
>
> Obviously this is much more straightforward then the x86 series' page table
> conversion batching stuff, but TBH I was worried that even that was going to
> have a performance hit. I think how to efficiently do direct map permissions is
> the key technical problem to solve for pkeys security usages. They can switch on
> and off fast, but applying the key is just as much of a hit as any other kernel
> memory permission. (I assume this works the similarly to x86's?)

The benchmarking results (see cover letter) don't seem to point to a
major performance hit from setting the pkey on arm64 (worth noting that
the linear mapping is PTE-mapped on arm64 today so no splitting should
occur when setting the pkey). The overhead may well be substantially
higher on x86.

I agree this is worth looking into, though. I will check the overhead
added by set_memory_pkey() specifically (ignoring pkey register
switches), and maybe try to allocate page tables with a dedicated
kmem_cache instead, reusing this patch [1] from my other kpkeys series.
A kmem_cache won't be as optimal as a dedicated allocator, but batching
the page freeing may already improve things substantially.

- Kevin

[1]
https://lore.kernel.org/linux-hardening/20250815090000.2182450-4-kevin.brodsky@arm.com/


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey
  2025-08-18 16:02     ` Kevin Brodsky
@ 2025-08-18 17:01       ` Edgecombe, Rick P
  2025-08-19  9:35         ` Kevin Brodsky
  0 siblings, 1 reply; 32+ messages in thread
From: Edgecombe, Rick P @ 2025-08-18 17:01 UTC (permalink / raw)
  To: kevin.brodsky@arm.com, linux-hardening@vger.kernel.org
  Cc: x86@kernel.org, maz@kernel.org, luto@kernel.org,
	mbland@motorola.com, willy@infradead.org,
	dave.hansen@linux.intel.com, david@redhat.com, rppt@kernel.org,
	joey.gouly@arm.com, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, pierre.langlois@arm.com, Weiny, Ira,
	vbabka@suse.cz, catalin.marinas@arm.com, jeffxu@chromium.org,
	linus.walleij@linaro.org, lorenzo.stoakes@oracle.com,
	kees@kernel.org, ryan.roberts@arm.com, tglx@linutronix.de,
	jannh@google.com, peterz@infradead.org,
	linux-arm-kernel@lists.infradead.org, will@kernel.org,
	qperret@google.com, linux-mm@kvack.org, broonie@kernel.org

On Mon, 2025-08-18 at 18:02 +0200, Kevin Brodsky wrote:
> The benchmarking results (see cover letter) don't seem to point to a
> major performance hit from setting the pkey on arm64 (worth noting that
> the linear mapping is PTE-mapped on arm64 today so no splitting should
> occur when setting the pkey). The overhead may well be substantially
> higher on x86.

It's surprising to me. The batching seems to be about switching the pkey, not
the conversion of the direct map. And with batching you measured a fork
benchmark actually sped up a tiny bit. Shouldn't it involve a pile of page table
allocations and so extra direct map work?

I don't know if it's possible the mock implementation skipped some set_memory()
work somehow?

> 
> I agree this is worth looking into, though. I will check the overhead
> added by set_memory_pkey() specifically (ignoring pkey register
> switches), and maybe try to allocate page tables with a dedicated
> kmem_cache instead, reusing this patch [1] from my other kpkeys series.
> A kmem_cache won't be as optimal as a dedicated allocator, but batching
> the page freeing may already improve things substantially.

I actually never got to the benchmark on real HW stage either, but I'd be
surprised if this approach would have acceptable performance for x86. There are
so many optimizations around minimizing TLB flushes in Linux. Dunno. Maybe my
arm knowledge is too lacking.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey
  2025-08-18 17:01       ` Edgecombe, Rick P
@ 2025-08-19  9:35         ` Kevin Brodsky
  0 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-19  9:35 UTC (permalink / raw)
  To: Edgecombe, Rick P, linux-hardening@vger.kernel.org
  Cc: x86@kernel.org, maz@kernel.org, luto@kernel.org,
	mbland@motorola.com, willy@infradead.org,
	dave.hansen@linux.intel.com, david@redhat.com, rppt@kernel.org,
	joey.gouly@arm.com, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, pierre.langlois@arm.com, Weiny, Ira,
	vbabka@suse.cz, catalin.marinas@arm.com, jeffxu@chromium.org,
	linus.walleij@linaro.org, lorenzo.stoakes@oracle.com,
	kees@kernel.org, ryan.roberts@arm.com, tglx@linutronix.de,
	jannh@google.com, peterz@infradead.org,
	linux-arm-kernel@lists.infradead.org, will@kernel.org,
	qperret@google.com, linux-mm@kvack.org, broonie@kernel.org

On 18/08/2025 19:01, Edgecombe, Rick P wrote:
> On Mon, 2025-08-18 at 18:02 +0200, Kevin Brodsky wrote:
>> The benchmarking results (see cover letter) don't seem to point to a
>> major performance hit from setting the pkey on arm64 (worth noting that
>> the linear mapping is PTE-mapped on arm64 today so no splitting should
>> occur when setting the pkey). The overhead may well be substantially
>> higher on x86.
> It's surprising to me. The batching seems to be about switching the pkey, not
> the conversion of the direct map.

Correct, there is still a set_memory_pkey() for each PTP.

>  And with batching you measured a fork
> benchmark actually sped up a tiny bit. Shouldn't it involve a pile of page table
> allocations and so extra direct map work?

It should indeed...

> I don't know if it's possible the mock implementation skipped some set_memory()
> work somehow?

In fact you're absolutely right, in the mock implementation I
benchmarked set_memory_pkey() is in fact a no-op :( This is because
patch 6 gates set_memory_pkey() on system_supports_poe(), but the mock
implementation [1] only modifies arch_kpkeys_enabled(). In other words
the numbers in the cover letter correspond to the added pkey register
switches, without touching the page tables.

I am now re-running the benchmarks with set_memory_pkey() actually
modifying the page tables. I'll reply to the cover letter with the
updated numbers.

- Kevin

[1]
https://gitlab.arm.com/linux-arm/linux-kb/-/commit/fd75b43abb354e84d06f3dfb05ce839e9fb13e08


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (17 preceding siblings ...)
  2025-08-15  8:55 ` [RFC PATCH v5 18/18] arm64: mm: Batch kpkeys level switches Kevin Brodsky
@ 2025-08-20 15:53 ` Kevin Brodsky
  2025-08-20 16:01   ` Kevin Brodsky
  2025-08-21 17:29 ` Yang Shi
  19 siblings, 1 reply; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-20 15:53 UTC (permalink / raw)
  To: linux-hardening, Rick Edgecombe
  Cc: linux-kernel, Andrew Morton, Andy Lutomirski, Catalin Marinas,
	Dave Hansen, David Hildenbrand, Ira Weiny, Jann Horn, Jeff Xu,
	Joey Gouly, Kees Cook, Linus Walleij, Lorenzo Stoakes,
	Marc Zyngier, Mark Brown, Matthew Wilcox, Maxwell Bland,
	Mike Rapoport (IBM), Peter Zijlstra, Pierre Langlois,
	Quentin Perret, Ryan Roberts, Thomas Gleixner, Vlastimil Babka,
	Will Deacon, linux-arm-kernel, linux-mm, x86

On 15/08/2025 10:54, Kevin Brodsky wrote:
> [...]
>
> Performance
> ===========
>
> No arm64 hardware currently implements POE. To estimate the performance
> impact of kpkeys_hardened_pgtables, a mock implementation of kpkeys has
> been used, replacing accesses to the POR_EL1 register with accesses to
> another system register that is otherwise unused (CONTEXTIDR_EL1), and
> leaving everything else unchanged. Most of the kpkeys overhead is
> expected to originate from the barrier (ISB) that is required after
> writing to POR_EL1, and from setting the POIndex (pkey) in page tables;
> both of these are done exactly in the same way in the mock
> implementation.

It turns out this wasn't the case regarding the pkey setting - because
patch 6 gates set_memory_pkey() on system_supports_poe() and not
arch_kpkeys_enabled(), the mock implementation turned set_memory_pkey()
into a no-op. Many thanks to Rick Edgecombe for highlighting that the
overheads were suspiciously low for some benchmarks!

> The original implementation of kpkeys_hardened_pgtables is very
> inefficient when many PTEs are changed at once, as the kpkeys level is
> switched twice for every PTE (two ISBs per PTE). Patch 18 introduces
> an optimisation that makes use of the lazy_mmu mode to batch those
> switches: 1. switch to KPKEYS_LVL_PGTABLES on arch_enter_lazy_mmu_mode(),
> 2. skip any kpkeys switch while in that section, and 3. restore the
> kpkeys level on arch_leave_lazy_mmu_mode(). When that last function
> already issues an ISB (when updating kernel page tables), we get a
> further optimisation as we can skip the ISB when restoring the kpkeys
> level.
>
> Both implementations (without and with batching) were evaluated on an
> Amazon EC2 M7g instance (Graviton3), using a variety of benchmarks that
> involve heavy page table manipulations. The results shown below are
> relative to the baseline for this series, which is 6.17-rc1. The
> branches used for all three sets of results (baseline, with/without
> batching) are available in a repository, see next section.
>
> Caveat: these numbers should be seen as a lower bound for the overhead
> of a real POE-based protection. The hardware checks added by POE are
> however not expected to incur significant extra overhead.
>
> Reading example: for the fix_size_alloc_test benchmark, using 1 page per
> iteration (no hugepage), kpkeys_hardened_pgtables incurs 17.35% overhead
> without batching, and 14.62% overhead with batching. Both results are
> considered statistically significant (95% confidence interval),
> indicated by "(R)".
>
> +-------------------+----------------------------------+------------------+---------------+
> | Benchmark         | Result Class                     | Without batching | With batching |
> +===================+==================================+==================+===============+
> | mmtests/kernbench | real time                        |            0.30% |         0.11% |
> |                   | system time                      |        (R) 3.97% |     (R) 2.17% |
> |                   | user time                        |            0.12% |         0.02% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/fork      | fork: h:0                        |      (R) 217.31% |        -0.97% |
> |                   | fork: h:1                        |      (R) 275.25% |     (R) 2.25% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/munmap    | munmap: h:0                      |       (R) 15.57% |        -1.95% |
> |                   | munmap: h:1                      |      (R) 169.53% |     (R) 6.53% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 17.35% |    (R) 14.62% |
> |                   | fix_size_alloc_test: p:4, h:0    |       (R) 37.54% |     (R) 9.35% |
> |                   | fix_size_alloc_test: p:16, h:0   |       (R) 66.08% |     (R) 3.15% |
> |                   | fix_size_alloc_test: p:64, h:0   |       (R) 82.94% |        -0.39% |
> |                   | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |        -1.67% |
> |                   | fix_size_alloc_test: p:16, h:1   |       (R) 50.31% |         3.00% |
> |                   | fix_size_alloc_test: p:64, h:1   |       (R) 59.73% |         2.23% |
> |                   | fix_size_alloc_test: p:256, h:1  |       (R) 62.14% |         1.51% |
> |                   | random_size_alloc_test: p:1, h:0 |       (R) 77.82% |        -0.21% |
> |                   | vm_map_ram_test: p:1, h:0        |       (R) 30.66% |    (R) 27.30% |
> +-------------------+----------------------------------+------------------+---------------+

These numbers therefore correspond to set_memory_pkey() being a no-op,
in other words they represent the overhead of switching the pkey
register only.

I have amended the mock implementation so that set_memory_pkey() is run
as it would on a real POE implementation (i.e. actually setting the PTE
bits). Here are the new results, representing the overhead of both pkey
register switching and setting the pkey of page table pages (PTPs) on
alloc/free:

+-------------------+----------------------------------+------------------+---------------+
| Benchmark         | Result Class                     | Without
batching | With batching |
+===================+==================================+==================+===============+
| mmtests/kernbench | real time                        |           
0.32% |         0.35% |
|                   | system time                      |        (R)
4.18% |     (R) 3.18% |
|                   | user time                        |           
0.08% |         0.20% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/fork      | fork: h:0                        |      (R)
221.39% |     (R) 3.35% |
|                   | fork: h:1                        |      (R)
282.89% |     (R) 6.99% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/munmap    | munmap: h:0                      |       (R)
17.37% |        -0.28% |
|                   | munmap: h:1                      |      (R)
172.61% |     (R) 8.08% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R)
15.54% |    (R) 12.57% |
|                   | fix_size_alloc_test: p:4, h:0    |       (R)
39.18% |     (R) 9.13% |
|                   | fix_size_alloc_test: p:16, h:0   |       (R)
65.81% |         2.97% |
|                   | fix_size_alloc_test: p:64, h:0   |       (R)
83.39% |        -0.49% |
|                   | fix_size_alloc_test: p:256, h:0  |       (R)
87.85% |    (I) -2.04% |
|                   | fix_size_alloc_test: p:16, h:1   |       (R)
51.21% |         3.77% |
|                   | fix_size_alloc_test: p:64, h:1   |       (R)
60.02% |         0.99% |
|                   | fix_size_alloc_test: p:256, h:1  |       (R)
63.82% |         1.16% |
|                   | random_size_alloc_test: p:1, h:0 |       (R)
77.79% |        -0.51% |
|                   | vm_map_ram_test: p:1, h:0        |       (R)
30.67% |    (R) 27.09% |
+-------------------+----------------------------------+------------------+---------------+

Those results are overall very similar to the original ones.
micromm/fork is however clearly impacted - around 4% additional overhead
from set_memory_pkey(); it makes sense considering that forking requires
duplicating (and therefore allocating) a full set of page tables.
kernbench is also a fork-heavy workload and it gets a 1% hit in system
time (with batching).

It seems fair to conclude that, on arm64, setting the pkey whenever a
PTP is allocated/freed is not particularly expensive. The situation may
well be different on x86 as Rick pointed out, and it may also change on
newer arm64 systems as I noted further down. Allocating/freeing PTPs in
bulk should help if setting the pkey in the pgtable ctor/dtor proves too
expensive.

- Kevin

> Benchmarks:
> - mmtests/kernbench: running kernbench (kernel build) [4].
> - micromm/{fork,munmap}: from David Hildenbrand's benchmark suite. A
>   1 GB mapping is created and then fork/unmap is called. The mapping is
>   created using either page-sized (h:0) or hugepage folios (h:1); in all
>   cases the memory is PTE-mapped.
> - micromm/vmalloc: from test_vmalloc.ko, varying the number of pages
>   (p:) and whether huge pages are used (h:).
>
> On a "real-world" and fork-heavy workload like kernbench, the estimated
> overhead of kpkeys_hardened_pgtables is reasonable: 4% system time
> overhead without batching, and about half that figure (2.2%) with
> batching. The real time overhead is negligible.
>
> Microbenchmarks show large overheads without batching, which increase
> with the number of pages being manipulated. Batching drastically reduces
> that overhead, almost negating it for micromm/fork. Because all PTEs in
> the mapping are modified in the same lazy_mmu section, the kpkeys level
> is changed just twice regardless of the mapping size; as a result the
> relative overhead actually decreases as the size increases for
> fix_size_alloc_test.
>
> Note: the performance impact of set_memory_pkey() is likely to be
> relatively low on arm64 because the linear mapping uses PTE-level
> descriptors only. This means that set_memory_pkey() simply changes the
> attributes of some PTE descriptors. However, some systems may be able to
> use higher-level descriptors in the future [5], meaning that
> set_memory_pkey() may have to split mappings. Allocating page tables
> from a contiguous cache of pages could help minimise the overhead, as
> proposed for x86 in [1].
>
> [...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
  2025-08-20 15:53 ` [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
@ 2025-08-20 16:01   ` Kevin Brodsky
  2025-08-20 16:18     ` Edgecombe, Rick P
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-20 16:01 UTC (permalink / raw)
  To: linux-hardening, Rick Edgecombe
  Cc: linux-kernel, Andrew Morton, Andy Lutomirski, Catalin Marinas,
	Dave Hansen, David Hildenbrand, Ira Weiny, Jann Horn, Jeff Xu,
	Joey Gouly, Kees Cook, Linus Walleij, Lorenzo Stoakes,
	Marc Zyngier, Mark Brown, Matthew Wilcox, Maxwell Bland,
	Mike Rapoport (IBM), Peter Zijlstra, Pierre Langlois,
	Quentin Perret, Ryan Roberts, Thomas Gleixner, Vlastimil Babka,
	Will Deacon, linux-arm-kernel, linux-mm, x86

On 20/08/2025 17:53, Kevin Brodsky wrote:
> On 15/08/2025 10:54, Kevin Brodsky wrote:
>> [...]
>>
>> Performance
>> ===========
>>
>> No arm64 hardware currently implements POE. To estimate the performance
>> impact of kpkeys_hardened_pgtables, a mock implementation of kpkeys has
>> been used, replacing accesses to the POR_EL1 register with accesses to
>> another system register that is otherwise unused (CONTEXTIDR_EL1), and
>> leaving everything else unchanged. Most of the kpkeys overhead is
>> expected to originate from the barrier (ISB) that is required after
>> writing to POR_EL1, and from setting the POIndex (pkey) in page tables;
>> both of these are done exactly in the same way in the mock
>> implementation.
> It turns out this wasn't the case regarding the pkey setting - because
> patch 6 gates set_memory_pkey() on system_supports_poe() and not
> arch_kpkeys_enabled(), the mock implementation turned set_memory_pkey()
> into a no-op. Many thanks to Rick Edgecombe for highlighting that the
> overheads were suspiciously low for some benchmarks!
>
>> The original implementation of kpkeys_hardened_pgtables is very
>> inefficient when many PTEs are changed at once, as the kpkeys level is
>> switched twice for every PTE (two ISBs per PTE). Patch 18 introduces
>> an optimisation that makes use of the lazy_mmu mode to batch those
>> switches: 1. switch to KPKEYS_LVL_PGTABLES on arch_enter_lazy_mmu_mode(),
>> 2. skip any kpkeys switch while in that section, and 3. restore the
>> kpkeys level on arch_leave_lazy_mmu_mode(). When that last function
>> already issues an ISB (when updating kernel page tables), we get a
>> further optimisation as we can skip the ISB when restoring the kpkeys
>> level.
>>
>> Both implementations (without and with batching) were evaluated on an
>> Amazon EC2 M7g instance (Graviton3), using a variety of benchmarks that
>> involve heavy page table manipulations. The results shown below are
>> relative to the baseline for this series, which is 6.17-rc1. The
>> branches used for all three sets of results (baseline, with/without
>> batching) are available in a repository, see next section.
>>
>> Caveat: these numbers should be seen as a lower bound for the overhead
>> of a real POE-based protection. The hardware checks added by POE are
>> however not expected to incur significant extra overhead.
>>
>> Reading example: for the fix_size_alloc_test benchmark, using 1 page per
>> iteration (no hugepage), kpkeys_hardened_pgtables incurs 17.35% overhead
>> without batching, and 14.62% overhead with batching. Both results are
>> considered statistically significant (95% confidence interval),
>> indicated by "(R)".
>>
>> +-------------------+----------------------------------+------------------+---------------+
>> | Benchmark         | Result Class                     | Without batching | With batching |
>> +===================+==================================+==================+===============+
>> | mmtests/kernbench | real time                        |            0.30% |         0.11% |
>> |                   | system time                      |        (R) 3.97% |     (R) 2.17% |
>> |                   | user time                        |            0.12% |         0.02% |
>> +-------------------+----------------------------------+------------------+---------------+
>> | micromm/fork      | fork: h:0                        |      (R) 217.31% |        -0.97% |
>> |                   | fork: h:1                        |      (R) 275.25% |     (R) 2.25% |
>> +-------------------+----------------------------------+------------------+---------------+
>> | micromm/munmap    | munmap: h:0                      |       (R) 15.57% |        -1.95% |
>> |                   | munmap: h:1                      |      (R) 169.53% |     (R) 6.53% |
>> +-------------------+----------------------------------+------------------+---------------+
>> | micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 17.35% |    (R) 14.62% |
>> |                   | fix_size_alloc_test: p:4, h:0    |       (R) 37.54% |     (R) 9.35% |
>> |                   | fix_size_alloc_test: p:16, h:0   |       (R) 66.08% |     (R) 3.15% |
>> |                   | fix_size_alloc_test: p:64, h:0   |       (R) 82.94% |        -0.39% |
>> |                   | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |        -1.67% |
>> |                   | fix_size_alloc_test: p:16, h:1   |       (R) 50.31% |         3.00% |
>> |                   | fix_size_alloc_test: p:64, h:1   |       (R) 59.73% |         2.23% |
>> |                   | fix_size_alloc_test: p:256, h:1  |       (R) 62.14% |         1.51% |
>> |                   | random_size_alloc_test: p:1, h:0 |       (R) 77.82% |        -0.21% |
>> |                   | vm_map_ram_test: p:1, h:0        |       (R) 30.66% |    (R) 27.30% |
>> +-------------------+----------------------------------+------------------+---------------+
> These numbers therefore correspond to set_memory_pkey() being a no-op,
> in other words they represent the overhead of switching the pkey
> register only.
>
> I have amended the mock implementation so that set_memory_pkey() is run
> as it would on a real POE implementation (i.e. actually setting the PTE
> bits). Here are the new results, representing the overhead of both pkey
> register switching and setting the pkey of page table pages (PTPs) on
> alloc/free:
>
> +-------------------+----------------------------------+------------------+---------------+
> | Benchmark         | Result Class                     | Without
> batching | With batching |
> +===================+==================================+==================+===============+
> | mmtests/kernbench | real time                        |           
> 0.32% |         0.35% |
> |                   | system time                      |        (R)
> 4.18% |     (R) 3.18% |
> |                   | user time                        |           
> 0.08% |         0.20% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/fork      | fork: h:0                        |      (R)
> 221.39% |     (R) 3.35% |
> |                   | fork: h:1                        |      (R)
> 282.89% |     (R) 6.99% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/munmap    | munmap: h:0                      |       (R)
> 17.37% |        -0.28% |
> |                   | munmap: h:1                      |      (R)
> 172.61% |     (R) 8.08% |
> +-------------------+----------------------------------+------------------+---------------+
> | micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R)
> 15.54% |    (R) 12.57% |
> |                   | fix_size_alloc_test: p:4, h:0    |       (R)
> 39.18% |     (R) 9.13% |
> |                   | fix_size_alloc_test: p:16, h:0   |       (R)
> 65.81% |         2.97% |
> |                   | fix_size_alloc_test: p:64, h:0   |       (R)
> 83.39% |        -0.49% |
> |                   | fix_size_alloc_test: p:256, h:0  |       (R)
> 87.85% |    (I) -2.04% |
> |                   | fix_size_alloc_test: p:16, h:1   |       (R)
> 51.21% |         3.77% |
> |                   | fix_size_alloc_test: p:64, h:1   |       (R)
> 60.02% |         0.99% |
> |                   | fix_size_alloc_test: p:256, h:1  |       (R)
> 63.82% |         1.16% |
> |                   | random_size_alloc_test: p:1, h:0 |       (R)
> 77.79% |        -0.51% |
> |                   | vm_map_ram_test: p:1, h:0        |       (R)
> 30.67% |    (R) 27.09% |
> +-------------------+----------------------------------+------------------+---------------+

Apologies, Thunderbird helpfully decided to wrap around that table...
Here's the unmangled table:

+-------------------+----------------------------------+------------------+---------------+
| Benchmark         | Result Class                     | Without batching | With batching |
+===================+==================================+==================+===============+
| mmtests/kernbench | real time                        |            0.32% |         0.35% |
|                   | system time                      |        (R) 4.18% |     (R) 3.18% |
|                   | user time                        |            0.08% |         0.20% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/fork      | fork: h:0                        |      (R) 221.39% |     (R) 3.35% |
|                   | fork: h:1                        |      (R) 282.89% |     (R) 6.99% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/munmap    | munmap: h:0                      |       (R) 17.37% |        -0.28% |
|                   | munmap: h:1                      |      (R) 172.61% |     (R) 8.08% |
+-------------------+----------------------------------+------------------+---------------+
| micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 15.54% |    (R) 12.57% |
|                   | fix_size_alloc_test: p:4, h:0    |       (R) 39.18% |     (R) 9.13% |
|                   | fix_size_alloc_test: p:16, h:0   |       (R) 65.81% |         2.97% |
|                   | fix_size_alloc_test: p:64, h:0   |       (R) 83.39% |        -0.49% |
|                   | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |    (I) -2.04% |
|                   | fix_size_alloc_test: p:16, h:1   |       (R) 51.21% |         3.77% |
|                   | fix_size_alloc_test: p:64, h:1   |       (R) 60.02% |         0.99% |
|                   | fix_size_alloc_test: p:256, h:1  |       (R) 63.82% |         1.16% |
|                   | random_size_alloc_test: p:1, h:0 |       (R) 77.79% |        -0.51% |
|                   | vm_map_ram_test: p:1, h:0        |       (R) 30.67% |    (R) 27.09% |
+-------------------+----------------------------------+------------------+---------------+

> Those results are overall very similar to the original ones.
> micromm/fork is however clearly impacted - around 4% additional overhead
> from set_memory_pkey(); it makes sense considering that forking requires
> duplicating (and therefore allocating) a full set of page tables.
> kernbench is also a fork-heavy workload and it gets a 1% hit in system
> time (with batching).
>
> It seems fair to conclude that, on arm64, setting the pkey whenever a
> PTP is allocated/freed is not particularly expensive. The situation may
> well be different on x86 as Rick pointed out, and it may also change on
> newer arm64 systems as I noted further down. Allocating/freeing PTPs in
> bulk should help if setting the pkey in the pgtable ctor/dtor proves too
> expensive.
>
> - Kevin
>
>> Benchmarks:
>> - mmtests/kernbench: running kernbench (kernel build) [4].
>> - micromm/{fork,munmap}: from David Hildenbrand's benchmark suite. A
>>   1 GB mapping is created and then fork/unmap is called. The mapping is
>>   created using either page-sized (h:0) or hugepage folios (h:1); in all
>>   cases the memory is PTE-mapped.
>> - micromm/vmalloc: from test_vmalloc.ko, varying the number of pages
>>   (p:) and whether huge pages are used (h:).
>>
>> On a "real-world" and fork-heavy workload like kernbench, the estimated
>> overhead of kpkeys_hardened_pgtables is reasonable: 4% system time
>> overhead without batching, and about half that figure (2.2%) with
>> batching. The real time overhead is negligible.
>>
>> Microbenchmarks show large overheads without batching, which increase
>> with the number of pages being manipulated. Batching drastically reduces
>> that overhead, almost negating it for micromm/fork. Because all PTEs in
>> the mapping are modified in the same lazy_mmu section, the kpkeys level
>> is changed just twice regardless of the mapping size; as a result the
>> relative overhead actually decreases as the size increases for
>> fix_size_alloc_test.
>>
>> Note: the performance impact of set_memory_pkey() is likely to be
>> relatively low on arm64 because the linear mapping uses PTE-level
>> descriptors only. This means that set_memory_pkey() simply changes the
>> attributes of some PTE descriptors. However, some systems may be able to
>> use higher-level descriptors in the future [5], meaning that
>> set_memory_pkey() may have to split mappings. Allocating page tables
>> from a contiguous cache of pages could help minimise the overhead, as
>> proposed for x86 in [1].
>>
>> [...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
  2025-08-20 16:01   ` Kevin Brodsky
@ 2025-08-20 16:18     ` Edgecombe, Rick P
  2025-08-21  7:23       ` Kevin Brodsky
  0 siblings, 1 reply; 32+ messages in thread
From: Edgecombe, Rick P @ 2025-08-20 16:18 UTC (permalink / raw)
  To: kevin.brodsky@arm.com, linux-hardening@vger.kernel.org
  Cc: maz@kernel.org, luto@kernel.org, willy@infradead.org,
	mbland@motorola.com, david@redhat.com,
	dave.hansen@linux.intel.com, rppt@kernel.org, joey.gouly@arm.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	catalin.marinas@arm.com, Weiny, Ira, vbabka@suse.cz,
	pierre.langlois@arm.com, jeffxu@chromium.org,
	linus.walleij@linaro.org, lorenzo.stoakes@oracle.com,
	kees@kernel.org, ryan.roberts@arm.com, tglx@linutronix.de,
	jannh@google.com, peterz@infradead.org,
	linux-arm-kernel@lists.infradead.org, will@kernel.org,
	qperret@google.com, linux-mm@kvack.org, broonie@kernel.org,
	x86@kernel.org

On Wed, 2025-08-20 at 18:01 +0200, Kevin Brodsky wrote:
> Apologies, Thunderbird helpfully decided to wrap around that table...
> Here's the unmangled table:
> 
> +-------------------+----------------------------------+------------------+---------------+
> > Benchmark         | Result Class                     | Without batching | With batching |
> +===================+==================================+==================+===============+
> > mmtests/kernbench | real time                        |            0.32% |         0.35% |
> >                    | system time                      |        (R) 4.18% |     (R) 3.18% |
> >                    | user time                        |            0.08% |         0.20% |
> +-------------------+----------------------------------+------------------+---------------+
> > micromm/fork      | fork: h:0                        |      (R) 221.39% |     (R) 3.35% |
> >                    | fork: h:1                        |      (R) 282.89% |     (R) 6.99% |
> +-------------------+----------------------------------+------------------+---------------+
> > micromm/munmap    | munmap: h:0                      |       (R) 17.37% |        -0.28% |
> >                    | munmap: h:1                      |      (R) 172.61% |     (R) 8.08% |
> +-------------------+----------------------------------+------------------+---------------+
> > micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 15.54% |    (R) 12.57% |

Both this and the previous one have the 95% confidence interval. So it saw a 16%
speed up with direct map modification. Possible?

> >                    | fix_size_alloc_test: p:4, h:0    |       (R) 39.18% |     (R) 9.13% |
> >                    | fix_size_alloc_test: p:16, h:0   |       (R) 65.81% |         2.97% |
> >                    | fix_size_alloc_test: p:64, h:0   |       (R) 83.39% |        -0.49% |
> >                    | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |    (I) -2.04% |
> >                    | fix_size_alloc_test: p:16, h:1   |       (R) 51.21% |         3.77% |
> >                    | fix_size_alloc_test: p:64, h:1   |       (R) 60.02% |         0.99% |
> >                    | fix_size_alloc_test: p:256, h:1  |       (R) 63.82% |         1.16% |
> >                    | random_size_alloc_test: p:1, h:0 |       (R) 77.79% |        -0.51% |
> >                    | vm_map_ram_test: p:1, h:0        |       (R) 30.67% |    (R) 27.09% |
> +-------------------+----------------------------------+------------------+---------------+

Hmm, still surprisingly low to me, but ok. It would be good have x86 and arm
work the same, but I don't think we have line of sight to x86 currently. And I
actually never did real benchmarks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
  2025-08-20 16:18     ` Edgecombe, Rick P
@ 2025-08-21  7:23       ` Kevin Brodsky
  0 siblings, 0 replies; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-21  7:23 UTC (permalink / raw)
  To: Edgecombe, Rick P, linux-hardening@vger.kernel.org
  Cc: maz@kernel.org, luto@kernel.org, willy@infradead.org,
	mbland@motorola.com, david@redhat.com,
	dave.hansen@linux.intel.com, rppt@kernel.org, joey.gouly@arm.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	catalin.marinas@arm.com, Weiny, Ira, vbabka@suse.cz,
	pierre.langlois@arm.com, jeffxu@chromium.org,
	linus.walleij@linaro.org, lorenzo.stoakes@oracle.com,
	kees@kernel.org, ryan.roberts@arm.com, tglx@linutronix.de,
	jannh@google.com, peterz@infradead.org,
	linux-arm-kernel@lists.infradead.org, will@kernel.org,
	qperret@google.com, linux-mm@kvack.org, broonie@kernel.org,
	x86@kernel.org

On 20/08/2025 18:18, Edgecombe, Rick P wrote:
> On Wed, 2025-08-20 at 18:01 +0200, Kevin Brodsky wrote:
>> Apologies, Thunderbird helpfully decided to wrap around that table...
>> Here's the unmangled table:
>>
>> +-------------------+----------------------------------+------------------+---------------+
>>> Benchmark         | Result Class                     | Without batching | With batching |
>> +===================+==================================+==================+===============+
>>> mmtests/kernbench | real time                        |            0.32% |         0.35% |
>>>                    | system time                      |        (R) 4.18% |     (R) 3.18% |
>>>                    | user time                        |            0.08% |         0.20% |
>> +-------------------+----------------------------------+------------------+---------------+
>>> micromm/fork      | fork: h:0                        |      (R) 221.39% |     (R) 3.35% |
>>>                    | fork: h:1                        |      (R) 282.89% |     (R) 6.99% |
>> +-------------------+----------------------------------+------------------+---------------+
>>> micromm/munmap    | munmap: h:0                      |       (R) 17.37% |        -0.28% |
>>>                    | munmap: h:1                      |      (R) 172.61% |     (R) 8.08% |
>> +-------------------+----------------------------------+------------------+---------------+
>>> micromm/vmalloc   | fix_size_alloc_test: p:1, h:0    |       (R) 15.54% |    (R) 12.57% |
> Both this and the previous one have the 95% confidence interval. So it saw a 16%
> speed up with direct map modification. Possible?

Positive numbers mean performance degradation ("(R)" actually stands for
regression), so in that case the protection is adding a 16%/13%
overhead. Here this is mainly due to the added pkey register switching
(+ barrier) happening on every call to vmalloc() and vfree(), which has
a large relative impact since only one page is being allocated/freed.

>>>                    | fix_size_alloc_test: p:4, h:0    |       (R) 39.18% |     (R) 9.13% |
>>>                    | fix_size_alloc_test: p:16, h:0   |       (R) 65.81% |         2.97% |
>>>                    | fix_size_alloc_test: p:64, h:0   |       (R) 83.39% |        -0.49% |
>>>                    | fix_size_alloc_test: p:256, h:0  |       (R) 87.85% |    (I) -2.04% |
>>>                    | fix_size_alloc_test: p:16, h:1   |       (R) 51.21% |         3.77% |
>>>                    | fix_size_alloc_test: p:64, h:1   |       (R) 60.02% |         0.99% |
>>>                    | fix_size_alloc_test: p:256, h:1  |       (R) 63.82% |         1.16% |
>>>                    | random_size_alloc_test: p:1, h:0 |       (R) 77.79% |        -0.51% |
>>>                    | vm_map_ram_test: p:1, h:0        |       (R) 30.67% |    (R) 27.09% |
>> +-------------------+----------------------------------+------------------+---------------+
> Hmm, still surprisingly low to me, but ok. It would be good have x86 and arm
> work the same, but I don't think we have line of sight to x86 currently. And I
> actually never did real benchmarks.

It would certainly be good to get numbers on x86 as well - I'm hoping
that someone with a better understanding of x86 than myself could
implement kpkeys on x86 at some point, so that we can run the same
benchmarks there.

- Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
  2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
                   ` (18 preceding siblings ...)
  2025-08-20 15:53 ` [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
@ 2025-08-21 17:29 ` Yang Shi
  2025-08-25  7:31   ` Kevin Brodsky
  19 siblings, 1 reply; 32+ messages in thread
From: Yang Shi @ 2025-08-21 17:29 UTC (permalink / raw)
  To: Kevin Brodsky, linux-hardening
  Cc: linux-kernel, Andrew Morton, Andy Lutomirski, Catalin Marinas,
	Dave Hansen, David Hildenbrand, Ira Weiny, Jann Horn, Jeff Xu,
	Joey Gouly, Kees Cook, Linus Walleij, Lorenzo Stoakes,
	Marc Zyngier, Mark Brown, Matthew Wilcox, Maxwell Bland,
	Mike Rapoport (IBM), Peter Zijlstra, Pierre Langlois,
	Quentin Perret, Rick Edgecombe, Ryan Roberts, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-mm, x86

Hi Kevin,

On 8/15/25 1:54 AM, Kevin Brodsky wrote:
> This is a proposal to leverage protection keys (pkeys) to harden
> critical kernel data, by making it mostly read-only. The series includes
> a simple framework called "kpkeys" to manipulate pkeys for in-kernel use,
> as well as a page table hardening feature based on that framework,
> "kpkeys_hardened_pgtables". Both are implemented on arm64 as a proof of
> concept, but they are designed to be compatible with any architecture
> that supports pkeys.

[...]

>
> Note: the performance impact of set_memory_pkey() is likely to be
> relatively low on arm64 because the linear mapping uses PTE-level
> descriptors only. This means that set_memory_pkey() simply changes the
> attributes of some PTE descriptors. However, some systems may be able to
> use higher-level descriptors in the future [5], meaning that
> set_memory_pkey() may have to split mappings. Allocating page tables

I'm supposed the page table hardening feature will be opt-in due to its 
overhead? If so I think you can just keep kernel linear mapping using 
PTE, just like debug page alloc.

> from a contiguous cache of pages could help minimise the overhead, as
> proposed for x86 in [1].

I'm a little bit confused about how this can work. The contiguous cache 
of pages should be some large page, for example, 2M. But the page table 
pages allocated from the cache may have different permissions if I 
understand correctly. The default permission is RO, but some of them may 
become R/W at sometime, for example, when calling set_pte_at(). You 
still need to split the linear mapping, right?

Regards,
Yang

>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
  2025-08-21 17:29 ` Yang Shi
@ 2025-08-25  7:31   ` Kevin Brodsky
  2025-08-26 19:18     ` Yang Shi
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-25  7:31 UTC (permalink / raw)
  To: Yang Shi, linux-hardening
  Cc: linux-kernel, Andrew Morton, Andy Lutomirski, Catalin Marinas,
	Dave Hansen, David Hildenbrand, Ira Weiny, Jann Horn, Jeff Xu,
	Joey Gouly, Kees Cook, Linus Walleij, Lorenzo Stoakes,
	Marc Zyngier, Mark Brown, Matthew Wilcox, Maxwell Bland,
	Mike Rapoport (IBM), Peter Zijlstra, Pierre Langlois,
	Quentin Perret, Rick Edgecombe, Ryan Roberts, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-mm, x86

On 21/08/2025 19:29, Yang Shi wrote:
> Hi Kevin,
>
> On 8/15/25 1:54 AM, Kevin Brodsky wrote:
>> This is a proposal to leverage protection keys (pkeys) to harden
>> critical kernel data, by making it mostly read-only. The series includes
>> a simple framework called "kpkeys" to manipulate pkeys for in-kernel
>> use,
>> as well as a page table hardening feature based on that framework,
>> "kpkeys_hardened_pgtables". Both are implemented on arm64 as a proof of
>> concept, but they are designed to be compatible with any architecture
>> that supports pkeys.
>
> [...]
>
>>
>> Note: the performance impact of set_memory_pkey() is likely to be
>> relatively low on arm64 because the linear mapping uses PTE-level
>> descriptors only. This means that set_memory_pkey() simply changes the
>> attributes of some PTE descriptors. However, some systems may be able to
>> use higher-level descriptors in the future [5], meaning that
>> set_memory_pkey() may have to split mappings. Allocating page tables
>
> I'm supposed the page table hardening feature will be opt-in due to
> its overhead? If so I think you can just keep kernel linear mapping
> using PTE, just like debug page alloc.

Indeed, I don't expect it to be turned on by default (in defconfig). If
the overhead proves too large when block mappings are used, it seems
reasonable to force PTE mappings when kpkeys_hardened_pgtables is enabled.

>
>> from a contiguous cache of pages could help minimise the overhead, as
>> proposed for x86 in [1].
>
> I'm a little bit confused about how this can work. The contiguous
> cache of pages should be some large page, for example, 2M. But the
> page table pages allocated from the cache may have different
> permissions if I understand correctly. The default permission is RO,
> but some of them may become R/W at sometime, for example, when calling
> set_pte_at(). You still need to split the linear mapping, right?

When such a helper is called, *all* PTPs become writeable - there is no
per-PTP permission switching.

PTPs remain mapped RW (i.e. the base permissions set at the PTE level
are RW). With this series, they are also all mapped with the same pkey
(1). By default, the pkey register is configured so that pkey 1 provides
RO access. The net result is that PTPs are RO by default, since the pkey
restricts the effective permissions.

When calling e.g. set_pte(), the pkey register is modified to enable RW
access to pkey 1, making it possible to write to any PTP. Its value is
restored when the function exit so that PTPs are once again RO.

- Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
  2025-08-25  7:31   ` Kevin Brodsky
@ 2025-08-26 19:18     ` Yang Shi
  2025-08-27 16:09       ` Kevin Brodsky
  0 siblings, 1 reply; 32+ messages in thread
From: Yang Shi @ 2025-08-26 19:18 UTC (permalink / raw)
  To: Kevin Brodsky, linux-hardening
  Cc: linux-kernel, Andrew Morton, Andy Lutomirski, Catalin Marinas,
	Dave Hansen, David Hildenbrand, Ira Weiny, Jann Horn, Jeff Xu,
	Joey Gouly, Kees Cook, Linus Walleij, Lorenzo Stoakes,
	Marc Zyngier, Mark Brown, Matthew Wilcox, Maxwell Bland,
	Mike Rapoport (IBM), Peter Zijlstra, Pierre Langlois,
	Quentin Perret, Rick Edgecombe, Ryan Roberts, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-mm, x86



On 8/25/25 12:31 AM, Kevin Brodsky wrote:
> On 21/08/2025 19:29, Yang Shi wrote:
>> Hi Kevin,
>>
>> On 8/15/25 1:54 AM, Kevin Brodsky wrote:
>>> This is a proposal to leverage protection keys (pkeys) to harden
>>> critical kernel data, by making it mostly read-only. The series includes
>>> a simple framework called "kpkeys" to manipulate pkeys for in-kernel
>>> use,
>>> as well as a page table hardening feature based on that framework,
>>> "kpkeys_hardened_pgtables". Both are implemented on arm64 as a proof of
>>> concept, but they are designed to be compatible with any architecture
>>> that supports pkeys.
>> [...]
>>
>>> Note: the performance impact of set_memory_pkey() is likely to be
>>> relatively low on arm64 because the linear mapping uses PTE-level
>>> descriptors only. This means that set_memory_pkey() simply changes the
>>> attributes of some PTE descriptors. However, some systems may be able to
>>> use higher-level descriptors in the future [5], meaning that
>>> set_memory_pkey() may have to split mappings. Allocating page tables
>> I'm supposed the page table hardening feature will be opt-in due to
>> its overhead? If so I think you can just keep kernel linear mapping
>> using PTE, just like debug page alloc.
> Indeed, I don't expect it to be turned on by default (in defconfig). If
> the overhead proves too large when block mappings are used, it seems
> reasonable to force PTE mappings when kpkeys_hardened_pgtables is enabled.
>
>>> from a contiguous cache of pages could help minimise the overhead, as
>>> proposed for x86 in [1].
>> I'm a little bit confused about how this can work. The contiguous
>> cache of pages should be some large page, for example, 2M. But the
>> page table pages allocated from the cache may have different
>> permissions if I understand correctly. The default permission is RO,
>> but some of them may become R/W at sometime, for example, when calling
>> set_pte_at(). You still need to split the linear mapping, right?
> When such a helper is called, *all* PTPs become writeable - there is no
> per-PTP permission switching.

OK, so all PTPs in the same contiguous cache will become writeable even 
though the helper (i.e. set_pte_at()) is just called on one of the 
PTPs.  But doesn't it compromise the page table hardening somehow? The 
PTPs from the same cache may belong to different processes.

Thanks,
Yang

>
> PTPs remain mapped RW (i.e. the base permissions set at the PTE level
> are RW). With this series, they are also all mapped with the same pkey
> (1). By default, the pkey register is configured so that pkey 1 provides
> RO access. The net result is that PTPs are RO by default, since the pkey
> restricts the effective permissions.
>
> When calling e.g. set_pte(), the pkey register is modified to enable RW
> access to pkey 1, making it possible to write to any PTP. Its value is
> restored when the function exit so that PTPs are once again RO.
>
> - Kevin


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
  2025-08-26 19:18     ` Yang Shi
@ 2025-08-27 16:09       ` Kevin Brodsky
  2025-08-29 22:31         ` Yang Shi
  0 siblings, 1 reply; 32+ messages in thread
From: Kevin Brodsky @ 2025-08-27 16:09 UTC (permalink / raw)
  To: Yang Shi, linux-hardening
  Cc: linux-kernel, Andrew Morton, Andy Lutomirski, Catalin Marinas,
	Dave Hansen, David Hildenbrand, Ira Weiny, Jann Horn, Jeff Xu,
	Joey Gouly, Kees Cook, Linus Walleij, Lorenzo Stoakes,
	Marc Zyngier, Mark Brown, Matthew Wilcox, Maxwell Bland,
	Mike Rapoport (IBM), Peter Zijlstra, Pierre Langlois,
	Quentin Perret, Rick Edgecombe, Ryan Roberts, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-mm, x86

On 26/08/2025 21:18, Yang Shi wrote:
>>
>>>> from a contiguous cache of pages could help minimise the overhead, as
>>>> proposed for x86 in [1].
>>> I'm a little bit confused about how this can work. The contiguous
>>> cache of pages should be some large page, for example, 2M. But the
>>> page table pages allocated from the cache may have different
>>> permissions if I understand correctly. The default permission is RO,
>>> but some of them may become R/W at sometime, for example, when calling
>>> set_pte_at(). You still need to split the linear mapping, right?
>> When such a helper is called, *all* PTPs become writeable - there is no
>> per-PTP permission switching.
>
> OK, so all PTPs in the same contiguous cache will become writeable
> even though the helper (i.e. set_pte_at()) is just called on one of
> the PTPs.  But doesn't it compromise the page table hardening somehow?
> The PTPs from the same cache may belong to different processes. 

First just a note that this is true regardless of how the PTPs are
allocated (i.e. this is already the case in this version of the series).

Either way, yes you are right, this approach does not introduce any
isolation *between* page tables - pgtable helpers are able to write to
all page tables. In principle it should be possible to use a different
pkey for kernel and user page tables, but that would make the kpkeys
level switching in helpers quite a bit more complicated. Isolating
further is impractical as we have so few pkeys (just 8 on arm64).

That said, what kpkeys really tries to protect against is the direct
corruption of critical data by arbitrary (unprivileged) code. If the
attacker is able to manipulate calls to set_pte() and the likes, kpkeys
cannot provide much protection - even if we restricted the writes to a
specific set of page tables, the attacker would still be able to insert
a translation to any arbitrary physical page.

- Kevin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v5 00/18] pkeys-based page table hardening
  2025-08-27 16:09       ` Kevin Brodsky
@ 2025-08-29 22:31         ` Yang Shi
  0 siblings, 0 replies; 32+ messages in thread
From: Yang Shi @ 2025-08-29 22:31 UTC (permalink / raw)
  To: Kevin Brodsky, linux-hardening
  Cc: linux-kernel, Andrew Morton, Andy Lutomirski, Catalin Marinas,
	Dave Hansen, David Hildenbrand, Ira Weiny, Jann Horn, Jeff Xu,
	Joey Gouly, Kees Cook, Linus Walleij, Lorenzo Stoakes,
	Marc Zyngier, Mark Brown, Matthew Wilcox, Maxwell Bland,
	Mike Rapoport (IBM), Peter Zijlstra, Pierre Langlois,
	Quentin Perret, Rick Edgecombe, Ryan Roberts, Thomas Gleixner,
	Vlastimil Babka, Will Deacon, linux-arm-kernel, linux-mm, x86



On 8/27/25 9:09 AM, Kevin Brodsky wrote:
> On 26/08/2025 21:18, Yang Shi wrote:
>>>>> from a contiguous cache of pages could help minimise the overhead, as
>>>>> proposed for x86 in [1].
>>>> I'm a little bit confused about how this can work. The contiguous
>>>> cache of pages should be some large page, for example, 2M. But the
>>>> page table pages allocated from the cache may have different
>>>> permissions if I understand correctly. The default permission is RO,
>>>> but some of them may become R/W at sometime, for example, when calling
>>>> set_pte_at(). You still need to split the linear mapping, right?
>>> When such a helper is called, *all* PTPs become writeable - there is no
>>> per-PTP permission switching.
>> OK, so all PTPs in the same contiguous cache will become writeable
>> even though the helper (i.e. set_pte_at()) is just called on one of
>> the PTPs.  But doesn't it compromise the page table hardening somehow?
>> The PTPs from the same cache may belong to different processes.
> First just a note that this is true regardless of how the PTPs are
> allocated (i.e. this is already the case in this version of the series).
>
> Either way, yes you are right, this approach does not introduce any
> isolation *between* page tables - pgtable helpers are able to write to
> all page tables. In principle it should be possible to use a different
> pkey for kernel and user page tables, but that would make the kpkeys
> level switching in helpers quite a bit more complicated. Isolating
> further is impractical as we have so few pkeys (just 8 on arm64).
>
> That said, what kpkeys really tries to protect against is the direct
> corruption of critical data by arbitrary (unprivileged) code. If the
> attacker is able to manipulate calls to set_pte() and the likes, kpkeys
> cannot provide much protection - even if we restricted the writes to a
> specific set of page tables, the attacker would still be able to insert
> a translation to any arbitrary physical page.

I see. Thanks for elaborating this.

Yang

>
> - Kevin


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-08-29 22:31 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-15  8:54 [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
2025-08-15  8:54 ` [RFC PATCH v5 01/18] mm: Introduce kpkeys Kevin Brodsky
2025-08-15  8:54 ` [RFC PATCH v5 02/18] set_memory: Introduce set_memory_pkey() stub Kevin Brodsky
2025-08-15  8:54 ` [RFC PATCH v5 03/18] arm64: mm: Enable overlays for all EL1 indirect permissions Kevin Brodsky
2025-08-15  8:54 ` [RFC PATCH v5 04/18] arm64: Introduce por_elx_set_pkey_perms() helper Kevin Brodsky
2025-08-15  8:54 ` [RFC PATCH v5 05/18] arm64: Implement asm/kpkeys.h using POE Kevin Brodsky
2025-08-15  8:55 ` [RFC PATCH v5 06/18] arm64: set_memory: Implement set_memory_pkey() Kevin Brodsky
2025-08-15  8:55 ` [RFC PATCH v5 07/18] arm64: Reset POR_EL1 on exception entry Kevin Brodsky
2025-08-15  8:55 ` [RFC PATCH v5 08/18] arm64: Context-switch POR_EL1 Kevin Brodsky
2025-08-15  8:55 ` [RFC PATCH v5 09/18] arm64: Enable kpkeys Kevin Brodsky
2025-08-15  8:55 ` [RFC PATCH v5 10/18] mm: Introduce kernel_pgtables_set_pkey() Kevin Brodsky
2025-08-15  8:55 ` [RFC PATCH v5 11/18] mm: Introduce kpkeys_hardened_pgtables Kevin Brodsky
2025-08-15  8:55 ` [RFC PATCH v5 12/18] mm: Allow __pagetable_ctor() to fail Kevin Brodsky
2025-08-15  8:55 ` [RFC PATCH v5 13/18] mm: Map page tables with privileged pkey Kevin Brodsky
2025-08-15 16:37   ` Edgecombe, Rick P
2025-08-18 16:02     ` Kevin Brodsky
2025-08-18 17:01       ` Edgecombe, Rick P
2025-08-19  9:35         ` Kevin Brodsky
2025-08-15  8:55 ` [RFC PATCH v5 14/18] arm64: kpkeys: Support KPKEYS_LVL_PGTABLES Kevin Brodsky
2025-08-15  8:55 ` [RFC PATCH v5 15/18] arm64: mm: Guard page table writes with kpkeys Kevin Brodsky
2025-08-15  8:55 ` [RFC PATCH v5 16/18] arm64: Enable kpkeys_hardened_pgtables support Kevin Brodsky
2025-08-15  8:55 ` [RFC PATCH v5 17/18] mm: Add basic tests for kpkeys_hardened_pgtables Kevin Brodsky
2025-08-15  8:55 ` [RFC PATCH v5 18/18] arm64: mm: Batch kpkeys level switches Kevin Brodsky
2025-08-20 15:53 ` [RFC PATCH v5 00/18] pkeys-based page table hardening Kevin Brodsky
2025-08-20 16:01   ` Kevin Brodsky
2025-08-20 16:18     ` Edgecombe, Rick P
2025-08-21  7:23       ` Kevin Brodsky
2025-08-21 17:29 ` Yang Shi
2025-08-25  7:31   ` Kevin Brodsky
2025-08-26 19:18     ` Yang Shi
2025-08-27 16:09       ` Kevin Brodsky
2025-08-29 22:31         ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).