* [RFC PATCH v1 00/57] Boot-time page size selection for arm64
@ 2024-10-14 10:55 Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (8 more replies)
0 siblings, 9 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:55 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
Hi All,
Patch bomb incoming... This covers many subsystems, so I've included a core set
of people on the full series and additionally included maintainers on relevant
patches. I haven't included those maintainers on this cover letter since the
numbers were far too big for it to work. But I've included a link to this cover
letter on each patch, so they can hopefully find their way here. For follow up
submissions I'll break it up by subsystem, but for now thought it was important
to show the full picture.
This RFC series implements support for boot-time page size selection within the
arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to date, page
size has been selected at compile-time, meaning the size is baked into a given
kernel image. As use of larger-than-4K page sizes become more prevalent this
starts to present a problem for distributions. Boot-time page size selection
enables the creation of a single kernel image, which can be told which page size
to use on the kernel command line.
Why is having an image-per-page size problematic?
=================================================
Many traditional distros are now supporting both 4K and 64K. And this means
managing 2 kernel packages, along with drivers for each. For some, it means
multiple installer flavours and multiple ISOs. All of this adds up to a
less-than-ideal level of complexity. Additionally, Android now supports 4K and
16K kernels. I'm told having to explicitly manage their KABI for each kernel is
painful, and the extra flash space required for both kernel images and the
duplicated modules has been problematic. Boot-time page size selection solves
all of this.
Additionally, in starting to think about the longer term deployment story for
D128 page tables, which Arm architecture now supports, a lot of the same
problems need to be solved, so this work sets us up nicely for that.
So what's the down side?
========================
Well nothing's free; Various static allocations in the kernel image must be
sized for the worst case (largest supported page size), so image size is in line
with size of 64K compile-time image. So if you're interested in 4K or 16K, there
is a slight increase to the image size. But I expect that problem goes away if
you're compressing the image - its just some extra zeros. At boot-time, I expect
we could free the unused static storage once we know the page size - although
that would be a follow up enhancement.
And then there is performance. Since PAGE_SIZE and friends are no longer
compile-time constants, we must look up their values and do arithmetic at
runtime instead of compile-time. My early perf testing suggests this is
inperceptible for real-world workloads, and only has small impact on
microbenchmarks - more on this below.
Approach
========
The basic idea is to rid the source of any assumptions that PAGE_SIZE and
friends are compile-time constant, but in a way that allows the compiler to
perform the same optimizations as was previously being done if they do turn out
to be compile-time constant. Where constants are required, we use limits;
PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full description
of all the classes of problems to solve.
By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX. arm64
does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE Kconfig,
which is an alternative to selecting a compile-time page size.
When boot-time page size is active, the arch pgtable geometry macro definitions
resolve to something that can be configured at boot. The arm64 implementation in
this series mainly uses global, __ro_after_init variables. I've tried using
alternatives patching, but that performs worse than loading from memory; I think
due to code size bloat.
Status
======
When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented enough
to compile the kernel image itself with defconfig (and a few other bits and
pieces). This is enough to build a kernel that can boot under QEMU or FVP. I'll
happily do the rest of the work to enable all the extra drivers, but wanted to
get feedback on the shape of this effort first. If anyone wants to do any
testing, and has a must-have config, let me know and I'll prioritize enabling it
first.
The series is arranged as follows:
- patch 1: Add macros required for converting non-arch code to support
boot-time page size selection
- patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
non-arch code
- patches 37-38: Some arm64 tidy ups
- patch 39: Add macros required for converting arm64 code to support
boot-time page size selection
- patches 40-56: arm64 changes to support boot-time page size selection
- patch 57: Add arm64 Kconfig option to enable boot-time page size
selection
Ideally, I'd like to get the basics merged (something like this series), then
incrementally improve it over a handful of kernel releases until we can
demonstrate that we have feature parity with the compile-time build and no
performance blockers. Once at that point, ideally the compile-time build options
would be removed and the code could be cleaned up further.
One of the bigger peices that I'd propose to add as a follow up, is to make
va-size boot-time selectable too. That will greatly simplify LPA2 fallback
handling.
Assuming people are ammenable to the rough shape, how would I go about getting
the non-arch changes merged? Since they cover many subsystems, will each piece
need to go independently to each relevant maintainer or could it all be merged
together through the arm64 tree?
Image Size
==========
The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
kernel image on disk for base (before any changes applied), compile (with
changes, configured for compile-time page size) and boot (with changes,
configured for boot-time page size).
You can see the that compile-16k and 64k configs are actually slightly smaller
than the baselines; that's due to optimizing some buffer sizes which didn't need
to depend on page size during the series. The boot-time image is ~1% bigger than
the 64k compile-time image. I believe there is scope to improve this to make it
equal to compile-64k if required:
| config | size/KB | diff/KB | diff/% |
|-------------|---------|---------|---------|
| base-4k | 54895 | 0 | 0.0% |
| base-16k | 55161 | 266 | 0.5% |
| base-64k | 56775 | 1880 | 3.4% |
| compile-4k | 54895 | 0 | 0.0% |
| compile-16k | 55097 | 202 | 0.4% |
| compile-64k | 56391 | 1496 | 2.7% |
| boot-4K | 57045 | 2150 | 3.9% |
And below shows the size of the image in memory at run-time, separated for text
and data costs. The boot image has ~1% text cost; most likely due to the fact
that PAGE_SIZE and friends are not compile-time constants so need instructions
to load the values and do arithmetic. I believe we could eventually get the data
cost to match the cost for the compile image for the chosen page size by freeing
the ends of the static buffers not needed for the selected page size:
| | text | text | text | data | data | data |
| config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
|-------------|---------|---------|---------|---------|---------|---------|
| base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
| base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
| base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
| compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
| compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
| compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
| boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
Functional Testing
==================
I've build-tested defconfig for all arches supported by tuxmake (which is most)
without issue.
I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page sizes
and a few va-sizes, and additionally have run all the mm-selftests, with no
regressions observed vs the equivalent compile-time page size build (although
the mm-selftests have a few existing failures when run against 16K and 64K
kernels - those should really be investigated and fixed independently).
Test coverage is lacking for many of the drivers that I've touched, but in many
cases, I'm hoping the changes are simple enough that review might suffice?
Performance Testing
===================
I've run some limited performance benchmarks:
First, a real-world benchmark that causes a lot of page table manipulation (and
therefore we would expect to see regression here if we are going to see it
anywhere); kernel compilation. It barely registers a change. Values are times,
so smaller is better. All relative to base-4k:
| | kern | kern | user | user | real | real |
| config | mean | stdev | mean | stdev | mean | stdev |
|-------------|---------|---------|---------|---------|---------|---------|
| base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
| compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
| boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
The Speedometer JavaScript benchmark also shows no change. Values are runs per
min, so bigger is better. All relative to base-4k:
| config | mean | stdev |
|-------------|---------|---------|
| base-4k | 0.0% | 0.8% |
| compile-4k | 0.4% | 0.8% |
| boot-4k | 0.0% | 0.9% |
Finally, I've run some microbenchmarks known to stress page table manipulations
(originally from David Hildenbrand). The fork test maps/allocs 1G of anon
memory, then measures the cost of fork(). The munmap test maps/allocs 1G of anon
memory then measures the cost of munmap()ing it. The fork test is known to be
extremely sensitive to any changes that cause instructions to be aligned
differently in cachelines. When using this test for other changes, I've seen
double digit regressions for the slightest thing, so 12% regression on this test
is actually fairly good. This likely represents the extreme worst case for
regressions that will be observed across other microbenchmarks (famous last
words). Values are times, so smaller is better. All relative to base-4k:
| | fork | fork | munmap | munmap |
| config | mean | stdev | stdev | stdev |
|-------------|---------|---------|---------|---------|
| base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
| compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
| boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
NOTE: The series applies on top of v6.11.
Thanks,
Ryan
Ryan Roberts (57):
mm: Add macros ahead of supporting boot-time page size selection
vmlinux: Align to PAGE_SIZE_MAX
mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
mm/page_alloc: Make page_frag_cache boot-time page size compatible
mm: Avoid split pmd ptl if pmd level is run-time folded
mm: Remove PAGE_SIZE compile-time constant assumption
fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
fs: Remove PAGE_SIZE compile-time constant assumption
fs/nfs: Remove PAGE_SIZE compile-time constant assumption
fs/ext4: Remove PAGE_SIZE compile-time constant assumption
fork: Permit boot-time THREAD_SIZE determination
cgroup: Remove PAGE_SIZE compile-time constant assumption
bpf: Remove PAGE_SIZE compile-time constant assumption
pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
stackdepot: Remove PAGE_SIZE compile-time constant assumption
perf: Remove PAGE_SIZE compile-time constant assumption
kvm: Remove PAGE_SIZE compile-time constant assumption
trace: Remove PAGE_SIZE compile-time constant assumption
crash: Remove PAGE_SIZE compile-time constant assumption
crypto: Remove PAGE_SIZE compile-time constant assumption
sunrpc: Remove PAGE_SIZE compile-time constant assumption
sound: Remove PAGE_SIZE compile-time constant assumption
net: Remove PAGE_SIZE compile-time constant assumption
net: fec: Remove PAGE_SIZE compile-time constant assumption
net: marvell: Remove PAGE_SIZE compile-time constant assumption
net: hns3: Remove PAGE_SIZE compile-time constant assumption
net: e1000: Remove PAGE_SIZE compile-time constant assumption
net: igbvf: Remove PAGE_SIZE compile-time constant assumption
net: igb: Remove PAGE_SIZE compile-time constant assumption
drivers/base: Remove PAGE_SIZE compile-time constant assumption
edac: Remove PAGE_SIZE compile-time constant assumption
optee: Remove PAGE_SIZE compile-time constant assumption
random: Remove PAGE_SIZE compile-time constant assumption
sata_sil24: Remove PAGE_SIZE compile-time constant assumption
virtio: Remove PAGE_SIZE compile-time constant assumption
xen: Remove PAGE_SIZE compile-time constant assumption
arm64: Fix macros to work in C code in addition to the linker script
arm64: Track early pgtable allocation limit
arm64: Introduce macros required for boot-time page selection
arm64: Refactor early pgtable size calculation macros
arm64: Pass desired page size on command line
arm64: Divorce early init from PAGE_SIZE
arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
arm64: Align sections to PAGE_SIZE_MAX
arm64: Rework trampoline rodata mapping
arm64: Generalize fixmap for boot-time page size
arm64: Statically allocate and align for worst-case page size
arm64: Convert switch to if for non-const comparison values
arm64: Convert BUILD_BUG_ON to VM_BUG_ON
arm64: Remove PAGE_SZ asm-offset
arm64: Introduce cpu features for page sizes
arm64: Remove PAGE_SIZE from assembly code
arm64: Runtime-fold pmd level
arm64: Support runtime folding in idmap_kpti_install_ng_mappings
arm64: TRAMP_VALIAS is no longer compile-time constant
arm64: Determine THREAD_SIZE at boot-time
arm64: Enable boot-time page size selection
arch/alpha/include/asm/page.h | 1 +
arch/arc/include/asm/page.h | 1 +
arch/arm/include/asm/page.h | 1 +
arch/arm64/Kconfig | 26 ++-
arch/arm64/include/asm/assembler.h | 78 ++++++-
arch/arm64/include/asm/cpufeature.h | 44 +++-
arch/arm64/include/asm/efi.h | 2 +-
arch/arm64/include/asm/fixmap.h | 28 ++-
arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
arch/arm64/include/asm/kvm_arm.h | 21 +-
arch/arm64/include/asm/kvm_hyp.h | 11 +
arch/arm64/include/asm/kvm_pgtable.h | 6 +-
arch/arm64/include/asm/memory.h | 62 ++++--
arch/arm64/include/asm/page-def.h | 3 +-
arch/arm64/include/asm/pgalloc.h | 16 +-
arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
arch/arm64/include/asm/pgtable-prot.h | 2 +-
arch/arm64/include/asm/pgtable.h | 133 +++++++++---
arch/arm64/include/asm/processor.h | 10 +-
arch/arm64/include/asm/sections.h | 1 +
arch/arm64/include/asm/smp.h | 1 +
arch/arm64/include/asm/sparsemem.h | 15 +-
arch/arm64/include/asm/sysreg.h | 54 +++--
arch/arm64/include/asm/tlb.h | 3 +
arch/arm64/kernel/asm-offsets.c | 4 +-
arch/arm64/kernel/cpufeature.c | 93 ++++++--
arch/arm64/kernel/efi.c | 2 +-
arch/arm64/kernel/entry.S | 60 +++++-
arch/arm64/kernel/head.S | 46 +++-
arch/arm64/kernel/hibernate-asm.S | 6 +-
arch/arm64/kernel/image-vars.h | 14 ++
arch/arm64/kernel/image.h | 4 +
arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
arch/arm64/kernel/pi/pi.h | 63 +++++-
arch/arm64/kernel/relocate_kernel.S | 10 +-
arch/arm64/kernel/vdso-wrap.S | 4 +-
arch/arm64/kernel/vdso.c | 7 +-
arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
arch/arm64/kernel/vdso32-wrap.S | 4 +-
arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
arch/arm64/kernel/vmlinux.lds.S | 48 +++--
arch/arm64/kvm/arm.c | 10 +
arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
arch/arm64/kvm/mmu.c | 39 ++--
arch/arm64/lib/clear_page.S | 7 +-
arch/arm64/lib/copy_page.S | 33 ++-
arch/arm64/lib/mte.S | 27 ++-
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/fixmap.c | 38 ++--
arch/arm64/mm/hugetlbpage.c | 40 +---
arch/arm64/mm/init.c | 26 +--
arch/arm64/mm/kasan_init.c | 8 +-
arch/arm64/mm/mmu.c | 53 +++--
arch/arm64/mm/pgd.c | 12 +-
arch/arm64/mm/pgtable-geometry.c | 24 +++
arch/arm64/mm/proc.S | 128 ++++++++---
arch/arm64/mm/ptdump.c | 3 +-
arch/arm64/tools/cpucaps | 3 +
arch/csky/include/asm/page.h | 3 +
arch/hexagon/include/asm/page.h | 2 +
arch/loongarch/include/asm/page.h | 2 +
arch/m68k/include/asm/page.h | 1 +
arch/microblaze/include/asm/page.h | 1 +
arch/mips/include/asm/page.h | 1 +
arch/nios2/include/asm/page.h | 2 +
arch/openrisc/include/asm/page.h | 1 +
arch/parisc/include/asm/page.h | 1 +
arch/powerpc/include/asm/page.h | 2 +
arch/riscv/include/asm/page.h | 1 +
arch/s390/include/asm/page.h | 1 +
arch/sh/include/asm/page.h | 1 +
arch/sparc/include/asm/page.h | 3 +
arch/um/include/asm/page.h | 2 +
arch/x86/include/asm/page_types.h | 2 +
arch/xtensa/include/asm/page.h | 1 +
crypto/lskcipher.c | 4 +-
drivers/ata/sata_sil24.c | 46 ++--
drivers/base/node.c | 6 +-
drivers/base/topology.c | 32 +--
drivers/block/virtio_blk.c | 2 +-
drivers/char/random.c | 4 +-
drivers/edac/edac_mc.h | 13 +-
drivers/firmware/efi/libstub/arm64.c | 3 +-
drivers/irqchip/irq-gic-v3-its.c | 2 +-
drivers/mtd/mtdswap.c | 4 +-
drivers/net/ethernet/freescale/fec.h | 3 +-
drivers/net/ethernet/freescale/fec_main.c | 5 +-
.../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
drivers/net/ethernet/intel/igb/igb.h | 25 +--
drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
drivers/net/ethernet/marvell/mvneta.c | 9 +-
drivers/net/ethernet/marvell/sky2.h | 2 +-
drivers/tee/optee/call.c | 7 +-
drivers/tee/optee/smc_abi.c | 2 +-
drivers/virtio/virtio_balloon.c | 10 +-
drivers/xen/balloon.c | 11 +-
drivers/xen/biomerge.c | 12 +-
drivers/xen/privcmd.c | 2 +-
drivers/xen/xenbus/xenbus_client.c | 5 +-
drivers/xen/xlate_mmu.c | 6 +-
fs/binfmt_elf.c | 11 +-
fs/buffer.c | 2 +-
fs/coredump.c | 8 +-
fs/ext4/ext4.h | 36 ++--
fs/ext4/move_extent.c | 2 +-
fs/ext4/readpage.c | 2 +-
fs/fat/dir.c | 4 +-
fs/fat/fatent.c | 4 +-
fs/nfs/nfs42proc.c | 2 +-
fs/nfs/nfs42xattr.c | 2 +-
fs/nfs/nfs4proc.c | 2 +-
include/asm-generic/pgtable-geometry.h | 71 +++++++
include/asm-generic/vmlinux.lds.h | 38 ++--
include/linux/buffer_head.h | 1 +
include/linux/cpumask.h | 5 +
include/linux/linkage.h | 4 +-
include/linux/mm.h | 17 +-
include/linux/mm_types.h | 15 +-
include/linux/mm_types_task.h | 2 +-
include/linux/mmzone.h | 3 +-
include/linux/netlink.h | 6 +-
include/linux/percpu-defs.h | 4 +-
include/linux/perf_event.h | 2 +-
include/linux/sched.h | 4 +-
include/linux/slab.h | 7 +-
include/linux/stackdepot.h | 6 +-
include/linux/sunrpc/svc.h | 8 +-
include/linux/sunrpc/svc_rdma.h | 4 +-
include/linux/sunrpc/svcsock.h | 2 +-
include/linux/swap.h | 17 +-
include/linux/swapops.h | 6 +-
include/linux/thread_info.h | 10 +-
include/xen/page.h | 2 +
init/main.c | 7 +-
kernel/bpf/core.c | 9 +-
kernel/bpf/ringbuf.c | 54 ++---
kernel/cgroup/cgroup.c | 8 +-
kernel/crash_core.c | 2 +-
kernel/events/core.c | 2 +-
kernel/fork.c | 71 +++----
kernel/power/power.h | 2 +-
kernel/power/snapshot.c | 2 +-
kernel/power/swap.c | 129 +++++++++--
kernel/trace/fgraph.c | 2 +-
kernel/trace/trace.c | 2 +-
lib/stackdepot.c | 6 +-
mm/kasan/report.c | 3 +-
mm/memcontrol.c | 11 +-
mm/memory.c | 4 +-
mm/mmap.c | 2 +-
mm/page-writeback.c | 2 +-
mm/page_alloc.c | 31 +--
mm/slub.c | 2 +-
mm/sparse.c | 2 +-
mm/swapfile.c | 2 +-
mm/vmalloc.c | 7 +-
net/9p/trans_virtio.c | 4 +-
net/core/hotdata.c | 4 +-
net/core/skbuff.c | 4 +-
net/core/sysctl_net_core.c | 2 +-
net/sunrpc/cache.c | 3 +-
net/unix/af_unix.c | 2 +-
sound/soc/soc-utils.c | 4 +-
virt/kvm/kvm_main.c | 2 +-
172 files changed, 2185 insertions(+), 951 deletions(-)
create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
create mode 100644 arch/arm64/mm/pgtable-geometry.c
create mode 100644 include/asm-generic/pgtable-geometry.h
--
2.43.0
^ permalink raw reply [flat|nested] 196+ messages in thread
* [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection
2024-10-14 10:55 [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 02/57] vmlinux: Align to PAGE_SIZE_MAX Ryan Roberts
` (58 more replies)
2024-10-14 17:32 ` [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Florian Fainelli
` (7 subsequent siblings)
8 siblings, 59 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: David S. Miller, James E.J. Bottomley, Andreas Larsson,
Andrew Morton, Anshuman Khandual, Anton Ivanov, Ard Biesheuvel,
Arnd Bergmann, Borislav Petkov, Catalin Marinas, Chris Zankel,
Dave Hansen, David Hildenbrand, Dinh Nguyen, Geert Uytterhoeven,
Greg Marsden, Helge Deller, Huacai Chen, Ingo Molnar, Ivan Ivanov,
Johannes Berg, John Paul Adrian Glaubitz, Jonas Bonn,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Max Filippov, Miroslav Benes, Rich Felker, Richard Weinberger,
Stafford Horne, Stefan Kristiansson, Thomas Bogendoerfer,
Thomas Gleixner, Will Deacon, Yoshinori Sato, x86
Cc: Ryan Roberts, linux-alpha, linux-arch, linux-arm-kernel,
linux-csky, linux-hexagon, linux-kernel, linux-m68k, linux-mips,
linux-mm, linux-openrisc, linux-parisc, linux-riscv, linux-s390,
linux-sh, linux-snps-arc, linux-um, linuxppc-dev, loongarch,
sparclinux
arm64 can support multiple base page sizes. Instead of selecting a page
size at compile time, as is done today, we will make it possible to
select the desired page size on the command line.
In this case PAGE_SHIFT and it's derivatives, PAGE_SIZE and PAGE_MASK
(as well as a number of other macros related to or derived from
PAGE_SHIFT, but I'm not worrying about those yet), are no longer
compile-time constants. So the code base needs to cope with that.
As a first step, introduce MIN and MAX variants of these macros, which
express the range of possible page sizes. These are always compile-time
constants and can be used in many places where PAGE_[SHIFT|SIZE|MASK]
were previously used where a compile-time constant is required.
(Subsequent patches will do that conversion work). When the arch/build
doesn't support boot-time page size selection, the MIN and MAX variants
are equal and everything resolves as it did previously.
Additionally, introduce DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() which wrap
global variable defintions so that for boot-time page size selection
builds, the variable being wrapped is initialized at boot-time, instead
of compile-time. This is done by defining a function to do the
assignment, which has the "constructor" attribute. Constructor is
preferred over initcall, because when compiling a module, the module is
limited to a single initcall but constructors are unlimited. For
built-in code, constructors are now called earlier to guarrantee that
the variables are initialized by the time they are used. Any arch that
wants to enable boot-time page size selection will need to select
CONFIG_CONSTRUCTORS.
These new macros need to be available anywhere PAGE_SHIFT and friends
are available. Those are defined via asm/page.h (although some arches
have a sub-include that defines them). Unfortunately there is no
reliable asm-generic header we can easily piggy-back on, so let's define
a new one, pgtable-geometry.h, which we include near where each arch
defines PAGE_SHIFT. Ugh.
-------
Most of the problems that need to be solved over the next few patches
fall into these broad categories, which are all solved with the help of
these new macros:
1. Assignment of values derived from PAGE_SIZE in global variables
For boot-time page size builds, we must defer the initialization of
these variables until boot-time, when the page size is known. See
DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() as described above.
2. Define static storage in units related to PAGE_SIZE
This static storage will be defined according to PAGE_SIZE_MAX.
3. Define size of struct so that it is related to PAGE_SIZE
The struct often contains an array that is sized to fill the page. In
this case, use a flexible array with dynamic allocation. In other
cases, the struct fits exactly over a page, which is a header (e.g.
swap file header). In this case, remove the padding, and manually
determine the struct pointer within the page.
4. BUILD_BUG_ON() with values derived from PAGE_SIZE
In most cases, we can change these to compare againt the appropriate
limit (either MIN or MAX). In other cases, we must change these to
run-time BUG_ON().
5. Ensure page alignment of static data structures
Align instead to PAGE_SIZE_MAX.
6. #ifdeffery based on PAGE_SIZE
Often these can be changed to c code constructs. e.g. a macro that
returns a different value depending on page size can be changed to use
the ternary operator and the compiler will dead code strip it for the
compile-time constant case and runtime evaluate it for the non-const
case. Or #if/#else/#endif within a function can be converted to c
if/else blocks, which are also dead code stripped for the const case.
Sometimes we can change the c-preprocessor logic to use the
appropriate MIN/MAX limit.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/alpha/include/asm/page.h | 1 +
arch/arc/include/asm/page.h | 1 +
arch/arm/include/asm/page.h | 1 +
arch/arm64/include/asm/page-def.h | 2 +
arch/csky/include/asm/page.h | 3 ++
arch/hexagon/include/asm/page.h | 2 +
arch/loongarch/include/asm/page.h | 2 +
arch/m68k/include/asm/page.h | 1 +
arch/microblaze/include/asm/page.h | 1 +
arch/mips/include/asm/page.h | 1 +
arch/nios2/include/asm/page.h | 2 +
arch/openrisc/include/asm/page.h | 1 +
arch/parisc/include/asm/page.h | 1 +
arch/powerpc/include/asm/page.h | 2 +
arch/riscv/include/asm/page.h | 1 +
arch/s390/include/asm/page.h | 1 +
arch/sh/include/asm/page.h | 1 +
arch/sparc/include/asm/page.h | 3 ++
arch/um/include/asm/page.h | 2 +
arch/x86/include/asm/page_types.h | 2 +
arch/xtensa/include/asm/page.h | 1 +
include/asm-generic/pgtable-geometry.h | 71 ++++++++++++++++++++++++++
init/main.c | 5 +-
23 files changed, 107 insertions(+), 1 deletion(-)
create mode 100644 include/asm-generic/pgtable-geometry.h
diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index 70419e6be1a35..d0096fb5521b8 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -88,5 +88,6 @@ typedef struct page *pgtable_t;
#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
#endif /* _ALPHA_PAGE_H */
diff --git a/arch/arc/include/asm/page.h b/arch/arc/include/asm/page.h
index def0dfb95b436..8d56549db7a33 100644
--- a/arch/arc/include/asm/page.h
+++ b/arch/arc/include/asm/page.h
@@ -6,6 +6,7 @@
#define __ASM_ARC_PAGE_H
#include <uapi/asm/page.h>
+#include <asm-generic/pgtable-geometry.h>
#ifdef CONFIG_ARC_HAS_PAE40
diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h
index 62af9f7f9e963..417aa8533c718 100644
--- a/arch/arm/include/asm/page.h
+++ b/arch/arm/include/asm/page.h
@@ -191,5 +191,6 @@ extern int pfn_valid(unsigned long);
#include <asm-generic/getorder.h>
#include <asm-generic/memory_model.h>
+#include <asm-generic/pgtable-geometry.h>
#endif
diff --git a/arch/arm64/include/asm/page-def.h b/arch/arm64/include/asm/page-def.h
index 792e9fe881dcf..d69971cf49cd2 100644
--- a/arch/arm64/include/asm/page-def.h
+++ b/arch/arm64/include/asm/page-def.h
@@ -15,4 +15,6 @@
#define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE-1))
+#include <asm-generic/pgtable-geometry.h>
+
#endif /* __ASM_PAGE_DEF_H */
diff --git a/arch/csky/include/asm/page.h b/arch/csky/include/asm/page.h
index 0ca6c408c07f2..95173d57adc8b 100644
--- a/arch/csky/include/asm/page.h
+++ b/arch/csky/include/asm/page.h
@@ -92,4 +92,7 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
#include <asm-generic/getorder.h>
#endif /* !__ASSEMBLY__ */
+
+#include <asm-generic/pgtable-geometry.h>
+
#endif /* __ASM_CSKY_PAGE_H */
diff --git a/arch/hexagon/include/asm/page.h b/arch/hexagon/include/asm/page.h
index 8a6af57274c2d..ba7ad5231695f 100644
--- a/arch/hexagon/include/asm/page.h
+++ b/arch/hexagon/include/asm/page.h
@@ -139,4 +139,6 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
#endif /* ifdef __ASSEMBLY__ */
#endif /* ifdef __KERNEL__ */
+#include <asm-generic/pgtable-geometry.h>
+
#endif
diff --git a/arch/loongarch/include/asm/page.h b/arch/loongarch/include/asm/page.h
index e85df33f11c77..9862e8fb047a6 100644
--- a/arch/loongarch/include/asm/page.h
+++ b/arch/loongarch/include/asm/page.h
@@ -123,4 +123,6 @@ extern int __virt_addr_valid(volatile void *kaddr);
#endif /* !__ASSEMBLY__ */
+#include <asm-generic/pgtable-geometry.h>
+
#endif /* _ASM_PAGE_H */
diff --git a/arch/m68k/include/asm/page.h b/arch/m68k/include/asm/page.h
index 8cfb84b499751..4df4681b02194 100644
--- a/arch/m68k/include/asm/page.h
+++ b/arch/m68k/include/asm/page.h
@@ -60,5 +60,6 @@ extern unsigned long _ramend;
#include <asm-generic/getorder.h>
#include <asm-generic/memory_model.h>
+#include <asm-generic/pgtable-geometry.h>
#endif /* _M68K_PAGE_H */
diff --git a/arch/microblaze/include/asm/page.h b/arch/microblaze/include/asm/page.h
index 8810f4f1c3b02..abc23c3d743bd 100644
--- a/arch/microblaze/include/asm/page.h
+++ b/arch/microblaze/include/asm/page.h
@@ -142,5 +142,6 @@ static inline const void *pfn_to_virt(unsigned long pfn)
#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
#endif /* _ASM_MICROBLAZE_PAGE_H */
diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index 4609cb0326cf3..3d91021538f02 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -227,5 +227,6 @@ static inline unsigned long kaslr_offset(void)
#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
#endif /* _ASM_PAGE_H */
diff --git a/arch/nios2/include/asm/page.h b/arch/nios2/include/asm/page.h
index 0722f88e63cc7..2e5f93beb42b7 100644
--- a/arch/nios2/include/asm/page.h
+++ b/arch/nios2/include/asm/page.h
@@ -97,4 +97,6 @@ extern struct page *mem_map;
#endif /* !__ASSEMBLY__ */
+#include <asm-generic/pgtable-geometry.h>
+
#endif /* _ASM_NIOS2_PAGE_H */
diff --git a/arch/openrisc/include/asm/page.h b/arch/openrisc/include/asm/page.h
index 1d5913f67c312..a0da2a9842241 100644
--- a/arch/openrisc/include/asm/page.h
+++ b/arch/openrisc/include/asm/page.h
@@ -88,5 +88,6 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
#endif /* __ASM_OPENRISC_PAGE_H */
diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
index 4bea2e95798f0..2a75496237c09 100644
--- a/arch/parisc/include/asm/page.h
+++ b/arch/parisc/include/asm/page.h
@@ -173,6 +173,7 @@ extern int npmem_ranges;
#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
#include <asm/pdc.h>
#define PAGE0 ((struct zeropage *)absolute_pointer(__PAGE_OFFSET))
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 83d0a4fc5f755..4601c115b6485 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -300,4 +300,6 @@ static inline unsigned long kaslr_offset(void)
#include <asm-generic/memory_model.h>
#endif /* __ASSEMBLY__ */
+#include <asm-generic/pgtable-geometry.h>
+
#endif /* _ASM_POWERPC_PAGE_H */
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 7ede2111c5917..e5af7579e45bf 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -204,5 +204,6 @@ static __always_inline void *pfn_to_kaddr(unsigned long pfn)
#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
#endif /* _ASM_RISCV_PAGE_H */
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index 16e4caa931f1f..42157e7690a77 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -275,6 +275,7 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
#define AMODE31_SIZE (3 * PAGE_SIZE)
diff --git a/arch/sh/include/asm/page.h b/arch/sh/include/asm/page.h
index f780b467e75d7..09533d46ef033 100644
--- a/arch/sh/include/asm/page.h
+++ b/arch/sh/include/asm/page.h
@@ -162,5 +162,6 @@ typedef struct page *pgtable_t;
#include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h>
+#include <asm-generic/pgtable-geometry.h>
#endif /* __ASM_SH_PAGE_H */
diff --git a/arch/sparc/include/asm/page.h b/arch/sparc/include/asm/page.h
index 5e44cdf2a8f2b..4327fe2bfa010 100644
--- a/arch/sparc/include/asm/page.h
+++ b/arch/sparc/include/asm/page.h
@@ -9,4 +9,7 @@
#else
#include <asm/page_32.h>
#endif
+
+#include <asm-generic/pgtable-geometry.h>
+
#endif
diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
index 9ef9a8aedfa66..f26011808f514 100644
--- a/arch/um/include/asm/page.h
+++ b/arch/um/include/asm/page.h
@@ -119,4 +119,6 @@ extern unsigned long uml_physmem;
#define __HAVE_ARCH_GATE_AREA 1
#endif
+#include <asm-generic/pgtable-geometry.h>
+
#endif /* __UM_PAGE_H */
diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 52f1b4ff0cc16..6d2381342047f 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -71,4 +71,6 @@ extern void initmem_init(void);
#endif /* !__ASSEMBLY__ */
+#include <asm-generic/pgtable-geometry.h>
+
#endif /* _ASM_X86_PAGE_DEFS_H */
diff --git a/arch/xtensa/include/asm/page.h b/arch/xtensa/include/asm/page.h
index 4db56ef052d22..86952cb32af23 100644
--- a/arch/xtensa/include/asm/page.h
+++ b/arch/xtensa/include/asm/page.h
@@ -200,4 +200,5 @@ static inline unsigned long ___pa(unsigned long va)
#endif /* __ASSEMBLY__ */
#include <asm-generic/memory_model.h>
+#include <asm-generic/pgtable-geometry.h>
#endif /* _XTENSA_PAGE_H */
diff --git a/include/asm-generic/pgtable-geometry.h b/include/asm-generic/pgtable-geometry.h
new file mode 100644
index 0000000000000..358e729a6ac37
--- /dev/null
+++ b/include/asm-generic/pgtable-geometry.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ASM_GENERIC_PGTABLE_GEOMETRY_H
+#define ASM_GENERIC_PGTABLE_GEOMETRY_H
+
+#if defined(PAGE_SHIFT_MAX) && defined(PAGE_SIZE_MAX) && defined(PAGE_MASK_MAX) && \
+ defined(PAGE_SHIFT_MIN) && defined(PAGE_SIZE_MIN) && defined(PAGE_MASK_MIN)
+/* Arch supports boot-time page size selection. */
+#elif defined(PAGE_SHIFT_MAX) || defined(PAGE_SIZE_MAX) || defined(PAGE_MASK_MAX) || \
+ defined(PAGE_SHIFT_MIN) || defined(PAGE_SIZE_MIN) || defined(PAGE_MASK_MIN)
+#error Arch must define all or none of the boot-time page size macros
+#else
+/* Arch does not support boot-time page size selection. */
+#define PAGE_SHIFT_MIN PAGE_SHIFT
+#define PAGE_SIZE_MIN PAGE_SIZE
+#define PAGE_MASK_MIN PAGE_MASK
+#define PAGE_SHIFT_MAX PAGE_SHIFT
+#define PAGE_SIZE_MAX PAGE_SIZE
+#define PAGE_MASK_MAX PAGE_MASK
+#endif
+
+/*
+ * Define a global variable (scalar or struct), whose value is derived from
+ * PAGE_SIZE and friends. When PAGE_SIZE is a compile-time constant, the global
+ * variable is simply defined with the static value. When PAGE_SIZE is
+ * determined at boot-time, a pure initcall is registered and run during boot to
+ * initialize the variable.
+ *
+ * @type: Unqualified type. Do not include "const"; implied by macro variant.
+ * @name: Variable name.
+ * @...: Initialization value. May be scalar or initializer.
+ *
+ * "static" is declared by placing "static" before the macro.
+ *
+ * Example:
+ *
+ * struct my_struct {
+ * int a;
+ * char b;
+ * };
+ *
+ * static DEFINE_GLOBAL_PAGE_SIZE_VAR(struct my_struct, my_variable, {
+ * .a = 10,
+ * .b = 'e',
+ * });
+ */
+#if PAGE_SIZE_MIN != PAGE_SIZE_MAX
+#define __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, attrib, ...) \
+ type name attrib; \
+ static int __init __attribute__((constructor)) __##name##_init(void) \
+ { \
+ name = (type)__VA_ARGS__; \
+ return 0; \
+ }
+
+#define DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, ...) \
+ __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, , __VA_ARGS__)
+
+#define DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(type, name, ...) \
+ __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, __ro_after_init, __VA_ARGS__)
+#else /* PAGE_SIZE_MIN == PAGE_SIZE_MAX */
+#define __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, attrib, ...) \
+ type name attrib = __VA_ARGS__; \
+
+#define DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, ...) \
+ __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, , __VA_ARGS__)
+
+#define DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(type, name, ...) \
+ __DEFINE_GLOBAL_PAGE_SIZE_VAR(const type, name, , __VA_ARGS__)
+#endif
+
+#endif /* ASM_GENERIC_PGTABLE_GEOMETRY_H */
diff --git a/init/main.c b/init/main.c
index 206acdde51f5a..ba1515eb20b9d 100644
--- a/init/main.c
+++ b/init/main.c
@@ -899,6 +899,8 @@ static void __init early_numa_node_init(void)
#endif
}
+static __init void do_ctors(void);
+
asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector
void start_kernel(void)
{
@@ -910,6 +912,8 @@ void start_kernel(void)
debug_objects_early_init();
init_vmlinux_build_id();
+ do_ctors();
+
cgroup_init_early();
local_irq_disable();
@@ -1360,7 +1364,6 @@ static void __init do_basic_setup(void)
cpuset_init_smp();
driver_init();
init_irq_proc();
- do_ctors();
do_initcalls();
}
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 02/57] vmlinux: Align to PAGE_SIZE_MAX
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 16:50 ` Christoph Lameter (Ampere)
2024-10-14 10:58 ` [RFC PATCH v1 03/57] mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large Ryan Roberts
` (57 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Arnd Bergmann,
Catalin Marinas, Christoph Lameter, David Hildenbrand,
Dennis Zhou, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Tejun Heo, Will Deacon
Cc: Ryan Roberts, linux-arch, linux-arm-kernel, linux-kernel,
linux-mm
Increase alignment of structures requiring at least PAGE_SIZE alignment
to PAGE_SIZE_MAX. For compile-time PAGE_SIZE, PAGE_SIZE_MAX == PAGE_SIZE
so there is no change. For boot-time PAGE_SIZE, PAGE_SIZE_MAX is the
largest selectable page size.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
include/asm-generic/vmlinux.lds.h | 32 +++++++++++++++----------------
include/linux/linkage.h | 4 ++--
include/linux/percpu-defs.h | 4 ++--
3 files changed, 20 insertions(+), 20 deletions(-)
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 1ae44793132a8..5727f883001bb 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -13,7 +13,7 @@
* . = START;
* __init_begin = .;
* HEAD_TEXT_SECTION
- * INIT_TEXT_SECTION(PAGE_SIZE)
+ * INIT_TEXT_SECTION(PAGE_SIZE_MAX)
* INIT_DATA_SECTION(...)
* PERCPU_SECTION(CACHELINE_SIZE)
* __init_end = .;
@@ -23,7 +23,7 @@
* _etext = .;
*
* _sdata = .;
- * RO_DATA(PAGE_SIZE)
+ * RO_DATA(PAGE_SIZE_MAX)
* RW_DATA(...)
* _edata = .;
*
@@ -371,10 +371,10 @@
* Data section helpers
*/
#define NOSAVE_DATA \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
__nosave_begin = .; \
*(.data..nosave) \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
__nosave_end = .;
#define PAGE_ALIGNED_DATA(page_align) \
@@ -733,9 +733,9 @@
. = ALIGN(bss_align); \
.bss : AT(ADDR(.bss) - LOAD_OFFSET) { \
BSS_FIRST_SECTIONS \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
*(.bss..page_aligned) \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
*(.dynbss) \
*(BSS_MAIN) \
*(COMMON) \
@@ -950,9 +950,9 @@
*/
#ifdef CONFIG_AMD_MEM_ENCRYPT
#define PERCPU_DECRYPTED_SECTION \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
*(.data..percpu..decrypted) \
- . = ALIGN(PAGE_SIZE);
+ . = ALIGN(PAGE_SIZE_MAX);
#else
#define PERCPU_DECRYPTED_SECTION
#endif
@@ -1030,7 +1030,7 @@
#define PERCPU_INPUT(cacheline) \
__per_cpu_start = .; \
*(.data..percpu..first) \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
*(.data..percpu..page_aligned) \
. = ALIGN(cacheline); \
*(.data..percpu..read_mostly) \
@@ -1075,16 +1075,16 @@
* PERCPU_SECTION - define output section for percpu area, simple version
* @cacheline: cacheline size
*
- * Align to PAGE_SIZE and outputs output section for percpu area. This
+ * Align to PAGE_SIZE_MAX and outputs output section for percpu area. This
* macro doesn't manipulate @vaddr or @phdr and __per_cpu_load and
* __per_cpu_start will be identical.
*
- * This macro is equivalent to ALIGN(PAGE_SIZE); PERCPU_VADDR(@cacheline,,)
+ * This macro is equivalent to ALIGN(PAGE_SIZE_MAX); PERCPU_VADDR(@cacheline,,)
* except that __per_cpu_load is defined as a relative symbol against
* .data..percpu which is required for relocatable x86_32 configuration.
*/
#define PERCPU_SECTION(cacheline) \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
.data..percpu : AT(ADDR(.data..percpu) - LOAD_OFFSET) { \
__per_cpu_load = .; \
PERCPU_INPUT(cacheline) \
@@ -1102,15 +1102,15 @@
* All sections are combined in a single .data section.
* The sections following CONSTRUCTORS are arranged so their
* typical alignment matches.
- * A cacheline is typical/always less than a PAGE_SIZE so
+ * A cacheline is typical/always less than a PAGE_SIZE_MAX so
* the sections that has this restriction (or similar)
- * is located before the ones requiring PAGE_SIZE alignment.
- * NOSAVE_DATA starts and ends with a PAGE_SIZE alignment which
+ * is located before the ones requiring PAGE_SIZE_MAX alignment.
+ * NOSAVE_DATA starts and ends with a PAGE_SIZE_MAX alignment which
* matches the requirement of PAGE_ALIGNED_DATA.
*
* use 0 as page_align if page_aligned data is not used */
#define RW_DATA(cacheline, pagealigned, inittask) \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
.data : AT(ADDR(.data) - LOAD_OFFSET) { \
INIT_TASK_DATA(inittask) \
NOSAVE_DATA \
diff --git a/include/linux/linkage.h b/include/linux/linkage.h
index 5c8865bb59d91..68aa9775fce51 100644
--- a/include/linux/linkage.h
+++ b/include/linux/linkage.h
@@ -36,8 +36,8 @@
__stringify(name))
#endif
-#define __page_aligned_data __section(".data..page_aligned") __aligned(PAGE_SIZE)
-#define __page_aligned_bss __section(".bss..page_aligned") __aligned(PAGE_SIZE)
+#define __page_aligned_data __section(".data..page_aligned") __aligned(PAGE_SIZE_MAX)
+#define __page_aligned_bss __section(".bss..page_aligned") __aligned(PAGE_SIZE_MAX)
/*
* For assembly routines.
diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index 8efce7414fad6..89c7f430015ba 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -156,11 +156,11 @@
*/
#define DECLARE_PER_CPU_PAGE_ALIGNED(type, name) \
DECLARE_PER_CPU_SECTION(type, name, "..page_aligned") \
- __aligned(PAGE_SIZE)
+ __aligned(PAGE_SIZE_MAX)
#define DEFINE_PER_CPU_PAGE_ALIGNED(type, name) \
DEFINE_PER_CPU_SECTION(type, name, "..page_aligned") \
- __aligned(PAGE_SIZE)
+ __aligned(PAGE_SIZE_MAX)
/*
* Declaration/definition used for per-CPU variables that must be read mostly.
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 03/57] mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 02/57] vmlinux: Align to PAGE_SIZE_MAX Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 13:00 ` Johannes Weiner
` (2 more replies)
2024-10-14 10:58 ` [RFC PATCH v1 04/57] mm/page_alloc: Make page_frag_cache boot-time page size compatible Ryan Roberts
` (56 subsequent siblings)
58 siblings, 3 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Johannes Weiner,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Michal Hocko, Miroslav Benes, Roman Gushchin, Shakeel Butt,
Will Deacon
Cc: Ryan Roberts, cgroups, linux-arm-kernel, linux-kernel, linux-mm
Previously the seq_buf used for accumulating the memory.stat output was
sized at PAGE_SIZE. But the amount of output is invariant to PAGE_SIZE;
If 4K is enough on a 4K page system, then it should also be enough on a
64K page system, so we can save 60K om the static buffer used in
mem_cgroup_print_oom_meminfo(). Let's make it so.
This also has the beneficial side effect of removing a place in the code
that assumed PAGE_SIZE is a compile-time constant. So this helps our
quest towards supporting boot-time page size selection.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
mm/memcontrol.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d563fb515766b..c5f9195f76c65 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -95,6 +95,7 @@ static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
#define THRESHOLDS_EVENTS_TARGET 128
#define SOFTLIMIT_EVENTS_TARGET 1024
+#define SEQ_BUF_SIZE SZ_4K
static inline bool task_is_dying(void)
{
@@ -1519,7 +1520,7 @@ void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *
void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
{
/* Use static buffer, for the caller is holding oom_lock. */
- static char buf[PAGE_SIZE];
+ static char buf[SEQ_BUF_SIZE];
struct seq_buf s;
lockdep_assert_held(&oom_lock);
@@ -1545,7 +1546,7 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
pr_info("Memory cgroup stats for ");
pr_cont_cgroup_path(memcg->css.cgroup);
pr_cont(":");
- seq_buf_init(&s, buf, sizeof(buf));
+ seq_buf_init(&s, buf, SEQ_BUF_SIZE);
memory_stat_format(memcg, &s);
seq_buf_do_printk(&s, KERN_INFO);
}
@@ -4158,12 +4159,12 @@ static int memory_events_local_show(struct seq_file *m, void *v)
int memory_stat_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
- char *buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ char *buf = kmalloc(SEQ_BUF_SIZE, GFP_KERNEL);
struct seq_buf s;
if (!buf)
return -ENOMEM;
- seq_buf_init(&s, buf, PAGE_SIZE);
+ seq_buf_init(&s, buf, SEQ_BUF_SIZE);
memory_stat_format(memcg, &s);
seq_puts(m, buf);
kfree(buf);
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 04/57] mm/page_alloc: Make page_frag_cache boot-time page size compatible
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 02/57] vmlinux: Align to PAGE_SIZE_MAX Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 03/57] mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-11-14 8:23 ` Vlastimil Babka
2024-10-14 10:58 ` [RFC PATCH v1 05/57] mm: Avoid split pmd ptl if pmd level is run-time folded Ryan Roberts
` (55 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
"struct page_frag_cache" has some optimizations that depend on page
size. Let's refactor it a bit so that those optimizations can be
determined at run-time for the case where page size is a boot-time
parameter. For compile-time page size, the compiler should dead code
strip and the result is very similar to before.
One wrinkle is that we don't know if we need the size member until
runtime. So remove the ifdeffery and always define offset as u32 (needed
if PAGE_SIZE is >= 64K) and size as u16 (only used when PAGE_SIZE <=
32K). We move the members around a bit so that the overall size of the
struct remains the same; 24 bytes for 64-bit and 16 bytes on 32 bit.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
page_alloc
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
include/linux/mm_types.h | 13 ++++++-------
mm/page_alloc.c | 31 ++++++++++++++++++-------------
2 files changed, 24 insertions(+), 20 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4854249792545..0844ed7cfaa53 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -544,16 +544,15 @@ static inline void *folio_get_private(struct folio *folio)
struct page_frag_cache {
void * va;
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- __u16 offset;
- __u16 size;
-#else
- __u32 offset;
-#endif
/* we maintain a pagecount bias, so that we dont dirty cache line
* containing page->_refcount every time we allocate a fragment.
*/
- unsigned int pagecnt_bias;
+ unsigned int pagecnt_bias;
+ __u32 offset;
+ /* size only used when PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE, in which
+ * case PAGE_FRAG_CACHE_MAX_SIZE is 32K and 16 bits is sufficient.
+ */
+ __u16 size;
bool pfmemalloc;
};
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91ace8ca97e21..8678103b1b396 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4822,13 +4822,18 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
struct page *page = NULL;
gfp_t gfp = gfp_mask;
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
- __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
- page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
- PAGE_FRAG_CACHE_MAX_ORDER);
- nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
-#endif
+ if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) {
+ gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
+ __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
+ page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
+ PAGE_FRAG_CACHE_MAX_ORDER);
+ /*
+ * Cast to silence warning due to 16-bit nc->size. Not real
+ * because PAGE_SIZE only less than PAGE_FRAG_CACHE_MAX_SIZE
+ * when PAGE_FRAG_CACHE_MAX_SIZE is 32K.
+ */
+ nc->size = (__u16)(page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE);
+ }
if (unlikely(!page))
page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
@@ -4870,10 +4875,10 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
if (!page)
return NULL;
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
/* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
-#endif
+ if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+ size = nc->size;
+
/* Even if we own the page, we do not use atomic_set().
* This would break get_page_unless_zero() users.
*/
@@ -4897,10 +4902,10 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
goto refill;
}
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
/* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
-#endif
+ if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+ size = nc->size;
+
/* OK, page count is 0, we can safely set it */
set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 05/57] mm: Avoid split pmd ptl if pmd level is run-time folded
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (2 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 04/57] mm/page_alloc: Make page_frag_cache boot-time page size compatible Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption Ryan Roberts
` (54 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
If there are only 2 levels of translation, the first level (pgd) may not
be an entire page and so does not have a ptdesc backing it (this may be
true on arm64 depending on the VA size and page size). Even if it is an
entire page and does therefore have an entire ptdesc,
pagetable_pmd_ctor() won't be called for the ptdesc (since it's a pgd
not pmd table) and so the per-ptdec ptl fields won't be initialised.
To date this has been fine; the arch knows at compile time if it needs
to fold the pmd level and in this case does not select
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK. However, if the number of levels
are not known at compile time (as is the case for boot-time page size
selection), we want to be able to choose at boot whether to use split
pmd ptls in the pmd's ptdesc or simply fall back to the lock in the
mm_struct.
So let's make that change; when CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK is
selected, determine if it should be used at run-time based on
mm_pmd_folded().
This sets us up for arm64 to support boot-time page size selection.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
include/linux/mm.h | 15 ++++++++++++++-
include/linux/mm_types.h | 2 +-
kernel/fork.c | 4 ++--
3 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1470736017168..09a840517c23a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3037,6 +3037,8 @@ static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
{
+ if (mm_pmd_folded(mm))
+ return &mm->page_table_lock;
return ptlock_ptr(pmd_ptdesc(pmd));
}
@@ -3056,7 +3058,18 @@ static inline void pmd_ptlock_free(struct ptdesc *ptdesc)
ptlock_free(ptdesc);
}
-#define pmd_huge_pte(mm, pmd) (pmd_ptdesc(pmd)->pmd_huge_pte)
+static inline pgtable_t *__pmd_huge_pte(struct mm_struct *mm, pmd_t *pmd)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (mm_pmd_folded(mm))
+ return &mm->pmd_huge_pte;
+ return &pmd_ptdesc(pmd)->pmd_huge_pte;
+#else
+ return NULL;
+#endif
+}
+
+#define pmd_huge_pte(mm, pmd) (*__pmd_huge_pte(mm, pmd))
#else
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0844ed7cfaa53..87dc6de7b7baf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -946,7 +946,7 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_subscriptions *notifier_subscriptions;
#endif
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
#ifdef CONFIG_NUMA_BALANCING
diff --git a/kernel/fork.c b/kernel/fork.c
index cc760491f2012..ea472566d4fcc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -832,7 +832,7 @@ static void check_mm(struct mm_struct *mm)
pr_alert("BUG: non-zero pgtables_bytes on freeing mm: %ld\n",
mm_pgtables_bytes(mm));
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
VM_BUG_ON_MM(mm->pmd_huge_pte, mm);
#endif
}
@@ -1276,7 +1276,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
RCU_INIT_POINTER(mm->exe_file, NULL);
mmu_notifier_subscriptions_init(mm);
init_tlb_flush_pending(mm);
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE)
mm->pmd_huge_pte = NULL;
#endif
mm_init_uprobes_state(mm);
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (3 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 05/57] mm: Avoid split pmd ptl if pmd level is run-time folded Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-16 14:37 ` Ryan Roberts
` (2 more replies)
2024-10-14 10:58 ` [RFC PATCH v1 07/57] fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing Ryan Roberts
` (53 subsequent siblings)
58 siblings, 3 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
Christoph Lameter, David Hildenbrand, David Rientjes,
Greg Marsden, Ivan Ivanov, Johannes Weiner, Joonsoo Kim,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Michal Hocko, Miquel Raynal, Miroslav Benes, Pekka Enberg,
Richard Weinberger, Shakeel Butt, Vignesh Raghavendra,
Vlastimil Babka, Will Deacon
Cc: Ryan Roberts, cgroups, linux-arm-kernel, linux-fsdevel,
linux-kernel, linux-mm, linux-mtd
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Refactor "struct vmap_block" to use a flexible array for used_mmap since
VMAP_BBMAP_BITS is not a compile time constant for the boot-time page
size case.
Update various BUILD_BUG_ON() instances to check against appropriate
page size limit.
Re-define "union swap_header" so that it's no longer exactly page-sized.
Instead define a flexible "magic" array with a define which tells the
offset to where the magic signature begins.
Consider page size limit in some CPP condditionals.
Wrap global variables that are initialized with PAGE_SIZE derived values
using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
deferred for boot-time page size builds.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/mtd/mtdswap.c | 4 ++--
include/linux/mm.h | 2 +-
include/linux/mm_types_task.h | 2 +-
include/linux/mmzone.h | 3 ++-
include/linux/slab.h | 7 ++++---
include/linux/swap.h | 17 ++++++++++++-----
include/linux/swapops.h | 6 +++++-
mm/memcontrol.c | 2 +-
mm/memory.c | 4 ++--
mm/mmap.c | 2 +-
mm/page-writeback.c | 2 +-
mm/slub.c | 2 +-
mm/sparse.c | 2 +-
mm/swapfile.c | 2 +-
mm/vmalloc.c | 7 ++++---
15 files changed, 39 insertions(+), 25 deletions(-)
diff --git a/drivers/mtd/mtdswap.c b/drivers/mtd/mtdswap.c
index 680366616da24..7412a32708114 100644
--- a/drivers/mtd/mtdswap.c
+++ b/drivers/mtd/mtdswap.c
@@ -1062,13 +1062,13 @@ static int mtdswap_auto_header(struct mtdswap_dev *d, char *buf)
{
union swap_header *hd = (union swap_header *)(buf);
- memset(buf, 0, PAGE_SIZE - 10);
+ memset(buf, 0, SWAP_HEADER_MAGIC);
hd->info.version = 1;
hd->info.last_page = d->mbd_dev->size - 1;
hd->info.nr_badpages = 0;
- memcpy(buf + PAGE_SIZE - 10, "SWAPSPACE2", 10);
+ memcpy(buf + SWAP_HEADER_MAGIC, "SWAPSPACE2", 10);
return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 09a840517c23a..49c2078354e6e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2927,7 +2927,7 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte)
{
BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
- BUILD_BUG_ON(MAX_PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE);
+ BUILD_BUG_ON(MAX_PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE_MAX);
return ptlock_ptr(virt_to_ptdesc(pte));
}
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index a2f6179b672b8..c356897d5f41c 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -37,7 +37,7 @@ struct page;
struct page_frag {
struct page *page;
-#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
+#if (BITS_PER_LONG > 32) || (PAGE_SIZE_MAX >= 65536)
__u32 offset;
__u32 size;
#else
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1dc6248feb832..cd58034b82c81 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1744,6 +1744,7 @@ static inline bool movable_only_nodes(nodemask_t *nodes)
*/
#define PA_SECTION_SHIFT (SECTION_SIZE_BITS)
#define PFN_SECTION_SHIFT (SECTION_SIZE_BITS - PAGE_SHIFT)
+#define PFN_SECTION_SHIFT_MIN (SECTION_SIZE_BITS - PAGE_SHIFT_MAX)
#define NR_MEM_SECTIONS (1UL << SECTIONS_SHIFT)
@@ -1753,7 +1754,7 @@ static inline bool movable_only_nodes(nodemask_t *nodes)
#define SECTION_BLOCKFLAGS_BITS \
((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS)
-#if (MAX_PAGE_ORDER + PAGE_SHIFT) > SECTION_SIZE_BITS
+#if (MAX_PAGE_ORDER + PAGE_SHIFT_MAX) > SECTION_SIZE_BITS
#error Allocator MAX_PAGE_ORDER exceeds SECTION_SIZE
#endif
diff --git a/include/linux/slab.h b/include/linux/slab.h
index eb2bf46291576..11c6ff3a12579 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -347,7 +347,7 @@ static inline unsigned int arch_slab_minalign(void)
*/
#define __assume_kmalloc_alignment __assume_aligned(ARCH_KMALLOC_MINALIGN)
#define __assume_slab_alignment __assume_aligned(ARCH_SLAB_MINALIGN)
-#define __assume_page_alignment __assume_aligned(PAGE_SIZE)
+#define __assume_page_alignment __assume_aligned(PAGE_SIZE_MIN)
/*
* Kmalloc array related definitions
@@ -358,6 +358,7 @@ static inline unsigned int arch_slab_minalign(void)
* (PAGE_SIZE*2). Larger requests are passed to the page allocator.
*/
#define KMALLOC_SHIFT_HIGH (PAGE_SHIFT + 1)
+#define KMALLOC_SHIFT_HIGH_MAX (PAGE_SHIFT_MAX + 1)
#define KMALLOC_SHIFT_MAX (MAX_PAGE_ORDER + PAGE_SHIFT)
#ifndef KMALLOC_SHIFT_LOW
#define KMALLOC_SHIFT_LOW 3
@@ -426,7 +427,7 @@ enum kmalloc_cache_type {
NR_KMALLOC_TYPES
};
-typedef struct kmem_cache * kmem_buckets[KMALLOC_SHIFT_HIGH + 1];
+typedef struct kmem_cache * kmem_buckets[KMALLOC_SHIFT_HIGH_MAX + 1];
extern kmem_buckets kmalloc_caches[NR_KMALLOC_TYPES];
@@ -524,7 +525,7 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
/* Will never be reached. Needed because the compiler may complain */
return -1;
}
-static_assert(PAGE_SHIFT <= 20);
+static_assert(PAGE_SHIFT_MAX <= 20);
#define kmalloc_index(s) __kmalloc_index(s, true)
#include <linux/alloc_tag.h>
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ba7ea95d1c57a..e85df0332979f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -132,10 +132,17 @@ static inline int current_is_kswapd(void)
* bootbits...
*/
union swap_header {
- struct {
- char reserved[PAGE_SIZE - 10];
- char magic[10]; /* SWAP-SPACE or SWAPSPACE2 */
- } magic;
+ /*
+ * Exists conceptually, but since PAGE_SIZE may not be known at compile
+ * time, we must access through pointer arithmetic at run time.
+ *
+ * struct {
+ * char reserved[PAGE_SIZE - 10];
+ * char magic[10]; SWAP-SPACE or SWAPSPACE2
+ * } magic;
+ */
+#define SWAP_HEADER_MAGIC (PAGE_SIZE - 10)
+ char magic[1];
struct {
char bootbits[1024]; /* Space for disklabel etc. */
__u32 version;
@@ -201,7 +208,7 @@ struct swap_extent {
* Max bad pages in the new format..
*/
#define MAX_SWAP_BADPAGES \
- ((offsetof(union swap_header, magic.magic) - \
+ ((SWAP_HEADER_MAGIC - \
offsetof(union swap_header, info.badpages)) / sizeof(int))
enum {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index cb468e418ea11..890fe6a3e6702 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -34,10 +34,14 @@
*/
#ifdef MAX_PHYSMEM_BITS
#define SWP_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT)
+#define SWP_PFN_BITS_MAX (MAX_PHYSMEM_BITS - PAGE_SHIFT_MIN)
#else /* MAX_PHYSMEM_BITS */
#define SWP_PFN_BITS min_t(int, \
sizeof(phys_addr_t) * 8 - PAGE_SHIFT, \
SWP_TYPE_SHIFT)
+#define SWP_PFN_BITS_MAX min_t(int, \
+ sizeof(phys_addr_t) * 8 - PAGE_SHIFT_MIN, \
+ SWP_TYPE_SHIFT)
#endif /* MAX_PHYSMEM_BITS */
#define SWP_PFN_MASK (BIT(SWP_PFN_BITS) - 1)
@@ -519,7 +523,7 @@ static inline struct folio *pfn_swap_entry_folio(swp_entry_t entry)
static inline bool is_pfn_swap_entry(swp_entry_t entry)
{
/* Make sure the swp offset can always store the needed fields */
- BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
+ BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS_MAX);
return is_migration_entry(entry) || is_device_private_entry(entry) ||
is_device_exclusive_entry(entry) || is_hwpoison_entry(entry);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c5f9195f76c65..4b17bec566fbd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4881,7 +4881,7 @@ static int __init mem_cgroup_init(void)
* to work fine, we should make sure that the overfill threshold can't
* exceed S32_MAX / PAGE_SIZE.
*/
- BUILD_BUG_ON(MEMCG_CHARGE_BATCH > S32_MAX / PAGE_SIZE);
+ BUILD_BUG_ON(MEMCG_CHARGE_BATCH > S32_MAX / PAGE_SIZE_MIN);
cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
memcg_hotplug_cpu_dead);
diff --git a/mm/memory.c b/mm/memory.c
index ebfc9768f801a..14b5ef6870486 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4949,8 +4949,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
return ret;
}
-static unsigned long fault_around_pages __read_mostly =
- 65536 >> PAGE_SHIFT;
+static __DEFINE_GLOBAL_PAGE_SIZE_VAR(unsigned long, fault_around_pages,
+ __read_mostly, 65536 >> PAGE_SHIFT);
#ifdef CONFIG_DEBUG_FS
static int fault_around_bytes_get(void *data, u64 *val)
diff --git a/mm/mmap.c b/mm/mmap.c
index d0dfc85b209bb..d9642aba07ac4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2279,7 +2279,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
}
/* enforced gap between the expanding stack and other mappings. */
-unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
+DEFINE_GLOBAL_PAGE_SIZE_VAR(unsigned long, stack_guard_gap, 256UL<<PAGE_SHIFT);
static int __init cmdline_parse_stack_guard_gap(char *p)
{
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4430ac68e4c41..8fc9ac50749bd 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2292,7 +2292,7 @@ static int page_writeback_cpu_online(unsigned int cpu)
#ifdef CONFIG_SYSCTL
/* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
-static const unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
+static DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(unsigned long, dirty_bytes_min, 2 * PAGE_SIZE);
static struct ctl_table vm_page_writeback_sysctls[] = {
{
diff --git a/mm/slub.c b/mm/slub.c
index a77f354f83251..82f6e98cf25bb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5001,7 +5001,7 @@ init_kmem_cache_node(struct kmem_cache_node *n)
static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
{
BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
- NR_KMALLOC_TYPES * KMALLOC_SHIFT_HIGH *
+ NR_KMALLOC_TYPES * KMALLOC_SHIFT_HIGH_MAX *
sizeof(struct kmem_cache_cpu));
/*
diff --git a/mm/sparse.c b/mm/sparse.c
index dc38539f85603..2491425930c4d 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -277,7 +277,7 @@ static unsigned long sparse_encode_mem_map(struct page *mem_map, unsigned long p
{
unsigned long coded_mem_map =
(unsigned long)(mem_map - (section_nr_to_pfn(pnum)));
- BUILD_BUG_ON(SECTION_MAP_LAST_BIT > PFN_SECTION_SHIFT);
+ BUILD_BUG_ON(SECTION_MAP_LAST_BIT > PFN_SECTION_SHIFT_MIN);
BUG_ON(coded_mem_map & ~SECTION_MAP_MASK);
return coded_mem_map;
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 38bdc439651ac..6311a1cc7e46b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2931,7 +2931,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
unsigned long swapfilepages;
unsigned long last_page;
- if (memcmp("SWAPSPACE2", swap_header->magic.magic, 10)) {
+ if (memcmp("SWAPSPACE2", &swap_header->magic[SWAP_HEADER_MAGIC], 10)) {
pr_err("Unable to find swap-space signature\n");
return 0;
}
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a0df1e2e155a8..b4fbba204603c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2497,12 +2497,12 @@ struct vmap_block {
spinlock_t lock;
struct vmap_area *va;
unsigned long free, dirty;
- DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
unsigned long dirty_min, dirty_max; /*< dirty range */
struct list_head free_list;
struct rcu_head rcu_head;
struct list_head purge;
unsigned int cpu;
+ unsigned long used_map[];
};
/* Queue of free and dirty vmap blocks, for allocation and flushing purposes */
@@ -2600,11 +2600,12 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
unsigned long vb_idx;
int node, err;
void *vaddr;
+ size_t size;
node = numa_node_id();
- vb = kmalloc_node(sizeof(struct vmap_block),
- gfp_mask & GFP_RECLAIM_MASK, node);
+ size = struct_size(vb, used_map, BITS_TO_LONGS(VMAP_BBMAP_BITS));
+ vb = kmalloc_node(size, gfp_mask & GFP_RECLAIM_MASK, node);
if (unlikely(!vb))
return ERR_PTR(-ENOMEM);
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 07/57] fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (4 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 08/57] fs: Remove PAGE_SIZE compile-time constant assumption Ryan Roberts
` (52 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
Christian Brauner, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, OGAWA Hirofumi, Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-ext4, linux-fsdevel,
linux-kernel, linux-mm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Code that previously defined arrays with MAX_BUF_PER_PAGE will no longer
work with boot-time page selection because PAGE_SIZE is not known at
compile-time. Introduce MAX_BUF_PER_PAGE_SIZE_MAX for this purpose,
which is the requirement in the limit when PAGE_SIZE_MAX is the selected
page size.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
fs/buffer.c | 2 +-
fs/ext4/move_extent.c | 2 +-
fs/ext4/readpage.c | 2 +-
fs/fat/dir.c | 4 ++--
fs/fat/fatent.c | 4 ++--
include/linux/buffer_head.h | 1 +
6 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index e55ad471c5306..f00542ad43a5c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2371,7 +2371,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
{
struct inode *inode = folio->mapping->host;
sector_t iblock, lblock;
- struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE_SIZE_MAX];
size_t blocksize;
int nr, i;
int fully_mapped = 1;
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 204f53b236229..68304426c6f45 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -172,7 +172,7 @@ mext_page_mkuptodate(struct folio *folio, unsigned from, unsigned to)
{
struct inode *inode = folio->mapping->host;
sector_t block;
- struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE_SIZE_MAX];
unsigned int blocksize, block_start, block_end;
int i, err, nr = 0, partial = 0;
BUG_ON(!folio_test_locked(folio));
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index 8494492582abe..5808d85096aeb 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -221,7 +221,7 @@ int ext4_mpage_readpages(struct inode *inode,
sector_t block_in_file;
sector_t last_block;
sector_t last_block_in_file;
- sector_t blocks[MAX_BUF_PER_PAGE];
+ sector_t blocks[MAX_BUF_PER_PAGE_SIZE_MAX];
unsigned page_block;
struct block_device *bdev = inode->i_sb->s_bdev;
int length;
diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index acbec5bdd5210..f3e96ecf21c92 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -1146,7 +1146,7 @@ int fat_alloc_new_dir(struct inode *dir, struct timespec64 *ts)
{
struct super_block *sb = dir->i_sb;
struct msdos_sb_info *sbi = MSDOS_SB(sb);
- struct buffer_head *bhs[MAX_BUF_PER_PAGE];
+ struct buffer_head *bhs[MAX_BUF_PER_PAGE_SIZE_MAX];
struct msdos_dir_entry *de;
sector_t blknr;
__le16 date, time;
@@ -1213,7 +1213,7 @@ static int fat_add_new_entries(struct inode *dir, void *slots, int nr_slots,
{
struct super_block *sb = dir->i_sb;
struct msdos_sb_info *sbi = MSDOS_SB(sb);
- struct buffer_head *bhs[MAX_BUF_PER_PAGE];
+ struct buffer_head *bhs[MAX_BUF_PER_PAGE_SIZE_MAX];
sector_t blknr, start_blknr, last_blknr;
unsigned long size, copy;
int err, i, n, offset, cluster[2];
diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c
index 1db348f8f887a..322cf5b8e5590 100644
--- a/fs/fat/fatent.c
+++ b/fs/fat/fatent.c
@@ -469,7 +469,7 @@ int fat_alloc_clusters(struct inode *inode, int *cluster, int nr_cluster)
struct msdos_sb_info *sbi = MSDOS_SB(sb);
const struct fatent_operations *ops = sbi->fatent_ops;
struct fat_entry fatent, prev_ent;
- struct buffer_head *bhs[MAX_BUF_PER_PAGE];
+ struct buffer_head *bhs[MAX_BUF_PER_PAGE_SIZE_MAX];
int i, count, err, nr_bhs, idx_clus;
BUG_ON(nr_cluster > (MAX_BUF_PER_PAGE / 2)); /* fixed limit */
@@ -557,7 +557,7 @@ int fat_free_clusters(struct inode *inode, int cluster)
struct msdos_sb_info *sbi = MSDOS_SB(sb);
const struct fatent_operations *ops = sbi->fatent_ops;
struct fat_entry fatent;
- struct buffer_head *bhs[MAX_BUF_PER_PAGE];
+ struct buffer_head *bhs[MAX_BUF_PER_PAGE_SIZE_MAX];
int i, err, nr_bhs;
int first_cl = cluster, dirty_fsinfo = 0;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 14acf1bbe0ce6..5dff4837b76cd 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -41,6 +41,7 @@ enum bh_state_bits {
};
#define MAX_BUF_PER_PAGE (PAGE_SIZE / 512)
+#define MAX_BUF_PER_PAGE_SIZE_MAX (PAGE_SIZE_MAX / 512)
struct page;
struct buffer_head;
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 08/57] fs: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (5 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 07/57] fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 09/57] fs/nfs: " Ryan Roberts
` (51 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Alexander Viro, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, Christian Brauner, David Hildenbrand,
Greg Marsden, Ivan Ivanov, Kalesh Singh, Marc Zyngier,
Mark Rutland, Matthias Brugger, Miroslav Benes, Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-fsdevel, linux-kernel,
linux-mm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
In binfmt_elf, convert CPP conditional to C ternary operator; this will
be folded to the same code by the compiler when in compile-time page
size mode, but will also work for runtime evaluation in boot-time page
size mode.
In coredump, modify __dump_skip() to emit zeros in blocks of
PAGE_SIZE_MIN. This resolves to the previous PAGE_SIZE for compile-time
page size, but that doesn't work for boot-time page size. PAGE_SIZE_MIN
is preferred here over PAGE_SIZE_MAX to save memory.
Wrap global variables that are initialized with PAGE_SIZE derived values
using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
deferred for boot-time page size builds.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
fs/binfmt_elf.c | 11 ++++-------
fs/coredump.c | 8 ++++----
2 files changed, 8 insertions(+), 11 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 19fa49cd9907f..e439d36c43c7e 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -84,11 +84,8 @@ static int elf_core_dump(struct coredump_params *cprm);
#define elf_core_dump NULL
#endif
-#if ELF_EXEC_PAGESIZE > PAGE_SIZE
-#define ELF_MIN_ALIGN ELF_EXEC_PAGESIZE
-#else
-#define ELF_MIN_ALIGN PAGE_SIZE
-#endif
+#define ELF_MIN_ALIGN \
+ (ELF_EXEC_PAGESIZE > PAGE_SIZE ? ELF_EXEC_PAGESIZE : PAGE_SIZE)
#ifndef ELF_CORE_EFLAGS
#define ELF_CORE_EFLAGS 0
@@ -98,7 +95,7 @@ static int elf_core_dump(struct coredump_params *cprm);
#define ELF_PAGEOFFSET(_v) ((_v) & (ELF_MIN_ALIGN-1))
#define ELF_PAGEALIGN(_v) (((_v) + ELF_MIN_ALIGN - 1) & ~(ELF_MIN_ALIGN - 1))
-static struct linux_binfmt elf_format = {
+static DEFINE_GLOBAL_PAGE_SIZE_VAR(struct linux_binfmt, elf_format, {
.module = THIS_MODULE,
.load_binary = load_elf_binary,
.load_shlib = load_elf_library,
@@ -106,7 +103,7 @@ static struct linux_binfmt elf_format = {
.core_dump = elf_core_dump,
.min_coredump = ELF_EXEC_PAGESIZE,
#endif
-};
+});
#define BAD_ADDR(x) (unlikely((unsigned long)(x) >= TASK_SIZE))
diff --git a/fs/coredump.c b/fs/coredump.c
index 7f12ff6ad1d3e..203f2a158246e 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -825,7 +825,7 @@ static int __dump_emit(struct coredump_params *cprm, const void *addr, int nr)
static int __dump_skip(struct coredump_params *cprm, size_t nr)
{
- static char zeroes[PAGE_SIZE];
+ static char zeroes[PAGE_SIZE_MIN];
struct file *file = cprm->file;
if (file->f_mode & FMODE_LSEEK) {
if (dump_interrupted() ||
@@ -834,10 +834,10 @@ static int __dump_skip(struct coredump_params *cprm, size_t nr)
cprm->pos += nr;
return 1;
} else {
- while (nr > PAGE_SIZE) {
- if (!__dump_emit(cprm, zeroes, PAGE_SIZE))
+ while (nr > PAGE_SIZE_MIN) {
+ if (!__dump_emit(cprm, zeroes, PAGE_SIZE_MIN))
return 0;
- nr -= PAGE_SIZE;
+ nr -= PAGE_SIZE_MIN;
}
return __dump_emit(cprm, zeroes, nr);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 09/57] fs/nfs: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (6 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 08/57] fs: Remove PAGE_SIZE compile-time constant assumption Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 10/57] fs/ext4: " Ryan Roberts
` (50 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anna Schumaker, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Trond Myklebust, Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm, linux-nfs
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Calculation of NFS4ACL_MAXPAGES and NFS4XATTR_MAXPAGES are modified to
give max pages when page size is at the minimum.
BUILD_BUG_ON() is modified to test against the min page size, which
implicitly also applies to all other page sizes.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
fs/nfs/nfs42proc.c | 2 +-
fs/nfs/nfs42xattr.c | 2 +-
fs/nfs/nfs4proc.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index 28704f924612c..c600574105c63 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -1161,7 +1161,7 @@ int nfs42_proc_clone(struct file *src_f, struct file *dst_f,
return err;
}
-#define NFS4XATTR_MAXPAGES DIV_ROUND_UP(XATTR_SIZE_MAX, PAGE_SIZE)
+#define NFS4XATTR_MAXPAGES DIV_ROUND_UP(XATTR_SIZE_MAX, PAGE_SIZE_MIN)
static int _nfs42_proc_removexattr(struct inode *inode, const char *name)
{
diff --git a/fs/nfs/nfs42xattr.c b/fs/nfs/nfs42xattr.c
index b6e3d8f77b910..734177eb44889 100644
--- a/fs/nfs/nfs42xattr.c
+++ b/fs/nfs/nfs42xattr.c
@@ -183,7 +183,7 @@ nfs4_xattr_alloc_entry(const char *name, const void *value,
uint32_t flags;
BUILD_BUG_ON(sizeof(struct nfs4_xattr_entry) +
- XATTR_NAME_MAX + 1 > PAGE_SIZE);
+ XATTR_NAME_MAX + 1 > PAGE_SIZE_MIN);
alloclen = sizeof(struct nfs4_xattr_entry);
if (name != NULL) {
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index b8ffbe52ba15a..3c3622f46d3e0 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5928,7 +5928,7 @@ static bool nfs4_server_supports_acls(const struct nfs_server *server,
* it's OK to put sizeof(void) * (XATTR_SIZE_MAX/PAGE_SIZE) bytes on
* the stack.
*/
-#define NFS4ACL_MAXPAGES DIV_ROUND_UP(XATTR_SIZE_MAX, PAGE_SIZE)
+#define NFS4ACL_MAXPAGES DIV_ROUND_UP(XATTR_SIZE_MAX, PAGE_SIZE_MIN)
int nfs4_buf_to_pages_noslab(const void *buf, size_t buflen,
struct page **pages)
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 10/57] fs/ext4: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (7 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 09/57] fs/nfs: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 11/57] fork: Permit boot-time THREAD_SIZE determination Ryan Roberts
` (49 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Theodore Ts'o, Andreas Dilger, Andrew Morton,
Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-ext4, linux-kernel,
linux-mm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Convert CPP PAGE_SIZE conditionals to C if/else. For compile-time page
size, the compiler will strip the dead part, and for boot-time page
size, the condition will be evaluated at run time.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
fs/ext4/ext4.h | 36 ++++++++++++++++++------------------
1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 08acd152261ed..1a6dbd925024a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2415,31 +2415,31 @@ ext4_rec_len_from_disk(__le16 dlen, unsigned blocksize)
{
unsigned len = le16_to_cpu(dlen);
-#if (PAGE_SIZE >= 65536)
- if (len == EXT4_MAX_REC_LEN || len == 0)
- return blocksize;
- return (len & 65532) | ((len & 3) << 16);
-#else
- return len;
-#endif
+ if (PAGE_SIZE >= 65536) {
+ if (len == EXT4_MAX_REC_LEN || len == 0)
+ return blocksize;
+ return (len & 65532) | ((len & 3) << 16);
+ } else {
+ return len;
+ }
}
static inline __le16 ext4_rec_len_to_disk(unsigned len, unsigned blocksize)
{
BUG_ON((len > blocksize) || (blocksize > (1 << 18)) || (len & 3));
-#if (PAGE_SIZE >= 65536)
- if (len < 65536)
+ if (PAGE_SIZE >= 65536) {
+ if (len < 65536)
+ return cpu_to_le16(len);
+ if (len == blocksize) {
+ if (blocksize == 65536)
+ return cpu_to_le16(EXT4_MAX_REC_LEN);
+ else
+ return cpu_to_le16(0);
+ }
+ return cpu_to_le16((len & 65532) | ((len >> 16) & 3));
+ } else {
return cpu_to_le16(len);
- if (len == blocksize) {
- if (blocksize == 65536)
- return cpu_to_le16(EXT4_MAX_REC_LEN);
- else
- return cpu_to_le16(0);
}
- return cpu_to_le16((len & 65532) | ((len >> 16) & 3));
-#else
- return cpu_to_le16(len);
-#endif
}
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 11/57] fork: Permit boot-time THREAD_SIZE determination
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (8 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 10/57] fs/ext4: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-11-14 10:42 ` Vlastimil Babka
2024-10-14 10:58 ` [RFC PATCH v1 12/57] cgroup: Remove PAGE_SIZE compile-time constant assumption Ryan Roberts
` (48 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Andrey Ryabinin, Anshuman Khandual, Ard Biesheuvel,
Arnd Bergmann, Catalin Marinas, David Hildenbrand, Greg Marsden,
Ingo Molnar, Ivan Ivanov, Juri Lelli, Kalesh Singh, Marc Zyngier,
Mark Rutland, Matthias Brugger, Miroslav Benes, Peter Zijlstra,
Vincent Guittot, Will Deacon
Cc: Ryan Roberts, kasan-dev, linux-arch, linux-arm-kernel,
linux-kernel, linux-mm
THREAD_SIZE defines the size of a kernel thread stack. To date, it has
been set at compile-time. However, when using vmap stacks, the size must
be a multiple of PAGE_SIZE, and given we are in the process of
supporting boot-time page size, we must also do the same for
THREAD_SIZE.
The alternative would be to define THREAD_SIZE for the largest supported
page size, but this would waste memory when using a smaller page size.
For example, arm64 requires THREAD_SIZE to be 16K, but when using 64K
pages and a vmap stack, we must increase the size to 64K. If we required
64K when 4K or 16K page size was in use, we would waste 48K per kernel
thread.
So let's refactor to allow THREAD_SIZE to not be a compile-time
constant. THREAD_SIZE_MAX (and THREAD_ALIGN_MAX) are introduced to
manage the limits, as is done for PAGE_SIZE.
When THREAD_SIZE is a compile-time constant, behaviour and code size
should be equivalent.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
include/asm-generic/vmlinux.lds.h | 6 ++-
include/linux/sched.h | 4 +-
include/linux/thread_info.h | 10 ++++-
init/main.c | 2 +-
kernel/fork.c | 67 +++++++++++--------------------
mm/kasan/report.c | 3 +-
6 files changed, 42 insertions(+), 50 deletions(-)
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 5727f883001bb..f19bab7a2e8f9 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -56,6 +56,10 @@
#define LOAD_OFFSET 0
#endif
+#ifndef THREAD_SIZE_MAX
+#define THREAD_SIZE_MAX THREAD_SIZE
+#endif
+
/*
* Only some architectures want to have the .notes segment visible in
* a separate PT_NOTE ELF Program Header. When this happens, it needs
@@ -398,7 +402,7 @@
init_stack = .; \
KEEP(*(.data..init_task)) \
KEEP(*(.data..init_thread_info)) \
- . = __start_init_stack + THREAD_SIZE; \
+ . = __start_init_stack + THREAD_SIZE_MAX; \
__end_init_stack = .;
#define JUMP_TABLE_DATA \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f8d150343d42d..3de4f655ee492 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1863,14 +1863,14 @@ union thread_union {
#ifndef CONFIG_THREAD_INFO_IN_TASK
struct thread_info thread_info;
#endif
- unsigned long stack[THREAD_SIZE/sizeof(long)];
+ unsigned long stack[THREAD_SIZE_MAX/sizeof(long)];
};
#ifndef CONFIG_THREAD_INFO_IN_TASK
extern struct thread_info init_thread_info;
#endif
-extern unsigned long init_stack[THREAD_SIZE / sizeof(unsigned long)];
+extern unsigned long init_stack[THREAD_SIZE_MAX / sizeof(unsigned long)];
#ifdef CONFIG_THREAD_INFO_IN_TASK
# define task_thread_info(task) (&(task)->thread_info)
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 9ea0b28068f49..a7ccc448cd298 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -74,7 +74,15 @@ static inline long set_restart_fn(struct restart_block *restart,
}
#ifndef THREAD_ALIGN
-#define THREAD_ALIGN THREAD_SIZE
+#define THREAD_ALIGN THREAD_SIZE
+#endif
+
+#ifndef THREAD_SIZE_MAX
+#define THREAD_SIZE_MAX THREAD_SIZE
+#endif
+
+#ifndef THREAD_ALIGN_MAX
+#define THREAD_ALIGN_MAX max(THREAD_ALIGN, THREAD_SIZE_MAX)
#endif
#define THREADINFO_GFP (GFP_KERNEL_ACCOUNT | __GFP_ZERO)
diff --git a/init/main.c b/init/main.c
index ba1515eb20b9d..4dc28115fdf57 100644
--- a/init/main.c
+++ b/init/main.c
@@ -797,7 +797,7 @@ void __init __weak smp_prepare_boot_cpu(void)
{
}
-# if THREAD_SIZE >= PAGE_SIZE
+#ifdef CONFIG_VMAP_STACK
void __init __weak thread_stack_cache_init(void)
{
}
diff --git a/kernel/fork.c b/kernel/fork.c
index ea472566d4fcc..cbc3e73f9b501 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -184,13 +184,7 @@ static inline void free_task_struct(struct task_struct *tsk)
kmem_cache_free(task_struct_cachep, tsk);
}
-/*
- * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
- * kmemcache based allocator.
- */
-# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
-
-# ifdef CONFIG_VMAP_STACK
+#ifdef CONFIG_VMAP_STACK
/*
* vmalloc() is a bit slow, and calling vfree() enough times will force a TLB
* flush. Try to minimize the number of calls by caching stacks.
@@ -343,46 +337,21 @@ static void free_thread_stack(struct task_struct *tsk)
tsk->stack_vm_area = NULL;
}
-# else /* !CONFIG_VMAP_STACK */
+#else /* !CONFIG_VMAP_STACK */
-static void thread_stack_free_rcu(struct rcu_head *rh)
-{
- __free_pages(virt_to_page(rh), THREAD_SIZE_ORDER);
-}
-
-static void thread_stack_delayed_free(struct task_struct *tsk)
-{
- struct rcu_head *rh = tsk->stack;
-
- call_rcu(rh, thread_stack_free_rcu);
-}
-
-static int alloc_thread_stack_node(struct task_struct *tsk, int node)
-{
- struct page *page = alloc_pages_node(node, THREADINFO_GFP,
- THREAD_SIZE_ORDER);
-
- if (likely(page)) {
- tsk->stack = kasan_reset_tag(page_address(page));
- return 0;
- }
- return -ENOMEM;
-}
-
-static void free_thread_stack(struct task_struct *tsk)
-{
- thread_stack_delayed_free(tsk);
- tsk->stack = NULL;
-}
-
-# endif /* CONFIG_VMAP_STACK */
-# else /* !(THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)) */
+/*
+ * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
+ * kmemcache based allocator.
+ */
static struct kmem_cache *thread_stack_cache;
static void thread_stack_free_rcu(struct rcu_head *rh)
{
- kmem_cache_free(thread_stack_cache, rh);
+ if (THREAD_SIZE >= PAGE_SIZE)
+ __free_pages(virt_to_page(rh), THREAD_SIZE_ORDER);
+ else
+ kmem_cache_free(thread_stack_cache, rh);
}
static void thread_stack_delayed_free(struct task_struct *tsk)
@@ -395,7 +364,16 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
static int alloc_thread_stack_node(struct task_struct *tsk, int node)
{
unsigned long *stack;
- stack = kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node);
+ struct page *page;
+
+ if (THREAD_SIZE >= PAGE_SIZE) {
+ page = alloc_pages_node(node, THREADINFO_GFP, THREAD_SIZE_ORDER);
+ stack = likely(page) ? page_address(page) : NULL;
+ } else {
+ stack = kmem_cache_alloc_node(thread_stack_cache,
+ THREADINFO_GFP, node);
+ }
+
stack = kasan_reset_tag(stack);
tsk->stack = stack;
return stack ? 0 : -ENOMEM;
@@ -409,13 +387,16 @@ static void free_thread_stack(struct task_struct *tsk)
void thread_stack_cache_init(void)
{
+ if (THREAD_SIZE >= PAGE_SIZE)
+ return;
+
thread_stack_cache = kmem_cache_create_usercopy("thread_stack",
THREAD_SIZE, THREAD_SIZE, 0, 0,
THREAD_SIZE, NULL);
BUG_ON(thread_stack_cache == NULL);
}
-# endif /* THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK) */
+#endif /* CONFIG_VMAP_STACK */
/* SLAB cache for signal_struct structures (tsk->signal) */
static struct kmem_cache *signal_cachep;
diff --git a/mm/kasan/report.c b/mm/kasan/report.c
index b48c768acc84d..57c877852dbc6 100644
--- a/mm/kasan/report.c
+++ b/mm/kasan/report.c
@@ -365,8 +365,7 @@ static inline bool kernel_or_module_addr(const void *addr)
static inline bool init_task_stack_addr(const void *addr)
{
return addr >= (void *)&init_thread_union.stack &&
- (addr <= (void *)&init_thread_union.stack +
- sizeof(init_thread_union.stack));
+ (addr <= (void *)&init_thread_union.stack + THREAD_SIZE);
}
static void print_address_description(void *addr, u8 tag,
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 12/57] cgroup: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (9 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 11/57] fork: Permit boot-time THREAD_SIZE determination Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 13/57] bpf: " Ryan Roberts
` (47 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Michal Koutný, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, David Hildenbrand, Greg Marsden,
Ivan Ivanov, Johannes Weiner, Kalesh Singh, Marc Zyngier,
Mark Rutland, Matthias Brugger, Miroslav Benes, Tejun Heo,
Will Deacon, Zefan Li
Cc: Ryan Roberts, cgroups, linux-arm-kernel, linux-kernel, linux-mm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Wrap global variables that are initialized with PAGE_SIZE derived values
using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
deferred for boot-time page size builds.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
kernel/cgroup/cgroup.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c8e4b62b436a4..1e9c96210821d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4176,16 +4176,16 @@ static int cgroup_seqfile_show(struct seq_file *m, void *arg)
return 0;
}
-static struct kernfs_ops cgroup_kf_single_ops = {
+static DEFINE_GLOBAL_PAGE_SIZE_VAR(struct kernfs_ops, cgroup_kf_single_ops, {
.atomic_write_len = PAGE_SIZE,
.open = cgroup_file_open,
.release = cgroup_file_release,
.write = cgroup_file_write,
.poll = cgroup_file_poll,
.seq_show = cgroup_seqfile_show,
-};
+});
-static struct kernfs_ops cgroup_kf_ops = {
+static DEFINE_GLOBAL_PAGE_SIZE_VAR(struct kernfs_ops, cgroup_kf_ops, {
.atomic_write_len = PAGE_SIZE,
.open = cgroup_file_open,
.release = cgroup_file_release,
@@ -4195,7 +4195,7 @@ static struct kernfs_ops cgroup_kf_ops = {
.seq_next = cgroup_seqfile_next,
.seq_stop = cgroup_seqfile_stop,
.seq_show = cgroup_seqfile_show,
-};
+});
static void cgroup_file_notify_timer(struct timer_list *timer)
{
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 13/57] bpf: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (10 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 12/57] cgroup: Remove PAGE_SIZE compile-time constant assumption Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-16 14:38 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 14/57] pm/hibernate: " Ryan Roberts
` (46 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Alexei Starovoitov, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, Daniel Borkmann,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, bpf, linux-arm-kernel, linux-kernel, linux-mm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Refactor "struct bpf_ringbuf" so that consumer_pos, producer_pos,
pending_pos and data are no longer embedded at (static) page offsets
within the struct. This can't work for boot-time page size because the
page size isn't known at compile-time. Instead, only define the meta
data in the struct, along with pointers to those values. At "struct
bpf_ringbuf" allocation time, the extra pages are allocated at the end
and the pointers are initialized to point to the correct locations.
Additionally, only expose the __PAGE_SIZE enum to BTF for compile-time
page size builds. We don't know the page size at compile-time for
boot-time builds. NOTE: This may need some extra thought; perhaps
__PAGE_SIZE should be exposed as 0 in this case? And/or perhaps
__PAGE_SIZE_MIN/__PAGE_SIZE_MAX should be exposed? And there would need
to be a runtime mechanism for querying the page size (e.g.
getpagesize()).
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
kernel/bpf/core.c | 9 ++++++--
kernel/bpf/ringbuf.c | 54 ++++++++++++++++++++++++--------------------
2 files changed, 37 insertions(+), 26 deletions(-)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7ee62e38faf0e..485875aa78e63 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -89,10 +89,15 @@ void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, uns
return NULL;
}
-/* tell bpf programs that include vmlinux.h kernel's PAGE_SIZE */
+/*
+ * tell bpf programs that include vmlinux.h kernel's PAGE_SIZE. We can only do
+ * this for compile-time PAGE_SIZE builds.
+ */
+#if PAGE_SIZE_MIN == PAGE_SIZE_MAX
enum page_size_enum {
__PAGE_SIZE = PAGE_SIZE
};
+#endif
struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flags)
{
@@ -100,7 +105,7 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
struct bpf_prog_aux *aux;
struct bpf_prog *fp;
- size = round_up(size, __PAGE_SIZE);
+ size = round_up(size, PAGE_SIZE);
fp = __vmalloc(size, gfp_flags);
if (fp == NULL)
return NULL;
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index e20b90c361316..8e4093ddbc638 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -14,9 +14,9 @@
#define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE)
-/* non-mmap()'able part of bpf_ringbuf (everything up to consumer page) */
+/* non-mmap()'able part of bpf_ringbuf (everything defined in struct) */
#define RINGBUF_PGOFF \
- (offsetof(struct bpf_ringbuf, consumer_pos) >> PAGE_SHIFT)
+ (PAGE_ALIGN(sizeof(struct bpf_ringbuf)) >> PAGE_SHIFT)
/* consumer page and producer page */
#define RINGBUF_POS_PAGES 2
#define RINGBUF_NR_META_PAGES (RINGBUF_PGOFF + RINGBUF_POS_PAGES)
@@ -69,10 +69,10 @@ struct bpf_ringbuf {
* validate each sample to ensure that they're correctly formatted, and
* fully contained within the ring buffer.
*/
- unsigned long consumer_pos __aligned(PAGE_SIZE);
- unsigned long producer_pos __aligned(PAGE_SIZE);
- unsigned long pending_pos;
- char data[] __aligned(PAGE_SIZE);
+ unsigned long *consumer_pos;
+ unsigned long *producer_pos;
+ unsigned long *pending_pos;
+ char *data;
};
struct bpf_ringbuf_map {
@@ -134,9 +134,15 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
rb = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
VM_MAP | VM_USERMAP, PAGE_KERNEL);
if (rb) {
+ void *base = rb;
+
kmemleak_not_leak(pages);
rb->pages = pages;
rb->nr_pages = nr_pages;
+ rb->consumer_pos = (unsigned long *)(base + PAGE_SIZE * RINGBUF_PGOFF);
+ rb->producer_pos = (unsigned long *)(base + PAGE_SIZE * (RINGBUF_PGOFF + 1));
+ rb->pending_pos = rb->producer_pos + 1;
+ rb->data = base + PAGE_SIZE * nr_meta_pages;
return rb;
}
@@ -179,9 +185,9 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
init_irq_work(&rb->work, bpf_ringbuf_notify);
rb->mask = data_sz - 1;
- rb->consumer_pos = 0;
- rb->producer_pos = 0;
- rb->pending_pos = 0;
+ *rb->consumer_pos = 0;
+ *rb->producer_pos = 0;
+ *rb->pending_pos = 0;
return rb;
}
@@ -300,8 +306,8 @@ static unsigned long ringbuf_avail_data_sz(struct bpf_ringbuf *rb)
{
unsigned long cons_pos, prod_pos;
- cons_pos = smp_load_acquire(&rb->consumer_pos);
- prod_pos = smp_load_acquire(&rb->producer_pos);
+ cons_pos = smp_load_acquire(rb->consumer_pos);
+ prod_pos = smp_load_acquire(rb->producer_pos);
return prod_pos - cons_pos;
}
@@ -418,7 +424,7 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
if (len > ringbuf_total_data_sz(rb))
return NULL;
- cons_pos = smp_load_acquire(&rb->consumer_pos);
+ cons_pos = smp_load_acquire(rb->consumer_pos);
if (in_nmi()) {
if (!spin_trylock_irqsave(&rb->spinlock, flags))
@@ -427,8 +433,8 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
spin_lock_irqsave(&rb->spinlock, flags);
}
- pend_pos = rb->pending_pos;
- prod_pos = rb->producer_pos;
+ pend_pos = *rb->pending_pos;
+ prod_pos = *rb->producer_pos;
new_prod_pos = prod_pos + len;
while (pend_pos < prod_pos) {
@@ -440,7 +446,7 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
tmp_size = round_up(tmp_size + BPF_RINGBUF_HDR_SZ, 8);
pend_pos += tmp_size;
}
- rb->pending_pos = pend_pos;
+ *rb->pending_pos = pend_pos;
/* check for out of ringbuf space:
* - by ensuring producer position doesn't advance more than
@@ -460,7 +466,7 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
hdr->pg_off = pg_off;
/* pairs with consumer's smp_load_acquire() */
- smp_store_release(&rb->producer_pos, new_prod_pos);
+ smp_store_release(rb->producer_pos, new_prod_pos);
spin_unlock_irqrestore(&rb->spinlock, flags);
@@ -506,7 +512,7 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard)
* new data availability
*/
rec_pos = (void *)hdr - (void *)rb->data;
- cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
+ cons_pos = smp_load_acquire(rb->consumer_pos) & rb->mask;
if (flags & BPF_RB_FORCE_WAKEUP)
irq_work_queue(&rb->work);
@@ -580,9 +586,9 @@ BPF_CALL_2(bpf_ringbuf_query, struct bpf_map *, map, u64, flags)
case BPF_RB_RING_SIZE:
return ringbuf_total_data_sz(rb);
case BPF_RB_CONS_POS:
- return smp_load_acquire(&rb->consumer_pos);
+ return smp_load_acquire(rb->consumer_pos);
case BPF_RB_PROD_POS:
- return smp_load_acquire(&rb->producer_pos);
+ return smp_load_acquire(rb->producer_pos);
default:
return 0;
}
@@ -680,12 +686,12 @@ static int __bpf_user_ringbuf_peek(struct bpf_ringbuf *rb, void **sample, u32 *s
u64 cons_pos, prod_pos;
/* Synchronizes with smp_store_release() in user-space producer. */
- prod_pos = smp_load_acquire(&rb->producer_pos);
+ prod_pos = smp_load_acquire(rb->producer_pos);
if (prod_pos % 8)
return -EINVAL;
/* Synchronizes with smp_store_release() in __bpf_user_ringbuf_sample_release() */
- cons_pos = smp_load_acquire(&rb->consumer_pos);
+ cons_pos = smp_load_acquire(rb->consumer_pos);
if (cons_pos >= prod_pos)
return -ENODATA;
@@ -715,7 +721,7 @@ static int __bpf_user_ringbuf_peek(struct bpf_ringbuf *rb, void **sample, u32 *s
* Update the consumer pos, and return -EAGAIN so the caller
* knows to skip this sample and try to read the next one.
*/
- smp_store_release(&rb->consumer_pos, cons_pos + total_len);
+ smp_store_release(rb->consumer_pos, cons_pos + total_len);
return -EAGAIN;
}
@@ -737,9 +743,9 @@ static void __bpf_user_ringbuf_sample_release(struct bpf_ringbuf *rb, size_t siz
* prevents another task from writing to consumer_pos after it was read
* by this task with smp_load_acquire() in __bpf_user_ringbuf_peek().
*/
- consumer_pos = rb->consumer_pos;
+ consumer_pos = *rb->consumer_pos;
/* Synchronizes with smp_load_acquire() in user-space producer. */
- smp_store_release(&rb->consumer_pos, consumer_pos + rounded_size);
+ smp_store_release(rb->consumer_pos, consumer_pos + rounded_size);
}
BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map,
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 14/57] pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (11 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 13/57] bpf: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-16 14:39 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 15/57] stackdepot: " Ryan Roberts
` (45 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm, linux-pm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
"struct linked_page", "struct swap_map_page" and "struct swsusp_header"
were all previously sized to be exactly PAGE_SIZE. Refactor those
structures to remove the padding, then superimpose them on a page at
runtime.
"struct cmp_data" and "struct dec_data" previously contained embedded
"unc" and "cmp" arrays, who's sizes were derived from PAGE_SIZE. We
can't use flexible array approach here since there are 2 arrays in the
structure, so convert to pointers and define an allocator and
deallocator for each struct.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
kernel/power/power.h | 2 +-
kernel/power/snapshot.c | 2 +-
kernel/power/swap.c | 129 +++++++++++++++++++++++++++++++++-------
3 files changed, 108 insertions(+), 25 deletions(-)
diff --git a/kernel/power/power.h b/kernel/power/power.h
index de0e6b1077f23..74af2eb8d48a4 100644
--- a/kernel/power/power.h
+++ b/kernel/power/power.h
@@ -16,7 +16,7 @@ struct swsusp_info {
unsigned long image_pages;
unsigned long pages;
unsigned long size;
-} __aligned(PAGE_SIZE);
+} __aligned(PAGE_SIZE_MAX);
#ifdef CONFIG_HIBERNATION
/* kernel/power/snapshot.c */
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 405eddbda4fc5..144e92f786e35 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -155,7 +155,7 @@ struct pbe *restore_pblist;
struct linked_page {
struct linked_page *next;
- char data[LINKED_PAGE_DATA_SIZE];
+ char data[];
} __packed;
/*
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 82b884b67152f..ffd4c864acfa2 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -59,6 +59,7 @@ static bool clean_pages_on_decompress;
*/
#define MAP_PAGE_ENTRIES (PAGE_SIZE / sizeof(sector_t) - 1)
+#define NEXT_SWAP_INDEX MAP_PAGE_ENTRIES
/*
* Number of free pages that are not high.
@@ -78,8 +79,11 @@ static inline unsigned long reqd_free_pages(void)
}
struct swap_map_page {
- sector_t entries[MAP_PAGE_ENTRIES];
- sector_t next_swap;
+ /*
+ * A PAGE_SIZE structure with (PAGE_SIZE / sizeof(sector_t)) entries.
+ * The last entry, [NEXT_SWAP_INDEX], is `.next_swap`.
+ */
+ sector_t entries[1];
};
struct swap_map_page_list {
@@ -103,8 +107,6 @@ struct swap_map_handle {
};
struct swsusp_header {
- char reserved[PAGE_SIZE - 20 - sizeof(sector_t) - sizeof(int) -
- sizeof(u32) - sizeof(u32)];
u32 hw_sig;
u32 crc32;
sector_t image;
@@ -113,6 +115,7 @@ struct swsusp_header {
char sig[10];
} __packed;
+static char *swsusp_header_pg;
static struct swsusp_header *swsusp_header;
/*
@@ -315,7 +318,7 @@ static int mark_swapfiles(struct swap_map_handle *handle, unsigned int flags)
{
int error;
- hib_submit_io(REQ_OP_READ, swsusp_resume_block, swsusp_header, NULL);
+ hib_submit_io(REQ_OP_READ, swsusp_resume_block, swsusp_header_pg, NULL);
if (!memcmp("SWAP-SPACE",swsusp_header->sig, 10) ||
!memcmp("SWAPSPACE2",swsusp_header->sig, 10)) {
memcpy(swsusp_header->orig_sig,swsusp_header->sig, 10);
@@ -329,7 +332,7 @@ static int mark_swapfiles(struct swap_map_handle *handle, unsigned int flags)
if (flags & SF_CRC32_MODE)
swsusp_header->crc32 = handle->crc32;
error = hib_submit_io(REQ_OP_WRITE | REQ_SYNC,
- swsusp_resume_block, swsusp_header, NULL);
+ swsusp_resume_block, swsusp_header_pg, NULL);
} else {
pr_err("Swap header not found!\n");
error = -ENODEV;
@@ -466,7 +469,7 @@ static int swap_write_page(struct swap_map_handle *handle, void *buf,
offset = alloc_swapdev_block(root_swap);
if (!offset)
return -ENOSPC;
- handle->cur->next_swap = offset;
+ handle->cur->entries[NEXT_SWAP_INDEX] = offset;
error = write_page(handle->cur, handle->cur_swap, hb);
if (error)
goto out;
@@ -643,8 +646,8 @@ struct cmp_data {
wait_queue_head_t done; /* compression done */
size_t unc_len; /* uncompressed length */
size_t cmp_len; /* compressed length */
- unsigned char unc[UNC_SIZE]; /* uncompressed buffer */
- unsigned char cmp[CMP_SIZE]; /* compressed buffer */
+ unsigned char *unc; /* uncompressed buffer */
+ unsigned char *cmp; /* compressed buffer */
};
/* Indicates the image size after compression */
@@ -683,6 +686,45 @@ static int compress_threadfn(void *data)
return 0;
}
+static void free_cmp_data(struct cmp_data *data, unsigned nr_threads)
+{
+ int i;
+
+ if (!data)
+ return;
+
+ for (i = 0; i < nr_threads; i++) {
+ vfree(data[i].unc);
+ vfree(data[i].cmp);
+ }
+
+ vfree(data);
+}
+
+static struct cmp_data *alloc_cmp_data(unsigned nr_threads)
+{
+ struct cmp_data *data = NULL;
+ int i = -1;
+
+ data = vzalloc(array_size(nr_threads, sizeof(*data)));
+ if (!data)
+ goto fail;
+
+ for (i = 0; i < nr_threads; i++) {
+ data[i].unc = vzalloc(UNC_SIZE);
+ if (!data[i].unc)
+ goto fail;
+ data[i].cmp = vzalloc(CMP_SIZE);
+ if (!data[i].cmp)
+ goto fail;
+ }
+
+ return data;
+fail:
+ free_cmp_data(data, nr_threads);
+ return NULL;
+}
+
/**
* save_compressed_image - Save the suspend image data after compression.
* @handle: Swap map handle to use for saving the image.
@@ -724,7 +766,7 @@ static int save_compressed_image(struct swap_map_handle *handle,
goto out_clean;
}
- data = vzalloc(array_size(nr_threads, sizeof(*data)));
+ data = alloc_cmp_data(nr_threads);
if (!data) {
pr_err("Failed to allocate %s data\n", hib_comp_algo);
ret = -ENOMEM;
@@ -902,7 +944,7 @@ static int save_compressed_image(struct swap_map_handle *handle,
if (data[thr].cc)
crypto_free_comp(data[thr].cc);
}
- vfree(data);
+ free_cmp_data(data, nr_threads);
}
if (page) free_page((unsigned long)page);
@@ -1036,7 +1078,7 @@ static int get_swap_reader(struct swap_map_handle *handle,
release_swap_reader(handle);
return error;
}
- offset = tmp->map->next_swap;
+ offset = tmp->map->entries[NEXT_SWAP_INDEX];
}
handle->k = 0;
handle->cur = handle->maps->map;
@@ -1150,8 +1192,8 @@ struct dec_data {
wait_queue_head_t done; /* decompression done */
size_t unc_len; /* uncompressed length */
size_t cmp_len; /* compressed length */
- unsigned char unc[UNC_SIZE]; /* uncompressed buffer */
- unsigned char cmp[CMP_SIZE]; /* compressed buffer */
+ unsigned char *unc; /* uncompressed buffer */
+ unsigned char *cmp; /* compressed buffer */
};
/*
@@ -1189,6 +1231,45 @@ static int decompress_threadfn(void *data)
return 0;
}
+static void free_dec_data(struct dec_data *data, unsigned nr_threads)
+{
+ int i;
+
+ if (!data)
+ return;
+
+ for (i = 0; i < nr_threads; i++) {
+ vfree(data[i].unc);
+ vfree(data[i].cmp);
+ }
+
+ vfree(data);
+}
+
+static struct dec_data *alloc_dec_data(unsigned nr_threads)
+{
+ struct dec_data *data = NULL;
+ int i = -1;
+
+ data = vzalloc(array_size(nr_threads, sizeof(*data)));
+ if (!data)
+ goto fail;
+
+ for (i = 0; i < nr_threads; i++) {
+ data[i].unc = vzalloc(UNC_SIZE);
+ if (!data[i].unc)
+ goto fail;
+ data[i].cmp = vzalloc(CMP_SIZE);
+ if (!data[i].cmp)
+ goto fail;
+ }
+
+ return data;
+fail:
+ free_dec_data(data, nr_threads);
+ return NULL;
+}
+
/**
* load_compressed_image - Load compressed image data and decompress it.
* @handle: Swap map handle to use for loading data.
@@ -1231,7 +1312,7 @@ static int load_compressed_image(struct swap_map_handle *handle,
goto out_clean;
}
- data = vzalloc(array_size(nr_threads, sizeof(*data)));
+ data = alloc_dec_data(nr_threads);
if (!data) {
pr_err("Failed to allocate %s data\n", hib_comp_algo);
ret = -ENOMEM;
@@ -1510,7 +1591,7 @@ static int load_compressed_image(struct swap_map_handle *handle,
if (data[thr].cc)
crypto_free_comp(data[thr].cc);
}
- vfree(data);
+ free_dec_data(data, nr_threads);
}
vfree(page);
@@ -1569,9 +1650,9 @@ int swsusp_check(bool exclusive)
hib_resume_bdev_file = bdev_file_open_by_dev(swsusp_resume_device,
BLK_OPEN_READ, holder, NULL);
if (!IS_ERR(hib_resume_bdev_file)) {
- clear_page(swsusp_header);
+ clear_page(swsusp_header_pg);
error = hib_submit_io(REQ_OP_READ, swsusp_resume_block,
- swsusp_header, NULL);
+ swsusp_header_pg, NULL);
if (error)
goto put;
@@ -1581,7 +1662,7 @@ int swsusp_check(bool exclusive)
/* Reset swap signature now */
error = hib_submit_io(REQ_OP_WRITE | REQ_SYNC,
swsusp_resume_block,
- swsusp_header, NULL);
+ swsusp_header_pg, NULL);
} else {
error = -EINVAL;
}
@@ -1631,12 +1712,12 @@ int swsusp_unmark(void)
int error;
hib_submit_io(REQ_OP_READ, swsusp_resume_block,
- swsusp_header, NULL);
+ swsusp_header_pg, NULL);
if (!memcmp(HIBERNATE_SIG,swsusp_header->sig, 10)) {
memcpy(swsusp_header->sig,swsusp_header->orig_sig, 10);
error = hib_submit_io(REQ_OP_WRITE | REQ_SYNC,
swsusp_resume_block,
- swsusp_header, NULL);
+ swsusp_header_pg, NULL);
} else {
pr_err("Cannot find swsusp signature!\n");
error = -ENODEV;
@@ -1653,9 +1734,11 @@ int swsusp_unmark(void)
static int __init swsusp_header_init(void)
{
- swsusp_header = (struct swsusp_header*) __get_free_page(GFP_KERNEL);
- if (!swsusp_header)
+ swsusp_header_pg = (char *)__get_free_page(GFP_KERNEL);
+ if (!swsusp_header_pg)
panic("Could not allocate memory for swsusp_header\n");
+ swsusp_header = (struct swsusp_header *)(swsusp_header_pg +
+ PAGE_SIZE - sizeof(struct swsusp_header));
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 15/57] stackdepot: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (12 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 14/57] pm/hibernate: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-11-14 11:15 ` Vlastimil Babka
2024-10-14 10:58 ` [RFC PATCH v1 16/57] perf: " Ryan Roberts
` (44 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
"union handle_parts" previously calculated the number of bits required
for its pool index and offset members based on PAGE_SHIFT. This is
problematic for boot-time page size builds because the actual page size
isn't known until boot-time.
We could use PAGE_SHIFT_MAX in calculating the worst case offset bits,
but bits would be wasted that could be used for pool index when
PAGE_SIZE is set smaller than MAX, the end result being that stack depot
can address less memory than it should.
To avoid needing to dynamically define the offset and index bit widths,
let's instead fix the pool size and derive the order at runtime based on
the PAGE_SIZE. This means that the fields' widths can remain static,
with the down side being slightly increased risk of failing to allocate
the large folio.
This only affects boot-time page size builds. compile-time page size
builds will still always allocate order-2 folios.
Additionally, wrap global variables that are initialized with PAGE_SIZE
derived values using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their
initialization can be deferred for boot-time page size builds.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
include/linux/stackdepot.h | 6 +++---
lib/stackdepot.c | 6 +++---
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
index e9ec32fb97d4a..ac877a4e90406 100644
--- a/include/linux/stackdepot.h
+++ b/include/linux/stackdepot.h
@@ -32,10 +32,10 @@ typedef u32 depot_stack_handle_t;
#define DEPOT_HANDLE_BITS (sizeof(depot_stack_handle_t) * 8)
-#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages */
-#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT + DEPOT_POOL_ORDER))
+#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages of PAGE_SIZE_MAX */
+#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT_MAX + DEPOT_POOL_ORDER))
#define DEPOT_STACK_ALIGN 4
-#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
+#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT_MAX - DEPOT_STACK_ALIGN)
#define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
STACK_DEPOT_EXTRA_BITS)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 5ed34cc963fc3..974351f0e9e3c 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -68,7 +68,7 @@ static void *new_pool;
/* Number of pools in stack_pools. */
static int pools_num;
/* Offset to the unused space in the currently used pool. */
-static size_t pool_offset = DEPOT_POOL_SIZE;
+static DEFINE_GLOBAL_PAGE_SIZE_VAR(size_t, pool_offset, DEPOT_POOL_SIZE);
/* Freelist of stack records within stack_pools. */
static LIST_HEAD(free_stacks);
/* The lock must be held when performing pool or freelist modifications. */
@@ -625,7 +625,7 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
*/
if (unlikely(can_alloc && !READ_ONCE(new_pool))) {
page = alloc_pages(gfp_nested_mask(alloc_flags),
- DEPOT_POOL_ORDER);
+ get_order(DEPOT_POOL_SIZE));
if (page)
prealloc = page_address(page);
}
@@ -663,7 +663,7 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
exit:
if (prealloc) {
/* Stack depot didn't use this memory, free it. */
- free_pages((unsigned long)prealloc, DEPOT_POOL_ORDER);
+ free_pages((unsigned long)prealloc, get_order(DEPOT_POOL_SIZE));
}
if (found)
handle = found->handle.handle;
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 16/57] perf: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (13 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 15/57] stackdepot: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-16 14:40 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 17/57] kvm: " Ryan Roberts
` (43 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm,
linux-perf-users
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Refactor a BUILD_BUG_ON() so that we test against the limit; _format is
invariant to page size so testing it is no bigger than the minimum
supported size is sufficient.
Wrap global variables that are initialized with PAGE_SIZE derived values
using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
deferred for boot-time page size builds.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
include/linux/perf_event.h | 2 +-
kernel/events/core.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1a8942277ddad..b7972155f93eb 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1872,7 +1872,7 @@ _name##_show(struct device *dev, \
struct device_attribute *attr, \
char *page) \
{ \
- BUILD_BUG_ON(sizeof(_format) >= PAGE_SIZE); \
+ BUILD_BUG_ON(sizeof(_format) >= PAGE_SIZE_MIN); \
return sprintf(page, _format "\n"); \
} \
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8a6c6bbcd658a..81149663ab7d8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -419,7 +419,7 @@ static struct kmem_cache *perf_event_cache;
int sysctl_perf_event_paranoid __read_mostly = 2;
/* Minimum for 512 kiB + 1 user control page */
-int sysctl_perf_event_mlock __read_mostly = 512 + (PAGE_SIZE / 1024); /* 'free' kiB per user */
+__DEFINE_GLOBAL_PAGE_SIZE_VAR(int, sysctl_perf_event_mlock, __read_mostly, 512 + (PAGE_SIZE / 1024)); /* 'free' kiB per user */
/*
* max perf event sample rate
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 17/57] kvm: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (14 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 16/57] perf: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 21:37 ` Sean Christopherson
2024-10-16 14:41 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 18/57] trace: " Ryan Roberts
` (42 subsequent siblings)
58 siblings, 2 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, kvm, linux-arm-kernel, linux-kernel, linux-mm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Modify BUILD_BUG_ON() to compare with page size limit.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
virt/kvm/kvm_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb2b78e92910f..6c862bc41a672 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4244,7 +4244,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
goto vcpu_decrement;
}
- BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
+ BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE_MIN);
page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
if (!page) {
r = -ENOMEM;
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 18/57] trace: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (15 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 17/57] kvm: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 16:46 ` Steven Rostedt
2024-10-14 10:58 ` [RFC PATCH v1 19/57] crash: " Ryan Roberts
` (41 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Masami Hiramatsu, Matthias Brugger,
Miroslav Benes, Steven Rostedt, Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm,
linux-trace-kernel
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Convert BUILD_BUG_ON() BUG_ON() since the argument depends on PAGE_SIZE
and its not trivial to test against a page size limit.
Redefine FTRACE_KSTACK_ENTRIES so that "struct ftrace_stacks" is always
sized at 32K for 64-bit and 16K for 32-bit. It was previously defined in
terms of PAGE_SIZE (and worked out at the quoted sizes for a 4K page
size). But for 64K pages, the size expanded to 512K. Given the ftrace
stacks should be invariant to page size, this seemed like a waste. As a
side effect, it removes the PAGE_SIZE compile-time constant assumption
from this code.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
kernel/trace/fgraph.c | 2 +-
kernel/trace/trace.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index d7d4fb403f6f0..47aa5c8d8090e 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -534,7 +534,7 @@ ftrace_push_return_trace(unsigned long ret, unsigned long func,
if (!current->ret_stack)
return -EBUSY;
- BUILD_BUG_ON(SHADOW_STACK_SIZE % sizeof(long));
+ BUG_ON(SHADOW_STACK_SIZE % sizeof(long));
/* Set val to "reserved" with the delta to the new fgraph frame */
val = (FGRAPH_TYPE_RESERVED << FGRAPH_TYPE_SHIFT) | FGRAPH_FRAME_OFFSET;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c3b2c7dfadef1..0f2ec3d30579f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2887,7 +2887,7 @@ trace_function(struct trace_array *tr, unsigned long ip, unsigned long
/* Allow 4 levels of nesting: normal, softirq, irq, NMI */
#define FTRACE_KSTACK_NESTING 4
-#define FTRACE_KSTACK_ENTRIES (PAGE_SIZE / FTRACE_KSTACK_NESTING)
+#define FTRACE_KSTACK_ENTRIES (SZ_4K / FTRACE_KSTACK_NESTING)
struct ftrace_stack {
unsigned long calls[FTRACE_KSTACK_ENTRIES];
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 19/57] crash: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (16 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 18/57] trace: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-15 3:47 ` Baoquan He
2024-10-14 10:58 ` [RFC PATCH v1 20/57] crypto: " Ryan Roberts
` (40 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Baoquan He,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: Ryan Roberts, kexec, linux-arm-kernel, linux-kernel, linux-mm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Updated BUILD_BUG_ON() to test against limit.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
kernel/crash_core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 63cf89393c6eb..978c600a47ac8 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -465,7 +465,7 @@ static int __init crash_notes_memory_init(void)
* Break compile if size is bigger than PAGE_SIZE since crash_notes
* definitely will be in 2 pages with that.
*/
- BUILD_BUG_ON(size > PAGE_SIZE);
+ BUILD_BUG_ON(size > PAGE_SIZE_MIN);
crash_notes = __alloc_percpu(size, align);
if (!crash_notes) {
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 20/57] crypto: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (17 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 19/57] crash: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-26 6:54 ` Herbert Xu
2024-10-14 10:58 ` [RFC PATCH v1 21/57] sunrpc: " Ryan Roberts
` (39 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: David S. Miller, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Herbert Xu,
Ivan Ivanov, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Miroslav Benes, Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-crypto, linux-kernel,
linux-mm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Updated BUILD_BUG_ON() to test against limit.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
crypto/lskcipher.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/crypto/lskcipher.c b/crypto/lskcipher.c
index cdb4897c63e6f..2b84cefba7cd1 100644
--- a/crypto/lskcipher.c
+++ b/crypto/lskcipher.c
@@ -79,8 +79,8 @@ static int crypto_lskcipher_crypt_unaligned(
u8 *tiv;
u8 *p;
- BUILD_BUG_ON(MAX_CIPHER_BLOCKSIZE > PAGE_SIZE ||
- MAX_CIPHER_ALIGNMASK >= PAGE_SIZE);
+ BUILD_BUG_ON(MAX_CIPHER_BLOCKSIZE > PAGE_SIZE_MIN ||
+ MAX_CIPHER_ALIGNMASK >= PAGE_SIZE_MIN);
tiv = kmalloc(PAGE_SIZE, GFP_ATOMIC);
if (!tiv)
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 21/57] sunrpc: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (18 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 20/57] crypto: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-16 14:42 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 22/57] sound: " Ryan Roberts
` (38 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anna Schumaker, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Trond Myklebust, Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm, linux-nfs
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Updated array sizes in various structs to contain enough entries for the
smallest supported page size.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
include/linux/sunrpc/svc.h | 8 +++++---
include/linux/sunrpc/svc_rdma.h | 4 ++--
include/linux/sunrpc/svcsock.h | 2 +-
3 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index a7d0406b9ef59..dda44018b8f36 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -160,6 +160,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
*/
#define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
+ 2 + 1)
+#define RPCSVC_MAXPAGES_MAX ((RPCSVC_MAXPAYLOAD+PAGE_SIZE_MIN-1)/PAGE_SIZE_MIN \
+ + 2 + 1)
/*
* The context of a single thread, including the request currently being
@@ -190,14 +192,14 @@ struct svc_rqst {
struct xdr_stream rq_res_stream;
struct page *rq_scratch_page;
struct xdr_buf rq_res;
- struct page *rq_pages[RPCSVC_MAXPAGES + 1];
+ struct page *rq_pages[RPCSVC_MAXPAGES_MAX + 1];
struct page * *rq_respages; /* points into rq_pages */
struct page * *rq_next_page; /* next reply page to use */
struct page * *rq_page_end; /* one past the last page */
struct folio_batch rq_fbatch;
- struct kvec rq_vec[RPCSVC_MAXPAGES]; /* generally useful.. */
- struct bio_vec rq_bvec[RPCSVC_MAXPAGES];
+ struct kvec rq_vec[RPCSVC_MAXPAGES_MAX]; /* generally useful.. */
+ struct bio_vec rq_bvec[RPCSVC_MAXPAGES_MAX];
__be32 rq_xid; /* transmission id */
u32 rq_prog; /* program number */
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index d33bab33099ab..7c6441e8d6f7a 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -200,7 +200,7 @@ struct svc_rdma_recv_ctxt {
struct svc_rdma_pcl rc_reply_pcl;
unsigned int rc_page_count;
- struct page *rc_pages[RPCSVC_MAXPAGES];
+ struct page *rc_pages[RPCSVC_MAXPAGES_MAX];
};
/*
@@ -242,7 +242,7 @@ struct svc_rdma_send_ctxt {
void *sc_xprt_buf;
int sc_page_count;
int sc_cur_sge_no;
- struct page *sc_pages[RPCSVC_MAXPAGES];
+ struct page *sc_pages[RPCSVC_MAXPAGES_MAX];
struct ib_sge sc_sges[];
};
diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index 7c78ec6356b92..6c6bcc82685a3 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -40,7 +40,7 @@ struct svc_sock {
struct completion sk_handshake_done;
- struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */
+ struct page * sk_pages[RPCSVC_MAXPAGES_MAX]; /* received data */
};
static inline u32 svc_sock_reclen(struct svc_sock *svsk)
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 22/57] sound: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (19 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 21/57] sunrpc: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 11:38 ` Mark Brown
2024-10-14 10:58 ` [RFC PATCH v1 23/57] net: " Ryan Roberts
` (37 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Jaroslav Kysela,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Takashi Iwai, Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm,
linux-sound
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Wrap global variables that are initialized with PAGE_SIZE derived values
using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
deferred for boot-time page size builds.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
sound/soc/soc-utils.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/sound/soc/soc-utils.c b/sound/soc/soc-utils.c
index 303823dc45d7a..74e1bc1087de4 100644
--- a/sound/soc/soc-utils.c
+++ b/sound/soc/soc-utils.c
@@ -98,7 +98,7 @@ int snd_soc_tdm_params_to_bclk(const struct snd_pcm_hw_params *params,
}
EXPORT_SYMBOL_GPL(snd_soc_tdm_params_to_bclk);
-static const struct snd_pcm_hardware dummy_dma_hardware = {
+static DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(struct snd_pcm_hardware, dummy_dma_hardware, {
/* Random values to keep userspace happy when checking constraints */
.info = SNDRV_PCM_INFO_INTERLEAVED |
SNDRV_PCM_INFO_BLOCK_TRANSFER,
@@ -107,7 +107,7 @@ static const struct snd_pcm_hardware dummy_dma_hardware = {
.period_bytes_max = PAGE_SIZE*2,
.periods_min = 2,
.periods_max = 128,
-};
+});
static const struct snd_soc_component_driver dummy_platform;
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 23/57] net: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (20 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 22/57] sound: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-16 14:43 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 24/57] net: fec: " Ryan Roberts
` (36 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: David S. Miller, Andrew Morton, Anna Schumaker, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, David Hildenbrand, Eric Dumazet,
Greg Marsden, Ivan Ivanov, Jakub Kicinski, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Paolo Abeni, Trond Myklebust, Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm, linux-nfs,
netdev
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Define NLMSG_GOODSIZE using min() instead of ifdeffery. This will now
evaluate to a compile-time constant for compile-time page size, but
evaluate at run-time when using boot-time page size.
Rework NAPI small page frag infrastructure so that for boot-time page
size it is compiled in if 4K page size is in the possible range, but
defer deciding to use it to run time when the page size is known. No
change for compile-time page size case.
Resize cache_defer_hash[] array for PAGE_SIZE_MAX.
Convert a complex BUILD_BUG_ON() to runtime BUG_ON().
Wrap global variables that are initialized with PAGE_SIZE derived values
using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
deferred for boot-time page size builds.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
include/linux/netlink.h | 6 +-----
net/core/hotdata.c | 4 ++--
net/core/skbuff.c | 4 ++--
net/core/sysctl_net_core.c | 2 +-
net/sunrpc/cache.c | 3 ++-
net/unix/af_unix.c | 2 +-
6 files changed, 9 insertions(+), 12 deletions(-)
diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index b332c2048c755..ffa1e94111f89 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -267,11 +267,7 @@ netlink_skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
* use enormous buffer sizes on recvmsg() calls just to avoid
* MSG_TRUNC when PAGE_SIZE is very large.
*/
-#if PAGE_SIZE < 8192UL
-#define NLMSG_GOODSIZE SKB_WITH_OVERHEAD(PAGE_SIZE)
-#else
-#define NLMSG_GOODSIZE SKB_WITH_OVERHEAD(8192UL)
-#endif
+#define NLMSG_GOODSIZE SKB_WITH_OVERHEAD(min(PAGE_SIZE, 8192UL))
#define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN)
diff --git a/net/core/hotdata.c b/net/core/hotdata.c
index d0aaaaa556f22..e1f30e87ba6e9 100644
--- a/net/core/hotdata.c
+++ b/net/core/hotdata.c
@@ -5,7 +5,7 @@
#include <net/hotdata.h>
#include <net/proto_memory.h>
-struct net_hotdata net_hotdata __cacheline_aligned = {
+__DEFINE_GLOBAL_PAGE_SIZE_VAR(struct net_hotdata, net_hotdata, __cacheline_aligned, {
.offload_base = LIST_HEAD_INIT(net_hotdata.offload_base),
.ptype_all = LIST_HEAD_INIT(net_hotdata.ptype_all),
.gro_normal_batch = 8,
@@ -21,5 +21,5 @@ struct net_hotdata net_hotdata __cacheline_aligned = {
.sysctl_max_skb_frags = MAX_SKB_FRAGS,
.sysctl_skb_defer_max = 64,
.sysctl_mem_pcpu_rsv = SK_MEMORY_PCPU_RESERVE
-};
+});
EXPORT_SYMBOL(net_hotdata);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 83f8cd8aa2d16..b6c8eee0cc74b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -219,9 +219,9 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
#define NAPI_SKB_CACHE_BULK 16
#define NAPI_SKB_CACHE_HALF (NAPI_SKB_CACHE_SIZE / 2)
-#if PAGE_SIZE == SZ_4K
+#if PAGE_SIZE_MIN <= SZ_4K && SZ_4K <= PAGE_SIZE_MAX
-#define NAPI_HAS_SMALL_PAGE_FRAG 1
+#define NAPI_HAS_SMALL_PAGE_FRAG (PAGE_SIZE == SZ_4K)
#define NAPI_SMALL_PAGE_PFMEMALLOC(nc) ((nc).pfmemalloc)
/* specialized page frag allocator using a single order 0 page
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 86a2476678c48..a7a2eb7581bd1 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -33,7 +33,7 @@ static int int_3600 = 3600;
static int min_sndbuf = SOCK_MIN_SNDBUF;
static int min_rcvbuf = SOCK_MIN_RCVBUF;
static int max_skb_frags = MAX_SKB_FRAGS;
-static int min_mem_pcpu_rsv = SK_MEMORY_PCPU_RESERVE;
+static DEFINE_GLOBAL_PAGE_SIZE_VAR(int, min_mem_pcpu_rsv, SK_MEMORY_PCPU_RESERVE);
static int net_msg_warn; /* Unused, but still a sysctl */
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 95ff747061046..4e682c0cd7586 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -573,13 +573,14 @@ EXPORT_SYMBOL_GPL(cache_purge);
*/
#define DFR_HASHSIZE (PAGE_SIZE/sizeof(struct list_head))
+#define DFR_HASHSIZE_MAX (PAGE_SIZE_MAX/sizeof(struct list_head))
#define DFR_HASH(item) ((((long)item)>>4 ^ (((long)item)>>13)) % DFR_HASHSIZE)
#define DFR_MAX 300 /* ??? */
static DEFINE_SPINLOCK(cache_defer_lock);
static LIST_HEAD(cache_defer_list);
-static struct hlist_head cache_defer_hash[DFR_HASHSIZE];
+static struct hlist_head cache_defer_hash[DFR_HASHSIZE_MAX];
static int cache_defer_cnt;
static void __unhash_deferred_req(struct cache_deferred_req *dreq)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 0be0dcb07f7b6..1cf9f583358af 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2024,7 +2024,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
MAX_SKB_FRAGS * PAGE_SIZE);
data_len = PAGE_ALIGN(data_len);
- BUILD_BUG_ON(SKB_MAX_ALLOC < PAGE_SIZE);
+ BUG_ON(SKB_MAX_ALLOC < PAGE_SIZE);
}
skb = sock_alloc_send_pskb(sk, len - data_len, data_len,
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 24/57] net: fec: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (21 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 23/57] net: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 25/57] net: marvell: " Ryan Roberts
` (35 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: David S. Miller, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Eric Dumazet, Greg Marsden,
Ivan Ivanov, Jakub Kicinski, Kalesh Singh, Marc Zyngier,
Mark Rutland, Matthias Brugger, Miroslav Benes, Paolo Abeni,
Wei Fang, Will Deacon
Cc: Ryan Roberts, imx, linux-arm-kernel, linux-kernel, linux-mm,
netdev
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Refactored "struct fec_enet_priv_rx_q" to use a flexible array member
for "rx_skb_info", since its length depends on PAGE_SIZE.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/net/ethernet/freescale/fec.h | 3 ++-
drivers/net/ethernet/freescale/fec_main.c | 5 +++--
2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index a19cb2a786fd2..afc8b3f360555 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -571,7 +571,6 @@ struct fec_enet_priv_tx_q {
struct fec_enet_priv_rx_q {
struct bufdesc_prop bd;
- struct fec_enet_priv_txrx_info rx_skb_info[RX_RING_SIZE];
/* page_pool */
struct page_pool *page_pool;
@@ -580,6 +579,8 @@ struct fec_enet_priv_rx_q {
/* rx queue number, in the range 0-7 */
u8 id;
+
+ struct fec_enet_priv_txrx_info rx_skb_info[];
};
struct fec_stop_mode_gpr {
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index a923cb95cdc62..b9214c12d537e 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -3339,6 +3339,8 @@ static int fec_enet_alloc_queue(struct net_device *ndev)
int i;
int ret = 0;
struct fec_enet_priv_tx_q *txq;
+ size_t rxq_sz = struct_size(fep->rx_queue[0], rx_skb_info, RX_RING_SIZE);
+
for (i = 0; i < fep->num_tx_queues; i++) {
txq = kzalloc(sizeof(*txq), GFP_KERNEL);
@@ -3364,8 +3366,7 @@ static int fec_enet_alloc_queue(struct net_device *ndev)
}
for (i = 0; i < fep->num_rx_queues; i++) {
- fep->rx_queue[i] = kzalloc(sizeof(*fep->rx_queue[i]),
- GFP_KERNEL);
+ fep->rx_queue[i] = kzalloc(rxq_sz, GFP_KERNEL);
if (!fep->rx_queue[i]) {
ret = -ENOMEM;
goto alloc_failed;
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 25/57] net: marvell: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (22 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 24/57] net: fec: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 26/57] net: hns3: " Ryan Roberts
` (34 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: David S. Miller, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Eric Dumazet, Greg Marsden,
Ivan Ivanov, Jakub Kicinski, Kalesh Singh, Marc Zyngier,
Marcin Wojtas, Mark Rutland, Matthias Brugger, Miroslav Benes,
Paolo Abeni, Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm, netdev
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Updated sky2 "struct rx_ring_info" member frag_addr[] to contain enough
entries for the smallest supported page size.
Updated mvneta "struct mvneta_tx_queue" members tso_hdrs[] and
tso_hdrs_phys[] to contain enough entries for the smallest supported
page size.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/net/ethernet/marvell/mvneta.c | 9 ++++++---
drivers/net/ethernet/marvell/sky2.h | 2 +-
2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 41894834fb53c..f3ac371d8f3a7 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -346,12 +346,15 @@
/* The size of a TSO header page */
#define MVNETA_TSO_PAGE_SIZE (2 * PAGE_SIZE)
+#define MVNETA_TSO_PAGE_SIZE_MIN (2 * PAGE_SIZE_MIN)
/* Number of TSO headers per page. This should be a power of 2 */
#define MVNETA_TSO_PER_PAGE (MVNETA_TSO_PAGE_SIZE / TSO_HEADER_SIZE)
+#define MVNETA_TSO_PER_PAGE_MIN (MVNETA_TSO_PAGE_SIZE_MIN / TSO_HEADER_SIZE)
/* Maximum number of TSO header pages */
#define MVNETA_MAX_TSO_PAGES (MVNETA_MAX_TXD / MVNETA_TSO_PER_PAGE)
+#define MVNETA_MAX_TSO_PAGES_MAX (MVNETA_MAX_TXD / MVNETA_TSO_PER_PAGE_MIN)
/* descriptor aligned size */
#define MVNETA_DESC_ALIGNED_SIZE 32
@@ -696,10 +699,10 @@ struct mvneta_tx_queue {
int next_desc_to_proc;
/* DMA buffers for TSO headers */
- char *tso_hdrs[MVNETA_MAX_TSO_PAGES];
+ char *tso_hdrs[MVNETA_MAX_TSO_PAGES_MAX];
/* DMA address of TSO headers */
- dma_addr_t tso_hdrs_phys[MVNETA_MAX_TSO_PAGES];
+ dma_addr_t tso_hdrs_phys[MVNETA_MAX_TSO_PAGES_MAX];
/* Affinity mask for CPUs*/
cpumask_t affinity_mask;
@@ -5895,7 +5898,7 @@ static int __init mvneta_driver_init(void)
{
int ret;
- BUILD_BUG_ON_NOT_POWER_OF_2(MVNETA_TSO_PER_PAGE);
+ BUILD_BUG_ON_NOT_POWER_OF_2(MVNETA_TSO_PER_PAGE_MIN);
ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "net/mvneta:online",
mvneta_cpu_online,
diff --git a/drivers/net/ethernet/marvell/sky2.h b/drivers/net/ethernet/marvell/sky2.h
index 8d0bacf4e49cc..8ee73ae087dfc 100644
--- a/drivers/net/ethernet/marvell/sky2.h
+++ b/drivers/net/ethernet/marvell/sky2.h
@@ -2195,7 +2195,7 @@ struct rx_ring_info {
struct sk_buff *skb;
dma_addr_t data_addr;
DEFINE_DMA_UNMAP_LEN(data_size);
- dma_addr_t frag_addr[ETH_JUMBO_MTU >> PAGE_SHIFT ?: 1];
+ dma_addr_t frag_addr[ETH_JUMBO_MTU >> PAGE_SHIFT_MIN ?: 1];
};
enum flow_control {
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 26/57] net: hns3: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (23 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 25/57] net: marvell: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 27/57] net: e1000: " Ryan Roberts
` (33 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: David S. Miller, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Eric Dumazet, Greg Marsden,
Ivan Ivanov, Jakub Kicinski, Jijie Shao, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Paolo Abeni, Salil Mehta, Will Deacon, Yisen Zhuang
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm, netdev
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Convert a CPP conditional to a C conditional. The compiler will dead
code strip when doing a compile-time page size build, for the same end
effect. But this will also work with boot-time page size builds.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index d36c4ed16d8dd..5e675721b7364 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -681,10 +681,8 @@ static inline bool hns3_nic_resetting(struct net_device *netdev)
static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring)
{
-#if (PAGE_SIZE < 8192)
- if (ring->buf_size > (PAGE_SIZE / 2))
+ if (PAGE_SIZE < 8192 && ring->buf_size > (PAGE_SIZE / 2))
return 1;
-#endif
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 27/57] net: e1000: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (24 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 26/57] net: hns3: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-16 14:43 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 28/57] net: igbvf: " Ryan Roberts
` (32 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: David S. Miller, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Eric Dumazet, Greg Marsden,
Ivan Ivanov, Jakub Kicinski, Kalesh Singh, Marc Zyngier,
Mark Rutland, Matthias Brugger, Miroslav Benes, Paolo Abeni,
Will Deacon
Cc: Ryan Roberts, intel-wired-lan, linux-arm-kernel, linux-kernel,
linux-mm, netdev
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Convert CPP conditionals to C conditionals. The compiler will dead code
strip when doing a compile-time page size build, for the same end
effect. But this will also work with boot-time page size builds.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/net/ethernet/intel/e1000/e1000_main.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index ab7ae418d2948..cc14788f5bb04 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -3553,12 +3553,10 @@ static int e1000_change_mtu(struct net_device *netdev, int new_mtu)
if (max_frame <= E1000_RXBUFFER_2048)
adapter->rx_buffer_len = E1000_RXBUFFER_2048;
- else
-#if (PAGE_SIZE >= E1000_RXBUFFER_16384)
+ else if (PAGE_SIZE >= E1000_RXBUFFER_16384)
adapter->rx_buffer_len = E1000_RXBUFFER_16384;
-#elif (PAGE_SIZE >= E1000_RXBUFFER_4096)
+ else if (PAGE_SIZE >= E1000_RXBUFFER_4096)
adapter->rx_buffer_len = PAGE_SIZE;
-#endif
/* adjust allocation if LPE protects us, and we aren't using SBP */
if (!hw->tbi_compatibility_on &&
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 28/57] net: igbvf: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (25 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 27/57] net: e1000: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-16 14:44 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 29/57] net: igb: " Ryan Roberts
` (31 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: David S. Miller, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Eric Dumazet, Greg Marsden,
Ivan Ivanov, Jakub Kicinski, Kalesh Singh, Marc Zyngier,
Mark Rutland, Matthias Brugger, Miroslav Benes, Paolo Abeni,
Will Deacon
Cc: Ryan Roberts, intel-wired-lan, linux-arm-kernel, linux-kernel,
linux-mm, netdev
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Convert CPP conditionals to C conditionals. The compiler will dead code
strip when doing a compile-time page size build, for the same end
effect. But this will also work with boot-time page size builds.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/net/ethernet/intel/igbvf/netdev.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c
index 925d7286a8ee4..2e11d999168de 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -2419,12 +2419,10 @@ static int igbvf_change_mtu(struct net_device *netdev, int new_mtu)
adapter->rx_buffer_len = 1024;
else if (max_frame <= 2048)
adapter->rx_buffer_len = 2048;
- else
-#if (PAGE_SIZE / 2) > 16384
+ else if ((PAGE_SIZE / 2) > 16384)
adapter->rx_buffer_len = 16384;
-#else
+ else
adapter->rx_buffer_len = PAGE_SIZE / 2;
-#endif
/* adjust allocation if LPE protects us, and we aren't using SBP */
if ((max_frame == ETH_FRAME_LEN + ETH_FCS_LEN) ||
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 29/57] net: igb: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (26 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 28/57] net: igbvf: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-16 14:45 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 30/57] drivers/base: " Ryan Roberts
` (30 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: David S. Miller, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Eric Dumazet, Greg Marsden,
Ivan Ivanov, Jakub Kicinski, Kalesh Singh, Marc Zyngier,
Mark Rutland, Matthias Brugger, Miroslav Benes, Paolo Abeni,
Will Deacon
Cc: Ryan Roberts, bpf, intel-wired-lan, linux-arm-kernel,
linux-kernel, linux-mm, netdev
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Convert CPP conditionals to C conditionals. The compiler will dead code
strip when doing a compile-time page size build, for the same end
effect. But this will also work with boot-time page size builds.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/net/ethernet/intel/igb/igb.h | 25 ++--
drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++++++-----------
2 files changed, 82 insertions(+), 92 deletions(-)
diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index 3c2dc7bdebb50..04aeebcd363b3 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -158,7 +158,6 @@ struct vf_mac_filter {
* up negative. In these cases we should fall back to the 3K
* buffers.
*/
-#if (PAGE_SIZE < 8192)
#define IGB_MAX_FRAME_BUILD_SKB (IGB_RXBUFFER_1536 - NET_IP_ALIGN)
#define IGB_2K_TOO_SMALL_WITH_PADDING \
((NET_SKB_PAD + IGB_TS_HDR_LEN + IGB_RXBUFFER_1536) > SKB_WITH_OVERHEAD(IGB_RXBUFFER_2048))
@@ -177,6 +176,9 @@ static inline int igb_skb_pad(void)
{
int rx_buf_len;
+ if (PAGE_SIZE >= 8192)
+ return NET_SKB_PAD + NET_IP_ALIGN;
+
/* If a 2K buffer cannot handle a standard Ethernet frame then
* optimize padding for a 3K buffer instead of a 1.5K buffer.
*
@@ -196,9 +198,6 @@ static inline int igb_skb_pad(void)
}
#define IGB_SKB_PAD igb_skb_pad()
-#else
-#define IGB_SKB_PAD (NET_SKB_PAD + NET_IP_ALIGN)
-#endif
/* How many Rx Buffers do we bundle into one write to the hardware ? */
#define IGB_RX_BUFFER_WRITE 16 /* Must be power of 2 */
@@ -280,7 +279,7 @@ struct igb_tx_buffer {
struct igb_rx_buffer {
dma_addr_t dma;
struct page *page;
-#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
+#if (BITS_PER_LONG > 32) || (PAGE_SIZE_MAX >= 65536)
__u32 page_offset;
#else
__u16 page_offset;
@@ -403,22 +402,20 @@ enum e1000_ring_flags_t {
static inline unsigned int igb_rx_bufsz(struct igb_ring *ring)
{
-#if (PAGE_SIZE < 8192)
- if (ring_uses_large_buffer(ring))
- return IGB_RXBUFFER_3072;
+ if (PAGE_SIZE < 8192) {
+ if (ring_uses_large_buffer(ring))
+ return IGB_RXBUFFER_3072;
- if (ring_uses_build_skb(ring))
- return IGB_MAX_FRAME_BUILD_SKB;
-#endif
+ if (ring_uses_build_skb(ring))
+ return IGB_MAX_FRAME_BUILD_SKB;
+ }
return IGB_RXBUFFER_2048;
}
static inline unsigned int igb_rx_pg_order(struct igb_ring *ring)
{
-#if (PAGE_SIZE < 8192)
- if (ring_uses_large_buffer(ring))
+ if (PAGE_SIZE < 8192 && ring_uses_large_buffer(ring))
return 1;
-#endif
return 0;
}
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 1ef4cb871452a..4f2c53dece1a2 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -4797,9 +4797,7 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
static void igb_set_rx_buffer_len(struct igb_adapter *adapter,
struct igb_ring *rx_ring)
{
-#if (PAGE_SIZE < 8192)
struct e1000_hw *hw = &adapter->hw;
-#endif
/* set build_skb and buffer size flags */
clear_ring_build_skb_enabled(rx_ring);
@@ -4810,12 +4808,11 @@ static void igb_set_rx_buffer_len(struct igb_adapter *adapter,
set_ring_build_skb_enabled(rx_ring);
-#if (PAGE_SIZE < 8192)
- if (adapter->max_frame_size > IGB_MAX_FRAME_BUILD_SKB ||
+ if (PAGE_SIZE < 8192 &&
+ (adapter->max_frame_size > IGB_MAX_FRAME_BUILD_SKB ||
IGB_2K_TOO_SMALL_WITH_PADDING ||
- rd32(E1000_RCTL) & E1000_RCTL_SBP)
+ rd32(E1000_RCTL) & E1000_RCTL_SBP))
set_ring_uses_large_buffer(rx_ring);
-#endif
}
/**
@@ -5314,12 +5311,10 @@ static void igb_set_rx_mode(struct net_device *netdev)
E1000_RCTL_VFE);
wr32(E1000_RCTL, rctl);
-#if (PAGE_SIZE < 8192)
- if (!adapter->vfs_allocated_count) {
+ if (PAGE_SIZE < 8192 && !adapter->vfs_allocated_count) {
if (adapter->max_frame_size <= IGB_MAX_FRAME_BUILD_SKB)
rlpml = IGB_MAX_FRAME_BUILD_SKB;
}
-#endif
wr32(E1000_RLPML, rlpml);
/* In order to support SR-IOV and eventually VMDq it is necessary to set
@@ -5338,11 +5333,10 @@ static void igb_set_rx_mode(struct net_device *netdev)
/* enable Rx jumbo frames, restrict as needed to support build_skb */
vmolr &= ~E1000_VMOLR_RLPML_MASK;
-#if (PAGE_SIZE < 8192)
- if (adapter->max_frame_size <= IGB_MAX_FRAME_BUILD_SKB)
+ if (PAGE_SIZE < 8192 &&
+ adapter->max_frame_size <= IGB_MAX_FRAME_BUILD_SKB)
vmolr |= IGB_MAX_FRAME_BUILD_SKB;
else
-#endif
vmolr |= MAX_JUMBO_FRAME_SIZE;
vmolr |= E1000_VMOLR_LPE;
@@ -8435,17 +8429,17 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer,
if (!dev_page_is_reusable(page))
return false;
-#if (PAGE_SIZE < 8192)
- /* if we are only owner of page we can reuse it */
- if (unlikely((rx_buf_pgcnt - pagecnt_bias) > 1))
- return false;
-#else
+ if (PAGE_SIZE < 8192) {
+ /* if we are only owner of page we can reuse it */
+ if (unlikely((rx_buf_pgcnt - pagecnt_bias) > 1))
+ return false;
+ } else {
#define IGB_LAST_OFFSET \
(SKB_WITH_OVERHEAD(PAGE_SIZE) - IGB_RXBUFFER_2048)
- if (rx_buffer->page_offset > IGB_LAST_OFFSET)
- return false;
-#endif
+ if (rx_buffer->page_offset > IGB_LAST_OFFSET)
+ return false;
+ }
/* If we have drained the page fragment pool we need to update
* the pagecnt_bias and page count so that we fully restock the
@@ -8473,20 +8467,22 @@ static void igb_add_rx_frag(struct igb_ring *rx_ring,
struct sk_buff *skb,
unsigned int size)
{
-#if (PAGE_SIZE < 8192)
- unsigned int truesize = igb_rx_pg_size(rx_ring) / 2;
-#else
- unsigned int truesize = ring_uses_build_skb(rx_ring) ?
+ unsigned int truesize;
+
+ if (PAGE_SIZE < 8192)
+ truesize = igb_rx_pg_size(rx_ring) / 2;
+ else
+ truesize = ring_uses_build_skb(rx_ring) ?
SKB_DATA_ALIGN(IGB_SKB_PAD + size) :
SKB_DATA_ALIGN(size);
-#endif
+
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
rx_buffer->page_offset, size, truesize);
-#if (PAGE_SIZE < 8192)
- rx_buffer->page_offset ^= truesize;
-#else
- rx_buffer->page_offset += truesize;
-#endif
+
+ if (PAGE_SIZE < 8192)
+ rx_buffer->page_offset ^= truesize;
+ else
+ rx_buffer->page_offset += truesize;
}
static struct sk_buff *igb_construct_skb(struct igb_ring *rx_ring,
@@ -8494,16 +8490,16 @@ static struct sk_buff *igb_construct_skb(struct igb_ring *rx_ring,
struct xdp_buff *xdp,
ktime_t timestamp)
{
-#if (PAGE_SIZE < 8192)
- unsigned int truesize = igb_rx_pg_size(rx_ring) / 2;
-#else
- unsigned int truesize = SKB_DATA_ALIGN(xdp->data_end -
- xdp->data_hard_start);
-#endif
unsigned int size = xdp->data_end - xdp->data;
+ unsigned int truesize;
unsigned int headlen;
struct sk_buff *skb;
+ if (PAGE_SIZE < 8192)
+ truesize = igb_rx_pg_size(rx_ring) / 2;
+ else
+ truesize = SKB_DATA_ALIGN(xdp->data_end - xdp->data_hard_start);
+
/* prefetch first cache line of first page */
net_prefetch(xdp->data);
@@ -8529,11 +8525,10 @@ static struct sk_buff *igb_construct_skb(struct igb_ring *rx_ring,
skb_add_rx_frag(skb, 0, rx_buffer->page,
(xdp->data + headlen) - page_address(rx_buffer->page),
size, truesize);
-#if (PAGE_SIZE < 8192)
- rx_buffer->page_offset ^= truesize;
-#else
- rx_buffer->page_offset += truesize;
-#endif
+ if (PAGE_SIZE < 8192)
+ rx_buffer->page_offset ^= truesize;
+ else
+ rx_buffer->page_offset += truesize;
} else {
rx_buffer->pagecnt_bias++;
}
@@ -8546,16 +8541,17 @@ static struct sk_buff *igb_build_skb(struct igb_ring *rx_ring,
struct xdp_buff *xdp,
ktime_t timestamp)
{
-#if (PAGE_SIZE < 8192)
- unsigned int truesize = igb_rx_pg_size(rx_ring) / 2;
-#else
- unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
- SKB_DATA_ALIGN(xdp->data_end -
- xdp->data_hard_start);
-#endif
unsigned int metasize = xdp->data - xdp->data_meta;
+ unsigned int truesize;
struct sk_buff *skb;
+ if (PAGE_SIZE < 8192)
+ truesize = igb_rx_pg_size(rx_ring) / 2;
+ else
+ truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
+ SKB_DATA_ALIGN(xdp->data_end -
+ xdp->data_hard_start);
+
/* prefetch first cache line of first page */
net_prefetch(xdp->data_meta);
@@ -8575,11 +8571,10 @@ static struct sk_buff *igb_build_skb(struct igb_ring *rx_ring,
skb_hwtstamps(skb)->hwtstamp = timestamp;
/* update buffer offset */
-#if (PAGE_SIZE < 8192)
- rx_buffer->page_offset ^= truesize;
-#else
- rx_buffer->page_offset += truesize;
-#endif
+ if (PAGE_SIZE < 8192)
+ rx_buffer->page_offset ^= truesize;
+ else
+ rx_buffer->page_offset += truesize;
return skb;
}
@@ -8634,14 +8629,14 @@ static unsigned int igb_rx_frame_truesize(struct igb_ring *rx_ring,
{
unsigned int truesize;
-#if (PAGE_SIZE < 8192)
- truesize = igb_rx_pg_size(rx_ring) / 2; /* Must be power-of-2 */
-#else
- truesize = ring_uses_build_skb(rx_ring) ?
- SKB_DATA_ALIGN(IGB_SKB_PAD + size) +
- SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) :
- SKB_DATA_ALIGN(size);
-#endif
+ if (PAGE_SIZE < 8192)
+ truesize = igb_rx_pg_size(rx_ring) / 2; /* Must be power-of-2 */
+ else
+ truesize = ring_uses_build_skb(rx_ring) ?
+ SKB_DATA_ALIGN(IGB_SKB_PAD + size) +
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) :
+ SKB_DATA_ALIGN(size);
+
return truesize;
}
@@ -8650,11 +8645,11 @@ static void igb_rx_buffer_flip(struct igb_ring *rx_ring,
unsigned int size)
{
unsigned int truesize = igb_rx_frame_truesize(rx_ring, size);
-#if (PAGE_SIZE < 8192)
- rx_buffer->page_offset ^= truesize;
-#else
- rx_buffer->page_offset += truesize;
-#endif
+
+ if (PAGE_SIZE < 8192)
+ rx_buffer->page_offset ^= truesize;
+ else
+ rx_buffer->page_offset += truesize;
}
static inline void igb_rx_checksum(struct igb_ring *ring,
@@ -8825,12 +8820,12 @@ static struct igb_rx_buffer *igb_get_rx_buffer(struct igb_ring *rx_ring,
struct igb_rx_buffer *rx_buffer;
rx_buffer = &rx_ring->rx_buffer_info[rx_ring->next_to_clean];
- *rx_buf_pgcnt =
-#if (PAGE_SIZE < 8192)
- page_count(rx_buffer->page);
-#else
- 0;
-#endif
+
+ if (PAGE_SIZE < 8192)
+ *rx_buf_pgcnt = page_count(rx_buffer->page);
+ else
+ *rx_buf_pgcnt = 0;
+
prefetchw(rx_buffer->page);
/* we are reusing so sync this buffer for CPU use */
@@ -8881,9 +8876,8 @@ static int igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)
int rx_buf_pgcnt;
/* Frame size depend on rx_ring setup when PAGE_SIZE=4K */
-#if (PAGE_SIZE < 8192)
- frame_sz = igb_rx_frame_truesize(rx_ring, 0);
-#endif
+ if (PAGE_SIZE < 8192)
+ frame_sz = igb_rx_frame_truesize(rx_ring, 0);
xdp_init_buff(&xdp, frame_sz, &rx_ring->xdp_rxq);
while (likely(total_packets < budget)) {
@@ -8932,10 +8926,9 @@ static int igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)
xdp_prepare_buff(&xdp, hard_start, offset, size, true);
xdp_buff_clear_frags_flag(&xdp);
-#if (PAGE_SIZE > 4096)
/* At larger PAGE_SIZE, frame_sz depend on len size */
- xdp.frame_sz = igb_rx_frame_truesize(rx_ring, size);
-#endif
+ if (PAGE_SIZE > 4096)
+ xdp.frame_sz = igb_rx_frame_truesize(rx_ring, size);
skb = igb_run_xdp(adapter, rx_ring, &xdp);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 30/57] drivers/base: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (27 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 29/57] net: igb: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-16 14:45 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 31/57] edac: " Ryan Roberts
` (29 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, Yury Norov
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Update BUILD_BUG_ON() to test against page size limits.
CPUMAP_FILE_MAX_BYTES and CPULIST_FILE_MAX_BYTES are both defined
relative to PAGE_SIZE, so when these values are assigned to global
variables via BIN_ATTR_RO(), let's wrap them with
DEFINE_GLOBAL_PAGE_SIZE_VAR() so that their assignment can be deferred
until boot-time.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/base/node.c | 6 +++---
drivers/base/topology.c | 32 ++++++++++++++++----------------
include/linux/cpumask.h | 5 +++++
3 files changed, 24 insertions(+), 19 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index eb72580288e62..30e6549e4c438 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -45,7 +45,7 @@ static inline ssize_t cpumap_read(struct file *file, struct kobject *kobj,
return n;
}
-static BIN_ATTR_RO(cpumap, CPUMAP_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(cpumap, CPUMAP_FILE_MAX_BYTES);
static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
struct bin_attribute *attr, char *buf,
@@ -66,7 +66,7 @@ static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
return n;
}
-static BIN_ATTR_RO(cpulist, CPULIST_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(cpulist, CPULIST_FILE_MAX_BYTES);
/**
* struct node_access_nodes - Access class device to hold user visible
@@ -558,7 +558,7 @@ static ssize_t node_read_distance(struct device *dev,
* buf is currently PAGE_SIZE in length and each node needs 4 chars
* at the most (distance + space or newline).
*/
- BUILD_BUG_ON(MAX_NUMNODES * 4 > PAGE_SIZE);
+ BUILD_BUG_ON(MAX_NUMNODES * 4 > PAGE_SIZE_MIN);
for_each_online_node(i) {
len += sysfs_emit_at(buf, len, "%s%d",
diff --git a/drivers/base/topology.c b/drivers/base/topology.c
index 89f98be5c5b99..bdbdbefd95b15 100644
--- a/drivers/base/topology.c
+++ b/drivers/base/topology.c
@@ -62,47 +62,47 @@ define_id_show_func(ppin, "0x%llx");
static DEVICE_ATTR_ADMIN_RO(ppin);
define_siblings_read_func(thread_siblings, sibling_cpumask);
-static BIN_ATTR_RO(thread_siblings, CPUMAP_FILE_MAX_BYTES);
-static BIN_ATTR_RO(thread_siblings_list, CPULIST_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(thread_siblings, CPUMAP_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(thread_siblings_list, CPULIST_FILE_MAX_BYTES);
define_siblings_read_func(core_cpus, sibling_cpumask);
-static BIN_ATTR_RO(core_cpus, CPUMAP_FILE_MAX_BYTES);
-static BIN_ATTR_RO(core_cpus_list, CPULIST_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(core_cpus, CPUMAP_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(core_cpus_list, CPULIST_FILE_MAX_BYTES);
define_siblings_read_func(core_siblings, core_cpumask);
-static BIN_ATTR_RO(core_siblings, CPUMAP_FILE_MAX_BYTES);
-static BIN_ATTR_RO(core_siblings_list, CPULIST_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(core_siblings, CPUMAP_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(core_siblings_list, CPULIST_FILE_MAX_BYTES);
#ifdef TOPOLOGY_CLUSTER_SYSFS
define_siblings_read_func(cluster_cpus, cluster_cpumask);
-static BIN_ATTR_RO(cluster_cpus, CPUMAP_FILE_MAX_BYTES);
-static BIN_ATTR_RO(cluster_cpus_list, CPULIST_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(cluster_cpus, CPUMAP_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(cluster_cpus_list, CPULIST_FILE_MAX_BYTES);
#endif
#ifdef TOPOLOGY_DIE_SYSFS
define_siblings_read_func(die_cpus, die_cpumask);
-static BIN_ATTR_RO(die_cpus, CPUMAP_FILE_MAX_BYTES);
-static BIN_ATTR_RO(die_cpus_list, CPULIST_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(die_cpus, CPUMAP_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(die_cpus_list, CPULIST_FILE_MAX_BYTES);
#endif
define_siblings_read_func(package_cpus, core_cpumask);
-static BIN_ATTR_RO(package_cpus, CPUMAP_FILE_MAX_BYTES);
-static BIN_ATTR_RO(package_cpus_list, CPULIST_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(package_cpus, CPUMAP_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(package_cpus_list, CPULIST_FILE_MAX_BYTES);
#ifdef TOPOLOGY_BOOK_SYSFS
define_id_show_func(book_id, "%d");
static DEVICE_ATTR_RO(book_id);
define_siblings_read_func(book_siblings, book_cpumask);
-static BIN_ATTR_RO(book_siblings, CPUMAP_FILE_MAX_BYTES);
-static BIN_ATTR_RO(book_siblings_list, CPULIST_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(book_siblings, CPUMAP_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(book_siblings_list, CPULIST_FILE_MAX_BYTES);
#endif
#ifdef TOPOLOGY_DRAWER_SYSFS
define_id_show_func(drawer_id, "%d");
static DEVICE_ATTR_RO(drawer_id);
define_siblings_read_func(drawer_siblings, drawer_cpumask);
-static BIN_ATTR_RO(drawer_siblings, CPUMAP_FILE_MAX_BYTES);
-static BIN_ATTR_RO(drawer_siblings_list, CPULIST_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(drawer_siblings, CPUMAP_FILE_MAX_BYTES);
+static CPU_FILE_BIN_ATTR_RO(drawer_siblings_list, CPULIST_FILE_MAX_BYTES);
#endif
static struct bin_attribute *bin_attrs[] = {
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 53158de44b837..f654b4198abc2 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -1292,4 +1292,9 @@ cpumap_print_list_to_buf(char *buf, const struct cpumask *mask,
? (NR_CPUS * 9)/32 - 1 : PAGE_SIZE)
#define CPULIST_FILE_MAX_BYTES (((NR_CPUS * 7)/2 > PAGE_SIZE) ? (NR_CPUS * 7)/2 : PAGE_SIZE)
+#define CPU_FILE_BIN_ATTR_RO(_name, _size) \
+ DEFINE_GLOBAL_PAGE_SIZE_VAR(struct bin_attribute, \
+ bin_attr_##_name, \
+ __BIN_ATTR_RO(_name, _size))
+
#endif /* __LINUX_CPUMASK_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 31/57] edac: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (28 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 30/57] drivers/base: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-16 14:46 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 32/57] optee: " Ryan Roberts
` (28 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-edac, linux-kernel,
linux-mm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Convert PAGES_TO_MiB() and MiB_TO_PAGES() to use the ternary operator so
that they continue to work with boot-time page size; Boot-time page size
can't be used with CPP because it's value is not known at compile time.
For compile-time page size builds, the compiler will dead code strip for
the same result.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/edac/edac_mc.h | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/drivers/edac/edac_mc.h b/drivers/edac/edac_mc.h
index 881b00eadf7a5..22132ee86e953 100644
--- a/drivers/edac/edac_mc.h
+++ b/drivers/edac/edac_mc.h
@@ -37,13 +37,12 @@
#include <linux/workqueue.h>
#include <linux/edac.h>
-#if PAGE_SHIFT < 20
-#define PAGES_TO_MiB(pages) ((pages) >> (20 - PAGE_SHIFT))
-#define MiB_TO_PAGES(mb) ((mb) << (20 - PAGE_SHIFT))
-#else /* PAGE_SHIFT > 20 */
-#define PAGES_TO_MiB(pages) ((pages) << (PAGE_SHIFT - 20))
-#define MiB_TO_PAGES(mb) ((mb) >> (PAGE_SHIFT - 20))
-#endif
+#define PAGES_TO_MiB(pages) (PAGE_SHIFT < 20 ? \
+ ((pages) >> (20 - PAGE_SHIFT)) :\
+ ((pages) << (PAGE_SHIFT - 20)))
+#define MiB_TO_PAGES(mb) (PAGE_SHIFT < 20 ? \
+ ((mb) << (20 - PAGE_SHIFT)) : \
+ ((mb) >> (PAGE_SHIFT - 20)))
#define edac_printk(level, prefix, fmt, arg...) \
printk(level "EDAC " prefix ": " fmt, ##arg)
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 32/57] optee: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (29 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 31/57] edac: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 33/57] random: " Ryan Roberts
` (27 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Jens Wiklander,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm, op-tee
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Updated BUILD_BUG_ON() to test against limit.
Refactored "struct optee_shm_arg_entry" to use a flexible array member
for "map", since its length depends on PAGE_SIZE.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/tee/optee/call.c | 7 +++++--
drivers/tee/optee/smc_abi.c | 2 +-
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/tee/optee/call.c b/drivers/tee/optee/call.c
index 16eb953e14bb6..41bd7ace6606e 100644
--- a/drivers/tee/optee/call.c
+++ b/drivers/tee/optee/call.c
@@ -36,7 +36,7 @@
struct optee_shm_arg_entry {
struct list_head list_node;
struct tee_shm *shm;
- DECLARE_BITMAP(map, MAX_ARG_COUNT_PER_ENTRY);
+ unsigned long map[];
};
void optee_cq_init(struct optee_call_queue *cq, int thread_count)
@@ -271,6 +271,7 @@ struct optee_msg_arg *optee_get_msg_arg(struct tee_context *ctx,
struct optee_shm_arg_entry *entry;
struct optee_msg_arg *ma;
size_t args_per_entry;
+ size_t entry_sz;
u_long bit;
u_int offs;
void *res;
@@ -293,7 +294,9 @@ struct optee_msg_arg *optee_get_msg_arg(struct tee_context *ctx,
/*
* No entry was found, let's allocate a new.
*/
- entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+ entry_sz = struct_size(entry, map,
+ BITS_TO_LONGS(MAX_ARG_COUNT_PER_ENTRY));
+ entry = kzalloc(entry_sz, GFP_KERNEL);
if (!entry) {
res = ERR_PTR(-ENOMEM);
goto out;
diff --git a/drivers/tee/optee/smc_abi.c b/drivers/tee/optee/smc_abi.c
index 844285d4f03c1..005689380d848 100644
--- a/drivers/tee/optee/smc_abi.c
+++ b/drivers/tee/optee/smc_abi.c
@@ -418,7 +418,7 @@ static void optee_fill_pages_list(u64 *dst, struct page **pages, int num_pages,
* code heavily relies on this assumption, so it is better be
* safe than sorry.
*/
- BUILD_BUG_ON(PAGE_SIZE < OPTEE_MSG_NONCONTIG_PAGE_SIZE);
+ BUILD_BUG_ON(PAGE_SIZE_MIN < OPTEE_MSG_NONCONTIG_PAGE_SIZE);
pages_data = (void *)dst;
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 33/57] random: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (30 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 32/57] optee: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 34/57] sata_sil24: " Ryan Roberts
` (26 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Jason A. Donenfeld, Theodore Ts'o, Andrew Morton,
Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Update BUILD_BUG_ON()s to test against page size limits.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/char/random.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 87fe61295ea1f..49d6c4ef16df4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -466,7 +466,7 @@ static ssize_t get_random_bytes_user(struct iov_iter *iter)
if (!iov_iter_count(iter) || copied != sizeof(block))
break;
- BUILD_BUG_ON(PAGE_SIZE % sizeof(block) != 0);
+ BUILD_BUG_ON(PAGE_SIZE_MIN % sizeof(block) != 0);
if (ret % PAGE_SIZE == 0) {
if (signal_pending(current))
break;
@@ -1428,7 +1428,7 @@ static ssize_t write_pool_user(struct iov_iter *iter)
if (!iov_iter_count(iter) || copied != sizeof(block))
break;
- BUILD_BUG_ON(PAGE_SIZE % sizeof(block) != 0);
+ BUILD_BUG_ON(PAGE_SIZE_MIN % sizeof(block) != 0);
if (ret % PAGE_SIZE == 0) {
if (signal_pending(current))
break;
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 34/57] sata_sil24: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (31 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 33/57] random: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-17 9:09 ` Niklas Cassel
2024-10-14 10:58 ` [RFC PATCH v1 35/57] virtio: " Ryan Roberts
` (25 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
Damien Le Moal, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Niklas Cassel, Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-ide, linux-kernel, linux-mm
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Convert "struct sil24_ata_block" and "struct sil24_atapi_block" to use a
flexible array member for their sge[] array. The previous static size of
SIL24_MAX_SGE depends on PAGE_SIZE so doesn't work for boot-time page
size.
Wrap global variables that are initialized with PAGE_SIZE derived values
using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
deferred for boot-time page size builds.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/ata/sata_sil24.c | 46 +++++++++++++++++++---------------------
1 file changed, 22 insertions(+), 24 deletions(-)
diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
index 72c03cbdaff43..85c6382976626 100644
--- a/drivers/ata/sata_sil24.c
+++ b/drivers/ata/sata_sil24.c
@@ -42,26 +42,25 @@ struct sil24_sge {
__le32 flags;
};
+/*
+ * sil24 fetches in chunks of 64bytes. The first block
+ * contains the PRB and two SGEs. From the second block, it's
+ * consisted of four SGEs and called SGT. Calculate the
+ * number of SGTs that fit into one page.
+ */
+#define SIL24_PRB_SZ (sizeof(struct sil24_prb) + 2 * sizeof(struct sil24_sge))
+#define SIL24_MAX_SGT ((PAGE_SIZE - SIL24_PRB_SZ) / (4 * sizeof(struct sil24_sge)))
+
+/*
+ * This will give us one unused SGEs for ATA. This extra SGE
+ * will be used to store CDB for ATAPI devices.
+ */
+#define SIL24_MAX_SGE (4 * SIL24_MAX_SGT + 1)
enum {
SIL24_HOST_BAR = 0,
SIL24_PORT_BAR = 2,
- /* sil24 fetches in chunks of 64bytes. The first block
- * contains the PRB and two SGEs. From the second block, it's
- * consisted of four SGEs and called SGT. Calculate the
- * number of SGTs that fit into one page.
- */
- SIL24_PRB_SZ = sizeof(struct sil24_prb)
- + 2 * sizeof(struct sil24_sge),
- SIL24_MAX_SGT = (PAGE_SIZE - SIL24_PRB_SZ)
- / (4 * sizeof(struct sil24_sge)),
-
- /* This will give us one unused SGEs for ATA. This extra SGE
- * will be used to store CDB for ATAPI devices.
- */
- SIL24_MAX_SGE = 4 * SIL24_MAX_SGT + 1,
-
/*
* Global controller registers (128 bytes @ BAR0)
*/
@@ -244,13 +243,13 @@ enum {
struct sil24_ata_block {
struct sil24_prb prb;
- struct sil24_sge sge[SIL24_MAX_SGE];
+ struct sil24_sge sge[];
};
struct sil24_atapi_block {
struct sil24_prb prb;
u8 cdb[16];
- struct sil24_sge sge[SIL24_MAX_SGE];
+ struct sil24_sge sge[];
};
union sil24_cmd_block {
@@ -373,7 +372,7 @@ static struct pci_driver sil24_pci_driver = {
#endif
};
-static const struct scsi_host_template sil24_sht = {
+static DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(struct scsi_host_template, sil24_sht, {
__ATA_BASE_SHT(DRV_NAME),
.can_queue = SIL24_MAX_CMDS,
.sg_tablesize = SIL24_MAX_SGE,
@@ -382,7 +381,7 @@ static const struct scsi_host_template sil24_sht = {
.sdev_groups = ata_ncq_sdev_groups,
.change_queue_depth = ata_scsi_change_queue_depth,
.device_configure = ata_scsi_device_configure
-};
+});
static struct ata_port_operations sil24_ops = {
.inherits = &sata_pmp_port_ops,
@@ -1193,7 +1192,7 @@ static int sil24_port_start(struct ata_port *ap)
struct device *dev = ap->host->dev;
struct sil24_port_priv *pp;
union sil24_cmd_block *cb;
- size_t cb_size = sizeof(*cb) * SIL24_MAX_CMDS;
+ size_t cb_size = PAGE_SIZE * SIL24_MAX_CMDS;
dma_addr_t cb_dma;
pp = devm_kzalloc(dev, sizeof(*pp), GFP_KERNEL);
@@ -1258,7 +1257,6 @@ static void sil24_init_controller(struct ata_host *host)
static int sil24_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
{
- extern int __MARKER__sil24_cmd_block_is_sized_wrongly;
struct ata_port_info pi = sil24_port_info[ent->driver_data];
const struct ata_port_info *ppi[] = { &pi, NULL };
void __iomem * const *iomap;
@@ -1266,9 +1264,9 @@ static int sil24_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
int rc;
u32 tmp;
- /* cause link error if sil24_cmd_block is sized wrongly */
- if (sizeof(union sil24_cmd_block) != PAGE_SIZE)
- __MARKER__sil24_cmd_block_is_sized_wrongly = 1;
+ /* union sil24_cmd_block must be PAGE_SIZE */
+ BUG_ON(struct_size_t(struct sil24_atapi_block, sge, SIL24_MAX_SGE) != PAGE_SIZE);
+ BUG_ON(struct_size_t(struct sil24_ata_block, sge, SIL24_MAX_SGE) > PAGE_SIZE);
ata_print_version_once(&pdev->dev, DRV_VERSION);
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 35/57] virtio: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (32 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 34/57] sata_sil24: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 36/57] xen: " Ryan Roberts
` (24 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Michael S. Tsirkin, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, David Hildenbrand,
Dominique Martinet, Eric Van Hensbergen, Greg Marsden,
Ivan Ivanov, Jason Wang, Jens Axboe, Kalesh Singh,
Latchesar Ionkov, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-block, linux-kernel,
linux-mm, v9fs, virtualization
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Updated multiple BUILD_BUG_ON() instances to test against page size
limits.
Wrap global variables that are initialized with PAGE_SIZE derived values
using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
deferred for boot-time page size builds.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/block/virtio_blk.c | 2 +-
drivers/virtio/virtio_balloon.c | 10 ++++++----
net/9p/trans_virtio.c | 4 ++--
3 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 194417abc1053..8a8960b609bc9 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -899,7 +899,7 @@ static ssize_t serial_show(struct device *dev,
int err;
/* sysfs gives us a PAGE_SIZE buffer */
- BUILD_BUG_ON(PAGE_SIZE < VIRTIO_BLK_ID_BYTES);
+ BUILD_BUG_ON(PAGE_SIZE_MIN < VIRTIO_BLK_ID_BYTES);
buf[VIRTIO_BLK_ID_BYTES] = '\0';
err = virtblk_get_id(disk, buf);
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 54469277ca303..3818d894bd212 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -25,6 +25,7 @@
* page units.
*/
#define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned int)(PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
+#define VIRTIO_BALLOON_PAGES_PER_PAGE_MAX (unsigned int)(PAGE_SIZE_MAX >> VIRTIO_BALLOON_PFN_SHIFT)
#define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
/* Maximum number of (4k) pages to deflate on OOM notifications. */
#define VIRTIO_BALLOON_OOM_NR_PAGES 256
@@ -138,7 +139,7 @@ static u32 page_to_balloon_pfn(struct page *page)
{
unsigned long pfn = page_to_pfn(page);
- BUILD_BUG_ON(PAGE_SHIFT < VIRTIO_BALLOON_PFN_SHIFT);
+ BUILD_BUG_ON(PAGE_SHIFT_MIN < VIRTIO_BALLOON_PFN_SHIFT);
/* Convert pfn from Linux page size to balloon page size. */
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
}
@@ -228,7 +229,7 @@ static void set_page_pfns(struct virtio_balloon *vb,
{
unsigned int i;
- BUILD_BUG_ON(VIRTIO_BALLOON_PAGES_PER_PAGE > VIRTIO_BALLOON_ARRAY_PFNS_MAX);
+ BUILD_BUG_ON(VIRTIO_BALLOON_PAGES_PER_PAGE_MAX > VIRTIO_BALLOON_ARRAY_PFNS_MAX);
/*
* Set balloon pfns pointing at this page.
@@ -1042,8 +1043,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
* host's base page size. However, it needs more work to report
* that value. The hard-coded order would be fine currently.
*/
-#if defined(CONFIG_ARM64) && defined(CONFIG_ARM64_64K_PAGES)
- vb->pr_dev_info.order = 5;
+#if defined(CONFIG_ARM64)
+ if (PAGE_SIZE == SZ_64K)
+ vb->pr_dev_info.order = 5;
#endif
err = page_reporting_register(&vb->pr_dev_info);
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 0b8086f58ad55..25b8253011cec 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -786,7 +786,7 @@ static struct virtio_driver p9_virtio_drv = {
.remove = p9_virtio_remove,
};
-static struct p9_trans_module p9_virtio_trans = {
+static DEFINE_GLOBAL_PAGE_SIZE_VAR(struct p9_trans_module, p9_virtio_trans, {
.name = "virtio",
.create = p9_virtio_create,
.close = p9_virtio_close,
@@ -804,7 +804,7 @@ static struct p9_trans_module p9_virtio_trans = {
.pooled_rbuffers = false,
.def = 1,
.owner = THIS_MODULE,
-};
+});
/* The standard init function */
static int __init p9_virtio_init(void)
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 36/57] xen: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (33 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 35/57] virtio: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-16 14:46 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 37/57] arm64: Fix macros to work in C code in addition to the linker script Ryan Roberts
` (23 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm, xen-devel
To prepare for supporting boot-time page size selection, refactor code
to remove assumptions about PAGE_SIZE being compile-time constant. Code
intended to be equivalent when compile-time page size is active.
Allocate enough "frame_list" static storage in the balloon driver for
the maximum supported page size. Although continue to use only the first
PAGE_SIZE of the buffer at run-time to maintain existing behaviour.
Refactor xen_biovec_phys_mergeable() to convert ifdeffery to c if/else.
For compile-time page size, the compiler will choose one branch and
strip the dead one. For boot-time, it can be evaluated at run time.
Refactor a BUILD_BUG_ON to evaluate the limit (when the minimum
supported page size is selected at boot-time).
Reserve enough storage for max page size in "struct remap_data" and
"struct xenbus_map_node".
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
drivers/xen/balloon.c | 11 ++++++-----
drivers/xen/biomerge.c | 12 ++++++------
drivers/xen/privcmd.c | 2 +-
drivers/xen/xenbus/xenbus_client.c | 5 +++--
drivers/xen/xlate_mmu.c | 6 +++---
include/xen/page.h | 2 ++
6 files changed, 21 insertions(+), 17 deletions(-)
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 528395133b4f8..0ed5f6453af0e 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -131,7 +131,8 @@ struct balloon_stats balloon_stats;
EXPORT_SYMBOL_GPL(balloon_stats);
/* We increase/decrease in batches which fit in a page */
-static xen_pfn_t frame_list[PAGE_SIZE / sizeof(xen_pfn_t)];
+static xen_pfn_t frame_list[PAGE_SIZE_MAX / sizeof(xen_pfn_t)];
+#define FRAME_LIST_NR_ENTRIES (PAGE_SIZE / sizeof(xen_pfn_t))
/* List of ballooned pages, threaded through the mem_map array. */
@@ -389,8 +390,8 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
unsigned long i;
struct page *page;
- if (nr_pages > ARRAY_SIZE(frame_list))
- nr_pages = ARRAY_SIZE(frame_list);
+ if (nr_pages > FRAME_LIST_NR_ENTRIES)
+ nr_pages = FRAME_LIST_NR_ENTRIES;
page = list_first_entry_or_null(&ballooned_pages, struct page, lru);
for (i = 0; i < nr_pages; i++) {
@@ -434,8 +435,8 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
int ret;
LIST_HEAD(pages);
- if (nr_pages > ARRAY_SIZE(frame_list))
- nr_pages = ARRAY_SIZE(frame_list);
+ if (nr_pages > FRAME_LIST_NR_ENTRIES)
+ nr_pages = FRAME_LIST_NR_ENTRIES;
for (i = 0; i < nr_pages; i++) {
page = alloc_page(gfp);
diff --git a/drivers/xen/biomerge.c b/drivers/xen/biomerge.c
index 05a286d24f148..28f0887e40026 100644
--- a/drivers/xen/biomerge.c
+++ b/drivers/xen/biomerge.c
@@ -8,16 +8,16 @@
bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
const struct page *page)
{
-#if XEN_PAGE_SIZE == PAGE_SIZE
- unsigned long bfn1 = pfn_to_bfn(page_to_pfn(vec1->bv_page));
- unsigned long bfn2 = pfn_to_bfn(page_to_pfn(page));
+ if (XEN_PAGE_SIZE == PAGE_SIZE) {
+ unsigned long bfn1 = pfn_to_bfn(page_to_pfn(vec1->bv_page));
+ unsigned long bfn2 = pfn_to_bfn(page_to_pfn(page));
+
+ return bfn1 + PFN_DOWN(vec1->bv_offset + vec1->bv_len) == bfn2;
+ }
- return bfn1 + PFN_DOWN(vec1->bv_offset + vec1->bv_len) == bfn2;
-#else
/*
* XXX: Add support for merging bio_vec when using different page
* size in Xen and Linux.
*/
return false;
-#endif
}
diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 9563650dfbafc..847f7b806caf7 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -557,7 +557,7 @@ static long privcmd_ioctl_mmap_batch(
state.global_error = 0;
state.version = version;
- BUILD_BUG_ON(((PAGE_SIZE / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE) != 0);
+ BUILD_BUG_ON(((PAGE_SIZE_MIN / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE_MAX) != 0);
/* mmap_batch_fn guarantees ret == 0 */
BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
&pagelist, mmap_batch_fn, &state));
diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
index 51b3124b0d56c..99bde836c10c4 100644
--- a/drivers/xen/xenbus/xenbus_client.c
+++ b/drivers/xen/xenbus/xenbus_client.c
@@ -49,9 +49,10 @@
#include "xenbus.h"
-#define XENBUS_PAGES(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE))
+#define XENBUS_PAGES(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE))
+#define XENBUS_PAGES_MAX(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE_MIN))
-#define XENBUS_MAX_RING_PAGES (XENBUS_PAGES(XENBUS_MAX_RING_GRANTS))
+#define XENBUS_MAX_RING_PAGES (XENBUS_PAGES_MAX(XENBUS_MAX_RING_GRANTS))
struct xenbus_map_node {
struct list_head next;
diff --git a/drivers/xen/xlate_mmu.c b/drivers/xen/xlate_mmu.c
index f17c4c03db30c..a757c801a7542 100644
--- a/drivers/xen/xlate_mmu.c
+++ b/drivers/xen/xlate_mmu.c
@@ -74,9 +74,9 @@ struct remap_data {
int mapped;
/* Hypercall parameters */
- int h_errs[XEN_PFN_PER_PAGE];
- xen_ulong_t h_idxs[XEN_PFN_PER_PAGE];
- xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE];
+ int h_errs[XEN_PFN_PER_PAGE_MAX];
+ xen_ulong_t h_idxs[XEN_PFN_PER_PAGE_MAX];
+ xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE_MAX];
int h_iter; /* Iterator */
};
diff --git a/include/xen/page.h b/include/xen/page.h
index 285677b42943a..86683a30038a3 100644
--- a/include/xen/page.h
+++ b/include/xen/page.h
@@ -21,6 +21,8 @@
((page_to_pfn(page)) << (PAGE_SHIFT - XEN_PAGE_SHIFT))
#define XEN_PFN_PER_PAGE (PAGE_SIZE / XEN_PAGE_SIZE)
+#define XEN_PFN_PER_PAGE_MIN (PAGE_SIZE_MIN / XEN_PAGE_SIZE)
+#define XEN_PFN_PER_PAGE_MAX (PAGE_SIZE_MAX / XEN_PAGE_SIZE)
#define XEN_PFN_DOWN(x) ((x) >> XEN_PAGE_SHIFT)
#define XEN_PFN_UP(x) (((x) + XEN_PAGE_SIZE-1) >> XEN_PAGE_SHIFT)
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 37/57] arm64: Fix macros to work in C code in addition to the linker script
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (34 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 36/57] xen: " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 38/57] arm64: Track early pgtable allocation limit Ryan Roberts
` (22 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
Previously INIT_DIR_SIZE and INIT_IDMAP_DIR_SIZE used _end for the end
address of the kernel image. In the linker script context, this resolves
to an integer that refers to the link va of the end of the kernel image.
But in C code it resolves to a pointer to the end of the image as placed
in memory. So there are 2 problems; because its a pointer, we can't do
arithmetic on it. And because the image may be in a different location
in memory than the one it was linked at, it is not correct to find the
image size by subtracting KIMAGE_VADDR.
So introduce KIMAGE_VADDR_END, which always represents the link va of
the end of the kernel image as an integer.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/kernel-pgtable.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
index bf05a77873a49..1722b9217d47d 100644
--- a/arch/arm64/include/asm/kernel-pgtable.h
+++ b/arch/arm64/include/asm/kernel-pgtable.h
@@ -35,6 +35,8 @@
#define IDMAP_LEVELS ARM64_HW_PGTABLE_LEVELS(IDMAP_VA_BITS)
#define IDMAP_ROOT_LEVEL (4 - IDMAP_LEVELS)
+#define KIMAGE_VADDR_END (_AT(u64, _end) - _AT(u64, _text) + KIMAGE_VADDR)
+
/*
* A relocatable kernel may execute from an address that differs from the one at
* which it was linked. In the worst case, its runtime placement may intersect
@@ -56,10 +58,10 @@
+ EARLY_LEVEL(3, (lvls), (vstart), (vend), add) /* each entry needs a next level page table */ \
+ EARLY_LEVEL(2, (lvls), (vstart), (vend), add) /* each entry needs a next level page table */ \
+ EARLY_LEVEL(1, (lvls), (vstart), (vend), add))/* each entry needs a next level page table */
-#define INIT_DIR_SIZE (PAGE_SIZE * (EARLY_PAGES(SWAPPER_PGTABLE_LEVELS, KIMAGE_VADDR, _end, EXTRA_PAGE) \
+#define INIT_DIR_SIZE (PAGE_SIZE * (EARLY_PAGES(SWAPPER_PGTABLE_LEVELS, KIMAGE_VADDR, KIMAGE_VADDR_END, EXTRA_PAGE) \
+ EARLY_SEGMENT_EXTRA_PAGES))
-#define INIT_IDMAP_DIR_PAGES (EARLY_PAGES(INIT_IDMAP_PGTABLE_LEVELS, KIMAGE_VADDR, _end, 1))
+#define INIT_IDMAP_DIR_PAGES (EARLY_PAGES(INIT_IDMAP_PGTABLE_LEVELS, KIMAGE_VADDR, KIMAGE_VADDR_END, 1))
#define INIT_IDMAP_DIR_SIZE ((INIT_IDMAP_DIR_PAGES + EARLY_IDMAP_EXTRA_PAGES) * PAGE_SIZE)
#define INIT_IDMAP_FDT_PAGES (EARLY_PAGES(INIT_IDMAP_PGTABLE_LEVELS, 0UL, UL(MAX_FDT_SIZE), 1) - 1)
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 38/57] arm64: Track early pgtable allocation limit
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (35 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 37/57] arm64: Fix macros to work in C code in addition to the linker script Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 39/57] arm64: Introduce macros required for boot-time page selection Ryan Roberts
` (21 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
Early pgtables (e.g. init_idmap_pg_dir, init_pg_dir, etc) are allocated
from statically defined memory blocks within the kernel image that are
sized for the calculated worst case requirements. Let's make the
allocator aware of the block's limit so that it can detect any overflow.
This boils down to passing the limit of the memory block to map_range()
so let's add it as a parameter. If an overflow is detected, report the
error to __early_cpu_boot_status and park the cpu.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/smp.h | 1 +
arch/arm64/kernel/head.S | 3 +-
arch/arm64/kernel/image-vars.h | 3 ++
arch/arm64/kernel/pi/map_kernel.c | 35 +++++++++++---------
arch/arm64/kernel/pi/map_range.c | 54 +++++++++++++++++++++++++------
arch/arm64/kernel/pi/pi.h | 4 +--
arch/arm64/mm/mmu.c | 14 +++++---
7 files changed, 81 insertions(+), 33 deletions(-)
diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index 2510eec026f7e..86edc5f8c9673 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -22,6 +22,7 @@
#define CPU_STUCK_REASON_52_BIT_VA (UL(1) << CPU_STUCK_REASON_SHIFT)
#define CPU_STUCK_REASON_NO_GRAN (UL(2) << CPU_STUCK_REASON_SHIFT)
+#define CPU_STUCK_REASON_NO_PGTABLE_MEM (UL(3) << CPU_STUCK_REASON_SHIFT)
#ifndef __ASSEMBLY__
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index cb68adcabe078..7e17a71fd9e4b 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -89,7 +89,8 @@ SYM_CODE_START(primary_entry)
mov sp, x1
mov x29, xzr
adrp x0, init_idmap_pg_dir
- mov x1, xzr
+ adrp x1, init_idmap_pg_end
+ mov x2, xzr
bl __pi_create_init_idmap
/*
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 8f5422ed1b758..a168f3337446f 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -52,6 +52,7 @@ PROVIDE(__pi_cavium_erratum_27456_cpus = cavium_erratum_27456_cpus);
#endif
PROVIDE(__pi__ctype = _ctype);
PROVIDE(__pi_memstart_offset_seed = memstart_offset_seed);
+PROVIDE(__pi___early_cpu_boot_status = __early_cpu_boot_status);
PROVIDE(__pi_init_idmap_pg_dir = init_idmap_pg_dir);
PROVIDE(__pi_init_idmap_pg_end = init_idmap_pg_end);
@@ -68,6 +69,8 @@ PROVIDE(__pi___inittext_end = __inittext_end);
PROVIDE(__pi___initdata_begin = __initdata_begin);
PROVIDE(__pi___initdata_end = __initdata_end);
PROVIDE(__pi__data = _data);
+PROVIDE(__pi___mmuoff_data_start = __mmuoff_data_start);
+PROVIDE(__pi___mmuoff_data_end = __mmuoff_data_end);
PROVIDE(__pi___bss_start = __bss_start);
PROVIDE(__pi__end = _end);
diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
index f374a3e5a5fe1..dcf9233ccfff2 100644
--- a/arch/arm64/kernel/pi/map_kernel.c
+++ b/arch/arm64/kernel/pi/map_kernel.c
@@ -20,11 +20,11 @@ extern const u8 __eh_frame_start[], __eh_frame_end[];
extern void idmap_cpu_replace_ttbr1(void *pgdir);
-static void __init map_segment(pgd_t *pg_dir, u64 *pgd, u64 va_offset,
- void *start, void *end, pgprot_t prot,
- bool may_use_cont, int root_level)
+static void __init map_segment(pgd_t *pg_dir, u64 *pgd, u64 limit,
+ u64 va_offset, void *start, void *end,
+ pgprot_t prot, bool may_use_cont, int root_level)
{
- map_range(pgd, ((u64)start + va_offset) & ~PAGE_OFFSET,
+ map_range(pgd, limit, ((u64)start + va_offset) & ~PAGE_OFFSET,
((u64)end + va_offset) & ~PAGE_OFFSET, (u64)start,
prot, root_level, (pte_t *)pg_dir, may_use_cont, 0);
}
@@ -32,7 +32,7 @@ static void __init map_segment(pgd_t *pg_dir, u64 *pgd, u64 va_offset,
static void __init unmap_segment(pgd_t *pg_dir, u64 va_offset, void *start,
void *end, int root_level)
{
- map_segment(pg_dir, NULL, va_offset, start, end, __pgprot(0),
+ map_segment(pg_dir, NULL, 0, va_offset, start, end, __pgprot(0),
false, root_level);
}
@@ -41,6 +41,7 @@ static void __init map_kernel(u64 kaslr_offset, u64 va_offset, int root_level)
bool enable_scs = IS_ENABLED(CONFIG_UNWIND_PATCH_PAC_INTO_SCS);
bool twopass = IS_ENABLED(CONFIG_RELOCATABLE);
u64 pgdp = (u64)init_pg_dir + PAGE_SIZE;
+ u64 limit = (u64)init_pg_end;
pgprot_t text_prot = PAGE_KERNEL_ROX;
pgprot_t data_prot = PAGE_KERNEL;
pgprot_t prot;
@@ -78,16 +79,16 @@ static void __init map_kernel(u64 kaslr_offset, u64 va_offset, int root_level)
twopass |= enable_scs;
prot = twopass ? data_prot : text_prot;
- map_segment(init_pg_dir, &pgdp, va_offset, _stext, _etext, prot,
+ map_segment(init_pg_dir, &pgdp, limit, va_offset, _stext, _etext, prot,
!twopass, root_level);
- map_segment(init_pg_dir, &pgdp, va_offset, __start_rodata,
+ map_segment(init_pg_dir, &pgdp, limit, va_offset, __start_rodata,
__inittext_begin, data_prot, false, root_level);
- map_segment(init_pg_dir, &pgdp, va_offset, __inittext_begin,
+ map_segment(init_pg_dir, &pgdp, limit, va_offset, __inittext_begin,
__inittext_end, prot, false, root_level);
- map_segment(init_pg_dir, &pgdp, va_offset, __initdata_begin,
+ map_segment(init_pg_dir, &pgdp, limit, va_offset, __initdata_begin,
__initdata_end, data_prot, false, root_level);
- map_segment(init_pg_dir, &pgdp, va_offset, _data, _end, data_prot,
- true, root_level);
+ map_segment(init_pg_dir, &pgdp, limit, va_offset, _data, _end,
+ data_prot, true, root_level);
dsb(ishst);
idmap_cpu_replace_ttbr1(init_pg_dir);
@@ -120,9 +121,9 @@ static void __init map_kernel(u64 kaslr_offset, u64 va_offset, int root_level)
* Remap these segments with different permissions
* No new page table allocations should be needed
*/
- map_segment(init_pg_dir, NULL, va_offset, _stext, _etext,
+ map_segment(init_pg_dir, NULL, 0, va_offset, _stext, _etext,
text_prot, true, root_level);
- map_segment(init_pg_dir, NULL, va_offset, __inittext_begin,
+ map_segment(init_pg_dir, NULL, 0, va_offset, __inittext_begin,
__inittext_end, text_prot, false, root_level);
}
@@ -164,7 +165,7 @@ static void __init remap_idmap_for_lpa2(void)
* LPA2 compatible fashion, and update the initial ID map while running
* from that.
*/
- create_init_idmap(init_pg_dir, mask);
+ create_init_idmap(init_pg_dir, init_pg_end, mask);
dsb(ishst);
set_ttbr0_for_lpa2((u64)init_pg_dir);
@@ -175,7 +176,7 @@ static void __init remap_idmap_for_lpa2(void)
memset(init_idmap_pg_dir, 0,
(u64)init_idmap_pg_end - (u64)init_idmap_pg_dir);
- create_init_idmap(init_idmap_pg_dir, mask);
+ create_init_idmap(init_idmap_pg_dir, init_idmap_pg_end, mask);
dsb(ishst);
/* switch back to the updated initial ID map */
@@ -188,6 +189,7 @@ static void __init remap_idmap_for_lpa2(void)
static void __init map_fdt(u64 fdt)
{
static u8 ptes[INIT_IDMAP_FDT_SIZE] __initdata __aligned(PAGE_SIZE);
+ u64 limit = (u64)&ptes[INIT_IDMAP_FDT_SIZE];
u64 efdt = fdt + MAX_FDT_SIZE;
u64 ptep = (u64)ptes;
@@ -195,7 +197,8 @@ static void __init map_fdt(u64 fdt)
* Map up to MAX_FDT_SIZE bytes, but avoid overlap with
* the kernel image.
*/
- map_range(&ptep, fdt, (u64)_text > fdt ? min((u64)_text, efdt) : efdt,
+ map_range(&ptep, limit, fdt,
+ (u64)_text > fdt ? min((u64)_text, efdt) : efdt,
fdt, PAGE_KERNEL, IDMAP_ROOT_LEVEL,
(pte_t *)init_idmap_pg_dir, false, 0);
dsb(ishst);
diff --git a/arch/arm64/kernel/pi/map_range.c b/arch/arm64/kernel/pi/map_range.c
index 5410b2cac5907..f0024d9b1d921 100644
--- a/arch/arm64/kernel/pi/map_range.c
+++ b/arch/arm64/kernel/pi/map_range.c
@@ -11,11 +11,36 @@
#include "pi.h"
+static void __init mmuoff_data_clean(void)
+{
+ bool cache_ena = !!(read_sysreg(sctlr_el1) & SCTLR_ELx_C);
+
+ if (cache_ena)
+ dcache_clean_poc((unsigned long)__mmuoff_data_start,
+ (unsigned long)__mmuoff_data_end);
+ else
+ dcache_inval_poc((unsigned long)__mmuoff_data_start,
+ (unsigned long)__mmuoff_data_end);
+}
+
+static void __init report_cpu_stuck(long val)
+{
+ val |= CPU_STUCK_IN_KERNEL;
+ WRITE_ONCE(__early_cpu_boot_status, val);
+
+ /* Ensure the visibility of the status update */
+ dsb(ishst);
+ mmuoff_data_clean();
+
+ cpu_park_loop();
+}
+
/**
* map_range - Map a contiguous range of physical pages into virtual memory
*
* @pte: Address of physical pointer to array of pages to
* allocate page tables from
+ * @limit: Physical address of end of page allocation array
* @start: Virtual address of the start of the range
* @end: Virtual address of the end of the range (exclusive)
* @pa: Physical address of the start of the range
@@ -26,8 +51,9 @@
* @va_offset: Offset between a physical page and its current mapping
* in the VA space
*/
-void __init map_range(u64 *pte, u64 start, u64 end, u64 pa, pgprot_t prot,
- int level, pte_t *tbl, bool may_use_cont, u64 va_offset)
+void __init map_range(u64 *pte, u64 limit, u64 start, u64 end, u64 pa,
+ pgprot_t prot, int level, pte_t *tbl, bool may_use_cont,
+ u64 va_offset)
{
u64 cmask = (level == 3) ? CONT_PTE_SIZE - 1 : U64_MAX;
u64 protval = pgprot_val(prot) & ~PTE_TYPE_MASK;
@@ -56,11 +82,18 @@ void __init map_range(u64 *pte, u64 start, u64 end, u64 pa, pgprot_t prot,
* table mapping if necessary and recurse.
*/
if (pte_none(*tbl)) {
+ u64 size = PTRS_PER_PTE * sizeof(pte_t);
+
+ if (*pte + size > limit) {
+ report_cpu_stuck(
+ CPU_STUCK_REASON_NO_PGTABLE_MEM);
+ }
+
*tbl = __pte(__phys_to_pte_val(*pte) |
PMD_TYPE_TABLE | PMD_TABLE_UXN);
- *pte += PTRS_PER_PTE * sizeof(pte_t);
+ *pte += size;
}
- map_range(pte, start, next, pa, prot, level + 1,
+ map_range(pte, limit, start, next, pa, prot, level + 1,
(pte_t *)(__pte_to_phys(*tbl) + va_offset),
may_use_cont, va_offset);
} else {
@@ -87,7 +120,8 @@ void __init map_range(u64 *pte, u64 start, u64 end, u64 pa, pgprot_t prot,
}
}
-asmlinkage u64 __init create_init_idmap(pgd_t *pg_dir, pteval_t clrmask)
+asmlinkage u64 __init create_init_idmap(pgd_t *pg_dir, pgd_t *pg_end,
+ pteval_t clrmask)
{
u64 ptep = (u64)pg_dir + PAGE_SIZE;
pgprot_t text_prot = PAGE_KERNEL_ROX;
@@ -96,10 +130,12 @@ asmlinkage u64 __init create_init_idmap(pgd_t *pg_dir, pteval_t clrmask)
pgprot_val(text_prot) &= ~clrmask;
pgprot_val(data_prot) &= ~clrmask;
- map_range(&ptep, (u64)_stext, (u64)__initdata_begin, (u64)_stext,
- text_prot, IDMAP_ROOT_LEVEL, (pte_t *)pg_dir, false, 0);
- map_range(&ptep, (u64)__initdata_begin, (u64)_end, (u64)__initdata_begin,
- data_prot, IDMAP_ROOT_LEVEL, (pte_t *)pg_dir, false, 0);
+ map_range(&ptep, (u64)pg_end, (u64)_stext, (u64)__initdata_begin,
+ (u64)_stext, text_prot, IDMAP_ROOT_LEVEL, (pte_t *)pg_dir,
+ false, 0);
+ map_range(&ptep, (u64)pg_end, (u64)__initdata_begin, (u64)_end,
+ (u64)__initdata_begin, data_prot, IDMAP_ROOT_LEVEL,
+ (pte_t *)pg_dir, false, 0);
return ptep;
}
diff --git a/arch/arm64/kernel/pi/pi.h b/arch/arm64/kernel/pi/pi.h
index c91e5e965cd39..20fe0941cb8ee 100644
--- a/arch/arm64/kernel/pi/pi.h
+++ b/arch/arm64/kernel/pi/pi.h
@@ -28,9 +28,9 @@ u64 kaslr_early_init(void *fdt, int chosen);
void relocate_kernel(u64 offset);
int scs_patch(const u8 eh_frame[], int size);
-void map_range(u64 *pgd, u64 start, u64 end, u64 pa, pgprot_t prot,
+void map_range(u64 *pgd, u64 limit, u64 start, u64 end, u64 pa, pgprot_t prot,
int level, pte_t *tbl, bool may_use_cont, u64 va_offset);
asmlinkage void early_map_kernel(u64 boot_status, void *fdt);
-asmlinkage u64 create_init_idmap(pgd_t *pgd, pteval_t clrmask);
+asmlinkage u64 create_init_idmap(pgd_t *pgd, pgd_t *pg_end, pteval_t clrmask);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 353ea5dc32b85..969348a2e93c9 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -773,8 +773,9 @@ static void __init declare_kernel_vmas(void)
declare_vma(&vmlinux_seg[4], _data, _end, 0);
}
-void __pi_map_range(u64 *pgd, u64 start, u64 end, u64 pa, pgprot_t prot,
- int level, pte_t *tbl, bool may_use_cont, u64 va_offset);
+void __pi_map_range(u64 *pgd, u64 limit, u64 start, u64 end, u64 pa,
+ pgprot_t prot, int level, pte_t *tbl, bool may_use_cont,
+ u64 va_offset);
static u8 idmap_ptes[IDMAP_LEVELS - 1][PAGE_SIZE] __aligned(PAGE_SIZE) __ro_after_init,
kpti_ptes[IDMAP_LEVELS - 1][PAGE_SIZE] __aligned(PAGE_SIZE) __ro_after_init;
@@ -784,8 +785,9 @@ static void __init create_idmap(void)
u64 start = __pa_symbol(__idmap_text_start);
u64 end = __pa_symbol(__idmap_text_end);
u64 ptep = __pa_symbol(idmap_ptes);
+ u64 limit = __pa_symbol(&idmap_ptes[IDMAP_LEVELS - 1][0]);
- __pi_map_range(&ptep, start, end, start, PAGE_KERNEL_ROX,
+ __pi_map_range(&ptep, limit, start, end, start, PAGE_KERNEL_ROX,
IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false,
__phys_to_virt(ptep) - ptep);
@@ -798,8 +800,10 @@ static void __init create_idmap(void)
* of its synchronization flag in the ID map.
*/
ptep = __pa_symbol(kpti_ptes);
- __pi_map_range(&ptep, pa, pa + sizeof(u32), pa, PAGE_KERNEL,
- IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false,
+ limit = __pa_symbol(&kpti_ptes[IDMAP_LEVELS - 1][0]);
+ __pi_map_range(&ptep, limit, pa, pa + sizeof(u32), pa,
+ PAGE_KERNEL, IDMAP_ROOT_LEVEL,
+ (pte_t *)idmap_pg_dir, false,
__phys_to_virt(ptep) - ptep);
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 39/57] arm64: Introduce macros required for boot-time page selection
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (36 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 38/57] arm64: Track early pgtable allocation limit Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 40/57] arm64: Refactor early pgtable size calculation macros Ryan Roberts
` (20 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
This minmal set of macros will allow boot-time page selection support to
be added to the arm64 arch code incrementally over the following set of
patches.
The definitions in pgtable-geometry.h are for compile-time page size
currently, but they will be modified in future to support boot-time page
size.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/page-def.h | 5 ++--
arch/arm64/include/asm/pgtable-geometry.h | 28 +++++++++++++++++++++++
arch/arm64/include/asm/pgtable-hwdef.h | 16 ++++++++-----
3 files changed, 40 insertions(+), 9 deletions(-)
create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
diff --git a/arch/arm64/include/asm/page-def.h b/arch/arm64/include/asm/page-def.h
index d69971cf49cd2..b99dee0112463 100644
--- a/arch/arm64/include/asm/page-def.h
+++ b/arch/arm64/include/asm/page-def.h
@@ -9,12 +9,11 @@
#define __ASM_PAGE_DEF_H
#include <linux/const.h>
+#include <asm/pgtable-geometry.h>
/* PAGE_SHIFT determines the page size */
-#define PAGE_SHIFT CONFIG_PAGE_SHIFT
+#define PAGE_SHIFT ptg_page_shift
#define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE-1))
-#include <asm-generic/pgtable-geometry.h>
-
#endif /* __ASM_PAGE_DEF_H */
diff --git a/arch/arm64/include/asm/pgtable-geometry.h b/arch/arm64/include/asm/pgtable-geometry.h
new file mode 100644
index 0000000000000..62fe125909c08
--- /dev/null
+++ b/arch/arm64/include/asm/pgtable-geometry.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef ASM_PGTABLE_GEOMETRY_H
+#define ASM_PGTABLE_GEOMETRY_H
+
+#define ARM64_PAGE_SHIFT_4K 12
+#define ARM64_PAGE_SHIFT_16K 14
+#define ARM64_PAGE_SHIFT_64K 16
+
+#define PAGE_SHIFT_MIN CONFIG_PAGE_SHIFT
+#define PAGE_SIZE_MIN (_AC(1, UL) << PAGE_SHIFT_MIN)
+#define PAGE_MASK_MIN (~(PAGE_SIZE_MIN-1))
+
+#define PAGE_SHIFT_MAX CONFIG_PAGE_SHIFT
+#define PAGE_SIZE_MAX (_AC(1, UL) << PAGE_SHIFT_MAX)
+#define PAGE_MASK_MAX (~(PAGE_SIZE_MAX-1))
+
+#include <asm-generic/pgtable-geometry.h>
+
+#define ptg_page_shift CONFIG_PAGE_SHIFT
+#define ptg_pmd_shift ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
+#define ptg_pud_shift ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
+#define ptg_p4d_shift ARM64_HW_PGTABLE_LEVEL_SHIFT(0)
+#define ptg_pgdir_shift ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - CONFIG_PGTABLE_LEVELS)
+#define ptg_cont_pte_shift (CONFIG_ARM64_CONT_PTE_SHIFT + PAGE_SHIFT)
+#define ptg_cont_pmd_shift (CONFIG_ARM64_CONT_PMD_SHIFT + PMD_SHIFT)
+#define ptg_pgtable_levels CONFIG_PGTABLE_LEVELS
+
+#endif /* ASM_PGTABLE_GEOMETRY_H */
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index 1f60aa1bc750c..54a9153f56bc5 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -41,39 +41,43 @@
#define ARM64_HW_PGTABLE_LEVEL_SHIFT(n) ((PAGE_SHIFT - 3) * (4 - (n)) + 3)
#define PTRS_PER_PTE (1 << (PAGE_SHIFT - 3))
+#define MAX_PTRS_PER_PTE (1 << (PAGE_SHIFT_MAX - 3))
/*
* PMD_SHIFT determines the size a level 2 page table entry can map.
*/
#if CONFIG_PGTABLE_LEVELS > 2
-#define PMD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
+#define PMD_SHIFT ptg_pmd_shift
#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE-1))
#define PTRS_PER_PMD (1 << (PAGE_SHIFT - 3))
+#define MAX_PTRS_PER_PMD (1 << (PAGE_SHIFT_MAX - 3))
#endif
/*
* PUD_SHIFT determines the size a level 1 page table entry can map.
*/
#if CONFIG_PGTABLE_LEVELS > 3
-#define PUD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
+#define PUD_SHIFT ptg_pud_shift
#define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
#define PUD_MASK (~(PUD_SIZE-1))
#define PTRS_PER_PUD (1 << (PAGE_SHIFT - 3))
+#define MAX_PTRS_PER_PUD (1 << (PAGE_SHIFT_MAX - 3))
#endif
#if CONFIG_PGTABLE_LEVELS > 4
-#define P4D_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(0)
+#define P4D_SHIFT ptg_p4d_shift
#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
#define P4D_MASK (~(P4D_SIZE-1))
#define PTRS_PER_P4D (1 << (PAGE_SHIFT - 3))
+#define MAX_PTRS_PER_P4D (1 << (PAGE_SHIFT_MAX - 3))
#endif
/*
* PGDIR_SHIFT determines the size a top-level page table entry can map
* (depending on the configuration, this level can be -1, 0, 1 or 2).
*/
-#define PGDIR_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - CONFIG_PGTABLE_LEVELS)
+#define PGDIR_SHIFT ptg_pgdir_shift
#define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT)
#define PGDIR_MASK (~(PGDIR_SIZE-1))
#define PTRS_PER_PGD (1 << (VA_BITS - PGDIR_SHIFT))
@@ -81,12 +85,12 @@
/*
* Contiguous page definitions.
*/
-#define CONT_PTE_SHIFT (CONFIG_ARM64_CONT_PTE_SHIFT + PAGE_SHIFT)
+#define CONT_PTE_SHIFT ptg_cont_pte_shift
#define CONT_PTES (1 << (CONT_PTE_SHIFT - PAGE_SHIFT))
#define CONT_PTE_SIZE (CONT_PTES * PAGE_SIZE)
#define CONT_PTE_MASK (~(CONT_PTE_SIZE - 1))
-#define CONT_PMD_SHIFT (CONFIG_ARM64_CONT_PMD_SHIFT + PMD_SHIFT)
+#define CONT_PMD_SHIFT ptg_cont_pmd_shift
#define CONT_PMDS (1 << (CONT_PMD_SHIFT - PMD_SHIFT))
#define CONT_PMD_SIZE (CONT_PMDS * PMD_SIZE)
#define CONT_PMD_MASK (~(CONT_PMD_SIZE - 1))
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 40/57] arm64: Refactor early pgtable size calculation macros
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (37 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 39/57] arm64: Introduce macros required for boot-time page selection Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 41/57] arm64: Pass desired page size on command line Ryan Roberts
` (19 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
The various early idmaps and init/swapper pgtables are constructed using
static storage, the size of which is obviously calculated at
compile-time based on the selected page size. But in the near future,
boot-time page size builds will need to statically allocate enough
storage for the worst case, depending on which page size is selected.
Therefore, refactor the macros that determine the storage requirement to
take a page_shift parameter, then perform the calculation for each page
size we are compiling with support for and take the max. For
compile-time page size builds, the end result is exactly the same
because there is only 1 page size we support. For boot-time page size
builds we end up with the worst case required size.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/kernel-pgtable.h | 148 +++++++++++++++++-------
arch/arm64/include/asm/pgtable-hwdef.h | 6 +-
arch/arm64/kernel/pi/map_kernel.c | 6 +-
arch/arm64/kernel/pi/map_range.c | 8 +-
arch/arm64/kernel/vmlinux.lds.S | 4 +-
arch/arm64/mm/mmu.c | 13 +--
6 files changed, 124 insertions(+), 61 deletions(-)
diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
index 1722b9217d47d..facdf273d4cda 100644
--- a/arch/arm64/include/asm/kernel-pgtable.h
+++ b/arch/arm64/include/asm/kernel-pgtable.h
@@ -12,28 +12,38 @@
#include <asm/pgtable-hwdef.h>
#include <asm/sparsemem.h>
+#define PGTABLE_LEVELS(page_shift, va_bits) \
+ __ARM64_HW_PGTABLE_LEVELS(page_shift, va_bits)
+#define PGTABLE_LEVEL_SHIFT(page_shift, n) \
+ __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift, n)
+#define PGTABLE_LEVEL_SIZE(page_shift, n) \
+ (UL(1) << PGTABLE_LEVEL_SHIFT(page_shift, n))
+
/*
* The physical and virtual addresses of the start of the kernel image are
* equal modulo 2 MiB (per the arm64 booting.txt requirements). Hence we can
* use section mapping with 4K (section size = 2M) but not with 16K (section
* size = 32M) or 64K (section size = 512M).
*/
-#if defined(PMD_SIZE) && PMD_SIZE <= MIN_KIMG_ALIGN
-#define SWAPPER_BLOCK_SHIFT PMD_SHIFT
-#define SWAPPER_SKIP_LEVEL 1
-#else
-#define SWAPPER_BLOCK_SHIFT PAGE_SHIFT
-#define SWAPPER_SKIP_LEVEL 0
-#endif
-#define SWAPPER_BLOCK_SIZE (UL(1) << SWAPPER_BLOCK_SHIFT)
-#define SWAPPER_TABLE_SHIFT (SWAPPER_BLOCK_SHIFT + PAGE_SHIFT - 3)
-
-#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS - SWAPPER_SKIP_LEVEL)
-#define INIT_IDMAP_PGTABLE_LEVELS (IDMAP_LEVELS - SWAPPER_SKIP_LEVEL)
-
-#define IDMAP_VA_BITS 48
-#define IDMAP_LEVELS ARM64_HW_PGTABLE_LEVELS(IDMAP_VA_BITS)
-#define IDMAP_ROOT_LEVEL (4 - IDMAP_LEVELS)
+#define SWAPPER_BLOCK_SHIFT(page_shift) \
+ ((PGTABLE_LEVEL_SIZE(page_shift, 2) <= MIN_KIMG_ALIGN) ? \
+ PGTABLE_LEVEL_SHIFT(page_shift, 2) : (page_shift))
+
+#define SWAPPER_SKIP_LEVEL(page_shift) \
+ ((PGTABLE_LEVEL_SIZE(page_shift, 2) <= MIN_KIMG_ALIGN) ? 1 : 0)
+
+#define SWAPPER_BLOCK_SIZE(page_shift) \
+ (UL(1) << SWAPPER_BLOCK_SHIFT(page_shift))
+
+#define SWAPPER_PGTABLE_LEVELS(page_shift) \
+ (PGTABLE_LEVELS(page_shift, VA_BITS) - SWAPPER_SKIP_LEVEL(page_shift))
+
+#define INIT_IDMAP_PGTABLE_LEVELS(page_shift) \
+ (IDMAP_LEVELS(page_shift) - SWAPPER_SKIP_LEVEL(page_shift))
+
+#define IDMAP_VA_BITS 48
+#define IDMAP_LEVELS(page_shift) PGTABLE_LEVELS(page_shift, IDMAP_VA_BITS)
+#define IDMAP_ROOT_LEVEL(page_shift) (4 - IDMAP_LEVELS(page_shift))
#define KIMAGE_VADDR_END (_AT(u64, _end) - _AT(u64, _text) + KIMAGE_VADDR)
@@ -43,47 +53,99 @@
* with two adjacent PGDIR entries, which means that an additional page table
* may be needed at each subordinate level.
*/
-#define EXTRA_PAGE __is_defined(CONFIG_RELOCATABLE)
+#define EXTRA_PAGE __is_defined(CONFIG_RELOCATABLE)
-#define SPAN_NR_ENTRIES(vstart, vend, shift) \
+#define SPAN_NR_ENTRIES(vstart, vend, shift) \
((((vend) - 1) >> (shift)) - ((vstart) >> (shift)) + 1)
-#define EARLY_ENTRIES(vstart, vend, shift, add) \
+#define EARLY_ENTRIES(vstart, vend, shift, add) \
(SPAN_NR_ENTRIES(vstart, vend, shift) + (add))
-#define EARLY_LEVEL(lvl, lvls, vstart, vend, add) \
- (lvls > lvl ? EARLY_ENTRIES(vstart, vend, SWAPPER_BLOCK_SHIFT + lvl * (PAGE_SHIFT - 3), add) : 0)
-
-#define EARLY_PAGES(lvls, vstart, vend, add) (1 /* PGDIR page */ \
- + EARLY_LEVEL(3, (lvls), (vstart), (vend), add) /* each entry needs a next level page table */ \
- + EARLY_LEVEL(2, (lvls), (vstart), (vend), add) /* each entry needs a next level page table */ \
- + EARLY_LEVEL(1, (lvls), (vstart), (vend), add))/* each entry needs a next level page table */
-#define INIT_DIR_SIZE (PAGE_SIZE * (EARLY_PAGES(SWAPPER_PGTABLE_LEVELS, KIMAGE_VADDR, KIMAGE_VADDR_END, EXTRA_PAGE) \
- + EARLY_SEGMENT_EXTRA_PAGES))
-
-#define INIT_IDMAP_DIR_PAGES (EARLY_PAGES(INIT_IDMAP_PGTABLE_LEVELS, KIMAGE_VADDR, KIMAGE_VADDR_END, 1))
-#define INIT_IDMAP_DIR_SIZE ((INIT_IDMAP_DIR_PAGES + EARLY_IDMAP_EXTRA_PAGES) * PAGE_SIZE)
+#define EARLY_LEVEL(page_shift, lvl, lvls, vstart, vend, add) \
+ (lvls > lvl ? EARLY_ENTRIES(vstart, vend, \
+ SWAPPER_BLOCK_SHIFT(page_shift) + lvl * ((page_shift) - 3), \
+ add) : 0)
-#define INIT_IDMAP_FDT_PAGES (EARLY_PAGES(INIT_IDMAP_PGTABLE_LEVELS, 0UL, UL(MAX_FDT_SIZE), 1) - 1)
-#define INIT_IDMAP_FDT_SIZE ((INIT_IDMAP_FDT_PAGES + EARLY_IDMAP_EXTRA_FDT_PAGES) * PAGE_SIZE)
+#define EARLY_PAGES(page_shift, lvls, vstart, vend, add) (1 /* PGDIR */ \
+ + EARLY_LEVEL((page_shift), 3, (lvls), (vstart), (vend), add) \
+ + EARLY_LEVEL((page_shift), 2, (lvls), (vstart), (vend), add) \
+ + EARLY_LEVEL((page_shift), 1, (lvls), (vstart), (vend), add))
/* The number of segments in the kernel image (text, rodata, inittext, initdata, data+bss) */
-#define KERNEL_SEGMENT_COUNT 5
+#define KERNEL_SEGMENT_COUNT 5
-#if SWAPPER_BLOCK_SIZE > SEGMENT_ALIGN
-#define EARLY_SEGMENT_EXTRA_PAGES (KERNEL_SEGMENT_COUNT + 1)
/*
* The initial ID map consists of the kernel image, mapped as two separate
* segments, and may appear misaligned wrt the swapper block size. This means
* we need 3 additional pages. The DT could straddle a swapper block boundary,
* so it may need 2.
*/
-#define EARLY_IDMAP_EXTRA_PAGES 3
-#define EARLY_IDMAP_EXTRA_FDT_PAGES 2
-#else
-#define EARLY_SEGMENT_EXTRA_PAGES 0
-#define EARLY_IDMAP_EXTRA_PAGES 0
-#define EARLY_IDMAP_EXTRA_FDT_PAGES 0
-#endif
+#define EARLY_SEGMENT_EXTRA_PAGES(page_shift) \
+ ((SWAPPER_BLOCK_SIZE(page_shift) > SEGMENT_ALIGN) ? \
+ (KERNEL_SEGMENT_COUNT + 1) : 0)
+
+#define EARLY_IDMAP_EXTRA_PAGES(page_shift) \
+ ((SWAPPER_BLOCK_SIZE(page_shift) > SEGMENT_ALIGN) ? 3 : 0)
+
+#define EARLY_IDMAP_EXTRA_FDT_PAGES(page_shift) \
+ ((SWAPPER_BLOCK_SIZE(page_shift) > SEGMENT_ALIGN) ? 2 : 0)
+
+#define INIT_DIR_PAGES(page_shift) \
+ (EARLY_PAGES((page_shift), SWAPPER_PGTABLE_LEVELS(page_shift), \
+ KIMAGE_VADDR, KIMAGE_VADDR_END, EXTRA_PAGE))
+
+#define INIT_DIR_SIZE(page_shift) \
+ ((INIT_DIR_PAGES(page_shift) + \
+ EARLY_SEGMENT_EXTRA_PAGES(page_shift)) * (UL(1) << (page_shift)))
+
+#define INIT_IDMAP_DIR_PAGES(page_shift) \
+ (EARLY_PAGES((page_shift), \
+ INIT_IDMAP_PGTABLE_LEVELS(page_shift), \
+ KIMAGE_VADDR, KIMAGE_VADDR_END, 1))
+
+#define INIT_IDMAP_DIR_SIZE(page_shift) \
+ ((INIT_IDMAP_DIR_PAGES(page_shift) + \
+ EARLY_IDMAP_EXTRA_PAGES(page_shift)) * (UL(1) << (page_shift)))
+
+#define INIT_IDMAP_FDT_PAGES(page_shift) \
+ (EARLY_PAGES((page_shift), \
+ INIT_IDMAP_PGTABLE_LEVELS(page_shift), \
+ UL(0), UL(MAX_FDT_SIZE), 1) - 1)
+
+#define INIT_IDMAP_FDT_SIZE(page_shift) \
+ ((INIT_IDMAP_FDT_PAGES(page_shift) + \
+ EARLY_IDMAP_EXTRA_FDT_PAGES(page_shift)) * (UL(1) << (page_shift)))
+
+#define VAL_IF_HAVE_PGSZ(val, page_shift) \
+ ((page_shift) >= PAGE_SHIFT_MIN && \
+ (page_shift) <= PAGE_SHIFT_MAX ? (val) : 0)
+
+#define MAX_IF_HAVE_PGSZ(val4k, val16k, val64k) \
+ MAX(VAL_IF_HAVE_PGSZ((val4k), ARM64_PAGE_SHIFT_4K), MAX( \
+ VAL_IF_HAVE_PGSZ((val16k), ARM64_PAGE_SHIFT_16K), \
+ VAL_IF_HAVE_PGSZ((val64k), ARM64_PAGE_SHIFT_64K)))
+
+#define IDMAP_LEVELS_MAX \
+ MAX_IF_HAVE_PGSZ(IDMAP_LEVELS(ARM64_PAGE_SHIFT_4K), \
+ IDMAP_LEVELS(ARM64_PAGE_SHIFT_16K), \
+ IDMAP_LEVELS(ARM64_PAGE_SHIFT_64K))
+
+#define __INIT_DIR_SIZE_MAX \
+ MAX_IF_HAVE_PGSZ(INIT_DIR_SIZE(ARM64_PAGE_SHIFT_4K), \
+ INIT_DIR_SIZE(ARM64_PAGE_SHIFT_16K), \
+ INIT_DIR_SIZE(ARM64_PAGE_SHIFT_64K))
+
+#define INIT_DIR_SIZE_MAX \
+ MAX(__INIT_DIR_SIZE_MAX, INIT_IDMAP_DIR_SIZE_MAX)
+
+#define INIT_IDMAP_DIR_SIZE_MAX \
+ MAX_IF_HAVE_PGSZ(INIT_IDMAP_DIR_SIZE(ARM64_PAGE_SHIFT_4K), \
+ INIT_IDMAP_DIR_SIZE(ARM64_PAGE_SHIFT_16K), \
+ INIT_IDMAP_DIR_SIZE(ARM64_PAGE_SHIFT_64K))
+
+#define INIT_IDMAP_FDT_SIZE_MAX \
+ MAX_IF_HAVE_PGSZ(INIT_IDMAP_FDT_SIZE(ARM64_PAGE_SHIFT_4K), \
+ INIT_IDMAP_FDT_SIZE(ARM64_PAGE_SHIFT_16K), \
+ INIT_IDMAP_FDT_SIZE(ARM64_PAGE_SHIFT_64K))
#endif /* __ASM_KERNEL_PGTABLE_H */
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index 54a9153f56bc5..ca8bcbc1fe220 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -23,7 +23,8 @@
*
* which gets simplified as :
*/
-#define ARM64_HW_PGTABLE_LEVELS(va_bits) (((va_bits) - 4) / (PAGE_SHIFT - 3))
+#define __ARM64_HW_PGTABLE_LEVELS(page_shift, va_bits) (((va_bits) - 4) / ((page_shift) - 3))
+#define ARM64_HW_PGTABLE_LEVELS(va_bits) __ARM64_HW_PGTABLE_LEVELS(PAGE_SHIFT, va_bits)
/*
* Size mapped by an entry at level n ( -1 <= n <= 3)
@@ -38,7 +39,8 @@
* Rearranging it a bit we get :
* (4 - n) * (PAGE_SHIFT - 3) + 3
*/
-#define ARM64_HW_PGTABLE_LEVEL_SHIFT(n) ((PAGE_SHIFT - 3) * (4 - (n)) + 3)
+#define __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift, n) (((page_shift) - 3) * (4 - (n)) + 3)
+#define ARM64_HW_PGTABLE_LEVEL_SHIFT(n) __ARM64_HW_PGTABLE_LEVEL_SHIFT(PAGE_SHIFT, n)
#define PTRS_PER_PTE (1 << (PAGE_SHIFT - 3))
#define MAX_PTRS_PER_PTE (1 << (PAGE_SHIFT_MAX - 3))
diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
index dcf9233ccfff2..a53fc225d2d0d 100644
--- a/arch/arm64/kernel/pi/map_kernel.c
+++ b/arch/arm64/kernel/pi/map_kernel.c
@@ -188,8 +188,8 @@ static void __init remap_idmap_for_lpa2(void)
static void __init map_fdt(u64 fdt)
{
- static u8 ptes[INIT_IDMAP_FDT_SIZE] __initdata __aligned(PAGE_SIZE);
- u64 limit = (u64)&ptes[INIT_IDMAP_FDT_SIZE];
+ static u8 ptes[INIT_IDMAP_FDT_SIZE_MAX] __initdata __aligned(PAGE_SIZE);
+ u64 limit = (u64)&ptes[INIT_IDMAP_FDT_SIZE_MAX];
u64 efdt = fdt + MAX_FDT_SIZE;
u64 ptep = (u64)ptes;
@@ -199,7 +199,7 @@ static void __init map_fdt(u64 fdt)
*/
map_range(&ptep, limit, fdt,
(u64)_text > fdt ? min((u64)_text, efdt) : efdt,
- fdt, PAGE_KERNEL, IDMAP_ROOT_LEVEL,
+ fdt, PAGE_KERNEL, IDMAP_ROOT_LEVEL(PAGE_SHIFT),
(pte_t *)init_idmap_pg_dir, false, 0);
dsb(ishst);
}
diff --git a/arch/arm64/kernel/pi/map_range.c b/arch/arm64/kernel/pi/map_range.c
index f0024d9b1d921..b62d2e3135f81 100644
--- a/arch/arm64/kernel/pi/map_range.c
+++ b/arch/arm64/kernel/pi/map_range.c
@@ -131,11 +131,11 @@ asmlinkage u64 __init create_init_idmap(pgd_t *pg_dir, pgd_t *pg_end,
pgprot_val(data_prot) &= ~clrmask;
map_range(&ptep, (u64)pg_end, (u64)_stext, (u64)__initdata_begin,
- (u64)_stext, text_prot, IDMAP_ROOT_LEVEL, (pte_t *)pg_dir,
- false, 0);
+ (u64)_stext, text_prot,
+ IDMAP_ROOT_LEVEL(PAGE_SHIFT), (pte_t *)pg_dir, false, 0);
map_range(&ptep, (u64)pg_end, (u64)__initdata_begin, (u64)_end,
- (u64)__initdata_begin, data_prot, IDMAP_ROOT_LEVEL,
- (pte_t *)pg_dir, false, 0);
+ (u64)__initdata_begin, data_prot,
+ IDMAP_ROOT_LEVEL(PAGE_SHIFT), (pte_t *)pg_dir, false, 0);
return ptep;
}
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 55a8e310ea12c..7f3f6d709ae73 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -249,7 +249,7 @@ SECTIONS
__initdata_begin = .;
init_idmap_pg_dir = .;
- . += INIT_IDMAP_DIR_SIZE;
+ . += INIT_IDMAP_DIR_SIZE_MAX;
init_idmap_pg_end = .;
.init.data : {
@@ -319,7 +319,7 @@ SECTIONS
. = ALIGN(PAGE_SIZE);
init_pg_dir = .;
- . += INIT_DIR_SIZE;
+ . += INIT_DIR_SIZE_MAX;
init_pg_end = .;
/* end of zero-init region */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 969348a2e93c9..d4d30eaefb4cd 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -777,19 +777,19 @@ void __pi_map_range(u64 *pgd, u64 limit, u64 start, u64 end, u64 pa,
pgprot_t prot, int level, pte_t *tbl, bool may_use_cont,
u64 va_offset);
-static u8 idmap_ptes[IDMAP_LEVELS - 1][PAGE_SIZE] __aligned(PAGE_SIZE) __ro_after_init,
- kpti_ptes[IDMAP_LEVELS - 1][PAGE_SIZE] __aligned(PAGE_SIZE) __ro_after_init;
+static u8 idmap_ptes[IDMAP_LEVELS_MAX - 1][PAGE_SIZE] __aligned(PAGE_SIZE) __ro_after_init,
+ kpti_ptes[IDMAP_LEVELS_MAX - 1][PAGE_SIZE] __aligned(PAGE_SIZE) __ro_after_init;
static void __init create_idmap(void)
{
u64 start = __pa_symbol(__idmap_text_start);
u64 end = __pa_symbol(__idmap_text_end);
u64 ptep = __pa_symbol(idmap_ptes);
- u64 limit = __pa_symbol(&idmap_ptes[IDMAP_LEVELS - 1][0]);
+ u64 limit = __pa_symbol(&idmap_ptes[IDMAP_LEVELS_MAX - 1][0]);
__pi_map_range(&ptep, limit, start, end, start, PAGE_KERNEL_ROX,
- IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false,
- __phys_to_virt(ptep) - ptep);
+ IDMAP_ROOT_LEVEL(PAGE_SHIFT), (pte_t *)idmap_pg_dir,
+ false, __phys_to_virt(ptep) - ptep);
if (IS_ENABLED(CONFIG_UNMAP_KERNEL_AT_EL0) && !arm64_use_ng_mappings) {
extern u32 __idmap_kpti_flag;
@@ -800,9 +800,8 @@ static void __init create_idmap(void)
* of its synchronization flag in the ID map.
*/
ptep = __pa_symbol(kpti_ptes);
- limit = __pa_symbol(&kpti_ptes[IDMAP_LEVELS - 1][0]);
__pi_map_range(&ptep, limit, pa, pa + sizeof(u32), pa,
- PAGE_KERNEL, IDMAP_ROOT_LEVEL,
+ PAGE_KERNEL, IDMAP_ROOT_LEVEL(PAGE_SHIFT),
(pte_t *)idmap_pg_dir, false,
__phys_to_virt(ptep) - ptep);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 41/57] arm64: Pass desired page size on command line
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (38 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 40/57] arm64: Refactor early pgtable size calculation macros Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 42/57] arm64: Divorce early init from PAGE_SIZE Ryan Roberts
` (18 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
Allow user to pass desired page size via command line as either
"arm64.pagesize=4k", "arm64.pagesize=16k", or "arm64.pagesize=64k". The
specified value is stored in the SW_FEATURE register as an encoded page
shift in a 4 bit field.
We only allow setting the page size override if the requested size is
supported by the HW and is within the compile-time [PAGE_SIZE_MIN,
PAGE_SIZE_MAX] range. This second condition means that overrides get
ignored when we have a compile-time page size (because PAGE_SIZE_MIN ==
PAGE_SIZE_MAX).
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/cpufeature.h | 11 ++++++++
arch/arm64/kernel/pi/idreg-override.c | 36 +++++++++++++++++++++++++++
2 files changed, 47 insertions(+)
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 5584342672715..4edbb586810d7 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -18,6 +18,7 @@
#define ARM64_SW_FEATURE_OVERRIDE_NOKASLR 0
#define ARM64_SW_FEATURE_OVERRIDE_HVHE 4
#define ARM64_SW_FEATURE_OVERRIDE_RODATA_OFF 8
+#define ARM64_SW_FEATURE_OVERRIDE_PAGESHIFT 12
#ifndef __ASSEMBLY__
@@ -963,6 +964,16 @@ static inline bool arm64_test_sw_feature_override(int feat)
&arm64_sw_feature_override);
}
+static inline int arm64_pageshift_cmdline(void)
+{
+ int val;
+
+ val = arm64_apply_feature_override(0,
+ ARM64_SW_FEATURE_OVERRIDE_PAGESHIFT,
+ 4, &arm64_sw_feature_override);
+ return val ? val * 2 + 10 : 0;
+}
+
static inline bool kaslr_disabled_cmdline(void)
{
return arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_NOKASLR);
diff --git a/arch/arm64/kernel/pi/idreg-override.c b/arch/arm64/kernel/pi/idreg-override.c
index 29d4b6244a6f6..5a38bdb231bc8 100644
--- a/arch/arm64/kernel/pi/idreg-override.c
+++ b/arch/arm64/kernel/pi/idreg-override.c
@@ -183,6 +183,38 @@ static bool __init hvhe_filter(u64 val)
ID_AA64MMFR1_EL1_VH_SHIFT));
}
+static bool __init pageshift_filter(u64 val)
+{
+ u64 mmfr0 = read_sysreg_s(SYS_ID_AA64MMFR0_EL1);
+ u32 tgran64 = SYS_FIELD_GET(ID_AA64MMFR0_EL1, TGRAN64, mmfr0);
+ u32 tgran16 = SYS_FIELD_GET(ID_AA64MMFR0_EL1, TGRAN16, mmfr0);
+ u32 tgran4 = SYS_FIELD_GET(ID_AA64MMFR0_EL1, TGRAN4, mmfr0);
+
+ /* pageshift is stored compressed in 4 bit field. */
+ if (val)
+ val = val * 2 + 10;
+
+ if (val < PAGE_SHIFT_MIN || val > PAGE_SHIFT_MAX)
+ return false;
+
+ if (val == ARM64_PAGE_SHIFT_64K &&
+ tgran64 >= ID_AA64MMFR0_EL1_TGRAN64_SUPPORTED_MIN &&
+ tgran64 <= ID_AA64MMFR0_EL1_TGRAN64_SUPPORTED_MAX)
+ return true;
+
+ if (val == ARM64_PAGE_SHIFT_16K &&
+ tgran16 >= ID_AA64MMFR0_EL1_TGRAN16_SUPPORTED_MIN &&
+ tgran16 <= ID_AA64MMFR0_EL1_TGRAN16_SUPPORTED_MAX)
+ return true;
+
+ if (val == ARM64_PAGE_SHIFT_4K &&
+ tgran4 >= ID_AA64MMFR0_EL1_TGRAN4_SUPPORTED_MIN &&
+ tgran4 <= ID_AA64MMFR0_EL1_TGRAN4_SUPPORTED_MAX)
+ return true;
+
+ return false;
+}
+
static const struct ftr_set_desc sw_features __prel64_initconst = {
.name = "arm64_sw",
.override = &arm64_sw_feature_override,
@@ -190,6 +222,7 @@ static const struct ftr_set_desc sw_features __prel64_initconst = {
FIELD("nokaslr", ARM64_SW_FEATURE_OVERRIDE_NOKASLR, NULL),
FIELD("hvhe", ARM64_SW_FEATURE_OVERRIDE_HVHE, hvhe_filter),
FIELD("rodataoff", ARM64_SW_FEATURE_OVERRIDE_RODATA_OFF, NULL),
+ FIELD("pageshift", ARM64_SW_FEATURE_OVERRIDE_PAGESHIFT, pageshift_filter),
{}
},
};
@@ -225,6 +258,9 @@ static const struct {
{ "rodata=off", "arm64_sw.rodataoff=1" },
{ "arm64.nolva", "id_aa64mmfr2.varange=0" },
{ "arm64.no32bit_el0", "id_aa64pfr0.el0=1" },
+ { "arm64.pagesize=4k", "arm64_sw.pageshift=1" },
+ { "arm64.pagesize=16k", "arm64_sw.pageshift=2" },
+ { "arm64.pagesize=64k", "arm64_sw.pageshift=3" },
};
static int __init parse_hexdigit(const char *p, u64 *v)
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 42/57] arm64: Divorce early init from PAGE_SIZE
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (39 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 41/57] arm64: Pass desired page size on command line Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 43/57] arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES Ryan Roberts
` (17 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
Refactor all the early code between entrypoint and start_kernel() so
that it does not rely on compile-time configuration of page size. For
now, that code chooses to use the compile-time page size as its
boot-time selected page size, but this will change in the near future to
allow the page size to be specified on the command line.
An initial page size is selected by probe_init_idmap_page_shift(), which
is used for the init_idmap_pg_dir. This selects the largest page size
that the HW supports and which is also in the range of page sizes
allowed by the compile-time config. For now, the allowed range only
covers the compile-time selected page size, but in future boot-time page
size builds, the range will be expanded.
Once the mmu is enabled, we access the command line in the device tree
as before, which allows us to determine the page size requested by the
user, filtered by the same allowed compile-time range of page sizes -
still just the compile-time selected page size for now. If no acceptable
page size was specified on the command line, we fall back to the page
size selected in probe_init_idmap_page_shift(). We then do a dance to
repaint init_idmap_pg_dir for the final page size and for LPA2, if in
use. Finally with that installed, we can continue booting the kernel.
For all of this to work, we must replace previous compile-time decisions
with run-time decisions. For the most part, we can do this by looking at
tcr_el1.tg0 to determine the installed page size. These run-time
decisions are not in hot code paths so I don't anticipate any
performance regressions as a result and therefore I prefer the
simplicity of using the run-time approach even for builds that specify a
compile-time page size.
Of course to be able to actually change the early page size, the static
storage for the page tables need to be sized for the maximum
requirement. Currently they are still sized for the compile-time page
size.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/assembler.h | 50 +++++++-
arch/arm64/include/asm/cpufeature.h | 33 +++--
arch/arm64/include/asm/pgtable-prot.h | 2 +-
arch/arm64/kernel/head.S | 40 ++++--
arch/arm64/kernel/pi/idreg-override.c | 32 +++--
arch/arm64/kernel/pi/map_kernel.c | 87 +++++++++----
arch/arm64/kernel/pi/map_range.c | 171 ++++++++++++++++++++++----
arch/arm64/kernel/pi/pi.h | 61 ++++++++-
arch/arm64/mm/proc.S | 21 ++--
9 files changed, 405 insertions(+), 92 deletions(-)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index bc0b0d75acef7..77c2d707adb1a 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -568,6 +568,14 @@ alternative_endif
mrs \rd, sp_el0
.endm
+/*
+ * Retrieve and return tcr_el1.tg0 in \tg0.
+ */
+ .macro get_tg0, tg0
+ mrs \tg0, tcr_el1
+ and \tg0, \tg0, #TCR_TG0_MASK
+ .endm
+
/*
* If the kernel is built for 52-bit virtual addressing but the hardware only
* supports 48 bits, we cannot program the pgdir address into TTBR1 directly,
@@ -584,12 +592,16 @@ alternative_endif
* ttbr: Value of ttbr to set, modified.
*/
.macro offset_ttbr1, ttbr, tmp
-#if defined(CONFIG_ARM64_VA_BITS_52) && !defined(CONFIG_ARM64_LPA2)
+#if defined(CONFIG_ARM64_VA_BITS_52)
+ get_tg0 \tmp
+ cmp \tmp, #TCR_TG0_64K
+ b.ne .Ldone\@
mrs \tmp, tcr_el1
and \tmp, \tmp, #TCR_T1SZ_MASK
cmp \tmp, #TCR_T1SZ(VA_BITS_MIN)
orr \tmp, \ttbr, #TTBR1_BADDR_4852_OFFSET
csel \ttbr, \tmp, \ttbr, eq
+.Ldone\@:
#endif
.endm
@@ -863,4 +875,40 @@ alternative_cb ARM64_ALWAYS_SYSTEM, spectre_bhb_patch_clearbhb
alternative_cb_end
#endif /* CONFIG_MITIGATE_SPECTRE_BRANCH_HISTORY */
.endm
+
+/*
+ * Given \tg0, populates \val with one of the 3 passed in values, corresponding
+ * to the page size advertised by \tg0.
+ */
+ .macro value_for_page_size, val, tg0, val4k, val16k, val64k
+.Lsz_64k\@:
+ cmp \tg0, #TCR_TG0_64K
+ b.ne .Lsz_16k\@
+ mov \val, #\val64k
+ b .Ldone\@
+.Lsz_16k\@:
+ cmp \tg0, #TCR_TG0_16K
+ b.ne .Lsz_4k\@
+ mov \val, #\val16k
+ b .Ldone\@
+.Lsz_4k\@:
+ mov \val, #\val4k
+.Ldone\@:
+ .endm
+
+ .macro tgran_shift, val, tg0
+ value_for_page_size \val, \tg0, ID_AA64MMFR0_EL1_TGRAN4_SHIFT, ID_AA64MMFR0_EL1_TGRAN16_SHIFT, ID_AA64MMFR0_EL1_TGRAN64_SHIFT
+ .endm
+
+ .macro tgran_min, val, tg0
+ value_for_page_size \val, \tg0, ID_AA64MMFR0_EL1_TGRAN4_SUPPORTED_MIN, ID_AA64MMFR0_EL1_TGRAN16_SUPPORTED_MIN, ID_AA64MMFR0_EL1_TGRAN64_SUPPORTED_MIN
+ .endm
+
+ .macro tgran_max, val, tg0
+ value_for_page_size \val, \tg0, ID_AA64MMFR0_EL1_TGRAN4_SUPPORTED_MAX, ID_AA64MMFR0_EL1_TGRAN16_SUPPORTED_MAX, ID_AA64MMFR0_EL1_TGRAN64_SUPPORTED_MAX
+ .endm
+
+ .macro tgran_lpa2, val, tg0
+ value_for_page_size \val, \tg0, ID_AA64MMFR0_EL1_TGRAN4_52_BIT, ID_AA64MMFR0_EL1_TGRAN16_52_BIT, -1
+ .endm
#endif /* __ASM_ASSEMBLER_H */
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 4edbb586810d7..2c22cfdc04bc7 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -26,6 +26,7 @@
#include <linux/jump_label.h>
#include <linux/kernel.h>
#include <linux/cpumask.h>
+#include <linux/sizes.h>
/*
* CPU feature register tracking
@@ -1014,10 +1015,13 @@ static inline bool cpu_has_pac(void)
&id_aa64isar2_override);
}
-static inline bool cpu_has_lva(void)
+static inline bool cpu_has_lva(u64 page_size)
{
u64 mmfr2;
+ if (page_size != SZ_64K)
+ return false;
+
mmfr2 = read_sysreg_s(SYS_ID_AA64MMFR2_EL1);
mmfr2 &= ~id_aa64mmfr2_override.mask;
mmfr2 |= id_aa64mmfr2_override.val;
@@ -1025,22 +1029,31 @@ static inline bool cpu_has_lva(void)
ID_AA64MMFR2_EL1_VARange_SHIFT);
}
-static inline bool cpu_has_lpa2(void)
+static inline bool cpu_has_lpa2(u64 page_size)
{
-#ifdef CONFIG_ARM64_LPA2
u64 mmfr0;
int feat;
+ int shift;
+ int minval;
+
+ switch (page_size) {
+ case SZ_4K:
+ shift = ID_AA64MMFR0_EL1_TGRAN4_SHIFT;
+ minval = ID_AA64MMFR0_EL1_TGRAN4_52_BIT;
+ break;
+ case SZ_16K:
+ shift = ID_AA64MMFR0_EL1_TGRAN16_SHIFT;
+ minval = ID_AA64MMFR0_EL1_TGRAN16_52_BIT;
+ break;
+ default:
+ return false;
+ }
mmfr0 = read_sysreg(id_aa64mmfr0_el1);
mmfr0 &= ~id_aa64mmfr0_override.mask;
mmfr0 |= id_aa64mmfr0_override.val;
- feat = cpuid_feature_extract_signed_field(mmfr0,
- ID_AA64MMFR0_EL1_TGRAN_SHIFT);
-
- return feat >= ID_AA64MMFR0_EL1_TGRAN_LPA2;
-#else
- return false;
-#endif
+ feat = cpuid_feature_extract_signed_field(mmfr0, shift);
+ return feat >= minval;
}
#endif /* __ASSEMBLY__ */
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index b11cfb9fdd379..f8ebf424ca016 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -74,7 +74,7 @@ extern bool arm64_use_ng_mappings;
#define PTE_MAYBE_NG (arm64_use_ng_mappings ? PTE_NG : 0)
#define PMD_MAYBE_NG (arm64_use_ng_mappings ? PMD_SECT_NG : 0)
-#ifndef CONFIG_ARM64_LPA2
+#ifndef CONFIG_ARM64_VA_BITS_52
#define lpa2_is_enabled() false
#define PTE_MAYBE_SHARED PTE_SHARED
#define PMD_MAYBE_SHARED PMD_SECT_S
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 7e17a71fd9e4b..761b7f5633e15 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -88,6 +88,8 @@ SYM_CODE_START(primary_entry)
adrp x1, early_init_stack
mov sp, x1
mov x29, xzr
+ bl __pi_probe_init_idmap_page_shift
+ mov x3, x0
adrp x0, init_idmap_pg_dir
adrp x1, init_idmap_pg_end
mov x2, xzr
@@ -471,11 +473,16 @@ SYM_FUNC_END(set_cpu_boot_mode_flag)
*/
.section ".idmap.text","a"
SYM_FUNC_START(__enable_mmu)
+ get_tg0 x3
+ tgran_shift x4, x3
+ tgran_min x5, x3
+ tgran_max x6, x3
mrs x3, ID_AA64MMFR0_EL1
- ubfx x3, x3, #ID_AA64MMFR0_EL1_TGRAN_SHIFT, 4
- cmp x3, #ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MIN
+ lsr x3, x3, x4
+ ubfx x3, x3, #0, 4
+ cmp x3, x5
b.lt __no_granule_support
- cmp x3, #ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MAX
+ cmp x3, x6
b.gt __no_granule_support
phys_to_ttbr x2, x2
msr ttbr0_el1, x2 // load TTBR0
@@ -488,17 +495,32 @@ SYM_FUNC_END(__enable_mmu)
#ifdef CONFIG_ARM64_VA_BITS_52
SYM_FUNC_START(__cpu_secondary_check52bitva)
-#ifndef CONFIG_ARM64_LPA2
+ /*
+ * tcr_el1 is not yet loaded (there is a chicken-and-egg problem) so we
+ * can't figure out LPA2 vs LVA from that. But this is only called for
+ * secondary cpus so tcr_boot_fields has already been populated by the
+ * primary cpu. So grab that and rely on the DS bit to tell us if we are
+ * LPA2 or LVA.
+ */
+ adr_l x1, __pi_tcr_boot_fields
+ ldr x0, [x1]
+ and x1, x0, #TCR_DS
+ cbnz x1, .Llpa2
+.Llva:
mrs_s x0, SYS_ID_AA64MMFR2_EL1
and x0, x0, ID_AA64MMFR2_EL1_VARange_MASK
cbnz x0, 2f
-#else
+ b .Lfail
+.Llpa2:
+ and x0, x0, #TCR_TG0_MASK
+ tgran_shift x1, x0
+ tgran_lpa2 x2, x0
mrs x0, id_aa64mmfr0_el1
- sbfx x0, x0, #ID_AA64MMFR0_EL1_TGRAN_SHIFT, 4
- cmp x0, #ID_AA64MMFR0_EL1_TGRAN_LPA2
+ lsr x0, x0, x1
+ sbfx x0, x0, #0, 4
+ cmp x0, x2
b.ge 2f
-#endif
-
+.Lfail:
update_early_cpu_boot_status \
CPU_STUCK_IN_KERNEL | CPU_STUCK_REASON_52_BIT_VA, x0, x1
1: wfe
diff --git a/arch/arm64/kernel/pi/idreg-override.c b/arch/arm64/kernel/pi/idreg-override.c
index 5a38bdb231bc8..0685c0a3255e2 100644
--- a/arch/arm64/kernel/pi/idreg-override.c
+++ b/arch/arm64/kernel/pi/idreg-override.c
@@ -62,20 +62,34 @@ static const struct ftr_set_desc mmfr1 __prel64_initconst = {
static bool __init mmfr2_varange_filter(u64 val)
{
- int __maybe_unused feat;
+ u64 mmfr0;
+ int feat;
+ int shift;
+ int minval;
if (val)
return false;
-#ifdef CONFIG_ARM64_LPA2
- feat = cpuid_feature_extract_signed_field(read_sysreg(id_aa64mmfr0_el1),
- ID_AA64MMFR0_EL1_TGRAN_SHIFT);
- if (feat >= ID_AA64MMFR0_EL1_TGRAN_LPA2) {
- id_aa64mmfr0_override.val |=
- (ID_AA64MMFR0_EL1_TGRAN_LPA2 - 1) << ID_AA64MMFR0_EL1_TGRAN_SHIFT;
- id_aa64mmfr0_override.mask |= 0xfU << ID_AA64MMFR0_EL1_TGRAN_SHIFT;
+ mmfr0 = read_sysreg(id_aa64mmfr0_el1);
+
+ /* Remove LPA2 support for 4K granule. */
+ shift = ID_AA64MMFR0_EL1_TGRAN4_SHIFT;
+ minval = ID_AA64MMFR0_EL1_TGRAN4_52_BIT;
+ feat = cpuid_feature_extract_signed_field(mmfr0, shift);
+ if (feat >= minval) {
+ id_aa64mmfr0_override.val |= (minval - 1) << shift;
+ id_aa64mmfr0_override.mask |= 0xfU << shift;
+ }
+
+ /* Remove LPA2 support for 16K granule. */
+ shift = ID_AA64MMFR0_EL1_TGRAN16_SHIFT;
+ minval = ID_AA64MMFR0_EL1_TGRAN16_52_BIT;
+ feat = cpuid_feature_extract_signed_field(mmfr0, shift);
+ if (feat >= minval) {
+ id_aa64mmfr0_override.val |= (minval - 1) << shift;
+ id_aa64mmfr0_override.mask |= 0xfU << shift;
}
-#endif
+
return true;
}
diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
index a53fc225d2d0d..7a62d4238449d 100644
--- a/arch/arm64/kernel/pi/map_kernel.c
+++ b/arch/arm64/kernel/pi/map_kernel.c
@@ -133,10 +133,18 @@ static void __init map_kernel(u64 kaslr_offset, u64 va_offset, int root_level)
idmap_cpu_replace_ttbr1(swapper_pg_dir);
}
-static void noinline __section(".idmap.text") set_ttbr0_for_lpa2(u64 ttbr)
+static void noinline __section(".idmap.text") set_ttbr0(u64 ttbr, bool use_lpa2,
+ int page_shift)
{
u64 sctlr = read_sysreg(sctlr_el1);
- u64 tcr = read_sysreg(tcr_el1) | TCR_DS;
+ u64 tcr = read_sysreg(tcr_el1);
+ u64 boot_fields = early_page_shift_to_tcr_tgx(page_shift);
+
+ if (use_lpa2)
+ boot_fields |= TCR_DS;
+
+ tcr &= ~(TCR_TG0_MASK | TCR_TG1_MASK | TCR_DS);
+ tcr |= boot_fields;
asm(" msr sctlr_el1, %0 ;"
" isb ;"
@@ -149,57 +157,66 @@ static void noinline __section(".idmap.text") set_ttbr0_for_lpa2(u64 ttbr)
" msr sctlr_el1, %3 ;"
" isb ;"
:: "r"(sctlr & ~SCTLR_ELx_M), "r"(ttbr), "r"(tcr), "r"(sctlr));
+
+ /* Stash this for __cpu_setup to configure secondary cpus' tcr_el1. */
+ set_tcr_boot_fields(boot_fields);
}
-static void __init remap_idmap_for_lpa2(void)
+static void __init remap_idmap(bool use_lpa2, int page_shift)
{
/* clear the bits that change meaning once LPA2 is turned on */
- pteval_t mask = PTE_SHARED;
+ pteval_t mask = use_lpa2 ? PTE_SHARED : 0;
/*
- * We have to clear bits [9:8] in all block or page descriptors in the
- * initial ID map, as otherwise they will be (mis)interpreted as
+ * For LPA2, We have to clear bits [9:8] in all block or page descriptors
+ * in the initial ID map, as otherwise they will be (mis)interpreted as
* physical address bits once we flick the LPA2 switch (TCR.DS). Since
* we cannot manipulate live descriptors in that way without creating
* potential TLB conflicts, let's create another temporary ID map in a
* LPA2 compatible fashion, and update the initial ID map while running
* from that.
*/
- create_init_idmap(init_pg_dir, init_pg_end, mask);
+ create_init_idmap(init_pg_dir, init_pg_end, mask, page_shift);
dsb(ishst);
- set_ttbr0_for_lpa2((u64)init_pg_dir);
+ set_ttbr0((u64)init_pg_dir, use_lpa2, page_shift);
/*
- * Recreate the initial ID map with the same granularity as before.
- * Don't bother with the FDT, we no longer need it after this.
+ * Recreate the initial ID map with new page size and, if LPA2 is in
+ * use, bits [9:8] cleared.
*/
memset(init_idmap_pg_dir, 0,
(u64)init_idmap_pg_end - (u64)init_idmap_pg_dir);
- create_init_idmap(init_idmap_pg_dir, init_idmap_pg_end, mask);
+ create_init_idmap(init_idmap_pg_dir, init_idmap_pg_end, mask, page_shift);
dsb(ishst);
/* switch back to the updated initial ID map */
- set_ttbr0_for_lpa2((u64)init_idmap_pg_dir);
+ set_ttbr0((u64)init_idmap_pg_dir, use_lpa2, page_shift);
/* wipe the temporary ID map from memory */
memset(init_pg_dir, 0, (u64)init_pg_end - (u64)init_pg_dir);
}
-static void __init map_fdt(u64 fdt)
+static void __init map_fdt(u64 fdt, int page_shift)
{
static u8 ptes[INIT_IDMAP_FDT_SIZE_MAX] __initdata __aligned(PAGE_SIZE);
+ static bool first_time __initdata = true;
u64 limit = (u64)&ptes[INIT_IDMAP_FDT_SIZE_MAX];
u64 efdt = fdt + MAX_FDT_SIZE;
u64 ptep = (u64)ptes;
+ if (!first_time) {
+ memset(ptes, 0, sizeof(ptes));
+ first_time = false;
+ }
+
/*
* Map up to MAX_FDT_SIZE bytes, but avoid overlap with
* the kernel image.
*/
map_range(&ptep, limit, fdt,
(u64)_text > fdt ? min((u64)_text, efdt) : efdt,
- fdt, PAGE_KERNEL, IDMAP_ROOT_LEVEL(PAGE_SHIFT),
+ fdt, PAGE_KERNEL, IDMAP_ROOT_LEVEL(page_shift),
(pte_t *)init_idmap_pg_dir, false, 0);
dsb(ishst);
}
@@ -207,13 +224,16 @@ static void __init map_fdt(u64 fdt)
asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
{
static char const chosen_str[] __initconst = "/chosen";
+ u64 early_page_shift = early_tcr_tg0_to_page_shift();
u64 va_base, pa_base = (u64)&_text;
u64 kaslr_offset = pa_base % MIN_KIMG_ALIGN;
- int root_level = 4 - CONFIG_PGTABLE_LEVELS;
int va_bits = VA_BITS;
+ bool use_lpa2 = false;
+ int root_level;
+ int page_shift;
int chosen;
- map_fdt((u64)fdt);
+ map_fdt((u64)fdt, early_page_shift);
/* Clear BSS and the initial page tables */
memset(__bss_start, 0, (u64)init_pg_end - (u64)__bss_start);
@@ -222,16 +242,37 @@ asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
chosen = fdt_path_offset(fdt, chosen_str);
init_feature_override(boot_status, fdt, chosen);
- if (IS_ENABLED(CONFIG_ARM64_64K_PAGES) && !cpu_has_lva()) {
- va_bits = VA_BITS_MIN;
- } else if (IS_ENABLED(CONFIG_ARM64_LPA2) && !cpu_has_lpa2()) {
- va_bits = VA_BITS_MIN;
- root_level++;
+ /* Get page_shift from cmdline, falling back to early_page_shift. */
+ page_shift = arm64_pageshift_cmdline();
+ if (!page_shift)
+ page_shift = early_page_shift;
+
+ if (va_bits > 48) {
+ u64 page_size = early_page_size(page_shift);
+
+ if (page_size == SZ_64K) {
+ if (!cpu_has_lva(page_size))
+ va_bits = VA_BITS_MIN;
+ } else {
+ use_lpa2 = cpu_has_lpa2(page_size);
+ if (!use_lpa2)
+ va_bits = VA_BITS_MIN;
+ }
}
if (va_bits > VA_BITS_MIN)
sysreg_clear_set(tcr_el1, TCR_T1SZ_MASK, TCR_T1SZ(va_bits));
+ /*
+ * This will update tg0/tg1 in tcr for the final page size. After this,
+ * PAGE_SIZE and friends can be used safely. kaslr_early_init(), below,
+ * is the first such user.
+ */
+ if (use_lpa2 || page_shift != early_page_shift) {
+ remap_idmap(use_lpa2, page_shift);
+ map_fdt((u64)fdt, page_shift);
+ }
+
/*
* The virtual KASLR displacement modulo 2MiB is decided by the
* physical placement of the image, as otherwise, we might not be able
@@ -248,9 +289,7 @@ asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
kaslr_offset |= kaslr_seed & ~(MIN_KIMG_ALIGN - 1);
}
- if (IS_ENABLED(CONFIG_ARM64_LPA2) && va_bits > VA_BITS_MIN)
- remap_idmap_for_lpa2();
-
va_base = KIMAGE_VADDR + kaslr_offset;
+ root_level = 4 - PGTABLE_LEVELS(page_shift, va_bits);
map_kernel(kaslr_offset, va_base - pa_base, root_level);
}
diff --git a/arch/arm64/kernel/pi/map_range.c b/arch/arm64/kernel/pi/map_range.c
index b62d2e3135f81..be5470a969a47 100644
--- a/arch/arm64/kernel/pi/map_range.c
+++ b/arch/arm64/kernel/pi/map_range.c
@@ -11,6 +11,34 @@
#include "pi.h"
+static inline u64 __init pte_get_oa(pte_t pte, int page_shift, bool oa52bit)
+{
+ pteval_t pv = pte_val(pte);
+
+#ifdef CONFIG_ARM64_PA_BITS_52
+ if (oa52bit) {
+ if (early_page_size(page_shift) == SZ_64K)
+ return (pv & GENMASK(47, 16)) |
+ ((pv & GENMASK(15, 12)) << 36);
+ return (pv & GENMASK(49, 12)) | ((pv & GENMASK(9, 8)) << 42);
+ }
+#endif
+ return pv & GENMASK(47, 12);
+}
+
+static inline u64 __init pte_prep_oa(u64 oa, int page_shift, bool oa52bit)
+{
+#ifdef CONFIG_ARM64_PA_BITS_52
+ if (oa52bit) {
+ if (early_page_size(page_shift) == SZ_64K)
+ return (oa & GENMASK(47, 16)) |
+ ((oa >> 36) & GENMASK(15, 12));
+ return (oa & GENMASK(49, 12)) | ((oa >> 42) & GENMASK(9, 8));
+ }
+#endif
+ return oa;
+}
+
static void __init mmuoff_data_clean(void)
{
bool cache_ena = !!(read_sysreg(sctlr_el1) & SCTLR_ELx_C);
@@ -35,8 +63,19 @@ static void __init report_cpu_stuck(long val)
cpu_park_loop();
}
+u64 __section(".mmuoff.data.read") tcr_boot_fields;
+
+void __init set_tcr_boot_fields(u64 val)
+{
+ WRITE_ONCE(tcr_boot_fields, val);
+
+ /* Ensure the visibility of the new value */
+ dsb(ishst);
+ mmuoff_data_clean();
+}
+
/**
- * map_range - Map a contiguous range of physical pages into virtual memory
+ * __map_range - Map a contiguous range of physical pages into virtual memory
*
* @pte: Address of physical pointer to array of pages to
* allocate page tables from
@@ -50,21 +89,28 @@ static void __init report_cpu_stuck(long val)
* @may_use_cont: Whether the use of the contiguous attribute is allowed
* @va_offset: Offset between a physical page and its current mapping
* in the VA space
+ * @page_shift: Page size (as a shift) to create page table for
+ * @oa52bit: Whether to store output addresses in 52-bit format
*/
-void __init map_range(u64 *pte, u64 limit, u64 start, u64 end, u64 pa,
- pgprot_t prot, int level, pte_t *tbl, bool may_use_cont,
- u64 va_offset)
+static void __init __map_range(u64 *pte, u64 limit, u64 start, u64 end, u64 pa,
+ pgprot_t prot, int level, pte_t *tbl,
+ bool may_use_cont, u64 va_offset, int page_shift,
+ bool oa52bit)
{
- u64 cmask = (level == 3) ? CONT_PTE_SIZE - 1 : U64_MAX;
+ const u64 page_size = early_page_size(page_shift);
+ const u64 page_mask = early_page_mask(page_shift);
+ const u64 cont_pte_size = early_cont_pte_size(page_shift);
+ const u64 ptrs_per_pte = early_ptrs_per_pte(page_shift);
+ u64 cmask = (level == 3) ? cont_pte_size - 1 : U64_MAX;
u64 protval = pgprot_val(prot) & ~PTE_TYPE_MASK;
- int lshift = (3 - level) * (PAGE_SHIFT - 3);
- u64 lmask = (PAGE_SIZE << lshift) - 1;
+ int lshift = (3 - level) * (page_shift - 3);
+ u64 lmask = (page_size << lshift) - 1;
- start &= PAGE_MASK;
- pa &= PAGE_MASK;
+ start &= page_mask;
+ pa &= page_mask;
/* Advance tbl to the entry that covers start */
- tbl += (start >> (lshift + PAGE_SHIFT)) % PTRS_PER_PTE;
+ tbl += (start >> (lshift + page_shift)) % ptrs_per_pte;
/*
* Set the right block/page bits for this level unless we are
@@ -74,7 +120,7 @@ void __init map_range(u64 *pte, u64 limit, u64 start, u64 end, u64 pa,
protval |= (level < 3) ? PMD_TYPE_SECT : PTE_TYPE_PAGE;
while (start < end) {
- u64 next = min((start | lmask) + 1, PAGE_ALIGN(end));
+ u64 next = min((start | lmask) + 1, ALIGN(end, page_size));
if (level < 3 && (start | next | pa) & lmask) {
/*
@@ -82,20 +128,20 @@ void __init map_range(u64 *pte, u64 limit, u64 start, u64 end, u64 pa,
* table mapping if necessary and recurse.
*/
if (pte_none(*tbl)) {
- u64 size = PTRS_PER_PTE * sizeof(pte_t);
+ u64 size = ptrs_per_pte * sizeof(pte_t);
if (*pte + size > limit) {
report_cpu_stuck(
CPU_STUCK_REASON_NO_PGTABLE_MEM);
}
- *tbl = __pte(__phys_to_pte_val(*pte) |
+ *tbl = __pte(pte_prep_oa(*pte, page_shift, oa52bit) |
PMD_TYPE_TABLE | PMD_TABLE_UXN);
*pte += size;
}
- map_range(pte, limit, start, next, pa, prot, level + 1,
- (pte_t *)(__pte_to_phys(*tbl) + va_offset),
- may_use_cont, va_offset);
+ __map_range(pte, limit, start, next, pa, prot, level + 1,
+ (pte_t *)(pte_get_oa(*tbl, page_shift, oa52bit) + va_offset),
+ may_use_cont, va_offset, page_shift, oa52bit);
} else {
/*
* Start a contiguous range if start and pa are
@@ -112,7 +158,8 @@ void __init map_range(u64 *pte, u64 limit, u64 start, u64 end, u64 pa,
protval &= ~PTE_CONT;
/* Put down a block or page mapping */
- *tbl = __pte(__phys_to_pte_val(pa) | protval);
+ *tbl = __pte(pte_prep_oa(pa, page_shift, oa52bit) |
+ protval);
}
pa += next - start;
start = next;
@@ -120,22 +167,96 @@ void __init map_range(u64 *pte, u64 limit, u64 start, u64 end, u64 pa,
}
}
+/**
+ * map_range - Map a contiguous range of physical pages into virtual memory
+ *
+ * As per __map_range(), except it uses the page_shift and oa52bit of the
+ * currently tcr-installed granule size instead of passing explicitly.
+ */
+void __init map_range(u64 *pte, u64 limit, u64 start, u64 end, u64 pa,
+ pgprot_t prot, int level, pte_t *tbl, bool may_use_cont,
+ u64 va_offset)
+{
+ int page_shift = early_tcr_tg0_to_page_shift();
+ bool oa52bit = false;
+
+#ifdef CONFIG_ARM64_PA_BITS_52
+ /*
+ * We can safely assume 52bit for 64K pages because if it turns out to
+ * be 48bit, its still safe to treat [51:48] as address bits because
+ * they are 0.
+ */
+ if (early_page_size(page_shift) == SZ_64K)
+ oa52bit = true;
+ /*
+ * For 4K and 16K, on the other hand, those bits are used for something
+ * else when LPA2 is not explicitly enabled. Deliberately not using
+ * read_tcr() since it is marked pure, and at this point, the tcr is not
+ * yet stable.
+ */
+ else if (read_sysreg(tcr_el1) & TCR_DS)
+ oa52bit = true;
+#endif
+
+ __map_range(pte, limit, start, end, pa, prot, level, tbl, may_use_cont,
+ va_offset, page_shift, oa52bit);
+}
+
+asmlinkage u64 __init probe_init_idmap_page_shift(void)
+{
+ u64 mmfr0 = read_sysreg_s(SYS_ID_AA64MMFR0_EL1);
+ u32 tgran64 = SYS_FIELD_GET(ID_AA64MMFR0_EL1, TGRAN64, mmfr0);
+ u32 tgran16 = SYS_FIELD_GET(ID_AA64MMFR0_EL1, TGRAN16, mmfr0);
+ u32 tgran4 = SYS_FIELD_GET(ID_AA64MMFR0_EL1, TGRAN4, mmfr0);
+ u64 page_shift;
+
+ /*
+ * Select the largest page size supported by the HW, which is also
+ * allowed by the compilation config.
+ */
+ if (ARM64_PAGE_SHIFT_64K >= PAGE_SHIFT_MIN &&
+ ARM64_PAGE_SHIFT_64K <= PAGE_SHIFT_MAX &&
+ tgran64 >= ID_AA64MMFR0_EL1_TGRAN64_SUPPORTED_MIN &&
+ tgran64 <= ID_AA64MMFR0_EL1_TGRAN64_SUPPORTED_MAX)
+ page_shift = ARM64_PAGE_SHIFT_64K;
+ else if (ARM64_PAGE_SHIFT_16K >= PAGE_SHIFT_MIN &&
+ ARM64_PAGE_SHIFT_16K <= PAGE_SHIFT_MAX &&
+ tgran16 >= ID_AA64MMFR0_EL1_TGRAN16_SUPPORTED_MIN &&
+ tgran16 <= ID_AA64MMFR0_EL1_TGRAN16_SUPPORTED_MAX)
+ page_shift = ARM64_PAGE_SHIFT_16K;
+ else if (ARM64_PAGE_SHIFT_4K >= PAGE_SHIFT_MIN &&
+ ARM64_PAGE_SHIFT_4K <= PAGE_SHIFT_MAX &&
+ tgran4 >= ID_AA64MMFR0_EL1_TGRAN4_SUPPORTED_MIN &&
+ tgran4 <= ID_AA64MMFR0_EL1_TGRAN4_SUPPORTED_MAX)
+ page_shift = ARM64_PAGE_SHIFT_4K;
+ else
+ report_cpu_stuck(CPU_STUCK_REASON_NO_GRAN);
+
+ /* Stash this for __cpu_setup to configure primary cpu's tcr_el1. */
+ set_tcr_boot_fields(early_page_shift_to_tcr_tgx(page_shift));
+
+ return page_shift;
+}
+
asmlinkage u64 __init create_init_idmap(pgd_t *pg_dir, pgd_t *pg_end,
- pteval_t clrmask)
+ pteval_t clrmask, int page_shift)
{
- u64 ptep = (u64)pg_dir + PAGE_SIZE;
+ const u64 page_size = early_page_size(page_shift);
+ u64 ptep = (u64)pg_dir + page_size;
pgprot_t text_prot = PAGE_KERNEL_ROX;
pgprot_t data_prot = PAGE_KERNEL;
pgprot_val(text_prot) &= ~clrmask;
pgprot_val(data_prot) &= ~clrmask;
- map_range(&ptep, (u64)pg_end, (u64)_stext, (u64)__initdata_begin,
- (u64)_stext, text_prot,
- IDMAP_ROOT_LEVEL(PAGE_SHIFT), (pte_t *)pg_dir, false, 0);
- map_range(&ptep, (u64)pg_end, (u64)__initdata_begin, (u64)_end,
- (u64)__initdata_begin, data_prot,
- IDMAP_ROOT_LEVEL(PAGE_SHIFT), (pte_t *)pg_dir, false, 0);
+ __map_range(&ptep, (u64)pg_end, (u64)_stext, (u64)__initdata_begin,
+ (u64)_stext, text_prot,
+ IDMAP_ROOT_LEVEL(page_shift), (pte_t *)pg_dir, false, 0,
+ page_shift, false);
+ __map_range(&ptep, (u64)pg_end, (u64)__initdata_begin, (u64)_end,
+ (u64)__initdata_begin, data_prot,
+ IDMAP_ROOT_LEVEL(page_shift), (pte_t *)pg_dir, false, 0,
+ page_shift, false);
return ptep;
}
diff --git a/arch/arm64/kernel/pi/pi.h b/arch/arm64/kernel/pi/pi.h
index 20fe0941cb8ee..15c14d0aa6c63 100644
--- a/arch/arm64/kernel/pi/pi.h
+++ b/arch/arm64/kernel/pi/pi.h
@@ -23,6 +23,62 @@ extern bool dynamic_scs_is_enabled;
extern pgd_t init_idmap_pg_dir[], init_idmap_pg_end[];
+static inline u64 early_page_size(int page_shift)
+{
+ return 1UL << page_shift;
+}
+
+static inline u64 early_page_mask(int page_shift)
+{
+ return ~(early_page_size(page_shift) - 1);
+}
+
+static inline u64 early_cont_pte_size(int page_shift)
+{
+ switch (page_shift) {
+ case 16: /* 64K */
+ case 14: /* 16K */
+ return SZ_2M;
+ default: /* 12 4K */
+ return SZ_64K;
+ }
+}
+
+static inline u64 early_ptrs_per_pte(int page_shift)
+{
+ return 1UL << (page_shift - 3);
+}
+
+static inline int early_tcr_tg0_to_page_shift(void)
+{
+ /*
+ * Deliberately not using read_tcr() since it is marked pure, and at
+ * this point, the tcr is not yet stable.
+ */
+ u64 tg0 = read_sysreg(tcr_el1) & TCR_TG0_MASK;
+
+ switch (tg0) {
+ case TCR_TG0_64K:
+ return 16;
+ case TCR_TG0_16K:
+ return 14;
+ default: /* TCR_TG0_4K */
+ return 12;
+ }
+}
+
+static inline u64 early_page_shift_to_tcr_tgx(int page_shift)
+{
+ switch (early_page_size(page_shift)) {
+ case SZ_64K:
+ return TCR_TG0_64K | TCR_TG1_64K;
+ case SZ_16K:
+ return TCR_TG0_16K | TCR_TG1_16K;
+ default:
+ return TCR_TG0_4K | TCR_TG1_4K;
+ }
+}
+
void init_feature_override(u64 boot_status, const void *fdt, int chosen);
u64 kaslr_early_init(void *fdt, int chosen);
void relocate_kernel(u64 offset);
@@ -33,4 +89,7 @@ void map_range(u64 *pgd, u64 limit, u64 start, u64 end, u64 pa, pgprot_t prot,
asmlinkage void early_map_kernel(u64 boot_status, void *fdt);
-asmlinkage u64 create_init_idmap(pgd_t *pgd, pgd_t *pg_end, pteval_t clrmask);
+void set_tcr_boot_fields(u64 val);
+asmlinkage u64 probe_init_idmap_page_shift(void);
+asmlinkage u64 create_init_idmap(pgd_t *pgd, pgd_t *pg_end, pteval_t clrmask,
+ int page_shift);
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index f4bc6c5bac062..ab5aa84923524 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -22,14 +22,6 @@
#include <asm/smp.h>
#include <asm/sysreg.h>
-#ifdef CONFIG_ARM64_64K_PAGES
-#define TCR_TG_FLAGS TCR_TG0_64K | TCR_TG1_64K
-#elif defined(CONFIG_ARM64_16K_PAGES)
-#define TCR_TG_FLAGS TCR_TG0_16K | TCR_TG1_16K
-#else /* CONFIG_ARM64_4K_PAGES */
-#define TCR_TG_FLAGS TCR_TG0_4K | TCR_TG1_4K
-#endif
-
#ifdef CONFIG_RANDOMIZE_BASE
#define TCR_KASLR_FLAGS TCR_NFD1
#else
@@ -469,18 +461,23 @@ SYM_FUNC_START(__cpu_setup)
tcr .req x16
mov_q mair, MAIR_EL1_SET
mov_q tcr, TCR_T0SZ(IDMAP_VA_BITS) | TCR_T1SZ(VA_BITS_MIN) | TCR_CACHE_FLAGS | \
- TCR_SMP_FLAGS | TCR_TG_FLAGS | TCR_KASLR_FLAGS | TCR_ASID16 | \
+ TCR_SMP_FLAGS | TCR_KASLR_FLAGS | TCR_ASID16 | \
TCR_TBI0 | TCR_A1 | TCR_KASAN_SW_FLAGS | TCR_MTE_FLAGS
+ /*
+ * Insert the boot-time determined fields (TG0, TG1 and DS), which are
+ * cached in the tcr_boot_fields variable.
+ */
+ adr_l x2, __pi_tcr_boot_fields
+ ldr x3, [x2]
+ orr tcr, tcr, x3
+
tcr_clear_errata_bits tcr, x9, x5
#ifdef CONFIG_ARM64_VA_BITS_52
mov x9, #64 - VA_BITS
alternative_if ARM64_HAS_VA52
tcr_set_t1sz tcr, x9
-#ifdef CONFIG_ARM64_LPA2
- orr tcr, tcr, #TCR_DS
-#endif
alternative_else_nop_endif
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 43/57] arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (40 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 42/57] arm64: Divorce early init from PAGE_SIZE Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 44/57] arm64: Align sections to PAGE_SIZE_MAX Ryan Roberts
` (16 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Andrey Ryabinin, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Oliver Upton, Thomas Gleixner, Will Deacon
Cc: Ryan Roberts, kasan-dev, kvmarm, linux-arm-kernel, linux-kernel,
linux-mm
There are a number of places that define macros conditionally depending
on which of the CONFIG_ARM64_*K_PAGES macros are defined. But in
preparation for supporting boot-time page size selection, we will no
longer be able to make these decisions at compile time.
So let's refactor the code to check the size of PAGE_SIZE using the
ternary operator. This approach will still resolve to compile-time
constants when configured for a compile-time page size, but it will also
work when we turn PAGE_SIZE into a run-time value. Additionally,
IS_ENABLED(CONFIG_ARM64_*K_PAGES) instances are also converted to test
the size of PAGE_SIZE.
Additionally modify ARM64_HAS_VA52 capability detection to use a custom
match function, which chooses which feature register and field to check
based on PAGE_SIZE. The compiler will eliminate the other page sizes
when selecting a compile time page size, but will also now cope with
seting page size at boot time.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/kvm_arm.h | 21 ++++-------
arch/arm64/include/asm/kvm_pgtable.h | 6 +---
arch/arm64/include/asm/memory.h | 7 ++--
arch/arm64/include/asm/processor.h | 10 +++---
arch/arm64/include/asm/sparsemem.h | 11 ++----
arch/arm64/include/asm/sysreg.h | 54 ++++++++++++++++++----------
arch/arm64/kernel/cpufeature.c | 43 +++++++++++++---------
arch/arm64/mm/fixmap.c | 2 +-
arch/arm64/mm/init.c | 20 +++++------
arch/arm64/mm/kasan_init.c | 8 ++---
arch/arm64/mm/mmu.c | 2 +-
drivers/irqchip/irq-gic-v3-its.c | 2 +-
12 files changed, 94 insertions(+), 92 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index d81cc746e0ebd..08155dc17ad17 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -189,22 +189,13 @@
* Entry_Level = 4 - Number_of_levels.
*
*/
-#ifdef CONFIG_ARM64_64K_PAGES
+#define VTCR_EL2_TGRAN \
+ (PAGE_SIZE == SZ_64K ? \
+ VTCR_EL2_TG0_64K : \
+ (PAGE_SIZE == SZ_16K ? VTCR_EL2_TG0_16K : VTCR_EL2_TG0_4K))
-#define VTCR_EL2_TGRAN VTCR_EL2_TG0_64K
-#define VTCR_EL2_TGRAN_SL0_BASE 3UL
-
-#elif defined(CONFIG_ARM64_16K_PAGES)
-
-#define VTCR_EL2_TGRAN VTCR_EL2_TG0_16K
-#define VTCR_EL2_TGRAN_SL0_BASE 3UL
-
-#else /* 4K */
-
-#define VTCR_EL2_TGRAN VTCR_EL2_TG0_4K
-#define VTCR_EL2_TGRAN_SL0_BASE 2UL
-
-#endif
+#define VTCR_EL2_TGRAN_SL0_BASE \
+ (PAGE_SIZE == SZ_64K ? 3UL : (PAGE_SIZE == SZ_16K ? 3UL : 2UL))
#define VTCR_EL2_LVLS_TO_SL0(levels) \
((VTCR_EL2_TGRAN_SL0_BASE - (4 - (levels))) << VTCR_EL2_SL0_SHIFT)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 19278dfe79782..796614bf59e78 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -20,11 +20,7 @@
* - 16K (level 2): 32MB
* - 64K (level 2): 512MB
*/
-#ifdef CONFIG_ARM64_4K_PAGES
-#define KVM_PGTABLE_MIN_BLOCK_LEVEL 1
-#else
-#define KVM_PGTABLE_MIN_BLOCK_LEVEL 2
-#endif
+#define KVM_PGTABLE_MIN_BLOCK_LEVEL (PAGE_SIZE == SZ_4K ? 1 : 2)
#define kvm_lpa2_is_enabled() system_supports_lpa2()
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 54fb014eba058..6aa97fa22dc30 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -188,11 +188,8 @@
#define MT_S2_FWB_NORMAL_NC 5
#define MT_S2_FWB_DEVICE_nGnRE 1
-#ifdef CONFIG_ARM64_4K_PAGES
-#define IOREMAP_MAX_ORDER (PUD_SHIFT)
-#else
-#define IOREMAP_MAX_ORDER (PMD_SHIFT)
-#endif
+#define IOREMAP_MAX_ORDER \
+ (PAGE_SIZE == SZ_4K ? PUD_SHIFT : PMD_SHIFT)
/*
* Open-coded (swapper_pg_dir - reserved_pg_dir) as this cannot be calculated
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index f77371232d8c6..444694a4e6733 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -55,15 +55,15 @@
#define TASK_SIZE_MAX (UL(1) << VA_BITS)
#ifdef CONFIG_COMPAT
-#if defined(CONFIG_ARM64_64K_PAGES) && defined(CONFIG_KUSER_HELPERS)
+#if defined(CONFIG_KUSER_HELPERS)
/*
- * With CONFIG_ARM64_64K_PAGES enabled, the last page is occupied
- * by the compat vectors page.
+ * With 64K pages in use, the last page is occupied by the compat vectors page.
*/
-#define TASK_SIZE_32 UL(0x100000000)
+#define TASK_SIZE_32 \
+ (PAGE_SIZE == SZ_64K ? UL(0x100000000) : (UL(0x100000000) - PAGE_SIZE))
#else
#define TASK_SIZE_32 (UL(0x100000000) - PAGE_SIZE)
-#endif /* CONFIG_ARM64_64K_PAGES */
+#endif /* CONFIG_KUSER_HELPERS */
#define TASK_SIZE (test_thread_flag(TIF_32BIT) ? \
TASK_SIZE_32 : TASK_SIZE_64)
#define TASK_SIZE_OF(tsk) (test_tsk_thread_flag(tsk, TIF_32BIT) ? \
diff --git a/arch/arm64/include/asm/sparsemem.h b/arch/arm64/include/asm/sparsemem.h
index 8a8acc220371c..a05fdd54014f7 100644
--- a/arch/arm64/include/asm/sparsemem.h
+++ b/arch/arm64/include/asm/sparsemem.h
@@ -11,19 +11,12 @@
* Section size must be at least 512MB for 64K base
* page size config. Otherwise it will be less than
* MAX_PAGE_ORDER and the build process will fail.
- */
-#ifdef CONFIG_ARM64_64K_PAGES
-#define SECTION_SIZE_BITS 29
-
-#else
-
-/*
+ *
* Section size must be at least 128MB for 4K base
* page size config. Otherwise PMD based huge page
* entries could not be created for vmemmap mappings.
* 16K follows 4K for simplicity.
*/
-#define SECTION_SIZE_BITS 27
-#endif /* CONFIG_ARM64_64K_PAGES */
+#define SECTION_SIZE_BITS (PAGE_SIZE == SZ_64K ? 29 : 27)
#endif
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 4a9ea103817e8..cbcf861bbf2a6 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -10,10 +10,12 @@
#define __ASM_SYSREG_H
#include <linux/bits.h>
+#include <linux/sizes.h>
#include <linux/stringify.h>
#include <linux/kasan-tags.h>
#include <asm/gpr-num.h>
+#include <asm/page-def.h>
/*
* ARMv8 ARM reserves the following encoding for system registers:
@@ -913,24 +915,40 @@
#define ID_AA64MMFR0_EL1_PARANGE_MAX ID_AA64MMFR0_EL1_PARANGE_48
#endif
-#if defined(CONFIG_ARM64_4K_PAGES)
-#define ID_AA64MMFR0_EL1_TGRAN_SHIFT ID_AA64MMFR0_EL1_TGRAN4_SHIFT
-#define ID_AA64MMFR0_EL1_TGRAN_LPA2 ID_AA64MMFR0_EL1_TGRAN4_52_BIT
-#define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MIN ID_AA64MMFR0_EL1_TGRAN4_SUPPORTED_MIN
-#define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MAX ID_AA64MMFR0_EL1_TGRAN4_SUPPORTED_MAX
-#define ID_AA64MMFR0_EL1_TGRAN_2_SHIFT ID_AA64MMFR0_EL1_TGRAN4_2_SHIFT
-#elif defined(CONFIG_ARM64_16K_PAGES)
-#define ID_AA64MMFR0_EL1_TGRAN_SHIFT ID_AA64MMFR0_EL1_TGRAN16_SHIFT
-#define ID_AA64MMFR0_EL1_TGRAN_LPA2 ID_AA64MMFR0_EL1_TGRAN16_52_BIT
-#define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MIN ID_AA64MMFR0_EL1_TGRAN16_SUPPORTED_MIN
-#define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MAX ID_AA64MMFR0_EL1_TGRAN16_SUPPORTED_MAX
-#define ID_AA64MMFR0_EL1_TGRAN_2_SHIFT ID_AA64MMFR0_EL1_TGRAN16_2_SHIFT
-#elif defined(CONFIG_ARM64_64K_PAGES)
-#define ID_AA64MMFR0_EL1_TGRAN_SHIFT ID_AA64MMFR0_EL1_TGRAN64_SHIFT
-#define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MIN ID_AA64MMFR0_EL1_TGRAN64_SUPPORTED_MIN
-#define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MAX ID_AA64MMFR0_EL1_TGRAN64_SUPPORTED_MAX
-#define ID_AA64MMFR0_EL1_TGRAN_2_SHIFT ID_AA64MMFR0_EL1_TGRAN64_2_SHIFT
-#endif
+#define ID_AA64MMFR0_EL1_TGRAN_SHIFT \
+ (PAGE_SIZE == SZ_4K ? \
+ ID_AA64MMFR0_EL1_TGRAN4_SHIFT : \
+ (PAGE_SIZE == SZ_16K ? \
+ ID_AA64MMFR0_EL1_TGRAN16_SHIFT : \
+ ID_AA64MMFR0_EL1_TGRAN64_SHIFT))
+
+#define ID_AA64MMFR0_EL1_TGRAN_LPA2 \
+ (PAGE_SIZE == SZ_4K ? \
+ ID_AA64MMFR0_EL1_TGRAN4_52_BIT : \
+ (PAGE_SIZE == SZ_16K ? \
+ ID_AA64MMFR0_EL1_TGRAN16_52_BIT : \
+ -1))
+
+#define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MIN \
+ (PAGE_SIZE == SZ_4K ? \
+ ID_AA64MMFR0_EL1_TGRAN4_SUPPORTED_MIN : \
+ (PAGE_SIZE == SZ_16K ? \
+ ID_AA64MMFR0_EL1_TGRAN16_SUPPORTED_MIN : \
+ ID_AA64MMFR0_EL1_TGRAN64_SUPPORTED_MIN))
+
+#define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MAX \
+ (PAGE_SIZE == SZ_4K ? \
+ ID_AA64MMFR0_EL1_TGRAN4_SUPPORTED_MAX : \
+ (PAGE_SIZE == SZ_16K ? \
+ ID_AA64MMFR0_EL1_TGRAN16_SUPPORTED_MAX : \
+ ID_AA64MMFR0_EL1_TGRAN64_SUPPORTED_MAX))
+
+#define ID_AA64MMFR0_EL1_TGRAN_2_SHIFT \
+ (PAGE_SIZE == SZ_4K ? \
+ ID_AA64MMFR0_EL1_TGRAN4_2_SHIFT : \
+ (PAGE_SIZE == SZ_16K ? \
+ ID_AA64MMFR0_EL1_TGRAN16_2_SHIFT : \
+ ID_AA64MMFR0_EL1_TGRAN64_2_SHIFT))
#define CPACR_EL1_FPEN_EL1EN (BIT(20)) /* enable EL1 access */
#define CPACR_EL1_FPEN_EL0EN (BIT(21)) /* enable EL0 access, if EL1EN set */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 646ecd3069fdd..7705c9c0e7142 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1831,11 +1831,13 @@ static bool has_nv1(const struct arm64_cpu_capabilities *entry, int scope)
is_midr_in_range_list(read_cpuid_id(), nv1_ni_list)));
}
-#if defined(ID_AA64MMFR0_EL1_TGRAN_LPA2) && defined(ID_AA64MMFR0_EL1_TGRAN_2_SUPPORTED_LPA2)
static bool has_lpa2_at_stage1(u64 mmfr0)
{
unsigned int tgran;
+ if (PAGE_SIZE == SZ_64K)
+ return false;
+
tgran = cpuid_feature_extract_unsigned_field(mmfr0,
ID_AA64MMFR0_EL1_TGRAN_SHIFT);
return tgran == ID_AA64MMFR0_EL1_TGRAN_LPA2;
@@ -1845,6 +1847,9 @@ static bool has_lpa2_at_stage2(u64 mmfr0)
{
unsigned int tgran;
+ if (PAGE_SIZE == SZ_64K)
+ return false;
+
tgran = cpuid_feature_extract_unsigned_field(mmfr0,
ID_AA64MMFR0_EL1_TGRAN_2_SHIFT);
return tgran == ID_AA64MMFR0_EL1_TGRAN_2_SUPPORTED_LPA2;
@@ -1857,10 +1862,26 @@ static bool has_lpa2(const struct arm64_cpu_capabilities *entry, int scope)
mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
return has_lpa2_at_stage1(mmfr0) && has_lpa2_at_stage2(mmfr0);
}
-#else
-static bool has_lpa2(const struct arm64_cpu_capabilities *entry, int scope)
+
+#ifdef CONFIG_ARM64_VA_BITS_52
+static bool has_va52(const struct arm64_cpu_capabilities *entry, int scope)
{
- return false;
+ const struct arm64_cpu_capabilities entry_64k = {
+ ARM64_CPUID_FIELDS(ID_AA64MMFR2_EL1, VARange, 52)
+ };
+ const struct arm64_cpu_capabilities entry_16k = {
+ ARM64_CPUID_FIELDS(ID_AA64MMFR0_EL1, TGRAN16, 52_BIT)
+ };
+ const struct arm64_cpu_capabilities entry_4k = {
+ ARM64_CPUID_FIELDS(ID_AA64MMFR0_EL1, TGRAN4, 52_BIT)
+ };
+
+ if (PAGE_SIZE == SZ_64K)
+ return has_cpuid_feature(&entry_64k, scope);
+ else if (PAGE_SIZE == SZ_16K)
+ return has_cpuid_feature(&entry_16k, scope);
+ else
+ return has_cpuid_feature(&entry_4k, scope);
}
#endif
@@ -2847,20 +2868,10 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
},
#ifdef CONFIG_ARM64_VA_BITS_52
{
+ .desc = "52-bit Virtual Addressing",
.capability = ARM64_HAS_VA52,
.type = ARM64_CPUCAP_BOOT_CPU_FEATURE,
- .matches = has_cpuid_feature,
-#ifdef CONFIG_ARM64_64K_PAGES
- .desc = "52-bit Virtual Addressing (LVA)",
- ARM64_CPUID_FIELDS(ID_AA64MMFR2_EL1, VARange, 52)
-#else
- .desc = "52-bit Virtual Addressing (LPA2)",
-#ifdef CONFIG_ARM64_4K_PAGES
- ARM64_CPUID_FIELDS(ID_AA64MMFR0_EL1, TGRAN4, 52_BIT)
-#else
- ARM64_CPUID_FIELDS(ID_AA64MMFR0_EL1, TGRAN16, 52_BIT)
-#endif
-#endif
+ .matches = has_va52,
},
#endif
{
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index de1e09d986ad2..15ce3253ad359 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -82,7 +82,7 @@ static void __init early_fixmap_init_pud(p4d_t *p4dp, unsigned long addr,
* share the top level pgd entry, which should only happen on
* 16k/4 levels configurations.
*/
- BUG_ON(!IS_ENABLED(CONFIG_ARM64_16K_PAGES));
+ BUG_ON(PAGE_SIZE != SZ_16K);
}
if (p4d_none(p4d))
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 9b5ab6818f7f3..42eb246949072 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -73,13 +73,10 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit;
* (64k granule), or a multiple that can be mapped using contiguous bits
* in the page tables: 32 * PMD_SIZE (16k granule)
*/
-#if defined(CONFIG_ARM64_4K_PAGES)
-#define ARM64_MEMSTART_SHIFT PUD_SHIFT
-#elif defined(CONFIG_ARM64_16K_PAGES)
-#define ARM64_MEMSTART_SHIFT CONT_PMD_SHIFT
-#else
-#define ARM64_MEMSTART_SHIFT PMD_SHIFT
-#endif
+#define ARM64_MEMSTART_SHIFT \
+ (PAGE_SIZE == SZ_4K ? \
+ PUD_SHIFT : \
+ (PAGE_SIZE == SZ_16K ? CONT_PMD_SHIFT : PMD_SHIFT))
/*
* sparsemem vmemmap imposes an additional requirement on the alignment of
@@ -87,11 +84,10 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit;
* has a direct correspondence, and needs to appear sufficiently aligned
* in the virtual address space.
*/
-#if ARM64_MEMSTART_SHIFT < SECTION_SIZE_BITS
-#define ARM64_MEMSTART_ALIGN (1UL << SECTION_SIZE_BITS)
-#else
-#define ARM64_MEMSTART_ALIGN (1UL << ARM64_MEMSTART_SHIFT)
-#endif
+#define ARM64_MEMSTART_ALIGN \
+ (ARM64_MEMSTART_SHIFT < SECTION_SIZE_BITS ? \
+ (1UL << SECTION_SIZE_BITS) : \
+ (1UL << ARM64_MEMSTART_SHIFT))
static void __init arch_reserve_crashkernel(void)
{
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b65a29440a0c9..9af897fb3c432 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -178,10 +178,10 @@ static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
} while (pgdp++, addr = next, addr != end);
}
-#if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS > 4
+#if CONFIG_PGTABLE_LEVELS > 4
#define SHADOW_ALIGN P4D_SIZE
#else
-#define SHADOW_ALIGN PUD_SIZE
+#define SHADOW_ALIGN (PAGE_SIZE == SZ_64K ? P4D_SIZE : PUD_SIZE)
#endif
/*
@@ -243,8 +243,8 @@ static int __init root_level_idx(u64 addr)
* not implemented. This means we need to index the table as usual,
* instead of masking off bits based on vabits_actual.
*/
- u64 vabits = IS_ENABLED(CONFIG_ARM64_64K_PAGES) ? VA_BITS
- : vabits_actual;
+ u64 vabits = PAGE_SIZE == SZ_64K ? VA_BITS
+ : vabits_actual;
int shift = (ARM64_HW_PGTABLE_LEVELS(vabits) - 1) * (PAGE_SHIFT - 3);
return (addr & ~_PAGE_OFFSET(vabits)) >> (shift + PAGE_SHIFT);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d4d30eaefb4cd..a528787c1e550 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1179,7 +1179,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
{
WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
- if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
+ if (PAGE_SIZE != SZ_4K)
return vmemmap_populate_basepages(start, end, node, altmap);
else
return vmemmap_populate_hugepages(start, end, node, altmap);
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index fdec478ba5e70..b745579b4b9f3 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -2323,7 +2323,7 @@ static int its_setup_baser(struct its_node *its, struct its_baser *baser,
baser_phys = virt_to_phys(base);
/* Check if the physical address of the memory is above 48bits */
- if (IS_ENABLED(CONFIG_ARM64_64K_PAGES) && (baser_phys >> 48)) {
+ if (PAGE_SIZE == SZ_64K && (baser_phys >> 48)) {
/* 52bit PA is supported only when PageSize=64K */
if (psz != SZ_64K) {
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 44/57] arm64: Align sections to PAGE_SIZE_MAX
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (41 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 43/57] arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-19 14:16 ` Thomas Weißschuh
2024-10-14 10:58 ` [RFC PATCH v1 45/57] arm64: Rework trampoline rodata mapping Ryan Roberts
` (15 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Oliver Upton, Will Deacon
Cc: Ryan Roberts, kvmarm, linux-arm-kernel, linux-kernel, linux-mm
Increase alignment of sections in nvhe hyp, vdso and final vmlinux image
from PAGE_SIZE to PAGE_SIZE_MAX. For compile-time PAGE_SIZE,
PAGE_SIZE_MAX == PAGE_SIZE so there is no change. For boot-time
PAGE_SIZE, PAGE_SIZE_MAX is the largest selectable page size.
For a boot-time page size build, image size is comparable to a 64K page
size compile-time build. In future, it may be desirable to optimize
run-time memory consumption by freeing unused padding pages when the
boot-time selected page size is less than PAGE_SIZE_MAX.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/memory.h | 4 +--
arch/arm64/kernel/vdso-wrap.S | 4 +--
arch/arm64/kernel/vdso.c | 7 +++---
arch/arm64/kernel/vdso/vdso.lds.S | 4 +--
arch/arm64/kernel/vdso32-wrap.S | 4 +--
arch/arm64/kernel/vdso32/vdso.lds.S | 4 +--
arch/arm64/kernel/vmlinux.lds.S | 38 ++++++++++++++---------------
arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 2 +-
8 files changed, 34 insertions(+), 33 deletions(-)
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 6aa97fa22dc30..5393a859183f7 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -195,13 +195,13 @@
* Open-coded (swapper_pg_dir - reserved_pg_dir) as this cannot be calculated
* until link time.
*/
-#define RESERVED_SWAPPER_OFFSET (PAGE_SIZE)
+#define RESERVED_SWAPPER_OFFSET (PAGE_SIZE_MAX)
/*
* Open-coded (swapper_pg_dir - tramp_pg_dir) as this cannot be calculated
* until link time.
*/
-#define TRAMP_SWAPPER_OFFSET (2 * PAGE_SIZE)
+#define TRAMP_SWAPPER_OFFSET (2 * PAGE_SIZE_MAX)
#ifndef __ASSEMBLY__
diff --git a/arch/arm64/kernel/vdso-wrap.S b/arch/arm64/kernel/vdso-wrap.S
index c4b1990bf2be0..79fa77628199b 100644
--- a/arch/arm64/kernel/vdso-wrap.S
+++ b/arch/arm64/kernel/vdso-wrap.S
@@ -13,10 +13,10 @@
.globl vdso_start, vdso_end
.section .rodata
- .balign PAGE_SIZE
+ .balign PAGE_SIZE_MAX
vdso_start:
.incbin "arch/arm64/kernel/vdso/vdso.so"
- .balign PAGE_SIZE
+ .balign PAGE_SIZE_MAX
vdso_end:
.previous
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 89b6e78400023..1efe98909a2e0 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -195,7 +195,7 @@ static int __setup_additional_pages(enum vdso_abi abi,
vdso_text_len = vdso_info[abi].vdso_pages << PAGE_SHIFT;
/* Be sure to map the data page */
- vdso_mapping_len = vdso_text_len + VVAR_NR_PAGES * PAGE_SIZE;
+ vdso_mapping_len = vdso_text_len + VVAR_NR_PAGES * PAGE_SIZE_MAX;
vdso_base = get_unmapped_area(NULL, 0, vdso_mapping_len, 0, 0);
if (IS_ERR_VALUE(vdso_base)) {
@@ -203,7 +203,8 @@ static int __setup_additional_pages(enum vdso_abi abi,
goto up_fail;
}
- ret = _install_special_mapping(mm, vdso_base, VVAR_NR_PAGES * PAGE_SIZE,
+ ret = _install_special_mapping(mm, vdso_base,
+ VVAR_NR_PAGES * PAGE_SIZE_MAX,
VM_READ|VM_MAYREAD|VM_PFNMAP,
vdso_info[abi].dm);
if (IS_ERR(ret))
@@ -212,7 +213,7 @@ static int __setup_additional_pages(enum vdso_abi abi,
if (system_supports_bti_kernel())
gp_flags = VM_ARM64_BTI;
- vdso_base += VVAR_NR_PAGES * PAGE_SIZE;
+ vdso_base += VVAR_NR_PAGES * PAGE_SIZE_MAX;
mm->context.vdso = (void *)vdso_base;
ret = _install_special_mapping(mm, vdso_base, vdso_text_len,
VM_READ|VM_EXEC|gp_flags|
diff --git a/arch/arm64/kernel/vdso/vdso.lds.S b/arch/arm64/kernel/vdso/vdso.lds.S
index 45354f2ddf706..f7d1537a689e8 100644
--- a/arch/arm64/kernel/vdso/vdso.lds.S
+++ b/arch/arm64/kernel/vdso/vdso.lds.S
@@ -18,9 +18,9 @@ OUTPUT_ARCH(aarch64)
SECTIONS
{
- PROVIDE(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE);
+ PROVIDE(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE_MAX);
#ifdef CONFIG_TIME_NS
- PROVIDE(_timens_data = _vdso_data + PAGE_SIZE);
+ PROVIDE(_timens_data = _vdso_data + PAGE_SIZE_MAX);
#endif
. = VDSO_LBASE + SIZEOF_HEADERS;
diff --git a/arch/arm64/kernel/vdso32-wrap.S b/arch/arm64/kernel/vdso32-wrap.S
index e72ac7bc4c04f..1c6069d6c457e 100644
--- a/arch/arm64/kernel/vdso32-wrap.S
+++ b/arch/arm64/kernel/vdso32-wrap.S
@@ -10,10 +10,10 @@
.globl vdso32_start, vdso32_end
.section .rodata
- .balign PAGE_SIZE
+ .balign PAGE_SIZE_MAX
vdso32_start:
.incbin "arch/arm64/kernel/vdso32/vdso.so"
- .balign PAGE_SIZE
+ .balign PAGE_SIZE_MAX
vdso32_end:
.previous
diff --git a/arch/arm64/kernel/vdso32/vdso.lds.S b/arch/arm64/kernel/vdso32/vdso.lds.S
index 8d95d7d35057d..c46d18a69d1ce 100644
--- a/arch/arm64/kernel/vdso32/vdso.lds.S
+++ b/arch/arm64/kernel/vdso32/vdso.lds.S
@@ -18,9 +18,9 @@ OUTPUT_ARCH(arm)
SECTIONS
{
- PROVIDE_HIDDEN(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE);
+ PROVIDE_HIDDEN(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE_MAX);
#ifdef CONFIG_TIME_NS
- PROVIDE_HIDDEN(_timens_data = _vdso_data + PAGE_SIZE);
+ PROVIDE_HIDDEN(_timens_data = _vdso_data + PAGE_SIZE_MAX);
#endif
. = VDSO_LBASE + SIZEOF_HEADERS;
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 7f3f6d709ae73..1ef6dea13b57c 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -15,16 +15,16 @@
#define HYPERVISOR_DATA_SECTIONS \
HYP_SECTION_NAME(.rodata) : { \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
__hyp_rodata_start = .; \
*(HYP_SECTION_NAME(.data..ro_after_init)) \
*(HYP_SECTION_NAME(.rodata)) \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
__hyp_rodata_end = .; \
}
#define HYPERVISOR_PERCPU_SECTION \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
HYP_SECTION_NAME(.data..percpu) : { \
*(HYP_SECTION_NAME(.data..percpu)) \
}
@@ -39,7 +39,7 @@
#define BSS_FIRST_SECTIONS \
__hyp_bss_start = .; \
*(HYP_SECTION_NAME(.bss)) \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
__hyp_bss_end = .;
/*
@@ -48,7 +48,7 @@
* between them, which can in some cases cause the linker to misalign them. To
* work around the issue, force a page alignment for __bss_start.
*/
-#define SBSS_ALIGN PAGE_SIZE
+#define SBSS_ALIGN PAGE_SIZE_MAX
#else /* CONFIG_KVM */
#define HYPERVISOR_EXTABLE
#define HYPERVISOR_DATA_SECTIONS
@@ -75,14 +75,14 @@ ENTRY(_text)
jiffies = jiffies_64;
#define HYPERVISOR_TEXT \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
__hyp_idmap_text_start = .; \
*(.hyp.idmap.text) \
__hyp_idmap_text_end = .; \
__hyp_text_start = .; \
*(.hyp.text) \
HYPERVISOR_EXTABLE \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
__hyp_text_end = .;
#define IDMAP_TEXT \
@@ -113,11 +113,11 @@ jiffies = jiffies_64;
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
#define TRAMP_TEXT \
- . = ALIGN(PAGE_SIZE); \
+ . = ALIGN(PAGE_SIZE_MAX); \
__entry_tramp_text_start = .; \
*(.entry.tramp.text) \
- . = ALIGN(PAGE_SIZE); \
__entry_tramp_text_end = .; \
+ . = ALIGN(PAGE_SIZE_MAX); \
*(.entry.tramp.rodata)
#else
#define TRAMP_TEXT
@@ -187,7 +187,7 @@ SECTIONS
_etext = .; /* End of text section */
/* everything from this point to __init_begin will be marked RO NX */
- RO_DATA(PAGE_SIZE)
+ RO_DATA(PAGE_SIZE_MAX)
HYPERVISOR_DATA_SECTIONS
@@ -206,22 +206,22 @@ SECTIONS
HIBERNATE_TEXT
KEXEC_TEXT
IDMAP_TEXT
- . = ALIGN(PAGE_SIZE);
+ . = ALIGN(PAGE_SIZE_MAX);
}
idmap_pg_dir = .;
- . += PAGE_SIZE;
+ . += PAGE_SIZE_MAX;
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
tramp_pg_dir = .;
- . += PAGE_SIZE;
+ . += PAGE_SIZE_MAX;
#endif
reserved_pg_dir = .;
- . += PAGE_SIZE;
+ . += PAGE_SIZE_MAX;
swapper_pg_dir = .;
- . += PAGE_SIZE;
+ . += PAGE_SIZE_MAX;
. = ALIGN(SEGMENT_ALIGN);
__init_begin = .;
@@ -290,7 +290,7 @@ SECTIONS
_data = .;
_sdata = .;
- RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN)
+ RW_DATA(L1_CACHE_BYTES, PAGE_SIZE_MAX, THREAD_ALIGN)
/*
* Data written with the MMU off but read with the MMU on requires
@@ -317,7 +317,7 @@ SECTIONS
/* start of zero-init region */
BSS_SECTION(SBSS_ALIGN, 0, 0)
- . = ALIGN(PAGE_SIZE);
+ . = ALIGN(PAGE_SIZE_MAX);
init_pg_dir = .;
. += INIT_DIR_SIZE_MAX;
init_pg_end = .;
@@ -356,7 +356,7 @@ SECTIONS
* former is page-aligned, but the latter may not be with 16K or 64K pages, so
* it should also not cross a page boundary.
*/
-ASSERT(__hyp_idmap_text_end - __hyp_idmap_text_start <= PAGE_SIZE,
+ASSERT(__hyp_idmap_text_end - __hyp_idmap_text_start <= SZ_4K,
"HYP init code too big")
ASSERT(__idmap_text_end - (__idmap_text_start & ~(SZ_4K - 1)) <= SZ_4K,
"ID map text too big or misaligned")
@@ -367,7 +367,7 @@ ASSERT(__hibernate_exit_text_start == swsusp_arch_suspend_exit,
"Hibernate exit text does not start with swsusp_arch_suspend_exit")
#endif
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
-ASSERT((__entry_tramp_text_end - __entry_tramp_text_start) <= 3*PAGE_SIZE,
+ASSERT((__entry_tramp_text_end - __entry_tramp_text_start) <= 3 * SZ_4K,
"Entry trampoline text too big")
#endif
#ifdef CONFIG_KVM
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp.lds.S b/arch/arm64/kvm/hyp/nvhe/hyp.lds.S
index f4562f417d3fc..74c7c21626270 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp.lds.S
+++ b/arch/arm64/kvm/hyp/nvhe/hyp.lds.S
@@ -21,7 +21,7 @@ SECTIONS {
* .hyp..data..percpu needs to be page aligned to maintain the same
* alignment for when linking into vmlinux.
*/
- . = ALIGN(PAGE_SIZE);
+ . = ALIGN(PAGE_SIZE_MAX);
BEGIN_HYP_SECTION(.data..percpu)
PERCPU_INPUT(L1_CACHE_BYTES)
END_HYP_SECTION
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 45/57] arm64: Rework trampoline rodata mapping
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (42 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 44/57] arm64: Align sections to PAGE_SIZE_MAX Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 46/57] arm64: Generalize fixmap for boot-time page size Ryan Roberts
` (14 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
Now that the trampoline rodata is aligned to the next PAGE_SIZE_MAX
boundary after the end of the trampoline text, the code that maps it in
the fixmap is incorrect, because it still assumes the rodata is in the
next page immediately after the text. Of course it still works for now
with compile-time page size but for boot-time page size when selecting a
page size less than PAGE_SIZE_MAX, it will fail.
So let's fix that by allocating sufficient fixmap slots to cover the
extra alignment padding in the worst case (PAGE_SIZE == PAGE_SIZE_MIN)
and explicitly mapping the rodata to the slot offset correctly from the
text.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/fixmap.h | 16 +++++++++++-----
arch/arm64/include/asm/sections.h | 1 +
arch/arm64/kernel/vmlinux.lds.S | 4 +++-
arch/arm64/mm/mmu.c | 22 ++++++++++++++--------
4 files changed, 29 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index 87e307804b99c..9a496d54dfe6e 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -59,13 +59,19 @@ enum fixed_addresses {
#endif /* CONFIG_ACPI_APEI_GHES */
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
+#define TRAMP_TEXT_SIZE (PAGE_SIZE_MIN * 3)
#ifdef CONFIG_RELOCATABLE
- FIX_ENTRY_TRAMP_TEXT4, /* one extra slot for the data page */
+#define TRAMP_DATA_SIZE PAGE_SIZE_MIN
+#define TRAMP_PAD_SIZE (PAGE_SIZE_MAX - PAGE_SIZE_MIN)
+#else
+#define TRAMP_DATA_SIZE 0
+#define TRAMP_PAD_SIZE 0
#endif
- FIX_ENTRY_TRAMP_TEXT3,
- FIX_ENTRY_TRAMP_TEXT2,
- FIX_ENTRY_TRAMP_TEXT1,
-#define TRAMP_VALIAS (__fix_to_virt(FIX_ENTRY_TRAMP_TEXT1))
+#define TRAMP_SIZE (TRAMP_TEXT_SIZE + TRAMP_DATA_SIZE + TRAMP_PAD_SIZE)
+ FIX_ENTRY_TRAMP_END,
+ FIX_ENTRY_TRAMP_BEGIN = FIX_ENTRY_TRAMP_END +
+ DIV_ROUND_UP(TRAMP_SIZE, PAGE_SIZE_MIN) - 1,
+#define TRAMP_VALIAS (__fix_to_virt(FIX_ENTRY_TRAMP_BEGIN))
#endif /* CONFIG_UNMAP_KERNEL_AT_EL0 */
__end_of_permanent_fixed_addresses,
diff --git a/arch/arm64/include/asm/sections.h b/arch/arm64/include/asm/sections.h
index 40971ac1303f9..252ec58963093 100644
--- a/arch/arm64/include/asm/sections.h
+++ b/arch/arm64/include/asm/sections.h
@@ -21,6 +21,7 @@ extern char __exittext_begin[], __exittext_end[];
extern char __irqentry_text_start[], __irqentry_text_end[];
extern char __mmuoff_data_start[], __mmuoff_data_end[];
extern char __entry_tramp_text_start[], __entry_tramp_text_end[];
+extern char __entry_tramp_rodata_start[], __entry_tramp_rodata_end[];
extern char __relocate_new_kernel_start[], __relocate_new_kernel_end[];
static inline size_t entry_tramp_text_size(void)
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 1ef6dea13b57c..09fcc234c0f77 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -118,7 +118,9 @@ jiffies = jiffies_64;
*(.entry.tramp.text) \
__entry_tramp_text_end = .; \
. = ALIGN(PAGE_SIZE_MAX); \
- *(.entry.tramp.rodata)
+ __entry_tramp_rodata_start = .; \
+ *(.entry.tramp.rodata) \
+ __entry_tramp_rodata_end = .;
#else
#define TRAMP_TEXT
#endif
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a528787c1e550..84df9f278d24d 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -734,25 +734,31 @@ static int __init map_entry_trampoline(void)
return 0;
pgprot_t prot = kernel_exec_prot();
- phys_addr_t pa_start = __pa_symbol(__entry_tramp_text_start);
+ phys_addr_t pa_text = __pa_symbol(__entry_tramp_text_start);
+ phys_addr_t pa_data = __pa_symbol(__entry_tramp_rodata_start);
+ int slot = FIX_ENTRY_TRAMP_BEGIN;
/* The trampoline is always mapped and can therefore be global */
pgprot_val(prot) &= ~PTE_NG;
/* Map only the text into the trampoline page table */
memset(tramp_pg_dir, 0, PGD_SIZE);
- __create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS,
+ __create_pgd_mapping(tramp_pg_dir, pa_text, TRAMP_VALIAS,
entry_tramp_text_size(), prot,
__pgd_pgtable_alloc, NO_BLOCK_MAPPINGS);
/* Map both the text and data into the kernel page table */
- for (i = 0; i < DIV_ROUND_UP(entry_tramp_text_size(), PAGE_SIZE); i++)
- __set_fixmap(FIX_ENTRY_TRAMP_TEXT1 - i,
- pa_start + i * PAGE_SIZE, prot);
+ for (i = 0; i < DIV_ROUND_UP(entry_tramp_text_size(), PAGE_SIZE); i++) {
+ __set_fixmap(slot, pa_text, prot);
+ pa_text += PAGE_SIZE;
+ slot--;
+ }
- if (IS_ENABLED(CONFIG_RELOCATABLE))
- __set_fixmap(FIX_ENTRY_TRAMP_TEXT1 - i,
- pa_start + i * PAGE_SIZE, PAGE_KERNEL_RO);
+ if (IS_ENABLED(CONFIG_RELOCATABLE)) {
+ slot -= (pa_data - pa_text) / PAGE_SIZE;
+ VM_BUG_ON(slot < FIX_ENTRY_TRAMP_END);
+ __set_fixmap(slot, pa_data, PAGE_KERNEL_RO);
+ }
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 46/57] arm64: Generalize fixmap for boot-time page size
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (43 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 45/57] arm64: Rework trampoline rodata mapping Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 47/57] arm64: Statically allocate and align for worst-case " Ryan Roberts
` (13 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
Some fixmap fixed address slots previously depended on PAGE_SIZE (i.e.
to determine how many slots were required to cover a given size). Since
we require the fixed address slots to be compile-time constant, let's
work out the worst case number of required slots when page size is
PAGE_SIZE_MIN instead.
Additionally, let's determine the worst-case number of PTE tables we
require and statically allocate enough memory.
For compile-time page size builds, the end result is the same as it was
previously.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/fixmap.h | 12 ++++++++----
arch/arm64/mm/fixmap.c | 34 ++++++++++++++++++++++-----------
2 files changed, 31 insertions(+), 15 deletions(-)
diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index 9a496d54dfe6e..c73fd3c1334ff 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -43,7 +43,7 @@ enum fixed_addresses {
* whether it crosses any page boundary.
*/
FIX_FDT_END,
- FIX_FDT = FIX_FDT_END + DIV_ROUND_UP(MAX_FDT_SIZE, PAGE_SIZE) + 1,
+ FIX_FDT = FIX_FDT_END + DIV_ROUND_UP(MAX_FDT_SIZE, PAGE_SIZE_MIN) + 1,
FIX_EARLYCON_MEM_BASE,
FIX_TEXT_POKE0,
@@ -79,7 +79,7 @@ enum fixed_addresses {
* Temporary boot-time mappings, used by early_ioremap(),
* before ioremap() is functional.
*/
-#define NR_FIX_BTMAPS (SZ_256K / PAGE_SIZE)
+#define NR_FIX_BTMAPS (SZ_256K / PAGE_SIZE_MIN)
#define FIX_BTMAPS_SLOTS 7
#define TOTAL_FIX_BTMAPS (NR_FIX_BTMAPS * FIX_BTMAPS_SLOTS)
@@ -101,8 +101,12 @@ enum fixed_addresses {
#define FIXADDR_SIZE (__end_of_permanent_fixed_addresses << PAGE_SHIFT)
#define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
-#define FIXADDR_TOT_SIZE (__end_of_fixed_addresses << PAGE_SHIFT)
-#define FIXADDR_TOT_START (FIXADDR_TOP - FIXADDR_TOT_SIZE)
+#define __FIXADDR_TOT_SIZE(page_shift) \
+ (__end_of_fixed_addresses << (page_shift))
+#define __FIXADDR_TOT_START(page_shift) \
+ (FIXADDR_TOP - __FIXADDR_TOT_SIZE(page_shift))
+#define FIXADDR_TOT_SIZE __FIXADDR_TOT_SIZE(PAGE_SHIFT)
+#define FIXADDR_TOT_START __FIXADDR_TOT_START(PAGE_SHIFT)
#define FIXMAP_PAGE_IO __pgprot(PROT_DEVICE_nGnRE)
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index 15ce3253ad359..a0dcf2375ccb4 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -17,27 +17,39 @@
#include <asm/tlbflush.h>
/* ensure that the fixmap region does not grow down into the PCI I/O region */
-static_assert(FIXADDR_TOT_START > PCI_IO_END);
+static_assert(__FIXADDR_TOT_START(PAGE_SHIFT_MAX) > PCI_IO_END);
-#define NR_BM_PTE_TABLES \
- SPAN_NR_ENTRIES(FIXADDR_TOT_START, FIXADDR_TOP, PMD_SHIFT)
-#define NR_BM_PMD_TABLES \
- SPAN_NR_ENTRIES(FIXADDR_TOT_START, FIXADDR_TOP, PUD_SHIFT)
+#define FIXMAP_LEVEL(page_shift, lvl, vstart, vend) \
+ SPAN_NR_ENTRIES(vstart, vend, PGTABLE_LEVEL_SHIFT(page_shift, lvl))
-static_assert(NR_BM_PMD_TABLES == 1);
+#define FIXMAP_PAGES(page_shift, level) \
+ FIXMAP_LEVEL(page_shift, level, \
+ __FIXADDR_TOT_START(page_shift), FIXADDR_TOP)
+
+#define FIXMAP_SIZE(page_shift, level) \
+ (FIXMAP_PAGES(page_shift, level) * (UL(1) << (page_shift)))
+
+#define FIXMAP_PTE_SIZE_MAX \
+ MAX_IF_HAVE_PGSZ(FIXMAP_SIZE(ARM64_PAGE_SHIFT_4K, 2), \
+ FIXMAP_SIZE(ARM64_PAGE_SHIFT_16K, 2), \
+ FIXMAP_SIZE(ARM64_PAGE_SHIFT_64K, 2))
+
+static_assert(FIXMAP_PAGES(ARM64_PAGE_SHIFT_4K, 1) == 1);
+static_assert(FIXMAP_PAGES(ARM64_PAGE_SHIFT_16K, 1) == 1);
+static_assert(FIXMAP_PAGES(ARM64_PAGE_SHIFT_64K, 1) == 1);
#define __BM_TABLE_IDX(addr, shift) \
(((addr) >> (shift)) - (FIXADDR_TOT_START >> (shift)))
#define BM_PTE_TABLE_IDX(addr) __BM_TABLE_IDX(addr, PMD_SHIFT)
-static pte_t bm_pte[NR_BM_PTE_TABLES][PTRS_PER_PTE] __page_aligned_bss;
-static pmd_t bm_pmd[PTRS_PER_PMD] __page_aligned_bss __maybe_unused;
-static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss __maybe_unused;
+static pte_t bm_pte[FIXMAP_PTE_SIZE_MAX / sizeof(pte_t)] __page_aligned_bss;
+static pmd_t bm_pmd[MAX_PTRS_PER_PMD] __page_aligned_bss __maybe_unused;
+static pud_t bm_pud[MAX_PTRS_PER_PUD] __page_aligned_bss __maybe_unused;
static inline pte_t *fixmap_pte(unsigned long addr)
{
- return &bm_pte[BM_PTE_TABLE_IDX(addr)][pte_index(addr)];
+ return &bm_pte[BM_PTE_TABLE_IDX(addr) * PTRS_PER_PTE + pte_index(addr)];
}
static void __init early_fixmap_init_pte(pmd_t *pmdp, unsigned long addr)
@@ -46,7 +58,7 @@ static void __init early_fixmap_init_pte(pmd_t *pmdp, unsigned long addr)
pte_t *ptep;
if (pmd_none(pmd)) {
- ptep = bm_pte[BM_PTE_TABLE_IDX(addr)];
+ ptep = &bm_pte[BM_PTE_TABLE_IDX(addr) * PTRS_PER_PTE];
__pmd_populate(pmdp, __pa_symbol(ptep), PMD_TYPE_TABLE);
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 47/57] arm64: Statically allocate and align for worst-case page size
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (44 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 46/57] arm64: Generalize fixmap for boot-time page size Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 48/57] arm64: Convert switch to if for non-const comparison values Ryan Roberts
` (12 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
Increase the size and alignment of the zero page and various static
buffers used for page tables to PAGE_SIZE_MAX. This resolves to the same
thing for compile-time page size builds.
For boot-time builds, we may in future consider freeing unused pages at
runtime when the selected page size is less than MAX.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/pgtable.h | 2 +-
arch/arm64/kernel/pi/map_kernel.c | 2 +-
arch/arm64/mm/mmu.c | 6 +++---
3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7a4f5604be3f7..fd47f70a42396 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -61,7 +61,7 @@
* ZERO_PAGE is a global shared page that is always zero: used
* for zero-mapped memory areas etc..
*/
-extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
+extern unsigned long empty_zero_page[PAGE_SIZE_MAX / sizeof(unsigned long)];
#define ZERO_PAGE(vaddr) phys_to_page(__pa_symbol(empty_zero_page))
#define pte_ERROR(e) \
diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
index 7a62d4238449d..deb8cd50b0b0c 100644
--- a/arch/arm64/kernel/pi/map_kernel.c
+++ b/arch/arm64/kernel/pi/map_kernel.c
@@ -199,7 +199,7 @@ static void __init remap_idmap(bool use_lpa2, int page_shift)
static void __init map_fdt(u64 fdt, int page_shift)
{
- static u8 ptes[INIT_IDMAP_FDT_SIZE_MAX] __initdata __aligned(PAGE_SIZE);
+ static u8 ptes[INIT_IDMAP_FDT_SIZE_MAX] __initdata __aligned(PAGE_SIZE_MAX);
static bool first_time __initdata = true;
u64 limit = (u64)&ptes[INIT_IDMAP_FDT_SIZE_MAX];
u64 efdt = fdt + MAX_FDT_SIZE;
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 84df9f278d24d..b4cd3b6a73c22 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -62,7 +62,7 @@ long __section(".mmuoff.data.write") __early_cpu_boot_status;
* Empty_zero_page is a special page that is used for zero-initialized data
* and COW.
*/
-unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)] __page_aligned_bss;
+unsigned long empty_zero_page[PAGE_SIZE_MAX / sizeof(unsigned long)] __page_aligned_bss;
EXPORT_SYMBOL(empty_zero_page);
static DEFINE_SPINLOCK(swapper_pgdir_lock);
@@ -783,8 +783,8 @@ void __pi_map_range(u64 *pgd, u64 limit, u64 start, u64 end, u64 pa,
pgprot_t prot, int level, pte_t *tbl, bool may_use_cont,
u64 va_offset);
-static u8 idmap_ptes[IDMAP_LEVELS_MAX - 1][PAGE_SIZE] __aligned(PAGE_SIZE) __ro_after_init,
- kpti_ptes[IDMAP_LEVELS_MAX - 1][PAGE_SIZE] __aligned(PAGE_SIZE) __ro_after_init;
+static u8 idmap_ptes[IDMAP_LEVELS_MAX - 1][PAGE_SIZE_MAX] __aligned(PAGE_SIZE_MAX) __ro_after_init,
+ kpti_ptes[IDMAP_LEVELS_MAX - 1][PAGE_SIZE_MAX] __aligned(PAGE_SIZE_MAX) __ro_after_init;
static void __init create_idmap(void)
{
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 48/57] arm64: Convert switch to if for non-const comparison values
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (45 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 47/57] arm64: Statically allocate and align for worst-case " Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 49/57] arm64: Convert BUILD_BUG_ON to VM_BUG_ON Ryan Roberts
` (11 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Oliver Upton, Will Deacon
Cc: Ryan Roberts, kvmarm, linux-arm-kernel, linux-kernel, linux-mm
When we enable boot-time page size, some macros are no longer
compile-time constants. Where these macros are used as cases in switch
statements, the switch statements no longer compile.
Let's convert these to if/else blocks, which can handle the runtime
values.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/kvm/mmu.c | 32 +++++++++++++++-----------------
arch/arm64/mm/hugetlbpage.c | 34 +++++++++++-----------------------
2 files changed, 26 insertions(+), 40 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a509b63bd4dd5..248a2d7ad6dbb 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1487,29 +1487,27 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
vma_shift = get_vma_page_shift(vma, hva);
}
- switch (vma_shift) {
#ifndef __PAGETABLE_PMD_FOLDED
- case PUD_SHIFT:
- if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
- break;
- fallthrough;
+ if (vma_shift == PUD_SHIFT) {
+ if (!fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
+ vma_shift = PMD_SHIFT;
+ }
#endif
- case CONT_PMD_SHIFT:
+ if (vma_shift == CONT_PMD_SHIFT) {
vma_shift = PMD_SHIFT;
- fallthrough;
- case PMD_SHIFT:
- if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE))
- break;
- fallthrough;
- case CONT_PTE_SHIFT:
+ }
+ if (vma_shift == PMD_SHIFT) {
+ if (!fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE))
+ vma_shift = PAGE_SHIFT;
+ }
+ if (vma_shift == CONT_PTE_SHIFT) {
vma_shift = PAGE_SHIFT;
force_pte = true;
- fallthrough;
- case PAGE_SHIFT:
- break;
- default:
- WARN_ONCE(1, "Unknown vma_shift %d", vma_shift);
}
+ if (vma_shift != PUD_SHIFT &&
+ vma_shift != PMD_SHIFT &&
+ vma_shift != PAGE_SHIFT)
+ WARN_ONCE(1, "Unknown vma_shift %d", vma_shift);
vma_pagesize = 1UL << vma_shift;
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 5f1e2103888b7..bc98c20655bba 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -51,16 +51,12 @@ void __init arm64_hugetlb_cma_reserve(void)
static bool __hugetlb_valid_size(unsigned long size)
{
- switch (size) {
#ifndef __PAGETABLE_PMD_FOLDED
- case PUD_SIZE:
+ if (size == PUD_SIZE)
return pud_sect_supported();
#endif
- case CONT_PMD_SIZE:
- case PMD_SIZE:
- case CONT_PTE_SIZE:
+ if (size == CONT_PMD_SIZE || size == PMD_SIZE || size == CONT_PTE_SIZE)
return true;
- }
return false;
}
@@ -104,24 +100,20 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
*pgsize = size;
- switch (size) {
#ifndef __PAGETABLE_PMD_FOLDED
- case PUD_SIZE:
+ if (size == PUD_SIZE) {
if (pud_sect_supported())
contig_ptes = 1;
- break;
+ } else
#endif
- case PMD_SIZE:
+ if (size == PMD_SIZE) {
contig_ptes = 1;
- break;
- case CONT_PMD_SIZE:
+ } else if (size == CONT_PMD_SIZE) {
*pgsize = PMD_SIZE;
contig_ptes = CONT_PMDS;
- break;
- case CONT_PTE_SIZE:
+ } else if (size == CONT_PTE_SIZE) {
*pgsize = PAGE_SIZE;
contig_ptes = CONT_PTES;
- break;
}
return contig_ptes;
@@ -339,20 +331,16 @@ unsigned long hugetlb_mask_last_page(struct hstate *h)
{
unsigned long hp_size = huge_page_size(h);
- switch (hp_size) {
#ifndef __PAGETABLE_PMD_FOLDED
- case PUD_SIZE:
+ if (hp_size == PUD_SIZE)
return PGDIR_SIZE - PUD_SIZE;
#endif
- case CONT_PMD_SIZE:
+ if (hp_size == CONT_PMD_SIZE)
return PUD_SIZE - CONT_PMD_SIZE;
- case PMD_SIZE:
+ if (hp_size == PMD_SIZE)
return PUD_SIZE - PMD_SIZE;
- case CONT_PTE_SIZE:
+ if (hp_size == CONT_PTE_SIZE)
return PMD_SIZE - CONT_PTE_SIZE;
- default:
- break;
- }
return 0UL;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 49/57] arm64: Convert BUILD_BUG_ON to VM_BUG_ON
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (46 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 48/57] arm64: Convert switch to if for non-const comparison values Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 50/57] arm64: Remove PAGE_SZ asm-offset Ryan Roberts
` (10 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
There are some build bug checks that will no longer compile for
boot-time page size because the values they are testing are no longer
compile-time constants. Resolve these by converting them to VM_BUG_ON,
which will perform a runtime check.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/mm/init.c | 6 +++---
arch/arm64/mm/mmu.c | 4 ++--
arch/arm64/mm/pgd.c | 2 +-
3 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 42eb246949072..4d24034418b39 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -388,15 +388,15 @@ void __init mem_init(void)
* detected at build time already.
*/
#ifdef CONFIG_COMPAT
- BUILD_BUG_ON(TASK_SIZE_32 > DEFAULT_MAP_WINDOW_64);
+ VM_BUG_ON(TASK_SIZE_32 > DEFAULT_MAP_WINDOW_64);
#endif
/*
* Selected page table levels should match when derived from
* scratch using the virtual address range and page size.
*/
- BUILD_BUG_ON(ARM64_HW_PGTABLE_LEVELS(CONFIG_ARM64_VA_BITS) !=
- CONFIG_PGTABLE_LEVELS);
+ VM_BUG_ON(ARM64_HW_PGTABLE_LEVELS(CONFIG_ARM64_VA_BITS) !=
+ CONFIG_PGTABLE_LEVELS);
if (PAGE_SIZE >= 16384 && get_num_physpages() <= 128) {
extern int sysctl_overcommit_memory;
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b4cd3b6a73c22..ad7fd3fda705a 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -639,8 +639,8 @@ static void __init map_mem(pgd_t *pgdp)
* entire reduced VA space is covered by a single pgd_t which will have
* been populated without the PXNTable attribute by the time we get here.)
*/
- BUILD_BUG_ON(pgd_index(direct_map_end - 1) == pgd_index(direct_map_end) &&
- pgd_index(_PAGE_OFFSET(VA_BITS_MIN)) != PTRS_PER_PGD - 1);
+ VM_BUG_ON(pgd_index(direct_map_end - 1) == pgd_index(direct_map_end) &&
+ pgd_index(_PAGE_OFFSET(VA_BITS_MIN)) != PTRS_PER_PGD - 1);
early_kfence_pool = arm64_kfence_alloc_pool();
diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
index 0c501cabc2384..4b106510358b1 100644
--- a/arch/arm64/mm/pgd.c
+++ b/arch/arm64/mm/pgd.c
@@ -56,7 +56,7 @@ void __init pgtable_cache_init(void)
* With 52-bit physical addresses, the architecture requires the
* top-level table to be aligned to at least 64 bytes.
*/
- BUILD_BUG_ON(PGD_SIZE < 64);
+ VM_BUG_ON(PGD_SIZE < 64);
#endif
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 50/57] arm64: Remove PAGE_SZ asm-offset
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (47 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 49/57] arm64: Convert BUILD_BUG_ON to VM_BUG_ON Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 51/57] arm64: Introduce cpu features for page sizes Ryan Roberts
` (9 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
PAGE_SZ is not used anywhere and for boot-time builds, where it is not a
compile-time constant, we don't know the value anyway. So let's remove
it.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/kernel/asm-offsets.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 27de1dddb0abe..f32b8d7f00b2a 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -111,8 +111,6 @@ int main(void)
BLANK();
DEFINE(VM_EXEC, VM_EXEC);
BLANK();
- DEFINE(PAGE_SZ, PAGE_SIZE);
- BLANK();
DEFINE(DMA_TO_DEVICE, DMA_TO_DEVICE);
DEFINE(DMA_FROM_DEVICE, DMA_FROM_DEVICE);
BLANK();
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 51/57] arm64: Introduce cpu features for page sizes
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (48 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 50/57] arm64: Remove PAGE_SZ asm-offset Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 52/57] arm64: Remove PAGE_SIZE from assembly code Ryan Roberts
` (8 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
Introduce one boot cpu feature per page size, ARM64_USE_PAGE_SIZE_*K.
These will enable use of the alternatives framework to patch code based
on the page size selected at boot.
Additionally, this provides a neat way to figure out what page size is
in use during boot from the kernel log.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/kernel/cpufeature.c | 25 +++++++++++++++++++++++++
arch/arm64/tools/cpucaps | 3 +++
2 files changed, 28 insertions(+)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 7705c9c0e7142..e5618423bb99d 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1831,6 +1831,13 @@ static bool has_nv1(const struct arm64_cpu_capabilities *entry, int scope)
is_midr_in_range_list(read_cpuid_id(), nv1_ni_list)));
}
+static bool use_page_size(const struct arm64_cpu_capabilities *entry, int scope)
+{
+ return (entry->capability == ARM64_USE_PAGE_SIZE_4K && PAGE_SIZE == SZ_4K) ||
+ (entry->capability == ARM64_USE_PAGE_SIZE_16K && PAGE_SIZE == SZ_16K) ||
+ (entry->capability == ARM64_USE_PAGE_SIZE_64K && PAGE_SIZE == SZ_64K);
+}
+
static bool has_lpa2_at_stage1(u64 mmfr0)
{
unsigned int tgran;
@@ -2881,6 +2888,24 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
.matches = has_nv1,
ARM64_CPUID_FIELDS_NEG(ID_AA64MMFR4_EL1, E2H0, NI_NV1)
},
+ {
+ .desc = "4K page size",
+ .capability = ARM64_USE_PAGE_SIZE_4K,
+ .type = ARM64_CPUCAP_BOOT_CPU_FEATURE,
+ .matches = use_page_size,
+ },
+ {
+ .desc = "16K page size",
+ .capability = ARM64_USE_PAGE_SIZE_16K,
+ .type = ARM64_CPUCAP_BOOT_CPU_FEATURE,
+ .matches = use_page_size,
+ },
+ {
+ .desc = "64K page size",
+ .capability = ARM64_USE_PAGE_SIZE_64K,
+ .type = ARM64_CPUCAP_BOOT_CPU_FEATURE,
+ .matches = use_page_size,
+ },
{},
};
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index ac3429d892b9a..5cb75675303c6 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -71,6 +71,9 @@ SPECTRE_BHB
SSBS
SVE
UNMAP_KERNEL_AT_EL0
+USE_PAGE_SIZE_4K
+USE_PAGE_SIZE_16K
+USE_PAGE_SIZE_64K
WORKAROUND_834220
WORKAROUND_843419
WORKAROUND_845719
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 52/57] arm64: Remove PAGE_SIZE from assembly code
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (49 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 51/57] arm64: Introduce cpu features for page sizes Ryan Roberts
@ 2024-10-14 10:58 ` Ryan Roberts
2024-10-14 10:59 ` [RFC PATCH v1 53/57] arm64: Runtime-fold pmd level Ryan Roberts
` (7 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:58 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Oliver Upton, Will Deacon
Cc: Ryan Roberts, kvmarm, linux-arm-kernel, linux-kernel, linux-mm
Remove usage of PAGE_SHIFT, PAGE_SIZE and PAGE_MASK macros from assembly
code since these are no longer compile-time constants when boot-time
page size is in use.
For the most part, they are replaced with run-time lookups based on the
value of TG0. This is done outside of loops so while there is a cost of
a few extra instructions, performance should not be impacted.
However, invalid_host_el2_vect requires that the page shift be an
immediate since it has no registers to spare. So for this, let's use
alternatives patching. This code is guarranteed not to run until after
patching is complete.
__pi_copy_page has no registers to spare to hold the page size, and we
want to avoid having to reload it on every iteration of the loop. Since
I couldn't provably conclude that the function is not called prior to
alternatives patching, I opted to make a copy of the function for each
page size and branch to the right one at the start.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/assembler.h | 18 +++++++++++++---
arch/arm64/kernel/hibernate-asm.S | 6 ++++--
arch/arm64/kernel/relocate_kernel.S | 10 ++++++---
arch/arm64/kvm/hyp/nvhe/host.S | 10 ++++++++-
arch/arm64/lib/clear_page.S | 7 ++++--
arch/arm64/lib/copy_page.S | 33 +++++++++++++++++++++--------
arch/arm64/lib/mte.S | 27 +++++++++++++++++------
7 files changed, 85 insertions(+), 26 deletions(-)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 77c2d707adb1a..6424fd6be1cbe 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -495,9 +495,11 @@ alternative_endif
.Lskip_\@:
.endm
/*
- * copy_page - copy src to dest using temp registers t1-t8
+ * copy_page - copy src to dest using temp registers t1-t9
*/
- .macro copy_page dest:req src:req t1:req t2:req t3:req t4:req t5:req t6:req t7:req t8:req
+ .macro copy_page dest:req src:req t1:req t2:req t3:req t4:req t5:req t6:req t7:req t8:req t9:req
+ get_page_size \t9
+ sub \t9, \t9, #1 // (PAGE_SIZE - 1) in \t9
9998: ldp \t1, \t2, [\src]
ldp \t3, \t4, [\src, #16]
ldp \t5, \t6, [\src, #32]
@@ -508,7 +510,7 @@ alternative_endif
stnp \t5, \t6, [\dest, #32]
stnp \t7, \t8, [\dest, #48]
add \dest, \dest, #64
- tst \src, #(PAGE_SIZE - 1)
+ tst \src, \t9
b.ne 9998b
.endm
@@ -911,4 +913,14 @@ alternative_cb_end
.macro tgran_lpa2, val, tg0
value_for_page_size \val, \tg0, ID_AA64MMFR0_EL1_TGRAN4_52_BIT, ID_AA64MMFR0_EL1_TGRAN16_52_BIT, -1
.endm
+
+ .macro get_page_size, val
+ get_tg0 \val
+ value_for_page_size \val, \val, SZ_4K, SZ_16K, SZ_64K
+ .endm
+
+ .macro get_page_mask, val
+ get_tg0 \val
+ value_for_page_size \val, \val, (~(SZ_4K-1)), (~(SZ_16K-1)), (~(SZ_64K-1))
+ .endm
#endif /* __ASM_ASSEMBLER_H */
diff --git a/arch/arm64/kernel/hibernate-asm.S b/arch/arm64/kernel/hibernate-asm.S
index 0e1d9c3c6a933..375b2fcf82e84 100644
--- a/arch/arm64/kernel/hibernate-asm.S
+++ b/arch/arm64/kernel/hibernate-asm.S
@@ -57,6 +57,8 @@ SYM_CODE_START(swsusp_arch_suspend_exit)
mov x24, x4
mov x25, x5
+ get_page_size x12
+
/* walk the restore_pblist and use copy_page() to over-write memory */
mov x19, x3
@@ -64,9 +66,9 @@ SYM_CODE_START(swsusp_arch_suspend_exit)
mov x0, x10
ldr x1, [x19, #HIBERN_PBE_ADDR]
- copy_page x0, x1, x2, x3, x4, x5, x6, x7, x8, x9
+ copy_page x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x11
- add x1, x10, #PAGE_SIZE
+ add x1, x10, x12
/* Clean the copied page to PoU - based on caches_clean_inval_pou() */
raw_dcache_line_size x2, x3
sub x3, x2, #1
diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index 413f899e4ac63..bc4f37fba6c74 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -46,6 +46,10 @@ SYM_CODE_START(arm64_relocate_new_kernel)
ldr x27, [x0, #KIMAGE_ARCH_EL2_VECTORS]
ldr x26, [x0, #KIMAGE_ARCH_DTB_MEM]
+ /* Grab page size values. */
+ get_page_size x10 /* x10 = PAGE_SIZE */
+ get_page_mask x11 /* x11 = PAGE_MASK */
+
/* Setup the list loop variables. */
ldr x18, [x0, #KIMAGE_ARCH_ZERO_PAGE] /* x18 = zero page for BBM */
ldr x17, [x0, #KIMAGE_ARCH_TTBR1] /* x17 = linear map copy */
@@ -54,7 +58,7 @@ SYM_CODE_START(arm64_relocate_new_kernel)
raw_dcache_line_size x15, x1 /* x15 = dcache line size */
break_before_make_ttbr_switch x18, x17, x1, x2 /* set linear map */
.Lloop:
- and x12, x16, PAGE_MASK /* x12 = addr */
+ and x12, x16, x11 /* x12 = addr */
sub x12, x12, x22 /* Convert x12 to virt */
/* Test the entry flags. */
.Ltest_source:
@@ -62,8 +66,8 @@ SYM_CODE_START(arm64_relocate_new_kernel)
/* Invalidate dest page to PoC. */
mov x19, x13
- copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
- add x1, x19, #PAGE_SIZE
+ copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8, x9
+ add x1, x19, x10
dcache_by_myline_op civac, sy, x19, x1, x15, x20
b .Lnext
.Ltest_indirection:
diff --git a/arch/arm64/kvm/hyp/nvhe/host.S b/arch/arm64/kvm/hyp/nvhe/host.S
index 3d610fc51f4d3..2b0d583fcf1af 100644
--- a/arch/arm64/kvm/hyp/nvhe/host.S
+++ b/arch/arm64/kvm/hyp/nvhe/host.S
@@ -193,7 +193,15 @@ SYM_FUNC_END(__host_hvc)
*/
add sp, sp, x0 // sp' = sp + x0
sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp
- tbz x0, #PAGE_SHIFT, .L__hyp_sp_overflow\@
+alternative_if ARM64_USE_PAGE_SIZE_4K
+ tbz x0, #ARM64_PAGE_SHIFT_4K, .L__hyp_sp_overflow\@
+alternative_else_nop_endif
+alternative_if ARM64_USE_PAGE_SIZE_16K
+ tbz x0, #ARM64_PAGE_SHIFT_16K, .L__hyp_sp_overflow\@
+alternative_else_nop_endif
+alternative_if ARM64_USE_PAGE_SIZE_64K
+ tbz x0, #ARM64_PAGE_SHIFT_64K, .L__hyp_sp_overflow\@
+alternative_else_nop_endif
sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = x0
sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = sp
diff --git a/arch/arm64/lib/clear_page.S b/arch/arm64/lib/clear_page.S
index ebde40e7fa2b2..b6f2cb8d704cc 100644
--- a/arch/arm64/lib/clear_page.S
+++ b/arch/arm64/lib/clear_page.S
@@ -15,6 +15,9 @@
* x0 - dest
*/
SYM_FUNC_START(__pi_clear_page)
+ get_page_size x3
+ sub x3, x3, #1 /* (PAGE_SIZE - 1) in x3 */
+
mrs x1, dczid_el0
tbnz x1, #4, 2f /* Branch if DC ZVA is prohibited */
and w1, w1, #0xf
@@ -23,7 +26,7 @@ SYM_FUNC_START(__pi_clear_page)
1: dc zva, x0
add x0, x0, x1
- tst x0, #(PAGE_SIZE - 1)
+ tst x0, x3
b.ne 1b
ret
@@ -32,7 +35,7 @@ SYM_FUNC_START(__pi_clear_page)
stnp xzr, xzr, [x0, #32]
stnp xzr, xzr, [x0, #48]
add x0, x0, #64
- tst x0, #(PAGE_SIZE - 1)
+ tst x0, x3
b.ne 2b
ret
SYM_FUNC_END(__pi_clear_page)
diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 6a56d7cf309da..6c19b03ab4d69 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -10,14 +10,7 @@
#include <asm/cpufeature.h>
#include <asm/alternative.h>
-/*
- * Copy a page from src to dest (both are page aligned)
- *
- * Parameters:
- * x0 - dest
- * x1 - src
- */
-SYM_FUNC_START(__pi_copy_page)
+ .macro copy_page_body, page_size
ldp x2, x3, [x1]
ldp x4, x5, [x1, #16]
ldp x6, x7, [x1, #32]
@@ -30,7 +23,7 @@ SYM_FUNC_START(__pi_copy_page)
add x0, x0, #256
add x1, x1, #128
1:
- tst x0, #(PAGE_SIZE - 1)
+ tst x0, #(\page_size - 1)
stnp x2, x3, [x0, #-256]
ldp x2, x3, [x1]
@@ -62,7 +55,29 @@ SYM_FUNC_START(__pi_copy_page)
stnp x12, x13, [x0, #80 - 256]
stnp x14, x15, [x0, #96 - 256]
stnp x16, x17, [x0, #112 - 256]
+ .endm
+/*
+ * Copy a page from src to dest (both are page aligned)
+ *
+ * Parameters:
+ * x0 - dest
+ * x1 - src
+ */
+SYM_FUNC_START(__pi_copy_page)
+ get_tg0 x2
+.Lsz_64k:
+ cmp x2, #TCR_TG0_64K
+ b.ne .Lsz_16k
+ copy_page_body SZ_64K
+ ret
+.Lsz_16k:
+ cmp x2, #TCR_TG0_16K
+ b.ne .Lsz_4k
+ copy_page_body SZ_16K
+ ret
+.Lsz_4k:
+ copy_page_body SZ_4K
ret
SYM_FUNC_END(__pi_copy_page)
SYM_FUNC_ALIAS(copy_page, __pi_copy_page)
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index 5018ac03b6bf3..b4f6f5be0ec79 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -28,10 +28,13 @@
* x0 - address of the page to be cleared
*/
SYM_FUNC_START(mte_clear_page_tags)
+ get_page_size x3
+ sub x3, x3, #1 // (PAGE_SIZE - 1) in x3
+
multitag_transfer_size x1, x2
1: stgm xzr, [x0]
add x0, x0, x1
- tst x0, #(PAGE_SIZE - 1)
+ tst x0, x3
b.ne 1b
ret
SYM_FUNC_END(mte_clear_page_tags)
@@ -43,6 +46,9 @@ SYM_FUNC_END(mte_clear_page_tags)
* x0 - address to the beginning of the page
*/
SYM_FUNC_START(mte_zero_clear_page_tags)
+ get_page_size x3
+ sub x3, x3, #1 // (PAGE_SIZE - 1) in x3
+
and x0, x0, #(1 << MTE_TAG_SHIFT) - 1 // clear the tag
mrs x1, dczid_el0
tbnz x1, #4, 2f // Branch if DC GZVA is prohibited
@@ -52,12 +58,12 @@ SYM_FUNC_START(mte_zero_clear_page_tags)
1: dc gzva, x0
add x0, x0, x1
- tst x0, #(PAGE_SIZE - 1)
+ tst x0, x3
b.ne 1b
ret
2: stz2g x0, [x0], #(MTE_GRANULE_SIZE * 2)
- tst x0, #(PAGE_SIZE - 1)
+ tst x0, x3
b.ne 2b
ret
SYM_FUNC_END(mte_zero_clear_page_tags)
@@ -68,6 +74,9 @@ SYM_FUNC_END(mte_zero_clear_page_tags)
* x1 - address of the source page
*/
SYM_FUNC_START(mte_copy_page_tags)
+ get_page_size x7
+ sub x7, x7, #1 // (PAGE_SIZE - 1) in x7
+
mov x2, x0
mov x3, x1
multitag_transfer_size x5, x6
@@ -75,7 +84,7 @@ SYM_FUNC_START(mte_copy_page_tags)
stgm x4, [x2]
add x2, x2, x5
add x3, x3, x5
- tst x2, #(PAGE_SIZE - 1)
+ tst x2, x7
b.ne 1b
ret
SYM_FUNC_END(mte_copy_page_tags)
@@ -137,6 +146,9 @@ SYM_FUNC_END(mte_copy_tags_to_user)
* x1 - tag storage, MTE_PAGE_TAG_STORAGE bytes
*/
SYM_FUNC_START(mte_save_page_tags)
+ get_page_size x3
+ sub x3, x3, #1 // (PAGE_SIZE - 1) in x3
+
multitag_transfer_size x7, x5
1:
mov x2, #0
@@ -149,7 +161,7 @@ SYM_FUNC_START(mte_save_page_tags)
str x2, [x1], #8
- tst x0, #(PAGE_SIZE - 1)
+ tst x0, x3
b.ne 1b
ret
@@ -161,6 +173,9 @@ SYM_FUNC_END(mte_save_page_tags)
* x1 - tag storage, MTE_PAGE_TAG_STORAGE bytes
*/
SYM_FUNC_START(mte_restore_page_tags)
+ get_page_size x3
+ sub x3, x3, #1 // (PAGE_SIZE - 1) in x3
+
multitag_transfer_size x7, x5
1:
ldr x2, [x1], #8
@@ -170,7 +185,7 @@ SYM_FUNC_START(mte_restore_page_tags)
tst x0, #0xFF
b.ne 2b
- tst x0, #(PAGE_SIZE - 1)
+ tst x0, x3
b.ne 1b
ret
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 53/57] arm64: Runtime-fold pmd level
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (50 preceding siblings ...)
2024-10-14 10:58 ` [RFC PATCH v1 52/57] arm64: Remove PAGE_SIZE from assembly code Ryan Roberts
@ 2024-10-14 10:59 ` Ryan Roberts
2024-10-14 10:59 ` [RFC PATCH v1 54/57] arm64: Support runtime folding in idmap_kpti_install_ng_mappings Ryan Roberts
` (6 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:59 UTC (permalink / raw)
To: Aneesh Kumar K.V, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, David Hildenbrand, Greg Marsden,
Ivan Ivanov, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Miroslav Benes, Nick Piggin, Oliver Upton,
Peter Zijlstra, Will Deacon
Cc: Ryan Roberts, kvmarm, linux-arch, linux-arm-kernel, linux-kernel,
linux-mm
For a given VA size, the number of levels of lookup depends on the page
size. With boot-time page size selection, we therefore don't know how
many levels of lookup we require until boot time. So we need to
runtime-fold some levels of lookup.
We already have code to runtime-fold p4d and pud levels; that exists for
LPA2 fallback paths and can be repurposed for our needs. But pmd level
also needs to support runtime folding; for example, 16K/36-bit and
64K/42-bit configs require only 2 levels.
So let's add the required code. However, note that until we actually add
the boot-time page size config, pgtable_l3_enabled() simply returns the
compile-time determined answer.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/pgalloc.h | 16 +++-
arch/arm64/include/asm/pgtable.h | 123 +++++++++++++++++++++++--------
arch/arm64/include/asm/tlb.h | 3 +
arch/arm64/kernel/cpufeature.c | 4 +-
arch/arm64/kvm/mmu.c | 9 +--
arch/arm64/mm/fixmap.c | 2 +-
arch/arm64/mm/hugetlbpage.c | 16 ++--
arch/arm64/mm/init.c | 2 +-
arch/arm64/mm/mmu.c | 2 +-
arch/arm64/mm/ptdump.c | 3 +-
10 files changed, 126 insertions(+), 54 deletions(-)
diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index 8ff5f2a2579e4..51cc2f32931d2 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -15,6 +15,7 @@
#define __HAVE_ARCH_PGD_FREE
#define __HAVE_ARCH_PUD_FREE
+#define __HAVE_ARCH_PMD_FREE
#include <asm-generic/pgalloc.h>
#define PGD_SIZE (PTRS_PER_PGD * sizeof(pgd_t))
@@ -23,7 +24,8 @@
static inline void __pud_populate(pud_t *pudp, phys_addr_t pmdp, pudval_t prot)
{
- set_pud(pudp, __pud(__phys_to_pud_val(pmdp) | prot));
+ if (pgtable_l3_enabled())
+ set_pud(pudp, __pud(__phys_to_pud_val(pmdp) | prot));
}
static inline void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmdp)
@@ -33,6 +35,18 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmdp)
pudval |= (mm == &init_mm) ? PUD_TABLE_UXN : PUD_TABLE_PXN;
__pud_populate(pudp, __pa(pmdp), pudval);
}
+
+static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
+{
+ struct ptdesc *ptdesc = virt_to_ptdesc(pmd);
+
+ if (!pgtable_l3_enabled())
+ return;
+
+ BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
+ pagetable_pmd_dtor(ptdesc);
+ pagetable_free(ptdesc);
+}
#else
static inline void __pud_populate(pud_t *pudp, phys_addr_t pmdp, pudval_t prot)
{
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index fd47f70a42396..8ead41da715b0 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -672,15 +672,21 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
#define pmd_leaf_size(pmd) (pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE)
#define pte_leaf_size(pte) (pte_cont(pte) ? CONT_PTE_SIZE : PAGE_SIZE)
-#if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
-static inline bool pud_sect(pud_t pud) { return false; }
-static inline bool pud_table(pud_t pud) { return true; }
-#else
-#define pud_sect(pud) ((pud_val(pud) & PUD_TYPE_MASK) == \
- PUD_TYPE_SECT)
-#define pud_table(pud) ((pud_val(pud) & PUD_TYPE_MASK) == \
- PUD_TYPE_TABLE)
-#endif
+static inline bool pgtable_l3_enabled(void);
+
+static inline bool pud_sect(pud_t pud)
+{
+ if (PAGE_SIZE == SZ_64K || !pgtable_l3_enabled())
+ return false;
+ return (pud_val(pud) & PUD_TYPE_MASK) == PUD_TYPE_SECT;
+}
+
+static inline bool pud_table(pud_t pud)
+{
+ if (PAGE_SIZE == SZ_64K || !pgtable_l3_enabled())
+ return true;
+ return (pud_val(pud) & PUD_TYPE_MASK) == PUD_TYPE_TABLE;
+}
extern pgd_t init_pg_dir[];
extern pgd_t init_pg_end[];
@@ -699,12 +705,10 @@ static inline bool in_swapper_pgdir(void *addr)
static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
{
-#ifdef __PAGETABLE_PMD_FOLDED
- if (in_swapper_pgdir(pmdp)) {
+ if (!pgtable_l3_enabled() && in_swapper_pgdir(pmdp)) {
set_swapper_pgd((pgd_t *)pmdp, __pgd(pmd_val(pmd)));
return;
}
-#endif /* __PAGETABLE_PMD_FOLDED */
WRITE_ONCE(*pmdp, pmd);
@@ -749,20 +753,27 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
#if CONFIG_PGTABLE_LEVELS > 2
+static __always_inline bool pgtable_l3_enabled(void)
+{
+ return true;
+}
+
+static inline bool mm_pmd_folded(const struct mm_struct *mm)
+{
+ return !pgtable_l3_enabled();
+}
+#define mm_pmd_folded mm_pmd_folded
+
#define pmd_ERROR(e) \
pr_err("%s:%d: bad pmd %016llx.\n", __FILE__, __LINE__, pmd_val(e))
-#define pud_none(pud) (!pud_val(pud))
-#define pud_bad(pud) (!pud_table(pud))
-#define pud_present(pud) pte_present(pud_pte(pud))
-#ifndef __PAGETABLE_PMD_FOLDED
-#define pud_leaf(pud) (pud_present(pud) && !pud_table(pud))
-#else
-#define pud_leaf(pud) false
-#endif
-#define pud_valid(pud) pte_valid(pud_pte(pud))
-#define pud_user(pud) pte_user(pud_pte(pud))
-#define pud_user_exec(pud) pte_user_exec(pud_pte(pud))
+#define pud_none(pud) (pgtable_l3_enabled() && !pud_val(pud))
+#define pud_bad(pud) (pgtable_l3_enabled() && !pud_table(pud))
+#define pud_present(pud) (!pgtable_l3_enabled() || pte_present(pud_pte(pud)))
+#define pud_leaf(pud) (pgtable_l3_enabled() && pte_present(pud_pte(pud)) && !pud_table(pud))
+#define pud_valid(pud) (pgtable_l3_enabled() && pte_valid(pud_pte(pud)))
+#define pud_user(pud) (pgtable_l3_enabled() && pte_user(pud_pte(pud)))
+#define pud_user_exec(pud) (pgtable_l3_enabled() && pte_user_exec(pud_pte(pud)))
static inline bool pgtable_l4_enabled(void);
@@ -783,7 +794,8 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
static inline void pud_clear(pud_t *pudp)
{
- set_pud(pudp, __pud(0));
+ if (pgtable_l3_enabled())
+ set_pud(pudp, __pud(0));
}
static inline phys_addr_t pud_page_paddr(pud_t pud)
@@ -791,25 +803,74 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
return __pud_to_phys(pud);
}
+#define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
+
+static inline pmd_t *pud_to_folded_pmd(pud_t *pudp, unsigned long addr)
+{
+ return (pmd_t *)pudp;
+}
+
static inline pmd_t *pud_pgtable(pud_t pud)
{
return (pmd_t *)__va(pud_page_paddr(pud));
}
-/* Find an entry in the second-level page table. */
-#define pmd_offset_phys(dir, addr) (pud_page_paddr(READ_ONCE(*(dir))) + pmd_index(addr) * sizeof(pmd_t))
+static inline phys_addr_t pmd_offset_phys(pud_t *pudp, unsigned long addr)
+{
+ BUG_ON(!pgtable_l3_enabled());
+
+ return pud_page_paddr(READ_ONCE(*pudp)) + pmd_index(addr) * sizeof(pmd_t);
+}
+
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud,
+ unsigned long addr)
+{
+ if (!pgtable_l3_enabled())
+ return pud_to_folded_pmd(pudp, addr);
+ return (pmd_t *)__va(pud_page_paddr(pud)) + pmd_index(addr);
+}
+#define pmd_offset_lockless pmd_offset_lockless
-#define pmd_set_fixmap(addr) ((pmd_t *)set_fixmap_offset(FIX_PMD, addr))
-#define pmd_set_fixmap_offset(pud, addr) pmd_set_fixmap(pmd_offset_phys(pud, addr))
-#define pmd_clear_fixmap() clear_fixmap(FIX_PMD)
+static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long addr)
+{
+ return pmd_offset_lockless(pudp, READ_ONCE(*pudp), addr);
+}
+#define pmd_offset pmd_offset
-#define pud_page(pud) phys_to_page(__pud_to_phys(pud))
+static inline pmd_t *pmd_set_fixmap(unsigned long addr)
+{
+ if (!pgtable_l3_enabled())
+ return NULL;
+ return (pmd_t *)set_fixmap_offset(FIX_PMD, addr);
+}
+
+static inline pmd_t *pmd_set_fixmap_offset(pud_t *pudp, unsigned long addr)
+{
+ if (!pgtable_l3_enabled())
+ return pud_to_folded_pmd(pudp, addr);
+ return pmd_set_fixmap(pmd_offset_phys(pudp, addr));
+}
+
+static inline void pmd_clear_fixmap(void)
+{
+ if (pgtable_l3_enabled())
+ clear_fixmap(FIX_PMD);
+}
/* use ONLY for statically allocated translation tables */
-#define pmd_offset_kimg(dir,addr) ((pmd_t *)__phys_to_kimg(pmd_offset_phys((dir), (addr))))
+static inline pmd_t *pmd_offset_kimg(pud_t *pudp, u64 addr)
+{
+ if (!pgtable_l3_enabled())
+ return pud_to_folded_pmd(pudp, addr);
+ return (pmd_t *)__phys_to_kimg(pmd_offset_phys(pudp, addr));
+}
+
+#define pud_page(pud) phys_to_page(__pud_to_phys(pud))
#else
+static inline bool pgtable_l3_enabled(void) { return false; }
+
#define pud_valid(pud) false
#define pud_page_paddr(pud) ({ BUILD_BUG(); 0; })
#define pud_user_exec(pud) pud_user(pud) /* Always 0 with folding */
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index a947c6e784ed2..527630f0803c6 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -92,6 +92,9 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
{
struct ptdesc *ptdesc = virt_to_ptdesc(pmdp);
+ if (!pgtable_l3_enabled())
+ return;
+
pagetable_pmd_dtor(ptdesc);
tlb_remove_ptdesc(tlb, ptdesc);
}
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index e5618423bb99d..663cc76569a27 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1923,8 +1923,10 @@ static int __init __kpti_install_ng_mappings(void *__unused)
if (levels == 5 && !pgtable_l5_enabled())
levels = 4;
- else if (levels == 4 && !pgtable_l4_enabled())
+ if (levels == 4 && !pgtable_l4_enabled())
levels = 3;
+ if (levels == 3 && !pgtable_l3_enabled())
+ levels = 2;
remap_fn = (void *)__pa_symbol(idmap_kpti_install_ng_mappings);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 248a2d7ad6dbb..146ecdaaaf647 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1370,12 +1370,11 @@ static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
-#ifndef __PAGETABLE_PMD_FOLDED
- if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
+ if (pgtable_l3_enabled() &&
+ (hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
ALIGN(hva, PUD_SIZE) <= vma->vm_end)
return PUD_SHIFT;
-#endif
if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
@@ -1487,12 +1486,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
vma_shift = get_vma_page_shift(vma, hva);
}
-#ifndef __PAGETABLE_PMD_FOLDED
- if (vma_shift == PUD_SHIFT) {
+ if (pgtable_l3_enabled() && vma_shift == PUD_SHIFT) {
if (!fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
vma_shift = PMD_SHIFT;
}
-#endif
if (vma_shift == CONT_PMD_SHIFT) {
vma_shift = PMD_SHIFT;
}
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index a0dcf2375ccb4..f2c6678046a96 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -87,7 +87,7 @@ static void __init early_fixmap_init_pud(p4d_t *p4dp, unsigned long addr,
p4d_t p4d = READ_ONCE(*p4dp);
pud_t *pudp;
- if (CONFIG_PGTABLE_LEVELS > 3 && !p4d_none(p4d) &&
+ if (ptg_pgtable_levels > 3 && !p4d_none(p4d) &&
p4d_page_paddr(p4d) != __pa_symbol(bm_pud)) {
/*
* We only end up here if the kernel mapping and the fixmap
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index bc98c20655bba..2add0839179e3 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -51,10 +51,9 @@ void __init arm64_hugetlb_cma_reserve(void)
static bool __hugetlb_valid_size(unsigned long size)
{
-#ifndef __PAGETABLE_PMD_FOLDED
- if (size == PUD_SIZE)
+ if (pgtable_l3_enabled() && size == PUD_SIZE)
return pud_sect_supported();
-#endif
+
if (size == CONT_PMD_SIZE || size == PMD_SIZE || size == CONT_PTE_SIZE)
return true;
@@ -100,13 +99,10 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
*pgsize = size;
-#ifndef __PAGETABLE_PMD_FOLDED
- if (size == PUD_SIZE) {
+ if (pgtable_l3_enabled() && size == PUD_SIZE) {
if (pud_sect_supported())
contig_ptes = 1;
- } else
-#endif
- if (size == PMD_SIZE) {
+ } else if (size == PMD_SIZE) {
contig_ptes = 1;
} else if (size == CONT_PMD_SIZE) {
*pgsize = PMD_SIZE;
@@ -331,10 +327,8 @@ unsigned long hugetlb_mask_last_page(struct hstate *h)
{
unsigned long hp_size = huge_page_size(h);
-#ifndef __PAGETABLE_PMD_FOLDED
- if (hp_size == PUD_SIZE)
+ if (pgtable_l3_enabled() && hp_size == PUD_SIZE)
return PGDIR_SIZE - PUD_SIZE;
-#endif
if (hp_size == CONT_PMD_SIZE)
return PUD_SIZE - CONT_PMD_SIZE;
if (hp_size == PMD_SIZE)
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 4d24034418b39..62587104f30d8 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -396,7 +396,7 @@ void __init mem_init(void)
* scratch using the virtual address range and page size.
*/
VM_BUG_ON(ARM64_HW_PGTABLE_LEVELS(CONFIG_ARM64_VA_BITS) !=
- CONFIG_PGTABLE_LEVELS);
+ ptg_pgtable_levels);
if (PAGE_SIZE >= 16384 && get_num_physpages() <= 128) {
extern int sysctl_overcommit_memory;
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ad7fd3fda705a..b78a341cd9e70 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1046,7 +1046,7 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
free_empty_pte_table(pmdp, addr, next, floor, ceiling);
} while (addr = next, addr < end);
- if (CONFIG_PGTABLE_LEVELS <= 2)
+ if (!pgtable_l3_enabled())
return;
if (!pgtable_range_aligned(start, end, floor, ceiling, PUD_MASK))
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 6986827e0d645..045a4188afc10 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -230,7 +230,8 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
/* check if the current level has been folded dynamically */
if ((level == 1 && mm_p4d_folded(st->mm)) ||
- (level == 2 && mm_pud_folded(st->mm)))
+ (level == 2 && mm_pud_folded(st->mm)) ||
+ (level == 3 && mm_pmd_folded(st->mm)))
level = 0;
if (level >= 0)
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 54/57] arm64: Support runtime folding in idmap_kpti_install_ng_mappings
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (51 preceding siblings ...)
2024-10-14 10:59 ` [RFC PATCH v1 53/57] arm64: Runtime-fold pmd level Ryan Roberts
@ 2024-10-14 10:59 ` Ryan Roberts
2024-10-14 10:59 ` [RFC PATCH v1 55/57] arm64: TRAMP_VALIAS is no longer compile-time constant Ryan Roberts
` (5 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:59 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
TODO:
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/assembler.h | 5 ++
arch/arm64/kernel/cpufeature.c | 21 +++++-
arch/arm64/mm/proc.S | 107 ++++++++++++++++++++++-------
3 files changed, 108 insertions(+), 25 deletions(-)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 6424fd6be1cbe..0cfa7c3efd214 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -919,6 +919,11 @@ alternative_cb_end
value_for_page_size \val, \val, SZ_4K, SZ_16K, SZ_64K
.endm
+ .macro get_page_shift, val
+ get_tg0 \val
+ value_for_page_size \val, \val, ARM64_PAGE_SHIFT_4K, ARM64_PAGE_SHIFT_16K, ARM64_PAGE_SHIFT_64K
+ .endm
+
.macro get_page_mask, val
get_tg0 \val
value_for_page_size \val, \val, (~(SZ_4K-1)), (~(SZ_16K-1)), (~(SZ_64K-1))
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 663cc76569a27..ee94de556d3f0 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1908,11 +1908,27 @@ static phys_addr_t __init kpti_ng_pgd_alloc(int shift)
return kpti_ng_temp_alloc;
}
+struct install_ng_pgtable_geometry {
+ unsigned long ptrs_per_pte;
+ unsigned long ptrs_per_pmd;
+ unsigned long ptrs_per_pud;
+ unsigned long ptrs_per_p4d;
+ unsigned long ptrs_per_pgd;
+};
+
static int __init __kpti_install_ng_mappings(void *__unused)
{
- typedef void (kpti_remap_fn)(int, int, phys_addr_t, unsigned long);
+ typedef void (kpti_remap_fn)(int, int, phys_addr_t, unsigned long,
+ struct install_ng_pgtable_geometry *);
extern kpti_remap_fn idmap_kpti_install_ng_mappings;
kpti_remap_fn *remap_fn;
+ struct install_ng_pgtable_geometry geometry = {
+ .ptrs_per_pte = PTRS_PER_PTE,
+ .ptrs_per_pmd = PTRS_PER_PMD,
+ .ptrs_per_pud = PTRS_PER_PUD,
+ .ptrs_per_p4d = PTRS_PER_P4D,
+ .ptrs_per_pgd = PTRS_PER_PGD,
+ };
int cpu = smp_processor_id();
int levels = CONFIG_PGTABLE_LEVELS;
@@ -1957,7 +1973,8 @@ static int __init __kpti_install_ng_mappings(void *__unused)
}
cpu_install_idmap();
- remap_fn(cpu, num_online_cpus(), kpti_ng_temp_pgd_pa, KPTI_NG_TEMP_VA);
+ remap_fn(cpu, num_online_cpus(), kpti_ng_temp_pgd_pa, KPTI_NG_TEMP_VA,
+ &geometry);
cpu_uninstall_idmap();
if (!cpu) {
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index ab5aa84923524..11bf6ba6dac33 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -190,7 +190,7 @@ SYM_FUNC_ALIAS(__pi_idmap_cpu_replace_ttbr1, idmap_cpu_replace_ttbr1)
.pushsection ".idmap.text", "a"
.macro pte_to_phys, phys, pte
- and \phys, \pte, #PTE_ADDR_LOW
+ and \phys, \pte, pte_addr_low
#ifdef CONFIG_ARM64_PA_BITS_52
and \pte, \pte, #PTE_ADDR_HIGH
orr \phys, \phys, \pte, lsl #PTE_ADDR_HIGH_SHIFT
@@ -198,7 +198,8 @@ SYM_FUNC_ALIAS(__pi_idmap_cpu_replace_ttbr1, idmap_cpu_replace_ttbr1)
.endm
.macro kpti_mk_tbl_ng, type, num_entries
- add end_\type\()p, cur_\type\()p, #\num_entries * 8
+ lsl scratch, \num_entries, #3
+ add end_\type\()p, cur_\type\()p, scratch
.Ldo_\type:
ldr \type, [cur_\type\()p], #8 // Load the entry and advance
tbz \type, #0, .Lnext_\type // Skip invalid and
@@ -220,14 +221,18 @@ SYM_FUNC_ALIAS(__pi_idmap_cpu_replace_ttbr1, idmap_cpu_replace_ttbr1)
.macro kpti_map_pgtbl, type, level
str xzr, [temp_pte, #8 * (\level + 2)] // break before make
dsb nshst
- add pte, temp_pte, #PAGE_SIZE * (\level + 2)
+ mov scratch, #(\level + 2)
+ mul scratch, scratch, page_size
+ add pte, temp_pte, scratch
lsr pte, pte, #12
tlbi vaae1, pte
dsb nsh
isb
phys_to_pte pte, cur_\type\()p
- add cur_\type\()p, temp_pte, #PAGE_SIZE * (\level + 2)
+ mov scratch, #(\level + 2)
+ mul scratch, scratch, page_size
+ add cur_\type\()p, temp_pte, scratch
orr pte, pte, pte_flags
str pte, [temp_pte, #8 * (\level + 2)]
dsb nshst
@@ -235,7 +240,8 @@ SYM_FUNC_ALIAS(__pi_idmap_cpu_replace_ttbr1, idmap_cpu_replace_ttbr1)
/*
* void __kpti_install_ng_mappings(int cpu, int num_secondaries, phys_addr_t temp_pgd,
- * unsigned long temp_pte_va)
+ * unsigned long temp_pte_va,
+ * struct install_ng_pgtable_geometry *geometry)
*
* Called exactly once from stop_machine context by each CPU found during boot.
*/
@@ -251,6 +257,8 @@ SYM_TYPED_FUNC_START(idmap_kpti_install_ng_mappings)
temp_pgd_phys .req x2
swapper_ttb .req x3
flag_ptr .req x4
+ geometry .req x4
+ scratch .req x4
cur_pgdp .req x5
end_pgdp .req x6
pgd .req x7
@@ -264,18 +272,45 @@ SYM_TYPED_FUNC_START(idmap_kpti_install_ng_mappings)
valid .req x17
cur_p4dp .req x19
end_p4dp .req x20
-
- mov x5, x3 // preserve temp_pte arg
- mrs swapper_ttb, ttbr1_el1
- adr_l flag_ptr, __idmap_kpti_flag
+ page_size .req x21
+ ptrs_per_pte .req x22
+ ptrs_per_pmd .req x23
+ ptrs_per_pud .req x24
+ ptrs_per_p4d .req x25
+ ptrs_per_pgd .req x26
+ pte_addr_low .req x27
cbnz cpu, __idmap_kpti_secondary
-#if CONFIG_PGTABLE_LEVELS > 4
- stp x29, x30, [sp, #-32]!
+ /* Preserve callee-saved registers */
+ stp x19, x20, [sp, #-96]!
+ stp x21, x22, [sp, #80]
+ stp x23, x24, [sp, #64]
+ stp x25, x26, [sp, #48]
+ stp x27, x28, [sp, #32]
+ stp x29, x30, [sp, #16]
mov x29, sp
- stp x19, x20, [sp, #16]
-#endif
+
+ /* Load pgtable geometry parameters */
+ get_page_size page_size
+ ldr ptrs_per_pte, [geometry, #0]
+ ldr ptrs_per_pmd, [geometry, #8]
+ ldr ptrs_per_pud, [geometry, #16]
+ ldr ptrs_per_p4d, [geometry, #24]
+ ldr ptrs_per_pgd, [geometry, #32]
+
+ /* Precalculate pte_addr_low mask */
+ get_page_shift x0
+ mov pte_addr_low, #50
+ sub pte_addr_low, pte_addr_low, x0
+ mov scratch, #1
+ lsl pte_addr_low, scratch, pte_addr_low
+ sub pte_addr_low, pte_addr_low, #1
+ lsl pte_addr_low, pte_addr_low, x0
+
+ mov temp_pte, x3
+ mrs swapper_ttb, ttbr1_el1
+ adr_l flag_ptr, __idmap_kpti_flag
/* We're the boot CPU. Wait for the others to catch up */
sevl
@@ -290,7 +325,6 @@ SYM_TYPED_FUNC_START(idmap_kpti_install_ng_mappings)
msr ttbr1_el1, temp_pgd_phys
isb
- mov temp_pte, x5
mov_q pte_flags, KPTI_NG_PTE_FLAGS
/* Everybody is enjoying the idmap, so we can rewrite swapper. */
@@ -320,7 +354,7 @@ alternative_else_nop_endif
/* PGD */
adrp cur_pgdp, swapper_pg_dir
kpti_map_pgtbl pgd, -1
- kpti_mk_tbl_ng pgd, PTRS_PER_PGD
+ kpti_mk_tbl_ng pgd, ptrs_per_pgd
/* Ensure all the updated entries are visible to secondary CPUs */
dsb ishst
@@ -331,21 +365,33 @@ alternative_else_nop_endif
isb
/* Set the flag to zero to indicate that we're all done */
+ adr_l flag_ptr, __idmap_kpti_flag
str wzr, [flag_ptr]
-#if CONFIG_PGTABLE_LEVELS > 4
- ldp x19, x20, [sp, #16]
- ldp x29, x30, [sp], #32
-#endif
+
+ /* Restore callee-saved registers */
+ ldp x29, x30, [sp, #16]
+ ldp x27, x28, [sp, #32]
+ ldp x25, x26, [sp, #48]
+ ldp x23, x24, [sp, #64]
+ ldp x21, x22, [sp, #80]
+ ldp x19, x20, [sp], #96
+
ret
.Lderef_pgd:
/* P4D */
.if CONFIG_PGTABLE_LEVELS > 4
p4d .req x30
+ cmp ptrs_per_p4d, #1
+ b.eq .Lfold_p4d
pte_to_phys cur_p4dp, pgd
kpti_map_pgtbl p4d, 0
- kpti_mk_tbl_ng p4d, PTRS_PER_P4D
+ kpti_mk_tbl_ng p4d, ptrs_per_p4d
b .Lnext_pgd
+.Lfold_p4d:
+ mov p4d, pgd // fold to next level
+ mov cur_p4dp, end_p4dp // must be equal to terminate loop
+ b .Lderef_p4d
.else /* CONFIG_PGTABLE_LEVELS <= 4 */
p4d .req pgd
.set .Lnext_p4d, .Lnext_pgd
@@ -355,10 +401,16 @@ alternative_else_nop_endif
/* PUD */
.if CONFIG_PGTABLE_LEVELS > 3
pud .req x10
+ cmp ptrs_per_pud, #1
+ b.eq .Lfold_pud
pte_to_phys cur_pudp, p4d
kpti_map_pgtbl pud, 1
- kpti_mk_tbl_ng pud, PTRS_PER_PUD
+ kpti_mk_tbl_ng pud, ptrs_per_pud
b .Lnext_p4d
+.Lfold_pud:
+ mov pud, p4d // fold to next level
+ mov cur_pudp, end_pudp // must be equal to terminate loop
+ b .Lderef_pud
.else /* CONFIG_PGTABLE_LEVELS <= 3 */
pud .req pgd
.set .Lnext_pud, .Lnext_pgd
@@ -368,10 +420,16 @@ alternative_else_nop_endif
/* PMD */
.if CONFIG_PGTABLE_LEVELS > 2
pmd .req x13
+ cmp ptrs_per_pmd, #1
+ b.eq .Lfold_pmd
pte_to_phys cur_pmdp, pud
kpti_map_pgtbl pmd, 2
- kpti_mk_tbl_ng pmd, PTRS_PER_PMD
+ kpti_mk_tbl_ng pmd, ptrs_per_pmd
b .Lnext_pud
+.Lfold_pmd:
+ mov pmd, pud // fold to next level
+ mov cur_pmdp, end_pmdp // must be equal to terminate loop
+ b .Lderef_pmd
.else /* CONFIG_PGTABLE_LEVELS <= 2 */
pmd .req pgd
.set .Lnext_pmd, .Lnext_pgd
@@ -381,7 +439,7 @@ alternative_else_nop_endif
/* PTE */
pte_to_phys cur_ptep, pmd
kpti_map_pgtbl pte, 3
- kpti_mk_tbl_ng pte, PTRS_PER_PTE
+ kpti_mk_tbl_ng pte, ptrs_per_pte
b .Lnext_pmd
.unreq cpu
@@ -408,6 +466,9 @@ alternative_else_nop_endif
/* Secondary CPUs end up here */
__idmap_kpti_secondary:
+ mrs swapper_ttb, ttbr1_el1
+ adr_l flag_ptr, __idmap_kpti_flag
+
/* Uninstall swapper before surgery begins */
__idmap_cpu_set_reserved_ttbr1 x16, x17
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 55/57] arm64: TRAMP_VALIAS is no longer compile-time constant
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (52 preceding siblings ...)
2024-10-14 10:59 ` [RFC PATCH v1 54/57] arm64: Support runtime folding in idmap_kpti_install_ng_mappings Ryan Roberts
@ 2024-10-14 10:59 ` Ryan Roberts
2024-10-14 11:21 ` Ard Biesheuvel
2024-10-14 10:59 ` [RFC PATCH v1 56/57] arm64: Determine THREAD_SIZE at boot-time Ryan Roberts
` (4 subsequent siblings)
58 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:59 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm
When boot-time page size is in operation, TRAMP_VALIAS is no longer a
compile-time constant, because the VA of a fixmap slot depends upon
PAGE_SIZE.
Let's handle this by instead exporting the slot index,
FIX_ENTRY_TRAMP_BEGIN,to assembly, then do the TRAMP_VALIAS calculation
per page size and use alternatives to decide which variant to activate.
Note that for the tramp_map_kernel case, we are one instruction short of
space in the vector to have NOPs for all 3 page size variants. So we do
if/else for 16K/64K and branch around it for the 4K case. This saves 2
instructions.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/kernel/asm-offsets.c | 2 +-
arch/arm64/kernel/entry.S | 50 ++++++++++++++++++++++++++-------
2 files changed, 41 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index f32b8d7f00b2a..c45fa3e281884 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -172,7 +172,7 @@ int main(void)
DEFINE(ARM64_FTR_SYSVAL, offsetof(struct arm64_ftr_reg, sys_val));
BLANK();
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
- DEFINE(TRAMP_VALIAS, TRAMP_VALIAS);
+ DEFINE(FIX_ENTRY_TRAMP_BEGIN, FIX_ENTRY_TRAMP_BEGIN);
#endif
#ifdef CONFIG_ARM_SDE_INTERFACE
DEFINE(SDEI_EVENT_INTREGS, offsetof(struct sdei_registered_event, interrupted_regs));
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 7ef0e127b149f..ba47dc8672c04 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -101,11 +101,27 @@
.org .Lventry_start\@ + 128 // Did we overflow the ventry slot?
.endm
+#define TRAMP_VALIAS(page_shift) (FIXADDR_TOP - (FIX_ENTRY_TRAMP_BEGIN << (page_shift)))
+
.macro tramp_alias, dst, sym
- .set .Lalias\@, TRAMP_VALIAS + \sym - .entry.tramp.text
- movz \dst, :abs_g2_s:.Lalias\@
- movk \dst, :abs_g1_nc:.Lalias\@
- movk \dst, :abs_g0_nc:.Lalias\@
+alternative_if ARM64_USE_PAGE_SIZE_4K
+ .set .Lalias4k\@, TRAMP_VALIAS(ARM64_PAGE_SHIFT_4K) + \sym - .entry.tramp.text
+ movz \dst, :abs_g2_s:.Lalias4k\@
+ movk \dst, :abs_g1_nc:.Lalias4k\@
+ movk \dst, :abs_g0_nc:.Lalias4k\@
+alternative_else_nop_endif
+alternative_if ARM64_USE_PAGE_SIZE_16K
+ .set .Lalias16k\@, TRAMP_VALIAS(ARM64_PAGE_SHIFT_16K) + \sym - .entry.tramp.text
+ movz \dst, :abs_g2_s:.Lalias16k\@
+ movk \dst, :abs_g1_nc:.Lalias16k\@
+ movk \dst, :abs_g0_nc:.Lalias16k\@
+alternative_else_nop_endif
+alternative_if ARM64_USE_PAGE_SIZE_64K
+ .set .Lalias64k\@, TRAMP_VALIAS(ARM64_PAGE_SHIFT_64K) + \sym - .entry.tramp.text
+ movz \dst, :abs_g2_s:.Lalias64k\@
+ movk \dst, :abs_g1_nc:.Lalias64k\@
+ movk \dst, :abs_g0_nc:.Lalias64k\@
+alternative_else_nop_endif
.endm
/*
@@ -627,16 +643,30 @@ SYM_CODE_END(ret_to_user)
bic \tmp, \tmp, #USER_ASID_FLAG
msr ttbr1_el1, \tmp
#ifdef CONFIG_QCOM_FALKOR_ERRATUM_1003
-alternative_if ARM64_WORKAROUND_QCOM_FALKOR_E1003
+alternative_if_not ARM64_WORKAROUND_QCOM_FALKOR_E1003
+ b .Lskip_falkor_e1003\@
+alternative_else_nop_endif
/* ASID already in \tmp[63:48] */
- movk \tmp, #:abs_g2_nc:(TRAMP_VALIAS >> 12)
- movk \tmp, #:abs_g1_nc:(TRAMP_VALIAS >> 12)
- /* 2MB boundary containing the vectors, so we nobble the walk cache */
- movk \tmp, #:abs_g0_nc:((TRAMP_VALIAS & ~(SZ_2M - 1)) >> 12)
+alternative_if ARM64_USE_PAGE_SIZE_4K
+ movk \tmp, #:abs_g2_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_4K) >> 12)
+ movk \tmp, #:abs_g1_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_4K) >> 12)
+ movk \tmp, #:abs_g0_nc:((TRAMP_VALIAS(ARM64_PAGE_SHIFT_4K) & ~(SZ_2M - 1)) >> 12)
+ b .Lfinish_falkor_e1003\@
+alternative_else_nop_endif
+alternative_if ARM64_USE_PAGE_SIZE_16K
+ movk \tmp, #:abs_g2_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_16K) >> 12)
+ movk \tmp, #:abs_g1_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_16K) >> 12)
+ movk \tmp, #:abs_g0_nc:((TRAMP_VALIAS(ARM64_PAGE_SHIFT_16K) & ~(SZ_2M - 1)) >> 12)
+alternative_else /* ARM64_USE_PAGE_SIZE_64K */
+ movk \tmp, #:abs_g2_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_64K) >> 12)
+ movk \tmp, #:abs_g1_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_64K) >> 12)
+ movk \tmp, #:abs_g0_nc:((TRAMP_VALIAS(ARM64_PAGE_SHIFT_64K) & ~(SZ_2M - 1)) >> 12)
+alternative_endif
+.Lfinish_falkor_e1003\@:
isb
tlbi vae1, \tmp
dsb nsh
-alternative_else_nop_endif
+.Lskip_falkor_e1003\@:
#endif /* CONFIG_QCOM_FALKOR_ERRATUM_1003 */
.endm
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 56/57] arm64: Determine THREAD_SIZE at boot-time
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (53 preceding siblings ...)
2024-10-14 10:59 ` [RFC PATCH v1 55/57] arm64: TRAMP_VALIAS is no longer compile-time constant Ryan Roberts
@ 2024-10-14 10:59 ` Ryan Roberts
2024-10-14 10:59 ` [RFC PATCH v1 57/57] arm64: Enable boot-time page size selection Ryan Roberts
` (3 subsequent siblings)
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:59 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Oliver Upton, Will Deacon
Cc: Ryan Roberts, kvmarm, linux-arm-kernel, linux-efi, linux-kernel,
linux-mm
Since THREAD_SIZE depends on PAGE_SIZE when stacks are vmapped, we must
defer the decision on THREAD_SIZE until we have selected PAGE_SIZE at
boot.
The one wrinkle is entry.S's requirement to have THREAD_SHIFT as an
immediate in order to check that the stack has not overflowed without
clobbering any registers, early in the exception handler. Solve this by
patching alternatives. During early boot, all 3 options are NOPs until
the alternative is patched in. So we forgo overflow checking until
boot-cpu patching is complete.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/include/asm/assembler.h | 5 +++
arch/arm64/include/asm/efi.h | 2 +-
arch/arm64/include/asm/memory.h | 51 +++++++++++++++++++++++++-----
arch/arm64/kernel/efi.c | 2 +-
arch/arm64/kernel/entry.S | 10 +++++-
arch/arm64/kernel/head.S | 3 +-
arch/arm64/kernel/vmlinux.lds.S | 4 +--
arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 2 +-
8 files changed, 64 insertions(+), 15 deletions(-)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 0cfa7c3efd214..745328e7768b7 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -928,4 +928,9 @@ alternative_cb_end
get_tg0 \val
value_for_page_size \val, \val, (~(SZ_4K-1)), (~(SZ_16K-1)), (~(SZ_64K-1))
.endm
+
+ .macro get_task_size, val
+ get_tg0 \val
+ value_for_page_size \val, \val, (1 << THREAD_SHIFT_4K), (1 << THREAD_SHIFT_16K), (1 << THREAD_SHIFT_64K)
+ .endm
#endif /* __ASM_ASSEMBLER_H */
diff --git a/arch/arm64/include/asm/efi.h b/arch/arm64/include/asm/efi.h
index bcd5622aa0968..913f599c14e40 100644
--- a/arch/arm64/include/asm/efi.h
+++ b/arch/arm64/include/asm/efi.h
@@ -68,7 +68,7 @@ void arch_efi_call_virt_teardown(void);
* kernel need greater alignment than we require the segments to be padded to.
*/
#define EFI_KIMG_ALIGN \
- (SEGMENT_ALIGN > THREAD_ALIGN ? SEGMENT_ALIGN : THREAD_ALIGN)
+ (SEGMENT_ALIGN > THREAD_ALIGN_MAX ? SEGMENT_ALIGN : THREAD_ALIGN_MAX)
/*
* On arm64, we have to ensure that the initrd ends up in the linear region,
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 5393a859183f7..e28f5700ef022 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -110,23 +110,56 @@
#define PAGE_END (_PAGE_END(VA_BITS_MIN))
#endif /* CONFIG_KASAN */
-#define MIN_THREAD_SHIFT (14 + KASAN_THREAD_SHIFT)
+#define IDEAL_THREAD_SHIFT (14 + KASAN_THREAD_SHIFT)
/*
* VMAP'd stacks are allocated at page granularity, so we must ensure that such
* stacks are a multiple of page size.
*/
-#if defined(CONFIG_VMAP_STACK) && (MIN_THREAD_SHIFT < PAGE_SHIFT)
-#define THREAD_SHIFT PAGE_SHIFT
+
+#if defined(CONFIG_VMAP_STACK)
+#define THREAD_SHIFT \
+ (IDEAL_THREAD_SHIFT < PAGE_SHIFT ? PAGE_SHIFT : IDEAL_THREAD_SHIFT)
+#if (IDEAL_THREAD_SHIFT < PAGE_SHIFT_MIN)
+#define THREAD_SHIFT_MIN PAGE_SHIFT_MIN
#else
-#define THREAD_SHIFT MIN_THREAD_SHIFT
+#define THREAD_SHIFT_MIN IDEAL_THREAD_SHIFT
#endif
-
-#if THREAD_SHIFT >= PAGE_SHIFT
-#define THREAD_SIZE_ORDER (THREAD_SHIFT - PAGE_SHIFT)
+#if (IDEAL_THREAD_SHIFT < PAGE_SHIFT_MAX)
+#define THREAD_SHIFT_MAX PAGE_SHIFT_MAX
+#else
+#define THREAD_SHIFT_MAX IDEAL_THREAD_SHIFT
+#endif
+#if (IDEAL_THREAD_SHIFT < ARM64_PAGE_SHIFT_4K)
+#define THREAD_SHIFT_4K ARM64_PAGE_SHIFT_4K
+#else
+#define THREAD_SHIFT_4K IDEAL_THREAD_SHIFT
+#endif
+#if (IDEAL_THREAD_SHIFT < ARM64_PAGE_SHIFT_16K)
+#define THREAD_SHIFT_16K ARM64_PAGE_SHIFT_16K
+#else
+#define THREAD_SHIFT_16K IDEAL_THREAD_SHIFT
+#endif
+#if (IDEAL_THREAD_SHIFT < ARM64_PAGE_SHIFT_64K)
+#define THREAD_SHIFT_64K ARM64_PAGE_SHIFT_64K
+#else
+#define THREAD_SHIFT_64K IDEAL_THREAD_SHIFT
#endif
+#else
+#define THREAD_SHIFT IDEAL_THREAD_SHIFT
+#define THREAD_SHIFT_MIN IDEAL_THREAD_SHIFT
+#define THREAD_SHIFT_MAX IDEAL_THREAD_SHIFT
+#define THREAD_SHIFT_4K IDEAL_THREAD_SHIFT
+#define THREAD_SHIFT_16K IDEAL_THREAD_SHIFT
+#define THREAD_SHIFT_64K IDEAL_THREAD_SHIFT
+#endif
+
+#define THREAD_SIZE_ORDER \
+ (PAGE_SHIFT < THREAD_SHIFT ? THREAD_SHIFT - PAGE_SHIFT : 0)
#define THREAD_SIZE (UL(1) << THREAD_SHIFT)
+#define THREAD_SIZE_MIN (UL(1) << THREAD_SHIFT_MIN)
+#define THREAD_SIZE_MAX (UL(1) << THREAD_SHIFT_MAX)
/*
* By aligning VMAP'd stacks to 2 * THREAD_SIZE, we can detect overflow by
@@ -135,11 +168,13 @@
*/
#ifdef CONFIG_VMAP_STACK
#define THREAD_ALIGN (2 * THREAD_SIZE)
+#define THREAD_ALIGN_MAX (2 * THREAD_SIZE_MAX)
#else
#define THREAD_ALIGN THREAD_SIZE
+#define THREAD_ALIGN_MAX THREAD_SIZE_MAX
#endif
-#define IRQ_STACK_SIZE THREAD_SIZE
+#define IRQ_STACK_SIZE THREAD_SIZE_MIN
#define OVERFLOW_STACK_SIZE SZ_4K
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 712718aed5dd9..ebc44b7e83199 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -197,7 +197,7 @@ bool efi_runtime_fixup_exception(struct pt_regs *regs, const char *msg)
}
/* EFI requires 8 KiB of stack space for runtime services */
-static_assert(THREAD_SIZE >= SZ_8K);
+static_assert(THREAD_SIZE_MIN >= SZ_8K);
static int __init arm64_efi_rt_init(void)
{
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index ba47dc8672c04..1ab65e406b62e 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -62,7 +62,15 @@
*/
add sp, sp, x0 // sp' = sp + x0
sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp
- tbnz x0, #THREAD_SHIFT, 0f
+alternative_if ARM64_USE_PAGE_SIZE_4K
+ tbnz x0, #THREAD_SHIFT_4K, 0f
+alternative_else_nop_endif
+alternative_if ARM64_USE_PAGE_SIZE_16K
+ tbnz x0, #THREAD_SHIFT_16K, 0f
+alternative_else_nop_endif
+alternative_if ARM64_USE_PAGE_SIZE_64K
+ tbnz x0, #THREAD_SHIFT_64K, 0f
+alternative_else_nop_endif
sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = x0
sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = sp
b el\el\ht\()_\regsize\()_\label
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 761b7f5633e15..2530ee5cee548 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -198,7 +198,8 @@ SYM_CODE_END(preserve_boot_args)
msr sp_el0, \tsk
ldr \tmp1, [\tsk, #TSK_STACK]
- add sp, \tmp1, #THREAD_SIZE
+ get_task_size \tmp2
+ add sp, \tmp1, \tmp2
sub sp, sp, #PT_REGS_SIZE
stp xzr, xzr, [sp, #S_STACKFRAME]
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 09fcc234c0f77..937900a458a89 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -60,11 +60,11 @@
#define RO_EXCEPTION_TABLE_ALIGN 4
#define RUNTIME_DISCARD_EXIT
+#include <asm/memory.h>
#include <asm-generic/vmlinux.lds.h>
#include <asm/cache.h>
#include <asm/kernel-pgtable.h>
#include <asm/kexec.h>
-#include <asm/memory.h>
#include <asm/page.h>
#include "image.h"
@@ -292,7 +292,7 @@ SECTIONS
_data = .;
_sdata = .;
- RW_DATA(L1_CACHE_BYTES, PAGE_SIZE_MAX, THREAD_ALIGN)
+ RW_DATA(L1_CACHE_BYTES, PAGE_SIZE_MAX, THREAD_ALIGN_MAX)
/*
* Data written with the MMU off but read with the MMU on requires
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp.lds.S b/arch/arm64/kvm/hyp/nvhe/hyp.lds.S
index 74c7c21626270..fe1fbfa8f8f05 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp.lds.S
+++ b/arch/arm64/kvm/hyp/nvhe/hyp.lds.S
@@ -7,9 +7,9 @@
*/
#include <asm/hyp_image.h>
+#include <asm/memory.h>
#include <asm-generic/vmlinux.lds.h>
#include <asm/cache.h>
-#include <asm/memory.h>
SECTIONS {
HYP_SECTION(.idmap.text)
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* [RFC PATCH v1 57/57] arm64: Enable boot-time page size selection
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (54 preceding siblings ...)
2024-10-14 10:59 ` [RFC PATCH v1 56/57] arm64: Determine THREAD_SIZE at boot-time Ryan Roberts
@ 2024-10-14 10:59 ` Ryan Roberts
2024-10-15 17:42 ` Zi Yan
2024-10-15 17:52 ` Michael Kelley
2024-10-14 13:54 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting " Pingfan Liu
` (2 subsequent siblings)
58 siblings, 2 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 10:59 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Oliver Upton, Will Deacon
Cc: Ryan Roberts, kvmarm, linux-arm-kernel, linux-efi, linux-kernel,
linux-mm
Introduce a new Kconfig, ARM64_BOOT_TIME_PAGE_SIZE, which can be
selected instead of a page size. When selected, the resulting kernel's
page size can be configured at boot via the command line.
For now, boot-time page size kernels are limited to 48-bit VA, since
more work is required to support LPA2. Additionally MMAP_RND_BITS and
SECTION_SIZE_BITS are configured for the worst case (64K pages). Future
work could be implemented to be able to configure these at boot time for
optimial page size-specific values.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
***NOTE***
Any confused maintainers may want to read the cover note here for context:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
arch/arm64/Kconfig | 26 ++++++++++---
arch/arm64/include/asm/kvm_hyp.h | 11 ++++++
arch/arm64/include/asm/pgtable-geometry.h | 22 ++++++++++-
arch/arm64/include/asm/pgtable-hwdef.h | 6 +--
arch/arm64/include/asm/pgtable.h | 10 ++++-
arch/arm64/include/asm/sparsemem.h | 4 ++
arch/arm64/kernel/image-vars.h | 11 ++++++
arch/arm64/kernel/image.h | 4 ++
arch/arm64/kernel/pi/map_kernel.c | 45 ++++++++++++++++++++++
arch/arm64/kvm/arm.c | 10 +++++
arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++++++++
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/pgd.c | 10 +++--
arch/arm64/mm/pgtable-geometry.c | 24 ++++++++++++
drivers/firmware/efi/libstub/arm64.c | 3 +-
16 files changed, 187 insertions(+), 17 deletions(-)
create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
create mode 100644 arch/arm64/mm/pgtable-geometry.c
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2f8ff354ca67..573d308741169 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -121,6 +121,7 @@ config ARM64
select BUILDTIME_TABLE_SORT
select CLONE_BACKWARDS
select COMMON_CLK
+ select CONSTRUCTORS if ARM64_BOOT_TIME_PAGE_SIZE
select CPU_PM if (SUSPEND || CPU_IDLE)
select CPUMASK_OFFSTACK if NR_CPUS > 256
select CRC32
@@ -284,18 +285,20 @@ config MMU
config ARM64_CONT_PTE_SHIFT
int
+ depends on !ARM64_BOOT_TIME_PAGE_SIZE
default 5 if PAGE_SIZE_64KB
default 7 if PAGE_SIZE_16KB
default 4
config ARM64_CONT_PMD_SHIFT
int
+ depends on !ARM64_BOOT_TIME_PAGE_SIZE
default 5 if PAGE_SIZE_64KB
default 5 if PAGE_SIZE_16KB
default 4
config ARCH_MMAP_RND_BITS_MIN
- default 14 if PAGE_SIZE_64KB
+ default 14 if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
default 16 if PAGE_SIZE_16KB
default 18
@@ -306,15 +309,15 @@ config ARCH_MMAP_RND_BITS_MAX
default 24 if ARM64_VA_BITS=39
default 27 if ARM64_VA_BITS=42
default 30 if ARM64_VA_BITS=47
- default 29 if ARM64_VA_BITS=48 && ARM64_64K_PAGES
+ default 29 if ARM64_VA_BITS=48 && (ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE)
default 31 if ARM64_VA_BITS=48 && ARM64_16K_PAGES
default 33 if ARM64_VA_BITS=48
- default 14 if ARM64_64K_PAGES
+ default 14 if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
default 16 if ARM64_16K_PAGES
default 18
config ARCH_MMAP_RND_COMPAT_BITS_MIN
- default 7 if ARM64_64K_PAGES
+ default 7 if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
default 9 if ARM64_16K_PAGES
default 11
@@ -362,6 +365,7 @@ config FIX_EARLYCON_MEM
config PGTABLE_LEVELS
int
+ default 4 if ARM64_BOOT_TIME_PAGE_SIZE # Advertise max supported levels
default 2 if ARM64_16K_PAGES && ARM64_VA_BITS_36
default 2 if ARM64_64K_PAGES && ARM64_VA_BITS_42
default 3 if ARM64_64K_PAGES && (ARM64_VA_BITS_48 || ARM64_VA_BITS_52)
@@ -1316,6 +1320,14 @@ config ARM64_64K_PAGES
look-up. AArch32 emulation requires applications compiled
with 64K aligned segments.
+config ARM64_BOOT_TIME_PAGE_SIZE
+ bool "Boot-time selection"
+ select HAVE_PAGE_SIZE_64KB # Advertise largest page size to core
+ help
+ Select desired page size (4KB, 16KB or 64KB) at boot-time via the
+ kernel command line option "arm64.pagesize=4k", "arm64.pagesize=16k"
+ or "arm64.pagesize=64k".
+
endchoice
choice
@@ -1348,6 +1360,7 @@ config ARM64_VA_BITS_48
config ARM64_VA_BITS_52
bool "52-bit"
depends on ARM64_PAN || !ARM64_SW_TTBR0_PAN
+ depends on !ARM64_BOOT_TIME_PAGE_SIZE
help
Enable 52-bit virtual addressing for userspace when explicitly
requested via a hint to mmap(). The kernel will also use 52-bit
@@ -1588,9 +1601,10 @@ config XEN
# 4K | 27 | 12 | 15 | 10 |
# 16K | 27 | 14 | 13 | 11 |
# 64K | 29 | 16 | 13 | 13 |
+# BOOT| 29 | 16 (max) | 13 | 13 |
config ARCH_FORCE_MAX_ORDER
int
- default "13" if ARM64_64K_PAGES
+ default "13" if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
default "11" if ARM64_16K_PAGES
default "10"
help
@@ -1663,7 +1677,7 @@ config ARM64_TAGGED_ADDR_ABI
menuconfig COMPAT
bool "Kernel support for 32-bit EL0"
- depends on ARM64_4K_PAGES || EXPERT
+ depends on ARM64_4K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE || EXPERT
select HAVE_UID16
select OLD_SIGSUSPEND3
select COMPAT_OLD_SIGACTION
diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index c838309e4ec47..9397a14642afa 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -145,4 +145,15 @@ extern unsigned long kvm_nvhe_sym(__icache_flags);
extern unsigned int kvm_nvhe_sym(kvm_arm_vmid_bits);
extern unsigned int kvm_nvhe_sym(kvm_host_sve_max_vl);
+#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
+extern int kvm_nvhe_sym(ptg_page_shift);
+extern int kvm_nvhe_sym(ptg_pmd_shift);
+extern int kvm_nvhe_sym(ptg_pud_shift);
+extern int kvm_nvhe_sym(ptg_p4d_shift);
+extern int kvm_nvhe_sym(ptg_pgdir_shift);
+extern int kvm_nvhe_sym(ptg_cont_pte_shift);
+extern int kvm_nvhe_sym(ptg_cont_pmd_shift);
+extern int kvm_nvhe_sym(ptg_pgtable_levels);
+#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
+
#endif /* __ARM64_KVM_HYP_H__ */
diff --git a/arch/arm64/include/asm/pgtable-geometry.h b/arch/arm64/include/asm/pgtable-geometry.h
index 62fe125909c08..18a8c8d499ecc 100644
--- a/arch/arm64/include/asm/pgtable-geometry.h
+++ b/arch/arm64/include/asm/pgtable-geometry.h
@@ -6,16 +6,33 @@
#define ARM64_PAGE_SHIFT_16K 14
#define ARM64_PAGE_SHIFT_64K 16
+#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
+#define PAGE_SHIFT_MIN ARM64_PAGE_SHIFT_4K
+#define PAGE_SHIFT_MAX ARM64_PAGE_SHIFT_64K
+#else /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
#define PAGE_SHIFT_MIN CONFIG_PAGE_SHIFT
+#define PAGE_SHIFT_MAX CONFIG_PAGE_SHIFT
+#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
+
#define PAGE_SIZE_MIN (_AC(1, UL) << PAGE_SHIFT_MIN)
#define PAGE_MASK_MIN (~(PAGE_SIZE_MIN-1))
-
-#define PAGE_SHIFT_MAX CONFIG_PAGE_SHIFT
#define PAGE_SIZE_MAX (_AC(1, UL) << PAGE_SHIFT_MAX)
#define PAGE_MASK_MAX (~(PAGE_SIZE_MAX-1))
#include <asm-generic/pgtable-geometry.h>
+#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
+#ifndef __ASSEMBLY__
+extern int ptg_page_shift;
+extern int ptg_pmd_shift;
+extern int ptg_pud_shift;
+extern int ptg_p4d_shift;
+extern int ptg_pgdir_shift;
+extern int ptg_cont_pte_shift;
+extern int ptg_cont_pmd_shift;
+extern int ptg_pgtable_levels;
+#endif /* __ASSEMBLY__ */
+#else /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
#define ptg_page_shift CONFIG_PAGE_SHIFT
#define ptg_pmd_shift ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
#define ptg_pud_shift ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
@@ -24,5 +41,6 @@
#define ptg_cont_pte_shift (CONFIG_ARM64_CONT_PTE_SHIFT + PAGE_SHIFT)
#define ptg_cont_pmd_shift (CONFIG_ARM64_CONT_PMD_SHIFT + PMD_SHIFT)
#define ptg_pgtable_levels CONFIG_PGTABLE_LEVELS
+#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
#endif /* ASM_PGTABLE_GEOMETRY_H */
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index ca8bcbc1fe220..da5404617acbf 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -52,7 +52,7 @@
#define PMD_SHIFT ptg_pmd_shift
#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE-1))
-#define PTRS_PER_PMD (1 << (PAGE_SHIFT - 3))
+#define PTRS_PER_PMD (ptg_pgtable_levels > 2 ? (1 << (PAGE_SHIFT - 3)) : 1)
#define MAX_PTRS_PER_PMD (1 << (PAGE_SHIFT_MAX - 3))
#endif
@@ -63,7 +63,7 @@
#define PUD_SHIFT ptg_pud_shift
#define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
#define PUD_MASK (~(PUD_SIZE-1))
-#define PTRS_PER_PUD (1 << (PAGE_SHIFT - 3))
+#define PTRS_PER_PUD (ptg_pgtable_levels > 3 ? (1 << (PAGE_SHIFT - 3)) : 1)
#define MAX_PTRS_PER_PUD (1 << (PAGE_SHIFT_MAX - 3))
#endif
@@ -71,7 +71,7 @@
#define P4D_SHIFT ptg_p4d_shift
#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
#define P4D_MASK (~(P4D_SIZE-1))
-#define PTRS_PER_P4D (1 << (PAGE_SHIFT - 3))
+#define PTRS_PER_P4D (ptg_pgtable_levels > 4 ? (1 << (PAGE_SHIFT - 3)) : 1)
#define MAX_PTRS_PER_P4D (1 << (PAGE_SHIFT_MAX - 3))
#endif
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 8ead41da715b0..ad9f75f5cc29a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -755,7 +755,7 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
static __always_inline bool pgtable_l3_enabled(void)
{
- return true;
+ return ptg_pgtable_levels > 2;
}
static inline bool mm_pmd_folded(const struct mm_struct *mm)
@@ -888,6 +888,8 @@ static inline bool pgtable_l3_enabled(void) { return false; }
static __always_inline bool pgtable_l4_enabled(void)
{
+ if (IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
+ return ptg_pgtable_levels > 3;
if (CONFIG_PGTABLE_LEVELS > 4 || !IS_ENABLED(CONFIG_ARM64_LPA2))
return true;
if (!alternative_has_cap_likely(ARM64_ALWAYS_BOOT))
@@ -935,6 +937,8 @@ static inline phys_addr_t p4d_page_paddr(p4d_t p4d)
static inline pud_t *p4d_to_folded_pud(p4d_t *p4dp, unsigned long addr)
{
+ if (IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
+ return (pud_t *)p4dp;
return (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
}
@@ -1014,6 +1018,8 @@ static inline bool pgtable_l4_enabled(void) { return false; }
static __always_inline bool pgtable_l5_enabled(void)
{
+ if (IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
+ return ptg_pgtable_levels > 4;
if (!alternative_has_cap_likely(ARM64_ALWAYS_BOOT))
return vabits_actual == VA_BITS;
return alternative_has_cap_unlikely(ARM64_HAS_VA52);
@@ -1059,6 +1065,8 @@ static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
static inline p4d_t *pgd_to_folded_p4d(pgd_t *pgdp, unsigned long addr)
{
+ if (IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
+ return (p4d_t *)pgdp;
return (p4d_t *)PTR_ALIGN_DOWN(pgdp, PAGE_SIZE) + p4d_index(addr);
}
diff --git a/arch/arm64/include/asm/sparsemem.h b/arch/arm64/include/asm/sparsemem.h
index a05fdd54014f7..2daf1263ba638 100644
--- a/arch/arm64/include/asm/sparsemem.h
+++ b/arch/arm64/include/asm/sparsemem.h
@@ -17,6 +17,10 @@
* entries could not be created for vmemmap mappings.
* 16K follows 4K for simplicity.
*/
+#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
+#define SECTION_SIZE_BITS 29
+#else
#define SECTION_SIZE_BITS (PAGE_SIZE == SZ_64K ? 29 : 27)
+#endif
#endif
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index a168f3337446f..9968320f83bc4 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -36,6 +36,17 @@ PROVIDE(__pi___memcpy = __pi_memcpy);
PROVIDE(__pi___memmove = __pi_memmove);
PROVIDE(__pi___memset = __pi_memset);
+#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
+PROVIDE(__pi_ptg_page_shift = ptg_page_shift);
+PROVIDE(__pi_ptg_pmd_shift = ptg_pmd_shift);
+PROVIDE(__pi_ptg_pud_shift = ptg_pud_shift);
+PROVIDE(__pi_ptg_p4d_shift = ptg_p4d_shift);
+PROVIDE(__pi_ptg_pgdir_shift = ptg_pgdir_shift);
+PROVIDE(__pi_ptg_cont_pte_shift = ptg_cont_pte_shift);
+PROVIDE(__pi_ptg_cont_pmd_shift = ptg_cont_pmd_shift);
+PROVIDE(__pi_ptg_pgtable_levels = ptg_pgtable_levels);
+#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
+
PROVIDE(__pi_id_aa64isar1_override = id_aa64isar1_override);
PROVIDE(__pi_id_aa64isar2_override = id_aa64isar2_override);
PROVIDE(__pi_id_aa64mmfr0_override = id_aa64mmfr0_override);
diff --git a/arch/arm64/kernel/image.h b/arch/arm64/kernel/image.h
index 7bc3ba8979019..01502fc3b891b 100644
--- a/arch/arm64/kernel/image.h
+++ b/arch/arm64/kernel/image.h
@@ -47,7 +47,11 @@
#define __HEAD_FLAG_BE ARM64_IMAGE_FLAG_LE
#endif
+#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
+#define __HEAD_FLAG_PAGE_SIZE 0
+#else
#define __HEAD_FLAG_PAGE_SIZE ((PAGE_SHIFT - 10) / 2)
+#endif
#define __HEAD_FLAG_PHYS_BASE 1
diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
index deb8cd50b0b0c..22b3c70e04f9c 100644
--- a/arch/arm64/kernel/pi/map_kernel.c
+++ b/arch/arm64/kernel/pi/map_kernel.c
@@ -221,6 +221,49 @@ static void __init map_fdt(u64 fdt, int page_shift)
dsb(ishst);
}
+#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
+static void __init ptg_init(int page_shift)
+{
+ ptg_pgtable_levels =
+ __ARM64_HW_PGTABLE_LEVELS(page_shift, CONFIG_ARM64_VA_BITS);
+
+ ptg_pgdir_shift = __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift,
+ 4 - ptg_pgtable_levels);
+
+ ptg_p4d_shift = ptg_pgtable_levels >= 5 ?
+ __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift, 0) :
+ ptg_pgdir_shift;
+
+ ptg_pud_shift = ptg_pgtable_levels >= 4 ?
+ __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift, 1) :
+ ptg_pgdir_shift;
+
+ ptg_pmd_shift = ptg_pgtable_levels >= 3 ?
+ __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift, 2) :
+ ptg_pgdir_shift;
+
+ ptg_page_shift = page_shift;
+
+ switch (page_shift) {
+ case ARM64_PAGE_SHIFT_64K:
+ ptg_cont_pte_shift = ptg_page_shift + 5;
+ ptg_cont_pmd_shift = ptg_pmd_shift + 5;
+ break;
+ case ARM64_PAGE_SHIFT_16K:
+ ptg_cont_pte_shift = ptg_page_shift + 7;
+ ptg_cont_pmd_shift = ptg_pmd_shift + 5;
+ break;
+ default: /* ARM64_PAGE_SHIFT_4K */
+ ptg_cont_pte_shift = ptg_page_shift + 4;
+ ptg_cont_pmd_shift = ptg_pmd_shift + 4;
+ }
+}
+#else /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
+static inline void ptg_init(int page_shift)
+{
+}
+#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
+
asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
{
static char const chosen_str[] __initconst = "/chosen";
@@ -247,6 +290,8 @@ asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
if (!page_shift)
page_shift = early_page_shift;
+ ptg_init(page_shift);
+
if (va_bits > 48) {
u64 page_size = early_page_size(page_shift);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 9bef7638342ef..c835a50b8b768 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2424,6 +2424,16 @@ static void kvm_hyp_init_symbols(void)
kvm_nvhe_sym(id_aa64smfr0_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64SMFR0_EL1);
kvm_nvhe_sym(__icache_flags) = __icache_flags;
kvm_nvhe_sym(kvm_arm_vmid_bits) = kvm_arm_vmid_bits;
+#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
+ kvm_nvhe_sym(ptg_page_shift) = ptg_page_shift;
+ kvm_nvhe_sym(ptg_pmd_shift) = ptg_pmd_shift;
+ kvm_nvhe_sym(ptg_pud_shift) = ptg_pud_shift;
+ kvm_nvhe_sym(ptg_p4d_shift) = ptg_p4d_shift;
+ kvm_nvhe_sym(ptg_pgdir_shift) = ptg_pgdir_shift;
+ kvm_nvhe_sym(ptg_cont_pte_shift) = ptg_cont_pte_shift;
+ kvm_nvhe_sym(ptg_cont_pmd_shift) = ptg_cont_pmd_shift;
+ kvm_nvhe_sym(ptg_pgtable_levels) = ptg_pgtable_levels;
+#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
}
static int __init kvm_hyp_init_protection(u32 hyp_va_bits)
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index b43426a493df5..a8fcbb84c7996 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -27,6 +27,7 @@ hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o
cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o
hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
+hyp-obj-$(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) += pgtable-geometry.o
hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o
hyp-obj-y += $(lib-objs)
diff --git a/arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c b/arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
new file mode 100644
index 0000000000000..17f807450a31a
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2024 ARM Ltd.
+ */
+
+#include <linux/cache.h>
+#include <asm/pgtable-geometry.h>
+
+int ptg_page_shift __ro_after_init;
+int ptg_pmd_shift __ro_after_init;
+int ptg_pud_shift __ro_after_init;
+int ptg_p4d_shift __ro_after_init;
+int ptg_pgdir_shift __ro_after_init;
+int ptg_cont_pte_shift __ro_after_init;
+int ptg_cont_pmd_shift __ro_after_init;
+int ptg_pgtable_levels __ro_after_init;
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 60454256945b8..2ba30d06b35fe 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -3,6 +3,7 @@ obj-y := dma-mapping.o extable.o fault.o init.o \
cache.o copypage.o flush.o \
ioremap.o mmap.o pgd.o mmu.o \
context.o proc.o pageattr.o fixmap.o
+obj-$(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) += pgtable-geometry.o
obj-$(CONFIG_ARM64_CONTPTE) += contpte.o
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
index 4b106510358b1..c052d0dcb0c69 100644
--- a/arch/arm64/mm/pgd.c
+++ b/arch/arm64/mm/pgd.c
@@ -21,10 +21,12 @@ static bool pgdir_is_page_size(void)
{
if (PGD_SIZE == PAGE_SIZE)
return true;
- if (CONFIG_PGTABLE_LEVELS == 4)
- return !pgtable_l4_enabled();
- if (CONFIG_PGTABLE_LEVELS == 5)
- return !pgtable_l5_enabled();
+ if (!IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE)) {
+ if (CONFIG_PGTABLE_LEVELS == 4)
+ return !pgtable_l4_enabled();
+ if (CONFIG_PGTABLE_LEVELS == 5)
+ return !pgtable_l5_enabled();
+ }
return false;
}
diff --git a/arch/arm64/mm/pgtable-geometry.c b/arch/arm64/mm/pgtable-geometry.c
new file mode 100644
index 0000000000000..ba50637f1e9d0
--- /dev/null
+++ b/arch/arm64/mm/pgtable-geometry.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2024 ARM Ltd.
+ */
+
+#include <linux/cache.h>
+#include <asm/pgtable-geometry.h>
+
+/*
+ * TODO: These should be __ro_after_init, but we need to write to them from the
+ * pi code where they are mapped in the early page table as read-only.
+ * __ro_after_init doesn't become writable until later when the swapper pgtable
+ * is fully set up. We should update the early page table to map __ro_after_init
+ * as read-write.
+ */
+
+int ptg_page_shift __read_mostly;
+int ptg_pmd_shift __read_mostly;
+int ptg_pud_shift __read_mostly;
+int ptg_p4d_shift __read_mostly;
+int ptg_pgdir_shift __read_mostly;
+int ptg_cont_pte_shift __read_mostly;
+int ptg_cont_pmd_shift __read_mostly;
+int ptg_pgtable_levels __read_mostly;
diff --git a/drivers/firmware/efi/libstub/arm64.c b/drivers/firmware/efi/libstub/arm64.c
index e57cd3de0a00f..8db9dba7d5423 100644
--- a/drivers/firmware/efi/libstub/arm64.c
+++ b/drivers/firmware/efi/libstub/arm64.c
@@ -68,7 +68,8 @@ efi_status_t check_platform_features(void)
efi_novamap = true;
/* UEFI mandates support for 4 KB granularity, no need to check */
- if (IS_ENABLED(CONFIG_ARM64_4K_PAGES))
+ if (IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
+ IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
return EFI_SUCCESS;
tg = (read_cpuid(ID_AA64MMFR0_EL1) >> ID_AA64MMFR0_EL1_TGRAN_SHIFT) & 0xf;
--
2.43.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 55/57] arm64: TRAMP_VALIAS is no longer compile-time constant
2024-10-14 10:59 ` [RFC PATCH v1 55/57] arm64: TRAMP_VALIAS is no longer compile-time constant Ryan Roberts
@ 2024-10-14 11:21 ` Ard Biesheuvel
2024-10-14 11:28 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Ard Biesheuvel @ 2024-10-14 11:21 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
Hi Ryan,
On Mon, 14 Oct 2024 at 13:02, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> When boot-time page size is in operation, TRAMP_VALIAS is no longer a
> compile-time constant, because the VA of a fixmap slot depends upon
> PAGE_SIZE.
>
> Let's handle this by instead exporting the slot index,
> FIX_ENTRY_TRAMP_BEGIN,to assembly, then do the TRAMP_VALIAS calculation
> per page size and use alternatives to decide which variant to activate.
>
> Note that for the tramp_map_kernel case, we are one instruction short of
> space in the vector to have NOPs for all 3 page size variants. So we do
> if/else for 16K/64K and branch around it for the 4K case. This saves 2
> instructions.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> arch/arm64/kernel/asm-offsets.c | 2 +-
> arch/arm64/kernel/entry.S | 50 ++++++++++++++++++++++++++-------
> 2 files changed, 41 insertions(+), 11 deletions(-)
>
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index f32b8d7f00b2a..c45fa3e281884 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -172,7 +172,7 @@ int main(void)
> DEFINE(ARM64_FTR_SYSVAL, offsetof(struct arm64_ftr_reg, sys_val));
> BLANK();
> #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
> - DEFINE(TRAMP_VALIAS, TRAMP_VALIAS);
> + DEFINE(FIX_ENTRY_TRAMP_BEGIN, FIX_ENTRY_TRAMP_BEGIN);
> #endif
> #ifdef CONFIG_ARM_SDE_INTERFACE
> DEFINE(SDEI_EVENT_INTREGS, offsetof(struct sdei_registered_event, interrupted_regs));
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 7ef0e127b149f..ba47dc8672c04 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -101,11 +101,27 @@
> .org .Lventry_start\@ + 128 // Did we overflow the ventry slot?
> .endm
>
> +#define TRAMP_VALIAS(page_shift) (FIXADDR_TOP - (FIX_ENTRY_TRAMP_BEGIN << (page_shift)))
> +
> .macro tramp_alias, dst, sym
> - .set .Lalias\@, TRAMP_VALIAS + \sym - .entry.tramp.text
> - movz \dst, :abs_g2_s:.Lalias\@
> - movk \dst, :abs_g1_nc:.Lalias\@
> - movk \dst, :abs_g0_nc:.Lalias\@
> +alternative_if ARM64_USE_PAGE_SIZE_4K
> + .set .Lalias4k\@, TRAMP_VALIAS(ARM64_PAGE_SHIFT_4K) + \sym - .entry.tramp.text
> + movz \dst, :abs_g2_s:.Lalias4k\@
> + movk \dst, :abs_g1_nc:.Lalias4k\@
> + movk \dst, :abs_g0_nc:.Lalias4k\@
> +alternative_else_nop_endif
> +alternative_if ARM64_USE_PAGE_SIZE_16K
> + .set .Lalias16k\@, TRAMP_VALIAS(ARM64_PAGE_SHIFT_16K) + \sym - .entry.tramp.text
> + movz \dst, :abs_g2_s:.Lalias16k\@
> + movk \dst, :abs_g1_nc:.Lalias16k\@
> + movk \dst, :abs_g0_nc:.Lalias16k\@
> +alternative_else_nop_endif
> +alternative_if ARM64_USE_PAGE_SIZE_64K
> + .set .Lalias64k\@, TRAMP_VALIAS(ARM64_PAGE_SHIFT_64K) + \sym - .entry.tramp.text
> + movz \dst, :abs_g2_s:.Lalias64k\@
> + movk \dst, :abs_g1_nc:.Lalias64k\@
> + movk \dst, :abs_g0_nc:.Lalias64k\@
> +alternative_else_nop_endif
Since you're changing these, might as well drop the middle movk as the
fixmap is now always in the top 2 GiB of the VA space.
However, wouldn't it be better to reuse the existing callback
alternative stuff that Marc added for KVM?
Same applies below, I reckon.
> .endm
>
> /*
> @@ -627,16 +643,30 @@ SYM_CODE_END(ret_to_user)
> bic \tmp, \tmp, #USER_ASID_FLAG
> msr ttbr1_el1, \tmp
> #ifdef CONFIG_QCOM_FALKOR_ERRATUM_1003
> -alternative_if ARM64_WORKAROUND_QCOM_FALKOR_E1003
> +alternative_if_not ARM64_WORKAROUND_QCOM_FALKOR_E1003
> + b .Lskip_falkor_e1003\@
> +alternative_else_nop_endif
> /* ASID already in \tmp[63:48] */
> - movk \tmp, #:abs_g2_nc:(TRAMP_VALIAS >> 12)
> - movk \tmp, #:abs_g1_nc:(TRAMP_VALIAS >> 12)
> - /* 2MB boundary containing the vectors, so we nobble the walk cache */
> - movk \tmp, #:abs_g0_nc:((TRAMP_VALIAS & ~(SZ_2M - 1)) >> 12)
> +alternative_if ARM64_USE_PAGE_SIZE_4K
> + movk \tmp, #:abs_g2_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_4K) >> 12)
> + movk \tmp, #:abs_g1_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_4K) >> 12)
> + movk \tmp, #:abs_g0_nc:((TRAMP_VALIAS(ARM64_PAGE_SHIFT_4K) & ~(SZ_2M - 1)) >> 12)
> + b .Lfinish_falkor_e1003\@
> +alternative_else_nop_endif
> +alternative_if ARM64_USE_PAGE_SIZE_16K
> + movk \tmp, #:abs_g2_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_16K) >> 12)
> + movk \tmp, #:abs_g1_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_16K) >> 12)
> + movk \tmp, #:abs_g0_nc:((TRAMP_VALIAS(ARM64_PAGE_SHIFT_16K) & ~(SZ_2M - 1)) >> 12)
> +alternative_else /* ARM64_USE_PAGE_SIZE_64K */
> + movk \tmp, #:abs_g2_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_64K) >> 12)
> + movk \tmp, #:abs_g1_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_64K) >> 12)
> + movk \tmp, #:abs_g0_nc:((TRAMP_VALIAS(ARM64_PAGE_SHIFT_64K) & ~(SZ_2M - 1)) >> 12)
> +alternative_endif
> +.Lfinish_falkor_e1003\@:
> isb
> tlbi vae1, \tmp
> dsb nsh
> -alternative_else_nop_endif
> +.Lskip_falkor_e1003\@:
> #endif /* CONFIG_QCOM_FALKOR_ERRATUM_1003 */
> .endm
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 55/57] arm64: TRAMP_VALIAS is no longer compile-time constant
2024-10-14 11:21 ` Ard Biesheuvel
@ 2024-10-14 11:28 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 11:28 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Andrew Morton, Anshuman Khandual, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On 14/10/2024 12:21, Ard Biesheuvel wrote:
> Hi Ryan,
>
> On Mon, 14 Oct 2024 at 13:02, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> When boot-time page size is in operation, TRAMP_VALIAS is no longer a
>> compile-time constant, because the VA of a fixmap slot depends upon
>> PAGE_SIZE.
>>
>> Let's handle this by instead exporting the slot index,
>> FIX_ENTRY_TRAMP_BEGIN,to assembly, then do the TRAMP_VALIAS calculation
>> per page size and use alternatives to decide which variant to activate.
>>
>> Note that for the tramp_map_kernel case, we are one instruction short of
>> space in the vector to have NOPs for all 3 page size variants. So we do
>> if/else for 16K/64K and branch around it for the 4K case. This saves 2
>> instructions.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> ***NOTE***
>> Any confused maintainers may want to read the cover note here for context:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>
>> arch/arm64/kernel/asm-offsets.c | 2 +-
>> arch/arm64/kernel/entry.S | 50 ++++++++++++++++++++++++++-------
>> 2 files changed, 41 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
>> index f32b8d7f00b2a..c45fa3e281884 100644
>> --- a/arch/arm64/kernel/asm-offsets.c
>> +++ b/arch/arm64/kernel/asm-offsets.c
>> @@ -172,7 +172,7 @@ int main(void)
>> DEFINE(ARM64_FTR_SYSVAL, offsetof(struct arm64_ftr_reg, sys_val));
>> BLANK();
>> #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
>> - DEFINE(TRAMP_VALIAS, TRAMP_VALIAS);
>> + DEFINE(FIX_ENTRY_TRAMP_BEGIN, FIX_ENTRY_TRAMP_BEGIN);
>> #endif
>> #ifdef CONFIG_ARM_SDE_INTERFACE
>> DEFINE(SDEI_EVENT_INTREGS, offsetof(struct sdei_registered_event, interrupted_regs));
>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>> index 7ef0e127b149f..ba47dc8672c04 100644
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -101,11 +101,27 @@
>> .org .Lventry_start\@ + 128 // Did we overflow the ventry slot?
>> .endm
>>
>> +#define TRAMP_VALIAS(page_shift) (FIXADDR_TOP - (FIX_ENTRY_TRAMP_BEGIN << (page_shift)))
>> +
>> .macro tramp_alias, dst, sym
>> - .set .Lalias\@, TRAMP_VALIAS + \sym - .entry.tramp.text
>> - movz \dst, :abs_g2_s:.Lalias\@
>> - movk \dst, :abs_g1_nc:.Lalias\@
>> - movk \dst, :abs_g0_nc:.Lalias\@
>> +alternative_if ARM64_USE_PAGE_SIZE_4K
>> + .set .Lalias4k\@, TRAMP_VALIAS(ARM64_PAGE_SHIFT_4K) + \sym - .entry.tramp.text
>> + movz \dst, :abs_g2_s:.Lalias4k\@
>> + movk \dst, :abs_g1_nc:.Lalias4k\@
>> + movk \dst, :abs_g0_nc:.Lalias4k\@
>> +alternative_else_nop_endif
>> +alternative_if ARM64_USE_PAGE_SIZE_16K
>> + .set .Lalias16k\@, TRAMP_VALIAS(ARM64_PAGE_SHIFT_16K) + \sym - .entry.tramp.text
>> + movz \dst, :abs_g2_s:.Lalias16k\@
>> + movk \dst, :abs_g1_nc:.Lalias16k\@
>> + movk \dst, :abs_g0_nc:.Lalias16k\@
>> +alternative_else_nop_endif
>> +alternative_if ARM64_USE_PAGE_SIZE_64K
>> + .set .Lalias64k\@, TRAMP_VALIAS(ARM64_PAGE_SHIFT_64K) + \sym - .entry.tramp.text
>> + movz \dst, :abs_g2_s:.Lalias64k\@
>> + movk \dst, :abs_g1_nc:.Lalias64k\@
>> + movk \dst, :abs_g0_nc:.Lalias64k\@
>> +alternative_else_nop_endif
>
> Since you're changing these, might as well drop the middle movk as the
> fixmap is now always in the top 2 GiB of the VA space.
>
> However, wouldn't it be better to reuse the existing callback
> alternative stuff that Marc added for KVM?
Yes, I agree. Mark suggested the same thing when we were talking the other day
too. I'll definitely use the callbacks for next version, but I didn't want to
hold up the RFC any further - I'd already spent way too much time polishing.
>
> Same applies below, I reckon.
>
>> .endm
>>
>> /*
>> @@ -627,16 +643,30 @@ SYM_CODE_END(ret_to_user)
>> bic \tmp, \tmp, #USER_ASID_FLAG
>> msr ttbr1_el1, \tmp
>> #ifdef CONFIG_QCOM_FALKOR_ERRATUM_1003
>> -alternative_if ARM64_WORKAROUND_QCOM_FALKOR_E1003
>> +alternative_if_not ARM64_WORKAROUND_QCOM_FALKOR_E1003
>> + b .Lskip_falkor_e1003\@
>> +alternative_else_nop_endif
>> /* ASID already in \tmp[63:48] */
>> - movk \tmp, #:abs_g2_nc:(TRAMP_VALIAS >> 12)
>> - movk \tmp, #:abs_g1_nc:(TRAMP_VALIAS >> 12)
>> - /* 2MB boundary containing the vectors, so we nobble the walk cache */
>> - movk \tmp, #:abs_g0_nc:((TRAMP_VALIAS & ~(SZ_2M - 1)) >> 12)
>> +alternative_if ARM64_USE_PAGE_SIZE_4K
>> + movk \tmp, #:abs_g2_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_4K) >> 12)
>> + movk \tmp, #:abs_g1_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_4K) >> 12)
>> + movk \tmp, #:abs_g0_nc:((TRAMP_VALIAS(ARM64_PAGE_SHIFT_4K) & ~(SZ_2M - 1)) >> 12)
>> + b .Lfinish_falkor_e1003\@
>> +alternative_else_nop_endif
>> +alternative_if ARM64_USE_PAGE_SIZE_16K
>> + movk \tmp, #:abs_g2_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_16K) >> 12)
>> + movk \tmp, #:abs_g1_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_16K) >> 12)
>> + movk \tmp, #:abs_g0_nc:((TRAMP_VALIAS(ARM64_PAGE_SHIFT_16K) & ~(SZ_2M - 1)) >> 12)
>> +alternative_else /* ARM64_USE_PAGE_SIZE_64K */
>> + movk \tmp, #:abs_g2_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_64K) >> 12)
>> + movk \tmp, #:abs_g1_nc:(TRAMP_VALIAS(ARM64_PAGE_SHIFT_64K) >> 12)
>> + movk \tmp, #:abs_g0_nc:((TRAMP_VALIAS(ARM64_PAGE_SHIFT_64K) & ~(SZ_2M - 1)) >> 12)
>> +alternative_endif
>> +.Lfinish_falkor_e1003\@:
>> isb
>> tlbi vae1, \tmp
>> dsb nsh
>> -alternative_else_nop_endif
>> +.Lskip_falkor_e1003\@:
>> #endif /* CONFIG_QCOM_FALKOR_ERRATUM_1003 */
>> .endm
>>
>> --
>> 2.43.0
>>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 22/57] sound: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 22/57] sound: " Ryan Roberts
@ 2024-10-14 11:38 ` Mark Brown
2024-10-14 12:24 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Mark Brown @ 2024-10-14 11:38 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Jaroslav Kysela,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Takashi Iwai, Will Deacon, linux-arm-kernel,
linux-kernel, linux-mm, linux-sound
[-- Attachment #1: Type: text/plain, Size: 1806 bytes --]
On Mon, Oct 14, 2024 at 11:58:29AM +0100, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
Please submit patches using subject lines reflecting the style for the
subsystem, this makes it easier for people to identify relevant patches.
Look at what existing commits in the area you're changing are doing and
make sure your subject lines visually resemble what they're doing.
There's no need to resubmit to fix this alone.
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
As documented in submitting-patches.rst please send patches to the
maintainers for the code you would like to change. The normal kernel
workflow is that people apply patches from their inboxes, if they aren't
copied they are likely to not see the patch at all and it is much more
difficult to apply patches.
> -static const struct snd_pcm_hardware dummy_dma_hardware = {
> +static DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(struct snd_pcm_hardware, dummy_dma_hardware, {
> /* Random values to keep userspace happy when checking constraints */
> .info = SNDRV_PCM_INFO_INTERLEAVED |
> SNDRV_PCM_INFO_BLOCK_TRANSFER,
> @@ -107,7 +107,7 @@ static const struct snd_pcm_hardware dummy_dma_hardware = {
> .period_bytes_max = PAGE_SIZE*2,
> .periods_min = 2,
> .periods_max = 128,
> -};
> +});
It's probably better to just use PAGE_SIZE_MAX here and avoid the
deferred patching, like the comment says we don't particularly care what
the value actually is here given that it's a dummy.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 22/57] sound: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 11:38 ` Mark Brown
@ 2024-10-14 12:24 ` Ryan Roberts
2024-10-14 12:41 ` Takashi Iwai
2024-10-14 16:01 ` Mark Brown
0 siblings, 2 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 12:24 UTC (permalink / raw)
To: Mark Brown
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Jaroslav Kysela,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Takashi Iwai, Will Deacon, linux-arm-kernel,
linux-kernel, linux-mm, linux-sound
On 14/10/2024 12:38, Mark Brown wrote:
> On Mon, Oct 14, 2024 at 11:58:29AM +0100, Ryan Roberts wrote:
>> To prepare for supporting boot-time page size selection, refactor code
>> to remove assumptions about PAGE_SIZE being compile-time constant. Code
>> intended to be equivalent when compile-time page size is active.
>
> Please submit patches using subject lines reflecting the style for the
> subsystem, this makes it easier for people to identify relevant patches.
> Look at what existing commits in the area you're changing are doing and
> make sure your subject lines visually resemble what they're doing.
> There's no need to resubmit to fix this alone.
No problem, will fix this in the next round (where I anticipate sending more
targetted serieses to maintainers).
>
>> ***NOTE***
>> Any confused maintainers may want to read the cover note here for context:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> As documented in submitting-patches.rst please send patches to the
> maintainers for the code you would like to change. The normal kernel
> workflow is that people apply patches from their inboxes, if they aren't
> copied they are likely to not see the patch at all and it is much more
> difficult to apply patches.
Sure. I think you're implying that you would have liked to be in To: for this
patch? I went to quite a lot of trouble to ensure all maintainers were at least
in the To: field for patches touching their code. But get_maintainer.pl lists
you as a supporter, not a maintainer when I ran this patch through. Could you
clarify what would have been the correct thing to do? I could include all
reviewers and supporters as well as maintainers but then I'd be banging up
against the limits for some of the patches.
Or perhaps I've misunderstood the point you're making here.
>
>> -static const struct snd_pcm_hardware dummy_dma_hardware = {
>> +static DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(struct snd_pcm_hardware, dummy_dma_hardware, {
>> /* Random values to keep userspace happy when checking constraints */
>> .info = SNDRV_PCM_INFO_INTERLEAVED |
>> SNDRV_PCM_INFO_BLOCK_TRANSFER,
>> @@ -107,7 +107,7 @@ static const struct snd_pcm_hardware dummy_dma_hardware = {
>> .period_bytes_max = PAGE_SIZE*2,
>> .periods_min = 2,
>> .periods_max = 128,
>> -};
>> +});
>
> It's probably better to just use PAGE_SIZE_MAX here and avoid the
> deferred patching, like the comment says we don't particularly care what
> the value actually is here given that it's a dummy.
OK, so would that be:
.buffer_bytes_max = 128*1024,
.period_bytes_min = PAGE_SIZE_MAX, <<<<<
.period_bytes_max = PAGE_SIZE_MAX*2, <<<<<
.periods_min = 2,
.periods_max = 128,
?
It's not really clear to me how all the parameters interact; the buffer size
128K, which, if PAGE_SIZE_MAX is 64K, would hold 1 period of the maximum size.
But periods_min is 2. So not sure that works? Or perhaps I'm trying to apply too
much meaning to the param names...
Thanks,
Ryan
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 22/57] sound: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 12:24 ` Ryan Roberts
@ 2024-10-14 12:41 ` Takashi Iwai
2024-10-14 12:52 ` Ryan Roberts
2024-10-14 16:01 ` Mark Brown
1 sibling, 1 reply; 196+ messages in thread
From: Takashi Iwai @ 2024-10-14 12:41 UTC (permalink / raw)
To: Ryan Roberts
Cc: Mark Brown, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Jaroslav Kysela, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Miroslav Benes, Takashi Iwai, Will Deacon,
linux-arm-kernel, linux-kernel, linux-mm, linux-sound
On Mon, 14 Oct 2024 14:24:02 +0200,
Ryan Roberts wrote:
>
> On 14/10/2024 12:38, Mark Brown wrote:
> > On Mon, Oct 14, 2024 at 11:58:29AM +0100, Ryan Roberts wrote:
> >> -static const struct snd_pcm_hardware dummy_dma_hardware = {
> >> +static DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(struct snd_pcm_hardware, dummy_dma_hardware, {
> >> /* Random values to keep userspace happy when checking constraints */
> >> .info = SNDRV_PCM_INFO_INTERLEAVED |
> >> SNDRV_PCM_INFO_BLOCK_TRANSFER,
> >> @@ -107,7 +107,7 @@ static const struct snd_pcm_hardware dummy_dma_hardware = {
> >> .period_bytes_max = PAGE_SIZE*2,
> >> .periods_min = 2,
> >> .periods_max = 128,
> >> -};
> >> +});
> >
> > It's probably better to just use PAGE_SIZE_MAX here and avoid the
> > deferred patching, like the comment says we don't particularly care what
> > the value actually is here given that it's a dummy.
>
> OK, so would that be:
>
> .buffer_bytes_max = 128*1024,
> .period_bytes_min = PAGE_SIZE_MAX, <<<<<
> .period_bytes_max = PAGE_SIZE_MAX*2, <<<<<
> .periods_min = 2,
> .periods_max = 128,
>
> ?
>
> It's not really clear to me how all the parameters interact; the buffer size
> 128K, which, if PAGE_SIZE_MAX is 64K, would hold 1 period of the maximum size.
> But periods_min is 2. So not sure that works? Or perhaps I'm trying to apply too
> much meaning to the param names...
Right, when PAGE_SIZE_MAX is 64k, 128k won't be used because of the
constrant of periods_min=2.
As Mark mentioned, here the actual size itself doesn't matter much.
So I suppose it'd be even simpler to define just 4096 and 4096 * 2 for
period_bytes_min and *_max instead of sticking with PAGE_SIZE. Then
it would become platform-agnostic, too.
thanks,
Takashi
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 22/57] sound: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 12:41 ` Takashi Iwai
@ 2024-10-14 12:52 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 12:52 UTC (permalink / raw)
To: Takashi Iwai
Cc: Mark Brown, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Jaroslav Kysela, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Miroslav Benes, Takashi Iwai, Will Deacon,
linux-arm-kernel, linux-kernel, linux-mm, linux-sound
On 14/10/2024 13:41, Takashi Iwai wrote:
> On Mon, 14 Oct 2024 14:24:02 +0200,
> Ryan Roberts wrote:
>>
>> On 14/10/2024 12:38, Mark Brown wrote:
>>> On Mon, Oct 14, 2024 at 11:58:29AM +0100, Ryan Roberts wrote:
>>>> -static const struct snd_pcm_hardware dummy_dma_hardware = {
>>>> +static DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(struct snd_pcm_hardware, dummy_dma_hardware, {
>>>> /* Random values to keep userspace happy when checking constraints */
>>>> .info = SNDRV_PCM_INFO_INTERLEAVED |
>>>> SNDRV_PCM_INFO_BLOCK_TRANSFER,
>>>> @@ -107,7 +107,7 @@ static const struct snd_pcm_hardware dummy_dma_hardware = {
>>>> .period_bytes_max = PAGE_SIZE*2,
>>>> .periods_min = 2,
>>>> .periods_max = 128,
>>>> -};
>>>> +});
>>>
>>> It's probably better to just use PAGE_SIZE_MAX here and avoid the
>>> deferred patching, like the comment says we don't particularly care what
>>> the value actually is here given that it's a dummy.
>>
>> OK, so would that be:
>>
>> .buffer_bytes_max = 128*1024,
>> .period_bytes_min = PAGE_SIZE_MAX, <<<<<
>> .period_bytes_max = PAGE_SIZE_MAX*2, <<<<<
>> .periods_min = 2,
>> .periods_max = 128,
>>
>> ?
>>
>> It's not really clear to me how all the parameters interact; the buffer size
>> 128K, which, if PAGE_SIZE_MAX is 64K, would hold 1 period of the maximum size.
>> But periods_min is 2. So not sure that works? Or perhaps I'm trying to apply too
>> much meaning to the param names...
>
> Right, when PAGE_SIZE_MAX is 64k, 128k won't be used because of the
> constrant of periods_min=2.
>
> As Mark mentioned, here the actual size itself doesn't matter much.
> So I suppose it'd be even simpler to define just 4096 and 4096 * 2 for
> period_bytes_min and *_max instead of sticking with PAGE_SIZE. Then
> it would become platform-agnostic, too.
OK great I'll set these to 4096 and 4096*2 for the next version.
Thanks for the feedback!
>
>
> thanks,
>
> Takashi
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 03/57] mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
2024-10-14 10:58 ` [RFC PATCH v1 03/57] mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large Ryan Roberts
@ 2024-10-14 13:00 ` Johannes Weiner
2024-10-14 19:59 ` Shakeel Butt
2024-10-17 16:09 ` Roman Gushchin
2 siblings, 0 replies; 196+ messages in thread
From: Johannes Weiner @ 2024-10-14 13:00 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Michal Hocko,
Miroslav Benes, Roman Gushchin, Shakeel Butt, Will Deacon,
cgroups, linux-arm-kernel, linux-kernel, linux-mm
On Mon, Oct 14, 2024 at 11:58:10AM +0100, Ryan Roberts wrote:
> Previously the seq_buf used for accumulating the memory.stat output was
> sized at PAGE_SIZE. But the amount of output is invariant to PAGE_SIZE;
> If 4K is enough on a 4K page system, then it should also be enough on a
> 64K page system, so we can save 60K om the static buffer used in
> mem_cgroup_print_oom_meminfo(). Let's make it so.
>
> This also has the beneficial side effect of removing a place in the code
> that assumed PAGE_SIZE is a compile-time constant. So this helps our
> quest towards supporting boot-time page size selection.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (55 preceding siblings ...)
2024-10-14 10:59 ` [RFC PATCH v1 57/57] arm64: Enable boot-time page size selection Ryan Roberts
@ 2024-10-14 13:54 ` Pingfan Liu
2024-10-14 14:07 ` Ryan Roberts
2024-10-16 14:36 ` Ryan Roberts
2024-10-30 8:45 ` Ryan Roberts
58 siblings, 1 reply; 196+ messages in thread
From: Pingfan Liu @ 2024-10-14 13:54 UTC (permalink / raw)
To: Ryan Roberts
Cc: David S. Miller, James E.J. Bottomley, Andreas Larsson,
Andrew Morton, Anshuman Khandual, Anton Ivanov, Ard Biesheuvel,
Arnd Bergmann, Borislav Petkov, Catalin Marinas, Chris Zankel,
Dave Hansen, David Hildenbrand, Dinh Nguyen, Geert Uytterhoeven,
Greg Marsden, Helge Deller, Huacai Chen, Ingo Molnar, Ivan Ivanov,
Johannes Berg, John Paul Adrian Glaubitz, Jonas Bonn,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Max Filippov, Miroslav Benes, Rich Felker, Richard Weinberger,
Stafford Horne, Stefan Kristiansson, Thomas Bogendoerfer,
Thomas Gleixner, Will Deacon, Yoshinori Sato, x86, linux-alpha,
linux-arch, linux-arm-kernel, linux-csky, linux-hexagon,
linux-kernel, linux-m68k, linux-mips, linux-mm, linux-openrisc,
linux-parisc, linux-riscv, linux-s390, linux-sh, linux-snps-arc,
linux-um, linuxppc-dev, loongarch, sparclinux
Hello Ryan,
On Mon, Oct 14, 2024 at 11:58:08AM +0100, Ryan Roberts wrote:
> arm64 can support multiple base page sizes. Instead of selecting a page
> size at compile time, as is done today, we will make it possible to
> select the desired page size on the command line.
>
> In this case PAGE_SHIFT and it's derivatives, PAGE_SIZE and PAGE_MASK
> (as well as a number of other macros related to or derived from
> PAGE_SHIFT, but I'm not worrying about those yet), are no longer
> compile-time constants. So the code base needs to cope with that.
>
> As a first step, introduce MIN and MAX variants of these macros, which
> express the range of possible page sizes. These are always compile-time
> constants and can be used in many places where PAGE_[SHIFT|SIZE|MASK]
> were previously used where a compile-time constant is required.
> (Subsequent patches will do that conversion work). When the arch/build
> doesn't support boot-time page size selection, the MIN and MAX variants
> are equal and everything resolves as it did previously.
>
MIN and MAX appear to construct a boundary, but it may be not enough.
Please see the following comment inline.
> Additionally, introduce DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() which wrap
> global variable defintions so that for boot-time page size selection
> builds, the variable being wrapped is initialized at boot-time, instead
> of compile-time. This is done by defining a function to do the
> assignment, which has the "constructor" attribute. Constructor is
> preferred over initcall, because when compiling a module, the module is
> limited to a single initcall but constructors are unlimited. For
> built-in code, constructors are now called earlier to guarrantee that
> the variables are initialized by the time they are used. Any arch that
> wants to enable boot-time page size selection will need to select
> CONFIG_CONSTRUCTORS.
>
> These new macros need to be available anywhere PAGE_SHIFT and friends
> are available. Those are defined via asm/page.h (although some arches
> have a sub-include that defines them). Unfortunately there is no
> reliable asm-generic header we can easily piggy-back on, so let's define
> a new one, pgtable-geometry.h, which we include near where each arch
> defines PAGE_SHIFT. Ugh.
>
> -------
>
> Most of the problems that need to be solved over the next few patches
> fall into these broad categories, which are all solved with the help of
> these new macros:
>
> 1. Assignment of values derived from PAGE_SIZE in global variables
>
> For boot-time page size builds, we must defer the initialization of
> these variables until boot-time, when the page size is known. See
> DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() as described above.
>
> 2. Define static storage in units related to PAGE_SIZE
>
> This static storage will be defined according to PAGE_SIZE_MAX.
>
> 3. Define size of struct so that it is related to PAGE_SIZE
>
> The struct often contains an array that is sized to fill the page. In
> this case, use a flexible array with dynamic allocation. In other
> cases, the struct fits exactly over a page, which is a header (e.g.
> swap file header). In this case, remove the padding, and manually
> determine the struct pointer within the page.
>
About two years ago, I tried to do similar thing in your series, but ran
into problem at this point, or maybe not exactly as the point you list
here. I consider this as the most challenged part.
The scenario is
struct X {
a[size_a];
b[size_b];
c;
};
Where size_a = f(PAGE_SHIFT), size_b=g(PAGE_SHIFT). One of f() and g()
is proportional to PAGE_SHIFT, the other is inversely proportional.
How can you fix the reference of X.a and X.b?
Thanks,
Pingfan
> 4. BUILD_BUG_ON() with values derived from PAGE_SIZE
>
> In most cases, we can change these to compare againt the appropriate
> limit (either MIN or MAX). In other cases, we must change these to
> run-time BUG_ON().
>
> 5. Ensure page alignment of static data structures
>
> Align instead to PAGE_SIZE_MAX.
>
> 6. #ifdeffery based on PAGE_SIZE
>
> Often these can be changed to c code constructs. e.g. a macro that
> returns a different value depending on page size can be changed to use
> the ternary operator and the compiler will dead code strip it for the
> compile-time constant case and runtime evaluate it for the non-const
> case. Or #if/#else/#endif within a function can be converted to c
> if/else blocks, which are also dead code stripped for the const case.
> Sometimes we can change the c-preprocessor logic to use the
> appropriate MIN/MAX limit.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> arch/alpha/include/asm/page.h | 1 +
> arch/arc/include/asm/page.h | 1 +
> arch/arm/include/asm/page.h | 1 +
> arch/arm64/include/asm/page-def.h | 2 +
> arch/csky/include/asm/page.h | 3 ++
> arch/hexagon/include/asm/page.h | 2 +
> arch/loongarch/include/asm/page.h | 2 +
> arch/m68k/include/asm/page.h | 1 +
> arch/microblaze/include/asm/page.h | 1 +
> arch/mips/include/asm/page.h | 1 +
> arch/nios2/include/asm/page.h | 2 +
> arch/openrisc/include/asm/page.h | 1 +
> arch/parisc/include/asm/page.h | 1 +
> arch/powerpc/include/asm/page.h | 2 +
> arch/riscv/include/asm/page.h | 1 +
> arch/s390/include/asm/page.h | 1 +
> arch/sh/include/asm/page.h | 1 +
> arch/sparc/include/asm/page.h | 3 ++
> arch/um/include/asm/page.h | 2 +
> arch/x86/include/asm/page_types.h | 2 +
> arch/xtensa/include/asm/page.h | 1 +
> include/asm-generic/pgtable-geometry.h | 71 ++++++++++++++++++++++++++
> init/main.c | 5 +-
> 23 files changed, 107 insertions(+), 1 deletion(-)
> create mode 100644 include/asm-generic/pgtable-geometry.h
>
> diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
> index 70419e6be1a35..d0096fb5521b8 100644
> --- a/arch/alpha/include/asm/page.h
> +++ b/arch/alpha/include/asm/page.h
> @@ -88,5 +88,6 @@ typedef struct page *pgtable_t;
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _ALPHA_PAGE_H */
> diff --git a/arch/arc/include/asm/page.h b/arch/arc/include/asm/page.h
> index def0dfb95b436..8d56549db7a33 100644
> --- a/arch/arc/include/asm/page.h
> +++ b/arch/arc/include/asm/page.h
> @@ -6,6 +6,7 @@
> #define __ASM_ARC_PAGE_H
>
> #include <uapi/asm/page.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #ifdef CONFIG_ARC_HAS_PAE40
>
> diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h
> index 62af9f7f9e963..417aa8533c718 100644
> --- a/arch/arm/include/asm/page.h
> +++ b/arch/arm/include/asm/page.h
> @@ -191,5 +191,6 @@ extern int pfn_valid(unsigned long);
>
> #include <asm-generic/getorder.h>
> #include <asm-generic/memory_model.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif
> diff --git a/arch/arm64/include/asm/page-def.h b/arch/arm64/include/asm/page-def.h
> index 792e9fe881dcf..d69971cf49cd2 100644
> --- a/arch/arm64/include/asm/page-def.h
> +++ b/arch/arm64/include/asm/page-def.h
> @@ -15,4 +15,6 @@
> #define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
> #define PAGE_MASK (~(PAGE_SIZE-1))
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* __ASM_PAGE_DEF_H */
> diff --git a/arch/csky/include/asm/page.h b/arch/csky/include/asm/page.h
> index 0ca6c408c07f2..95173d57adc8b 100644
> --- a/arch/csky/include/asm/page.h
> +++ b/arch/csky/include/asm/page.h
> @@ -92,4 +92,7 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
> #include <asm-generic/getorder.h>
>
> #endif /* !__ASSEMBLY__ */
> +
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* __ASM_CSKY_PAGE_H */
> diff --git a/arch/hexagon/include/asm/page.h b/arch/hexagon/include/asm/page.h
> index 8a6af57274c2d..ba7ad5231695f 100644
> --- a/arch/hexagon/include/asm/page.h
> +++ b/arch/hexagon/include/asm/page.h
> @@ -139,4 +139,6 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
> #endif /* ifdef __ASSEMBLY__ */
> #endif /* ifdef __KERNEL__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif
> diff --git a/arch/loongarch/include/asm/page.h b/arch/loongarch/include/asm/page.h
> index e85df33f11c77..9862e8fb047a6 100644
> --- a/arch/loongarch/include/asm/page.h
> +++ b/arch/loongarch/include/asm/page.h
> @@ -123,4 +123,6 @@ extern int __virt_addr_valid(volatile void *kaddr);
>
> #endif /* !__ASSEMBLY__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* _ASM_PAGE_H */
> diff --git a/arch/m68k/include/asm/page.h b/arch/m68k/include/asm/page.h
> index 8cfb84b499751..4df4681b02194 100644
> --- a/arch/m68k/include/asm/page.h
> +++ b/arch/m68k/include/asm/page.h
> @@ -60,5 +60,6 @@ extern unsigned long _ramend;
>
> #include <asm-generic/getorder.h>
> #include <asm-generic/memory_model.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _M68K_PAGE_H */
> diff --git a/arch/microblaze/include/asm/page.h b/arch/microblaze/include/asm/page.h
> index 8810f4f1c3b02..abc23c3d743bd 100644
> --- a/arch/microblaze/include/asm/page.h
> +++ b/arch/microblaze/include/asm/page.h
> @@ -142,5 +142,6 @@ static inline const void *pfn_to_virt(unsigned long pfn)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _ASM_MICROBLAZE_PAGE_H */
> diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
> index 4609cb0326cf3..3d91021538f02 100644
> --- a/arch/mips/include/asm/page.h
> +++ b/arch/mips/include/asm/page.h
> @@ -227,5 +227,6 @@ static inline unsigned long kaslr_offset(void)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _ASM_PAGE_H */
> diff --git a/arch/nios2/include/asm/page.h b/arch/nios2/include/asm/page.h
> index 0722f88e63cc7..2e5f93beb42b7 100644
> --- a/arch/nios2/include/asm/page.h
> +++ b/arch/nios2/include/asm/page.h
> @@ -97,4 +97,6 @@ extern struct page *mem_map;
>
> #endif /* !__ASSEMBLY__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* _ASM_NIOS2_PAGE_H */
> diff --git a/arch/openrisc/include/asm/page.h b/arch/openrisc/include/asm/page.h
> index 1d5913f67c312..a0da2a9842241 100644
> --- a/arch/openrisc/include/asm/page.h
> +++ b/arch/openrisc/include/asm/page.h
> @@ -88,5 +88,6 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* __ASM_OPENRISC_PAGE_H */
> diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
> index 4bea2e95798f0..2a75496237c09 100644
> --- a/arch/parisc/include/asm/page.h
> +++ b/arch/parisc/include/asm/page.h
> @@ -173,6 +173,7 @@ extern int npmem_ranges;
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
> #include <asm/pdc.h>
>
> #define PAGE0 ((struct zeropage *)absolute_pointer(__PAGE_OFFSET))
> diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
> index 83d0a4fc5f755..4601c115b6485 100644
> --- a/arch/powerpc/include/asm/page.h
> +++ b/arch/powerpc/include/asm/page.h
> @@ -300,4 +300,6 @@ static inline unsigned long kaslr_offset(void)
> #include <asm-generic/memory_model.h>
> #endif /* __ASSEMBLY__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* _ASM_POWERPC_PAGE_H */
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index 7ede2111c5917..e5af7579e45bf 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -204,5 +204,6 @@ static __always_inline void *pfn_to_kaddr(unsigned long pfn)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _ASM_RISCV_PAGE_H */
> diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
> index 16e4caa931f1f..42157e7690a77 100644
> --- a/arch/s390/include/asm/page.h
> +++ b/arch/s390/include/asm/page.h
> @@ -275,6 +275,7 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #define AMODE31_SIZE (3 * PAGE_SIZE)
>
> diff --git a/arch/sh/include/asm/page.h b/arch/sh/include/asm/page.h
> index f780b467e75d7..09533d46ef033 100644
> --- a/arch/sh/include/asm/page.h
> +++ b/arch/sh/include/asm/page.h
> @@ -162,5 +162,6 @@ typedef struct page *pgtable_t;
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* __ASM_SH_PAGE_H */
> diff --git a/arch/sparc/include/asm/page.h b/arch/sparc/include/asm/page.h
> index 5e44cdf2a8f2b..4327fe2bfa010 100644
> --- a/arch/sparc/include/asm/page.h
> +++ b/arch/sparc/include/asm/page.h
> @@ -9,4 +9,7 @@
> #else
> #include <asm/page_32.h>
> #endif
> +
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif
> diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
> index 9ef9a8aedfa66..f26011808f514 100644
> --- a/arch/um/include/asm/page.h
> +++ b/arch/um/include/asm/page.h
> @@ -119,4 +119,6 @@ extern unsigned long uml_physmem;
> #define __HAVE_ARCH_GATE_AREA 1
> #endif
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* __UM_PAGE_H */
> diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
> index 52f1b4ff0cc16..6d2381342047f 100644
> --- a/arch/x86/include/asm/page_types.h
> +++ b/arch/x86/include/asm/page_types.h
> @@ -71,4 +71,6 @@ extern void initmem_init(void);
>
> #endif /* !__ASSEMBLY__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* _ASM_X86_PAGE_DEFS_H */
> diff --git a/arch/xtensa/include/asm/page.h b/arch/xtensa/include/asm/page.h
> index 4db56ef052d22..86952cb32af23 100644
> --- a/arch/xtensa/include/asm/page.h
> +++ b/arch/xtensa/include/asm/page.h
> @@ -200,4 +200,5 @@ static inline unsigned long ___pa(unsigned long va)
> #endif /* __ASSEMBLY__ */
>
> #include <asm-generic/memory_model.h>
> +#include <asm-generic/pgtable-geometry.h>
> #endif /* _XTENSA_PAGE_H */
> diff --git a/include/asm-generic/pgtable-geometry.h b/include/asm-generic/pgtable-geometry.h
> new file mode 100644
> index 0000000000000..358e729a6ac37
> --- /dev/null
> +++ b/include/asm-generic/pgtable-geometry.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef ASM_GENERIC_PGTABLE_GEOMETRY_H
> +#define ASM_GENERIC_PGTABLE_GEOMETRY_H
> +
> +#if defined(PAGE_SHIFT_MAX) && defined(PAGE_SIZE_MAX) && defined(PAGE_MASK_MAX) && \
> + defined(PAGE_SHIFT_MIN) && defined(PAGE_SIZE_MIN) && defined(PAGE_MASK_MIN)
> +/* Arch supports boot-time page size selection. */
> +#elif defined(PAGE_SHIFT_MAX) || defined(PAGE_SIZE_MAX) || defined(PAGE_MASK_MAX) || \
> + defined(PAGE_SHIFT_MIN) || defined(PAGE_SIZE_MIN) || defined(PAGE_MASK_MIN)
> +#error Arch must define all or none of the boot-time page size macros
> +#else
> +/* Arch does not support boot-time page size selection. */
> +#define PAGE_SHIFT_MIN PAGE_SHIFT
> +#define PAGE_SIZE_MIN PAGE_SIZE
> +#define PAGE_MASK_MIN PAGE_MASK
> +#define PAGE_SHIFT_MAX PAGE_SHIFT
> +#define PAGE_SIZE_MAX PAGE_SIZE
> +#define PAGE_MASK_MAX PAGE_MASK
> +#endif
> +
> +/*
> + * Define a global variable (scalar or struct), whose value is derived from
> + * PAGE_SIZE and friends. When PAGE_SIZE is a compile-time constant, the global
> + * variable is simply defined with the static value. When PAGE_SIZE is
> + * determined at boot-time, a pure initcall is registered and run during boot to
> + * initialize the variable.
> + *
> + * @type: Unqualified type. Do not include "const"; implied by macro variant.
> + * @name: Variable name.
> + * @...: Initialization value. May be scalar or initializer.
> + *
> + * "static" is declared by placing "static" before the macro.
> + *
> + * Example:
> + *
> + * struct my_struct {
> + * int a;
> + * char b;
> + * };
> + *
> + * static DEFINE_GLOBAL_PAGE_SIZE_VAR(struct my_struct, my_variable, {
> + * .a = 10,
> + * .b = 'e',
> + * });
> + */
> +#if PAGE_SIZE_MIN != PAGE_SIZE_MAX
> +#define __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, attrib, ...) \
> + type name attrib; \
> + static int __init __attribute__((constructor)) __##name##_init(void) \
> + { \
> + name = (type)__VA_ARGS__; \
> + return 0; \
> + }
> +
> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, ...) \
> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, , __VA_ARGS__)
> +
> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(type, name, ...) \
> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, __ro_after_init, __VA_ARGS__)
> +#else /* PAGE_SIZE_MIN == PAGE_SIZE_MAX */
> +#define __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, attrib, ...) \
> + type name attrib = __VA_ARGS__; \
> +
> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, ...) \
> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, , __VA_ARGS__)
> +
> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(type, name, ...) \
> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(const type, name, , __VA_ARGS__)
> +#endif
> +
> +#endif /* ASM_GENERIC_PGTABLE_GEOMETRY_H */
> diff --git a/init/main.c b/init/main.c
> index 206acdde51f5a..ba1515eb20b9d 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -899,6 +899,8 @@ static void __init early_numa_node_init(void)
> #endif
> }
>
> +static __init void do_ctors(void);
> +
> asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector
> void start_kernel(void)
> {
> @@ -910,6 +912,8 @@ void start_kernel(void)
> debug_objects_early_init();
> init_vmlinux_build_id();
>
> + do_ctors();
> +
> cgroup_init_early();
>
> local_irq_disable();
> @@ -1360,7 +1364,6 @@ static void __init do_basic_setup(void)
> cpuset_init_smp();
> driver_init();
> init_irq_proc();
> - do_ctors();
> do_initcalls();
> }
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection
2024-10-14 13:54 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting " Pingfan Liu
@ 2024-10-14 14:07 ` Ryan Roberts
2024-10-15 3:04 ` Pingfan Liu
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-14 14:07 UTC (permalink / raw)
To: Pingfan Liu
Cc: David S. Miller, James E.J. Bottomley, Andreas Larsson,
Andrew Morton, Anshuman Khandual, Anton Ivanov, Ard Biesheuvel,
Arnd Bergmann, Borislav Petkov, Catalin Marinas, Chris Zankel,
Dave Hansen, David Hildenbrand, Dinh Nguyen, Geert Uytterhoeven,
Greg Marsden, Helge Deller, Huacai Chen, Ingo Molnar, Ivan Ivanov,
Johannes Berg, John Paul Adrian Glaubitz, Jonas Bonn,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Max Filippov, Miroslav Benes, Rich Felker, Richard Weinberger,
Stafford Horne, Stefan Kristiansson, Thomas Bogendoerfer,
Thomas Gleixner, Will Deacon, Yoshinori Sato, x86, linux-alpha,
linux-arch, linux-arm-kernel, linux-csky, linux-hexagon,
linux-kernel, linux-m68k, linux-mips, linux-mm, linux-openrisc,
linux-parisc, linux-riscv, linux-s390, linux-sh, linux-snps-arc,
linux-um, linuxppc-dev, loongarch, sparclinux
On 14/10/2024 14:54, Pingfan Liu wrote:
> Hello Ryan,
>
> On Mon, Oct 14, 2024 at 11:58:08AM +0100, Ryan Roberts wrote:
>> arm64 can support multiple base page sizes. Instead of selecting a page
>> size at compile time, as is done today, we will make it possible to
>> select the desired page size on the command line.
>>
>> In this case PAGE_SHIFT and it's derivatives, PAGE_SIZE and PAGE_MASK
>> (as well as a number of other macros related to or derived from
>> PAGE_SHIFT, but I'm not worrying about those yet), are no longer
>> compile-time constants. So the code base needs to cope with that.
>>
>> As a first step, introduce MIN and MAX variants of these macros, which
>> express the range of possible page sizes. These are always compile-time
>> constants and can be used in many places where PAGE_[SHIFT|SIZE|MASK]
>> were previously used where a compile-time constant is required.
>> (Subsequent patches will do that conversion work). When the arch/build
>> doesn't support boot-time page size selection, the MIN and MAX variants
>> are equal and everything resolves as it did previously.
>>
>
> MIN and MAX appear to construct a boundary, but it may be not enough.
> Please see the following comment inline.
>
>> Additionally, introduce DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() which wrap
>> global variable defintions so that for boot-time page size selection
>> builds, the variable being wrapped is initialized at boot-time, instead
>> of compile-time. This is done by defining a function to do the
>> assignment, which has the "constructor" attribute. Constructor is
>> preferred over initcall, because when compiling a module, the module is
>> limited to a single initcall but constructors are unlimited. For
>> built-in code, constructors are now called earlier to guarrantee that
>> the variables are initialized by the time they are used. Any arch that
>> wants to enable boot-time page size selection will need to select
>> CONFIG_CONSTRUCTORS.
>>
>> These new macros need to be available anywhere PAGE_SHIFT and friends
>> are available. Those are defined via asm/page.h (although some arches
>> have a sub-include that defines them). Unfortunately there is no
>> reliable asm-generic header we can easily piggy-back on, so let's define
>> a new one, pgtable-geometry.h, which we include near where each arch
>> defines PAGE_SHIFT. Ugh.
>>
>> -------
>>
>> Most of the problems that need to be solved over the next few patches
>> fall into these broad categories, which are all solved with the help of
>> these new macros:
>>
>> 1. Assignment of values derived from PAGE_SIZE in global variables
>>
>> For boot-time page size builds, we must defer the initialization of
>> these variables until boot-time, when the page size is known. See
>> DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() as described above.
>>
>> 2. Define static storage in units related to PAGE_SIZE
>>
>> This static storage will be defined according to PAGE_SIZE_MAX.
>>
>> 3. Define size of struct so that it is related to PAGE_SIZE
>>
>> The struct often contains an array that is sized to fill the page. In
>> this case, use a flexible array with dynamic allocation. In other
>> cases, the struct fits exactly over a page, which is a header (e.g.
>> swap file header). In this case, remove the padding, and manually
>> determine the struct pointer within the page.
>>
>
> About two years ago, I tried to do similar thing in your series, but ran
> into problem at this point, or maybe not exactly as the point you list
> here. I consider this as the most challenged part.
>
> The scenario is
> struct X {
> a[size_a];
> b[size_b];
> c;
> };
>
> Where size_a = f(PAGE_SHIFT), size_b=g(PAGE_SHIFT). One of f() and g()
> is proportional to PAGE_SHIFT, the other is inversely proportional.
>
> How can you fix the reference of X.a and X.b?
If you need to allocate static memory, then in this scenario, assuming f() is
proportional and g() is inversely-proportional, then I guess you need
size_a=f(PAGE_SIZE_MAX) and size_b=g(PAGE_SIZE_MIN). Or if you can allocate the
memory dynamically, then make a and b pointers to dynamically allocated buffers.
Is there a specific place in the source where this pattern is used today? It
might be easier to discuss in the context of the code if so.
Thanks,
Ryan
>
> Thanks,
>
> Pingfan
>
>
>> 4. BUILD_BUG_ON() with values derived from PAGE_SIZE
>>
>> In most cases, we can change these to compare againt the appropriate
>> limit (either MIN or MAX). In other cases, we must change these to
>> run-time BUG_ON().
>>
>> 5. Ensure page alignment of static data structures
>>
>> Align instead to PAGE_SIZE_MAX.
>>
>> 6. #ifdeffery based on PAGE_SIZE
>>
>> Often these can be changed to c code constructs. e.g. a macro that
>> returns a different value depending on page size can be changed to use
>> the ternary operator and the compiler will dead code strip it for the
>> compile-time constant case and runtime evaluate it for the non-const
>> case. Or #if/#else/#endif within a function can be converted to c
>> if/else blocks, which are also dead code stripped for the const case.
>> Sometimes we can change the c-preprocessor logic to use the
>> appropriate MIN/MAX limit.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> ***NOTE***
>> Any confused maintainers may want to read the cover note here for context:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>
>> arch/alpha/include/asm/page.h | 1 +
>> arch/arc/include/asm/page.h | 1 +
>> arch/arm/include/asm/page.h | 1 +
>> arch/arm64/include/asm/page-def.h | 2 +
>> arch/csky/include/asm/page.h | 3 ++
>> arch/hexagon/include/asm/page.h | 2 +
>> arch/loongarch/include/asm/page.h | 2 +
>> arch/m68k/include/asm/page.h | 1 +
>> arch/microblaze/include/asm/page.h | 1 +
>> arch/mips/include/asm/page.h | 1 +
>> arch/nios2/include/asm/page.h | 2 +
>> arch/openrisc/include/asm/page.h | 1 +
>> arch/parisc/include/asm/page.h | 1 +
>> arch/powerpc/include/asm/page.h | 2 +
>> arch/riscv/include/asm/page.h | 1 +
>> arch/s390/include/asm/page.h | 1 +
>> arch/sh/include/asm/page.h | 1 +
>> arch/sparc/include/asm/page.h | 3 ++
>> arch/um/include/asm/page.h | 2 +
>> arch/x86/include/asm/page_types.h | 2 +
>> arch/xtensa/include/asm/page.h | 1 +
>> include/asm-generic/pgtable-geometry.h | 71 ++++++++++++++++++++++++++
>> init/main.c | 5 +-
>> 23 files changed, 107 insertions(+), 1 deletion(-)
>> create mode 100644 include/asm-generic/pgtable-geometry.h
>>
>> diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
>> index 70419e6be1a35..d0096fb5521b8 100644
>> --- a/arch/alpha/include/asm/page.h
>> +++ b/arch/alpha/include/asm/page.h
>> @@ -88,5 +88,6 @@ typedef struct page *pgtable_t;
>>
>> #include <asm-generic/memory_model.h>
>> #include <asm-generic/getorder.h>
>> +#include <asm-generic/pgtable-geometry.h>
>>
>> #endif /* _ALPHA_PAGE_H */
>> diff --git a/arch/arc/include/asm/page.h b/arch/arc/include/asm/page.h
>> index def0dfb95b436..8d56549db7a33 100644
>> --- a/arch/arc/include/asm/page.h
>> +++ b/arch/arc/include/asm/page.h
>> @@ -6,6 +6,7 @@
>> #define __ASM_ARC_PAGE_H
>>
>> #include <uapi/asm/page.h>
>> +#include <asm-generic/pgtable-geometry.h>
>>
>> #ifdef CONFIG_ARC_HAS_PAE40
>>
>> diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h
>> index 62af9f7f9e963..417aa8533c718 100644
>> --- a/arch/arm/include/asm/page.h
>> +++ b/arch/arm/include/asm/page.h
>> @@ -191,5 +191,6 @@ extern int pfn_valid(unsigned long);
>>
>> #include <asm-generic/getorder.h>
>> #include <asm-generic/memory_model.h>
>> +#include <asm-generic/pgtable-geometry.h>
>>
>> #endif
>> diff --git a/arch/arm64/include/asm/page-def.h b/arch/arm64/include/asm/page-def.h
>> index 792e9fe881dcf..d69971cf49cd2 100644
>> --- a/arch/arm64/include/asm/page-def.h
>> +++ b/arch/arm64/include/asm/page-def.h
>> @@ -15,4 +15,6 @@
>> #define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
>> #define PAGE_MASK (~(PAGE_SIZE-1))
>>
>> +#include <asm-generic/pgtable-geometry.h>
>> +
>> #endif /* __ASM_PAGE_DEF_H */
>> diff --git a/arch/csky/include/asm/page.h b/arch/csky/include/asm/page.h
>> index 0ca6c408c07f2..95173d57adc8b 100644
>> --- a/arch/csky/include/asm/page.h
>> +++ b/arch/csky/include/asm/page.h
>> @@ -92,4 +92,7 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
>> #include <asm-generic/getorder.h>
>>
>> #endif /* !__ASSEMBLY__ */
>> +
>> +#include <asm-generic/pgtable-geometry.h>
>> +
>> #endif /* __ASM_CSKY_PAGE_H */
>> diff --git a/arch/hexagon/include/asm/page.h b/arch/hexagon/include/asm/page.h
>> index 8a6af57274c2d..ba7ad5231695f 100644
>> --- a/arch/hexagon/include/asm/page.h
>> +++ b/arch/hexagon/include/asm/page.h
>> @@ -139,4 +139,6 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
>> #endif /* ifdef __ASSEMBLY__ */
>> #endif /* ifdef __KERNEL__ */
>>
>> +#include <asm-generic/pgtable-geometry.h>
>> +
>> #endif
>> diff --git a/arch/loongarch/include/asm/page.h b/arch/loongarch/include/asm/page.h
>> index e85df33f11c77..9862e8fb047a6 100644
>> --- a/arch/loongarch/include/asm/page.h
>> +++ b/arch/loongarch/include/asm/page.h
>> @@ -123,4 +123,6 @@ extern int __virt_addr_valid(volatile void *kaddr);
>>
>> #endif /* !__ASSEMBLY__ */
>>
>> +#include <asm-generic/pgtable-geometry.h>
>> +
>> #endif /* _ASM_PAGE_H */
>> diff --git a/arch/m68k/include/asm/page.h b/arch/m68k/include/asm/page.h
>> index 8cfb84b499751..4df4681b02194 100644
>> --- a/arch/m68k/include/asm/page.h
>> +++ b/arch/m68k/include/asm/page.h
>> @@ -60,5 +60,6 @@ extern unsigned long _ramend;
>>
>> #include <asm-generic/getorder.h>
>> #include <asm-generic/memory_model.h>
>> +#include <asm-generic/pgtable-geometry.h>
>>
>> #endif /* _M68K_PAGE_H */
>> diff --git a/arch/microblaze/include/asm/page.h b/arch/microblaze/include/asm/page.h
>> index 8810f4f1c3b02..abc23c3d743bd 100644
>> --- a/arch/microblaze/include/asm/page.h
>> +++ b/arch/microblaze/include/asm/page.h
>> @@ -142,5 +142,6 @@ static inline const void *pfn_to_virt(unsigned long pfn)
>>
>> #include <asm-generic/memory_model.h>
>> #include <asm-generic/getorder.h>
>> +#include <asm-generic/pgtable-geometry.h>
>>
>> #endif /* _ASM_MICROBLAZE_PAGE_H */
>> diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
>> index 4609cb0326cf3..3d91021538f02 100644
>> --- a/arch/mips/include/asm/page.h
>> +++ b/arch/mips/include/asm/page.h
>> @@ -227,5 +227,6 @@ static inline unsigned long kaslr_offset(void)
>>
>> #include <asm-generic/memory_model.h>
>> #include <asm-generic/getorder.h>
>> +#include <asm-generic/pgtable-geometry.h>
>>
>> #endif /* _ASM_PAGE_H */
>> diff --git a/arch/nios2/include/asm/page.h b/arch/nios2/include/asm/page.h
>> index 0722f88e63cc7..2e5f93beb42b7 100644
>> --- a/arch/nios2/include/asm/page.h
>> +++ b/arch/nios2/include/asm/page.h
>> @@ -97,4 +97,6 @@ extern struct page *mem_map;
>>
>> #endif /* !__ASSEMBLY__ */
>>
>> +#include <asm-generic/pgtable-geometry.h>
>> +
>> #endif /* _ASM_NIOS2_PAGE_H */
>> diff --git a/arch/openrisc/include/asm/page.h b/arch/openrisc/include/asm/page.h
>> index 1d5913f67c312..a0da2a9842241 100644
>> --- a/arch/openrisc/include/asm/page.h
>> +++ b/arch/openrisc/include/asm/page.h
>> @@ -88,5 +88,6 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
>>
>> #include <asm-generic/memory_model.h>
>> #include <asm-generic/getorder.h>
>> +#include <asm-generic/pgtable-geometry.h>
>>
>> #endif /* __ASM_OPENRISC_PAGE_H */
>> diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
>> index 4bea2e95798f0..2a75496237c09 100644
>> --- a/arch/parisc/include/asm/page.h
>> +++ b/arch/parisc/include/asm/page.h
>> @@ -173,6 +173,7 @@ extern int npmem_ranges;
>>
>> #include <asm-generic/memory_model.h>
>> #include <asm-generic/getorder.h>
>> +#include <asm-generic/pgtable-geometry.h>
>> #include <asm/pdc.h>
>>
>> #define PAGE0 ((struct zeropage *)absolute_pointer(__PAGE_OFFSET))
>> diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
>> index 83d0a4fc5f755..4601c115b6485 100644
>> --- a/arch/powerpc/include/asm/page.h
>> +++ b/arch/powerpc/include/asm/page.h
>> @@ -300,4 +300,6 @@ static inline unsigned long kaslr_offset(void)
>> #include <asm-generic/memory_model.h>
>> #endif /* __ASSEMBLY__ */
>>
>> +#include <asm-generic/pgtable-geometry.h>
>> +
>> #endif /* _ASM_POWERPC_PAGE_H */
>> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
>> index 7ede2111c5917..e5af7579e45bf 100644
>> --- a/arch/riscv/include/asm/page.h
>> +++ b/arch/riscv/include/asm/page.h
>> @@ -204,5 +204,6 @@ static __always_inline void *pfn_to_kaddr(unsigned long pfn)
>>
>> #include <asm-generic/memory_model.h>
>> #include <asm-generic/getorder.h>
>> +#include <asm-generic/pgtable-geometry.h>
>>
>> #endif /* _ASM_RISCV_PAGE_H */
>> diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
>> index 16e4caa931f1f..42157e7690a77 100644
>> --- a/arch/s390/include/asm/page.h
>> +++ b/arch/s390/include/asm/page.h
>> @@ -275,6 +275,7 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
>>
>> #include <asm-generic/memory_model.h>
>> #include <asm-generic/getorder.h>
>> +#include <asm-generic/pgtable-geometry.h>
>>
>> #define AMODE31_SIZE (3 * PAGE_SIZE)
>>
>> diff --git a/arch/sh/include/asm/page.h b/arch/sh/include/asm/page.h
>> index f780b467e75d7..09533d46ef033 100644
>> --- a/arch/sh/include/asm/page.h
>> +++ b/arch/sh/include/asm/page.h
>> @@ -162,5 +162,6 @@ typedef struct page *pgtable_t;
>>
>> #include <asm-generic/memory_model.h>
>> #include <asm-generic/getorder.h>
>> +#include <asm-generic/pgtable-geometry.h>
>>
>> #endif /* __ASM_SH_PAGE_H */
>> diff --git a/arch/sparc/include/asm/page.h b/arch/sparc/include/asm/page.h
>> index 5e44cdf2a8f2b..4327fe2bfa010 100644
>> --- a/arch/sparc/include/asm/page.h
>> +++ b/arch/sparc/include/asm/page.h
>> @@ -9,4 +9,7 @@
>> #else
>> #include <asm/page_32.h>
>> #endif
>> +
>> +#include <asm-generic/pgtable-geometry.h>
>> +
>> #endif
>> diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
>> index 9ef9a8aedfa66..f26011808f514 100644
>> --- a/arch/um/include/asm/page.h
>> +++ b/arch/um/include/asm/page.h
>> @@ -119,4 +119,6 @@ extern unsigned long uml_physmem;
>> #define __HAVE_ARCH_GATE_AREA 1
>> #endif
>>
>> +#include <asm-generic/pgtable-geometry.h>
>> +
>> #endif /* __UM_PAGE_H */
>> diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
>> index 52f1b4ff0cc16..6d2381342047f 100644
>> --- a/arch/x86/include/asm/page_types.h
>> +++ b/arch/x86/include/asm/page_types.h
>> @@ -71,4 +71,6 @@ extern void initmem_init(void);
>>
>> #endif /* !__ASSEMBLY__ */
>>
>> +#include <asm-generic/pgtable-geometry.h>
>> +
>> #endif /* _ASM_X86_PAGE_DEFS_H */
>> diff --git a/arch/xtensa/include/asm/page.h b/arch/xtensa/include/asm/page.h
>> index 4db56ef052d22..86952cb32af23 100644
>> --- a/arch/xtensa/include/asm/page.h
>> +++ b/arch/xtensa/include/asm/page.h
>> @@ -200,4 +200,5 @@ static inline unsigned long ___pa(unsigned long va)
>> #endif /* __ASSEMBLY__ */
>>
>> #include <asm-generic/memory_model.h>
>> +#include <asm-generic/pgtable-geometry.h>
>> #endif /* _XTENSA_PAGE_H */
>> diff --git a/include/asm-generic/pgtable-geometry.h b/include/asm-generic/pgtable-geometry.h
>> new file mode 100644
>> index 0000000000000..358e729a6ac37
>> --- /dev/null
>> +++ b/include/asm-generic/pgtable-geometry.h
>> @@ -0,0 +1,71 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef ASM_GENERIC_PGTABLE_GEOMETRY_H
>> +#define ASM_GENERIC_PGTABLE_GEOMETRY_H
>> +
>> +#if defined(PAGE_SHIFT_MAX) && defined(PAGE_SIZE_MAX) && defined(PAGE_MASK_MAX) && \
>> + defined(PAGE_SHIFT_MIN) && defined(PAGE_SIZE_MIN) && defined(PAGE_MASK_MIN)
>> +/* Arch supports boot-time page size selection. */
>> +#elif defined(PAGE_SHIFT_MAX) || defined(PAGE_SIZE_MAX) || defined(PAGE_MASK_MAX) || \
>> + defined(PAGE_SHIFT_MIN) || defined(PAGE_SIZE_MIN) || defined(PAGE_MASK_MIN)
>> +#error Arch must define all or none of the boot-time page size macros
>> +#else
>> +/* Arch does not support boot-time page size selection. */
>> +#define PAGE_SHIFT_MIN PAGE_SHIFT
>> +#define PAGE_SIZE_MIN PAGE_SIZE
>> +#define PAGE_MASK_MIN PAGE_MASK
>> +#define PAGE_SHIFT_MAX PAGE_SHIFT
>> +#define PAGE_SIZE_MAX PAGE_SIZE
>> +#define PAGE_MASK_MAX PAGE_MASK
>> +#endif
>> +
>> +/*
>> + * Define a global variable (scalar or struct), whose value is derived from
>> + * PAGE_SIZE and friends. When PAGE_SIZE is a compile-time constant, the global
>> + * variable is simply defined with the static value. When PAGE_SIZE is
>> + * determined at boot-time, a pure initcall is registered and run during boot to
>> + * initialize the variable.
>> + *
>> + * @type: Unqualified type. Do not include "const"; implied by macro variant.
>> + * @name: Variable name.
>> + * @...: Initialization value. May be scalar or initializer.
>> + *
>> + * "static" is declared by placing "static" before the macro.
>> + *
>> + * Example:
>> + *
>> + * struct my_struct {
>> + * int a;
>> + * char b;
>> + * };
>> + *
>> + * static DEFINE_GLOBAL_PAGE_SIZE_VAR(struct my_struct, my_variable, {
>> + * .a = 10,
>> + * .b = 'e',
>> + * });
>> + */
>> +#if PAGE_SIZE_MIN != PAGE_SIZE_MAX
>> +#define __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, attrib, ...) \
>> + type name attrib; \
>> + static int __init __attribute__((constructor)) __##name##_init(void) \
>> + { \
>> + name = (type)__VA_ARGS__; \
>> + return 0; \
>> + }
>> +
>> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, ...) \
>> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, , __VA_ARGS__)
>> +
>> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(type, name, ...) \
>> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, __ro_after_init, __VA_ARGS__)
>> +#else /* PAGE_SIZE_MIN == PAGE_SIZE_MAX */
>> +#define __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, attrib, ...) \
>> + type name attrib = __VA_ARGS__; \
>> +
>> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, ...) \
>> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, , __VA_ARGS__)
>> +
>> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(type, name, ...) \
>> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(const type, name, , __VA_ARGS__)
>> +#endif
>> +
>> +#endif /* ASM_GENERIC_PGTABLE_GEOMETRY_H */
>> diff --git a/init/main.c b/init/main.c
>> index 206acdde51f5a..ba1515eb20b9d 100644
>> --- a/init/main.c
>> +++ b/init/main.c
>> @@ -899,6 +899,8 @@ static void __init early_numa_node_init(void)
>> #endif
>> }
>>
>> +static __init void do_ctors(void);
>> +
>> asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector
>> void start_kernel(void)
>> {
>> @@ -910,6 +912,8 @@ void start_kernel(void)
>> debug_objects_early_init();
>> init_vmlinux_build_id();
>>
>> + do_ctors();
>> +
>> cgroup_init_early();
>>
>> local_irq_disable();
>> @@ -1360,7 +1364,6 @@ static void __init do_basic_setup(void)
>> cpuset_init_smp();
>> driver_init();
>> init_irq_proc();
>> - do_ctors();
>> do_initcalls();
>> }
>>
>> --
>> 2.43.0
>>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 22/57] sound: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 12:24 ` Ryan Roberts
2024-10-14 12:41 ` Takashi Iwai
@ 2024-10-14 16:01 ` Mark Brown
2024-10-15 11:35 ` Ryan Roberts
1 sibling, 1 reply; 196+ messages in thread
From: Mark Brown @ 2024-10-14 16:01 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Jaroslav Kysela,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Takashi Iwai, Will Deacon, linux-arm-kernel,
linux-kernel, linux-mm, linux-sound
[-- Attachment #1: Type: text/plain, Size: 2781 bytes --]
On Mon, Oct 14, 2024 at 01:24:02PM +0100, Ryan Roberts wrote:
> On 14/10/2024 12:38, Mark Brown wrote:
> > On Mon, Oct 14, 2024 at 11:58:29AM +0100, Ryan Roberts wrote:
> >> ***NOTE***
> >> Any confused maintainers may want to read the cover note here for context:
> >> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
> > As documented in submitting-patches.rst please send patches to the
> > maintainers for the code you would like to change. The normal kernel
> > workflow is that people apply patches from their inboxes, if they aren't
> > copied they are likely to not see the patch at all and it is much more
> > difficult to apply patches.
> Sure. I think you're implying that you would have liked to be in To: for this
> patch? I went to quite a lot of trouble to ensure all maintainers were at least
> in the To: field for patches touching their code. But get_maintainer.pl lists
> you as a supporter, not a maintainer when I ran this patch through. Could you
> clarify what would have been the correct thing to do? I could include all
> reviewers and supporters as well as maintainers but then I'd be banging up
> against the limits for some of the patches.
The entry in MAINTAINERS for me is a M:, supporter is just the usual
get_maintainers noise. Supported is exactly equivalent to a maintainer.
Generally if you're going to filter people you should be filtering less
specific matches out rather than more and if you're looking to filter
very aggressively look at who actually commits changes to whatever
you're trying to change, less specific maintainers will generally
delegate down to the more specific ones.
> > It's probably better to just use PAGE_SIZE_MAX here and avoid the
> > deferred patching, like the comment says we don't particularly care what
> > the value actually is here given that it's a dummy.
> OK, so would that be:
> .buffer_bytes_max = 128*1024,
> .period_bytes_min = PAGE_SIZE_MAX, <<<<<
> .period_bytes_max = PAGE_SIZE_MAX*2, <<<<<
> .periods_min = 2,
> .periods_max = 128,
> It's not really clear to me how all the parameters interact; the buffer size
> 128K, which, if PAGE_SIZE_MAX is 64K, would hold 1 period of the maximum size.
> But periods_min is 2. So not sure that works? Or perhaps I'm trying to apply too
> much meaning to the param names...
Like Takashi says just using absolute numbers here is probably just as
sensible, the numbers are there to stop userspace tripping over itself
but like I say it shouldn't ever get as far as actually using them for
anything. So long as we end up with some numbers that don't need any
late init patching the specifics aren't super important, the use of
PAGE_SIZE was kind of random.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 18/57] trace: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 18/57] trace: " Ryan Roberts
@ 2024-10-14 16:46 ` Steven Rostedt
2024-10-15 11:09 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Steven Rostedt @ 2024-10-14 16:46 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Masami Hiramatsu, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel, linux-kernel,
linux-mm, linux-trace-kernel
On Mon, 14 Oct 2024 11:58:25 +0100
Ryan Roberts <ryan.roberts@arm.com> wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Convert BUILD_BUG_ON() BUG_ON() since the argument depends on PAGE_SIZE
> and its not trivial to test against a page size limit.
>
> Redefine FTRACE_KSTACK_ENTRIES so that "struct ftrace_stacks" is always
> sized at 32K for 64-bit and 16K for 32-bit. It was previously defined in
> terms of PAGE_SIZE (and worked out at the quoted sizes for a 4K page
> size). But for 64K pages, the size expanded to 512K. Given the ftrace
> stacks should be invariant to page size, this seemed like a waste. As a
> side effect, it removes the PAGE_SIZE compile-time constant assumption
> from this code.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> kernel/trace/fgraph.c | 2 +-
> kernel/trace/trace.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index d7d4fb403f6f0..47aa5c8d8090e 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -534,7 +534,7 @@ ftrace_push_return_trace(unsigned long ret, unsigned long func,
> if (!current->ret_stack)
> return -EBUSY;
>
> - BUILD_BUG_ON(SHADOW_STACK_SIZE % sizeof(long));
> + BUG_ON(SHADOW_STACK_SIZE % sizeof(long));
Absolutely not!
BUG_ON() is in no way a substitution of any BUILD_BUG_ON(). BUILD_BUG_ON()
is a non intrusive way to see if something isn't lined up correctly, and
can fix it before you execute any code. BUG_ON() is the most intrusive way
to say something is wrong and you crash the system.
Not to mention, when function graph tracing is enabled, this gets triggered
for *every* function call! So I do not want any runtime test done. Every
nanosecond counts in this code path.
If anything, this needs to be moved to initialization and checked once, if
it fails, gives a WARN_ON() and disables function graph tracing.
-- Steve
>
> /* Set val to "reserved" with the delta to the new fgraph frame */
> val = (FGRAPH_TYPE_RESERVED << FGRAPH_TYPE_SHIFT) | FGRAPH_FRAME_OFFSET;
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index c3b2c7dfadef1..0f2ec3d30579f 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2887,7 +2887,7 @@ trace_function(struct trace_array *tr, unsigned long ip, unsigned long
> /* Allow 4 levels of nesting: normal, softirq, irq, NMI */
> #define FTRACE_KSTACK_NESTING 4
>
> -#define FTRACE_KSTACK_ENTRIES (PAGE_SIZE / FTRACE_KSTACK_NESTING)
> +#define FTRACE_KSTACK_ENTRIES (SZ_4K / FTRACE_KSTACK_NESTING)
>
> struct ftrace_stack {
> unsigned long calls[FTRACE_KSTACK_ENTRIES];
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 02/57] vmlinux: Align to PAGE_SIZE_MAX
2024-10-14 10:58 ` [RFC PATCH v1 02/57] vmlinux: Align to PAGE_SIZE_MAX Ryan Roberts
@ 2024-10-14 16:50 ` Christoph Lameter (Ampere)
2024-10-15 10:53 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-10-14 16:50 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Arnd Bergmann,
Catalin Marinas, David Hildenbrand, Dennis Zhou, Greg Marsden,
Ivan Ivanov, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Miroslav Benes, Tejun Heo, Will Deacon,
linux-arch, linux-arm-kernel, linux-kernel, linux-mm
On Mon, 14 Oct 2024, Ryan Roberts wrote:
> Increase alignment of structures requiring at least PAGE_SIZE alignment
> to PAGE_SIZE_MAX. For compile-time PAGE_SIZE, PAGE_SIZE_MAX == PAGE_SIZE
> so there is no change. For boot-time PAGE_SIZE, PAGE_SIZE_MAX is the
> largest selectable page size.
Can you verify that this works with the arch specific portions? This may
also allow to to reduce some of the arch dependent stuff.
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-14 10:55 [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
@ 2024-10-14 17:32 ` Florian Fainelli
2024-10-15 11:48 ` Ryan Roberts
2024-10-15 18:38 ` Michael Kelley
` (6 subsequent siblings)
8 siblings, 1 reply; 196+ messages in thread
From: Florian Fainelli @ 2024-10-14 17:32 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 10/14/24 03:55, Ryan Roberts wrote:
> Hi All,
>
> Patch bomb incoming... This covers many subsystems, so I've included a core set
> of people on the full series and additionally included maintainers on relevant
> patches. I haven't included those maintainers on this cover letter since the
> numbers were far too big for it to work. But I've included a link to this cover
> letter on each patch, so they can hopefully find their way here. For follow up
> submissions I'll break it up by subsystem, but for now thought it was important
> to show the full picture.
>
> This RFC series implements support for boot-time page size selection within the
> arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to date, page
> size has been selected at compile-time, meaning the size is baked into a given
> kernel image. As use of larger-than-4K page sizes become more prevalent this
> starts to present a problem for distributions. Boot-time page size selection
> enables the creation of a single kernel image, which can be told which page size
> to use on the kernel command line.
>
> Why is having an image-per-page size problematic?
> =================================================
>
> Many traditional distros are now supporting both 4K and 64K. And this means
> managing 2 kernel packages, along with drivers for each. For some, it means
> multiple installer flavours and multiple ISOs. All of this adds up to a
> less-than-ideal level of complexity. Additionally, Android now supports 4K and
> 16K kernels. I'm told having to explicitly manage their KABI for each kernel is
> painful, and the extra flash space required for both kernel images and the
> duplicated modules has been problematic. Boot-time page size selection solves
> all of this.
>
> Additionally, in starting to think about the longer term deployment story for
> D128 page tables, which Arm architecture now supports, a lot of the same
> problems need to be solved, so this work sets us up nicely for that.
>
> So what's the down side?
> ========================
>
> Well nothing's free; Various static allocations in the kernel image must be
> sized for the worst case (largest supported page size), so image size is in line
> with size of 64K compile-time image. So if you're interested in 4K or 16K, there
> is a slight increase to the image size. But I expect that problem goes away if
> you're compressing the image - its just some extra zeros. At boot-time, I expect
> we could free the unused static storage once we know the page size - although
> that would be a follow up enhancement.
>
> And then there is performance. Since PAGE_SIZE and friends are no longer
> compile-time constants, we must look up their values and do arithmetic at
> runtime instead of compile-time. My early perf testing suggests this is
> inperceptible for real-world workloads, and only has small impact on
> microbenchmarks - more on this below.
>
> Approach
> ========
>
> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
> friends are compile-time constant, but in a way that allows the compiler to
> perform the same optimizations as was previously being done if they do turn out
> to be compile-time constant. Where constants are required, we use limits;
> PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full description
> of all the classes of problems to solve.
>
> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX. arm64
> does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE Kconfig,
> which is an alternative to selecting a compile-time page size.
>
> When boot-time page size is active, the arch pgtable geometry macro definitions
> resolve to something that can be configured at boot. The arm64 implementation in
> this series mainly uses global, __ro_after_init variables. I've tried using
> alternatives patching, but that performs worse than loading from memory; I think
> due to code size bloat.
FWIW, this paragraph was not entirely clear to me until I looked at
patch 57 to see that the compile time page size selection had been
retained, and could continue to be used as-is. It was somewhat implicit,
but not IMHO explicit enough, not a big deal though.
Great work, thanks for doing that! This makes me wonder if we could
leverage any of that to have a single kernel supporting both LPAE and
!LPAE on ARM 32-bit, but that still seems like somewhat more difficult,
largely due to the difference in the page table descriptor format (long
vs. short).
--
Florian
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 03/57] mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
2024-10-14 10:58 ` [RFC PATCH v1 03/57] mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large Ryan Roberts
2024-10-14 13:00 ` Johannes Weiner
@ 2024-10-14 19:59 ` Shakeel Butt
2024-10-15 10:55 ` Ryan Roberts
2024-10-17 16:09 ` Roman Gushchin
2 siblings, 1 reply; 196+ messages in thread
From: Shakeel Butt @ 2024-10-14 19:59 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Johannes Weiner,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Michal Hocko, Miroslav Benes, Roman Gushchin, Will Deacon,
cgroups, linux-arm-kernel, linux-kernel, linux-mm
On Mon, Oct 14, 2024 at 11:58:10AM GMT, Ryan Roberts wrote:
> Previously the seq_buf used for accumulating the memory.stat output was
> sized at PAGE_SIZE. But the amount of output is invariant to PAGE_SIZE;
> If 4K is enough on a 4K page system, then it should also be enough on a
> 64K page system, so we can save 60K om the static buffer used in
> mem_cgroup_print_oom_meminfo(). Let's make it so.
>
> This also has the beneficial side effect of removing a place in the code
> that assumed PAGE_SIZE is a compile-time constant. So this helps our
> quest towards supporting boot-time page size selection.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 17/57] kvm: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 17/57] kvm: " Ryan Roberts
@ 2024-10-14 21:37 ` Sean Christopherson
2024-10-15 10:57 ` Ryan Roberts
2024-10-16 14:41 ` Ryan Roberts
1 sibling, 1 reply; 196+ messages in thread
From: Sean Christopherson @ 2024-10-14 21:37 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, kvm, linux-arm-kernel, linux-kernel, linux-mm
Nit, "KVM:" for the scope.
On Mon, Oct 14, 2024, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Modify BUILD_BUG_ON() to compare with page size limit.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
The patch should still stand on its own. Most people can probably suss out what
PAGE_SIZE_MIN is, but at the same time, it's quite easy to provide a more verbose
changelog that's tailored to the actual patch. E.g.
To prepare for supporting boot-time page size selection, refactor KVM's
check on the size of the kvm_run structure to assert that the size is less
than the smallest possible page size, i.e. that kvm_run won't overflow its
page regardless of what page size is chosen at boot time.
With something like the above,
Reviewed-by: Sean Christopherson <seanjc@google.com>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection
2024-10-14 14:07 ` Ryan Roberts
@ 2024-10-15 3:04 ` Pingfan Liu
2024-10-15 11:16 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Pingfan Liu @ 2024-10-15 3:04 UTC (permalink / raw)
To: Ryan Roberts
Cc: David S. Miller, James E.J. Bottomley, Andreas Larsson,
Andrew Morton, Anshuman Khandual, Anton Ivanov, Ard Biesheuvel,
Arnd Bergmann, Borislav Petkov, Catalin Marinas, Chris Zankel,
Dave Hansen, David Hildenbrand, Dinh Nguyen, Geert Uytterhoeven,
Greg Marsden, Helge Deller, Huacai Chen, Ingo Molnar, Ivan Ivanov,
Johannes Berg, John Paul Adrian Glaubitz, Jonas Bonn,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Max Filippov, Miroslav Benes, Rich Felker, Richard Weinberger,
Stafford Horne, Stefan Kristiansson, Thomas Bogendoerfer,
Thomas Gleixner, Will Deacon, Yoshinori Sato, x86, linux-alpha,
linux-arch, linux-arm-kernel, linux-csky, linux-hexagon,
linux-kernel, linux-m68k, linux-mips, linux-mm, linux-openrisc,
linux-parisc, linux-riscv, linux-s390, linux-sh, linux-snps-arc,
linux-um, linuxppc-dev, loongarch, sparclinux
On Mon, Oct 14, 2024 at 10:07 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 14/10/2024 14:54, Pingfan Liu wrote:
> > Hello Ryan,
> >
> > On Mon, Oct 14, 2024 at 11:58:08AM +0100, Ryan Roberts wrote:
> >> arm64 can support multiple base page sizes. Instead of selecting a page
> >> size at compile time, as is done today, we will make it possible to
> >> select the desired page size on the command line.
> >>
> >> In this case PAGE_SHIFT and it's derivatives, PAGE_SIZE and PAGE_MASK
> >> (as well as a number of other macros related to or derived from
> >> PAGE_SHIFT, but I'm not worrying about those yet), are no longer
> >> compile-time constants. So the code base needs to cope with that.
> >>
> >> As a first step, introduce MIN and MAX variants of these macros, which
> >> express the range of possible page sizes. These are always compile-time
> >> constants and can be used in many places where PAGE_[SHIFT|SIZE|MASK]
> >> were previously used where a compile-time constant is required.
> >> (Subsequent patches will do that conversion work). When the arch/build
> >> doesn't support boot-time page size selection, the MIN and MAX variants
> >> are equal and everything resolves as it did previously.
> >>
> >
> > MIN and MAX appear to construct a boundary, but it may be not enough.
> > Please see the following comment inline.
> >
> >> Additionally, introduce DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() which wrap
> >> global variable defintions so that for boot-time page size selection
> >> builds, the variable being wrapped is initialized at boot-time, instead
> >> of compile-time. This is done by defining a function to do the
> >> assignment, which has the "constructor" attribute. Constructor is
> >> preferred over initcall, because when compiling a module, the module is
> >> limited to a single initcall but constructors are unlimited. For
> >> built-in code, constructors are now called earlier to guarrantee that
> >> the variables are initialized by the time they are used. Any arch that
> >> wants to enable boot-time page size selection will need to select
> >> CONFIG_CONSTRUCTORS.
> >>
> >> These new macros need to be available anywhere PAGE_SHIFT and friends
> >> are available. Those are defined via asm/page.h (although some arches
> >> have a sub-include that defines them). Unfortunately there is no
> >> reliable asm-generic header we can easily piggy-back on, so let's define
> >> a new one, pgtable-geometry.h, which we include near where each arch
> >> defines PAGE_SHIFT. Ugh.
> >>
> >> -------
> >>
> >> Most of the problems that need to be solved over the next few patches
> >> fall into these broad categories, which are all solved with the help of
> >> these new macros:
> >>
> >> 1. Assignment of values derived from PAGE_SIZE in global variables
> >>
> >> For boot-time page size builds, we must defer the initialization of
> >> these variables until boot-time, when the page size is known. See
> >> DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() as described above.
> >>
> >> 2. Define static storage in units related to PAGE_SIZE
> >>
> >> This static storage will be defined according to PAGE_SIZE_MAX.
> >>
> >> 3. Define size of struct so that it is related to PAGE_SIZE
> >>
> >> The struct often contains an array that is sized to fill the page. In
> >> this case, use a flexible array with dynamic allocation. In other
> >> cases, the struct fits exactly over a page, which is a header (e.g.
> >> swap file header). In this case, remove the padding, and manually
> >> determine the struct pointer within the page.
> >>
> >
> > About two years ago, I tried to do similar thing in your series, but ran
> > into problem at this point, or maybe not exactly as the point you list
> > here. I consider this as the most challenged part.
> >
> > The scenario is
> > struct X {
> > a[size_a];
> > b[size_b];
> > c;
> > };
> >
> > Where size_a = f(PAGE_SHIFT), size_b=g(PAGE_SHIFT). One of f() and g()
> > is proportional to PAGE_SHIFT, the other is inversely proportional.
> >
> > How can you fix the reference of X.a and X.b?
>
> If you need to allocate static memory, then in this scenario, assuming f() is
> proportional and g() is inversely-proportional, then I guess you need
> size_a=f(PAGE_SIZE_MAX) and size_b=g(PAGE_SIZE_MIN). Or if you can allocate the
My point is that such stuff can not be handled by scripts
automatically and needs manual intervention.
> memory dynamically, then make a and b pointers to dynamically allocated buffers.
>
This seems a better way out.
> Is there a specific place in the source where this pattern is used today? It
> might be easier to discuss in the context of the code if so.
>
No such code at hand. Just throw out the potential issue and be
curious about it which frustrates me.
I hope people can reach an agreement on it and turn this useful series
into reality.
Thanks,
Pingfan
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 19/57] crash: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 19/57] crash: " Ryan Roberts
@ 2024-10-15 3:47 ` Baoquan He
2024-10-15 11:13 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Baoquan He @ 2024-10-15 3:47 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, kexec, linux-arm-kernel, linux-kernel, linux-mm
On 10/14/24 at 11:58am, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Updated BUILD_BUG_ON() to test against limit.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> kernel/crash_core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 63cf89393c6eb..978c600a47ac8 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -465,7 +465,7 @@ static int __init crash_notes_memory_init(void)
> * Break compile if size is bigger than PAGE_SIZE since crash_notes
> * definitely will be in 2 pages with that.
> */
> - BUILD_BUG_ON(size > PAGE_SIZE);
> + BUILD_BUG_ON(size > PAGE_SIZE_MIN);
This should be OK. While one thing which could happen is if selected size
is 64K, PAGE_SIZE_MIN is 4K, it will issue a false-positive warning when
compiling while actual it's not a problem during running. Not sure if
that could happen on arm64. Anyway, we can check the crash_notes to get
why it's so big when it really happens. So,
Acked-by: Baoquan He <bhe@redhat.com>
>
> crash_notes = __alloc_percpu(size, align);
> if (!crash_notes) {
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 02/57] vmlinux: Align to PAGE_SIZE_MAX
2024-10-14 16:50 ` Christoph Lameter (Ampere)
@ 2024-10-15 10:53 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-15 10:53 UTC (permalink / raw)
To: Christoph Lameter (Ampere)
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Arnd Bergmann,
Catalin Marinas, David Hildenbrand, Dennis Zhou, Greg Marsden,
Ivan Ivanov, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Miroslav Benes, Tejun Heo, Will Deacon,
linux-arch, linux-arm-kernel, linux-kernel, linux-mm
On 14/10/2024 17:50, Christoph Lameter (Ampere) wrote:
> On Mon, 14 Oct 2024, Ryan Roberts wrote:
>
>> Increase alignment of structures requiring at least PAGE_SIZE alignment
>> to PAGE_SIZE_MAX. For compile-time PAGE_SIZE, PAGE_SIZE_MAX == PAGE_SIZE
>> so there is no change. For boot-time PAGE_SIZE, PAGE_SIZE_MAX is the
>> largest selectable page size.
>
> Can you verify that this works with the arch specific portions? This may
> also allow to to reduce some of the arch dependent stuff.
Sorry, Chistoph, I'm not exactly sure what you mean here by "arch specific
portions" and "reduce some of the arch dependent stuff"? Could you elaborate?
I can certainly verify that this change works for all the test scenarios I've
listed on the cover letter.
Thanks,
Ryan
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 03/57] mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
2024-10-14 19:59 ` Shakeel Butt
@ 2024-10-15 10:55 ` Ryan Roberts
2024-10-17 12:21 ` Michal Hocko
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-15 10:55 UTC (permalink / raw)
To: Shakeel Butt
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Johannes Weiner,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Michal Hocko, Miroslav Benes, Roman Gushchin, Will Deacon,
cgroups, linux-arm-kernel, linux-kernel, linux-mm
On 14/10/2024 20:59, Shakeel Butt wrote:
> On Mon, Oct 14, 2024 at 11:58:10AM GMT, Ryan Roberts wrote:
>> Previously the seq_buf used for accumulating the memory.stat output was
>> sized at PAGE_SIZE. But the amount of output is invariant to PAGE_SIZE;
>> If 4K is enough on a 4K page system, then it should also be enough on a
>> 64K page system, so we can save 60K om the static buffer used in
>> mem_cgroup_print_oom_meminfo(). Let's make it so.
>>
>> This also has the beneficial side effect of removing a place in the code
>> that assumed PAGE_SIZE is a compile-time constant. So this helps our
>> quest towards supporting boot-time page size selection.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Thanks Shakeel and Johannes, for the acks. Given this patch is totally
independent, I'll plan to resubmit it on its own and hopefully we can get it in
independently of the rest of the series.
Thanks,
Ryan
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 17/57] kvm: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 21:37 ` Sean Christopherson
@ 2024-10-15 10:57 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-15 10:57 UTC (permalink / raw)
To: Sean Christopherson
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, kvm, linux-arm-kernel, linux-kernel, linux-mm
On 14/10/2024 22:37, Sean Christopherson wrote:
> Nit, "KVM:" for the scope.
Thanks, will fix.
>
> On Mon, Oct 14, 2024, Ryan Roberts wrote:
>> To prepare for supporting boot-time page size selection, refactor code
>> to remove assumptions about PAGE_SIZE being compile-time constant. Code
>> intended to be equivalent when compile-time page size is active.
>>
>> Modify BUILD_BUG_ON() to compare with page size limit.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> ***NOTE***
>> Any confused maintainers may want to read the cover note here for context:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> The patch should still stand on its own. Most people can probably suss out what
> PAGE_SIZE_MIN is, but at the same time, it's quite easy to provide a more verbose
> changelog that's tailored to the actual patch. E.g.
>
> To prepare for supporting boot-time page size selection, refactor KVM's
> check on the size of the kvm_run structure to assert that the size is less
> than the smallest possible page size, i.e. that kvm_run won't overflow its
> page regardless of what page size is chosen at boot time.
>
> With something like the above,
>
> Reviewed-by: Sean Christopherson <seanjc@google.com>
Thanks! I'll update this for the next version.
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 18/57] trace: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 16:46 ` Steven Rostedt
@ 2024-10-15 11:09 ` Ryan Roberts
2024-10-18 15:24 ` Steven Rostedt
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-15 11:09 UTC (permalink / raw)
To: Steven Rostedt
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Masami Hiramatsu, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel, linux-kernel,
linux-mm, linux-trace-kernel
On 14/10/2024 17:46, Steven Rostedt wrote:
> On Mon, 14 Oct 2024 11:58:25 +0100
> Ryan Roberts <ryan.roberts@arm.com> wrote:
>
>> To prepare for supporting boot-time page size selection, refactor code
>> to remove assumptions about PAGE_SIZE being compile-time constant. Code
>> intended to be equivalent when compile-time page size is active.
>>
>> Convert BUILD_BUG_ON() BUG_ON() since the argument depends on PAGE_SIZE
>> and its not trivial to test against a page size limit.
>>
>> Redefine FTRACE_KSTACK_ENTRIES so that "struct ftrace_stacks" is always
>> sized at 32K for 64-bit and 16K for 32-bit. It was previously defined in
>> terms of PAGE_SIZE (and worked out at the quoted sizes for a 4K page
>> size). But for 64K pages, the size expanded to 512K. Given the ftrace
>> stacks should be invariant to page size, this seemed like a waste. As a
>> side effect, it removes the PAGE_SIZE compile-time constant assumption
>> from this code.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> ***NOTE***
>> Any confused maintainers may want to read the cover note here for context:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>
>> kernel/trace/fgraph.c | 2 +-
>> kernel/trace/trace.c | 2 +-
>> 2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
>> index d7d4fb403f6f0..47aa5c8d8090e 100644
>> --- a/kernel/trace/fgraph.c
>> +++ b/kernel/trace/fgraph.c
>> @@ -534,7 +534,7 @@ ftrace_push_return_trace(unsigned long ret, unsigned long func,
>> if (!current->ret_stack)
>> return -EBUSY;
>>
>> - BUILD_BUG_ON(SHADOW_STACK_SIZE % sizeof(long));
>> + BUG_ON(SHADOW_STACK_SIZE % sizeof(long));
>
> Absolutely not!
>
> BUG_ON() is in no way a substitution of any BUILD_BUG_ON(). BUILD_BUG_ON()
> is a non intrusive way to see if something isn't lined up correctly, and
> can fix it before you execute any code. BUG_ON() is the most intrusive way
> to say something is wrong and you crash the system.
Yep, totally agree. I'm afraid this was me being lazy, and there are a couple of
other instances where I have done this in other patches that I'll need to fix.
Most of the time, I've been able to keep BUILD_BUG_ON() and simply compare
against a page size limit.
Looking at this again, perhaps the better solution is to define
SHADOW_STACK_SIZE as PAGE_SIZE_MIN? Then it remains a compile-time constant. Is
there any need for SHADOW_STACK_SIZE to increase with page size?
>
> Not to mention, when function graph tracing is enabled, this gets triggered
> for *every* function call! So I do not want any runtime test done. Every
> nanosecond counts in this code path.
>
> If anything, this needs to be moved to initialization and checked once, if
> it fails, gives a WARN_ON() and disables function graph tracing.
I'm hoping my suggestion above to decouple SHADOW_STACK_SIZE from PAGE_SIZE is
acceptable and simpler? If not, happy to do as you suggest here.
Thanks,
Ryan
>
> -- Steve
>
>
>>
>> /* Set val to "reserved" with the delta to the new fgraph frame */
>> val = (FGRAPH_TYPE_RESERVED << FGRAPH_TYPE_SHIFT) | FGRAPH_FRAME_OFFSET;
>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
>> index c3b2c7dfadef1..0f2ec3d30579f 100644
>> --- a/kernel/trace/trace.c
>> +++ b/kernel/trace/trace.c
>> @@ -2887,7 +2887,7 @@ trace_function(struct trace_array *tr, unsigned long ip, unsigned long
>> /* Allow 4 levels of nesting: normal, softirq, irq, NMI */
>> #define FTRACE_KSTACK_NESTING 4
>>
>> -#define FTRACE_KSTACK_ENTRIES (PAGE_SIZE / FTRACE_KSTACK_NESTING)
>> +#define FTRACE_KSTACK_ENTRIES (SZ_4K / FTRACE_KSTACK_NESTING)
>>
>> struct ftrace_stack {
>> unsigned long calls[FTRACE_KSTACK_ENTRIES];
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 19/57] crash: Remove PAGE_SIZE compile-time constant assumption
2024-10-15 3:47 ` Baoquan He
@ 2024-10-15 11:13 ` Ryan Roberts
2024-10-18 3:00 ` Baoquan He
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-15 11:13 UTC (permalink / raw)
To: Baoquan He
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, kexec, linux-arm-kernel, linux-kernel, linux-mm
On 15/10/2024 04:47, Baoquan He wrote:
> On 10/14/24 at 11:58am, Ryan Roberts wrote:
>> To prepare for supporting boot-time page size selection, refactor code
>> to remove assumptions about PAGE_SIZE being compile-time constant. Code
>> intended to be equivalent when compile-time page size is active.
>>
>> Updated BUILD_BUG_ON() to test against limit.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> ***NOTE***
>> Any confused maintainers may want to read the cover note here for context:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>
>> kernel/crash_core.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
>> index 63cf89393c6eb..978c600a47ac8 100644
>> --- a/kernel/crash_core.c
>> +++ b/kernel/crash_core.c
>> @@ -465,7 +465,7 @@ static int __init crash_notes_memory_init(void)
>> * Break compile if size is bigger than PAGE_SIZE since crash_notes
>> * definitely will be in 2 pages with that.
>> */
>> - BUILD_BUG_ON(size > PAGE_SIZE);
>> + BUILD_BUG_ON(size > PAGE_SIZE_MIN);
>
> This should be OK. While one thing which could happen is if selected size
> is 64K, PAGE_SIZE_MIN is 4K, it will issue a false-positive warning when
> compiling while actual it's not a problem during running.
PAGE_SIZE can only ever be bigger than PAGE_SIZE_MIN if compiling a "boot-time
page size" build. And in this case, you need to know that size is small enough
to work with any of the boot-time selectable page sizes. Since size
(=sizeof(note_buf_t)) is invariant to PAGE_SIZE, we can do this by checking
against PAGE_SIZE_MIN.
So I don't think this could ever lead to a false-positive.
Not sure if
> that could happen on arm64. Anyway, we can check the crash_notes to get
> why it's so big when it really happens. So,
>
> Acked-by: Baoquan He <bhe@redhat.com>
Thanks!
>
>>
>> crash_notes = __alloc_percpu(size, align);
>> if (!crash_notes) {
>> --
>> 2.43.0
>>
>>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection
2024-10-15 3:04 ` Pingfan Liu
@ 2024-10-15 11:16 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-15 11:16 UTC (permalink / raw)
To: Pingfan Liu
Cc: David S. Miller, James E.J. Bottomley, Andreas Larsson,
Andrew Morton, Anshuman Khandual, Anton Ivanov, Ard Biesheuvel,
Arnd Bergmann, Borislav Petkov, Catalin Marinas, Chris Zankel,
Dave Hansen, David Hildenbrand, Dinh Nguyen, Geert Uytterhoeven,
Greg Marsden, Helge Deller, Huacai Chen, Ingo Molnar, Ivan Ivanov,
Johannes Berg, John Paul Adrian Glaubitz, Jonas Bonn,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Max Filippov, Miroslav Benes, Rich Felker, Richard Weinberger,
Stafford Horne, Stefan Kristiansson, Thomas Bogendoerfer,
Thomas Gleixner, Will Deacon, Yoshinori Sato, x86, linux-alpha,
linux-arch, linux-arm-kernel, linux-csky, linux-hexagon,
linux-kernel, linux-m68k, linux-mips, linux-mm, linux-openrisc,
linux-parisc, linux-riscv, linux-s390, linux-sh, linux-snps-arc,
linux-um, linuxppc-dev, loongarch, sparclinux
On 15/10/2024 04:04, Pingfan Liu wrote:
> On Mon, Oct 14, 2024 at 10:07 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 14/10/2024 14:54, Pingfan Liu wrote:
>>> Hello Ryan,
>>>
>>> On Mon, Oct 14, 2024 at 11:58:08AM +0100, Ryan Roberts wrote:
>>>> arm64 can support multiple base page sizes. Instead of selecting a page
>>>> size at compile time, as is done today, we will make it possible to
>>>> select the desired page size on the command line.
>>>>
>>>> In this case PAGE_SHIFT and it's derivatives, PAGE_SIZE and PAGE_MASK
>>>> (as well as a number of other macros related to or derived from
>>>> PAGE_SHIFT, but I'm not worrying about those yet), are no longer
>>>> compile-time constants. So the code base needs to cope with that.
>>>>
>>>> As a first step, introduce MIN and MAX variants of these macros, which
>>>> express the range of possible page sizes. These are always compile-time
>>>> constants and can be used in many places where PAGE_[SHIFT|SIZE|MASK]
>>>> were previously used where a compile-time constant is required.
>>>> (Subsequent patches will do that conversion work). When the arch/build
>>>> doesn't support boot-time page size selection, the MIN and MAX variants
>>>> are equal and everything resolves as it did previously.
>>>>
>>>
>>> MIN and MAX appear to construct a boundary, but it may be not enough.
>>> Please see the following comment inline.
>>>
>>>> Additionally, introduce DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() which wrap
>>>> global variable defintions so that for boot-time page size selection
>>>> builds, the variable being wrapped is initialized at boot-time, instead
>>>> of compile-time. This is done by defining a function to do the
>>>> assignment, which has the "constructor" attribute. Constructor is
>>>> preferred over initcall, because when compiling a module, the module is
>>>> limited to a single initcall but constructors are unlimited. For
>>>> built-in code, constructors are now called earlier to guarrantee that
>>>> the variables are initialized by the time they are used. Any arch that
>>>> wants to enable boot-time page size selection will need to select
>>>> CONFIG_CONSTRUCTORS.
>>>>
>>>> These new macros need to be available anywhere PAGE_SHIFT and friends
>>>> are available. Those are defined via asm/page.h (although some arches
>>>> have a sub-include that defines them). Unfortunately there is no
>>>> reliable asm-generic header we can easily piggy-back on, so let's define
>>>> a new one, pgtable-geometry.h, which we include near where each arch
>>>> defines PAGE_SHIFT. Ugh.
>>>>
>>>> -------
>>>>
>>>> Most of the problems that need to be solved over the next few patches
>>>> fall into these broad categories, which are all solved with the help of
>>>> these new macros:
>>>>
>>>> 1. Assignment of values derived from PAGE_SIZE in global variables
>>>>
>>>> For boot-time page size builds, we must defer the initialization of
>>>> these variables until boot-time, when the page size is known. See
>>>> DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() as described above.
>>>>
>>>> 2. Define static storage in units related to PAGE_SIZE
>>>>
>>>> This static storage will be defined according to PAGE_SIZE_MAX.
>>>>
>>>> 3. Define size of struct so that it is related to PAGE_SIZE
>>>>
>>>> The struct often contains an array that is sized to fill the page. In
>>>> this case, use a flexible array with dynamic allocation. In other
>>>> cases, the struct fits exactly over a page, which is a header (e.g.
>>>> swap file header). In this case, remove the padding, and manually
>>>> determine the struct pointer within the page.
>>>>
>>>
>>> About two years ago, I tried to do similar thing in your series, but ran
>>> into problem at this point, or maybe not exactly as the point you list
>>> here. I consider this as the most challenged part.
>>>
>>> The scenario is
>>> struct X {
>>> a[size_a];
>>> b[size_b];
>>> c;
>>> };
>>>
>>> Where size_a = f(PAGE_SHIFT), size_b=g(PAGE_SHIFT). One of f() and g()
>>> is proportional to PAGE_SHIFT, the other is inversely proportional.
>>>
>>> How can you fix the reference of X.a and X.b?
>>
>> If you need to allocate static memory, then in this scenario, assuming f() is
>> proportional and g() is inversely-proportional, then I guess you need
>> size_a=f(PAGE_SIZE_MAX) and size_b=g(PAGE_SIZE_MIN). Or if you can allocate the
>
> My point is that such stuff can not be handled by scripts
> automatically and needs manual intervention.
Yes agreed. I spent some time thinking about how much of this could be automated
(i.e. with Cochinelle or otherwise), but concluded that it's very difficult. As
a result, all of the patches in this series are manually created.
>
>> memory dynamically, then make a and b pointers to dynamically allocated buffers.
>>
>
> This seems a better way out.
>
>> Is there a specific place in the source where this pattern is used today? It
>> might be easier to discuss in the context of the code if so.
>>
>
> No such code at hand. Just throw out the potential issue and be
> curious about it which frustrates me.
> I hope people can reach an agreement on it and turn this useful series
> into reality.
Yes, hope so!
>
> Thanks,
>
> Pingfan
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 22/57] sound: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 16:01 ` Mark Brown
@ 2024-10-15 11:35 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-15 11:35 UTC (permalink / raw)
To: Mark Brown
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Jaroslav Kysela,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Takashi Iwai, Will Deacon, linux-arm-kernel,
linux-kernel, linux-mm, linux-sound
On 14/10/2024 17:01, Mark Brown wrote:
> On Mon, Oct 14, 2024 at 01:24:02PM +0100, Ryan Roberts wrote:
>> On 14/10/2024 12:38, Mark Brown wrote:
>>> On Mon, Oct 14, 2024 at 11:58:29AM +0100, Ryan Roberts wrote:
>
>>>> ***NOTE***
>>>> Any confused maintainers may want to read the cover note here for context:
>>>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
>>> As documented in submitting-patches.rst please send patches to the
>>> maintainers for the code you would like to change. The normal kernel
>>> workflow is that people apply patches from their inboxes, if they aren't
>>> copied they are likely to not see the patch at all and it is much more
>>> difficult to apply patches.
>
>> Sure. I think you're implying that you would have liked to be in To: for this
>> patch? I went to quite a lot of trouble to ensure all maintainers were at least
>> in the To: field for patches touching their code. But get_maintainer.pl lists
>> you as a supporter, not a maintainer when I ran this patch through. Could you
>> clarify what would have been the correct thing to do? I could include all
>> reviewers and supporters as well as maintainers but then I'd be banging up
>> against the limits for some of the patches.
>
> The entry in MAINTAINERS for me is a M:, supporter is just the usual
> get_maintainers noise. Supported is exactly equivalent to a maintainer.
Ugh, In my head I always thought "supporter" was somebody who engaged with the
subsystem but did not have an official role (like a football supporter). But now
that I've gone and read the MAINTAINERS file, I see it's actually referring to
status (supported vs maintained). Sorry about this. Due to this buggy filtering,
I've missed a few others off other patches in this series. I'll fix that by
forwarding to them.
> Generally if you're going to filter people you should be filtering less
> specific matches out rather than more and if you're looking to filter
> very aggressively look at who actually commits changes to whatever
> you're trying to change, less specific maintainers will generally
> delegate down to the more specific ones.
>
>>> It's probably better to just use PAGE_SIZE_MAX here and avoid the
>>> deferred patching, like the comment says we don't particularly care what
>>> the value actually is here given that it's a dummy.
>
>> OK, so would that be:
>
>> .buffer_bytes_max = 128*1024,
>> .period_bytes_min = PAGE_SIZE_MAX, <<<<<
>> .period_bytes_max = PAGE_SIZE_MAX*2, <<<<<
>> .periods_min = 2,
>> .periods_max = 128,
>
>> It's not really clear to me how all the parameters interact; the buffer size
>> 128K, which, if PAGE_SIZE_MAX is 64K, would hold 1 period of the maximum size.
>> But periods_min is 2. So not sure that works? Or perhaps I'm trying to apply too
>> much meaning to the param names...
>
> Like Takashi says just using absolute numbers here is probably just as
> sensible, the numbers are there to stop userspace tripping over itself
> but like I say it shouldn't ever get as far as actually using them for
> anything. So long as we end up with some numbers that don't need any
> late init patching the specifics aren't super important, the use of
> PAGE_SIZE was kind of random.
OK, I'll post a respin of this patch independently of the rest of the series,
given it no longer has a dependency.
Thanks,
Ryan
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-14 17:32 ` [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Florian Fainelli
@ 2024-10-15 11:48 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-15 11:48 UTC (permalink / raw)
To: Florian Fainelli, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, David Hildenbrand, Greg Marsden,
Ivan Ivanov, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 14/10/2024 18:32, Florian Fainelli wrote:
> On 10/14/24 03:55, Ryan Roberts wrote:
>> Hi All,
>>
>> Patch bomb incoming... This covers many subsystems, so I've included a core set
>> of people on the full series and additionally included maintainers on relevant
>> patches. I haven't included those maintainers on this cover letter since the
>> numbers were far too big for it to work. But I've included a link to this cover
>> letter on each patch, so they can hopefully find their way here. For follow up
>> submissions I'll break it up by subsystem, but for now thought it was important
>> to show the full picture.
>>
>> This RFC series implements support for boot-time page size selection within the
>> arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to date, page
>> size has been selected at compile-time, meaning the size is baked into a given
>> kernel image. As use of larger-than-4K page sizes become more prevalent this
>> starts to present a problem for distributions. Boot-time page size selection
>> enables the creation of a single kernel image, which can be told which page size
>> to use on the kernel command line.
>>
>> Why is having an image-per-page size problematic?
>> =================================================
>>
>> Many traditional distros are now supporting both 4K and 64K. And this means
>> managing 2 kernel packages, along with drivers for each. For some, it means
>> multiple installer flavours and multiple ISOs. All of this adds up to a
>> less-than-ideal level of complexity. Additionally, Android now supports 4K and
>> 16K kernels. I'm told having to explicitly manage their KABI for each kernel is
>> painful, and the extra flash space required for both kernel images and the
>> duplicated modules has been problematic. Boot-time page size selection solves
>> all of this.
>>
>> Additionally, in starting to think about the longer term deployment story for
>> D128 page tables, which Arm architecture now supports, a lot of the same
>> problems need to be solved, so this work sets us up nicely for that.
>>
>> So what's the down side?
>> ========================
>>
>> Well nothing's free; Various static allocations in the kernel image must be
>> sized for the worst case (largest supported page size), so image size is in line
>> with size of 64K compile-time image. So if you're interested in 4K or 16K, there
>> is a slight increase to the image size. But I expect that problem goes away if
>> you're compressing the image - its just some extra zeros. At boot-time, I expect
>> we could free the unused static storage once we know the page size - although
>> that would be a follow up enhancement.
>>
>> And then there is performance. Since PAGE_SIZE and friends are no longer
>> compile-time constants, we must look up their values and do arithmetic at
>> runtime instead of compile-time. My early perf testing suggests this is
>> inperceptible for real-world workloads, and only has small impact on
>> microbenchmarks - more on this below.
>>
>> Approach
>> ========
>>
>> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
>> friends are compile-time constant, but in a way that allows the compiler to
>> perform the same optimizations as was previously being done if they do turn out
>> to be compile-time constant. Where constants are required, we use limits;
>> PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full description
>> of all the classes of problems to solve.
>>
>> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
>> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX. arm64
>> does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE Kconfig,
>> which is an alternative to selecting a compile-time page size.
>>
>> When boot-time page size is active, the arch pgtable geometry macro definitions
>> resolve to something that can be configured at boot. The arm64 implementation in
>> this series mainly uses global, __ro_after_init variables. I've tried using
>> alternatives patching, but that performs worse than loading from memory; I think
>> due to code size bloat.
>
> FWIW, this paragraph was not entirely clear to me until I looked at patch 57 to
> see that the compile time page size selection had been retained, and could
> continue to be used as-is. It was somewhat implicit, but not IMHO explicit
> enough, not a big deal though.
I intended to make that bit clear with the above sentance "arm64 does this if
the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE Kconfig, which is an
alternative to selecting a compile-time page size.", but appreciate there is a
lot going on here.
>
> Great work, thanks for doing that! This makes me wonder if we could leverage any
> of that to have a single kernel supporting both LPAE and !LPAE on ARM 32-bit,
> but that still seems like somewhat more difficult, largely due to the difference
> in the page table descriptor format (long vs. short).
We will eventually have the exact same problem with FEAT_D128 on arm64. This
introduces page tables with 128 bit PTEs. Ideally we would like to support both
in a single image, although, we have much more thinking to do on that. But my
current view is that this series solves a bunch of problems that makes it easier
(PTRS_PER_Pxx and Pxx_SHIFT all become boot-time values, for example, so we can
easily represent the different geometries).
Yes, we still need to solve the PTE size difference (in our case 64-bit vs
128-bit). I have a couple of proposals for how to do that; the "gold-plated"
approach would be to create and use a handle type to represent a PTE/PxD slot in
a table. Then increments/decrements would be enforced via explicit helpers that
know the size, and direct dereferencing would be impossible. When accessing via
helpers we would pass around pte_t/pxd_t values that are the larger size, then
narrow then when writing back.
Anshuman has a series [1] that starts to move in that direction.
If you have any other ideas, it would be good to talk!
[1]
https://lore.kernel.org/linux-mm/20240917073117.1531207-1-anshuman.khandual@arm.com/
Thanks,
Ryan
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 57/57] arm64: Enable boot-time page size selection
2024-10-14 10:59 ` [RFC PATCH v1 57/57] arm64: Enable boot-time page size selection Ryan Roberts
@ 2024-10-15 17:42 ` Zi Yan
2024-10-16 8:14 ` Ryan Roberts
2024-10-15 17:52 ` Michael Kelley
1 sibling, 1 reply; 196+ messages in thread
From: Zi Yan @ 2024-10-15 17:42 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel, linux-efi,
linux-kernel, linux-mm
[-- Attachment #1: Type: text/plain, Size: 1561 bytes --]
On 14 Oct 2024, at 6:59, Ryan Roberts wrote:
> Introduce a new Kconfig, ARM64_BOOT_TIME_PAGE_SIZE, which can be
> selected instead of a page size. When selected, the resulting kernel's
> page size can be configured at boot via the command line.
>
> For now, boot-time page size kernels are limited to 48-bit VA, since
> more work is required to support LPA2. Additionally MMAP_RND_BITS and
> SECTION_SIZE_BITS are configured for the worst case (64K pages). Future
> work could be implemented to be able to configure these at boot time for
> optimial page size-specific values.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
<snip>
>
> @@ -1588,9 +1601,10 @@ config XEN
> # 4K | 27 | 12 | 15 | 10 |
> # 16K | 27 | 14 | 13 | 11 |
> # 64K | 29 | 16 | 13 | 13 |
> +# BOOT| 29 | 16 (max) | 13 | 13 |
> config ARCH_FORCE_MAX_ORDER
> int
> - default "13" if ARM64_64K_PAGES
> + default "13" if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
> default "11" if ARM64_16K_PAGES
> default "10"
> help
So boot-time page size kernel always has the highest MAX_PAGE_ORDER, which
means the section size increases for 4KB and 16KB page sizes. Any downside
for this?
Is there any plan (not in this patchset) to support boot-time MAX_PAGE_ORDER
to keep section size the same?
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 196+ messages in thread
* RE: [RFC PATCH v1 57/57] arm64: Enable boot-time page size selection
2024-10-14 10:59 ` [RFC PATCH v1 57/57] arm64: Enable boot-time page size selection Ryan Roberts
2024-10-15 17:42 ` Zi Yan
@ 2024-10-15 17:52 ` Michael Kelley
2024-10-16 8:17 ` Ryan Roberts
1 sibling, 1 reply; 196+ messages in thread
From: Michael Kelley @ 2024-10-15 17:52 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Oliver Upton, Will Deacon
Cc: kvmarm@lists.linux.dev, linux-arm-kernel@lists.infradead.org,
linux-efi@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
From: Ryan Roberts <ryan.roberts@arm.com> Sent: Monday, October 14, 2024 3:59 AM
>
> Introduce a new Kconfig, ARM64_BOOT_TIME_PAGE_SIZE, which can be
> selected instead of a page size. When selected, the resulting kernel's
> page size can be configured at boot via the command line.
>
> For now, boot-time page size kernels are limited to 48-bit VA, since
> more work is required to support LPA2. Additionally MMAP_RND_BITS and
> SECTION_SIZE_BITS are configured for the worst case (64K pages). Future
> work could be implemented to be able to configure these at boot time for
> optimial page size-specific values.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> arch/arm64/Kconfig | 26 ++++++++++---
> arch/arm64/include/asm/kvm_hyp.h | 11 ++++++
> arch/arm64/include/asm/pgtable-geometry.h | 22 ++++++++++-
> arch/arm64/include/asm/pgtable-hwdef.h | 6 +--
> arch/arm64/include/asm/pgtable.h | 10 ++++-
> arch/arm64/include/asm/sparsemem.h | 4 ++
> arch/arm64/kernel/image-vars.h | 11 ++++++
> arch/arm64/kernel/image.h | 4 ++
> arch/arm64/kernel/pi/map_kernel.c | 45 ++++++++++++++++++++++
> arch/arm64/kvm/arm.c | 10 +++++
> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++++++++
> arch/arm64/mm/Makefile | 1 +
> arch/arm64/mm/pgd.c | 10 +++--
> arch/arm64/mm/pgtable-geometry.c | 24 ++++++++++++
> drivers/firmware/efi/libstub/arm64.c | 3 +-
> 16 files changed, 187 insertions(+), 17 deletions(-)
> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
> create mode 100644 arch/arm64/mm/pgtable-geometry.c
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a2f8ff354ca67..573d308741169 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -121,6 +121,7 @@ config ARM64
> select BUILDTIME_TABLE_SORT
> select CLONE_BACKWARDS
> select COMMON_CLK
> + select CONSTRUCTORS if ARM64_BOOT_TIME_PAGE_SIZE
> select CPU_PM if (SUSPEND || CPU_IDLE)
> select CPUMASK_OFFSTACK if NR_CPUS > 256
> select CRC32
> @@ -284,18 +285,20 @@ config MMU
>
> config ARM64_CONT_PTE_SHIFT
> int
> + depends on !ARM64_BOOT_TIME_PAGE_SIZE
> default 5 if PAGE_SIZE_64KB
> default 7 if PAGE_SIZE_16KB
> default 4
>
> config ARM64_CONT_PMD_SHIFT
> int
> + depends on !ARM64_BOOT_TIME_PAGE_SIZE
> default 5 if PAGE_SIZE_64KB
> default 5 if PAGE_SIZE_16KB
> default 4
>
> config ARCH_MMAP_RND_BITS_MIN
> - default 14 if PAGE_SIZE_64KB
> + default 14 if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
> default 16 if PAGE_SIZE_16KB
> default 18
>
> @@ -306,15 +309,15 @@ config ARCH_MMAP_RND_BITS_MAX
> default 24 if ARM64_VA_BITS=39
> default 27 if ARM64_VA_BITS=42
> default 30 if ARM64_VA_BITS=47
> - default 29 if ARM64_VA_BITS=48 && ARM64_64K_PAGES
> + default 29 if ARM64_VA_BITS=48 && (ARM64_64K_PAGES ||
> ARM64_BOOT_TIME_PAGE_SIZE)
> default 31 if ARM64_VA_BITS=48 && ARM64_16K_PAGES
> default 33 if ARM64_VA_BITS=48
> - default 14 if ARM64_64K_PAGES
> + default 14 if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
> default 16 if ARM64_16K_PAGES
> default 18
>
> config ARCH_MMAP_RND_COMPAT_BITS_MIN
> - default 7 if ARM64_64K_PAGES
> + default 7 if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
> default 9 if ARM64_16K_PAGES
> default 11
>
> @@ -362,6 +365,7 @@ config FIX_EARLYCON_MEM
>
> config PGTABLE_LEVELS
> int
> + default 4 if ARM64_BOOT_TIME_PAGE_SIZE # Advertise max supported levels
> default 2 if ARM64_16K_PAGES && ARM64_VA_BITS_36
> default 2 if ARM64_64K_PAGES && ARM64_VA_BITS_42
> default 3 if ARM64_64K_PAGES && (ARM64_VA_BITS_48 ||
> ARM64_VA_BITS_52)
> @@ -1316,6 +1320,14 @@ config ARM64_64K_PAGES
> look-up. AArch32 emulation requires applications compiled
> with 64K aligned segments.
>
> +config ARM64_BOOT_TIME_PAGE_SIZE
> + bool "Boot-time selection"
> + select HAVE_PAGE_SIZE_64KB # Advertise largest page size to core
> + help
> + Select desired page size (4KB, 16KB or 64KB) at boot-time via the
> + kernel command line option "arm64.pagesize=4k", "arm64.pagesize=16k"
> + or "arm64.pagesize=64k".
> +
> endchoice
>
> choice
> @@ -1348,6 +1360,7 @@ config ARM64_VA_BITS_48
> config ARM64_VA_BITS_52
> bool "52-bit"
> depends on ARM64_PAN || !ARM64_SW_TTBR0_PAN
> + depends on !ARM64_BOOT_TIME_PAGE_SIZE
> help
> Enable 52-bit virtual addressing for userspace when explicitly
> requested via a hint to mmap(). The kernel will also use 52-bit
> @@ -1588,9 +1601,10 @@ config XEN
> # 4K | 27 | 12 | 15 | 10 |
> # 16K | 27 | 14 | 13 | 11 |
> # 64K | 29 | 16 | 13 | 13 |
> +# BOOT| 29 | 16 (max) | 13 | 13 |
> config ARCH_FORCE_MAX_ORDER
> int
> - default "13" if ARM64_64K_PAGES
> + default "13" if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
> default "11" if ARM64_16K_PAGES
> default "10"
> help
> @@ -1663,7 +1677,7 @@ config ARM64_TAGGED_ADDR_ABI
>
> menuconfig COMPAT
> bool "Kernel support for 32-bit EL0"
> - depends on ARM64_4K_PAGES || EXPERT
> + depends on ARM64_4K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE || EXPERT
> select HAVE_UID16
> select OLD_SIGSUSPEND3
> select COMPAT_OLD_SIGACTION
> diff --git a/arch/arm64/include/asm/kvm_hyp.h
> b/arch/arm64/include/asm/kvm_hyp.h
> index c838309e4ec47..9397a14642afa 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -145,4 +145,15 @@ extern unsigned long kvm_nvhe_sym(__icache_flags);
> extern unsigned int kvm_nvhe_sym(kvm_arm_vmid_bits);
> extern unsigned int kvm_nvhe_sym(kvm_host_sve_max_vl);
>
> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> +extern int kvm_nvhe_sym(ptg_page_shift);
> +extern int kvm_nvhe_sym(ptg_pmd_shift);
> +extern int kvm_nvhe_sym(ptg_pud_shift);
> +extern int kvm_nvhe_sym(ptg_p4d_shift);
> +extern int kvm_nvhe_sym(ptg_pgdir_shift);
> +extern int kvm_nvhe_sym(ptg_cont_pte_shift);
> +extern int kvm_nvhe_sym(ptg_cont_pmd_shift);
> +extern int kvm_nvhe_sym(ptg_pgtable_levels);
> +#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
> +
> #endif /* __ARM64_KVM_HYP_H__ */
> diff --git a/arch/arm64/include/asm/pgtable-geometry.h
> b/arch/arm64/include/asm/pgtable-geometry.h
> index 62fe125909c08..18a8c8d499ecc 100644
> --- a/arch/arm64/include/asm/pgtable-geometry.h
> +++ b/arch/arm64/include/asm/pgtable-geometry.h
> @@ -6,16 +6,33 @@
> #define ARM64_PAGE_SHIFT_16K 14
> #define ARM64_PAGE_SHIFT_64K 16
>
> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> +#define PAGE_SHIFT_MIN ARM64_PAGE_SHIFT_4K
> +#define PAGE_SHIFT_MAX ARM64_PAGE_SHIFT_64K
> +#else /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
> #define PAGE_SHIFT_MIN CONFIG_PAGE_SHIFT
> +#define PAGE_SHIFT_MAX CONFIG_PAGE_SHIFT
> +#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
> +
> #define PAGE_SIZE_MIN (_AC(1, UL) << PAGE_SHIFT_MIN)
> #define PAGE_MASK_MIN (~(PAGE_SIZE_MIN-1))
> -
> -#define PAGE_SHIFT_MAX CONFIG_PAGE_SHIFT
> #define PAGE_SIZE_MAX (_AC(1, UL) << PAGE_SHIFT_MAX)
> #define PAGE_MASK_MAX (~(PAGE_SIZE_MAX-1))
>
> #include <asm-generic/pgtable-geometry.h>
>
> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> +#ifndef __ASSEMBLY__
> +extern int ptg_page_shift;
> +extern int ptg_pmd_shift;
> +extern int ptg_pud_shift;
> +extern int ptg_p4d_shift;
> +extern int ptg_pgdir_shift;
> +extern int ptg_cont_pte_shift;
> +extern int ptg_cont_pmd_shift;
> +extern int ptg_pgtable_levels;
> +#endif /* __ASSEMBLY__ */
> +#else /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
> #define ptg_page_shift CONFIG_PAGE_SHIFT
> #define ptg_pmd_shift ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
> #define ptg_pud_shift ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
> @@ -24,5 +41,6 @@
> #define ptg_cont_pte_shift (CONFIG_ARM64_CONT_PTE_SHIFT + PAGE_SHIFT)
> #define ptg_cont_pmd_shift (CONFIG_ARM64_CONT_PMD_SHIFT + PMD_SHIFT)
> #define ptg_pgtable_levels CONFIG_PGTABLE_LEVELS
> +#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
>
> #endif /* ASM_PGTABLE_GEOMETRY_H */
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index ca8bcbc1fe220..da5404617acbf 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -52,7 +52,7 @@
> #define PMD_SHIFT ptg_pmd_shift
> #define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
> #define PMD_MASK (~(PMD_SIZE-1))
> -#define PTRS_PER_PMD (1 << (PAGE_SHIFT - 3))
> +#define PTRS_PER_PMD (ptg_pgtable_levels > 2 ? (1 << (PAGE_SHIFT -
> 3)) : 1)
> #define MAX_PTRS_PER_PMD (1 << (PAGE_SHIFT_MAX - 3))
> #endif
>
> @@ -63,7 +63,7 @@
> #define PUD_SHIFT ptg_pud_shift
> #define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
> #define PUD_MASK (~(PUD_SIZE-1))
> -#define PTRS_PER_PUD (1 << (PAGE_SHIFT - 3))
> +#define PTRS_PER_PUD (ptg_pgtable_levels > 3 ? (1 << (PAGE_SHIFT -
> 3)) : 1)
> #define MAX_PTRS_PER_PUD (1 << (PAGE_SHIFT_MAX - 3))
> #endif
>
> @@ -71,7 +71,7 @@
> #define P4D_SHIFT ptg_p4d_shift
> #define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
> #define P4D_MASK (~(P4D_SIZE-1))
> -#define PTRS_PER_P4D (1 << (PAGE_SHIFT - 3))
> +#define PTRS_PER_P4D (ptg_pgtable_levels > 4 ? (1 << (PAGE_SHIFT -
> 3)) : 1)
> #define MAX_PTRS_PER_P4D (1 << (PAGE_SHIFT_MAX - 3))
> #endif
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 8ead41da715b0..ad9f75f5cc29a 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -755,7 +755,7 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>
> static __always_inline bool pgtable_l3_enabled(void)
> {
> - return true;
> + return ptg_pgtable_levels > 2;
> }
>
> static inline bool mm_pmd_folded(const struct mm_struct *mm)
> @@ -888,6 +888,8 @@ static inline bool pgtable_l3_enabled(void) { return false; }
>
> static __always_inline bool pgtable_l4_enabled(void)
> {
> + if (IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
> + return ptg_pgtable_levels > 3;
> if (CONFIG_PGTABLE_LEVELS > 4 || !IS_ENABLED(CONFIG_ARM64_LPA2))
> return true;
> if (!alternative_has_cap_likely(ARM64_ALWAYS_BOOT))
> @@ -935,6 +937,8 @@ static inline phys_addr_t p4d_page_paddr(p4d_t p4d)
>
> static inline pud_t *p4d_to_folded_pud(p4d_t *p4dp, unsigned long addr)
> {
> + if (IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
> + return (pud_t *)p4dp;
> return (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
> }
>
> @@ -1014,6 +1018,8 @@ static inline bool pgtable_l4_enabled(void) { return false; }
>
> static __always_inline bool pgtable_l5_enabled(void)
> {
> + if (IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
> + return ptg_pgtable_levels > 4;
> if (!alternative_has_cap_likely(ARM64_ALWAYS_BOOT))
> return vabits_actual == VA_BITS;
> return alternative_has_cap_unlikely(ARM64_HAS_VA52);
> @@ -1059,6 +1065,8 @@ static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
>
> static inline p4d_t *pgd_to_folded_p4d(pgd_t *pgdp, unsigned long addr)
> {
> + if (IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
> + return (p4d_t *)pgdp;
> return (p4d_t *)PTR_ALIGN_DOWN(pgdp, PAGE_SIZE) + p4d_index(addr);
> }
>
> diff --git a/arch/arm64/include/asm/sparsemem.h
> b/arch/arm64/include/asm/sparsemem.h
> index a05fdd54014f7..2daf1263ba638 100644
> --- a/arch/arm64/include/asm/sparsemem.h
> +++ b/arch/arm64/include/asm/sparsemem.h
> @@ -17,6 +17,10 @@
> * entries could not be created for vmemmap mappings.
> * 16K follows 4K for simplicity.
> */
> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> +#define SECTION_SIZE_BITS 29
> +#else
> #define SECTION_SIZE_BITS (PAGE_SIZE == SZ_64K ? 29 : 27)
> +#endif
>
> #endif
> diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
> index a168f3337446f..9968320f83bc4 100644
> --- a/arch/arm64/kernel/image-vars.h
> +++ b/arch/arm64/kernel/image-vars.h
> @@ -36,6 +36,17 @@ PROVIDE(__pi___memcpy =
> __pi_memcpy);
> PROVIDE(__pi___memmove = __pi_memmove);
> PROVIDE(__pi___memset = __pi_memset);
>
> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> +PROVIDE(__pi_ptg_page_shift = ptg_page_shift);
> +PROVIDE(__pi_ptg_pmd_shift = ptg_pmd_shift);
> +PROVIDE(__pi_ptg_pud_shift = ptg_pud_shift);
> +PROVIDE(__pi_ptg_p4d_shift = ptg_p4d_shift);
> +PROVIDE(__pi_ptg_pgdir_shift = ptg_pgdir_shift);
> +PROVIDE(__pi_ptg_cont_pte_shift = ptg_cont_pte_shift);
> +PROVIDE(__pi_ptg_cont_pmd_shift = ptg_cont_pmd_shift);
> +PROVIDE(__pi_ptg_pgtable_levels = ptg_pgtable_levels);
> +#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
> +
> PROVIDE(__pi_id_aa64isar1_override = id_aa64isar1_override);
> PROVIDE(__pi_id_aa64isar2_override = id_aa64isar2_override);
> PROVIDE(__pi_id_aa64mmfr0_override = id_aa64mmfr0_override);
> diff --git a/arch/arm64/kernel/image.h b/arch/arm64/kernel/image.h
> index 7bc3ba8979019..01502fc3b891b 100644
> --- a/arch/arm64/kernel/image.h
> +++ b/arch/arm64/kernel/image.h
> @@ -47,7 +47,11 @@
> #define __HEAD_FLAG_BE ARM64_IMAGE_FLAG_LE
> #endif
>
> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> +#define __HEAD_FLAG_PAGE_SIZE 0
> +#else
> #define __HEAD_FLAG_PAGE_SIZE ((PAGE_SHIFT - 10) / 2)
> +#endif
>
> #define __HEAD_FLAG_PHYS_BASE 1
>
> diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
> index deb8cd50b0b0c..22b3c70e04f9c 100644
> --- a/arch/arm64/kernel/pi/map_kernel.c
> +++ b/arch/arm64/kernel/pi/map_kernel.c
> @@ -221,6 +221,49 @@ static void __init map_fdt(u64 fdt, int page_shift)
> dsb(ishst);
> }
>
> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> +static void __init ptg_init(int page_shift)
> +{
> + ptg_pgtable_levels =
> + __ARM64_HW_PGTABLE_LEVELS(page_shift,
> CONFIG_ARM64_VA_BITS);
> +
> + ptg_pgdir_shift = __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift,
> + 4 - ptg_pgtable_levels);
> +
> + ptg_p4d_shift = ptg_pgtable_levels >= 5 ?
> + __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift, 0) :
> + ptg_pgdir_shift;
> +
> + ptg_pud_shift = ptg_pgtable_levels >= 4 ?
> + __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift, 1) :
> + ptg_pgdir_shift;
> +
> + ptg_pmd_shift = ptg_pgtable_levels >= 3 ?
> + __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift, 2) :
> + ptg_pgdir_shift;
> +
> + ptg_page_shift = page_shift;
> +
> + switch (page_shift) {
> + case ARM64_PAGE_SHIFT_64K:
> + ptg_cont_pte_shift = ptg_page_shift + 5;
> + ptg_cont_pmd_shift = ptg_pmd_shift + 5;
> + break;
> + case ARM64_PAGE_SHIFT_16K:
> + ptg_cont_pte_shift = ptg_page_shift + 7;
> + ptg_cont_pmd_shift = ptg_pmd_shift + 5;
> + break;
> + default: /* ARM64_PAGE_SHIFT_4K */
> + ptg_cont_pte_shift = ptg_page_shift + 4;
> + ptg_cont_pmd_shift = ptg_pmd_shift + 4;
> + }
> +}
> +#else /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
> +static inline void ptg_init(int page_shift)
> +{
> +}
> +#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
> +
> asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
> {
> static char const chosen_str[] __initconst = "/chosen";
> @@ -247,6 +290,8 @@ asmlinkage void __init early_map_kernel(u64 boot_status, void
> *fdt)
> if (!page_shift)
> page_shift = early_page_shift;
>
> + ptg_init(page_shift);
> +
> if (va_bits > 48) {
> u64 page_size = early_page_size(page_shift);
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 9bef7638342ef..c835a50b8b768 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -2424,6 +2424,16 @@ static void kvm_hyp_init_symbols(void)
> kvm_nvhe_sym(id_aa64smfr0_el1_sys_val) =
> read_sanitised_ftr_reg(SYS_ID_AA64SMFR0_EL1);
> kvm_nvhe_sym(__icache_flags) = __icache_flags;
> kvm_nvhe_sym(kvm_arm_vmid_bits) = kvm_arm_vmid_bits;
> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> + kvm_nvhe_sym(ptg_page_shift) = ptg_page_shift;
> + kvm_nvhe_sym(ptg_pmd_shift) = ptg_pmd_shift;
> + kvm_nvhe_sym(ptg_pud_shift) = ptg_pud_shift;
> + kvm_nvhe_sym(ptg_p4d_shift) = ptg_p4d_shift;
> + kvm_nvhe_sym(ptg_pgdir_shift) = ptg_pgdir_shift;
> + kvm_nvhe_sym(ptg_cont_pte_shift) = ptg_cont_pte_shift;
> + kvm_nvhe_sym(ptg_cont_pmd_shift) = ptg_cont_pmd_shift;
> + kvm_nvhe_sym(ptg_pgtable_levels) = ptg_pgtable_levels;
> +#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
> }
>
> static int __init kvm_hyp_init_protection(u32 hyp_va_bits)
> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile
> b/arch/arm64/kvm/hyp/nvhe/Makefile
> index b43426a493df5..a8fcbb84c7996 100644
> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> @@ -27,6 +27,7 @@ hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-
> init.o host.o
> cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o
> hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
> +hyp-obj-$(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) += pgtable-geometry.o
> hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o
> hyp-obj-y += $(lib-objs)
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
> b/arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
> new file mode 100644
> index 0000000000000..17f807450a31a
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
> @@ -0,0 +1,16 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2024 ARM Ltd.
> + */
> +
> +#include <linux/cache.h>
> +#include <asm/pgtable-geometry.h>
> +
> +int ptg_page_shift __ro_after_init;
> +int ptg_pmd_shift __ro_after_init;
> +int ptg_pud_shift __ro_after_init;
> +int ptg_p4d_shift __ro_after_init;
> +int ptg_pgdir_shift __ro_after_init;
> +int ptg_cont_pte_shift __ro_after_init;
> +int ptg_cont_pmd_shift __ro_after_init;
> +int ptg_pgtable_levels __ro_after_init;
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index 60454256945b8..2ba30d06b35fe 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -3,6 +3,7 @@ obj-y := dma-mapping.o extable.o
> fault.o init.o \
> cache.o copypage.o flush.o \
> ioremap.o mmap.o pgd.o mmu.o \
> context.o proc.o pageattr.o fixmap.o
> +obj-$(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) += pgtable-geometry.o
> obj-$(CONFIG_ARM64_CONTPTE) += contpte.o
> obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
> obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
> diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> index 4b106510358b1..c052d0dcb0c69 100644
> --- a/arch/arm64/mm/pgd.c
> +++ b/arch/arm64/mm/pgd.c
> @@ -21,10 +21,12 @@ static bool pgdir_is_page_size(void)
> {
> if (PGD_SIZE == PAGE_SIZE)
> return true;
> - if (CONFIG_PGTABLE_LEVELS == 4)
> - return !pgtable_l4_enabled();
> - if (CONFIG_PGTABLE_LEVELS == 5)
> - return !pgtable_l5_enabled();
> + if (!IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE)) {
> + if (CONFIG_PGTABLE_LEVELS == 4)
> + return !pgtable_l4_enabled();
> + if (CONFIG_PGTABLE_LEVELS == 5)
> + return !pgtable_l5_enabled();
> + }
> return false;
> }
>
> diff --git a/arch/arm64/mm/pgtable-geometry.c b/arch/arm64/mm/pgtable-
> geometry.c
> new file mode 100644
> index 0000000000000..ba50637f1e9d0
> --- /dev/null
> +++ b/arch/arm64/mm/pgtable-geometry.c
> @@ -0,0 +1,24 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2024 ARM Ltd.
> + */
> +
> +#include <linux/cache.h>
> +#include <asm/pgtable-geometry.h>
> +
> +/*
> + * TODO: These should be __ro_after_init, but we need to write to them from the
> + * pi code where they are mapped in the early page table as read-only.
> + * __ro_after_init doesn't become writable until later when the swapper pgtable
> + * is fully set up. We should update the early page table to map __ro_after_init
> + * as read-write.
> + */
> +
> +int ptg_page_shift __read_mostly;
> +int ptg_pmd_shift __read_mostly;
I found that ptg_page_shift and ptg_pmd_shift need
EXPORT_SYMBOL_GPL for cases where code compiled
as a module is using PAGE_SIZE/PAGE_SHIFT or
PMD_SIZE/PMD_SHIFT. Some of the others below
might also need EXPORT_SYMBOL_GPL.
Michael
> +int ptg_pud_shift __read_mostly;
> +int ptg_p4d_shift __read_mostly;
> +int ptg_pgdir_shift __read_mostly;
> +int ptg_cont_pte_shift __read_mostly;
> +int ptg_cont_pmd_shift __read_mostly;
> +int ptg_pgtable_levels __read_mostly;
> diff --git a/drivers/firmware/efi/libstub/arm64.c
> b/drivers/firmware/efi/libstub/arm64.c
> index e57cd3de0a00f..8db9dba7d5423 100644
> --- a/drivers/firmware/efi/libstub/arm64.c
> +++ b/drivers/firmware/efi/libstub/arm64.c
> @@ -68,7 +68,8 @@ efi_status_t check_platform_features(void)
> efi_novamap = true;
>
> /* UEFI mandates support for 4 KB granularity, no need to check */
> - if (IS_ENABLED(CONFIG_ARM64_4K_PAGES))
> + if (IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
> + IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
> return EFI_SUCCESS;
>
> tg = (read_cpuid(ID_AA64MMFR0_EL1) >>
> ID_AA64MMFR0_EL1_TGRAN_SHIFT) & 0xf;
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* RE: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-14 10:55 [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
2024-10-14 17:32 ` [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Florian Fainelli
@ 2024-10-15 18:38 ` Michael Kelley
2024-10-16 8:23 ` Ryan Roberts
2024-10-16 15:16 ` David Hildenbrand
` (5 subsequent siblings)
8 siblings, 1 reply; 196+ messages in thread
From: Michael Kelley @ 2024-10-15 18:38 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, Dexuan Cui, Boqun Feng
Cc: linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
From: Ryan Roberts <ryan.roberts@arm.com> Sent: Monday, October 14, 2024 3:55 AM
>
> Hi All,
>
> Patch bomb incoming... This covers many subsystems, so I've included a core set
> of people on the full series and additionally included maintainers on relevant
> patches. I haven't included those maintainers on this cover letter since the
> numbers were far too big for it to work. But I've included a link to this cover
> letter on each patch, so they can hopefully find their way here. For follow up
> submissions I'll break it up by subsystem, but for now thought it was important
> to show the full picture.
>
> This RFC series implements support for boot-time page size selection within the
> arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to date, page
> size has been selected at compile-time, meaning the size is baked into a given
> kernel image. As use of larger-than-4K page sizes become more prevalent this
> starts to present a problem for distributions. Boot-time page size selection
> enables the creation of a single kernel image, which can be told which page size
> to use on the kernel command line.
>
> Why is having an image-per-page size problematic?
> =================================================
>
> Many traditional distros are now supporting both 4K and 64K. And this means
> managing 2 kernel packages, along with drivers for each. For some, it means
> multiple installer flavours and multiple ISOs. All of this adds up to a
> less-than-ideal level of complexity. Additionally, Android now supports 4K and
> 16K kernels. I'm told having to explicitly manage their KABI for each kernel is
> painful, and the extra flash space required for both kernel images and the
> duplicated modules has been problematic. Boot-time page size selection solves
> all of this.
>
> Additionally, in starting to think about the longer term deployment story for
> D128 page tables, which Arm architecture now supports, a lot of the same
> problems need to be solved, so this work sets us up nicely for that.
>
> So what's the down side?
> ========================
>
> Well nothing's free; Various static allocations in the kernel image must be
> sized for the worst case (largest supported page size), so image size is in line
> with size of 64K compile-time image. So if you're interested in 4K or 16K, there
> is a slight increase to the image size. But I expect that problem goes away if
> you're compressing the image - its just some extra zeros. At boot-time, I expect
> we could free the unused static storage once we know the page size - although
> that would be a follow up enhancement.
>
> And then there is performance. Since PAGE_SIZE and friends are no longer
> compile-time constants, we must look up their values and do arithmetic at
> runtime instead of compile-time. My early perf testing suggests this is
> inperceptible for real-world workloads, and only has small impact on
> microbenchmarks - more on this below.
[snip]
This is pretty cool. :-) FWIW, I've built a kernel with this patch set, and
have it running in a RHEL 8.7 guest on Hyper-V in the Azure public cloud.
Ran with 4K, 16K, and 64K page sizes, and the basic smoke tests work.
The Hyper-V specific code in the Linux kernel needed a few tweaks to
deal with PAGE_SIZE and friends no longer being constant, but it's nothing
significant. Getting the kernel built in the first place was a little harder
because my .config file is fairly generic with a lot of device drivers and file
system code that aren't really needed for Hyper-V guests. I had to
weed out the ones that won't build. My RHEL 8.7 install uses LVM, so I
hacked the 'dm' code to make it compile and run.
As this work moves forward, I can supply the necessary patches for
the Hyper-V support. Let me know if you want to include them in the
main patch set.
I've added a couple of Microsoft's Linux people to this email's addressee
list so they are aware of what's going on.
Michael Kelley
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 57/57] arm64: Enable boot-time page size selection
2024-10-15 17:42 ` Zi Yan
@ 2024-10-16 8:14 ` Ryan Roberts
2024-10-16 14:21 ` Zi Yan
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 8:14 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel, linux-efi,
linux-kernel, linux-mm
On 15/10/2024 18:42, Zi Yan wrote:
> On 14 Oct 2024, at 6:59, Ryan Roberts wrote:
>
>> Introduce a new Kconfig, ARM64_BOOT_TIME_PAGE_SIZE, which can be
>> selected instead of a page size. When selected, the resulting kernel's
>> page size can be configured at boot via the command line.
>>
>> For now, boot-time page size kernels are limited to 48-bit VA, since
>> more work is required to support LPA2. Additionally MMAP_RND_BITS and
>> SECTION_SIZE_BITS are configured for the worst case (64K pages). Future
>> work could be implemented to be able to configure these at boot time for
>> optimial page size-specific values.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>
> <snip>
>
>>
>> @@ -1588,9 +1601,10 @@ config XEN
>> # 4K | 27 | 12 | 15 | 10 |
>> # 16K | 27 | 14 | 13 | 11 |
>> # 64K | 29 | 16 | 13 | 13 |
>> +# BOOT| 29 | 16 (max) | 13 | 13 |
>> config ARCH_FORCE_MAX_ORDER
>> int
>> - default "13" if ARM64_64K_PAGES
>> + default "13" if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
>> default "11" if ARM64_16K_PAGES
>> default "10"
>> help
>
> So boot-time page size kernel always has the highest MAX_PAGE_ORDER, which
> means the section size increases for 4KB and 16KB page sizes. Any downside
> for this?
I guess there is some cost to the buddy when MAX_PAGE_ORDER is larger than it
needs to be - I expect you can explain those details much better than I can. I'm
just setting it to the worst case for now as it was the easiest solution for the
initial series.
>
> Is there any plan (not in this patchset) to support boot-time MAX_PAGE_ORDER
> to keep section size the same?
Yes absolutely. I should have documented MAX_PAGE_ORDER in the commit log along
with the comments for MMAP_RND_BITS and SECTION_SIZE_BITS - that was an
oversight and I'll fix it in the next version. I plan to look at making all 3
values boot-time configurable in future (although I have no idea at this point
how involved that will be).
Thanks,
Ryan
>
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 57/57] arm64: Enable boot-time page size selection
2024-10-15 17:52 ` Michael Kelley
@ 2024-10-16 8:17 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 8:17 UTC (permalink / raw)
To: Michael Kelley, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Oliver Upton, Will Deacon
Cc: kvmarm@lists.linux.dev, linux-arm-kernel@lists.infradead.org,
linux-efi@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
On 15/10/2024 18:52, Michael Kelley wrote:
> From: Ryan Roberts <ryan.roberts@arm.com> Sent: Monday, October 14, 2024 3:59 AM
>>
>> Introduce a new Kconfig, ARM64_BOOT_TIME_PAGE_SIZE, which can be
>> selected instead of a page size. When selected, the resulting kernel's
>> page size can be configured at boot via the command line.
>>
>> For now, boot-time page size kernels are limited to 48-bit VA, since
>> more work is required to support LPA2. Additionally MMAP_RND_BITS and
>> SECTION_SIZE_BITS are configured for the worst case (64K pages). Future
>> work could be implemented to be able to configure these at boot time for
>> optimial page size-specific values.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> ***NOTE***
>> Any confused maintainers may want to read the cover note here for context:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>
>> arch/arm64/Kconfig | 26 ++++++++++---
>> arch/arm64/include/asm/kvm_hyp.h | 11 ++++++
>> arch/arm64/include/asm/pgtable-geometry.h | 22 ++++++++++-
>> arch/arm64/include/asm/pgtable-hwdef.h | 6 +--
>> arch/arm64/include/asm/pgtable.h | 10 ++++-
>> arch/arm64/include/asm/sparsemem.h | 4 ++
>> arch/arm64/kernel/image-vars.h | 11 ++++++
>> arch/arm64/kernel/image.h | 4 ++
>> arch/arm64/kernel/pi/map_kernel.c | 45 ++++++++++++++++++++++
>> arch/arm64/kvm/arm.c | 10 +++++
>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++++++++
>> arch/arm64/mm/Makefile | 1 +
>> arch/arm64/mm/pgd.c | 10 +++--
>> arch/arm64/mm/pgtable-geometry.c | 24 ++++++++++++
>> drivers/firmware/efi/libstub/arm64.c | 3 +-
>> 16 files changed, 187 insertions(+), 17 deletions(-)
>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
>> create mode 100644 arch/arm64/mm/pgtable-geometry.c
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index a2f8ff354ca67..573d308741169 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -121,6 +121,7 @@ config ARM64
>> select BUILDTIME_TABLE_SORT
>> select CLONE_BACKWARDS
>> select COMMON_CLK
>> + select CONSTRUCTORS if ARM64_BOOT_TIME_PAGE_SIZE
>> select CPU_PM if (SUSPEND || CPU_IDLE)
>> select CPUMASK_OFFSTACK if NR_CPUS > 256
>> select CRC32
>> @@ -284,18 +285,20 @@ config MMU
>>
>> config ARM64_CONT_PTE_SHIFT
>> int
>> + depends on !ARM64_BOOT_TIME_PAGE_SIZE
>> default 5 if PAGE_SIZE_64KB
>> default 7 if PAGE_SIZE_16KB
>> default 4
>>
>> config ARM64_CONT_PMD_SHIFT
>> int
>> + depends on !ARM64_BOOT_TIME_PAGE_SIZE
>> default 5 if PAGE_SIZE_64KB
>> default 5 if PAGE_SIZE_16KB
>> default 4
>>
>> config ARCH_MMAP_RND_BITS_MIN
>> - default 14 if PAGE_SIZE_64KB
>> + default 14 if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
>> default 16 if PAGE_SIZE_16KB
>> default 18
>>
>> @@ -306,15 +309,15 @@ config ARCH_MMAP_RND_BITS_MAX
>> default 24 if ARM64_VA_BITS=39
>> default 27 if ARM64_VA_BITS=42
>> default 30 if ARM64_VA_BITS=47
>> - default 29 if ARM64_VA_BITS=48 && ARM64_64K_PAGES
>> + default 29 if ARM64_VA_BITS=48 && (ARM64_64K_PAGES ||
>> ARM64_BOOT_TIME_PAGE_SIZE)
>> default 31 if ARM64_VA_BITS=48 && ARM64_16K_PAGES
>> default 33 if ARM64_VA_BITS=48
>> - default 14 if ARM64_64K_PAGES
>> + default 14 if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
>> default 16 if ARM64_16K_PAGES
>> default 18
>>
>> config ARCH_MMAP_RND_COMPAT_BITS_MIN
>> - default 7 if ARM64_64K_PAGES
>> + default 7 if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
>> default 9 if ARM64_16K_PAGES
>> default 11
>>
>> @@ -362,6 +365,7 @@ config FIX_EARLYCON_MEM
>>
>> config PGTABLE_LEVELS
>> int
>> + default 4 if ARM64_BOOT_TIME_PAGE_SIZE # Advertise max supported levels
>> default 2 if ARM64_16K_PAGES && ARM64_VA_BITS_36
>> default 2 if ARM64_64K_PAGES && ARM64_VA_BITS_42
>> default 3 if ARM64_64K_PAGES && (ARM64_VA_BITS_48 ||
>> ARM64_VA_BITS_52)
>> @@ -1316,6 +1320,14 @@ config ARM64_64K_PAGES
>> look-up. AArch32 emulation requires applications compiled
>> with 64K aligned segments.
>>
>> +config ARM64_BOOT_TIME_PAGE_SIZE
>> + bool "Boot-time selection"
>> + select HAVE_PAGE_SIZE_64KB # Advertise largest page size to core
>> + help
>> + Select desired page size (4KB, 16KB or 64KB) at boot-time via the
>> + kernel command line option "arm64.pagesize=4k", "arm64.pagesize=16k"
>> + or "arm64.pagesize=64k".
>> +
>> endchoice
>>
>> choice
>> @@ -1348,6 +1360,7 @@ config ARM64_VA_BITS_48
>> config ARM64_VA_BITS_52
>> bool "52-bit"
>> depends on ARM64_PAN || !ARM64_SW_TTBR0_PAN
>> + depends on !ARM64_BOOT_TIME_PAGE_SIZE
>> help
>> Enable 52-bit virtual addressing for userspace when explicitly
>> requested via a hint to mmap(). The kernel will also use 52-bit
>> @@ -1588,9 +1601,10 @@ config XEN
>> # 4K | 27 | 12 | 15 | 10 |
>> # 16K | 27 | 14 | 13 | 11 |
>> # 64K | 29 | 16 | 13 | 13 |
>> +# BOOT| 29 | 16 (max) | 13 | 13 |
>> config ARCH_FORCE_MAX_ORDER
>> int
>> - default "13" if ARM64_64K_PAGES
>> + default "13" if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
>> default "11" if ARM64_16K_PAGES
>> default "10"
>> help
>> @@ -1663,7 +1677,7 @@ config ARM64_TAGGED_ADDR_ABI
>>
>> menuconfig COMPAT
>> bool "Kernel support for 32-bit EL0"
>> - depends on ARM64_4K_PAGES || EXPERT
>> + depends on ARM64_4K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE || EXPERT
>> select HAVE_UID16
>> select OLD_SIGSUSPEND3
>> select COMPAT_OLD_SIGACTION
>> diff --git a/arch/arm64/include/asm/kvm_hyp.h
>> b/arch/arm64/include/asm/kvm_hyp.h
>> index c838309e4ec47..9397a14642afa 100644
>> --- a/arch/arm64/include/asm/kvm_hyp.h
>> +++ b/arch/arm64/include/asm/kvm_hyp.h
>> @@ -145,4 +145,15 @@ extern unsigned long kvm_nvhe_sym(__icache_flags);
>> extern unsigned int kvm_nvhe_sym(kvm_arm_vmid_bits);
>> extern unsigned int kvm_nvhe_sym(kvm_host_sve_max_vl);
>>
>> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>> +extern int kvm_nvhe_sym(ptg_page_shift);
>> +extern int kvm_nvhe_sym(ptg_pmd_shift);
>> +extern int kvm_nvhe_sym(ptg_pud_shift);
>> +extern int kvm_nvhe_sym(ptg_p4d_shift);
>> +extern int kvm_nvhe_sym(ptg_pgdir_shift);
>> +extern int kvm_nvhe_sym(ptg_cont_pte_shift);
>> +extern int kvm_nvhe_sym(ptg_cont_pmd_shift);
>> +extern int kvm_nvhe_sym(ptg_pgtable_levels);
>> +#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
>> +
>> #endif /* __ARM64_KVM_HYP_H__ */
>> diff --git a/arch/arm64/include/asm/pgtable-geometry.h
>> b/arch/arm64/include/asm/pgtable-geometry.h
>> index 62fe125909c08..18a8c8d499ecc 100644
>> --- a/arch/arm64/include/asm/pgtable-geometry.h
>> +++ b/arch/arm64/include/asm/pgtable-geometry.h
>> @@ -6,16 +6,33 @@
>> #define ARM64_PAGE_SHIFT_16K 14
>> #define ARM64_PAGE_SHIFT_64K 16
>>
>> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>> +#define PAGE_SHIFT_MIN ARM64_PAGE_SHIFT_4K
>> +#define PAGE_SHIFT_MAX ARM64_PAGE_SHIFT_64K
>> +#else /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
>> #define PAGE_SHIFT_MIN CONFIG_PAGE_SHIFT
>> +#define PAGE_SHIFT_MAX CONFIG_PAGE_SHIFT
>> +#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
>> +
>> #define PAGE_SIZE_MIN (_AC(1, UL) << PAGE_SHIFT_MIN)
>> #define PAGE_MASK_MIN (~(PAGE_SIZE_MIN-1))
>> -
>> -#define PAGE_SHIFT_MAX CONFIG_PAGE_SHIFT
>> #define PAGE_SIZE_MAX (_AC(1, UL) << PAGE_SHIFT_MAX)
>> #define PAGE_MASK_MAX (~(PAGE_SIZE_MAX-1))
>>
>> #include <asm-generic/pgtable-geometry.h>
>>
>> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>> +#ifndef __ASSEMBLY__
>> +extern int ptg_page_shift;
>> +extern int ptg_pmd_shift;
>> +extern int ptg_pud_shift;
>> +extern int ptg_p4d_shift;
>> +extern int ptg_pgdir_shift;
>> +extern int ptg_cont_pte_shift;
>> +extern int ptg_cont_pmd_shift;
>> +extern int ptg_pgtable_levels;
>> +#endif /* __ASSEMBLY__ */
>> +#else /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
>> #define ptg_page_shift CONFIG_PAGE_SHIFT
>> #define ptg_pmd_shift ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
>> #define ptg_pud_shift ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
>> @@ -24,5 +41,6 @@
>> #define ptg_cont_pte_shift (CONFIG_ARM64_CONT_PTE_SHIFT + PAGE_SHIFT)
>> #define ptg_cont_pmd_shift (CONFIG_ARM64_CONT_PMD_SHIFT + PMD_SHIFT)
>> #define ptg_pgtable_levels CONFIG_PGTABLE_LEVELS
>> +#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
>>
>> #endif /* ASM_PGTABLE_GEOMETRY_H */
>> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h
>> b/arch/arm64/include/asm/pgtable-hwdef.h
>> index ca8bcbc1fe220..da5404617acbf 100644
>> --- a/arch/arm64/include/asm/pgtable-hwdef.h
>> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
>> @@ -52,7 +52,7 @@
>> #define PMD_SHIFT ptg_pmd_shift
>> #define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
>> #define PMD_MASK (~(PMD_SIZE-1))
>> -#define PTRS_PER_PMD (1 << (PAGE_SHIFT - 3))
>> +#define PTRS_PER_PMD (ptg_pgtable_levels > 2 ? (1 << (PAGE_SHIFT -
>> 3)) : 1)
>> #define MAX_PTRS_PER_PMD (1 << (PAGE_SHIFT_MAX - 3))
>> #endif
>>
>> @@ -63,7 +63,7 @@
>> #define PUD_SHIFT ptg_pud_shift
>> #define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
>> #define PUD_MASK (~(PUD_SIZE-1))
>> -#define PTRS_PER_PUD (1 << (PAGE_SHIFT - 3))
>> +#define PTRS_PER_PUD (ptg_pgtable_levels > 3 ? (1 << (PAGE_SHIFT -
>> 3)) : 1)
>> #define MAX_PTRS_PER_PUD (1 << (PAGE_SHIFT_MAX - 3))
>> #endif
>>
>> @@ -71,7 +71,7 @@
>> #define P4D_SHIFT ptg_p4d_shift
>> #define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
>> #define P4D_MASK (~(P4D_SIZE-1))
>> -#define PTRS_PER_P4D (1 << (PAGE_SHIFT - 3))
>> +#define PTRS_PER_P4D (ptg_pgtable_levels > 4 ? (1 << (PAGE_SHIFT -
>> 3)) : 1)
>> #define MAX_PTRS_PER_P4D (1 << (PAGE_SHIFT_MAX - 3))
>> #endif
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 8ead41da715b0..ad9f75f5cc29a 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -755,7 +755,7 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>>
>> static __always_inline bool pgtable_l3_enabled(void)
>> {
>> - return true;
>> + return ptg_pgtable_levels > 2;
>> }
>>
>> static inline bool mm_pmd_folded(const struct mm_struct *mm)
>> @@ -888,6 +888,8 @@ static inline bool pgtable_l3_enabled(void) { return false; }
>>
>> static __always_inline bool pgtable_l4_enabled(void)
>> {
>> + if (IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
>> + return ptg_pgtable_levels > 3;
>> if (CONFIG_PGTABLE_LEVELS > 4 || !IS_ENABLED(CONFIG_ARM64_LPA2))
>> return true;
>> if (!alternative_has_cap_likely(ARM64_ALWAYS_BOOT))
>> @@ -935,6 +937,8 @@ static inline phys_addr_t p4d_page_paddr(p4d_t p4d)
>>
>> static inline pud_t *p4d_to_folded_pud(p4d_t *p4dp, unsigned long addr)
>> {
>> + if (IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
>> + return (pud_t *)p4dp;
>> return (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
>> }
>>
>> @@ -1014,6 +1018,8 @@ static inline bool pgtable_l4_enabled(void) { return false; }
>>
>> static __always_inline bool pgtable_l5_enabled(void)
>> {
>> + if (IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
>> + return ptg_pgtable_levels > 4;
>> if (!alternative_has_cap_likely(ARM64_ALWAYS_BOOT))
>> return vabits_actual == VA_BITS;
>> return alternative_has_cap_unlikely(ARM64_HAS_VA52);
>> @@ -1059,6 +1065,8 @@ static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
>>
>> static inline p4d_t *pgd_to_folded_p4d(pgd_t *pgdp, unsigned long addr)
>> {
>> + if (IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
>> + return (p4d_t *)pgdp;
>> return (p4d_t *)PTR_ALIGN_DOWN(pgdp, PAGE_SIZE) + p4d_index(addr);
>> }
>>
>> diff --git a/arch/arm64/include/asm/sparsemem.h
>> b/arch/arm64/include/asm/sparsemem.h
>> index a05fdd54014f7..2daf1263ba638 100644
>> --- a/arch/arm64/include/asm/sparsemem.h
>> +++ b/arch/arm64/include/asm/sparsemem.h
>> @@ -17,6 +17,10 @@
>> * entries could not be created for vmemmap mappings.
>> * 16K follows 4K for simplicity.
>> */
>> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>> +#define SECTION_SIZE_BITS 29
>> +#else
>> #define SECTION_SIZE_BITS (PAGE_SIZE == SZ_64K ? 29 : 27)
>> +#endif
>>
>> #endif
>> diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
>> index a168f3337446f..9968320f83bc4 100644
>> --- a/arch/arm64/kernel/image-vars.h
>> +++ b/arch/arm64/kernel/image-vars.h
>> @@ -36,6 +36,17 @@ PROVIDE(__pi___memcpy =
>> __pi_memcpy);
>> PROVIDE(__pi___memmove = __pi_memmove);
>> PROVIDE(__pi___memset = __pi_memset);
>>
>> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>> +PROVIDE(__pi_ptg_page_shift = ptg_page_shift);
>> +PROVIDE(__pi_ptg_pmd_shift = ptg_pmd_shift);
>> +PROVIDE(__pi_ptg_pud_shift = ptg_pud_shift);
>> +PROVIDE(__pi_ptg_p4d_shift = ptg_p4d_shift);
>> +PROVIDE(__pi_ptg_pgdir_shift = ptg_pgdir_shift);
>> +PROVIDE(__pi_ptg_cont_pte_shift = ptg_cont_pte_shift);
>> +PROVIDE(__pi_ptg_cont_pmd_shift = ptg_cont_pmd_shift);
>> +PROVIDE(__pi_ptg_pgtable_levels = ptg_pgtable_levels);
>> +#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
>> +
>> PROVIDE(__pi_id_aa64isar1_override = id_aa64isar1_override);
>> PROVIDE(__pi_id_aa64isar2_override = id_aa64isar2_override);
>> PROVIDE(__pi_id_aa64mmfr0_override = id_aa64mmfr0_override);
>> diff --git a/arch/arm64/kernel/image.h b/arch/arm64/kernel/image.h
>> index 7bc3ba8979019..01502fc3b891b 100644
>> --- a/arch/arm64/kernel/image.h
>> +++ b/arch/arm64/kernel/image.h
>> @@ -47,7 +47,11 @@
>> #define __HEAD_FLAG_BE ARM64_IMAGE_FLAG_LE
>> #endif
>>
>> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>> +#define __HEAD_FLAG_PAGE_SIZE 0
>> +#else
>> #define __HEAD_FLAG_PAGE_SIZE ((PAGE_SHIFT - 10) / 2)
>> +#endif
>>
>> #define __HEAD_FLAG_PHYS_BASE 1
>>
>> diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
>> index deb8cd50b0b0c..22b3c70e04f9c 100644
>> --- a/arch/arm64/kernel/pi/map_kernel.c
>> +++ b/arch/arm64/kernel/pi/map_kernel.c
>> @@ -221,6 +221,49 @@ static void __init map_fdt(u64 fdt, int page_shift)
>> dsb(ishst);
>> }
>>
>> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>> +static void __init ptg_init(int page_shift)
>> +{
>> + ptg_pgtable_levels =
>> + __ARM64_HW_PGTABLE_LEVELS(page_shift,
>> CONFIG_ARM64_VA_BITS);
>> +
>> + ptg_pgdir_shift = __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift,
>> + 4 - ptg_pgtable_levels);
>> +
>> + ptg_p4d_shift = ptg_pgtable_levels >= 5 ?
>> + __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift, 0) :
>> + ptg_pgdir_shift;
>> +
>> + ptg_pud_shift = ptg_pgtable_levels >= 4 ?
>> + __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift, 1) :
>> + ptg_pgdir_shift;
>> +
>> + ptg_pmd_shift = ptg_pgtable_levels >= 3 ?
>> + __ARM64_HW_PGTABLE_LEVEL_SHIFT(page_shift, 2) :
>> + ptg_pgdir_shift;
>> +
>> + ptg_page_shift = page_shift;
>> +
>> + switch (page_shift) {
>> + case ARM64_PAGE_SHIFT_64K:
>> + ptg_cont_pte_shift = ptg_page_shift + 5;
>> + ptg_cont_pmd_shift = ptg_pmd_shift + 5;
>> + break;
>> + case ARM64_PAGE_SHIFT_16K:
>> + ptg_cont_pte_shift = ptg_page_shift + 7;
>> + ptg_cont_pmd_shift = ptg_pmd_shift + 5;
>> + break;
>> + default: /* ARM64_PAGE_SHIFT_4K */
>> + ptg_cont_pte_shift = ptg_page_shift + 4;
>> + ptg_cont_pmd_shift = ptg_pmd_shift + 4;
>> + }
>> +}
>> +#else /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
>> +static inline void ptg_init(int page_shift)
>> +{
>> +}
>> +#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
>> +
>> asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
>> {
>> static char const chosen_str[] __initconst = "/chosen";
>> @@ -247,6 +290,8 @@ asmlinkage void __init early_map_kernel(u64 boot_status, void
>> *fdt)
>> if (!page_shift)
>> page_shift = early_page_shift;
>>
>> + ptg_init(page_shift);
>> +
>> if (va_bits > 48) {
>> u64 page_size = early_page_size(page_shift);
>>
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 9bef7638342ef..c835a50b8b768 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -2424,6 +2424,16 @@ static void kvm_hyp_init_symbols(void)
>> kvm_nvhe_sym(id_aa64smfr0_el1_sys_val) =
>> read_sanitised_ftr_reg(SYS_ID_AA64SMFR0_EL1);
>> kvm_nvhe_sym(__icache_flags) = __icache_flags;
>> kvm_nvhe_sym(kvm_arm_vmid_bits) = kvm_arm_vmid_bits;
>> +#ifdef CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>> + kvm_nvhe_sym(ptg_page_shift) = ptg_page_shift;
>> + kvm_nvhe_sym(ptg_pmd_shift) = ptg_pmd_shift;
>> + kvm_nvhe_sym(ptg_pud_shift) = ptg_pud_shift;
>> + kvm_nvhe_sym(ptg_p4d_shift) = ptg_p4d_shift;
>> + kvm_nvhe_sym(ptg_pgdir_shift) = ptg_pgdir_shift;
>> + kvm_nvhe_sym(ptg_cont_pte_shift) = ptg_cont_pte_shift;
>> + kvm_nvhe_sym(ptg_cont_pmd_shift) = ptg_cont_pmd_shift;
>> + kvm_nvhe_sym(ptg_pgtable_levels) = ptg_pgtable_levels;
>> +#endif /* CONFIG_ARM64_BOOT_TIME_PAGE_SIZE */
>> }
>>
>> static int __init kvm_hyp_init_protection(u32 hyp_va_bits)
>> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile
>> b/arch/arm64/kvm/hyp/nvhe/Makefile
>> index b43426a493df5..a8fcbb84c7996 100644
>> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
>> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
>> @@ -27,6 +27,7 @@ hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-
>> init.o host.o
>> cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o
>> hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
>> ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
>> +hyp-obj-$(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) += pgtable-geometry.o
>> hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o
>> hyp-obj-y += $(lib-objs)
>>
>> diff --git a/arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
>> b/arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
>> new file mode 100644
>> index 0000000000000..17f807450a31a
>> --- /dev/null
>> +++ b/arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
>> @@ -0,0 +1,16 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2024 ARM Ltd.
>> + */
>> +
>> +#include <linux/cache.h>
>> +#include <asm/pgtable-geometry.h>
>> +
>> +int ptg_page_shift __ro_after_init;
>> +int ptg_pmd_shift __ro_after_init;
>> +int ptg_pud_shift __ro_after_init;
>> +int ptg_p4d_shift __ro_after_init;
>> +int ptg_pgdir_shift __ro_after_init;
>> +int ptg_cont_pte_shift __ro_after_init;
>> +int ptg_cont_pmd_shift __ro_after_init;
>> +int ptg_pgtable_levels __ro_after_init;
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index 60454256945b8..2ba30d06b35fe 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -3,6 +3,7 @@ obj-y := dma-mapping.o extable.o
>> fault.o init.o \
>> cache.o copypage.o flush.o \
>> ioremap.o mmap.o pgd.o mmu.o \
>> context.o proc.o pageattr.o fixmap.o
>> +obj-$(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) += pgtable-geometry.o
>> obj-$(CONFIG_ARM64_CONTPTE) += contpte.o
>> obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
>> obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
>> diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
>> index 4b106510358b1..c052d0dcb0c69 100644
>> --- a/arch/arm64/mm/pgd.c
>> +++ b/arch/arm64/mm/pgd.c
>> @@ -21,10 +21,12 @@ static bool pgdir_is_page_size(void)
>> {
>> if (PGD_SIZE == PAGE_SIZE)
>> return true;
>> - if (CONFIG_PGTABLE_LEVELS == 4)
>> - return !pgtable_l4_enabled();
>> - if (CONFIG_PGTABLE_LEVELS == 5)
>> - return !pgtable_l5_enabled();
>> + if (!IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE)) {
>> + if (CONFIG_PGTABLE_LEVELS == 4)
>> + return !pgtable_l4_enabled();
>> + if (CONFIG_PGTABLE_LEVELS == 5)
>> + return !pgtable_l5_enabled();
>> + }
>> return false;
>> }
>>
>> diff --git a/arch/arm64/mm/pgtable-geometry.c b/arch/arm64/mm/pgtable-
>> geometry.c
>> new file mode 100644
>> index 0000000000000..ba50637f1e9d0
>> --- /dev/null
>> +++ b/arch/arm64/mm/pgtable-geometry.c
>> @@ -0,0 +1,24 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2024 ARM Ltd.
>> + */
>> +
>> +#include <linux/cache.h>
>> +#include <asm/pgtable-geometry.h>
>> +
>> +/*
>> + * TODO: These should be __ro_after_init, but we need to write to them from the
>> + * pi code where they are mapped in the early page table as read-only.
>> + * __ro_after_init doesn't become writable until later when the swapper pgtable
>> + * is fully set up. We should update the early page table to map __ro_after_init
>> + * as read-write.
>> + */
>> +
>> +int ptg_page_shift __read_mostly;
>> +int ptg_pmd_shift __read_mostly;
>
> I found that ptg_page_shift and ptg_pmd_shift need
> EXPORT_SYMBOL_GPL for cases where code compiled
> as a module is using PAGE_SIZE/PAGE_SHIFT or
> PMD_SIZE/PMD_SHIFT. Some of the others below
> might also need EXPORT_SYMBOL_GPL.
Ahh good spot - thanks! I'll fix this in the next version.
I wonder if these should really be EXPORT_SYMBOL() and not limited to GPL? I
guess having access to PAGE_SIZE is a pretty fundamental thing? Anybody know the
policy here?
Thanks,
Ryan
>
> Michael
>
>> +int ptg_pud_shift __read_mostly;
>> +int ptg_p4d_shift __read_mostly;
>> +int ptg_pgdir_shift __read_mostly;
>> +int ptg_cont_pte_shift __read_mostly;
>> +int ptg_cont_pmd_shift __read_mostly;
>> +int ptg_pgtable_levels __read_mostly;
>> diff --git a/drivers/firmware/efi/libstub/arm64.c
>> b/drivers/firmware/efi/libstub/arm64.c
>> index e57cd3de0a00f..8db9dba7d5423 100644
>> --- a/drivers/firmware/efi/libstub/arm64.c
>> +++ b/drivers/firmware/efi/libstub/arm64.c
>> @@ -68,7 +68,8 @@ efi_status_t check_platform_features(void)
>> efi_novamap = true;
>>
>> /* UEFI mandates support for 4 KB granularity, no need to check */
>> - if (IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>> + if (IS_ENABLED(CONFIG_ARM64_4K_PAGES) ||
>> + IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE))
>> return EFI_SUCCESS;
>>
>> tg = (read_cpuid(ID_AA64MMFR0_EL1) >>
>> ID_AA64MMFR0_EL1_TGRAN_SHIFT) & 0xf;
>> --
>> 2.43.0
>>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-15 18:38 ` Michael Kelley
@ 2024-10-16 8:23 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 8:23 UTC (permalink / raw)
To: Michael Kelley, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, Dexuan Cui, Boqun Feng
Cc: linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
On 15/10/2024 19:38, Michael Kelley wrote:
> From: Ryan Roberts <ryan.roberts@arm.com> Sent: Monday, October 14, 2024 3:55 AM
>>
>> Hi All,
>>
>> Patch bomb incoming... This covers many subsystems, so I've included a core set
>> of people on the full series and additionally included maintainers on relevant
>> patches. I haven't included those maintainers on this cover letter since the
>> numbers were far too big for it to work. But I've included a link to this cover
>> letter on each patch, so they can hopefully find their way here. For follow up
>> submissions I'll break it up by subsystem, but for now thought it was important
>> to show the full picture.
>>
>> This RFC series implements support for boot-time page size selection within the
>> arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to date, page
>> size has been selected at compile-time, meaning the size is baked into a given
>> kernel image. As use of larger-than-4K page sizes become more prevalent this
>> starts to present a problem for distributions. Boot-time page size selection
>> enables the creation of a single kernel image, which can be told which page size
>> to use on the kernel command line.
>>
>> Why is having an image-per-page size problematic?
>> =================================================
>>
>> Many traditional distros are now supporting both 4K and 64K. And this means
>> managing 2 kernel packages, along with drivers for each. For some, it means
>> multiple installer flavours and multiple ISOs. All of this adds up to a
>> less-than-ideal level of complexity. Additionally, Android now supports 4K and
>> 16K kernels. I'm told having to explicitly manage their KABI for each kernel is
>> painful, and the extra flash space required for both kernel images and the
>> duplicated modules has been problematic. Boot-time page size selection solves
>> all of this.
>>
>> Additionally, in starting to think about the longer term deployment story for
>> D128 page tables, which Arm architecture now supports, a lot of the same
>> problems need to be solved, so this work sets us up nicely for that.
>>
>> So what's the down side?
>> ========================
>>
>> Well nothing's free; Various static allocations in the kernel image must be
>> sized for the worst case (largest supported page size), so image size is in line
>> with size of 64K compile-time image. So if you're interested in 4K or 16K, there
>> is a slight increase to the image size. But I expect that problem goes away if
>> you're compressing the image - its just some extra zeros. At boot-time, I expect
>> we could free the unused static storage once we know the page size - although
>> that would be a follow up enhancement.
>>
>> And then there is performance. Since PAGE_SIZE and friends are no longer
>> compile-time constants, we must look up their values and do arithmetic at
>> runtime instead of compile-time. My early perf testing suggests this is
>> inperceptible for real-world workloads, and only has small impact on
>> microbenchmarks - more on this below.
>
> [snip]
>
> This is pretty cool. :-) FWIW, I've built a kernel with this patch set, and
> have it running in a RHEL 8.7 guest on Hyper-V in the Azure public cloud.
> Ran with 4K, 16K, and 64K page sizes, and the basic smoke tests work.
That's great to hear - thanks for taking the time to test!
>
> The Hyper-V specific code in the Linux kernel needed a few tweaks to
> deal with PAGE_SIZE and friends no longer being constant, but it's nothing
> significant. Getting the kernel built in the first place was a little harder
> because my .config file is fairly generic with a lot of device drivers and file
> system code that aren't really needed for Hyper-V guests. I had to
> weed out the ones that won't build. My RHEL 8.7 install uses LVM, so I> hacked the 'dm' code to make it compile and run.
Yeah, getting all this sorted is going to be the long tail. I feel I've had
enough positive response to this RFC that I should probably just get on and
start that work to get a real feel for how much of it there is going to be.
>
> As this work moves forward, I can supply the necessary patches for
> the Hyper-V support. Let me know if you want to include them in the
> main patch set.
Great! If you are happy to forward them to me, I'll include them in future
versions of the series (or more likely, serieses).
Thanks,
Ryan
>
> I've added a couple of Microsoft's Linux people to this email's addressee
> list so they are aware of what's going on.
>
> Michael Kelley
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 57/57] arm64: Enable boot-time page size selection
2024-10-16 8:14 ` Ryan Roberts
@ 2024-10-16 14:21 ` Zi Yan
2024-10-16 14:31 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Zi Yan @ 2024-10-16 14:21 UTC (permalink / raw)
To: Ryan Roberts, David Hildenbrand
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
Greg Marsden, Ivan Ivanov, Kalesh Singh, Marc Zyngier,
Mark Rutland, Matthias Brugger, Miroslav Benes, Oliver Upton,
Will Deacon, kvmarm, linux-arm-kernel, linux-efi, linux-kernel,
linux-mm
[-- Attachment #1: Type: text/plain, Size: 3722 bytes --]
On 16 Oct 2024, at 4:14, Ryan Roberts wrote:
> On 15/10/2024 18:42, Zi Yan wrote:
>> On 14 Oct 2024, at 6:59, Ryan Roberts wrote:
>>
>>> Introduce a new Kconfig, ARM64_BOOT_TIME_PAGE_SIZE, which can be
>>> selected instead of a page size. When selected, the resulting kernel's
>>> page size can be configured at boot via the command line.
>>>
>>> For now, boot-time page size kernels are limited to 48-bit VA, since
>>> more work is required to support LPA2. Additionally MMAP_RND_BITS and
>>> SECTION_SIZE_BITS are configured for the worst case (64K pages). Future
>>> work could be implemented to be able to configure these at boot time for
>>> optimial page size-specific values.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>
>> <snip>
>>
>>>
>>> @@ -1588,9 +1601,10 @@ config XEN
>>> # 4K | 27 | 12 | 15 | 10 |
>>> # 16K | 27 | 14 | 13 | 11 |
>>> # 64K | 29 | 16 | 13 | 13 |
>>> +# BOOT| 29 | 16 (max) | 13 | 13 |
>>> config ARCH_FORCE_MAX_ORDER
>>> int
>>> - default "13" if ARM64_64K_PAGES
>>> + default "13" if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
>>> default "11" if ARM64_16K_PAGES
>>> default "10"
>>> help
>>
>> So boot-time page size kernel always has the highest MAX_PAGE_ORDER, which
>> means the section size increases for 4KB and 16KB page sizes. Any downside
>> for this?
>
> I guess there is some cost to the buddy when MAX_PAGE_ORDER is larger than it
> needs to be - I expect you can explain those details much better than I can. I'm
> just setting it to the worst case for now as it was the easiest solution for the
> initial series.
From my past experience (around 5.19), the perf impact (using vm-scalability)
seems very small due to MAX_PAGE_ORDER increases [1] (I made MAX_PAGE_ORDER
a boot time variable and increased it to 20 for my 1GB THP experiments).
Larger MAX_PAGE_ORDER means larger section size and larger mem_block size,
so the granularity of memory hotplug also increases. In this case:
1. ARM64 4KB: mem_block size increases from 4MB to 32MB,
2. ARM64 16KB: mem_block size increases from 32MB to 128MB,
3. ARM64 64KB: mem_block size keeps the same, 512MB.
DavidH was concerned about large mem_block size before. He might have some
opinion on this.
>
>>
>> Is there any plan (not in this patchset) to support boot-time MAX_PAGE_ORDER
>> to keep section size the same?
>
> Yes absolutely. I should have documented MAX_PAGE_ORDER in the commit log along
> with the comments for MMAP_RND_BITS and SECTION_SIZE_BITS - that was an
> oversight and I'll fix it in the next version. I plan to look at making all 3
> values boot-time configurable in future (although I have no idea at this point
> how involved that will be).
In [1], I tried to make MAX_PAGE_ORDER a boot time variable,
but for a different purpose, allocating 1GB THP. I needed some additional
changes in my patchset, since I assumed MAX_PAGE_ORDER can go beyond
section size, which makes things a little bit complicated. For your case,
I assume you are not planning to make MAX_PAGE_ORDER bigger than section
size, then I should be able to revive my patchset with fewer changes.
In terms of SECTION_SIZE_BITS, why do you want to make it a boot time variable?
Since it decides the minimum memory hotplug size, I assume we should keep
it unchanged or as small as possible to make virtual machine memory usage
efficient.
[1] https://lore.kernel.org/linux-mm/20220811231643.1012912-1-zi.yan@sent.com/
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 57/57] arm64: Enable boot-time page size selection
2024-10-16 14:21 ` Zi Yan
@ 2024-10-16 14:31 ` Ryan Roberts
2024-10-16 14:35 ` Zi Yan
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:31 UTC (permalink / raw)
To: Zi Yan, David Hildenbrand
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
Greg Marsden, Ivan Ivanov, Kalesh Singh, Marc Zyngier,
Mark Rutland, Matthias Brugger, Miroslav Benes, Oliver Upton,
Will Deacon, kvmarm, linux-arm-kernel, linux-efi, linux-kernel,
linux-mm
On 16/10/2024 15:21, Zi Yan wrote:
> On 16 Oct 2024, at 4:14, Ryan Roberts wrote:
>
>> On 15/10/2024 18:42, Zi Yan wrote:
>>> On 14 Oct 2024, at 6:59, Ryan Roberts wrote:
>>>
>>>> Introduce a new Kconfig, ARM64_BOOT_TIME_PAGE_SIZE, which can be
>>>> selected instead of a page size. When selected, the resulting kernel's
>>>> page size can be configured at boot via the command line.
>>>>
>>>> For now, boot-time page size kernels are limited to 48-bit VA, since
>>>> more work is required to support LPA2. Additionally MMAP_RND_BITS and
>>>> SECTION_SIZE_BITS are configured for the worst case (64K pages). Future
>>>> work could be implemented to be able to configure these at boot time for
>>>> optimial page size-specific values.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>
>>> <snip>
>>>
>>>>
>>>> @@ -1588,9 +1601,10 @@ config XEN
>>>> # 4K | 27 | 12 | 15 | 10 |
>>>> # 16K | 27 | 14 | 13 | 11 |
>>>> # 64K | 29 | 16 | 13 | 13 |
>>>> +# BOOT| 29 | 16 (max) | 13 | 13 |
>>>> config ARCH_FORCE_MAX_ORDER
>>>> int
>>>> - default "13" if ARM64_64K_PAGES
>>>> + default "13" if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
>>>> default "11" if ARM64_16K_PAGES
>>>> default "10"
>>>> help
>>>
>>> So boot-time page size kernel always has the highest MAX_PAGE_ORDER, which
>>> means the section size increases for 4KB and 16KB page sizes. Any downside
>>> for this?
>>
>> I guess there is some cost to the buddy when MAX_PAGE_ORDER is larger than it
>> needs to be - I expect you can explain those details much better than I can. I'm
>> just setting it to the worst case for now as it was the easiest solution for the
>> initial series.
>
> From my past experience (around 5.19), the perf impact (using vm-scalability)
> seems very small due to MAX_PAGE_ORDER increases [1] (I made MAX_PAGE_ORDER
> a boot time variable and increased it to 20 for my 1GB THP experiments).
>
> Larger MAX_PAGE_ORDER means larger section size and larger mem_block size,
> so the granularity of memory hotplug also increases. In this case:
> 1. ARM64 4KB: mem_block size increases from 4MB to 32MB,
> 2. ARM64 16KB: mem_block size increases from 32MB to 128MB,
> 3. ARM64 64KB: mem_block size keeps the same, 512MB.
>
> DavidH was concerned about large mem_block size before. He might have some
> opinion on this.
>
>
>>
>>>
>>> Is there any plan (not in this patchset) to support boot-time MAX_PAGE_ORDER
>>> to keep section size the same?
>>
>> Yes absolutely. I should have documented MAX_PAGE_ORDER in the commit log along
>> with the comments for MMAP_RND_BITS and SECTION_SIZE_BITS - that was an
>> oversight and I'll fix it in the next version. I plan to look at making all 3
>> values boot-time configurable in future (although I have no idea at this point
>> how involved that will be).
>
> In [1], I tried to make MAX_PAGE_ORDER a boot time variable,
> but for a different purpose, allocating 1GB THP. I needed some additional
> changes in my patchset, since I assumed MAX_PAGE_ORDER can go beyond
> section size, which makes things a little bit complicated. For your case,
> I assume you are not planning to make MAX_PAGE_ORDER bigger than section
> size, then I should be able to revive my patchset with fewer changes.
Yes correct; no need to make it bigger than section size. Thanks for the patch,
I'll certainly use it as a base when I get there or if you're interested in
doing it then even better ;-)
But I don't think this is urgent. For now, boot-time page size is a new Kconfig
for arm64. It still supports the compile-time page size options. So having a
larger MAX_PAGE_ORDER than strictly necessary doesn't represent a regression,
just a limitation of boot-time page size config - something we can optimize later.
>
> In terms of SECTION_SIZE_BITS, why do you want to make it a boot time variable?
> Since it decides the minimum memory hotplug size, I assume we should keep
> it unchanged or as small as possible to make virtual machine memory usage
> efficient.
When I say "boot-time variable" I just mean something that the arch can
configure at boot based on the selected page size. I'm not proposing to allow
the user to set it via the command line. That means we need to rid the code of
any assumptions that it is compile time constant (e.g. c preprocessor usage of
the value, etc). The same goes for MAX_PAGE_ORDER and the MMAP_RND_BITS stuff.
>
>
> [1] https://lore.kernel.org/linux-mm/20220811231643.1012912-1-zi.yan@sent.com/
>
>
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 57/57] arm64: Enable boot-time page size selection
2024-10-16 14:31 ` Ryan Roberts
@ 2024-10-16 14:35 ` Zi Yan
0 siblings, 0 replies; 196+ messages in thread
From: Zi Yan @ 2024-10-16 14:35 UTC (permalink / raw)
To: Ryan Roberts
Cc: David Hildenbrand, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Oliver Upton, Will Deacon, kvmarm,
linux-arm-kernel, linux-efi, linux-kernel, linux-mm
[-- Attachment #1: Type: text/plain, Size: 5026 bytes --]
On 16 Oct 2024, at 10:31, Ryan Roberts wrote:
> On 16/10/2024 15:21, Zi Yan wrote:
>> On 16 Oct 2024, at 4:14, Ryan Roberts wrote:
>>
>>> On 15/10/2024 18:42, Zi Yan wrote:
>>>> On 14 Oct 2024, at 6:59, Ryan Roberts wrote:
>>>>
>>>>> Introduce a new Kconfig, ARM64_BOOT_TIME_PAGE_SIZE, which can be
>>>>> selected instead of a page size. When selected, the resulting kernel's
>>>>> page size can be configured at boot via the command line.
>>>>>
>>>>> For now, boot-time page size kernels are limited to 48-bit VA, since
>>>>> more work is required to support LPA2. Additionally MMAP_RND_BITS and
>>>>> SECTION_SIZE_BITS are configured for the worst case (64K pages). Future
>>>>> work could be implemented to be able to configure these at boot time for
>>>>> optimial page size-specific values.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>
>>>> <snip>
>>>>
>>>>>
>>>>> @@ -1588,9 +1601,10 @@ config XEN
>>>>> # 4K | 27 | 12 | 15 | 10 |
>>>>> # 16K | 27 | 14 | 13 | 11 |
>>>>> # 64K | 29 | 16 | 13 | 13 |
>>>>> +# BOOT| 29 | 16 (max) | 13 | 13 |
>>>>> config ARCH_FORCE_MAX_ORDER
>>>>> int
>>>>> - default "13" if ARM64_64K_PAGES
>>>>> + default "13" if ARM64_64K_PAGES || ARM64_BOOT_TIME_PAGE_SIZE
>>>>> default "11" if ARM64_16K_PAGES
>>>>> default "10"
>>>>> help
>>>>
>>>> So boot-time page size kernel always has the highest MAX_PAGE_ORDER, which
>>>> means the section size increases for 4KB and 16KB page sizes. Any downside
>>>> for this?
>>>
>>> I guess there is some cost to the buddy when MAX_PAGE_ORDER is larger than it
>>> needs to be - I expect you can explain those details much better than I can. I'm
>>> just setting it to the worst case for now as it was the easiest solution for the
>>> initial series.
>>
>> From my past experience (around 5.19), the perf impact (using vm-scalability)
>> seems very small due to MAX_PAGE_ORDER increases [1] (I made MAX_PAGE_ORDER
>> a boot time variable and increased it to 20 for my 1GB THP experiments).
>>
>> Larger MAX_PAGE_ORDER means larger section size and larger mem_block size,
>> so the granularity of memory hotplug also increases. In this case:
>> 1. ARM64 4KB: mem_block size increases from 4MB to 32MB,
>> 2. ARM64 16KB: mem_block size increases from 32MB to 128MB,
>> 3. ARM64 64KB: mem_block size keeps the same, 512MB.
>>
>> DavidH was concerned about large mem_block size before. He might have some
>> opinion on this.
>>
>>
>>>
>>>>
>>>> Is there any plan (not in this patchset) to support boot-time MAX_PAGE_ORDER
>>>> to keep section size the same?
>>>
>>> Yes absolutely. I should have documented MAX_PAGE_ORDER in the commit log along
>>> with the comments for MMAP_RND_BITS and SECTION_SIZE_BITS - that was an
>>> oversight and I'll fix it in the next version. I plan to look at making all 3
>>> values boot-time configurable in future (although I have no idea at this point
>>> how involved that will be).
>>
>> In [1], I tried to make MAX_PAGE_ORDER a boot time variable,
>> but for a different purpose, allocating 1GB THP. I needed some additional
>> changes in my patchset, since I assumed MAX_PAGE_ORDER can go beyond
>> section size, which makes things a little bit complicated. For your case,
>> I assume you are not planning to make MAX_PAGE_ORDER bigger than section
>> size, then I should be able to revive my patchset with fewer changes.
>
> Yes correct; no need to make it bigger than section size. Thanks for the patch,
> I'll certainly use it as a base when I get there or if you're interested in
> doing it then even better ;-)
>
> But I don't think this is urgent. For now, boot-time page size is a new Kconfig
> for arm64. It still supports the compile-time page size options. So having a
> larger MAX_PAGE_ORDER than strictly necessary doesn't represent a regression,
> just a limitation of boot-time page size config - something we can optimize later.
Sure. I will revisit my boot time MAX_PAGE_ORDER patchset when this patchset
settles. Glad to help. :)
>
>>
>> In terms of SECTION_SIZE_BITS, why do you want to make it a boot time variable?
>> Since it decides the minimum memory hotplug size, I assume we should keep
>> it unchanged or as small as possible to make virtual machine memory usage
>> efficient.
>
> When I say "boot-time variable" I just mean something that the arch can
> configure at boot based on the selected page size. I'm not proposing to allow
> the user to set it via the command line. That means we need to rid the code of
> any assumptions that it is compile time constant (e.g. c preprocessor usage of
> the value, etc). The same goes for MAX_PAGE_ORDER and the MMAP_RND_BITS stuff.
>
Got it.
>> [1] https://lore.kernel.org/linux-mm/20220811231643.1012912-1-zi.yan@sent.com/
Best Regards,
Yan, Zi
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (56 preceding siblings ...)
2024-10-14 13:54 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting " Pingfan Liu
@ 2024-10-16 14:36 ` Ryan Roberts
2024-10-30 8:45 ` Ryan Roberts
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:36 UTC (permalink / raw)
To: David S. Miller, James E.J. Bottomley, Andreas Larsson,
Andrew Morton, Anshuman Khandual, Anton Ivanov, Ard Biesheuvel,
Arnd Bergmann, Borislav Petkov, Catalin Marinas, Chris Zankel,
Dave Hansen, David Hildenbrand, Dinh Nguyen, Geert Uytterhoeven,
Greg Marsden, Helge Deller, Huacai Chen, Ingo Molnar, Ivan Ivanov,
Johannes Berg, John Paul Adrian Glaubitz, Jonas Bonn,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Max Filippov, Miroslav Benes, Rich Felker, Richard Weinberger,
Stafford Horne, Stefan Kristiansson, Thomas Bogendoerfer,
Thomas Gleixner, Will Deacon, Yoshinori Sato, x86, Albert Ou,
Alexander Gordeev, Brian Cain, Guo Ren, Heiko Carstens,
Michael Ellerman, Michal Simek, Palmer Dabbelt, Paul Walmsley,
Vasily Gorbik, Vineet Gupta
Cc: linux-alpha, linux-arch, linux-arm-kernel, linux-csky,
linux-hexagon, linux-kernel, linux-m68k, linux-mips, linux-mm,
linux-openrisc, linux-parisc, linux-riscv, linux-s390, linux-sh,
linux-snps-arc, linux-um, linuxppc-dev, loongarch, sparclinux
+ Albert Ou, Alexander Gordeev, Brian Cain, Guo Ren, Heiko Carstens, Michael
Ellerman, Michal Simek, Palmer Dabbelt, Paul Walmsley, Vasily Gorbik, Vineet Gupta.
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> arm64 can support multiple base page sizes. Instead of selecting a page
> size at compile time, as is done today, we will make it possible to
> select the desired page size on the command line.
>
> In this case PAGE_SHIFT and it's derivatives, PAGE_SIZE and PAGE_MASK
> (as well as a number of other macros related to or derived from
> PAGE_SHIFT, but I'm not worrying about those yet), are no longer
> compile-time constants. So the code base needs to cope with that.
>
> As a first step, introduce MIN and MAX variants of these macros, which
> express the range of possible page sizes. These are always compile-time
> constants and can be used in many places where PAGE_[SHIFT|SIZE|MASK]
> were previously used where a compile-time constant is required.
> (Subsequent patches will do that conversion work). When the arch/build
> doesn't support boot-time page size selection, the MIN and MAX variants
> are equal and everything resolves as it did previously.
>
> Additionally, introduce DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() which wrap
> global variable defintions so that for boot-time page size selection
> builds, the variable being wrapped is initialized at boot-time, instead
> of compile-time. This is done by defining a function to do the
> assignment, which has the "constructor" attribute. Constructor is
> preferred over initcall, because when compiling a module, the module is
> limited to a single initcall but constructors are unlimited. For
> built-in code, constructors are now called earlier to guarrantee that
> the variables are initialized by the time they are used. Any arch that
> wants to enable boot-time page size selection will need to select
> CONFIG_CONSTRUCTORS.
>
> These new macros need to be available anywhere PAGE_SHIFT and friends
> are available. Those are defined via asm/page.h (although some arches
> have a sub-include that defines them). Unfortunately there is no
> reliable asm-generic header we can easily piggy-back on, so let's define
> a new one, pgtable-geometry.h, which we include near where each arch
> defines PAGE_SHIFT. Ugh.
>
> -------
>
> Most of the problems that need to be solved over the next few patches
> fall into these broad categories, which are all solved with the help of
> these new macros:
>
> 1. Assignment of values derived from PAGE_SIZE in global variables
>
> For boot-time page size builds, we must defer the initialization of
> these variables until boot-time, when the page size is known. See
> DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() as described above.
>
> 2. Define static storage in units related to PAGE_SIZE
>
> This static storage will be defined according to PAGE_SIZE_MAX.
>
> 3. Define size of struct so that it is related to PAGE_SIZE
>
> The struct often contains an array that is sized to fill the page. In
> this case, use a flexible array with dynamic allocation. In other
> cases, the struct fits exactly over a page, which is a header (e.g.
> swap file header). In this case, remove the padding, and manually
> determine the struct pointer within the page.
>
> 4. BUILD_BUG_ON() with values derived from PAGE_SIZE
>
> In most cases, we can change these to compare againt the appropriate
> limit (either MIN or MAX). In other cases, we must change these to
> run-time BUG_ON().
>
> 5. Ensure page alignment of static data structures
>
> Align instead to PAGE_SIZE_MAX.
>
> 6. #ifdeffery based on PAGE_SIZE
>
> Often these can be changed to c code constructs. e.g. a macro that
> returns a different value depending on page size can be changed to use
> the ternary operator and the compiler will dead code strip it for the
> compile-time constant case and runtime evaluate it for the non-const
> case. Or #if/#else/#endif within a function can be converted to c
> if/else blocks, which are also dead code stripped for the const case.
> Sometimes we can change the c-preprocessor logic to use the
> appropriate MIN/MAX limit.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> arch/alpha/include/asm/page.h | 1 +
> arch/arc/include/asm/page.h | 1 +
> arch/arm/include/asm/page.h | 1 +
> arch/arm64/include/asm/page-def.h | 2 +
> arch/csky/include/asm/page.h | 3 ++
> arch/hexagon/include/asm/page.h | 2 +
> arch/loongarch/include/asm/page.h | 2 +
> arch/m68k/include/asm/page.h | 1 +
> arch/microblaze/include/asm/page.h | 1 +
> arch/mips/include/asm/page.h | 1 +
> arch/nios2/include/asm/page.h | 2 +
> arch/openrisc/include/asm/page.h | 1 +
> arch/parisc/include/asm/page.h | 1 +
> arch/powerpc/include/asm/page.h | 2 +
> arch/riscv/include/asm/page.h | 1 +
> arch/s390/include/asm/page.h | 1 +
> arch/sh/include/asm/page.h | 1 +
> arch/sparc/include/asm/page.h | 3 ++
> arch/um/include/asm/page.h | 2 +
> arch/x86/include/asm/page_types.h | 2 +
> arch/xtensa/include/asm/page.h | 1 +
> include/asm-generic/pgtable-geometry.h | 71 ++++++++++++++++++++++++++
> init/main.c | 5 +-
> 23 files changed, 107 insertions(+), 1 deletion(-)
> create mode 100644 include/asm-generic/pgtable-geometry.h
>
> diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
> index 70419e6be1a35..d0096fb5521b8 100644
> --- a/arch/alpha/include/asm/page.h
> +++ b/arch/alpha/include/asm/page.h
> @@ -88,5 +88,6 @@ typedef struct page *pgtable_t;
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _ALPHA_PAGE_H */
> diff --git a/arch/arc/include/asm/page.h b/arch/arc/include/asm/page.h
> index def0dfb95b436..8d56549db7a33 100644
> --- a/arch/arc/include/asm/page.h
> +++ b/arch/arc/include/asm/page.h
> @@ -6,6 +6,7 @@
> #define __ASM_ARC_PAGE_H
>
> #include <uapi/asm/page.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #ifdef CONFIG_ARC_HAS_PAE40
>
> diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h
> index 62af9f7f9e963..417aa8533c718 100644
> --- a/arch/arm/include/asm/page.h
> +++ b/arch/arm/include/asm/page.h
> @@ -191,5 +191,6 @@ extern int pfn_valid(unsigned long);
>
> #include <asm-generic/getorder.h>
> #include <asm-generic/memory_model.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif
> diff --git a/arch/arm64/include/asm/page-def.h b/arch/arm64/include/asm/page-def.h
> index 792e9fe881dcf..d69971cf49cd2 100644
> --- a/arch/arm64/include/asm/page-def.h
> +++ b/arch/arm64/include/asm/page-def.h
> @@ -15,4 +15,6 @@
> #define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
> #define PAGE_MASK (~(PAGE_SIZE-1))
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* __ASM_PAGE_DEF_H */
> diff --git a/arch/csky/include/asm/page.h b/arch/csky/include/asm/page.h
> index 0ca6c408c07f2..95173d57adc8b 100644
> --- a/arch/csky/include/asm/page.h
> +++ b/arch/csky/include/asm/page.h
> @@ -92,4 +92,7 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
> #include <asm-generic/getorder.h>
>
> #endif /* !__ASSEMBLY__ */
> +
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* __ASM_CSKY_PAGE_H */
> diff --git a/arch/hexagon/include/asm/page.h b/arch/hexagon/include/asm/page.h
> index 8a6af57274c2d..ba7ad5231695f 100644
> --- a/arch/hexagon/include/asm/page.h
> +++ b/arch/hexagon/include/asm/page.h
> @@ -139,4 +139,6 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
> #endif /* ifdef __ASSEMBLY__ */
> #endif /* ifdef __KERNEL__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif
> diff --git a/arch/loongarch/include/asm/page.h b/arch/loongarch/include/asm/page.h
> index e85df33f11c77..9862e8fb047a6 100644
> --- a/arch/loongarch/include/asm/page.h
> +++ b/arch/loongarch/include/asm/page.h
> @@ -123,4 +123,6 @@ extern int __virt_addr_valid(volatile void *kaddr);
>
> #endif /* !__ASSEMBLY__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* _ASM_PAGE_H */
> diff --git a/arch/m68k/include/asm/page.h b/arch/m68k/include/asm/page.h
> index 8cfb84b499751..4df4681b02194 100644
> --- a/arch/m68k/include/asm/page.h
> +++ b/arch/m68k/include/asm/page.h
> @@ -60,5 +60,6 @@ extern unsigned long _ramend;
>
> #include <asm-generic/getorder.h>
> #include <asm-generic/memory_model.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _M68K_PAGE_H */
> diff --git a/arch/microblaze/include/asm/page.h b/arch/microblaze/include/asm/page.h
> index 8810f4f1c3b02..abc23c3d743bd 100644
> --- a/arch/microblaze/include/asm/page.h
> +++ b/arch/microblaze/include/asm/page.h
> @@ -142,5 +142,6 @@ static inline const void *pfn_to_virt(unsigned long pfn)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _ASM_MICROBLAZE_PAGE_H */
> diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
> index 4609cb0326cf3..3d91021538f02 100644
> --- a/arch/mips/include/asm/page.h
> +++ b/arch/mips/include/asm/page.h
> @@ -227,5 +227,6 @@ static inline unsigned long kaslr_offset(void)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _ASM_PAGE_H */
> diff --git a/arch/nios2/include/asm/page.h b/arch/nios2/include/asm/page.h
> index 0722f88e63cc7..2e5f93beb42b7 100644
> --- a/arch/nios2/include/asm/page.h
> +++ b/arch/nios2/include/asm/page.h
> @@ -97,4 +97,6 @@ extern struct page *mem_map;
>
> #endif /* !__ASSEMBLY__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* _ASM_NIOS2_PAGE_H */
> diff --git a/arch/openrisc/include/asm/page.h b/arch/openrisc/include/asm/page.h
> index 1d5913f67c312..a0da2a9842241 100644
> --- a/arch/openrisc/include/asm/page.h
> +++ b/arch/openrisc/include/asm/page.h
> @@ -88,5 +88,6 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* __ASM_OPENRISC_PAGE_H */
> diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
> index 4bea2e95798f0..2a75496237c09 100644
> --- a/arch/parisc/include/asm/page.h
> +++ b/arch/parisc/include/asm/page.h
> @@ -173,6 +173,7 @@ extern int npmem_ranges;
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
> #include <asm/pdc.h>
>
> #define PAGE0 ((struct zeropage *)absolute_pointer(__PAGE_OFFSET))
> diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
> index 83d0a4fc5f755..4601c115b6485 100644
> --- a/arch/powerpc/include/asm/page.h
> +++ b/arch/powerpc/include/asm/page.h
> @@ -300,4 +300,6 @@ static inline unsigned long kaslr_offset(void)
> #include <asm-generic/memory_model.h>
> #endif /* __ASSEMBLY__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* _ASM_POWERPC_PAGE_H */
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index 7ede2111c5917..e5af7579e45bf 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -204,5 +204,6 @@ static __always_inline void *pfn_to_kaddr(unsigned long pfn)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _ASM_RISCV_PAGE_H */
> diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
> index 16e4caa931f1f..42157e7690a77 100644
> --- a/arch/s390/include/asm/page.h
> +++ b/arch/s390/include/asm/page.h
> @@ -275,6 +275,7 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #define AMODE31_SIZE (3 * PAGE_SIZE)
>
> diff --git a/arch/sh/include/asm/page.h b/arch/sh/include/asm/page.h
> index f780b467e75d7..09533d46ef033 100644
> --- a/arch/sh/include/asm/page.h
> +++ b/arch/sh/include/asm/page.h
> @@ -162,5 +162,6 @@ typedef struct page *pgtable_t;
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* __ASM_SH_PAGE_H */
> diff --git a/arch/sparc/include/asm/page.h b/arch/sparc/include/asm/page.h
> index 5e44cdf2a8f2b..4327fe2bfa010 100644
> --- a/arch/sparc/include/asm/page.h
> +++ b/arch/sparc/include/asm/page.h
> @@ -9,4 +9,7 @@
> #else
> #include <asm/page_32.h>
> #endif
> +
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif
> diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
> index 9ef9a8aedfa66..f26011808f514 100644
> --- a/arch/um/include/asm/page.h
> +++ b/arch/um/include/asm/page.h
> @@ -119,4 +119,6 @@ extern unsigned long uml_physmem;
> #define __HAVE_ARCH_GATE_AREA 1
> #endif
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* __UM_PAGE_H */
> diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
> index 52f1b4ff0cc16..6d2381342047f 100644
> --- a/arch/x86/include/asm/page_types.h
> +++ b/arch/x86/include/asm/page_types.h
> @@ -71,4 +71,6 @@ extern void initmem_init(void);
>
> #endif /* !__ASSEMBLY__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* _ASM_X86_PAGE_DEFS_H */
> diff --git a/arch/xtensa/include/asm/page.h b/arch/xtensa/include/asm/page.h
> index 4db56ef052d22..86952cb32af23 100644
> --- a/arch/xtensa/include/asm/page.h
> +++ b/arch/xtensa/include/asm/page.h
> @@ -200,4 +200,5 @@ static inline unsigned long ___pa(unsigned long va)
> #endif /* __ASSEMBLY__ */
>
> #include <asm-generic/memory_model.h>
> +#include <asm-generic/pgtable-geometry.h>
> #endif /* _XTENSA_PAGE_H */
> diff --git a/include/asm-generic/pgtable-geometry.h b/include/asm-generic/pgtable-geometry.h
> new file mode 100644
> index 0000000000000..358e729a6ac37
> --- /dev/null
> +++ b/include/asm-generic/pgtable-geometry.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef ASM_GENERIC_PGTABLE_GEOMETRY_H
> +#define ASM_GENERIC_PGTABLE_GEOMETRY_H
> +
> +#if defined(PAGE_SHIFT_MAX) && defined(PAGE_SIZE_MAX) && defined(PAGE_MASK_MAX) && \
> + defined(PAGE_SHIFT_MIN) && defined(PAGE_SIZE_MIN) && defined(PAGE_MASK_MIN)
> +/* Arch supports boot-time page size selection. */
> +#elif defined(PAGE_SHIFT_MAX) || defined(PAGE_SIZE_MAX) || defined(PAGE_MASK_MAX) || \
> + defined(PAGE_SHIFT_MIN) || defined(PAGE_SIZE_MIN) || defined(PAGE_MASK_MIN)
> +#error Arch must define all or none of the boot-time page size macros
> +#else
> +/* Arch does not support boot-time page size selection. */
> +#define PAGE_SHIFT_MIN PAGE_SHIFT
> +#define PAGE_SIZE_MIN PAGE_SIZE
> +#define PAGE_MASK_MIN PAGE_MASK
> +#define PAGE_SHIFT_MAX PAGE_SHIFT
> +#define PAGE_SIZE_MAX PAGE_SIZE
> +#define PAGE_MASK_MAX PAGE_MASK
> +#endif
> +
> +/*
> + * Define a global variable (scalar or struct), whose value is derived from
> + * PAGE_SIZE and friends. When PAGE_SIZE is a compile-time constant, the global
> + * variable is simply defined with the static value. When PAGE_SIZE is
> + * determined at boot-time, a pure initcall is registered and run during boot to
> + * initialize the variable.
> + *
> + * @type: Unqualified type. Do not include "const"; implied by macro variant.
> + * @name: Variable name.
> + * @...: Initialization value. May be scalar or initializer.
> + *
> + * "static" is declared by placing "static" before the macro.
> + *
> + * Example:
> + *
> + * struct my_struct {
> + * int a;
> + * char b;
> + * };
> + *
> + * static DEFINE_GLOBAL_PAGE_SIZE_VAR(struct my_struct, my_variable, {
> + * .a = 10,
> + * .b = 'e',
> + * });
> + */
> +#if PAGE_SIZE_MIN != PAGE_SIZE_MAX
> +#define __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, attrib, ...) \
> + type name attrib; \
> + static int __init __attribute__((constructor)) __##name##_init(void) \
> + { \
> + name = (type)__VA_ARGS__; \
> + return 0; \
> + }
> +
> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, ...) \
> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, , __VA_ARGS__)
> +
> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(type, name, ...) \
> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, __ro_after_init, __VA_ARGS__)
> +#else /* PAGE_SIZE_MIN == PAGE_SIZE_MAX */
> +#define __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, attrib, ...) \
> + type name attrib = __VA_ARGS__; \
> +
> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, ...) \
> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, , __VA_ARGS__)
> +
> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(type, name, ...) \
> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(const type, name, , __VA_ARGS__)
> +#endif
> +
> +#endif /* ASM_GENERIC_PGTABLE_GEOMETRY_H */
> diff --git a/init/main.c b/init/main.c
> index 206acdde51f5a..ba1515eb20b9d 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -899,6 +899,8 @@ static void __init early_numa_node_init(void)
> #endif
> }
>
> +static __init void do_ctors(void);
> +
> asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector
> void start_kernel(void)
> {
> @@ -910,6 +912,8 @@ void start_kernel(void)
> debug_objects_early_init();
> init_vmlinux_build_id();
>
> + do_ctors();
> +
> cgroup_init_early();
>
> local_irq_disable();
> @@ -1360,7 +1364,6 @@ static void __init do_basic_setup(void)
> cpuset_init_smp();
> driver_init();
> init_irq_proc();
> - do_ctors();
> do_initcalls();
> }
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption Ryan Roberts
@ 2024-10-16 14:37 ` Ryan Roberts
2024-11-01 20:16 ` [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant Dave Kleikamp
2024-11-14 10:17 ` [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption Vlastimil Babka
2 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:37 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
Christoph Lameter, David Hildenbrand, David Rientjes,
Greg Marsden, Ivan Ivanov, Johannes Weiner, Joonsoo Kim,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Michal Hocko, Miquel Raynal, Miroslav Benes, Pekka Enberg,
Richard Weinberger, Shakeel Butt, Vignesh Raghavendra,
Vlastimil Babka, Will Deacon, Matthew Wilcox
Cc: cgroups, linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
linux-mtd
+ Matthew Wilcox
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Refactor "struct vmap_block" to use a flexible array for used_mmap since
> VMAP_BBMAP_BITS is not a compile time constant for the boot-time page
> size case.
>
> Update various BUILD_BUG_ON() instances to check against appropriate
> page size limit.
>
> Re-define "union swap_header" so that it's no longer exactly page-sized.
> Instead define a flexible "magic" array with a define which tells the
> offset to where the magic signature begins.
>
> Consider page size limit in some CPP condditionals.
>
> Wrap global variables that are initialized with PAGE_SIZE derived values
> using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
> deferred for boot-time page size builds.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> drivers/mtd/mtdswap.c | 4 ++--
> include/linux/mm.h | 2 +-
> include/linux/mm_types_task.h | 2 +-
> include/linux/mmzone.h | 3 ++-
> include/linux/slab.h | 7 ++++---
> include/linux/swap.h | 17 ++++++++++++-----
> include/linux/swapops.h | 6 +++++-
> mm/memcontrol.c | 2 +-
> mm/memory.c | 4 ++--
> mm/mmap.c | 2 +-
> mm/page-writeback.c | 2 +-
> mm/slub.c | 2 +-
> mm/sparse.c | 2 +-
> mm/swapfile.c | 2 +-
> mm/vmalloc.c | 7 ++++---
> 15 files changed, 39 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/mtd/mtdswap.c b/drivers/mtd/mtdswap.c
> index 680366616da24..7412a32708114 100644
> --- a/drivers/mtd/mtdswap.c
> +++ b/drivers/mtd/mtdswap.c
> @@ -1062,13 +1062,13 @@ static int mtdswap_auto_header(struct mtdswap_dev *d, char *buf)
> {
> union swap_header *hd = (union swap_header *)(buf);
>
> - memset(buf, 0, PAGE_SIZE - 10);
> + memset(buf, 0, SWAP_HEADER_MAGIC);
>
> hd->info.version = 1;
> hd->info.last_page = d->mbd_dev->size - 1;
> hd->info.nr_badpages = 0;
>
> - memcpy(buf + PAGE_SIZE - 10, "SWAPSPACE2", 10);
> + memcpy(buf + SWAP_HEADER_MAGIC, "SWAPSPACE2", 10);
>
> return 0;
> }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 09a840517c23a..49c2078354e6e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2927,7 +2927,7 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
> static inline spinlock_t *ptep_lockptr(struct mm_struct *mm, pte_t *pte)
> {
> BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHPTE));
> - BUILD_BUG_ON(MAX_PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE);
> + BUILD_BUG_ON(MAX_PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE_MAX);
> return ptlock_ptr(virt_to_ptdesc(pte));
> }
>
> diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
> index a2f6179b672b8..c356897d5f41c 100644
> --- a/include/linux/mm_types_task.h
> +++ b/include/linux/mm_types_task.h
> @@ -37,7 +37,7 @@ struct page;
>
> struct page_frag {
> struct page *page;
> -#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
> +#if (BITS_PER_LONG > 32) || (PAGE_SIZE_MAX >= 65536)
> __u32 offset;
> __u32 size;
> #else
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 1dc6248feb832..cd58034b82c81 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1744,6 +1744,7 @@ static inline bool movable_only_nodes(nodemask_t *nodes)
> */
> #define PA_SECTION_SHIFT (SECTION_SIZE_BITS)
> #define PFN_SECTION_SHIFT (SECTION_SIZE_BITS - PAGE_SHIFT)
> +#define PFN_SECTION_SHIFT_MIN (SECTION_SIZE_BITS - PAGE_SHIFT_MAX)
>
> #define NR_MEM_SECTIONS (1UL << SECTIONS_SHIFT)
>
> @@ -1753,7 +1754,7 @@ static inline bool movable_only_nodes(nodemask_t *nodes)
> #define SECTION_BLOCKFLAGS_BITS \
> ((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS)
>
> -#if (MAX_PAGE_ORDER + PAGE_SHIFT) > SECTION_SIZE_BITS
> +#if (MAX_PAGE_ORDER + PAGE_SHIFT_MAX) > SECTION_SIZE_BITS
> #error Allocator MAX_PAGE_ORDER exceeds SECTION_SIZE
> #endif
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index eb2bf46291576..11c6ff3a12579 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -347,7 +347,7 @@ static inline unsigned int arch_slab_minalign(void)
> */
> #define __assume_kmalloc_alignment __assume_aligned(ARCH_KMALLOC_MINALIGN)
> #define __assume_slab_alignment __assume_aligned(ARCH_SLAB_MINALIGN)
> -#define __assume_page_alignment __assume_aligned(PAGE_SIZE)
> +#define __assume_page_alignment __assume_aligned(PAGE_SIZE_MIN)
>
> /*
> * Kmalloc array related definitions
> @@ -358,6 +358,7 @@ static inline unsigned int arch_slab_minalign(void)
> * (PAGE_SIZE*2). Larger requests are passed to the page allocator.
> */
> #define KMALLOC_SHIFT_HIGH (PAGE_SHIFT + 1)
> +#define KMALLOC_SHIFT_HIGH_MAX (PAGE_SHIFT_MAX + 1)
> #define KMALLOC_SHIFT_MAX (MAX_PAGE_ORDER + PAGE_SHIFT)
> #ifndef KMALLOC_SHIFT_LOW
> #define KMALLOC_SHIFT_LOW 3
> @@ -426,7 +427,7 @@ enum kmalloc_cache_type {
> NR_KMALLOC_TYPES
> };
>
> -typedef struct kmem_cache * kmem_buckets[KMALLOC_SHIFT_HIGH + 1];
> +typedef struct kmem_cache * kmem_buckets[KMALLOC_SHIFT_HIGH_MAX + 1];
>
> extern kmem_buckets kmalloc_caches[NR_KMALLOC_TYPES];
>
> @@ -524,7 +525,7 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
> /* Will never be reached. Needed because the compiler may complain */
> return -1;
> }
> -static_assert(PAGE_SHIFT <= 20);
> +static_assert(PAGE_SHIFT_MAX <= 20);
> #define kmalloc_index(s) __kmalloc_index(s, true)
>
> #include <linux/alloc_tag.h>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ba7ea95d1c57a..e85df0332979f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -132,10 +132,17 @@ static inline int current_is_kswapd(void)
> * bootbits...
> */
> union swap_header {
> - struct {
> - char reserved[PAGE_SIZE - 10];
> - char magic[10]; /* SWAP-SPACE or SWAPSPACE2 */
> - } magic;
> + /*
> + * Exists conceptually, but since PAGE_SIZE may not be known at compile
> + * time, we must access through pointer arithmetic at run time.
> + *
> + * struct {
> + * char reserved[PAGE_SIZE - 10];
> + * char magic[10]; SWAP-SPACE or SWAPSPACE2
> + * } magic;
> + */
> +#define SWAP_HEADER_MAGIC (PAGE_SIZE - 10)
> + char magic[1];
> struct {
> char bootbits[1024]; /* Space for disklabel etc. */
> __u32 version;
> @@ -201,7 +208,7 @@ struct swap_extent {
> * Max bad pages in the new format..
> */
> #define MAX_SWAP_BADPAGES \
> - ((offsetof(union swap_header, magic.magic) - \
> + ((SWAP_HEADER_MAGIC - \
> offsetof(union swap_header, info.badpages)) / sizeof(int))
>
> enum {
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index cb468e418ea11..890fe6a3e6702 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -34,10 +34,14 @@
> */
> #ifdef MAX_PHYSMEM_BITS
> #define SWP_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT)
> +#define SWP_PFN_BITS_MAX (MAX_PHYSMEM_BITS - PAGE_SHIFT_MIN)
> #else /* MAX_PHYSMEM_BITS */
> #define SWP_PFN_BITS min_t(int, \
> sizeof(phys_addr_t) * 8 - PAGE_SHIFT, \
> SWP_TYPE_SHIFT)
> +#define SWP_PFN_BITS_MAX min_t(int, \
> + sizeof(phys_addr_t) * 8 - PAGE_SHIFT_MIN, \
> + SWP_TYPE_SHIFT)
> #endif /* MAX_PHYSMEM_BITS */
> #define SWP_PFN_MASK (BIT(SWP_PFN_BITS) - 1)
>
> @@ -519,7 +523,7 @@ static inline struct folio *pfn_swap_entry_folio(swp_entry_t entry)
> static inline bool is_pfn_swap_entry(swp_entry_t entry)
> {
> /* Make sure the swp offset can always store the needed fields */
> - BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
> + BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS_MAX);
>
> return is_migration_entry(entry) || is_device_private_entry(entry) ||
> is_device_exclusive_entry(entry) || is_hwpoison_entry(entry);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c5f9195f76c65..4b17bec566fbd 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4881,7 +4881,7 @@ static int __init mem_cgroup_init(void)
> * to work fine, we should make sure that the overfill threshold can't
> * exceed S32_MAX / PAGE_SIZE.
> */
> - BUILD_BUG_ON(MEMCG_CHARGE_BATCH > S32_MAX / PAGE_SIZE);
> + BUILD_BUG_ON(MEMCG_CHARGE_BATCH > S32_MAX / PAGE_SIZE_MIN);
>
> cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
> memcg_hotplug_cpu_dead);
> diff --git a/mm/memory.c b/mm/memory.c
> index ebfc9768f801a..14b5ef6870486 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4949,8 +4949,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> return ret;
> }
>
> -static unsigned long fault_around_pages __read_mostly =
> - 65536 >> PAGE_SHIFT;
> +static __DEFINE_GLOBAL_PAGE_SIZE_VAR(unsigned long, fault_around_pages,
> + __read_mostly, 65536 >> PAGE_SHIFT);
>
> #ifdef CONFIG_DEBUG_FS
> static int fault_around_bytes_get(void *data, u64 *val)
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d0dfc85b209bb..d9642aba07ac4 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2279,7 +2279,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> }
>
> /* enforced gap between the expanding stack and other mappings. */
> -unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
> +DEFINE_GLOBAL_PAGE_SIZE_VAR(unsigned long, stack_guard_gap, 256UL<<PAGE_SHIFT);
>
> static int __init cmdline_parse_stack_guard_gap(char *p)
> {
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 4430ac68e4c41..8fc9ac50749bd 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2292,7 +2292,7 @@ static int page_writeback_cpu_online(unsigned int cpu)
> #ifdef CONFIG_SYSCTL
>
> /* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
> -static const unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
> +static DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(unsigned long, dirty_bytes_min, 2 * PAGE_SIZE);
>
> static struct ctl_table vm_page_writeback_sysctls[] = {
> {
> diff --git a/mm/slub.c b/mm/slub.c
> index a77f354f83251..82f6e98cf25bb 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5001,7 +5001,7 @@ init_kmem_cache_node(struct kmem_cache_node *n)
> static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
> {
> BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
> - NR_KMALLOC_TYPES * KMALLOC_SHIFT_HIGH *
> + NR_KMALLOC_TYPES * KMALLOC_SHIFT_HIGH_MAX *
> sizeof(struct kmem_cache_cpu));
>
> /*
> diff --git a/mm/sparse.c b/mm/sparse.c
> index dc38539f85603..2491425930c4d 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -277,7 +277,7 @@ static unsigned long sparse_encode_mem_map(struct page *mem_map, unsigned long p
> {
> unsigned long coded_mem_map =
> (unsigned long)(mem_map - (section_nr_to_pfn(pnum)));
> - BUILD_BUG_ON(SECTION_MAP_LAST_BIT > PFN_SECTION_SHIFT);
> + BUILD_BUG_ON(SECTION_MAP_LAST_BIT > PFN_SECTION_SHIFT_MIN);
> BUG_ON(coded_mem_map & ~SECTION_MAP_MASK);
> return coded_mem_map;
> }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 38bdc439651ac..6311a1cc7e46b 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2931,7 +2931,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
> unsigned long swapfilepages;
> unsigned long last_page;
>
> - if (memcmp("SWAPSPACE2", swap_header->magic.magic, 10)) {
> + if (memcmp("SWAPSPACE2", &swap_header->magic[SWAP_HEADER_MAGIC], 10)) {
> pr_err("Unable to find swap-space signature\n");
> return 0;
> }
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index a0df1e2e155a8..b4fbba204603c 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2497,12 +2497,12 @@ struct vmap_block {
> spinlock_t lock;
> struct vmap_area *va;
> unsigned long free, dirty;
> - DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
> unsigned long dirty_min, dirty_max; /*< dirty range */
> struct list_head free_list;
> struct rcu_head rcu_head;
> struct list_head purge;
> unsigned int cpu;
> + unsigned long used_map[];
> };
>
> /* Queue of free and dirty vmap blocks, for allocation and flushing purposes */
> @@ -2600,11 +2600,12 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
> unsigned long vb_idx;
> int node, err;
> void *vaddr;
> + size_t size;
>
> node = numa_node_id();
>
> - vb = kmalloc_node(sizeof(struct vmap_block),
> - gfp_mask & GFP_RECLAIM_MASK, node);
> + size = struct_size(vb, used_map, BITS_TO_LONGS(VMAP_BBMAP_BITS));
> + vb = kmalloc_node(size, gfp_mask & GFP_RECLAIM_MASK, node);
> if (unlikely(!vb))
> return ERR_PTR(-ENOMEM);
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 13/57] bpf: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 13/57] bpf: " Ryan Roberts
@ 2024-10-16 14:38 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:38 UTC (permalink / raw)
To: Alexei Starovoitov, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, Daniel Borkmann,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, Andrii Nakryiko
Cc: bpf, linux-arm-kernel, linux-kernel, linux-mm
+ Andrii Nakryiko
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Refactor "struct bpf_ringbuf" so that consumer_pos, producer_pos,
> pending_pos and data are no longer embedded at (static) page offsets
> within the struct. This can't work for boot-time page size because the
> page size isn't known at compile-time. Instead, only define the meta
> data in the struct, along with pointers to those values. At "struct
> bpf_ringbuf" allocation time, the extra pages are allocated at the end
> and the pointers are initialized to point to the correct locations.
>
> Additionally, only expose the __PAGE_SIZE enum to BTF for compile-time
> page size builds. We don't know the page size at compile-time for
> boot-time builds. NOTE: This may need some extra thought; perhaps
> __PAGE_SIZE should be exposed as 0 in this case? And/or perhaps
> __PAGE_SIZE_MIN/__PAGE_SIZE_MAX should be exposed? And there would need
> to be a runtime mechanism for querying the page size (e.g.
> getpagesize()).
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> kernel/bpf/core.c | 9 ++++++--
> kernel/bpf/ringbuf.c | 54 ++++++++++++++++++++++++--------------------
> 2 files changed, 37 insertions(+), 26 deletions(-)
>
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 7ee62e38faf0e..485875aa78e63 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -89,10 +89,15 @@ void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, uns
> return NULL;
> }
>
> -/* tell bpf programs that include vmlinux.h kernel's PAGE_SIZE */
> +/*
> + * tell bpf programs that include vmlinux.h kernel's PAGE_SIZE. We can only do
> + * this for compile-time PAGE_SIZE builds.
> + */
> +#if PAGE_SIZE_MIN == PAGE_SIZE_MAX
> enum page_size_enum {
> __PAGE_SIZE = PAGE_SIZE
> };
> +#endif
>
> struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flags)
> {
> @@ -100,7 +105,7 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
> struct bpf_prog_aux *aux;
> struct bpf_prog *fp;
>
> - size = round_up(size, __PAGE_SIZE);
> + size = round_up(size, PAGE_SIZE);
> fp = __vmalloc(size, gfp_flags);
> if (fp == NULL)
> return NULL;
> diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
> index e20b90c361316..8e4093ddbc638 100644
> --- a/kernel/bpf/ringbuf.c
> +++ b/kernel/bpf/ringbuf.c
> @@ -14,9 +14,9 @@
>
> #define RINGBUF_CREATE_FLAG_MASK (BPF_F_NUMA_NODE)
>
> -/* non-mmap()'able part of bpf_ringbuf (everything up to consumer page) */
> +/* non-mmap()'able part of bpf_ringbuf (everything defined in struct) */
> #define RINGBUF_PGOFF \
> - (offsetof(struct bpf_ringbuf, consumer_pos) >> PAGE_SHIFT)
> + (PAGE_ALIGN(sizeof(struct bpf_ringbuf)) >> PAGE_SHIFT)
> /* consumer page and producer page */
> #define RINGBUF_POS_PAGES 2
> #define RINGBUF_NR_META_PAGES (RINGBUF_PGOFF + RINGBUF_POS_PAGES)
> @@ -69,10 +69,10 @@ struct bpf_ringbuf {
> * validate each sample to ensure that they're correctly formatted, and
> * fully contained within the ring buffer.
> */
> - unsigned long consumer_pos __aligned(PAGE_SIZE);
> - unsigned long producer_pos __aligned(PAGE_SIZE);
> - unsigned long pending_pos;
> - char data[] __aligned(PAGE_SIZE);
> + unsigned long *consumer_pos;
> + unsigned long *producer_pos;
> + unsigned long *pending_pos;
> + char *data;
> };
>
> struct bpf_ringbuf_map {
> @@ -134,9 +134,15 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
> rb = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
> VM_MAP | VM_USERMAP, PAGE_KERNEL);
> if (rb) {
> + void *base = rb;
> +
> kmemleak_not_leak(pages);
> rb->pages = pages;
> rb->nr_pages = nr_pages;
> + rb->consumer_pos = (unsigned long *)(base + PAGE_SIZE * RINGBUF_PGOFF);
> + rb->producer_pos = (unsigned long *)(base + PAGE_SIZE * (RINGBUF_PGOFF + 1));
> + rb->pending_pos = rb->producer_pos + 1;
> + rb->data = base + PAGE_SIZE * nr_meta_pages;
> return rb;
> }
>
> @@ -179,9 +185,9 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
> init_irq_work(&rb->work, bpf_ringbuf_notify);
>
> rb->mask = data_sz - 1;
> - rb->consumer_pos = 0;
> - rb->producer_pos = 0;
> - rb->pending_pos = 0;
> + *rb->consumer_pos = 0;
> + *rb->producer_pos = 0;
> + *rb->pending_pos = 0;
>
> return rb;
> }
> @@ -300,8 +306,8 @@ static unsigned long ringbuf_avail_data_sz(struct bpf_ringbuf *rb)
> {
> unsigned long cons_pos, prod_pos;
>
> - cons_pos = smp_load_acquire(&rb->consumer_pos);
> - prod_pos = smp_load_acquire(&rb->producer_pos);
> + cons_pos = smp_load_acquire(rb->consumer_pos);
> + prod_pos = smp_load_acquire(rb->producer_pos);
> return prod_pos - cons_pos;
> }
>
> @@ -418,7 +424,7 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
> if (len > ringbuf_total_data_sz(rb))
> return NULL;
>
> - cons_pos = smp_load_acquire(&rb->consumer_pos);
> + cons_pos = smp_load_acquire(rb->consumer_pos);
>
> if (in_nmi()) {
> if (!spin_trylock_irqsave(&rb->spinlock, flags))
> @@ -427,8 +433,8 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
> spin_lock_irqsave(&rb->spinlock, flags);
> }
>
> - pend_pos = rb->pending_pos;
> - prod_pos = rb->producer_pos;
> + pend_pos = *rb->pending_pos;
> + prod_pos = *rb->producer_pos;
> new_prod_pos = prod_pos + len;
>
> while (pend_pos < prod_pos) {
> @@ -440,7 +446,7 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
> tmp_size = round_up(tmp_size + BPF_RINGBUF_HDR_SZ, 8);
> pend_pos += tmp_size;
> }
> - rb->pending_pos = pend_pos;
> + *rb->pending_pos = pend_pos;
>
> /* check for out of ringbuf space:
> * - by ensuring producer position doesn't advance more than
> @@ -460,7 +466,7 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
> hdr->pg_off = pg_off;
>
> /* pairs with consumer's smp_load_acquire() */
> - smp_store_release(&rb->producer_pos, new_prod_pos);
> + smp_store_release(rb->producer_pos, new_prod_pos);
>
> spin_unlock_irqrestore(&rb->spinlock, flags);
>
> @@ -506,7 +512,7 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard)
> * new data availability
> */
> rec_pos = (void *)hdr - (void *)rb->data;
> - cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
> + cons_pos = smp_load_acquire(rb->consumer_pos) & rb->mask;
>
> if (flags & BPF_RB_FORCE_WAKEUP)
> irq_work_queue(&rb->work);
> @@ -580,9 +586,9 @@ BPF_CALL_2(bpf_ringbuf_query, struct bpf_map *, map, u64, flags)
> case BPF_RB_RING_SIZE:
> return ringbuf_total_data_sz(rb);
> case BPF_RB_CONS_POS:
> - return smp_load_acquire(&rb->consumer_pos);
> + return smp_load_acquire(rb->consumer_pos);
> case BPF_RB_PROD_POS:
> - return smp_load_acquire(&rb->producer_pos);
> + return smp_load_acquire(rb->producer_pos);
> default:
> return 0;
> }
> @@ -680,12 +686,12 @@ static int __bpf_user_ringbuf_peek(struct bpf_ringbuf *rb, void **sample, u32 *s
> u64 cons_pos, prod_pos;
>
> /* Synchronizes with smp_store_release() in user-space producer. */
> - prod_pos = smp_load_acquire(&rb->producer_pos);
> + prod_pos = smp_load_acquire(rb->producer_pos);
> if (prod_pos % 8)
> return -EINVAL;
>
> /* Synchronizes with smp_store_release() in __bpf_user_ringbuf_sample_release() */
> - cons_pos = smp_load_acquire(&rb->consumer_pos);
> + cons_pos = smp_load_acquire(rb->consumer_pos);
> if (cons_pos >= prod_pos)
> return -ENODATA;
>
> @@ -715,7 +721,7 @@ static int __bpf_user_ringbuf_peek(struct bpf_ringbuf *rb, void **sample, u32 *s
> * Update the consumer pos, and return -EAGAIN so the caller
> * knows to skip this sample and try to read the next one.
> */
> - smp_store_release(&rb->consumer_pos, cons_pos + total_len);
> + smp_store_release(rb->consumer_pos, cons_pos + total_len);
> return -EAGAIN;
> }
>
> @@ -737,9 +743,9 @@ static void __bpf_user_ringbuf_sample_release(struct bpf_ringbuf *rb, size_t siz
> * prevents another task from writing to consumer_pos after it was read
> * by this task with smp_load_acquire() in __bpf_user_ringbuf_peek().
> */
> - consumer_pos = rb->consumer_pos;
> + consumer_pos = *rb->consumer_pos;
> /* Synchronizes with smp_load_acquire() in user-space producer. */
> - smp_store_release(&rb->consumer_pos, consumer_pos + rounded_size);
> + smp_store_release(rb->consumer_pos, consumer_pos + rounded_size);
> }
>
> BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map,
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 14/57] pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 14/57] pm/hibernate: " Ryan Roberts
@ 2024-10-16 14:39 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:39 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, Rafael J. Wysocki, Len Brown, Pavel Machek
Cc: linux-arm-kernel, linux-kernel, linux-mm, linux-pm
+ Rafael J. Wysocki, Len Brown, Pavel Machek
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> "struct linked_page", "struct swap_map_page" and "struct swsusp_header"
> were all previously sized to be exactly PAGE_SIZE. Refactor those
> structures to remove the padding, then superimpose them on a page at
> runtime.
>
> "struct cmp_data" and "struct dec_data" previously contained embedded
> "unc" and "cmp" arrays, who's sizes were derived from PAGE_SIZE. We
> can't use flexible array approach here since there are 2 arrays in the
> structure, so convert to pointers and define an allocator and
> deallocator for each struct.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> kernel/power/power.h | 2 +-
> kernel/power/snapshot.c | 2 +-
> kernel/power/swap.c | 129 +++++++++++++++++++++++++++++++++-------
> 3 files changed, 108 insertions(+), 25 deletions(-)
>
> diff --git a/kernel/power/power.h b/kernel/power/power.h
> index de0e6b1077f23..74af2eb8d48a4 100644
> --- a/kernel/power/power.h
> +++ b/kernel/power/power.h
> @@ -16,7 +16,7 @@ struct swsusp_info {
> unsigned long image_pages;
> unsigned long pages;
> unsigned long size;
> -} __aligned(PAGE_SIZE);
> +} __aligned(PAGE_SIZE_MAX);
>
> #ifdef CONFIG_HIBERNATION
> /* kernel/power/snapshot.c */
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 405eddbda4fc5..144e92f786e35 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -155,7 +155,7 @@ struct pbe *restore_pblist;
>
> struct linked_page {
> struct linked_page *next;
> - char data[LINKED_PAGE_DATA_SIZE];
> + char data[];
> } __packed;
>
> /*
> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> index 82b884b67152f..ffd4c864acfa2 100644
> --- a/kernel/power/swap.c
> +++ b/kernel/power/swap.c
> @@ -59,6 +59,7 @@ static bool clean_pages_on_decompress;
> */
>
> #define MAP_PAGE_ENTRIES (PAGE_SIZE / sizeof(sector_t) - 1)
> +#define NEXT_SWAP_INDEX MAP_PAGE_ENTRIES
>
> /*
> * Number of free pages that are not high.
> @@ -78,8 +79,11 @@ static inline unsigned long reqd_free_pages(void)
> }
>
> struct swap_map_page {
> - sector_t entries[MAP_PAGE_ENTRIES];
> - sector_t next_swap;
> + /*
> + * A PAGE_SIZE structure with (PAGE_SIZE / sizeof(sector_t)) entries.
> + * The last entry, [NEXT_SWAP_INDEX], is `.next_swap`.
> + */
> + sector_t entries[1];
> };
>
> struct swap_map_page_list {
> @@ -103,8 +107,6 @@ struct swap_map_handle {
> };
>
> struct swsusp_header {
> - char reserved[PAGE_SIZE - 20 - sizeof(sector_t) - sizeof(int) -
> - sizeof(u32) - sizeof(u32)];
> u32 hw_sig;
> u32 crc32;
> sector_t image;
> @@ -113,6 +115,7 @@ struct swsusp_header {
> char sig[10];
> } __packed;
>
> +static char *swsusp_header_pg;
> static struct swsusp_header *swsusp_header;
>
> /*
> @@ -315,7 +318,7 @@ static int mark_swapfiles(struct swap_map_handle *handle, unsigned int flags)
> {
> int error;
>
> - hib_submit_io(REQ_OP_READ, swsusp_resume_block, swsusp_header, NULL);
> + hib_submit_io(REQ_OP_READ, swsusp_resume_block, swsusp_header_pg, NULL);
> if (!memcmp("SWAP-SPACE",swsusp_header->sig, 10) ||
> !memcmp("SWAPSPACE2",swsusp_header->sig, 10)) {
> memcpy(swsusp_header->orig_sig,swsusp_header->sig, 10);
> @@ -329,7 +332,7 @@ static int mark_swapfiles(struct swap_map_handle *handle, unsigned int flags)
> if (flags & SF_CRC32_MODE)
> swsusp_header->crc32 = handle->crc32;
> error = hib_submit_io(REQ_OP_WRITE | REQ_SYNC,
> - swsusp_resume_block, swsusp_header, NULL);
> + swsusp_resume_block, swsusp_header_pg, NULL);
> } else {
> pr_err("Swap header not found!\n");
> error = -ENODEV;
> @@ -466,7 +469,7 @@ static int swap_write_page(struct swap_map_handle *handle, void *buf,
> offset = alloc_swapdev_block(root_swap);
> if (!offset)
> return -ENOSPC;
> - handle->cur->next_swap = offset;
> + handle->cur->entries[NEXT_SWAP_INDEX] = offset;
> error = write_page(handle->cur, handle->cur_swap, hb);
> if (error)
> goto out;
> @@ -643,8 +646,8 @@ struct cmp_data {
> wait_queue_head_t done; /* compression done */
> size_t unc_len; /* uncompressed length */
> size_t cmp_len; /* compressed length */
> - unsigned char unc[UNC_SIZE]; /* uncompressed buffer */
> - unsigned char cmp[CMP_SIZE]; /* compressed buffer */
> + unsigned char *unc; /* uncompressed buffer */
> + unsigned char *cmp; /* compressed buffer */
> };
>
> /* Indicates the image size after compression */
> @@ -683,6 +686,45 @@ static int compress_threadfn(void *data)
> return 0;
> }
>
> +static void free_cmp_data(struct cmp_data *data, unsigned nr_threads)
> +{
> + int i;
> +
> + if (!data)
> + return;
> +
> + for (i = 0; i < nr_threads; i++) {
> + vfree(data[i].unc);
> + vfree(data[i].cmp);
> + }
> +
> + vfree(data);
> +}
> +
> +static struct cmp_data *alloc_cmp_data(unsigned nr_threads)
> +{
> + struct cmp_data *data = NULL;
> + int i = -1;
> +
> + data = vzalloc(array_size(nr_threads, sizeof(*data)));
> + if (!data)
> + goto fail;
> +
> + for (i = 0; i < nr_threads; i++) {
> + data[i].unc = vzalloc(UNC_SIZE);
> + if (!data[i].unc)
> + goto fail;
> + data[i].cmp = vzalloc(CMP_SIZE);
> + if (!data[i].cmp)
> + goto fail;
> + }
> +
> + return data;
> +fail:
> + free_cmp_data(data, nr_threads);
> + return NULL;
> +}
> +
> /**
> * save_compressed_image - Save the suspend image data after compression.
> * @handle: Swap map handle to use for saving the image.
> @@ -724,7 +766,7 @@ static int save_compressed_image(struct swap_map_handle *handle,
> goto out_clean;
> }
>
> - data = vzalloc(array_size(nr_threads, sizeof(*data)));
> + data = alloc_cmp_data(nr_threads);
> if (!data) {
> pr_err("Failed to allocate %s data\n", hib_comp_algo);
> ret = -ENOMEM;
> @@ -902,7 +944,7 @@ static int save_compressed_image(struct swap_map_handle *handle,
> if (data[thr].cc)
> crypto_free_comp(data[thr].cc);
> }
> - vfree(data);
> + free_cmp_data(data, nr_threads);
> }
> if (page) free_page((unsigned long)page);
>
> @@ -1036,7 +1078,7 @@ static int get_swap_reader(struct swap_map_handle *handle,
> release_swap_reader(handle);
> return error;
> }
> - offset = tmp->map->next_swap;
> + offset = tmp->map->entries[NEXT_SWAP_INDEX];
> }
> handle->k = 0;
> handle->cur = handle->maps->map;
> @@ -1150,8 +1192,8 @@ struct dec_data {
> wait_queue_head_t done; /* decompression done */
> size_t unc_len; /* uncompressed length */
> size_t cmp_len; /* compressed length */
> - unsigned char unc[UNC_SIZE]; /* uncompressed buffer */
> - unsigned char cmp[CMP_SIZE]; /* compressed buffer */
> + unsigned char *unc; /* uncompressed buffer */
> + unsigned char *cmp; /* compressed buffer */
> };
>
> /*
> @@ -1189,6 +1231,45 @@ static int decompress_threadfn(void *data)
> return 0;
> }
>
> +static void free_dec_data(struct dec_data *data, unsigned nr_threads)
> +{
> + int i;
> +
> + if (!data)
> + return;
> +
> + for (i = 0; i < nr_threads; i++) {
> + vfree(data[i].unc);
> + vfree(data[i].cmp);
> + }
> +
> + vfree(data);
> +}
> +
> +static struct dec_data *alloc_dec_data(unsigned nr_threads)
> +{
> + struct dec_data *data = NULL;
> + int i = -1;
> +
> + data = vzalloc(array_size(nr_threads, sizeof(*data)));
> + if (!data)
> + goto fail;
> +
> + for (i = 0; i < nr_threads; i++) {
> + data[i].unc = vzalloc(UNC_SIZE);
> + if (!data[i].unc)
> + goto fail;
> + data[i].cmp = vzalloc(CMP_SIZE);
> + if (!data[i].cmp)
> + goto fail;
> + }
> +
> + return data;
> +fail:
> + free_dec_data(data, nr_threads);
> + return NULL;
> +}
> +
> /**
> * load_compressed_image - Load compressed image data and decompress it.
> * @handle: Swap map handle to use for loading data.
> @@ -1231,7 +1312,7 @@ static int load_compressed_image(struct swap_map_handle *handle,
> goto out_clean;
> }
>
> - data = vzalloc(array_size(nr_threads, sizeof(*data)));
> + data = alloc_dec_data(nr_threads);
> if (!data) {
> pr_err("Failed to allocate %s data\n", hib_comp_algo);
> ret = -ENOMEM;
> @@ -1510,7 +1591,7 @@ static int load_compressed_image(struct swap_map_handle *handle,
> if (data[thr].cc)
> crypto_free_comp(data[thr].cc);
> }
> - vfree(data);
> + free_dec_data(data, nr_threads);
> }
> vfree(page);
>
> @@ -1569,9 +1650,9 @@ int swsusp_check(bool exclusive)
> hib_resume_bdev_file = bdev_file_open_by_dev(swsusp_resume_device,
> BLK_OPEN_READ, holder, NULL);
> if (!IS_ERR(hib_resume_bdev_file)) {
> - clear_page(swsusp_header);
> + clear_page(swsusp_header_pg);
> error = hib_submit_io(REQ_OP_READ, swsusp_resume_block,
> - swsusp_header, NULL);
> + swsusp_header_pg, NULL);
> if (error)
> goto put;
>
> @@ -1581,7 +1662,7 @@ int swsusp_check(bool exclusive)
> /* Reset swap signature now */
> error = hib_submit_io(REQ_OP_WRITE | REQ_SYNC,
> swsusp_resume_block,
> - swsusp_header, NULL);
> + swsusp_header_pg, NULL);
> } else {
> error = -EINVAL;
> }
> @@ -1631,12 +1712,12 @@ int swsusp_unmark(void)
> int error;
>
> hib_submit_io(REQ_OP_READ, swsusp_resume_block,
> - swsusp_header, NULL);
> + swsusp_header_pg, NULL);
> if (!memcmp(HIBERNATE_SIG,swsusp_header->sig, 10)) {
> memcpy(swsusp_header->sig,swsusp_header->orig_sig, 10);
> error = hib_submit_io(REQ_OP_WRITE | REQ_SYNC,
> swsusp_resume_block,
> - swsusp_header, NULL);
> + swsusp_header_pg, NULL);
> } else {
> pr_err("Cannot find swsusp signature!\n");
> error = -ENODEV;
> @@ -1653,9 +1734,11 @@ int swsusp_unmark(void)
>
> static int __init swsusp_header_init(void)
> {
> - swsusp_header = (struct swsusp_header*) __get_free_page(GFP_KERNEL);
> - if (!swsusp_header)
> + swsusp_header_pg = (char *)__get_free_page(GFP_KERNEL);
> + if (!swsusp_header_pg)
> panic("Could not allocate memory for swsusp_header\n");
> + swsusp_header = (struct swsusp_header *)(swsusp_header_pg +
> + PAGE_SIZE - sizeof(struct swsusp_header));
> return 0;
> }
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 16/57] perf: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 16/57] perf: " Ryan Roberts
@ 2024-10-16 14:40 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:40 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, Arnaldo Carvalho de Melo, Ingo Molnar, Namhyung Kim,
Peter Zijlstra
Cc: linux-arm-kernel, linux-kernel, linux-mm, linux-perf-users
+ Arnaldo Carvalho de Melo, Ingo Molnar, Namhyung Kim, Peter Zijlstra
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Refactor a BUILD_BUG_ON() so that we test against the limit; _format is
> invariant to page size so testing it is no bigger than the minimum
> supported size is sufficient.
>
> Wrap global variables that are initialized with PAGE_SIZE derived values
> using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
> deferred for boot-time page size builds.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> include/linux/perf_event.h | 2 +-
> kernel/events/core.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 1a8942277ddad..b7972155f93eb 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1872,7 +1872,7 @@ _name##_show(struct device *dev, \
> struct device_attribute *attr, \
> char *page) \
> { \
> - BUILD_BUG_ON(sizeof(_format) >= PAGE_SIZE); \
> + BUILD_BUG_ON(sizeof(_format) >= PAGE_SIZE_MIN); \
> return sprintf(page, _format "\n"); \
> } \
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 8a6c6bbcd658a..81149663ab7d8 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -419,7 +419,7 @@ static struct kmem_cache *perf_event_cache;
> int sysctl_perf_event_paranoid __read_mostly = 2;
>
> /* Minimum for 512 kiB + 1 user control page */
> -int sysctl_perf_event_mlock __read_mostly = 512 + (PAGE_SIZE / 1024); /* 'free' kiB per user */
> +__DEFINE_GLOBAL_PAGE_SIZE_VAR(int, sysctl_perf_event_mlock, __read_mostly, 512 + (PAGE_SIZE / 1024)); /* 'free' kiB per user */
>
> /*
> * max perf event sample rate
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 17/57] kvm: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 17/57] kvm: " Ryan Roberts
2024-10-14 21:37 ` Sean Christopherson
@ 2024-10-16 14:41 ` Ryan Roberts
1 sibling, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:41 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, Paolo Bonzini
Cc: kvm, linux-arm-kernel, linux-kernel, linux-mm
+ Paolo Bonzini
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Modify BUILD_BUG_ON() to compare with page size limit.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> virt/kvm/kvm_main.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cb2b78e92910f..6c862bc41a672 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4244,7 +4244,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
> goto vcpu_decrement;
> }
>
> - BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
> + BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE_MIN);
> page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> if (!page) {
> r = -ENOMEM;
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 21/57] sunrpc: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 21/57] sunrpc: " Ryan Roberts
@ 2024-10-16 14:42 ` Ryan Roberts
2024-10-16 14:47 ` Chuck Lever
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:42 UTC (permalink / raw)
To: Andrew Morton, Anna Schumaker, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Trond Myklebust, Will Deacon, Chuck Lever,
Jeff Layton
Cc: linux-arm-kernel, linux-kernel, linux-mm, linux-nfs
+ Chuck Lever, Jeff Layton
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Updated array sizes in various structs to contain enough entries for the
> smallest supported page size.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> include/linux/sunrpc/svc.h | 8 +++++---
> include/linux/sunrpc/svc_rdma.h | 4 ++--
> include/linux/sunrpc/svcsock.h | 2 +-
> 3 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index a7d0406b9ef59..dda44018b8f36 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -160,6 +160,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> */
> #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> + 2 + 1)
> +#define RPCSVC_MAXPAGES_MAX ((RPCSVC_MAXPAYLOAD+PAGE_SIZE_MIN-1)/PAGE_SIZE_MIN \
> + + 2 + 1)
>
> /*
> * The context of a single thread, including the request currently being
> @@ -190,14 +192,14 @@ struct svc_rqst {
> struct xdr_stream rq_res_stream;
> struct page *rq_scratch_page;
> struct xdr_buf rq_res;
> - struct page *rq_pages[RPCSVC_MAXPAGES + 1];
> + struct page *rq_pages[RPCSVC_MAXPAGES_MAX + 1];
> struct page * *rq_respages; /* points into rq_pages */
> struct page * *rq_next_page; /* next reply page to use */
> struct page * *rq_page_end; /* one past the last page */
>
> struct folio_batch rq_fbatch;
> - struct kvec rq_vec[RPCSVC_MAXPAGES]; /* generally useful.. */
> - struct bio_vec rq_bvec[RPCSVC_MAXPAGES];
> + struct kvec rq_vec[RPCSVC_MAXPAGES_MAX]; /* generally useful.. */
> + struct bio_vec rq_bvec[RPCSVC_MAXPAGES_MAX];
>
> __be32 rq_xid; /* transmission id */
> u32 rq_prog; /* program number */
> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> index d33bab33099ab..7c6441e8d6f7a 100644
> --- a/include/linux/sunrpc/svc_rdma.h
> +++ b/include/linux/sunrpc/svc_rdma.h
> @@ -200,7 +200,7 @@ struct svc_rdma_recv_ctxt {
> struct svc_rdma_pcl rc_reply_pcl;
>
> unsigned int rc_page_count;
> - struct page *rc_pages[RPCSVC_MAXPAGES];
> + struct page *rc_pages[RPCSVC_MAXPAGES_MAX];
> };
>
> /*
> @@ -242,7 +242,7 @@ struct svc_rdma_send_ctxt {
> void *sc_xprt_buf;
> int sc_page_count;
> int sc_cur_sge_no;
> - struct page *sc_pages[RPCSVC_MAXPAGES];
> + struct page *sc_pages[RPCSVC_MAXPAGES_MAX];
> struct ib_sge sc_sges[];
> };
>
> diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
> index 7c78ec6356b92..6c6bcc82685a3 100644
> --- a/include/linux/sunrpc/svcsock.h
> +++ b/include/linux/sunrpc/svcsock.h
> @@ -40,7 +40,7 @@ struct svc_sock {
>
> struct completion sk_handshake_done;
>
> - struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */
> + struct page * sk_pages[RPCSVC_MAXPAGES_MAX]; /* received data */
> };
>
> static inline u32 svc_sock_reclen(struct svc_sock *svsk)
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 23/57] net: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 23/57] net: " Ryan Roberts
@ 2024-10-16 14:43 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:43 UTC (permalink / raw)
To: David S. Miller, Andrew Morton, Anna Schumaker, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, David Hildenbrand, Eric Dumazet,
Greg Marsden, Ivan Ivanov, Jakub Kicinski, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Paolo Abeni, Trond Myklebust, Will Deacon, Chuck Lever,
Jeff Layton
Cc: linux-arm-kernel, linux-kernel, linux-mm, linux-nfs, netdev
+ Chuck Lever, Jeff Layton
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Define NLMSG_GOODSIZE using min() instead of ifdeffery. This will now
> evaluate to a compile-time constant for compile-time page size, but
> evaluate at run-time when using boot-time page size.
>
> Rework NAPI small page frag infrastructure so that for boot-time page
> size it is compiled in if 4K page size is in the possible range, but
> defer deciding to use it to run time when the page size is known. No
> change for compile-time page size case.
>
> Resize cache_defer_hash[] array for PAGE_SIZE_MAX.
>
> Convert a complex BUILD_BUG_ON() to runtime BUG_ON().
>
> Wrap global variables that are initialized with PAGE_SIZE derived values
> using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
> deferred for boot-time page size builds.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> include/linux/netlink.h | 6 +-----
> net/core/hotdata.c | 4 ++--
> net/core/skbuff.c | 4 ++--
> net/core/sysctl_net_core.c | 2 +-
> net/sunrpc/cache.c | 3 ++-
> net/unix/af_unix.c | 2 +-
> 6 files changed, 9 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/netlink.h b/include/linux/netlink.h
> index b332c2048c755..ffa1e94111f89 100644
> --- a/include/linux/netlink.h
> +++ b/include/linux/netlink.h
> @@ -267,11 +267,7 @@ netlink_skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
> * use enormous buffer sizes on recvmsg() calls just to avoid
> * MSG_TRUNC when PAGE_SIZE is very large.
> */
> -#if PAGE_SIZE < 8192UL
> -#define NLMSG_GOODSIZE SKB_WITH_OVERHEAD(PAGE_SIZE)
> -#else
> -#define NLMSG_GOODSIZE SKB_WITH_OVERHEAD(8192UL)
> -#endif
> +#define NLMSG_GOODSIZE SKB_WITH_OVERHEAD(min(PAGE_SIZE, 8192UL))
>
> #define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN)
>
> diff --git a/net/core/hotdata.c b/net/core/hotdata.c
> index d0aaaaa556f22..e1f30e87ba6e9 100644
> --- a/net/core/hotdata.c
> +++ b/net/core/hotdata.c
> @@ -5,7 +5,7 @@
> #include <net/hotdata.h>
> #include <net/proto_memory.h>
>
> -struct net_hotdata net_hotdata __cacheline_aligned = {
> +__DEFINE_GLOBAL_PAGE_SIZE_VAR(struct net_hotdata, net_hotdata, __cacheline_aligned, {
> .offload_base = LIST_HEAD_INIT(net_hotdata.offload_base),
> .ptype_all = LIST_HEAD_INIT(net_hotdata.ptype_all),
> .gro_normal_batch = 8,
> @@ -21,5 +21,5 @@ struct net_hotdata net_hotdata __cacheline_aligned = {
> .sysctl_max_skb_frags = MAX_SKB_FRAGS,
> .sysctl_skb_defer_max = 64,
> .sysctl_mem_pcpu_rsv = SK_MEMORY_PCPU_RESERVE
> -};
> +});
> EXPORT_SYMBOL(net_hotdata);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 83f8cd8aa2d16..b6c8eee0cc74b 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -219,9 +219,9 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
> #define NAPI_SKB_CACHE_BULK 16
> #define NAPI_SKB_CACHE_HALF (NAPI_SKB_CACHE_SIZE / 2)
>
> -#if PAGE_SIZE == SZ_4K
> +#if PAGE_SIZE_MIN <= SZ_4K && SZ_4K <= PAGE_SIZE_MAX
>
> -#define NAPI_HAS_SMALL_PAGE_FRAG 1
> +#define NAPI_HAS_SMALL_PAGE_FRAG (PAGE_SIZE == SZ_4K)
> #define NAPI_SMALL_PAGE_PFMEMALLOC(nc) ((nc).pfmemalloc)
>
> /* specialized page frag allocator using a single order 0 page
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index 86a2476678c48..a7a2eb7581bd1 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -33,7 +33,7 @@ static int int_3600 = 3600;
> static int min_sndbuf = SOCK_MIN_SNDBUF;
> static int min_rcvbuf = SOCK_MIN_RCVBUF;
> static int max_skb_frags = MAX_SKB_FRAGS;
> -static int min_mem_pcpu_rsv = SK_MEMORY_PCPU_RESERVE;
> +static DEFINE_GLOBAL_PAGE_SIZE_VAR(int, min_mem_pcpu_rsv, SK_MEMORY_PCPU_RESERVE);
>
> static int net_msg_warn; /* Unused, but still a sysctl */
>
> diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
> index 95ff747061046..4e682c0cd7586 100644
> --- a/net/sunrpc/cache.c
> +++ b/net/sunrpc/cache.c
> @@ -573,13 +573,14 @@ EXPORT_SYMBOL_GPL(cache_purge);
> */
>
> #define DFR_HASHSIZE (PAGE_SIZE/sizeof(struct list_head))
> +#define DFR_HASHSIZE_MAX (PAGE_SIZE_MAX/sizeof(struct list_head))
> #define DFR_HASH(item) ((((long)item)>>4 ^ (((long)item)>>13)) % DFR_HASHSIZE)
>
> #define DFR_MAX 300 /* ??? */
>
> static DEFINE_SPINLOCK(cache_defer_lock);
> static LIST_HEAD(cache_defer_list);
> -static struct hlist_head cache_defer_hash[DFR_HASHSIZE];
> +static struct hlist_head cache_defer_hash[DFR_HASHSIZE_MAX];
> static int cache_defer_cnt;
>
> static void __unhash_deferred_req(struct cache_deferred_req *dreq)
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 0be0dcb07f7b6..1cf9f583358af 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -2024,7 +2024,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
> MAX_SKB_FRAGS * PAGE_SIZE);
> data_len = PAGE_ALIGN(data_len);
>
> - BUILD_BUG_ON(SKB_MAX_ALLOC < PAGE_SIZE);
> + BUG_ON(SKB_MAX_ALLOC < PAGE_SIZE);
> }
>
> skb = sock_alloc_send_pskb(sk, len - data_len, data_len,
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 27/57] net: e1000: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 27/57] net: e1000: " Ryan Roberts
@ 2024-10-16 14:43 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:43 UTC (permalink / raw)
To: David S. Miller, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Eric Dumazet, Greg Marsden,
Ivan Ivanov, Jakub Kicinski, Kalesh Singh, Marc Zyngier,
Mark Rutland, Matthias Brugger, Miroslav Benes, Paolo Abeni,
Will Deacon, Przemek Kitszel, Tony Nguyen
Cc: intel-wired-lan, linux-arm-kernel, linux-kernel, linux-mm, netdev
+ Przemek Kitszel, Tony Nguyen
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Convert CPP conditionals to C conditionals. The compiler will dead code
> strip when doing a compile-time page size build, for the same end
> effect. But this will also work with boot-time page size builds.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index ab7ae418d2948..cc14788f5bb04 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -3553,12 +3553,10 @@ static int e1000_change_mtu(struct net_device *netdev, int new_mtu)
>
> if (max_frame <= E1000_RXBUFFER_2048)
> adapter->rx_buffer_len = E1000_RXBUFFER_2048;
> - else
> -#if (PAGE_SIZE >= E1000_RXBUFFER_16384)
> + else if (PAGE_SIZE >= E1000_RXBUFFER_16384)
> adapter->rx_buffer_len = E1000_RXBUFFER_16384;
> -#elif (PAGE_SIZE >= E1000_RXBUFFER_4096)
> + else if (PAGE_SIZE >= E1000_RXBUFFER_4096)
> adapter->rx_buffer_len = PAGE_SIZE;
> -#endif
>
> /* adjust allocation if LPE protects us, and we aren't using SBP */
> if (!hw->tbi_compatibility_on &&
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 28/57] net: igbvf: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 28/57] net: igbvf: " Ryan Roberts
@ 2024-10-16 14:44 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:44 UTC (permalink / raw)
To: David S. Miller, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Eric Dumazet, Greg Marsden,
Ivan Ivanov, Jakub Kicinski, Kalesh Singh, Marc Zyngier,
Mark Rutland, Matthias Brugger, Miroslav Benes, Paolo Abeni,
Will Deacon, Przemek Kitszel, Tony Nguyen
Cc: intel-wired-lan, linux-arm-kernel, linux-kernel, linux-mm, netdev
+ Przemek Kitszel, Tony Nguyen
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Convert CPP conditionals to C conditionals. The compiler will dead code
> strip when doing a compile-time page size build, for the same end
> effect. But this will also work with boot-time page size builds.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> drivers/net/ethernet/intel/igbvf/netdev.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c
> index 925d7286a8ee4..2e11d999168de 100644
> --- a/drivers/net/ethernet/intel/igbvf/netdev.c
> +++ b/drivers/net/ethernet/intel/igbvf/netdev.c
> @@ -2419,12 +2419,10 @@ static int igbvf_change_mtu(struct net_device *netdev, int new_mtu)
> adapter->rx_buffer_len = 1024;
> else if (max_frame <= 2048)
> adapter->rx_buffer_len = 2048;
> - else
> -#if (PAGE_SIZE / 2) > 16384
> + else if ((PAGE_SIZE / 2) > 16384)
> adapter->rx_buffer_len = 16384;
> -#else
> + else
> adapter->rx_buffer_len = PAGE_SIZE / 2;
> -#endif
>
> /* adjust allocation if LPE protects us, and we aren't using SBP */
> if ((max_frame == ETH_FRAME_LEN + ETH_FCS_LEN) ||
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 29/57] net: igb: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 29/57] net: igb: " Ryan Roberts
@ 2024-10-16 14:45 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:45 UTC (permalink / raw)
To: David S. Miller, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Eric Dumazet, Greg Marsden,
Ivan Ivanov, Jakub Kicinski, Kalesh Singh, Marc Zyngier,
Mark Rutland, Matthias Brugger, Miroslav Benes, Paolo Abeni,
Will Deacon, Alexei Starovoitov, Daniel Borkmann,
Jesper Dangaard Brouer, John Fastabend, Przemek Kitszel,
Tony Nguyen
Cc: bpf, intel-wired-lan, linux-arm-kernel, linux-kernel, linux-mm,
netdev
+ Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Przemek Kitszel, Tony Nguyen
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Convert CPP conditionals to C conditionals. The compiler will dead code
> strip when doing a compile-time page size build, for the same end
> effect. But this will also work with boot-time page size builds.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> drivers/net/ethernet/intel/igb/igb.h | 25 ++--
> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++++++-----------
> 2 files changed, 82 insertions(+), 92 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
> index 3c2dc7bdebb50..04aeebcd363b3 100644
> --- a/drivers/net/ethernet/intel/igb/igb.h
> +++ b/drivers/net/ethernet/intel/igb/igb.h
> @@ -158,7 +158,6 @@ struct vf_mac_filter {
> * up negative. In these cases we should fall back to the 3K
> * buffers.
> */
> -#if (PAGE_SIZE < 8192)
> #define IGB_MAX_FRAME_BUILD_SKB (IGB_RXBUFFER_1536 - NET_IP_ALIGN)
> #define IGB_2K_TOO_SMALL_WITH_PADDING \
> ((NET_SKB_PAD + IGB_TS_HDR_LEN + IGB_RXBUFFER_1536) > SKB_WITH_OVERHEAD(IGB_RXBUFFER_2048))
> @@ -177,6 +176,9 @@ static inline int igb_skb_pad(void)
> {
> int rx_buf_len;
>
> + if (PAGE_SIZE >= 8192)
> + return NET_SKB_PAD + NET_IP_ALIGN;
> +
> /* If a 2K buffer cannot handle a standard Ethernet frame then
> * optimize padding for a 3K buffer instead of a 1.5K buffer.
> *
> @@ -196,9 +198,6 @@ static inline int igb_skb_pad(void)
> }
>
> #define IGB_SKB_PAD igb_skb_pad()
> -#else
> -#define IGB_SKB_PAD (NET_SKB_PAD + NET_IP_ALIGN)
> -#endif
>
> /* How many Rx Buffers do we bundle into one write to the hardware ? */
> #define IGB_RX_BUFFER_WRITE 16 /* Must be power of 2 */
> @@ -280,7 +279,7 @@ struct igb_tx_buffer {
> struct igb_rx_buffer {
> dma_addr_t dma;
> struct page *page;
> -#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
> +#if (BITS_PER_LONG > 32) || (PAGE_SIZE_MAX >= 65536)
> __u32 page_offset;
> #else
> __u16 page_offset;
> @@ -403,22 +402,20 @@ enum e1000_ring_flags_t {
>
> static inline unsigned int igb_rx_bufsz(struct igb_ring *ring)
> {
> -#if (PAGE_SIZE < 8192)
> - if (ring_uses_large_buffer(ring))
> - return IGB_RXBUFFER_3072;
> + if (PAGE_SIZE < 8192) {
> + if (ring_uses_large_buffer(ring))
> + return IGB_RXBUFFER_3072;
>
> - if (ring_uses_build_skb(ring))
> - return IGB_MAX_FRAME_BUILD_SKB;
> -#endif
> + if (ring_uses_build_skb(ring))
> + return IGB_MAX_FRAME_BUILD_SKB;
> + }
> return IGB_RXBUFFER_2048;
> }
>
> static inline unsigned int igb_rx_pg_order(struct igb_ring *ring)
> {
> -#if (PAGE_SIZE < 8192)
> - if (ring_uses_large_buffer(ring))
> + if (PAGE_SIZE < 8192 && ring_uses_large_buffer(ring))
> return 1;
> -#endif
> return 0;
> }
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index 1ef4cb871452a..4f2c53dece1a2 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -4797,9 +4797,7 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
> static void igb_set_rx_buffer_len(struct igb_adapter *adapter,
> struct igb_ring *rx_ring)
> {
> -#if (PAGE_SIZE < 8192)
> struct e1000_hw *hw = &adapter->hw;
> -#endif
>
> /* set build_skb and buffer size flags */
> clear_ring_build_skb_enabled(rx_ring);
> @@ -4810,12 +4808,11 @@ static void igb_set_rx_buffer_len(struct igb_adapter *adapter,
>
> set_ring_build_skb_enabled(rx_ring);
>
> -#if (PAGE_SIZE < 8192)
> - if (adapter->max_frame_size > IGB_MAX_FRAME_BUILD_SKB ||
> + if (PAGE_SIZE < 8192 &&
> + (adapter->max_frame_size > IGB_MAX_FRAME_BUILD_SKB ||
> IGB_2K_TOO_SMALL_WITH_PADDING ||
> - rd32(E1000_RCTL) & E1000_RCTL_SBP)
> + rd32(E1000_RCTL) & E1000_RCTL_SBP))
> set_ring_uses_large_buffer(rx_ring);
> -#endif
> }
>
> /**
> @@ -5314,12 +5311,10 @@ static void igb_set_rx_mode(struct net_device *netdev)
> E1000_RCTL_VFE);
> wr32(E1000_RCTL, rctl);
>
> -#if (PAGE_SIZE < 8192)
> - if (!adapter->vfs_allocated_count) {
> + if (PAGE_SIZE < 8192 && !adapter->vfs_allocated_count) {
> if (adapter->max_frame_size <= IGB_MAX_FRAME_BUILD_SKB)
> rlpml = IGB_MAX_FRAME_BUILD_SKB;
> }
> -#endif
> wr32(E1000_RLPML, rlpml);
>
> /* In order to support SR-IOV and eventually VMDq it is necessary to set
> @@ -5338,11 +5333,10 @@ static void igb_set_rx_mode(struct net_device *netdev)
>
> /* enable Rx jumbo frames, restrict as needed to support build_skb */
> vmolr &= ~E1000_VMOLR_RLPML_MASK;
> -#if (PAGE_SIZE < 8192)
> - if (adapter->max_frame_size <= IGB_MAX_FRAME_BUILD_SKB)
> + if (PAGE_SIZE < 8192 &&
> + adapter->max_frame_size <= IGB_MAX_FRAME_BUILD_SKB)
> vmolr |= IGB_MAX_FRAME_BUILD_SKB;
> else
> -#endif
> vmolr |= MAX_JUMBO_FRAME_SIZE;
> vmolr |= E1000_VMOLR_LPE;
>
> @@ -8435,17 +8429,17 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer,
> if (!dev_page_is_reusable(page))
> return false;
>
> -#if (PAGE_SIZE < 8192)
> - /* if we are only owner of page we can reuse it */
> - if (unlikely((rx_buf_pgcnt - pagecnt_bias) > 1))
> - return false;
> -#else
> + if (PAGE_SIZE < 8192) {
> + /* if we are only owner of page we can reuse it */
> + if (unlikely((rx_buf_pgcnt - pagecnt_bias) > 1))
> + return false;
> + } else {
> #define IGB_LAST_OFFSET \
> (SKB_WITH_OVERHEAD(PAGE_SIZE) - IGB_RXBUFFER_2048)
>
> - if (rx_buffer->page_offset > IGB_LAST_OFFSET)
> - return false;
> -#endif
> + if (rx_buffer->page_offset > IGB_LAST_OFFSET)
> + return false;
> + }
>
> /* If we have drained the page fragment pool we need to update
> * the pagecnt_bias and page count so that we fully restock the
> @@ -8473,20 +8467,22 @@ static void igb_add_rx_frag(struct igb_ring *rx_ring,
> struct sk_buff *skb,
> unsigned int size)
> {
> -#if (PAGE_SIZE < 8192)
> - unsigned int truesize = igb_rx_pg_size(rx_ring) / 2;
> -#else
> - unsigned int truesize = ring_uses_build_skb(rx_ring) ?
> + unsigned int truesize;
> +
> + if (PAGE_SIZE < 8192)
> + truesize = igb_rx_pg_size(rx_ring) / 2;
> + else
> + truesize = ring_uses_build_skb(rx_ring) ?
> SKB_DATA_ALIGN(IGB_SKB_PAD + size) :
> SKB_DATA_ALIGN(size);
> -#endif
> +
> skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
> rx_buffer->page_offset, size, truesize);
> -#if (PAGE_SIZE < 8192)
> - rx_buffer->page_offset ^= truesize;
> -#else
> - rx_buffer->page_offset += truesize;
> -#endif
> +
> + if (PAGE_SIZE < 8192)
> + rx_buffer->page_offset ^= truesize;
> + else
> + rx_buffer->page_offset += truesize;
> }
>
> static struct sk_buff *igb_construct_skb(struct igb_ring *rx_ring,
> @@ -8494,16 +8490,16 @@ static struct sk_buff *igb_construct_skb(struct igb_ring *rx_ring,
> struct xdp_buff *xdp,
> ktime_t timestamp)
> {
> -#if (PAGE_SIZE < 8192)
> - unsigned int truesize = igb_rx_pg_size(rx_ring) / 2;
> -#else
> - unsigned int truesize = SKB_DATA_ALIGN(xdp->data_end -
> - xdp->data_hard_start);
> -#endif
> unsigned int size = xdp->data_end - xdp->data;
> + unsigned int truesize;
> unsigned int headlen;
> struct sk_buff *skb;
>
> + if (PAGE_SIZE < 8192)
> + truesize = igb_rx_pg_size(rx_ring) / 2;
> + else
> + truesize = SKB_DATA_ALIGN(xdp->data_end - xdp->data_hard_start);
> +
> /* prefetch first cache line of first page */
> net_prefetch(xdp->data);
>
> @@ -8529,11 +8525,10 @@ static struct sk_buff *igb_construct_skb(struct igb_ring *rx_ring,
> skb_add_rx_frag(skb, 0, rx_buffer->page,
> (xdp->data + headlen) - page_address(rx_buffer->page),
> size, truesize);
> -#if (PAGE_SIZE < 8192)
> - rx_buffer->page_offset ^= truesize;
> -#else
> - rx_buffer->page_offset += truesize;
> -#endif
> + if (PAGE_SIZE < 8192)
> + rx_buffer->page_offset ^= truesize;
> + else
> + rx_buffer->page_offset += truesize;
> } else {
> rx_buffer->pagecnt_bias++;
> }
> @@ -8546,16 +8541,17 @@ static struct sk_buff *igb_build_skb(struct igb_ring *rx_ring,
> struct xdp_buff *xdp,
> ktime_t timestamp)
> {
> -#if (PAGE_SIZE < 8192)
> - unsigned int truesize = igb_rx_pg_size(rx_ring) / 2;
> -#else
> - unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
> - SKB_DATA_ALIGN(xdp->data_end -
> - xdp->data_hard_start);
> -#endif
> unsigned int metasize = xdp->data - xdp->data_meta;
> + unsigned int truesize;
> struct sk_buff *skb;
>
> + if (PAGE_SIZE < 8192)
> + truesize = igb_rx_pg_size(rx_ring) / 2;
> + else
> + truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
> + SKB_DATA_ALIGN(xdp->data_end -
> + xdp->data_hard_start);
> +
> /* prefetch first cache line of first page */
> net_prefetch(xdp->data_meta);
>
> @@ -8575,11 +8571,10 @@ static struct sk_buff *igb_build_skb(struct igb_ring *rx_ring,
> skb_hwtstamps(skb)->hwtstamp = timestamp;
>
> /* update buffer offset */
> -#if (PAGE_SIZE < 8192)
> - rx_buffer->page_offset ^= truesize;
> -#else
> - rx_buffer->page_offset += truesize;
> -#endif
> + if (PAGE_SIZE < 8192)
> + rx_buffer->page_offset ^= truesize;
> + else
> + rx_buffer->page_offset += truesize;
>
> return skb;
> }
> @@ -8634,14 +8629,14 @@ static unsigned int igb_rx_frame_truesize(struct igb_ring *rx_ring,
> {
> unsigned int truesize;
>
> -#if (PAGE_SIZE < 8192)
> - truesize = igb_rx_pg_size(rx_ring) / 2; /* Must be power-of-2 */
> -#else
> - truesize = ring_uses_build_skb(rx_ring) ?
> - SKB_DATA_ALIGN(IGB_SKB_PAD + size) +
> - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) :
> - SKB_DATA_ALIGN(size);
> -#endif
> + if (PAGE_SIZE < 8192)
> + truesize = igb_rx_pg_size(rx_ring) / 2; /* Must be power-of-2 */
> + else
> + truesize = ring_uses_build_skb(rx_ring) ?
> + SKB_DATA_ALIGN(IGB_SKB_PAD + size) +
> + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) :
> + SKB_DATA_ALIGN(size);
> +
> return truesize;
> }
>
> @@ -8650,11 +8645,11 @@ static void igb_rx_buffer_flip(struct igb_ring *rx_ring,
> unsigned int size)
> {
> unsigned int truesize = igb_rx_frame_truesize(rx_ring, size);
> -#if (PAGE_SIZE < 8192)
> - rx_buffer->page_offset ^= truesize;
> -#else
> - rx_buffer->page_offset += truesize;
> -#endif
> +
> + if (PAGE_SIZE < 8192)
> + rx_buffer->page_offset ^= truesize;
> + else
> + rx_buffer->page_offset += truesize;
> }
>
> static inline void igb_rx_checksum(struct igb_ring *ring,
> @@ -8825,12 +8820,12 @@ static struct igb_rx_buffer *igb_get_rx_buffer(struct igb_ring *rx_ring,
> struct igb_rx_buffer *rx_buffer;
>
> rx_buffer = &rx_ring->rx_buffer_info[rx_ring->next_to_clean];
> - *rx_buf_pgcnt =
> -#if (PAGE_SIZE < 8192)
> - page_count(rx_buffer->page);
> -#else
> - 0;
> -#endif
> +
> + if (PAGE_SIZE < 8192)
> + *rx_buf_pgcnt = page_count(rx_buffer->page);
> + else
> + *rx_buf_pgcnt = 0;
> +
> prefetchw(rx_buffer->page);
>
> /* we are reusing so sync this buffer for CPU use */
> @@ -8881,9 +8876,8 @@ static int igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)
> int rx_buf_pgcnt;
>
> /* Frame size depend on rx_ring setup when PAGE_SIZE=4K */
> -#if (PAGE_SIZE < 8192)
> - frame_sz = igb_rx_frame_truesize(rx_ring, 0);
> -#endif
> + if (PAGE_SIZE < 8192)
> + frame_sz = igb_rx_frame_truesize(rx_ring, 0);
> xdp_init_buff(&xdp, frame_sz, &rx_ring->xdp_rxq);
>
> while (likely(total_packets < budget)) {
> @@ -8932,10 +8926,9 @@ static int igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)
>
> xdp_prepare_buff(&xdp, hard_start, offset, size, true);
> xdp_buff_clear_frags_flag(&xdp);
> -#if (PAGE_SIZE > 4096)
> /* At larger PAGE_SIZE, frame_sz depend on len size */
> - xdp.frame_sz = igb_rx_frame_truesize(rx_ring, size);
> -#endif
> + if (PAGE_SIZE > 4096)
> + xdp.frame_sz = igb_rx_frame_truesize(rx_ring, size);
> skb = igb_run_xdp(adapter, rx_ring, &xdp);
> }
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 30/57] drivers/base: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 30/57] drivers/base: " Ryan Roberts
@ 2024-10-16 14:45 ` Ryan Roberts
2024-10-16 15:04 ` Greg Kroah-Hartman
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:45 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, Yury Norov, Greg Kroah-Hartman
Cc: linux-arm-kernel, linux-kernel, linux-mm
+ Greg Kroah-Hartman
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Update BUILD_BUG_ON() to test against page size limits.
>
> CPUMAP_FILE_MAX_BYTES and CPULIST_FILE_MAX_BYTES are both defined
> relative to PAGE_SIZE, so when these values are assigned to global
> variables via BIN_ATTR_RO(), let's wrap them with
> DEFINE_GLOBAL_PAGE_SIZE_VAR() so that their assignment can be deferred
> until boot-time.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> drivers/base/node.c | 6 +++---
> drivers/base/topology.c | 32 ++++++++++++++++----------------
> include/linux/cpumask.h | 5 +++++
> 3 files changed, 24 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index eb72580288e62..30e6549e4c438 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -45,7 +45,7 @@ static inline ssize_t cpumap_read(struct file *file, struct kobject *kobj,
> return n;
> }
>
> -static BIN_ATTR_RO(cpumap, CPUMAP_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(cpumap, CPUMAP_FILE_MAX_BYTES);
>
> static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
> struct bin_attribute *attr, char *buf,
> @@ -66,7 +66,7 @@ static inline ssize_t cpulist_read(struct file *file, struct kobject *kobj,
> return n;
> }
>
> -static BIN_ATTR_RO(cpulist, CPULIST_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(cpulist, CPULIST_FILE_MAX_BYTES);
>
> /**
> * struct node_access_nodes - Access class device to hold user visible
> @@ -558,7 +558,7 @@ static ssize_t node_read_distance(struct device *dev,
> * buf is currently PAGE_SIZE in length and each node needs 4 chars
> * at the most (distance + space or newline).
> */
> - BUILD_BUG_ON(MAX_NUMNODES * 4 > PAGE_SIZE);
> + BUILD_BUG_ON(MAX_NUMNODES * 4 > PAGE_SIZE_MIN);
>
> for_each_online_node(i) {
> len += sysfs_emit_at(buf, len, "%s%d",
> diff --git a/drivers/base/topology.c b/drivers/base/topology.c
> index 89f98be5c5b99..bdbdbefd95b15 100644
> --- a/drivers/base/topology.c
> +++ b/drivers/base/topology.c
> @@ -62,47 +62,47 @@ define_id_show_func(ppin, "0x%llx");
> static DEVICE_ATTR_ADMIN_RO(ppin);
>
> define_siblings_read_func(thread_siblings, sibling_cpumask);
> -static BIN_ATTR_RO(thread_siblings, CPUMAP_FILE_MAX_BYTES);
> -static BIN_ATTR_RO(thread_siblings_list, CPULIST_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(thread_siblings, CPUMAP_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(thread_siblings_list, CPULIST_FILE_MAX_BYTES);
>
> define_siblings_read_func(core_cpus, sibling_cpumask);
> -static BIN_ATTR_RO(core_cpus, CPUMAP_FILE_MAX_BYTES);
> -static BIN_ATTR_RO(core_cpus_list, CPULIST_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(core_cpus, CPUMAP_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(core_cpus_list, CPULIST_FILE_MAX_BYTES);
>
> define_siblings_read_func(core_siblings, core_cpumask);
> -static BIN_ATTR_RO(core_siblings, CPUMAP_FILE_MAX_BYTES);
> -static BIN_ATTR_RO(core_siblings_list, CPULIST_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(core_siblings, CPUMAP_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(core_siblings_list, CPULIST_FILE_MAX_BYTES);
>
> #ifdef TOPOLOGY_CLUSTER_SYSFS
> define_siblings_read_func(cluster_cpus, cluster_cpumask);
> -static BIN_ATTR_RO(cluster_cpus, CPUMAP_FILE_MAX_BYTES);
> -static BIN_ATTR_RO(cluster_cpus_list, CPULIST_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(cluster_cpus, CPUMAP_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(cluster_cpus_list, CPULIST_FILE_MAX_BYTES);
> #endif
>
> #ifdef TOPOLOGY_DIE_SYSFS
> define_siblings_read_func(die_cpus, die_cpumask);
> -static BIN_ATTR_RO(die_cpus, CPUMAP_FILE_MAX_BYTES);
> -static BIN_ATTR_RO(die_cpus_list, CPULIST_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(die_cpus, CPUMAP_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(die_cpus_list, CPULIST_FILE_MAX_BYTES);
> #endif
>
> define_siblings_read_func(package_cpus, core_cpumask);
> -static BIN_ATTR_RO(package_cpus, CPUMAP_FILE_MAX_BYTES);
> -static BIN_ATTR_RO(package_cpus_list, CPULIST_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(package_cpus, CPUMAP_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(package_cpus_list, CPULIST_FILE_MAX_BYTES);
>
> #ifdef TOPOLOGY_BOOK_SYSFS
> define_id_show_func(book_id, "%d");
> static DEVICE_ATTR_RO(book_id);
> define_siblings_read_func(book_siblings, book_cpumask);
> -static BIN_ATTR_RO(book_siblings, CPUMAP_FILE_MAX_BYTES);
> -static BIN_ATTR_RO(book_siblings_list, CPULIST_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(book_siblings, CPUMAP_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(book_siblings_list, CPULIST_FILE_MAX_BYTES);
> #endif
>
> #ifdef TOPOLOGY_DRAWER_SYSFS
> define_id_show_func(drawer_id, "%d");
> static DEVICE_ATTR_RO(drawer_id);
> define_siblings_read_func(drawer_siblings, drawer_cpumask);
> -static BIN_ATTR_RO(drawer_siblings, CPUMAP_FILE_MAX_BYTES);
> -static BIN_ATTR_RO(drawer_siblings_list, CPULIST_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(drawer_siblings, CPUMAP_FILE_MAX_BYTES);
> +static CPU_FILE_BIN_ATTR_RO(drawer_siblings_list, CPULIST_FILE_MAX_BYTES);
> #endif
>
> static struct bin_attribute *bin_attrs[] = {
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 53158de44b837..f654b4198abc2 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -1292,4 +1292,9 @@ cpumap_print_list_to_buf(char *buf, const struct cpumask *mask,
> ? (NR_CPUS * 9)/32 - 1 : PAGE_SIZE)
> #define CPULIST_FILE_MAX_BYTES (((NR_CPUS * 7)/2 > PAGE_SIZE) ? (NR_CPUS * 7)/2 : PAGE_SIZE)
>
> +#define CPU_FILE_BIN_ATTR_RO(_name, _size) \
> + DEFINE_GLOBAL_PAGE_SIZE_VAR(struct bin_attribute, \
> + bin_attr_##_name, \
> + __BIN_ATTR_RO(_name, _size))
> +
> #endif /* __LINUX_CPUMASK_H */
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 31/57] edac: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 31/57] edac: " Ryan Roberts
@ 2024-10-16 14:46 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:46 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon
Cc: linux-arm-kernel, linux-edac, linux-kernel, linux-mm
+ Borislav Petkov, Tony Luck
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Convert PAGES_TO_MiB() and MiB_TO_PAGES() to use the ternary operator so
> that they continue to work with boot-time page size; Boot-time page size
> can't be used with CPP because it's value is not known at compile time.
> For compile-time page size builds, the compiler will dead code strip for
> the same result.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> drivers/edac/edac_mc.h | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/edac/edac_mc.h b/drivers/edac/edac_mc.h
> index 881b00eadf7a5..22132ee86e953 100644
> --- a/drivers/edac/edac_mc.h
> +++ b/drivers/edac/edac_mc.h
> @@ -37,13 +37,12 @@
> #include <linux/workqueue.h>
> #include <linux/edac.h>
>
> -#if PAGE_SHIFT < 20
> -#define PAGES_TO_MiB(pages) ((pages) >> (20 - PAGE_SHIFT))
> -#define MiB_TO_PAGES(mb) ((mb) << (20 - PAGE_SHIFT))
> -#else /* PAGE_SHIFT > 20 */
> -#define PAGES_TO_MiB(pages) ((pages) << (PAGE_SHIFT - 20))
> -#define MiB_TO_PAGES(mb) ((mb) >> (PAGE_SHIFT - 20))
> -#endif
> +#define PAGES_TO_MiB(pages) (PAGE_SHIFT < 20 ? \
> + ((pages) >> (20 - PAGE_SHIFT)) :\
> + ((pages) << (PAGE_SHIFT - 20)))
> +#define MiB_TO_PAGES(mb) (PAGE_SHIFT < 20 ? \
> + ((mb) << (20 - PAGE_SHIFT)) : \
> + ((mb) >> (PAGE_SHIFT - 20)))
>
> #define edac_printk(level, prefix, fmt, arg...) \
> printk(level "EDAC " prefix ": " fmt, ##arg)
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 36/57] xen: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 36/57] xen: " Ryan Roberts
@ 2024-10-16 14:46 ` Ryan Roberts
2024-10-23 1:23 ` Stefano Stabellini
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 14:46 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, Juergen Gross, Stefano Stabellini
Cc: linux-arm-kernel, linux-kernel, linux-mm, xen-devel
+ Juergen Gross, Stefano Stabellini
This was a rather tricky series to get the recipients correct for and my script
did not realize that "supporter" was a pseudonym for "maintainer" so you were
missed off the original post. Appologies!
More context in cover letter:
https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
On 14/10/2024 11:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Allocate enough "frame_list" static storage in the balloon driver for
> the maximum supported page size. Although continue to use only the first
> PAGE_SIZE of the buffer at run-time to maintain existing behaviour.
>
> Refactor xen_biovec_phys_mergeable() to convert ifdeffery to c if/else.
> For compile-time page size, the compiler will choose one branch and
> strip the dead one. For boot-time, it can be evaluated at run time.
>
> Refactor a BUILD_BUG_ON to evaluate the limit (when the minimum
> supported page size is selected at boot-time).
>
> Reserve enough storage for max page size in "struct remap_data" and
> "struct xenbus_map_node".
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> drivers/xen/balloon.c | 11 ++++++-----
> drivers/xen/biomerge.c | 12 ++++++------
> drivers/xen/privcmd.c | 2 +-
> drivers/xen/xenbus/xenbus_client.c | 5 +++--
> drivers/xen/xlate_mmu.c | 6 +++---
> include/xen/page.h | 2 ++
> 6 files changed, 21 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
> index 528395133b4f8..0ed5f6453af0e 100644
> --- a/drivers/xen/balloon.c
> +++ b/drivers/xen/balloon.c
> @@ -131,7 +131,8 @@ struct balloon_stats balloon_stats;
> EXPORT_SYMBOL_GPL(balloon_stats);
>
> /* We increase/decrease in batches which fit in a page */
> -static xen_pfn_t frame_list[PAGE_SIZE / sizeof(xen_pfn_t)];
> +static xen_pfn_t frame_list[PAGE_SIZE_MAX / sizeof(xen_pfn_t)];
> +#define FRAME_LIST_NR_ENTRIES (PAGE_SIZE / sizeof(xen_pfn_t))
>
>
> /* List of ballooned pages, threaded through the mem_map array. */
> @@ -389,8 +390,8 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
> unsigned long i;
> struct page *page;
>
> - if (nr_pages > ARRAY_SIZE(frame_list))
> - nr_pages = ARRAY_SIZE(frame_list);
> + if (nr_pages > FRAME_LIST_NR_ENTRIES)
> + nr_pages = FRAME_LIST_NR_ENTRIES;
>
> page = list_first_entry_or_null(&ballooned_pages, struct page, lru);
> for (i = 0; i < nr_pages; i++) {
> @@ -434,8 +435,8 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
> int ret;
> LIST_HEAD(pages);
>
> - if (nr_pages > ARRAY_SIZE(frame_list))
> - nr_pages = ARRAY_SIZE(frame_list);
> + if (nr_pages > FRAME_LIST_NR_ENTRIES)
> + nr_pages = FRAME_LIST_NR_ENTRIES;
>
> for (i = 0; i < nr_pages; i++) {
> page = alloc_page(gfp);
> diff --git a/drivers/xen/biomerge.c b/drivers/xen/biomerge.c
> index 05a286d24f148..28f0887e40026 100644
> --- a/drivers/xen/biomerge.c
> +++ b/drivers/xen/biomerge.c
> @@ -8,16 +8,16 @@
> bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
> const struct page *page)
> {
> -#if XEN_PAGE_SIZE == PAGE_SIZE
> - unsigned long bfn1 = pfn_to_bfn(page_to_pfn(vec1->bv_page));
> - unsigned long bfn2 = pfn_to_bfn(page_to_pfn(page));
> + if (XEN_PAGE_SIZE == PAGE_SIZE) {
> + unsigned long bfn1 = pfn_to_bfn(page_to_pfn(vec1->bv_page));
> + unsigned long bfn2 = pfn_to_bfn(page_to_pfn(page));
> +
> + return bfn1 + PFN_DOWN(vec1->bv_offset + vec1->bv_len) == bfn2;
> + }
>
> - return bfn1 + PFN_DOWN(vec1->bv_offset + vec1->bv_len) == bfn2;
> -#else
> /*
> * XXX: Add support for merging bio_vec when using different page
> * size in Xen and Linux.
> */
> return false;
> -#endif
> }
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 9563650dfbafc..847f7b806caf7 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
> @@ -557,7 +557,7 @@ static long privcmd_ioctl_mmap_batch(
> state.global_error = 0;
> state.version = version;
>
> - BUILD_BUG_ON(((PAGE_SIZE / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE) != 0);
> + BUILD_BUG_ON(((PAGE_SIZE_MIN / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE_MAX) != 0);
> /* mmap_batch_fn guarantees ret == 0 */
> BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
> &pagelist, mmap_batch_fn, &state));
> diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
> index 51b3124b0d56c..99bde836c10c4 100644
> --- a/drivers/xen/xenbus/xenbus_client.c
> +++ b/drivers/xen/xenbus/xenbus_client.c
> @@ -49,9 +49,10 @@
>
> #include "xenbus.h"
>
> -#define XENBUS_PAGES(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE))
> +#define XENBUS_PAGES(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE))
> +#define XENBUS_PAGES_MAX(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE_MIN))
>
> -#define XENBUS_MAX_RING_PAGES (XENBUS_PAGES(XENBUS_MAX_RING_GRANTS))
> +#define XENBUS_MAX_RING_PAGES (XENBUS_PAGES_MAX(XENBUS_MAX_RING_GRANTS))
>
> struct xenbus_map_node {
> struct list_head next;
> diff --git a/drivers/xen/xlate_mmu.c b/drivers/xen/xlate_mmu.c
> index f17c4c03db30c..a757c801a7542 100644
> --- a/drivers/xen/xlate_mmu.c
> +++ b/drivers/xen/xlate_mmu.c
> @@ -74,9 +74,9 @@ struct remap_data {
> int mapped;
>
> /* Hypercall parameters */
> - int h_errs[XEN_PFN_PER_PAGE];
> - xen_ulong_t h_idxs[XEN_PFN_PER_PAGE];
> - xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE];
> + int h_errs[XEN_PFN_PER_PAGE_MAX];
> + xen_ulong_t h_idxs[XEN_PFN_PER_PAGE_MAX];
> + xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE_MAX];
>
> int h_iter; /* Iterator */
> };
> diff --git a/include/xen/page.h b/include/xen/page.h
> index 285677b42943a..86683a30038a3 100644
> --- a/include/xen/page.h
> +++ b/include/xen/page.h
> @@ -21,6 +21,8 @@
> ((page_to_pfn(page)) << (PAGE_SHIFT - XEN_PAGE_SHIFT))
>
> #define XEN_PFN_PER_PAGE (PAGE_SIZE / XEN_PAGE_SIZE)
> +#define XEN_PFN_PER_PAGE_MIN (PAGE_SIZE_MIN / XEN_PAGE_SIZE)
> +#define XEN_PFN_PER_PAGE_MAX (PAGE_SIZE_MAX / XEN_PAGE_SIZE)
>
> #define XEN_PFN_DOWN(x) ((x) >> XEN_PAGE_SHIFT)
> #define XEN_PFN_UP(x) (((x) + XEN_PAGE_SIZE-1) >> XEN_PAGE_SHIFT)
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 21/57] sunrpc: Remove PAGE_SIZE compile-time constant assumption
2024-10-16 14:42 ` Ryan Roberts
@ 2024-10-16 14:47 ` Chuck Lever
2024-10-16 14:54 ` Jeff Layton
0 siblings, 1 reply; 196+ messages in thread
From: Chuck Lever @ 2024-10-16 14:47 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anna Schumaker, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Trond Myklebust, Will Deacon, Jeff Layton,
linux-arm-kernel, linux-kernel, linux-mm, linux-nfs
On Wed, Oct 16, 2024 at 03:42:12PM +0100, Ryan Roberts wrote:
> + Chuck Lever, Jeff Layton
>
> This was a rather tricky series to get the recipients correct for and my script
> did not realize that "supporter" was a pseudonym for "maintainer" so you were
> missed off the original post. Appologies!
>
> More context in cover letter:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
>
> On 14/10/2024 11:58, Ryan Roberts wrote:
> > To prepare for supporting boot-time page size selection, refactor code
> > to remove assumptions about PAGE_SIZE being compile-time constant. Code
> > intended to be equivalent when compile-time page size is active.
> >
> > Updated array sizes in various structs to contain enough entries for the
> > smallest supported page size.
> >
> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> >
> > ***NOTE***
> > Any confused maintainers may want to read the cover note here for context:
> > https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
> >
> > include/linux/sunrpc/svc.h | 8 +++++---
> > include/linux/sunrpc/svc_rdma.h | 4 ++--
> > include/linux/sunrpc/svcsock.h | 2 +-
> > 3 files changed, 8 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index a7d0406b9ef59..dda44018b8f36 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -160,6 +160,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > */
> > #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> > + 2 + 1)
> > +#define RPCSVC_MAXPAGES_MAX ((RPCSVC_MAXPAYLOAD+PAGE_SIZE_MIN-1)/PAGE_SIZE_MIN \
> > + + 2 + 1)
There is already a "MAX" in the name, so adding this new macro seems
superfluous to me. Can we get away with simply updating the
"RPCSVC_MAXPAGES" macro, instead of adding this new one?
> > /*
> > * The context of a single thread, including the request currently being
> > @@ -190,14 +192,14 @@ struct svc_rqst {
> > struct xdr_stream rq_res_stream;
> > struct page *rq_scratch_page;
> > struct xdr_buf rq_res;
> > - struct page *rq_pages[RPCSVC_MAXPAGES + 1];
> > + struct page *rq_pages[RPCSVC_MAXPAGES_MAX + 1];
> > struct page * *rq_respages; /* points into rq_pages */
> > struct page * *rq_next_page; /* next reply page to use */
> > struct page * *rq_page_end; /* one past the last page */
> >
> > struct folio_batch rq_fbatch;
> > - struct kvec rq_vec[RPCSVC_MAXPAGES]; /* generally useful.. */
> > - struct bio_vec rq_bvec[RPCSVC_MAXPAGES];
> > + struct kvec rq_vec[RPCSVC_MAXPAGES_MAX]; /* generally useful.. */
> > + struct bio_vec rq_bvec[RPCSVC_MAXPAGES_MAX];
> >
> > __be32 rq_xid; /* transmission id */
> > u32 rq_prog; /* program number */
> > diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> > index d33bab33099ab..7c6441e8d6f7a 100644
> > --- a/include/linux/sunrpc/svc_rdma.h
> > +++ b/include/linux/sunrpc/svc_rdma.h
> > @@ -200,7 +200,7 @@ struct svc_rdma_recv_ctxt {
> > struct svc_rdma_pcl rc_reply_pcl;
> >
> > unsigned int rc_page_count;
> > - struct page *rc_pages[RPCSVC_MAXPAGES];
> > + struct page *rc_pages[RPCSVC_MAXPAGES_MAX];
> > };
> >
> > /*
> > @@ -242,7 +242,7 @@ struct svc_rdma_send_ctxt {
> > void *sc_xprt_buf;
> > int sc_page_count;
> > int sc_cur_sge_no;
> > - struct page *sc_pages[RPCSVC_MAXPAGES];
> > + struct page *sc_pages[RPCSVC_MAXPAGES_MAX];
> > struct ib_sge sc_sges[];
> > };
> >
> > diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
> > index 7c78ec6356b92..6c6bcc82685a3 100644
> > --- a/include/linux/sunrpc/svcsock.h
> > +++ b/include/linux/sunrpc/svcsock.h
> > @@ -40,7 +40,7 @@ struct svc_sock {
> >
> > struct completion sk_handshake_done;
> >
> > - struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */
> > + struct page * sk_pages[RPCSVC_MAXPAGES_MAX]; /* received data */
> > };
> >
> > static inline u32 svc_sock_reclen(struct svc_sock *svsk)
>
--
Chuck Lever
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 21/57] sunrpc: Remove PAGE_SIZE compile-time constant assumption
2024-10-16 14:47 ` Chuck Lever
@ 2024-10-16 14:54 ` Jeff Layton
2024-10-16 15:09 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Jeff Layton @ 2024-10-16 14:54 UTC (permalink / raw)
To: Chuck Lever, Ryan Roberts
Cc: Andrew Morton, Anna Schumaker, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Trond Myklebust, Will Deacon, linux-arm-kernel,
linux-kernel, linux-mm, linux-nfs
On Wed, 2024-10-16 at 10:47 -0400, Chuck Lever wrote:
> On Wed, Oct 16, 2024 at 03:42:12PM +0100, Ryan Roberts wrote:
> > + Chuck Lever, Jeff Layton
> >
> > This was a rather tricky series to get the recipients correct for and my script
> > did not realize that "supporter" was a pseudonym for "maintainer" so you were
> > missed off the original post. Appologies!
> >
> > More context in cover letter:
> > https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
> >
> >
> > On 14/10/2024 11:58, Ryan Roberts wrote:
> > > To prepare for supporting boot-time page size selection, refactor code
> > > to remove assumptions about PAGE_SIZE being compile-time constant. Code
> > > intended to be equivalent when compile-time page size is active.
> > >
> > > Updated array sizes in various structs to contain enough entries for the
> > > smallest supported page size.
> > >
> > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > > ---
> > >
> > > ***NOTE***
> > > Any confused maintainers may want to read the cover note here for context:
> > > https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
> > >
> > > include/linux/sunrpc/svc.h | 8 +++++---
> > > include/linux/sunrpc/svc_rdma.h | 4 ++--
> > > include/linux/sunrpc/svcsock.h | 2 +-
> > > 3 files changed, 8 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > > index a7d0406b9ef59..dda44018b8f36 100644
> > > --- a/include/linux/sunrpc/svc.h
> > > +++ b/include/linux/sunrpc/svc.h
> > > @@ -160,6 +160,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > > */
> > > #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> > > + 2 + 1)
> > > +#define RPCSVC_MAXPAGES_MAX ((RPCSVC_MAXPAYLOAD+PAGE_SIZE_MIN-1)/PAGE_SIZE_MIN \
> > > + + 2 + 1)
>
> There is already a "MAX" in the name, so adding this new macro seems
> superfluous to me. Can we get away with simply updating the
> "RPCSVC_MAXPAGES" macro, instead of adding this new one?
>
+1 that was my thinking too. This is mostly just used to size arrays,
so we might as well just change the existing macro.
With 64k pages we probably wouldn't need arrays as long as these will
be. Fixing those array sizes to be settable at runtime though is not a
trivial project though.
>
> > > /*
> > > * The context of a single thread, including the request currently being
> > > @@ -190,14 +192,14 @@ struct svc_rqst {
> > > struct xdr_stream rq_res_stream;
> > > struct page *rq_scratch_page;
> > > struct xdr_buf rq_res;
> > > - struct page *rq_pages[RPCSVC_MAXPAGES + 1];
> > > + struct page *rq_pages[RPCSVC_MAXPAGES_MAX + 1];
> > > struct page * *rq_respages; /* points into rq_pages */
> > > struct page * *rq_next_page; /* next reply page to use */
> > > struct page * *rq_page_end; /* one past the last page */
> > >
> > > struct folio_batch rq_fbatch;
> > > - struct kvec rq_vec[RPCSVC_MAXPAGES]; /* generally useful.. */
> > > - struct bio_vec rq_bvec[RPCSVC_MAXPAGES];
> > > + struct kvec rq_vec[RPCSVC_MAXPAGES_MAX]; /* generally useful.. */
> > > + struct bio_vec rq_bvec[RPCSVC_MAXPAGES_MAX];
> > >
> > > __be32 rq_xid; /* transmission id */
> > > u32 rq_prog; /* program number */
> > > diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> > > index d33bab33099ab..7c6441e8d6f7a 100644
> > > --- a/include/linux/sunrpc/svc_rdma.h
> > > +++ b/include/linux/sunrpc/svc_rdma.h
> > > @@ -200,7 +200,7 @@ struct svc_rdma_recv_ctxt {
> > > struct svc_rdma_pcl rc_reply_pcl;
> > >
> > > unsigned int rc_page_count;
> > > - struct page *rc_pages[RPCSVC_MAXPAGES];
> > > + struct page *rc_pages[RPCSVC_MAXPAGES_MAX];
> > > };
> > >
> > > /*
> > > @@ -242,7 +242,7 @@ struct svc_rdma_send_ctxt {
> > > void *sc_xprt_buf;
> > > int sc_page_count;
> > > int sc_cur_sge_no;
> > > - struct page *sc_pages[RPCSVC_MAXPAGES];
> > > + struct page *sc_pages[RPCSVC_MAXPAGES_MAX];
> > > struct ib_sge sc_sges[];
> > > };
> > >
> > > diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
> > > index 7c78ec6356b92..6c6bcc82685a3 100644
> > > --- a/include/linux/sunrpc/svcsock.h
> > > +++ b/include/linux/sunrpc/svcsock.h
> > > @@ -40,7 +40,7 @@ struct svc_sock {
> > >
> > > struct completion sk_handshake_done;
> > >
> > > - struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */
> > > + struct page * sk_pages[RPCSVC_MAXPAGES_MAX]; /* received data */
> > > };
> > >
> > > static inline u32 svc_sock_reclen(struct svc_sock *svsk)
> >
>
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 30/57] drivers/base: Remove PAGE_SIZE compile-time constant assumption
2024-10-16 14:45 ` Ryan Roberts
@ 2024-10-16 15:04 ` Greg Kroah-Hartman
2024-10-16 15:12 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Greg Kroah-Hartman @ 2024-10-16 15:04 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, Yury Norov, linux-arm-kernel, linux-kernel, linux-mm
On Wed, Oct 16, 2024 at 03:45:48PM +0100, Ryan Roberts wrote:
> + Greg Kroah-Hartman
>
> This was a rather tricky series to get the recipients correct for and my script
> did not realize that "supporter" was a pseudonym for "maintainer" so you were
> missed off the original post. Appologies!
"supporter" is actually a much stronger "signal" than "maintainer"
according to the MAINTAINERS file:
Supported: Someone is actually paid to look after this.
Maintained: Someone actually looks after it.
> More context in cover letter:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
Ick, good luck!
greg k-h
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 21/57] sunrpc: Remove PAGE_SIZE compile-time constant assumption
2024-10-16 14:54 ` Jeff Layton
@ 2024-10-16 15:09 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 15:09 UTC (permalink / raw)
To: Jeff Layton, Chuck Lever
Cc: Andrew Morton, Anna Schumaker, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Trond Myklebust, Will Deacon, linux-arm-kernel,
linux-kernel, linux-mm, linux-nfs
On 16/10/2024 15:54, Jeff Layton wrote:
> On Wed, 2024-10-16 at 10:47 -0400, Chuck Lever wrote:
>> On Wed, Oct 16, 2024 at 03:42:12PM +0100, Ryan Roberts wrote:
>>> + Chuck Lever, Jeff Layton
>>>
>>> This was a rather tricky series to get the recipients correct for and my script
>>> did not realize that "supporter" was a pseudonym for "maintainer" so you were
>>> missed off the original post. Appologies!
>>>
>>> More context in cover letter:
>>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>>
>>>
>>> On 14/10/2024 11:58, Ryan Roberts wrote:
>>>> To prepare for supporting boot-time page size selection, refactor code
>>>> to remove assumptions about PAGE_SIZE being compile-time constant. Code
>>>> intended to be equivalent when compile-time page size is active.
>>>>
>>>> Updated array sizes in various structs to contain enough entries for the
>>>> smallest supported page size.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>
>>>> ***NOTE***
>>>> Any confused maintainers may want to read the cover note here for context:
>>>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>>>
>>>> include/linux/sunrpc/svc.h | 8 +++++---
>>>> include/linux/sunrpc/svc_rdma.h | 4 ++--
>>>> include/linux/sunrpc/svcsock.h | 2 +-
>>>> 3 files changed, 8 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>>> index a7d0406b9ef59..dda44018b8f36 100644
>>>> --- a/include/linux/sunrpc/svc.h
>>>> +++ b/include/linux/sunrpc/svc.h
>>>> @@ -160,6 +160,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>>>> */
>>>> #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
>>>> + 2 + 1)
>>>> +#define RPCSVC_MAXPAGES_MAX ((RPCSVC_MAXPAYLOAD+PAGE_SIZE_MIN-1)/PAGE_SIZE_MIN \
>>>> + + 2 + 1)
>>
>> There is already a "MAX" in the name, so adding this new macro seems
>> superfluous to me. Can we get away with simply updating the
>> "RPCSVC_MAXPAGES" macro, instead of adding this new one?
>>
>
> +1 that was my thinking too. This is mostly just used to size arrays,
> so we might as well just change the existing macro.
I agree, its not the prettiest. I was (incorrectly) assuming you would want to
continue to limit the number of actual pages at runtime based on the in-use page
size. That said, looking again at the code, RPCSVC_MAXPAGES never actually gets
used to dynamically allocate any memory. So I propose to just do the following:
#define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE_MIN-1)/
PAGE_SIZE_MIN + 2 + 1)
That will be 259 in practice (assuming PAGE_SIZE_MIN=4K).
>
> With 64k pages we probably wouldn't need arrays as long as these will
> be. Fixing those array sizes to be settable at runtime though is not a
> trivial project though.
Indeed. Hopefully the above is sufficient.
Thanks for the review!
Ryan
>
>>
>>>> /*
>>>> * The context of a single thread, including the request currently being
>>>> @@ -190,14 +192,14 @@ struct svc_rqst {
>>>> struct xdr_stream rq_res_stream;
>>>> struct page *rq_scratch_page;
>>>> struct xdr_buf rq_res;
>>>> - struct page *rq_pages[RPCSVC_MAXPAGES + 1];
>>>> + struct page *rq_pages[RPCSVC_MAXPAGES_MAX + 1];
>>>> struct page * *rq_respages; /* points into rq_pages */
>>>> struct page * *rq_next_page; /* next reply page to use */
>>>> struct page * *rq_page_end; /* one past the last page */
>>>>
>>>> struct folio_batch rq_fbatch;
>>>> - struct kvec rq_vec[RPCSVC_MAXPAGES]; /* generally useful.. */
>>>> - struct bio_vec rq_bvec[RPCSVC_MAXPAGES];
>>>> + struct kvec rq_vec[RPCSVC_MAXPAGES_MAX]; /* generally useful.. */
>>>> + struct bio_vec rq_bvec[RPCSVC_MAXPAGES_MAX];
>>>>
>>>> __be32 rq_xid; /* transmission id */
>>>> u32 rq_prog; /* program number */
>>>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
>>>> index d33bab33099ab..7c6441e8d6f7a 100644
>>>> --- a/include/linux/sunrpc/svc_rdma.h
>>>> +++ b/include/linux/sunrpc/svc_rdma.h
>>>> @@ -200,7 +200,7 @@ struct svc_rdma_recv_ctxt {
>>>> struct svc_rdma_pcl rc_reply_pcl;
>>>>
>>>> unsigned int rc_page_count;
>>>> - struct page *rc_pages[RPCSVC_MAXPAGES];
>>>> + struct page *rc_pages[RPCSVC_MAXPAGES_MAX];
>>>> };
>>>>
>>>> /*
>>>> @@ -242,7 +242,7 @@ struct svc_rdma_send_ctxt {
>>>> void *sc_xprt_buf;
>>>> int sc_page_count;
>>>> int sc_cur_sge_no;
>>>> - struct page *sc_pages[RPCSVC_MAXPAGES];
>>>> + struct page *sc_pages[RPCSVC_MAXPAGES_MAX];
>>>> struct ib_sge sc_sges[];
>>>> };
>>>>
>>>> diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
>>>> index 7c78ec6356b92..6c6bcc82685a3 100644
>>>> --- a/include/linux/sunrpc/svcsock.h
>>>> +++ b/include/linux/sunrpc/svcsock.h
>>>> @@ -40,7 +40,7 @@ struct svc_sock {
>>>>
>>>> struct completion sk_handshake_done;
>>>>
>>>> - struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */
>>>> + struct page * sk_pages[RPCSVC_MAXPAGES_MAX]; /* received data */
>>>> };
>>>>
>>>> static inline u32 svc_sock_reclen(struct svc_sock *svsk)
>>>
>>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 30/57] drivers/base: Remove PAGE_SIZE compile-time constant assumption
2024-10-16 15:04 ` Greg Kroah-Hartman
@ 2024-10-16 15:12 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 15:12 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, Yury Norov, linux-arm-kernel, linux-kernel, linux-mm
On 16/10/2024 16:04, Greg Kroah-Hartman wrote:
> On Wed, Oct 16, 2024 at 03:45:48PM +0100, Ryan Roberts wrote:
>> + Greg Kroah-Hartman
>>
>> This was a rather tricky series to get the recipients correct for and my script
>> did not realize that "supporter" was a pseudonym for "maintainer" so you were
>> missed off the original post. Appologies!
>
> "supporter" is actually a much stronger "signal" than "maintainer"
> according to the MAINTAINERS file:
> Supported: Someone is actually paid to look after this.
> Maintained: Someone actually looks after it.
Yes, consider me educated now. For some reason my brain always thought
"supporter" was someone who was active in the subsystem but without an official
affiliation.
>
>> More context in cover letter:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> Ick, good luck!
>
> greg k-h
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-14 10:55 [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Ryan Roberts
` (2 preceding siblings ...)
2024-10-15 18:38 ` Michael Kelley
@ 2024-10-16 15:16 ` David Hildenbrand
2024-10-16 16:08 ` Ryan Roberts
2024-10-17 12:27 ` Petr Tesarik
` (4 subsequent siblings)
8 siblings, 1 reply; 196+ messages in thread
From: David Hildenbrand @ 2024-10-16 15:16 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, Donald Dutile
Cc: linux-arm-kernel, linux-kernel, linux-mm
> Performance Testing
> ===================
>
> I've run some limited performance benchmarks:
>
> First, a real-world benchmark that causes a lot of page table manipulation (and
> therefore we would expect to see regression here if we are going to see it
> anywhere); kernel compilation. It barely registers a change. Values are times,
> so smaller is better. All relative to base-4k:
>
> | | kern | kern | user | user | real | real |
> | config | mean | stdev | mean | stdev | mean | stdev |
> |-------------|---------|---------|---------|---------|---------|---------|
> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
>
> The Speedometer JavaScript benchmark also shows no change. Values are runs per
> min, so bigger is better. All relative to base-4k:
>
> | config | mean | stdev |
> |-------------|---------|---------|
> | base-4k | 0.0% | 0.8% |
> | compile-4k | 0.4% | 0.8% |
> | boot-4k | 0.0% | 0.9% |
>
> Finally, I've run some microbenchmarks known to stress page table manipulations
> (originally from David Hildenbrand). The fork test maps/allocs 1G of anon
> memory, then measures the cost of fork(). The munmap test maps/allocs 1G of anon
> memory then measures the cost of munmap()ing it. The fork test is known to be
> extremely sensitive to any changes that cause instructions to be aligned
> differently in cachelines. When using this test for other changes, I've seen
> double digit regressions for the slightest thing, so 12% regression on this test
> is actually fairly good. This likely represents the extreme worst case for
> regressions that will be observed across other microbenchmarks (famous last
> words). Values are times, so smaller is better. All relative to base-4k:
>
... and here I am, worrying about much smaller degradation in these
micro-benchmark ;) You're right, these are pure micro-benchmarks, and
while 12% does sound like "much", even stupid compiler code movement can
result in such changes in the fork() micro benchmark.
So I think this is just fine, and actually "surprisingly" small. And,
there is even a way to statically compile a page size and not worry
about that at all.
As discussed ahead of times, I consider this change very valuable. In
RHEL, the biggest issue is actually the test matrix, that cannot really
be reduced significantly ... but it will make shipping/packaging easier.
CCing Don, who did the separate 64k RHEL flavor kernel.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-16 15:16 ` David Hildenbrand
@ 2024-10-16 16:08 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-16 16:08 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, Donald Dutile
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 16/10/2024 16:16, David Hildenbrand wrote:
>> Performance Testing
>> ===================
>>
>> I've run some limited performance benchmarks:
>>
>> First, a real-world benchmark that causes a lot of page table manipulation (and
>> therefore we would expect to see regression here if we are going to see it
>> anywhere); kernel compilation. It barely registers a change. Values are times,
>> so smaller is better. All relative to base-4k:
>>
>> | | kern | kern | user | user | real | real |
>> | config | mean | stdev | mean | stdev | mean | stdev |
>> |-------------|---------|---------|---------|---------|---------|---------|
>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
>>
>> The Speedometer JavaScript benchmark also shows no change. Values are runs per
>> min, so bigger is better. All relative to base-4k:
>>
>> | config | mean | stdev |
>> |-------------|---------|---------|
>> | base-4k | 0.0% | 0.8% |
>> | compile-4k | 0.4% | 0.8% |
>> | boot-4k | 0.0% | 0.9% |
>>
>> Finally, I've run some microbenchmarks known to stress page table manipulations
>> (originally from David Hildenbrand). The fork test maps/allocs 1G of anon
>> memory, then measures the cost of fork(). The munmap test maps/allocs 1G of anon
>> memory then measures the cost of munmap()ing it. The fork test is known to be
>> extremely sensitive to any changes that cause instructions to be aligned
>> differently in cachelines. When using this test for other changes, I've seen
>> double digit regressions for the slightest thing, so 12% regression on this test
>> is actually fairly good. This likely represents the extreme worst case for
>> regressions that will be observed across other microbenchmarks (famous last
>> words). Values are times, so smaller is better. All relative to base-4k:
>>
>
> ... and here I am, worrying about much smaller degradation in these micro-
> benchmark ;) You're right, these are pure micro-benchmarks, and while 12% does
> sound like "much", even stupid compiler code movement can result in such changes
> in the fork() micro benchmark.
>
> So I think this is just fine, and actually "surprisingly" small. And, there is
> even a way to statically compile a page size and not worry about that at all.
>
> As discussed ahead of times, I consider this change very valuable. In RHEL, the
> biggest issue is actually the test matrix, that cannot really be reduced
> significantly ... but it will make shipping/packaging easier.
>
> CCing Don, who did the separate 64k RHEL flavor kernel.
>
Thanks, David! I'm planning to investigate and see if I can improve even on that
12%. I have a couple of ideas. But like you say, I don't think this should be a
blocker to moving forwards.
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 34/57] sata_sil24: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 34/57] sata_sil24: " Ryan Roberts
@ 2024-10-17 9:09 ` Niklas Cassel
2024-10-17 12:42 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Niklas Cassel @ 2024-10-17 9:09 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
Damien Le Moal, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel, linux-ide,
linux-kernel, linux-mm, Kees Cook
Hello Ryan,
While I realize that this has not always been consistent,
please prefix the subject with "ata: ", so that it becomes
"ata: sata_sil24: ".
On Mon, Oct 14, 2024 at 11:58:41AM +0100, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Convert "struct sil24_ata_block" and "struct sil24_atapi_block" to use a
> flexible array member for their sge[] array. The previous static size of
> SIL24_MAX_SGE depends on PAGE_SIZE so doesn't work for boot-time page
> size.
>
> Wrap global variables that are initialized with PAGE_SIZE derived values
> using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
> deferred for boot-time page size builds.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> drivers/ata/sata_sil24.c | 46 +++++++++++++++++++---------------------
> 1 file changed, 22 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
> index 72c03cbdaff43..85c6382976626 100644
> --- a/drivers/ata/sata_sil24.c
> +++ b/drivers/ata/sata_sil24.c
> @@ -42,26 +42,25 @@ struct sil24_sge {
> __le32 flags;
> };
>
> +/*
> + * sil24 fetches in chunks of 64bytes. The first block
> + * contains the PRB and two SGEs. From the second block, it's
> + * consisted of four SGEs and called SGT. Calculate the
> + * number of SGTs that fit into one page.
> + */
> +#define SIL24_PRB_SZ (sizeof(struct sil24_prb) + 2 * sizeof(struct sil24_sge))
> +#define SIL24_MAX_SGT ((PAGE_SIZE - SIL24_PRB_SZ) / (4 * sizeof(struct sil24_sge)))
> +
> +/*
> + * This will give us one unused SGEs for ATA. This extra SGE
> + * will be used to store CDB for ATAPI devices.
> + */
> +#define SIL24_MAX_SGE (4 * SIL24_MAX_SGT + 1)
>
> enum {
> SIL24_HOST_BAR = 0,
> SIL24_PORT_BAR = 2,
>
> - /* sil24 fetches in chunks of 64bytes. The first block
> - * contains the PRB and two SGEs. From the second block, it's
> - * consisted of four SGEs and called SGT. Calculate the
> - * number of SGTs that fit into one page.
> - */
> - SIL24_PRB_SZ = sizeof(struct sil24_prb)
> - + 2 * sizeof(struct sil24_sge),
> - SIL24_MAX_SGT = (PAGE_SIZE - SIL24_PRB_SZ)
> - / (4 * sizeof(struct sil24_sge)),
> -
> - /* This will give us one unused SGEs for ATA. This extra SGE
> - * will be used to store CDB for ATAPI devices.
> - */
> - SIL24_MAX_SGE = 4 * SIL24_MAX_SGT + 1,
> -
> /*
> * Global controller registers (128 bytes @ BAR0)
> */
> @@ -244,13 +243,13 @@ enum {
>
> struct sil24_ata_block {
> struct sil24_prb prb;
> - struct sil24_sge sge[SIL24_MAX_SGE];
> + struct sil24_sge sge[];
> };
>
> struct sil24_atapi_block {
> struct sil24_prb prb;
> u8 cdb[16];
> - struct sil24_sge sge[SIL24_MAX_SGE];
> + struct sil24_sge sge[];
> };
>
> union sil24_cmd_block {
> @@ -373,7 +372,7 @@ static struct pci_driver sil24_pci_driver = {
> #endif
> };
>
> -static const struct scsi_host_template sil24_sht = {
> +static DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(struct scsi_host_template, sil24_sht, {
> __ATA_BASE_SHT(DRV_NAME),
> .can_queue = SIL24_MAX_CMDS,
> .sg_tablesize = SIL24_MAX_SGE,
> @@ -382,7 +381,7 @@ static const struct scsi_host_template sil24_sht = {
> .sdev_groups = ata_ncq_sdev_groups,
> .change_queue_depth = ata_scsi_change_queue_depth,
> .device_configure = ata_scsi_device_configure
> -};
> +});
>
> static struct ata_port_operations sil24_ops = {
> .inherits = &sata_pmp_port_ops,
> @@ -1193,7 +1192,7 @@ static int sil24_port_start(struct ata_port *ap)
> struct device *dev = ap->host->dev;
> struct sil24_port_priv *pp;
> union sil24_cmd_block *cb;
> - size_t cb_size = sizeof(*cb) * SIL24_MAX_CMDS;
> + size_t cb_size = PAGE_SIZE * SIL24_MAX_CMDS;
> dma_addr_t cb_dma;
>
> pp = devm_kzalloc(dev, sizeof(*pp), GFP_KERNEL);
> @@ -1258,7 +1257,6 @@ static void sil24_init_controller(struct ata_host *host)
>
> static int sil24_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
> {
> - extern int __MARKER__sil24_cmd_block_is_sized_wrongly;
> struct ata_port_info pi = sil24_port_info[ent->driver_data];
> const struct ata_port_info *ppi[] = { &pi, NULL };
> void __iomem * const *iomap;
> @@ -1266,9 +1264,9 @@ static int sil24_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
> int rc;
> u32 tmp;
>
> - /* cause link error if sil24_cmd_block is sized wrongly */
> - if (sizeof(union sil24_cmd_block) != PAGE_SIZE)
> - __MARKER__sil24_cmd_block_is_sized_wrongly = 1;
> + /* union sil24_cmd_block must be PAGE_SIZE */
> + BUG_ON(struct_size_t(struct sil24_atapi_block, sge, SIL24_MAX_SGE) != PAGE_SIZE);
> + BUG_ON(struct_size_t(struct sil24_ata_block, sge, SIL24_MAX_SGE) > PAGE_SIZE);
>
> ata_print_version_once(&pdev->dev, DRV_VERSION);
>
> --
> 2.43.0
>
As you might know, there is an effort to annotate all flexible array
members with their run-time size information, see commit:
dd06e72e68bc ("Compiler Attributes: Add __counted_by macro")
I haven't looked at the DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST macro, but since
sge[] now becomes a flexible array member, I think it would be nice if it
would be possible to somehow use the __counted_by macro.
Other than that, this looks good to me.
Kind regards,
Niklas
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 03/57] mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
2024-10-15 10:55 ` Ryan Roberts
@ 2024-10-17 12:21 ` Michal Hocko
0 siblings, 0 replies; 196+ messages in thread
From: Michal Hocko @ 2024-10-17 12:21 UTC (permalink / raw)
To: Ryan Roberts
Cc: Shakeel Butt, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Johannes Weiner, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Miroslav Benes, Roman Gushchin, Will Deacon,
cgroups, linux-arm-kernel, linux-kernel, linux-mm
On Tue 15-10-24 11:55:26, Ryan Roberts wrote:
> On 14/10/2024 20:59, Shakeel Butt wrote:
> > On Mon, Oct 14, 2024 at 11:58:10AM GMT, Ryan Roberts wrote:
> >> Previously the seq_buf used for accumulating the memory.stat output was
> >> sized at PAGE_SIZE. But the amount of output is invariant to PAGE_SIZE;
> >> If 4K is enough on a 4K page system, then it should also be enough on a
> >> 64K page system, so we can save 60K om the static buffer used in
> >> mem_cgroup_print_oom_meminfo(). Let's make it so.
> >>
> >> This also has the beneficial side effect of removing a place in the code
> >> that assumed PAGE_SIZE is a compile-time constant. So this helps our
> >> quest towards supporting boot-time page size selection.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >
> > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>
> Thanks Shakeel and Johannes, for the acks. Given this patch is totally
> independent, I'll plan to resubmit it on its own and hopefully we can get it in
> independently of the rest of the series.
Yes, this makes sense independent on the whole series.
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-14 10:55 [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Ryan Roberts
` (3 preceding siblings ...)
2024-10-16 15:16 ` David Hildenbrand
@ 2024-10-17 12:27 ` Petr Tesarik
2024-10-17 12:32 ` Ryan Roberts
2024-10-17 22:05 ` Dave Kleikamp
` (3 subsequent siblings)
8 siblings, 1 reply; 196+ messages in thread
From: Petr Tesarik @ 2024-10-17 12:27 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On Mon, 14 Oct 2024 11:55:11 +0100
Ryan Roberts <ryan.roberts@arm.com> wrote:
>[...]
> The series is arranged as follows:
>
> - patch 1: Add macros required for converting non-arch code to support
> boot-time page size selection
> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
> non-arch code
I have just tried to recompile the openSUSE kernel with these patches
applied, and I'm running into this:
CC arch/arm64/hyperv/hv_core.o
In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file scope
u8 reserved2[PAGE_SIZE - 68];
^~~~~~~~~
It looks like one more place which needs a patch, right?
Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-17 12:27 ` Petr Tesarik
@ 2024-10-17 12:32 ` Ryan Roberts
2024-10-18 12:56 ` Petr Tesarik
` (3 more replies)
0 siblings, 4 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-17 12:32 UTC (permalink / raw)
To: Petr Tesarik
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On 17/10/2024 13:27, Petr Tesarik wrote:
> On Mon, 14 Oct 2024 11:55:11 +0100
> Ryan Roberts <ryan.roberts@arm.com> wrote:
>
>> [...]
>> The series is arranged as follows:
>>
>> - patch 1: Add macros required for converting non-arch code to support
>> boot-time page size selection
>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
>> non-arch code
>
> I have just tried to recompile the openSUSE kernel with these patches
> applied, and I'm running into this:
>
> CC arch/arm64/hyperv/hv_core.o
> In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
> ../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file scope
> u8 reserved2[PAGE_SIZE - 68];
> ^~~~~~~~~
>
> It looks like one more place which needs a patch, right?
As mentioned in the cover letter, so far I've only converted enough to get the
defconfig *image* building (i.e. no modules). If you are compiling a different
config or compiling the modules for defconfig, you will likely run into these
types of issues.
That said, I do have some patches to fix Hyper-V, which Michael Kelley was kind
enough to send me.
I understand that Suse might be able to help with wider performance testing - if
that's the reason you are trying to compile, you could send me your config and
I'll start working on fixing up other drivers?
Thanks,
Ryan
>
> Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 34/57] sata_sil24: Remove PAGE_SIZE compile-time constant assumption
2024-10-17 9:09 ` Niklas Cassel
@ 2024-10-17 12:42 ` Ryan Roberts
2024-10-17 12:51 ` Niklas Cassel
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-17 12:42 UTC (permalink / raw)
To: Niklas Cassel
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
Damien Le Moal, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel, linux-ide,
linux-kernel, linux-mm, Kees Cook
On 17/10/2024 10:09, Niklas Cassel wrote:
> Hello Ryan,
>
> While I realize that this has not always been consistent,
> please prefix the subject with "ata: ", so that it becomes
> "ata: sata_sil24: ".
Noted; I'll fix this in the next version.
>
> On Mon, Oct 14, 2024 at 11:58:41AM +0100, Ryan Roberts wrote:
>> To prepare for supporting boot-time page size selection, refactor code
>> to remove assumptions about PAGE_SIZE being compile-time constant. Code
>> intended to be equivalent when compile-time page size is active.
>>
>> Convert "struct sil24_ata_block" and "struct sil24_atapi_block" to use a
>> flexible array member for their sge[] array. The previous static size of
>> SIL24_MAX_SGE depends on PAGE_SIZE so doesn't work for boot-time page
>> size.
>>
>> Wrap global variables that are initialized with PAGE_SIZE derived values
>> using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
>> deferred for boot-time page size builds.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> ***NOTE***
>> Any confused maintainers may want to read the cover note here for context:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>
>> drivers/ata/sata_sil24.c | 46 +++++++++++++++++++---------------------
>> 1 file changed, 22 insertions(+), 24 deletions(-)
>>
>> diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
>> index 72c03cbdaff43..85c6382976626 100644
>> --- a/drivers/ata/sata_sil24.c
>> +++ b/drivers/ata/sata_sil24.c
>> @@ -42,26 +42,25 @@ struct sil24_sge {
>> __le32 flags;
>> };
>>
>> +/*
>> + * sil24 fetches in chunks of 64bytes. The first block
>> + * contains the PRB and two SGEs. From the second block, it's
>> + * consisted of four SGEs and called SGT. Calculate the
>> + * number of SGTs that fit into one page.
>> + */
>> +#define SIL24_PRB_SZ (sizeof(struct sil24_prb) + 2 * sizeof(struct sil24_sge))
>> +#define SIL24_MAX_SGT ((PAGE_SIZE - SIL24_PRB_SZ) / (4 * sizeof(struct sil24_sge)))
>> +
>> +/*
>> + * This will give us one unused SGEs for ATA. This extra SGE
>> + * will be used to store CDB for ATAPI devices.
>> + */
>> +#define SIL24_MAX_SGE (4 * SIL24_MAX_SGT + 1)
>>
>> enum {
>> SIL24_HOST_BAR = 0,
>> SIL24_PORT_BAR = 2,
>>
>> - /* sil24 fetches in chunks of 64bytes. The first block
>> - * contains the PRB and two SGEs. From the second block, it's
>> - * consisted of four SGEs and called SGT. Calculate the
>> - * number of SGTs that fit into one page.
>> - */
>> - SIL24_PRB_SZ = sizeof(struct sil24_prb)
>> - + 2 * sizeof(struct sil24_sge),
>> - SIL24_MAX_SGT = (PAGE_SIZE - SIL24_PRB_SZ)
>> - / (4 * sizeof(struct sil24_sge)),
>> -
>> - /* This will give us one unused SGEs for ATA. This extra SGE
>> - * will be used to store CDB for ATAPI devices.
>> - */
>> - SIL24_MAX_SGE = 4 * SIL24_MAX_SGT + 1,
>> -
>> /*
>> * Global controller registers (128 bytes @ BAR0)
>> */
>> @@ -244,13 +243,13 @@ enum {
>>
>> struct sil24_ata_block {
>> struct sil24_prb prb;
>> - struct sil24_sge sge[SIL24_MAX_SGE];
>> + struct sil24_sge sge[];
>> };
>>
>> struct sil24_atapi_block {
>> struct sil24_prb prb;
>> u8 cdb[16];
>> - struct sil24_sge sge[SIL24_MAX_SGE];
>> + struct sil24_sge sge[];
>> };
>>
>> union sil24_cmd_block {
>> @@ -373,7 +372,7 @@ static struct pci_driver sil24_pci_driver = {
>> #endif
>> };
>>
>> -static const struct scsi_host_template sil24_sht = {
>> +static DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(struct scsi_host_template, sil24_sht, {
>> __ATA_BASE_SHT(DRV_NAME),
>> .can_queue = SIL24_MAX_CMDS,
>> .sg_tablesize = SIL24_MAX_SGE,
>> @@ -382,7 +381,7 @@ static const struct scsi_host_template sil24_sht = {
>> .sdev_groups = ata_ncq_sdev_groups,
>> .change_queue_depth = ata_scsi_change_queue_depth,
>> .device_configure = ata_scsi_device_configure
>> -};
>> +});
>>
>> static struct ata_port_operations sil24_ops = {
>> .inherits = &sata_pmp_port_ops,
>> @@ -1193,7 +1192,7 @@ static int sil24_port_start(struct ata_port *ap)
>> struct device *dev = ap->host->dev;
>> struct sil24_port_priv *pp;
>> union sil24_cmd_block *cb;
>> - size_t cb_size = sizeof(*cb) * SIL24_MAX_CMDS;
>> + size_t cb_size = PAGE_SIZE * SIL24_MAX_CMDS;
>> dma_addr_t cb_dma;
>>
>> pp = devm_kzalloc(dev, sizeof(*pp), GFP_KERNEL);
>> @@ -1258,7 +1257,6 @@ static void sil24_init_controller(struct ata_host *host)
>>
>> static int sil24_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>> {
>> - extern int __MARKER__sil24_cmd_block_is_sized_wrongly;
>> struct ata_port_info pi = sil24_port_info[ent->driver_data];
>> const struct ata_port_info *ppi[] = { &pi, NULL };
>> void __iomem * const *iomap;
>> @@ -1266,9 +1264,9 @@ static int sil24_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>> int rc;
>> u32 tmp;
>>
>> - /* cause link error if sil24_cmd_block is sized wrongly */
>> - if (sizeof(union sil24_cmd_block) != PAGE_SIZE)
>> - __MARKER__sil24_cmd_block_is_sized_wrongly = 1;
>> + /* union sil24_cmd_block must be PAGE_SIZE */
>> + BUG_ON(struct_size_t(struct sil24_atapi_block, sge, SIL24_MAX_SGE) != PAGE_SIZE);
>> + BUG_ON(struct_size_t(struct sil24_ata_block, sge, SIL24_MAX_SGE) > PAGE_SIZE);
>>
>> ata_print_version_once(&pdev->dev, DRV_VERSION);
>>
>> --
>> 2.43.0
>>
>
> As you might know, there is an effort to annotate all flexible array
> members with their run-time size information, see commit:
> dd06e72e68bc ("Compiler Attributes: Add __counted_by macro")
I'm vaguely aware of it. But as I understand it, __counted_by() nominates
another member in the struct which keeps the count? In this case, there is no
such member, it's size is implicit based on the value of PAGE_SIZE. So I'm not
sure if it's practical to use it here?
>
> I haven't looked at the DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST macro, but since
DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(), when doing a boot-time page size build,
defers the initialization of the global variable to kernel init time, when
PAGE_SIZE is known. Because SIL24_MAX_SGE is defined in terms of PAGE_SIZE, this
deferral is required.
> sge[] now becomes a flexible array member, I think it would be nice if it
> would be possible to somehow use the __counted_by macro.
>
> Other than that, this looks good to me.
Thanks for the review!
>
>
> Kind regards,
> Niklas
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 34/57] sata_sil24: Remove PAGE_SIZE compile-time constant assumption
2024-10-17 12:42 ` Ryan Roberts
@ 2024-10-17 12:51 ` Niklas Cassel
2024-10-21 9:24 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Niklas Cassel @ 2024-10-17 12:51 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
Damien Le Moal, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel, linux-ide,
linux-kernel, linux-mm, Kees Cook, Gustavo A. R. Silva
On Thu, Oct 17, 2024 at 01:42:22PM +0100, Ryan Roberts wrote:
> On 17/10/2024 10:09, Niklas Cassel wrote:
(snip)
> > As you might know, there is an effort to annotate all flexible array
> > members with their run-time size information, see commit:
> > dd06e72e68bc ("Compiler Attributes: Add __counted_by macro")
>
> I'm vaguely aware of it. But as I understand it, __counted_by() nominates
> another member in the struct which keeps the count? In this case, there is no
> such member, it's size is implicit based on the value of PAGE_SIZE. So I'm not
> sure if it's practical to use it here?
Neither am I :)
Perhaps some of the flexible array member experts like
Kees Cook or Gustavo A. R. Silva could help us out here.
Would it make sense to add another struct member and simply initialize
it to PAGE_SIZE, in order to be able to use the __counted_by macro?
>
> >
> > I haven't looked at the DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST macro, but since
>
> DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(), when doing a boot-time page size build,
> defers the initialization of the global variable to kernel init time, when
> PAGE_SIZE is known. Because SIL24_MAX_SGE is defined in terms of PAGE_SIZE, this
> deferral is required.
>
> > sge[] now becomes a flexible array member, I think it would be nice if it
> > would be possible to somehow use the __counted_by macro.
> >
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 03/57] mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
2024-10-14 10:58 ` [RFC PATCH v1 03/57] mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large Ryan Roberts
2024-10-14 13:00 ` Johannes Weiner
2024-10-14 19:59 ` Shakeel Butt
@ 2024-10-17 16:09 ` Roman Gushchin
2 siblings, 0 replies; 196+ messages in thread
From: Roman Gushchin @ 2024-10-17 16:09 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Johannes Weiner,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Michal Hocko, Miroslav Benes, Shakeel Butt, Will Deacon, cgroups,
linux-arm-kernel, linux-kernel, linux-mm
On Mon, Oct 14, 2024 at 11:58:10AM +0100, Ryan Roberts wrote:
> Previously the seq_buf used for accumulating the memory.stat output was
> sized at PAGE_SIZE. But the amount of output is invariant to PAGE_SIZE;
> If 4K is enough on a 4K page system, then it should also be enough on a
> 64K page system, so we can save 60K om the static buffer used in
> mem_cgroup_print_oom_meminfo(). Let's make it so.
>
> This also has the beneficial side effect of removing a place in the code
> that assumed PAGE_SIZE is a compile-time constant. So this helps our
> quest towards supporting boot-time page size selection.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Thanks!
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-14 10:55 [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Ryan Roberts
` (4 preceding siblings ...)
2024-10-17 12:27 ` Petr Tesarik
@ 2024-10-17 22:05 ` Dave Kleikamp
2024-10-21 11:49 ` Ryan Roberts
2024-10-18 18:15 ` Joseph Salisbury
` (2 subsequent siblings)
8 siblings, 1 reply; 196+ messages in thread
From: Dave Kleikamp @ 2024-10-17 22:05 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 10/14/24 5:55AM, Ryan Roberts wrote:
> Hi All,
>
> Patch bomb incoming... This covers many subsystems, so I've included a core set
> of people on the full series and additionally included maintainers on relevant
> patches. I haven't included those maintainers on this cover letter since the
> numbers were far too big for it to work. But I've included a link to this cover
> letter on each patch, so they can hopefully find their way here. For follow up
> submissions I'll break it up by subsystem, but for now thought it was important
> to show the full picture.
>
> This RFC series implements support for boot-time page size selection within the
> arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to date, page
> size has been selected at compile-time, meaning the size is baked into a given
> kernel image. As use of larger-than-4K page sizes become more prevalent this
> starts to present a problem for distributions. Boot-time page size selection
> enables the creation of a single kernel image, which can be told which page size
> to use on the kernel command line.
This looks really promising. Building and maintaining separate kernels
is costly. Being able to build one kernel for three protential page
sizes would not only cut down on the overhead of producing kernel
packages and images, but also eases benchmarking and testing different
page sizes without the need to build and install multiple kernels.
I'm also impressed that the patches are less intrusive than I would have
expected. I'm looking forward to seeing this project move forward.
Thanks,
Shaggy
>
> Why is having an image-per-page size problematic?
> =================================================
>
> Many traditional distros are now supporting both 4K and 64K. And this means
> managing 2 kernel packages, along with drivers for each. For some, it means
> multiple installer flavours and multiple ISOs. All of this adds up to a
> less-than-ideal level of complexity. Additionally, Android now supports 4K and
> 16K kernels. I'm told having to explicitly manage their KABI for each kernel is
> painful, and the extra flash space required for both kernel images and the
> duplicated modules has been problematic. Boot-time page size selection solves
> all of this.
>
> Additionally, in starting to think about the longer term deployment story for
> D128 page tables, which Arm architecture now supports, a lot of the same
> problems need to be solved, so this work sets us up nicely for that.
>
> So what's the down side?
> ========================
>
> Well nothing's free; Various static allocations in the kernel image must be
> sized for the worst case (largest supported page size), so image size is in line
> with size of 64K compile-time image. So if you're interested in 4K or 16K, there
> is a slight increase to the image size. But I expect that problem goes away if
> you're compressing the image - its just some extra zeros. At boot-time, I expect
> we could free the unused static storage once we know the page size - although
> that would be a follow up enhancement.
>
> And then there is performance. Since PAGE_SIZE and friends are no longer
> compile-time constants, we must look up their values and do arithmetic at
> runtime instead of compile-time. My early perf testing suggests this is
> inperceptible for real-world workloads, and only has small impact on
> microbenchmarks - more on this below.
>
> Approach
> ========
>
> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
> friends are compile-time constant, but in a way that allows the compiler to
> perform the same optimizations as was previously being done if they do turn out
> to be compile-time constant. Where constants are required, we use limits;
> PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full description
> of all the classes of problems to solve.
>
> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX. arm64
> does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE Kconfig,
> which is an alternative to selecting a compile-time page size.
>
> When boot-time page size is active, the arch pgtable geometry macro definitions
> resolve to something that can be configured at boot. The arm64 implementation in
> this series mainly uses global, __ro_after_init variables. I've tried using
> alternatives patching, but that performs worse than loading from memory; I think
> due to code size bloat.
>
> Status
> ======
>
> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented enough
> to compile the kernel image itself with defconfig (and a few other bits and
> pieces). This is enough to build a kernel that can boot under QEMU or FVP. I'll
> happily do the rest of the work to enable all the extra drivers, but wanted to
> get feedback on the shape of this effort first. If anyone wants to do any
> testing, and has a must-have config, let me know and I'll prioritize enabling it
> first.
>
> The series is arranged as follows:
>
> - patch 1: Add macros required for converting non-arch code to support
> boot-time page size selection
> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
> non-arch code
> - patches 37-38: Some arm64 tidy ups
> - patch 39: Add macros required for converting arm64 code to support
> boot-time page size selection
> - patches 40-56: arm64 changes to support boot-time page size selection
> - patch 57: Add arm64 Kconfig option to enable boot-time page size
> selection
>
> Ideally, I'd like to get the basics merged (something like this series), then
> incrementally improve it over a handful of kernel releases until we can
> demonstrate that we have feature parity with the compile-time build and no
> performance blockers. Once at that point, ideally the compile-time build options
> would be removed and the code could be cleaned up further.
>
> One of the bigger peices that I'd propose to add as a follow up, is to make
> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
> handling.
>
> Assuming people are ammenable to the rough shape, how would I go about getting
> the non-arch changes merged? Since they cover many subsystems, will each piece
> need to go independently to each relevant maintainer or could it all be merged
> together through the arm64 tree?
>
> Image Size
> ==========
>
> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
> kernel image on disk for base (before any changes applied), compile (with
> changes, configured for compile-time page size) and boot (with changes,
> configured for boot-time page size).
>
> You can see the that compile-16k and 64k configs are actually slightly smaller
> than the baselines; that's due to optimizing some buffer sizes which didn't need
> to depend on page size during the series. The boot-time image is ~1% bigger than
> the 64k compile-time image. I believe there is scope to improve this to make it
> equal to compile-64k if required:
>
> | config | size/KB | diff/KB | diff/% |
> |-------------|---------|---------|---------|
> | base-4k | 54895 | 0 | 0.0% |
> | base-16k | 55161 | 266 | 0.5% |
> | base-64k | 56775 | 1880 | 3.4% |
> | compile-4k | 54895 | 0 | 0.0% |
> | compile-16k | 55097 | 202 | 0.4% |
> | compile-64k | 56391 | 1496 | 2.7% |
> | boot-4K | 57045 | 2150 | 3.9% |
>
> And below shows the size of the image in memory at run-time, separated for text
> and data costs. The boot image has ~1% text cost; most likely due to the fact
> that PAGE_SIZE and friends are not compile-time constants so need instructions
> to load the values and do arithmetic. I believe we could eventually get the data
> cost to match the cost for the compile image for the chosen page size by freeing
> the ends of the static buffers not needed for the selected page size:
>
> | | text | text | text | data | data | data |
> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
> |-------------|---------|---------|---------|---------|---------|---------|
> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
>
> Functional Testing
> ==================
>
> I've build-tested defconfig for all arches supported by tuxmake (which is most)
> without issue.
>
> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page sizes
> and a few va-sizes, and additionally have run all the mm-selftests, with no
> regressions observed vs the equivalent compile-time page size build (although
> the mm-selftests have a few existing failures when run against 16K and 64K
> kernels - those should really be investigated and fixed independently).
>
> Test coverage is lacking for many of the drivers that I've touched, but in many
> cases, I'm hoping the changes are simple enough that review might suffice?
>
> Performance Testing
> ===================
>
> I've run some limited performance benchmarks:
>
> First, a real-world benchmark that causes a lot of page table manipulation (and
> therefore we would expect to see regression here if we are going to see it
> anywhere); kernel compilation. It barely registers a change. Values are times,
> so smaller is better. All relative to base-4k:
>
> | | kern | kern | user | user | real | real |
> | config | mean | stdev | mean | stdev | mean | stdev |
> |-------------|---------|---------|---------|---------|---------|---------|
> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
>
> The Speedometer JavaScript benchmark also shows no change. Values are runs per
> min, so bigger is better. All relative to base-4k:
>
> | config | mean | stdev |
> |-------------|---------|---------|
> | base-4k | 0.0% | 0.8% |
> | compile-4k | 0.4% | 0.8% |
> | boot-4k | 0.0% | 0.9% |
>
> Finally, I've run some microbenchmarks known to stress page table manipulations
> (originally from David Hildenbrand). The fork test maps/allocs 1G of anon
> memory, then measures the cost of fork(). The munmap test maps/allocs 1G of anon
> memory then measures the cost of munmap()ing it. The fork test is known to be
> extremely sensitive to any changes that cause instructions to be aligned
> differently in cachelines. When using this test for other changes, I've seen
> double digit regressions for the slightest thing, so 12% regression on this test
> is actually fairly good. This likely represents the extreme worst case for
> regressions that will be observed across other microbenchmarks (famous last
> words). Values are times, so smaller is better. All relative to base-4k:
>
> | | fork | fork | munmap | munmap |
> | config | mean | stdev | stdev | stdev |
> |-------------|---------|---------|---------|---------|
> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
>
> NOTE: The series applies on top of v6.11.
>
> Thanks,
> Ryan
>
>
> Ryan Roberts (57):
> mm: Add macros ahead of supporting boot-time page size selection
> vmlinux: Align to PAGE_SIZE_MAX
> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
> mm/page_alloc: Make page_frag_cache boot-time page size compatible
> mm: Avoid split pmd ptl if pmd level is run-time folded
> mm: Remove PAGE_SIZE compile-time constant assumption
> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
> fs: Remove PAGE_SIZE compile-time constant assumption
> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
> fork: Permit boot-time THREAD_SIZE determination
> cgroup: Remove PAGE_SIZE compile-time constant assumption
> bpf: Remove PAGE_SIZE compile-time constant assumption
> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
> stackdepot: Remove PAGE_SIZE compile-time constant assumption
> perf: Remove PAGE_SIZE compile-time constant assumption
> kvm: Remove PAGE_SIZE compile-time constant assumption
> trace: Remove PAGE_SIZE compile-time constant assumption
> crash: Remove PAGE_SIZE compile-time constant assumption
> crypto: Remove PAGE_SIZE compile-time constant assumption
> sunrpc: Remove PAGE_SIZE compile-time constant assumption
> sound: Remove PAGE_SIZE compile-time constant assumption
> net: Remove PAGE_SIZE compile-time constant assumption
> net: fec: Remove PAGE_SIZE compile-time constant assumption
> net: marvell: Remove PAGE_SIZE compile-time constant assumption
> net: hns3: Remove PAGE_SIZE compile-time constant assumption
> net: e1000: Remove PAGE_SIZE compile-time constant assumption
> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
> net: igb: Remove PAGE_SIZE compile-time constant assumption
> drivers/base: Remove PAGE_SIZE compile-time constant assumption
> edac: Remove PAGE_SIZE compile-time constant assumption
> optee: Remove PAGE_SIZE compile-time constant assumption
> random: Remove PAGE_SIZE compile-time constant assumption
> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
> virtio: Remove PAGE_SIZE compile-time constant assumption
> xen: Remove PAGE_SIZE compile-time constant assumption
> arm64: Fix macros to work in C code in addition to the linker script
> arm64: Track early pgtable allocation limit
> arm64: Introduce macros required for boot-time page selection
> arm64: Refactor early pgtable size calculation macros
> arm64: Pass desired page size on command line
> arm64: Divorce early init from PAGE_SIZE
> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
> arm64: Align sections to PAGE_SIZE_MAX
> arm64: Rework trampoline rodata mapping
> arm64: Generalize fixmap for boot-time page size
> arm64: Statically allocate and align for worst-case page size
> arm64: Convert switch to if for non-const comparison values
> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
> arm64: Remove PAGE_SZ asm-offset
> arm64: Introduce cpu features for page sizes
> arm64: Remove PAGE_SIZE from assembly code
> arm64: Runtime-fold pmd level
> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
> arm64: TRAMP_VALIAS is no longer compile-time constant
> arm64: Determine THREAD_SIZE at boot-time
> arm64: Enable boot-time page size selection
>
> arch/alpha/include/asm/page.h | 1 +
> arch/arc/include/asm/page.h | 1 +
> arch/arm/include/asm/page.h | 1 +
> arch/arm64/Kconfig | 26 ++-
> arch/arm64/include/asm/assembler.h | 78 ++++++-
> arch/arm64/include/asm/cpufeature.h | 44 +++-
> arch/arm64/include/asm/efi.h | 2 +-
> arch/arm64/include/asm/fixmap.h | 28 ++-
> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
> arch/arm64/include/asm/kvm_arm.h | 21 +-
> arch/arm64/include/asm/kvm_hyp.h | 11 +
> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
> arch/arm64/include/asm/memory.h | 62 ++++--
> arch/arm64/include/asm/page-def.h | 3 +-
> arch/arm64/include/asm/pgalloc.h | 16 +-
> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
> arch/arm64/include/asm/pgtable-prot.h | 2 +-
> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
> arch/arm64/include/asm/processor.h | 10 +-
> arch/arm64/include/asm/sections.h | 1 +
> arch/arm64/include/asm/smp.h | 1 +
> arch/arm64/include/asm/sparsemem.h | 15 +-
> arch/arm64/include/asm/sysreg.h | 54 +++--
> arch/arm64/include/asm/tlb.h | 3 +
> arch/arm64/kernel/asm-offsets.c | 4 +-
> arch/arm64/kernel/cpufeature.c | 93 ++++++--
> arch/arm64/kernel/efi.c | 2 +-
> arch/arm64/kernel/entry.S | 60 +++++-
> arch/arm64/kernel/head.S | 46 +++-
> arch/arm64/kernel/hibernate-asm.S | 6 +-
> arch/arm64/kernel/image-vars.h | 14 ++
> arch/arm64/kernel/image.h | 4 +
> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
> arch/arm64/kernel/pi/pi.h | 63 +++++-
> arch/arm64/kernel/relocate_kernel.S | 10 +-
> arch/arm64/kernel/vdso-wrap.S | 4 +-
> arch/arm64/kernel/vdso.c | 7 +-
> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
> arch/arm64/kernel/vdso32-wrap.S | 4 +-
> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
> arch/arm64/kvm/arm.c | 10 +
> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
> arch/arm64/kvm/mmu.c | 39 ++--
> arch/arm64/lib/clear_page.S | 7 +-
> arch/arm64/lib/copy_page.S | 33 ++-
> arch/arm64/lib/mte.S | 27 ++-
> arch/arm64/mm/Makefile | 1 +
> arch/arm64/mm/fixmap.c | 38 ++--
> arch/arm64/mm/hugetlbpage.c | 40 +---
> arch/arm64/mm/init.c | 26 +--
> arch/arm64/mm/kasan_init.c | 8 +-
> arch/arm64/mm/mmu.c | 53 +++--
> arch/arm64/mm/pgd.c | 12 +-
> arch/arm64/mm/pgtable-geometry.c | 24 +++
> arch/arm64/mm/proc.S | 128 ++++++++---
> arch/arm64/mm/ptdump.c | 3 +-
> arch/arm64/tools/cpucaps | 3 +
> arch/csky/include/asm/page.h | 3 +
> arch/hexagon/include/asm/page.h | 2 +
> arch/loongarch/include/asm/page.h | 2 +
> arch/m68k/include/asm/page.h | 1 +
> arch/microblaze/include/asm/page.h | 1 +
> arch/mips/include/asm/page.h | 1 +
> arch/nios2/include/asm/page.h | 2 +
> arch/openrisc/include/asm/page.h | 1 +
> arch/parisc/include/asm/page.h | 1 +
> arch/powerpc/include/asm/page.h | 2 +
> arch/riscv/include/asm/page.h | 1 +
> arch/s390/include/asm/page.h | 1 +
> arch/sh/include/asm/page.h | 1 +
> arch/sparc/include/asm/page.h | 3 +
> arch/um/include/asm/page.h | 2 +
> arch/x86/include/asm/page_types.h | 2 +
> arch/xtensa/include/asm/page.h | 1 +
> crypto/lskcipher.c | 4 +-
> drivers/ata/sata_sil24.c | 46 ++--
> drivers/base/node.c | 6 +-
> drivers/base/topology.c | 32 +--
> drivers/block/virtio_blk.c | 2 +-
> drivers/char/random.c | 4 +-
> drivers/edac/edac_mc.h | 13 +-
> drivers/firmware/efi/libstub/arm64.c | 3 +-
> drivers/irqchip/irq-gic-v3-its.c | 2 +-
> drivers/mtd/mtdswap.c | 4 +-
> drivers/net/ethernet/freescale/fec.h | 3 +-
> drivers/net/ethernet/freescale/fec_main.c | 5 +-
> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
> drivers/net/ethernet/intel/igb/igb.h | 25 +--
> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
> drivers/net/ethernet/marvell/mvneta.c | 9 +-
> drivers/net/ethernet/marvell/sky2.h | 2 +-
> drivers/tee/optee/call.c | 7 +-
> drivers/tee/optee/smc_abi.c | 2 +-
> drivers/virtio/virtio_balloon.c | 10 +-
> drivers/xen/balloon.c | 11 +-
> drivers/xen/biomerge.c | 12 +-
> drivers/xen/privcmd.c | 2 +-
> drivers/xen/xenbus/xenbus_client.c | 5 +-
> drivers/xen/xlate_mmu.c | 6 +-
> fs/binfmt_elf.c | 11 +-
> fs/buffer.c | 2 +-
> fs/coredump.c | 8 +-
> fs/ext4/ext4.h | 36 ++--
> fs/ext4/move_extent.c | 2 +-
> fs/ext4/readpage.c | 2 +-
> fs/fat/dir.c | 4 +-
> fs/fat/fatent.c | 4 +-
> fs/nfs/nfs42proc.c | 2 +-
> fs/nfs/nfs42xattr.c | 2 +-
> fs/nfs/nfs4proc.c | 2 +-
> include/asm-generic/pgtable-geometry.h | 71 +++++++
> include/asm-generic/vmlinux.lds.h | 38 ++--
> include/linux/buffer_head.h | 1 +
> include/linux/cpumask.h | 5 +
> include/linux/linkage.h | 4 +-
> include/linux/mm.h | 17 +-
> include/linux/mm_types.h | 15 +-
> include/linux/mm_types_task.h | 2 +-
> include/linux/mmzone.h | 3 +-
> include/linux/netlink.h | 6 +-
> include/linux/percpu-defs.h | 4 +-
> include/linux/perf_event.h | 2 +-
> include/linux/sched.h | 4 +-
> include/linux/slab.h | 7 +-
> include/linux/stackdepot.h | 6 +-
> include/linux/sunrpc/svc.h | 8 +-
> include/linux/sunrpc/svc_rdma.h | 4 +-
> include/linux/sunrpc/svcsock.h | 2 +-
> include/linux/swap.h | 17 +-
> include/linux/swapops.h | 6 +-
> include/linux/thread_info.h | 10 +-
> include/xen/page.h | 2 +
> init/main.c | 7 +-
> kernel/bpf/core.c | 9 +-
> kernel/bpf/ringbuf.c | 54 ++---
> kernel/cgroup/cgroup.c | 8 +-
> kernel/crash_core.c | 2 +-
> kernel/events/core.c | 2 +-
> kernel/fork.c | 71 +++----
> kernel/power/power.h | 2 +-
> kernel/power/snapshot.c | 2 +-
> kernel/power/swap.c | 129 +++++++++--
> kernel/trace/fgraph.c | 2 +-
> kernel/trace/trace.c | 2 +-
> lib/stackdepot.c | 6 +-
> mm/kasan/report.c | 3 +-
> mm/memcontrol.c | 11 +-
> mm/memory.c | 4 +-
> mm/mmap.c | 2 +-
> mm/page-writeback.c | 2 +-
> mm/page_alloc.c | 31 +--
> mm/slub.c | 2 +-
> mm/sparse.c | 2 +-
> mm/swapfile.c | 2 +-
> mm/vmalloc.c | 7 +-
> net/9p/trans_virtio.c | 4 +-
> net/core/hotdata.c | 4 +-
> net/core/skbuff.c | 4 +-
> net/core/sysctl_net_core.c | 2 +-
> net/sunrpc/cache.c | 3 +-
> net/unix/af_unix.c | 2 +-
> sound/soc/soc-utils.c | 4 +-
> virt/kvm/kvm_main.c | 2 +-
> 172 files changed, 2185 insertions(+), 951 deletions(-)
> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
> create mode 100644 arch/arm64/mm/pgtable-geometry.c
> create mode 100644 include/asm-generic/pgtable-geometry.h
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 19/57] crash: Remove PAGE_SIZE compile-time constant assumption
2024-10-15 11:13 ` Ryan Roberts
@ 2024-10-18 3:00 ` Baoquan He
0 siblings, 0 replies; 196+ messages in thread
From: Baoquan He @ 2024-10-18 3:00 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, kexec, linux-arm-kernel, linux-kernel, linux-mm
On 10/15/24 at 12:13pm, Ryan Roberts wrote:
> On 15/10/2024 04:47, Baoquan He wrote:
> > On 10/14/24 at 11:58am, Ryan Roberts wrote:
> >> To prepare for supporting boot-time page size selection, refactor code
> >> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> >> intended to be equivalent when compile-time page size is active.
> >>
> >> Updated BUILD_BUG_ON() to test against limit.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>
> >> ***NOTE***
> >> Any confused maintainers may want to read the cover note here for context:
> >> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
> >>
> >> kernel/crash_core.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> >> index 63cf89393c6eb..978c600a47ac8 100644
> >> --- a/kernel/crash_core.c
> >> +++ b/kernel/crash_core.c
> >> @@ -465,7 +465,7 @@ static int __init crash_notes_memory_init(void)
> >> * Break compile if size is bigger than PAGE_SIZE since crash_notes
> >> * definitely will be in 2 pages with that.
> >> */
> >> - BUILD_BUG_ON(size > PAGE_SIZE);
> >> + BUILD_BUG_ON(size > PAGE_SIZE_MIN);
> >
> > This should be OK. While one thing which could happen is if selected size
> > is 64K, PAGE_SIZE_MIN is 4K, it will issue a false-positive warning when
> > compiling while actual it's not a problem during running.
>
> PAGE_SIZE can only ever be bigger than PAGE_SIZE_MIN if compiling a "boot-time
> page size" build. And in this case, you need to know that size is small enough
> to work with any of the boot-time selectable page sizes. Since size
> (=sizeof(note_buf_t)) is invariant to PAGE_SIZE, we can do this by checking
> against PAGE_SIZE_MIN.
>
> So I don't think this could ever lead to a false-positive.
Makes sense, thanks for your explanation.
>
>
> Not sure if
> > that could happen on arm64. Anyway, we can check the crash_notes to get
> > why it's so big when it really happens. So,
> >
> > Acked-by: Baoquan He <bhe@redhat.com>
>
> Thanks!
>
> >
> >>
> >> crash_notes = __alloc_percpu(size, align);
> >> if (!crash_notes) {
> >> --
> >> 2.43.0
> >>
> >>
> >
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-17 12:32 ` Ryan Roberts
@ 2024-10-18 12:56 ` Petr Tesarik
2024-10-18 14:41 ` Petr Tesarik
2024-10-23 21:00 ` Thomas Tai
` (2 subsequent siblings)
3 siblings, 1 reply; 196+ messages in thread
From: Petr Tesarik @ 2024-10-18 12:56 UTC (permalink / raw)
To: Ryan Roberts, Michael Kelley
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On Thu, 17 Oct 2024 13:32:43 +0100
Ryan Roberts <ryan.roberts@arm.com> wrote:
> On 17/10/2024 13:27, Petr Tesarik wrote:
> > On Mon, 14 Oct 2024 11:55:11 +0100
> > Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> >> [...]
> >> The series is arranged as follows:
> >>
> >> - patch 1: Add macros required for converting non-arch code to support
> >> boot-time page size selection
> >> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
> >> non-arch code
> >
> > I have just tried to recompile the openSUSE kernel with these patches
> > applied, and I'm running into this:
> >
> > CC arch/arm64/hyperv/hv_core.o
> > In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
> > ../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file scope
> > u8 reserved2[PAGE_SIZE - 68];
> > ^~~~~~~~~
> >
> > It looks like one more place which needs a patch, right?
>
> As mentioned in the cover letter, so far I've only converted enough to get the
> defconfig *image* building (i.e. no modules). If you are compiling a different
> config or compiling the modules for defconfig, you will likely run into these
> types of issues.
>
> That said, I do have some patches to fix Hyper-V, which Michael Kelley was kind
> enough to send me.
>
> I understand that Suse might be able to help with wider performance testing - if
> that's the reason you are trying to compile, you could send me your config and
> I'll start working on fixing up other drivers?
You're right, performance testing is my goal.
Heh, the openSUSE master config is cranked up to max. ;-) That would be
a lot of work, and we don't need all those options for running our test
suite. Let me disable the conflicting options instead.
For reference, here's a long (yet incomplete) list of kernel options
that conflict with this v1 patch series:
# already handled by Michael
CONFIG_HYPERV
# sorry, Windows
CONFIG_CIFS
CONFIG_NTFS3_FS
# no, not even with ntfs-3g
CONFIG_FUSE_FS
# bye-bye ZSWAP
CONFIG_ZBUD
CONFIG_Z3FOLD
CONFIG_ZSMALLOC # ah, also bye-bye ZRAM
# who needs redundancy?
CONFIG_DM_RAID
CONFIG_MD_RAID1
CONFIG_MD_RAID456
CONFIG_MD_RAID10
# who needs security?
CONFIG_SECURITY_SELINUX
# or integrity?
CONFIG_IMA
CONFIG_DM_INTEGRITY
# or even crypto (this disables A LOT of stuff)...
CONFIG_CRYPTO_MANAGER2
# meh...
CONFIG_ARM_SMMU_V3_SVA
CONFIG_ACPI_NFIT
CONFIG_DEV_DAX_PMEM
CONFIG_NVDIMM
CONFIG_MTD_SWAP
CONFIG_MLXBF_PMC
CONFIG_THUNDERX2_PMU
CONFIG_LKDTM
CONFIG_VMWARE_VMCI
CONFIG_HT16K33
CONFIG_FB_TFT_HX8340BN
CONFIG_FB_TFT_ILI9341
CONFIG_DVB_FIREDTV
CONFIG_DVB_PT3
CONFIG_VIDEO_ET8EK8
CONFIG_VIDEO_IVTV
CONFIG_VIDEO_SAA7164
CONFIG_DRM_AMDGPU
CONFIG_DRM_POWERVR
CONFIG_DRM_QXL
CONFIG_DRM_RADEON
CONFIG_DRM_VMWGFX
CONFIG_FIREWIRE_OHCI
CONFIG_SND_SEQ_MIDI
CONFIG_SND_DARLA20
CONFIG_SND_GINA20
CONFIG_SND_LAYLA20
CONFIG_SND_DARLA24
CONFIG_SND_DARLA24
CONFIG_SND_GINA24
CONFIG_SND_MONA
CONFIG_SND_MIA
CONFIG_SND_ECHO3G
CONFIG_SND_INDIGO
CONFIG_SND_INDIGOIO
CONFIG_SND_INDIGODJ
CONFIG_SND_INDIGOIOX
CONFIG_SND_INDIGODJX
CONFIG_SND_BCM63XX_I2S_WHISTLER
CONFIG_SND_SOC_SOF
CONFIG_SND_SOC_SPRD
CONFIG_SND_SOC_STM32_SAI
CONFIG_SND_SOC_STM32_I2S
CONFIG_SND_SOC_STM32_SPDIFRX
CONFIG_SND_SOC_STM32_DFSDM
CONFIG_SND_SOC_TEGRA
CONFIG_SND_SOC_CROS_EC_CODEC
CONFIG_SND_SOC_RT5514_SPI
CONFIG_SND_USB_UA101
CONFIG_USB_F_PHONET
CONFIG_USB_F_TCM
CONFIG_SPI_LOOPBACK_TEST
CONFIG_W1
CONFIG_RDS
CONFIG_TIPC
CONFIG_TCP_SIGPOOL
CONFIG_OPENVSWITCH
CONFIG_NIU
CONFIG_QED_SRIOV
CONFIG_SFC
CONFIG_SFC_FALCON
CONFIG_SFC_SIENA
CONFIG_TSNEP
CONFIG_LIBERTAS
CONFIG_LOOPBACK_TARGET
CONFIG_SUNRPC_XPRT_RDMA
CONFIG_INFINIBAND_HNS
CONFIG_INFINIBAND_IPOIB
CONFIG_INFINIBAND_EFA
CONFIG_INFINIBAND_MTHCA
CONFIG_MLX4_CORE
CONFIG_MLX4_INFINIBAND
CONFIG_MLX5_CORE
CONFIG_MLX5_INFINIBAND
CONFIG_MLX5_VDPA_NET
CONFIG_MLX5_VFIO_PCI
CONFIG_ISCSI_TCP
CONFIG_SCSI_CXGB3_ISCSI
CONFIG_SCSI_CXGB4_ISCSI
CONFIG_SCSI_DC395x
CONFIG_SCSI_DMX3191D
CONFIG_SCSI_FDOMAIN
CONFIG_SCSI_MVUMI
CONFIG_SCSI_STEX
CONFIG_SCSI_SYM53C8XX_2
CONFIG_CDROM_PKTCDVD
CONFIG_AFS_FS
CONFIG_BCACHE
CONFIG_BCACHEFS_FS
CONFIG_CEPH_FS
CONFIG_DLM
CONFIG_BLK_DEV_NULL_BLK
CONFIG_BLK_DEV_DRBD
CONFIG_BLK_DEV_RBD
CONFIG_OCFS2_FS
CONFIG_CRAMFS
CONFIG_EROFS_FS
CONFIG_ECRYPT_FS
CONFIG_F2FS_FS
CONFIG_ZISOFS
CONFIG_NFS_V3_ACL
# would be nice to have...
CONFIG_NFSD_V4
CONFIG_SUNRPC_BACKCHANNEL # required by CONFIG_NFS_V4_1
CONFIG_MMC
CONFIG_NVME_CORE
CONFIG_NVMEM # required by CONFIG_USB4
CONFIG_USB_UAS
CONFIG_BLK_DEV_DM
# ...but this is kind of really necessary
CONFIG_BTRFS_FS
After disabling all the above and exporting ptg_page_shift, the
tumbleweed kernel builds. TBH I expected more broken things. Great
success! ;-)
I'll see if I can do something about btrfs. Then I can try to boot the
kernel...
Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-18 12:56 ` Petr Tesarik
@ 2024-10-18 14:41 ` Petr Tesarik
2024-10-21 11:47 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Petr Tesarik @ 2024-10-18 14:41 UTC (permalink / raw)
To: Ryan Roberts, Michael Kelley
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On Fri, 18 Oct 2024 14:56:00 +0200
Petr Tesarik <ptesarik@suse.com> wrote:
> On Thu, 17 Oct 2024 13:32:43 +0100
> Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> > On 17/10/2024 13:27, Petr Tesarik wrote:
> > > On Mon, 14 Oct 2024 11:55:11 +0100
> > > Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >
> > >> [...]
> > >> The series is arranged as follows:
> > >>
> > >> - patch 1: Add macros required for converting non-arch code to support
> > >> boot-time page size selection
> > >> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
> > >> non-arch code
> > >
> > > I have just tried to recompile the openSUSE kernel with these patches
> > > applied, and I'm running into this:
> > >
> > > CC arch/arm64/hyperv/hv_core.o
> > > In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
> > > ../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file scope
> > > u8 reserved2[PAGE_SIZE - 68];
> > > ^~~~~~~~~
> > >
> > > It looks like one more place which needs a patch, right?
> >
> > As mentioned in the cover letter, so far I've only converted enough to get the
> > defconfig *image* building (i.e. no modules). If you are compiling a different
> > config or compiling the modules for defconfig, you will likely run into these
> > types of issues.
> >
> > That said, I do have some patches to fix Hyper-V, which Michael Kelley was kind
> > enough to send me.
> >
> > I understand that Suse might be able to help with wider performance testing - if
> > that's the reason you are trying to compile, you could send me your config and
> > I'll start working on fixing up other drivers?
>
> You're right, performance testing is my goal.
>
> Heh, the openSUSE master config is cranked up to max. ;-) That would be
> a lot of work, and we don't need all those options for running our test
> suite. Let me disable the conflicting options instead.
>[...]
> I'll see if I can do something about btrfs. Then I can try to boot the
> kernel...
FWIW the kernel builds and _boots_ after applying this patch:
fs/btrfs/compression.h | 2 +-
fs/btrfs/defrag.c | 2 +-
fs/btrfs/extent_io.h | 2 +-
fs/btrfs/scrub.c | 2 +-
include/linux/raid/pq.h | 4 ++--
lib/raid6/algos.c | 2 +-
6 files changed, 7 insertions(+), 7 deletions(-)
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -33,7 +33,7 @@ struct btrfs_bio;
/* Maximum length of compressed data stored on disk */
#define BTRFS_MAX_COMPRESSED (SZ_128K)
#define BTRFS_MAX_COMPRESSED_PAGES (BTRFS_MAX_COMPRESSED / PAGE_SIZE)
-static_assert((BTRFS_MAX_COMPRESSED % PAGE_SIZE) == 0);
+static_assert((BTRFS_MAX_COMPRESSED % PAGE_SIZE_MAX) == 0);
/* Maximum size of data before compression */
#define BTRFS_MAX_UNCOMPRESSED (SZ_128K)
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -1144,7 +1144,7 @@ next:
}
#define CLUSTER_SIZE (SZ_256K)
-static_assert(PAGE_ALIGNED(CLUSTER_SIZE));
+static_assert(IS_ALIGNED(CLUSTER_SIZE, PAGE_SIZE_MAX));
/*
* Defrag one contiguous target range.
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -89,7 +89,7 @@ enum {
int __init extent_buffer_init_cachep(void);
void __cold extent_buffer_free_cachep(void);
-#define INLINE_EXTENT_BUFFER_PAGES (BTRFS_MAX_METADATA_BLOCKSIZE / PAGE_SIZE)
+#define INLINE_EXTENT_BUFFER_PAGES (BTRFS_MAX_METADATA_BLOCKSIZE / PAGE_SIZE_MIN)
struct extent_buffer {
u64 start;
u32 len;
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -100,7 +100,7 @@ enum scrub_stripe_flags {
SCRUB_STRIPE_FLAG_NO_REPORT,
};
-#define SCRUB_STRIPE_PAGES (BTRFS_STRIPE_LEN / PAGE_SIZE)
+#define SCRUB_STRIPE_PAGES (BTRFS_STRIPE_LEN / PAGE_SIZE_MIN)
/*
* Represent one contiguous range with a length of BTRFS_STRIPE_LEN.
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -12,7 +12,7 @@
#include <linux/blkdev.h>
-extern const char raid6_empty_zero_page[PAGE_SIZE];
+extern const char raid6_empty_zero_page[PAGE_SIZE_MAX];
#else /* ! __KERNEL__ */
/* Used for testing in user space */
@@ -39,7 +39,7 @@ typedef uint64_t u64;
#ifndef PAGE_SHIFT
# define PAGE_SHIFT 12
#endif
-extern const char raid6_empty_zero_page[PAGE_SIZE];
+extern const char raid6_empty_zero_page[PAGE_SIZE_MAX];
#define __init
#define __exit
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -19,7 +19,7 @@
#include <linux/module.h>
#include <linux/gfp.h>
/* In .bss so it's zeroed */
-const char raid6_empty_zero_page[PAGE_SIZE] __attribute__((aligned(256)));
+const char raid6_empty_zero_page[PAGE_SIZE_MAX] __attribute__((aligned(256)));
EXPORT_SYMBOL(raid6_empty_zero_page);
#endif
Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 18/57] trace: Remove PAGE_SIZE compile-time constant assumption
2024-10-15 11:09 ` Ryan Roberts
@ 2024-10-18 15:24 ` Steven Rostedt
0 siblings, 0 replies; 196+ messages in thread
From: Steven Rostedt @ 2024-10-18 15:24 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Masami Hiramatsu, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel, linux-kernel,
linux-mm, linux-trace-kernel
On Tue, 15 Oct 2024 12:09:38 +0100
Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > Not to mention, when function graph tracing is enabled, this gets triggered
> > for *every* function call! So I do not want any runtime test done. Every
> > nanosecond counts in this code path.
> >
> > If anything, this needs to be moved to initialization and checked once, if
> > it fails, gives a WARN_ON() and disables function graph tracing.
>
> I'm hoping my suggestion above to decouple SHADOW_STACK_SIZE from PAGE_SIZE is
> acceptable and simpler? If not, happy to do as you suggest here.
Yeah, I think we can do that. In fact, I'm thinking it should turn into a
kmem_cache item that doesn't have to be a power of two (but must be evenly
divisible by the size of long).
I'll write up a patch.
-- Steve
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-14 10:55 [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Ryan Roberts
` (5 preceding siblings ...)
2024-10-17 22:05 ` Dave Kleikamp
@ 2024-10-18 18:15 ` Joseph Salisbury
2024-10-18 18:27 ` David Hildenbrand
2024-10-19 15:47 ` Neal Gompa
2024-10-31 21:07 ` Catalin Marinas
8 siblings, 1 reply; 196+ messages in thread
From: Joseph Salisbury @ 2024-10-18 18:15 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 10/14/24 06:55, Ryan Roberts wrote:
> Hi All,
>
> Patch bomb incoming... This covers many subsystems, so I've included a core set
> of people on the full series and additionally included maintainers on relevant
> patches. I haven't included those maintainers on this cover letter since the
> numbers were far too big for it to work. But I've included a link to this cover
> letter on each patch, so they can hopefully find their way here. For follow up
> submissions I'll break it up by subsystem, but for now thought it was important
> to show the full picture.
>
> This RFC series implements support for boot-time page size selection within the
> arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to date, page
> size has been selected at compile-time, meaning the size is baked into a given
> kernel image. As use of larger-than-4K page sizes become more prevalent this
> starts to present a problem for distributions. Boot-time page size selection
> enables the creation of a single kernel image, which can be told which page size
> to use on the kernel command line.
>
> Why is having an image-per-page size problematic?
> =================================================
>
> Many traditional distros are now supporting both 4K and 64K. And this means
> managing 2 kernel packages, along with drivers for each. For some, it means
> multiple installer flavours and multiple ISOs. All of this adds up to a
> less-than-ideal level of complexity. Additionally, Android now supports 4K and
> 16K kernels. I'm told having to explicitly manage their KABI for each kernel is
> painful, and the extra flash space required for both kernel images and the
> duplicated modules has been problematic. Boot-time page size selection solves
> all of this.
>
> Additionally, in starting to think about the longer term deployment story for
> D128 page tables, which Arm architecture now supports, a lot of the same
> problems need to be solved, so this work sets us up nicely for that.
>
> So what's the down side?
> ========================
>
> Well nothing's free; Various static allocations in the kernel image must be
> sized for the worst case (largest supported page size), so image size is in line
> with size of 64K compile-time image. So if you're interested in 4K or 16K, there
> is a slight increase to the image size. But I expect that problem goes away if
> you're compressing the image - its just some extra zeros. At boot-time, I expect
> we could free the unused static storage once we know the page size - although
> that would be a follow up enhancement.
>
> And then there is performance. Since PAGE_SIZE and friends are no longer
> compile-time constants, we must look up their values and do arithmetic at
> runtime instead of compile-time. My early perf testing suggests this is
> inperceptible for real-world workloads, and only has small impact on
> microbenchmarks - more on this below.
>
> Approach
> ========
>
> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
> friends are compile-time constant, but in a way that allows the compiler to
> perform the same optimizations as was previously being done if they do turn out
> to be compile-time constant. Where constants are required, we use limits;
> PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full description
> of all the classes of problems to solve.
>
> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX. arm64
> does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE Kconfig,
> which is an alternative to selecting a compile-time page size.
>
> When boot-time page size is active, the arch pgtable geometry macro definitions
> resolve to something that can be configured at boot. The arm64 implementation in
> this series mainly uses global, __ro_after_init variables. I've tried using
> alternatives patching, but that performs worse than loading from memory; I think
> due to code size bloat.
>
> Status
> ======
>
> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented enough
> to compile the kernel image itself with defconfig (and a few other bits and
> pieces). This is enough to build a kernel that can boot under QEMU or FVP. I'll
> happily do the rest of the work to enable all the extra drivers, but wanted to
> get feedback on the shape of this effort first. If anyone wants to do any
> testing, and has a must-have config, let me know and I'll prioritize enabling it
> first.
>
> The series is arranged as follows:
>
> - patch 1: Add macros required for converting non-arch code to support
> boot-time page size selection
> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
> non-arch code
> - patches 37-38: Some arm64 tidy ups
> - patch 39: Add macros required for converting arm64 code to support
> boot-time page size selection
> - patches 40-56: arm64 changes to support boot-time page size selection
> - patch 57: Add arm64 Kconfig option to enable boot-time page size
> selection
>
> Ideally, I'd like to get the basics merged (something like this series), then
> incrementally improve it over a handful of kernel releases until we can
> demonstrate that we have feature parity with the compile-time build and no
> performance blockers. Once at that point, ideally the compile-time build options
> would be removed and the code could be cleaned up further.
>
> One of the bigger peices that I'd propose to add as a follow up, is to make
> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
> handling.
>
> Assuming people are ammenable to the rough shape, how would I go about getting
> the non-arch changes merged? Since they cover many subsystems, will each piece
> need to go independently to each relevant maintainer or could it all be merged
> together through the arm64 tree?
>
> Image Size
> ==========
>
> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
> kernel image on disk for base (before any changes applied), compile (with
> changes, configured for compile-time page size) and boot (with changes,
> configured for boot-time page size).
>
> You can see the that compile-16k and 64k configs are actually slightly smaller
> than the baselines; that's due to optimizing some buffer sizes which didn't need
> to depend on page size during the series. The boot-time image is ~1% bigger than
> the 64k compile-time image. I believe there is scope to improve this to make it
> equal to compile-64k if required:
>
> | config | size/KB | diff/KB | diff/% |
> |-------------|---------|---------|---------|
> | base-4k | 54895 | 0 | 0.0% |
> | base-16k | 55161 | 266 | 0.5% |
> | base-64k | 56775 | 1880 | 3.4% |
> | compile-4k | 54895 | 0 | 0.0% |
> | compile-16k | 55097 | 202 | 0.4% |
> | compile-64k | 56391 | 1496 | 2.7% |
> | boot-4K | 57045 | 2150 | 3.9% |
>
> And below shows the size of the image in memory at run-time, separated for text
> and data costs. The boot image has ~1% text cost; most likely due to the fact
> that PAGE_SIZE and friends are not compile-time constants so need instructions
> to load the values and do arithmetic. I believe we could eventually get the data
> cost to match the cost for the compile image for the chosen page size by freeing
> the ends of the static buffers not needed for the selected page size:
>
> | | text | text | text | data | data | data |
> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
> |-------------|---------|---------|---------|---------|---------|---------|
> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
>
> Functional Testing
> ==================
>
> I've build-tested defconfig for all arches supported by tuxmake (which is most)
> without issue.
>
> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page sizes
> and a few va-sizes, and additionally have run all the mm-selftests, with no
> regressions observed vs the equivalent compile-time page size build (although
> the mm-selftests have a few existing failures when run against 16K and 64K
> kernels - those should really be investigated and fixed independently).
>
> Test coverage is lacking for many of the drivers that I've touched, but in many
> cases, I'm hoping the changes are simple enough that review might suffice?
>
> Performance Testing
> ===================
>
> I've run some limited performance benchmarks:
>
> First, a real-world benchmark that causes a lot of page table manipulation (and
> therefore we would expect to see regression here if we are going to see it
> anywhere); kernel compilation. It barely registers a change. Values are times,
> so smaller is better. All relative to base-4k:
>
> | | kern | kern | user | user | real | real |
> | config | mean | stdev | mean | stdev | mean | stdev |
> |-------------|---------|---------|---------|---------|---------|---------|
> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
>
> The Speedometer JavaScript benchmark also shows no change. Values are runs per
> min, so bigger is better. All relative to base-4k:
>
> | config | mean | stdev |
> |-------------|---------|---------|
> | base-4k | 0.0% | 0.8% |
> | compile-4k | 0.4% | 0.8% |
> | boot-4k | 0.0% | 0.9% |
>
> Finally, I've run some microbenchmarks known to stress page table manipulations
> (originally from David Hildenbrand). The fork test maps/allocs 1G of anon
> memory, then measures the cost of fork(). The munmap test maps/allocs 1G of anon
> memory then measures the cost of munmap()ing it. The fork test is known to be
> extremely sensitive to any changes that cause instructions to be aligned
> differently in cachelines. When using this test for other changes, I've seen
> double digit regressions for the slightest thing, so 12% regression on this test
> is actually fairly good. This likely represents the extreme worst case for
> regressions that will be observed across other microbenchmarks (famous last
> words). Values are times, so smaller is better. All relative to base-4k:
>
> | | fork | fork | munmap | munmap |
> | config | mean | stdev | stdev | stdev |
> |-------------|---------|---------|---------|---------|
> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
>
> NOTE: The series applies on top of v6.11.
>
> Thanks,
> Ryan
>
>
> Ryan Roberts (57):
> mm: Add macros ahead of supporting boot-time page size selection
> vmlinux: Align to PAGE_SIZE_MAX
> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
> mm/page_alloc: Make page_frag_cache boot-time page size compatible
> mm: Avoid split pmd ptl if pmd level is run-time folded
> mm: Remove PAGE_SIZE compile-time constant assumption
> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
> fs: Remove PAGE_SIZE compile-time constant assumption
> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
> fork: Permit boot-time THREAD_SIZE determination
> cgroup: Remove PAGE_SIZE compile-time constant assumption
> bpf: Remove PAGE_SIZE compile-time constant assumption
> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
> stackdepot: Remove PAGE_SIZE compile-time constant assumption
> perf: Remove PAGE_SIZE compile-time constant assumption
> kvm: Remove PAGE_SIZE compile-time constant assumption
> trace: Remove PAGE_SIZE compile-time constant assumption
> crash: Remove PAGE_SIZE compile-time constant assumption
> crypto: Remove PAGE_SIZE compile-time constant assumption
> sunrpc: Remove PAGE_SIZE compile-time constant assumption
> sound: Remove PAGE_SIZE compile-time constant assumption
> net: Remove PAGE_SIZE compile-time constant assumption
> net: fec: Remove PAGE_SIZE compile-time constant assumption
> net: marvell: Remove PAGE_SIZE compile-time constant assumption
> net: hns3: Remove PAGE_SIZE compile-time constant assumption
> net: e1000: Remove PAGE_SIZE compile-time constant assumption
> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
> net: igb: Remove PAGE_SIZE compile-time constant assumption
> drivers/base: Remove PAGE_SIZE compile-time constant assumption
> edac: Remove PAGE_SIZE compile-time constant assumption
> optee: Remove PAGE_SIZE compile-time constant assumption
> random: Remove PAGE_SIZE compile-time constant assumption
> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
> virtio: Remove PAGE_SIZE compile-time constant assumption
> xen: Remove PAGE_SIZE compile-time constant assumption
> arm64: Fix macros to work in C code in addition to the linker script
> arm64: Track early pgtable allocation limit
> arm64: Introduce macros required for boot-time page selection
> arm64: Refactor early pgtable size calculation macros
> arm64: Pass desired page size on command line
> arm64: Divorce early init from PAGE_SIZE
> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
> arm64: Align sections to PAGE_SIZE_MAX
> arm64: Rework trampoline rodata mapping
> arm64: Generalize fixmap for boot-time page size
> arm64: Statically allocate and align for worst-case page size
> arm64: Convert switch to if for non-const comparison values
> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
> arm64: Remove PAGE_SZ asm-offset
> arm64: Introduce cpu features for page sizes
> arm64: Remove PAGE_SIZE from assembly code
> arm64: Runtime-fold pmd level
> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
> arm64: TRAMP_VALIAS is no longer compile-time constant
> arm64: Determine THREAD_SIZE at boot-time
> arm64: Enable boot-time page size selection
>
> arch/alpha/include/asm/page.h | 1 +
> arch/arc/include/asm/page.h | 1 +
> arch/arm/include/asm/page.h | 1 +
> arch/arm64/Kconfig | 26 ++-
> arch/arm64/include/asm/assembler.h | 78 ++++++-
> arch/arm64/include/asm/cpufeature.h | 44 +++-
> arch/arm64/include/asm/efi.h | 2 +-
> arch/arm64/include/asm/fixmap.h | 28 ++-
> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
> arch/arm64/include/asm/kvm_arm.h | 21 +-
> arch/arm64/include/asm/kvm_hyp.h | 11 +
> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
> arch/arm64/include/asm/memory.h | 62 ++++--
> arch/arm64/include/asm/page-def.h | 3 +-
> arch/arm64/include/asm/pgalloc.h | 16 +-
> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
> arch/arm64/include/asm/pgtable-prot.h | 2 +-
> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
> arch/arm64/include/asm/processor.h | 10 +-
> arch/arm64/include/asm/sections.h | 1 +
> arch/arm64/include/asm/smp.h | 1 +
> arch/arm64/include/asm/sparsemem.h | 15 +-
> arch/arm64/include/asm/sysreg.h | 54 +++--
> arch/arm64/include/asm/tlb.h | 3 +
> arch/arm64/kernel/asm-offsets.c | 4 +-
> arch/arm64/kernel/cpufeature.c | 93 ++++++--
> arch/arm64/kernel/efi.c | 2 +-
> arch/arm64/kernel/entry.S | 60 +++++-
> arch/arm64/kernel/head.S | 46 +++-
> arch/arm64/kernel/hibernate-asm.S | 6 +-
> arch/arm64/kernel/image-vars.h | 14 ++
> arch/arm64/kernel/image.h | 4 +
> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
> arch/arm64/kernel/pi/pi.h | 63 +++++-
> arch/arm64/kernel/relocate_kernel.S | 10 +-
> arch/arm64/kernel/vdso-wrap.S | 4 +-
> arch/arm64/kernel/vdso.c | 7 +-
> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
> arch/arm64/kernel/vdso32-wrap.S | 4 +-
> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
> arch/arm64/kvm/arm.c | 10 +
> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
> arch/arm64/kvm/mmu.c | 39 ++--
> arch/arm64/lib/clear_page.S | 7 +-
> arch/arm64/lib/copy_page.S | 33 ++-
> arch/arm64/lib/mte.S | 27 ++-
> arch/arm64/mm/Makefile | 1 +
> arch/arm64/mm/fixmap.c | 38 ++--
> arch/arm64/mm/hugetlbpage.c | 40 +---
> arch/arm64/mm/init.c | 26 +--
> arch/arm64/mm/kasan_init.c | 8 +-
> arch/arm64/mm/mmu.c | 53 +++--
> arch/arm64/mm/pgd.c | 12 +-
> arch/arm64/mm/pgtable-geometry.c | 24 +++
> arch/arm64/mm/proc.S | 128 ++++++++---
> arch/arm64/mm/ptdump.c | 3 +-
> arch/arm64/tools/cpucaps | 3 +
> arch/csky/include/asm/page.h | 3 +
> arch/hexagon/include/asm/page.h | 2 +
> arch/loongarch/include/asm/page.h | 2 +
> arch/m68k/include/asm/page.h | 1 +
> arch/microblaze/include/asm/page.h | 1 +
> arch/mips/include/asm/page.h | 1 +
> arch/nios2/include/asm/page.h | 2 +
> arch/openrisc/include/asm/page.h | 1 +
> arch/parisc/include/asm/page.h | 1 +
> arch/powerpc/include/asm/page.h | 2 +
> arch/riscv/include/asm/page.h | 1 +
> arch/s390/include/asm/page.h | 1 +
> arch/sh/include/asm/page.h | 1 +
> arch/sparc/include/asm/page.h | 3 +
> arch/um/include/asm/page.h | 2 +
> arch/x86/include/asm/page_types.h | 2 +
> arch/xtensa/include/asm/page.h | 1 +
> crypto/lskcipher.c | 4 +-
> drivers/ata/sata_sil24.c | 46 ++--
> drivers/base/node.c | 6 +-
> drivers/base/topology.c | 32 +--
> drivers/block/virtio_blk.c | 2 +-
> drivers/char/random.c | 4 +-
> drivers/edac/edac_mc.h | 13 +-
> drivers/firmware/efi/libstub/arm64.c | 3 +-
> drivers/irqchip/irq-gic-v3-its.c | 2 +-
> drivers/mtd/mtdswap.c | 4 +-
> drivers/net/ethernet/freescale/fec.h | 3 +-
> drivers/net/ethernet/freescale/fec_main.c | 5 +-
> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
> drivers/net/ethernet/intel/igb/igb.h | 25 +--
> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
> drivers/net/ethernet/marvell/mvneta.c | 9 +-
> drivers/net/ethernet/marvell/sky2.h | 2 +-
> drivers/tee/optee/call.c | 7 +-
> drivers/tee/optee/smc_abi.c | 2 +-
> drivers/virtio/virtio_balloon.c | 10 +-
> drivers/xen/balloon.c | 11 +-
> drivers/xen/biomerge.c | 12 +-
> drivers/xen/privcmd.c | 2 +-
> drivers/xen/xenbus/xenbus_client.c | 5 +-
> drivers/xen/xlate_mmu.c | 6 +-
> fs/binfmt_elf.c | 11 +-
> fs/buffer.c | 2 +-
> fs/coredump.c | 8 +-
> fs/ext4/ext4.h | 36 ++--
> fs/ext4/move_extent.c | 2 +-
> fs/ext4/readpage.c | 2 +-
> fs/fat/dir.c | 4 +-
> fs/fat/fatent.c | 4 +-
> fs/nfs/nfs42proc.c | 2 +-
> fs/nfs/nfs42xattr.c | 2 +-
> fs/nfs/nfs4proc.c | 2 +-
> include/asm-generic/pgtable-geometry.h | 71 +++++++
> include/asm-generic/vmlinux.lds.h | 38 ++--
> include/linux/buffer_head.h | 1 +
> include/linux/cpumask.h | 5 +
> include/linux/linkage.h | 4 +-
> include/linux/mm.h | 17 +-
> include/linux/mm_types.h | 15 +-
> include/linux/mm_types_task.h | 2 +-
> include/linux/mmzone.h | 3 +-
> include/linux/netlink.h | 6 +-
> include/linux/percpu-defs.h | 4 +-
> include/linux/perf_event.h | 2 +-
> include/linux/sched.h | 4 +-
> include/linux/slab.h | 7 +-
> include/linux/stackdepot.h | 6 +-
> include/linux/sunrpc/svc.h | 8 +-
> include/linux/sunrpc/svc_rdma.h | 4 +-
> include/linux/sunrpc/svcsock.h | 2 +-
> include/linux/swap.h | 17 +-
> include/linux/swapops.h | 6 +-
> include/linux/thread_info.h | 10 +-
> include/xen/page.h | 2 +
> init/main.c | 7 +-
> kernel/bpf/core.c | 9 +-
> kernel/bpf/ringbuf.c | 54 ++---
> kernel/cgroup/cgroup.c | 8 +-
> kernel/crash_core.c | 2 +-
> kernel/events/core.c | 2 +-
> kernel/fork.c | 71 +++----
> kernel/power/power.h | 2 +-
> kernel/power/snapshot.c | 2 +-
> kernel/power/swap.c | 129 +++++++++--
> kernel/trace/fgraph.c | 2 +-
> kernel/trace/trace.c | 2 +-
> lib/stackdepot.c | 6 +-
> mm/kasan/report.c | 3 +-
> mm/memcontrol.c | 11 +-
> mm/memory.c | 4 +-
> mm/mmap.c | 2 +-
> mm/page-writeback.c | 2 +-
> mm/page_alloc.c | 31 +--
> mm/slub.c | 2 +-
> mm/sparse.c | 2 +-
> mm/swapfile.c | 2 +-
> mm/vmalloc.c | 7 +-
> net/9p/trans_virtio.c | 4 +-
> net/core/hotdata.c | 4 +-
> net/core/skbuff.c | 4 +-
> net/core/sysctl_net_core.c | 2 +-
> net/sunrpc/cache.c | 3 +-
> net/unix/af_unix.c | 2 +-
> sound/soc/soc-utils.c | 4 +-
> virt/kvm/kvm_main.c | 2 +-
> 172 files changed, 2185 insertions(+), 951 deletions(-)
> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
> create mode 100644 arch/arm64/mm/pgtable-geometry.c
> create mode 100644 include/asm-generic/pgtable-geometry.h
>
> --
> 2.43.0
>
>
Hi Ryan,
First off, this is excellent work! Your cover page was very detailed
and made the patch set easier to understand.
Some questions/comments:
Once a kernel is booted with a certain page size, could there be issues
if it is booted later with a different page size? How about if this is
done frequently?
A random example of this: Lets say a retailer, doctors office or a
similar OLTP environment prefers a small page size during the day for
performance reasons. Then in the off-hours prefer a large page size for
DSS type workloads like running reports or batch jobs.
I'm thinking how this might be used for cost savings. The best approach
would be to have multiple systems/VMs/cloud instances for the different
workload types. However, and end user might only have one system type
and change the page size regularly as in that example.
Also, the performance impact does look very minimal. It will be
interesting to see if there are any effects on the larger industry
standard benchmarks like TPC and SPEC.
Thanks,
Joe
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-18 18:15 ` Joseph Salisbury
@ 2024-10-18 18:27 ` David Hildenbrand
2024-10-18 19:19 ` [External] : " Joseph Salisbury
0 siblings, 1 reply; 196+ messages in thread
From: David Hildenbrand @ 2024-10-18 18:27 UTC (permalink / raw)
To: Joseph Salisbury, Ryan Roberts, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 18.10.24 20:15, Joseph Salisbury wrote:
>
>
>
> On 10/14/24 06:55, Ryan Roberts wrote:
>> Hi All,
>>
>> Patch bomb incoming... This covers many subsystems, so I've included a core set
>> of people on the full series and additionally included maintainers on relevant
>> patches. I haven't included those maintainers on this cover letter since the
>> numbers were far too big for it to work. But I've included a link to this cover
>> letter on each patch, so they can hopefully find their way here. For follow up
>> submissions I'll break it up by subsystem, but for now thought it was important
>> to show the full picture.
>>
>> This RFC series implements support for boot-time page size selection within the
>> arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to date, page
>> size has been selected at compile-time, meaning the size is baked into a given
>> kernel image. As use of larger-than-4K page sizes become more prevalent this
>> starts to present a problem for distributions. Boot-time page size selection
>> enables the creation of a single kernel image, which can be told which page size
>> to use on the kernel command line.
>>
>> Why is having an image-per-page size problematic?
>> =================================================
>>
>> Many traditional distros are now supporting both 4K and 64K. And this means
>> managing 2 kernel packages, along with drivers for each. For some, it means
>> multiple installer flavours and multiple ISOs. All of this adds up to a
>> less-than-ideal level of complexity. Additionally, Android now supports 4K and
>> 16K kernels. I'm told having to explicitly manage their KABI for each kernel is
>> painful, and the extra flash space required for both kernel images and the
>> duplicated modules has been problematic. Boot-time page size selection solves
>> all of this.
>>
>> Additionally, in starting to think about the longer term deployment story for
>> D128 page tables, which Arm architecture now supports, a lot of the same
>> problems need to be solved, so this work sets us up nicely for that.
>>
>> So what's the down side?
>> ========================
>>
>> Well nothing's free; Various static allocations in the kernel image must be
>> sized for the worst case (largest supported page size), so image size is in line
>> with size of 64K compile-time image. So if you're interested in 4K or 16K, there
>> is a slight increase to the image size. But I expect that problem goes away if
>> you're compressing the image - its just some extra zeros. At boot-time, I expect
>> we could free the unused static storage once we know the page size - although
>> that would be a follow up enhancement.
>>
>> And then there is performance. Since PAGE_SIZE and friends are no longer
>> compile-time constants, we must look up their values and do arithmetic at
>> runtime instead of compile-time. My early perf testing suggests this is
>> inperceptible for real-world workloads, and only has small impact on
>> microbenchmarks - more on this below.
>>
>> Approach
>> ========
>>
>> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
>> friends are compile-time constant, but in a way that allows the compiler to
>> perform the same optimizations as was previously being done if they do turn out
>> to be compile-time constant. Where constants are required, we use limits;
>> PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full description
>> of all the classes of problems to solve.
>>
>> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
>> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX. arm64
>> does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE Kconfig,
>> which is an alternative to selecting a compile-time page size.
>>
>> When boot-time page size is active, the arch pgtable geometry macro definitions
>> resolve to something that can be configured at boot. The arm64 implementation in
>> this series mainly uses global, __ro_after_init variables. I've tried using
>> alternatives patching, but that performs worse than loading from memory; I think
>> due to code size bloat.
>>
>> Status
>> ======
>>
>> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented enough
>> to compile the kernel image itself with defconfig (and a few other bits and
>> pieces). This is enough to build a kernel that can boot under QEMU or FVP. I'll
>> happily do the rest of the work to enable all the extra drivers, but wanted to
>> get feedback on the shape of this effort first. If anyone wants to do any
>> testing, and has a must-have config, let me know and I'll prioritize enabling it
>> first.
>>
>> The series is arranged as follows:
>>
>> - patch 1: Add macros required for converting non-arch code to support
>> boot-time page size selection
>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
>> non-arch code
>> - patches 37-38: Some arm64 tidy ups
>> - patch 39: Add macros required for converting arm64 code to support
>> boot-time page size selection
>> - patches 40-56: arm64 changes to support boot-time page size selection
>> - patch 57: Add arm64 Kconfig option to enable boot-time page size
>> selection
>>
>> Ideally, I'd like to get the basics merged (something like this series), then
>> incrementally improve it over a handful of kernel releases until we can
>> demonstrate that we have feature parity with the compile-time build and no
>> performance blockers. Once at that point, ideally the compile-time build options
>> would be removed and the code could be cleaned up further.
>>
>> One of the bigger peices that I'd propose to add as a follow up, is to make
>> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
>> handling.
>>
>> Assuming people are ammenable to the rough shape, how would I go about getting
>> the non-arch changes merged? Since they cover many subsystems, will each piece
>> need to go independently to each relevant maintainer or could it all be merged
>> together through the arm64 tree?
>>
>> Image Size
>> ==========
>>
>> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
>> kernel image on disk for base (before any changes applied), compile (with
>> changes, configured for compile-time page size) and boot (with changes,
>> configured for boot-time page size).
>>
>> You can see the that compile-16k and 64k configs are actually slightly smaller
>> than the baselines; that's due to optimizing some buffer sizes which didn't need
>> to depend on page size during the series. The boot-time image is ~1% bigger than
>> the 64k compile-time image. I believe there is scope to improve this to make it
>> equal to compile-64k if required:
>>
>> | config | size/KB | diff/KB | diff/% |
>> |-------------|---------|---------|---------|
>> | base-4k | 54895 | 0 | 0.0% |
>> | base-16k | 55161 | 266 | 0.5% |
>> | base-64k | 56775 | 1880 | 3.4% |
>> | compile-4k | 54895 | 0 | 0.0% |
>> | compile-16k | 55097 | 202 | 0.4% |
>> | compile-64k | 56391 | 1496 | 2.7% |
>> | boot-4K | 57045 | 2150 | 3.9% |
>>
>> And below shows the size of the image in memory at run-time, separated for text
>> and data costs. The boot image has ~1% text cost; most likely due to the fact
>> that PAGE_SIZE and friends are not compile-time constants so need instructions
>> to load the values and do arithmetic. I believe we could eventually get the data
>> cost to match the cost for the compile image for the chosen page size by freeing
>> the ends of the static buffers not needed for the selected page size:
>>
>> | | text | text | text | data | data | data |
>> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
>> |-------------|---------|---------|---------|---------|---------|---------|
>> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
>> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
>> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
>> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
>> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
>> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
>> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
>>
>> Functional Testing
>> ==================
>>
>> I've build-tested defconfig for all arches supported by tuxmake (which is most)
>> without issue.
>>
>> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page sizes
>> and a few va-sizes, and additionally have run all the mm-selftests, with no
>> regressions observed vs the equivalent compile-time page size build (although
>> the mm-selftests have a few existing failures when run against 16K and 64K
>> kernels - those should really be investigated and fixed independently).
>>
>> Test coverage is lacking for many of the drivers that I've touched, but in many
>> cases, I'm hoping the changes are simple enough that review might suffice?
>>
>> Performance Testing
>> ===================
>>
>> I've run some limited performance benchmarks:
>>
>> First, a real-world benchmark that causes a lot of page table manipulation (and
>> therefore we would expect to see regression here if we are going to see it
>> anywhere); kernel compilation. It barely registers a change. Values are times,
>> so smaller is better. All relative to base-4k:
>>
>> | | kern | kern | user | user | real | real |
>> | config | mean | stdev | mean | stdev | mean | stdev |
>> |-------------|---------|---------|---------|---------|---------|---------|
>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
>>
>> The Speedometer JavaScript benchmark also shows no change. Values are runs per
>> min, so bigger is better. All relative to base-4k:
>>
>> | config | mean | stdev |
>> |-------------|---------|---------|
>> | base-4k | 0.0% | 0.8% |
>> | compile-4k | 0.4% | 0.8% |
>> | boot-4k | 0.0% | 0.9% |
>>
>> Finally, I've run some microbenchmarks known to stress page table manipulations
>> (originally from David Hildenbrand). The fork test maps/allocs 1G of anon
>> memory, then measures the cost of fork(). The munmap test maps/allocs 1G of anon
>> memory then measures the cost of munmap()ing it. The fork test is known to be
>> extremely sensitive to any changes that cause instructions to be aligned
>> differently in cachelines. When using this test for other changes, I've seen
>> double digit regressions for the slightest thing, so 12% regression on this test
>> is actually fairly good. This likely represents the extreme worst case for
>> regressions that will be observed across other microbenchmarks (famous last
>> words). Values are times, so smaller is better. All relative to base-4k:
>>
>> | | fork | fork | munmap | munmap |
>> | config | mean | stdev | stdev | stdev |
>> |-------------|---------|---------|---------|---------|
>> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
>> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
>> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
>>
>> NOTE: The series applies on top of v6.11.
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (57):
>> mm: Add macros ahead of supporting boot-time page size selection
>> vmlinux: Align to PAGE_SIZE_MAX
>> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
>> mm/page_alloc: Make page_frag_cache boot-time page size compatible
>> mm: Avoid split pmd ptl if pmd level is run-time folded
>> mm: Remove PAGE_SIZE compile-time constant assumption
>> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
>> fs: Remove PAGE_SIZE compile-time constant assumption
>> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
>> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
>> fork: Permit boot-time THREAD_SIZE determination
>> cgroup: Remove PAGE_SIZE compile-time constant assumption
>> bpf: Remove PAGE_SIZE compile-time constant assumption
>> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
>> stackdepot: Remove PAGE_SIZE compile-time constant assumption
>> perf: Remove PAGE_SIZE compile-time constant assumption
>> kvm: Remove PAGE_SIZE compile-time constant assumption
>> trace: Remove PAGE_SIZE compile-time constant assumption
>> crash: Remove PAGE_SIZE compile-time constant assumption
>> crypto: Remove PAGE_SIZE compile-time constant assumption
>> sunrpc: Remove PAGE_SIZE compile-time constant assumption
>> sound: Remove PAGE_SIZE compile-time constant assumption
>> net: Remove PAGE_SIZE compile-time constant assumption
>> net: fec: Remove PAGE_SIZE compile-time constant assumption
>> net: marvell: Remove PAGE_SIZE compile-time constant assumption
>> net: hns3: Remove PAGE_SIZE compile-time constant assumption
>> net: e1000: Remove PAGE_SIZE compile-time constant assumption
>> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
>> net: igb: Remove PAGE_SIZE compile-time constant assumption
>> drivers/base: Remove PAGE_SIZE compile-time constant assumption
>> edac: Remove PAGE_SIZE compile-time constant assumption
>> optee: Remove PAGE_SIZE compile-time constant assumption
>> random: Remove PAGE_SIZE compile-time constant assumption
>> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
>> virtio: Remove PAGE_SIZE compile-time constant assumption
>> xen: Remove PAGE_SIZE compile-time constant assumption
>> arm64: Fix macros to work in C code in addition to the linker script
>> arm64: Track early pgtable allocation limit
>> arm64: Introduce macros required for boot-time page selection
>> arm64: Refactor early pgtable size calculation macros
>> arm64: Pass desired page size on command line
>> arm64: Divorce early init from PAGE_SIZE
>> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
>> arm64: Align sections to PAGE_SIZE_MAX
>> arm64: Rework trampoline rodata mapping
>> arm64: Generalize fixmap for boot-time page size
>> arm64: Statically allocate and align for worst-case page size
>> arm64: Convert switch to if for non-const comparison values
>> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
>> arm64: Remove PAGE_SZ asm-offset
>> arm64: Introduce cpu features for page sizes
>> arm64: Remove PAGE_SIZE from assembly code
>> arm64: Runtime-fold pmd level
>> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
>> arm64: TRAMP_VALIAS is no longer compile-time constant
>> arm64: Determine THREAD_SIZE at boot-time
>> arm64: Enable boot-time page size selection
>>
>> arch/alpha/include/asm/page.h | 1 +
>> arch/arc/include/asm/page.h | 1 +
>> arch/arm/include/asm/page.h | 1 +
>> arch/arm64/Kconfig | 26 ++-
>> arch/arm64/include/asm/assembler.h | 78 ++++++-
>> arch/arm64/include/asm/cpufeature.h | 44 +++-
>> arch/arm64/include/asm/efi.h | 2 +-
>> arch/arm64/include/asm/fixmap.h | 28 ++-
>> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
>> arch/arm64/include/asm/kvm_arm.h | 21 +-
>> arch/arm64/include/asm/kvm_hyp.h | 11 +
>> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
>> arch/arm64/include/asm/memory.h | 62 ++++--
>> arch/arm64/include/asm/page-def.h | 3 +-
>> arch/arm64/include/asm/pgalloc.h | 16 +-
>> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
>> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
>> arch/arm64/include/asm/pgtable-prot.h | 2 +-
>> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
>> arch/arm64/include/asm/processor.h | 10 +-
>> arch/arm64/include/asm/sections.h | 1 +
>> arch/arm64/include/asm/smp.h | 1 +
>> arch/arm64/include/asm/sparsemem.h | 15 +-
>> arch/arm64/include/asm/sysreg.h | 54 +++--
>> arch/arm64/include/asm/tlb.h | 3 +
>> arch/arm64/kernel/asm-offsets.c | 4 +-
>> arch/arm64/kernel/cpufeature.c | 93 ++++++--
>> arch/arm64/kernel/efi.c | 2 +-
>> arch/arm64/kernel/entry.S | 60 +++++-
>> arch/arm64/kernel/head.S | 46 +++-
>> arch/arm64/kernel/hibernate-asm.S | 6 +-
>> arch/arm64/kernel/image-vars.h | 14 ++
>> arch/arm64/kernel/image.h | 4 +
>> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
>> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
>> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
>> arch/arm64/kernel/pi/pi.h | 63 +++++-
>> arch/arm64/kernel/relocate_kernel.S | 10 +-
>> arch/arm64/kernel/vdso-wrap.S | 4 +-
>> arch/arm64/kernel/vdso.c | 7 +-
>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
>> arch/arm64/kernel/vdso32-wrap.S | 4 +-
>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
>> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
>> arch/arm64/kvm/arm.c | 10 +
>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
>> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
>> arch/arm64/kvm/mmu.c | 39 ++--
>> arch/arm64/lib/clear_page.S | 7 +-
>> arch/arm64/lib/copy_page.S | 33 ++-
>> arch/arm64/lib/mte.S | 27 ++-
>> arch/arm64/mm/Makefile | 1 +
>> arch/arm64/mm/fixmap.c | 38 ++--
>> arch/arm64/mm/hugetlbpage.c | 40 +---
>> arch/arm64/mm/init.c | 26 +--
>> arch/arm64/mm/kasan_init.c | 8 +-
>> arch/arm64/mm/mmu.c | 53 +++--
>> arch/arm64/mm/pgd.c | 12 +-
>> arch/arm64/mm/pgtable-geometry.c | 24 +++
>> arch/arm64/mm/proc.S | 128 ++++++++---
>> arch/arm64/mm/ptdump.c | 3 +-
>> arch/arm64/tools/cpucaps | 3 +
>> arch/csky/include/asm/page.h | 3 +
>> arch/hexagon/include/asm/page.h | 2 +
>> arch/loongarch/include/asm/page.h | 2 +
>> arch/m68k/include/asm/page.h | 1 +
>> arch/microblaze/include/asm/page.h | 1 +
>> arch/mips/include/asm/page.h | 1 +
>> arch/nios2/include/asm/page.h | 2 +
>> arch/openrisc/include/asm/page.h | 1 +
>> arch/parisc/include/asm/page.h | 1 +
>> arch/powerpc/include/asm/page.h | 2 +
>> arch/riscv/include/asm/page.h | 1 +
>> arch/s390/include/asm/page.h | 1 +
>> arch/sh/include/asm/page.h | 1 +
>> arch/sparc/include/asm/page.h | 3 +
>> arch/um/include/asm/page.h | 2 +
>> arch/x86/include/asm/page_types.h | 2 +
>> arch/xtensa/include/asm/page.h | 1 +
>> crypto/lskcipher.c | 4 +-
>> drivers/ata/sata_sil24.c | 46 ++--
>> drivers/base/node.c | 6 +-
>> drivers/base/topology.c | 32 +--
>> drivers/block/virtio_blk.c | 2 +-
>> drivers/char/random.c | 4 +-
>> drivers/edac/edac_mc.h | 13 +-
>> drivers/firmware/efi/libstub/arm64.c | 3 +-
>> drivers/irqchip/irq-gic-v3-its.c | 2 +-
>> drivers/mtd/mtdswap.c | 4 +-
>> drivers/net/ethernet/freescale/fec.h | 3 +-
>> drivers/net/ethernet/freescale/fec_main.c | 5 +-
>> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
>> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
>> drivers/net/ethernet/intel/igb/igb.h | 25 +--
>> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
>> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
>> drivers/net/ethernet/marvell/mvneta.c | 9 +-
>> drivers/net/ethernet/marvell/sky2.h | 2 +-
>> drivers/tee/optee/call.c | 7 +-
>> drivers/tee/optee/smc_abi.c | 2 +-
>> drivers/virtio/virtio_balloon.c | 10 +-
>> drivers/xen/balloon.c | 11 +-
>> drivers/xen/biomerge.c | 12 +-
>> drivers/xen/privcmd.c | 2 +-
>> drivers/xen/xenbus/xenbus_client.c | 5 +-
>> drivers/xen/xlate_mmu.c | 6 +-
>> fs/binfmt_elf.c | 11 +-
>> fs/buffer.c | 2 +-
>> fs/coredump.c | 8 +-
>> fs/ext4/ext4.h | 36 ++--
>> fs/ext4/move_extent.c | 2 +-
>> fs/ext4/readpage.c | 2 +-
>> fs/fat/dir.c | 4 +-
>> fs/fat/fatent.c | 4 +-
>> fs/nfs/nfs42proc.c | 2 +-
>> fs/nfs/nfs42xattr.c | 2 +-
>> fs/nfs/nfs4proc.c | 2 +-
>> include/asm-generic/pgtable-geometry.h | 71 +++++++
>> include/asm-generic/vmlinux.lds.h | 38 ++--
>> include/linux/buffer_head.h | 1 +
>> include/linux/cpumask.h | 5 +
>> include/linux/linkage.h | 4 +-
>> include/linux/mm.h | 17 +-
>> include/linux/mm_types.h | 15 +-
>> include/linux/mm_types_task.h | 2 +-
>> include/linux/mmzone.h | 3 +-
>> include/linux/netlink.h | 6 +-
>> include/linux/percpu-defs.h | 4 +-
>> include/linux/perf_event.h | 2 +-
>> include/linux/sched.h | 4 +-
>> include/linux/slab.h | 7 +-
>> include/linux/stackdepot.h | 6 +-
>> include/linux/sunrpc/svc.h | 8 +-
>> include/linux/sunrpc/svc_rdma.h | 4 +-
>> include/linux/sunrpc/svcsock.h | 2 +-
>> include/linux/swap.h | 17 +-
>> include/linux/swapops.h | 6 +-
>> include/linux/thread_info.h | 10 +-
>> include/xen/page.h | 2 +
>> init/main.c | 7 +-
>> kernel/bpf/core.c | 9 +-
>> kernel/bpf/ringbuf.c | 54 ++---
>> kernel/cgroup/cgroup.c | 8 +-
>> kernel/crash_core.c | 2 +-
>> kernel/events/core.c | 2 +-
>> kernel/fork.c | 71 +++----
>> kernel/power/power.h | 2 +-
>> kernel/power/snapshot.c | 2 +-
>> kernel/power/swap.c | 129 +++++++++--
>> kernel/trace/fgraph.c | 2 +-
>> kernel/trace/trace.c | 2 +-
>> lib/stackdepot.c | 6 +-
>> mm/kasan/report.c | 3 +-
>> mm/memcontrol.c | 11 +-
>> mm/memory.c | 4 +-
>> mm/mmap.c | 2 +-
>> mm/page-writeback.c | 2 +-
>> mm/page_alloc.c | 31 +--
>> mm/slub.c | 2 +-
>> mm/sparse.c | 2 +-
>> mm/swapfile.c | 2 +-
>> mm/vmalloc.c | 7 +-
>> net/9p/trans_virtio.c | 4 +-
>> net/core/hotdata.c | 4 +-
>> net/core/skbuff.c | 4 +-
>> net/core/sysctl_net_core.c | 2 +-
>> net/sunrpc/cache.c | 3 +-
>> net/unix/af_unix.c | 2 +-
>> sound/soc/soc-utils.c | 4 +-
>> virt/kvm/kvm_main.c | 2 +-
>> 172 files changed, 2185 insertions(+), 951 deletions(-)
>> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
>> create mode 100644 arch/arm64/mm/pgtable-geometry.c
>> create mode 100644 include/asm-generic/pgtable-geometry.h
>>
>> --
>> 2.43.0
>>
>>
>
> Hi Ryan,
>
> First off, this is excellent work! Your cover page was very detailed
> and made the patch set easier to understand.
>
> Some questions/comments:
>
> Once a kernel is booted with a certain page size, could there be issues
> if it is booted later with a different page size? How about if this is
> done frequently?
I think that is the reason why you are only given the option in RHEL to
select the kernel (4K vs. 64K) to use at install time.
Software can easily use a different data format for persistance based on
the base page size. I would suspect DBs might be the usual suspects.
One example is swap space I think, where the base page size used when
formatting the device is used, and it cannot be used with a different
page size unless reformatting it.
So ... one has to be a bit careful ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [External] : Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-18 18:27 ` David Hildenbrand
@ 2024-10-18 19:19 ` Joseph Salisbury
2024-10-18 19:27 ` David Hildenbrand
0 siblings, 1 reply; 196+ messages in thread
From: Joseph Salisbury @ 2024-10-18 19:19 UTC (permalink / raw)
To: David Hildenbrand, Ryan Roberts, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 10/18/24 14:27, David Hildenbrand wrote:
> On 18.10.24 20:15, Joseph Salisbury wrote:
>>
>>
>>
>> On 10/14/24 06:55, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> Patch bomb incoming... This covers many subsystems, so I've included
>>> a core set
>>> of people on the full series and additionally included maintainers
>>> on relevant
>>> patches. I haven't included those maintainers on this cover letter
>>> since the
>>> numbers were far too big for it to work. But I've included a link to
>>> this cover
>>> letter on each patch, so they can hopefully find their way here. For
>>> follow up
>>> submissions I'll break it up by subsystem, but for now thought it
>>> was important
>>> to show the full picture.
>>>
>>> This RFC series implements support for boot-time page size selection
>>> within the
>>> arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but
>>> to date, page
>>> size has been selected at compile-time, meaning the size is baked
>>> into a given
>>> kernel image. As use of larger-than-4K page sizes become more
>>> prevalent this
>>> starts to present a problem for distributions. Boot-time page size
>>> selection
>>> enables the creation of a single kernel image, which can be told
>>> which page size
>>> to use on the kernel command line.
>>>
>>> Why is having an image-per-page size problematic?
>>> =================================================
>>>
>>> Many traditional distros are now supporting both 4K and 64K. And
>>> this means
>>> managing 2 kernel packages, along with drivers for each. For some,
>>> it means
>>> multiple installer flavours and multiple ISOs. All of this adds up to a
>>> less-than-ideal level of complexity. Additionally, Android now
>>> supports 4K and
>>> 16K kernels. I'm told having to explicitly manage their KABI for
>>> each kernel is
>>> painful, and the extra flash space required for both kernel images
>>> and the
>>> duplicated modules has been problematic. Boot-time page size
>>> selection solves
>>> all of this.
>>>
>>> Additionally, in starting to think about the longer term deployment
>>> story for
>>> D128 page tables, which Arm architecture now supports, a lot of the
>>> same
>>> problems need to be solved, so this work sets us up nicely for that.
>>>
>>> So what's the down side?
>>> ========================
>>>
>>> Well nothing's free; Various static allocations in the kernel image
>>> must be
>>> sized for the worst case (largest supported page size), so image
>>> size is in line
>>> with size of 64K compile-time image. So if you're interested in 4K
>>> or 16K, there
>>> is a slight increase to the image size. But I expect that problem
>>> goes away if
>>> you're compressing the image - its just some extra zeros. At
>>> boot-time, I expect
>>> we could free the unused static storage once we know the page size -
>>> although
>>> that would be a follow up enhancement.
>>>
>>> And then there is performance. Since PAGE_SIZE and friends are no
>>> longer
>>> compile-time constants, we must look up their values and do
>>> arithmetic at
>>> runtime instead of compile-time. My early perf testing suggests this is
>>> inperceptible for real-world workloads, and only has small impact on
>>> microbenchmarks - more on this below.
>>>
>>> Approach
>>> ========
>>>
>>> The basic idea is to rid the source of any assumptions that
>>> PAGE_SIZE and
>>> friends are compile-time constant, but in a way that allows the
>>> compiler to
>>> perform the same optimizations as was previously being done if they
>>> do turn out
>>> to be compile-time constant. Where constants are required, we use
>>> limits;
>>> PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full
>>> description
>>> of all the classes of problems to solve.
>>>
>>> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may
>>> opt-in to
>>> boot-time page size selection by defining PAGE_SIZE_MIN &
>>> PAGE_SIZE_MAX. arm64
>>> does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>>> Kconfig,
>>> which is an alternative to selecting a compile-time page size.
>>>
>>> When boot-time page size is active, the arch pgtable geometry macro
>>> definitions
>>> resolve to something that can be configured at boot. The arm64
>>> implementation in
>>> this series mainly uses global, __ro_after_init variables. I've
>>> tried using
>>> alternatives patching, but that performs worse than loading from
>>> memory; I think
>>> due to code size bloat.
>>>
>>> Status
>>> ======
>>>
>>> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only
>>> implemented enough
>>> to compile the kernel image itself with defconfig (and a few other
>>> bits and
>>> pieces). This is enough to build a kernel that can boot under QEMU
>>> or FVP. I'll
>>> happily do the rest of the work to enable all the extra drivers, but
>>> wanted to
>>> get feedback on the shape of this effort first. If anyone wants to
>>> do any
>>> testing, and has a must-have config, let me know and I'll prioritize
>>> enabling it
>>> first.
>>>
>>> The series is arranged as follows:
>>>
>>> - patch 1: Add macros required for converting non-arch
>>> code to support
>>> boot-time page size selection
>>> - patches 2-36: Remove PAGE_SIZE compile-time constant
>>> assumption from all
>>> non-arch code
>>> - patches 37-38: Some arm64 tidy ups
>>> - patch 39: Add macros required for converting arm64 code
>>> to support
>>> boot-time page size selection
>>> - patches 40-56: arm64 changes to support boot-time page size
>>> selection
>>> - patch 57: Add arm64 Kconfig option to enable boot-time
>>> page size
>>> selection
>>>
>>> Ideally, I'd like to get the basics merged (something like this
>>> series), then
>>> incrementally improve it over a handful of kernel releases until we can
>>> demonstrate that we have feature parity with the compile-time build
>>> and no
>>> performance blockers. Once at that point, ideally the compile-time
>>> build options
>>> would be removed and the code could be cleaned up further.
>>>
>>> One of the bigger peices that I'd propose to add as a follow up, is
>>> to make
>>> va-size boot-time selectable too. That will greatly simplify LPA2
>>> fallback
>>> handling.
>>>
>>> Assuming people are ammenable to the rough shape, how would I go
>>> about getting
>>> the non-arch changes merged? Since they cover many subsystems, will
>>> each piece
>>> need to go independently to each relevant maintainer or could it all
>>> be merged
>>> together through the arm64 tree?
>>>
>>> Image Size
>>> ==========
>>>
>>> The below shows the size of a defconfig (+ xfs, squashfs, ftrace,
>>> kprobes)
>>> kernel image on disk for base (before any changes applied), compile
>>> (with
>>> changes, configured for compile-time page size) and boot (with changes,
>>> configured for boot-time page size).
>>>
>>> You can see the that compile-16k and 64k configs are actually
>>> slightly smaller
>>> than the baselines; that's due to optimizing some buffer sizes which
>>> didn't need
>>> to depend on page size during the series. The boot-time image is ~1%
>>> bigger than
>>> the 64k compile-time image. I believe there is scope to improve this
>>> to make it
>>> equal to compile-64k if required:
>>>
>>> | config | size/KB | diff/KB | diff/% |
>>> |-------------|---------|---------|---------|
>>> | base-4k | 54895 | 0 | 0.0% |
>>> | base-16k | 55161 | 266 | 0.5% |
>>> | base-64k | 56775 | 1880 | 3.4% |
>>> | compile-4k | 54895 | 0 | 0.0% |
>>> | compile-16k | 55097 | 202 | 0.4% |
>>> | compile-64k | 56391 | 1496 | 2.7% |
>>> | boot-4K | 57045 | 2150 | 3.9% |
>>>
>>> And below shows the size of the image in memory at run-time,
>>> separated for text
>>> and data costs. The boot image has ~1% text cost; most likely due to
>>> the fact
>>> that PAGE_SIZE and friends are not compile-time constants so need
>>> instructions
>>> to load the values and do arithmetic. I believe we could eventually
>>> get the data
>>> cost to match the cost for the compile image for the chosen page
>>> size by freeing
>>> the ends of the static buffers not needed for the selected page size:
>>>
>>> | | text | text | text | data | data |
>>> data |
>>> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB |
>>> diff/% |
>>> |-------------|---------|---------|---------|---------|---------|---------|
>>>
>>> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
>>> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
>>> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 |
>>> 9.5% |
>>> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
>>> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
>>> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
>>> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
>>>
>>> Functional Testing
>>> ==================
>>>
>>> I've build-tested defconfig for all arches supported by tuxmake
>>> (which is most)
>>> without issue.
>>>
>>> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all
>>> page sizes
>>> and a few va-sizes, and additionally have run all the mm-selftests,
>>> with no
>>> regressions observed vs the equivalent compile-time page size build
>>> (although
>>> the mm-selftests have a few existing failures when run against 16K
>>> and 64K
>>> kernels - those should really be investigated and fixed independently).
>>>
>>> Test coverage is lacking for many of the drivers that I've touched,
>>> but in many
>>> cases, I'm hoping the changes are simple enough that review might
>>> suffice?
>>>
>>> Performance Testing
>>> ===================
>>>
>>> I've run some limited performance benchmarks:
>>>
>>> First, a real-world benchmark that causes a lot of page table
>>> manipulation (and
>>> therefore we would expect to see regression here if we are going to
>>> see it
>>> anywhere); kernel compilation. It barely registers a change. Values
>>> are times,
>>> so smaller is better. All relative to base-4k:
>>>
>>> | | kern | kern | user | user | real |
>>> real |
>>> | config | mean | stdev | mean | stdev | mean |
>>> stdev |
>>> |-------------|---------|---------|---------|---------|---------|---------|
>>>
>>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% |
>>> 0.3% |
>>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% |
>>> 0.3% |
>>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% |
>>> 0.2% |
>>>
>>> The Speedometer JavaScript benchmark also shows no change. Values
>>> are runs per
>>> min, so bigger is better. All relative to base-4k:
>>>
>>> | config | mean | stdev |
>>> |-------------|---------|---------|
>>> | base-4k | 0.0% | 0.8% |
>>> | compile-4k | 0.4% | 0.8% |
>>> | boot-4k | 0.0% | 0.9% |
>>>
>>> Finally, I've run some microbenchmarks known to stress page table
>>> manipulations
>>> (originally from David Hildenbrand). The fork test maps/allocs 1G of
>>> anon
>>> memory, then measures the cost of fork(). The munmap test
>>> maps/allocs 1G of anon
>>> memory then measures the cost of munmap()ing it. The fork test is
>>> known to be
>>> extremely sensitive to any changes that cause instructions to be
>>> aligned
>>> differently in cachelines. When using this test for other changes,
>>> I've seen
>>> double digit regressions for the slightest thing, so 12% regression
>>> on this test
>>> is actually fairly good. This likely represents the extreme worst
>>> case for
>>> regressions that will be observed across other microbenchmarks
>>> (famous last
>>> words). Values are times, so smaller is better. All relative to
>>> base-4k:
>>>
>>> | | fork | fork | munmap | munmap |
>>> | config | mean | stdev | stdev | stdev |
>>> |-------------|---------|---------|---------|---------|
>>> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
>>> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
>>> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
>>>
>>> NOTE: The series applies on top of v6.11.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>> Ryan Roberts (57):
>>> mm: Add macros ahead of supporting boot-time page size selection
>>> vmlinux: Align to PAGE_SIZE_MAX
>>> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is
>>> large
>>> mm/page_alloc: Make page_frag_cache boot-time page size compatible
>>> mm: Avoid split pmd ptl if pmd level is run-time folded
>>> mm: Remove PAGE_SIZE compile-time constant assumption
>>> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
>>> fs: Remove PAGE_SIZE compile-time constant assumption
>>> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
>>> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
>>> fork: Permit boot-time THREAD_SIZE determination
>>> cgroup: Remove PAGE_SIZE compile-time constant assumption
>>> bpf: Remove PAGE_SIZE compile-time constant assumption
>>> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
>>> stackdepot: Remove PAGE_SIZE compile-time constant assumption
>>> perf: Remove PAGE_SIZE compile-time constant assumption
>>> kvm: Remove PAGE_SIZE compile-time constant assumption
>>> trace: Remove PAGE_SIZE compile-time constant assumption
>>> crash: Remove PAGE_SIZE compile-time constant assumption
>>> crypto: Remove PAGE_SIZE compile-time constant assumption
>>> sunrpc: Remove PAGE_SIZE compile-time constant assumption
>>> sound: Remove PAGE_SIZE compile-time constant assumption
>>> net: Remove PAGE_SIZE compile-time constant assumption
>>> net: fec: Remove PAGE_SIZE compile-time constant assumption
>>> net: marvell: Remove PAGE_SIZE compile-time constant assumption
>>> net: hns3: Remove PAGE_SIZE compile-time constant assumption
>>> net: e1000: Remove PAGE_SIZE compile-time constant assumption
>>> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
>>> net: igb: Remove PAGE_SIZE compile-time constant assumption
>>> drivers/base: Remove PAGE_SIZE compile-time constant assumption
>>> edac: Remove PAGE_SIZE compile-time constant assumption
>>> optee: Remove PAGE_SIZE compile-time constant assumption
>>> random: Remove PAGE_SIZE compile-time constant assumption
>>> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
>>> virtio: Remove PAGE_SIZE compile-time constant assumption
>>> xen: Remove PAGE_SIZE compile-time constant assumption
>>> arm64: Fix macros to work in C code in addition to the linker
>>> script
>>> arm64: Track early pgtable allocation limit
>>> arm64: Introduce macros required for boot-time page selection
>>> arm64: Refactor early pgtable size calculation macros
>>> arm64: Pass desired page size on command line
>>> arm64: Divorce early init from PAGE_SIZE
>>> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
>>> arm64: Align sections to PAGE_SIZE_MAX
>>> arm64: Rework trampoline rodata mapping
>>> arm64: Generalize fixmap for boot-time page size
>>> arm64: Statically allocate and align for worst-case page size
>>> arm64: Convert switch to if for non-const comparison values
>>> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
>>> arm64: Remove PAGE_SZ asm-offset
>>> arm64: Introduce cpu features for page sizes
>>> arm64: Remove PAGE_SIZE from assembly code
>>> arm64: Runtime-fold pmd level
>>> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
>>> arm64: TRAMP_VALIAS is no longer compile-time constant
>>> arm64: Determine THREAD_SIZE at boot-time
>>> arm64: Enable boot-time page size selection
>>>
>>> arch/alpha/include/asm/page.h | 1 +
>>> arch/arc/include/asm/page.h | 1 +
>>> arch/arm/include/asm/page.h | 1 +
>>> arch/arm64/Kconfig | 26 ++-
>>> arch/arm64/include/asm/assembler.h | 78 ++++++-
>>> arch/arm64/include/asm/cpufeature.h | 44 +++-
>>> arch/arm64/include/asm/efi.h | 2 +-
>>> arch/arm64/include/asm/fixmap.h | 28 ++-
>>> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
>>> arch/arm64/include/asm/kvm_arm.h | 21 +-
>>> arch/arm64/include/asm/kvm_hyp.h | 11 +
>>> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
>>> arch/arm64/include/asm/memory.h | 62 ++++--
>>> arch/arm64/include/asm/page-def.h | 3 +-
>>> arch/arm64/include/asm/pgalloc.h | 16 +-
>>> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
>>> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
>>> arch/arm64/include/asm/pgtable-prot.h | 2 +-
>>> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
>>> arch/arm64/include/asm/processor.h | 10 +-
>>> arch/arm64/include/asm/sections.h | 1 +
>>> arch/arm64/include/asm/smp.h | 1 +
>>> arch/arm64/include/asm/sparsemem.h | 15 +-
>>> arch/arm64/include/asm/sysreg.h | 54 +++--
>>> arch/arm64/include/asm/tlb.h | 3 +
>>> arch/arm64/kernel/asm-offsets.c | 4 +-
>>> arch/arm64/kernel/cpufeature.c | 93 ++++++--
>>> arch/arm64/kernel/efi.c | 2 +-
>>> arch/arm64/kernel/entry.S | 60 +++++-
>>> arch/arm64/kernel/head.S | 46 +++-
>>> arch/arm64/kernel/hibernate-asm.S | 6 +-
>>> arch/arm64/kernel/image-vars.h | 14 ++
>>> arch/arm64/kernel/image.h | 4 +
>>> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
>>> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
>>> arch/arm64/kernel/pi/map_range.c | 201
>>> ++++++++++++++++--
>>> arch/arm64/kernel/pi/pi.h | 63 +++++-
>>> arch/arm64/kernel/relocate_kernel.S | 10 +-
>>> arch/arm64/kernel/vdso-wrap.S | 4 +-
>>> arch/arm64/kernel/vdso.c | 7 +-
>>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
>>> arch/arm64/kernel/vdso32-wrap.S | 4 +-
>>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
>>> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
>>> arch/arm64/kvm/arm.c | 10 +
>>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
>>> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
>>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
>>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
>>> arch/arm64/kvm/mmu.c | 39 ++--
>>> arch/arm64/lib/clear_page.S | 7 +-
>>> arch/arm64/lib/copy_page.S | 33 ++-
>>> arch/arm64/lib/mte.S | 27 ++-
>>> arch/arm64/mm/Makefile | 1 +
>>> arch/arm64/mm/fixmap.c | 38 ++--
>>> arch/arm64/mm/hugetlbpage.c | 40 +---
>>> arch/arm64/mm/init.c | 26 +--
>>> arch/arm64/mm/kasan_init.c | 8 +-
>>> arch/arm64/mm/mmu.c | 53 +++--
>>> arch/arm64/mm/pgd.c | 12 +-
>>> arch/arm64/mm/pgtable-geometry.c | 24 +++
>>> arch/arm64/mm/proc.S | 128 ++++++++---
>>> arch/arm64/mm/ptdump.c | 3 +-
>>> arch/arm64/tools/cpucaps | 3 +
>>> arch/csky/include/asm/page.h | 3 +
>>> arch/hexagon/include/asm/page.h | 2 +
>>> arch/loongarch/include/asm/page.h | 2 +
>>> arch/m68k/include/asm/page.h | 1 +
>>> arch/microblaze/include/asm/page.h | 1 +
>>> arch/mips/include/asm/page.h | 1 +
>>> arch/nios2/include/asm/page.h | 2 +
>>> arch/openrisc/include/asm/page.h | 1 +
>>> arch/parisc/include/asm/page.h | 1 +
>>> arch/powerpc/include/asm/page.h | 2 +
>>> arch/riscv/include/asm/page.h | 1 +
>>> arch/s390/include/asm/page.h | 1 +
>>> arch/sh/include/asm/page.h | 1 +
>>> arch/sparc/include/asm/page.h | 3 +
>>> arch/um/include/asm/page.h | 2 +
>>> arch/x86/include/asm/page_types.h | 2 +
>>> arch/xtensa/include/asm/page.h | 1 +
>>> crypto/lskcipher.c | 4 +-
>>> drivers/ata/sata_sil24.c | 46 ++--
>>> drivers/base/node.c | 6 +-
>>> drivers/base/topology.c | 32 +--
>>> drivers/block/virtio_blk.c | 2 +-
>>> drivers/char/random.c | 4 +-
>>> drivers/edac/edac_mc.h | 13 +-
>>> drivers/firmware/efi/libstub/arm64.c | 3 +-
>>> drivers/irqchip/irq-gic-v3-its.c | 2 +-
>>> drivers/mtd/mtdswap.c | 4 +-
>>> drivers/net/ethernet/freescale/fec.h | 3 +-
>>> drivers/net/ethernet/freescale/fec_main.c | 5 +-
>>> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
>>> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
>>> drivers/net/ethernet/intel/igb/igb.h | 25 +--
>>> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
>>> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
>>> drivers/net/ethernet/marvell/mvneta.c | 9 +-
>>> drivers/net/ethernet/marvell/sky2.h | 2 +-
>>> drivers/tee/optee/call.c | 7 +-
>>> drivers/tee/optee/smc_abi.c | 2 +-
>>> drivers/virtio/virtio_balloon.c | 10 +-
>>> drivers/xen/balloon.c | 11 +-
>>> drivers/xen/biomerge.c | 12 +-
>>> drivers/xen/privcmd.c | 2 +-
>>> drivers/xen/xenbus/xenbus_client.c | 5 +-
>>> drivers/xen/xlate_mmu.c | 6 +-
>>> fs/binfmt_elf.c | 11 +-
>>> fs/buffer.c | 2 +-
>>> fs/coredump.c | 8 +-
>>> fs/ext4/ext4.h | 36 ++--
>>> fs/ext4/move_extent.c | 2 +-
>>> fs/ext4/readpage.c | 2 +-
>>> fs/fat/dir.c | 4 +-
>>> fs/fat/fatent.c | 4 +-
>>> fs/nfs/nfs42proc.c | 2 +-
>>> fs/nfs/nfs42xattr.c | 2 +-
>>> fs/nfs/nfs4proc.c | 2 +-
>>> include/asm-generic/pgtable-geometry.h | 71 +++++++
>>> include/asm-generic/vmlinux.lds.h | 38 ++--
>>> include/linux/buffer_head.h | 1 +
>>> include/linux/cpumask.h | 5 +
>>> include/linux/linkage.h | 4 +-
>>> include/linux/mm.h | 17 +-
>>> include/linux/mm_types.h | 15 +-
>>> include/linux/mm_types_task.h | 2 +-
>>> include/linux/mmzone.h | 3 +-
>>> include/linux/netlink.h | 6 +-
>>> include/linux/percpu-defs.h | 4 +-
>>> include/linux/perf_event.h | 2 +-
>>> include/linux/sched.h | 4 +-
>>> include/linux/slab.h | 7 +-
>>> include/linux/stackdepot.h | 6 +-
>>> include/linux/sunrpc/svc.h | 8 +-
>>> include/linux/sunrpc/svc_rdma.h | 4 +-
>>> include/linux/sunrpc/svcsock.h | 2 +-
>>> include/linux/swap.h | 17 +-
>>> include/linux/swapops.h | 6 +-
>>> include/linux/thread_info.h | 10 +-
>>> include/xen/page.h | 2 +
>>> init/main.c | 7 +-
>>> kernel/bpf/core.c | 9 +-
>>> kernel/bpf/ringbuf.c | 54 ++---
>>> kernel/cgroup/cgroup.c | 8 +-
>>> kernel/crash_core.c | 2 +-
>>> kernel/events/core.c | 2 +-
>>> kernel/fork.c | 71 +++----
>>> kernel/power/power.h | 2 +-
>>> kernel/power/snapshot.c | 2 +-
>>> kernel/power/swap.c | 129 +++++++++--
>>> kernel/trace/fgraph.c | 2 +-
>>> kernel/trace/trace.c | 2 +-
>>> lib/stackdepot.c | 6 +-
>>> mm/kasan/report.c | 3 +-
>>> mm/memcontrol.c | 11 +-
>>> mm/memory.c | 4 +-
>>> mm/mmap.c | 2 +-
>>> mm/page-writeback.c | 2 +-
>>> mm/page_alloc.c | 31 +--
>>> mm/slub.c | 2 +-
>>> mm/sparse.c | 2 +-
>>> mm/swapfile.c | 2 +-
>>> mm/vmalloc.c | 7 +-
>>> net/9p/trans_virtio.c | 4 +-
>>> net/core/hotdata.c | 4 +-
>>> net/core/skbuff.c | 4 +-
>>> net/core/sysctl_net_core.c | 2 +-
>>> net/sunrpc/cache.c | 3 +-
>>> net/unix/af_unix.c | 2 +-
>>> sound/soc/soc-utils.c | 4 +-
>>> virt/kvm/kvm_main.c | 2 +-
>>> 172 files changed, 2185 insertions(+), 951 deletions(-)
>>> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
>>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
>>> create mode 100644 arch/arm64/mm/pgtable-geometry.c
>>> create mode 100644 include/asm-generic/pgtable-geometry.h
>>>
>>> --
>>> 2.43.0
>>>
>>>
>>
>> Hi Ryan,
>>
>> First off, this is excellent work! Your cover page was very detailed
>> and made the patch set easier to understand.
>>
>> Some questions/comments:
>>
>> Once a kernel is booted with a certain page size, could there be issues
>> if it is booted later with a different page size? How about if this is
>> done frequently?
>
> I think that is the reason why you are only given the option in RHEL
> to select the kernel (4K vs. 64K) to use at install time.
>
> Software can easily use a different data format for persistance based
> on the base page size. I would suspect DBs might be the usual suspects.
>
> One example is swap space I think, where the base page size used when
> formatting the device is used, and it cannot be used with a different
> page size unless reformatting it.
>
> So ... one has to be a bit careful ...
>
Yes, that is what I was thinking. Once a userspace process does an I/O
and if it is based on PAGE_SIZE things can go south. I think this is
not an issue with THP, so maybe it's possible with boot-time page selection?
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [External] : Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-18 19:19 ` [External] : " Joseph Salisbury
@ 2024-10-18 19:27 ` David Hildenbrand
2024-10-18 20:06 ` Joseph Salisbury
0 siblings, 1 reply; 196+ messages in thread
From: David Hildenbrand @ 2024-10-18 19:27 UTC (permalink / raw)
To: Joseph Salisbury, Ryan Roberts, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
>>> Hi Ryan,
>>>
>>> First off, this is excellent work! Your cover page was very detailed
>>> and made the patch set easier to understand.
>>>
>>> Some questions/comments:
>>>
>>> Once a kernel is booted with a certain page size, could there be issues
>>> if it is booted later with a different page size? How about if this is
>>> done frequently?
>>
>> I think that is the reason why you are only given the option in RHEL
>> to select the kernel (4K vs. 64K) to use at install time.
>>
>> Software can easily use a different data format for persistance based
>> on the base page size. I would suspect DBs might be the usual suspects.
>>
>> One example is swap space I think, where the base page size used when
>> formatting the device is used, and it cannot be used with a different
>> page size unless reformatting it.
>>
>> So ... one has to be a bit careful ...
>>
> Yes, that is what I was thinking. Once a userspace process does an I/O
> and if it is based on PAGE_SIZE things can go south. I think this is
> not an issue with THP, so maybe it's possible with boot-time page selection?
THP is a different beast and has different semantics: the base page size
doesn't change: the result of getpagesize() is unmodified ("transparent").
One would have to emulate for a given user space process a different
page size ... and Ryan can likely tell some stories about that.
Not that I consider it reasonable to have dynamic page sizes in the
kernel and then try emulating a different one for all user space.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [External] : Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-18 19:27 ` David Hildenbrand
@ 2024-10-18 20:06 ` Joseph Salisbury
2024-10-21 9:55 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Joseph Salisbury @ 2024-10-18 20:06 UTC (permalink / raw)
To: David Hildenbrand, Ryan Roberts, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 10/18/24 15:27, David Hildenbrand wrote:
>
>>>> Hi Ryan,
>>>>
>>>> First off, this is excellent work! Your cover page was very detailed
>>>> and made the patch set easier to understand.
>>>>
>>>> Some questions/comments:
>>>>
>>>> Once a kernel is booted with a certain page size, could there be
>>>> issues
>>>> if it is booted later with a different page size? How about if
>>>> this is
>>>> done frequently?
>>>
>>> I think that is the reason why you are only given the option in RHEL
>>> to select the kernel (4K vs. 64K) to use at install time.
>>>
>>> Software can easily use a different data format for persistance based
>>> on the base page size. I would suspect DBs might be the usual suspects.
>>>
>>> One example is swap space I think, where the base page size used when
>>> formatting the device is used, and it cannot be used with a different
>>> page size unless reformatting it.
>>>
>>> So ... one has to be a bit careful ...
>>>
>> Yes, that is what I was thinking. Once a userspace process does an I/O
>> and if it is based on PAGE_SIZE things can go south. I think this is
>> not an issue with THP, so maybe it's possible with boot-time page
>> selection?
>
> THP is a different beast and has different semantics: the base page
> size doesn't change: the result of getpagesize() is unmodified
> ("transparent").
>
> One would have to emulate for a given user space process a different
> page size ... and Ryan can likely tell some stories about that.
>
> Not that I consider it reasonable to have dynamic page sizes in the
> kernel and then try emulating a different one for all user space.
This is probably a case of ensuring proper documentation from the
distro or application vendor.
Or maybe some type of "Safety gate" could be implemented outside of the
kernel. Some check for the prior use of different page sizes, in the
cases where it could cause problems.
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 44/57] arm64: Align sections to PAGE_SIZE_MAX
2024-10-14 10:58 ` [RFC PATCH v1 44/57] arm64: Align sections to PAGE_SIZE_MAX Ryan Roberts
@ 2024-10-19 14:16 ` Thomas Weißschuh
2024-10-21 11:20 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Thomas Weißschuh @ 2024-10-19 14:16 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel, linux-kernel,
linux-mm
On 2024-10-14 11:58:51+0100, Ryan Roberts wrote:
> Increase alignment of sections in nvhe hyp, vdso and final vmlinux image
> from PAGE_SIZE to PAGE_SIZE_MAX. For compile-time PAGE_SIZE,
> PAGE_SIZE_MAX == PAGE_SIZE so there is no change. For boot-time
> PAGE_SIZE, PAGE_SIZE_MAX is the largest selectable page size.
>
> For a boot-time page size build, image size is comparable to a 64K page
> size compile-time build. In future, it may be desirable to optimize
> run-time memory consumption by freeing unused padding pages when the
> boot-time selected page size is less than PAGE_SIZE_MAX.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> arch/arm64/include/asm/memory.h | 4 +--
> arch/arm64/kernel/vdso-wrap.S | 4 +--
> arch/arm64/kernel/vdso.c | 7 +++---
> arch/arm64/kernel/vdso/vdso.lds.S | 4 +--
> arch/arm64/kernel/vdso32-wrap.S | 4 +--
> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +--
> arch/arm64/kernel/vmlinux.lds.S | 38 ++++++++++++++---------------
> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 2 +-
> 8 files changed, 34 insertions(+), 33 deletions(-)
> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> index 89b6e78400023..1efe98909a2e0 100644
> --- a/arch/arm64/kernel/vdso.c
> +++ b/arch/arm64/kernel/vdso.c
> @@ -195,7 +195,7 @@ static int __setup_additional_pages(enum vdso_abi abi,
>
> vdso_text_len = vdso_info[abi].vdso_pages << PAGE_SHIFT;
> /* Be sure to map the data page */
> - vdso_mapping_len = vdso_text_len + VVAR_NR_PAGES * PAGE_SIZE;
> + vdso_mapping_len = vdso_text_len + VVAR_NR_PAGES * PAGE_SIZE_MAX;
>
> vdso_base = get_unmapped_area(NULL, 0, vdso_mapping_len, 0, 0);
> if (IS_ERR_VALUE(vdso_base)) {
> @@ -203,7 +203,8 @@ static int __setup_additional_pages(enum vdso_abi abi,
> goto up_fail;
> }
>
> - ret = _install_special_mapping(mm, vdso_base, VVAR_NR_PAGES * PAGE_SIZE,
> + ret = _install_special_mapping(mm, vdso_base,
> + VVAR_NR_PAGES * PAGE_SIZE_MAX,
> VM_READ|VM_MAYREAD|VM_PFNMAP,
> vdso_info[abi].dm);
> if (IS_ERR(ret))
> @@ -212,7 +213,7 @@ static int __setup_additional_pages(enum vdso_abi abi,
> if (system_supports_bti_kernel())
> gp_flags = VM_ARM64_BTI;
>
> - vdso_base += VVAR_NR_PAGES * PAGE_SIZE;
> + vdso_base += VVAR_NR_PAGES * PAGE_SIZE_MAX;
> mm->context.vdso = (void *)vdso_base;
> ret = _install_special_mapping(mm, vdso_base, vdso_text_len,
> VM_READ|VM_EXEC|gp_flags|
> diff --git a/arch/arm64/kernel/vdso/vdso.lds.S b/arch/arm64/kernel/vdso/vdso.lds.S
> index 45354f2ddf706..f7d1537a689e8 100644
> --- a/arch/arm64/kernel/vdso/vdso.lds.S
> +++ b/arch/arm64/kernel/vdso/vdso.lds.S
> @@ -18,9 +18,9 @@ OUTPUT_ARCH(aarch64)
>
> SECTIONS
> {
> - PROVIDE(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE);
> + PROVIDE(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE_MAX);
> #ifdef CONFIG_TIME_NS
> - PROVIDE(_timens_data = _vdso_data + PAGE_SIZE);
> + PROVIDE(_timens_data = _vdso_data + PAGE_SIZE_MAX);
This looks like it also needs a change to vvar_fault() in vdso.c.
The symbols are now always PAGE_SIZE_MAX apart, while vvar_fault() works
in page offsets (vmf->pgoff) that are based on the runtime PAGE_SIZE and
it expects hardcoded offsets.
As test you can use tools/testing/selftests/timens/timens.
(I can't test this right now, so it's only a suspicion)
> #endif
> . = VDSO_LBASE + SIZEOF_HEADERS;
> diff --git a/arch/arm64/kernel/vdso32/vdso.lds.S b/arch/arm64/kernel/vdso32/vdso.lds.S
> index 8d95d7d35057d..c46d18a69d1ce 100644
> --- a/arch/arm64/kernel/vdso32/vdso.lds.S
> +++ b/arch/arm64/kernel/vdso32/vdso.lds.S
> @@ -18,9 +18,9 @@ OUTPUT_ARCH(arm)
>
> SECTIONS
> {
> - PROVIDE_HIDDEN(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE);
> + PROVIDE_HIDDEN(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE_MAX);
> #ifdef CONFIG_TIME_NS
> - PROVIDE_HIDDEN(_timens_data = _vdso_data + PAGE_SIZE);
> + PROVIDE_HIDDEN(_timens_data = _vdso_data + PAGE_SIZE_MAX);
> #endif
> . = VDSO_LBASE + SIZEOF_HEADERS;
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-14 10:55 [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Ryan Roberts
` (6 preceding siblings ...)
2024-10-18 18:15 ` Joseph Salisbury
@ 2024-10-19 15:47 ` Neal Gompa
2024-10-21 11:02 ` Ryan Roberts
2024-10-31 21:07 ` Catalin Marinas
8 siblings, 1 reply; 196+ messages in thread
From: Neal Gompa @ 2024-10-19 15:47 UTC (permalink / raw)
To: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, Ryan Roberts, Hector Martin
Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm, asahi
On Monday, October 14, 2024 6:55:11 AM EDT Ryan Roberts wrote:
> Hi All,
>
> Patch bomb incoming... This covers many subsystems, so I've included a core
> set of people on the full series and additionally included maintainers on
> relevant patches. I haven't included those maintainers on this cover letter
> since the numbers were far too big for it to work. But I've included a link
> to this cover letter on each patch, so they can hopefully find their way
> here. For follow up submissions I'll break it up by subsystem, but for now
> thought it was important to show the full picture.
>
> This RFC series implements support for boot-time page size selection within
> the arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to
> date, page size has been selected at compile-time, meaning the size is
> baked into a given kernel image. As use of larger-than-4K page sizes become
> more prevalent this starts to present a problem for distributions.
> Boot-time page size selection enables the creation of a single kernel
> image, which can be told which page size to use on the kernel command line.
>
> Why is having an image-per-page size problematic?
> =================================================
>
> Many traditional distros are now supporting both 4K and 64K. And this means
> managing 2 kernel packages, along with drivers for each. For some, it means
> multiple installer flavours and multiple ISOs. All of this adds up to a
> less-than-ideal level of complexity. Additionally, Android now supports 4K
> and 16K kernels. I'm told having to explicitly manage their KABI for each
> kernel is painful, and the extra flash space required for both kernel
> images and the duplicated modules has been problematic. Boot-time page size
> selection solves all of this.
>
> Additionally, in starting to think about the longer term deployment story
> for D128 page tables, which Arm architecture now supports, a lot of the
> same problems need to be solved, so this work sets us up nicely for that.
>
> So what's the down side?
> ========================
>
> Well nothing's free; Various static allocations in the kernel image must be
> sized for the worst case (largest supported page size), so image size is in
> line with size of 64K compile-time image. So if you're interested in 4K or
> 16K, there is a slight increase to the image size. But I expect that
> problem goes away if you're compressing the image - its just some extra
> zeros. At boot-time, I expect we could free the unused static storage once
> we know the page size - although that would be a follow up enhancement.
>
> And then there is performance. Since PAGE_SIZE and friends are no longer
> compile-time constants, we must look up their values and do arithmetic at
> runtime instead of compile-time. My early perf testing suggests this is
> inperceptible for real-world workloads, and only has small impact on
> microbenchmarks - more on this below.
>
> Approach
> ========
>
> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
> friends are compile-time constant, but in a way that allows the compiler to
> perform the same optimizations as was previously being done if they do turn
> out to be compile-time constant. Where constants are required, we use
> limits; PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full
> description of all the classes of problems to solve.
>
> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX.
> arm64 does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> Kconfig, which is an alternative to selecting a compile-time page size.
>
> When boot-time page size is active, the arch pgtable geometry macro
> definitions resolve to something that can be configured at boot. The arm64
> implementation in this series mainly uses global, __ro_after_init
> variables. I've tried using alternatives patching, but that performs worse
> than loading from memory; I think due to code size bloat.
>
> Status
> ======
>
> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented
> enough to compile the kernel image itself with defconfig (and a few other
> bits and pieces). This is enough to build a kernel that can boot under QEMU
> or FVP. I'll happily do the rest of the work to enable all the extra
> drivers, but wanted to get feedback on the shape of this effort first. If
> anyone wants to do any testing, and has a must-have config, let me know and
> I'll prioritize enabling it first.
>
> The series is arranged as follows:
>
> - patch 1: Add macros required for converting non-arch code to support
> boot-time page size selection
> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from
> all non-arch code
> - patches 37-38: Some arm64 tidy ups
> - patch 39: Add macros required for converting arm64 code to
support
> boot-time page size selection
> - patches 40-56: arm64 changes to support boot-time page size selection
> - patch 57: Add arm64 Kconfig option to enable boot-time page
size
> selection
>
> Ideally, I'd like to get the basics merged (something like this series),
> then incrementally improve it over a handful of kernel releases until we
> can demonstrate that we have feature parity with the compile-time build and
> no performance blockers. Once at that point, ideally the compile-time build
> options would be removed and the code could be cleaned up further.
>
> One of the bigger peices that I'd propose to add as a follow up, is to make
> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
> handling.
>
> Assuming people are ammenable to the rough shape, how would I go about
> getting the non-arch changes merged? Since they cover many subsystems, will
> each piece need to go independently to each relevant maintainer or could it
> all be merged together through the arm64 tree?
>
> Image Size
> ==========
>
> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
> kernel image on disk for base (before any changes applied), compile (with
> changes, configured for compile-time page size) and boot (with changes,
> configured for boot-time page size).
>
> You can see the that compile-16k and 64k configs are actually slightly
> smaller than the baselines; that's due to optimizing some buffer sizes
> which didn't need to depend on page size during the series. The boot-time
> image is ~1% bigger than the 64k compile-time image. I believe there is
> scope to improve this to make it
> equal to compile-64k if required:
> | config | size/KB | diff/KB | diff/% |
> |
> |-------------|---------|---------|---------|
> |
> | base-4k | 54895 | 0 | 0.0% |
> | base-16k | 55161 | 266 | 0.5% |
> | base-64k | 56775 | 1880 | 3.4% |
> | compile-4k | 54895 | 0 | 0.0% |
> | compile-16k | 55097 | 202 | 0.4% |
> | compile-64k | 56391 | 1496 | 2.7% |
> | boot-4K | 57045 | 2150 | 3.9% |
>
> And below shows the size of the image in memory at run-time, separated for
> text and data costs. The boot image has ~1% text cost; most likely due to
> the fact that PAGE_SIZE and friends are not compile-time constants so need
> instructions to load the values and do arithmetic. I believe we could
> eventually get the data cost to match the cost for the compile image for
> the chosen page size by freeing
> the ends of the static buffers not needed for the selected page size:
> | | text | text | text | data | data | data |
> |
> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
> |
> |-------------|---------|---------|---------|---------|---------|---------|
> |
> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
>
> Functional Testing
> ==================
>
> I've build-tested defconfig for all arches supported by tuxmake (which is
> most) without issue.
>
> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page
> sizes and a few va-sizes, and additionally have run all the mm-selftests,
> with no regressions observed vs the equivalent compile-time page size build
> (although the mm-selftests have a few existing failures when run against
> 16K and 64K kernels - those should really be investigated and fixed
> independently).
>
> Test coverage is lacking for many of the drivers that I've touched, but in
> many cases, I'm hoping the changes are simple enough that review might
> suffice?
>
> Performance Testing
> ===================
>
> I've run some limited performance benchmarks:
>
> First, a real-world benchmark that causes a lot of page table manipulation
> (and therefore we would expect to see regression here if we are going to
> see it anywhere); kernel compilation. It barely registers a change. Values
> are times,
> so smaller is better. All relative to base-4k:
> | | kern | kern | user | user | real | real |
> |
> | config | mean | stdev | mean | stdev | mean | stdev |
> |
> |-------------|---------|---------|---------|---------|---------|---------|
> |
> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
>
> The Speedometer JavaScript benchmark also shows no change. Values are runs
> per
> min, so bigger is better. All relative to base-4k:
> | config | mean | stdev |
> |
> |-------------|---------|---------|
> |
> | base-4k | 0.0% | 0.8% |
> | compile-4k | 0.4% | 0.8% |
> | boot-4k | 0.0% | 0.9% |
>
> Finally, I've run some microbenchmarks known to stress page table
> manipulations (originally from David Hildenbrand). The fork test
> maps/allocs 1G of anon memory, then measures the cost of fork(). The munmap
> test maps/allocs 1G of anon memory then measures the cost of munmap()ing
> it. The fork test is known to be extremely sensitive to any changes that
> cause instructions to be aligned differently in cachelines. When using this
> test for other changes, I've seen double digit regressions for the
> slightest thing, so 12% regression on this test is actually fairly good.
> This likely represents the extreme worst case for regressions that will be
> observed across other microbenchmarks (famous last
> words). Values are times, so smaller is better. All relative to base-4k:
> | | fork | fork | munmap | munmap |
> |
> | config | mean | stdev | stdev | stdev |
> |
> |-------------|---------|---------|---------|---------|
> |
> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
>
> NOTE: The series applies on top of v6.11.
>
> Thanks,
> Ryan
>
>
> Ryan Roberts (57):
> mm: Add macros ahead of supporting boot-time page size selection
> vmlinux: Align to PAGE_SIZE_MAX
> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
> mm/page_alloc: Make page_frag_cache boot-time page size compatible
> mm: Avoid split pmd ptl if pmd level is run-time folded
> mm: Remove PAGE_SIZE compile-time constant assumption
> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
> fs: Remove PAGE_SIZE compile-time constant assumption
> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
> fork: Permit boot-time THREAD_SIZE determination
> cgroup: Remove PAGE_SIZE compile-time constant assumption
> bpf: Remove PAGE_SIZE compile-time constant assumption
> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
> stackdepot: Remove PAGE_SIZE compile-time constant assumption
> perf: Remove PAGE_SIZE compile-time constant assumption
> kvm: Remove PAGE_SIZE compile-time constant assumption
> trace: Remove PAGE_SIZE compile-time constant assumption
> crash: Remove PAGE_SIZE compile-time constant assumption
> crypto: Remove PAGE_SIZE compile-time constant assumption
> sunrpc: Remove PAGE_SIZE compile-time constant assumption
> sound: Remove PAGE_SIZE compile-time constant assumption
> net: Remove PAGE_SIZE compile-time constant assumption
> net: fec: Remove PAGE_SIZE compile-time constant assumption
> net: marvell: Remove PAGE_SIZE compile-time constant assumption
> net: hns3: Remove PAGE_SIZE compile-time constant assumption
> net: e1000: Remove PAGE_SIZE compile-time constant assumption
> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
> net: igb: Remove PAGE_SIZE compile-time constant assumption
> drivers/base: Remove PAGE_SIZE compile-time constant assumption
> edac: Remove PAGE_SIZE compile-time constant assumption
> optee: Remove PAGE_SIZE compile-time constant assumption
> random: Remove PAGE_SIZE compile-time constant assumption
> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
> virtio: Remove PAGE_SIZE compile-time constant assumption
> xen: Remove PAGE_SIZE compile-time constant assumption
> arm64: Fix macros to work in C code in addition to the linker script
> arm64: Track early pgtable allocation limit
> arm64: Introduce macros required for boot-time page selection
> arm64: Refactor early pgtable size calculation macros
> arm64: Pass desired page size on command line
> arm64: Divorce early init from PAGE_SIZE
> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
> arm64: Align sections to PAGE_SIZE_MAX
> arm64: Rework trampoline rodata mapping
> arm64: Generalize fixmap for boot-time page size
> arm64: Statically allocate and align for worst-case page size
> arm64: Convert switch to if for non-const comparison values
> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
> arm64: Remove PAGE_SZ asm-offset
> arm64: Introduce cpu features for page sizes
> arm64: Remove PAGE_SIZE from assembly code
> arm64: Runtime-fold pmd level
> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
> arm64: TRAMP_VALIAS is no longer compile-time constant
> arm64: Determine THREAD_SIZE at boot-time
> arm64: Enable boot-time page size selection
>
> arch/alpha/include/asm/page.h | 1 +
> arch/arc/include/asm/page.h | 1 +
> arch/arm/include/asm/page.h | 1 +
> arch/arm64/Kconfig | 26 ++-
> arch/arm64/include/asm/assembler.h | 78 ++++++-
> arch/arm64/include/asm/cpufeature.h | 44 +++-
> arch/arm64/include/asm/efi.h | 2 +-
> arch/arm64/include/asm/fixmap.h | 28 ++-
> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
> arch/arm64/include/asm/kvm_arm.h | 21 +-
> arch/arm64/include/asm/kvm_hyp.h | 11 +
> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
> arch/arm64/include/asm/memory.h | 62 ++++--
> arch/arm64/include/asm/page-def.h | 3 +-
> arch/arm64/include/asm/pgalloc.h | 16 +-
> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
> arch/arm64/include/asm/pgtable-prot.h | 2 +-
> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
> arch/arm64/include/asm/processor.h | 10 +-
> arch/arm64/include/asm/sections.h | 1 +
> arch/arm64/include/asm/smp.h | 1 +
> arch/arm64/include/asm/sparsemem.h | 15 +-
> arch/arm64/include/asm/sysreg.h | 54 +++--
> arch/arm64/include/asm/tlb.h | 3 +
> arch/arm64/kernel/asm-offsets.c | 4 +-
> arch/arm64/kernel/cpufeature.c | 93 ++++++--
> arch/arm64/kernel/efi.c | 2 +-
> arch/arm64/kernel/entry.S | 60 +++++-
> arch/arm64/kernel/head.S | 46 +++-
> arch/arm64/kernel/hibernate-asm.S | 6 +-
> arch/arm64/kernel/image-vars.h | 14 ++
> arch/arm64/kernel/image.h | 4 +
> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
> arch/arm64/kernel/pi/pi.h | 63 +++++-
> arch/arm64/kernel/relocate_kernel.S | 10 +-
> arch/arm64/kernel/vdso-wrap.S | 4 +-
> arch/arm64/kernel/vdso.c | 7 +-
> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
> arch/arm64/kernel/vdso32-wrap.S | 4 +-
> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
> arch/arm64/kvm/arm.c | 10 +
> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
> arch/arm64/kvm/mmu.c | 39 ++--
> arch/arm64/lib/clear_page.S | 7 +-
> arch/arm64/lib/copy_page.S | 33 ++-
> arch/arm64/lib/mte.S | 27 ++-
> arch/arm64/mm/Makefile | 1 +
> arch/arm64/mm/fixmap.c | 38 ++--
> arch/arm64/mm/hugetlbpage.c | 40 +---
> arch/arm64/mm/init.c | 26 +--
> arch/arm64/mm/kasan_init.c | 8 +-
> arch/arm64/mm/mmu.c | 53 +++--
> arch/arm64/mm/pgd.c | 12 +-
> arch/arm64/mm/pgtable-geometry.c | 24 +++
> arch/arm64/mm/proc.S | 128 ++++++++---
> arch/arm64/mm/ptdump.c | 3 +-
> arch/arm64/tools/cpucaps | 3 +
> arch/csky/include/asm/page.h | 3 +
> arch/hexagon/include/asm/page.h | 2 +
> arch/loongarch/include/asm/page.h | 2 +
> arch/m68k/include/asm/page.h | 1 +
> arch/microblaze/include/asm/page.h | 1 +
> arch/mips/include/asm/page.h | 1 +
> arch/nios2/include/asm/page.h | 2 +
> arch/openrisc/include/asm/page.h | 1 +
> arch/parisc/include/asm/page.h | 1 +
> arch/powerpc/include/asm/page.h | 2 +
> arch/riscv/include/asm/page.h | 1 +
> arch/s390/include/asm/page.h | 1 +
> arch/sh/include/asm/page.h | 1 +
> arch/sparc/include/asm/page.h | 3 +
> arch/um/include/asm/page.h | 2 +
> arch/x86/include/asm/page_types.h | 2 +
> arch/xtensa/include/asm/page.h | 1 +
> crypto/lskcipher.c | 4 +-
> drivers/ata/sata_sil24.c | 46 ++--
> drivers/base/node.c | 6 +-
> drivers/base/topology.c | 32 +--
> drivers/block/virtio_blk.c | 2 +-
> drivers/char/random.c | 4 +-
> drivers/edac/edac_mc.h | 13 +-
> drivers/firmware/efi/libstub/arm64.c | 3 +-
> drivers/irqchip/irq-gic-v3-its.c | 2 +-
> drivers/mtd/mtdswap.c | 4 +-
> drivers/net/ethernet/freescale/fec.h | 3 +-
> drivers/net/ethernet/freescale/fec_main.c | 5 +-
> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
> drivers/net/ethernet/intel/igb/igb.h | 25 +--
> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
> drivers/net/ethernet/marvell/mvneta.c | 9 +-
> drivers/net/ethernet/marvell/sky2.h | 2 +-
> drivers/tee/optee/call.c | 7 +-
> drivers/tee/optee/smc_abi.c | 2 +-
> drivers/virtio/virtio_balloon.c | 10 +-
> drivers/xen/balloon.c | 11 +-
> drivers/xen/biomerge.c | 12 +-
> drivers/xen/privcmd.c | 2 +-
> drivers/xen/xenbus/xenbus_client.c | 5 +-
> drivers/xen/xlate_mmu.c | 6 +-
> fs/binfmt_elf.c | 11 +-
> fs/buffer.c | 2 +-
> fs/coredump.c | 8 +-
> fs/ext4/ext4.h | 36 ++--
> fs/ext4/move_extent.c | 2 +-
> fs/ext4/readpage.c | 2 +-
> fs/fat/dir.c | 4 +-
> fs/fat/fatent.c | 4 +-
> fs/nfs/nfs42proc.c | 2 +-
> fs/nfs/nfs42xattr.c | 2 +-
> fs/nfs/nfs4proc.c | 2 +-
> include/asm-generic/pgtable-geometry.h | 71 +++++++
> include/asm-generic/vmlinux.lds.h | 38 ++--
> include/linux/buffer_head.h | 1 +
> include/linux/cpumask.h | 5 +
> include/linux/linkage.h | 4 +-
> include/linux/mm.h | 17 +-
> include/linux/mm_types.h | 15 +-
> include/linux/mm_types_task.h | 2 +-
> include/linux/mmzone.h | 3 +-
> include/linux/netlink.h | 6 +-
> include/linux/percpu-defs.h | 4 +-
> include/linux/perf_event.h | 2 +-
> include/linux/sched.h | 4 +-
> include/linux/slab.h | 7 +-
> include/linux/stackdepot.h | 6 +-
> include/linux/sunrpc/svc.h | 8 +-
> include/linux/sunrpc/svc_rdma.h | 4 +-
> include/linux/sunrpc/svcsock.h | 2 +-
> include/linux/swap.h | 17 +-
> include/linux/swapops.h | 6 +-
> include/linux/thread_info.h | 10 +-
> include/xen/page.h | 2 +
> init/main.c | 7 +-
> kernel/bpf/core.c | 9 +-
> kernel/bpf/ringbuf.c | 54 ++---
> kernel/cgroup/cgroup.c | 8 +-
> kernel/crash_core.c | 2 +-
> kernel/events/core.c | 2 +-
> kernel/fork.c | 71 +++----
> kernel/power/power.h | 2 +-
> kernel/power/snapshot.c | 2 +-
> kernel/power/swap.c | 129 +++++++++--
> kernel/trace/fgraph.c | 2 +-
> kernel/trace/trace.c | 2 +-
> lib/stackdepot.c | 6 +-
> mm/kasan/report.c | 3 +-
> mm/memcontrol.c | 11 +-
> mm/memory.c | 4 +-
> mm/mmap.c | 2 +-
> mm/page-writeback.c | 2 +-
> mm/page_alloc.c | 31 +--
> mm/slub.c | 2 +-
> mm/sparse.c | 2 +-
> mm/swapfile.c | 2 +-
> mm/vmalloc.c | 7 +-
> net/9p/trans_virtio.c | 4 +-
> net/core/hotdata.c | 4 +-
> net/core/skbuff.c | 4 +-
> net/core/sysctl_net_core.c | 2 +-
> net/sunrpc/cache.c | 3 +-
> net/unix/af_unix.c | 2 +-
> sound/soc/soc-utils.c | 4 +-
> virt/kvm/kvm_main.c | 2 +-
> 172 files changed, 2185 insertions(+), 951 deletions(-)
> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
> create mode 100644 arch/arm64/mm/pgtable-geometry.c
> create mode 100644 include/asm-generic/pgtable-geometry.h
>
> --
> 2.43.0
This is a generally very exciting patch set! I'm looking forward to seeing it
land so I can take advantage of it for Fedora ARM and Fedora Asahi Remix.
That said, I have a couple of questions:
* Going forward, how would we handle drivers/modules that require a particular
page size? For example, the Apple Silicon IOMMU driver code requires the
kernel to operate in 16k page size mode, and it would need to be disabled in
other page sizes.
* How would we handle an invalid selection at boot? Can we program in a
fallback when the "wrong" mode is selected for a chip or something similar?
Thanks again and best regards!
(P.S.: Please add the asahi@ mailing list to the CC for future iterations of
this patch set and tag both Hector and myself in as well. Thanks!)
--
真実はいつも一つ!/ Always, there's only one truth!
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 34/57] sata_sil24: Remove PAGE_SIZE compile-time constant assumption
2024-10-17 12:51 ` Niklas Cassel
@ 2024-10-21 9:24 ` Ryan Roberts
2024-10-21 11:04 ` Niklas Cassel
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-21 9:24 UTC (permalink / raw)
To: Niklas Cassel
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
Damien Le Moal, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel, linux-ide,
linux-kernel, linux-mm, Kees Cook, Gustavo A. R. Silva
On 17/10/2024 13:51, Niklas Cassel wrote:
> On Thu, Oct 17, 2024 at 01:42:22PM +0100, Ryan Roberts wrote:
>> On 17/10/2024 10:09, Niklas Cassel wrote:
>
> (snip)
>
>>> As you might know, there is an effort to annotate all flexible array
>>> members with their run-time size information, see commit:
>>> dd06e72e68bc ("Compiler Attributes: Add __counted_by macro")
>>
>> I'm vaguely aware of it. But as I understand it, __counted_by() nominates
>> another member in the struct which keeps the count? In this case, there is no
>> such member, it's size is implicit based on the value of PAGE_SIZE. So I'm not
>> sure if it's practical to use it here?
>
> Neither am I :)
>
> Perhaps some of the flexible array member experts like
> Kees Cook or Gustavo A. R. Silva could help us out here.
The GCC feature request is clear that it is explicitly to mark a member as the count variable: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108896
But, yes, would be good to hear from Kees or Gustavo if there is an alternative mechanism for what we are doing here.
>
> Would it make sense to add another struct member and simply initialize
> it to PAGE_SIZE, in order to be able to use the __counted_by macro?
I guess that _could_ be done. But the way the driver is currently structured takes the sge array pointer and passes that around for DMA, so I think the value of this tag within the struct would be lost anyway. It would also require reducing the number of sge entries to make space for the count, and given I'm not really familiar with the driver or HW, I'd be concerned that this could cause a performance regression. Overall, my preference is to leave it as is.
That said, while investigating this, I've spotted a bug in my change. paddr calculation in sil24_qc_issue() is incorrect since sizeof(*pp->cmd_block) is no longer PAGE_SIZE. Based on feedback in another patch, I'm also converting the BUG_ONs to WARN_ON_ONCEs.
Additional proposed change, which I'll plan to include in the next version:
---8<---
diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
index 85c6382976626..c402bf998c4ee 100644
--- a/drivers/ata/sata_sil24.c
+++ b/drivers/ata/sata_sil24.c
@@ -257,6 +257,10 @@ union sil24_cmd_block {
struct sil24_atapi_block atapi;
};
+#define SIL24_ATA_BLOCK_SIZE struct_size_t(struct sil24_ata_block, sge, SIL24_MAX_SGE)
+#define SIL24_ATAPI_BLOCK_SIZE struct_size_t(struct sil24_atapi_block, sge, SIL24_MAX_SGE)
+#define SIL24_CMD_BLOCK_SIZE max(SIL24_ATA_BLOCK_SIZE, SIL24_ATAPI_BLOCK_SIZE)
+
static const struct sil24_cerr_info {
unsigned int err_mask, action;
const char *desc;
@@ -886,7 +890,7 @@ static unsigned int sil24_qc_issue(struct ata_queued_cmd *qc)
dma_addr_t paddr;
void __iomem *activate;
- paddr = pp->cmd_block_dma + tag * sizeof(*pp->cmd_block);
+ paddr = pp->cmd_block_dma + tag * SIL24_CMD_BLOCK_SIZE;
activate = port + PORT_CMD_ACTIVATE + tag * 8;
/*
@@ -1192,7 +1196,7 @@ static int sil24_port_start(struct ata_port *ap)
struct device *dev = ap->host->dev;
struct sil24_port_priv *pp;
union sil24_cmd_block *cb;
- size_t cb_size = PAGE_SIZE * SIL24_MAX_CMDS;
+ size_t cb_size = SIL24_CMD_BLOCK_SIZE * SIL24_MAX_CMDS;
dma_addr_t cb_dma;
pp = devm_kzalloc(dev, sizeof(*pp), GFP_KERNEL);
@@ -1265,8 +1269,8 @@ static int sil24_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
u32 tmp;
/* union sil24_cmd_block must be PAGE_SIZE */
- BUG_ON(struct_size_t(struct sil24_atapi_block, sge, SIL24_MAX_SGE) != PAGE_SIZE);
- BUG_ON(struct_size_t(struct sil24_ata_block, sge, SIL24_MAX_SGE) > PAGE_SIZE);
+ WARN_ON_ONCE(SIL24_ATAPI_BLOCK_SIZE != PAGE_SIZE);
+ WARN_ON_ONCE(SIL24_ATA_BLOCK_SIZE != PAGE_SIZE - 16);
ata_print_version_once(&pdev->dev, DRV_VERSION);
---8<---
>
>
>>
>>>
>>> I haven't looked at the DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST macro, but since
>>
>> DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(), when doing a boot-time page size build,
>> defers the initialization of the global variable to kernel init time, when
>> PAGE_SIZE is known. Because SIL24_MAX_SGE is defined in terms of PAGE_SIZE, this
>> deferral is required.
>>
>>> sge[] now becomes a flexible array member, I think it would be nice if it
>>> would be possible to somehow use the __counted_by macro.
>>>
^ permalink raw reply related [flat|nested] 196+ messages in thread
* Re: [External] : Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-18 20:06 ` Joseph Salisbury
@ 2024-10-21 9:55 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-21 9:55 UTC (permalink / raw)
To: Joseph Salisbury, David Hildenbrand, Andrew Morton,
Anshuman Khandual, Ard Biesheuvel, Catalin Marinas, Greg Marsden,
Ivan Ivanov, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 18/10/2024 21:06, Joseph Salisbury wrote:
>
>
>
> On 10/18/24 15:27, David Hildenbrand wrote:
>>
>>>>> Hi Ryan,
>>>>>
>>>>> First off, this is excellent work! Your cover page was very detailed
>>>>> and made the patch set easier to understand.
Thanks!
>>>>>
>>>>> Some questions/comments:
>>>>>
>>>>> Once a kernel is booted with a certain page size, could there be issues
>>>>> if it is booted later with a different page size? How about if this is
>>>>> done frequently?
>>>>
>>>> I think that is the reason why you are only given the option in RHEL
>>>> to select the kernel (4K vs. 64K) to use at install time.
>>>>
>>>> Software can easily use a different data format for persistance based
>>>> on the base page size. I would suspect DBs might be the usual suspects.
>>>>
>>>> One example is swap space I think, where the base page size used when
>>>> formatting the device is used, and it cannot be used with a different
>>>> page size unless reformatting it.
>>>>
>>>> So ... one has to be a bit careful ...
>>>>
>>> Yes, that is what I was thinking. Once a userspace process does an I/O
>>> and if it is based on PAGE_SIZE things can go south. I think this is
>>> not an issue with THP, so maybe it's possible with boot-time page selection?
>>
>> THP is a different beast and has different semantics: the base page size
>> doesn't change: the result of getpagesize() is unmodified ("transparent").
>>
>> One would have to emulate for a given user space process a different page
>> size ... and Ryan can likely tell some stories about that.
>>
>> Not that I consider it reasonable to have dynamic page sizes in the kernel and
>> then try emulating a different one for all user space.
>
> This is probably a case of ensuring proper documentation from the distro or
> application vendor.
>
> Or maybe some type of "Safety gate" could be implemented outside of the kernel.
> Some check for the prior use of different page sizes, in the cases where it
> could cause problems.
I agree there are likely to be problems in some corner cases if switching page
size between boots, if persisted data makes assumptions about the page size. I
would argue that any problems that are observed should really be considered bugs
in the user space SW though.
But I don't think this is really any different from today; With Ubuntu, for
example, you can install both 4K and 64K kernels concurrently, then choose which
one to boot via Grub. So the issue exists there already. This proposed boot-time
page size selection series, doesn't make that any worse, it just simplifies the
distribution model, given the reality that distros are now having to support
multiple page sizes.
Thanks,
Ryan
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-19 15:47 ` Neal Gompa
@ 2024-10-21 11:02 ` Ryan Roberts
2024-10-21 11:32 ` Eric Curtin
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-21 11:02 UTC (permalink / raw)
To: Neal Gompa, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, Hector Martin
Cc: linux-arm-kernel, linux-kernel, linux-mm, asahi
On 19/10/2024 16:47, Neal Gompa wrote:
> On Monday, October 14, 2024 6:55:11 AM EDT Ryan Roberts wrote:
>> Hi All,
>>
>> Patch bomb incoming... This covers many subsystems, so I've included a core
>> set of people on the full series and additionally included maintainers on
>> relevant patches. I haven't included those maintainers on this cover letter
>> since the numbers were far too big for it to work. But I've included a link
>> to this cover letter on each patch, so they can hopefully find their way
>> here. For follow up submissions I'll break it up by subsystem, but for now
>> thought it was important to show the full picture.
>>
>> This RFC series implements support for boot-time page size selection within
>> the arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to
>> date, page size has been selected at compile-time, meaning the size is
>> baked into a given kernel image. As use of larger-than-4K page sizes become
>> more prevalent this starts to present a problem for distributions.
>> Boot-time page size selection enables the creation of a single kernel
>> image, which can be told which page size to use on the kernel command line.
>>
>> Why is having an image-per-page size problematic?
>> =================================================
>>
>> Many traditional distros are now supporting both 4K and 64K. And this means
>> managing 2 kernel packages, along with drivers for each. For some, it means
>> multiple installer flavours and multiple ISOs. All of this adds up to a
>> less-than-ideal level of complexity. Additionally, Android now supports 4K
>> and 16K kernels. I'm told having to explicitly manage their KABI for each
>> kernel is painful, and the extra flash space required for both kernel
>> images and the duplicated modules has been problematic. Boot-time page size
>> selection solves all of this.
>>
>> Additionally, in starting to think about the longer term deployment story
>> for D128 page tables, which Arm architecture now supports, a lot of the
>> same problems need to be solved, so this work sets us up nicely for that.
>>
>> So what's the down side?
>> ========================
>>
>> Well nothing's free; Various static allocations in the kernel image must be
>> sized for the worst case (largest supported page size), so image size is in
>> line with size of 64K compile-time image. So if you're interested in 4K or
>> 16K, there is a slight increase to the image size. But I expect that
>> problem goes away if you're compressing the image - its just some extra
>> zeros. At boot-time, I expect we could free the unused static storage once
>> we know the page size - although that would be a follow up enhancement.
>>
>> And then there is performance. Since PAGE_SIZE and friends are no longer
>> compile-time constants, we must look up their values and do arithmetic at
>> runtime instead of compile-time. My early perf testing suggests this is
>> inperceptible for real-world workloads, and only has small impact on
>> microbenchmarks - more on this below.
>>
>> Approach
>> ========
>>
>> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
>> friends are compile-time constant, but in a way that allows the compiler to
>> perform the same optimizations as was previously being done if they do turn
>> out to be compile-time constant. Where constants are required, we use
>> limits; PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full
>> description of all the classes of problems to solve.
>>
>> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
>> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX.
>> arm64 does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>> Kconfig, which is an alternative to selecting a compile-time page size.
>>
>> When boot-time page size is active, the arch pgtable geometry macro
>> definitions resolve to something that can be configured at boot. The arm64
>> implementation in this series mainly uses global, __ro_after_init
>> variables. I've tried using alternatives patching, but that performs worse
>> than loading from memory; I think due to code size bloat.
>>
>> Status
>> ======
>>
>> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented
>> enough to compile the kernel image itself with defconfig (and a few other
>> bits and pieces). This is enough to build a kernel that can boot under QEMU
>> or FVP. I'll happily do the rest of the work to enable all the extra
>> drivers, but wanted to get feedback on the shape of this effort first. If
>> anyone wants to do any testing, and has a must-have config, let me know and
>> I'll prioritize enabling it first.
>>
>> The series is arranged as follows:
>>
>> - patch 1: Add macros required for converting non-arch code to support
>> boot-time page size selection
>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from
>> all non-arch code
>> - patches 37-38: Some arm64 tidy ups
>> - patch 39: Add macros required for converting arm64 code to
> support
>> boot-time page size selection
>> - patches 40-56: arm64 changes to support boot-time page size selection
>> - patch 57: Add arm64 Kconfig option to enable boot-time page
> size
>> selection
>>
>> Ideally, I'd like to get the basics merged (something like this series),
>> then incrementally improve it over a handful of kernel releases until we
>> can demonstrate that we have feature parity with the compile-time build and
>> no performance blockers. Once at that point, ideally the compile-time build
>> options would be removed and the code could be cleaned up further.
>>
>> One of the bigger peices that I'd propose to add as a follow up, is to make
>> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
>> handling.
>>
>> Assuming people are ammenable to the rough shape, how would I go about
>> getting the non-arch changes merged? Since they cover many subsystems, will
>> each piece need to go independently to each relevant maintainer or could it
>> all be merged together through the arm64 tree?
>>
>> Image Size
>> ==========
>>
>> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
>> kernel image on disk for base (before any changes applied), compile (with
>> changes, configured for compile-time page size) and boot (with changes,
>> configured for boot-time page size).
>>
>> You can see the that compile-16k and 64k configs are actually slightly
>> smaller than the baselines; that's due to optimizing some buffer sizes
>> which didn't need to depend on page size during the series. The boot-time
>> image is ~1% bigger than the 64k compile-time image. I believe there is
>> scope to improve this to make it
>> equal to compile-64k if required:
>> | config | size/KB | diff/KB | diff/% |
>> |
>> |-------------|---------|---------|---------|
>> |
>> | base-4k | 54895 | 0 | 0.0% |
>> | base-16k | 55161 | 266 | 0.5% |
>> | base-64k | 56775 | 1880 | 3.4% |
>> | compile-4k | 54895 | 0 | 0.0% |
>> | compile-16k | 55097 | 202 | 0.4% |
>> | compile-64k | 56391 | 1496 | 2.7% |
>> | boot-4K | 57045 | 2150 | 3.9% |
>>
>> And below shows the size of the image in memory at run-time, separated for
>> text and data costs. The boot image has ~1% text cost; most likely due to
>> the fact that PAGE_SIZE and friends are not compile-time constants so need
>> instructions to load the values and do arithmetic. I believe we could
>> eventually get the data cost to match the cost for the compile image for
>> the chosen page size by freeing
>> the ends of the static buffers not needed for the selected page size:
>> | | text | text | text | data | data | data |
>> |
>> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
>> |
>> |-------------|---------|---------|---------|---------|---------|---------|
>> |
>> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
>> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
>> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
>> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
>> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
>> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
>> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
>>
>> Functional Testing
>> ==================
>>
>> I've build-tested defconfig for all arches supported by tuxmake (which is
>> most) without issue.
>>
>> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page
>> sizes and a few va-sizes, and additionally have run all the mm-selftests,
>> with no regressions observed vs the equivalent compile-time page size build
>> (although the mm-selftests have a few existing failures when run against
>> 16K and 64K kernels - those should really be investigated and fixed
>> independently).
>>
>> Test coverage is lacking for many of the drivers that I've touched, but in
>> many cases, I'm hoping the changes are simple enough that review might
>> suffice?
>>
>> Performance Testing
>> ===================
>>
>> I've run some limited performance benchmarks:
>>
>> First, a real-world benchmark that causes a lot of page table manipulation
>> (and therefore we would expect to see regression here if we are going to
>> see it anywhere); kernel compilation. It barely registers a change. Values
>> are times,
>> so smaller is better. All relative to base-4k:
>> | | kern | kern | user | user | real | real |
>> |
>> | config | mean | stdev | mean | stdev | mean | stdev |
>> |
>> |-------------|---------|---------|---------|---------|---------|---------|
>> |
>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
>>
>> The Speedometer JavaScript benchmark also shows no change. Values are runs
>> per
>> min, so bigger is better. All relative to base-4k:
>> | config | mean | stdev |
>> |
>> |-------------|---------|---------|
>> |
>> | base-4k | 0.0% | 0.8% |
>> | compile-4k | 0.4% | 0.8% |
>> | boot-4k | 0.0% | 0.9% |
>>
>> Finally, I've run some microbenchmarks known to stress page table
>> manipulations (originally from David Hildenbrand). The fork test
>> maps/allocs 1G of anon memory, then measures the cost of fork(). The munmap
>> test maps/allocs 1G of anon memory then measures the cost of munmap()ing
>> it. The fork test is known to be extremely sensitive to any changes that
>> cause instructions to be aligned differently in cachelines. When using this
>> test for other changes, I've seen double digit regressions for the
>> slightest thing, so 12% regression on this test is actually fairly good.
>> This likely represents the extreme worst case for regressions that will be
>> observed across other microbenchmarks (famous last
>> words). Values are times, so smaller is better. All relative to base-4k:
>> | | fork | fork | munmap | munmap |
>> |
>> | config | mean | stdev | stdev | stdev |
>> |
>> |-------------|---------|---------|---------|---------|
>> |
>> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
>> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
>> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
>>
>> NOTE: The series applies on top of v6.11.
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (57):
>> mm: Add macros ahead of supporting boot-time page size selection
>> vmlinux: Align to PAGE_SIZE_MAX
>> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
>> mm/page_alloc: Make page_frag_cache boot-time page size compatible
>> mm: Avoid split pmd ptl if pmd level is run-time folded
>> mm: Remove PAGE_SIZE compile-time constant assumption
>> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
>> fs: Remove PAGE_SIZE compile-time constant assumption
>> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
>> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
>> fork: Permit boot-time THREAD_SIZE determination
>> cgroup: Remove PAGE_SIZE compile-time constant assumption
>> bpf: Remove PAGE_SIZE compile-time constant assumption
>> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
>> stackdepot: Remove PAGE_SIZE compile-time constant assumption
>> perf: Remove PAGE_SIZE compile-time constant assumption
>> kvm: Remove PAGE_SIZE compile-time constant assumption
>> trace: Remove PAGE_SIZE compile-time constant assumption
>> crash: Remove PAGE_SIZE compile-time constant assumption
>> crypto: Remove PAGE_SIZE compile-time constant assumption
>> sunrpc: Remove PAGE_SIZE compile-time constant assumption
>> sound: Remove PAGE_SIZE compile-time constant assumption
>> net: Remove PAGE_SIZE compile-time constant assumption
>> net: fec: Remove PAGE_SIZE compile-time constant assumption
>> net: marvell: Remove PAGE_SIZE compile-time constant assumption
>> net: hns3: Remove PAGE_SIZE compile-time constant assumption
>> net: e1000: Remove PAGE_SIZE compile-time constant assumption
>> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
>> net: igb: Remove PAGE_SIZE compile-time constant assumption
>> drivers/base: Remove PAGE_SIZE compile-time constant assumption
>> edac: Remove PAGE_SIZE compile-time constant assumption
>> optee: Remove PAGE_SIZE compile-time constant assumption
>> random: Remove PAGE_SIZE compile-time constant assumption
>> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
>> virtio: Remove PAGE_SIZE compile-time constant assumption
>> xen: Remove PAGE_SIZE compile-time constant assumption
>> arm64: Fix macros to work in C code in addition to the linker script
>> arm64: Track early pgtable allocation limit
>> arm64: Introduce macros required for boot-time page selection
>> arm64: Refactor early pgtable size calculation macros
>> arm64: Pass desired page size on command line
>> arm64: Divorce early init from PAGE_SIZE
>> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
>> arm64: Align sections to PAGE_SIZE_MAX
>> arm64: Rework trampoline rodata mapping
>> arm64: Generalize fixmap for boot-time page size
>> arm64: Statically allocate and align for worst-case page size
>> arm64: Convert switch to if for non-const comparison values
>> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
>> arm64: Remove PAGE_SZ asm-offset
>> arm64: Introduce cpu features for page sizes
>> arm64: Remove PAGE_SIZE from assembly code
>> arm64: Runtime-fold pmd level
>> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
>> arm64: TRAMP_VALIAS is no longer compile-time constant
>> arm64: Determine THREAD_SIZE at boot-time
>> arm64: Enable boot-time page size selection
>>
>> arch/alpha/include/asm/page.h | 1 +
>> arch/arc/include/asm/page.h | 1 +
>> arch/arm/include/asm/page.h | 1 +
>> arch/arm64/Kconfig | 26 ++-
>> arch/arm64/include/asm/assembler.h | 78 ++++++-
>> arch/arm64/include/asm/cpufeature.h | 44 +++-
>> arch/arm64/include/asm/efi.h | 2 +-
>> arch/arm64/include/asm/fixmap.h | 28 ++-
>> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
>> arch/arm64/include/asm/kvm_arm.h | 21 +-
>> arch/arm64/include/asm/kvm_hyp.h | 11 +
>> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
>> arch/arm64/include/asm/memory.h | 62 ++++--
>> arch/arm64/include/asm/page-def.h | 3 +-
>> arch/arm64/include/asm/pgalloc.h | 16 +-
>> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
>> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
>> arch/arm64/include/asm/pgtable-prot.h | 2 +-
>> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
>> arch/arm64/include/asm/processor.h | 10 +-
>> arch/arm64/include/asm/sections.h | 1 +
>> arch/arm64/include/asm/smp.h | 1 +
>> arch/arm64/include/asm/sparsemem.h | 15 +-
>> arch/arm64/include/asm/sysreg.h | 54 +++--
>> arch/arm64/include/asm/tlb.h | 3 +
>> arch/arm64/kernel/asm-offsets.c | 4 +-
>> arch/arm64/kernel/cpufeature.c | 93 ++++++--
>> arch/arm64/kernel/efi.c | 2 +-
>> arch/arm64/kernel/entry.S | 60 +++++-
>> arch/arm64/kernel/head.S | 46 +++-
>> arch/arm64/kernel/hibernate-asm.S | 6 +-
>> arch/arm64/kernel/image-vars.h | 14 ++
>> arch/arm64/kernel/image.h | 4 +
>> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
>> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
>> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
>> arch/arm64/kernel/pi/pi.h | 63 +++++-
>> arch/arm64/kernel/relocate_kernel.S | 10 +-
>> arch/arm64/kernel/vdso-wrap.S | 4 +-
>> arch/arm64/kernel/vdso.c | 7 +-
>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
>> arch/arm64/kernel/vdso32-wrap.S | 4 +-
>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
>> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
>> arch/arm64/kvm/arm.c | 10 +
>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
>> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
>> arch/arm64/kvm/mmu.c | 39 ++--
>> arch/arm64/lib/clear_page.S | 7 +-
>> arch/arm64/lib/copy_page.S | 33 ++-
>> arch/arm64/lib/mte.S | 27 ++-
>> arch/arm64/mm/Makefile | 1 +
>> arch/arm64/mm/fixmap.c | 38 ++--
>> arch/arm64/mm/hugetlbpage.c | 40 +---
>> arch/arm64/mm/init.c | 26 +--
>> arch/arm64/mm/kasan_init.c | 8 +-
>> arch/arm64/mm/mmu.c | 53 +++--
>> arch/arm64/mm/pgd.c | 12 +-
>> arch/arm64/mm/pgtable-geometry.c | 24 +++
>> arch/arm64/mm/proc.S | 128 ++++++++---
>> arch/arm64/mm/ptdump.c | 3 +-
>> arch/arm64/tools/cpucaps | 3 +
>> arch/csky/include/asm/page.h | 3 +
>> arch/hexagon/include/asm/page.h | 2 +
>> arch/loongarch/include/asm/page.h | 2 +
>> arch/m68k/include/asm/page.h | 1 +
>> arch/microblaze/include/asm/page.h | 1 +
>> arch/mips/include/asm/page.h | 1 +
>> arch/nios2/include/asm/page.h | 2 +
>> arch/openrisc/include/asm/page.h | 1 +
>> arch/parisc/include/asm/page.h | 1 +
>> arch/powerpc/include/asm/page.h | 2 +
>> arch/riscv/include/asm/page.h | 1 +
>> arch/s390/include/asm/page.h | 1 +
>> arch/sh/include/asm/page.h | 1 +
>> arch/sparc/include/asm/page.h | 3 +
>> arch/um/include/asm/page.h | 2 +
>> arch/x86/include/asm/page_types.h | 2 +
>> arch/xtensa/include/asm/page.h | 1 +
>> crypto/lskcipher.c | 4 +-
>> drivers/ata/sata_sil24.c | 46 ++--
>> drivers/base/node.c | 6 +-
>> drivers/base/topology.c | 32 +--
>> drivers/block/virtio_blk.c | 2 +-
>> drivers/char/random.c | 4 +-
>> drivers/edac/edac_mc.h | 13 +-
>> drivers/firmware/efi/libstub/arm64.c | 3 +-
>> drivers/irqchip/irq-gic-v3-its.c | 2 +-
>> drivers/mtd/mtdswap.c | 4 +-
>> drivers/net/ethernet/freescale/fec.h | 3 +-
>> drivers/net/ethernet/freescale/fec_main.c | 5 +-
>> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
>> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
>> drivers/net/ethernet/intel/igb/igb.h | 25 +--
>> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
>> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
>> drivers/net/ethernet/marvell/mvneta.c | 9 +-
>> drivers/net/ethernet/marvell/sky2.h | 2 +-
>> drivers/tee/optee/call.c | 7 +-
>> drivers/tee/optee/smc_abi.c | 2 +-
>> drivers/virtio/virtio_balloon.c | 10 +-
>> drivers/xen/balloon.c | 11 +-
>> drivers/xen/biomerge.c | 12 +-
>> drivers/xen/privcmd.c | 2 +-
>> drivers/xen/xenbus/xenbus_client.c | 5 +-
>> drivers/xen/xlate_mmu.c | 6 +-
>> fs/binfmt_elf.c | 11 +-
>> fs/buffer.c | 2 +-
>> fs/coredump.c | 8 +-
>> fs/ext4/ext4.h | 36 ++--
>> fs/ext4/move_extent.c | 2 +-
>> fs/ext4/readpage.c | 2 +-
>> fs/fat/dir.c | 4 +-
>> fs/fat/fatent.c | 4 +-
>> fs/nfs/nfs42proc.c | 2 +-
>> fs/nfs/nfs42xattr.c | 2 +-
>> fs/nfs/nfs4proc.c | 2 +-
>> include/asm-generic/pgtable-geometry.h | 71 +++++++
>> include/asm-generic/vmlinux.lds.h | 38 ++--
>> include/linux/buffer_head.h | 1 +
>> include/linux/cpumask.h | 5 +
>> include/linux/linkage.h | 4 +-
>> include/linux/mm.h | 17 +-
>> include/linux/mm_types.h | 15 +-
>> include/linux/mm_types_task.h | 2 +-
>> include/linux/mmzone.h | 3 +-
>> include/linux/netlink.h | 6 +-
>> include/linux/percpu-defs.h | 4 +-
>> include/linux/perf_event.h | 2 +-
>> include/linux/sched.h | 4 +-
>> include/linux/slab.h | 7 +-
>> include/linux/stackdepot.h | 6 +-
>> include/linux/sunrpc/svc.h | 8 +-
>> include/linux/sunrpc/svc_rdma.h | 4 +-
>> include/linux/sunrpc/svcsock.h | 2 +-
>> include/linux/swap.h | 17 +-
>> include/linux/swapops.h | 6 +-
>> include/linux/thread_info.h | 10 +-
>> include/xen/page.h | 2 +
>> init/main.c | 7 +-
>> kernel/bpf/core.c | 9 +-
>> kernel/bpf/ringbuf.c | 54 ++---
>> kernel/cgroup/cgroup.c | 8 +-
>> kernel/crash_core.c | 2 +-
>> kernel/events/core.c | 2 +-
>> kernel/fork.c | 71 +++----
>> kernel/power/power.h | 2 +-
>> kernel/power/snapshot.c | 2 +-
>> kernel/power/swap.c | 129 +++++++++--
>> kernel/trace/fgraph.c | 2 +-
>> kernel/trace/trace.c | 2 +-
>> lib/stackdepot.c | 6 +-
>> mm/kasan/report.c | 3 +-
>> mm/memcontrol.c | 11 +-
>> mm/memory.c | 4 +-
>> mm/mmap.c | 2 +-
>> mm/page-writeback.c | 2 +-
>> mm/page_alloc.c | 31 +--
>> mm/slub.c | 2 +-
>> mm/sparse.c | 2 +-
>> mm/swapfile.c | 2 +-
>> mm/vmalloc.c | 7 +-
>> net/9p/trans_virtio.c | 4 +-
>> net/core/hotdata.c | 4 +-
>> net/core/skbuff.c | 4 +-
>> net/core/sysctl_net_core.c | 2 +-
>> net/sunrpc/cache.c | 3 +-
>> net/unix/af_unix.c | 2 +-
>> sound/soc/soc-utils.c | 4 +-
>> virt/kvm/kvm_main.c | 2 +-
>> 172 files changed, 2185 insertions(+), 951 deletions(-)
>> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
>> create mode 100644 arch/arm64/mm/pgtable-geometry.c
>> create mode 100644 include/asm-generic/pgtable-geometry.h
>>
>> --
>> 2.43.0
>
> This is a generally very exciting patch set! I'm looking forward to seeing it
> land so I can take advantage of it for Fedora ARM and Fedora Asahi Remix.
>
> That said, I have a couple of questions:
>
> * Going forward, how would we handle drivers/modules that require a particular
> page size? For example, the Apple Silicon IOMMU driver code requires the
> kernel to operate in 16k page size mode, and it would need to be disabled in
> other page sizes.
I think these drivers would want to check PAGE_SIZE at probe time and fail if an
unsupported page size is in use. Do you see any issue with that?
>
> * How would we handle an invalid selection at boot?
What do you mean by invalid here? The current policy validates that the
requested page size is supported by the HW by checking mmfr0. If no page size is
passed on the command line, or the passed value is not supported by the HW, then
the we default to the largest page size supported by the HW (so for Apple
Silicon that would be 16k since the HW doesn't support 64k). Although I think it
may be better to change that policy to use the smallest page size in this case;
4k is the safer bet for compat and will waste much less memory than 64k.
> Can we program in a
> fallback when the "wrong" mode is selected for a chip or something similar?
Do you mean effectively add a machanism to force 16k if the detected HW is Apple
Silicon? The trouble is that we need to select the page size, very early in
boot, before start_kernel() is called, so we really only have generic arch code
and the command line with which to make the decision.
> > Thanks again and best regards!
>
> (P.S.: Please add the asahi@ mailing list to the CC for future iterations of
> this patch set and tag both Hector and myself in as well. Thanks!)
Will do!
>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 34/57] sata_sil24: Remove PAGE_SIZE compile-time constant assumption
2024-10-21 9:24 ` Ryan Roberts
@ 2024-10-21 11:04 ` Niklas Cassel
2024-10-21 11:26 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Niklas Cassel @ 2024-10-21 11:04 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
Damien Le Moal, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel, linux-ide,
linux-kernel, linux-mm, Kees Cook, Gustavo A. R. Silva
On Mon, Oct 21, 2024 at 10:24:37AM +0100, Ryan Roberts wrote:
> On 17/10/2024 13:51, Niklas Cassel wrote:
> > On Thu, Oct 17, 2024 at 01:42:22PM +0100, Ryan Roberts wrote:
(snip)
> That said, while investigating this, I've spotted a bug in my change. paddr calculation in sil24_qc_issue() is incorrect since sizeof(*pp->cmd_block) is no longer PAGE_SIZE. Based on feedback in another patch, I'm also converting the BUG_ONs to WARN_ON_ONCEs.
Side note: Please wrap you lines to 80 characters max.
>
> Additional proposed change, which I'll plan to include in the next version:
>
> ---8<---
> diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
> index 85c6382976626..c402bf998c4ee 100644
> --- a/drivers/ata/sata_sil24.c
> +++ b/drivers/ata/sata_sil24.c
> @@ -257,6 +257,10 @@ union sil24_cmd_block {
> struct sil24_atapi_block atapi;
> };
>
> +#define SIL24_ATA_BLOCK_SIZE struct_size_t(struct sil24_ata_block, sge, SIL24_MAX_SGE)
> +#define SIL24_ATAPI_BLOCK_SIZE struct_size_t(struct sil24_atapi_block, sge, SIL24_MAX_SGE)
> +#define SIL24_CMD_BLOCK_SIZE max(SIL24_ATA_BLOCK_SIZE, SIL24_ATAPI_BLOCK_SIZE)
> +
> static const struct sil24_cerr_info {
> unsigned int err_mask, action;
> const char *desc;
> @@ -886,7 +890,7 @@ static unsigned int sil24_qc_issue(struct ata_queued_cmd *qc)
> dma_addr_t paddr;
> void __iomem *activate;
>
> - paddr = pp->cmd_block_dma + tag * sizeof(*pp->cmd_block);
> + paddr = pp->cmd_block_dma + tag * SIL24_CMD_BLOCK_SIZE;
> activate = port + PORT_CMD_ACTIVATE + tag * 8;
>
> /*
> @@ -1192,7 +1196,7 @@ static int sil24_port_start(struct ata_port *ap)
> struct device *dev = ap->host->dev;
> struct sil24_port_priv *pp;
> union sil24_cmd_block *cb;
> - size_t cb_size = PAGE_SIZE * SIL24_MAX_CMDS;
> + size_t cb_size = SIL24_CMD_BLOCK_SIZE * SIL24_MAX_CMDS;
> dma_addr_t cb_dma;
>
> pp = devm_kzalloc(dev, sizeof(*pp), GFP_KERNEL);
> @@ -1265,8 +1269,8 @@ static int sil24_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
> u32 tmp;
>
> /* union sil24_cmd_block must be PAGE_SIZE */
This comment should probably be rephrased to be more clear then, since like
you said sizeof(union sil24_cmd_block) will no longer be PAGE_SIZE.
> - BUG_ON(struct_size_t(struct sil24_atapi_block, sge, SIL24_MAX_SGE) != PAGE_SIZE);
> - BUG_ON(struct_size_t(struct sil24_ata_block, sge, SIL24_MAX_SGE) > PAGE_SIZE);
> + WARN_ON_ONCE(SIL24_ATAPI_BLOCK_SIZE != PAGE_SIZE);
> + WARN_ON_ONCE(SIL24_ATA_BLOCK_SIZE != PAGE_SIZE - 16);
>
> ata_print_version_once(&pdev->dev, DRV_VERSION);
> ---8<---
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 44/57] arm64: Align sections to PAGE_SIZE_MAX
2024-10-19 14:16 ` Thomas Weißschuh
@ 2024-10-21 11:20 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-21 11:20 UTC (permalink / raw)
To: Thomas Weißschuh
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Oliver Upton, Will Deacon, kvmarm, linux-arm-kernel, linux-kernel,
linux-mm
On 19/10/2024 15:16, Thomas Weißschuh wrote:
> On 2024-10-14 11:58:51+0100, Ryan Roberts wrote:
>> Increase alignment of sections in nvhe hyp, vdso and final vmlinux image
>> from PAGE_SIZE to PAGE_SIZE_MAX. For compile-time PAGE_SIZE,
>> PAGE_SIZE_MAX == PAGE_SIZE so there is no change. For boot-time
>> PAGE_SIZE, PAGE_SIZE_MAX is the largest selectable page size.
>>
>> For a boot-time page size build, image size is comparable to a 64K page
>> size compile-time build. In future, it may be desirable to optimize
>> run-time memory consumption by freeing unused padding pages when the
>> boot-time selected page size is less than PAGE_SIZE_MAX.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> ***NOTE***
>> Any confused maintainers may want to read the cover note here for context:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>
>> arch/arm64/include/asm/memory.h | 4 +--
>> arch/arm64/kernel/vdso-wrap.S | 4 +--
>> arch/arm64/kernel/vdso.c | 7 +++---
>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +--
>> arch/arm64/kernel/vdso32-wrap.S | 4 +--
>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +--
>> arch/arm64/kernel/vmlinux.lds.S | 38 ++++++++++++++---------------
>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 2 +-
>> 8 files changed, 34 insertions(+), 33 deletions(-)
>
>> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
>> index 89b6e78400023..1efe98909a2e0 100644
>> --- a/arch/arm64/kernel/vdso.c
>> +++ b/arch/arm64/kernel/vdso.c
>> @@ -195,7 +195,7 @@ static int __setup_additional_pages(enum vdso_abi abi,
>>
>> vdso_text_len = vdso_info[abi].vdso_pages << PAGE_SHIFT;
>> /* Be sure to map the data page */
>> - vdso_mapping_len = vdso_text_len + VVAR_NR_PAGES * PAGE_SIZE;
>> + vdso_mapping_len = vdso_text_len + VVAR_NR_PAGES * PAGE_SIZE_MAX;
>>
>> vdso_base = get_unmapped_area(NULL, 0, vdso_mapping_len, 0, 0);
>> if (IS_ERR_VALUE(vdso_base)) {
>> @@ -203,7 +203,8 @@ static int __setup_additional_pages(enum vdso_abi abi,
>> goto up_fail;
>> }
>>
>> - ret = _install_special_mapping(mm, vdso_base, VVAR_NR_PAGES * PAGE_SIZE,
>> + ret = _install_special_mapping(mm, vdso_base,
>> + VVAR_NR_PAGES * PAGE_SIZE_MAX,
>> VM_READ|VM_MAYREAD|VM_PFNMAP,
>> vdso_info[abi].dm);
>> if (IS_ERR(ret))
>> @@ -212,7 +213,7 @@ static int __setup_additional_pages(enum vdso_abi abi,
>> if (system_supports_bti_kernel())
>> gp_flags = VM_ARM64_BTI;
>>
>> - vdso_base += VVAR_NR_PAGES * PAGE_SIZE;
>> + vdso_base += VVAR_NR_PAGES * PAGE_SIZE_MAX;
>> mm->context.vdso = (void *)vdso_base;
>> ret = _install_special_mapping(mm, vdso_base, vdso_text_len,
>> VM_READ|VM_EXEC|gp_flags|
>
>> diff --git a/arch/arm64/kernel/vdso/vdso.lds.S b/arch/arm64/kernel/vdso/vdso.lds.S
>> index 45354f2ddf706..f7d1537a689e8 100644
>> --- a/arch/arm64/kernel/vdso/vdso.lds.S
>> +++ b/arch/arm64/kernel/vdso/vdso.lds.S
>> @@ -18,9 +18,9 @@ OUTPUT_ARCH(aarch64)
>>
>> SECTIONS
>> {
>> - PROVIDE(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE);
>> + PROVIDE(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE_MAX);
>> #ifdef CONFIG_TIME_NS
>> - PROVIDE(_timens_data = _vdso_data + PAGE_SIZE);
>> + PROVIDE(_timens_data = _vdso_data + PAGE_SIZE_MAX);
>
> This looks like it also needs a change to vvar_fault() in vdso.c.
> The symbols are now always PAGE_SIZE_MAX apart, while vvar_fault() works
> in page offsets (vmf->pgoff) that are based on the runtime PAGE_SIZE and
> it expects hardcoded offsets.
>
> As test you can use tools/testing/selftests/timens/timens.
>
> (I can't test this right now, so it's only a suspicion)
Ahh good spot - that test does infact fail.
This fixes the problem:
---8<---
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 1efe98909a2e0..d2049ba6b19f5 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -151,10 +151,11 @@ int vdso_join_timens(struct task_struct *task, struct
time_namespace *ns)
static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
struct vm_area_struct *vma, struct vm_fault *vmf)
{
+ pgoff_t pgmaxoff = vmf->pgoff >> (PAGE_SHIFT_MAX - PAGE_SHIFT);
struct page *timens_page = find_timens_vvar_page(vma);
unsigned long pfn;
- switch (vmf->pgoff) {
+ switch (pgmaxoff) {
case VVAR_DATA_PAGE_OFFSET:
if (timens_page)
pfn = page_to_pfn(timens_page);
---8<---
I'll include it in the next version.
Thanks,
Ryan
>
>> #endif
>> . = VDSO_LBASE + SIZEOF_HEADERS;
>
>> diff --git a/arch/arm64/kernel/vdso32/vdso.lds.S b/arch/arm64/kernel/vdso32/vdso.lds.S
>> index 8d95d7d35057d..c46d18a69d1ce 100644
>> --- a/arch/arm64/kernel/vdso32/vdso.lds.S
>> +++ b/arch/arm64/kernel/vdso32/vdso.lds.S
>> @@ -18,9 +18,9 @@ OUTPUT_ARCH(arm)
>>
>> SECTIONS
>> {
>> - PROVIDE_HIDDEN(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE);
>> + PROVIDE_HIDDEN(_vdso_data = . - __VVAR_PAGES * PAGE_SIZE_MAX);
>> #ifdef CONFIG_TIME_NS
>> - PROVIDE_HIDDEN(_timens_data = _vdso_data + PAGE_SIZE);
>> + PROVIDE_HIDDEN(_timens_data = _vdso_data + PAGE_SIZE_MAX);
>> #endif
>> . = VDSO_LBASE + SIZEOF_HEADERS;
^ permalink raw reply related [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 34/57] sata_sil24: Remove PAGE_SIZE compile-time constant assumption
2024-10-21 11:04 ` Niklas Cassel
@ 2024-10-21 11:26 ` Ryan Roberts
2024-10-21 11:43 ` Niklas Cassel
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-21 11:26 UTC (permalink / raw)
To: Niklas Cassel
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
Damien Le Moal, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel, linux-ide,
linux-kernel, linux-mm, Kees Cook, Gustavo A. R. Silva
On 21/10/2024 12:04, Niklas Cassel wrote:
> On Mon, Oct 21, 2024 at 10:24:37AM +0100, Ryan Roberts wrote:
>> On 17/10/2024 13:51, Niklas Cassel wrote:
>>> On Thu, Oct 17, 2024 at 01:42:22PM +0100, Ryan Roberts wrote:
>
> (snip)
>
>> That said, while investigating this, I've spotted a bug in my change. paddr calculation in sil24_qc_issue() is incorrect since sizeof(*pp->cmd_block) is no longer PAGE_SIZE. Based on feedback in another patch, I'm also converting the BUG_ONs to WARN_ON_ONCEs.
>
> Side note: Please wrap you lines to 80 characters max.
Yes sorry, I turned off line wrapping for that last mail because I didn't want
it to wrap the copy/pasted patch. I'll figure out how to mix and match for future.
>
>
>>
>> Additional proposed change, which I'll plan to include in the next version:
>>
>> ---8<---
>> diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
>> index 85c6382976626..c402bf998c4ee 100644
>> --- a/drivers/ata/sata_sil24.c
>> +++ b/drivers/ata/sata_sil24.c
>> @@ -257,6 +257,10 @@ union sil24_cmd_block {
>> struct sil24_atapi_block atapi;
>> };
>>
>> +#define SIL24_ATA_BLOCK_SIZE struct_size_t(struct sil24_ata_block, sge, SIL24_MAX_SGE)
>> +#define SIL24_ATAPI_BLOCK_SIZE struct_size_t(struct sil24_atapi_block, sge, SIL24_MAX_SGE)
>> +#define SIL24_CMD_BLOCK_SIZE max(SIL24_ATA_BLOCK_SIZE, SIL24_ATAPI_BLOCK_SIZE)
>> +
>> static const struct sil24_cerr_info {
>> unsigned int err_mask, action;
>> const char *desc;
>> @@ -886,7 +890,7 @@ static unsigned int sil24_qc_issue(struct ata_queued_cmd *qc)
>> dma_addr_t paddr;
>> void __iomem *activate;
>>
>> - paddr = pp->cmd_block_dma + tag * sizeof(*pp->cmd_block);
>> + paddr = pp->cmd_block_dma + tag * SIL24_CMD_BLOCK_SIZE;
>> activate = port + PORT_CMD_ACTIVATE + tag * 8;
>>
>> /*
>> @@ -1192,7 +1196,7 @@ static int sil24_port_start(struct ata_port *ap)
>> struct device *dev = ap->host->dev;
>> struct sil24_port_priv *pp;
>> union sil24_cmd_block *cb;
>> - size_t cb_size = PAGE_SIZE * SIL24_MAX_CMDS;
>> + size_t cb_size = SIL24_CMD_BLOCK_SIZE * SIL24_MAX_CMDS;
>> dma_addr_t cb_dma;
>>
>> pp = devm_kzalloc(dev, sizeof(*pp), GFP_KERNEL);
>> @@ -1265,8 +1269,8 @@ static int sil24_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>> u32 tmp;
>>
>> /* union sil24_cmd_block must be PAGE_SIZE */
>
> This comment should probably be rephrased to be more clear then, since like
> you said sizeof(union sil24_cmd_block) will no longer be PAGE_SIZE.
How about:
/*
* union sil24_cmd_block must be PAGE_SIZE once taking into account the 'sge'
* flexible array members in struct sil24_atapi_block and struct sil24_ata_block
*/
>
>
>> - BUG_ON(struct_size_t(struct sil24_atapi_block, sge, SIL24_MAX_SGE) != PAGE_SIZE);
>> - BUG_ON(struct_size_t(struct sil24_ata_block, sge, SIL24_MAX_SGE) > PAGE_SIZE);
>> + WARN_ON_ONCE(SIL24_ATAPI_BLOCK_SIZE != PAGE_SIZE);
>> + WARN_ON_ONCE(SIL24_ATA_BLOCK_SIZE != PAGE_SIZE - 16);
>>
>> ata_print_version_once(&pdev->dev, DRV_VERSION);
>> ---8<---
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-21 11:02 ` Ryan Roberts
@ 2024-10-21 11:32 ` Eric Curtin
2024-10-21 11:51 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Eric Curtin @ 2024-10-21 11:32 UTC (permalink / raw)
To: Ryan Roberts
Cc: Neal Gompa, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, Hector Martin, linux-arm-kernel,
linux-kernel, linux-mm, asahi
On Mon, 21 Oct 2024 at 12:09, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 19/10/2024 16:47, Neal Gompa wrote:
> > On Monday, October 14, 2024 6:55:11 AM EDT Ryan Roberts wrote:
> >> Hi All,
> >>
> >> Patch bomb incoming... This covers many subsystems, so I've included a core
> >> set of people on the full series and additionally included maintainers on
> >> relevant patches. I haven't included those maintainers on this cover letter
> >> since the numbers were far too big for it to work. But I've included a link
> >> to this cover letter on each patch, so they can hopefully find their way
> >> here. For follow up submissions I'll break it up by subsystem, but for now
> >> thought it was important to show the full picture.
> >>
> >> This RFC series implements support for boot-time page size selection within
> >> the arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to
> >> date, page size has been selected at compile-time, meaning the size is
> >> baked into a given kernel image. As use of larger-than-4K page sizes become
> >> more prevalent this starts to present a problem for distributions.
> >> Boot-time page size selection enables the creation of a single kernel
> >> image, which can be told which page size to use on the kernel command line.
> >>
> >> Why is having an image-per-page size problematic?
> >> =================================================
> >>
> >> Many traditional distros are now supporting both 4K and 64K. And this means
> >> managing 2 kernel packages, along with drivers for each. For some, it means
> >> multiple installer flavours and multiple ISOs. All of this adds up to a
> >> less-than-ideal level of complexity. Additionally, Android now supports 4K
> >> and 16K kernels. I'm told having to explicitly manage their KABI for each
> >> kernel is painful, and the extra flash space required for both kernel
> >> images and the duplicated modules has been problematic. Boot-time page size
> >> selection solves all of this.
> >>
> >> Additionally, in starting to think about the longer term deployment story
> >> for D128 page tables, which Arm architecture now supports, a lot of the
> >> same problems need to be solved, so this work sets us up nicely for that.
> >>
> >> So what's the down side?
> >> ========================
> >>
> >> Well nothing's free; Various static allocations in the kernel image must be
> >> sized for the worst case (largest supported page size), so image size is in
> >> line with size of 64K compile-time image. So if you're interested in 4K or
> >> 16K, there is a slight increase to the image size. But I expect that
> >> problem goes away if you're compressing the image - its just some extra
> >> zeros. At boot-time, I expect we could free the unused static storage once
> >> we know the page size - although that would be a follow up enhancement.
> >>
> >> And then there is performance. Since PAGE_SIZE and friends are no longer
> >> compile-time constants, we must look up their values and do arithmetic at
> >> runtime instead of compile-time. My early perf testing suggests this is
> >> inperceptible for real-world workloads, and only has small impact on
> >> microbenchmarks - more on this below.
> >>
> >> Approach
> >> ========
> >>
> >> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
> >> friends are compile-time constant, but in a way that allows the compiler to
> >> perform the same optimizations as was previously being done if they do turn
> >> out to be compile-time constant. Where constants are required, we use
> >> limits; PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full
> >> description of all the classes of problems to solve.
> >>
> >> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
> >> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX.
> >> arm64 does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> >> Kconfig, which is an alternative to selecting a compile-time page size.
> >>
> >> When boot-time page size is active, the arch pgtable geometry macro
> >> definitions resolve to something that can be configured at boot. The arm64
> >> implementation in this series mainly uses global, __ro_after_init
> >> variables. I've tried using alternatives patching, but that performs worse
> >> than loading from memory; I think due to code size bloat.
> >>
> >> Status
> >> ======
> >>
> >> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented
> >> enough to compile the kernel image itself with defconfig (and a few other
> >> bits and pieces). This is enough to build a kernel that can boot under QEMU
> >> or FVP. I'll happily do the rest of the work to enable all the extra
> >> drivers, but wanted to get feedback on the shape of this effort first. If
> >> anyone wants to do any testing, and has a must-have config, let me know and
> >> I'll prioritize enabling it first.
> >>
> >> The series is arranged as follows:
> >>
> >> - patch 1: Add macros required for converting non-arch code to support
> >> boot-time page size selection
> >> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from
> >> all non-arch code
> >> - patches 37-38: Some arm64 tidy ups
> >> - patch 39: Add macros required for converting arm64 code to
> > support
> >> boot-time page size selection
> >> - patches 40-56: arm64 changes to support boot-time page size selection
> >> - patch 57: Add arm64 Kconfig option to enable boot-time page
> > size
> >> selection
> >>
> >> Ideally, I'd like to get the basics merged (something like this series),
> >> then incrementally improve it over a handful of kernel releases until we
> >> can demonstrate that we have feature parity with the compile-time build and
> >> no performance blockers. Once at that point, ideally the compile-time build
> >> options would be removed and the code could be cleaned up further.
> >>
> >> One of the bigger peices that I'd propose to add as a follow up, is to make
> >> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
> >> handling.
> >>
> >> Assuming people are ammenable to the rough shape, how would I go about
> >> getting the non-arch changes merged? Since they cover many subsystems, will
> >> each piece need to go independently to each relevant maintainer or could it
> >> all be merged together through the arm64 tree?
> >>
> >> Image Size
> >> ==========
> >>
> >> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
> >> kernel image on disk for base (before any changes applied), compile (with
> >> changes, configured for compile-time page size) and boot (with changes,
> >> configured for boot-time page size).
> >>
> >> You can see the that compile-16k and 64k configs are actually slightly
> >> smaller than the baselines; that's due to optimizing some buffer sizes
> >> which didn't need to depend on page size during the series. The boot-time
> >> image is ~1% bigger than the 64k compile-time image. I believe there is
> >> scope to improve this to make it
> >> equal to compile-64k if required:
> >> | config | size/KB | diff/KB | diff/% |
> >> |
> >> |-------------|---------|---------|---------|
> >> |
> >> | base-4k | 54895 | 0 | 0.0% |
> >> | base-16k | 55161 | 266 | 0.5% |
> >> | base-64k | 56775 | 1880 | 3.4% |
> >> | compile-4k | 54895 | 0 | 0.0% |
> >> | compile-16k | 55097 | 202 | 0.4% |
> >> | compile-64k | 56391 | 1496 | 2.7% |
> >> | boot-4K | 57045 | 2150 | 3.9% |
> >>
> >> And below shows the size of the image in memory at run-time, separated for
> >> text and data costs. The boot image has ~1% text cost; most likely due to
> >> the fact that PAGE_SIZE and friends are not compile-time constants so need
> >> instructions to load the values and do arithmetic. I believe we could
> >> eventually get the data cost to match the cost for the compile image for
> >> the chosen page size by freeing
> >> the ends of the static buffers not needed for the selected page size:
> >> | | text | text | text | data | data | data |
> >> |
> >> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
> >> |
> >> |-------------|---------|---------|---------|---------|---------|---------|
> >> |
> >> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
> >> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
> >> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
> >> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
> >> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
> >> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
> >> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
> >>
> >> Functional Testing
> >> ==================
> >>
> >> I've build-tested defconfig for all arches supported by tuxmake (which is
> >> most) without issue.
> >>
> >> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page
> >> sizes and a few va-sizes, and additionally have run all the mm-selftests,
> >> with no regressions observed vs the equivalent compile-time page size build
> >> (although the mm-selftests have a few existing failures when run against
> >> 16K and 64K kernels - those should really be investigated and fixed
> >> independently).
> >>
> >> Test coverage is lacking for many of the drivers that I've touched, but in
> >> many cases, I'm hoping the changes are simple enough that review might
> >> suffice?
> >>
> >> Performance Testing
> >> ===================
> >>
> >> I've run some limited performance benchmarks:
> >>
> >> First, a real-world benchmark that causes a lot of page table manipulation
> >> (and therefore we would expect to see regression here if we are going to
> >> see it anywhere); kernel compilation. It barely registers a change. Values
> >> are times,
> >> so smaller is better. All relative to base-4k:
> >> | | kern | kern | user | user | real | real |
> >> |
> >> | config | mean | stdev | mean | stdev | mean | stdev |
> >> |
> >> |-------------|---------|---------|---------|---------|---------|---------|
> >> |
> >> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
> >> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
> >> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
> >>
> >> The Speedometer JavaScript benchmark also shows no change. Values are runs
> >> per
> >> min, so bigger is better. All relative to base-4k:
> >> | config | mean | stdev |
> >> |
> >> |-------------|---------|---------|
> >> |
> >> | base-4k | 0.0% | 0.8% |
> >> | compile-4k | 0.4% | 0.8% |
> >> | boot-4k | 0.0% | 0.9% |
> >>
> >> Finally, I've run some microbenchmarks known to stress page table
> >> manipulations (originally from David Hildenbrand). The fork test
> >> maps/allocs 1G of anon memory, then measures the cost of fork(). The munmap
> >> test maps/allocs 1G of anon memory then measures the cost of munmap()ing
> >> it. The fork test is known to be extremely sensitive to any changes that
> >> cause instructions to be aligned differently in cachelines. When using this
> >> test for other changes, I've seen double digit regressions for the
> >> slightest thing, so 12% regression on this test is actually fairly good.
> >> This likely represents the extreme worst case for regressions that will be
> >> observed across other microbenchmarks (famous last
> >> words). Values are times, so smaller is better. All relative to base-4k:
> >> | | fork | fork | munmap | munmap |
> >> |
> >> | config | mean | stdev | stdev | stdev |
> >> |
> >> |-------------|---------|---------|---------|---------|
> >> |
> >> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
> >> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
> >> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
> >>
> >> NOTE: The series applies on top of v6.11.
> >>
> >> Thanks,
> >> Ryan
> >>
> >>
> >> Ryan Roberts (57):
> >> mm: Add macros ahead of supporting boot-time page size selection
> >> vmlinux: Align to PAGE_SIZE_MAX
> >> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
> >> mm/page_alloc: Make page_frag_cache boot-time page size compatible
> >> mm: Avoid split pmd ptl if pmd level is run-time folded
> >> mm: Remove PAGE_SIZE compile-time constant assumption
> >> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
> >> fs: Remove PAGE_SIZE compile-time constant assumption
> >> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
> >> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
> >> fork: Permit boot-time THREAD_SIZE determination
> >> cgroup: Remove PAGE_SIZE compile-time constant assumption
> >> bpf: Remove PAGE_SIZE compile-time constant assumption
> >> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
> >> stackdepot: Remove PAGE_SIZE compile-time constant assumption
> >> perf: Remove PAGE_SIZE compile-time constant assumption
> >> kvm: Remove PAGE_SIZE compile-time constant assumption
> >> trace: Remove PAGE_SIZE compile-time constant assumption
> >> crash: Remove PAGE_SIZE compile-time constant assumption
> >> crypto: Remove PAGE_SIZE compile-time constant assumption
> >> sunrpc: Remove PAGE_SIZE compile-time constant assumption
> >> sound: Remove PAGE_SIZE compile-time constant assumption
> >> net: Remove PAGE_SIZE compile-time constant assumption
> >> net: fec: Remove PAGE_SIZE compile-time constant assumption
> >> net: marvell: Remove PAGE_SIZE compile-time constant assumption
> >> net: hns3: Remove PAGE_SIZE compile-time constant assumption
> >> net: e1000: Remove PAGE_SIZE compile-time constant assumption
> >> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
> >> net: igb: Remove PAGE_SIZE compile-time constant assumption
> >> drivers/base: Remove PAGE_SIZE compile-time constant assumption
> >> edac: Remove PAGE_SIZE compile-time constant assumption
> >> optee: Remove PAGE_SIZE compile-time constant assumption
> >> random: Remove PAGE_SIZE compile-time constant assumption
> >> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
> >> virtio: Remove PAGE_SIZE compile-time constant assumption
> >> xen: Remove PAGE_SIZE compile-time constant assumption
> >> arm64: Fix macros to work in C code in addition to the linker script
> >> arm64: Track early pgtable allocation limit
> >> arm64: Introduce macros required for boot-time page selection
> >> arm64: Refactor early pgtable size calculation macros
> >> arm64: Pass desired page size on command line
> >> arm64: Divorce early init from PAGE_SIZE
> >> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
> >> arm64: Align sections to PAGE_SIZE_MAX
> >> arm64: Rework trampoline rodata mapping
> >> arm64: Generalize fixmap for boot-time page size
> >> arm64: Statically allocate and align for worst-case page size
> >> arm64: Convert switch to if for non-const comparison values
> >> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
> >> arm64: Remove PAGE_SZ asm-offset
> >> arm64: Introduce cpu features for page sizes
> >> arm64: Remove PAGE_SIZE from assembly code
> >> arm64: Runtime-fold pmd level
> >> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
> >> arm64: TRAMP_VALIAS is no longer compile-time constant
> >> arm64: Determine THREAD_SIZE at boot-time
> >> arm64: Enable boot-time page size selection
> >>
> >> arch/alpha/include/asm/page.h | 1 +
> >> arch/arc/include/asm/page.h | 1 +
> >> arch/arm/include/asm/page.h | 1 +
> >> arch/arm64/Kconfig | 26 ++-
> >> arch/arm64/include/asm/assembler.h | 78 ++++++-
> >> arch/arm64/include/asm/cpufeature.h | 44 +++-
> >> arch/arm64/include/asm/efi.h | 2 +-
> >> arch/arm64/include/asm/fixmap.h | 28 ++-
> >> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
> >> arch/arm64/include/asm/kvm_arm.h | 21 +-
> >> arch/arm64/include/asm/kvm_hyp.h | 11 +
> >> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
> >> arch/arm64/include/asm/memory.h | 62 ++++--
> >> arch/arm64/include/asm/page-def.h | 3 +-
> >> arch/arm64/include/asm/pgalloc.h | 16 +-
> >> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
> >> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
> >> arch/arm64/include/asm/pgtable-prot.h | 2 +-
> >> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
> >> arch/arm64/include/asm/processor.h | 10 +-
> >> arch/arm64/include/asm/sections.h | 1 +
> >> arch/arm64/include/asm/smp.h | 1 +
> >> arch/arm64/include/asm/sparsemem.h | 15 +-
> >> arch/arm64/include/asm/sysreg.h | 54 +++--
> >> arch/arm64/include/asm/tlb.h | 3 +
> >> arch/arm64/kernel/asm-offsets.c | 4 +-
> >> arch/arm64/kernel/cpufeature.c | 93 ++++++--
> >> arch/arm64/kernel/efi.c | 2 +-
> >> arch/arm64/kernel/entry.S | 60 +++++-
> >> arch/arm64/kernel/head.S | 46 +++-
> >> arch/arm64/kernel/hibernate-asm.S | 6 +-
> >> arch/arm64/kernel/image-vars.h | 14 ++
> >> arch/arm64/kernel/image.h | 4 +
> >> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
> >> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
> >> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
> >> arch/arm64/kernel/pi/pi.h | 63 +++++-
> >> arch/arm64/kernel/relocate_kernel.S | 10 +-
> >> arch/arm64/kernel/vdso-wrap.S | 4 +-
> >> arch/arm64/kernel/vdso.c | 7 +-
> >> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
> >> arch/arm64/kernel/vdso32-wrap.S | 4 +-
> >> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
> >> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
> >> arch/arm64/kvm/arm.c | 10 +
> >> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
> >> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
> >> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
> >> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
> >> arch/arm64/kvm/mmu.c | 39 ++--
> >> arch/arm64/lib/clear_page.S | 7 +-
> >> arch/arm64/lib/copy_page.S | 33 ++-
> >> arch/arm64/lib/mte.S | 27 ++-
> >> arch/arm64/mm/Makefile | 1 +
> >> arch/arm64/mm/fixmap.c | 38 ++--
> >> arch/arm64/mm/hugetlbpage.c | 40 +---
> >> arch/arm64/mm/init.c | 26 +--
> >> arch/arm64/mm/kasan_init.c | 8 +-
> >> arch/arm64/mm/mmu.c | 53 +++--
> >> arch/arm64/mm/pgd.c | 12 +-
> >> arch/arm64/mm/pgtable-geometry.c | 24 +++
> >> arch/arm64/mm/proc.S | 128 ++++++++---
> >> arch/arm64/mm/ptdump.c | 3 +-
> >> arch/arm64/tools/cpucaps | 3 +
> >> arch/csky/include/asm/page.h | 3 +
> >> arch/hexagon/include/asm/page.h | 2 +
> >> arch/loongarch/include/asm/page.h | 2 +
> >> arch/m68k/include/asm/page.h | 1 +
> >> arch/microblaze/include/asm/page.h | 1 +
> >> arch/mips/include/asm/page.h | 1 +
> >> arch/nios2/include/asm/page.h | 2 +
> >> arch/openrisc/include/asm/page.h | 1 +
> >> arch/parisc/include/asm/page.h | 1 +
> >> arch/powerpc/include/asm/page.h | 2 +
> >> arch/riscv/include/asm/page.h | 1 +
> >> arch/s390/include/asm/page.h | 1 +
> >> arch/sh/include/asm/page.h | 1 +
> >> arch/sparc/include/asm/page.h | 3 +
> >> arch/um/include/asm/page.h | 2 +
> >> arch/x86/include/asm/page_types.h | 2 +
> >> arch/xtensa/include/asm/page.h | 1 +
> >> crypto/lskcipher.c | 4 +-
> >> drivers/ata/sata_sil24.c | 46 ++--
> >> drivers/base/node.c | 6 +-
> >> drivers/base/topology.c | 32 +--
> >> drivers/block/virtio_blk.c | 2 +-
> >> drivers/char/random.c | 4 +-
> >> drivers/edac/edac_mc.h | 13 +-
> >> drivers/firmware/efi/libstub/arm64.c | 3 +-
> >> drivers/irqchip/irq-gic-v3-its.c | 2 +-
> >> drivers/mtd/mtdswap.c | 4 +-
> >> drivers/net/ethernet/freescale/fec.h | 3 +-
> >> drivers/net/ethernet/freescale/fec_main.c | 5 +-
> >> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
> >> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
> >> drivers/net/ethernet/intel/igb/igb.h | 25 +--
> >> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
> >> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
> >> drivers/net/ethernet/marvell/mvneta.c | 9 +-
> >> drivers/net/ethernet/marvell/sky2.h | 2 +-
> >> drivers/tee/optee/call.c | 7 +-
> >> drivers/tee/optee/smc_abi.c | 2 +-
> >> drivers/virtio/virtio_balloon.c | 10 +-
> >> drivers/xen/balloon.c | 11 +-
> >> drivers/xen/biomerge.c | 12 +-
> >> drivers/xen/privcmd.c | 2 +-
> >> drivers/xen/xenbus/xenbus_client.c | 5 +-
> >> drivers/xen/xlate_mmu.c | 6 +-
> >> fs/binfmt_elf.c | 11 +-
> >> fs/buffer.c | 2 +-
> >> fs/coredump.c | 8 +-
> >> fs/ext4/ext4.h | 36 ++--
> >> fs/ext4/move_extent.c | 2 +-
> >> fs/ext4/readpage.c | 2 +-
> >> fs/fat/dir.c | 4 +-
> >> fs/fat/fatent.c | 4 +-
> >> fs/nfs/nfs42proc.c | 2 +-
> >> fs/nfs/nfs42xattr.c | 2 +-
> >> fs/nfs/nfs4proc.c | 2 +-
> >> include/asm-generic/pgtable-geometry.h | 71 +++++++
> >> include/asm-generic/vmlinux.lds.h | 38 ++--
> >> include/linux/buffer_head.h | 1 +
> >> include/linux/cpumask.h | 5 +
> >> include/linux/linkage.h | 4 +-
> >> include/linux/mm.h | 17 +-
> >> include/linux/mm_types.h | 15 +-
> >> include/linux/mm_types_task.h | 2 +-
> >> include/linux/mmzone.h | 3 +-
> >> include/linux/netlink.h | 6 +-
> >> include/linux/percpu-defs.h | 4 +-
> >> include/linux/perf_event.h | 2 +-
> >> include/linux/sched.h | 4 +-
> >> include/linux/slab.h | 7 +-
> >> include/linux/stackdepot.h | 6 +-
> >> include/linux/sunrpc/svc.h | 8 +-
> >> include/linux/sunrpc/svc_rdma.h | 4 +-
> >> include/linux/sunrpc/svcsock.h | 2 +-
> >> include/linux/swap.h | 17 +-
> >> include/linux/swapops.h | 6 +-
> >> include/linux/thread_info.h | 10 +-
> >> include/xen/page.h | 2 +
> >> init/main.c | 7 +-
> >> kernel/bpf/core.c | 9 +-
> >> kernel/bpf/ringbuf.c | 54 ++---
> >> kernel/cgroup/cgroup.c | 8 +-
> >> kernel/crash_core.c | 2 +-
> >> kernel/events/core.c | 2 +-
> >> kernel/fork.c | 71 +++----
> >> kernel/power/power.h | 2 +-
> >> kernel/power/snapshot.c | 2 +-
> >> kernel/power/swap.c | 129 +++++++++--
> >> kernel/trace/fgraph.c | 2 +-
> >> kernel/trace/trace.c | 2 +-
> >> lib/stackdepot.c | 6 +-
> >> mm/kasan/report.c | 3 +-
> >> mm/memcontrol.c | 11 +-
> >> mm/memory.c | 4 +-
> >> mm/mmap.c | 2 +-
> >> mm/page-writeback.c | 2 +-
> >> mm/page_alloc.c | 31 +--
> >> mm/slub.c | 2 +-
> >> mm/sparse.c | 2 +-
> >> mm/swapfile.c | 2 +-
> >> mm/vmalloc.c | 7 +-
> >> net/9p/trans_virtio.c | 4 +-
> >> net/core/hotdata.c | 4 +-
> >> net/core/skbuff.c | 4 +-
> >> net/core/sysctl_net_core.c | 2 +-
> >> net/sunrpc/cache.c | 3 +-
> >> net/unix/af_unix.c | 2 +-
> >> sound/soc/soc-utils.c | 4 +-
> >> virt/kvm/kvm_main.c | 2 +-
> >> 172 files changed, 2185 insertions(+), 951 deletions(-)
> >> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
> >> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
> >> create mode 100644 arch/arm64/mm/pgtable-geometry.c
> >> create mode 100644 include/asm-generic/pgtable-geometry.h
> >>
> >> --
> >> 2.43.0
> >
> > This is a generally very exciting patch set! I'm looking forward to seeing it
> > land so I can take advantage of it for Fedora ARM and Fedora Asahi Remix.
> >
> > That said, I have a couple of questions:
> >
> > * Going forward, how would we handle drivers/modules that require a particular
> > page size? For example, the Apple Silicon IOMMU driver code requires the
> > kernel to operate in 16k page size mode, and it would need to be disabled in
> > other page sizes.
>
> I think these drivers would want to check PAGE_SIZE at probe time and fail if an
> unsupported page size is in use. Do you see any issue with that?
>
> >
> > * How would we handle an invalid selection at boot?
>
> What do you mean by invalid here? The current policy validates that the
> requested page size is supported by the HW by checking mmfr0. If no page size is
> passed on the command line, or the passed value is not supported by the HW, then
> the we default to the largest page size supported by the HW (so for Apple
> Silicon that would be 16k since the HW doesn't support 64k). Although I think it
> may be better to change that policy to use the smallest page size in this case;
> 4k is the safer bet for compat and will waste much less memory than 64k.
>
> > Can we program in a
> > fallback when the "wrong" mode is selected for a chip or something similar?
>
> Do you mean effectively add a machanism to force 16k if the detected HW is Apple
> Silicon? The trouble is that we need to select the page size, very early in
> boot, before start_kernel() is called, so we really only have generic arch code
> and the command line with which to make the decision.
Yes... I think a build-time CONFIG for default page size, which can be
overridden by a karg makes sense... Even on platforms like Apple
Silicon you may want to test very specific things in 4k by overriding
with a karg.
Like in downstream kernels like Fedora/RHEL/etc. I would expect the
default would be 4k, but you could override with 16k, 64k, etc. with a
karg.
>
> > > Thanks again and best regards!
> >
> > (P.S.: Please add the asahi@ mailing list to the CC for future iterations of
> > this patch set and tag both Hector and myself in as well. Thanks!)
>
> Will do!
>
> >
> >
>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 34/57] sata_sil24: Remove PAGE_SIZE compile-time constant assumption
2024-10-21 11:26 ` Ryan Roberts
@ 2024-10-21 11:43 ` Niklas Cassel
0 siblings, 0 replies; 196+ messages in thread
From: Niklas Cassel @ 2024-10-21 11:43 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
Damien Le Moal, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel, linux-ide,
linux-kernel, linux-mm, Kees Cook, Gustavo A. R. Silva
On Mon, Oct 21, 2024 at 12:26:15PM +0100, Ryan Roberts wrote:
> On 21/10/2024 12:04, Niklas Cassel wrote:
> > On Mon, Oct 21, 2024 at 10:24:37AM +0100, Ryan Roberts wrote:
> >> On 17/10/2024 13:51, Niklas Cassel wrote:
> >>> On Thu, Oct 17, 2024 at 01:42:22PM +0100, Ryan Roberts wrote:
> >
> > (snip)
> >
> >> That said, while investigating this, I've spotted a bug in my change. paddr calculation in sil24_qc_issue() is incorrect since sizeof(*pp->cmd_block) is no longer PAGE_SIZE. Based on feedback in another patch, I'm also converting the BUG_ONs to WARN_ON_ONCEs.
> >
> > Side note: Please wrap you lines to 80 characters max.
>
> Yes sorry, I turned off line wrapping for that last mail because I didn't want
> it to wrap the copy/pasted patch. I'll figure out how to mix and match for future.
>
> >
> >
> >>
> >> Additional proposed change, which I'll plan to include in the next version:
> >>
> >> ---8<---
> >> diff --git a/drivers/ata/sata_sil24.c b/drivers/ata/sata_sil24.c
> >> index 85c6382976626..c402bf998c4ee 100644
> >> --- a/drivers/ata/sata_sil24.c
> >> +++ b/drivers/ata/sata_sil24.c
> >> @@ -257,6 +257,10 @@ union sil24_cmd_block {
> >> struct sil24_atapi_block atapi;
> >> };
> >>
> >> +#define SIL24_ATA_BLOCK_SIZE struct_size_t(struct sil24_ata_block, sge, SIL24_MAX_SGE)
> >> +#define SIL24_ATAPI_BLOCK_SIZE struct_size_t(struct sil24_atapi_block, sge, SIL24_MAX_SGE)
> >> +#define SIL24_CMD_BLOCK_SIZE max(SIL24_ATA_BLOCK_SIZE, SIL24_ATAPI_BLOCK_SIZE)
> >> +
> >> static const struct sil24_cerr_info {
> >> unsigned int err_mask, action;
> >> const char *desc;
> >> @@ -886,7 +890,7 @@ static unsigned int sil24_qc_issue(struct ata_queued_cmd *qc)
> >> dma_addr_t paddr;
> >> void __iomem *activate;
> >>
> >> - paddr = pp->cmd_block_dma + tag * sizeof(*pp->cmd_block);
> >> + paddr = pp->cmd_block_dma + tag * SIL24_CMD_BLOCK_SIZE;
> >> activate = port + PORT_CMD_ACTIVATE + tag * 8;
> >>
> >> /*
> >> @@ -1192,7 +1196,7 @@ static int sil24_port_start(struct ata_port *ap)
> >> struct device *dev = ap->host->dev;
> >> struct sil24_port_priv *pp;
> >> union sil24_cmd_block *cb;
> >> - size_t cb_size = PAGE_SIZE * SIL24_MAX_CMDS;
> >> + size_t cb_size = SIL24_CMD_BLOCK_SIZE * SIL24_MAX_CMDS;
> >> dma_addr_t cb_dma;
> >>
> >> pp = devm_kzalloc(dev, sizeof(*pp), GFP_KERNEL);
> >> @@ -1265,8 +1269,8 @@ static int sil24_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
> >> u32 tmp;
> >>
> >> /* union sil24_cmd_block must be PAGE_SIZE */
> >
> > This comment should probably be rephrased to be more clear then, since like
> > you said sizeof(union sil24_cmd_block) will no longer be PAGE_SIZE.
>
> How about:
>
> /*
> * union sil24_cmd_block must be PAGE_SIZE once taking into account the 'sge'
> * flexible array members in struct sil24_atapi_block and struct sil24_ata_block
> */
Sounds good to me!
Kind regards,
Niklas
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-18 14:41 ` Petr Tesarik
@ 2024-10-21 11:47 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-21 11:47 UTC (permalink / raw)
To: Petr Tesarik, Michael Kelley
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On 18/10/2024 15:41, Petr Tesarik wrote:
> On Fri, 18 Oct 2024 14:56:00 +0200
> Petr Tesarik <ptesarik@suse.com> wrote:
>
>> On Thu, 17 Oct 2024 13:32:43 +0100
>> Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>>> On 17/10/2024 13:27, Petr Tesarik wrote:
>>>> On Mon, 14 Oct 2024 11:55:11 +0100
>>>> Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>>> [...]
>>>>> The series is arranged as follows:
>>>>>
>>>>> - patch 1: Add macros required for converting non-arch code to support
>>>>> boot-time page size selection
>>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
>>>>> non-arch code
>>>>
>>>> I have just tried to recompile the openSUSE kernel with these patches
>>>> applied, and I'm running into this:
>>>>
>>>> CC arch/arm64/hyperv/hv_core.o
>>>> In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
>>>> ../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file scope
>>>> u8 reserved2[PAGE_SIZE - 68];
>>>> ^~~~~~~~~
>>>>
>>>> It looks like one more place which needs a patch, right?
>>>
>>> As mentioned in the cover letter, so far I've only converted enough to get the
>>> defconfig *image* building (i.e. no modules). If you are compiling a different
>>> config or compiling the modules for defconfig, you will likely run into these
>>> types of issues.
>>>
>>> That said, I do have some patches to fix Hyper-V, which Michael Kelley was kind
>>> enough to send me.
>>>
>>> I understand that Suse might be able to help with wider performance testing - if
>>> that's the reason you are trying to compile, you could send me your config and
>>> I'll start working on fixing up other drivers?
>>
>> You're right, performance testing is my goal.
>>
>> Heh, the openSUSE master config is cranked up to max. ;-) That would be
>> a lot of work, and we don't need all those options for running our test
>> suite. Let me disable the conflicting options instead.
>> [...]
>> I'll see if I can do something about btrfs. Then I can try to boot the
>> kernel...
>
> FWIW the kernel builds and _boots_ after applying this patch:
Amazing - thanks for doing this!
>
> fs/btrfs/compression.h | 2 +-
> fs/btrfs/defrag.c | 2 +-
> fs/btrfs/extent_io.h | 2 +-
> fs/btrfs/scrub.c | 2 +-
> include/linux/raid/pq.h | 4 ++--
> lib/raid6/algos.c | 2 +-
> 6 files changed, 7 insertions(+), 7 deletions(-)
>
> --- a/fs/btrfs/compression.h
> +++ b/fs/btrfs/compression.h
> @@ -33,7 +33,7 @@ struct btrfs_bio;
> /* Maximum length of compressed data stored on disk */
> #define BTRFS_MAX_COMPRESSED (SZ_128K)
> #define BTRFS_MAX_COMPRESSED_PAGES (BTRFS_MAX_COMPRESSED / PAGE_SIZE)
> -static_assert((BTRFS_MAX_COMPRESSED % PAGE_SIZE) == 0);
> +static_assert((BTRFS_MAX_COMPRESSED % PAGE_SIZE_MAX) == 0);
>
> /* Maximum size of data before compression */
> #define BTRFS_MAX_UNCOMPRESSED (SZ_128K)
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -1144,7 +1144,7 @@ next:
> }
>
> #define CLUSTER_SIZE (SZ_256K)
> -static_assert(PAGE_ALIGNED(CLUSTER_SIZE));
> +static_assert(IS_ALIGNED(CLUSTER_SIZE, PAGE_SIZE_MAX));
>
> /*
> * Defrag one contiguous target range.
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -89,7 +89,7 @@ enum {
> int __init extent_buffer_init_cachep(void);
> void __cold extent_buffer_free_cachep(void);
>
> -#define INLINE_EXTENT_BUFFER_PAGES (BTRFS_MAX_METADATA_BLOCKSIZE / PAGE_SIZE)
> +#define INLINE_EXTENT_BUFFER_PAGES (BTRFS_MAX_METADATA_BLOCKSIZE / PAGE_SIZE_MIN)
While this works, I'm not sure if you would want to have 2 separate macros; 1
for worst-case static allocation, and 1 for dynamic allocation and iterating. I
could imagine if you allocate PAGE_SIZE_MAX pages into the worst case number of
slots that would increase memory. I'm not familiar with the code so don't know
if this is a problem in practice. Certainly what you have done is much simpler
if acceptable.
> struct extent_buffer {
> u64 start;
> u32 len;
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -100,7 +100,7 @@ enum scrub_stripe_flags {
> SCRUB_STRIPE_FLAG_NO_REPORT,
> };
>
> -#define SCRUB_STRIPE_PAGES (BTRFS_STRIPE_LEN / PAGE_SIZE)
> +#define SCRUB_STRIPE_PAGES (BTRFS_STRIPE_LEN / PAGE_SIZE_MIN)
Same comment.
Thanks,
Ryan
>
> /*
> * Represent one contiguous range with a length of BTRFS_STRIPE_LEN.
> --- a/include/linux/raid/pq.h
> +++ b/include/linux/raid/pq.h
> @@ -12,7 +12,7 @@
>
> #include <linux/blkdev.h>
>
> -extern const char raid6_empty_zero_page[PAGE_SIZE];
> +extern const char raid6_empty_zero_page[PAGE_SIZE_MAX];
>
> #else /* ! __KERNEL__ */
> /* Used for testing in user space */
> @@ -39,7 +39,7 @@ typedef uint64_t u64;
> #ifndef PAGE_SHIFT
> # define PAGE_SHIFT 12
> #endif
> -extern const char raid6_empty_zero_page[PAGE_SIZE];
> +extern const char raid6_empty_zero_page[PAGE_SIZE_MAX];
>
> #define __init
> #define __exit
> --- a/lib/raid6/algos.c
> +++ b/lib/raid6/algos.c
> @@ -19,7 +19,7 @@
> #include <linux/module.h>
> #include <linux/gfp.h>
> /* In .bss so it's zeroed */
> -const char raid6_empty_zero_page[PAGE_SIZE] __attribute__((aligned(256)));
> +const char raid6_empty_zero_page[PAGE_SIZE_MAX] __attribute__((aligned(256)));
> EXPORT_SYMBOL(raid6_empty_zero_page);
> #endif
>
>
> Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-17 22:05 ` Dave Kleikamp
@ 2024-10-21 11:49 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-21 11:49 UTC (permalink / raw)
To: Dave Kleikamp, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 17/10/2024 23:05, Dave Kleikamp wrote:
> On 10/14/24 5:55AM, Ryan Roberts wrote:
>> Hi All,
>>
>> Patch bomb incoming... This covers many subsystems, so I've included a core set
>> of people on the full series and additionally included maintainers on relevant
>> patches. I haven't included those maintainers on this cover letter since the
>> numbers were far too big for it to work. But I've included a link to this cover
>> letter on each patch, so they can hopefully find their way here. For follow up
>> submissions I'll break it up by subsystem, but for now thought it was important
>> to show the full picture.
>>
>> This RFC series implements support for boot-time page size selection within the
>> arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to date, page
>> size has been selected at compile-time, meaning the size is baked into a given
>> kernel image. As use of larger-than-4K page sizes become more prevalent this
>> starts to present a problem for distributions. Boot-time page size selection
>> enables the creation of a single kernel image, which can be told which page size
>> to use on the kernel command line.
>
> This looks really promising. Building and maintaining separate kernels is
> costly. Being able to build one kernel for three protential page sizes would not
> only cut down on the overhead of producing kernel packages and images, but also
> eases benchmarking and testing different page sizes without the need to build
> and install multiple kernels.
>
> I'm also impressed that the patches are less intrusive than I would have
> expected. I'm looking forward to seeing this project move forward.
Thanks for the feedback! I'm sure any review/test capacity that Oracle has would
be greatly appreciated :)
Thanks,
Ryan
>
> Thanks,
> Shaggy
>
>>
>> Why is having an image-per-page size problematic?
>> =================================================
>>
>> Many traditional distros are now supporting both 4K and 64K. And this means
>> managing 2 kernel packages, along with drivers for each. For some, it means
>> multiple installer flavours and multiple ISOs. All of this adds up to a
>> less-than-ideal level of complexity. Additionally, Android now supports 4K and
>> 16K kernels. I'm told having to explicitly manage their KABI for each kernel is
>> painful, and the extra flash space required for both kernel images and the
>> duplicated modules has been problematic. Boot-time page size selection solves
>> all of this.
>>
>> Additionally, in starting to think about the longer term deployment story for
>> D128 page tables, which Arm architecture now supports, a lot of the same
>> problems need to be solved, so this work sets us up nicely for that.
>>
>> So what's the down side?
>> ========================
>>
>> Well nothing's free; Various static allocations in the kernel image must be
>> sized for the worst case (largest supported page size), so image size is in line
>> with size of 64K compile-time image. So if you're interested in 4K or 16K, there
>> is a slight increase to the image size. But I expect that problem goes away if
>> you're compressing the image - its just some extra zeros. At boot-time, I expect
>> we could free the unused static storage once we know the page size - although
>> that would be a follow up enhancement.
>>
>> And then there is performance. Since PAGE_SIZE and friends are no longer
>> compile-time constants, we must look up their values and do arithmetic at
>> runtime instead of compile-time. My early perf testing suggests this is
>> inperceptible for real-world workloads, and only has small impact on
>> microbenchmarks - more on this below.
>>
>> Approach
>> ========
>>
>> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
>> friends are compile-time constant, but in a way that allows the compiler to
>> perform the same optimizations as was previously being done if they do turn out
>> to be compile-time constant. Where constants are required, we use limits;
>> PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full description
>> of all the classes of problems to solve.
>>
>> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
>> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX. arm64
>> does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE Kconfig,
>> which is an alternative to selecting a compile-time page size.
>>
>> When boot-time page size is active, the arch pgtable geometry macro definitions
>> resolve to something that can be configured at boot. The arm64 implementation in
>> this series mainly uses global, __ro_after_init variables. I've tried using
>> alternatives patching, but that performs worse than loading from memory; I think
>> due to code size bloat.
>>
>> Status
>> ======
>>
>> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented enough
>> to compile the kernel image itself with defconfig (and a few other bits and
>> pieces). This is enough to build a kernel that can boot under QEMU or FVP. I'll
>> happily do the rest of the work to enable all the extra drivers, but wanted to
>> get feedback on the shape of this effort first. If anyone wants to do any
>> testing, and has a must-have config, let me know and I'll prioritize enabling it
>> first.
>>
>> The series is arranged as follows:
>>
>> - patch 1: Add macros required for converting non-arch code to support
>> boot-time page size selection
>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
>> non-arch code
>> - patches 37-38: Some arm64 tidy ups
>> - patch 39: Add macros required for converting arm64 code to support
>> boot-time page size selection
>> - patches 40-56: arm64 changes to support boot-time page size selection
>> - patch 57: Add arm64 Kconfig option to enable boot-time page size
>> selection
>>
>> Ideally, I'd like to get the basics merged (something like this series), then
>> incrementally improve it over a handful of kernel releases until we can
>> demonstrate that we have feature parity with the compile-time build and no
>> performance blockers. Once at that point, ideally the compile-time build options
>> would be removed and the code could be cleaned up further.
>>
>> One of the bigger peices that I'd propose to add as a follow up, is to make
>> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
>> handling.
>>
>> Assuming people are ammenable to the rough shape, how would I go about getting
>> the non-arch changes merged? Since they cover many subsystems, will each piece
>> need to go independently to each relevant maintainer or could it all be merged
>> together through the arm64 tree?
>>
>> Image Size
>> ==========
>>
>> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
>> kernel image on disk for base (before any changes applied), compile (with
>> changes, configured for compile-time page size) and boot (with changes,
>> configured for boot-time page size).
>>
>> You can see the that compile-16k and 64k configs are actually slightly smaller
>> than the baselines; that's due to optimizing some buffer sizes which didn't need
>> to depend on page size during the series. The boot-time image is ~1% bigger than
>> the 64k compile-time image. I believe there is scope to improve this to make it
>> equal to compile-64k if required:
>>
>> | config | size/KB | diff/KB | diff/% |
>> |-------------|---------|---------|---------|
>> | base-4k | 54895 | 0 | 0.0% |
>> | base-16k | 55161 | 266 | 0.5% |
>> | base-64k | 56775 | 1880 | 3.4% |
>> | compile-4k | 54895 | 0 | 0.0% |
>> | compile-16k | 55097 | 202 | 0.4% |
>> | compile-64k | 56391 | 1496 | 2.7% |
>> | boot-4K | 57045 | 2150 | 3.9% |
>>
>> And below shows the size of the image in memory at run-time, separated for text
>> and data costs. The boot image has ~1% text cost; most likely due to the fact
>> that PAGE_SIZE and friends are not compile-time constants so need instructions
>> to load the values and do arithmetic. I believe we could eventually get the data
>> cost to match the cost for the compile image for the chosen page size by freeing
>> the ends of the static buffers not needed for the selected page size:
>>
>> | | text | text | text | data | data | data |
>> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
>> |-------------|---------|---------|---------|---------|---------|---------|
>> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
>> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
>> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
>> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
>> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
>> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
>> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
>>
>> Functional Testing
>> ==================
>>
>> I've build-tested defconfig for all arches supported by tuxmake (which is most)
>> without issue.
>>
>> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page sizes
>> and a few va-sizes, and additionally have run all the mm-selftests, with no
>> regressions observed vs the equivalent compile-time page size build (although
>> the mm-selftests have a few existing failures when run against 16K and 64K
>> kernels - those should really be investigated and fixed independently).
>>
>> Test coverage is lacking for many of the drivers that I've touched, but in many
>> cases, I'm hoping the changes are simple enough that review might suffice?
>>
>> Performance Testing
>> ===================
>>
>> I've run some limited performance benchmarks:
>>
>> First, a real-world benchmark that causes a lot of page table manipulation (and
>> therefore we would expect to see regression here if we are going to see it
>> anywhere); kernel compilation. It barely registers a change. Values are times,
>> so smaller is better. All relative to base-4k:
>>
>> | | kern | kern | user | user | real | real |
>> | config | mean | stdev | mean | stdev | mean | stdev |
>> |-------------|---------|---------|---------|---------|---------|---------|
>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
>>
>> The Speedometer JavaScript benchmark also shows no change. Values are runs per
>> min, so bigger is better. All relative to base-4k:
>>
>> | config | mean | stdev |
>> |-------------|---------|---------|
>> | base-4k | 0.0% | 0.8% |
>> | compile-4k | 0.4% | 0.8% |
>> | boot-4k | 0.0% | 0.9% |
>>
>> Finally, I've run some microbenchmarks known to stress page table manipulations
>> (originally from David Hildenbrand). The fork test maps/allocs 1G of anon
>> memory, then measures the cost of fork(). The munmap test maps/allocs 1G of anon
>> memory then measures the cost of munmap()ing it. The fork test is known to be
>> extremely sensitive to any changes that cause instructions to be aligned
>> differently in cachelines. When using this test for other changes, I've seen
>> double digit regressions for the slightest thing, so 12% regression on this test
>> is actually fairly good. This likely represents the extreme worst case for
>> regressions that will be observed across other microbenchmarks (famous last
>> words). Values are times, so smaller is better. All relative to base-4k:
>>
>> | | fork | fork | munmap | munmap |
>> | config | mean | stdev | stdev | stdev |
>> |-------------|---------|---------|---------|---------|
>> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
>> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
>> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
>>
>> NOTE: The series applies on top of v6.11.
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (57):
>> mm: Add macros ahead of supporting boot-time page size selection
>> vmlinux: Align to PAGE_SIZE_MAX
>> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
>> mm/page_alloc: Make page_frag_cache boot-time page size compatible
>> mm: Avoid split pmd ptl if pmd level is run-time folded
>> mm: Remove PAGE_SIZE compile-time constant assumption
>> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
>> fs: Remove PAGE_SIZE compile-time constant assumption
>> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
>> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
>> fork: Permit boot-time THREAD_SIZE determination
>> cgroup: Remove PAGE_SIZE compile-time constant assumption
>> bpf: Remove PAGE_SIZE compile-time constant assumption
>> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
>> stackdepot: Remove PAGE_SIZE compile-time constant assumption
>> perf: Remove PAGE_SIZE compile-time constant assumption
>> kvm: Remove PAGE_SIZE compile-time constant assumption
>> trace: Remove PAGE_SIZE compile-time constant assumption
>> crash: Remove PAGE_SIZE compile-time constant assumption
>> crypto: Remove PAGE_SIZE compile-time constant assumption
>> sunrpc: Remove PAGE_SIZE compile-time constant assumption
>> sound: Remove PAGE_SIZE compile-time constant assumption
>> net: Remove PAGE_SIZE compile-time constant assumption
>> net: fec: Remove PAGE_SIZE compile-time constant assumption
>> net: marvell: Remove PAGE_SIZE compile-time constant assumption
>> net: hns3: Remove PAGE_SIZE compile-time constant assumption
>> net: e1000: Remove PAGE_SIZE compile-time constant assumption
>> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
>> net: igb: Remove PAGE_SIZE compile-time constant assumption
>> drivers/base: Remove PAGE_SIZE compile-time constant assumption
>> edac: Remove PAGE_SIZE compile-time constant assumption
>> optee: Remove PAGE_SIZE compile-time constant assumption
>> random: Remove PAGE_SIZE compile-time constant assumption
>> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
>> virtio: Remove PAGE_SIZE compile-time constant assumption
>> xen: Remove PAGE_SIZE compile-time constant assumption
>> arm64: Fix macros to work in C code in addition to the linker script
>> arm64: Track early pgtable allocation limit
>> arm64: Introduce macros required for boot-time page selection
>> arm64: Refactor early pgtable size calculation macros
>> arm64: Pass desired page size on command line
>> arm64: Divorce early init from PAGE_SIZE
>> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
>> arm64: Align sections to PAGE_SIZE_MAX
>> arm64: Rework trampoline rodata mapping
>> arm64: Generalize fixmap for boot-time page size
>> arm64: Statically allocate and align for worst-case page size
>> arm64: Convert switch to if for non-const comparison values
>> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
>> arm64: Remove PAGE_SZ asm-offset
>> arm64: Introduce cpu features for page sizes
>> arm64: Remove PAGE_SIZE from assembly code
>> arm64: Runtime-fold pmd level
>> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
>> arm64: TRAMP_VALIAS is no longer compile-time constant
>> arm64: Determine THREAD_SIZE at boot-time
>> arm64: Enable boot-time page size selection
>>
>> arch/alpha/include/asm/page.h | 1 +
>> arch/arc/include/asm/page.h | 1 +
>> arch/arm/include/asm/page.h | 1 +
>> arch/arm64/Kconfig | 26 ++-
>> arch/arm64/include/asm/assembler.h | 78 ++++++-
>> arch/arm64/include/asm/cpufeature.h | 44 +++-
>> arch/arm64/include/asm/efi.h | 2 +-
>> arch/arm64/include/asm/fixmap.h | 28 ++-
>> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
>> arch/arm64/include/asm/kvm_arm.h | 21 +-
>> arch/arm64/include/asm/kvm_hyp.h | 11 +
>> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
>> arch/arm64/include/asm/memory.h | 62 ++++--
>> arch/arm64/include/asm/page-def.h | 3 +-
>> arch/arm64/include/asm/pgalloc.h | 16 +-
>> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
>> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
>> arch/arm64/include/asm/pgtable-prot.h | 2 +-
>> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
>> arch/arm64/include/asm/processor.h | 10 +-
>> arch/arm64/include/asm/sections.h | 1 +
>> arch/arm64/include/asm/smp.h | 1 +
>> arch/arm64/include/asm/sparsemem.h | 15 +-
>> arch/arm64/include/asm/sysreg.h | 54 +++--
>> arch/arm64/include/asm/tlb.h | 3 +
>> arch/arm64/kernel/asm-offsets.c | 4 +-
>> arch/arm64/kernel/cpufeature.c | 93 ++++++--
>> arch/arm64/kernel/efi.c | 2 +-
>> arch/arm64/kernel/entry.S | 60 +++++-
>> arch/arm64/kernel/head.S | 46 +++-
>> arch/arm64/kernel/hibernate-asm.S | 6 +-
>> arch/arm64/kernel/image-vars.h | 14 ++
>> arch/arm64/kernel/image.h | 4 +
>> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
>> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
>> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
>> arch/arm64/kernel/pi/pi.h | 63 +++++-
>> arch/arm64/kernel/relocate_kernel.S | 10 +-
>> arch/arm64/kernel/vdso-wrap.S | 4 +-
>> arch/arm64/kernel/vdso.c | 7 +-
>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
>> arch/arm64/kernel/vdso32-wrap.S | 4 +-
>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
>> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
>> arch/arm64/kvm/arm.c | 10 +
>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
>> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
>> arch/arm64/kvm/mmu.c | 39 ++--
>> arch/arm64/lib/clear_page.S | 7 +-
>> arch/arm64/lib/copy_page.S | 33 ++-
>> arch/arm64/lib/mte.S | 27 ++-
>> arch/arm64/mm/Makefile | 1 +
>> arch/arm64/mm/fixmap.c | 38 ++--
>> arch/arm64/mm/hugetlbpage.c | 40 +---
>> arch/arm64/mm/init.c | 26 +--
>> arch/arm64/mm/kasan_init.c | 8 +-
>> arch/arm64/mm/mmu.c | 53 +++--
>> arch/arm64/mm/pgd.c | 12 +-
>> arch/arm64/mm/pgtable-geometry.c | 24 +++
>> arch/arm64/mm/proc.S | 128 ++++++++---
>> arch/arm64/mm/ptdump.c | 3 +-
>> arch/arm64/tools/cpucaps | 3 +
>> arch/csky/include/asm/page.h | 3 +
>> arch/hexagon/include/asm/page.h | 2 +
>> arch/loongarch/include/asm/page.h | 2 +
>> arch/m68k/include/asm/page.h | 1 +
>> arch/microblaze/include/asm/page.h | 1 +
>> arch/mips/include/asm/page.h | 1 +
>> arch/nios2/include/asm/page.h | 2 +
>> arch/openrisc/include/asm/page.h | 1 +
>> arch/parisc/include/asm/page.h | 1 +
>> arch/powerpc/include/asm/page.h | 2 +
>> arch/riscv/include/asm/page.h | 1 +
>> arch/s390/include/asm/page.h | 1 +
>> arch/sh/include/asm/page.h | 1 +
>> arch/sparc/include/asm/page.h | 3 +
>> arch/um/include/asm/page.h | 2 +
>> arch/x86/include/asm/page_types.h | 2 +
>> arch/xtensa/include/asm/page.h | 1 +
>> crypto/lskcipher.c | 4 +-
>> drivers/ata/sata_sil24.c | 46 ++--
>> drivers/base/node.c | 6 +-
>> drivers/base/topology.c | 32 +--
>> drivers/block/virtio_blk.c | 2 +-
>> drivers/char/random.c | 4 +-
>> drivers/edac/edac_mc.h | 13 +-
>> drivers/firmware/efi/libstub/arm64.c | 3 +-
>> drivers/irqchip/irq-gic-v3-its.c | 2 +-
>> drivers/mtd/mtdswap.c | 4 +-
>> drivers/net/ethernet/freescale/fec.h | 3 +-
>> drivers/net/ethernet/freescale/fec_main.c | 5 +-
>> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
>> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
>> drivers/net/ethernet/intel/igb/igb.h | 25 +--
>> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
>> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
>> drivers/net/ethernet/marvell/mvneta.c | 9 +-
>> drivers/net/ethernet/marvell/sky2.h | 2 +-
>> drivers/tee/optee/call.c | 7 +-
>> drivers/tee/optee/smc_abi.c | 2 +-
>> drivers/virtio/virtio_balloon.c | 10 +-
>> drivers/xen/balloon.c | 11 +-
>> drivers/xen/biomerge.c | 12 +-
>> drivers/xen/privcmd.c | 2 +-
>> drivers/xen/xenbus/xenbus_client.c | 5 +-
>> drivers/xen/xlate_mmu.c | 6 +-
>> fs/binfmt_elf.c | 11 +-
>> fs/buffer.c | 2 +-
>> fs/coredump.c | 8 +-
>> fs/ext4/ext4.h | 36 ++--
>> fs/ext4/move_extent.c | 2 +-
>> fs/ext4/readpage.c | 2 +-
>> fs/fat/dir.c | 4 +-
>> fs/fat/fatent.c | 4 +-
>> fs/nfs/nfs42proc.c | 2 +-
>> fs/nfs/nfs42xattr.c | 2 +-
>> fs/nfs/nfs4proc.c | 2 +-
>> include/asm-generic/pgtable-geometry.h | 71 +++++++
>> include/asm-generic/vmlinux.lds.h | 38 ++--
>> include/linux/buffer_head.h | 1 +
>> include/linux/cpumask.h | 5 +
>> include/linux/linkage.h | 4 +-
>> include/linux/mm.h | 17 +-
>> include/linux/mm_types.h | 15 +-
>> include/linux/mm_types_task.h | 2 +-
>> include/linux/mmzone.h | 3 +-
>> include/linux/netlink.h | 6 +-
>> include/linux/percpu-defs.h | 4 +-
>> include/linux/perf_event.h | 2 +-
>> include/linux/sched.h | 4 +-
>> include/linux/slab.h | 7 +-
>> include/linux/stackdepot.h | 6 +-
>> include/linux/sunrpc/svc.h | 8 +-
>> include/linux/sunrpc/svc_rdma.h | 4 +-
>> include/linux/sunrpc/svcsock.h | 2 +-
>> include/linux/swap.h | 17 +-
>> include/linux/swapops.h | 6 +-
>> include/linux/thread_info.h | 10 +-
>> include/xen/page.h | 2 +
>> init/main.c | 7 +-
>> kernel/bpf/core.c | 9 +-
>> kernel/bpf/ringbuf.c | 54 ++---
>> kernel/cgroup/cgroup.c | 8 +-
>> kernel/crash_core.c | 2 +-
>> kernel/events/core.c | 2 +-
>> kernel/fork.c | 71 +++----
>> kernel/power/power.h | 2 +-
>> kernel/power/snapshot.c | 2 +-
>> kernel/power/swap.c | 129 +++++++++--
>> kernel/trace/fgraph.c | 2 +-
>> kernel/trace/trace.c | 2 +-
>> lib/stackdepot.c | 6 +-
>> mm/kasan/report.c | 3 +-
>> mm/memcontrol.c | 11 +-
>> mm/memory.c | 4 +-
>> mm/mmap.c | 2 +-
>> mm/page-writeback.c | 2 +-
>> mm/page_alloc.c | 31 +--
>> mm/slub.c | 2 +-
>> mm/sparse.c | 2 +-
>> mm/swapfile.c | 2 +-
>> mm/vmalloc.c | 7 +-
>> net/9p/trans_virtio.c | 4 +-
>> net/core/hotdata.c | 4 +-
>> net/core/skbuff.c | 4 +-
>> net/core/sysctl_net_core.c | 2 +-
>> net/sunrpc/cache.c | 3 +-
>> net/unix/af_unix.c | 2 +-
>> sound/soc/soc-utils.c | 4 +-
>> virt/kvm/kvm_main.c | 2 +-
>> 172 files changed, 2185 insertions(+), 951 deletions(-)
>> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
>> create mode 100644 arch/arm64/mm/pgtable-geometry.c
>> create mode 100644 include/asm-generic/pgtable-geometry.h
>>
>> --
>> 2.43.0
>>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-21 11:32 ` Eric Curtin
@ 2024-10-21 11:51 ` Ryan Roberts
2024-10-21 13:49 ` Neal Gompa
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-21 11:51 UTC (permalink / raw)
To: Eric Curtin
Cc: Neal Gompa, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, Hector Martin, linux-arm-kernel,
linux-kernel, linux-mm, asahi
On 21/10/2024 12:32, Eric Curtin wrote:
> On Mon, 21 Oct 2024 at 12:09, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 19/10/2024 16:47, Neal Gompa wrote:
>>> On Monday, October 14, 2024 6:55:11 AM EDT Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> Patch bomb incoming... This covers many subsystems, so I've included a core
>>>> set of people on the full series and additionally included maintainers on
>>>> relevant patches. I haven't included those maintainers on this cover letter
>>>> since the numbers were far too big for it to work. But I've included a link
>>>> to this cover letter on each patch, so they can hopefully find their way
>>>> here. For follow up submissions I'll break it up by subsystem, but for now
>>>> thought it was important to show the full picture.
>>>>
>>>> This RFC series implements support for boot-time page size selection within
>>>> the arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to
>>>> date, page size has been selected at compile-time, meaning the size is
>>>> baked into a given kernel image. As use of larger-than-4K page sizes become
>>>> more prevalent this starts to present a problem for distributions.
>>>> Boot-time page size selection enables the creation of a single kernel
>>>> image, which can be told which page size to use on the kernel command line.
>>>>
>>>> Why is having an image-per-page size problematic?
>>>> =================================================
>>>>
>>>> Many traditional distros are now supporting both 4K and 64K. And this means
>>>> managing 2 kernel packages, along with drivers for each. For some, it means
>>>> multiple installer flavours and multiple ISOs. All of this adds up to a
>>>> less-than-ideal level of complexity. Additionally, Android now supports 4K
>>>> and 16K kernels. I'm told having to explicitly manage their KABI for each
>>>> kernel is painful, and the extra flash space required for both kernel
>>>> images and the duplicated modules has been problematic. Boot-time page size
>>>> selection solves all of this.
>>>>
>>>> Additionally, in starting to think about the longer term deployment story
>>>> for D128 page tables, which Arm architecture now supports, a lot of the
>>>> same problems need to be solved, so this work sets us up nicely for that.
>>>>
>>>> So what's the down side?
>>>> ========================
>>>>
>>>> Well nothing's free; Various static allocations in the kernel image must be
>>>> sized for the worst case (largest supported page size), so image size is in
>>>> line with size of 64K compile-time image. So if you're interested in 4K or
>>>> 16K, there is a slight increase to the image size. But I expect that
>>>> problem goes away if you're compressing the image - its just some extra
>>>> zeros. At boot-time, I expect we could free the unused static storage once
>>>> we know the page size - although that would be a follow up enhancement.
>>>>
>>>> And then there is performance. Since PAGE_SIZE and friends are no longer
>>>> compile-time constants, we must look up their values and do arithmetic at
>>>> runtime instead of compile-time. My early perf testing suggests this is
>>>> inperceptible for real-world workloads, and only has small impact on
>>>> microbenchmarks - more on this below.
>>>>
>>>> Approach
>>>> ========
>>>>
>>>> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
>>>> friends are compile-time constant, but in a way that allows the compiler to
>>>> perform the same optimizations as was previously being done if they do turn
>>>> out to be compile-time constant. Where constants are required, we use
>>>> limits; PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full
>>>> description of all the classes of problems to solve.
>>>>
>>>> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
>>>> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX.
>>>> arm64 does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>>>> Kconfig, which is an alternative to selecting a compile-time page size.
>>>>
>>>> When boot-time page size is active, the arch pgtable geometry macro
>>>> definitions resolve to something that can be configured at boot. The arm64
>>>> implementation in this series mainly uses global, __ro_after_init
>>>> variables. I've tried using alternatives patching, but that performs worse
>>>> than loading from memory; I think due to code size bloat.
>>>>
>>>> Status
>>>> ======
>>>>
>>>> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented
>>>> enough to compile the kernel image itself with defconfig (and a few other
>>>> bits and pieces). This is enough to build a kernel that can boot under QEMU
>>>> or FVP. I'll happily do the rest of the work to enable all the extra
>>>> drivers, but wanted to get feedback on the shape of this effort first. If
>>>> anyone wants to do any testing, and has a must-have config, let me know and
>>>> I'll prioritize enabling it first.
>>>>
>>>> The series is arranged as follows:
>>>>
>>>> - patch 1: Add macros required for converting non-arch code to support
>>>> boot-time page size selection
>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from
>>>> all non-arch code
>>>> - patches 37-38: Some arm64 tidy ups
>>>> - patch 39: Add macros required for converting arm64 code to
>>> support
>>>> boot-time page size selection
>>>> - patches 40-56: arm64 changes to support boot-time page size selection
>>>> - patch 57: Add arm64 Kconfig option to enable boot-time page
>>> size
>>>> selection
>>>>
>>>> Ideally, I'd like to get the basics merged (something like this series),
>>>> then incrementally improve it over a handful of kernel releases until we
>>>> can demonstrate that we have feature parity with the compile-time build and
>>>> no performance blockers. Once at that point, ideally the compile-time build
>>>> options would be removed and the code could be cleaned up further.
>>>>
>>>> One of the bigger peices that I'd propose to add as a follow up, is to make
>>>> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
>>>> handling.
>>>>
>>>> Assuming people are ammenable to the rough shape, how would I go about
>>>> getting the non-arch changes merged? Since they cover many subsystems, will
>>>> each piece need to go independently to each relevant maintainer or could it
>>>> all be merged together through the arm64 tree?
>>>>
>>>> Image Size
>>>> ==========
>>>>
>>>> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
>>>> kernel image on disk for base (before any changes applied), compile (with
>>>> changes, configured for compile-time page size) and boot (with changes,
>>>> configured for boot-time page size).
>>>>
>>>> You can see the that compile-16k and 64k configs are actually slightly
>>>> smaller than the baselines; that's due to optimizing some buffer sizes
>>>> which didn't need to depend on page size during the series. The boot-time
>>>> image is ~1% bigger than the 64k compile-time image. I believe there is
>>>> scope to improve this to make it
>>>> equal to compile-64k if required:
>>>> | config | size/KB | diff/KB | diff/% |
>>>> |
>>>> |-------------|---------|---------|---------|
>>>> |
>>>> | base-4k | 54895 | 0 | 0.0% |
>>>> | base-16k | 55161 | 266 | 0.5% |
>>>> | base-64k | 56775 | 1880 | 3.4% |
>>>> | compile-4k | 54895 | 0 | 0.0% |
>>>> | compile-16k | 55097 | 202 | 0.4% |
>>>> | compile-64k | 56391 | 1496 | 2.7% |
>>>> | boot-4K | 57045 | 2150 | 3.9% |
>>>>
>>>> And below shows the size of the image in memory at run-time, separated for
>>>> text and data costs. The boot image has ~1% text cost; most likely due to
>>>> the fact that PAGE_SIZE and friends are not compile-time constants so need
>>>> instructions to load the values and do arithmetic. I believe we could
>>>> eventually get the data cost to match the cost for the compile image for
>>>> the chosen page size by freeing
>>>> the ends of the static buffers not needed for the selected page size:
>>>> | | text | text | text | data | data | data |
>>>> |
>>>> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
>>>> |
>>>> |-------------|---------|---------|---------|---------|---------|---------|
>>>> |
>>>> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
>>>> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
>>>> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
>>>> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
>>>> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
>>>> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
>>>> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
>>>>
>>>> Functional Testing
>>>> ==================
>>>>
>>>> I've build-tested defconfig for all arches supported by tuxmake (which is
>>>> most) without issue.
>>>>
>>>> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page
>>>> sizes and a few va-sizes, and additionally have run all the mm-selftests,
>>>> with no regressions observed vs the equivalent compile-time page size build
>>>> (although the mm-selftests have a few existing failures when run against
>>>> 16K and 64K kernels - those should really be investigated and fixed
>>>> independently).
>>>>
>>>> Test coverage is lacking for many of the drivers that I've touched, but in
>>>> many cases, I'm hoping the changes are simple enough that review might
>>>> suffice?
>>>>
>>>> Performance Testing
>>>> ===================
>>>>
>>>> I've run some limited performance benchmarks:
>>>>
>>>> First, a real-world benchmark that causes a lot of page table manipulation
>>>> (and therefore we would expect to see regression here if we are going to
>>>> see it anywhere); kernel compilation. It barely registers a change. Values
>>>> are times,
>>>> so smaller is better. All relative to base-4k:
>>>> | | kern | kern | user | user | real | real |
>>>> |
>>>> | config | mean | stdev | mean | stdev | mean | stdev |
>>>> |
>>>> |-------------|---------|---------|---------|---------|---------|---------|
>>>> |
>>>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
>>>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
>>>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
>>>>
>>>> The Speedometer JavaScript benchmark also shows no change. Values are runs
>>>> per
>>>> min, so bigger is better. All relative to base-4k:
>>>> | config | mean | stdev |
>>>> |
>>>> |-------------|---------|---------|
>>>> |
>>>> | base-4k | 0.0% | 0.8% |
>>>> | compile-4k | 0.4% | 0.8% |
>>>> | boot-4k | 0.0% | 0.9% |
>>>>
>>>> Finally, I've run some microbenchmarks known to stress page table
>>>> manipulations (originally from David Hildenbrand). The fork test
>>>> maps/allocs 1G of anon memory, then measures the cost of fork(). The munmap
>>>> test maps/allocs 1G of anon memory then measures the cost of munmap()ing
>>>> it. The fork test is known to be extremely sensitive to any changes that
>>>> cause instructions to be aligned differently in cachelines. When using this
>>>> test for other changes, I've seen double digit regressions for the
>>>> slightest thing, so 12% regression on this test is actually fairly good.
>>>> This likely represents the extreme worst case for regressions that will be
>>>> observed across other microbenchmarks (famous last
>>>> words). Values are times, so smaller is better. All relative to base-4k:
>>>> | | fork | fork | munmap | munmap |
>>>> |
>>>> | config | mean | stdev | stdev | stdev |
>>>> |
>>>> |-------------|---------|---------|---------|---------|
>>>> |
>>>> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
>>>> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
>>>> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
>>>>
>>>> NOTE: The series applies on top of v6.11.
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>> Ryan Roberts (57):
>>>> mm: Add macros ahead of supporting boot-time page size selection
>>>> vmlinux: Align to PAGE_SIZE_MAX
>>>> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
>>>> mm/page_alloc: Make page_frag_cache boot-time page size compatible
>>>> mm: Avoid split pmd ptl if pmd level is run-time folded
>>>> mm: Remove PAGE_SIZE compile-time constant assumption
>>>> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
>>>> fs: Remove PAGE_SIZE compile-time constant assumption
>>>> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
>>>> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
>>>> fork: Permit boot-time THREAD_SIZE determination
>>>> cgroup: Remove PAGE_SIZE compile-time constant assumption
>>>> bpf: Remove PAGE_SIZE compile-time constant assumption
>>>> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
>>>> stackdepot: Remove PAGE_SIZE compile-time constant assumption
>>>> perf: Remove PAGE_SIZE compile-time constant assumption
>>>> kvm: Remove PAGE_SIZE compile-time constant assumption
>>>> trace: Remove PAGE_SIZE compile-time constant assumption
>>>> crash: Remove PAGE_SIZE compile-time constant assumption
>>>> crypto: Remove PAGE_SIZE compile-time constant assumption
>>>> sunrpc: Remove PAGE_SIZE compile-time constant assumption
>>>> sound: Remove PAGE_SIZE compile-time constant assumption
>>>> net: Remove PAGE_SIZE compile-time constant assumption
>>>> net: fec: Remove PAGE_SIZE compile-time constant assumption
>>>> net: marvell: Remove PAGE_SIZE compile-time constant assumption
>>>> net: hns3: Remove PAGE_SIZE compile-time constant assumption
>>>> net: e1000: Remove PAGE_SIZE compile-time constant assumption
>>>> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
>>>> net: igb: Remove PAGE_SIZE compile-time constant assumption
>>>> drivers/base: Remove PAGE_SIZE compile-time constant assumption
>>>> edac: Remove PAGE_SIZE compile-time constant assumption
>>>> optee: Remove PAGE_SIZE compile-time constant assumption
>>>> random: Remove PAGE_SIZE compile-time constant assumption
>>>> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
>>>> virtio: Remove PAGE_SIZE compile-time constant assumption
>>>> xen: Remove PAGE_SIZE compile-time constant assumption
>>>> arm64: Fix macros to work in C code in addition to the linker script
>>>> arm64: Track early pgtable allocation limit
>>>> arm64: Introduce macros required for boot-time page selection
>>>> arm64: Refactor early pgtable size calculation macros
>>>> arm64: Pass desired page size on command line
>>>> arm64: Divorce early init from PAGE_SIZE
>>>> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
>>>> arm64: Align sections to PAGE_SIZE_MAX
>>>> arm64: Rework trampoline rodata mapping
>>>> arm64: Generalize fixmap for boot-time page size
>>>> arm64: Statically allocate and align for worst-case page size
>>>> arm64: Convert switch to if for non-const comparison values
>>>> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
>>>> arm64: Remove PAGE_SZ asm-offset
>>>> arm64: Introduce cpu features for page sizes
>>>> arm64: Remove PAGE_SIZE from assembly code
>>>> arm64: Runtime-fold pmd level
>>>> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
>>>> arm64: TRAMP_VALIAS is no longer compile-time constant
>>>> arm64: Determine THREAD_SIZE at boot-time
>>>> arm64: Enable boot-time page size selection
>>>>
>>>> arch/alpha/include/asm/page.h | 1 +
>>>> arch/arc/include/asm/page.h | 1 +
>>>> arch/arm/include/asm/page.h | 1 +
>>>> arch/arm64/Kconfig | 26 ++-
>>>> arch/arm64/include/asm/assembler.h | 78 ++++++-
>>>> arch/arm64/include/asm/cpufeature.h | 44 +++-
>>>> arch/arm64/include/asm/efi.h | 2 +-
>>>> arch/arm64/include/asm/fixmap.h | 28 ++-
>>>> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
>>>> arch/arm64/include/asm/kvm_arm.h | 21 +-
>>>> arch/arm64/include/asm/kvm_hyp.h | 11 +
>>>> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
>>>> arch/arm64/include/asm/memory.h | 62 ++++--
>>>> arch/arm64/include/asm/page-def.h | 3 +-
>>>> arch/arm64/include/asm/pgalloc.h | 16 +-
>>>> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
>>>> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
>>>> arch/arm64/include/asm/pgtable-prot.h | 2 +-
>>>> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
>>>> arch/arm64/include/asm/processor.h | 10 +-
>>>> arch/arm64/include/asm/sections.h | 1 +
>>>> arch/arm64/include/asm/smp.h | 1 +
>>>> arch/arm64/include/asm/sparsemem.h | 15 +-
>>>> arch/arm64/include/asm/sysreg.h | 54 +++--
>>>> arch/arm64/include/asm/tlb.h | 3 +
>>>> arch/arm64/kernel/asm-offsets.c | 4 +-
>>>> arch/arm64/kernel/cpufeature.c | 93 ++++++--
>>>> arch/arm64/kernel/efi.c | 2 +-
>>>> arch/arm64/kernel/entry.S | 60 +++++-
>>>> arch/arm64/kernel/head.S | 46 +++-
>>>> arch/arm64/kernel/hibernate-asm.S | 6 +-
>>>> arch/arm64/kernel/image-vars.h | 14 ++
>>>> arch/arm64/kernel/image.h | 4 +
>>>> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
>>>> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
>>>> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
>>>> arch/arm64/kernel/pi/pi.h | 63 +++++-
>>>> arch/arm64/kernel/relocate_kernel.S | 10 +-
>>>> arch/arm64/kernel/vdso-wrap.S | 4 +-
>>>> arch/arm64/kernel/vdso.c | 7 +-
>>>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
>>>> arch/arm64/kernel/vdso32-wrap.S | 4 +-
>>>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
>>>> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
>>>> arch/arm64/kvm/arm.c | 10 +
>>>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
>>>> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
>>>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
>>>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
>>>> arch/arm64/kvm/mmu.c | 39 ++--
>>>> arch/arm64/lib/clear_page.S | 7 +-
>>>> arch/arm64/lib/copy_page.S | 33 ++-
>>>> arch/arm64/lib/mte.S | 27 ++-
>>>> arch/arm64/mm/Makefile | 1 +
>>>> arch/arm64/mm/fixmap.c | 38 ++--
>>>> arch/arm64/mm/hugetlbpage.c | 40 +---
>>>> arch/arm64/mm/init.c | 26 +--
>>>> arch/arm64/mm/kasan_init.c | 8 +-
>>>> arch/arm64/mm/mmu.c | 53 +++--
>>>> arch/arm64/mm/pgd.c | 12 +-
>>>> arch/arm64/mm/pgtable-geometry.c | 24 +++
>>>> arch/arm64/mm/proc.S | 128 ++++++++---
>>>> arch/arm64/mm/ptdump.c | 3 +-
>>>> arch/arm64/tools/cpucaps | 3 +
>>>> arch/csky/include/asm/page.h | 3 +
>>>> arch/hexagon/include/asm/page.h | 2 +
>>>> arch/loongarch/include/asm/page.h | 2 +
>>>> arch/m68k/include/asm/page.h | 1 +
>>>> arch/microblaze/include/asm/page.h | 1 +
>>>> arch/mips/include/asm/page.h | 1 +
>>>> arch/nios2/include/asm/page.h | 2 +
>>>> arch/openrisc/include/asm/page.h | 1 +
>>>> arch/parisc/include/asm/page.h | 1 +
>>>> arch/powerpc/include/asm/page.h | 2 +
>>>> arch/riscv/include/asm/page.h | 1 +
>>>> arch/s390/include/asm/page.h | 1 +
>>>> arch/sh/include/asm/page.h | 1 +
>>>> arch/sparc/include/asm/page.h | 3 +
>>>> arch/um/include/asm/page.h | 2 +
>>>> arch/x86/include/asm/page_types.h | 2 +
>>>> arch/xtensa/include/asm/page.h | 1 +
>>>> crypto/lskcipher.c | 4 +-
>>>> drivers/ata/sata_sil24.c | 46 ++--
>>>> drivers/base/node.c | 6 +-
>>>> drivers/base/topology.c | 32 +--
>>>> drivers/block/virtio_blk.c | 2 +-
>>>> drivers/char/random.c | 4 +-
>>>> drivers/edac/edac_mc.h | 13 +-
>>>> drivers/firmware/efi/libstub/arm64.c | 3 +-
>>>> drivers/irqchip/irq-gic-v3-its.c | 2 +-
>>>> drivers/mtd/mtdswap.c | 4 +-
>>>> drivers/net/ethernet/freescale/fec.h | 3 +-
>>>> drivers/net/ethernet/freescale/fec_main.c | 5 +-
>>>> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
>>>> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
>>>> drivers/net/ethernet/intel/igb/igb.h | 25 +--
>>>> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
>>>> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
>>>> drivers/net/ethernet/marvell/mvneta.c | 9 +-
>>>> drivers/net/ethernet/marvell/sky2.h | 2 +-
>>>> drivers/tee/optee/call.c | 7 +-
>>>> drivers/tee/optee/smc_abi.c | 2 +-
>>>> drivers/virtio/virtio_balloon.c | 10 +-
>>>> drivers/xen/balloon.c | 11 +-
>>>> drivers/xen/biomerge.c | 12 +-
>>>> drivers/xen/privcmd.c | 2 +-
>>>> drivers/xen/xenbus/xenbus_client.c | 5 +-
>>>> drivers/xen/xlate_mmu.c | 6 +-
>>>> fs/binfmt_elf.c | 11 +-
>>>> fs/buffer.c | 2 +-
>>>> fs/coredump.c | 8 +-
>>>> fs/ext4/ext4.h | 36 ++--
>>>> fs/ext4/move_extent.c | 2 +-
>>>> fs/ext4/readpage.c | 2 +-
>>>> fs/fat/dir.c | 4 +-
>>>> fs/fat/fatent.c | 4 +-
>>>> fs/nfs/nfs42proc.c | 2 +-
>>>> fs/nfs/nfs42xattr.c | 2 +-
>>>> fs/nfs/nfs4proc.c | 2 +-
>>>> include/asm-generic/pgtable-geometry.h | 71 +++++++
>>>> include/asm-generic/vmlinux.lds.h | 38 ++--
>>>> include/linux/buffer_head.h | 1 +
>>>> include/linux/cpumask.h | 5 +
>>>> include/linux/linkage.h | 4 +-
>>>> include/linux/mm.h | 17 +-
>>>> include/linux/mm_types.h | 15 +-
>>>> include/linux/mm_types_task.h | 2 +-
>>>> include/linux/mmzone.h | 3 +-
>>>> include/linux/netlink.h | 6 +-
>>>> include/linux/percpu-defs.h | 4 +-
>>>> include/linux/perf_event.h | 2 +-
>>>> include/linux/sched.h | 4 +-
>>>> include/linux/slab.h | 7 +-
>>>> include/linux/stackdepot.h | 6 +-
>>>> include/linux/sunrpc/svc.h | 8 +-
>>>> include/linux/sunrpc/svc_rdma.h | 4 +-
>>>> include/linux/sunrpc/svcsock.h | 2 +-
>>>> include/linux/swap.h | 17 +-
>>>> include/linux/swapops.h | 6 +-
>>>> include/linux/thread_info.h | 10 +-
>>>> include/xen/page.h | 2 +
>>>> init/main.c | 7 +-
>>>> kernel/bpf/core.c | 9 +-
>>>> kernel/bpf/ringbuf.c | 54 ++---
>>>> kernel/cgroup/cgroup.c | 8 +-
>>>> kernel/crash_core.c | 2 +-
>>>> kernel/events/core.c | 2 +-
>>>> kernel/fork.c | 71 +++----
>>>> kernel/power/power.h | 2 +-
>>>> kernel/power/snapshot.c | 2 +-
>>>> kernel/power/swap.c | 129 +++++++++--
>>>> kernel/trace/fgraph.c | 2 +-
>>>> kernel/trace/trace.c | 2 +-
>>>> lib/stackdepot.c | 6 +-
>>>> mm/kasan/report.c | 3 +-
>>>> mm/memcontrol.c | 11 +-
>>>> mm/memory.c | 4 +-
>>>> mm/mmap.c | 2 +-
>>>> mm/page-writeback.c | 2 +-
>>>> mm/page_alloc.c | 31 +--
>>>> mm/slub.c | 2 +-
>>>> mm/sparse.c | 2 +-
>>>> mm/swapfile.c | 2 +-
>>>> mm/vmalloc.c | 7 +-
>>>> net/9p/trans_virtio.c | 4 +-
>>>> net/core/hotdata.c | 4 +-
>>>> net/core/skbuff.c | 4 +-
>>>> net/core/sysctl_net_core.c | 2 +-
>>>> net/sunrpc/cache.c | 3 +-
>>>> net/unix/af_unix.c | 2 +-
>>>> sound/soc/soc-utils.c | 4 +-
>>>> virt/kvm/kvm_main.c | 2 +-
>>>> 172 files changed, 2185 insertions(+), 951 deletions(-)
>>>> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
>>>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
>>>> create mode 100644 arch/arm64/mm/pgtable-geometry.c
>>>> create mode 100644 include/asm-generic/pgtable-geometry.h
>>>>
>>>> --
>>>> 2.43.0
>>>
>>> This is a generally very exciting patch set! I'm looking forward to seeing it
>>> land so I can take advantage of it for Fedora ARM and Fedora Asahi Remix.
>>>
>>> That said, I have a couple of questions:
>>>
>>> * Going forward, how would we handle drivers/modules that require a particular
>>> page size? For example, the Apple Silicon IOMMU driver code requires the
>>> kernel to operate in 16k page size mode, and it would need to be disabled in
>>> other page sizes.
>>
>> I think these drivers would want to check PAGE_SIZE at probe time and fail if an
>> unsupported page size is in use. Do you see any issue with that?
>>
>>>
>>> * How would we handle an invalid selection at boot?
>>
>> What do you mean by invalid here? The current policy validates that the
>> requested page size is supported by the HW by checking mmfr0. If no page size is
>> passed on the command line, or the passed value is not supported by the HW, then
>> the we default to the largest page size supported by the HW (so for Apple
>> Silicon that would be 16k since the HW doesn't support 64k). Although I think it
>> may be better to change that policy to use the smallest page size in this case;
>> 4k is the safer bet for compat and will waste much less memory than 64k.
>>
>>> Can we program in a
>>> fallback when the "wrong" mode is selected for a chip or something similar?
>>
>> Do you mean effectively add a machanism to force 16k if the detected HW is Apple
>> Silicon? The trouble is that we need to select the page size, very early in
>> boot, before start_kernel() is called, so we really only have generic arch code
>> and the command line with which to make the decision.
>
> Yes... I think a build-time CONFIG for default page size, which can be
> overridden by a karg makes sense... Even on platforms like Apple
> Silicon you may want to test very specific things in 4k by overriding
> with a karg.
Ahh, yes, that would certainly work. I'll work it into the next version.
>
> Like in downstream kernels like Fedora/RHEL/etc. I would expect the
> default would be 4k, but you could override with 16k, 64k, etc. with a
> karg.
>
>>
>>>> Thanks again and best regards!
>>>
>>> (P.S.: Please add the asahi@ mailing list to the CC for future iterations of
>>> this patch set and tag both Hector and myself in as well. Thanks!)
>>
>> Will do!
>>
>>>
>>>
>>
>>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-21 11:51 ` Ryan Roberts
@ 2024-10-21 13:49 ` Neal Gompa
2024-10-21 15:01 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Neal Gompa @ 2024-10-21 13:49 UTC (permalink / raw)
To: Ryan Roberts
Cc: Eric Curtin, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, Hector Martin, linux-arm-kernel,
linux-kernel, linux-mm, asahi
On Mon, Oct 21, 2024 at 7:51 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 21/10/2024 12:32, Eric Curtin wrote:
> > On Mon, 21 Oct 2024 at 12:09, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 19/10/2024 16:47, Neal Gompa wrote:
> >>> On Monday, October 14, 2024 6:55:11 AM EDT Ryan Roberts wrote:
> >>>> Hi All,
> >>>>
> >>>> Patch bomb incoming... This covers many subsystems, so I've included a core
> >>>> set of people on the full series and additionally included maintainers on
> >>>> relevant patches. I haven't included those maintainers on this cover letter
> >>>> since the numbers were far too big for it to work. But I've included a link
> >>>> to this cover letter on each patch, so they can hopefully find their way
> >>>> here. For follow up submissions I'll break it up by subsystem, but for now
> >>>> thought it was important to show the full picture.
> >>>>
> >>>> This RFC series implements support for boot-time page size selection within
> >>>> the arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to
> >>>> date, page size has been selected at compile-time, meaning the size is
> >>>> baked into a given kernel image. As use of larger-than-4K page sizes become
> >>>> more prevalent this starts to present a problem for distributions.
> >>>> Boot-time page size selection enables the creation of a single kernel
> >>>> image, which can be told which page size to use on the kernel command line.
> >>>>
> >>>> Why is having an image-per-page size problematic?
> >>>> =================================================
> >>>>
> >>>> Many traditional distros are now supporting both 4K and 64K. And this means
> >>>> managing 2 kernel packages, along with drivers for each. For some, it means
> >>>> multiple installer flavours and multiple ISOs. All of this adds up to a
> >>>> less-than-ideal level of complexity. Additionally, Android now supports 4K
> >>>> and 16K kernels. I'm told having to explicitly manage their KABI for each
> >>>> kernel is painful, and the extra flash space required for both kernel
> >>>> images and the duplicated modules has been problematic. Boot-time page size
> >>>> selection solves all of this.
> >>>>
> >>>> Additionally, in starting to think about the longer term deployment story
> >>>> for D128 page tables, which Arm architecture now supports, a lot of the
> >>>> same problems need to be solved, so this work sets us up nicely for that.
> >>>>
> >>>> So what's the down side?
> >>>> ========================
> >>>>
> >>>> Well nothing's free; Various static allocations in the kernel image must be
> >>>> sized for the worst case (largest supported page size), so image size is in
> >>>> line with size of 64K compile-time image. So if you're interested in 4K or
> >>>> 16K, there is a slight increase to the image size. But I expect that
> >>>> problem goes away if you're compressing the image - its just some extra
> >>>> zeros. At boot-time, I expect we could free the unused static storage once
> >>>> we know the page size - although that would be a follow up enhancement.
> >>>>
> >>>> And then there is performance. Since PAGE_SIZE and friends are no longer
> >>>> compile-time constants, we must look up their values and do arithmetic at
> >>>> runtime instead of compile-time. My early perf testing suggests this is
> >>>> inperceptible for real-world workloads, and only has small impact on
> >>>> microbenchmarks - more on this below.
> >>>>
> >>>> Approach
> >>>> ========
> >>>>
> >>>> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
> >>>> friends are compile-time constant, but in a way that allows the compiler to
> >>>> perform the same optimizations as was previously being done if they do turn
> >>>> out to be compile-time constant. Where constants are required, we use
> >>>> limits; PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full
> >>>> description of all the classes of problems to solve.
> >>>>
> >>>> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
> >>>> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX.
> >>>> arm64 does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> >>>> Kconfig, which is an alternative to selecting a compile-time page size.
> >>>>
> >>>> When boot-time page size is active, the arch pgtable geometry macro
> >>>> definitions resolve to something that can be configured at boot. The arm64
> >>>> implementation in this series mainly uses global, __ro_after_init
> >>>> variables. I've tried using alternatives patching, but that performs worse
> >>>> than loading from memory; I think due to code size bloat.
> >>>>
> >>>> Status
> >>>> ======
> >>>>
> >>>> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented
> >>>> enough to compile the kernel image itself with defconfig (and a few other
> >>>> bits and pieces). This is enough to build a kernel that can boot under QEMU
> >>>> or FVP. I'll happily do the rest of the work to enable all the extra
> >>>> drivers, but wanted to get feedback on the shape of this effort first. If
> >>>> anyone wants to do any testing, and has a must-have config, let me know and
> >>>> I'll prioritize enabling it first.
> >>>>
> >>>> The series is arranged as follows:
> >>>>
> >>>> - patch 1: Add macros required for converting non-arch code to support
> >>>> boot-time page size selection
> >>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from
> >>>> all non-arch code
> >>>> - patches 37-38: Some arm64 tidy ups
> >>>> - patch 39: Add macros required for converting arm64 code to
> >>> support
> >>>> boot-time page size selection
> >>>> - patches 40-56: arm64 changes to support boot-time page size selection
> >>>> - patch 57: Add arm64 Kconfig option to enable boot-time page
> >>> size
> >>>> selection
> >>>>
> >>>> Ideally, I'd like to get the basics merged (something like this series),
> >>>> then incrementally improve it over a handful of kernel releases until we
> >>>> can demonstrate that we have feature parity with the compile-time build and
> >>>> no performance blockers. Once at that point, ideally the compile-time build
> >>>> options would be removed and the code could be cleaned up further.
> >>>>
> >>>> One of the bigger peices that I'd propose to add as a follow up, is to make
> >>>> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
> >>>> handling.
> >>>>
> >>>> Assuming people are ammenable to the rough shape, how would I go about
> >>>> getting the non-arch changes merged? Since they cover many subsystems, will
> >>>> each piece need to go independently to each relevant maintainer or could it
> >>>> all be merged together through the arm64 tree?
> >>>>
> >>>> Image Size
> >>>> ==========
> >>>>
> >>>> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
> >>>> kernel image on disk for base (before any changes applied), compile (with
> >>>> changes, configured for compile-time page size) and boot (with changes,
> >>>> configured for boot-time page size).
> >>>>
> >>>> You can see the that compile-16k and 64k configs are actually slightly
> >>>> smaller than the baselines; that's due to optimizing some buffer sizes
> >>>> which didn't need to depend on page size during the series. The boot-time
> >>>> image is ~1% bigger than the 64k compile-time image. I believe there is
> >>>> scope to improve this to make it
> >>>> equal to compile-64k if required:
> >>>> | config | size/KB | diff/KB | diff/% |
> >>>> |
> >>>> |-------------|---------|---------|---------|
> >>>> |
> >>>> | base-4k | 54895 | 0 | 0.0% |
> >>>> | base-16k | 55161 | 266 | 0.5% |
> >>>> | base-64k | 56775 | 1880 | 3.4% |
> >>>> | compile-4k | 54895 | 0 | 0.0% |
> >>>> | compile-16k | 55097 | 202 | 0.4% |
> >>>> | compile-64k | 56391 | 1496 | 2.7% |
> >>>> | boot-4K | 57045 | 2150 | 3.9% |
> >>>>
> >>>> And below shows the size of the image in memory at run-time, separated for
> >>>> text and data costs. The boot image has ~1% text cost; most likely due to
> >>>> the fact that PAGE_SIZE and friends are not compile-time constants so need
> >>>> instructions to load the values and do arithmetic. I believe we could
> >>>> eventually get the data cost to match the cost for the compile image for
> >>>> the chosen page size by freeing
> >>>> the ends of the static buffers not needed for the selected page size:
> >>>> | | text | text | text | data | data | data |
> >>>> |
> >>>> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
> >>>> |
> >>>> |-------------|---------|---------|---------|---------|---------|---------|
> >>>> |
> >>>> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
> >>>> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
> >>>> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
> >>>> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
> >>>> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
> >>>> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
> >>>> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
> >>>>
> >>>> Functional Testing
> >>>> ==================
> >>>>
> >>>> I've build-tested defconfig for all arches supported by tuxmake (which is
> >>>> most) without issue.
> >>>>
> >>>> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page
> >>>> sizes and a few va-sizes, and additionally have run all the mm-selftests,
> >>>> with no regressions observed vs the equivalent compile-time page size build
> >>>> (although the mm-selftests have a few existing failures when run against
> >>>> 16K and 64K kernels - those should really be investigated and fixed
> >>>> independently).
> >>>>
> >>>> Test coverage is lacking for many of the drivers that I've touched, but in
> >>>> many cases, I'm hoping the changes are simple enough that review might
> >>>> suffice?
> >>>>
> >>>> Performance Testing
> >>>> ===================
> >>>>
> >>>> I've run some limited performance benchmarks:
> >>>>
> >>>> First, a real-world benchmark that causes a lot of page table manipulation
> >>>> (and therefore we would expect to see regression here if we are going to
> >>>> see it anywhere); kernel compilation. It barely registers a change. Values
> >>>> are times,
> >>>> so smaller is better. All relative to base-4k:
> >>>> | | kern | kern | user | user | real | real |
> >>>> |
> >>>> | config | mean | stdev | mean | stdev | mean | stdev |
> >>>> |
> >>>> |-------------|---------|---------|---------|---------|---------|---------|
> >>>> |
> >>>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
> >>>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
> >>>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
> >>>>
> >>>> The Speedometer JavaScript benchmark also shows no change. Values are runs
> >>>> per
> >>>> min, so bigger is better. All relative to base-4k:
> >>>> | config | mean | stdev |
> >>>> |
> >>>> |-------------|---------|---------|
> >>>> |
> >>>> | base-4k | 0.0% | 0.8% |
> >>>> | compile-4k | 0.4% | 0.8% |
> >>>> | boot-4k | 0.0% | 0.9% |
> >>>>
> >>>> Finally, I've run some microbenchmarks known to stress page table
> >>>> manipulations (originally from David Hildenbrand). The fork test
> >>>> maps/allocs 1G of anon memory, then measures the cost of fork(). The munmap
> >>>> test maps/allocs 1G of anon memory then measures the cost of munmap()ing
> >>>> it. The fork test is known to be extremely sensitive to any changes that
> >>>> cause instructions to be aligned differently in cachelines. When using this
> >>>> test for other changes, I've seen double digit regressions for the
> >>>> slightest thing, so 12% regression on this test is actually fairly good.
> >>>> This likely represents the extreme worst case for regressions that will be
> >>>> observed across other microbenchmarks (famous last
> >>>> words). Values are times, so smaller is better. All relative to base-4k:
> >>>> | | fork | fork | munmap | munmap |
> >>>> |
> >>>> | config | mean | stdev | stdev | stdev |
> >>>> |
> >>>> |-------------|---------|---------|---------|---------|
> >>>> |
> >>>> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
> >>>> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
> >>>> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
> >>>>
> >>>> NOTE: The series applies on top of v6.11.
> >>>>
> >>>> Thanks,
> >>>> Ryan
> >>>>
> >>>>
> >>>> Ryan Roberts (57):
> >>>> mm: Add macros ahead of supporting boot-time page size selection
> >>>> vmlinux: Align to PAGE_SIZE_MAX
> >>>> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
> >>>> mm/page_alloc: Make page_frag_cache boot-time page size compatible
> >>>> mm: Avoid split pmd ptl if pmd level is run-time folded
> >>>> mm: Remove PAGE_SIZE compile-time constant assumption
> >>>> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
> >>>> fs: Remove PAGE_SIZE compile-time constant assumption
> >>>> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
> >>>> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
> >>>> fork: Permit boot-time THREAD_SIZE determination
> >>>> cgroup: Remove PAGE_SIZE compile-time constant assumption
> >>>> bpf: Remove PAGE_SIZE compile-time constant assumption
> >>>> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
> >>>> stackdepot: Remove PAGE_SIZE compile-time constant assumption
> >>>> perf: Remove PAGE_SIZE compile-time constant assumption
> >>>> kvm: Remove PAGE_SIZE compile-time constant assumption
> >>>> trace: Remove PAGE_SIZE compile-time constant assumption
> >>>> crash: Remove PAGE_SIZE compile-time constant assumption
> >>>> crypto: Remove PAGE_SIZE compile-time constant assumption
> >>>> sunrpc: Remove PAGE_SIZE compile-time constant assumption
> >>>> sound: Remove PAGE_SIZE compile-time constant assumption
> >>>> net: Remove PAGE_SIZE compile-time constant assumption
> >>>> net: fec: Remove PAGE_SIZE compile-time constant assumption
> >>>> net: marvell: Remove PAGE_SIZE compile-time constant assumption
> >>>> net: hns3: Remove PAGE_SIZE compile-time constant assumption
> >>>> net: e1000: Remove PAGE_SIZE compile-time constant assumption
> >>>> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
> >>>> net: igb: Remove PAGE_SIZE compile-time constant assumption
> >>>> drivers/base: Remove PAGE_SIZE compile-time constant assumption
> >>>> edac: Remove PAGE_SIZE compile-time constant assumption
> >>>> optee: Remove PAGE_SIZE compile-time constant assumption
> >>>> random: Remove PAGE_SIZE compile-time constant assumption
> >>>> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
> >>>> virtio: Remove PAGE_SIZE compile-time constant assumption
> >>>> xen: Remove PAGE_SIZE compile-time constant assumption
> >>>> arm64: Fix macros to work in C code in addition to the linker script
> >>>> arm64: Track early pgtable allocation limit
> >>>> arm64: Introduce macros required for boot-time page selection
> >>>> arm64: Refactor early pgtable size calculation macros
> >>>> arm64: Pass desired page size on command line
> >>>> arm64: Divorce early init from PAGE_SIZE
> >>>> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
> >>>> arm64: Align sections to PAGE_SIZE_MAX
> >>>> arm64: Rework trampoline rodata mapping
> >>>> arm64: Generalize fixmap for boot-time page size
> >>>> arm64: Statically allocate and align for worst-case page size
> >>>> arm64: Convert switch to if for non-const comparison values
> >>>> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
> >>>> arm64: Remove PAGE_SZ asm-offset
> >>>> arm64: Introduce cpu features for page sizes
> >>>> arm64: Remove PAGE_SIZE from assembly code
> >>>> arm64: Runtime-fold pmd level
> >>>> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
> >>>> arm64: TRAMP_VALIAS is no longer compile-time constant
> >>>> arm64: Determine THREAD_SIZE at boot-time
> >>>> arm64: Enable boot-time page size selection
> >>>>
> >>>> arch/alpha/include/asm/page.h | 1 +
> >>>> arch/arc/include/asm/page.h | 1 +
> >>>> arch/arm/include/asm/page.h | 1 +
> >>>> arch/arm64/Kconfig | 26 ++-
> >>>> arch/arm64/include/asm/assembler.h | 78 ++++++-
> >>>> arch/arm64/include/asm/cpufeature.h | 44 +++-
> >>>> arch/arm64/include/asm/efi.h | 2 +-
> >>>> arch/arm64/include/asm/fixmap.h | 28 ++-
> >>>> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
> >>>> arch/arm64/include/asm/kvm_arm.h | 21 +-
> >>>> arch/arm64/include/asm/kvm_hyp.h | 11 +
> >>>> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
> >>>> arch/arm64/include/asm/memory.h | 62 ++++--
> >>>> arch/arm64/include/asm/page-def.h | 3 +-
> >>>> arch/arm64/include/asm/pgalloc.h | 16 +-
> >>>> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
> >>>> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
> >>>> arch/arm64/include/asm/pgtable-prot.h | 2 +-
> >>>> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
> >>>> arch/arm64/include/asm/processor.h | 10 +-
> >>>> arch/arm64/include/asm/sections.h | 1 +
> >>>> arch/arm64/include/asm/smp.h | 1 +
> >>>> arch/arm64/include/asm/sparsemem.h | 15 +-
> >>>> arch/arm64/include/asm/sysreg.h | 54 +++--
> >>>> arch/arm64/include/asm/tlb.h | 3 +
> >>>> arch/arm64/kernel/asm-offsets.c | 4 +-
> >>>> arch/arm64/kernel/cpufeature.c | 93 ++++++--
> >>>> arch/arm64/kernel/efi.c | 2 +-
> >>>> arch/arm64/kernel/entry.S | 60 +++++-
> >>>> arch/arm64/kernel/head.S | 46 +++-
> >>>> arch/arm64/kernel/hibernate-asm.S | 6 +-
> >>>> arch/arm64/kernel/image-vars.h | 14 ++
> >>>> arch/arm64/kernel/image.h | 4 +
> >>>> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
> >>>> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
> >>>> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
> >>>> arch/arm64/kernel/pi/pi.h | 63 +++++-
> >>>> arch/arm64/kernel/relocate_kernel.S | 10 +-
> >>>> arch/arm64/kernel/vdso-wrap.S | 4 +-
> >>>> arch/arm64/kernel/vdso.c | 7 +-
> >>>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
> >>>> arch/arm64/kernel/vdso32-wrap.S | 4 +-
> >>>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
> >>>> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
> >>>> arch/arm64/kvm/arm.c | 10 +
> >>>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
> >>>> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
> >>>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
> >>>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
> >>>> arch/arm64/kvm/mmu.c | 39 ++--
> >>>> arch/arm64/lib/clear_page.S | 7 +-
> >>>> arch/arm64/lib/copy_page.S | 33 ++-
> >>>> arch/arm64/lib/mte.S | 27 ++-
> >>>> arch/arm64/mm/Makefile | 1 +
> >>>> arch/arm64/mm/fixmap.c | 38 ++--
> >>>> arch/arm64/mm/hugetlbpage.c | 40 +---
> >>>> arch/arm64/mm/init.c | 26 +--
> >>>> arch/arm64/mm/kasan_init.c | 8 +-
> >>>> arch/arm64/mm/mmu.c | 53 +++--
> >>>> arch/arm64/mm/pgd.c | 12 +-
> >>>> arch/arm64/mm/pgtable-geometry.c | 24 +++
> >>>> arch/arm64/mm/proc.S | 128 ++++++++---
> >>>> arch/arm64/mm/ptdump.c | 3 +-
> >>>> arch/arm64/tools/cpucaps | 3 +
> >>>> arch/csky/include/asm/page.h | 3 +
> >>>> arch/hexagon/include/asm/page.h | 2 +
> >>>> arch/loongarch/include/asm/page.h | 2 +
> >>>> arch/m68k/include/asm/page.h | 1 +
> >>>> arch/microblaze/include/asm/page.h | 1 +
> >>>> arch/mips/include/asm/page.h | 1 +
> >>>> arch/nios2/include/asm/page.h | 2 +
> >>>> arch/openrisc/include/asm/page.h | 1 +
> >>>> arch/parisc/include/asm/page.h | 1 +
> >>>> arch/powerpc/include/asm/page.h | 2 +
> >>>> arch/riscv/include/asm/page.h | 1 +
> >>>> arch/s390/include/asm/page.h | 1 +
> >>>> arch/sh/include/asm/page.h | 1 +
> >>>> arch/sparc/include/asm/page.h | 3 +
> >>>> arch/um/include/asm/page.h | 2 +
> >>>> arch/x86/include/asm/page_types.h | 2 +
> >>>> arch/xtensa/include/asm/page.h | 1 +
> >>>> crypto/lskcipher.c | 4 +-
> >>>> drivers/ata/sata_sil24.c | 46 ++--
> >>>> drivers/base/node.c | 6 +-
> >>>> drivers/base/topology.c | 32 +--
> >>>> drivers/block/virtio_blk.c | 2 +-
> >>>> drivers/char/random.c | 4 +-
> >>>> drivers/edac/edac_mc.h | 13 +-
> >>>> drivers/firmware/efi/libstub/arm64.c | 3 +-
> >>>> drivers/irqchip/irq-gic-v3-its.c | 2 +-
> >>>> drivers/mtd/mtdswap.c | 4 +-
> >>>> drivers/net/ethernet/freescale/fec.h | 3 +-
> >>>> drivers/net/ethernet/freescale/fec_main.c | 5 +-
> >>>> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
> >>>> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
> >>>> drivers/net/ethernet/intel/igb/igb.h | 25 +--
> >>>> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
> >>>> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
> >>>> drivers/net/ethernet/marvell/mvneta.c | 9 +-
> >>>> drivers/net/ethernet/marvell/sky2.h | 2 +-
> >>>> drivers/tee/optee/call.c | 7 +-
> >>>> drivers/tee/optee/smc_abi.c | 2 +-
> >>>> drivers/virtio/virtio_balloon.c | 10 +-
> >>>> drivers/xen/balloon.c | 11 +-
> >>>> drivers/xen/biomerge.c | 12 +-
> >>>> drivers/xen/privcmd.c | 2 +-
> >>>> drivers/xen/xenbus/xenbus_client.c | 5 +-
> >>>> drivers/xen/xlate_mmu.c | 6 +-
> >>>> fs/binfmt_elf.c | 11 +-
> >>>> fs/buffer.c | 2 +-
> >>>> fs/coredump.c | 8 +-
> >>>> fs/ext4/ext4.h | 36 ++--
> >>>> fs/ext4/move_extent.c | 2 +-
> >>>> fs/ext4/readpage.c | 2 +-
> >>>> fs/fat/dir.c | 4 +-
> >>>> fs/fat/fatent.c | 4 +-
> >>>> fs/nfs/nfs42proc.c | 2 +-
> >>>> fs/nfs/nfs42xattr.c | 2 +-
> >>>> fs/nfs/nfs4proc.c | 2 +-
> >>>> include/asm-generic/pgtable-geometry.h | 71 +++++++
> >>>> include/asm-generic/vmlinux.lds.h | 38 ++--
> >>>> include/linux/buffer_head.h | 1 +
> >>>> include/linux/cpumask.h | 5 +
> >>>> include/linux/linkage.h | 4 +-
> >>>> include/linux/mm.h | 17 +-
> >>>> include/linux/mm_types.h | 15 +-
> >>>> include/linux/mm_types_task.h | 2 +-
> >>>> include/linux/mmzone.h | 3 +-
> >>>> include/linux/netlink.h | 6 +-
> >>>> include/linux/percpu-defs.h | 4 +-
> >>>> include/linux/perf_event.h | 2 +-
> >>>> include/linux/sched.h | 4 +-
> >>>> include/linux/slab.h | 7 +-
> >>>> include/linux/stackdepot.h | 6 +-
> >>>> include/linux/sunrpc/svc.h | 8 +-
> >>>> include/linux/sunrpc/svc_rdma.h | 4 +-
> >>>> include/linux/sunrpc/svcsock.h | 2 +-
> >>>> include/linux/swap.h | 17 +-
> >>>> include/linux/swapops.h | 6 +-
> >>>> include/linux/thread_info.h | 10 +-
> >>>> include/xen/page.h | 2 +
> >>>> init/main.c | 7 +-
> >>>> kernel/bpf/core.c | 9 +-
> >>>> kernel/bpf/ringbuf.c | 54 ++---
> >>>> kernel/cgroup/cgroup.c | 8 +-
> >>>> kernel/crash_core.c | 2 +-
> >>>> kernel/events/core.c | 2 +-
> >>>> kernel/fork.c | 71 +++----
> >>>> kernel/power/power.h | 2 +-
> >>>> kernel/power/snapshot.c | 2 +-
> >>>> kernel/power/swap.c | 129 +++++++++--
> >>>> kernel/trace/fgraph.c | 2 +-
> >>>> kernel/trace/trace.c | 2 +-
> >>>> lib/stackdepot.c | 6 +-
> >>>> mm/kasan/report.c | 3 +-
> >>>> mm/memcontrol.c | 11 +-
> >>>> mm/memory.c | 4 +-
> >>>> mm/mmap.c | 2 +-
> >>>> mm/page-writeback.c | 2 +-
> >>>> mm/page_alloc.c | 31 +--
> >>>> mm/slub.c | 2 +-
> >>>> mm/sparse.c | 2 +-
> >>>> mm/swapfile.c | 2 +-
> >>>> mm/vmalloc.c | 7 +-
> >>>> net/9p/trans_virtio.c | 4 +-
> >>>> net/core/hotdata.c | 4 +-
> >>>> net/core/skbuff.c | 4 +-
> >>>> net/core/sysctl_net_core.c | 2 +-
> >>>> net/sunrpc/cache.c | 3 +-
> >>>> net/unix/af_unix.c | 2 +-
> >>>> sound/soc/soc-utils.c | 4 +-
> >>>> virt/kvm/kvm_main.c | 2 +-
> >>>> 172 files changed, 2185 insertions(+), 951 deletions(-)
> >>>> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
> >>>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
> >>>> create mode 100644 arch/arm64/mm/pgtable-geometry.c
> >>>> create mode 100644 include/asm-generic/pgtable-geometry.h
> >>>>
> >>>> --
> >>>> 2.43.0
> >>>
> >>> This is a generally very exciting patch set! I'm looking forward to seeing it
> >>> land so I can take advantage of it for Fedora ARM and Fedora Asahi Remix.
> >>>
> >>> That said, I have a couple of questions:
> >>>
> >>> * Going forward, how would we handle drivers/modules that require a particular
> >>> page size? For example, the Apple Silicon IOMMU driver code requires the
> >>> kernel to operate in 16k page size mode, and it would need to be disabled in
> >>> other page sizes.
> >>
> >> I think these drivers would want to check PAGE_SIZE at probe time and fail if an
> >> unsupported page size is in use. Do you see any issue with that?
> >>
> >>>
> >>> * How would we handle an invalid selection at boot?
> >>
> >> What do you mean by invalid here? The current policy validates that the
> >> requested page size is supported by the HW by checking mmfr0. If no page size is
> >> passed on the command line, or the passed value is not supported by the HW, then
> >> the we default to the largest page size supported by the HW (so for Apple
> >> Silicon that would be 16k since the HW doesn't support 64k). Although I think it
> >> may be better to change that policy to use the smallest page size in this case;
> >> 4k is the safer bet for compat and will waste much less memory than 64k.
> >>
> >>> Can we program in a
> >>> fallback when the "wrong" mode is selected for a chip or something similar?
> >>
> >> Do you mean effectively add a machanism to force 16k if the detected HW is Apple
> >> Silicon? The trouble is that we need to select the page size, very early in
> >> boot, before start_kernel() is called, so we really only have generic arch code
> >> and the command line with which to make the decision.
> >
> > Yes... I think a build-time CONFIG for default page size, which can be
> > overridden by a karg makes sense... Even on platforms like Apple
> > Silicon you may want to test very specific things in 4k by overriding
> > with a karg.
>
> Ahh, yes, that would certainly work. I'll work it into the next version.
>
Could we maybe extend to have some kind of way to include a table of
SoC IDs that certain modes are disabled (e.g. 64k on Apple Silicon)
and preferred modes when no arg is set (16k for Apple Silicon)? That
way it'd work something like this:
1. Table identification of 4/16/64 depending on identified SoC
2. Unidentified ones follow build-time default
3. karg forces a mode regardless
--
真実はいつも一つ!/ Always, there's only one truth!
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-21 13:49 ` Neal Gompa
@ 2024-10-21 15:01 ` Ryan Roberts
2024-10-22 9:33 ` Neal Gompa
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-21 15:01 UTC (permalink / raw)
To: Neal Gompa
Cc: Eric Curtin, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, Hector Martin, linux-arm-kernel,
linux-kernel, linux-mm, asahi
On 21/10/2024 14:49, Neal Gompa wrote:
> On Mon, Oct 21, 2024 at 7:51 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 21/10/2024 12:32, Eric Curtin wrote:
>>> On Mon, 21 Oct 2024 at 12:09, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 19/10/2024 16:47, Neal Gompa wrote:
>>>>> On Monday, October 14, 2024 6:55:11 AM EDT Ryan Roberts wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> Patch bomb incoming... This covers many subsystems, so I've included a core
>>>>>> set of people on the full series and additionally included maintainers on
>>>>>> relevant patches. I haven't included those maintainers on this cover letter
>>>>>> since the numbers were far too big for it to work. But I've included a link
>>>>>> to this cover letter on each patch, so they can hopefully find their way
>>>>>> here. For follow up submissions I'll break it up by subsystem, but for now
>>>>>> thought it was important to show the full picture.
>>>>>>
>>>>>> This RFC series implements support for boot-time page size selection within
>>>>>> the arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to
>>>>>> date, page size has been selected at compile-time, meaning the size is
>>>>>> baked into a given kernel image. As use of larger-than-4K page sizes become
>>>>>> more prevalent this starts to present a problem for distributions.
>>>>>> Boot-time page size selection enables the creation of a single kernel
>>>>>> image, which can be told which page size to use on the kernel command line.
>>>>>>
>>>>>> Why is having an image-per-page size problematic?
>>>>>> =================================================
>>>>>>
>>>>>> Many traditional distros are now supporting both 4K and 64K. And this means
>>>>>> managing 2 kernel packages, along with drivers for each. For some, it means
>>>>>> multiple installer flavours and multiple ISOs. All of this adds up to a
>>>>>> less-than-ideal level of complexity. Additionally, Android now supports 4K
>>>>>> and 16K kernels. I'm told having to explicitly manage their KABI for each
>>>>>> kernel is painful, and the extra flash space required for both kernel
>>>>>> images and the duplicated modules has been problematic. Boot-time page size
>>>>>> selection solves all of this.
>>>>>>
>>>>>> Additionally, in starting to think about the longer term deployment story
>>>>>> for D128 page tables, which Arm architecture now supports, a lot of the
>>>>>> same problems need to be solved, so this work sets us up nicely for that.
>>>>>>
>>>>>> So what's the down side?
>>>>>> ========================
>>>>>>
>>>>>> Well nothing's free; Various static allocations in the kernel image must be
>>>>>> sized for the worst case (largest supported page size), so image size is in
>>>>>> line with size of 64K compile-time image. So if you're interested in 4K or
>>>>>> 16K, there is a slight increase to the image size. But I expect that
>>>>>> problem goes away if you're compressing the image - its just some extra
>>>>>> zeros. At boot-time, I expect we could free the unused static storage once
>>>>>> we know the page size - although that would be a follow up enhancement.
>>>>>>
>>>>>> And then there is performance. Since PAGE_SIZE and friends are no longer
>>>>>> compile-time constants, we must look up their values and do arithmetic at
>>>>>> runtime instead of compile-time. My early perf testing suggests this is
>>>>>> inperceptible for real-world workloads, and only has small impact on
>>>>>> microbenchmarks - more on this below.
>>>>>>
>>>>>> Approach
>>>>>> ========
>>>>>>
>>>>>> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
>>>>>> friends are compile-time constant, but in a way that allows the compiler to
>>>>>> perform the same optimizations as was previously being done if they do turn
>>>>>> out to be compile-time constant. Where constants are required, we use
>>>>>> limits; PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full
>>>>>> description of all the classes of problems to solve.
>>>>>>
>>>>>> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
>>>>>> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX.
>>>>>> arm64 does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>>>>>> Kconfig, which is an alternative to selecting a compile-time page size.
>>>>>>
>>>>>> When boot-time page size is active, the arch pgtable geometry macro
>>>>>> definitions resolve to something that can be configured at boot. The arm64
>>>>>> implementation in this series mainly uses global, __ro_after_init
>>>>>> variables. I've tried using alternatives patching, but that performs worse
>>>>>> than loading from memory; I think due to code size bloat.
>>>>>>
>>>>>> Status
>>>>>> ======
>>>>>>
>>>>>> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented
>>>>>> enough to compile the kernel image itself with defconfig (and a few other
>>>>>> bits and pieces). This is enough to build a kernel that can boot under QEMU
>>>>>> or FVP. I'll happily do the rest of the work to enable all the extra
>>>>>> drivers, but wanted to get feedback on the shape of this effort first. If
>>>>>> anyone wants to do any testing, and has a must-have config, let me know and
>>>>>> I'll prioritize enabling it first.
>>>>>>
>>>>>> The series is arranged as follows:
>>>>>>
>>>>>> - patch 1: Add macros required for converting non-arch code to support
>>>>>> boot-time page size selection
>>>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from
>>>>>> all non-arch code
>>>>>> - patches 37-38: Some arm64 tidy ups
>>>>>> - patch 39: Add macros required for converting arm64 code to
>>>>> support
>>>>>> boot-time page size selection
>>>>>> - patches 40-56: arm64 changes to support boot-time page size selection
>>>>>> - patch 57: Add arm64 Kconfig option to enable boot-time page
>>>>> size
>>>>>> selection
>>>>>>
>>>>>> Ideally, I'd like to get the basics merged (something like this series),
>>>>>> then incrementally improve it over a handful of kernel releases until we
>>>>>> can demonstrate that we have feature parity with the compile-time build and
>>>>>> no performance blockers. Once at that point, ideally the compile-time build
>>>>>> options would be removed and the code could be cleaned up further.
>>>>>>
>>>>>> One of the bigger peices that I'd propose to add as a follow up, is to make
>>>>>> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
>>>>>> handling.
>>>>>>
>>>>>> Assuming people are ammenable to the rough shape, how would I go about
>>>>>> getting the non-arch changes merged? Since they cover many subsystems, will
>>>>>> each piece need to go independently to each relevant maintainer or could it
>>>>>> all be merged together through the arm64 tree?
>>>>>>
>>>>>> Image Size
>>>>>> ==========
>>>>>>
>>>>>> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
>>>>>> kernel image on disk for base (before any changes applied), compile (with
>>>>>> changes, configured for compile-time page size) and boot (with changes,
>>>>>> configured for boot-time page size).
>>>>>>
>>>>>> You can see the that compile-16k and 64k configs are actually slightly
>>>>>> smaller than the baselines; that's due to optimizing some buffer sizes
>>>>>> which didn't need to depend on page size during the series. The boot-time
>>>>>> image is ~1% bigger than the 64k compile-time image. I believe there is
>>>>>> scope to improve this to make it
>>>>>> equal to compile-64k if required:
>>>>>> | config | size/KB | diff/KB | diff/% |
>>>>>> |
>>>>>> |-------------|---------|---------|---------|
>>>>>> |
>>>>>> | base-4k | 54895 | 0 | 0.0% |
>>>>>> | base-16k | 55161 | 266 | 0.5% |
>>>>>> | base-64k | 56775 | 1880 | 3.4% |
>>>>>> | compile-4k | 54895 | 0 | 0.0% |
>>>>>> | compile-16k | 55097 | 202 | 0.4% |
>>>>>> | compile-64k | 56391 | 1496 | 2.7% |
>>>>>> | boot-4K | 57045 | 2150 | 3.9% |
>>>>>>
>>>>>> And below shows the size of the image in memory at run-time, separated for
>>>>>> text and data costs. The boot image has ~1% text cost; most likely due to
>>>>>> the fact that PAGE_SIZE and friends are not compile-time constants so need
>>>>>> instructions to load the values and do arithmetic. I believe we could
>>>>>> eventually get the data cost to match the cost for the compile image for
>>>>>> the chosen page size by freeing
>>>>>> the ends of the static buffers not needed for the selected page size:
>>>>>> | | text | text | text | data | data | data |
>>>>>> |
>>>>>> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
>>>>>> |
>>>>>> |-------------|---------|---------|---------|---------|---------|---------|
>>>>>> |
>>>>>> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
>>>>>> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
>>>>>> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
>>>>>> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
>>>>>> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
>>>>>> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
>>>>>> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
>>>>>>
>>>>>> Functional Testing
>>>>>> ==================
>>>>>>
>>>>>> I've build-tested defconfig for all arches supported by tuxmake (which is
>>>>>> most) without issue.
>>>>>>
>>>>>> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page
>>>>>> sizes and a few va-sizes, and additionally have run all the mm-selftests,
>>>>>> with no regressions observed vs the equivalent compile-time page size build
>>>>>> (although the mm-selftests have a few existing failures when run against
>>>>>> 16K and 64K kernels - those should really be investigated and fixed
>>>>>> independently).
>>>>>>
>>>>>> Test coverage is lacking for many of the drivers that I've touched, but in
>>>>>> many cases, I'm hoping the changes are simple enough that review might
>>>>>> suffice?
>>>>>>
>>>>>> Performance Testing
>>>>>> ===================
>>>>>>
>>>>>> I've run some limited performance benchmarks:
>>>>>>
>>>>>> First, a real-world benchmark that causes a lot of page table manipulation
>>>>>> (and therefore we would expect to see regression here if we are going to
>>>>>> see it anywhere); kernel compilation. It barely registers a change. Values
>>>>>> are times,
>>>>>> so smaller is better. All relative to base-4k:
>>>>>> | | kern | kern | user | user | real | real |
>>>>>> |
>>>>>> | config | mean | stdev | mean | stdev | mean | stdev |
>>>>>> |
>>>>>> |-------------|---------|---------|---------|---------|---------|---------|
>>>>>> |
>>>>>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
>>>>>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
>>>>>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
>>>>>>
>>>>>> The Speedometer JavaScript benchmark also shows no change. Values are runs
>>>>>> per
>>>>>> min, so bigger is better. All relative to base-4k:
>>>>>> | config | mean | stdev |
>>>>>> |
>>>>>> |-------------|---------|---------|
>>>>>> |
>>>>>> | base-4k | 0.0% | 0.8% |
>>>>>> | compile-4k | 0.4% | 0.8% |
>>>>>> | boot-4k | 0.0% | 0.9% |
>>>>>>
>>>>>> Finally, I've run some microbenchmarks known to stress page table
>>>>>> manipulations (originally from David Hildenbrand). The fork test
>>>>>> maps/allocs 1G of anon memory, then measures the cost of fork(). The munmap
>>>>>> test maps/allocs 1G of anon memory then measures the cost of munmap()ing
>>>>>> it. The fork test is known to be extremely sensitive to any changes that
>>>>>> cause instructions to be aligned differently in cachelines. When using this
>>>>>> test for other changes, I've seen double digit regressions for the
>>>>>> slightest thing, so 12% regression on this test is actually fairly good.
>>>>>> This likely represents the extreme worst case for regressions that will be
>>>>>> observed across other microbenchmarks (famous last
>>>>>> words). Values are times, so smaller is better. All relative to base-4k:
>>>>>> | | fork | fork | munmap | munmap |
>>>>>> |
>>>>>> | config | mean | stdev | stdev | stdev |
>>>>>> |
>>>>>> |-------------|---------|---------|---------|---------|
>>>>>> |
>>>>>> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
>>>>>> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
>>>>>> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
>>>>>>
>>>>>> NOTE: The series applies on top of v6.11.
>>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>
>>>>>> Ryan Roberts (57):
>>>>>> mm: Add macros ahead of supporting boot-time page size selection
>>>>>> vmlinux: Align to PAGE_SIZE_MAX
>>>>>> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
>>>>>> mm/page_alloc: Make page_frag_cache boot-time page size compatible
>>>>>> mm: Avoid split pmd ptl if pmd level is run-time folded
>>>>>> mm: Remove PAGE_SIZE compile-time constant assumption
>>>>>> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
>>>>>> fs: Remove PAGE_SIZE compile-time constant assumption
>>>>>> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
>>>>>> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
>>>>>> fork: Permit boot-time THREAD_SIZE determination
>>>>>> cgroup: Remove PAGE_SIZE compile-time constant assumption
>>>>>> bpf: Remove PAGE_SIZE compile-time constant assumption
>>>>>> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
>>>>>> stackdepot: Remove PAGE_SIZE compile-time constant assumption
>>>>>> perf: Remove PAGE_SIZE compile-time constant assumption
>>>>>> kvm: Remove PAGE_SIZE compile-time constant assumption
>>>>>> trace: Remove PAGE_SIZE compile-time constant assumption
>>>>>> crash: Remove PAGE_SIZE compile-time constant assumption
>>>>>> crypto: Remove PAGE_SIZE compile-time constant assumption
>>>>>> sunrpc: Remove PAGE_SIZE compile-time constant assumption
>>>>>> sound: Remove PAGE_SIZE compile-time constant assumption
>>>>>> net: Remove PAGE_SIZE compile-time constant assumption
>>>>>> net: fec: Remove PAGE_SIZE compile-time constant assumption
>>>>>> net: marvell: Remove PAGE_SIZE compile-time constant assumption
>>>>>> net: hns3: Remove PAGE_SIZE compile-time constant assumption
>>>>>> net: e1000: Remove PAGE_SIZE compile-time constant assumption
>>>>>> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
>>>>>> net: igb: Remove PAGE_SIZE compile-time constant assumption
>>>>>> drivers/base: Remove PAGE_SIZE compile-time constant assumption
>>>>>> edac: Remove PAGE_SIZE compile-time constant assumption
>>>>>> optee: Remove PAGE_SIZE compile-time constant assumption
>>>>>> random: Remove PAGE_SIZE compile-time constant assumption
>>>>>> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
>>>>>> virtio: Remove PAGE_SIZE compile-time constant assumption
>>>>>> xen: Remove PAGE_SIZE compile-time constant assumption
>>>>>> arm64: Fix macros to work in C code in addition to the linker script
>>>>>> arm64: Track early pgtable allocation limit
>>>>>> arm64: Introduce macros required for boot-time page selection
>>>>>> arm64: Refactor early pgtable size calculation macros
>>>>>> arm64: Pass desired page size on command line
>>>>>> arm64: Divorce early init from PAGE_SIZE
>>>>>> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
>>>>>> arm64: Align sections to PAGE_SIZE_MAX
>>>>>> arm64: Rework trampoline rodata mapping
>>>>>> arm64: Generalize fixmap for boot-time page size
>>>>>> arm64: Statically allocate and align for worst-case page size
>>>>>> arm64: Convert switch to if for non-const comparison values
>>>>>> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
>>>>>> arm64: Remove PAGE_SZ asm-offset
>>>>>> arm64: Introduce cpu features for page sizes
>>>>>> arm64: Remove PAGE_SIZE from assembly code
>>>>>> arm64: Runtime-fold pmd level
>>>>>> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
>>>>>> arm64: TRAMP_VALIAS is no longer compile-time constant
>>>>>> arm64: Determine THREAD_SIZE at boot-time
>>>>>> arm64: Enable boot-time page size selection
>>>>>>
>>>>>> arch/alpha/include/asm/page.h | 1 +
>>>>>> arch/arc/include/asm/page.h | 1 +
>>>>>> arch/arm/include/asm/page.h | 1 +
>>>>>> arch/arm64/Kconfig | 26 ++-
>>>>>> arch/arm64/include/asm/assembler.h | 78 ++++++-
>>>>>> arch/arm64/include/asm/cpufeature.h | 44 +++-
>>>>>> arch/arm64/include/asm/efi.h | 2 +-
>>>>>> arch/arm64/include/asm/fixmap.h | 28 ++-
>>>>>> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
>>>>>> arch/arm64/include/asm/kvm_arm.h | 21 +-
>>>>>> arch/arm64/include/asm/kvm_hyp.h | 11 +
>>>>>> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
>>>>>> arch/arm64/include/asm/memory.h | 62 ++++--
>>>>>> arch/arm64/include/asm/page-def.h | 3 +-
>>>>>> arch/arm64/include/asm/pgalloc.h | 16 +-
>>>>>> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
>>>>>> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
>>>>>> arch/arm64/include/asm/pgtable-prot.h | 2 +-
>>>>>> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
>>>>>> arch/arm64/include/asm/processor.h | 10 +-
>>>>>> arch/arm64/include/asm/sections.h | 1 +
>>>>>> arch/arm64/include/asm/smp.h | 1 +
>>>>>> arch/arm64/include/asm/sparsemem.h | 15 +-
>>>>>> arch/arm64/include/asm/sysreg.h | 54 +++--
>>>>>> arch/arm64/include/asm/tlb.h | 3 +
>>>>>> arch/arm64/kernel/asm-offsets.c | 4 +-
>>>>>> arch/arm64/kernel/cpufeature.c | 93 ++++++--
>>>>>> arch/arm64/kernel/efi.c | 2 +-
>>>>>> arch/arm64/kernel/entry.S | 60 +++++-
>>>>>> arch/arm64/kernel/head.S | 46 +++-
>>>>>> arch/arm64/kernel/hibernate-asm.S | 6 +-
>>>>>> arch/arm64/kernel/image-vars.h | 14 ++
>>>>>> arch/arm64/kernel/image.h | 4 +
>>>>>> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
>>>>>> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
>>>>>> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
>>>>>> arch/arm64/kernel/pi/pi.h | 63 +++++-
>>>>>> arch/arm64/kernel/relocate_kernel.S | 10 +-
>>>>>> arch/arm64/kernel/vdso-wrap.S | 4 +-
>>>>>> arch/arm64/kernel/vdso.c | 7 +-
>>>>>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
>>>>>> arch/arm64/kernel/vdso32-wrap.S | 4 +-
>>>>>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
>>>>>> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
>>>>>> arch/arm64/kvm/arm.c | 10 +
>>>>>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
>>>>>> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
>>>>>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
>>>>>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
>>>>>> arch/arm64/kvm/mmu.c | 39 ++--
>>>>>> arch/arm64/lib/clear_page.S | 7 +-
>>>>>> arch/arm64/lib/copy_page.S | 33 ++-
>>>>>> arch/arm64/lib/mte.S | 27 ++-
>>>>>> arch/arm64/mm/Makefile | 1 +
>>>>>> arch/arm64/mm/fixmap.c | 38 ++--
>>>>>> arch/arm64/mm/hugetlbpage.c | 40 +---
>>>>>> arch/arm64/mm/init.c | 26 +--
>>>>>> arch/arm64/mm/kasan_init.c | 8 +-
>>>>>> arch/arm64/mm/mmu.c | 53 +++--
>>>>>> arch/arm64/mm/pgd.c | 12 +-
>>>>>> arch/arm64/mm/pgtable-geometry.c | 24 +++
>>>>>> arch/arm64/mm/proc.S | 128 ++++++++---
>>>>>> arch/arm64/mm/ptdump.c | 3 +-
>>>>>> arch/arm64/tools/cpucaps | 3 +
>>>>>> arch/csky/include/asm/page.h | 3 +
>>>>>> arch/hexagon/include/asm/page.h | 2 +
>>>>>> arch/loongarch/include/asm/page.h | 2 +
>>>>>> arch/m68k/include/asm/page.h | 1 +
>>>>>> arch/microblaze/include/asm/page.h | 1 +
>>>>>> arch/mips/include/asm/page.h | 1 +
>>>>>> arch/nios2/include/asm/page.h | 2 +
>>>>>> arch/openrisc/include/asm/page.h | 1 +
>>>>>> arch/parisc/include/asm/page.h | 1 +
>>>>>> arch/powerpc/include/asm/page.h | 2 +
>>>>>> arch/riscv/include/asm/page.h | 1 +
>>>>>> arch/s390/include/asm/page.h | 1 +
>>>>>> arch/sh/include/asm/page.h | 1 +
>>>>>> arch/sparc/include/asm/page.h | 3 +
>>>>>> arch/um/include/asm/page.h | 2 +
>>>>>> arch/x86/include/asm/page_types.h | 2 +
>>>>>> arch/xtensa/include/asm/page.h | 1 +
>>>>>> crypto/lskcipher.c | 4 +-
>>>>>> drivers/ata/sata_sil24.c | 46 ++--
>>>>>> drivers/base/node.c | 6 +-
>>>>>> drivers/base/topology.c | 32 +--
>>>>>> drivers/block/virtio_blk.c | 2 +-
>>>>>> drivers/char/random.c | 4 +-
>>>>>> drivers/edac/edac_mc.h | 13 +-
>>>>>> drivers/firmware/efi/libstub/arm64.c | 3 +-
>>>>>> drivers/irqchip/irq-gic-v3-its.c | 2 +-
>>>>>> drivers/mtd/mtdswap.c | 4 +-
>>>>>> drivers/net/ethernet/freescale/fec.h | 3 +-
>>>>>> drivers/net/ethernet/freescale/fec_main.c | 5 +-
>>>>>> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
>>>>>> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
>>>>>> drivers/net/ethernet/intel/igb/igb.h | 25 +--
>>>>>> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
>>>>>> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
>>>>>> drivers/net/ethernet/marvell/mvneta.c | 9 +-
>>>>>> drivers/net/ethernet/marvell/sky2.h | 2 +-
>>>>>> drivers/tee/optee/call.c | 7 +-
>>>>>> drivers/tee/optee/smc_abi.c | 2 +-
>>>>>> drivers/virtio/virtio_balloon.c | 10 +-
>>>>>> drivers/xen/balloon.c | 11 +-
>>>>>> drivers/xen/biomerge.c | 12 +-
>>>>>> drivers/xen/privcmd.c | 2 +-
>>>>>> drivers/xen/xenbus/xenbus_client.c | 5 +-
>>>>>> drivers/xen/xlate_mmu.c | 6 +-
>>>>>> fs/binfmt_elf.c | 11 +-
>>>>>> fs/buffer.c | 2 +-
>>>>>> fs/coredump.c | 8 +-
>>>>>> fs/ext4/ext4.h | 36 ++--
>>>>>> fs/ext4/move_extent.c | 2 +-
>>>>>> fs/ext4/readpage.c | 2 +-
>>>>>> fs/fat/dir.c | 4 +-
>>>>>> fs/fat/fatent.c | 4 +-
>>>>>> fs/nfs/nfs42proc.c | 2 +-
>>>>>> fs/nfs/nfs42xattr.c | 2 +-
>>>>>> fs/nfs/nfs4proc.c | 2 +-
>>>>>> include/asm-generic/pgtable-geometry.h | 71 +++++++
>>>>>> include/asm-generic/vmlinux.lds.h | 38 ++--
>>>>>> include/linux/buffer_head.h | 1 +
>>>>>> include/linux/cpumask.h | 5 +
>>>>>> include/linux/linkage.h | 4 +-
>>>>>> include/linux/mm.h | 17 +-
>>>>>> include/linux/mm_types.h | 15 +-
>>>>>> include/linux/mm_types_task.h | 2 +-
>>>>>> include/linux/mmzone.h | 3 +-
>>>>>> include/linux/netlink.h | 6 +-
>>>>>> include/linux/percpu-defs.h | 4 +-
>>>>>> include/linux/perf_event.h | 2 +-
>>>>>> include/linux/sched.h | 4 +-
>>>>>> include/linux/slab.h | 7 +-
>>>>>> include/linux/stackdepot.h | 6 +-
>>>>>> include/linux/sunrpc/svc.h | 8 +-
>>>>>> include/linux/sunrpc/svc_rdma.h | 4 +-
>>>>>> include/linux/sunrpc/svcsock.h | 2 +-
>>>>>> include/linux/swap.h | 17 +-
>>>>>> include/linux/swapops.h | 6 +-
>>>>>> include/linux/thread_info.h | 10 +-
>>>>>> include/xen/page.h | 2 +
>>>>>> init/main.c | 7 +-
>>>>>> kernel/bpf/core.c | 9 +-
>>>>>> kernel/bpf/ringbuf.c | 54 ++---
>>>>>> kernel/cgroup/cgroup.c | 8 +-
>>>>>> kernel/crash_core.c | 2 +-
>>>>>> kernel/events/core.c | 2 +-
>>>>>> kernel/fork.c | 71 +++----
>>>>>> kernel/power/power.h | 2 +-
>>>>>> kernel/power/snapshot.c | 2 +-
>>>>>> kernel/power/swap.c | 129 +++++++++--
>>>>>> kernel/trace/fgraph.c | 2 +-
>>>>>> kernel/trace/trace.c | 2 +-
>>>>>> lib/stackdepot.c | 6 +-
>>>>>> mm/kasan/report.c | 3 +-
>>>>>> mm/memcontrol.c | 11 +-
>>>>>> mm/memory.c | 4 +-
>>>>>> mm/mmap.c | 2 +-
>>>>>> mm/page-writeback.c | 2 +-
>>>>>> mm/page_alloc.c | 31 +--
>>>>>> mm/slub.c | 2 +-
>>>>>> mm/sparse.c | 2 +-
>>>>>> mm/swapfile.c | 2 +-
>>>>>> mm/vmalloc.c | 7 +-
>>>>>> net/9p/trans_virtio.c | 4 +-
>>>>>> net/core/hotdata.c | 4 +-
>>>>>> net/core/skbuff.c | 4 +-
>>>>>> net/core/sysctl_net_core.c | 2 +-
>>>>>> net/sunrpc/cache.c | 3 +-
>>>>>> net/unix/af_unix.c | 2 +-
>>>>>> sound/soc/soc-utils.c | 4 +-
>>>>>> virt/kvm/kvm_main.c | 2 +-
>>>>>> 172 files changed, 2185 insertions(+), 951 deletions(-)
>>>>>> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
>>>>>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
>>>>>> create mode 100644 arch/arm64/mm/pgtable-geometry.c
>>>>>> create mode 100644 include/asm-generic/pgtable-geometry.h
>>>>>>
>>>>>> --
>>>>>> 2.43.0
>>>>>
>>>>> This is a generally very exciting patch set! I'm looking forward to seeing it
>>>>> land so I can take advantage of it for Fedora ARM and Fedora Asahi Remix.
>>>>>
>>>>> That said, I have a couple of questions:
>>>>>
>>>>> * Going forward, how would we handle drivers/modules that require a particular
>>>>> page size? For example, the Apple Silicon IOMMU driver code requires the
>>>>> kernel to operate in 16k page size mode, and it would need to be disabled in
>>>>> other page sizes.
>>>>
>>>> I think these drivers would want to check PAGE_SIZE at probe time and fail if an
>>>> unsupported page size is in use. Do you see any issue with that?
>>>>
>>>>>
>>>>> * How would we handle an invalid selection at boot?
>>>>
>>>> What do you mean by invalid here? The current policy validates that the
>>>> requested page size is supported by the HW by checking mmfr0. If no page size is
>>>> passed on the command line, or the passed value is not supported by the HW, then
>>>> the we default to the largest page size supported by the HW (so for Apple
>>>> Silicon that would be 16k since the HW doesn't support 64k). Although I think it
>>>> may be better to change that policy to use the smallest page size in this case;
>>>> 4k is the safer bet for compat and will waste much less memory than 64k.
>>>>
>>>>> Can we program in a
>>>>> fallback when the "wrong" mode is selected for a chip or something similar?
>>>>
>>>> Do you mean effectively add a machanism to force 16k if the detected HW is Apple
>>>> Silicon? The trouble is that we need to select the page size, very early in
>>>> boot, before start_kernel() is called, so we really only have generic arch code
>>>> and the command line with which to make the decision.
>>>
>>> Yes... I think a build-time CONFIG for default page size, which can be
>>> overridden by a karg makes sense... Even on platforms like Apple
>>> Silicon you may want to test very specific things in 4k by overriding
>>> with a karg.
>>
>> Ahh, yes, that would certainly work. I'll work it into the next version.
>>
>
> Could we maybe extend to have some kind of way to include a table of
> SoC IDs that certain modes are disabled (e.g. 64k on Apple Silicon)
64k is already disabled on Apple Silicon because mmfr0 reports that 64k is not
supported.
> and preferred modes when no arg is set (16k for Apple Silicon)? That
And it's not obvious that we should hard-code a page size preference to a SoC
ID. If the CPU can support multiple page sizes, it should be up to the SW stack
to decide, not the SoC.
I'm guessing your desire is to have a single kernel build that will boot 16k by
default on Apple Silicon and 4k by default on other systems, all without needing
to modify the command line? Personally I think it's cleaner to just require
setting the page size on the command line in these cases.
> way it'd work something like this:
>
> 1. Table identification of 4/16/64 depending on identified SoC
So I'd prefer not to have this
> 2. Unidentified ones follow build-time default
> 3. karg forces a mode regardless
But keep these 2.
>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-21 15:01 ` Ryan Roberts
@ 2024-10-22 9:33 ` Neal Gompa
2024-10-22 15:03 ` Nick Chan
0 siblings, 1 reply; 196+ messages in thread
From: Neal Gompa @ 2024-10-22 9:33 UTC (permalink / raw)
To: Ryan Roberts
Cc: Eric Curtin, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, Hector Martin, linux-arm-kernel,
linux-kernel, linux-mm, asahi
On Mon, Oct 21, 2024 at 11:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 21/10/2024 14:49, Neal Gompa wrote:
> > On Mon, Oct 21, 2024 at 7:51 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 21/10/2024 12:32, Eric Curtin wrote:
> >>> On Mon, 21 Oct 2024 at 12:09, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 19/10/2024 16:47, Neal Gompa wrote:
> >>>>> On Monday, October 14, 2024 6:55:11 AM EDT Ryan Roberts wrote:
> >>>>>> Hi All,
> >>>>>>
> >>>>>> Patch bomb incoming... This covers many subsystems, so I've included a core
> >>>>>> set of people on the full series and additionally included maintainers on
> >>>>>> relevant patches. I haven't included those maintainers on this cover letter
> >>>>>> since the numbers were far too big for it to work. But I've included a link
> >>>>>> to this cover letter on each patch, so they can hopefully find their way
> >>>>>> here. For follow up submissions I'll break it up by subsystem, but for now
> >>>>>> thought it was important to show the full picture.
> >>>>>>
> >>>>>> This RFC series implements support for boot-time page size selection within
> >>>>>> the arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to
> >>>>>> date, page size has been selected at compile-time, meaning the size is
> >>>>>> baked into a given kernel image. As use of larger-than-4K page sizes become
> >>>>>> more prevalent this starts to present a problem for distributions.
> >>>>>> Boot-time page size selection enables the creation of a single kernel
> >>>>>> image, which can be told which page size to use on the kernel command line.
> >>>>>>
> >>>>>> Why is having an image-per-page size problematic?
> >>>>>> =================================================
> >>>>>>
> >>>>>> Many traditional distros are now supporting both 4K and 64K. And this means
> >>>>>> managing 2 kernel packages, along with drivers for each. For some, it means
> >>>>>> multiple installer flavours and multiple ISOs. All of this adds up to a
> >>>>>> less-than-ideal level of complexity. Additionally, Android now supports 4K
> >>>>>> and 16K kernels. I'm told having to explicitly manage their KABI for each
> >>>>>> kernel is painful, and the extra flash space required for both kernel
> >>>>>> images and the duplicated modules has been problematic. Boot-time page size
> >>>>>> selection solves all of this.
> >>>>>>
> >>>>>> Additionally, in starting to think about the longer term deployment story
> >>>>>> for D128 page tables, which Arm architecture now supports, a lot of the
> >>>>>> same problems need to be solved, so this work sets us up nicely for that.
> >>>>>>
> >>>>>> So what's the down side?
> >>>>>> ========================
> >>>>>>
> >>>>>> Well nothing's free; Various static allocations in the kernel image must be
> >>>>>> sized for the worst case (largest supported page size), so image size is in
> >>>>>> line with size of 64K compile-time image. So if you're interested in 4K or
> >>>>>> 16K, there is a slight increase to the image size. But I expect that
> >>>>>> problem goes away if you're compressing the image - its just some extra
> >>>>>> zeros. At boot-time, I expect we could free the unused static storage once
> >>>>>> we know the page size - although that would be a follow up enhancement.
> >>>>>>
> >>>>>> And then there is performance. Since PAGE_SIZE and friends are no longer
> >>>>>> compile-time constants, we must look up their values and do arithmetic at
> >>>>>> runtime instead of compile-time. My early perf testing suggests this is
> >>>>>> inperceptible for real-world workloads, and only has small impact on
> >>>>>> microbenchmarks - more on this below.
> >>>>>>
> >>>>>> Approach
> >>>>>> ========
> >>>>>>
> >>>>>> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
> >>>>>> friends are compile-time constant, but in a way that allows the compiler to
> >>>>>> perform the same optimizations as was previously being done if they do turn
> >>>>>> out to be compile-time constant. Where constants are required, we use
> >>>>>> limits; PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full
> >>>>>> description of all the classes of problems to solve.
> >>>>>>
> >>>>>> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
> >>>>>> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX.
> >>>>>> arm64 does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> >>>>>> Kconfig, which is an alternative to selecting a compile-time page size.
> >>>>>>
> >>>>>> When boot-time page size is active, the arch pgtable geometry macro
> >>>>>> definitions resolve to something that can be configured at boot. The arm64
> >>>>>> implementation in this series mainly uses global, __ro_after_init
> >>>>>> variables. I've tried using alternatives patching, but that performs worse
> >>>>>> than loading from memory; I think due to code size bloat.
> >>>>>>
> >>>>>> Status
> >>>>>> ======
> >>>>>>
> >>>>>> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented
> >>>>>> enough to compile the kernel image itself with defconfig (and a few other
> >>>>>> bits and pieces). This is enough to build a kernel that can boot under QEMU
> >>>>>> or FVP. I'll happily do the rest of the work to enable all the extra
> >>>>>> drivers, but wanted to get feedback on the shape of this effort first. If
> >>>>>> anyone wants to do any testing, and has a must-have config, let me know and
> >>>>>> I'll prioritize enabling it first.
> >>>>>>
> >>>>>> The series is arranged as follows:
> >>>>>>
> >>>>>> - patch 1: Add macros required for converting non-arch code to support
> >>>>>> boot-time page size selection
> >>>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from
> >>>>>> all non-arch code
> >>>>>> - patches 37-38: Some arm64 tidy ups
> >>>>>> - patch 39: Add macros required for converting arm64 code to
> >>>>> support
> >>>>>> boot-time page size selection
> >>>>>> - patches 40-56: arm64 changes to support boot-time page size selection
> >>>>>> - patch 57: Add arm64 Kconfig option to enable boot-time page
> >>>>> size
> >>>>>> selection
> >>>>>>
> >>>>>> Ideally, I'd like to get the basics merged (something like this series),
> >>>>>> then incrementally improve it over a handful of kernel releases until we
> >>>>>> can demonstrate that we have feature parity with the compile-time build and
> >>>>>> no performance blockers. Once at that point, ideally the compile-time build
> >>>>>> options would be removed and the code could be cleaned up further.
> >>>>>>
> >>>>>> One of the bigger peices that I'd propose to add as a follow up, is to make
> >>>>>> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
> >>>>>> handling.
> >>>>>>
> >>>>>> Assuming people are ammenable to the rough shape, how would I go about
> >>>>>> getting the non-arch changes merged? Since they cover many subsystems, will
> >>>>>> each piece need to go independently to each relevant maintainer or could it
> >>>>>> all be merged together through the arm64 tree?
> >>>>>>
> >>>>>> Image Size
> >>>>>> ==========
> >>>>>>
> >>>>>> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
> >>>>>> kernel image on disk for base (before any changes applied), compile (with
> >>>>>> changes, configured for compile-time page size) and boot (with changes,
> >>>>>> configured for boot-time page size).
> >>>>>>
> >>>>>> You can see the that compile-16k and 64k configs are actually slightly
> >>>>>> smaller than the baselines; that's due to optimizing some buffer sizes
> >>>>>> which didn't need to depend on page size during the series. The boot-time
> >>>>>> image is ~1% bigger than the 64k compile-time image. I believe there is
> >>>>>> scope to improve this to make it
> >>>>>> equal to compile-64k if required:
> >>>>>> | config | size/KB | diff/KB | diff/% |
> >>>>>> |
> >>>>>> |-------------|---------|---------|---------|
> >>>>>> |
> >>>>>> | base-4k | 54895 | 0 | 0.0% |
> >>>>>> | base-16k | 55161 | 266 | 0.5% |
> >>>>>> | base-64k | 56775 | 1880 | 3.4% |
> >>>>>> | compile-4k | 54895 | 0 | 0.0% |
> >>>>>> | compile-16k | 55097 | 202 | 0.4% |
> >>>>>> | compile-64k | 56391 | 1496 | 2.7% |
> >>>>>> | boot-4K | 57045 | 2150 | 3.9% |
> >>>>>>
> >>>>>> And below shows the size of the image in memory at run-time, separated for
> >>>>>> text and data costs. The boot image has ~1% text cost; most likely due to
> >>>>>> the fact that PAGE_SIZE and friends are not compile-time constants so need
> >>>>>> instructions to load the values and do arithmetic. I believe we could
> >>>>>> eventually get the data cost to match the cost for the compile image for
> >>>>>> the chosen page size by freeing
> >>>>>> the ends of the static buffers not needed for the selected page size:
> >>>>>> | | text | text | text | data | data | data |
> >>>>>> |
> >>>>>> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
> >>>>>> |
> >>>>>> |-------------|---------|---------|---------|---------|---------|---------|
> >>>>>> |
> >>>>>> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
> >>>>>> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
> >>>>>> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
> >>>>>> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
> >>>>>> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
> >>>>>> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
> >>>>>> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
> >>>>>>
> >>>>>> Functional Testing
> >>>>>> ==================
> >>>>>>
> >>>>>> I've build-tested defconfig for all arches supported by tuxmake (which is
> >>>>>> most) without issue.
> >>>>>>
> >>>>>> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page
> >>>>>> sizes and a few va-sizes, and additionally have run all the mm-selftests,
> >>>>>> with no regressions observed vs the equivalent compile-time page size build
> >>>>>> (although the mm-selftests have a few existing failures when run against
> >>>>>> 16K and 64K kernels - those should really be investigated and fixed
> >>>>>> independently).
> >>>>>>
> >>>>>> Test coverage is lacking for many of the drivers that I've touched, but in
> >>>>>> many cases, I'm hoping the changes are simple enough that review might
> >>>>>> suffice?
> >>>>>>
> >>>>>> Performance Testing
> >>>>>> ===================
> >>>>>>
> >>>>>> I've run some limited performance benchmarks:
> >>>>>>
> >>>>>> First, a real-world benchmark that causes a lot of page table manipulation
> >>>>>> (and therefore we would expect to see regression here if we are going to
> >>>>>> see it anywhere); kernel compilation. It barely registers a change. Values
> >>>>>> are times,
> >>>>>> so smaller is better. All relative to base-4k:
> >>>>>> | | kern | kern | user | user | real | real |
> >>>>>> |
> >>>>>> | config | mean | stdev | mean | stdev | mean | stdev |
> >>>>>> |
> >>>>>> |-------------|---------|---------|---------|---------|---------|---------|
> >>>>>> |
> >>>>>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
> >>>>>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
> >>>>>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
> >>>>>>
> >>>>>> The Speedometer JavaScript benchmark also shows no change. Values are runs
> >>>>>> per
> >>>>>> min, so bigger is better. All relative to base-4k:
> >>>>>> | config | mean | stdev |
> >>>>>> |
> >>>>>> |-------------|---------|---------|
> >>>>>> |
> >>>>>> | base-4k | 0.0% | 0.8% |
> >>>>>> | compile-4k | 0.4% | 0.8% |
> >>>>>> | boot-4k | 0.0% | 0.9% |
> >>>>>>
> >>>>>> Finally, I've run some microbenchmarks known to stress page table
> >>>>>> manipulations (originally from David Hildenbrand). The fork test
> >>>>>> maps/allocs 1G of anon memory, then measures the cost of fork(). The munmap
> >>>>>> test maps/allocs 1G of anon memory then measures the cost of munmap()ing
> >>>>>> it. The fork test is known to be extremely sensitive to any changes that
> >>>>>> cause instructions to be aligned differently in cachelines. When using this
> >>>>>> test for other changes, I've seen double digit regressions for the
> >>>>>> slightest thing, so 12% regression on this test is actually fairly good.
> >>>>>> This likely represents the extreme worst case for regressions that will be
> >>>>>> observed across other microbenchmarks (famous last
> >>>>>> words). Values are times, so smaller is better. All relative to base-4k:
> >>>>>> | | fork | fork | munmap | munmap |
> >>>>>> |
> >>>>>> | config | mean | stdev | stdev | stdev |
> >>>>>> |
> >>>>>> |-------------|---------|---------|---------|---------|
> >>>>>> |
> >>>>>> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
> >>>>>> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
> >>>>>> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
> >>>>>>
> >>>>>> NOTE: The series applies on top of v6.11.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Ryan
> >>>>>>
> >>>>>>
> >>>>>> Ryan Roberts (57):
> >>>>>> mm: Add macros ahead of supporting boot-time page size selection
> >>>>>> vmlinux: Align to PAGE_SIZE_MAX
> >>>>>> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
> >>>>>> mm/page_alloc: Make page_frag_cache boot-time page size compatible
> >>>>>> mm: Avoid split pmd ptl if pmd level is run-time folded
> >>>>>> mm: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
> >>>>>> fs: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> fork: Permit boot-time THREAD_SIZE determination
> >>>>>> cgroup: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> bpf: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> stackdepot: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> perf: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> kvm: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> trace: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> crash: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> crypto: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> sunrpc: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> sound: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> net: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> net: fec: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> net: marvell: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> net: hns3: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> net: e1000: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> net: igb: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> drivers/base: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> edac: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> optee: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> random: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> virtio: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> xen: Remove PAGE_SIZE compile-time constant assumption
> >>>>>> arm64: Fix macros to work in C code in addition to the linker script
> >>>>>> arm64: Track early pgtable allocation limit
> >>>>>> arm64: Introduce macros required for boot-time page selection
> >>>>>> arm64: Refactor early pgtable size calculation macros
> >>>>>> arm64: Pass desired page size on command line
> >>>>>> arm64: Divorce early init from PAGE_SIZE
> >>>>>> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
> >>>>>> arm64: Align sections to PAGE_SIZE_MAX
> >>>>>> arm64: Rework trampoline rodata mapping
> >>>>>> arm64: Generalize fixmap for boot-time page size
> >>>>>> arm64: Statically allocate and align for worst-case page size
> >>>>>> arm64: Convert switch to if for non-const comparison values
> >>>>>> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
> >>>>>> arm64: Remove PAGE_SZ asm-offset
> >>>>>> arm64: Introduce cpu features for page sizes
> >>>>>> arm64: Remove PAGE_SIZE from assembly code
> >>>>>> arm64: Runtime-fold pmd level
> >>>>>> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
> >>>>>> arm64: TRAMP_VALIAS is no longer compile-time constant
> >>>>>> arm64: Determine THREAD_SIZE at boot-time
> >>>>>> arm64: Enable boot-time page size selection
> >>>>>>
> >>>>>> arch/alpha/include/asm/page.h | 1 +
> >>>>>> arch/arc/include/asm/page.h | 1 +
> >>>>>> arch/arm/include/asm/page.h | 1 +
> >>>>>> arch/arm64/Kconfig | 26 ++-
> >>>>>> arch/arm64/include/asm/assembler.h | 78 ++++++-
> >>>>>> arch/arm64/include/asm/cpufeature.h | 44 +++-
> >>>>>> arch/arm64/include/asm/efi.h | 2 +-
> >>>>>> arch/arm64/include/asm/fixmap.h | 28 ++-
> >>>>>> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
> >>>>>> arch/arm64/include/asm/kvm_arm.h | 21 +-
> >>>>>> arch/arm64/include/asm/kvm_hyp.h | 11 +
> >>>>>> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
> >>>>>> arch/arm64/include/asm/memory.h | 62 ++++--
> >>>>>> arch/arm64/include/asm/page-def.h | 3 +-
> >>>>>> arch/arm64/include/asm/pgalloc.h | 16 +-
> >>>>>> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
> >>>>>> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
> >>>>>> arch/arm64/include/asm/pgtable-prot.h | 2 +-
> >>>>>> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
> >>>>>> arch/arm64/include/asm/processor.h | 10 +-
> >>>>>> arch/arm64/include/asm/sections.h | 1 +
> >>>>>> arch/arm64/include/asm/smp.h | 1 +
> >>>>>> arch/arm64/include/asm/sparsemem.h | 15 +-
> >>>>>> arch/arm64/include/asm/sysreg.h | 54 +++--
> >>>>>> arch/arm64/include/asm/tlb.h | 3 +
> >>>>>> arch/arm64/kernel/asm-offsets.c | 4 +-
> >>>>>> arch/arm64/kernel/cpufeature.c | 93 ++++++--
> >>>>>> arch/arm64/kernel/efi.c | 2 +-
> >>>>>> arch/arm64/kernel/entry.S | 60 +++++-
> >>>>>> arch/arm64/kernel/head.S | 46 +++-
> >>>>>> arch/arm64/kernel/hibernate-asm.S | 6 +-
> >>>>>> arch/arm64/kernel/image-vars.h | 14 ++
> >>>>>> arch/arm64/kernel/image.h | 4 +
> >>>>>> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
> >>>>>> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
> >>>>>> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
> >>>>>> arch/arm64/kernel/pi/pi.h | 63 +++++-
> >>>>>> arch/arm64/kernel/relocate_kernel.S | 10 +-
> >>>>>> arch/arm64/kernel/vdso-wrap.S | 4 +-
> >>>>>> arch/arm64/kernel/vdso.c | 7 +-
> >>>>>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
> >>>>>> arch/arm64/kernel/vdso32-wrap.S | 4 +-
> >>>>>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
> >>>>>> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
> >>>>>> arch/arm64/kvm/arm.c | 10 +
> >>>>>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
> >>>>>> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
> >>>>>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
> >>>>>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
> >>>>>> arch/arm64/kvm/mmu.c | 39 ++--
> >>>>>> arch/arm64/lib/clear_page.S | 7 +-
> >>>>>> arch/arm64/lib/copy_page.S | 33 ++-
> >>>>>> arch/arm64/lib/mte.S | 27 ++-
> >>>>>> arch/arm64/mm/Makefile | 1 +
> >>>>>> arch/arm64/mm/fixmap.c | 38 ++--
> >>>>>> arch/arm64/mm/hugetlbpage.c | 40 +---
> >>>>>> arch/arm64/mm/init.c | 26 +--
> >>>>>> arch/arm64/mm/kasan_init.c | 8 +-
> >>>>>> arch/arm64/mm/mmu.c | 53 +++--
> >>>>>> arch/arm64/mm/pgd.c | 12 +-
> >>>>>> arch/arm64/mm/pgtable-geometry.c | 24 +++
> >>>>>> arch/arm64/mm/proc.S | 128 ++++++++---
> >>>>>> arch/arm64/mm/ptdump.c | 3 +-
> >>>>>> arch/arm64/tools/cpucaps | 3 +
> >>>>>> arch/csky/include/asm/page.h | 3 +
> >>>>>> arch/hexagon/include/asm/page.h | 2 +
> >>>>>> arch/loongarch/include/asm/page.h | 2 +
> >>>>>> arch/m68k/include/asm/page.h | 1 +
> >>>>>> arch/microblaze/include/asm/page.h | 1 +
> >>>>>> arch/mips/include/asm/page.h | 1 +
> >>>>>> arch/nios2/include/asm/page.h | 2 +
> >>>>>> arch/openrisc/include/asm/page.h | 1 +
> >>>>>> arch/parisc/include/asm/page.h | 1 +
> >>>>>> arch/powerpc/include/asm/page.h | 2 +
> >>>>>> arch/riscv/include/asm/page.h | 1 +
> >>>>>> arch/s390/include/asm/page.h | 1 +
> >>>>>> arch/sh/include/asm/page.h | 1 +
> >>>>>> arch/sparc/include/asm/page.h | 3 +
> >>>>>> arch/um/include/asm/page.h | 2 +
> >>>>>> arch/x86/include/asm/page_types.h | 2 +
> >>>>>> arch/xtensa/include/asm/page.h | 1 +
> >>>>>> crypto/lskcipher.c | 4 +-
> >>>>>> drivers/ata/sata_sil24.c | 46 ++--
> >>>>>> drivers/base/node.c | 6 +-
> >>>>>> drivers/base/topology.c | 32 +--
> >>>>>> drivers/block/virtio_blk.c | 2 +-
> >>>>>> drivers/char/random.c | 4 +-
> >>>>>> drivers/edac/edac_mc.h | 13 +-
> >>>>>> drivers/firmware/efi/libstub/arm64.c | 3 +-
> >>>>>> drivers/irqchip/irq-gic-v3-its.c | 2 +-
> >>>>>> drivers/mtd/mtdswap.c | 4 +-
> >>>>>> drivers/net/ethernet/freescale/fec.h | 3 +-
> >>>>>> drivers/net/ethernet/freescale/fec_main.c | 5 +-
> >>>>>> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
> >>>>>> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
> >>>>>> drivers/net/ethernet/intel/igb/igb.h | 25 +--
> >>>>>> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
> >>>>>> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
> >>>>>> drivers/net/ethernet/marvell/mvneta.c | 9 +-
> >>>>>> drivers/net/ethernet/marvell/sky2.h | 2 +-
> >>>>>> drivers/tee/optee/call.c | 7 +-
> >>>>>> drivers/tee/optee/smc_abi.c | 2 +-
> >>>>>> drivers/virtio/virtio_balloon.c | 10 +-
> >>>>>> drivers/xen/balloon.c | 11 +-
> >>>>>> drivers/xen/biomerge.c | 12 +-
> >>>>>> drivers/xen/privcmd.c | 2 +-
> >>>>>> drivers/xen/xenbus/xenbus_client.c | 5 +-
> >>>>>> drivers/xen/xlate_mmu.c | 6 +-
> >>>>>> fs/binfmt_elf.c | 11 +-
> >>>>>> fs/buffer.c | 2 +-
> >>>>>> fs/coredump.c | 8 +-
> >>>>>> fs/ext4/ext4.h | 36 ++--
> >>>>>> fs/ext4/move_extent.c | 2 +-
> >>>>>> fs/ext4/readpage.c | 2 +-
> >>>>>> fs/fat/dir.c | 4 +-
> >>>>>> fs/fat/fatent.c | 4 +-
> >>>>>> fs/nfs/nfs42proc.c | 2 +-
> >>>>>> fs/nfs/nfs42xattr.c | 2 +-
> >>>>>> fs/nfs/nfs4proc.c | 2 +-
> >>>>>> include/asm-generic/pgtable-geometry.h | 71 +++++++
> >>>>>> include/asm-generic/vmlinux.lds.h | 38 ++--
> >>>>>> include/linux/buffer_head.h | 1 +
> >>>>>> include/linux/cpumask.h | 5 +
> >>>>>> include/linux/linkage.h | 4 +-
> >>>>>> include/linux/mm.h | 17 +-
> >>>>>> include/linux/mm_types.h | 15 +-
> >>>>>> include/linux/mm_types_task.h | 2 +-
> >>>>>> include/linux/mmzone.h | 3 +-
> >>>>>> include/linux/netlink.h | 6 +-
> >>>>>> include/linux/percpu-defs.h | 4 +-
> >>>>>> include/linux/perf_event.h | 2 +-
> >>>>>> include/linux/sched.h | 4 +-
> >>>>>> include/linux/slab.h | 7 +-
> >>>>>> include/linux/stackdepot.h | 6 +-
> >>>>>> include/linux/sunrpc/svc.h | 8 +-
> >>>>>> include/linux/sunrpc/svc_rdma.h | 4 +-
> >>>>>> include/linux/sunrpc/svcsock.h | 2 +-
> >>>>>> include/linux/swap.h | 17 +-
> >>>>>> include/linux/swapops.h | 6 +-
> >>>>>> include/linux/thread_info.h | 10 +-
> >>>>>> include/xen/page.h | 2 +
> >>>>>> init/main.c | 7 +-
> >>>>>> kernel/bpf/core.c | 9 +-
> >>>>>> kernel/bpf/ringbuf.c | 54 ++---
> >>>>>> kernel/cgroup/cgroup.c | 8 +-
> >>>>>> kernel/crash_core.c | 2 +-
> >>>>>> kernel/events/core.c | 2 +-
> >>>>>> kernel/fork.c | 71 +++----
> >>>>>> kernel/power/power.h | 2 +-
> >>>>>> kernel/power/snapshot.c | 2 +-
> >>>>>> kernel/power/swap.c | 129 +++++++++--
> >>>>>> kernel/trace/fgraph.c | 2 +-
> >>>>>> kernel/trace/trace.c | 2 +-
> >>>>>> lib/stackdepot.c | 6 +-
> >>>>>> mm/kasan/report.c | 3 +-
> >>>>>> mm/memcontrol.c | 11 +-
> >>>>>> mm/memory.c | 4 +-
> >>>>>> mm/mmap.c | 2 +-
> >>>>>> mm/page-writeback.c | 2 +-
> >>>>>> mm/page_alloc.c | 31 +--
> >>>>>> mm/slub.c | 2 +-
> >>>>>> mm/sparse.c | 2 +-
> >>>>>> mm/swapfile.c | 2 +-
> >>>>>> mm/vmalloc.c | 7 +-
> >>>>>> net/9p/trans_virtio.c | 4 +-
> >>>>>> net/core/hotdata.c | 4 +-
> >>>>>> net/core/skbuff.c | 4 +-
> >>>>>> net/core/sysctl_net_core.c | 2 +-
> >>>>>> net/sunrpc/cache.c | 3 +-
> >>>>>> net/unix/af_unix.c | 2 +-
> >>>>>> sound/soc/soc-utils.c | 4 +-
> >>>>>> virt/kvm/kvm_main.c | 2 +-
> >>>>>> 172 files changed, 2185 insertions(+), 951 deletions(-)
> >>>>>> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
> >>>>>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
> >>>>>> create mode 100644 arch/arm64/mm/pgtable-geometry.c
> >>>>>> create mode 100644 include/asm-generic/pgtable-geometry.h
> >>>>>>
> >>>>>> --
> >>>>>> 2.43.0
> >>>>>
> >>>>> This is a generally very exciting patch set! I'm looking forward to seeing it
> >>>>> land so I can take advantage of it for Fedora ARM and Fedora Asahi Remix.
> >>>>>
> >>>>> That said, I have a couple of questions:
> >>>>>
> >>>>> * Going forward, how would we handle drivers/modules that require a particular
> >>>>> page size? For example, the Apple Silicon IOMMU driver code requires the
> >>>>> kernel to operate in 16k page size mode, and it would need to be disabled in
> >>>>> other page sizes.
> >>>>
> >>>> I think these drivers would want to check PAGE_SIZE at probe time and fail if an
> >>>> unsupported page size is in use. Do you see any issue with that?
> >>>>
> >>>>>
> >>>>> * How would we handle an invalid selection at boot?
> >>>>
> >>>> What do you mean by invalid here? The current policy validates that the
> >>>> requested page size is supported by the HW by checking mmfr0. If no page size is
> >>>> passed on the command line, or the passed value is not supported by the HW, then
> >>>> the we default to the largest page size supported by the HW (so for Apple
> >>>> Silicon that would be 16k since the HW doesn't support 64k). Although I think it
> >>>> may be better to change that policy to use the smallest page size in this case;
> >>>> 4k is the safer bet for compat and will waste much less memory than 64k.
> >>>>
> >>>>> Can we program in a
> >>>>> fallback when the "wrong" mode is selected for a chip or something similar?
> >>>>
> >>>> Do you mean effectively add a machanism to force 16k if the detected HW is Apple
> >>>> Silicon? The trouble is that we need to select the page size, very early in
> >>>> boot, before start_kernel() is called, so we really only have generic arch code
> >>>> and the command line with which to make the decision.
> >>>
> >>> Yes... I think a build-time CONFIG for default page size, which can be
> >>> overridden by a karg makes sense... Even on platforms like Apple
> >>> Silicon you may want to test very specific things in 4k by overriding
> >>> with a karg.
> >>
> >> Ahh, yes, that would certainly work. I'll work it into the next version.
> >>
> >
> > Could we maybe extend to have some kind of way to include a table of
> > SoC IDs that certain modes are disabled (e.g. 64k on Apple Silicon)
>
> 64k is already disabled on Apple Silicon because mmfr0 reports that 64k is not
> supported.
>
> > and preferred modes when no arg is set (16k for Apple Silicon)? That
>
> And it's not obvious that we should hard-code a page size preference to a SoC
> ID. If the CPU can support multiple page sizes, it should be up to the SW stack
> to decide, not the SoC.
>
> I'm guessing your desire is to have a single kernel build that will boot 16k by
> default on Apple Silicon and 4k by default on other systems, all without needing
> to modify the command line? Personally I think it's cleaner to just require
> setting the page size on the command line in these cases.
>
> > way it'd work something like this:
> >
> > 1. Table identification of 4/16/64 depending on identified SoC
> So I'd prefer not to have this
>
> > 2. Unidentified ones follow build-time default
> > 3. karg forces a mode regardless
> But keep these 2.
>
I think it makes sense to have it, because it's not just Apple Silicon
where such a preference/requirement may be necessary. Apple Silicon
technically works at 4k, but is completely broken at 4k because Linux
cannot do 16k IOMMU with 4k everything else, so being able to at least
prefer 16k out of the box is important. And SoCs like the NVIDIA Grace
Hopper platform prefer 64k over other options (though I am unaware of
a gross incompatibility that effectively requires it like Apple
Silicon has).
When we're trying to get to "single generic image that works
everywhere", stuff like this matters and I would really like you to
consider it from the lens of "we want things to work as automagic as
they do on x86".
--
真実はいつも一つ!/ Always, there's only one truth!
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-22 9:33 ` Neal Gompa
@ 2024-10-22 15:03 ` Nick Chan
2024-10-22 15:12 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Nick Chan @ 2024-10-22 15:03 UTC (permalink / raw)
To: Neal Gompa, Ryan Roberts
Cc: Eric Curtin, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, Hector Martin, linux-arm-kernel,
linux-kernel, linux-mm, asahi
Neal Gompa 於 2024/10/22 下午5:33 寫道:
> On Mon, Oct 21, 2024 at 11:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 21/10/2024 14:49, Neal Gompa wrote:
>>> On Mon, Oct 21, 2024 at 7:51 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 21/10/2024 12:32, Eric Curtin wrote:
>>>>> On Mon, 21 Oct 2024 at 12:09, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 19/10/2024 16:47, Neal Gompa wrote:
>>>>>>> On Monday, October 14, 2024 6:55:11 AM EDT Ryan Roberts wrote:
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Patch bomb incoming... This covers many subsystems, so I've included a core
>>>>>>>> set of people on the full series and additionally included maintainers on
>>>>>>>> relevant patches. I haven't included those maintainers on this cover letter
>>>>>>>> since the numbers were far too big for it to work. But I've included a link
>>>>>>>> to this cover letter on each patch, so they can hopefully find their way
>>>>>>>> here. For follow up submissions I'll break it up by subsystem, but for now
>>>>>>>> thought it was important to show the full picture.
>>>>>>>>
>>>>>>>> This RFC series implements support for boot-time page size selection within
>>>>>>>> the arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to
>>>>>>>> date, page size has been selected at compile-time, meaning the size is
>>>>>>>> baked into a given kernel image. As use of larger-than-4K page sizes become
>>>>>>>> more prevalent this starts to present a problem for distributions.
>>>>>>>> Boot-time page size selection enables the creation of a single kernel
>>>>>>>> image, which can be told which page size to use on the kernel command line.
>>>>>>>>
>>>>>>>> Why is having an image-per-page size problematic?
>>>>>>>> =================================================
>>>>>>>>
>>>>>>>> Many traditional distros are now supporting both 4K and 64K. And this means
>>>>>>>> managing 2 kernel packages, along with drivers for each. For some, it means
>>>>>>>> multiple installer flavours and multiple ISOs. All of this adds up to a
>>>>>>>> less-than-ideal level of complexity. Additionally, Android now supports 4K
>>>>>>>> and 16K kernels. I'm told having to explicitly manage their KABI for each
>>>>>>>> kernel is painful, and the extra flash space required for both kernel
>>>>>>>> images and the duplicated modules has been problematic. Boot-time page size
>>>>>>>> selection solves all of this.
>>>>>>>>
>>>>>>>> Additionally, in starting to think about the longer term deployment story
>>>>>>>> for D128 page tables, which Arm architecture now supports, a lot of the
>>>>>>>> same problems need to be solved, so this work sets us up nicely for that.
>>>>>>>>
>>>>>>>> So what's the down side?
>>>>>>>> ========================
>>>>>>>>
>>>>>>>> Well nothing's free; Various static allocations in the kernel image must be
>>>>>>>> sized for the worst case (largest supported page size), so image size is in
>>>>>>>> line with size of 64K compile-time image. So if you're interested in 4K or
>>>>>>>> 16K, there is a slight increase to the image size. But I expect that
>>>>>>>> problem goes away if you're compressing the image - its just some extra
>>>>>>>> zeros. At boot-time, I expect we could free the unused static storage once
>>>>>>>> we know the page size - although that would be a follow up enhancement.
>>>>>>>>
>>>>>>>> And then there is performance. Since PAGE_SIZE and friends are no longer
>>>>>>>> compile-time constants, we must look up their values and do arithmetic at
>>>>>>>> runtime instead of compile-time. My early perf testing suggests this is
>>>>>>>> inperceptible for real-world workloads, and only has small impact on
>>>>>>>> microbenchmarks - more on this below.
>>>>>>>>
>>>>>>>> Approach
>>>>>>>> ========
>>>>>>>>
>>>>>>>> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
>>>>>>>> friends are compile-time constant, but in a way that allows the compiler to
>>>>>>>> perform the same optimizations as was previously being done if they do turn
>>>>>>>> out to be compile-time constant. Where constants are required, we use
>>>>>>>> limits; PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full
>>>>>>>> description of all the classes of problems to solve.
>>>>>>>>
>>>>>>>> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
>>>>>>>> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX.
>>>>>>>> arm64 does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>>>>>>>> Kconfig, which is an alternative to selecting a compile-time page size.
>>>>>>>>
>>>>>>>> When boot-time page size is active, the arch pgtable geometry macro
>>>>>>>> definitions resolve to something that can be configured at boot. The arm64
>>>>>>>> implementation in this series mainly uses global, __ro_after_init
>>>>>>>> variables. I've tried using alternatives patching, but that performs worse
>>>>>>>> than loading from memory; I think due to code size bloat.
>>>>>>>>
>>>>>>>> Status
>>>>>>>> ======
>>>>>>>>
>>>>>>>> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented
>>>>>>>> enough to compile the kernel image itself with defconfig (and a few other
>>>>>>>> bits and pieces). This is enough to build a kernel that can boot under QEMU
>>>>>>>> or FVP. I'll happily do the rest of the work to enable all the extra
>>>>>>>> drivers, but wanted to get feedback on the shape of this effort first. If
>>>>>>>> anyone wants to do any testing, and has a must-have config, let me know and
>>>>>>>> I'll prioritize enabling it first.
>>>>>>>>
>>>>>>>> The series is arranged as follows:
>>>>>>>>
>>>>>>>> - patch 1: Add macros required for converting non-arch code to support
>>>>>>>> boot-time page size selection
>>>>>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from
>>>>>>>> all non-arch code
>>>>>>>> - patches 37-38: Some arm64 tidy ups
>>>>>>>> - patch 39: Add macros required for converting arm64 code to
>>>>>>> support
>>>>>>>> boot-time page size selection
>>>>>>>> - patches 40-56: arm64 changes to support boot-time page size selection
>>>>>>>> - patch 57: Add arm64 Kconfig option to enable boot-time page
>>>>>>> size
>>>>>>>> selection
>>>>>>>>
>>>>>>>> Ideally, I'd like to get the basics merged (something like this series),
>>>>>>>> then incrementally improve it over a handful of kernel releases until we
>>>>>>>> can demonstrate that we have feature parity with the compile-time build and
>>>>>>>> no performance blockers. Once at that point, ideally the compile-time build
>>>>>>>> options would be removed and the code could be cleaned up further.
>>>>>>>>
>>>>>>>> One of the bigger peices that I'd propose to add as a follow up, is to make
>>>>>>>> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
>>>>>>>> handling.
>>>>>>>>
>>>>>>>> Assuming people are ammenable to the rough shape, how would I go about
>>>>>>>> getting the non-arch changes merged? Since they cover many subsystems, will
>>>>>>>> each piece need to go independently to each relevant maintainer or could it
>>>>>>>> all be merged together through the arm64 tree?
>>>>>>>>
>>>>>>>> Image Size
>>>>>>>> ==========
>>>>>>>>
>>>>>>>> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
>>>>>>>> kernel image on disk for base (before any changes applied), compile (with
>>>>>>>> changes, configured for compile-time page size) and boot (with changes,
>>>>>>>> configured for boot-time page size).
>>>>>>>>
>>>>>>>> You can see the that compile-16k and 64k configs are actually slightly
>>>>>>>> smaller than the baselines; that's due to optimizing some buffer sizes
>>>>>>>> which didn't need to depend on page size during the series. The boot-time
>>>>>>>> image is ~1% bigger than the 64k compile-time image. I believe there is
>>>>>>>> scope to improve this to make it
>>>>>>>> equal to compile-64k if required:
>>>>>>>> | config | size/KB | diff/KB | diff/% |
>>>>>>>> |
>>>>>>>> |-------------|---------|---------|---------|
>>>>>>>> |
>>>>>>>> | base-4k | 54895 | 0 | 0.0% |
>>>>>>>> | base-16k | 55161 | 266 | 0.5% |
>>>>>>>> | base-64k | 56775 | 1880 | 3.4% |
>>>>>>>> | compile-4k | 54895 | 0 | 0.0% |
>>>>>>>> | compile-16k | 55097 | 202 | 0.4% |
>>>>>>>> | compile-64k | 56391 | 1496 | 2.7% |
>>>>>>>> | boot-4K | 57045 | 2150 | 3.9% |
>>>>>>>>
>>>>>>>> And below shows the size of the image in memory at run-time, separated for
>>>>>>>> text and data costs. The boot image has ~1% text cost; most likely due to
>>>>>>>> the fact that PAGE_SIZE and friends are not compile-time constants so need
>>>>>>>> instructions to load the values and do arithmetic. I believe we could
>>>>>>>> eventually get the data cost to match the cost for the compile image for
>>>>>>>> the chosen page size by freeing
>>>>>>>> the ends of the static buffers not needed for the selected page size:
>>>>>>>> | | text | text | text | data | data | data |
>>>>>>>> |
>>>>>>>> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
>>>>>>>> |
>>>>>>>> |-------------|---------|---------|---------|---------|---------|---------|
>>>>>>>> |
>>>>>>>> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
>>>>>>>> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
>>>>>>>> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
>>>>>>>> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
>>>>>>>> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
>>>>>>>> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
>>>>>>>> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
>>>>>>>>
>>>>>>>> Functional Testing
>>>>>>>> ==================
>>>>>>>>
>>>>>>>> I've build-tested defconfig for all arches supported by tuxmake (which is
>>>>>>>> most) without issue.
>>>>>>>>
>>>>>>>> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page
>>>>>>>> sizes and a few va-sizes, and additionally have run all the mm-selftests,
>>>>>>>> with no regressions observed vs the equivalent compile-time page size build
>>>>>>>> (although the mm-selftests have a few existing failures when run against
>>>>>>>> 16K and 64K kernels - those should really be investigated and fixed
>>>>>>>> independently).
>>>>>>>>
>>>>>>>> Test coverage is lacking for many of the drivers that I've touched, but in
>>>>>>>> many cases, I'm hoping the changes are simple enough that review might
>>>>>>>> suffice?
>>>>>>>>
>>>>>>>> Performance Testing
>>>>>>>> ===================
>>>>>>>>
>>>>>>>> I've run some limited performance benchmarks:
>>>>>>>>
>>>>>>>> First, a real-world benchmark that causes a lot of page table manipulation
>>>>>>>> (and therefore we would expect to see regression here if we are going to
>>>>>>>> see it anywhere); kernel compilation. It barely registers a change. Values
>>>>>>>> are times,
>>>>>>>> so smaller is better. All relative to base-4k:
>>>>>>>> | | kern | kern | user | user | real | real |
>>>>>>>> |
>>>>>>>> | config | mean | stdev | mean | stdev | mean | stdev |
>>>>>>>> |
>>>>>>>> |-------------|---------|---------|---------|---------|---------|---------|
>>>>>>>> |
>>>>>>>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
>>>>>>>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
>>>>>>>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
>>>>>>>>
>>>>>>>> The Speedometer JavaScript benchmark also shows no change. Values are runs
>>>>>>>> per
>>>>>>>> min, so bigger is better. All relative to base-4k:
>>>>>>>> | config | mean | stdev |
>>>>>>>> |
>>>>>>>> |-------------|---------|---------|
>>>>>>>> |
>>>>>>>> | base-4k | 0.0% | 0.8% |
>>>>>>>> | compile-4k | 0.4% | 0.8% |
>>>>>>>> | boot-4k | 0.0% | 0.9% |
>>>>>>>>
>>>>>>>> Finally, I've run some microbenchmarks known to stress page table
>>>>>>>> manipulations (originally from David Hildenbrand). The fork test
>>>>>>>> maps/allocs 1G of anon memory, then measures the cost of fork(). The munmap
>>>>>>>> test maps/allocs 1G of anon memory then measures the cost of munmap()ing
>>>>>>>> it. The fork test is known to be extremely sensitive to any changes that
>>>>>>>> cause instructions to be aligned differently in cachelines. When using this
>>>>>>>> test for other changes, I've seen double digit regressions for the
>>>>>>>> slightest thing, so 12% regression on this test is actually fairly good.
>>>>>>>> This likely represents the extreme worst case for regressions that will be
>>>>>>>> observed across other microbenchmarks (famous last
>>>>>>>> words). Values are times, so smaller is better. All relative to base-4k:
>>>>>>>> | | fork | fork | munmap | munmap |
>>>>>>>> |
>>>>>>>> | config | mean | stdev | stdev | stdev |
>>>>>>>> |
>>>>>>>> |-------------|---------|---------|---------|---------|
>>>>>>>> |
>>>>>>>> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
>>>>>>>> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
>>>>>>>> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
>>>>>>>>
>>>>>>>> NOTE: The series applies on top of v6.11.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>
>>>>>>>> Ryan Roberts (57):
>>>>>>>> mm: Add macros ahead of supporting boot-time page size selection
>>>>>>>> vmlinux: Align to PAGE_SIZE_MAX
>>>>>>>> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
>>>>>>>> mm/page_alloc: Make page_frag_cache boot-time page size compatible
>>>>>>>> mm: Avoid split pmd ptl if pmd level is run-time folded
>>>>>>>> mm: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
>>>>>>>> fs: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> fork: Permit boot-time THREAD_SIZE determination
>>>>>>>> cgroup: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> bpf: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> stackdepot: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> perf: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> kvm: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> trace: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> crash: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> crypto: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> sunrpc: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> sound: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> net: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> net: fec: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> net: marvell: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> net: hns3: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> net: e1000: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> net: igb: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> drivers/base: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> edac: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> optee: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> random: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> virtio: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> xen: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>> arm64: Fix macros to work in C code in addition to the linker script
>>>>>>>> arm64: Track early pgtable allocation limit
>>>>>>>> arm64: Introduce macros required for boot-time page selection
>>>>>>>> arm64: Refactor early pgtable size calculation macros
>>>>>>>> arm64: Pass desired page size on command line
>>>>>>>> arm64: Divorce early init from PAGE_SIZE
>>>>>>>> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
>>>>>>>> arm64: Align sections to PAGE_SIZE_MAX
>>>>>>>> arm64: Rework trampoline rodata mapping
>>>>>>>> arm64: Generalize fixmap for boot-time page size
>>>>>>>> arm64: Statically allocate and align for worst-case page size
>>>>>>>> arm64: Convert switch to if for non-const comparison values
>>>>>>>> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
>>>>>>>> arm64: Remove PAGE_SZ asm-offset
>>>>>>>> arm64: Introduce cpu features for page sizes
>>>>>>>> arm64: Remove PAGE_SIZE from assembly code
>>>>>>>> arm64: Runtime-fold pmd level
>>>>>>>> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
>>>>>>>> arm64: TRAMP_VALIAS is no longer compile-time constant
>>>>>>>> arm64: Determine THREAD_SIZE at boot-time
>>>>>>>> arm64: Enable boot-time page size selection
>>>>>>>>
>>>>>>>> arch/alpha/include/asm/page.h | 1 +
>>>>>>>> arch/arc/include/asm/page.h | 1 +
>>>>>>>> arch/arm/include/asm/page.h | 1 +
>>>>>>>> arch/arm64/Kconfig | 26 ++-
>>>>>>>> arch/arm64/include/asm/assembler.h | 78 ++++++-
>>>>>>>> arch/arm64/include/asm/cpufeature.h | 44 +++-
>>>>>>>> arch/arm64/include/asm/efi.h | 2 +-
>>>>>>>> arch/arm64/include/asm/fixmap.h | 28 ++-
>>>>>>>> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
>>>>>>>> arch/arm64/include/asm/kvm_arm.h | 21 +-
>>>>>>>> arch/arm64/include/asm/kvm_hyp.h | 11 +
>>>>>>>> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
>>>>>>>> arch/arm64/include/asm/memory.h | 62 ++++--
>>>>>>>> arch/arm64/include/asm/page-def.h | 3 +-
>>>>>>>> arch/arm64/include/asm/pgalloc.h | 16 +-
>>>>>>>> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
>>>>>>>> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
>>>>>>>> arch/arm64/include/asm/pgtable-prot.h | 2 +-
>>>>>>>> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
>>>>>>>> arch/arm64/include/asm/processor.h | 10 +-
>>>>>>>> arch/arm64/include/asm/sections.h | 1 +
>>>>>>>> arch/arm64/include/asm/smp.h | 1 +
>>>>>>>> arch/arm64/include/asm/sparsemem.h | 15 +-
>>>>>>>> arch/arm64/include/asm/sysreg.h | 54 +++--
>>>>>>>> arch/arm64/include/asm/tlb.h | 3 +
>>>>>>>> arch/arm64/kernel/asm-offsets.c | 4 +-
>>>>>>>> arch/arm64/kernel/cpufeature.c | 93 ++++++--
>>>>>>>> arch/arm64/kernel/efi.c | 2 +-
>>>>>>>> arch/arm64/kernel/entry.S | 60 +++++-
>>>>>>>> arch/arm64/kernel/head.S | 46 +++-
>>>>>>>> arch/arm64/kernel/hibernate-asm.S | 6 +-
>>>>>>>> arch/arm64/kernel/image-vars.h | 14 ++
>>>>>>>> arch/arm64/kernel/image.h | 4 +
>>>>>>>> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
>>>>>>>> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
>>>>>>>> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
>>>>>>>> arch/arm64/kernel/pi/pi.h | 63 +++++-
>>>>>>>> arch/arm64/kernel/relocate_kernel.S | 10 +-
>>>>>>>> arch/arm64/kernel/vdso-wrap.S | 4 +-
>>>>>>>> arch/arm64/kernel/vdso.c | 7 +-
>>>>>>>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
>>>>>>>> arch/arm64/kernel/vdso32-wrap.S | 4 +-
>>>>>>>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
>>>>>>>> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
>>>>>>>> arch/arm64/kvm/arm.c | 10 +
>>>>>>>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
>>>>>>>> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
>>>>>>>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
>>>>>>>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
>>>>>>>> arch/arm64/kvm/mmu.c | 39 ++--
>>>>>>>> arch/arm64/lib/clear_page.S | 7 +-
>>>>>>>> arch/arm64/lib/copy_page.S | 33 ++-
>>>>>>>> arch/arm64/lib/mte.S | 27 ++-
>>>>>>>> arch/arm64/mm/Makefile | 1 +
>>>>>>>> arch/arm64/mm/fixmap.c | 38 ++--
>>>>>>>> arch/arm64/mm/hugetlbpage.c | 40 +---
>>>>>>>> arch/arm64/mm/init.c | 26 +--
>>>>>>>> arch/arm64/mm/kasan_init.c | 8 +-
>>>>>>>> arch/arm64/mm/mmu.c | 53 +++--
>>>>>>>> arch/arm64/mm/pgd.c | 12 +-
>>>>>>>> arch/arm64/mm/pgtable-geometry.c | 24 +++
>>>>>>>> arch/arm64/mm/proc.S | 128 ++++++++---
>>>>>>>> arch/arm64/mm/ptdump.c | 3 +-
>>>>>>>> arch/arm64/tools/cpucaps | 3 +
>>>>>>>> arch/csky/include/asm/page.h | 3 +
>>>>>>>> arch/hexagon/include/asm/page.h | 2 +
>>>>>>>> arch/loongarch/include/asm/page.h | 2 +
>>>>>>>> arch/m68k/include/asm/page.h | 1 +
>>>>>>>> arch/microblaze/include/asm/page.h | 1 +
>>>>>>>> arch/mips/include/asm/page.h | 1 +
>>>>>>>> arch/nios2/include/asm/page.h | 2 +
>>>>>>>> arch/openrisc/include/asm/page.h | 1 +
>>>>>>>> arch/parisc/include/asm/page.h | 1 +
>>>>>>>> arch/powerpc/include/asm/page.h | 2 +
>>>>>>>> arch/riscv/include/asm/page.h | 1 +
>>>>>>>> arch/s390/include/asm/page.h | 1 +
>>>>>>>> arch/sh/include/asm/page.h | 1 +
>>>>>>>> arch/sparc/include/asm/page.h | 3 +
>>>>>>>> arch/um/include/asm/page.h | 2 +
>>>>>>>> arch/x86/include/asm/page_types.h | 2 +
>>>>>>>> arch/xtensa/include/asm/page.h | 1 +
>>>>>>>> crypto/lskcipher.c | 4 +-
>>>>>>>> drivers/ata/sata_sil24.c | 46 ++--
>>>>>>>> drivers/base/node.c | 6 +-
>>>>>>>> drivers/base/topology.c | 32 +--
>>>>>>>> drivers/block/virtio_blk.c | 2 +-
>>>>>>>> drivers/char/random.c | 4 +-
>>>>>>>> drivers/edac/edac_mc.h | 13 +-
>>>>>>>> drivers/firmware/efi/libstub/arm64.c | 3 +-
>>>>>>>> drivers/irqchip/irq-gic-v3-its.c | 2 +-
>>>>>>>> drivers/mtd/mtdswap.c | 4 +-
>>>>>>>> drivers/net/ethernet/freescale/fec.h | 3 +-
>>>>>>>> drivers/net/ethernet/freescale/fec_main.c | 5 +-
>>>>>>>> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
>>>>>>>> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
>>>>>>>> drivers/net/ethernet/intel/igb/igb.h | 25 +--
>>>>>>>> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
>>>>>>>> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
>>>>>>>> drivers/net/ethernet/marvell/mvneta.c | 9 +-
>>>>>>>> drivers/net/ethernet/marvell/sky2.h | 2 +-
>>>>>>>> drivers/tee/optee/call.c | 7 +-
>>>>>>>> drivers/tee/optee/smc_abi.c | 2 +-
>>>>>>>> drivers/virtio/virtio_balloon.c | 10 +-
>>>>>>>> drivers/xen/balloon.c | 11 +-
>>>>>>>> drivers/xen/biomerge.c | 12 +-
>>>>>>>> drivers/xen/privcmd.c | 2 +-
>>>>>>>> drivers/xen/xenbus/xenbus_client.c | 5 +-
>>>>>>>> drivers/xen/xlate_mmu.c | 6 +-
>>>>>>>> fs/binfmt_elf.c | 11 +-
>>>>>>>> fs/buffer.c | 2 +-
>>>>>>>> fs/coredump.c | 8 +-
>>>>>>>> fs/ext4/ext4.h | 36 ++--
>>>>>>>> fs/ext4/move_extent.c | 2 +-
>>>>>>>> fs/ext4/readpage.c | 2 +-
>>>>>>>> fs/fat/dir.c | 4 +-
>>>>>>>> fs/fat/fatent.c | 4 +-
>>>>>>>> fs/nfs/nfs42proc.c | 2 +-
>>>>>>>> fs/nfs/nfs42xattr.c | 2 +-
>>>>>>>> fs/nfs/nfs4proc.c | 2 +-
>>>>>>>> include/asm-generic/pgtable-geometry.h | 71 +++++++
>>>>>>>> include/asm-generic/vmlinux.lds.h | 38 ++--
>>>>>>>> include/linux/buffer_head.h | 1 +
>>>>>>>> include/linux/cpumask.h | 5 +
>>>>>>>> include/linux/linkage.h | 4 +-
>>>>>>>> include/linux/mm.h | 17 +-
>>>>>>>> include/linux/mm_types.h | 15 +-
>>>>>>>> include/linux/mm_types_task.h | 2 +-
>>>>>>>> include/linux/mmzone.h | 3 +-
>>>>>>>> include/linux/netlink.h | 6 +-
>>>>>>>> include/linux/percpu-defs.h | 4 +-
>>>>>>>> include/linux/perf_event.h | 2 +-
>>>>>>>> include/linux/sched.h | 4 +-
>>>>>>>> include/linux/slab.h | 7 +-
>>>>>>>> include/linux/stackdepot.h | 6 +-
>>>>>>>> include/linux/sunrpc/svc.h | 8 +-
>>>>>>>> include/linux/sunrpc/svc_rdma.h | 4 +-
>>>>>>>> include/linux/sunrpc/svcsock.h | 2 +-
>>>>>>>> include/linux/swap.h | 17 +-
>>>>>>>> include/linux/swapops.h | 6 +-
>>>>>>>> include/linux/thread_info.h | 10 +-
>>>>>>>> include/xen/page.h | 2 +
>>>>>>>> init/main.c | 7 +-
>>>>>>>> kernel/bpf/core.c | 9 +-
>>>>>>>> kernel/bpf/ringbuf.c | 54 ++---
>>>>>>>> kernel/cgroup/cgroup.c | 8 +-
>>>>>>>> kernel/crash_core.c | 2 +-
>>>>>>>> kernel/events/core.c | 2 +-
>>>>>>>> kernel/fork.c | 71 +++----
>>>>>>>> kernel/power/power.h | 2 +-
>>>>>>>> kernel/power/snapshot.c | 2 +-
>>>>>>>> kernel/power/swap.c | 129 +++++++++--
>>>>>>>> kernel/trace/fgraph.c | 2 +-
>>>>>>>> kernel/trace/trace.c | 2 +-
>>>>>>>> lib/stackdepot.c | 6 +-
>>>>>>>> mm/kasan/report.c | 3 +-
>>>>>>>> mm/memcontrol.c | 11 +-
>>>>>>>> mm/memory.c | 4 +-
>>>>>>>> mm/mmap.c | 2 +-
>>>>>>>> mm/page-writeback.c | 2 +-
>>>>>>>> mm/page_alloc.c | 31 +--
>>>>>>>> mm/slub.c | 2 +-
>>>>>>>> mm/sparse.c | 2 +-
>>>>>>>> mm/swapfile.c | 2 +-
>>>>>>>> mm/vmalloc.c | 7 +-
>>>>>>>> net/9p/trans_virtio.c | 4 +-
>>>>>>>> net/core/hotdata.c | 4 +-
>>>>>>>> net/core/skbuff.c | 4 +-
>>>>>>>> net/core/sysctl_net_core.c | 2 +-
>>>>>>>> net/sunrpc/cache.c | 3 +-
>>>>>>>> net/unix/af_unix.c | 2 +-
>>>>>>>> sound/soc/soc-utils.c | 4 +-
>>>>>>>> virt/kvm/kvm_main.c | 2 +-
>>>>>>>> 172 files changed, 2185 insertions(+), 951 deletions(-)
>>>>>>>> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
>>>>>>>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
>>>>>>>> create mode 100644 arch/arm64/mm/pgtable-geometry.c
>>>>>>>> create mode 100644 include/asm-generic/pgtable-geometry.h
>>>>>>>>
>>>>>>>> --
>>>>>>>> 2.43.0
>>>>>>>
>>>>>>> This is a generally very exciting patch set! I'm looking forward to seeing it
>>>>>>> land so I can take advantage of it for Fedora ARM and Fedora Asahi Remix.
>>>>>>>
>>>>>>> That said, I have a couple of questions:
>>>>>>>
>>>>>>> * Going forward, how would we handle drivers/modules that require a particular
>>>>>>> page size? For example, the Apple Silicon IOMMU driver code requires the
>>>>>>> kernel to operate in 16k page size mode, and it would need to be disabled in
>>>>>>> other page sizes.
>>>>>>
>>>>>> I think these drivers would want to check PAGE_SIZE at probe time and fail if an
>>>>>> unsupported page size is in use. Do you see any issue with that?
>>>>>>
>>>>>>>
>>>>>>> * How would we handle an invalid selection at boot?
>>>>>>
>>>>>> What do you mean by invalid here? The current policy validates that the
>>>>>> requested page size is supported by the HW by checking mmfr0. If no page size is
>>>>>> passed on the command line, or the passed value is not supported by the HW, then
>>>>>> the we default to the largest page size supported by the HW (so for Apple
>>>>>> Silicon that would be 16k since the HW doesn't support 64k). Although I think it
>>>>>> may be better to change that policy to use the smallest page size in this case;
>>>>>> 4k is the safer bet for compat and will waste much less memory than 64k.
>>>>>>
>>>>>>> Can we program in a
>>>>>>> fallback when the "wrong" mode is selected for a chip or something similar?
>>>>>>
>>>>>> Do you mean effectively add a machanism to force 16k if the detected HW is Apple
>>>>>> Silicon? The trouble is that we need to select the page size, very early in
>>>>>> boot, before start_kernel() is called, so we really only have generic arch code
>>>>>> and the command line with which to make the decision.
>>>>>
>>>>> Yes... I think a build-time CONFIG for default page size, which can be
>>>>> overridden by a karg makes sense... Even on platforms like Apple
>>>>> Silicon you may want to test very specific things in 4k by overriding
>>>>> with a karg.
>>>>
>>>> Ahh, yes, that would certainly work. I'll work it into the next version.
>>>>
>>>
>>> Could we maybe extend to have some kind of way to include a table of
>>> SoC IDs that certain modes are disabled (e.g. 64k on Apple Silicon)
>>
>> 64k is already disabled on Apple Silicon because mmfr0 reports that 64k is not
>> supported.
>>
>>> and preferred modes when no arg is set (16k for Apple Silicon)? That
>>
>> And it's not obvious that we should hard-code a page size preference to a SoC
>> ID. If the CPU can support multiple page sizes, it should be up to the SW stack
>> to decide, not the SoC.
>>
>> I'm guessing your desire is to have a single kernel build that will boot 16k by
>> default on Apple Silicon and 4k by default on other systems, all without needing
>> to modify the command line? Personally I think it's cleaner to just require
>> setting the page size on the command line in these cases.
>>
>>> way it'd work something like this:
>>>
>>> 1. Table identification of 4/16/64 depending on identified SoC
>> So I'd prefer not to have this
>>
>>> 2. Unidentified ones follow build-time default
>>> 3. karg forces a mode regardless
>> But keep these 2.
>>
>
Since we are talking about Apple Silicon and page size, I would like to
add that on the Apple Silicon SoCs I am working on, the situation is like
this:
Apple A7 (s5l8960x), A8 (T7000), A8X (T7001): CPU MMU support 4K and 64K
page sizes.
Apple A9 (s8000/s8003), A9X (s8001), A10 (t8010), A10X (t8011), A11 (t8015):
CPU MMU Support 16K and 64K page sizes.
However, all of them have 4K page DART IOMMUs.
> I think it makes sense to have it, because it's not just Apple Silicon
> where such a preference/requirement may be necessary. Apple Silicon
> technically works at 4k, but is completely broken at 4k because Linux
> cannot do 16k IOMMU with 4k everything else, so being able to at least
> prefer 16k out of the box is important. And SoCs like the NVIDIA Grace
> Hopper platform prefer 64k over other options (though I am unaware of
> a gross incompatibility that effectively requires it like Apple
> Silicon has).
>
> When we're trying to get to "single generic image that works
> everywhere", stuff like this matters and I would really like you to
> consider it from the lens of "we want things to work as automagic as
> they do on x86".
For me, in order to get to this level of automagic, there do need to be
a table of which SoC should use which page size table.
>
>
Nick Chan
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-22 15:03 ` Nick Chan
@ 2024-10-22 15:12 ` Ryan Roberts
2024-10-22 17:30 ` Neal Gompa
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-22 15:12 UTC (permalink / raw)
To: Nick Chan, Neal Gompa
Cc: Eric Curtin, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, Hector Martin, linux-arm-kernel,
linux-kernel, linux-mm, asahi
On 22/10/2024 16:03, Nick Chan wrote:
>
>
> Neal Gompa 於 2024/10/22 下午5:33 寫道:
>> On Mon, Oct 21, 2024 at 11:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 21/10/2024 14:49, Neal Gompa wrote:
>>>> On Mon, Oct 21, 2024 at 7:51 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 21/10/2024 12:32, Eric Curtin wrote:
>>>>>> On Mon, 21 Oct 2024 at 12:09, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 19/10/2024 16:47, Neal Gompa wrote:
>>>>>>>> On Monday, October 14, 2024 6:55:11 AM EDT Ryan Roberts wrote:
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> Patch bomb incoming... This covers many subsystems, so I've included a core
>>>>>>>>> set of people on the full series and additionally included maintainers on
>>>>>>>>> relevant patches. I haven't included those maintainers on this cover letter
>>>>>>>>> since the numbers were far too big for it to work. But I've included a link
>>>>>>>>> to this cover letter on each patch, so they can hopefully find their way
>>>>>>>>> here. For follow up submissions I'll break it up by subsystem, but for now
>>>>>>>>> thought it was important to show the full picture.
>>>>>>>>>
>>>>>>>>> This RFC series implements support for boot-time page size selection within
>>>>>>>>> the arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to
>>>>>>>>> date, page size has been selected at compile-time, meaning the size is
>>>>>>>>> baked into a given kernel image. As use of larger-than-4K page sizes become
>>>>>>>>> more prevalent this starts to present a problem for distributions.
>>>>>>>>> Boot-time page size selection enables the creation of a single kernel
>>>>>>>>> image, which can be told which page size to use on the kernel command line.
>>>>>>>>>
>>>>>>>>> Why is having an image-per-page size problematic?
>>>>>>>>> =================================================
>>>>>>>>>
>>>>>>>>> Many traditional distros are now supporting both 4K and 64K. And this means
>>>>>>>>> managing 2 kernel packages, along with drivers for each. For some, it means
>>>>>>>>> multiple installer flavours and multiple ISOs. All of this adds up to a
>>>>>>>>> less-than-ideal level of complexity. Additionally, Android now supports 4K
>>>>>>>>> and 16K kernels. I'm told having to explicitly manage their KABI for each
>>>>>>>>> kernel is painful, and the extra flash space required for both kernel
>>>>>>>>> images and the duplicated modules has been problematic. Boot-time page size
>>>>>>>>> selection solves all of this.
>>>>>>>>>
>>>>>>>>> Additionally, in starting to think about the longer term deployment story
>>>>>>>>> for D128 page tables, which Arm architecture now supports, a lot of the
>>>>>>>>> same problems need to be solved, so this work sets us up nicely for that.
>>>>>>>>>
>>>>>>>>> So what's the down side?
>>>>>>>>> ========================
>>>>>>>>>
>>>>>>>>> Well nothing's free; Various static allocations in the kernel image must be
>>>>>>>>> sized for the worst case (largest supported page size), so image size is in
>>>>>>>>> line with size of 64K compile-time image. So if you're interested in 4K or
>>>>>>>>> 16K, there is a slight increase to the image size. But I expect that
>>>>>>>>> problem goes away if you're compressing the image - its just some extra
>>>>>>>>> zeros. At boot-time, I expect we could free the unused static storage once
>>>>>>>>> we know the page size - although that would be a follow up enhancement.
>>>>>>>>>
>>>>>>>>> And then there is performance. Since PAGE_SIZE and friends are no longer
>>>>>>>>> compile-time constants, we must look up their values and do arithmetic at
>>>>>>>>> runtime instead of compile-time. My early perf testing suggests this is
>>>>>>>>> inperceptible for real-world workloads, and only has small impact on
>>>>>>>>> microbenchmarks - more on this below.
>>>>>>>>>
>>>>>>>>> Approach
>>>>>>>>> ========
>>>>>>>>>
>>>>>>>>> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
>>>>>>>>> friends are compile-time constant, but in a way that allows the compiler to
>>>>>>>>> perform the same optimizations as was previously being done if they do turn
>>>>>>>>> out to be compile-time constant. Where constants are required, we use
>>>>>>>>> limits; PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full
>>>>>>>>> description of all the classes of problems to solve.
>>>>>>>>>
>>>>>>>>> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
>>>>>>>>> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX.
>>>>>>>>> arm64 does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>>>>>>>>> Kconfig, which is an alternative to selecting a compile-time page size.
>>>>>>>>>
>>>>>>>>> When boot-time page size is active, the arch pgtable geometry macro
>>>>>>>>> definitions resolve to something that can be configured at boot. The arm64
>>>>>>>>> implementation in this series mainly uses global, __ro_after_init
>>>>>>>>> variables. I've tried using alternatives patching, but that performs worse
>>>>>>>>> than loading from memory; I think due to code size bloat.
>>>>>>>>>
>>>>>>>>> Status
>>>>>>>>> ======
>>>>>>>>>
>>>>>>>>> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented
>>>>>>>>> enough to compile the kernel image itself with defconfig (and a few other
>>>>>>>>> bits and pieces). This is enough to build a kernel that can boot under QEMU
>>>>>>>>> or FVP. I'll happily do the rest of the work to enable all the extra
>>>>>>>>> drivers, but wanted to get feedback on the shape of this effort first. If
>>>>>>>>> anyone wants to do any testing, and has a must-have config, let me know and
>>>>>>>>> I'll prioritize enabling it first.
>>>>>>>>>
>>>>>>>>> The series is arranged as follows:
>>>>>>>>>
>>>>>>>>> - patch 1: Add macros required for converting non-arch code to support
>>>>>>>>> boot-time page size selection
>>>>>>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from
>>>>>>>>> all non-arch code
>>>>>>>>> - patches 37-38: Some arm64 tidy ups
>>>>>>>>> - patch 39: Add macros required for converting arm64 code to
>>>>>>>> support
>>>>>>>>> boot-time page size selection
>>>>>>>>> - patches 40-56: arm64 changes to support boot-time page size selection
>>>>>>>>> - patch 57: Add arm64 Kconfig option to enable boot-time page
>>>>>>>> size
>>>>>>>>> selection
>>>>>>>>>
>>>>>>>>> Ideally, I'd like to get the basics merged (something like this series),
>>>>>>>>> then incrementally improve it over a handful of kernel releases until we
>>>>>>>>> can demonstrate that we have feature parity with the compile-time build and
>>>>>>>>> no performance blockers. Once at that point, ideally the compile-time build
>>>>>>>>> options would be removed and the code could be cleaned up further.
>>>>>>>>>
>>>>>>>>> One of the bigger peices that I'd propose to add as a follow up, is to make
>>>>>>>>> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
>>>>>>>>> handling.
>>>>>>>>>
>>>>>>>>> Assuming people are ammenable to the rough shape, how would I go about
>>>>>>>>> getting the non-arch changes merged? Since they cover many subsystems, will
>>>>>>>>> each piece need to go independently to each relevant maintainer or could it
>>>>>>>>> all be merged together through the arm64 tree?
>>>>>>>>>
>>>>>>>>> Image Size
>>>>>>>>> ==========
>>>>>>>>>
>>>>>>>>> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
>>>>>>>>> kernel image on disk for base (before any changes applied), compile (with
>>>>>>>>> changes, configured for compile-time page size) and boot (with changes,
>>>>>>>>> configured for boot-time page size).
>>>>>>>>>
>>>>>>>>> You can see the that compile-16k and 64k configs are actually slightly
>>>>>>>>> smaller than the baselines; that's due to optimizing some buffer sizes
>>>>>>>>> which didn't need to depend on page size during the series. The boot-time
>>>>>>>>> image is ~1% bigger than the 64k compile-time image. I believe there is
>>>>>>>>> scope to improve this to make it
>>>>>>>>> equal to compile-64k if required:
>>>>>>>>> | config | size/KB | diff/KB | diff/% |
>>>>>>>>> |
>>>>>>>>> |-------------|---------|---------|---------|
>>>>>>>>> |
>>>>>>>>> | base-4k | 54895 | 0 | 0.0% |
>>>>>>>>> | base-16k | 55161 | 266 | 0.5% |
>>>>>>>>> | base-64k | 56775 | 1880 | 3.4% |
>>>>>>>>> | compile-4k | 54895 | 0 | 0.0% |
>>>>>>>>> | compile-16k | 55097 | 202 | 0.4% |
>>>>>>>>> | compile-64k | 56391 | 1496 | 2.7% |
>>>>>>>>> | boot-4K | 57045 | 2150 | 3.9% |
>>>>>>>>>
>>>>>>>>> And below shows the size of the image in memory at run-time, separated for
>>>>>>>>> text and data costs. The boot image has ~1% text cost; most likely due to
>>>>>>>>> the fact that PAGE_SIZE and friends are not compile-time constants so need
>>>>>>>>> instructions to load the values and do arithmetic. I believe we could
>>>>>>>>> eventually get the data cost to match the cost for the compile image for
>>>>>>>>> the chosen page size by freeing
>>>>>>>>> the ends of the static buffers not needed for the selected page size:
>>>>>>>>> | | text | text | text | data | data | data |
>>>>>>>>> |
>>>>>>>>> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
>>>>>>>>> |
>>>>>>>>> |-------------|---------|---------|---------|---------|---------|---------|
>>>>>>>>> |
>>>>>>>>> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
>>>>>>>>> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
>>>>>>>>> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
>>>>>>>>> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
>>>>>>>>> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
>>>>>>>>> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
>>>>>>>>> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
>>>>>>>>>
>>>>>>>>> Functional Testing
>>>>>>>>> ==================
>>>>>>>>>
>>>>>>>>> I've build-tested defconfig for all arches supported by tuxmake (which is
>>>>>>>>> most) without issue.
>>>>>>>>>
>>>>>>>>> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page
>>>>>>>>> sizes and a few va-sizes, and additionally have run all the mm-selftests,
>>>>>>>>> with no regressions observed vs the equivalent compile-time page size build
>>>>>>>>> (although the mm-selftests have a few existing failures when run against
>>>>>>>>> 16K and 64K kernels - those should really be investigated and fixed
>>>>>>>>> independently).
>>>>>>>>>
>>>>>>>>> Test coverage is lacking for many of the drivers that I've touched, but in
>>>>>>>>> many cases, I'm hoping the changes are simple enough that review might
>>>>>>>>> suffice?
>>>>>>>>>
>>>>>>>>> Performance Testing
>>>>>>>>> ===================
>>>>>>>>>
>>>>>>>>> I've run some limited performance benchmarks:
>>>>>>>>>
>>>>>>>>> First, a real-world benchmark that causes a lot of page table manipulation
>>>>>>>>> (and therefore we would expect to see regression here if we are going to
>>>>>>>>> see it anywhere); kernel compilation. It barely registers a change. Values
>>>>>>>>> are times,
>>>>>>>>> so smaller is better. All relative to base-4k:
>>>>>>>>> | | kern | kern | user | user | real | real |
>>>>>>>>> |
>>>>>>>>> | config | mean | stdev | mean | stdev | mean | stdev |
>>>>>>>>> |
>>>>>>>>> |-------------|---------|---------|---------|---------|---------|---------|
>>>>>>>>> |
>>>>>>>>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
>>>>>>>>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
>>>>>>>>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
>>>>>>>>>
>>>>>>>>> The Speedometer JavaScript benchmark also shows no change. Values are runs
>>>>>>>>> per
>>>>>>>>> min, so bigger is better. All relative to base-4k:
>>>>>>>>> | config | mean | stdev |
>>>>>>>>> |
>>>>>>>>> |-------------|---------|---------|
>>>>>>>>> |
>>>>>>>>> | base-4k | 0.0% | 0.8% |
>>>>>>>>> | compile-4k | 0.4% | 0.8% |
>>>>>>>>> | boot-4k | 0.0% | 0.9% |
>>>>>>>>>
>>>>>>>>> Finally, I've run some microbenchmarks known to stress page table
>>>>>>>>> manipulations (originally from David Hildenbrand). The fork test
>>>>>>>>> maps/allocs 1G of anon memory, then measures the cost of fork(). The munmap
>>>>>>>>> test maps/allocs 1G of anon memory then measures the cost of munmap()ing
>>>>>>>>> it. The fork test is known to be extremely sensitive to any changes that
>>>>>>>>> cause instructions to be aligned differently in cachelines. When using this
>>>>>>>>> test for other changes, I've seen double digit regressions for the
>>>>>>>>> slightest thing, so 12% regression on this test is actually fairly good.
>>>>>>>>> This likely represents the extreme worst case for regressions that will be
>>>>>>>>> observed across other microbenchmarks (famous last
>>>>>>>>> words). Values are times, so smaller is better. All relative to base-4k:
>>>>>>>>> | | fork | fork | munmap | munmap |
>>>>>>>>> |
>>>>>>>>> | config | mean | stdev | stdev | stdev |
>>>>>>>>> |
>>>>>>>>> |-------------|---------|---------|---------|---------|
>>>>>>>>> |
>>>>>>>>> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
>>>>>>>>> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
>>>>>>>>> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
>>>>>>>>>
>>>>>>>>> NOTE: The series applies on top of v6.11.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ryan Roberts (57):
>>>>>>>>> mm: Add macros ahead of supporting boot-time page size selection
>>>>>>>>> vmlinux: Align to PAGE_SIZE_MAX
>>>>>>>>> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
>>>>>>>>> mm/page_alloc: Make page_frag_cache boot-time page size compatible
>>>>>>>>> mm: Avoid split pmd ptl if pmd level is run-time folded
>>>>>>>>> mm: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
>>>>>>>>> fs: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> fork: Permit boot-time THREAD_SIZE determination
>>>>>>>>> cgroup: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> bpf: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> stackdepot: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> perf: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> kvm: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> trace: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> crash: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> crypto: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> sunrpc: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> sound: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> net: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> net: fec: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> net: marvell: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> net: hns3: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> net: e1000: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> net: igb: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> drivers/base: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> edac: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> optee: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> random: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> virtio: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> xen: Remove PAGE_SIZE compile-time constant assumption
>>>>>>>>> arm64: Fix macros to work in C code in addition to the linker script
>>>>>>>>> arm64: Track early pgtable allocation limit
>>>>>>>>> arm64: Introduce macros required for boot-time page selection
>>>>>>>>> arm64: Refactor early pgtable size calculation macros
>>>>>>>>> arm64: Pass desired page size on command line
>>>>>>>>> arm64: Divorce early init from PAGE_SIZE
>>>>>>>>> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
>>>>>>>>> arm64: Align sections to PAGE_SIZE_MAX
>>>>>>>>> arm64: Rework trampoline rodata mapping
>>>>>>>>> arm64: Generalize fixmap for boot-time page size
>>>>>>>>> arm64: Statically allocate and align for worst-case page size
>>>>>>>>> arm64: Convert switch to if for non-const comparison values
>>>>>>>>> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
>>>>>>>>> arm64: Remove PAGE_SZ asm-offset
>>>>>>>>> arm64: Introduce cpu features for page sizes
>>>>>>>>> arm64: Remove PAGE_SIZE from assembly code
>>>>>>>>> arm64: Runtime-fold pmd level
>>>>>>>>> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
>>>>>>>>> arm64: TRAMP_VALIAS is no longer compile-time constant
>>>>>>>>> arm64: Determine THREAD_SIZE at boot-time
>>>>>>>>> arm64: Enable boot-time page size selection
>>>>>>>>>
>>>>>>>>> arch/alpha/include/asm/page.h | 1 +
>>>>>>>>> arch/arc/include/asm/page.h | 1 +
>>>>>>>>> arch/arm/include/asm/page.h | 1 +
>>>>>>>>> arch/arm64/Kconfig | 26 ++-
>>>>>>>>> arch/arm64/include/asm/assembler.h | 78 ++++++-
>>>>>>>>> arch/arm64/include/asm/cpufeature.h | 44 +++-
>>>>>>>>> arch/arm64/include/asm/efi.h | 2 +-
>>>>>>>>> arch/arm64/include/asm/fixmap.h | 28 ++-
>>>>>>>>> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
>>>>>>>>> arch/arm64/include/asm/kvm_arm.h | 21 +-
>>>>>>>>> arch/arm64/include/asm/kvm_hyp.h | 11 +
>>>>>>>>> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
>>>>>>>>> arch/arm64/include/asm/memory.h | 62 ++++--
>>>>>>>>> arch/arm64/include/asm/page-def.h | 3 +-
>>>>>>>>> arch/arm64/include/asm/pgalloc.h | 16 +-
>>>>>>>>> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
>>>>>>>>> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
>>>>>>>>> arch/arm64/include/asm/pgtable-prot.h | 2 +-
>>>>>>>>> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
>>>>>>>>> arch/arm64/include/asm/processor.h | 10 +-
>>>>>>>>> arch/arm64/include/asm/sections.h | 1 +
>>>>>>>>> arch/arm64/include/asm/smp.h | 1 +
>>>>>>>>> arch/arm64/include/asm/sparsemem.h | 15 +-
>>>>>>>>> arch/arm64/include/asm/sysreg.h | 54 +++--
>>>>>>>>> arch/arm64/include/asm/tlb.h | 3 +
>>>>>>>>> arch/arm64/kernel/asm-offsets.c | 4 +-
>>>>>>>>> arch/arm64/kernel/cpufeature.c | 93 ++++++--
>>>>>>>>> arch/arm64/kernel/efi.c | 2 +-
>>>>>>>>> arch/arm64/kernel/entry.S | 60 +++++-
>>>>>>>>> arch/arm64/kernel/head.S | 46 +++-
>>>>>>>>> arch/arm64/kernel/hibernate-asm.S | 6 +-
>>>>>>>>> arch/arm64/kernel/image-vars.h | 14 ++
>>>>>>>>> arch/arm64/kernel/image.h | 4 +
>>>>>>>>> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
>>>>>>>>> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
>>>>>>>>> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
>>>>>>>>> arch/arm64/kernel/pi/pi.h | 63 +++++-
>>>>>>>>> arch/arm64/kernel/relocate_kernel.S | 10 +-
>>>>>>>>> arch/arm64/kernel/vdso-wrap.S | 4 +-
>>>>>>>>> arch/arm64/kernel/vdso.c | 7 +-
>>>>>>>>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
>>>>>>>>> arch/arm64/kernel/vdso32-wrap.S | 4 +-
>>>>>>>>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
>>>>>>>>> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
>>>>>>>>> arch/arm64/kvm/arm.c | 10 +
>>>>>>>>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
>>>>>>>>> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
>>>>>>>>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
>>>>>>>>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
>>>>>>>>> arch/arm64/kvm/mmu.c | 39 ++--
>>>>>>>>> arch/arm64/lib/clear_page.S | 7 +-
>>>>>>>>> arch/arm64/lib/copy_page.S | 33 ++-
>>>>>>>>> arch/arm64/lib/mte.S | 27 ++-
>>>>>>>>> arch/arm64/mm/Makefile | 1 +
>>>>>>>>> arch/arm64/mm/fixmap.c | 38 ++--
>>>>>>>>> arch/arm64/mm/hugetlbpage.c | 40 +---
>>>>>>>>> arch/arm64/mm/init.c | 26 +--
>>>>>>>>> arch/arm64/mm/kasan_init.c | 8 +-
>>>>>>>>> arch/arm64/mm/mmu.c | 53 +++--
>>>>>>>>> arch/arm64/mm/pgd.c | 12 +-
>>>>>>>>> arch/arm64/mm/pgtable-geometry.c | 24 +++
>>>>>>>>> arch/arm64/mm/proc.S | 128 ++++++++---
>>>>>>>>> arch/arm64/mm/ptdump.c | 3 +-
>>>>>>>>> arch/arm64/tools/cpucaps | 3 +
>>>>>>>>> arch/csky/include/asm/page.h | 3 +
>>>>>>>>> arch/hexagon/include/asm/page.h | 2 +
>>>>>>>>> arch/loongarch/include/asm/page.h | 2 +
>>>>>>>>> arch/m68k/include/asm/page.h | 1 +
>>>>>>>>> arch/microblaze/include/asm/page.h | 1 +
>>>>>>>>> arch/mips/include/asm/page.h | 1 +
>>>>>>>>> arch/nios2/include/asm/page.h | 2 +
>>>>>>>>> arch/openrisc/include/asm/page.h | 1 +
>>>>>>>>> arch/parisc/include/asm/page.h | 1 +
>>>>>>>>> arch/powerpc/include/asm/page.h | 2 +
>>>>>>>>> arch/riscv/include/asm/page.h | 1 +
>>>>>>>>> arch/s390/include/asm/page.h | 1 +
>>>>>>>>> arch/sh/include/asm/page.h | 1 +
>>>>>>>>> arch/sparc/include/asm/page.h | 3 +
>>>>>>>>> arch/um/include/asm/page.h | 2 +
>>>>>>>>> arch/x86/include/asm/page_types.h | 2 +
>>>>>>>>> arch/xtensa/include/asm/page.h | 1 +
>>>>>>>>> crypto/lskcipher.c | 4 +-
>>>>>>>>> drivers/ata/sata_sil24.c | 46 ++--
>>>>>>>>> drivers/base/node.c | 6 +-
>>>>>>>>> drivers/base/topology.c | 32 +--
>>>>>>>>> drivers/block/virtio_blk.c | 2 +-
>>>>>>>>> drivers/char/random.c | 4 +-
>>>>>>>>> drivers/edac/edac_mc.h | 13 +-
>>>>>>>>> drivers/firmware/efi/libstub/arm64.c | 3 +-
>>>>>>>>> drivers/irqchip/irq-gic-v3-its.c | 2 +-
>>>>>>>>> drivers/mtd/mtdswap.c | 4 +-
>>>>>>>>> drivers/net/ethernet/freescale/fec.h | 3 +-
>>>>>>>>> drivers/net/ethernet/freescale/fec_main.c | 5 +-
>>>>>>>>> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
>>>>>>>>> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
>>>>>>>>> drivers/net/ethernet/intel/igb/igb.h | 25 +--
>>>>>>>>> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
>>>>>>>>> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
>>>>>>>>> drivers/net/ethernet/marvell/mvneta.c | 9 +-
>>>>>>>>> drivers/net/ethernet/marvell/sky2.h | 2 +-
>>>>>>>>> drivers/tee/optee/call.c | 7 +-
>>>>>>>>> drivers/tee/optee/smc_abi.c | 2 +-
>>>>>>>>> drivers/virtio/virtio_balloon.c | 10 +-
>>>>>>>>> drivers/xen/balloon.c | 11 +-
>>>>>>>>> drivers/xen/biomerge.c | 12 +-
>>>>>>>>> drivers/xen/privcmd.c | 2 +-
>>>>>>>>> drivers/xen/xenbus/xenbus_client.c | 5 +-
>>>>>>>>> drivers/xen/xlate_mmu.c | 6 +-
>>>>>>>>> fs/binfmt_elf.c | 11 +-
>>>>>>>>> fs/buffer.c | 2 +-
>>>>>>>>> fs/coredump.c | 8 +-
>>>>>>>>> fs/ext4/ext4.h | 36 ++--
>>>>>>>>> fs/ext4/move_extent.c | 2 +-
>>>>>>>>> fs/ext4/readpage.c | 2 +-
>>>>>>>>> fs/fat/dir.c | 4 +-
>>>>>>>>> fs/fat/fatent.c | 4 +-
>>>>>>>>> fs/nfs/nfs42proc.c | 2 +-
>>>>>>>>> fs/nfs/nfs42xattr.c | 2 +-
>>>>>>>>> fs/nfs/nfs4proc.c | 2 +-
>>>>>>>>> include/asm-generic/pgtable-geometry.h | 71 +++++++
>>>>>>>>> include/asm-generic/vmlinux.lds.h | 38 ++--
>>>>>>>>> include/linux/buffer_head.h | 1 +
>>>>>>>>> include/linux/cpumask.h | 5 +
>>>>>>>>> include/linux/linkage.h | 4 +-
>>>>>>>>> include/linux/mm.h | 17 +-
>>>>>>>>> include/linux/mm_types.h | 15 +-
>>>>>>>>> include/linux/mm_types_task.h | 2 +-
>>>>>>>>> include/linux/mmzone.h | 3 +-
>>>>>>>>> include/linux/netlink.h | 6 +-
>>>>>>>>> include/linux/percpu-defs.h | 4 +-
>>>>>>>>> include/linux/perf_event.h | 2 +-
>>>>>>>>> include/linux/sched.h | 4 +-
>>>>>>>>> include/linux/slab.h | 7 +-
>>>>>>>>> include/linux/stackdepot.h | 6 +-
>>>>>>>>> include/linux/sunrpc/svc.h | 8 +-
>>>>>>>>> include/linux/sunrpc/svc_rdma.h | 4 +-
>>>>>>>>> include/linux/sunrpc/svcsock.h | 2 +-
>>>>>>>>> include/linux/swap.h | 17 +-
>>>>>>>>> include/linux/swapops.h | 6 +-
>>>>>>>>> include/linux/thread_info.h | 10 +-
>>>>>>>>> include/xen/page.h | 2 +
>>>>>>>>> init/main.c | 7 +-
>>>>>>>>> kernel/bpf/core.c | 9 +-
>>>>>>>>> kernel/bpf/ringbuf.c | 54 ++---
>>>>>>>>> kernel/cgroup/cgroup.c | 8 +-
>>>>>>>>> kernel/crash_core.c | 2 +-
>>>>>>>>> kernel/events/core.c | 2 +-
>>>>>>>>> kernel/fork.c | 71 +++----
>>>>>>>>> kernel/power/power.h | 2 +-
>>>>>>>>> kernel/power/snapshot.c | 2 +-
>>>>>>>>> kernel/power/swap.c | 129 +++++++++--
>>>>>>>>> kernel/trace/fgraph.c | 2 +-
>>>>>>>>> kernel/trace/trace.c | 2 +-
>>>>>>>>> lib/stackdepot.c | 6 +-
>>>>>>>>> mm/kasan/report.c | 3 +-
>>>>>>>>> mm/memcontrol.c | 11 +-
>>>>>>>>> mm/memory.c | 4 +-
>>>>>>>>> mm/mmap.c | 2 +-
>>>>>>>>> mm/page-writeback.c | 2 +-
>>>>>>>>> mm/page_alloc.c | 31 +--
>>>>>>>>> mm/slub.c | 2 +-
>>>>>>>>> mm/sparse.c | 2 +-
>>>>>>>>> mm/swapfile.c | 2 +-
>>>>>>>>> mm/vmalloc.c | 7 +-
>>>>>>>>> net/9p/trans_virtio.c | 4 +-
>>>>>>>>> net/core/hotdata.c | 4 +-
>>>>>>>>> net/core/skbuff.c | 4 +-
>>>>>>>>> net/core/sysctl_net_core.c | 2 +-
>>>>>>>>> net/sunrpc/cache.c | 3 +-
>>>>>>>>> net/unix/af_unix.c | 2 +-
>>>>>>>>> sound/soc/soc-utils.c | 4 +-
>>>>>>>>> virt/kvm/kvm_main.c | 2 +-
>>>>>>>>> 172 files changed, 2185 insertions(+), 951 deletions(-)
>>>>>>>>> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
>>>>>>>>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
>>>>>>>>> create mode 100644 arch/arm64/mm/pgtable-geometry.c
>>>>>>>>> create mode 100644 include/asm-generic/pgtable-geometry.h
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> 2.43.0
>>>>>>>>
>>>>>>>> This is a generally very exciting patch set! I'm looking forward to seeing it
>>>>>>>> land so I can take advantage of it for Fedora ARM and Fedora Asahi Remix.
>>>>>>>>
>>>>>>>> That said, I have a couple of questions:
>>>>>>>>
>>>>>>>> * Going forward, how would we handle drivers/modules that require a particular
>>>>>>>> page size? For example, the Apple Silicon IOMMU driver code requires the
>>>>>>>> kernel to operate in 16k page size mode, and it would need to be disabled in
>>>>>>>> other page sizes.
>>>>>>>
>>>>>>> I think these drivers would want to check PAGE_SIZE at probe time and fail if an
>>>>>>> unsupported page size is in use. Do you see any issue with that?
>>>>>>>
>>>>>>>>
>>>>>>>> * How would we handle an invalid selection at boot?
>>>>>>>
>>>>>>> What do you mean by invalid here? The current policy validates that the
>>>>>>> requested page size is supported by the HW by checking mmfr0. If no page size is
>>>>>>> passed on the command line, or the passed value is not supported by the HW, then
>>>>>>> the we default to the largest page size supported by the HW (so for Apple
>>>>>>> Silicon that would be 16k since the HW doesn't support 64k). Although I think it
>>>>>>> may be better to change that policy to use the smallest page size in this case;
>>>>>>> 4k is the safer bet for compat and will waste much less memory than 64k.
>>>>>>>
>>>>>>>> Can we program in a
>>>>>>>> fallback when the "wrong" mode is selected for a chip or something similar?
>>>>>>>
>>>>>>> Do you mean effectively add a machanism to force 16k if the detected HW is Apple
>>>>>>> Silicon? The trouble is that we need to select the page size, very early in
>>>>>>> boot, before start_kernel() is called, so we really only have generic arch code
>>>>>>> and the command line with which to make the decision.
>>>>>>
>>>>>> Yes... I think a build-time CONFIG for default page size, which can be
>>>>>> overridden by a karg makes sense... Even on platforms like Apple
>>>>>> Silicon you may want to test very specific things in 4k by overriding
>>>>>> with a karg.
>>>>>
>>>>> Ahh, yes, that would certainly work. I'll work it into the next version.
>>>>>
>>>>
>>>> Could we maybe extend to have some kind of way to include a table of
>>>> SoC IDs that certain modes are disabled (e.g. 64k on Apple Silicon)
>>>
>>> 64k is already disabled on Apple Silicon because mmfr0 reports that 64k is not
>>> supported.
>>>
>>>> and preferred modes when no arg is set (16k for Apple Silicon)? That
>>>
>>> And it's not obvious that we should hard-code a page size preference to a SoC
>>> ID. If the CPU can support multiple page sizes, it should be up to the SW stack
>>> to decide, not the SoC.
>>>
>>> I'm guessing your desire is to have a single kernel build that will boot 16k by
>>> default on Apple Silicon and 4k by default on other systems, all without needing
>>> to modify the command line? Personally I think it's cleaner to just require
>>> setting the page size on the command line in these cases.
>>>
>>>> way it'd work something like this:
>>>>
>>>> 1. Table identification of 4/16/64 depending on identified SoC
>>> So I'd prefer not to have this
>>>
>>>> 2. Unidentified ones follow build-time default
>>>> 3. karg forces a mode regardless
>>> But keep these 2.
>>>
>>
> Since we are talking about Apple Silicon and page size, I would like to
> add that on the Apple Silicon SoCs I am working on, the situation is like
> this:
>
> Apple A7 (s5l8960x), A8 (T7000), A8X (T7001): CPU MMU support 4K and 64K
> page sizes.
>
> Apple A9 (s8000/s8003), A9X (s8001), A10 (t8010), A10X (t8011), A11 (t8015):
> CPU MMU Support 16K and 64K page sizes.
>
> However, all of them have 4K page DART IOMMUs.
>
>> I think it makes sense to have it, because it's not just Apple Silicon
>> where such a preference/requirement may be necessary. Apple Silicon
>> technically works at 4k, but is completely broken at 4k because Linux
>> cannot do 16k IOMMU with 4k everything else, so being able to at least
>> prefer 16k out of the box is important. And SoCs like the NVIDIA Grace
>> Hopper platform prefer 64k over other options (though I am unaware of
>> a gross incompatibility that effectively requires it like Apple
>> Silicon has).
>>
>> When we're trying to get to "single generic image that works
>> everywhere", stuff like this matters and I would really like you to
>> consider it from the lens of "we want things to work as automagic as
>> they do on x86".
> For me, in order to get to this level of automagic, there do need to be
> a table of which SoC should use which page size table.
OK, but it's not clear to me that this table needs to be in the kernel. Could it
not be something in user space (e.g. during installation) that configures the
kernel command line?
Regardless, the hard work here is getting the boot-time page size selection
mechanism in place. Once that's there, follow up patches can add the desired
policy. I'd rather leave it out for now to avoid anything slowing down the core
work.
Thanks,
Ryan
>
>>
>>
>
> Nick Chan
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-22 15:12 ` Ryan Roberts
@ 2024-10-22 17:30 ` Neal Gompa
2024-10-24 10:34 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Neal Gompa @ 2024-10-22 17:30 UTC (permalink / raw)
To: Ryan Roberts
Cc: Nick Chan, Eric Curtin, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, David Hildenbrand, Greg Marsden,
Ivan Ivanov, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Miroslav Benes, Will Deacon, Hector Martin,
linux-arm-kernel, linux-kernel, linux-mm, asahi
On Tue, Oct 22, 2024 at 11:12 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 22/10/2024 16:03, Nick Chan wrote:
> >
> >
> > Neal Gompa 於 2024/10/22 下午5:33 寫道:
> >> On Mon, Oct 21, 2024 at 11:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>
> >>> On 21/10/2024 14:49, Neal Gompa wrote:
> >>>> On Mon, Oct 21, 2024 at 7:51 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>
> >>>>> On 21/10/2024 12:32, Eric Curtin wrote:
> >>>>>> On Mon, 21 Oct 2024 at 12:09, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>
> >>>>>>> On 19/10/2024 16:47, Neal Gompa wrote:
> >>>>>>>> On Monday, October 14, 2024 6:55:11 AM EDT Ryan Roberts wrote:
> >>>>>>>>> Hi All,
> >>>>>>>>>
> >>>>>>>>> Patch bomb incoming... This covers many subsystems, so I've included a core
> >>>>>>>>> set of people on the full series and additionally included maintainers on
> >>>>>>>>> relevant patches. I haven't included those maintainers on this cover letter
> >>>>>>>>> since the numbers were far too big for it to work. But I've included a link
> >>>>>>>>> to this cover letter on each patch, so they can hopefully find their way
> >>>>>>>>> here. For follow up submissions I'll break it up by subsystem, but for now
> >>>>>>>>> thought it was important to show the full picture.
> >>>>>>>>>
> >>>>>>>>> This RFC series implements support for boot-time page size selection within
> >>>>>>>>> the arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to
> >>>>>>>>> date, page size has been selected at compile-time, meaning the size is
> >>>>>>>>> baked into a given kernel image. As use of larger-than-4K page sizes become
> >>>>>>>>> more prevalent this starts to present a problem for distributions.
> >>>>>>>>> Boot-time page size selection enables the creation of a single kernel
> >>>>>>>>> image, which can be told which page size to use on the kernel command line.
> >>>>>>>>>
> >>>>>>>>> Why is having an image-per-page size problematic?
> >>>>>>>>> =================================================
> >>>>>>>>>
> >>>>>>>>> Many traditional distros are now supporting both 4K and 64K. And this means
> >>>>>>>>> managing 2 kernel packages, along with drivers for each. For some, it means
> >>>>>>>>> multiple installer flavours and multiple ISOs. All of this adds up to a
> >>>>>>>>> less-than-ideal level of complexity. Additionally, Android now supports 4K
> >>>>>>>>> and 16K kernels. I'm told having to explicitly manage their KABI for each
> >>>>>>>>> kernel is painful, and the extra flash space required for both kernel
> >>>>>>>>> images and the duplicated modules has been problematic. Boot-time page size
> >>>>>>>>> selection solves all of this.
> >>>>>>>>>
> >>>>>>>>> Additionally, in starting to think about the longer term deployment story
> >>>>>>>>> for D128 page tables, which Arm architecture now supports, a lot of the
> >>>>>>>>> same problems need to be solved, so this work sets us up nicely for that.
> >>>>>>>>>
> >>>>>>>>> So what's the down side?
> >>>>>>>>> ========================
> >>>>>>>>>
> >>>>>>>>> Well nothing's free; Various static allocations in the kernel image must be
> >>>>>>>>> sized for the worst case (largest supported page size), so image size is in
> >>>>>>>>> line with size of 64K compile-time image. So if you're interested in 4K or
> >>>>>>>>> 16K, there is a slight increase to the image size. But I expect that
> >>>>>>>>> problem goes away if you're compressing the image - its just some extra
> >>>>>>>>> zeros. At boot-time, I expect we could free the unused static storage once
> >>>>>>>>> we know the page size - although that would be a follow up enhancement.
> >>>>>>>>>
> >>>>>>>>> And then there is performance. Since PAGE_SIZE and friends are no longer
> >>>>>>>>> compile-time constants, we must look up their values and do arithmetic at
> >>>>>>>>> runtime instead of compile-time. My early perf testing suggests this is
> >>>>>>>>> inperceptible for real-world workloads, and only has small impact on
> >>>>>>>>> microbenchmarks - more on this below.
> >>>>>>>>>
> >>>>>>>>> Approach
> >>>>>>>>> ========
> >>>>>>>>>
> >>>>>>>>> The basic idea is to rid the source of any assumptions that PAGE_SIZE and
> >>>>>>>>> friends are compile-time constant, but in a way that allows the compiler to
> >>>>>>>>> perform the same optimizations as was previously being done if they do turn
> >>>>>>>>> out to be compile-time constant. Where constants are required, we use
> >>>>>>>>> limits; PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full
> >>>>>>>>> description of all the classes of problems to solve.
> >>>>>>>>>
> >>>>>>>>> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to
> >>>>>>>>> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX.
> >>>>>>>>> arm64 does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> >>>>>>>>> Kconfig, which is an alternative to selecting a compile-time page size.
> >>>>>>>>>
> >>>>>>>>> When boot-time page size is active, the arch pgtable geometry macro
> >>>>>>>>> definitions resolve to something that can be configured at boot. The arm64
> >>>>>>>>> implementation in this series mainly uses global, __ro_after_init
> >>>>>>>>> variables. I've tried using alternatives patching, but that performs worse
> >>>>>>>>> than loading from memory; I think due to code size bloat.
> >>>>>>>>>
> >>>>>>>>> Status
> >>>>>>>>> ======
> >>>>>>>>>
> >>>>>>>>> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented
> >>>>>>>>> enough to compile the kernel image itself with defconfig (and a few other
> >>>>>>>>> bits and pieces). This is enough to build a kernel that can boot under QEMU
> >>>>>>>>> or FVP. I'll happily do the rest of the work to enable all the extra
> >>>>>>>>> drivers, but wanted to get feedback on the shape of this effort first. If
> >>>>>>>>> anyone wants to do any testing, and has a must-have config, let me know and
> >>>>>>>>> I'll prioritize enabling it first.
> >>>>>>>>>
> >>>>>>>>> The series is arranged as follows:
> >>>>>>>>>
> >>>>>>>>> - patch 1: Add macros required for converting non-arch code to support
> >>>>>>>>> boot-time page size selection
> >>>>>>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from
> >>>>>>>>> all non-arch code
> >>>>>>>>> - patches 37-38: Some arm64 tidy ups
> >>>>>>>>> - patch 39: Add macros required for converting arm64 code to
> >>>>>>>> support
> >>>>>>>>> boot-time page size selection
> >>>>>>>>> - patches 40-56: arm64 changes to support boot-time page size selection
> >>>>>>>>> - patch 57: Add arm64 Kconfig option to enable boot-time page
> >>>>>>>> size
> >>>>>>>>> selection
> >>>>>>>>>
> >>>>>>>>> Ideally, I'd like to get the basics merged (something like this series),
> >>>>>>>>> then incrementally improve it over a handful of kernel releases until we
> >>>>>>>>> can demonstrate that we have feature parity with the compile-time build and
> >>>>>>>>> no performance blockers. Once at that point, ideally the compile-time build
> >>>>>>>>> options would be removed and the code could be cleaned up further.
> >>>>>>>>>
> >>>>>>>>> One of the bigger peices that I'd propose to add as a follow up, is to make
> >>>>>>>>> va-size boot-time selectable too. That will greatly simplify LPA2 fallback
> >>>>>>>>> handling.
> >>>>>>>>>
> >>>>>>>>> Assuming people are ammenable to the rough shape, how would I go about
> >>>>>>>>> getting the non-arch changes merged? Since they cover many subsystems, will
> >>>>>>>>> each piece need to go independently to each relevant maintainer or could it
> >>>>>>>>> all be merged together through the arm64 tree?
> >>>>>>>>>
> >>>>>>>>> Image Size
> >>>>>>>>> ==========
> >>>>>>>>>
> >>>>>>>>> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes)
> >>>>>>>>> kernel image on disk for base (before any changes applied), compile (with
> >>>>>>>>> changes, configured for compile-time page size) and boot (with changes,
> >>>>>>>>> configured for boot-time page size).
> >>>>>>>>>
> >>>>>>>>> You can see the that compile-16k and 64k configs are actually slightly
> >>>>>>>>> smaller than the baselines; that's due to optimizing some buffer sizes
> >>>>>>>>> which didn't need to depend on page size during the series. The boot-time
> >>>>>>>>> image is ~1% bigger than the 64k compile-time image. I believe there is
> >>>>>>>>> scope to improve this to make it
> >>>>>>>>> equal to compile-64k if required:
> >>>>>>>>> | config | size/KB | diff/KB | diff/% |
> >>>>>>>>> |
> >>>>>>>>> |-------------|---------|---------|---------|
> >>>>>>>>> |
> >>>>>>>>> | base-4k | 54895 | 0 | 0.0% |
> >>>>>>>>> | base-16k | 55161 | 266 | 0.5% |
> >>>>>>>>> | base-64k | 56775 | 1880 | 3.4% |
> >>>>>>>>> | compile-4k | 54895 | 0 | 0.0% |
> >>>>>>>>> | compile-16k | 55097 | 202 | 0.4% |
> >>>>>>>>> | compile-64k | 56391 | 1496 | 2.7% |
> >>>>>>>>> | boot-4K | 57045 | 2150 | 3.9% |
> >>>>>>>>>
> >>>>>>>>> And below shows the size of the image in memory at run-time, separated for
> >>>>>>>>> text and data costs. The boot image has ~1% text cost; most likely due to
> >>>>>>>>> the fact that PAGE_SIZE and friends are not compile-time constants so need
> >>>>>>>>> instructions to load the values and do arithmetic. I believe we could
> >>>>>>>>> eventually get the data cost to match the cost for the compile image for
> >>>>>>>>> the chosen page size by freeing
> >>>>>>>>> the ends of the static buffers not needed for the selected page size:
> >>>>>>>>> | | text | text | text | data | data | data |
> >>>>>>>>> |
> >>>>>>>>> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% |
> >>>>>>>>> |
> >>>>>>>>> |-------------|---------|---------|---------|---------|---------|---------|
> >>>>>>>>> |
> >>>>>>>>> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% |
> >>>>>>>>> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% |
> >>>>>>>>> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% |
> >>>>>>>>> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% |
> >>>>>>>>> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% |
> >>>>>>>>> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% |
> >>>>>>>>> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% |
> >>>>>>>>>
> >>>>>>>>> Functional Testing
> >>>>>>>>> ==================
> >>>>>>>>>
> >>>>>>>>> I've build-tested defconfig for all arches supported by tuxmake (which is
> >>>>>>>>> most) without issue.
> >>>>>>>>>
> >>>>>>>>> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page
> >>>>>>>>> sizes and a few va-sizes, and additionally have run all the mm-selftests,
> >>>>>>>>> with no regressions observed vs the equivalent compile-time page size build
> >>>>>>>>> (although the mm-selftests have a few existing failures when run against
> >>>>>>>>> 16K and 64K kernels - those should really be investigated and fixed
> >>>>>>>>> independently).
> >>>>>>>>>
> >>>>>>>>> Test coverage is lacking for many of the drivers that I've touched, but in
> >>>>>>>>> many cases, I'm hoping the changes are simple enough that review might
> >>>>>>>>> suffice?
> >>>>>>>>>
> >>>>>>>>> Performance Testing
> >>>>>>>>> ===================
> >>>>>>>>>
> >>>>>>>>> I've run some limited performance benchmarks:
> >>>>>>>>>
> >>>>>>>>> First, a real-world benchmark that causes a lot of page table manipulation
> >>>>>>>>> (and therefore we would expect to see regression here if we are going to
> >>>>>>>>> see it anywhere); kernel compilation. It barely registers a change. Values
> >>>>>>>>> are times,
> >>>>>>>>> so smaller is better. All relative to base-4k:
> >>>>>>>>> | | kern | kern | user | user | real | real |
> >>>>>>>>> |
> >>>>>>>>> | config | mean | stdev | mean | stdev | mean | stdev |
> >>>>>>>>> |
> >>>>>>>>> |-------------|---------|---------|---------|---------|---------|---------|
> >>>>>>>>> |
> >>>>>>>>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
> >>>>>>>>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
> >>>>>>>>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |
> >>>>>>>>>
> >>>>>>>>> The Speedometer JavaScript benchmark also shows no change. Values are runs
> >>>>>>>>> per
> >>>>>>>>> min, so bigger is better. All relative to base-4k:
> >>>>>>>>> | config | mean | stdev |
> >>>>>>>>> |
> >>>>>>>>> |-------------|---------|---------|
> >>>>>>>>> |
> >>>>>>>>> | base-4k | 0.0% | 0.8% |
> >>>>>>>>> | compile-4k | 0.4% | 0.8% |
> >>>>>>>>> | boot-4k | 0.0% | 0.9% |
> >>>>>>>>>
> >>>>>>>>> Finally, I've run some microbenchmarks known to stress page table
> >>>>>>>>> manipulations (originally from David Hildenbrand). The fork test
> >>>>>>>>> maps/allocs 1G of anon memory, then measures the cost of fork(). The munmap
> >>>>>>>>> test maps/allocs 1G of anon memory then measures the cost of munmap()ing
> >>>>>>>>> it. The fork test is known to be extremely sensitive to any changes that
> >>>>>>>>> cause instructions to be aligned differently in cachelines. When using this
> >>>>>>>>> test for other changes, I've seen double digit regressions for the
> >>>>>>>>> slightest thing, so 12% regression on this test is actually fairly good.
> >>>>>>>>> This likely represents the extreme worst case for regressions that will be
> >>>>>>>>> observed across other microbenchmarks (famous last
> >>>>>>>>> words). Values are times, so smaller is better. All relative to base-4k:
> >>>>>>>>> | | fork | fork | munmap | munmap |
> >>>>>>>>> |
> >>>>>>>>> | config | mean | stdev | stdev | stdev |
> >>>>>>>>> |
> >>>>>>>>> |-------------|---------|---------|---------|---------|
> >>>>>>>>> |
> >>>>>>>>> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% |
> >>>>>>>>> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% |
> >>>>>>>>> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% |
> >>>>>>>>>
> >>>>>>>>> NOTE: The series applies on top of v6.11.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Ryan
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Ryan Roberts (57):
> >>>>>>>>> mm: Add macros ahead of supporting boot-time page size selection
> >>>>>>>>> vmlinux: Align to PAGE_SIZE_MAX
> >>>>>>>>> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large
> >>>>>>>>> mm/page_alloc: Make page_frag_cache boot-time page size compatible
> >>>>>>>>> mm: Avoid split pmd ptl if pmd level is run-time folded
> >>>>>>>>> mm: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing
> >>>>>>>>> fs: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> fs/nfs: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> fs/ext4: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> fork: Permit boot-time THREAD_SIZE determination
> >>>>>>>>> cgroup: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> bpf: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> stackdepot: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> perf: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> kvm: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> trace: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> crash: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> crypto: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> sunrpc: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> sound: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> net: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> net: fec: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> net: marvell: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> net: hns3: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> net: e1000: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> net: igbvf: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> net: igb: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> drivers/base: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> edac: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> optee: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> random: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> sata_sil24: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> virtio: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> xen: Remove PAGE_SIZE compile-time constant assumption
> >>>>>>>>> arm64: Fix macros to work in C code in addition to the linker script
> >>>>>>>>> arm64: Track early pgtable allocation limit
> >>>>>>>>> arm64: Introduce macros required for boot-time page selection
> >>>>>>>>> arm64: Refactor early pgtable size calculation macros
> >>>>>>>>> arm64: Pass desired page size on command line
> >>>>>>>>> arm64: Divorce early init from PAGE_SIZE
> >>>>>>>>> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES
> >>>>>>>>> arm64: Align sections to PAGE_SIZE_MAX
> >>>>>>>>> arm64: Rework trampoline rodata mapping
> >>>>>>>>> arm64: Generalize fixmap for boot-time page size
> >>>>>>>>> arm64: Statically allocate and align for worst-case page size
> >>>>>>>>> arm64: Convert switch to if for non-const comparison values
> >>>>>>>>> arm64: Convert BUILD_BUG_ON to VM_BUG_ON
> >>>>>>>>> arm64: Remove PAGE_SZ asm-offset
> >>>>>>>>> arm64: Introduce cpu features for page sizes
> >>>>>>>>> arm64: Remove PAGE_SIZE from assembly code
> >>>>>>>>> arm64: Runtime-fold pmd level
> >>>>>>>>> arm64: Support runtime folding in idmap_kpti_install_ng_mappings
> >>>>>>>>> arm64: TRAMP_VALIAS is no longer compile-time constant
> >>>>>>>>> arm64: Determine THREAD_SIZE at boot-time
> >>>>>>>>> arm64: Enable boot-time page size selection
> >>>>>>>>>
> >>>>>>>>> arch/alpha/include/asm/page.h | 1 +
> >>>>>>>>> arch/arc/include/asm/page.h | 1 +
> >>>>>>>>> arch/arm/include/asm/page.h | 1 +
> >>>>>>>>> arch/arm64/Kconfig | 26 ++-
> >>>>>>>>> arch/arm64/include/asm/assembler.h | 78 ++++++-
> >>>>>>>>> arch/arm64/include/asm/cpufeature.h | 44 +++-
> >>>>>>>>> arch/arm64/include/asm/efi.h | 2 +-
> >>>>>>>>> arch/arm64/include/asm/fixmap.h | 28 ++-
> >>>>>>>>> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++----
> >>>>>>>>> arch/arm64/include/asm/kvm_arm.h | 21 +-
> >>>>>>>>> arch/arm64/include/asm/kvm_hyp.h | 11 +
> >>>>>>>>> arch/arm64/include/asm/kvm_pgtable.h | 6 +-
> >>>>>>>>> arch/arm64/include/asm/memory.h | 62 ++++--
> >>>>>>>>> arch/arm64/include/asm/page-def.h | 3 +-
> >>>>>>>>> arch/arm64/include/asm/pgalloc.h | 16 +-
> >>>>>>>>> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++
> >>>>>>>>> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++-
> >>>>>>>>> arch/arm64/include/asm/pgtable-prot.h | 2 +-
> >>>>>>>>> arch/arm64/include/asm/pgtable.h | 133 +++++++++---
> >>>>>>>>> arch/arm64/include/asm/processor.h | 10 +-
> >>>>>>>>> arch/arm64/include/asm/sections.h | 1 +
> >>>>>>>>> arch/arm64/include/asm/smp.h | 1 +
> >>>>>>>>> arch/arm64/include/asm/sparsemem.h | 15 +-
> >>>>>>>>> arch/arm64/include/asm/sysreg.h | 54 +++--
> >>>>>>>>> arch/arm64/include/asm/tlb.h | 3 +
> >>>>>>>>> arch/arm64/kernel/asm-offsets.c | 4 +-
> >>>>>>>>> arch/arm64/kernel/cpufeature.c | 93 ++++++--
> >>>>>>>>> arch/arm64/kernel/efi.c | 2 +-
> >>>>>>>>> arch/arm64/kernel/entry.S | 60 +++++-
> >>>>>>>>> arch/arm64/kernel/head.S | 46 +++-
> >>>>>>>>> arch/arm64/kernel/hibernate-asm.S | 6 +-
> >>>>>>>>> arch/arm64/kernel/image-vars.h | 14 ++
> >>>>>>>>> arch/arm64/kernel/image.h | 4 +
> >>>>>>>>> arch/arm64/kernel/pi/idreg-override.c | 68 +++++-
> >>>>>>>>> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++----
> >>>>>>>>> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++--
> >>>>>>>>> arch/arm64/kernel/pi/pi.h | 63 +++++-
> >>>>>>>>> arch/arm64/kernel/relocate_kernel.S | 10 +-
> >>>>>>>>> arch/arm64/kernel/vdso-wrap.S | 4 +-
> >>>>>>>>> arch/arm64/kernel/vdso.c | 7 +-
> >>>>>>>>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +-
> >>>>>>>>> arch/arm64/kernel/vdso32-wrap.S | 4 +-
> >>>>>>>>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +-
> >>>>>>>>> arch/arm64/kernel/vmlinux.lds.S | 48 +++--
> >>>>>>>>> arch/arm64/kvm/arm.c | 10 +
> >>>>>>>>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
> >>>>>>>>> arch/arm64/kvm/hyp/nvhe/host.S | 10 +-
> >>>>>>>>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +-
> >>>>>>>>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++
> >>>>>>>>> arch/arm64/kvm/mmu.c | 39 ++--
> >>>>>>>>> arch/arm64/lib/clear_page.S | 7 +-
> >>>>>>>>> arch/arm64/lib/copy_page.S | 33 ++-
> >>>>>>>>> arch/arm64/lib/mte.S | 27 ++-
> >>>>>>>>> arch/arm64/mm/Makefile | 1 +
> >>>>>>>>> arch/arm64/mm/fixmap.c | 38 ++--
> >>>>>>>>> arch/arm64/mm/hugetlbpage.c | 40 +---
> >>>>>>>>> arch/arm64/mm/init.c | 26 +--
> >>>>>>>>> arch/arm64/mm/kasan_init.c | 8 +-
> >>>>>>>>> arch/arm64/mm/mmu.c | 53 +++--
> >>>>>>>>> arch/arm64/mm/pgd.c | 12 +-
> >>>>>>>>> arch/arm64/mm/pgtable-geometry.c | 24 +++
> >>>>>>>>> arch/arm64/mm/proc.S | 128 ++++++++---
> >>>>>>>>> arch/arm64/mm/ptdump.c | 3 +-
> >>>>>>>>> arch/arm64/tools/cpucaps | 3 +
> >>>>>>>>> arch/csky/include/asm/page.h | 3 +
> >>>>>>>>> arch/hexagon/include/asm/page.h | 2 +
> >>>>>>>>> arch/loongarch/include/asm/page.h | 2 +
> >>>>>>>>> arch/m68k/include/asm/page.h | 1 +
> >>>>>>>>> arch/microblaze/include/asm/page.h | 1 +
> >>>>>>>>> arch/mips/include/asm/page.h | 1 +
> >>>>>>>>> arch/nios2/include/asm/page.h | 2 +
> >>>>>>>>> arch/openrisc/include/asm/page.h | 1 +
> >>>>>>>>> arch/parisc/include/asm/page.h | 1 +
> >>>>>>>>> arch/powerpc/include/asm/page.h | 2 +
> >>>>>>>>> arch/riscv/include/asm/page.h | 1 +
> >>>>>>>>> arch/s390/include/asm/page.h | 1 +
> >>>>>>>>> arch/sh/include/asm/page.h | 1 +
> >>>>>>>>> arch/sparc/include/asm/page.h | 3 +
> >>>>>>>>> arch/um/include/asm/page.h | 2 +
> >>>>>>>>> arch/x86/include/asm/page_types.h | 2 +
> >>>>>>>>> arch/xtensa/include/asm/page.h | 1 +
> >>>>>>>>> crypto/lskcipher.c | 4 +-
> >>>>>>>>> drivers/ata/sata_sil24.c | 46 ++--
> >>>>>>>>> drivers/base/node.c | 6 +-
> >>>>>>>>> drivers/base/topology.c | 32 +--
> >>>>>>>>> drivers/block/virtio_blk.c | 2 +-
> >>>>>>>>> drivers/char/random.c | 4 +-
> >>>>>>>>> drivers/edac/edac_mc.h | 13 +-
> >>>>>>>>> drivers/firmware/efi/libstub/arm64.c | 3 +-
> >>>>>>>>> drivers/irqchip/irq-gic-v3-its.c | 2 +-
> >>>>>>>>> drivers/mtd/mtdswap.c | 4 +-
> >>>>>>>>> drivers/net/ethernet/freescale/fec.h | 3 +-
> >>>>>>>>> drivers/net/ethernet/freescale/fec_main.c | 5 +-
> >>>>>>>>> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +-
> >>>>>>>>> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +-
> >>>>>>>>> drivers/net/ethernet/intel/igb/igb.h | 25 +--
> >>>>>>>>> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------
> >>>>>>>>> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +-
> >>>>>>>>> drivers/net/ethernet/marvell/mvneta.c | 9 +-
> >>>>>>>>> drivers/net/ethernet/marvell/sky2.h | 2 +-
> >>>>>>>>> drivers/tee/optee/call.c | 7 +-
> >>>>>>>>> drivers/tee/optee/smc_abi.c | 2 +-
> >>>>>>>>> drivers/virtio/virtio_balloon.c | 10 +-
> >>>>>>>>> drivers/xen/balloon.c | 11 +-
> >>>>>>>>> drivers/xen/biomerge.c | 12 +-
> >>>>>>>>> drivers/xen/privcmd.c | 2 +-
> >>>>>>>>> drivers/xen/xenbus/xenbus_client.c | 5 +-
> >>>>>>>>> drivers/xen/xlate_mmu.c | 6 +-
> >>>>>>>>> fs/binfmt_elf.c | 11 +-
> >>>>>>>>> fs/buffer.c | 2 +-
> >>>>>>>>> fs/coredump.c | 8 +-
> >>>>>>>>> fs/ext4/ext4.h | 36 ++--
> >>>>>>>>> fs/ext4/move_extent.c | 2 +-
> >>>>>>>>> fs/ext4/readpage.c | 2 +-
> >>>>>>>>> fs/fat/dir.c | 4 +-
> >>>>>>>>> fs/fat/fatent.c | 4 +-
> >>>>>>>>> fs/nfs/nfs42proc.c | 2 +-
> >>>>>>>>> fs/nfs/nfs42xattr.c | 2 +-
> >>>>>>>>> fs/nfs/nfs4proc.c | 2 +-
> >>>>>>>>> include/asm-generic/pgtable-geometry.h | 71 +++++++
> >>>>>>>>> include/asm-generic/vmlinux.lds.h | 38 ++--
> >>>>>>>>> include/linux/buffer_head.h | 1 +
> >>>>>>>>> include/linux/cpumask.h | 5 +
> >>>>>>>>> include/linux/linkage.h | 4 +-
> >>>>>>>>> include/linux/mm.h | 17 +-
> >>>>>>>>> include/linux/mm_types.h | 15 +-
> >>>>>>>>> include/linux/mm_types_task.h | 2 +-
> >>>>>>>>> include/linux/mmzone.h | 3 +-
> >>>>>>>>> include/linux/netlink.h | 6 +-
> >>>>>>>>> include/linux/percpu-defs.h | 4 +-
> >>>>>>>>> include/linux/perf_event.h | 2 +-
> >>>>>>>>> include/linux/sched.h | 4 +-
> >>>>>>>>> include/linux/slab.h | 7 +-
> >>>>>>>>> include/linux/stackdepot.h | 6 +-
> >>>>>>>>> include/linux/sunrpc/svc.h | 8 +-
> >>>>>>>>> include/linux/sunrpc/svc_rdma.h | 4 +-
> >>>>>>>>> include/linux/sunrpc/svcsock.h | 2 +-
> >>>>>>>>> include/linux/swap.h | 17 +-
> >>>>>>>>> include/linux/swapops.h | 6 +-
> >>>>>>>>> include/linux/thread_info.h | 10 +-
> >>>>>>>>> include/xen/page.h | 2 +
> >>>>>>>>> init/main.c | 7 +-
> >>>>>>>>> kernel/bpf/core.c | 9 +-
> >>>>>>>>> kernel/bpf/ringbuf.c | 54 ++---
> >>>>>>>>> kernel/cgroup/cgroup.c | 8 +-
> >>>>>>>>> kernel/crash_core.c | 2 +-
> >>>>>>>>> kernel/events/core.c | 2 +-
> >>>>>>>>> kernel/fork.c | 71 +++----
> >>>>>>>>> kernel/power/power.h | 2 +-
> >>>>>>>>> kernel/power/snapshot.c | 2 +-
> >>>>>>>>> kernel/power/swap.c | 129 +++++++++--
> >>>>>>>>> kernel/trace/fgraph.c | 2 +-
> >>>>>>>>> kernel/trace/trace.c | 2 +-
> >>>>>>>>> lib/stackdepot.c | 6 +-
> >>>>>>>>> mm/kasan/report.c | 3 +-
> >>>>>>>>> mm/memcontrol.c | 11 +-
> >>>>>>>>> mm/memory.c | 4 +-
> >>>>>>>>> mm/mmap.c | 2 +-
> >>>>>>>>> mm/page-writeback.c | 2 +-
> >>>>>>>>> mm/page_alloc.c | 31 +--
> >>>>>>>>> mm/slub.c | 2 +-
> >>>>>>>>> mm/sparse.c | 2 +-
> >>>>>>>>> mm/swapfile.c | 2 +-
> >>>>>>>>> mm/vmalloc.c | 7 +-
> >>>>>>>>> net/9p/trans_virtio.c | 4 +-
> >>>>>>>>> net/core/hotdata.c | 4 +-
> >>>>>>>>> net/core/skbuff.c | 4 +-
> >>>>>>>>> net/core/sysctl_net_core.c | 2 +-
> >>>>>>>>> net/sunrpc/cache.c | 3 +-
> >>>>>>>>> net/unix/af_unix.c | 2 +-
> >>>>>>>>> sound/soc/soc-utils.c | 4 +-
> >>>>>>>>> virt/kvm/kvm_main.c | 2 +-
> >>>>>>>>> 172 files changed, 2185 insertions(+), 951 deletions(-)
> >>>>>>>>> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h
> >>>>>>>>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c
> >>>>>>>>> create mode 100644 arch/arm64/mm/pgtable-geometry.c
> >>>>>>>>> create mode 100644 include/asm-generic/pgtable-geometry.h
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> 2.43.0
> >>>>>>>>
> >>>>>>>> This is a generally very exciting patch set! I'm looking forward to seeing it
> >>>>>>>> land so I can take advantage of it for Fedora ARM and Fedora Asahi Remix.
> >>>>>>>>
> >>>>>>>> That said, I have a couple of questions:
> >>>>>>>>
> >>>>>>>> * Going forward, how would we handle drivers/modules that require a particular
> >>>>>>>> page size? For example, the Apple Silicon IOMMU driver code requires the
> >>>>>>>> kernel to operate in 16k page size mode, and it would need to be disabled in
> >>>>>>>> other page sizes.
> >>>>>>>
> >>>>>>> I think these drivers would want to check PAGE_SIZE at probe time and fail if an
> >>>>>>> unsupported page size is in use. Do you see any issue with that?
> >>>>>>>
> >>>>>>>>
> >>>>>>>> * How would we handle an invalid selection at boot?
> >>>>>>>
> >>>>>>> What do you mean by invalid here? The current policy validates that the
> >>>>>>> requested page size is supported by the HW by checking mmfr0. If no page size is
> >>>>>>> passed on the command line, or the passed value is not supported by the HW, then
> >>>>>>> the we default to the largest page size supported by the HW (so for Apple
> >>>>>>> Silicon that would be 16k since the HW doesn't support 64k). Although I think it
> >>>>>>> may be better to change that policy to use the smallest page size in this case;
> >>>>>>> 4k is the safer bet for compat and will waste much less memory than 64k.
> >>>>>>>
> >>>>>>>> Can we program in a
> >>>>>>>> fallback when the "wrong" mode is selected for a chip or something similar?
> >>>>>>>
> >>>>>>> Do you mean effectively add a machanism to force 16k if the detected HW is Apple
> >>>>>>> Silicon? The trouble is that we need to select the page size, very early in
> >>>>>>> boot, before start_kernel() is called, so we really only have generic arch code
> >>>>>>> and the command line with which to make the decision.
> >>>>>>
> >>>>>> Yes... I think a build-time CONFIG for default page size, which can be
> >>>>>> overridden by a karg makes sense... Even on platforms like Apple
> >>>>>> Silicon you may want to test very specific things in 4k by overriding
> >>>>>> with a karg.
> >>>>>
> >>>>> Ahh, yes, that would certainly work. I'll work it into the next version.
> >>>>>
> >>>>
> >>>> Could we maybe extend to have some kind of way to include a table of
> >>>> SoC IDs that certain modes are disabled (e.g. 64k on Apple Silicon)
> >>>
> >>> 64k is already disabled on Apple Silicon because mmfr0 reports that 64k is not
> >>> supported.
> >>>
> >>>> and preferred modes when no arg is set (16k for Apple Silicon)? That
> >>>
> >>> And it's not obvious that we should hard-code a page size preference to a SoC
> >>> ID. If the CPU can support multiple page sizes, it should be up to the SW stack
> >>> to decide, not the SoC.
> >>>
> >>> I'm guessing your desire is to have a single kernel build that will boot 16k by
> >>> default on Apple Silicon and 4k by default on other systems, all without needing
> >>> to modify the command line? Personally I think it's cleaner to just require
> >>> setting the page size on the command line in these cases.
> >>>
> >>>> way it'd work something like this:
> >>>>
> >>>> 1. Table identification of 4/16/64 depending on identified SoC
> >>> So I'd prefer not to have this
> >>>
> >>>> 2. Unidentified ones follow build-time default
> >>>> 3. karg forces a mode regardless
> >>> But keep these 2.
> >>>
> >>
> > Since we are talking about Apple Silicon and page size, I would like to
> > add that on the Apple Silicon SoCs I am working on, the situation is like
> > this:
> >
> > Apple A7 (s5l8960x), A8 (T7000), A8X (T7001): CPU MMU support 4K and 64K
> > page sizes.
> >
> > Apple A9 (s8000/s8003), A9X (s8001), A10 (t8010), A10X (t8011), A11 (t8015):
> > CPU MMU Support 16K and 64K page sizes.
> >
> > However, all of them have 4K page DART IOMMUs.
> >
> >> I think it makes sense to have it, because it's not just Apple Silicon
> >> where such a preference/requirement may be necessary. Apple Silicon
> >> technically works at 4k, but is completely broken at 4k because Linux
> >> cannot do 16k IOMMU with 4k everything else, so being able to at least
> >> prefer 16k out of the box is important. And SoCs like the NVIDIA Grace
> >> Hopper platform prefer 64k over other options (though I am unaware of
> >> a gross incompatibility that effectively requires it like Apple
> >> Silicon has).
> >>
> >> When we're trying to get to "single generic image that works
> >> everywhere", stuff like this matters and I would really like you to
> >> consider it from the lens of "we want things to work as automagic as
> >> they do on x86".
> > For me, in order to get to this level of automagic, there do need to be
> > a table of which SoC should use which page size table.
>
> OK, but it's not clear to me that this table needs to be in the kernel. Could it
> not be something in user space (e.g. during installation) that configures the
> kernel command line?
>
This is not compatible with using things like ISOs with UEFI+ACPI
enabled desktop/server systems. We need to be able to safely,
automatically, and correctly boot up and support hardware. The only
place to do that early enough is in the kernel. But this can wait
until the core stuff is in.
> Regardless, the hard work here is getting the boot-time page size selection
> mechanism in place. Once that's there, follow up patches can add the desired
> policy. I'd rather leave it out for now to avoid anything slowing down the core
> work.
>
Sure, this can be done afterward.
--
真実はいつも一つ!/ Always, there's only one truth!
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 36/57] xen: Remove PAGE_SIZE compile-time constant assumption
2024-10-16 14:46 ` Ryan Roberts
@ 2024-10-23 1:23 ` Stefano Stabellini
2024-10-24 10:32 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Stefano Stabellini @ 2024-10-23 1:23 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, Juergen Gross, Stefano Stabellini, linux-arm-kernel,
linux-kernel, linux-mm, xen-devel, julien
+Julien
On Wed, 16 Oct 2024, Ryan Roberts wrote:
> + Juergen Gross, Stefano Stabellini
>
> This was a rather tricky series to get the recipients correct for and my script
> did not realize that "supporter" was a pseudonym for "maintainer" so you were
> missed off the original post. Appologies!
>
> More context in cover letter:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
>
> On 14/10/2024 11:58, Ryan Roberts wrote:
> > To prepare for supporting boot-time page size selection, refactor code
> > to remove assumptions about PAGE_SIZE being compile-time constant. Code
> > intended to be equivalent when compile-time page size is active.
> >
> > Allocate enough "frame_list" static storage in the balloon driver for
> > the maximum supported page size. Although continue to use only the first
> > PAGE_SIZE of the buffer at run-time to maintain existing behaviour.
> >
> > Refactor xen_biovec_phys_mergeable() to convert ifdeffery to c if/else.
> > For compile-time page size, the compiler will choose one branch and
> > strip the dead one. For boot-time, it can be evaluated at run time.
> >
> > Refactor a BUILD_BUG_ON to evaluate the limit (when the minimum
> > supported page size is selected at boot-time).
> >
> > Reserve enough storage for max page size in "struct remap_data" and
> > "struct xenbus_map_node".
> >
> > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> >
> > ***NOTE***
> > Any confused maintainers may want to read the cover note here for context:
> > https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
> >
> > drivers/xen/balloon.c | 11 ++++++-----
> > drivers/xen/biomerge.c | 12 ++++++------
> > drivers/xen/privcmd.c | 2 +-
> > drivers/xen/xenbus/xenbus_client.c | 5 +++--
> > drivers/xen/xlate_mmu.c | 6 +++---
> > include/xen/page.h | 2 ++
> > 6 files changed, 21 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
> > index 528395133b4f8..0ed5f6453af0e 100644
> > --- a/drivers/xen/balloon.c
> > +++ b/drivers/xen/balloon.c
> > @@ -131,7 +131,8 @@ struct balloon_stats balloon_stats;
> > EXPORT_SYMBOL_GPL(balloon_stats);
> >
> > /* We increase/decrease in batches which fit in a page */
> > -static xen_pfn_t frame_list[PAGE_SIZE / sizeof(xen_pfn_t)];
> > +static xen_pfn_t frame_list[PAGE_SIZE_MAX / sizeof(xen_pfn_t)];
> > +#define FRAME_LIST_NR_ENTRIES (PAGE_SIZE / sizeof(xen_pfn_t))
> >
> >
> > /* List of ballooned pages, threaded through the mem_map array. */
> > @@ -389,8 +390,8 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
> > unsigned long i;
> > struct page *page;
> >
> > - if (nr_pages > ARRAY_SIZE(frame_list))
> > - nr_pages = ARRAY_SIZE(frame_list);
> > + if (nr_pages > FRAME_LIST_NR_ENTRIES)
> > + nr_pages = FRAME_LIST_NR_ENTRIES;
> >
> > page = list_first_entry_or_null(&ballooned_pages, struct page, lru);
> > for (i = 0; i < nr_pages; i++) {
> > @@ -434,8 +435,8 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
> > int ret;
> > LIST_HEAD(pages);
> >
> > - if (nr_pages > ARRAY_SIZE(frame_list))
> > - nr_pages = ARRAY_SIZE(frame_list);
> > + if (nr_pages > FRAME_LIST_NR_ENTRIES)
> > + nr_pages = FRAME_LIST_NR_ENTRIES;
> >
> > for (i = 0; i < nr_pages; i++) {
> > page = alloc_page(gfp);
> > diff --git a/drivers/xen/biomerge.c b/drivers/xen/biomerge.c
> > index 05a286d24f148..28f0887e40026 100644
> > --- a/drivers/xen/biomerge.c
> > +++ b/drivers/xen/biomerge.c
> > @@ -8,16 +8,16 @@
> > bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
> > const struct page *page)
> > {
> > -#if XEN_PAGE_SIZE == PAGE_SIZE
> > - unsigned long bfn1 = pfn_to_bfn(page_to_pfn(vec1->bv_page));
> > - unsigned long bfn2 = pfn_to_bfn(page_to_pfn(page));
> > + if (XEN_PAGE_SIZE == PAGE_SIZE) {
> > + unsigned long bfn1 = pfn_to_bfn(page_to_pfn(vec1->bv_page));
> > + unsigned long bfn2 = pfn_to_bfn(page_to_pfn(page));
> > +
> > + return bfn1 + PFN_DOWN(vec1->bv_offset + vec1->bv_len) == bfn2;
> > + }
> >
> > - return bfn1 + PFN_DOWN(vec1->bv_offset + vec1->bv_len) == bfn2;
> > -#else
> > /*
> > * XXX: Add support for merging bio_vec when using different page
> > * size in Xen and Linux.
> > */
> > return false;
> > -#endif
> > }
> > diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> > index 9563650dfbafc..847f7b806caf7 100644
> > --- a/drivers/xen/privcmd.c
> > +++ b/drivers/xen/privcmd.c
> > @@ -557,7 +557,7 @@ static long privcmd_ioctl_mmap_batch(
> > state.global_error = 0;
> > state.version = version;
> >
> > - BUILD_BUG_ON(((PAGE_SIZE / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE) != 0);
> > + BUILD_BUG_ON(((PAGE_SIZE_MIN / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE_MAX) != 0);
Is there any value in keep this test? And if so, what should it look
like? I think we should turn it into a WARN_ON:
WARN_ON(((PAGE_SIZE / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE) != 0);
It doesn't make much sense having a BUILD_BUG_ON on a variable that can
change?
> > /* mmap_batch_fn guarantees ret == 0 */
> > BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
> > &pagelist, mmap_batch_fn, &state));
> > diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
> > index 51b3124b0d56c..99bde836c10c4 100644
> > --- a/drivers/xen/xenbus/xenbus_client.c
> > +++ b/drivers/xen/xenbus/xenbus_client.c
> > @@ -49,9 +49,10 @@
> >
> > #include "xenbus.h"
> >
> > -#define XENBUS_PAGES(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE))
> > +#define XENBUS_PAGES(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE))
> > +#define XENBUS_PAGES_MAX(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE_MIN))
> >
> > -#define XENBUS_MAX_RING_PAGES (XENBUS_PAGES(XENBUS_MAX_RING_GRANTS))
> > +#define XENBUS_MAX_RING_PAGES (XENBUS_PAGES_MAX(XENBUS_MAX_RING_GRANTS))
> >
> > struct xenbus_map_node {
> > struct list_head next;
> > diff --git a/drivers/xen/xlate_mmu.c b/drivers/xen/xlate_mmu.c
> > index f17c4c03db30c..a757c801a7542 100644
> > --- a/drivers/xen/xlate_mmu.c
> > +++ b/drivers/xen/xlate_mmu.c
> > @@ -74,9 +74,9 @@ struct remap_data {
> > int mapped;
> >
> > /* Hypercall parameters */
> > - int h_errs[XEN_PFN_PER_PAGE];
> > - xen_ulong_t h_idxs[XEN_PFN_PER_PAGE];
> > - xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE];
> > + int h_errs[XEN_PFN_PER_PAGE_MAX];
> > + xen_ulong_t h_idxs[XEN_PFN_PER_PAGE_MAX];
> > + xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE_MAX];
> >
> > int h_iter; /* Iterator */
> > };
> > diff --git a/include/xen/page.h b/include/xen/page.h
> > index 285677b42943a..86683a30038a3 100644
> > --- a/include/xen/page.h
> > +++ b/include/xen/page.h
> > @@ -21,6 +21,8 @@
> > ((page_to_pfn(page)) << (PAGE_SHIFT - XEN_PAGE_SHIFT))
> >
> > #define XEN_PFN_PER_PAGE (PAGE_SIZE / XEN_PAGE_SIZE)
> > +#define XEN_PFN_PER_PAGE_MIN (PAGE_SIZE_MIN / XEN_PAGE_SIZE)
> > +#define XEN_PFN_PER_PAGE_MAX (PAGE_SIZE_MAX / XEN_PAGE_SIZE)
> >
> > #define XEN_PFN_DOWN(x) ((x) >> XEN_PAGE_SHIFT)
> > #define XEN_PFN_UP(x) (((x) + XEN_PAGE_SIZE-1) >> XEN_PAGE_SHIFT)
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-17 12:32 ` Ryan Roberts
2024-10-18 12:56 ` Petr Tesarik
@ 2024-10-23 21:00 ` Thomas Tai
2024-10-24 10:48 ` Ryan Roberts
2024-11-11 12:14 ` Petr Tesarik
2024-12-05 17:20 ` Petr Tesarik
3 siblings, 1 reply; 196+ messages in thread
From: Thomas Tai @ 2024-10-23 21:00 UTC (permalink / raw)
To: Ryan Roberts, Petr Tesarik
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On 10/17/2024 8:32 AM, Ryan Roberts wrote:
> On 17/10/2024 13:27, Petr Tesarik wrote:
>> On Mon, 14 Oct 2024 11:55:11 +0100
>> Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>>> [...]
>>> The series is arranged as follows:
>>>
>>> - patch 1: Add macros required for converting non-arch code to support
>>> boot-time page size selection
>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
>>> non-arch code
>> I have just tried to recompile the openSUSE kernel with these patches
>> applied, and I'm running into this:
>>
>> CC arch/arm64/hyperv/hv_core.o
>> In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
>> ../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file scope
>> u8 reserved2[PAGE_SIZE - 68];
>> ^~~~~~~~~
>>
>> It looks like one more place which needs a patch, right?
> As mentioned in the cover letter, so far I've only converted enough to get the
> defconfig *image* building (i.e. no modules). If you are compiling a different
> config or compiling the modules for defconfig, you will likely run into these
> types of issues.
It would be nice if you could provide the defconfig you are using; I also ran
into build issues when using the arch/arm64/configs/defconfig.
Thank you,
Thomas
>
> That said, I do have some patches to fix Hyper-V, which Michael Kelley was kind
> enough to send me.
>
> I understand that Suse might be able to help with wider performance testing - if
> that's the reason you are trying to compile, you could send me your config and
> I'll start working on fixing up other drivers?
>
> Thanks,
> Ryan
>
>> Petr T
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 36/57] xen: Remove PAGE_SIZE compile-time constant assumption
2024-10-23 1:23 ` Stefano Stabellini
@ 2024-10-24 10:32 ` Ryan Roberts
2024-10-25 1:18 ` Stefano Stabellini
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-10-24 10:32 UTC (permalink / raw)
To: Stefano Stabellini
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, Juergen Gross, linux-arm-kernel, linux-kernel,
linux-mm, xen-devel, julien
On 23/10/2024 02:23, Stefano Stabellini wrote:
> +Julien
>
> On Wed, 16 Oct 2024, Ryan Roberts wrote:
>> + Juergen Gross, Stefano Stabellini
>>
>> This was a rather tricky series to get the recipients correct for and my script
>> did not realize that "supporter" was a pseudonym for "maintainer" so you were
>> missed off the original post. Appologies!
>>
>> More context in cover letter:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>
>>
>> On 14/10/2024 11:58, Ryan Roberts wrote:
>>> To prepare for supporting boot-time page size selection, refactor code
>>> to remove assumptions about PAGE_SIZE being compile-time constant. Code
>>> intended to be equivalent when compile-time page size is active.
>>>
>>> Allocate enough "frame_list" static storage in the balloon driver for
>>> the maximum supported page size. Although continue to use only the first
>>> PAGE_SIZE of the buffer at run-time to maintain existing behaviour.
>>>
>>> Refactor xen_biovec_phys_mergeable() to convert ifdeffery to c if/else.
>>> For compile-time page size, the compiler will choose one branch and
>>> strip the dead one. For boot-time, it can be evaluated at run time.
>>>
>>> Refactor a BUILD_BUG_ON to evaluate the limit (when the minimum
>>> supported page size is selected at boot-time).
>>>
>>> Reserve enough storage for max page size in "struct remap_data" and
>>> "struct xenbus_map_node".
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>
>>> ***NOTE***
>>> Any confused maintainers may want to read the cover note here for context:
>>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>>
>>> drivers/xen/balloon.c | 11 ++++++-----
>>> drivers/xen/biomerge.c | 12 ++++++------
>>> drivers/xen/privcmd.c | 2 +-
>>> drivers/xen/xenbus/xenbus_client.c | 5 +++--
>>> drivers/xen/xlate_mmu.c | 6 +++---
>>> include/xen/page.h | 2 ++
>>> 6 files changed, 21 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
>>> index 528395133b4f8..0ed5f6453af0e 100644
>>> --- a/drivers/xen/balloon.c
>>> +++ b/drivers/xen/balloon.c
>>> @@ -131,7 +131,8 @@ struct balloon_stats balloon_stats;
>>> EXPORT_SYMBOL_GPL(balloon_stats);
>>>
>>> /* We increase/decrease in batches which fit in a page */
>>> -static xen_pfn_t frame_list[PAGE_SIZE / sizeof(xen_pfn_t)];
>>> +static xen_pfn_t frame_list[PAGE_SIZE_MAX / sizeof(xen_pfn_t)];
>>> +#define FRAME_LIST_NR_ENTRIES (PAGE_SIZE / sizeof(xen_pfn_t))
>>>
>>>
>>> /* List of ballooned pages, threaded through the mem_map array. */
>>> @@ -389,8 +390,8 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
>>> unsigned long i;
>>> struct page *page;
>>>
>>> - if (nr_pages > ARRAY_SIZE(frame_list))
>>> - nr_pages = ARRAY_SIZE(frame_list);
>>> + if (nr_pages > FRAME_LIST_NR_ENTRIES)
>>> + nr_pages = FRAME_LIST_NR_ENTRIES;
>>>
>>> page = list_first_entry_or_null(&ballooned_pages, struct page, lru);
>>> for (i = 0; i < nr_pages; i++) {
>>> @@ -434,8 +435,8 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
>>> int ret;
>>> LIST_HEAD(pages);
>>>
>>> - if (nr_pages > ARRAY_SIZE(frame_list))
>>> - nr_pages = ARRAY_SIZE(frame_list);
>>> + if (nr_pages > FRAME_LIST_NR_ENTRIES)
>>> + nr_pages = FRAME_LIST_NR_ENTRIES;
>>>
>>> for (i = 0; i < nr_pages; i++) {
>>> page = alloc_page(gfp);
>>> diff --git a/drivers/xen/biomerge.c b/drivers/xen/biomerge.c
>>> index 05a286d24f148..28f0887e40026 100644
>>> --- a/drivers/xen/biomerge.c
>>> +++ b/drivers/xen/biomerge.c
>>> @@ -8,16 +8,16 @@
>>> bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
>>> const struct page *page)
>>> {
>>> -#if XEN_PAGE_SIZE == PAGE_SIZE
>>> - unsigned long bfn1 = pfn_to_bfn(page_to_pfn(vec1->bv_page));
>>> - unsigned long bfn2 = pfn_to_bfn(page_to_pfn(page));
>>> + if (XEN_PAGE_SIZE == PAGE_SIZE) {
>>> + unsigned long bfn1 = pfn_to_bfn(page_to_pfn(vec1->bv_page));
>>> + unsigned long bfn2 = pfn_to_bfn(page_to_pfn(page));
>>> +
>>> + return bfn1 + PFN_DOWN(vec1->bv_offset + vec1->bv_len) == bfn2;
>>> + }
>>>
>>> - return bfn1 + PFN_DOWN(vec1->bv_offset + vec1->bv_len) == bfn2;
>>> -#else
>>> /*
>>> * XXX: Add support for merging bio_vec when using different page
>>> * size in Xen and Linux.
>>> */
>>> return false;
>>> -#endif
>>> }
>>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>>> index 9563650dfbafc..847f7b806caf7 100644
>>> --- a/drivers/xen/privcmd.c
>>> +++ b/drivers/xen/privcmd.c
>>> @@ -557,7 +557,7 @@ static long privcmd_ioctl_mmap_batch(
>>> state.global_error = 0;
>>> state.version = version;
>>>
>>> - BUILD_BUG_ON(((PAGE_SIZE / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE) != 0);
>>> + BUILD_BUG_ON(((PAGE_SIZE_MIN / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE_MAX) != 0);
>
> Is there any value in keep this test? And if so, what should it look
> like? I think we should turn it into a WARN_ON:
>
> WARN_ON(((PAGE_SIZE / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE) != 0);
>
> It doesn't make much sense having a BUILD_BUG_ON on a variable that can
> change?
I believe that as long as we assume sizeof(xen_pfn_t), PAGE_SIZE and
XEN_PAGE_SIZE are all power-of-two sizes, then this single build-time test
should cover all possible boot-time PAGE_SIZEs.
Logic:
If PAGE_SIZE and XEN_PAGE_SIZE are power-of-two, then XEN_PFN_PER_PAGE must also
be power-of-two. XEN_PFN_PER_PAGE_MAX is just the worst case limit.
(PAGE_SIZE_MIN / sizeof(xen_pfn_t)) is the number of xen_pfn_t that fit on
smallest page.
If you can get an integer multiple number of XEN_PFN_PER_PAGE_MAX on the
smallest page, then it remains an integer multiple as PAGE_SIZE gets bigger,
assuming it is restricted to power-of-two sizes.
Perhaps there is a floor in my logic?
I'd prefer to keep BUILD_BUG_ON where possible to avoid the additional image
size bloat and runtime costs.
Thanks,
Ryan
>
>
>>> /* mmap_batch_fn guarantees ret == 0 */
>>> BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
>>> &pagelist, mmap_batch_fn, &state));
>>> diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
>>> index 51b3124b0d56c..99bde836c10c4 100644
>>> --- a/drivers/xen/xenbus/xenbus_client.c
>>> +++ b/drivers/xen/xenbus/xenbus_client.c
>>> @@ -49,9 +49,10 @@
>>>
>>> #include "xenbus.h"
>>>
>>> -#define XENBUS_PAGES(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE))
>>> +#define XENBUS_PAGES(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE))
>>> +#define XENBUS_PAGES_MAX(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE_MIN))
>>>
>>> -#define XENBUS_MAX_RING_PAGES (XENBUS_PAGES(XENBUS_MAX_RING_GRANTS))
>>> +#define XENBUS_MAX_RING_PAGES (XENBUS_PAGES_MAX(XENBUS_MAX_RING_GRANTS))
>>>
>>> struct xenbus_map_node {
>>> struct list_head next;
>>> diff --git a/drivers/xen/xlate_mmu.c b/drivers/xen/xlate_mmu.c
>>> index f17c4c03db30c..a757c801a7542 100644
>>> --- a/drivers/xen/xlate_mmu.c
>>> +++ b/drivers/xen/xlate_mmu.c
>>> @@ -74,9 +74,9 @@ struct remap_data {
>>> int mapped;
>>>
>>> /* Hypercall parameters */
>>> - int h_errs[XEN_PFN_PER_PAGE];
>>> - xen_ulong_t h_idxs[XEN_PFN_PER_PAGE];
>>> - xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE];
>>> + int h_errs[XEN_PFN_PER_PAGE_MAX];
>>> + xen_ulong_t h_idxs[XEN_PFN_PER_PAGE_MAX];
>>> + xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE_MAX];
>>>
>>> int h_iter; /* Iterator */
>>> };
>>> diff --git a/include/xen/page.h b/include/xen/page.h
>>> index 285677b42943a..86683a30038a3 100644
>>> --- a/include/xen/page.h
>>> +++ b/include/xen/page.h
>>> @@ -21,6 +21,8 @@
>>> ((page_to_pfn(page)) << (PAGE_SHIFT - XEN_PAGE_SHIFT))
>>>
>>> #define XEN_PFN_PER_PAGE (PAGE_SIZE / XEN_PAGE_SIZE)
>>> +#define XEN_PFN_PER_PAGE_MIN (PAGE_SIZE_MIN / XEN_PAGE_SIZE)
>>> +#define XEN_PFN_PER_PAGE_MAX (PAGE_SIZE_MAX / XEN_PAGE_SIZE)
>>>
>>> #define XEN_PFN_DOWN(x) ((x) >> XEN_PAGE_SHIFT)
>>> #define XEN_PFN_UP(x) (((x) + XEN_PAGE_SIZE-1) >> XEN_PAGE_SHIFT)
>>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-22 17:30 ` Neal Gompa
@ 2024-10-24 10:34 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-24 10:34 UTC (permalink / raw)
To: Neal Gompa
Cc: Nick Chan, Eric Curtin, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, David Hildenbrand, Greg Marsden,
Ivan Ivanov, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Miroslav Benes, Will Deacon, Hector Martin,
linux-arm-kernel, linux-kernel, linux-mm, asahi
On 22/10/2024 18:30, Neal Gompa wrote:
[...]
>>>>>>>>>>
>>>>>>>>>> This is a generally very exciting patch set! I'm looking forward to seeing it
>>>>>>>>>> land so I can take advantage of it for Fedora ARM and Fedora Asahi Remix.
>>>>>>>>>>
>>>>>>>>>> That said, I have a couple of questions:
>>>>>>>>>>
>>>>>>>>>> * Going forward, how would we handle drivers/modules that require a particular
>>>>>>>>>> page size? For example, the Apple Silicon IOMMU driver code requires the
>>>>>>>>>> kernel to operate in 16k page size mode, and it would need to be disabled in
>>>>>>>>>> other page sizes.
>>>>>>>>>
>>>>>>>>> I think these drivers would want to check PAGE_SIZE at probe time and fail if an
>>>>>>>>> unsupported page size is in use. Do you see any issue with that?
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> * How would we handle an invalid selection at boot?
>>>>>>>>>
>>>>>>>>> What do you mean by invalid here? The current policy validates that the
>>>>>>>>> requested page size is supported by the HW by checking mmfr0. If no page size is
>>>>>>>>> passed on the command line, or the passed value is not supported by the HW, then
>>>>>>>>> the we default to the largest page size supported by the HW (so for Apple
>>>>>>>>> Silicon that would be 16k since the HW doesn't support 64k). Although I think it
>>>>>>>>> may be better to change that policy to use the smallest page size in this case;
>>>>>>>>> 4k is the safer bet for compat and will waste much less memory than 64k.
>>>>>>>>>
>>>>>>>>>> Can we program in a
>>>>>>>>>> fallback when the "wrong" mode is selected for a chip or something similar?
>>>>>>>>>
>>>>>>>>> Do you mean effectively add a machanism to force 16k if the detected HW is Apple
>>>>>>>>> Silicon? The trouble is that we need to select the page size, very early in
>>>>>>>>> boot, before start_kernel() is called, so we really only have generic arch code
>>>>>>>>> and the command line with which to make the decision.
>>>>>>>>
>>>>>>>> Yes... I think a build-time CONFIG for default page size, which can be
>>>>>>>> overridden by a karg makes sense... Even on platforms like Apple
>>>>>>>> Silicon you may want to test very specific things in 4k by overriding
>>>>>>>> with a karg.
>>>>>>>
>>>>>>> Ahh, yes, that would certainly work. I'll work it into the next version.
>>>>>>>
>>>>>>
>>>>>> Could we maybe extend to have some kind of way to include a table of
>>>>>> SoC IDs that certain modes are disabled (e.g. 64k on Apple Silicon)
>>>>>
>>>>> 64k is already disabled on Apple Silicon because mmfr0 reports that 64k is not
>>>>> supported.
>>>>>
>>>>>> and preferred modes when no arg is set (16k for Apple Silicon)? That
>>>>>
>>>>> And it's not obvious that we should hard-code a page size preference to a SoC
>>>>> ID. If the CPU can support multiple page sizes, it should be up to the SW stack
>>>>> to decide, not the SoC.
>>>>>
>>>>> I'm guessing your desire is to have a single kernel build that will boot 16k by
>>>>> default on Apple Silicon and 4k by default on other systems, all without needing
>>>>> to modify the command line? Personally I think it's cleaner to just require
>>>>> setting the page size on the command line in these cases.
>>>>>
>>>>>> way it'd work something like this:
>>>>>>
>>>>>> 1. Table identification of 4/16/64 depending on identified SoC
>>>>> So I'd prefer not to have this
>>>>>
>>>>>> 2. Unidentified ones follow build-time default
>>>>>> 3. karg forces a mode regardless
>>>>> But keep these 2.
>>>>>
>>>>
>>> Since we are talking about Apple Silicon and page size, I would like to
>>> add that on the Apple Silicon SoCs I am working on, the situation is like
>>> this:
>>>
>>> Apple A7 (s5l8960x), A8 (T7000), A8X (T7001): CPU MMU support 4K and 64K
>>> page sizes.
>>>
>>> Apple A9 (s8000/s8003), A9X (s8001), A10 (t8010), A10X (t8011), A11 (t8015):
>>> CPU MMU Support 16K and 64K page sizes.
>>>
>>> However, all of them have 4K page DART IOMMUs.
>>>
>>>> I think it makes sense to have it, because it's not just Apple Silicon
>>>> where such a preference/requirement may be necessary. Apple Silicon
>>>> technically works at 4k, but is completely broken at 4k because Linux
>>>> cannot do 16k IOMMU with 4k everything else, so being able to at least
>>>> prefer 16k out of the box is important. And SoCs like the NVIDIA Grace
>>>> Hopper platform prefer 64k over other options (though I am unaware of
>>>> a gross incompatibility that effectively requires it like Apple
>>>> Silicon has).
>>>>
>>>> When we're trying to get to "single generic image that works
>>>> everywhere", stuff like this matters and I would really like you to
>>>> consider it from the lens of "we want things to work as automagic as
>>>> they do on x86".
>>> For me, in order to get to this level of automagic, there do need to be
>>> a table of which SoC should use which page size table.
>>
>> OK, but it's not clear to me that this table needs to be in the kernel. Could it
>> not be something in user space (e.g. during installation) that configures the
>> kernel command line?
>>
>
> This is not compatible with using things like ISOs with UEFI+ACPI
> enabled desktop/server systems. We need to be able to safely,
> automatically, and correctly boot up and support hardware. The only
> place to do that early enough is in the kernel. But this can wait
> until the core stuff is in.
OK got it.
>
>> Regardless, the hard work here is getting the boot-time page size selection
>> mechanism in place. Once that's there, follow up patches can add the desired
>> policy. I'd rather leave it out for now to avoid anything slowing down the core
>> work.
>>
>
> Sure, this can be done afterward.
Thanks! I understand the problem a bit better now. I'm sure we can find a
solution once we have landed the core mechanism.
Thanks,
Ryan
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-23 21:00 ` Thomas Tai
@ 2024-10-24 10:48 ` Ryan Roberts
2024-10-24 11:45 ` Petr Tesarik
2024-10-30 22:11 ` Sumit Gupta
0 siblings, 2 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-24 10:48 UTC (permalink / raw)
To: Thomas Tai, Petr Tesarik
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On 23/10/2024 22:00, Thomas Tai wrote:
>
> On 10/17/2024 8:32 AM, Ryan Roberts wrote:
>> On 17/10/2024 13:27, Petr Tesarik wrote:
>>> On Mon, 14 Oct 2024 11:55:11 +0100
>>> Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>>> [...]
>>>> The series is arranged as follows:
>>>>
>>>> - patch 1: Add macros required for converting non-arch code to support
>>>> boot-time page size selection
>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
>>>> non-arch code
>>> I have just tried to recompile the openSUSE kernel with these patches
>>> applied, and I'm running into this:
>>>
>>> CC arch/arm64/hyperv/hv_core.o
>>> In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
>>> ../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file
>>> scope
>>> u8 reserved2[PAGE_SIZE - 68];
>>> ^~~~~~~~~
>>>
>>> It looks like one more place which needs a patch, right?
>> As mentioned in the cover letter, so far I've only converted enough to get the
>> defconfig *image* building (i.e. no modules). If you are compiling a different
>> config or compiling the modules for defconfig, you will likely run into these
>> types of issues.
>
> It would be nice if you could provide the defconfig you are using; I also ran
> into build issues when using the arch/arm64/configs/defconfig.
git clean -xdfq
make defconfig
# Set CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
./scripts/config --disable CONFIG_ARM64_4K_PAGES
./scripts/config --disable CONFIG_ARM64_16K_PAGES
./scripts/config --disable CONFIG_ARM64_64K_PAGES
./scripts/config --disable CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
./scripts/config --enable CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
# Set ARM64_VA_BITS_48
./scripts/config --disable ARM64_VA_BITS_36
./scripts/config --disable ARM64_VA_BITS_39
./scripts/config --disable ARM64_VA_BITS_42
./scripts/config --disable ARM64_VA_BITS_47
./scripts/config --disable ARM64_VA_BITS_48
./scripts/config --disable ARM64_VA_BITS_52
./scripts/config --enable ARM64_VA_BITS_48
# Optional: filesystems known to compile with boot-time page size
./scripts/config --enable CONFIG_SQUASHFS_LZ4
./scripts/config --enable CONFIG_SQUASHFS_LZO
./scripts/config --enable CONFIG_SQUASHFS_XZ
./scripts/config --enable CONFIG_SQUASHFS_ZSTD
./scripts/config --enable CONFIG_XFS_FS
# Optional: trace stuff known to compile with boot-time page size
./scripts/config --enable CONFIG_FTRACE
./scripts/config --enable CONFIG_FUNCTION_TRACER
./scripts/config --enable CONFIG_KPROBES
./scripts/config --enable CONFIG_HIST_TRIGGERS
./scripts/config --enable CONFIG_FTRACE_SYSCALLS
# Optional: misc mm stuff known to compile with boot-time page size
./scripts/config --enable CONFIG_PTDUMP_DEBUGFS
./scripts/config --enable CONFIG_READ_ONLY_THP_FOR_FS
./scripts/config --enable CONFIG_USERFAULTFD
# Optional: mm debug stuff known compile with boot-time page size
./scripts/config --enable CONFIG_DEBUG_VM
./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
./scripts/config --enable CONFIG_DEBUG_VM_RB
./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
./scripts/config --enable CONFIG_PAGE_TABLE_CHECK_ENFORCED
make olddefconfig
make -s -j`nproc` Image
So I'm explicitly only building and booting the kernel image, not the modules.
The kernel image contains all the drivers needed to get a VM up and running
under QEMU/KVM.
Thanks,
Ryan
>
> Thank you,
> Thomas
>
>>
>> That said, I do have some patches to fix Hyper-V, which Michael Kelley was kind
>> enough to send me.
>>
>> I understand that Suse might be able to help with wider performance testing - if
>> that's the reason you are trying to compile, you could send me your config and
>> I'll start working on fixing up other drivers?
>>
>> Thanks,
>> Ryan
>>
>>> Petr T
>>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-24 10:48 ` Ryan Roberts
@ 2024-10-24 11:45 ` Petr Tesarik
2024-10-24 12:10 ` Ryan Roberts
2024-10-30 22:11 ` Sumit Gupta
1 sibling, 1 reply; 196+ messages in thread
From: Petr Tesarik @ 2024-10-24 11:45 UTC (permalink / raw)
To: Ryan Roberts
Cc: Thomas Tai, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel, linux-kernel,
linux-mm
On Thu, 24 Oct 2024 11:48:55 +0100
Ryan Roberts <ryan.roberts@arm.com> wrote:
> On 23/10/2024 22:00, Thomas Tai wrote:
> >
> > On 10/17/2024 8:32 AM, Ryan Roberts wrote:
> >> On 17/10/2024 13:27, Petr Tesarik wrote:
> >>> On Mon, 14 Oct 2024 11:55:11 +0100
> >>> Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>
> >>>> [...]
> >>>> The series is arranged as follows:
> >>>>
> >>>> - patch 1: Add macros required for converting non-arch code to support
> >>>> boot-time page size selection
> >>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
> >>>> non-arch code
> >>> I have just tried to recompile the openSUSE kernel with these patches
> >>> applied, and I'm running into this:
> >>>
> >>> CC arch/arm64/hyperv/hv_core.o
> >>> In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
> >>> ../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file
> >>> scope
> >>> u8 reserved2[PAGE_SIZE - 68];
> >>> ^~~~~~~~~
> >>>
> >>> It looks like one more place which needs a patch, right?
> >> As mentioned in the cover letter, so far I've only converted enough to get the
> >> defconfig *image* building (i.e. no modules). If you are compiling a different
> >> config or compiling the modules for defconfig, you will likely run into these
> >> types of issues.
> >
> > It would be nice if you could provide the defconfig you are using; I also ran
> > into build issues when using the arch/arm64/configs/defconfig.
>
> git clean -xdfq
> make defconfig
>
> # Set CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> ./scripts/config --disable CONFIG_ARM64_4K_PAGES
> ./scripts/config --disable CONFIG_ARM64_16K_PAGES
> ./scripts/config --disable CONFIG_ARM64_64K_PAGES
> ./scripts/config --disable CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> ./scripts/config --enable CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>
> # Set ARM64_VA_BITS_48
> ./scripts/config --disable ARM64_VA_BITS_36
> ./scripts/config --disable ARM64_VA_BITS_39
> ./scripts/config --disable ARM64_VA_BITS_42
> ./scripts/config --disable ARM64_VA_BITS_47
> ./scripts/config --disable ARM64_VA_BITS_48
> ./scripts/config --disable ARM64_VA_BITS_52
> ./scripts/config --enable ARM64_VA_BITS_48
>
> # Optional: filesystems known to compile with boot-time page size
> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
> ./scripts/config --enable CONFIG_SQUASHFS_LZO
> ./scripts/config --enable CONFIG_SQUASHFS_XZ
> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
> ./scripts/config --enable CONFIG_XFS_FS
>
> # Optional: trace stuff known to compile with boot-time page size
> ./scripts/config --enable CONFIG_FTRACE
> ./scripts/config --enable CONFIG_FUNCTION_TRACER
> ./scripts/config --enable CONFIG_KPROBES
> ./scripts/config --enable CONFIG_HIST_TRIGGERS
> ./scripts/config --enable CONFIG_FTRACE_SYSCALLS
>
> # Optional: misc mm stuff known to compile with boot-time page size
> ./scripts/config --enable CONFIG_PTDUMP_DEBUGFS
> ./scripts/config --enable CONFIG_READ_ONLY_THP_FOR_FS
> ./scripts/config --enable CONFIG_USERFAULTFD
>
> # Optional: mm debug stuff known compile with boot-time page size
> ./scripts/config --enable CONFIG_DEBUG_VM
> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
> ./scripts/config --enable CONFIG_DEBUG_VM_RB
> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK_ENFORCED
>
> make olddefconfig
> make -s -j`nproc` Image
>
> So I'm explicitly only building and booting the kernel image, not the modules.
> The kernel image contains all the drivers needed to get a VM up and running
> under QEMU/KVM.
FWIW with the attached patch I was also able to boot the kernel on
Ampere Altra bare metal and using modules.
Petr T
diff --git a/arch/arm64/mm/pgtable-geometry.c b/arch/arm64/mm/pgtable-geometry.c
index ba50637f1e9d..4eb074b99654 100644
--- a/arch/arm64/mm/pgtable-geometry.c
+++ b/arch/arm64/mm/pgtable-geometry.c
@@ -15,8 +15,14 @@
*/
int ptg_page_shift __read_mostly;
+EXPORT_SYMBOL_GPL(ptg_page_shift);
+
int ptg_pmd_shift __read_mostly;
+EXPORT_SYMBOL_GPL(ptg_pmd_shift);
+
int ptg_pud_shift __read_mostly;
+EXPORT_SYMBOL_GPL(ptg_pud_shift);
+
int ptg_p4d_shift __read_mostly;
int ptg_pgdir_shift __read_mostly;
int ptg_cont_pte_shift __read_mostly;
^ permalink raw reply related [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-24 11:45 ` Petr Tesarik
@ 2024-10-24 12:10 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-24 12:10 UTC (permalink / raw)
To: Petr Tesarik
Cc: Thomas Tai, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel, linux-kernel,
linux-mm
On 24/10/2024 12:45, Petr Tesarik wrote:
> On Thu, 24 Oct 2024 11:48:55 +0100
> Ryan Roberts <ryan.roberts@arm.com> wrote:
>
>> On 23/10/2024 22:00, Thomas Tai wrote:
>>>
>>> On 10/17/2024 8:32 AM, Ryan Roberts wrote:
>>>> On 17/10/2024 13:27, Petr Tesarik wrote:
>>>>> On Mon, 14 Oct 2024 11:55:11 +0100
>>>>> Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>>> [...]
>>>>>> The series is arranged as follows:
>>>>>>
>>>>>> - patch 1: Add macros required for converting non-arch code to support
>>>>>> boot-time page size selection
>>>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
>>>>>> non-arch code
>>>>> I have just tried to recompile the openSUSE kernel with these patches
>>>>> applied, and I'm running into this:
>>>>>
>>>>> CC arch/arm64/hyperv/hv_core.o
>>>>> In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
>>>>> ../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file
>>>>> scope
>>>>> u8 reserved2[PAGE_SIZE - 68];
>>>>> ^~~~~~~~~
>>>>>
>>>>> It looks like one more place which needs a patch, right?
>>>> As mentioned in the cover letter, so far I've only converted enough to get the
>>>> defconfig *image* building (i.e. no modules). If you are compiling a different
>>>> config or compiling the modules for defconfig, you will likely run into these
>>>> types of issues.
>>>
>>> It would be nice if you could provide the defconfig you are using; I also ran
>>> into build issues when using the arch/arm64/configs/defconfig.
>>
>> git clean -xdfq
>> make defconfig
>>
>> # Set CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>> ./scripts/config --disable CONFIG_ARM64_4K_PAGES
>> ./scripts/config --disable CONFIG_ARM64_16K_PAGES
>> ./scripts/config --disable CONFIG_ARM64_64K_PAGES
>> ./scripts/config --disable CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>> ./scripts/config --enable CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>>
>> # Set ARM64_VA_BITS_48
>> ./scripts/config --disable ARM64_VA_BITS_36
>> ./scripts/config --disable ARM64_VA_BITS_39
>> ./scripts/config --disable ARM64_VA_BITS_42
>> ./scripts/config --disable ARM64_VA_BITS_47
>> ./scripts/config --disable ARM64_VA_BITS_48
>> ./scripts/config --disable ARM64_VA_BITS_52
>> ./scripts/config --enable ARM64_VA_BITS_48
>>
>> # Optional: filesystems known to compile with boot-time page size
>> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
>> ./scripts/config --enable CONFIG_SQUASHFS_LZO
>> ./scripts/config --enable CONFIG_SQUASHFS_XZ
>> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
>> ./scripts/config --enable CONFIG_XFS_FS
>>
>> # Optional: trace stuff known to compile with boot-time page size
>> ./scripts/config --enable CONFIG_FTRACE
>> ./scripts/config --enable CONFIG_FUNCTION_TRACER
>> ./scripts/config --enable CONFIG_KPROBES
>> ./scripts/config --enable CONFIG_HIST_TRIGGERS
>> ./scripts/config --enable CONFIG_FTRACE_SYSCALLS
>>
>> # Optional: misc mm stuff known to compile with boot-time page size
>> ./scripts/config --enable CONFIG_PTDUMP_DEBUGFS
>> ./scripts/config --enable CONFIG_READ_ONLY_THP_FOR_FS
>> ./scripts/config --enable CONFIG_USERFAULTFD
>>
>> # Optional: mm debug stuff known compile with boot-time page size
>> ./scripts/config --enable CONFIG_DEBUG_VM
>> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
>> ./scripts/config --enable CONFIG_DEBUG_VM_RB
>> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
>> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
>> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK_ENFORCED
>>
>> make olddefconfig
>> make -s -j`nproc` Image
>>
>> So I'm explicitly only building and booting the kernel image, not the modules.
>> The kernel image contains all the drivers needed to get a VM up and running
>> under QEMU/KVM.
>
> FWIW with the attached patch I was also able to boot the kernel on
> Ampere Altra bare metal and using modules.
Nice!
Thanks for the below. That was already reported and I have a fix in my branch at
[1]. That also includes the btrfs patch you sent and the hyper-v patches, as
well as other fixups from review.
[1]
https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/boot-time-page-size-v2-wip
Thanks,
Ryan
>
> Petr T
>
> diff --git a/arch/arm64/mm/pgtable-geometry.c b/arch/arm64/mm/pgtable-geometry.c
> index ba50637f1e9d..4eb074b99654 100644
> --- a/arch/arm64/mm/pgtable-geometry.c
> +++ b/arch/arm64/mm/pgtable-geometry.c
> @@ -15,8 +15,14 @@
> */
>
> int ptg_page_shift __read_mostly;
> +EXPORT_SYMBOL_GPL(ptg_page_shift);
> +
> int ptg_pmd_shift __read_mostly;
> +EXPORT_SYMBOL_GPL(ptg_pmd_shift);
> +
> int ptg_pud_shift __read_mostly;
> +EXPORT_SYMBOL_GPL(ptg_pud_shift);
> +
> int ptg_p4d_shift __read_mostly;
> int ptg_pgdir_shift __read_mostly;
> int ptg_cont_pte_shift __read_mostly;
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 36/57] xen: Remove PAGE_SIZE compile-time constant assumption
2024-10-24 10:32 ` Ryan Roberts
@ 2024-10-25 1:18 ` Stefano Stabellini
0 siblings, 0 replies; 196+ messages in thread
From: Stefano Stabellini @ 2024-10-25 1:18 UTC (permalink / raw)
To: Ryan Roberts
Cc: Stefano Stabellini, Andrew Morton, Anshuman Khandual,
Ard Biesheuvel, Catalin Marinas, David Hildenbrand, Greg Marsden,
Ivan Ivanov, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Miroslav Benes, Will Deacon, Juergen Gross,
linux-arm-kernel, linux-kernel, linux-mm, xen-devel, julien
On Thu, 24 Oct 2024, Ryan Roberts wrote:
> On 23/10/2024 02:23, Stefano Stabellini wrote:
> > +Julien
> >
> > On Wed, 16 Oct 2024, Ryan Roberts wrote:
> >> + Juergen Gross, Stefano Stabellini
> >>
> >> This was a rather tricky series to get the recipients correct for and my script
> >> did not realize that "supporter" was a pseudonym for "maintainer" so you were
> >> missed off the original post. Appologies!
> >>
> >> More context in cover letter:
> >> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
> >>
> >>
> >> On 14/10/2024 11:58, Ryan Roberts wrote:
> >>> To prepare for supporting boot-time page size selection, refactor code
> >>> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> >>> intended to be equivalent when compile-time page size is active.
> >>>
> >>> Allocate enough "frame_list" static storage in the balloon driver for
> >>> the maximum supported page size. Although continue to use only the first
> >>> PAGE_SIZE of the buffer at run-time to maintain existing behaviour.
> >>>
> >>> Refactor xen_biovec_phys_mergeable() to convert ifdeffery to c if/else.
> >>> For compile-time page size, the compiler will choose one branch and
> >>> strip the dead one. For boot-time, it can be evaluated at run time.
> >>>
> >>> Refactor a BUILD_BUG_ON to evaluate the limit (when the minimum
> >>> supported page size is selected at boot-time).
> >>>
> >>> Reserve enough storage for max page size in "struct remap_data" and
> >>> "struct xenbus_map_node".
> >>>
> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>> ---
> >>>
> >>> ***NOTE***
> >>> Any confused maintainers may want to read the cover note here for context:
> >>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
> >>>
> >>> drivers/xen/balloon.c | 11 ++++++-----
> >>> drivers/xen/biomerge.c | 12 ++++++------
> >>> drivers/xen/privcmd.c | 2 +-
> >>> drivers/xen/xenbus/xenbus_client.c | 5 +++--
> >>> drivers/xen/xlate_mmu.c | 6 +++---
> >>> include/xen/page.h | 2 ++
> >>> 6 files changed, 21 insertions(+), 17 deletions(-)
> >>>
> >>> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
> >>> index 528395133b4f8..0ed5f6453af0e 100644
> >>> --- a/drivers/xen/balloon.c
> >>> +++ b/drivers/xen/balloon.c
> >>> @@ -131,7 +131,8 @@ struct balloon_stats balloon_stats;
> >>> EXPORT_SYMBOL_GPL(balloon_stats);
> >>>
> >>> /* We increase/decrease in batches which fit in a page */
> >>> -static xen_pfn_t frame_list[PAGE_SIZE / sizeof(xen_pfn_t)];
> >>> +static xen_pfn_t frame_list[PAGE_SIZE_MAX / sizeof(xen_pfn_t)];
> >>> +#define FRAME_LIST_NR_ENTRIES (PAGE_SIZE / sizeof(xen_pfn_t))
> >>>
> >>>
> >>> /* List of ballooned pages, threaded through the mem_map array. */
> >>> @@ -389,8 +390,8 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
> >>> unsigned long i;
> >>> struct page *page;
> >>>
> >>> - if (nr_pages > ARRAY_SIZE(frame_list))
> >>> - nr_pages = ARRAY_SIZE(frame_list);
> >>> + if (nr_pages > FRAME_LIST_NR_ENTRIES)
> >>> + nr_pages = FRAME_LIST_NR_ENTRIES;
> >>>
> >>> page = list_first_entry_or_null(&ballooned_pages, struct page, lru);
> >>> for (i = 0; i < nr_pages; i++) {
> >>> @@ -434,8 +435,8 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
> >>> int ret;
> >>> LIST_HEAD(pages);
> >>>
> >>> - if (nr_pages > ARRAY_SIZE(frame_list))
> >>> - nr_pages = ARRAY_SIZE(frame_list);
> >>> + if (nr_pages > FRAME_LIST_NR_ENTRIES)
> >>> + nr_pages = FRAME_LIST_NR_ENTRIES;
> >>>
> >>> for (i = 0; i < nr_pages; i++) {
> >>> page = alloc_page(gfp);
> >>> diff --git a/drivers/xen/biomerge.c b/drivers/xen/biomerge.c
> >>> index 05a286d24f148..28f0887e40026 100644
> >>> --- a/drivers/xen/biomerge.c
> >>> +++ b/drivers/xen/biomerge.c
> >>> @@ -8,16 +8,16 @@
> >>> bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
> >>> const struct page *page)
> >>> {
> >>> -#if XEN_PAGE_SIZE == PAGE_SIZE
> >>> - unsigned long bfn1 = pfn_to_bfn(page_to_pfn(vec1->bv_page));
> >>> - unsigned long bfn2 = pfn_to_bfn(page_to_pfn(page));
> >>> + if (XEN_PAGE_SIZE == PAGE_SIZE) {
> >>> + unsigned long bfn1 = pfn_to_bfn(page_to_pfn(vec1->bv_page));
> >>> + unsigned long bfn2 = pfn_to_bfn(page_to_pfn(page));
> >>> +
> >>> + return bfn1 + PFN_DOWN(vec1->bv_offset + vec1->bv_len) == bfn2;
> >>> + }
> >>>
> >>> - return bfn1 + PFN_DOWN(vec1->bv_offset + vec1->bv_len) == bfn2;
> >>> -#else
> >>> /*
> >>> * XXX: Add support for merging bio_vec when using different page
> >>> * size in Xen and Linux.
> >>> */
> >>> return false;
> >>> -#endif
> >>> }
> >>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> >>> index 9563650dfbafc..847f7b806caf7 100644
> >>> --- a/drivers/xen/privcmd.c
> >>> +++ b/drivers/xen/privcmd.c
> >>> @@ -557,7 +557,7 @@ static long privcmd_ioctl_mmap_batch(
> >>> state.global_error = 0;
> >>> state.version = version;
> >>>
> >>> - BUILD_BUG_ON(((PAGE_SIZE / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE) != 0);
> >>> + BUILD_BUG_ON(((PAGE_SIZE_MIN / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE_MAX) != 0);
> >
> > Is there any value in keep this test? And if so, what should it look
> > like? I think we should turn it into a WARN_ON:
> >
> > WARN_ON(((PAGE_SIZE / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE) != 0);
> >
> > It doesn't make much sense having a BUILD_BUG_ON on a variable that can
> > change?
>
> I believe that as long as we assume sizeof(xen_pfn_t), PAGE_SIZE and
> XEN_PAGE_SIZE are all power-of-two sizes, then this single build-time test
> should cover all possible boot-time PAGE_SIZEs.
>
> Logic:
>
> If PAGE_SIZE and XEN_PAGE_SIZE are power-of-two, then XEN_PFN_PER_PAGE must also
> be power-of-two. XEN_PFN_PER_PAGE_MAX is just the worst case limit.
>
> (PAGE_SIZE_MIN / sizeof(xen_pfn_t)) is the number of xen_pfn_t that fit on
> smallest page.
>
> If you can get an integer multiple number of XEN_PFN_PER_PAGE_MAX on the
> smallest page, then it remains an integer multiple as PAGE_SIZE gets bigger,
> assuming it is restricted to power-of-two sizes.
>
> Perhaps there is a floor in my logic?
>
> I'd prefer to keep BUILD_BUG_ON where possible to avoid the additional image
> size bloat and runtime costs.
You are right. It would be nice to add a in-code comment to explain
this.
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
> >>> /* mmap_batch_fn guarantees ret == 0 */
> >>> BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
> >>> &pagelist, mmap_batch_fn, &state));
> >>> diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
> >>> index 51b3124b0d56c..99bde836c10c4 100644
> >>> --- a/drivers/xen/xenbus/xenbus_client.c
> >>> +++ b/drivers/xen/xenbus/xenbus_client.c
> >>> @@ -49,9 +49,10 @@
> >>>
> >>> #include "xenbus.h"
> >>>
> >>> -#define XENBUS_PAGES(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE))
> >>> +#define XENBUS_PAGES(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE))
> >>> +#define XENBUS_PAGES_MAX(_grants) (DIV_ROUND_UP(_grants, XEN_PFN_PER_PAGE_MIN))
> >>>
> >>> -#define XENBUS_MAX_RING_PAGES (XENBUS_PAGES(XENBUS_MAX_RING_GRANTS))
> >>> +#define XENBUS_MAX_RING_PAGES (XENBUS_PAGES_MAX(XENBUS_MAX_RING_GRANTS))
> >>>
> >>> struct xenbus_map_node {
> >>> struct list_head next;
> >>> diff --git a/drivers/xen/xlate_mmu.c b/drivers/xen/xlate_mmu.c
> >>> index f17c4c03db30c..a757c801a7542 100644
> >>> --- a/drivers/xen/xlate_mmu.c
> >>> +++ b/drivers/xen/xlate_mmu.c
> >>> @@ -74,9 +74,9 @@ struct remap_data {
> >>> int mapped;
> >>>
> >>> /* Hypercall parameters */
> >>> - int h_errs[XEN_PFN_PER_PAGE];
> >>> - xen_ulong_t h_idxs[XEN_PFN_PER_PAGE];
> >>> - xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE];
> >>> + int h_errs[XEN_PFN_PER_PAGE_MAX];
> >>> + xen_ulong_t h_idxs[XEN_PFN_PER_PAGE_MAX];
> >>> + xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE_MAX];
> >>>
> >>> int h_iter; /* Iterator */
> >>> };
> >>> diff --git a/include/xen/page.h b/include/xen/page.h
> >>> index 285677b42943a..86683a30038a3 100644
> >>> --- a/include/xen/page.h
> >>> +++ b/include/xen/page.h
> >>> @@ -21,6 +21,8 @@
> >>> ((page_to_pfn(page)) << (PAGE_SHIFT - XEN_PAGE_SHIFT))
> >>>
> >>> #define XEN_PFN_PER_PAGE (PAGE_SIZE / XEN_PAGE_SIZE)
> >>> +#define XEN_PFN_PER_PAGE_MIN (PAGE_SIZE_MIN / XEN_PAGE_SIZE)
> >>> +#define XEN_PFN_PER_PAGE_MAX (PAGE_SIZE_MAX / XEN_PAGE_SIZE)
> >>>
> >>> #define XEN_PFN_DOWN(x) ((x) >> XEN_PAGE_SHIFT)
> >>> #define XEN_PFN_UP(x) (((x) + XEN_PAGE_SIZE-1) >> XEN_PAGE_SHIFT)
> >>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 20/57] crypto: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 20/57] crypto: " Ryan Roberts
@ 2024-10-26 6:54 ` Herbert Xu
0 siblings, 0 replies; 196+ messages in thread
From: Herbert Xu @ 2024-10-26 6:54 UTC (permalink / raw)
To: Ryan Roberts
Cc: David S. Miller, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel, linux-crypto,
linux-kernel, linux-mm
On Mon, Oct 14, 2024 at 11:58:27AM +0100, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Updated BUILD_BUG_ON() to test against limit.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> crypto/lskcipher.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
` (57 preceding siblings ...)
2024-10-16 14:36 ` Ryan Roberts
@ 2024-10-30 8:45 ` Ryan Roberts
58 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-10-30 8:45 UTC (permalink / raw)
To: David S. Miller, James E.J. Bottomley, Andreas Larsson,
Andrew Morton, Anshuman Khandual, Anton Ivanov, Ard Biesheuvel,
Arnd Bergmann, Borislav Petkov, Catalin Marinas, Chris Zankel,
Dave Hansen, David Hildenbrand, Dinh Nguyen, Geert Uytterhoeven,
Greg Marsden, Helge Deller, Huacai Chen, Ingo Molnar, Ivan Ivanov,
Johannes Berg, John Paul Adrian Glaubitz, Jonas Bonn,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Max Filippov, Miroslav Benes, Rich Felker, Richard Weinberger,
Stafford Horne, Stefan Kristiansson, Thomas Bogendoerfer,
Thomas Gleixner, Will Deacon, Yoshinori Sato, x86
Cc: linux-alpha, linux-arch, linux-arm-kernel, linux-csky,
linux-hexagon, linux-kernel, linux-m68k, linux-mips, linux-mm,
linux-openrisc, linux-parisc, linux-riscv, linux-s390, linux-sh,
linux-snps-arc, linux-um, linuxppc-dev, loongarch, sparclinux
Hi all (especially mm people!),
On 14/10/2024 11:58, Ryan Roberts wrote:
> arm64 can support multiple base page sizes. Instead of selecting a page
> size at compile time, as is done today, we will make it possible to
> select the desired page size on the command line.
>
> In this case PAGE_SHIFT and it's derivatives, PAGE_SIZE and PAGE_MASK
> (as well as a number of other macros related to or derived from
> PAGE_SHIFT, but I'm not worrying about those yet), are no longer
> compile-time constants. So the code base needs to cope with that.
>
> As a first step, introduce MIN and MAX variants of these macros, which
> express the range of possible page sizes. These are always compile-time
> constants and can be used in many places where PAGE_[SHIFT|SIZE|MASK]
> were previously used where a compile-time constant is required.
> (Subsequent patches will do that conversion work). When the arch/build
> doesn't support boot-time page size selection, the MIN and MAX variants
> are equal and everything resolves as it did previously.
>
> Additionally, introduce DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() which wrap
> global variable defintions so that for boot-time page size selection
> builds, the variable being wrapped is initialized at boot-time, instead
> of compile-time. This is done by defining a function to do the
> assignment, which has the "constructor" attribute. Constructor is
> preferred over initcall, because when compiling a module, the module is
> limited to a single initcall but constructors are unlimited. For
> built-in code, constructors are now called earlier to guarrantee that
> the variables are initialized by the time they are used. Any arch that
> wants to enable boot-time page size selection will need to select
> CONFIG_CONSTRUCTORS.
>
> These new macros need to be available anywhere PAGE_SHIFT and friends
> are available. Those are defined via asm/page.h (although some arches
> have a sub-include that defines them). Unfortunately there is no
> reliable asm-generic header we can easily piggy-back on, so let's define
> a new one, pgtable-geometry.h, which we include near where each arch
> defines PAGE_SHIFT. Ugh.
I haven't had any feedback on this particular patch yet. It would be great to
get this one into v6.13, since once this is in place, the changes in other
subsystems can go via their respective trees without any dependency issues.
Although time is getting tight.
If anyone has any feedback for this patch it would be great to hear it now. Then
I'll re-post on it's own in a couple of days time.
Thanks,
Ryan
>
> -------
>
> Most of the problems that need to be solved over the next few patches
> fall into these broad categories, which are all solved with the help of
> these new macros:
>
> 1. Assignment of values derived from PAGE_SIZE in global variables
>
> For boot-time page size builds, we must defer the initialization of
> these variables until boot-time, when the page size is known. See
> DEFINE_GLOBAL_PAGE_SIZE_VAR[_CONST]() as described above.
>
> 2. Define static storage in units related to PAGE_SIZE
>
> This static storage will be defined according to PAGE_SIZE_MAX.
>
> 3. Define size of struct so that it is related to PAGE_SIZE
>
> The struct often contains an array that is sized to fill the page. In
> this case, use a flexible array with dynamic allocation. In other
> cases, the struct fits exactly over a page, which is a header (e.g.
> swap file header). In this case, remove the padding, and manually
> determine the struct pointer within the page.
>
> 4. BUILD_BUG_ON() with values derived from PAGE_SIZE
>
> In most cases, we can change these to compare againt the appropriate
> limit (either MIN or MAX). In other cases, we must change these to
> run-time BUG_ON().
>
> 5. Ensure page alignment of static data structures
>
> Align instead to PAGE_SIZE_MAX.
>
> 6. #ifdeffery based on PAGE_SIZE
>
> Often these can be changed to c code constructs. e.g. a macro that
> returns a different value depending on page size can be changed to use
> the ternary operator and the compiler will dead code strip it for the
> compile-time constant case and runtime evaluate it for the non-const
> case. Or #if/#else/#endif within a function can be converted to c
> if/else blocks, which are also dead code stripped for the const case.
> Sometimes we can change the c-preprocessor logic to use the
> appropriate MIN/MAX limit.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> arch/alpha/include/asm/page.h | 1 +
> arch/arc/include/asm/page.h | 1 +
> arch/arm/include/asm/page.h | 1 +
> arch/arm64/include/asm/page-def.h | 2 +
> arch/csky/include/asm/page.h | 3 ++
> arch/hexagon/include/asm/page.h | 2 +
> arch/loongarch/include/asm/page.h | 2 +
> arch/m68k/include/asm/page.h | 1 +
> arch/microblaze/include/asm/page.h | 1 +
> arch/mips/include/asm/page.h | 1 +
> arch/nios2/include/asm/page.h | 2 +
> arch/openrisc/include/asm/page.h | 1 +
> arch/parisc/include/asm/page.h | 1 +
> arch/powerpc/include/asm/page.h | 2 +
> arch/riscv/include/asm/page.h | 1 +
> arch/s390/include/asm/page.h | 1 +
> arch/sh/include/asm/page.h | 1 +
> arch/sparc/include/asm/page.h | 3 ++
> arch/um/include/asm/page.h | 2 +
> arch/x86/include/asm/page_types.h | 2 +
> arch/xtensa/include/asm/page.h | 1 +
> include/asm-generic/pgtable-geometry.h | 71 ++++++++++++++++++++++++++
> init/main.c | 5 +-
> 23 files changed, 107 insertions(+), 1 deletion(-)
> create mode 100644 include/asm-generic/pgtable-geometry.h
>
> diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
> index 70419e6be1a35..d0096fb5521b8 100644
> --- a/arch/alpha/include/asm/page.h
> +++ b/arch/alpha/include/asm/page.h
> @@ -88,5 +88,6 @@ typedef struct page *pgtable_t;
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _ALPHA_PAGE_H */
> diff --git a/arch/arc/include/asm/page.h b/arch/arc/include/asm/page.h
> index def0dfb95b436..8d56549db7a33 100644
> --- a/arch/arc/include/asm/page.h
> +++ b/arch/arc/include/asm/page.h
> @@ -6,6 +6,7 @@
> #define __ASM_ARC_PAGE_H
>
> #include <uapi/asm/page.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #ifdef CONFIG_ARC_HAS_PAE40
>
> diff --git a/arch/arm/include/asm/page.h b/arch/arm/include/asm/page.h
> index 62af9f7f9e963..417aa8533c718 100644
> --- a/arch/arm/include/asm/page.h
> +++ b/arch/arm/include/asm/page.h
> @@ -191,5 +191,6 @@ extern int pfn_valid(unsigned long);
>
> #include <asm-generic/getorder.h>
> #include <asm-generic/memory_model.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif
> diff --git a/arch/arm64/include/asm/page-def.h b/arch/arm64/include/asm/page-def.h
> index 792e9fe881dcf..d69971cf49cd2 100644
> --- a/arch/arm64/include/asm/page-def.h
> +++ b/arch/arm64/include/asm/page-def.h
> @@ -15,4 +15,6 @@
> #define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
> #define PAGE_MASK (~(PAGE_SIZE-1))
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* __ASM_PAGE_DEF_H */
> diff --git a/arch/csky/include/asm/page.h b/arch/csky/include/asm/page.h
> index 0ca6c408c07f2..95173d57adc8b 100644
> --- a/arch/csky/include/asm/page.h
> +++ b/arch/csky/include/asm/page.h
> @@ -92,4 +92,7 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
> #include <asm-generic/getorder.h>
>
> #endif /* !__ASSEMBLY__ */
> +
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* __ASM_CSKY_PAGE_H */
> diff --git a/arch/hexagon/include/asm/page.h b/arch/hexagon/include/asm/page.h
> index 8a6af57274c2d..ba7ad5231695f 100644
> --- a/arch/hexagon/include/asm/page.h
> +++ b/arch/hexagon/include/asm/page.h
> @@ -139,4 +139,6 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
> #endif /* ifdef __ASSEMBLY__ */
> #endif /* ifdef __KERNEL__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif
> diff --git a/arch/loongarch/include/asm/page.h b/arch/loongarch/include/asm/page.h
> index e85df33f11c77..9862e8fb047a6 100644
> --- a/arch/loongarch/include/asm/page.h
> +++ b/arch/loongarch/include/asm/page.h
> @@ -123,4 +123,6 @@ extern int __virt_addr_valid(volatile void *kaddr);
>
> #endif /* !__ASSEMBLY__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* _ASM_PAGE_H */
> diff --git a/arch/m68k/include/asm/page.h b/arch/m68k/include/asm/page.h
> index 8cfb84b499751..4df4681b02194 100644
> --- a/arch/m68k/include/asm/page.h
> +++ b/arch/m68k/include/asm/page.h
> @@ -60,5 +60,6 @@ extern unsigned long _ramend;
>
> #include <asm-generic/getorder.h>
> #include <asm-generic/memory_model.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _M68K_PAGE_H */
> diff --git a/arch/microblaze/include/asm/page.h b/arch/microblaze/include/asm/page.h
> index 8810f4f1c3b02..abc23c3d743bd 100644
> --- a/arch/microblaze/include/asm/page.h
> +++ b/arch/microblaze/include/asm/page.h
> @@ -142,5 +142,6 @@ static inline const void *pfn_to_virt(unsigned long pfn)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _ASM_MICROBLAZE_PAGE_H */
> diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
> index 4609cb0326cf3..3d91021538f02 100644
> --- a/arch/mips/include/asm/page.h
> +++ b/arch/mips/include/asm/page.h
> @@ -227,5 +227,6 @@ static inline unsigned long kaslr_offset(void)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _ASM_PAGE_H */
> diff --git a/arch/nios2/include/asm/page.h b/arch/nios2/include/asm/page.h
> index 0722f88e63cc7..2e5f93beb42b7 100644
> --- a/arch/nios2/include/asm/page.h
> +++ b/arch/nios2/include/asm/page.h
> @@ -97,4 +97,6 @@ extern struct page *mem_map;
>
> #endif /* !__ASSEMBLY__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* _ASM_NIOS2_PAGE_H */
> diff --git a/arch/openrisc/include/asm/page.h b/arch/openrisc/include/asm/page.h
> index 1d5913f67c312..a0da2a9842241 100644
> --- a/arch/openrisc/include/asm/page.h
> +++ b/arch/openrisc/include/asm/page.h
> @@ -88,5 +88,6 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* __ASM_OPENRISC_PAGE_H */
> diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
> index 4bea2e95798f0..2a75496237c09 100644
> --- a/arch/parisc/include/asm/page.h
> +++ b/arch/parisc/include/asm/page.h
> @@ -173,6 +173,7 @@ extern int npmem_ranges;
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
> #include <asm/pdc.h>
>
> #define PAGE0 ((struct zeropage *)absolute_pointer(__PAGE_OFFSET))
> diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
> index 83d0a4fc5f755..4601c115b6485 100644
> --- a/arch/powerpc/include/asm/page.h
> +++ b/arch/powerpc/include/asm/page.h
> @@ -300,4 +300,6 @@ static inline unsigned long kaslr_offset(void)
> #include <asm-generic/memory_model.h>
> #endif /* __ASSEMBLY__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* _ASM_POWERPC_PAGE_H */
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index 7ede2111c5917..e5af7579e45bf 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -204,5 +204,6 @@ static __always_inline void *pfn_to_kaddr(unsigned long pfn)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* _ASM_RISCV_PAGE_H */
> diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
> index 16e4caa931f1f..42157e7690a77 100644
> --- a/arch/s390/include/asm/page.h
> +++ b/arch/s390/include/asm/page.h
> @@ -275,6 +275,7 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #define AMODE31_SIZE (3 * PAGE_SIZE)
>
> diff --git a/arch/sh/include/asm/page.h b/arch/sh/include/asm/page.h
> index f780b467e75d7..09533d46ef033 100644
> --- a/arch/sh/include/asm/page.h
> +++ b/arch/sh/include/asm/page.h
> @@ -162,5 +162,6 @@ typedef struct page *pgtable_t;
>
> #include <asm-generic/memory_model.h>
> #include <asm-generic/getorder.h>
> +#include <asm-generic/pgtable-geometry.h>
>
> #endif /* __ASM_SH_PAGE_H */
> diff --git a/arch/sparc/include/asm/page.h b/arch/sparc/include/asm/page.h
> index 5e44cdf2a8f2b..4327fe2bfa010 100644
> --- a/arch/sparc/include/asm/page.h
> +++ b/arch/sparc/include/asm/page.h
> @@ -9,4 +9,7 @@
> #else
> #include <asm/page_32.h>
> #endif
> +
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif
> diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
> index 9ef9a8aedfa66..f26011808f514 100644
> --- a/arch/um/include/asm/page.h
> +++ b/arch/um/include/asm/page.h
> @@ -119,4 +119,6 @@ extern unsigned long uml_physmem;
> #define __HAVE_ARCH_GATE_AREA 1
> #endif
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* __UM_PAGE_H */
> diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
> index 52f1b4ff0cc16..6d2381342047f 100644
> --- a/arch/x86/include/asm/page_types.h
> +++ b/arch/x86/include/asm/page_types.h
> @@ -71,4 +71,6 @@ extern void initmem_init(void);
>
> #endif /* !__ASSEMBLY__ */
>
> +#include <asm-generic/pgtable-geometry.h>
> +
> #endif /* _ASM_X86_PAGE_DEFS_H */
> diff --git a/arch/xtensa/include/asm/page.h b/arch/xtensa/include/asm/page.h
> index 4db56ef052d22..86952cb32af23 100644
> --- a/arch/xtensa/include/asm/page.h
> +++ b/arch/xtensa/include/asm/page.h
> @@ -200,4 +200,5 @@ static inline unsigned long ___pa(unsigned long va)
> #endif /* __ASSEMBLY__ */
>
> #include <asm-generic/memory_model.h>
> +#include <asm-generic/pgtable-geometry.h>
> #endif /* _XTENSA_PAGE_H */
> diff --git a/include/asm-generic/pgtable-geometry.h b/include/asm-generic/pgtable-geometry.h
> new file mode 100644
> index 0000000000000..358e729a6ac37
> --- /dev/null
> +++ b/include/asm-generic/pgtable-geometry.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef ASM_GENERIC_PGTABLE_GEOMETRY_H
> +#define ASM_GENERIC_PGTABLE_GEOMETRY_H
> +
> +#if defined(PAGE_SHIFT_MAX) && defined(PAGE_SIZE_MAX) && defined(PAGE_MASK_MAX) && \
> + defined(PAGE_SHIFT_MIN) && defined(PAGE_SIZE_MIN) && defined(PAGE_MASK_MIN)
> +/* Arch supports boot-time page size selection. */
> +#elif defined(PAGE_SHIFT_MAX) || defined(PAGE_SIZE_MAX) || defined(PAGE_MASK_MAX) || \
> + defined(PAGE_SHIFT_MIN) || defined(PAGE_SIZE_MIN) || defined(PAGE_MASK_MIN)
> +#error Arch must define all or none of the boot-time page size macros
> +#else
> +/* Arch does not support boot-time page size selection. */
> +#define PAGE_SHIFT_MIN PAGE_SHIFT
> +#define PAGE_SIZE_MIN PAGE_SIZE
> +#define PAGE_MASK_MIN PAGE_MASK
> +#define PAGE_SHIFT_MAX PAGE_SHIFT
> +#define PAGE_SIZE_MAX PAGE_SIZE
> +#define PAGE_MASK_MAX PAGE_MASK
> +#endif
> +
> +/*
> + * Define a global variable (scalar or struct), whose value is derived from
> + * PAGE_SIZE and friends. When PAGE_SIZE is a compile-time constant, the global
> + * variable is simply defined with the static value. When PAGE_SIZE is
> + * determined at boot-time, a pure initcall is registered and run during boot to
> + * initialize the variable.
> + *
> + * @type: Unqualified type. Do not include "const"; implied by macro variant.
> + * @name: Variable name.
> + * @...: Initialization value. May be scalar or initializer.
> + *
> + * "static" is declared by placing "static" before the macro.
> + *
> + * Example:
> + *
> + * struct my_struct {
> + * int a;
> + * char b;
> + * };
> + *
> + * static DEFINE_GLOBAL_PAGE_SIZE_VAR(struct my_struct, my_variable, {
> + * .a = 10,
> + * .b = 'e',
> + * });
> + */
> +#if PAGE_SIZE_MIN != PAGE_SIZE_MAX
> +#define __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, attrib, ...) \
> + type name attrib; \
> + static int __init __attribute__((constructor)) __##name##_init(void) \
> + { \
> + name = (type)__VA_ARGS__; \
> + return 0; \
> + }
> +
> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, ...) \
> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, , __VA_ARGS__)
> +
> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(type, name, ...) \
> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, __ro_after_init, __VA_ARGS__)
> +#else /* PAGE_SIZE_MIN == PAGE_SIZE_MAX */
> +#define __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, attrib, ...) \
> + type name attrib = __VA_ARGS__; \
> +
> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, ...) \
> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(type, name, , __VA_ARGS__)
> +
> +#define DEFINE_GLOBAL_PAGE_SIZE_VAR_CONST(type, name, ...) \
> + __DEFINE_GLOBAL_PAGE_SIZE_VAR(const type, name, , __VA_ARGS__)
> +#endif
> +
> +#endif /* ASM_GENERIC_PGTABLE_GEOMETRY_H */
> diff --git a/init/main.c b/init/main.c
> index 206acdde51f5a..ba1515eb20b9d 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -899,6 +899,8 @@ static void __init early_numa_node_init(void)
> #endif
> }
>
> +static __init void do_ctors(void);
> +
> asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector
> void start_kernel(void)
> {
> @@ -910,6 +912,8 @@ void start_kernel(void)
> debug_objects_early_init();
> init_vmlinux_build_id();
>
> + do_ctors();
> +
> cgroup_init_early();
>
> local_irq_disable();
> @@ -1360,7 +1364,6 @@ static void __init do_basic_setup(void)
> cpuset_init_smp();
> driver_init();
> init_irq_proc();
> - do_ctors();
> do_initcalls();
> }
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-24 10:48 ` Ryan Roberts
2024-10-24 11:45 ` Petr Tesarik
@ 2024-10-30 22:11 ` Sumit Gupta
1 sibling, 0 replies; 196+ messages in thread
From: Sumit Gupta @ 2024-10-30 22:11 UTC (permalink / raw)
To: Ryan Roberts, Thomas Tai, Petr Tesarik
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm,
linux-tegra
On 24/10/24 16:18, Ryan Roberts wrote:
> External email: Use caution opening links or attachments
>
>
> On 23/10/2024 22:00, Thomas Tai wrote:
>>
>> On 10/17/2024 8:32 AM, Ryan Roberts wrote:
>>> On 17/10/2024 13:27, Petr Tesarik wrote:
>>>> On Mon, 14 Oct 2024 11:55:11 +0100
>>>> Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>>> [...]
>>>>> The series is arranged as follows:
>>>>>
>>>>> - patch 1: Add macros required for converting non-arch code to support
>>>>> boot-time page size selection
>>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
>>>>> non-arch code
>>>> I have just tried to recompile the openSUSE kernel with these patches
>>>> applied, and I'm running into this:
>>>>
>>>> CC arch/arm64/hyperv/hv_core.o
>>>> In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
>>>> ../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file
>>>> scope
>>>> u8 reserved2[PAGE_SIZE - 68];
>>>> ^~~~~~~~~
>>>>
>>>> It looks like one more place which needs a patch, right?
>>> As mentioned in the cover letter, so far I've only converted enough to get the
>>> defconfig *image* building (i.e. no modules). If you are compiling a different
>>> config or compiling the modules for defconfig, you will likely run into these
>>> types of issues.
>>
>> It would be nice if you could provide the defconfig you are using; I also ran
>> into build issues when using the arch/arm64/configs/defconfig.
>
> git clean -xdfq
> make defconfig
>
> # Set CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> ./scripts/config --disable CONFIG_ARM64_4K_PAGES
> ./scripts/config --disable CONFIG_ARM64_16K_PAGES
> ./scripts/config --disable CONFIG_ARM64_64K_PAGES
> ./scripts/config --disable CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
> ./scripts/config --enable CONFIG_ARM64_BOOT_TIME_PAGE_SIZE
>
> # Set ARM64_VA_BITS_48
> ./scripts/config --disable ARM64_VA_BITS_36
> ./scripts/config --disable ARM64_VA_BITS_39
> ./scripts/config --disable ARM64_VA_BITS_42
> ./scripts/config --disable ARM64_VA_BITS_47
> ./scripts/config --disable ARM64_VA_BITS_48
> ./scripts/config --disable ARM64_VA_BITS_52
> ./scripts/config --enable ARM64_VA_BITS_48
>
> # Optional: filesystems known to compile with boot-time page size
> ./scripts/config --enable CONFIG_SQUASHFS_LZ4
> ./scripts/config --enable CONFIG_SQUASHFS_LZO
> ./scripts/config --enable CONFIG_SQUASHFS_XZ
> ./scripts/config --enable CONFIG_SQUASHFS_ZSTD
> ./scripts/config --enable CONFIG_XFS_FS
>
> # Optional: trace stuff known to compile with boot-time page size
> ./scripts/config --enable CONFIG_FTRACE
> ./scripts/config --enable CONFIG_FUNCTION_TRACER
> ./scripts/config --enable CONFIG_KPROBES
> ./scripts/config --enable CONFIG_HIST_TRIGGERS
> ./scripts/config --enable CONFIG_FTRACE_SYSCALLS
>
> # Optional: misc mm stuff known to compile with boot-time page size
> ./scripts/config --enable CONFIG_PTDUMP_DEBUGFS
> ./scripts/config --enable CONFIG_READ_ONLY_THP_FOR_FS
> ./scripts/config --enable CONFIG_USERFAULTFD
>
> # Optional: mm debug stuff known compile with boot-time page size
> ./scripts/config --enable CONFIG_DEBUG_VM
> ./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
> ./scripts/config --enable CONFIG_DEBUG_VM_RB
> ./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
> ./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
> ./scripts/config --enable CONFIG_PAGE_TABLE_CHECK_ENFORCED
>
> make olddefconfig
> make -s -j`nproc` Image
>
> So I'm explicitly only building and booting the kernel image, not the modules.
> The kernel image contains all the drivers needed to get a VM up and running
> under QEMU/KVM.
>
> Thanks,
> Ryan
>
Thank you for this patch set.
I was able to boot with minimal configs on Tegra234 board. Will enable
more configs and discuss.
Thank you,
Sumit Gupta
>>
>> Thank you,
>> Thomas
>>
>>>
>>> That said, I do have some patches to fix Hyper-V, which Michael Kelley was kind
>>> enough to send me.
>>>
>>> I understand that Suse might be able to help with wider performance testing - if
>>> that's the reason you are trying to compile, you could send me your config and
>>> I'll start working on fixing up other drivers?
>>>
>>> Thanks,
>>> Ryan
>>>
>>>> Petr T
>>>
>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-14 10:55 [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Ryan Roberts
` (7 preceding siblings ...)
2024-10-19 15:47 ` Neal Gompa
@ 2024-10-31 21:07 ` Catalin Marinas
2024-11-06 11:37 ` Ryan Roberts
8 siblings, 1 reply; 196+ messages in thread
From: Catalin Marinas @ 2024-10-31 21:07 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
Hi Ryan,
On Mon, Oct 14, 2024 at 11:55:11AM +0100, Ryan Roberts wrote:
> This RFC series implements support for boot-time page size selection within the
> arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to date, page
> size has been selected at compile-time, meaning the size is baked into a given
> kernel image. As use of larger-than-4K page sizes become more prevalent this
> starts to present a problem for distributions. Boot-time page size selection
> enables the creation of a single kernel image, which can be told which page size
> to use on the kernel command line.
That's great work, something I wasn't expecting to even build, let alone
run ;). I only looked briefly through the patches, there's probably room
for optimisation of micro-benchmarks like fork(), maybe using something
like runtime constants. The advantage for deployment and easy testing of
different configurations is pretty clear (distros mainly, not sure how
relevant it is for Android if apps can't move beyond 4K pages).
However, as a maintainer, my main concern is having to chase build
failures in obscure drivers that have not been tested/developed on
arm64. If people primarily test on x86, they wouldn't notice that
PAGE_SIZE/PAGE_SHIFT are no longer constants. Not looking forward to
trying to sort out allmodconfig builds every kernel release, especially
if they turn up in subsystems I have no clue about (like most stuff
outside arch/arm64).
So, first of all, I'd like to understand the overall maintainability
impact better. I assume you tested mostly defconfig. If you run an
allmodconfig build with make -k, how many build failures do you get with
this patchset? Similarly for some distro configs.
Do we have any better way to detect this other than actual compilation
on arm64? Can we hack something around COMPILE_TEST like redefine
PAGE_SIZE (for modules only) to a variable so that we have a better
chance of detecting build failures when modules are only tested on other
architectures?
At the moment, I'm not entirely convinced of the benefits vs. long term
maintainability. Even if we don't end up merging the dynamic PAGE_SIZE
support, parts of this series are needed for supporting 128-bit ptes on
arm64, hopefully dynamically as well.
Thanks.
--
Catalin
^ permalink raw reply [flat|nested] 196+ messages in thread
* [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant
2024-10-14 10:58 ` [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption Ryan Roberts
2024-10-16 14:37 ` Ryan Roberts
@ 2024-11-01 20:16 ` Dave Kleikamp
2024-11-06 11:44 ` Ryan Roberts
2024-11-14 10:09 ` Vlastimil Babka
2024-11-14 10:17 ` [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption Vlastimil Babka
2 siblings, 2 replies; 196+ messages in thread
From: Dave Kleikamp @ 2024-11-01 20:16 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton; +Cc: linux-arm-kernel, linux-kernel, linux-mm
When boot-time page size is enabled, the test against KMALLOC_MAX_CACHE_SIZE
is no longer optimized out with a constant size, so a build bug may
occur on a path that won't be reached.
Found compiling drivers/net/ethernet/qlogic/qed/qed_sriov.c
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
---
Ryan,
Please consider incorporating this fix or something similar into your
mm patch in the boot-time pages size patches.
include/linux/slab.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9848296ca6ba..a4c7507ab8ec 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -685,7 +685,8 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
if (size <= 1024 * 1024) return 20;
if (size <= 2 * 1024 * 1024) return 21;
- if (!IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
+ if (!IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) &&
+ !IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
BUILD_BUG_ON_MSG(1, "unexpected size in kmalloc_index()");
else
BUG();
--
2.47.0
^ permalink raw reply related [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-31 21:07 ` Catalin Marinas
@ 2024-11-06 11:37 ` Ryan Roberts
2024-11-07 12:35 ` Catalin Marinas
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-11-06 11:37 UTC (permalink / raw)
To: Catalin Marinas
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On 31/10/2024 21:07, Catalin Marinas wrote:
> Hi Ryan,
>
> On Mon, Oct 14, 2024 at 11:55:11AM +0100, Ryan Roberts wrote:
>> This RFC series implements support for boot-time page size selection within the
>> arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to date, page
>> size has been selected at compile-time, meaning the size is baked into a given
>> kernel image. As use of larger-than-4K page sizes become more prevalent this
>> starts to present a problem for distributions. Boot-time page size selection
>> enables the creation of a single kernel image, which can be told which page size
>> to use on the kernel command line.
>
> That's great work, something I wasn't expecting to even build, let alone
> run ;).
Cheers!
> I only looked briefly through the patches, there's probably room
> for optimisation of micro-benchmarks like fork(), maybe using something
> like runtime constants.
Yes I suspect there is room for some optimization. Although note I already tried
using alternatives patching but for the fork() microbenchmark this performed
worse than the approach I ended up taking of just loading a global variable. I
think this was likely due to code layout changes due to all the extra
branches/nops - fork has been very sensitive to code layout changes in the past.
> The advantage for deployment and easy testing of
> different configurations is pretty clear (distros mainly, not sure how
> relevant it is for Android if apps can't move beyond 4K pages).
>
> However, as a maintainer, my main concern is having to chase build
> failures in obscure drivers that have not been tested/developed on
> arm64. If people primarily test on x86, they wouldn't notice that
> PAGE_SIZE/PAGE_SHIFT are no longer constants. Not looking forward to
> trying to sort out allmodconfig builds every kernel release, especially
> if they turn up in subsystems I have no clue about (like most stuff
> outside arch/arm64).
Yes, I understand that concern.
>
> So, first of all, I'd like to understand the overall maintainability
> impact better. I assume you tested mostly defconfig. If you run an
> allmodconfig build with make -k, how many build failures do you get with
> this patchset? Similarly for some distro configs.
I've roughly done:
make alldefconfig &&
./scripts/config --enable CONFIG_ARM64_BOOT_TIME_PAGE_SIZE &&
make -s -j`nproc` -k &> allmodconfig.log
Then parsed the log for issues. Unfortunately the errors are very chatty and it
is difficult to perfectly extract stats.
If I search for r'(\S+\.[ch]):.*error:', that is optimistic because PAGE_SIZE
being non-const gets the ultimate blame for most things, but I'm interested in
the call sites. Number of affected files using this approach: 111.
If I just blindly search for all files, r'(\S+\.[ch]):', that is pessimistic
because when the issue is in a header, the full include chain is spat out.
Number of affected files using this approach: 1807.
If I just search for C files, r'(\S+\.[c]):', (all issues in headers terminate
in a C file) that is also pessimistic because the same single header issue is
reported for every C file it is included in. Number of affected files using this
approach: 1369.
In the end, I decided to go for r'(\S+\.[ch]):.*(error|note):', which is any
files described as having an error or being the callsite of the thing with the
error. I think this is likely most accurate from eyeballing the log:
| | C&H files | percentage of |
| directory | w/ error | all C&H files |
|------------|---------------|---------------|
| arch/arm64 | 7 | 1.3% |
| drivers | 127 | 0.4% |
| fs | 25 | 1.1% |
| include | 27 | 0.4% |
| init | 1 | 8.3% |
| kernel | 7 | 1.3% |
| lib | 1 | 0.2% |
| mm | 6 | 3.2% |
| net | 7 | 0.4% |
| security | 2 | 0.8% |
| sound | 21 | 0.8% |
|------------|---------------|---------------|
| TOTAL | 231 | 0.4% |
|------------|---------------|---------------|
I'm not sure how best to evaluate if this is a large or small number though! For
comparison, the RFC modified 172 files.
>
> Do we have any better way to detect this other than actual compilation
> on arm64? Can we hack something around COMPILE_TEST like redefine
> PAGE_SIZE (for modules only) to a variable so that we have a better
> chance of detecting build failures when modules are only tested on other
> architectures?
I can certainly look into this. But if the concern is that drivers are not being
compiled against arm64, what is the likelyhood of them being compiled against
COMPILE_TEST?
>
> At the moment, I'm not entirely convinced of the benefits vs. long term
> maintainability. Even if we don't end up merging the dynamic PAGE_SIZE
> support, parts of this series are needed for supporting 128-bit ptes on
> arm64, hopefully dynamically as well.
Agreed.
Thanks,
Ryan
>
> Thanks.
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant
2024-11-01 20:16 ` [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant Dave Kleikamp
@ 2024-11-06 11:44 ` Ryan Roberts
2024-11-06 15:20 ` Dave Kleikamp
2024-11-14 10:09 ` Vlastimil Babka
1 sibling, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-11-06 11:44 UTC (permalink / raw)
To: Dave Kleikamp, Andrew Morton; +Cc: linux-arm-kernel, linux-kernel, linux-mm
On 01/11/2024 20:16, Dave Kleikamp wrote:
> When boot-time page size is enabled, the test against KMALLOC_MAX_CACHE_SIZE
> is no longer optimized out with a constant size, so a build bug may
> occur on a path that won't be reached.
>
> Found compiling drivers/net/ethernet/qlogic/qed/qed_sriov.c
>
> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
> ---
>
> Ryan,
>
> Please consider incorporating this fix or something similar into your
> mm patch in the boot-time pages size patches.
>
> include/linux/slab.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 9848296ca6ba..a4c7507ab8ec 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -685,7 +685,8 @@ static __always_inline unsigned int __kmalloc_index(size_t
> size,
> if (size <= 1024 * 1024) return 20;
> if (size <= 2 * 1024 * 1024) return 21;
>
> - if (!IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
> + if (!IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) &&
Thanks for the patch! I think this may be better as:
if (PAGE_SHIFT_MIN == PAGE_SHIFT_MAX &&
Since that is independent of the architecture. Your approach wouldn't work if
another arch wanted to enable boot time page size, or if arm64 dropped the
Kconfig because it decided only boot time page size will be supported in future.
Thanks,
Ryan
> + !IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
> BUILD_BUG_ON_MSG(1, "unexpected size in kmalloc_index()");
> else
> BUG();
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant
2024-11-06 11:44 ` Ryan Roberts
@ 2024-11-06 15:20 ` Dave Kleikamp
0 siblings, 0 replies; 196+ messages in thread
From: Dave Kleikamp @ 2024-11-06 15:20 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton; +Cc: linux-arm-kernel, linux-kernel, linux-mm
On 11/6/24 5:44AM, Ryan Roberts wrote:
> On 01/11/2024 20:16, Dave Kleikamp wrote:
>> When boot-time page size is enabled, the test against KMALLOC_MAX_CACHE_SIZE
>> is no longer optimized out with a constant size, so a build bug may
>> occur on a path that won't be reached.
>>
>> Found compiling drivers/net/ethernet/qlogic/qed/qed_sriov.c
>>
>> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
>> ---
>>
>> Ryan,
>>
>> Please consider incorporating this fix or something similar into your
>> mm patch in the boot-time pages size patches.
>>
>> include/linux/slab.h | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>> index 9848296ca6ba..a4c7507ab8ec 100644
>> --- a/include/linux/slab.h
>> +++ b/include/linux/slab.h
>> @@ -685,7 +685,8 @@ static __always_inline unsigned int __kmalloc_index(size_t
>> size,
>> if (size <= 1024 * 1024) return 20;
>> if (size <= 2 * 1024 * 1024) return 21;
>>
>> - if (!IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
>> + if (!IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) &&
>
> Thanks for the patch! I think this may be better as:
>
> if (PAGE_SHIFT_MIN == PAGE_SHIFT_MAX &&
>
> Since that is independent of the architecture. Your approach wouldn't work if
> another arch wanted to enable boot time page size, or if arm64 dropped the
> Kconfig because it decided only boot time page size will be supported in future.
Absolutely. I may be sending some more. I haven't gotten to JFS yet, but
that one is my responsibility.
Thanks,
Shaggy
>
> Thanks,
> Ryan
>
>> + !IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
>> BUILD_BUG_ON_MSG(1, "unexpected size in kmalloc_index()");
>> else
>> BUG();
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-11-06 11:37 ` Ryan Roberts
@ 2024-11-07 12:35 ` Catalin Marinas
2024-11-07 12:47 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Catalin Marinas @ 2024-11-07 12:35 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On Wed, Nov 06, 2024 at 11:37:58AM +0000, Ryan Roberts wrote:
> On 31/10/2024 21:07, Catalin Marinas wrote:
> > So, first of all, I'd like to understand the overall maintainability
> > impact better. I assume you tested mostly defconfig. If you run an
> > allmodconfig build with make -k, how many build failures do you get with
> > this patchset? Similarly for some distro configs.
>
> I've roughly done:
>
> make alldefconfig &&
> ./scripts/config --enable CONFIG_ARM64_BOOT_TIME_PAGE_SIZE &&
> make -s -j`nproc` -k &> allmodconfig.log
Is it alldefconfig or allmodconfig? The former has a lot less symbols
enabled than even defconfig (fairly close to allnoconfig actually):
$ make defconfig
$ grep -v "^#\|^$" .config | wc -l
4449
$ make alldefconfig
$ grep -v "^#\|^$" .config | wc -l
713
$ make allmodconfig
$ grep -v "^#\|^$" .config | wc -l
14401
> In the end, I decided to go for r'(\S+\.[ch]):.*(error|note):', which is any
> files described as having an error or being the callsite of the thing with the
> error. I think this is likely most accurate from eyeballing the log:
I think that's good enough to give us a rough idea.
> | | C&H files | percentage of |
> | directory | w/ error | all C&H files |
> |------------|---------------|---------------|
> | arch/arm64 | 7 | 1.3% |
> | drivers | 127 | 0.4% |
> | fs | 25 | 1.1% |
> | include | 27 | 0.4% |
> | init | 1 | 8.3% |
> | kernel | 7 | 1.3% |
> | lib | 1 | 0.2% |
> | mm | 6 | 3.2% |
> | net | 7 | 0.4% |
> | security | 2 | 0.8% |
> | sound | 21 | 0.8% |
> |------------|---------------|---------------|
> | TOTAL | 231 | 0.4% |
> |------------|---------------|---------------|
This doesn't look that bad _if_ you actually built most modules. But if
it was alldefconfig, you likely missed the majority of modules.
> > Do we have any better way to detect this other than actual compilation
> > on arm64? Can we hack something around COMPILE_TEST like redefine
> > PAGE_SIZE (for modules only) to a variable so that we have a better
> > chance of detecting build failures when modules are only tested on other
> > architectures?
>
> I can certainly look into this. But if the concern is that drivers are not being
> compiled against arm64, what is the likelyhood of them being compiled against
> COMPILE_TEST?
Hopefully some CIs out there catching them. Well, if we are to fix them
anyway, we might as well eventually force a non-const PAGE_SIZE
generically even if it returns a constant.
I'm building allmod now with something like below (and some hacks in
arch and core code to use STATIC_PAGE_* as I did not apply your
patches). alldefconfig passes with my hacks but, as you can see, the
non-const PAGE_SIZE kicks in only if MODULE is defined. So, not an
accurate test, just to get a feel of the modules problem.
----------8<---------------------------
diff --git a/arch/arm64/include/asm/page-def.h b/arch/arm64/include/asm/page-def.h
index 792e9fe881dc..71a761f86b15 100644
--- a/arch/arm64/include/asm/page-def.h
+++ b/arch/arm64/include/asm/page-def.h
@@ -12,7 +12,19 @@
/* PAGE_SHIFT determines the page size */
#define PAGE_SHIFT CONFIG_PAGE_SHIFT
-#define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
+#define STATIC_PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
+#define STATIC_PAGE_MASK (~(STATIC_PAGE_SIZE-1))
+
+#if !defined(MODULE) || defined(__ASSEMBLY__)
+#define PAGE_SIZE STATIC_PAGE_SIZE
+#else
+static inline unsigned long __runtime_page_size(void)
+{
+ return 1UL << PAGE_SHIFT;
+}
+#define PAGE_SIZE (__runtime_page_size())
+#endif
+
#define PAGE_MASK (~(PAGE_SIZE-1))
#endif /* __ASM_PAGE_DEF_H */
----------8<---------------------------
--
Catalin
^ permalink raw reply related [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-11-07 12:35 ` Catalin Marinas
@ 2024-11-07 12:47 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-11-07 12:47 UTC (permalink / raw)
To: Catalin Marinas
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On 07/11/2024 12:35, Catalin Marinas wrote:
> On Wed, Nov 06, 2024 at 11:37:58AM +0000, Ryan Roberts wrote:
>> On 31/10/2024 21:07, Catalin Marinas wrote:
>>> So, first of all, I'd like to understand the overall maintainability
>>> impact better. I assume you tested mostly defconfig. If you run an
>>> allmodconfig build with make -k, how many build failures do you get with
>>> this patchset? Similarly for some distro configs.
>>
>> I've roughly done:
>>
>> make alldefconfig &&
>> ./scripts/config --enable CONFIG_ARM64_BOOT_TIME_PAGE_SIZE &&
>> make -s -j`nproc` -k &> allmodconfig.log
>
> Is it alldefconfig or allmodconfig? The former has a lot less symbols
> enabled than even defconfig (fairly close to allnoconfig actually):
Eek, that was a typo when I wrote the email... I built allmodconfig - the big one.
>
> $ make defconfig
> $ grep -v "^#\|^$" .config | wc -l
> 4449
>
> $ make alldefconfig
> $ grep -v "^#\|^$" .config | wc -l
> 713
>
> $ make allmodconfig
> $ grep -v "^#\|^$" .config | wc -l
> 14401
>
>> In the end, I decided to go for r'(\S+\.[ch]):.*(error|note):', which is any
>> files described as having an error or being the callsite of the thing with the
>> error. I think this is likely most accurate from eyeballing the log:
>
> I think that's good enough to give us a rough idea.
>
>> | | C&H files | percentage of |
>> | directory | w/ error | all C&H files |
>> |------------|---------------|---------------|
>> | arch/arm64 | 7 | 1.3% |
>> | drivers | 127 | 0.4% |
>> | fs | 25 | 1.1% |
>> | include | 27 | 0.4% |
>> | init | 1 | 8.3% |
>> | kernel | 7 | 1.3% |
>> | lib | 1 | 0.2% |
>> | mm | 6 | 3.2% |
>> | net | 7 | 0.4% |
>> | security | 2 | 0.8% |
>> | sound | 21 | 0.8% |
>> |------------|---------------|---------------|
>> | TOTAL | 231 | 0.4% |
>> |------------|---------------|---------------|
>
> This doesn't look that bad _if_ you actually built most modules. But if
> it was alldefconfig, you likely missed the majority of modules.
I definitely built allmodconfig, so I guess "this doesn't look bad" :)
>
>>> Do we have any better way to detect this other than actual compilation
>>> on arm64? Can we hack something around COMPILE_TEST like redefine
>>> PAGE_SIZE (for modules only) to a variable so that we have a better
>>> chance of detecting build failures when modules are only tested on other
>>> architectures?
>>
>> I can certainly look into this. But if the concern is that drivers are not being
>> compiled against arm64, what is the likelyhood of them being compiled against
>> COMPILE_TEST?
>
> Hopefully some CIs out there catching them. Well, if we are to fix them
> anyway, we might as well eventually force a non-const PAGE_SIZE
> generically even if it returns a constant.
>
> I'm building allmod now with something like below (and some hacks in
> arch and core code to use STATIC_PAGE_* as I did not apply your
> patches). alldefconfig passes with my hacks but, as you can see, the
> non-const PAGE_SIZE kicks in only if MODULE is defined. So, not an
> accurate test, just to get a feel of the modules problem.
Nice. I guess that's pretty much the change we would add for x86 with COMPILE_TEST.
>
> ----------8<---------------------------
> diff --git a/arch/arm64/include/asm/page-def.h b/arch/arm64/include/asm/page-def.h
> index 792e9fe881dc..71a761f86b15 100644
> --- a/arch/arm64/include/asm/page-def.h
> +++ b/arch/arm64/include/asm/page-def.h
> @@ -12,7 +12,19 @@
>
> /* PAGE_SHIFT determines the page size */
> #define PAGE_SHIFT CONFIG_PAGE_SHIFT
> -#define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
> +#define STATIC_PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
> +#define STATIC_PAGE_MASK (~(STATIC_PAGE_SIZE-1))
> +
> +#if !defined(MODULE) || defined(__ASSEMBLY__)
> +#define PAGE_SIZE STATIC_PAGE_SIZE
> +#else
> +static inline unsigned long __runtime_page_size(void)
> +{
> + return 1UL << PAGE_SHIFT;
> +}
> +#define PAGE_SIZE (__runtime_page_size())
> +#endif
> +
> #define PAGE_MASK (~(PAGE_SIZE-1))
>
> #endif /* __ASM_PAGE_DEF_H */
> ----------8<---------------------------
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-17 12:32 ` Ryan Roberts
2024-10-18 12:56 ` Petr Tesarik
2024-10-23 21:00 ` Thomas Tai
@ 2024-11-11 12:14 ` Petr Tesarik
2024-11-11 12:25 ` Ryan Roberts
2024-12-05 17:20 ` Petr Tesarik
3 siblings, 1 reply; 196+ messages in thread
From: Petr Tesarik @ 2024-11-11 12:14 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
Hi Ryan,
On Thu, 17 Oct 2024 13:32:43 +0100
Ryan Roberts <ryan.roberts@arm.com> wrote:
>[...]
> I understand that Suse might be able to help with wider performance testing
Sorry for the delay (vacation, other tasks). Anyway, let me share some
results with you.
First, I have looked only at 4k pages (constant v. selected at boot
time) so far.
Second, the impact of the patch series is much smaller than I expected.
Most macro-benchmarks (dbench, io-bench) did not see any significant
slowdown. There appears to be a performance hit of approx. 1-2%, but
that's within noise, and I can't dedicate my time to running extensive
tests to find the distribution peak and compare. In short, I suspect a
slight performance hit, but I cannot quantify it.
Third, a few micro-benchmarks saw a significant regression.
Most notably, getenv and getenvT2 tests from libMicro were 18% and 20%
slower with variable page size. I don't know why, but I'm looking into
it. The system() library call was also about 18% slower, but that might
be related.
The dup() syscall was up to 5% slower (depends on underlying filesystem
type).
VMA unmap was slower for some sizes, but the pattern seemed random,
sometimes giving even better performance with variable page size, so
this micro-benchmark may be too unstable to draw any conclusions.
Stay tuned
Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-11-11 12:14 ` Petr Tesarik
@ 2024-11-11 12:25 ` Ryan Roberts
2024-11-12 9:45 ` Petr Tesarik
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-11-11 12:25 UTC (permalink / raw)
To: Petr Tesarik
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
Hi Petr,
On 11/11/2024 12:14, Petr Tesarik wrote:
> Hi Ryan,
>
> On Thu, 17 Oct 2024 13:32:43 +0100
> Ryan Roberts <ryan.roberts@arm.com> wrote:
>
>> [...]
>> I understand that Suse might be able to help with wider performance testing
>
> Sorry for the delay (vacation, other tasks). Anyway, let me share some
> results with you.
Not at all; thanks for coming back with these results!
>
> First, I have looked only at 4k pages (constant v. selected at boot
> time) so far.
>
> Second, the impact of the patch series is much smaller than I expected.
> Most macro-benchmarks (dbench, io-bench) did not see any significant
> slowdown. There appears to be a performance hit of approx. 1-2%, but
> that's within noise, and I can't dedicate my time to running extensive
> tests to find the distribution peak and compare. In short, I suspect a
> slight performance hit, but I cannot quantify it.
>
> Third, a few micro-benchmarks saw a significant regression.
>
> Most notably, getenv and getenvT2 tests from libMicro were 18% and 20%
> slower with variable page size. I don't know why, but I'm looking into
> it. The system() library call was also about 18% slower, but that might
> be related.
OK, ouch. I think there are some things we can try to optimize the
implementation further. But I'll wait for your analysis before digging myself.
You probably also saw the conversation with Catalin about the cost vs benefit of
this series. Performance regressions will all need to be considered in the cost
column, of course. So understanding the root cause and trying to reduce the
regression as much as possible will increase chances of getting it accepted
upstream.
Thanks,
Ryan
>
> The dup() syscall was up to 5% slower (depends on underlying filesystem
> type).
>
> VMA unmap was slower for some sizes, but the pattern seemed random,
> sometimes giving even better performance with variable page size, so
> this micro-benchmark may be too unstable to draw any conclusions.
>
> Stay tuned
> Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-11-11 12:25 ` Ryan Roberts
@ 2024-11-12 9:45 ` Petr Tesarik
2024-11-12 10:19 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Petr Tesarik @ 2024-11-12 9:45 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On Mon, 11 Nov 2024 12:25:35 +0000
Ryan Roberts <ryan.roberts@arm.com> wrote:
> Hi Petr,
>
> On 11/11/2024 12:14, Petr Tesarik wrote:
> > Hi Ryan,
> >
> > On Thu, 17 Oct 2024 13:32:43 +0100
> > Ryan Roberts <ryan.roberts@arm.com> wrote:
>[...]
> > Third, a few micro-benchmarks saw a significant regression.
> >
> > Most notably, getenv and getenvT2 tests from libMicro were 18% and 20%
> > slower with variable page size. I don't know why, but I'm looking into
> > it. The system() library call was also about 18% slower, but that might
> > be related.
>
> OK, ouch. I think there are some things we can try to optimize the
> implementation further. But I'll wait for your analysis before digging myself.
This turned out to be a false positive. The way this microbenchmark was
invoked did not get enough samples, so it was mostly dependent on
whether caches were hot or cold, and the timing on this specific system
with the specific sequence of bencnmarks in the suite happens to favour
my baseline kernel.
After increasing the batch count, I'm getting pretty much the same
performance for 6.11 vanilla and patched kernels:
prc thr usecs/call samples errors cnt/samp
getenv (baseline) 1 1 0.14975 99 0 100000
getenv (patched) 1 1 0.14981 92 0 100000
> You probably also saw the conversation with Catalin about the cost vs benefit of
> this series. Performance regressions will all need to be considered in the cost
> column, of course. So understanding the root cause and trying to reduce the
> regression as much as possible will increase chances of getting it accepted
> upstream.
Yes. Now that the biggest number is off the table, I'm going to:
- look into the dup() slowdown
- verify whether VMA split/merge operations are indeed slower
Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-11-12 9:45 ` Petr Tesarik
@ 2024-11-12 10:19 ` Ryan Roberts
2024-11-12 10:50 ` Petr Tesarik
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-11-12 10:19 UTC (permalink / raw)
To: Petr Tesarik
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On 12/11/2024 09:45, Petr Tesarik wrote:
> On Mon, 11 Nov 2024 12:25:35 +0000
> Ryan Roberts <ryan.roberts@arm.com> wrote:
>
>> Hi Petr,
>>
>> On 11/11/2024 12:14, Petr Tesarik wrote:
>>> Hi Ryan,
>>>
>>> On Thu, 17 Oct 2024 13:32:43 +0100
>>> Ryan Roberts <ryan.roberts@arm.com> wrote:
>> [...]
>>> Third, a few micro-benchmarks saw a significant regression.
>>>
>>> Most notably, getenv and getenvT2 tests from libMicro were 18% and 20%
>>> slower with variable page size. I don't know why, but I'm looking into
>>> it. The system() library call was also about 18% slower, but that might
>>> be related.
>>
>> OK, ouch. I think there are some things we can try to optimize the
>> implementation further. But I'll wait for your analysis before digging myself.
>
> This turned out to be a false positive. The way this microbenchmark was
> invoked did not get enough samples, so it was mostly dependent on
> whether caches were hot or cold, and the timing on this specific system
> with the specific sequence of bencnmarks in the suite happens to favour
> my baseline kernel.
>
> After increasing the batch count, I'm getting pretty much the same
> performance for 6.11 vanilla and patched kernels:
>
> prc thr usecs/call samples errors cnt/samp
> getenv (baseline) 1 1 0.14975 99 0 100000
> getenv (patched) 1 1 0.14981 92 0 100000
Oh that's good news! Does this account for all 3 of the above tests (getenv,
getenvT2 and system())?
>
>> You probably also saw the conversation with Catalin about the cost vs benefit of
>> this series. Performance regressions will all need to be considered in the cost
>> column, of course. So understanding the root cause and trying to reduce the
>> regression as much as possible will increase chances of getting it accepted
>> upstream.
>
> Yes. Now that the biggest number is off the table, I'm going to:
>
> - look into the dup() slowdown
> - verify whether VMA split/merge operations are indeed slower
>
> Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-11-12 10:19 ` Ryan Roberts
@ 2024-11-12 10:50 ` Petr Tesarik
2024-11-13 12:40 ` Petr Tesarik
0 siblings, 1 reply; 196+ messages in thread
From: Petr Tesarik @ 2024-11-12 10:50 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On Tue, 12 Nov 2024 10:19:34 +0000
Ryan Roberts <ryan.roberts@arm.com> wrote:
> On 12/11/2024 09:45, Petr Tesarik wrote:
> > On Mon, 11 Nov 2024 12:25:35 +0000
> > Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> >> Hi Petr,
> >>
> >> On 11/11/2024 12:14, Petr Tesarik wrote:
> >>> Hi Ryan,
> >>>
> >>> On Thu, 17 Oct 2024 13:32:43 +0100
> >>> Ryan Roberts <ryan.roberts@arm.com> wrote:
> >> [...]
> >>> Third, a few micro-benchmarks saw a significant regression.
> >>>
> >>> Most notably, getenv and getenvT2 tests from libMicro were 18% and 20%
> >>> slower with variable page size. I don't know why, but I'm looking into
> >>> it. The system() library call was also about 18% slower, but that might
> >>> be related.
> >>
> >> OK, ouch. I think there are some things we can try to optimize the
> >> implementation further. But I'll wait for your analysis before digging myself.
> >
> > This turned out to be a false positive. The way this microbenchmark was
> > invoked did not get enough samples, so it was mostly dependent on
> > whether caches were hot or cold, and the timing on this specific system
> > with the specific sequence of bencnmarks in the suite happens to favour
> > my baseline kernel.
> >
> > After increasing the batch count, I'm getting pretty much the same
> > performance for 6.11 vanilla and patched kernels:
> >
> > prc thr usecs/call samples errors cnt/samp
> > getenv (baseline) 1 1 0.14975 99 0 100000
> > getenv (patched) 1 1 0.14981 92 0 100000
>
> Oh that's good news! Does this account for all 3 of the above tests (getenv,
> getenvT2 and system())?
It does for getenvT2 (a variant of the test with 2 threads), but not
for system. Thanks for asking, I forgot about that one.
I'm getting substantial difference there (+29% on average over 100 runs):
prc thr usecs/call samples errors cnt/samp command
system (baseline) 1 1 6937.18016 102 0 100 A=$$
system (patched) 1 1 8959.48032 102 0 100 A=$$
So, yeah, this should in fact be my priority #1.
The "system" benchmark measures the duration of system("A=$$"), which
involves starting the system shell (in my case bash-4.4.23), so this is
not really a microbenchmark. I hope perf can help match the difference
to a kernel API.
Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-11-12 10:50 ` Petr Tesarik
@ 2024-11-13 12:40 ` Petr Tesarik
2024-11-13 12:56 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Petr Tesarik @ 2024-11-13 12:40 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On Tue, 12 Nov 2024 11:50:39 +0100
Petr Tesarik <ptesarik@suse.com> wrote:
> On Tue, 12 Nov 2024 10:19:34 +0000
> Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> > On 12/11/2024 09:45, Petr Tesarik wrote:
> > > On Mon, 11 Nov 2024 12:25:35 +0000
> > > Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >
> > >> Hi Petr,
> > >>
> > >> On 11/11/2024 12:14, Petr Tesarik wrote:
> > >>> Hi Ryan,
> > >>>
> > >>> On Thu, 17 Oct 2024 13:32:43 +0100
> > >>> Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >> [...]
> > >>> Third, a few micro-benchmarks saw a significant regression.
> > >>>
> > >>> Most notably, getenv and getenvT2 tests from libMicro were 18% and 20%
> > >>> slower with variable page size. I don't know why, but I'm looking into
> > >>> it. The system() library call was also about 18% slower, but that might
> > >>> be related.
> > >>
> > >> OK, ouch. I think there are some things we can try to optimize the
> > >> implementation further. But I'll wait for your analysis before digging myself.
> > >
> > > This turned out to be a false positive. The way this microbenchmark was
> > > invoked did not get enough samples, so it was mostly dependent on
> > > whether caches were hot or cold, and the timing on this specific system
> > > with the specific sequence of bencnmarks in the suite happens to favour
> > > my baseline kernel.
> > >
> > > After increasing the batch count, I'm getting pretty much the same
> > > performance for 6.11 vanilla and patched kernels:
> > >
> > > prc thr usecs/call samples errors cnt/samp
> > > getenv (baseline) 1 1 0.14975 99 0 100000
> > > getenv (patched) 1 1 0.14981 92 0 100000
> >
> > Oh that's good news! Does this account for all 3 of the above tests (getenv,
> > getenvT2 and system())?
>
> It does for getenvT2 (a variant of the test with 2 threads), but not
> for system. Thanks for asking, I forgot about that one.
>
> I'm getting substantial difference there (+29% on average over 100 runs):
>
> prc thr usecs/call samples errors cnt/samp command
> system (baseline) 1 1 6937.18016 102 0 100 A=$$
> system (patched) 1 1 8959.48032 102 0 100 A=$$
>
> So, yeah, this should in fact be my priority #1.
Further testing reveals the workload is bimodal, that is to say the
distribution of results has two peaks. The first peak around 3.2 ms
covers 30% runs, the second peak around 15.7 ms covers 11%. Two per
cent are faster than the fast peak, 5% are slower than slow peak, the
rest is distributed almost evenly between them.
100 samples were not sufficient to see this distribution, and it was
mere bad luck that only the patched kernel originally reported bad
results. I can now see bad results even with the unpatched kernel.
In short, I don't think there is a difference in system() performance.
I will still have a look at dup() and VMA performance, but so far it
all looks good to me. Good job! ;-)
I will also try running a more complete set of benchmarks during next
week. That's SUSE Hack Week, and I want to make a PoC for the MM
changes I proposed at LPC24, so I won't need this Ampere system for
interactive use.
Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-11-13 12:40 ` Petr Tesarik
@ 2024-11-13 12:56 ` Ryan Roberts
2024-11-13 14:22 ` Petr Tesarik
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-11-13 12:56 UTC (permalink / raw)
To: Petr Tesarik
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On 13/11/2024 12:40, Petr Tesarik wrote:
> On Tue, 12 Nov 2024 11:50:39 +0100
> Petr Tesarik <ptesarik@suse.com> wrote:
>
>> On Tue, 12 Nov 2024 10:19:34 +0000
>> Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>>> On 12/11/2024 09:45, Petr Tesarik wrote:
>>>> On Mon, 11 Nov 2024 12:25:35 +0000
>>>> Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>>> Hi Petr,
>>>>>
>>>>> On 11/11/2024 12:14, Petr Tesarik wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>> On Thu, 17 Oct 2024 13:32:43 +0100
>>>>>> Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>> [...]
>>>>>> Third, a few micro-benchmarks saw a significant regression.
>>>>>>
>>>>>> Most notably, getenv and getenvT2 tests from libMicro were 18% and 20%
>>>>>> slower with variable page size. I don't know why, but I'm looking into
>>>>>> it. The system() library call was also about 18% slower, but that might
>>>>>> be related.
>>>>>
>>>>> OK, ouch. I think there are some things we can try to optimize the
>>>>> implementation further. But I'll wait for your analysis before digging myself.
>>>>
>>>> This turned out to be a false positive. The way this microbenchmark was
>>>> invoked did not get enough samples, so it was mostly dependent on
>>>> whether caches were hot or cold, and the timing on this specific system
>>>> with the specific sequence of bencnmarks in the suite happens to favour
>>>> my baseline kernel.
>>>>
>>>> After increasing the batch count, I'm getting pretty much the same
>>>> performance for 6.11 vanilla and patched kernels:
>>>>
>>>> prc thr usecs/call samples errors cnt/samp
>>>> getenv (baseline) 1 1 0.14975 99 0 100000
>>>> getenv (patched) 1 1 0.14981 92 0 100000
>>>
>>> Oh that's good news! Does this account for all 3 of the above tests (getenv,
>>> getenvT2 and system())?
>>
>> It does for getenvT2 (a variant of the test with 2 threads), but not
>> for system. Thanks for asking, I forgot about that one.
>>
>> I'm getting substantial difference there (+29% on average over 100 runs):
>>
>> prc thr usecs/call samples errors cnt/samp command
>> system (baseline) 1 1 6937.18016 102 0 100 A=$$
>> system (patched) 1 1 8959.48032 102 0 100 A=$$
>>
>> So, yeah, this should in fact be my priority #1.
>
> Further testing reveals the workload is bimodal, that is to say the
> distribution of results has two peaks. The first peak around 3.2 ms
> covers 30% runs, the second peak around 15.7 ms covers 11%. Two per
> cent are faster than the fast peak, 5% are slower than slow peak, the
> rest is distributed almost evenly between them.
FWIW, One source of bimodality I've seen on Ampere systems with 2 NUMA nodes is
placement of the kernel image vs placement of the running thread. If they are
remote from eachother, you'll see a slowdown. I've hacked this source away in
the past by effectively using only a single NUMA node (with the help of
'maxcpus' and 'mem' kernel cmdline options).
>
> 100 samples were not sufficient to see this distribution, and it was
> mere bad luck that only the patched kernel originally reported bad
> results. I can now see bad results even with the unpatched kernel.
>
> In short, I don't think there is a difference in system() performance.
>
> I will still have a look at dup() and VMA performance, but so far it
> all looks good to me. Good job! ;-)
Thanks for digging into all this!
>
> I will also try running a more complete set of benchmarks during next
> week. That's SUSE Hack Week, and I want to make a PoC for the MM
> changes I proposed at LPC24, so I won't need this Ampere system for
> interactive use.
>
> Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-11-13 12:56 ` Ryan Roberts
@ 2024-11-13 14:22 ` Petr Tesarik
0 siblings, 0 replies; 196+ messages in thread
From: Petr Tesarik @ 2024-11-13 14:22 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
On Wed, 13 Nov 2024 12:56:24 +0000
Ryan Roberts <ryan.roberts@arm.com> wrote:
> On 13/11/2024 12:40, Petr Tesarik wrote:
> > On Tue, 12 Nov 2024 11:50:39 +0100
> > Petr Tesarik <ptesarik@suse.com> wrote:
> >
> >> On Tue, 12 Nov 2024 10:19:34 +0000
> >> Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >>> On 12/11/2024 09:45, Petr Tesarik wrote:
> >>>> On Mon, 11 Nov 2024 12:25:35 +0000
> >>>> Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>>> Hi Petr,
> >>>>>
> >>>>> On 11/11/2024 12:14, Petr Tesarik wrote:
> >>>>>> Hi Ryan,
> >>>>>>
> >>>>>> On Thu, 17 Oct 2024 13:32:43 +0100
> >>>>>> Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>> [...]
> >>>>>> Third, a few micro-benchmarks saw a significant regression.
> >>>>>>
> >>>>>> Most notably, getenv and getenvT2 tests from libMicro were 18% and 20%
> >>>>>> slower with variable page size. I don't know why, but I'm looking into
> >>>>>> it. The system() library call was also about 18% slower, but that might
> >>>>>> be related.
> >>>>>
> >>>>> OK, ouch. I think there are some things we can try to optimize the
> >>>>> implementation further. But I'll wait for your analysis before digging myself.
> >>>>
> >>>> This turned out to be a false positive. The way this microbenchmark was
> >>>> invoked did not get enough samples, so it was mostly dependent on
> >>>> whether caches were hot or cold, and the timing on this specific system
> >>>> with the specific sequence of bencnmarks in the suite happens to favour
> >>>> my baseline kernel.
> >>>>
> >>>> After increasing the batch count, I'm getting pretty much the same
> >>>> performance for 6.11 vanilla and patched kernels:
> >>>>
> >>>> prc thr usecs/call samples errors cnt/samp
> >>>> getenv (baseline) 1 1 0.14975 99 0 100000
> >>>> getenv (patched) 1 1 0.14981 92 0 100000
> >>>
> >>> Oh that's good news! Does this account for all 3 of the above tests (getenv,
> >>> getenvT2 and system())?
> >>
> >> It does for getenvT2 (a variant of the test with 2 threads), but not
> >> for system. Thanks for asking, I forgot about that one.
> >>
> >> I'm getting substantial difference there (+29% on average over 100 runs):
> >>
> >> prc thr usecs/call samples errors cnt/samp command
> >> system (baseline) 1 1 6937.18016 102 0 100 A=$$
> >> system (patched) 1 1 8959.48032 102 0 100 A=$$
> >>
> >> So, yeah, this should in fact be my priority #1.
> >
> > Further testing reveals the workload is bimodal, that is to say the
> > distribution of results has two peaks. The first peak around 3.2 ms
> > covers 30% runs, the second peak around 15.7 ms covers 11%. Two per
> > cent are faster than the fast peak, 5% are slower than slow peak, the
> > rest is distributed almost evenly between them.
>
> FWIW, One source of bimodality I've seen on Ampere systems with 2 NUMA nodes is
> placement of the kernel image vs placement of the running thread. If they are
> remote from eachother, you'll see a slowdown. I've hacked this source away in
> the past by effectively using only a single NUMA node (with the help of
> 'maxcpus' and 'mem' kernel cmdline options).
This system has only one NUMA node. But your comment leads in the right
direction. CPU placement does play a role here.
I can consistently get the fast results if I pin the benchmark process
to a single CPU core, or more generally to a CPU set which shares the
L2 cache (as found on eMAG). But the scheduler only considers LLC,
which (with CONFIG_SCHED_CLUSTER=y) follows the complex affinity of the
SLC.
Long story short, without explicit affinity, the scheduler may place a
forked child onto a CPU with a cold L2 cache, which harms short-lived
processes (like the ones created by this benchmark).
Now it all makes sense and it is totally unrelated to dynamic page size
selection. :-)
Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 04/57] mm/page_alloc: Make page_frag_cache boot-time page size compatible
2024-10-14 10:58 ` [RFC PATCH v1 04/57] mm/page_alloc: Make page_frag_cache boot-time page size compatible Ryan Roberts
@ 2024-11-14 8:23 ` Vlastimil Babka
2024-11-14 9:36 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Vlastimil Babka @ 2024-11-14 8:23 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 10/14/24 12:58, Ryan Roberts wrote:
> "struct page_frag_cache" has some optimizations that depend on page
> size. Let's refactor it a bit so that those optimizations can be
> determined at run-time for the case where page size is a boot-time
> parameter. For compile-time page size, the compiler should dead code
> strip and the result is very similar to before.
>
> One wrinkle is that we don't know if we need the size member until
> runtime. So remove the ifdeffery and always define offset as u32 (needed
> if PAGE_SIZE is >= 64K) and size as u16 (only used when PAGE_SIZE <=
> 32K). We move the members around a bit so that the overall size of the
> struct remains the same; 24 bytes for 64-bit and 16 bytes on 32 bit.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Looks ok, but ideally the PAGE_FRAG_CACHE_MAX_ORDER #define should also be
replaced by some variable that's populated just once. It can be static local
to page_alloc.c as nothing else seems to use it.
>
> page_alloc
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> include/linux/mm_types.h | 13 ++++++-------
> mm/page_alloc.c | 31 ++++++++++++++++++-------------
> 2 files changed, 24 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4854249792545..0844ed7cfaa53 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -544,16 +544,15 @@ static inline void *folio_get_private(struct folio *folio)
>
> struct page_frag_cache {
> void * va;
> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> - __u16 offset;
> - __u16 size;
> -#else
> - __u32 offset;
> -#endif
> /* we maintain a pagecount bias, so that we dont dirty cache line
> * containing page->_refcount every time we allocate a fragment.
> */
> - unsigned int pagecnt_bias;
> + unsigned int pagecnt_bias;
> + __u32 offset;
> + /* size only used when PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE, in which
> + * case PAGE_FRAG_CACHE_MAX_SIZE is 32K and 16 bits is sufficient.
> + */
> + __u16 size;
> bool pfmemalloc;
> };
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91ace8ca97e21..8678103b1b396 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4822,13 +4822,18 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> struct page *page = NULL;
> gfp_t gfp = gfp_mask;
>
> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
> - __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> - page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
> - PAGE_FRAG_CACHE_MAX_ORDER);
> - nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
> -#endif
> + if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) {
> + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
> + __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> + page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
> + PAGE_FRAG_CACHE_MAX_ORDER);
> + /*
> + * Cast to silence warning due to 16-bit nc->size. Not real
> + * because PAGE_SIZE only less than PAGE_FRAG_CACHE_MAX_SIZE
> + * when PAGE_FRAG_CACHE_MAX_SIZE is 32K.
> + */
> + nc->size = (__u16)(page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE);
> + }
> if (unlikely(!page))
> page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
>
> @@ -4870,10 +4875,10 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
> if (!page)
> return NULL;
>
> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> /* if size can vary use size else just use PAGE_SIZE */
> - size = nc->size;
> -#endif
> + if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> + size = nc->size;
> +
> /* Even if we own the page, we do not use atomic_set().
> * This would break get_page_unless_zero() users.
> */
> @@ -4897,10 +4902,10 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
> goto refill;
> }
>
> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> /* if size can vary use size else just use PAGE_SIZE */
> - size = nc->size;
> -#endif
> + if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> + size = nc->size;
> +
> /* OK, page count is 0, we can safely set it */
> set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 04/57] mm/page_alloc: Make page_frag_cache boot-time page size compatible
2024-11-14 8:23 ` Vlastimil Babka
@ 2024-11-14 9:36 ` Ryan Roberts
2024-11-14 9:43 ` Vlastimil Babka
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-11-14 9:36 UTC (permalink / raw)
To: Vlastimil Babka, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 14/11/2024 08:23, Vlastimil Babka wrote:
> On 10/14/24 12:58, Ryan Roberts wrote:
>> "struct page_frag_cache" has some optimizations that depend on page
>> size. Let's refactor it a bit so that those optimizations can be
>> determined at run-time for the case where page size is a boot-time
>> parameter. For compile-time page size, the compiler should dead code
>> strip and the result is very similar to before.
>>
>> One wrinkle is that we don't know if we need the size member until
>> runtime. So remove the ifdeffery and always define offset as u32 (needed
>> if PAGE_SIZE is >= 64K) and size as u16 (only used when PAGE_SIZE <=
>> 32K). We move the members around a bit so that the overall size of the
>> struct remains the same; 24 bytes for 64-bit and 16 bytes on 32 bit.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>
> Looks ok, but ideally the PAGE_FRAG_CACHE_MAX_ORDER #define should also be
> replaced by some variable that's populated just once. It can be static local
> to page_alloc.c as nothing else seems to use it.
I can certainly do that, but wouldn't that be penalizing a compile-time page
size configuration? My current change means that PAGE_FRAG_CACHE_MAX_ORDER still
resolves to a compile-time constant in that situation and the compiler can
eliminate conditional branches it knows will never be taken. Or perhaps you're
suggesting I conditionally make it a variable if PAGE_SIZE_MIN != PAGE_SIZE_MAX?
Thanks,
Ryan
>
>>
>> page_alloc
>> ---
>>
>> ***NOTE***
>> Any confused maintainers may want to read the cover note here for context:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>
>> include/linux/mm_types.h | 13 ++++++-------
>> mm/page_alloc.c | 31 ++++++++++++++++++-------------
>> 2 files changed, 24 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 4854249792545..0844ed7cfaa53 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -544,16 +544,15 @@ static inline void *folio_get_private(struct folio *folio)
>>
>> struct page_frag_cache {
>> void * va;
>> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>> - __u16 offset;
>> - __u16 size;
>> -#else
>> - __u32 offset;
>> -#endif
>> /* we maintain a pagecount bias, so that we dont dirty cache line
>> * containing page->_refcount every time we allocate a fragment.
>> */
>> - unsigned int pagecnt_bias;
>> + unsigned int pagecnt_bias;
>> + __u32 offset;
>> + /* size only used when PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE, in which
>> + * case PAGE_FRAG_CACHE_MAX_SIZE is 32K and 16 bits is sufficient.
>> + */
>> + __u16 size;
>> bool pfmemalloc;
>> };
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 91ace8ca97e21..8678103b1b396 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4822,13 +4822,18 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>> struct page *page = NULL;
>> gfp_t gfp = gfp_mask;
>>
>> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>> - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
>> - __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>> - page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
>> - PAGE_FRAG_CACHE_MAX_ORDER);
>> - nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
>> -#endif
>> + if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) {
>> + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
>> + __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>> + page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
>> + PAGE_FRAG_CACHE_MAX_ORDER);
>> + /*
>> + * Cast to silence warning due to 16-bit nc->size. Not real
>> + * because PAGE_SIZE only less than PAGE_FRAG_CACHE_MAX_SIZE
>> + * when PAGE_FRAG_CACHE_MAX_SIZE is 32K.
>> + */
>> + nc->size = (__u16)(page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE);
>> + }
>> if (unlikely(!page))
>> page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
>>
>> @@ -4870,10 +4875,10 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
>> if (!page)
>> return NULL;
>>
>> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>> /* if size can vary use size else just use PAGE_SIZE */
>> - size = nc->size;
>> -#endif
>> + if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>> + size = nc->size;
>> +
>> /* Even if we own the page, we do not use atomic_set().
>> * This would break get_page_unless_zero() users.
>> */
>> @@ -4897,10 +4902,10 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
>> goto refill;
>> }
>>
>> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>> /* if size can vary use size else just use PAGE_SIZE */
>> - size = nc->size;
>> -#endif
>> + if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>> + size = nc->size;
>> +
>> /* OK, page count is 0, we can safely set it */
>> set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
>>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 04/57] mm/page_alloc: Make page_frag_cache boot-time page size compatible
2024-11-14 9:36 ` Ryan Roberts
@ 2024-11-14 9:43 ` Vlastimil Babka
0 siblings, 0 replies; 196+ messages in thread
From: Vlastimil Babka @ 2024-11-14 9:43 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 11/14/24 10:36, Ryan Roberts wrote:
> On 14/11/2024 08:23, Vlastimil Babka wrote:
>> On 10/14/24 12:58, Ryan Roberts wrote:
>>> "struct page_frag_cache" has some optimizations that depend on page
>>> size. Let's refactor it a bit so that those optimizations can be
>>> determined at run-time for the case where page size is a boot-time
>>> parameter. For compile-time page size, the compiler should dead code
>>> strip and the result is very similar to before.
>>>
>>> One wrinkle is that we don't know if we need the size member until
>>> runtime. So remove the ifdeffery and always define offset as u32 (needed
>>> if PAGE_SIZE is >= 64K) and size as u16 (only used when PAGE_SIZE <=
>>> 32K). We move the members around a bit so that the overall size of the
>>> struct remains the same; 24 bytes for 64-bit and 16 bytes on 32 bit.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>
>> Looks ok, but ideally the PAGE_FRAG_CACHE_MAX_ORDER #define should also be
>> replaced by some variable that's populated just once. It can be static local
>> to page_alloc.c as nothing else seems to use it.
>
> I can certainly do that, but wouldn't that be penalizing a compile-time page
> size configuration? My current change means that PAGE_FRAG_CACHE_MAX_ORDER still
> resolves to a compile-time constant in that situation and the compiler can
> eliminate conditional branches it knows will never be taken. Or perhaps you're
Ah, I see.
> suggesting I conditionally make it a variable if PAGE_SIZE_MIN != PAGE_SIZE_MAX?
Given the only place it's being used, it shouldn't be worth it after all.
You can add for this patch:
Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Thanks,
> Ryan
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant
2024-11-01 20:16 ` [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant Dave Kleikamp
2024-11-06 11:44 ` Ryan Roberts
@ 2024-11-14 10:09 ` Vlastimil Babka
2024-11-26 12:18 ` Ryan Roberts
1 sibling, 1 reply; 196+ messages in thread
From: Vlastimil Babka @ 2024-11-14 10:09 UTC (permalink / raw)
To: Dave Kleikamp, Ryan Roberts, Andrew Morton
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 11/1/24 21:16, Dave Kleikamp wrote:
> When boot-time page size is enabled, the test against KMALLOC_MAX_CACHE_SIZE
> is no longer optimized out with a constant size, so a build bug may
> occur on a path that won't be reached.
That's rather unfortunate, the __builtin_constant_p(size) part of
kmalloc_noprof() really expects things to resolve at compile time and it
would be better to keep it that way.
I think it would be better if we based KMALLOC_MAX_CACHE_SIZE itself on
PAGE_SHIFT_MAX and kept it constant, instead of introducing
KMALLOC_SHIFT_HIGH_MAX only for some sanity checks.
So if the kernel was built to support 4k to 64k, but booted as 4k, it would
still create and use kmalloc caches up to 128k. SLUB should handle that fine
(if not, please report it :)
Maybe we could also stop adding + 1 to PAGE_SHIFT_MAX if it's >=64k, so the
cache size is max 64k and not 128k but that should be probably evaluated
separately from this series.
Vlastimil
> Found compiling drivers/net/ethernet/qlogic/qed/qed_sriov.c
>
> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
> ---
>
> Ryan,
>
> Please consider incorporating this fix or something similar into your
> mm patch in the boot-time pages size patches.
>
> include/linux/slab.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 9848296ca6ba..a4c7507ab8ec 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -685,7 +685,8 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
> if (size <= 1024 * 1024) return 20;
> if (size <= 2 * 1024 * 1024) return 21;
>
> - if (!IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
> + if (!IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) &&
> + !IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
> BUILD_BUG_ON_MSG(1, "unexpected size in kmalloc_index()");
> else
> BUG();
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption Ryan Roberts
2024-10-16 14:37 ` Ryan Roberts
2024-11-01 20:16 ` [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant Dave Kleikamp
@ 2024-11-14 10:17 ` Vlastimil Babka
2024-11-26 10:08 ` Ryan Roberts
2 siblings, 1 reply; 196+ messages in thread
From: Vlastimil Babka @ 2024-11-14 10:17 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, Christoph Lameter, David Hildenbrand,
David Rientjes, Greg Marsden, Ivan Ivanov, Johannes Weiner,
Joonsoo Kim, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Michal Hocko, Miquel Raynal, Miroslav Benes,
Pekka Enberg, Richard Weinberger, Shakeel Butt,
Vignesh Raghavendra, Will Deacon
Cc: cgroups, linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
linux-mtd
On 10/14/24 12:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> Refactor "struct vmap_block" to use a flexible array for used_mmap since
> VMAP_BBMAP_BITS is not a compile time constant for the boot-time page
> size case.
>
> Update various BUILD_BUG_ON() instances to check against appropriate
> page size limit.
>
> Re-define "union swap_header" so that it's no longer exactly page-sized.
> Instead define a flexible "magic" array with a define which tells the
> offset to where the magic signature begins.
>
> Consider page size limit in some CPP condditionals.
>
> Wrap global variables that are initialized with PAGE_SIZE derived values
> using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
> deferred for boot-time page size builds.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> drivers/mtd/mtdswap.c | 4 ++--
> include/linux/mm.h | 2 +-
> include/linux/mm_types_task.h | 2 +-
> include/linux/mmzone.h | 3 ++-
> include/linux/slab.h | 7 ++++---
> include/linux/swap.h | 17 ++++++++++++-----
> include/linux/swapops.h | 6 +++++-
> mm/memcontrol.c | 2 +-
> mm/memory.c | 4 ++--
> mm/mmap.c | 2 +-
> mm/page-writeback.c | 2 +-
> mm/slub.c | 2 +-
> mm/sparse.c | 2 +-
> mm/swapfile.c | 2 +-
> mm/vmalloc.c | 7 ++++---
> 15 files changed, 39 insertions(+), 25 deletions(-)
>
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -132,10 +132,17 @@ static inline int current_is_kswapd(void)
> * bootbits...
> */
> union swap_header {
> - struct {
> - char reserved[PAGE_SIZE - 10];
> - char magic[10]; /* SWAP-SPACE or SWAPSPACE2 */
> - } magic;
> + /*
> + * Exists conceptually, but since PAGE_SIZE may not be known at compile
> + * time, we must access through pointer arithmetic at run time.
> + *
> + * struct {
> + * char reserved[PAGE_SIZE - 10];
> + * char magic[10]; SWAP-SPACE or SWAPSPACE2
> + * } magic;
> + */
> +#define SWAP_HEADER_MAGIC (PAGE_SIZE - 10)
> + char magic[1];
I wonder if it makes sense to even keep this magic field anymore.
> struct {
> char bootbits[1024]; /* Space for disklabel etc. */
> __u32 version;
> @@ -201,7 +208,7 @@ struct swap_extent {
> * Max bad pages in the new format..
> */
> #define MAX_SWAP_BADPAGES \
> - ((offsetof(union swap_header, magic.magic) - \
> + ((SWAP_HEADER_MAGIC - \
> offsetof(union swap_header, info.badpages)) / sizeof(int))
>
> enum {
<snip>
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2931,7 +2931,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
> unsigned long swapfilepages;
> unsigned long last_page;
>
> - if (memcmp("SWAPSPACE2", swap_header->magic.magic, 10)) {
> + if (memcmp("SWAPSPACE2", &swap_header->magic[SWAP_HEADER_MAGIC], 10)) {
I'd expect static checkers to scream here because we overflow the magic[1]
both due to copying 10 bytes into 1 byte array and also with the insane
offset. Hence my suggestion to drop the field and use purely pointer arithmetic.
> pr_err("Unable to find swap-space signature\n");
> return 0;
> }
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index a0df1e2e155a8..b4fbba204603c 100644
Hm I'm actually looking at yourwip branch which also has:
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -969,7 +969,7 @@ static inline int get_order_from_str(const char *size_str)
return -EINVAL;
}
-static char str_dup[PAGE_SIZE] __initdata;
+static char str_dup[PAGE_SIZE_MAX] __initdata;
static int __init setup_thp_anon(char *str)
{
char *token, *range, *policy, *subtoken;
Why PAGE_SIZE_MAX? Isn't this the same case as "mm/memcontrol: Fix seq_buf
size to save memory when PAGE_SIZE is large"
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 11/57] fork: Permit boot-time THREAD_SIZE determination
2024-10-14 10:58 ` [RFC PATCH v1 11/57] fork: Permit boot-time THREAD_SIZE determination Ryan Roberts
@ 2024-11-14 10:42 ` Vlastimil Babka
0 siblings, 0 replies; 196+ messages in thread
From: Vlastimil Babka @ 2024-11-14 10:42 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Andrey Ryabinin, Anshuman Khandual,
Ard Biesheuvel, Arnd Bergmann, Catalin Marinas, David Hildenbrand,
Greg Marsden, Ingo Molnar, Ivan Ivanov, Juri Lelli, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Peter Zijlstra, Vincent Guittot, Will Deacon
Cc: kasan-dev, linux-arch, linux-arm-kernel, linux-kernel, linux-mm
On 10/14/24 12:58, Ryan Roberts wrote:
> THREAD_SIZE defines the size of a kernel thread stack. To date, it has
> been set at compile-time. However, when using vmap stacks, the size must
> be a multiple of PAGE_SIZE, and given we are in the process of
> supporting boot-time page size, we must also do the same for
> THREAD_SIZE.
>
> The alternative would be to define THREAD_SIZE for the largest supported
> page size, but this would waste memory when using a smaller page size.
> For example, arm64 requires THREAD_SIZE to be 16K, but when using 64K
> pages and a vmap stack, we must increase the size to 64K. If we required
> 64K when 4K or 16K page size was in use, we would waste 48K per kernel
> thread.
>
> So let's refactor to allow THREAD_SIZE to not be a compile-time
> constant. THREAD_SIZE_MAX (and THREAD_ALIGN_MAX) are introduced to
> manage the limits, as is done for PAGE_SIZE.
>
> When THREAD_SIZE is a compile-time constant, behaviour and code size
> should be equivalent.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 15/57] stackdepot: Remove PAGE_SIZE compile-time constant assumption
2024-10-14 10:58 ` [RFC PATCH v1 15/57] stackdepot: " Ryan Roberts
@ 2024-11-14 11:15 ` Vlastimil Babka
2024-11-26 10:15 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Vlastimil Babka @ 2024-11-14 11:15 UTC (permalink / raw)
To: Ryan Roberts, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 10/14/24 12:58, Ryan Roberts wrote:
> To prepare for supporting boot-time page size selection, refactor code
> to remove assumptions about PAGE_SIZE being compile-time constant. Code
> intended to be equivalent when compile-time page size is active.
>
> "union handle_parts" previously calculated the number of bits required
> for its pool index and offset members based on PAGE_SHIFT. This is
> problematic for boot-time page size builds because the actual page size
> isn't known until boot-time.
>
> We could use PAGE_SHIFT_MAX in calculating the worst case offset bits,
> but bits would be wasted that could be used for pool index when
> PAGE_SIZE is set smaller than MAX, the end result being that stack depot
> can address less memory than it should.
>
> To avoid needing to dynamically define the offset and index bit widths,
> let's instead fix the pool size and derive the order at runtime based on
> the PAGE_SIZE. This means that the fields' widths can remain static,
> with the down side being slightly increased risk of failing to allocate
> the large folio.
>
> This only affects boot-time page size builds. compile-time page size
> builds will still always allocate order-2 folios.
>
> Additionally, wrap global variables that are initialized with PAGE_SIZE
> derived values using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their
> initialization can be deferred for boot-time page size builds.
This is done for pool_offset but given it's initialized by DEPOT_POOL_SIZE,
it doesn't look derived from PAGE_SIZE?
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Other than that,
Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>
> ***NOTE***
> Any confused maintainers may want to read the cover note here for context:
> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>
> include/linux/stackdepot.h | 6 +++---
> lib/stackdepot.c | 6 +++---
> 2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
> index e9ec32fb97d4a..ac877a4e90406 100644
> --- a/include/linux/stackdepot.h
> +++ b/include/linux/stackdepot.h
> @@ -32,10 +32,10 @@ typedef u32 depot_stack_handle_t;
>
> #define DEPOT_HANDLE_BITS (sizeof(depot_stack_handle_t) * 8)
>
> -#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages */
> -#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT + DEPOT_POOL_ORDER))
> +#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages of PAGE_SIZE_MAX */
> +#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT_MAX + DEPOT_POOL_ORDER))
> #define DEPOT_STACK_ALIGN 4
> -#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
> +#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT_MAX - DEPOT_STACK_ALIGN)
> #define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
> STACK_DEPOT_EXTRA_BITS)
>
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index 5ed34cc963fc3..974351f0e9e3c 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -68,7 +68,7 @@ static void *new_pool;
> /* Number of pools in stack_pools. */
> static int pools_num;
> /* Offset to the unused space in the currently used pool. */
> -static size_t pool_offset = DEPOT_POOL_SIZE;
> +static DEFINE_GLOBAL_PAGE_SIZE_VAR(size_t, pool_offset, DEPOT_POOL_SIZE);
> /* Freelist of stack records within stack_pools. */
> static LIST_HEAD(free_stacks);
> /* The lock must be held when performing pool or freelist modifications. */
> @@ -625,7 +625,7 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> */
> if (unlikely(can_alloc && !READ_ONCE(new_pool))) {
> page = alloc_pages(gfp_nested_mask(alloc_flags),
> - DEPOT_POOL_ORDER);
> + get_order(DEPOT_POOL_SIZE));
> if (page)
> prealloc = page_address(page);
> }
> @@ -663,7 +663,7 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> exit:
> if (prealloc) {
> /* Stack depot didn't use this memory, free it. */
> - free_pages((unsigned long)prealloc, DEPOT_POOL_ORDER);
> + free_pages((unsigned long)prealloc, get_order(DEPOT_POOL_SIZE));
> }
> if (found)
> handle = found->handle.handle;
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption
2024-11-14 10:17 ` [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption Vlastimil Babka
@ 2024-11-26 10:08 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-11-26 10:08 UTC (permalink / raw)
To: Vlastimil Babka, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, Christoph Lameter, David Hildenbrand,
David Rientjes, Greg Marsden, Ivan Ivanov, Johannes Weiner,
Joonsoo Kim, Kalesh Singh, Marc Zyngier, Mark Rutland,
Matthias Brugger, Michal Hocko, Miquel Raynal, Miroslav Benes,
Pekka Enberg, Richard Weinberger, Shakeel Butt,
Vignesh Raghavendra, Will Deacon
Cc: cgroups, linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
linux-mtd
Hi Vlastimil,
Sorry about the slow response to your review of this series - I'm just getting
to it now. Comment's below...
On 14/11/2024 10:17, Vlastimil Babka wrote:
> On 10/14/24 12:58, Ryan Roberts wrote:
>> To prepare for supporting boot-time page size selection, refactor code
>> to remove assumptions about PAGE_SIZE being compile-time constant. Code
>> intended to be equivalent when compile-time page size is active.
>>
>> Refactor "struct vmap_block" to use a flexible array for used_mmap since
>> VMAP_BBMAP_BITS is not a compile time constant for the boot-time page
>> size case.
>>
>> Update various BUILD_BUG_ON() instances to check against appropriate
>> page size limit.
>>
>> Re-define "union swap_header" so that it's no longer exactly page-sized.
>> Instead define a flexible "magic" array with a define which tells the
>> offset to where the magic signature begins.
>>
>> Consider page size limit in some CPP condditionals.
>>
>> Wrap global variables that are initialized with PAGE_SIZE derived values
>> using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their initialization can be
>> deferred for boot-time page size builds.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>
>> ***NOTE***
>> Any confused maintainers may want to read the cover note here for context:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>
>> drivers/mtd/mtdswap.c | 4 ++--
>> include/linux/mm.h | 2 +-
>> include/linux/mm_types_task.h | 2 +-
>> include/linux/mmzone.h | 3 ++-
>> include/linux/slab.h | 7 ++++---
>> include/linux/swap.h | 17 ++++++++++++-----
>> include/linux/swapops.h | 6 +++++-
>> mm/memcontrol.c | 2 +-
>> mm/memory.c | 4 ++--
>> mm/mmap.c | 2 +-
>> mm/page-writeback.c | 2 +-
>> mm/slub.c | 2 +-
>> mm/sparse.c | 2 +-
>> mm/swapfile.c | 2 +-
>> mm/vmalloc.c | 7 ++++---
>> 15 files changed, 39 insertions(+), 25 deletions(-)
>>
>
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -132,10 +132,17 @@ static inline int current_is_kswapd(void)
>> * bootbits...
>> */
>> union swap_header {
>> - struct {
>> - char reserved[PAGE_SIZE - 10];
>> - char magic[10]; /* SWAP-SPACE or SWAPSPACE2 */
>> - } magic;
>> + /*
>> + * Exists conceptually, but since PAGE_SIZE may not be known at compile
>> + * time, we must access through pointer arithmetic at run time.
>> + *
>> + * struct {
>> + * char reserved[PAGE_SIZE - 10];
>> + * char magic[10]; SWAP-SPACE or SWAPSPACE2
>> + * } magic;
>> + */
>> +#define SWAP_HEADER_MAGIC (PAGE_SIZE - 10)
>> + char magic[1];
>
> I wonder if it makes sense to even keep this magic field anymore.
>
>> struct {
>> char bootbits[1024]; /* Space for disklabel etc. */
>> __u32 version;
>> @@ -201,7 +208,7 @@ struct swap_extent {
>> * Max bad pages in the new format..
>> */
>> #define MAX_SWAP_BADPAGES \
>> - ((offsetof(union swap_header, magic.magic) - \
>> + ((SWAP_HEADER_MAGIC - \
>> offsetof(union swap_header, info.badpages)) / sizeof(int))
>>
>> enum {
>
> <snip>
>
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -2931,7 +2931,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
>> unsigned long swapfilepages;
>> unsigned long last_page;
>>
>> - if (memcmp("SWAPSPACE2", swap_header->magic.magic, 10)) {
>> + if (memcmp("SWAPSPACE2", &swap_header->magic[SWAP_HEADER_MAGIC], 10)) {
>
> I'd expect static checkers to scream here because we overflow the magic[1]
> both due to copying 10 bytes into 1 byte array and also with the insane
> offset. Hence my suggestion to drop the field and use purely pointer arithmetic.
Yeah, good point. I'll remove magic[] and use pointer arithmetic.
>
>> pr_err("Unable to find swap-space signature\n");
>> return 0;
>> }
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index a0df1e2e155a8..b4fbba204603c 100644
>
> Hm I'm actually looking at yourwip branch which also has:
>
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -969,7 +969,7 @@ static inline int get_order_from_str(const char *size_str)
> return -EINVAL;
> }
>
> -static char str_dup[PAGE_SIZE] __initdata;
> +static char str_dup[PAGE_SIZE_MAX] __initdata;
> static int __init setup_thp_anon(char *str)
> {
> char *token, *range, *policy, *subtoken;
>
> Why PAGE_SIZE_MAX? Isn't this the same case as "mm/memcontrol: Fix seq_buf
> size to save memory when PAGE_SIZE is large"
Hmm, you're probably right. I had a vague notion that "str", as passed into the
function, was guarranteed to be no bigger than PAGE_SIZE (perhaps I'm wrong). So
assumed that's where the original definition of str_dup[PAGE_SIZE] was coming from.
But I think your real question is "should the max size of str be a function of
PAGE_SIZE?". I think it could; there are more page orders that can legitimately
be described when the page size is bigger (at least for arm64). But in practice,
I'd expect any sane string for any page size to be easily within 4K.
So on that basis, I'll take your advice; changing this buffer to be 4K always.
Thanks,
Ryan
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 15/57] stackdepot: Remove PAGE_SIZE compile-time constant assumption
2024-11-14 11:15 ` Vlastimil Babka
@ 2024-11-26 10:15 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-11-26 10:15 UTC (permalink / raw)
To: Vlastimil Babka, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 14/11/2024 11:15, Vlastimil Babka wrote:
> On 10/14/24 12:58, Ryan Roberts wrote:
>> To prepare for supporting boot-time page size selection, refactor code
>> to remove assumptions about PAGE_SIZE being compile-time constant. Code
>> intended to be equivalent when compile-time page size is active.
>>
>> "union handle_parts" previously calculated the number of bits required
>> for its pool index and offset members based on PAGE_SHIFT. This is
>> problematic for boot-time page size builds because the actual page size
>> isn't known until boot-time.
>>
>> We could use PAGE_SHIFT_MAX in calculating the worst case offset bits,
>> but bits would be wasted that could be used for pool index when
>> PAGE_SIZE is set smaller than MAX, the end result being that stack depot
>> can address less memory than it should.
>>
>> To avoid needing to dynamically define the offset and index bit widths,
>> let's instead fix the pool size and derive the order at runtime based on
>> the PAGE_SIZE. This means that the fields' widths can remain static,
>> with the down side being slightly increased risk of failing to allocate
>> the large folio.
>>
>> This only affects boot-time page size builds. compile-time page size
>> builds will still always allocate order-2 folios.
>>
>> Additionally, wrap global variables that are initialized with PAGE_SIZE
>> derived values using DEFINE_GLOBAL_PAGE_SIZE_VAR() so their
>> initialization can be deferred for boot-time page size builds.
>
> This is done for pool_offset but given it's initialized by DEPOT_POOL_SIZE,
> it doesn't look derived from PAGE_SIZE?
Good spot; I think I initially did this when DEPOT_POOL_SIZE was still based on
PAGE_SIZE. But then I've subsequently re-defined DEPOT_POOL_SIZE to be
independent of PAGE_SIZE. I'll remove this part of the change.
>
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>
> Other than that,
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
Thanks!
>
>
>> ---
>>
>> ***NOTE***
>> Any confused maintainers may want to read the cover note here for context:
>> https://lore.kernel.org/all/20241014105514.3206191-1-ryan.roberts@arm.com/
>>
>> include/linux/stackdepot.h | 6 +++---
>> lib/stackdepot.c | 6 +++---
>> 2 files changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
>> index e9ec32fb97d4a..ac877a4e90406 100644
>> --- a/include/linux/stackdepot.h
>> +++ b/include/linux/stackdepot.h
>> @@ -32,10 +32,10 @@ typedef u32 depot_stack_handle_t;
>>
>> #define DEPOT_HANDLE_BITS (sizeof(depot_stack_handle_t) * 8)
>>
>> -#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages */
>> -#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT + DEPOT_POOL_ORDER))
>> +#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages of PAGE_SIZE_MAX */
>> +#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT_MAX + DEPOT_POOL_ORDER))
>> #define DEPOT_STACK_ALIGN 4
>> -#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
>> +#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT_MAX - DEPOT_STACK_ALIGN)
>> #define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
>> STACK_DEPOT_EXTRA_BITS)
>>
>> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
>> index 5ed34cc963fc3..974351f0e9e3c 100644
>> --- a/lib/stackdepot.c
>> +++ b/lib/stackdepot.c
>> @@ -68,7 +68,7 @@ static void *new_pool;
>> /* Number of pools in stack_pools. */
>> static int pools_num;
>> /* Offset to the unused space in the currently used pool. */
>> -static size_t pool_offset = DEPOT_POOL_SIZE;
>> +static DEFINE_GLOBAL_PAGE_SIZE_VAR(size_t, pool_offset, DEPOT_POOL_SIZE);
>> /* Freelist of stack records within stack_pools. */
>> static LIST_HEAD(free_stacks);
>> /* The lock must be held when performing pool or freelist modifications. */
>> @@ -625,7 +625,7 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
>> */
>> if (unlikely(can_alloc && !READ_ONCE(new_pool))) {
>> page = alloc_pages(gfp_nested_mask(alloc_flags),
>> - DEPOT_POOL_ORDER);
>> + get_order(DEPOT_POOL_SIZE));
>> if (page)
>> prealloc = page_address(page);
>> }
>> @@ -663,7 +663,7 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
>> exit:
>> if (prealloc) {
>> /* Stack depot didn't use this memory, free it. */
>> - free_pages((unsigned long)prealloc, DEPOT_POOL_ORDER);
>> + free_pages((unsigned long)prealloc, get_order(DEPOT_POOL_SIZE));
>> }
>> if (found)
>> handle = found->handle.handle;
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant
2024-11-14 10:09 ` Vlastimil Babka
@ 2024-11-26 12:18 ` Ryan Roberts
2024-11-26 12:36 ` Vlastimil Babka
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-11-26 12:18 UTC (permalink / raw)
To: Vlastimil Babka, Dave Kleikamp, Andrew Morton
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 14/11/2024 10:09, Vlastimil Babka wrote:
> On 11/1/24 21:16, Dave Kleikamp wrote:
>> When boot-time page size is enabled, the test against KMALLOC_MAX_CACHE_SIZE
>> is no longer optimized out with a constant size, so a build bug may
>> occur on a path that won't be reached.
>
> That's rather unfortunate, the __builtin_constant_p(size) part of
> kmalloc_noprof() really expects things to resolve at compile time and it
> would be better to keep it that way.
>
> I think it would be better if we based KMALLOC_MAX_CACHE_SIZE itself on
> PAGE_SHIFT_MAX and kept it constant, instead of introducing
> KMALLOC_SHIFT_HIGH_MAX only for some sanity checks.
>
> So if the kernel was built to support 4k to 64k, but booted as 4k, it would
> still create and use kmalloc caches up to 128k. SLUB should handle that fine
> (if not, please report it :)
So when PAGE_SIZE_MAX=64K and PAGE_SIZE=4K, kmalloc will support up to 128K
whereas before it only supported up to 8K. I was trying to avoid that since I
assumed that would be costly in terms of extra memory allocated for those higher
order buckets that will never be used. But I have no idea how SLUB works in
practice. Perhaps memory for the cache is only lazily allocated so we won't see
an issue in practice?
I'm happy to make this change if you're certain it's the right approach; please
confirm.
>
> Maybe we could also stop adding + 1 to PAGE_SHIFT_MAX if it's >=64k, so the
> cache size is max 64k and not 128k but that should be probably evaluated
> separately from this series.
I'm inferring from this that perhaps there is a memory cost with having the
higher orders defined but unused.
Thanks,
Ryan
>
> Vlastimil
>
>> Found compiling drivers/net/ethernet/qlogic/qed/qed_sriov.c
>>
>> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
>> ---
>>
>> Ryan,
>>
>> Please consider incorporating this fix or something similar into your
>> mm patch in the boot-time pages size patches.
>>
>> include/linux/slab.h | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>> index 9848296ca6ba..a4c7507ab8ec 100644
>> --- a/include/linux/slab.h
>> +++ b/include/linux/slab.h
>> @@ -685,7 +685,8 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
>> if (size <= 1024 * 1024) return 20;
>> if (size <= 2 * 1024 * 1024) return 21;
>>
>> - if (!IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
>> + if (!IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) &&
>> + !IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
>> BUILD_BUG_ON_MSG(1, "unexpected size in kmalloc_index()");
>> else
>> BUG();
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant
2024-11-26 12:18 ` Ryan Roberts
@ 2024-11-26 12:36 ` Vlastimil Babka
2024-11-26 14:26 ` Ryan Roberts
2024-11-26 14:53 ` Ryan Roberts
0 siblings, 2 replies; 196+ messages in thread
From: Vlastimil Babka @ 2024-11-26 12:36 UTC (permalink / raw)
To: Ryan Roberts, Dave Kleikamp, Andrew Morton
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 11/26/24 13:18, Ryan Roberts wrote:
> On 14/11/2024 10:09, Vlastimil Babka wrote:
>> On 11/1/24 21:16, Dave Kleikamp wrote:
>>> When boot-time page size is enabled, the test against KMALLOC_MAX_CACHE_SIZE
>>> is no longer optimized out with a constant size, so a build bug may
>>> occur on a path that won't be reached.
>>
>> That's rather unfortunate, the __builtin_constant_p(size) part of
>> kmalloc_noprof() really expects things to resolve at compile time and it
>> would be better to keep it that way.
>>
>> I think it would be better if we based KMALLOC_MAX_CACHE_SIZE itself on
>> PAGE_SHIFT_MAX and kept it constant, instead of introducing
>> KMALLOC_SHIFT_HIGH_MAX only for some sanity checks.
>>
>> So if the kernel was built to support 4k to 64k, but booted as 4k, it would
>> still create and use kmalloc caches up to 128k. SLUB should handle that fine
>> (if not, please report it :)
>
> So when PAGE_SIZE_MAX=64K and PAGE_SIZE=4K, kmalloc will support up to 128K
> whereas before it only supported up to 8K. I was trying to avoid that since I
> assumed that would be costly in terms of extra memory allocated for those higher
> order buckets that will never be used. But I have no idea how SLUB works in
> practice. Perhaps memory for the cache is only lazily allocated so we won't see
> an issue in practice?
Yes the e.g. 128k slabs themselves will be lazily allocated. There will be
some overhead with the management structures (struct kmem_cache etc) but
much smaller.
To be completely honest, some extra overhead might come to be when the slabs
are allocated ans later the user frees those allocations. kmalloc_large()
wwould return them immediately, while a regular kmem_cache will keep one or
more per cpu for reuse. But if that becomes a visible problem we can tune
those caches to discard slabs more aggressively.
> I'm happy to make this change if you're certain it's the right approach; please
> confirm.
Yes it's much better option than breaking the build-time-constant part of
kmalloc_noprof().
>>
>> Maybe we could also stop adding + 1 to PAGE_SHIFT_MAX if it's >=64k, so the
>> cache size is max 64k and not 128k but that should be probably evaluated
>> separately from this series.
>
> I'm inferring from this that perhaps there is a memory cost with having the
> higher orders defined but unused.
Yeah as per above, should not be too large and we could tune it down if
necessary.
> Thanks,
> Ryan
>
>>
>> Vlastimil
>>
>>> Found compiling drivers/net/ethernet/qlogic/qed/qed_sriov.c
>>>
>>> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
>>> ---
>>>
>>> Ryan,
>>>
>>> Please consider incorporating this fix or something similar into your
>>> mm patch in the boot-time pages size patches.
>>>
>>> include/linux/slab.h | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>>> index 9848296ca6ba..a4c7507ab8ec 100644
>>> --- a/include/linux/slab.h
>>> +++ b/include/linux/slab.h
>>> @@ -685,7 +685,8 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
>>> if (size <= 1024 * 1024) return 20;
>>> if (size <= 2 * 1024 * 1024) return 21;
>>>
>>> - if (!IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
>>> + if (!IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) &&
>>> + !IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
>>> BUILD_BUG_ON_MSG(1, "unexpected size in kmalloc_index()");
>>> else
>>> BUG();
>>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant
2024-11-26 12:36 ` Vlastimil Babka
@ 2024-11-26 14:26 ` Ryan Roberts
2024-11-26 14:53 ` Ryan Roberts
1 sibling, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-11-26 14:26 UTC (permalink / raw)
To: Vlastimil Babka, Dave Kleikamp, Andrew Morton
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 26/11/2024 12:36, Vlastimil Babka wrote:
> On 11/26/24 13:18, Ryan Roberts wrote:
>> On 14/11/2024 10:09, Vlastimil Babka wrote:
>>> On 11/1/24 21:16, Dave Kleikamp wrote:
>>>> When boot-time page size is enabled, the test against KMALLOC_MAX_CACHE_SIZE
>>>> is no longer optimized out with a constant size, so a build bug may
>>>> occur on a path that won't be reached.
>>>
>>> That's rather unfortunate, the __builtin_constant_p(size) part of
>>> kmalloc_noprof() really expects things to resolve at compile time and it
>>> would be better to keep it that way.
>>>
>>> I think it would be better if we based KMALLOC_MAX_CACHE_SIZE itself on
>>> PAGE_SHIFT_MAX and kept it constant, instead of introducing
>>> KMALLOC_SHIFT_HIGH_MAX only for some sanity checks.
>>>
>>> So if the kernel was built to support 4k to 64k, but booted as 4k, it would
>>> still create and use kmalloc caches up to 128k. SLUB should handle that fine
>>> (if not, please report it :)
>>
>> So when PAGE_SIZE_MAX=64K and PAGE_SIZE=4K, kmalloc will support up to 128K
>> whereas before it only supported up to 8K. I was trying to avoid that since I
>> assumed that would be costly in terms of extra memory allocated for those higher
>> order buckets that will never be used. But I have no idea how SLUB works in
>> practice. Perhaps memory for the cache is only lazily allocated so we won't see
>> an issue in practice?
>
> Yes the e.g. 128k slabs themselves will be lazily allocated. There will be
> some overhead with the management structures (struct kmem_cache etc) but
> much smaller.
> To be completely honest, some extra overhead might come to be when the slabs
> are allocated ans later the user frees those allocations. kmalloc_large()
> wwould return them immediately, while a regular kmem_cache will keep one or
> more per cpu for reuse. But if that becomes a visible problem we can tune
> those caches to discard slabs more aggressively.
>
>> I'm happy to make this change if you're certain it's the right approach; please
>> confirm.
>
> Yes it's much better option than breaking the build-time-constant part of
> kmalloc_noprof().
OK, I'll take this approach as you suggest.
Thanks,
Ryan
>
>>>
>>> Maybe we could also stop adding + 1 to PAGE_SHIFT_MAX if it's >=64k, so the
>>> cache size is max 64k and not 128k but that should be probably evaluated
>>> separately from this series.
>>
>> I'm inferring from this that perhaps there is a memory cost with having the
>> higher orders defined but unused.
>
> Yeah as per above, should not be too large and we could tune it down if
> necessary.
>
>> Thanks,
>> Ryan
>>
>>>
>>> Vlastimil
>>>
>>>> Found compiling drivers/net/ethernet/qlogic/qed/qed_sriov.c
>>>>
>>>> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
>>>> ---
>>>>
>>>> Ryan,
>>>>
>>>> Please consider incorporating this fix or something similar into your
>>>> mm patch in the boot-time pages size patches.
>>>>
>>>> include/linux/slab.h | 3 ++-
>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>>>> index 9848296ca6ba..a4c7507ab8ec 100644
>>>> --- a/include/linux/slab.h
>>>> +++ b/include/linux/slab.h
>>>> @@ -685,7 +685,8 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
>>>> if (size <= 1024 * 1024) return 20;
>>>> if (size <= 2 * 1024 * 1024) return 21;
>>>>
>>>> - if (!IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
>>>> + if (!IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) &&
>>>> + !IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
>>>> BUILD_BUG_ON_MSG(1, "unexpected size in kmalloc_index()");
>>>> else
>>>> BUG();
>>>
>>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant
2024-11-26 12:36 ` Vlastimil Babka
2024-11-26 14:26 ` Ryan Roberts
@ 2024-11-26 14:53 ` Ryan Roberts
2024-11-26 15:09 ` Vlastimil Babka
1 sibling, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-11-26 14:53 UTC (permalink / raw)
To: Vlastimil Babka, Dave Kleikamp, Andrew Morton
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 26/11/2024 12:36, Vlastimil Babka wrote:
> On 11/26/24 13:18, Ryan Roberts wrote:
>> On 14/11/2024 10:09, Vlastimil Babka wrote:
>>> On 11/1/24 21:16, Dave Kleikamp wrote:
>>>> When boot-time page size is enabled, the test against KMALLOC_MAX_CACHE_SIZE
>>>> is no longer optimized out with a constant size, so a build bug may
>>>> occur on a path that won't be reached.
>>>
>>> That's rather unfortunate, the __builtin_constant_p(size) part of
>>> kmalloc_noprof() really expects things to resolve at compile time and it
>>> would be better to keep it that way.
>>>
>>> I think it would be better if we based KMALLOC_MAX_CACHE_SIZE itself on
>>> PAGE_SHIFT_MAX and kept it constant, instead of introducing
>>> KMALLOC_SHIFT_HIGH_MAX only for some sanity checks.
>>>
>>> So if the kernel was built to support 4k to 64k, but booted as 4k, it would
>>> still create and use kmalloc caches up to 128k. SLUB should handle that fine
>>> (if not, please report it :)
>>
>> So when PAGE_SIZE_MAX=64K and PAGE_SIZE=4K, kmalloc will support up to 128K
>> whereas before it only supported up to 8K. I was trying to avoid that since I
>> assumed that would be costly in terms of extra memory allocated for those higher
>> order buckets that will never be used. But I have no idea how SLUB works in
>> practice. Perhaps memory for the cache is only lazily allocated so we won't see
>> an issue in practice?
>
> Yes the e.g. 128k slabs themselves will be lazily allocated. There will be
> some overhead with the management structures (struct kmem_cache etc) but
> much smaller.
> To be completely honest, some extra overhead might come to be when the slabs
> are allocated ans later the user frees those allocations. kmalloc_large()
> wwould return them immediately, while a regular kmem_cache will keep one or
> more per cpu for reuse. But if that becomes a visible problem we can tune
> those caches to discard slabs more aggressively.
Sorry to keep pushing on this, now that I've actually looked at the code, I feel
I have a slightly better understanding:
void *kmalloc_noprof(size_t size, gfp_t flags)
{
if (__builtin_constant_p(size) && size) {
if (size > KMALLOC_MAX_CACHE_SIZE)
return __kmalloc_large_noprof(size, flags); <<< (1)
index = kmalloc_index(size);
return __kmalloc_cache_noprof(...); <<< (2)
}
return __kmalloc_noprof(size, flags); <<< (3)
}
So if size and KMALLOC_MAX_CACHE_SIZE are constant, we end up with this
resolving either to a call to (1) or (2), decided at compile time. If
KMALLOC_MAX_CACHE_SIZE is not constant, (1), (2) and the runtime conditional
need to be kept in the function.
But intuatively, I would have guessed that given the choice between the overhead
of keeping that runtime conditional vs keeping per-cpu slab caches for extra
sizes between 16K and 128K, then the runtime conditional would be preferable. I
would guess that quite a bit of memory could get tied up in those caches?
Why is your preference the opposite? What am I not understanding?
>
>> I'm happy to make this change if you're certain it's the right approach; please
>> confirm.
>
> Yes it's much better option than breaking the build-time-constant part of
> kmalloc_noprof().
>
>>>
>>> Maybe we could also stop adding + 1 to PAGE_SHIFT_MAX if it's >=64k, so the
>>> cache size is max 64k and not 128k but that should be probably evaluated
>>> separately from this series.
>>
>> I'm inferring from this that perhaps there is a memory cost with having the
>> higher orders defined but unused.
>
> Yeah as per above, should not be too large and we could tune it down if
> necessary.
>
>> Thanks,
>> Ryan
>>
>>>
>>> Vlastimil
>>>
>>>> Found compiling drivers/net/ethernet/qlogic/qed/qed_sriov.c
>>>>
>>>> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
>>>> ---
>>>>
>>>> Ryan,
>>>>
>>>> Please consider incorporating this fix or something similar into your
>>>> mm patch in the boot-time pages size patches.
>>>>
>>>> include/linux/slab.h | 3 ++-
>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>>>> index 9848296ca6ba..a4c7507ab8ec 100644
>>>> --- a/include/linux/slab.h
>>>> +++ b/include/linux/slab.h
>>>> @@ -685,7 +685,8 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
>>>> if (size <= 1024 * 1024) return 20;
>>>> if (size <= 2 * 1024 * 1024) return 21;
>>>>
>>>> - if (!IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
>>>> + if (!IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) &&
>>>> + !IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
>>>> BUILD_BUG_ON_MSG(1, "unexpected size in kmalloc_index()");
>>>> else
>>>> BUG();
>>>
>>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant
2024-11-26 14:53 ` Ryan Roberts
@ 2024-11-26 15:09 ` Vlastimil Babka
2024-11-26 15:27 ` Vlastimil Babka
0 siblings, 1 reply; 196+ messages in thread
From: Vlastimil Babka @ 2024-11-26 15:09 UTC (permalink / raw)
To: Ryan Roberts, Dave Kleikamp, Andrew Morton, Christoph Lameter,
David Rientjes, Hyeonggon Yoo
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 11/26/24 15:53, Ryan Roberts wrote:
> On 26/11/2024 12:36, Vlastimil Babka wrote:
>> On 11/26/24 13:18, Ryan Roberts wrote:
>>> On 14/11/2024 10:09, Vlastimil Babka wrote:
>>>> On 11/1/24 21:16, Dave Kleikamp wrote:
>>>>> When boot-time page size is enabled, the test against KMALLOC_MAX_CACHE_SIZE
>>>>> is no longer optimized out with a constant size, so a build bug may
>>>>> occur on a path that won't be reached.
>>>>
>>>> That's rather unfortunate, the __builtin_constant_p(size) part of
>>>> kmalloc_noprof() really expects things to resolve at compile time and it
>>>> would be better to keep it that way.
>>>>
>>>> I think it would be better if we based KMALLOC_MAX_CACHE_SIZE itself on
>>>> PAGE_SHIFT_MAX and kept it constant, instead of introducing
>>>> KMALLOC_SHIFT_HIGH_MAX only for some sanity checks.
>>>>
>>>> So if the kernel was built to support 4k to 64k, but booted as 4k, it would
>>>> still create and use kmalloc caches up to 128k. SLUB should handle that fine
>>>> (if not, please report it :)
>>>
>>> So when PAGE_SIZE_MAX=64K and PAGE_SIZE=4K, kmalloc will support up to 128K
>>> whereas before it only supported up to 8K. I was trying to avoid that since I
>>> assumed that would be costly in terms of extra memory allocated for those higher
>>> order buckets that will never be used. But I have no idea how SLUB works in
>>> practice. Perhaps memory for the cache is only lazily allocated so we won't see
>>> an issue in practice?
>>
>> Yes the e.g. 128k slabs themselves will be lazily allocated. There will be
>> some overhead with the management structures (struct kmem_cache etc) but
>> much smaller.
>> To be completely honest, some extra overhead might come to be when the slabs
>> are allocated ans later the user frees those allocations. kmalloc_large()
>> wwould return them immediately, while a regular kmem_cache will keep one or
>> more per cpu for reuse. But if that becomes a visible problem we can tune
>> those caches to discard slabs more aggressively.
>
> Sorry to keep pushing on this, now that I've actually looked at the code, I feel
> I have a slightly better understanding:
>
> void *kmalloc_noprof(size_t size, gfp_t flags)
> {
> if (__builtin_constant_p(size) && size) {
>
> if (size > KMALLOC_MAX_CACHE_SIZE)
> return __kmalloc_large_noprof(size, flags); <<< (1)
>
> index = kmalloc_index(size);
> return __kmalloc_cache_noprof(...); <<< (2)
> }
> return __kmalloc_noprof(size, flags); <<< (3)
> }
>
> So if size and KMALLOC_MAX_CACHE_SIZE are constant, we end up with this
> resolving either to a call to (1) or (2), decided at compile time. If
> KMALLOC_MAX_CACHE_SIZE is not constant, (1), (2) and the runtime conditional
> need to be kept in the function.
>
> But intuatively, I would have guessed that given the choice between the overhead
> of keeping that runtime conditional vs keeping per-cpu slab caches for extra
> sizes between 16K and 128K, then the runtime conditional would be preferable. I
> would guess that quite a bit of memory could get tied up in those caches?
>
> Why is your preference the opposite? What am I not understanding?
+CC more slab people.
So the above is an inline function, but constructed in a way that it should,
without further inline code, become
- a call to __kmalloc_large_noprof() for build-time constant size larger
than KMALLOC_MAX_CACHE_SIZE
- a call to __kmalloc_cache_noprof() for build-time constant size smaller
than KMALLOC_MAX_CACHE_SIZE, where the cache is picked from an array with
compile-time calculated index
- call to __kmalloc_noprof() for non-constant sizes otherwise
If KMALLOC_MAX_CACHE_SIZE stops being build-time constant, the sensible way
to handle it would be to #ifdef or otherwise compile out away the whole "if
__builtin_constant_p(size)" part and just call __kmalloc_noprof() always, so
we don't blow the inline paths with a KMALLOC_MAX_CACHE_SIZE check leading
to choice between calling __kmalloc_large_noprof() or __kmalloc_cache_noprof().
I just don't believe we would waste so much memory with caches the extra
sizes for sizes between 16K and 128K, so would do that suggestion only if
proven wrong. But I wouldn't mind it that much if you chose it right away.
The solution earlier in this thread to patch __kmalloc_index() would be
worse than either of those two alternatives though.
>
>>
>>> I'm happy to make this change if you're certain it's the right approach; please
>>> confirm.
>>
>> Yes it's much better option than breaking the build-time-constant part of
>> kmalloc_noprof().
>>
>>>>
>>>> Maybe we could also stop adding + 1 to PAGE_SHIFT_MAX if it's >=64k, so the
>>>> cache size is max 64k and not 128k but that should be probably evaluated
>>>> separately from this series.
>>>
>>> I'm inferring from this that perhaps there is a memory cost with having the
>>> higher orders defined but unused.
>>
>> Yeah as per above, should not be too large and we could tune it down if
>> necessary.
>>
>>> Thanks,
>>> Ryan
>>>
>>>>
>>>> Vlastimil
>>>>
>>>>> Found compiling drivers/net/ethernet/qlogic/qed/qed_sriov.c
>>>>>
>>>>> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
>>>>> ---
>>>>>
>>>>> Ryan,
>>>>>
>>>>> Please consider incorporating this fix or something similar into your
>>>>> mm patch in the boot-time pages size patches.
>>>>>
>>>>> include/linux/slab.h | 3 ++-
>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>>>>> index 9848296ca6ba..a4c7507ab8ec 100644
>>>>> --- a/include/linux/slab.h
>>>>> +++ b/include/linux/slab.h
>>>>> @@ -685,7 +685,8 @@ static __always_inline unsigned int __kmalloc_index(size_t size,
>>>>> if (size <= 1024 * 1024) return 20;
>>>>> if (size <= 2 * 1024 * 1024) return 21;
>>>>>
>>>>> - if (!IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
>>>>> + if (!IS_ENABLED(CONFIG_ARM64_BOOT_TIME_PAGE_SIZE) &&
>>>>> + !IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant)
>>>>> BUILD_BUG_ON_MSG(1, "unexpected size in kmalloc_index()");
>>>>> else
>>>>> BUG();
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant
2024-11-26 15:09 ` Vlastimil Babka
@ 2024-11-26 15:27 ` Vlastimil Babka
2024-11-26 15:33 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Vlastimil Babka @ 2024-11-26 15:27 UTC (permalink / raw)
To: Ryan Roberts, Dave Kleikamp, Andrew Morton, Christoph Lameter,
David Rientjes, Hyeonggon Yoo
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 11/26/24 16:09, Vlastimil Babka wrote:
> On 11/26/24 15:53, Ryan Roberts wrote:
>> On 26/11/2024 12:36, Vlastimil Babka wrote:
>>> On 11/26/24 13:18, Ryan Roberts wrote:
>>>> On 14/11/2024 10:09, Vlastimil Babka wrote:
>>>>> On 11/1/24 21:16, Dave Kleikamp wrote:
>>>>>> When boot-time page size is enabled, the test against KMALLOC_MAX_CACHE_SIZE
>>>>>> is no longer optimized out with a constant size, so a build bug may
>>>>>> occur on a path that won't be reached.
>>>>>
>>>>> That's rather unfortunate, the __builtin_constant_p(size) part of
>>>>> kmalloc_noprof() really expects things to resolve at compile time and it
>>>>> would be better to keep it that way.
>>>>>
>>>>> I think it would be better if we based KMALLOC_MAX_CACHE_SIZE itself on
>>>>> PAGE_SHIFT_MAX and kept it constant, instead of introducing
>>>>> KMALLOC_SHIFT_HIGH_MAX only for some sanity checks.
>>>>>
>>>>> So if the kernel was built to support 4k to 64k, but booted as 4k, it would
>>>>> still create and use kmalloc caches up to 128k. SLUB should handle that fine
>>>>> (if not, please report it :)
>>>>
>>>> So when PAGE_SIZE_MAX=64K and PAGE_SIZE=4K, kmalloc will support up to 128K
>>>> whereas before it only supported up to 8K. I was trying to avoid that since I
>>>> assumed that would be costly in terms of extra memory allocated for those higher
>>>> order buckets that will never be used. But I have no idea how SLUB works in
>>>> practice. Perhaps memory for the cache is only lazily allocated so we won't see
>>>> an issue in practice?
>>>
>>> Yes the e.g. 128k slabs themselves will be lazily allocated. There will be
>>> some overhead with the management structures (struct kmem_cache etc) but
>>> much smaller.
>>> To be completely honest, some extra overhead might come to be when the slabs
>>> are allocated ans later the user frees those allocations. kmalloc_large()
>>> wwould return them immediately, while a regular kmem_cache will keep one or
>>> more per cpu for reuse. But if that becomes a visible problem we can tune
>>> those caches to discard slabs more aggressively.
>>
>> Sorry to keep pushing on this, now that I've actually looked at the code, I feel
>> I have a slightly better understanding:
>>
>> void *kmalloc_noprof(size_t size, gfp_t flags)
>> {
>> if (__builtin_constant_p(size) && size) {
>>
>> if (size > KMALLOC_MAX_CACHE_SIZE)
>> return __kmalloc_large_noprof(size, flags); <<< (1)
>>
>> index = kmalloc_index(size);
>> return __kmalloc_cache_noprof(...); <<< (2)
>> }
>> return __kmalloc_noprof(size, flags); <<< (3)
>> }
>>
>> So if size and KMALLOC_MAX_CACHE_SIZE are constant, we end up with this
>> resolving either to a call to (1) or (2), decided at compile time. If
>> KMALLOC_MAX_CACHE_SIZE is not constant, (1), (2) and the runtime conditional
>> need to be kept in the function.
>>
>> But intuatively, I would have guessed that given the choice between the overhead
>> of keeping that runtime conditional vs keeping per-cpu slab caches for extra
>> sizes between 16K and 128K, then the runtime conditional would be preferable. I
>> would guess that quite a bit of memory could get tied up in those caches?
>>
>> Why is your preference the opposite? What am I not understanding?
>
> +CC more slab people.
>
> So the above is an inline function, but constructed in a way that it should,
> without further inline code, become
> - a call to __kmalloc_large_noprof() for build-time constant size larger
> than KMALLOC_MAX_CACHE_SIZE
> - a call to __kmalloc_cache_noprof() for build-time constant size smaller
> than KMALLOC_MAX_CACHE_SIZE, where the cache is picked from an array with
> compile-time calculated index
> - call to __kmalloc_noprof() for non-constant sizes otherwise
>
> If KMALLOC_MAX_CACHE_SIZE stops being build-time constant, the sensible way
> to handle it would be to #ifdef or otherwise compile out away the whole "if
> __builtin_constant_p(size)" part and just call __kmalloc_noprof() always, so
> we don't blow the inline paths with a KMALLOC_MAX_CACHE_SIZE check leading
> to choice between calling __kmalloc_large_noprof() or __kmalloc_cache_noprof().
Or maybe we could have PAGE_SIZE_MAX derived KMALLOC_MAX_CACHE_SIZE_MAX
behave as the code above currently does with KMALLOC_MAX_CACHE_SIZE, and
additionally have PAGE_SIZE_MIN derived KMALLOC_MAX_CACHE_SIZE_MIN, where
build-time-constant size larger than KMALLOC_MAX_CACHE_SIZE_MIN (which is a
compile-time test) is redirected to __kmalloc_noprof() for a run-time test.
That seems like the optimum solution :)
> I just don't believe we would waste so much memory with caches the extra
> sizes for sizes between 16K and 128K, so would do that suggestion only if
> proven wrong. But I wouldn't mind it that much if you chose it right away.
> The solution earlier in this thread to patch __kmalloc_index() would be
> worse than either of those two alternatives though.
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant
2024-11-26 15:27 ` Vlastimil Babka
@ 2024-11-26 15:33 ` Ryan Roberts
0 siblings, 0 replies; 196+ messages in thread
From: Ryan Roberts @ 2024-11-26 15:33 UTC (permalink / raw)
To: Vlastimil Babka, Dave Kleikamp, Andrew Morton, Christoph Lameter,
David Rientjes, Hyeonggon Yoo
Cc: linux-arm-kernel, linux-kernel, linux-mm
On 26/11/2024 15:27, Vlastimil Babka wrote:
> On 11/26/24 16:09, Vlastimil Babka wrote:
>> On 11/26/24 15:53, Ryan Roberts wrote:
>>> On 26/11/2024 12:36, Vlastimil Babka wrote:
>>>> On 11/26/24 13:18, Ryan Roberts wrote:
>>>>> On 14/11/2024 10:09, Vlastimil Babka wrote:
>>>>>> On 11/1/24 21:16, Dave Kleikamp wrote:
>>>>>>> When boot-time page size is enabled, the test against KMALLOC_MAX_CACHE_SIZE
>>>>>>> is no longer optimized out with a constant size, so a build bug may
>>>>>>> occur on a path that won't be reached.
>>>>>>
>>>>>> That's rather unfortunate, the __builtin_constant_p(size) part of
>>>>>> kmalloc_noprof() really expects things to resolve at compile time and it
>>>>>> would be better to keep it that way.
>>>>>>
>>>>>> I think it would be better if we based KMALLOC_MAX_CACHE_SIZE itself on
>>>>>> PAGE_SHIFT_MAX and kept it constant, instead of introducing
>>>>>> KMALLOC_SHIFT_HIGH_MAX only for some sanity checks.
>>>>>>
>>>>>> So if the kernel was built to support 4k to 64k, but booted as 4k, it would
>>>>>> still create and use kmalloc caches up to 128k. SLUB should handle that fine
>>>>>> (if not, please report it :)
>>>>>
>>>>> So when PAGE_SIZE_MAX=64K and PAGE_SIZE=4K, kmalloc will support up to 128K
>>>>> whereas before it only supported up to 8K. I was trying to avoid that since I
>>>>> assumed that would be costly in terms of extra memory allocated for those higher
>>>>> order buckets that will never be used. But I have no idea how SLUB works in
>>>>> practice. Perhaps memory for the cache is only lazily allocated so we won't see
>>>>> an issue in practice?
>>>>
>>>> Yes the e.g. 128k slabs themselves will be lazily allocated. There will be
>>>> some overhead with the management structures (struct kmem_cache etc) but
>>>> much smaller.
>>>> To be completely honest, some extra overhead might come to be when the slabs
>>>> are allocated ans later the user frees those allocations. kmalloc_large()
>>>> wwould return them immediately, while a regular kmem_cache will keep one or
>>>> more per cpu for reuse. But if that becomes a visible problem we can tune
>>>> those caches to discard slabs more aggressively.
>>>
>>> Sorry to keep pushing on this, now that I've actually looked at the code, I feel
>>> I have a slightly better understanding:
>>>
>>> void *kmalloc_noprof(size_t size, gfp_t flags)
>>> {
>>> if (__builtin_constant_p(size) && size) {
>>>
>>> if (size > KMALLOC_MAX_CACHE_SIZE)
>>> return __kmalloc_large_noprof(size, flags); <<< (1)
>>>
>>> index = kmalloc_index(size);
>>> return __kmalloc_cache_noprof(...); <<< (2)
>>> }
>>> return __kmalloc_noprof(size, flags); <<< (3)
>>> }
>>>
>>> So if size and KMALLOC_MAX_CACHE_SIZE are constant, we end up with this
>>> resolving either to a call to (1) or (2), decided at compile time. If
>>> KMALLOC_MAX_CACHE_SIZE is not constant, (1), (2) and the runtime conditional
>>> need to be kept in the function.
>>>
>>> But intuatively, I would have guessed that given the choice between the overhead
>>> of keeping that runtime conditional vs keeping per-cpu slab caches for extra
>>> sizes between 16K and 128K, then the runtime conditional would be preferable. I
>>> would guess that quite a bit of memory could get tied up in those caches?
>>>
>>> Why is your preference the opposite? What am I not understanding?
>>
>> +CC more slab people.
>>
>> So the above is an inline function, but constructed in a way that it should,
>> without further inline code, become
>> - a call to __kmalloc_large_noprof() for build-time constant size larger
>> than KMALLOC_MAX_CACHE_SIZE
>> - a call to __kmalloc_cache_noprof() for build-time constant size smaller
>> than KMALLOC_MAX_CACHE_SIZE, where the cache is picked from an array with
>> compile-time calculated index
>> - call to __kmalloc_noprof() for non-constant sizes otherwise
>>
>> If KMALLOC_MAX_CACHE_SIZE stops being build-time constant, the sensible way
>> to handle it would be to #ifdef or otherwise compile out away the whole "if
>> __builtin_constant_p(size)" part and just call __kmalloc_noprof() always, so
>> we don't blow the inline paths with a KMALLOC_MAX_CACHE_SIZE check leading
>> to choice between calling __kmalloc_large_noprof() or __kmalloc_cache_noprof().
>
> Or maybe we could have PAGE_SIZE_MAX derived KMALLOC_MAX_CACHE_SIZE_MAX
> behave as the code above currently does with KMALLOC_MAX_CACHE_SIZE, and
> additionally have PAGE_SIZE_MIN derived KMALLOC_MAX_CACHE_SIZE_MIN, where
> build-time-constant size larger than KMALLOC_MAX_CACHE_SIZE_MIN (which is a
> compile-time test) is redirected to __kmalloc_noprof() for a run-time test.
>
> That seems like the optimum solution :)
Yes; that feels like the better approach to me. I'll implement this by default
unless anyone else objects.
>
>> I just don't believe we would waste so much memory with caches the extra
>> sizes for sizes between 16K and 128K, so would do that suggestion only if
>> proven wrong. But I wouldn't mind it that much if you chose it right away.
>> The solution earlier in this thread to patch __kmalloc_index() would be
>> worse than either of those two alternatives though.
>
>
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-10-17 12:32 ` Ryan Roberts
` (2 preceding siblings ...)
2024-11-11 12:14 ` Petr Tesarik
@ 2024-12-05 17:20 ` Petr Tesarik
2024-12-05 18:52 ` Michael Kelley
3 siblings, 1 reply; 196+ messages in thread
From: Petr Tesarik @ 2024-12-05 17:20 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel, linux-kernel, linux-mm
Hi Ryan,
On Thu, 17 Oct 2024 13:32:43 +0100
Ryan Roberts <ryan.roberts@arm.com> wrote:
> On 17/10/2024 13:27, Petr Tesarik wrote:
> > On Mon, 14 Oct 2024 11:55:11 +0100
> > Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> >> [...]
> >> The series is arranged as follows:
> >>
> >> - patch 1: Add macros required for converting non-arch code to support
> >> boot-time page size selection
> >> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
> >> non-arch code
> >
> > I have just tried to recompile the openSUSE kernel with these patches
> > applied, and I'm running into this:
> >
> > CC arch/arm64/hyperv/hv_core.o
> > In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
> > ../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file scope
> > u8 reserved2[PAGE_SIZE - 68];
> > ^~~~~~~~~
> >
> > It looks like one more place which needs a patch, right?
>
> As mentioned in the cover letter, so far I've only converted enough to get the
> defconfig *image* building (i.e. no modules). If you are compiling a different
> config or compiling the modules for defconfig, you will likely run into these
> types of issues.
>
> That said, I do have some patches to fix Hyper-V, which Michael Kelley was kind
> enough to send me.
>
> I understand that Suse might be able to help with wider performance testing - if
> that's the reason you are trying to compile, you could send me your config and
> I'll start working on fixing up other drivers?
This project was de-prioritised for some time, but I have just returned
to it, and one of our test systems uses a Mellanox 5 NIC, which did not build.
If you still have time to work on your patch series, please, can you
look into enabling MLX5_CORE_EN?
Oh, and have you rebased the series to 6.12 yet?
Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* RE: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-12-05 17:20 ` Petr Tesarik
@ 2024-12-05 18:52 ` Michael Kelley
2024-12-06 7:50 ` Petr Tesarik
0 siblings, 1 reply; 196+ messages in thread
From: Michael Kelley @ 2024-12-05 18:52 UTC (permalink / raw)
To: Petr Tesarik, Ryan Roberts
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
From: Petr Tesarik <ptesarik@suse.com> Sent: Thursday, December 5, 2024 9:20 AM
>
> Hi Ryan,
>
> On Thu, 17 Oct 2024 13:32:43 +0100
> Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> > On 17/10/2024 13:27, Petr Tesarik wrote:
> > > On Mon, 14 Oct 2024 11:55:11 +0100
> > > Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >
> > >> [...]
> > >> The series is arranged as follows:
> > >>
> > >> - patch 1: Add macros required for converting non-arch code to support
> > >> boot-time page size selection
> > >> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
> > >> non-arch code
> > >
> > > I have just tried to recompile the openSUSE kernel with these patches
> > > applied, and I'm running into this:
> > >
> > > CC arch/arm64/hyperv/hv_core.o
> > > In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
> > > ../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file scope
> > > u8 reserved2[PAGE_SIZE - 68];
> > > ^~~~~~~~~
> > >
> > > It looks like one more place which needs a patch, right?
> >
> > As mentioned in the cover letter, so far I've only converted enough to get the
> > defconfig *image* building (i.e. no modules). If you are compiling a different
> > config or compiling the modules for defconfig, you will likely run into these
> > types of issues.
> >
> > That said, I do have some patches to fix Hyper-V, which Michael Kelley was kind
> > enough to send me.
> >
> > I understand that Suse might be able to help with wider performance testing - if
> > that's the reason you are trying to compile, you could send me your config and
> > I'll start working on fixing up other drivers?
>
> This project was de-prioritised for some time, but I have just returned
> to it, and one of our test systems uses a Mellanox 5 NIC, which did not build.
>
> If you still have time to work on your patch series, please, can you
> look into enabling MLX5_CORE_EN?
>
> Oh, and have you rebased the series to 6.12 yet?
>
FWIW, here's what I hacked together to compile and run the mlx5 driver in
a Hyper-V VM. This was against a 6.11 kernel code base.
Michael
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
index d894a88fa9f2..d0b381df074c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
@@ -66,9 +66,10 @@ struct fw_page {
enum {
MLX5_MAX_RECLAIM_TIME_MILI = 5000,
- MLX5_NUM_4K_IN_PAGE = PAGE_SIZE / MLX5_ADAPTER_PAGE_SIZE,
};
+#define MLX5_NUM_4K_IN_PAGE ((int)(PAGE_SIZE / MLX5_ADAPTER_PAGE_SIZE))
+
static u32 get_function(u16 func_id, bool ec_function)
{
return (u32)func_id | (ec_function << 16);
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index ba875a619b97..2d39ba77b591 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -255,12 +255,14 @@ enum {
MLX5_NON_FP_BFREGS_PER_UAR,
MLX5_MAX_BFREGS = MLX5_MAX_UARS *
MLX5_NON_FP_BFREGS_PER_UAR,
- MLX5_UARS_IN_PAGE = PAGE_SIZE / MLX5_ADAPTER_PAGE_SIZE,
- MLX5_NON_FP_BFREGS_IN_PAGE = MLX5_NON_FP_BFREGS_PER_UAR * MLX5_UARS_IN_PAGE,
MLX5_MIN_DYN_BFREGS = 512,
MLX5_MAX_DYN_BFREGS = 1024,
};
+
+#define MLX5_UARS_IN_PAGE ((int)(PAGE_SIZE / MLX5_ADAPTER_PAGE_SIZE))
+#define MLX5_NON_FP_BFREGS_IN_PAGE ((int)(MLX5_NON_FP_BFREGS_PER_UAR * MLX5_UARS_IN_PAGE))
+
enum {
MLX5_MKEY_MASK_LEN = 1ull << 0,
MLX5_MKEY_MASK_PAGE_SIZE = 1ull << 1,
^ permalink raw reply related [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-12-05 18:52 ` Michael Kelley
@ 2024-12-06 7:50 ` Petr Tesarik
2024-12-06 10:26 ` Ryan Roberts
0 siblings, 1 reply; 196+ messages in thread
From: Petr Tesarik @ 2024-12-06 7:50 UTC (permalink / raw)
To: Michael Kelley
Cc: Ryan Roberts, Andrew Morton, Anshuman Khandual, Ard Biesheuvel,
Catalin Marinas, David Hildenbrand, Greg Marsden, Ivan Ivanov,
Kalesh Singh, Marc Zyngier, Mark Rutland, Matthias Brugger,
Miroslav Benes, Will Deacon, linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
On Thu, 5 Dec 2024 18:52:35 +0000
Michael Kelley <mhklinux@outlook.com> wrote:
> From: Petr Tesarik <ptesarik@suse.com> Sent: Thursday, December 5, 2024 9:20 AM
> >
> > Hi Ryan,
> >
> > On Thu, 17 Oct 2024 13:32:43 +0100
> > Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > > On 17/10/2024 13:27, Petr Tesarik wrote:
> > > > On Mon, 14 Oct 2024 11:55:11 +0100
> > > > Ryan Roberts <ryan.roberts@arm.com> wrote:
> > > >
> > > >> [...]
> > > >> The series is arranged as follows:
> > > >>
> > > >> - patch 1: Add macros required for converting non-arch code to support
> > > >> boot-time page size selection
> > > >> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
> > > >> non-arch code
> > > >
> > > > I have just tried to recompile the openSUSE kernel with these patches
> > > > applied, and I'm running into this:
> > > >
> > > > CC arch/arm64/hyperv/hv_core.o
> > > > In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
> > > > ../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file scope
> > > > u8 reserved2[PAGE_SIZE - 68];
> > > > ^~~~~~~~~
> > > >
> > > > It looks like one more place which needs a patch, right?
> > >
> > > As mentioned in the cover letter, so far I've only converted enough to get the
> > > defconfig *image* building (i.e. no modules). If you are compiling a different
> > > config or compiling the modules for defconfig, you will likely run into these
> > > types of issues.
> > >
> > > That said, I do have some patches to fix Hyper-V, which Michael Kelley was kind
> > > enough to send me.
> > >
> > > I understand that Suse might be able to help with wider performance testing - if
> > > that's the reason you are trying to compile, you could send me your config and
> > > I'll start working on fixing up other drivers?
> >
> > This project was de-prioritised for some time, but I have just returned
> > to it, and one of our test systems uses a Mellanox 5 NIC, which did not build.
> >
> > If you still have time to work on your patch series, please, can you
> > look into enabling MLX5_CORE_EN?
> >
> > Oh, and have you rebased the series to 6.12 yet?
> >
>
> FWIW, here's what I hacked together to compile and run the mlx5 driver in
> a Hyper-V VM. This was against a 6.11 kernel code base.
Wow! Thank you, Michael. I'll give it a try.
Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-12-06 7:50 ` Petr Tesarik
@ 2024-12-06 10:26 ` Ryan Roberts
2024-12-06 13:05 ` Michael Kelley
0 siblings, 1 reply; 196+ messages in thread
From: Ryan Roberts @ 2024-12-06 10:26 UTC (permalink / raw)
To: Petr Tesarik, Michael Kelley
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
On 06/12/2024 07:50, Petr Tesarik wrote:
> On Thu, 5 Dec 2024 18:52:35 +0000
> Michael Kelley <mhklinux@outlook.com> wrote:
>
>> From: Petr Tesarik <ptesarik@suse.com> Sent: Thursday, December 5, 2024 9:20 AM
>>>
>>> Hi Ryan,
>>>
>>> On Thu, 17 Oct 2024 13:32:43 +0100
>>> Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>>> On 17/10/2024 13:27, Petr Tesarik wrote:
>>>>> On Mon, 14 Oct 2024 11:55:11 +0100
>>>>> Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>>> [...]
>>>>>> The series is arranged as follows:
>>>>>>
>>>>>> - patch 1: Add macros required for converting non-arch code to support
>>>>>> boot-time page size selection
>>>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
>>>>>> non-arch code
>>>>>
>>>>> I have just tried to recompile the openSUSE kernel with these patches
>>>>> applied, and I'm running into this:
>>>>>
>>>>> CC arch/arm64/hyperv/hv_core.o
>>>>> In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
>>>>> ../include/linux/hyperv.h:158:5: error: variably modified ‘reserved2’ at file scope
>>>>> u8 reserved2[PAGE_SIZE - 68];
>>>>> ^~~~~~~~~
>>>>>
>>>>> It looks like one more place which needs a patch, right?
>>>>
>>>> As mentioned in the cover letter, so far I've only converted enough to get the
>>>> defconfig *image* building (i.e. no modules). If you are compiling a different
>>>> config or compiling the modules for defconfig, you will likely run into these
>>>> types of issues.
>>>>
>>>> That said, I do have some patches to fix Hyper-V, which Michael Kelley was kind
>>>> enough to send me.
>>>>
>>>> I understand that Suse might be able to help with wider performance testing - if
>>>> that's the reason you are trying to compile, you could send me your config and
>>>> I'll start working on fixing up other drivers?
>>>
>>> This project was de-prioritised for some time, but I have just returned
>>> to it, and one of our test systems uses a Mellanox 5 NIC, which did not build.
No problem - I appreciate all the time you have spent on it so far!
>>>
>>> If you still have time to work on your patch series, please, can you
>>> look into enabling MLX5_CORE_EN?
I've also had other things that have been taking up my time. I'm planning to get
back to this series properly after Christmas and convert all the remaining
module code. I'm hoping that Michael's patch will solve your problem for now?
>>>
>>> Oh, and have you rebased the series to 6.12 yet?
Afraid the latest I have at the moment is based on v6.12-rc3. It also includes
all the changes from the review feedback:
https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/boot-time-page-size-v2-wip
>>>
>>
>> FWIW, here's what I hacked together to compile and run the mlx5 driver in
>> a Hyper-V VM. This was against a 6.11 kernel code base.
>
> Wow! Thank you, Michael. I'll give it a try.
Yes, thanks, Michael - I'll take a look at this and integrate into my tree after
Christmas.
Thanks,
Ryan
>
> Petr T
^ permalink raw reply [flat|nested] 196+ messages in thread
* RE: [RFC PATCH v1 00/57] Boot-time page size selection for arm64
2024-12-06 10:26 ` Ryan Roberts
@ 2024-12-06 13:05 ` Michael Kelley
0 siblings, 0 replies; 196+ messages in thread
From: Michael Kelley @ 2024-12-06 13:05 UTC (permalink / raw)
To: Ryan Roberts, Petr Tesarik
Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Catalin Marinas,
David Hildenbrand, Greg Marsden, Ivan Ivanov, Kalesh Singh,
Marc Zyngier, Mark Rutland, Matthias Brugger, Miroslav Benes,
Will Deacon, linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
From: Ryan Roberts <ryan.roberts@arm.com> Sent: Friday, December 6, 2024 2:26 AM
>
> On 06/12/2024 07:50, Petr Tesarik wrote:
> > On Thu, 5 Dec 2024 18:52:35 +0000
> > Michael Kelley <mhklinux@outlook.com> wrote:
> >
> >> From: Petr Tesarik <ptesarik@suse.com> Sent: Thursday, December 5, 2024 9:20 AM
> >>>
> >>> Hi Ryan,
> >>>
> >>> On Thu, 17 Oct 2024 13:32:43 +0100
> >>> Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>
> >>>> On 17/10/2024 13:27, Petr Tesarik wrote:
> >>>>> On Mon, 14 Oct 2024 11:55:11 +0100
> >>>>> Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>
> >>>>>> [...]
> >>>>>> The series is arranged as follows:
> >>>>>>
> >>>>>> - patch 1: Add macros required for converting non-arch code to support
> >>>>>> boot-time page size selection
> >>>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from all
> >>>>>> non-arch code
> >>>>>
> >>>>> I have just tried to recompile the openSUSE kernel with these patches
> >>>>> applied, and I'm running into this:
> >>>>>
> >>>>> CC arch/arm64/hyperv/hv_core.o
> >>>>> In file included from ../arch/arm64/hyperv/hv_core.c:14:0:
> >>>>> ../include/linux/hyperv.h:158:5: error: variably modified 'reserved2' at file scope
> >>>>> u8 reserved2[PAGE_SIZE - 68];
> >>>>> ^~~~~~~~~
> >>>>>
> >>>>> It looks like one more place which needs a patch, right?
> >>>>
> >>>> As mentioned in the cover letter, so far I've only converted enough to get the
> >>>> defconfig *image* building (i.e. no modules). If you are compiling a different
> >>>> config or compiling the modules for defconfig, you will likely run into these
> >>>> types of issues.
> >>>>
> >>>> That said, I do have some patches to fix Hyper-V, which Michael Kelley was kind
> >>>> enough to send me.
> >>>>
> >>>> I understand that Suse might be able to help with wider performance testing - if
> >>>> that's the reason you are trying to compile, you could send me your config and
> >>>> I'll start working on fixing up other drivers?
> >>>
> >>> This project was de-prioritised for some time, but I have just returned
> >>> to it, and one of our test systems uses a Mellanox 5 NIC, which did not build.
>
> No problem - I appreciate all the time you have spent on it so far!
>
> >>>
> >>> If you still have time to work on your patch series, please, can you
> >>> look into enabling MLX5_CORE_EN?
>
> I've also had other things that have been taking up my time. I'm planning to get
> back to this series properly after Christmas and convert all the remaining
> module code. I'm hoping that Michael's patch will solve your problem for now?
>
> >>>
> >>> Oh, and have you rebased the series to 6.12 yet?
>
> Afraid the latest I have at the moment is based on v6.12-rc3. It also includes
> all the changes from the review feedback:
>
> https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/boot-time-page-size-v2-wip
>
> >>>
> >>
> >> FWIW, here's what I hacked together to compile and run the mlx5 driver in
> >> a Hyper-V VM. This was against a 6.11 kernel code base.
> >
> > Wow! Thank you, Michael. I'll give it a try.
>
> Yes, thanks, Michael - I'll take a look at this and integrate into my tree after
> Christmas.
>
To be clear, you probably do *not* want to integrate my changes into your tree.
They are a hack, and I'm not sure what approach the mlx driver folks will want to
take. But they made the Mellanox CX-5 NIC functional for my purposes in testing
variable page sizes.
Michael
^ permalink raw reply [flat|nested] 196+ messages in thread
end of thread, other threads:[~2024-12-06 13:06 UTC | newest]
Thread overview: 196+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-14 10:55 [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting boot-time page size selection Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 02/57] vmlinux: Align to PAGE_SIZE_MAX Ryan Roberts
2024-10-14 16:50 ` Christoph Lameter (Ampere)
2024-10-15 10:53 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 03/57] mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large Ryan Roberts
2024-10-14 13:00 ` Johannes Weiner
2024-10-14 19:59 ` Shakeel Butt
2024-10-15 10:55 ` Ryan Roberts
2024-10-17 12:21 ` Michal Hocko
2024-10-17 16:09 ` Roman Gushchin
2024-10-14 10:58 ` [RFC PATCH v1 04/57] mm/page_alloc: Make page_frag_cache boot-time page size compatible Ryan Roberts
2024-11-14 8:23 ` Vlastimil Babka
2024-11-14 9:36 ` Ryan Roberts
2024-11-14 9:43 ` Vlastimil Babka
2024-10-14 10:58 ` [RFC PATCH v1 05/57] mm: Avoid split pmd ptl if pmd level is run-time folded Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption Ryan Roberts
2024-10-16 14:37 ` Ryan Roberts
2024-11-01 20:16 ` [RFC PATCH] mm/slab: Avoid build bug for calls to kmalloc with a large constant Dave Kleikamp
2024-11-06 11:44 ` Ryan Roberts
2024-11-06 15:20 ` Dave Kleikamp
2024-11-14 10:09 ` Vlastimil Babka
2024-11-26 12:18 ` Ryan Roberts
2024-11-26 12:36 ` Vlastimil Babka
2024-11-26 14:26 ` Ryan Roberts
2024-11-26 14:53 ` Ryan Roberts
2024-11-26 15:09 ` Vlastimil Babka
2024-11-26 15:27 ` Vlastimil Babka
2024-11-26 15:33 ` Ryan Roberts
2024-11-14 10:17 ` [RFC PATCH v1 06/57] mm: Remove PAGE_SIZE compile-time constant assumption Vlastimil Babka
2024-11-26 10:08 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 07/57] fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 08/57] fs: Remove PAGE_SIZE compile-time constant assumption Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 09/57] fs/nfs: " Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 10/57] fs/ext4: " Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 11/57] fork: Permit boot-time THREAD_SIZE determination Ryan Roberts
2024-11-14 10:42 ` Vlastimil Babka
2024-10-14 10:58 ` [RFC PATCH v1 12/57] cgroup: Remove PAGE_SIZE compile-time constant assumption Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 13/57] bpf: " Ryan Roberts
2024-10-16 14:38 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 14/57] pm/hibernate: " Ryan Roberts
2024-10-16 14:39 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 15/57] stackdepot: " Ryan Roberts
2024-11-14 11:15 ` Vlastimil Babka
2024-11-26 10:15 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 16/57] perf: " Ryan Roberts
2024-10-16 14:40 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 17/57] kvm: " Ryan Roberts
2024-10-14 21:37 ` Sean Christopherson
2024-10-15 10:57 ` Ryan Roberts
2024-10-16 14:41 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 18/57] trace: " Ryan Roberts
2024-10-14 16:46 ` Steven Rostedt
2024-10-15 11:09 ` Ryan Roberts
2024-10-18 15:24 ` Steven Rostedt
2024-10-14 10:58 ` [RFC PATCH v1 19/57] crash: " Ryan Roberts
2024-10-15 3:47 ` Baoquan He
2024-10-15 11:13 ` Ryan Roberts
2024-10-18 3:00 ` Baoquan He
2024-10-14 10:58 ` [RFC PATCH v1 20/57] crypto: " Ryan Roberts
2024-10-26 6:54 ` Herbert Xu
2024-10-14 10:58 ` [RFC PATCH v1 21/57] sunrpc: " Ryan Roberts
2024-10-16 14:42 ` Ryan Roberts
2024-10-16 14:47 ` Chuck Lever
2024-10-16 14:54 ` Jeff Layton
2024-10-16 15:09 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 22/57] sound: " Ryan Roberts
2024-10-14 11:38 ` Mark Brown
2024-10-14 12:24 ` Ryan Roberts
2024-10-14 12:41 ` Takashi Iwai
2024-10-14 12:52 ` Ryan Roberts
2024-10-14 16:01 ` Mark Brown
2024-10-15 11:35 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 23/57] net: " Ryan Roberts
2024-10-16 14:43 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 24/57] net: fec: " Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 25/57] net: marvell: " Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 26/57] net: hns3: " Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 27/57] net: e1000: " Ryan Roberts
2024-10-16 14:43 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 28/57] net: igbvf: " Ryan Roberts
2024-10-16 14:44 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 29/57] net: igb: " Ryan Roberts
2024-10-16 14:45 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 30/57] drivers/base: " Ryan Roberts
2024-10-16 14:45 ` Ryan Roberts
2024-10-16 15:04 ` Greg Kroah-Hartman
2024-10-16 15:12 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 31/57] edac: " Ryan Roberts
2024-10-16 14:46 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 32/57] optee: " Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 33/57] random: " Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 34/57] sata_sil24: " Ryan Roberts
2024-10-17 9:09 ` Niklas Cassel
2024-10-17 12:42 ` Ryan Roberts
2024-10-17 12:51 ` Niklas Cassel
2024-10-21 9:24 ` Ryan Roberts
2024-10-21 11:04 ` Niklas Cassel
2024-10-21 11:26 ` Ryan Roberts
2024-10-21 11:43 ` Niklas Cassel
2024-10-14 10:58 ` [RFC PATCH v1 35/57] virtio: " Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 36/57] xen: " Ryan Roberts
2024-10-16 14:46 ` Ryan Roberts
2024-10-23 1:23 ` Stefano Stabellini
2024-10-24 10:32 ` Ryan Roberts
2024-10-25 1:18 ` Stefano Stabellini
2024-10-14 10:58 ` [RFC PATCH v1 37/57] arm64: Fix macros to work in C code in addition to the linker script Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 38/57] arm64: Track early pgtable allocation limit Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 39/57] arm64: Introduce macros required for boot-time page selection Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 40/57] arm64: Refactor early pgtable size calculation macros Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 41/57] arm64: Pass desired page size on command line Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 42/57] arm64: Divorce early init from PAGE_SIZE Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 43/57] arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 44/57] arm64: Align sections to PAGE_SIZE_MAX Ryan Roberts
2024-10-19 14:16 ` Thomas Weißschuh
2024-10-21 11:20 ` Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 45/57] arm64: Rework trampoline rodata mapping Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 46/57] arm64: Generalize fixmap for boot-time page size Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 47/57] arm64: Statically allocate and align for worst-case " Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 48/57] arm64: Convert switch to if for non-const comparison values Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 49/57] arm64: Convert BUILD_BUG_ON to VM_BUG_ON Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 50/57] arm64: Remove PAGE_SZ asm-offset Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 51/57] arm64: Introduce cpu features for page sizes Ryan Roberts
2024-10-14 10:58 ` [RFC PATCH v1 52/57] arm64: Remove PAGE_SIZE from assembly code Ryan Roberts
2024-10-14 10:59 ` [RFC PATCH v1 53/57] arm64: Runtime-fold pmd level Ryan Roberts
2024-10-14 10:59 ` [RFC PATCH v1 54/57] arm64: Support runtime folding in idmap_kpti_install_ng_mappings Ryan Roberts
2024-10-14 10:59 ` [RFC PATCH v1 55/57] arm64: TRAMP_VALIAS is no longer compile-time constant Ryan Roberts
2024-10-14 11:21 ` Ard Biesheuvel
2024-10-14 11:28 ` Ryan Roberts
2024-10-14 10:59 ` [RFC PATCH v1 56/57] arm64: Determine THREAD_SIZE at boot-time Ryan Roberts
2024-10-14 10:59 ` [RFC PATCH v1 57/57] arm64: Enable boot-time page size selection Ryan Roberts
2024-10-15 17:42 ` Zi Yan
2024-10-16 8:14 ` Ryan Roberts
2024-10-16 14:21 ` Zi Yan
2024-10-16 14:31 ` Ryan Roberts
2024-10-16 14:35 ` Zi Yan
2024-10-15 17:52 ` Michael Kelley
2024-10-16 8:17 ` Ryan Roberts
2024-10-14 13:54 ` [RFC PATCH v1 01/57] mm: Add macros ahead of supporting " Pingfan Liu
2024-10-14 14:07 ` Ryan Roberts
2024-10-15 3:04 ` Pingfan Liu
2024-10-15 11:16 ` Ryan Roberts
2024-10-16 14:36 ` Ryan Roberts
2024-10-30 8:45 ` Ryan Roberts
2024-10-14 17:32 ` [RFC PATCH v1 00/57] Boot-time page size selection for arm64 Florian Fainelli
2024-10-15 11:48 ` Ryan Roberts
2024-10-15 18:38 ` Michael Kelley
2024-10-16 8:23 ` Ryan Roberts
2024-10-16 15:16 ` David Hildenbrand
2024-10-16 16:08 ` Ryan Roberts
2024-10-17 12:27 ` Petr Tesarik
2024-10-17 12:32 ` Ryan Roberts
2024-10-18 12:56 ` Petr Tesarik
2024-10-18 14:41 ` Petr Tesarik
2024-10-21 11:47 ` Ryan Roberts
2024-10-23 21:00 ` Thomas Tai
2024-10-24 10:48 ` Ryan Roberts
2024-10-24 11:45 ` Petr Tesarik
2024-10-24 12:10 ` Ryan Roberts
2024-10-30 22:11 ` Sumit Gupta
2024-11-11 12:14 ` Petr Tesarik
2024-11-11 12:25 ` Ryan Roberts
2024-11-12 9:45 ` Petr Tesarik
2024-11-12 10:19 ` Ryan Roberts
2024-11-12 10:50 ` Petr Tesarik
2024-11-13 12:40 ` Petr Tesarik
2024-11-13 12:56 ` Ryan Roberts
2024-11-13 14:22 ` Petr Tesarik
2024-12-05 17:20 ` Petr Tesarik
2024-12-05 18:52 ` Michael Kelley
2024-12-06 7:50 ` Petr Tesarik
2024-12-06 10:26 ` Ryan Roberts
2024-12-06 13:05 ` Michael Kelley
2024-10-17 22:05 ` Dave Kleikamp
2024-10-21 11:49 ` Ryan Roberts
2024-10-18 18:15 ` Joseph Salisbury
2024-10-18 18:27 ` David Hildenbrand
2024-10-18 19:19 ` [External] : " Joseph Salisbury
2024-10-18 19:27 ` David Hildenbrand
2024-10-18 20:06 ` Joseph Salisbury
2024-10-21 9:55 ` Ryan Roberts
2024-10-19 15:47 ` Neal Gompa
2024-10-21 11:02 ` Ryan Roberts
2024-10-21 11:32 ` Eric Curtin
2024-10-21 11:51 ` Ryan Roberts
2024-10-21 13:49 ` Neal Gompa
2024-10-21 15:01 ` Ryan Roberts
2024-10-22 9:33 ` Neal Gompa
2024-10-22 15:03 ` Nick Chan
2024-10-22 15:12 ` Ryan Roberts
2024-10-22 17:30 ` Neal Gompa
2024-10-24 10:34 ` Ryan Roberts
2024-10-31 21:07 ` Catalin Marinas
2024-11-06 11:37 ` Ryan Roberts
2024-11-07 12:35 ` Catalin Marinas
2024-11-07 12:47 ` Ryan Roberts
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).