linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] riscv: Memory Hot(Un)Plug support
@ 2024-05-14 14:04 Björn Töpel
  2024-05-14 14:04 ` [PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries Björn Töpel
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 14:04 UTC (permalink / raw)
  To: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

From: Björn Töpel <bjorn@rivosinc.com>

================================================================
Memory Hot(Un)Plug support (and ZONE_DEVICE) for the RISC-V port
================================================================

Introduction
============

To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory
hot(un)plug allows for increasing and decreasing the size of physical
memory available to a machine at runtime."

This series adds memory hot(un)plugging, and ZONE_DEVICE support for
the RISC-V Linux port.

I'm sending this series while LSF/MM/BPF is on-going, and with some
luck some MM person can review the series while zoning out on a talk.
;-)

MM configuration
================

RISC-V MM has the following configuration:

 * Memory blocks are 128M, analogous to x86-64. It uses PMD
   ("hugepage") vmemmaps. From that follows that 2M (PMD) worth of
   vmemmap spans 32768 pages á 4K which gets us 128M.

 * The pageblock size is the minimum minimum virtio_mem size, and on
   RISC-V it's 2M (2^9 * 4K).

Implementation
==============

The PGD table on RISC-V is shared/copied between for all processes. To
avoid doing page table synchronization, the first patch (patch 1)
pre-allocated the PGD entries for vmemmap/direct map. By doing that
the init_mm PGD will be fixed at kernel init, and synchronization can
be avoided all together.

The following two patches (patch 2-3) does some preparations, followed
by the actual MHP implementation (patch 4-5). Then, MHP and virtio-mem
are enabled (patch 6-7), and finally ZONE_DEVICE support is added
(patch 8).

MHP and locking
===============

TL;DR: The MHP does not step on any toes, except for ptdump.
Additional locking is required for ptdump.

Long version: For v2 I spent some time digging into init_mm
synchronization/update. Here are my findings, and I'd love them to be
corrected if incorrect.

It's been a gnarly path...

The `init_mm` structure is a special mm (perhaps not a "real" one).
It's a "lazy context" that tracks kernel page table resources, e.g.,
the kernel page table (swapper_pg_dir), a kernel page_table_lock (more
about the usage below), mmap_lock, and such.

`init_mm` does not track/contain any VMAs. Having the `init_mm` is
convenient, so that the regular kernel page table walk/modify
functions can be used.

Now, `init_mm` being special means that the locking for kernel page
tables are special as well.

On RISC-V the PGD (top-level page table structure), similar to x86, is
shared (copied) with user processes. If the kernel PGD is modified, it
has to be synched to user-mode processes PGDs. This is avoided by
pre-populating the PGD, so it'll be fixed from boot.

The in-kernel pgd regions are documented in
`Documentation/arch/riscv/vm-layout.rst`.

The distinct regions are:
 * vmemmap
 * vmalloc/ioremap space
 * direct mapping of all physical memory
 * kasan
 * modules, BPF
 * kernel

Memory hotplug is the process of adding/removing memory to/from the
kernel.

Adding is done in two phases:
 1. Add the memory to the kernel
 2. Online memory, making it available to the page allocator.

Step 1 is partially architecture dependent, and updates the init_mm
page table:
 * Update the direct map page tables. The direct map is a linear map,
   representing all physical memory: `virt = phys + PAGE_OFFSET`
 * Add a `struct page` for each added page of memory. Update the
   vmemmap (virtual mapping to the `struct page`, so we can easily
   transform a kernel virtual address to a `struct page *` address.

From an MHP perspective, there are two regions of the PGD that are
updated:
 * vmemmap
 * direct mapping of all physical memory

The `struct mm_struct` has a couple of locks in play:
 * `spinlock_t page_table_lock` protects the page table, and some
    counters
 * `struct rw_semaphore mmap_lock` protect an mm's VMAs

Note again that `init_mm` does not contain any VMAs, but still uses
the mmap_lock in some places.

The `page_table_lock` was originally used to to protect all pages
tables, but more recently a split page table lock has been introduced.
The split lock has a per-table lock for the PTE and PMD tables. If
split lock is disabled, all tables are guarded by
`mm->page_table_lock` (for user processes). Split page table locks are
not used for init_mm.

MHP operations is typically synchronized using
`DEFINE_STATIC_PERCPU_RWSEM(mem_hotplug_lock)`.

Actors
------

The following non-MHP actors in the kernel traverses (read), and/or
modifies the kernel PGD.

 * `ptdump`

   Walks the entire `init_mm`, via `ptdump_walk_pgd()` with the
   `mmap_write_lock(init_mm)` taken.

   Observation: ptdump can race with MHP, and needs additional locking
   to avoid crashes/races.

 * `set_direct_*` / `arch/riscv/mm/pageattr.c`

   The `set_direct_*` functionality is used to "synchronize" the
   direct map to other kernel mappings, e.g. modules/kernel text. The
   direct map is using "as large huge table mappings as possible",
   which means that the `set_direct_*` might need to split the direct
   map.

  The `set_direct_*` functions operates with the
  `mmap_write_lock(init_mm)` taken.

  Observation: `set_direct_*` uses the direct map, but will never
  modify the same entry as MHP. If there is a mapping, that entry will
  never race with MHP. Further, MHP acts when memory is offline.

 * HVO / `mm/hugetlb_vmemmap`

   HVO optimizes the backing `struct page` for hugetlb pages, which
   means changing the "vmemmap" region. HVO can split (merge?) a
   vmemmap pmd. However, it will never race with MHP, since HVO only
   operates at online memory. HVO cannot touch memory being MHP added
   or removed.

 * `apply_to_page_range`

   Walks a range, creates pages and applies a callback (setting
   permissions) for the page.

   When creating a table, it might use `int __pte_alloc_kernel(pmd_t
   *pmd)` which takes the `init_mm.page_table_lock` to synchronize pmd
   populate.

   Used by: `mm/vmalloc.c` and `mm/kasan/shadow.c`. The KASAN callback
   takes the `init_mm.page_table_lock` to synchronize pte creation.

   Observations: `apply_to_page_range` applies to the "vmalloc/ioremap
   space" region, and "kasan" region. *Not* affected by MHP.

 * `apply_to_existing_page_range`

   Walks a range, applies a callback (setting permissions) for the
   page (no page creation).

   Used by: `kernel/bpf/arena.c` and `mm/kasan/shadow.c`. The KASAN
   callback takes the `init_mm.page_table_lock` to synchronize pte
   creation. *Not* affected by MHP regions.

 * `apply_to_existing_page_range` applies to the "vmalloc/ioremap
   space" region, and "kasan" region. *Not* affected by MHP regions.

 *  `ioremap_page_range` and `vmap_page_range`

    Uses the same internal function, and might create table entries at
    the "vmalloc/ioremap space" region. Can call
    `__pte_alloc_kernel()` which takes the `init_mm.page_table_lock`
    synchronizing pmd populate in the region. *Not* affected by MHP
    regions.

Summary:
  * MHP add will never modify the same page table entries, as any of
    the other actors.
  * MHP remove is done when memory is offlined, and will not clash
    with any of the actors.
  * Functions that walk the entire kernel page table need
    synchronization

  * It's sufficient to add the MHP lock ptdump.

Testing
=======

This series adds basic DT supported hotplugging. There is a QEMU patch
enabling MHP for the RISC-V "virt" machine here: [1]

ACPI/MSI support is still in the making for RISC-V, and prior proper
(ACPI) PCI MSI support lands [2] and NUMA SRAT support [3], it hard to
try it out.

I've prepared a QEMU branch with proper ACPI GED/PC-DIMM support [4],
and a this series with the required prerequisites [5] (AIA, ACPI AIA
MADT, ACPI NUMA SRAT).


To test with virtio-mem, e.g.:
  | qemu-system-riscv64 \
  |     -machine virt,aia=aplic-imsic \
  |     -cpu rv64,v=true,vlen=256,elen=64,h=true,zbkb=on,zbkc=on,zbkx=on,zkr=on,zkt=on,svinval=on,svnapot=on,svpbmt=on \
  |     -nodefaults \
  |     -nographic -smp 8 -kernel rv64-u-boot.bin \
  |     -drive file=rootfs.img,format=raw,if=virtio \
  |     -device virtio-rng-pci \
  |     -m 16G,slots=3,maxmem=32G \
  |     -object memory-backend-ram,id=mem0,size=16G \
  |     -numa node,nodeid=0,memdev=mem0 \
  |     -serial chardev:char0 \
  |     -mon chardev=char0,mode=readline \
  |     -chardev stdio,mux=on,id=char0 \
  |     -device pci-serial,id=serial0,chardev=char0 \
  |     -object memory-backend-ram,id=vmem0,size=2G \
  |     -device virtio-mem-pci,id=vm0,memdev=vmem0,node=0

In the QEMU monitor:
  | (qemu) info memory-devices
  | (qemu) qom-set vm0 requested-size 1G

...to test DAX/KMEM, use the follow QEMU parameters:
  |  -object memory-backend-file,id=mem1,share=on,mem-path=virtio_pmem.img,size=4G \
  |  -device virtio-pmem-pci,memdev=mem1,id=nv1

and the regular ndctl/daxctl dance.

If you're brave to try the ACPI branch, add "acpi=on" to "-machine
virt", and test PC-DIMM MHP (in addition to virtio-{p},mem):

In the QEMU monitor:
  | (qemu) object_add memory-backend-ram,id=mem1,size=1G
  | (qemu) device_add pc-dimm,id=dimm1,memdev=mem1

where "rv64-u-boot.bin" is U-boot with EFI/ACPI-support (use [6] if
you're lazy).

Remove "acpi=on" to run with DT.

 * Enable virtio-mem for RISC-V
 * Add proper hotplug support for virtio-mem
 
Thanks to Alex, David, and Andrew for all comments/tests/fixups.

Changelog
=========

v1->v2:
 * Reviewed a lot of MHP locking scenarios
 * Various config build issues (e.g. build !CONFIG_MEMORY_HOTPLUG) (Andrew)
 * Added arch_get_mappable_range() implementation
 * Acquire MHP lock for ptdump, analogous to arm64
 * ACPI MHP tests
 * Add ZONE_DEVICE patch

References
==========

[1] https://lore.kernel.org/qemu-devel/20240514110615.399065-1-bjorn@kernel.org/
[2] https://lore.kernel.org/linux-riscv/20240501121742.1215792-1-sunilvl@ventanamicro.com/
[3] https://lore.kernel.org/linux-riscv/cover.1713778236.git.haibo1.xu@intel.com/
[4] https://github.com/bjoto/qemu/commits/virtio-mem-pc-dimm-mhp-acpi/
[5] https://github.com/bjoto/linux/commits/mhp-v2-next-acpi/
[6] https://github.com/bjoto/riscv-rootfs-utils/tree/acpi


Thanks,
Björn


Björn Töpel (8):
  riscv: mm: Pre-allocate vmemmap/direct map PGD entries
  riscv: mm: Change attribute from __init to __meminit for page
    functions
  riscv: mm: Refactor create_linear_mapping_range() for memory hot add
  riscv: mm: Add memory hotplugging support
  riscv: mm: Take memory hotplug read-lock during kernel page table dump
  riscv: Enable memory hotplugging for RISC-V
  virtio-mem: Enable virtio-mem for RISC-V
  riscv: mm: Add support for ZONE_DEVICE

 arch/riscv/Kconfig                    |   3 +
 arch/riscv/include/asm/kasan.h        |   4 +-
 arch/riscv/include/asm/mmu.h          |   4 +-
 arch/riscv/include/asm/pgtable-64.h   |  20 ++
 arch/riscv/include/asm/pgtable-bits.h |   1 +
 arch/riscv/include/asm/pgtable.h      |  17 +-
 arch/riscv/mm/init.c                  | 320 ++++++++++++++++++++++----
 arch/riscv/mm/ptdump.c                |   3 +
 drivers/virtio/Kconfig                |   2 +-
 9 files changed, 328 insertions(+), 46 deletions(-)


base-commit: 0a16a172879012c42f55ae8c2883e17c1e4e388f
-- 
2.40.1



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries
  2024-05-14 14:04 [PATCH v2 0/8] riscv: Memory Hot(Un)Plug support Björn Töpel
@ 2024-05-14 14:04 ` Björn Töpel
  2024-05-14 15:57   ` Alexandre Ghiti
  2024-05-14 14:04 ` [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions Björn Töpel
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 14:04 UTC (permalink / raw)
  To: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

From: Björn Töpel <bjorn@rivosinc.com>

The RISC-V port copies the PGD table from init_mm/swapper_pg_dir to
all userland page tables, which means that if the PGD level table is
changed, other page tables has to be updated as well.

Instead of having the PGD changes ripple out to all tables, the
synchronization can be avoided by pre-allocating the PGD entries/pages
at boot, avoiding the synchronization all together.

This is currently done for the bpf/modules, and vmalloc PGD regions.
Extend this scheme for the PGD regions touched by memory hotplugging.

Prepare the RISC-V port for memory hotplug by pre-allocate
vmemmap/direct map entries at the PGD level. This will roughly waste
~128 worth of 4K pages when memory hotplugging is enabled in the
kernel configuration.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/riscv/include/asm/kasan.h | 4 ++--
 arch/riscv/mm/init.c           | 7 +++++++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
index 0b85e363e778..e6a0071bdb56 100644
--- a/arch/riscv/include/asm/kasan.h
+++ b/arch/riscv/include/asm/kasan.h
@@ -6,8 +6,6 @@
 
 #ifndef __ASSEMBLY__
 
-#ifdef CONFIG_KASAN
-
 /*
  * The following comment was copied from arm64:
  * KASAN_SHADOW_START: beginning of the kernel virtual addresses.
@@ -34,6 +32,8 @@
  */
 #define KASAN_SHADOW_START	((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & PGDIR_MASK)
 #define KASAN_SHADOW_END	MODULES_LOWEST_VADDR
+
+#ifdef CONFIG_KASAN
 #define KASAN_SHADOW_OFFSET	_AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
 
 void kasan_init(void);
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 2574f6a3b0e7..5b8cdfafb52a 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -27,6 +27,7 @@
 
 #include <asm/fixmap.h>
 #include <asm/io.h>
+#include <asm/kasan.h>
 #include <asm/numa.h>
 #include <asm/pgtable.h>
 #include <asm/sections.h>
@@ -1488,10 +1489,16 @@ static void __init preallocate_pgd_pages_range(unsigned long start, unsigned lon
 	panic("Failed to pre-allocate %s pages for %s area\n", lvl, area);
 }
 
+#define PAGE_END KASAN_SHADOW_START
+
 void __init pgtable_cache_init(void)
 {
 	preallocate_pgd_pages_range(VMALLOC_START, VMALLOC_END, "vmalloc");
 	if (IS_ENABLED(CONFIG_MODULES))
 		preallocate_pgd_pages_range(MODULES_VADDR, MODULES_END, "bpf/modules");
+	if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) {
+		preallocate_pgd_pages_range(VMEMMAP_START, VMEMMAP_END, "vmemmap");
+		preallocate_pgd_pages_range(PAGE_OFFSET, PAGE_END, "direct map");
+	}
 }
 #endif
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
  2024-05-14 14:04 [PATCH v2 0/8] riscv: Memory Hot(Un)Plug support Björn Töpel
  2024-05-14 14:04 ` [PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries Björn Töpel
@ 2024-05-14 14:04 ` Björn Töpel
  2024-05-14 15:59   ` David Hildenbrand
                     ` (2 more replies)
  2024-05-14 14:04 ` [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add Björn Töpel
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 14:04 UTC (permalink / raw)
  To: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

From: Björn Töpel <bjorn@rivosinc.com>

Prepare for memory hotplugging support by changing from __init to
__meminit for the page table functions that are used by the upcoming
architecture specific callbacks.

Changing the __init attribute to __meminit, avoids that the functions
are removed after init. The __meminit attribute makes sure the
functions are kept in the kernel text post init, but only if memory
hotplugging is enabled for the build.

Also, make sure that the altmap parameter is properly passed on to
vmemmap_populate_hugepages().

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/riscv/include/asm/mmu.h     |  4 +--
 arch/riscv/include/asm/pgtable.h |  2 +-
 arch/riscv/mm/init.c             | 58 ++++++++++++++------------------
 3 files changed, 29 insertions(+), 35 deletions(-)

diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h
index 60be458e94da..c09c3c79f496 100644
--- a/arch/riscv/include/asm/mmu.h
+++ b/arch/riscv/include/asm/mmu.h
@@ -28,8 +28,8 @@ typedef struct {
 #endif
 } mm_context_t;
 
-void __init create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa,
-			       phys_addr_t sz, pgprot_t prot);
+void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
+				  pgprot_t prot);
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_RISCV_MMU_H */
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 58fd7b70b903..7933f493db71 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -162,7 +162,7 @@ struct pt_alloc_ops {
 #endif
 };
 
-extern struct pt_alloc_ops pt_ops __initdata;
+extern struct pt_alloc_ops pt_ops __meminitdata;
 
 #ifdef CONFIG_MMU
 /* Number of PGD entries that a user-mode program can use */
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 5b8cdfafb52a..c969427eab88 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -295,7 +295,7 @@ static void __init setup_bootmem(void)
 }
 
 #ifdef CONFIG_MMU
-struct pt_alloc_ops pt_ops __initdata;
+struct pt_alloc_ops pt_ops __meminitdata;
 
 pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
 pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
@@ -357,7 +357,7 @@ static inline pte_t *__init get_pte_virt_fixmap(phys_addr_t pa)
 	return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
 }
 
-static inline pte_t *__init get_pte_virt_late(phys_addr_t pa)
+static inline pte_t *__meminit get_pte_virt_late(phys_addr_t pa)
 {
 	return (pte_t *) __va(pa);
 }
@@ -376,7 +376,7 @@ static inline phys_addr_t __init alloc_pte_fixmap(uintptr_t va)
 	return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 }
 
-static phys_addr_t __init alloc_pte_late(uintptr_t va)
+static phys_addr_t __meminit alloc_pte_late(uintptr_t va)
 {
 	struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
@@ -384,9 +384,8 @@ static phys_addr_t __init alloc_pte_late(uintptr_t va)
 	return __pa((pte_t *)ptdesc_address(ptdesc));
 }
 
-static void __init create_pte_mapping(pte_t *ptep,
-				      uintptr_t va, phys_addr_t pa,
-				      phys_addr_t sz, pgprot_t prot)
+static void __meminit create_pte_mapping(pte_t *ptep, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
+					 pgprot_t prot)
 {
 	uintptr_t pte_idx = pte_index(va);
 
@@ -440,7 +439,7 @@ static pmd_t *__init get_pmd_virt_fixmap(phys_addr_t pa)
 	return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
 }
 
-static pmd_t *__init get_pmd_virt_late(phys_addr_t pa)
+static pmd_t *__meminit get_pmd_virt_late(phys_addr_t pa)
 {
 	return (pmd_t *) __va(pa);
 }
@@ -457,7 +456,7 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
 	return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 }
 
-static phys_addr_t __init alloc_pmd_late(uintptr_t va)
+static phys_addr_t __meminit alloc_pmd_late(uintptr_t va)
 {
 	struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
@@ -465,9 +464,9 @@ static phys_addr_t __init alloc_pmd_late(uintptr_t va)
 	return __pa((pmd_t *)ptdesc_address(ptdesc));
 }
 
-static void __init create_pmd_mapping(pmd_t *pmdp,
-				      uintptr_t va, phys_addr_t pa,
-				      phys_addr_t sz, pgprot_t prot)
+static void __meminit create_pmd_mapping(pmd_t *pmdp,
+					 uintptr_t va, phys_addr_t pa,
+					 phys_addr_t sz, pgprot_t prot)
 {
 	pte_t *ptep;
 	phys_addr_t pte_phys;
@@ -503,7 +502,7 @@ static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
 	return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
 }
 
-static pud_t *__init get_pud_virt_late(phys_addr_t pa)
+static pud_t *__meminit get_pud_virt_late(phys_addr_t pa)
 {
 	return (pud_t *)__va(pa);
 }
@@ -521,7 +520,7 @@ static phys_addr_t __init alloc_pud_fixmap(uintptr_t va)
 	return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 }
 
-static phys_addr_t alloc_pud_late(uintptr_t va)
+static phys_addr_t __meminit alloc_pud_late(uintptr_t va)
 {
 	unsigned long vaddr;
 
@@ -541,7 +540,7 @@ static p4d_t *__init get_p4d_virt_fixmap(phys_addr_t pa)
 	return (p4d_t *)set_fixmap_offset(FIX_P4D, pa);
 }
 
-static p4d_t *__init get_p4d_virt_late(phys_addr_t pa)
+static p4d_t *__meminit get_p4d_virt_late(phys_addr_t pa)
 {
 	return (p4d_t *)__va(pa);
 }
@@ -559,7 +558,7 @@ static phys_addr_t __init alloc_p4d_fixmap(uintptr_t va)
 	return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 }
 
-static phys_addr_t alloc_p4d_late(uintptr_t va)
+static phys_addr_t __meminit alloc_p4d_late(uintptr_t va)
 {
 	unsigned long vaddr;
 
@@ -568,9 +567,8 @@ static phys_addr_t alloc_p4d_late(uintptr_t va)
 	return __pa(vaddr);
 }
 
-static void __init create_pud_mapping(pud_t *pudp,
-				      uintptr_t va, phys_addr_t pa,
-				      phys_addr_t sz, pgprot_t prot)
+static void __meminit create_pud_mapping(pud_t *pudp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
+					 pgprot_t prot)
 {
 	pmd_t *nextp;
 	phys_addr_t next_phys;
@@ -595,9 +593,8 @@ static void __init create_pud_mapping(pud_t *pudp,
 	create_pmd_mapping(nextp, va, pa, sz, prot);
 }
 
-static void __init create_p4d_mapping(p4d_t *p4dp,
-				      uintptr_t va, phys_addr_t pa,
-				      phys_addr_t sz, pgprot_t prot)
+static void __meminit create_p4d_mapping(p4d_t *p4dp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
+					 pgprot_t prot)
 {
 	pud_t *nextp;
 	phys_addr_t next_phys;
@@ -653,9 +650,8 @@ static void __init create_p4d_mapping(p4d_t *p4dp,
 #define create_pmd_mapping(__pmdp, __va, __pa, __sz, __prot) do {} while(0)
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-void __init create_pgd_mapping(pgd_t *pgdp,
-				      uintptr_t va, phys_addr_t pa,
-				      phys_addr_t sz, pgprot_t prot)
+void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
+				  pgprot_t prot)
 {
 	pgd_next_t *nextp;
 	phys_addr_t next_phys;
@@ -680,8 +676,7 @@ void __init create_pgd_mapping(pgd_t *pgdp,
 	create_pgd_next_mapping(nextp, va, pa, sz, prot);
 }
 
-static uintptr_t __init best_map_size(phys_addr_t pa, uintptr_t va,
-				      phys_addr_t size)
+static uintptr_t __meminit best_map_size(phys_addr_t pa, uintptr_t va, phys_addr_t size)
 {
 	if (pgtable_l5_enabled &&
 	    !(pa & (P4D_SIZE - 1)) && !(va & (P4D_SIZE - 1)) && size >= P4D_SIZE)
@@ -714,7 +709,7 @@ asmlinkage void __init __copy_data(void)
 #endif
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
-static __init pgprot_t pgprot_from_va(uintptr_t va)
+static __meminit pgprot_t pgprot_from_va(uintptr_t va)
 {
 	if (is_va_kernel_text(va))
 		return PAGE_KERNEL_READ_EXEC;
@@ -739,7 +734,7 @@ void mark_rodata_ro(void)
 				  set_memory_ro);
 }
 #else
-static __init pgprot_t pgprot_from_va(uintptr_t va)
+static __meminit pgprot_t pgprot_from_va(uintptr_t va)
 {
 	if (IS_ENABLED(CONFIG_64BIT) && !is_kernel_mapping(va))
 		return PAGE_KERNEL;
@@ -1231,9 +1226,8 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 	pt_ops_set_fixmap();
 }
 
-static void __init create_linear_mapping_range(phys_addr_t start,
-					       phys_addr_t end,
-					       uintptr_t fixed_map_size)
+static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t end,
+						  uintptr_t fixed_map_size)
 {
 	phys_addr_t pa;
 	uintptr_t va, map_size;
@@ -1435,7 +1429,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	 * memory hotplug, we are not able to update all the page tables with
 	 * the new PMDs.
 	 */
-	return vmemmap_populate_hugepages(start, end, node, NULL);
+	return vmemmap_populate_hugepages(start, end, node, altmap);
 }
 #endif
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add
  2024-05-14 14:04 [PATCH v2 0/8] riscv: Memory Hot(Un)Plug support Björn Töpel
  2024-05-14 14:04 ` [PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries Björn Töpel
  2024-05-14 14:04 ` [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions Björn Töpel
@ 2024-05-14 14:04 ` Björn Töpel
  2024-05-14 16:00   ` David Hildenbrand
                     ` (2 more replies)
  2024-05-14 14:04 ` [PATCH v2 4/8] riscv: mm: Add memory hotplugging support Björn Töpel
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 14:04 UTC (permalink / raw)
  To: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

From: Björn Töpel <bjorn@rivosinc.com>

Add a parameter to the direct map setup function, so it can be used in
arch_add_memory() later.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/riscv/mm/init.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index c969427eab88..6f72b0b2b854 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1227,7 +1227,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 }
 
 static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t end,
-						  uintptr_t fixed_map_size)
+						  uintptr_t fixed_map_size, const pgprot_t *pgprot)
 {
 	phys_addr_t pa;
 	uintptr_t va, map_size;
@@ -1238,7 +1238,7 @@ static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t
 					    best_map_size(pa, va, end - pa);
 
 		create_pgd_mapping(swapper_pg_dir, va, pa, map_size,
-				   pgprot_from_va(va));
+				   pgprot ? *pgprot : pgprot_from_va(va));
 	}
 }
 
@@ -1282,22 +1282,19 @@ static void __init create_linear_mapping_page_table(void)
 		if (end >= __pa(PAGE_OFFSET) + memory_limit)
 			end = __pa(PAGE_OFFSET) + memory_limit;
 
-		create_linear_mapping_range(start, end, 0);
+		create_linear_mapping_range(start, end, 0, NULL);
 	}
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
-	create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0);
-	create_linear_mapping_range(krodata_start,
-				    krodata_start + krodata_size, 0);
+	create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0, NULL);
+	create_linear_mapping_range(krodata_start, krodata_start + krodata_size, 0, NULL);
 
 	memblock_clear_nomap(ktext_start,  ktext_size);
 	memblock_clear_nomap(krodata_start, krodata_size);
 #endif
 
 #ifdef CONFIG_KFENCE
-	create_linear_mapping_range(kfence_pool,
-				    kfence_pool + KFENCE_POOL_SIZE,
-				    PAGE_SIZE);
+	create_linear_mapping_range(kfence_pool, kfence_pool + KFENCE_POOL_SIZE, PAGE_SIZE, NULL);
 
 	memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
 #endif
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
  2024-05-14 14:04 [PATCH v2 0/8] riscv: Memory Hot(Un)Plug support Björn Töpel
                   ` (2 preceding siblings ...)
  2024-05-14 14:04 ` [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add Björn Töpel
@ 2024-05-14 14:04 ` Björn Töpel
  2024-05-14 16:04   ` David Hildenbrand
                     ` (2 more replies)
  2024-05-14 14:04 ` [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump Björn Töpel
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 14:04 UTC (permalink / raw)
  To: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

From: Björn Töpel <bjorn@rivosinc.com>

For an architecture to support memory hotplugging, a couple of
callbacks needs to be implemented:

 arch_add_memory()
  This callback is responsible for adding the physical memory into the
  direct map, and call into the memory hotplugging generic code via
  __add_pages() that adds the corresponding struct page entries, and
  updates the vmemmap mapping.

 arch_remove_memory()
  This is the inverse of the callback above.

 vmemmap_free()
  This function tears down the vmemmap mappings (if
  CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
  backing vmemmap pages. Note that for persistent memory, an
  alternative allocator for the backing pages can be used; The
  vmem_altmap. This means that when the backing pages are cleared,
  extra care is needed so that the correct deallocation method is
  used.

 arch_get_mappable_range()
  This functions returns the PA range that the direct map can map.
  Used by the MHP internals for sanity checks.

The page table unmap/teardown functions are heavily based on code from
the x86 tree. The same remove_pgd_mapping() function is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/riscv/mm/init.c | 242 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 242 insertions(+)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 6f72b0b2b854..7f0b921a3d3a 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
 	}
 }
 #endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+	pte_t *pte;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		pte = pte_start + i;
+		if (!pte_none(*pte))
+			return;
+	}
+
+	free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
+	pmd_clear(pmd);
+}
+
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+	pmd_t *pmd;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		pmd = pmd_start + i;
+		if (!pmd_none(*pmd))
+			return;
+	}
+
+	free_pages((unsigned long)page_address(pud_page(*pud)), 0);
+	pud_clear(pud);
+}
+
+static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+	pud_t *pud;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		pud = pud_start + i;
+		if (!pud_none(*pud))
+			return;
+	}
+
+	free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
+	p4d_clear(p4d);
+}
+
+static void __meminit free_vmemmap_storage(struct page *page, size_t size,
+					   struct vmem_altmap *altmap)
+{
+	if (altmap)
+		vmem_altmap_free(altmap, size >> PAGE_SHIFT);
+	else
+		free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void __meminit remove_pte_mapping(pte_t *pte_base, unsigned long addr, unsigned long end,
+					 bool is_vmemmap, struct vmem_altmap *altmap)
+{
+	unsigned long next;
+	pte_t *ptep, pte;
+
+	for (; addr < end; addr = next) {
+		next = (addr + PAGE_SIZE) & PAGE_MASK;
+		if (next > end)
+			next = end;
+
+		ptep = pte_base + pte_index(addr);
+		pte = READ_ONCE(*ptep);
+
+		if (!pte_present(*ptep))
+			continue;
+
+		pte_clear(&init_mm, addr, ptep);
+		if (is_vmemmap)
+			free_vmemmap_storage(pte_page(pte), PAGE_SIZE, altmap);
+	}
+}
+
+static void __meminit remove_pmd_mapping(pmd_t *pmd_base, unsigned long addr, unsigned long end,
+					 bool is_vmemmap, struct vmem_altmap *altmap)
+{
+	unsigned long next;
+	pte_t *pte_base;
+	pmd_t *pmdp, pmd;
+
+	for (; addr < end; addr = next) {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_base + pmd_index(addr);
+		pmd = READ_ONCE(*pmdp);
+
+		if (!pmd_present(pmd))
+			continue;
+
+		if (pmd_leaf(pmd)) {
+			pmd_clear(pmdp);
+			if (is_vmemmap)
+				free_vmemmap_storage(pmd_page(pmd), PMD_SIZE, altmap);
+			continue;
+		}
+
+		pte_base = (pte_t *)pmd_page_vaddr(*pmdp);
+		remove_pte_mapping(pte_base, addr, next, is_vmemmap, altmap);
+		free_pte_table(pte_base, pmdp);
+	}
+}
+
+static void __meminit remove_pud_mapping(pud_t *pud_base, unsigned long addr, unsigned long end,
+					 bool is_vmemmap, struct vmem_altmap *altmap)
+{
+	unsigned long next;
+	pud_t *pudp, pud;
+	pmd_t *pmd_base;
+
+	for (; addr < end; addr = next) {
+		next = pud_addr_end(addr, end);
+		pudp = pud_base + pud_index(addr);
+		pud = READ_ONCE(*pudp);
+
+		if (!pud_present(pud))
+			continue;
+
+		if (pud_leaf(pud)) {
+			if (pgtable_l4_enabled) {
+				pud_clear(pudp);
+				if (is_vmemmap)
+					free_vmemmap_storage(pud_page(pud), PUD_SIZE, altmap);
+			}
+			continue;
+		}
+
+		pmd_base = pmd_offset(pudp, 0);
+		remove_pmd_mapping(pmd_base, addr, next, is_vmemmap, altmap);
+
+		if (pgtable_l4_enabled)
+			free_pmd_table(pmd_base, pudp);
+	}
+}
+
+static void __meminit remove_p4d_mapping(p4d_t *p4d_base, unsigned long addr, unsigned long end,
+					 bool is_vmemmap, struct vmem_altmap *altmap)
+{
+	unsigned long next;
+	p4d_t *p4dp, p4d;
+	pud_t *pud_base;
+
+	for (; addr < end; addr = next) {
+		next = p4d_addr_end(addr, end);
+		p4dp = p4d_base + p4d_index(addr);
+		p4d = READ_ONCE(*p4dp);
+
+		if (!p4d_present(p4d))
+			continue;
+
+		if (p4d_leaf(p4d)) {
+			if (pgtable_l5_enabled) {
+				p4d_clear(p4dp);
+				if (is_vmemmap)
+					free_vmemmap_storage(p4d_page(p4d), P4D_SIZE, altmap);
+			}
+			continue;
+		}
+
+		pud_base = pud_offset(p4dp, 0);
+		remove_pud_mapping(pud_base, addr, next, is_vmemmap, altmap);
+
+		if (pgtable_l5_enabled)
+			free_pud_table(pud_base, p4dp);
+	}
+}
+
+static void __meminit remove_pgd_mapping(unsigned long va, unsigned long end, bool is_vmemmap,
+					 struct vmem_altmap *altmap)
+{
+	unsigned long addr, next;
+	p4d_t *p4d_base;
+	pgd_t *pgd;
+
+	for (addr = va; addr < end; addr = next) {
+		next = pgd_addr_end(addr, end);
+		pgd = pgd_offset_k(addr);
+
+		if (!pgd_present(*pgd))
+			continue;
+
+		if (pgd_leaf(*pgd))
+			continue;
+
+		p4d_base = p4d_offset(pgd, 0);
+		remove_p4d_mapping(p4d_base, addr, next, is_vmemmap, altmap);
+	}
+
+	flush_tlb_all();
+}
+
+static void __meminit remove_linear_mapping(phys_addr_t start, u64 size)
+{
+	unsigned long va = (unsigned long)__va(start);
+	unsigned long end = (unsigned long)__va(start + size);
+
+	remove_pgd_mapping(va, end, false, NULL);
+}
+
+struct range arch_get_mappable_range(void)
+{
+	struct range mhp_range;
+
+	mhp_range.start = __pa(PAGE_OFFSET);
+	mhp_range.end = __pa(PAGE_END - 1);
+	return mhp_range;
+}
+
+int __ref arch_add_memory(int nid, u64 start, u64 size, struct mhp_params *params)
+{
+	int ret;
+
+	create_linear_mapping_range(start, start + size, 0, &params->pgprot);
+	flush_tlb_all();
+	ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, params);
+	if (ret) {
+		remove_linear_mapping(start, size);
+		return ret;
+	}
+
+	max_pfn = PFN_UP(start + size);
+	max_low_pfn = max_pfn;
+	return 0;
+}
+
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+{
+	__remove_pages(start >> PAGE_SHIFT, size >> PAGE_SHIFT, altmap);
+	remove_linear_mapping(start, size);
+}
+
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+void __ref vmemmap_free(unsigned long start, unsigned long end, struct vmem_altmap *altmap)
+{
+	remove_pgd_mapping(start, end, true, altmap);
+}
+#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+#endif /* CONFIG_MEMORY_HOTPLUG */
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump
  2024-05-14 14:04 [PATCH v2 0/8] riscv: Memory Hot(Un)Plug support Björn Töpel
                   ` (3 preceding siblings ...)
  2024-05-14 14:04 ` [PATCH v2 4/8] riscv: mm: Add memory hotplugging support Björn Töpel
@ 2024-05-14 14:04 ` Björn Töpel
  2024-05-14 20:39   ` David Hildenbrand
  2024-05-14 21:03   ` Oscar Salvador
  2024-05-14 14:04 ` [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V Björn Töpel
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 14:04 UTC (permalink / raw)
  To: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

From: Björn Töpel <bjorn@rivosinc.com>

During memory hot remove, the ptdump functionality can end up touching
stale data. Avoid any potential crashes (or worse), by holding the
memory hotplug read-lock while traversing the page table.

This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm:
Hold memory hotplug lock while walking for kernel page table dump").

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/riscv/mm/ptdump.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 1289cc6d3700..9d5f657a251b 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -6,6 +6,7 @@
 #include <linux/efi.h>
 #include <linux/init.h>
 #include <linux/debugfs.h>
+#include <linux/memory_hotplug.h>
 #include <linux/seq_file.h>
 #include <linux/ptdump.h>
 
@@ -370,7 +371,9 @@ bool ptdump_check_wx(void)
 
 static int ptdump_show(struct seq_file *m, void *v)
 {
+	get_online_mems();
 	ptdump_walk(m, m->private);
+	put_online_mems();
 
 	return 0;
 }
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
  2024-05-14 14:04 [PATCH v2 0/8] riscv: Memory Hot(Un)Plug support Björn Töpel
                   ` (4 preceding siblings ...)
  2024-05-14 14:04 ` [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump Björn Töpel
@ 2024-05-14 14:04 ` Björn Töpel
  2024-05-14 18:00   ` Alexandre Ghiti
  2024-05-14 21:06   ` Oscar Salvador
  2024-05-14 14:04 ` [PATCH v2 7/8] virtio-mem: Enable virtio-mem " Björn Töpel
  2024-05-14 14:04 ` [PATCH v2 8/8] riscv: mm: Add support for ZONE_DEVICE Björn Töpel
  7 siblings, 2 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 14:04 UTC (permalink / raw)
  To: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

From: Björn Töpel <bjorn@rivosinc.com>

Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
RISC-V.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/riscv/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 6bec1bce6586..b9398b64bb69 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -16,6 +16,8 @@ config RISCV
 	select ACPI_REDUCED_HARDWARE_ONLY if ACPI
 	select ARCH_DMA_DEFAULT_COHERENT
 	select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
+	select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU
+	select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
 	select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
 	select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
 	select ARCH_HAS_BINFMT_FLAT
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 7/8] virtio-mem: Enable virtio-mem for RISC-V
  2024-05-14 14:04 [PATCH v2 0/8] riscv: Memory Hot(Un)Plug support Björn Töpel
                   ` (5 preceding siblings ...)
  2024-05-14 14:04 ` [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V Björn Töpel
@ 2024-05-14 14:04 ` Björn Töpel
  2024-05-14 15:58   ` David Hildenbrand
  2024-05-14 14:04 ` [PATCH v2 8/8] riscv: mm: Add support for ZONE_DEVICE Björn Töpel
  7 siblings, 1 reply; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 14:04 UTC (permalink / raw)
  To: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

From: Björn Töpel <bjorn@rivosinc.com>

Now that RISC-V has memory hotplugging support, virtio-mem can be used
on the platform.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 drivers/virtio/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index c17193544268..4e5cebf1b82a 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -122,7 +122,7 @@ config VIRTIO_BALLOON
 
 config VIRTIO_MEM
 	tristate "Virtio mem driver"
-	depends on X86_64 || ARM64
+	depends on X86_64 || ARM64 || RISCV
 	depends on VIRTIO
 	depends on MEMORY_HOTPLUG
 	depends on MEMORY_HOTREMOVE
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 8/8] riscv: mm: Add support for ZONE_DEVICE
  2024-05-14 14:04 [PATCH v2 0/8] riscv: Memory Hot(Un)Plug support Björn Töpel
                   ` (6 preceding siblings ...)
  2024-05-14 14:04 ` [PATCH v2 7/8] virtio-mem: Enable virtio-mem " Björn Töpel
@ 2024-05-14 14:04 ` Björn Töpel
  2024-05-15  7:03   ` Björn Töpel
  7 siblings, 1 reply; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 14:04 UTC (permalink / raw)
  To: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

From: Björn Töpel <bjorn@rivosinc.com>

ZONE_DEVICE pages need DEVMAP PTEs support to function
(ARCH_HAS_PTE_DEVMAP). Claim another RSW (reserved for software) bit
in the PTE for DEVMAP mark, add the corresponding helpers, and enable
ARCH_HAS_PTE_DEVMAP for riscv64.

Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
---
 arch/riscv/Kconfig                    |  1 +
 arch/riscv/include/asm/pgtable-64.h   | 20 ++++++++++++++++++++
 arch/riscv/include/asm/pgtable-bits.h |  1 +
 arch/riscv/include/asm/pgtable.h      | 15 +++++++++++++++
 4 files changed, 37 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index b9398b64bb69..6d426afdd904 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -36,6 +36,7 @@ config RISCV
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API
 	select ARCH_HAS_PREPARE_SYNC_CORE_CMD
+	select ARCH_HAS_PTE_DEVMAP if 64BIT && MMU
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SET_DIRECT_MAP if MMU
 	select ARCH_HAS_SET_MEMORY if MMU
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index 221a5c1ee287..c67a9bbfd010 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -400,4 +400,24 @@ static inline struct page *pgd_page(pgd_t pgd)
 #define p4d_offset p4d_offset
 p4d_t *p4d_offset(pgd_t *pgd, unsigned long address);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pte_devmap(pte_t pte);
+static inline pte_t pmd_pte(pmd_t pmd);
+
+static inline int pmd_devmap(pmd_t pmd)
+{
+	return pte_devmap(pmd_pte(pmd));
+}
+
+static inline int pud_devmap(pud_t pud)
+{
+	return 0;
+}
+
+static inline int pgd_devmap(pgd_t pgd)
+{
+	return 0;
+}
+#endif
+
 #endif /* _ASM_RISCV_PGTABLE_64_H */
diff --git a/arch/riscv/include/asm/pgtable-bits.h b/arch/riscv/include/asm/pgtable-bits.h
index 179bd4afece4..a8f5205cea54 100644
--- a/arch/riscv/include/asm/pgtable-bits.h
+++ b/arch/riscv/include/asm/pgtable-bits.h
@@ -19,6 +19,7 @@
 #define _PAGE_SOFT      (3 << 8)    /* Reserved for software */
 
 #define _PAGE_SPECIAL   (1 << 8)    /* RSW: 0x1 */
+#define _PAGE_DEVMAP    (1 << 9)    /* RSW, devmap */
 #define _PAGE_TABLE     _PAGE_PRESENT
 
 /*
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 7933f493db71..216de1db3cd0 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -387,6 +387,11 @@ static inline int pte_special(pte_t pte)
 	return pte_val(pte) & _PAGE_SPECIAL;
 }
 
+static inline int pte_devmap(pte_t pte)
+{
+	return pte_val(pte) & _PAGE_DEVMAP;
+}
+
 /* static inline pte_t pte_rdprotect(pte_t pte) */
 
 static inline pte_t pte_wrprotect(pte_t pte)
@@ -428,6 +433,11 @@ static inline pte_t pte_mkspecial(pte_t pte)
 	return __pte(pte_val(pte) | _PAGE_SPECIAL);
 }
 
+static inline pte_t pte_mkdevmap(pte_t pte)
+{
+	return __pte(pte_val(pte) | _PAGE_DEVMAP);
+}
+
 static inline pte_t pte_mkhuge(pte_t pte)
 {
 	return pte;
@@ -711,6 +721,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return pte_pmd(pte_mkdirty(pmd_pte(pmd)));
 }
 
+static inline pmd_t pmd_mkdevmap(pmd_t pmd)
+{
+	return pte_pmd(pte_mkdevmap(pmd_pte(pmd)));
+}
+
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 				pmd_t *pmdp, pmd_t pmd)
 {
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries
  2024-05-14 14:04 ` [PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries Björn Töpel
@ 2024-05-14 15:57   ` Alexandre Ghiti
  2024-05-14 16:33     ` Björn Töpel
  0 siblings, 1 reply; 34+ messages in thread
From: Alexandre Ghiti @ 2024-05-14 15:57 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Albert Ou, David Hildenbrand, Palmer Dabbelt, Paul Walmsley,
	linux-riscv, Björn Töpel, Andrew Bresticker,
	Chethan Seshadri, Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

Hi Björn,

On Tue, May 14, 2024 at 4:05 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> From: Björn Töpel <bjorn@rivosinc.com>
>
> The RISC-V port copies the PGD table from init_mm/swapper_pg_dir to
> all userland page tables, which means that if the PGD level table is
> changed, other page tables has to be updated as well.
>
> Instead of having the PGD changes ripple out to all tables, the
> synchronization can be avoided by pre-allocating the PGD entries/pages
> at boot, avoiding the synchronization all together.
>
> This is currently done for the bpf/modules, and vmalloc PGD regions.
> Extend this scheme for the PGD regions touched by memory hotplugging.
>
> Prepare the RISC-V port for memory hotplug by pre-allocate
> vmemmap/direct map entries at the PGD level. This will roughly waste
> ~128 worth of 4K pages when memory hotplugging is enabled in the
> kernel configuration.
>
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> ---
>  arch/riscv/include/asm/kasan.h | 4 ++--
>  arch/riscv/mm/init.c           | 7 +++++++
>  2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
> index 0b85e363e778..e6a0071bdb56 100644
> --- a/arch/riscv/include/asm/kasan.h
> +++ b/arch/riscv/include/asm/kasan.h
> @@ -6,8 +6,6 @@
>
>  #ifndef __ASSEMBLY__
>
> -#ifdef CONFIG_KASAN
> -
>  /*
>   * The following comment was copied from arm64:
>   * KASAN_SHADOW_START: beginning of the kernel virtual addresses.
> @@ -34,6 +32,8 @@
>   */
>  #define KASAN_SHADOW_START     ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & PGDIR_MASK)
>  #define KASAN_SHADOW_END       MODULES_LOWEST_VADDR
> +
> +#ifdef CONFIG_KASAN
>  #define KASAN_SHADOW_OFFSET    _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
>
>  void kasan_init(void);
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 2574f6a3b0e7..5b8cdfafb52a 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -27,6 +27,7 @@
>
>  #include <asm/fixmap.h>
>  #include <asm/io.h>
> +#include <asm/kasan.h>
>  #include <asm/numa.h>
>  #include <asm/pgtable.h>
>  #include <asm/sections.h>
> @@ -1488,10 +1489,16 @@ static void __init preallocate_pgd_pages_range(unsigned long start, unsigned lon
>         panic("Failed to pre-allocate %s pages for %s area\n", lvl, area);
>  }
>
> +#define PAGE_END KASAN_SHADOW_START
> +
>  void __init pgtable_cache_init(void)
>  {
>         preallocate_pgd_pages_range(VMALLOC_START, VMALLOC_END, "vmalloc");
>         if (IS_ENABLED(CONFIG_MODULES))
>                 preallocate_pgd_pages_range(MODULES_VADDR, MODULES_END, "bpf/modules");
> +       if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) {
> +               preallocate_pgd_pages_range(VMEMMAP_START, VMEMMAP_END, "vmemmap");
> +               preallocate_pgd_pages_range(PAGE_OFFSET, PAGE_END, "direct map");
> +       }
>  }
>  #endif
> --
> 2.40.1
>

As you asked, with
https://lore.kernel.org/linux-riscv/20240514133614.87813-1-alexghiti@rivosinc.com/T/#u,
you will be able to remove the usage of KASAN_SHADOW_START.

But anyhow, you can add:

Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>

Thanks,

Alex


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 7/8] virtio-mem: Enable virtio-mem for RISC-V
  2024-05-14 14:04 ` [PATCH v2 7/8] virtio-mem: Enable virtio-mem " Björn Töpel
@ 2024-05-14 15:58   ` David Hildenbrand
  0 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2024-05-14 15:58 UTC (permalink / raw)
  To: Björn Töpel, Alexandre Ghiti, Albert Ou, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

On 14.05.24 16:04, Björn Töpel wrote:
> From: Björn Töpel <bjorn@rivosinc.com>
> 
> Now that RISC-V has memory hotplugging support, virtio-mem can be used
> on the platform.
> 
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> ---
>   drivers/virtio/Kconfig | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index c17193544268..4e5cebf1b82a 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -122,7 +122,7 @@ config VIRTIO_BALLOON
>   
>   config VIRTIO_MEM
>   	tristate "Virtio mem driver"
> -	depends on X86_64 || ARM64
> +	depends on X86_64 || ARM64 || RISCV
>   	depends on VIRTIO
>   	depends on MEMORY_HOTPLUG
>   	depends on MEMORY_HOTREMOVE


Nice!

Acked-by: David Hildenbrand <david@redhat.com>
-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
  2024-05-14 14:04 ` [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions Björn Töpel
@ 2024-05-14 15:59   ` David Hildenbrand
  2024-05-14 17:17   ` Alexandre Ghiti
  2024-05-14 20:32   ` Oscar Salvador
  2 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2024-05-14 15:59 UTC (permalink / raw)
  To: Björn Töpel, Alexandre Ghiti, Albert Ou, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

On 14.05.24 16:04, Björn Töpel wrote:
> From: Björn Töpel <bjorn@rivosinc.com>
> 
> Prepare for memory hotplugging support by changing from __init to
> __meminit for the page table functions that are used by the upcoming
> architecture specific callbacks.
> 
> Changing the __init attribute to __meminit, avoids that the functions
> are removed after init. The __meminit attribute makes sure the
> functions are kept in the kernel text post init, but only if memory
> hotplugging is enabled for the build.
> 
> Also, make sure that the altmap parameter is properly passed on to
> vmemmap_populate_hugepages().
> 
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add
  2024-05-14 14:04 ` [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add Björn Töpel
@ 2024-05-14 16:00   ` David Hildenbrand
  2024-05-14 17:24   ` Alexandre Ghiti
  2024-05-14 20:37   ` Oscar Salvador
  2 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2024-05-14 16:00 UTC (permalink / raw)
  To: Björn Töpel, Alexandre Ghiti, Albert Ou, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

On 14.05.24 16:04, Björn Töpel wrote:
> From: Björn Töpel <bjorn@rivosinc.com>
> 
> Add a parameter to the direct map setup function, so it can be used in
> arch_add_memory() later.
> 
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
  2024-05-14 14:04 ` [PATCH v2 4/8] riscv: mm: Add memory hotplugging support Björn Töpel
@ 2024-05-14 16:04   ` David Hildenbrand
  2024-05-14 16:34     ` Björn Töpel
  2024-05-14 17:49   ` Alexandre Ghiti
  2024-05-14 20:49   ` Oscar Salvador
  2 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2024-05-14 16:04 UTC (permalink / raw)
  To: Björn Töpel, Alexandre Ghiti, Albert Ou, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

On 14.05.24 16:04, Björn Töpel wrote:
> From: Björn Töpel <bjorn@rivosinc.com>
> 
> For an architecture to support memory hotplugging, a couple of
> callbacks needs to be implemented:
> 
>   arch_add_memory()
>    This callback is responsible for adding the physical memory into the
>    direct map, and call into the memory hotplugging generic code via
>    __add_pages() that adds the corresponding struct page entries, and
>    updates the vmemmap mapping.
> 
>   arch_remove_memory()
>    This is the inverse of the callback above.
> 
>   vmemmap_free()
>    This function tears down the vmemmap mappings (if
>    CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
>    backing vmemmap pages. Note that for persistent memory, an
>    alternative allocator for the backing pages can be used; The
>    vmem_altmap. This means that when the backing pages are cleared,
>    extra care is needed so that the correct deallocation method is
>    used.
> 
>   arch_get_mappable_range()
>    This functions returns the PA range that the direct map can map.
>    Used by the MHP internals for sanity checks.
> 
> The page table unmap/teardown functions are heavily based on code from
> the x86 tree. The same remove_pgd_mapping() function is used in both
> vmemmap_free() and arch_remove_memory(), but in the latter function
> the backing pages are not removed.
> 
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> ---
>   arch/riscv/mm/init.c | 242 +++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 242 insertions(+)
> 
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 6f72b0b2b854..7f0b921a3d3a 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
>   	}
>   }
>   #endif
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> +{
> +	pte_t *pte;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> +		pte = pte_start + i;
> +		if (!pte_none(*pte))
> +			return;
> +	}
> +
> +	free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
> +	pmd_clear(pmd);
> +}
> +
> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> +{
> +	pmd_t *pmd;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> +		pmd = pmd_start + i;
> +		if (!pmd_none(*pmd))
> +			return;
> +	}
> +
> +	free_pages((unsigned long)page_address(pud_page(*pud)), 0);
> +	pud_clear(pud);
> +}
> +
> +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
> +{
> +	pud_t *pud;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> +		pud = pud_start + i;
> +		if (!pud_none(*pud))
> +			return;
> +	}
> +
> +	free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
> +	p4d_clear(p4d);
> +}
> +
> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
> +					   struct vmem_altmap *altmap)
> +{
> +	if (altmap)
> +		vmem_altmap_free(altmap, size >> PAGE_SHIFT);
> +	else
> +		free_pages((unsigned long)page_address(page), get_order(size));

If you unplug a DIMM that was added during boot (can happen on x86-64, 
can it happen on riscv?), free_pages() would not be sufficient. You'd be 
freeing a PG_reserved page that has to be freed differently.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries
  2024-05-14 15:57   ` Alexandre Ghiti
@ 2024-05-14 16:33     ` Björn Töpel
  0 siblings, 0 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 16:33 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Albert Ou, David Hildenbrand, Palmer Dabbelt, Paul Walmsley,
	linux-riscv, Björn Töpel, Andrew Bresticker,
	Chethan Seshadri, Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

Alexandre Ghiti <alexghiti@rivosinc.com> writes:

> Hi Björn,
>
> On Tue, May 14, 2024 at 4:05 PM Björn Töpel <bjorn@kernel.org> wrote:
>>
>> From: Björn Töpel <bjorn@rivosinc.com>
>>
>> The RISC-V port copies the PGD table from init_mm/swapper_pg_dir to
>> all userland page tables, which means that if the PGD level table is
>> changed, other page tables has to be updated as well.
>>
>> Instead of having the PGD changes ripple out to all tables, the
>> synchronization can be avoided by pre-allocating the PGD entries/pages
>> at boot, avoiding the synchronization all together.
>>
>> This is currently done for the bpf/modules, and vmalloc PGD regions.
>> Extend this scheme for the PGD regions touched by memory hotplugging.
>>
>> Prepare the RISC-V port for memory hotplug by pre-allocate
>> vmemmap/direct map entries at the PGD level. This will roughly waste
>> ~128 worth of 4K pages when memory hotplugging is enabled in the
>> kernel configuration.
>>
>> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
>> ---
>>  arch/riscv/include/asm/kasan.h | 4 ++--
>>  arch/riscv/mm/init.c           | 7 +++++++
>>  2 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
>> index 0b85e363e778..e6a0071bdb56 100644
>> --- a/arch/riscv/include/asm/kasan.h
>> +++ b/arch/riscv/include/asm/kasan.h
>> @@ -6,8 +6,6 @@
>>
>>  #ifndef __ASSEMBLY__
>>
>> -#ifdef CONFIG_KASAN
>> -
>>  /*
>>   * The following comment was copied from arm64:
>>   * KASAN_SHADOW_START: beginning of the kernel virtual addresses.
>> @@ -34,6 +32,8 @@
>>   */
>>  #define KASAN_SHADOW_START     ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & PGDIR_MASK)
>>  #define KASAN_SHADOW_END       MODULES_LOWEST_VADDR
>> +
>> +#ifdef CONFIG_KASAN
>>  #define KASAN_SHADOW_OFFSET    _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
>>
>>  void kasan_init(void);
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 2574f6a3b0e7..5b8cdfafb52a 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -27,6 +27,7 @@
>>
>>  #include <asm/fixmap.h>
>>  #include <asm/io.h>
>> +#include <asm/kasan.h>
>>  #include <asm/numa.h>
>>  #include <asm/pgtable.h>
>>  #include <asm/sections.h>
>> @@ -1488,10 +1489,16 @@ static void __init preallocate_pgd_pages_range(unsigned long start, unsigned lon
>>         panic("Failed to pre-allocate %s pages for %s area\n", lvl, area);
>>  }
>>
>> +#define PAGE_END KASAN_SHADOW_START
>> +
>>  void __init pgtable_cache_init(void)
>>  {
>>         preallocate_pgd_pages_range(VMALLOC_START, VMALLOC_END, "vmalloc");
>>         if (IS_ENABLED(CONFIG_MODULES))
>>                 preallocate_pgd_pages_range(MODULES_VADDR, MODULES_END, "bpf/modules");
>> +       if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) {
>> +               preallocate_pgd_pages_range(VMEMMAP_START, VMEMMAP_END, "vmemmap");
>> +               preallocate_pgd_pages_range(PAGE_OFFSET, PAGE_END, "direct map");
>> +       }
>>  }
>>  #endif
>> --
>> 2.40.1
>>
>
> As you asked, with
> https://lore.kernel.org/linux-riscv/20240514133614.87813-1-alexghiti@rivosinc.com/T/#u,
> you will be able to remove the usage of KASAN_SHADOW_START.

Very nice -- consistency! I'll need to respin, so I'll clean this up for
the next version.

> But anyhow, you can add:
>
> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>


Thank you!
Björn


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
  2024-05-14 16:04   ` David Hildenbrand
@ 2024-05-14 16:34     ` Björn Töpel
  0 siblings, 0 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 16:34 UTC (permalink / raw)
  To: David Hildenbrand, Alexandre Ghiti, Albert Ou, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

David Hildenbrand <david@redhat.com> writes:

> On 14.05.24 16:04, Björn Töpel wrote:
>> From: Björn Töpel <bjorn@rivosinc.com>
>> 
>> For an architecture to support memory hotplugging, a couple of
>> callbacks needs to be implemented:
>> 
>>   arch_add_memory()
>>    This callback is responsible for adding the physical memory into the
>>    direct map, and call into the memory hotplugging generic code via
>>    __add_pages() that adds the corresponding struct page entries, and
>>    updates the vmemmap mapping.
>> 
>>   arch_remove_memory()
>>    This is the inverse of the callback above.
>> 
>>   vmemmap_free()
>>    This function tears down the vmemmap mappings (if
>>    CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
>>    backing vmemmap pages. Note that for persistent memory, an
>>    alternative allocator for the backing pages can be used; The
>>    vmem_altmap. This means that when the backing pages are cleared,
>>    extra care is needed so that the correct deallocation method is
>>    used.
>> 
>>   arch_get_mappable_range()
>>    This functions returns the PA range that the direct map can map.
>>    Used by the MHP internals for sanity checks.
>> 
>> The page table unmap/teardown functions are heavily based on code from
>> the x86 tree. The same remove_pgd_mapping() function is used in both
>> vmemmap_free() and arch_remove_memory(), but in the latter function
>> the backing pages are not removed.
>> 
>> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
>> ---
>>   arch/riscv/mm/init.c | 242 +++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 242 insertions(+)
>> 
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 6f72b0b2b854..7f0b921a3d3a 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
>>   	}
>>   }
>>   #endif
>> +
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
>> +{
>> +	pte_t *pte;
>> +	int i;
>> +
>> +	for (i = 0; i < PTRS_PER_PTE; i++) {
>> +		pte = pte_start + i;
>> +		if (!pte_none(*pte))
>> +			return;
>> +	}
>> +
>> +	free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
>> +	pmd_clear(pmd);
>> +}
>> +
>> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>> +{
>> +	pmd_t *pmd;
>> +	int i;
>> +
>> +	for (i = 0; i < PTRS_PER_PMD; i++) {
>> +		pmd = pmd_start + i;
>> +		if (!pmd_none(*pmd))
>> +			return;
>> +	}
>> +
>> +	free_pages((unsigned long)page_address(pud_page(*pud)), 0);
>> +	pud_clear(pud);
>> +}
>> +
>> +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
>> +{
>> +	pud_t *pud;
>> +	int i;
>> +
>> +	for (i = 0; i < PTRS_PER_PUD; i++) {
>> +		pud = pud_start + i;
>> +		if (!pud_none(*pud))
>> +			return;
>> +	}
>> +
>> +	free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
>> +	p4d_clear(p4d);
>> +}
>> +
>> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
>> +					   struct vmem_altmap *altmap)
>> +{
>> +	if (altmap)
>> +		vmem_altmap_free(altmap, size >> PAGE_SHIFT);
>> +	else
>> +		free_pages((unsigned long)page_address(page), get_order(size));
>
> If you unplug a DIMM that was added during boot (can happen on x86-64, 
> can it happen on riscv?), free_pages() would not be sufficient. You'd be 
> freeing a PG_reserved page that has to be freed differently.

I'd say if it can happen on x86-64, it probably can on RISC-V. I'll look
into this for the next spin!

Thanks for spending time on the series!


Cheers,
Björn


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
  2024-05-14 14:04 ` [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions Björn Töpel
  2024-05-14 15:59   ` David Hildenbrand
@ 2024-05-14 17:17   ` Alexandre Ghiti
  2024-05-14 17:45     ` Björn Töpel
  2024-05-14 20:32   ` Oscar Salvador
  2 siblings, 1 reply; 34+ messages in thread
From: Alexandre Ghiti @ 2024-05-14 17:17 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Albert Ou, David Hildenbrand, Palmer Dabbelt, Paul Walmsley,
	linux-riscv, Björn Töpel, Andrew Bresticker,
	Chethan Seshadri, Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

On Tue, May 14, 2024 at 4:05 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> From: Björn Töpel <bjorn@rivosinc.com>
>
> Prepare for memory hotplugging support by changing from __init to
> __meminit for the page table functions that are used by the upcoming
> architecture specific callbacks.
>
> Changing the __init attribute to __meminit, avoids that the functions
> are removed after init. The __meminit attribute makes sure the
> functions are kept in the kernel text post init, but only if memory
> hotplugging is enabled for the build.
>
> Also, make sure that the altmap parameter is properly passed on to
> vmemmap_populate_hugepages().
>
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> ---
>  arch/riscv/include/asm/mmu.h     |  4 +--
>  arch/riscv/include/asm/pgtable.h |  2 +-
>  arch/riscv/mm/init.c             | 58 ++++++++++++++------------------
>  3 files changed, 29 insertions(+), 35 deletions(-)
>
> diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h
> index 60be458e94da..c09c3c79f496 100644
> --- a/arch/riscv/include/asm/mmu.h
> +++ b/arch/riscv/include/asm/mmu.h
> @@ -28,8 +28,8 @@ typedef struct {
>  #endif
>  } mm_context_t;
>
> -void __init create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa,
> -                              phys_addr_t sz, pgprot_t prot);
> +void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
> +                                 pgprot_t prot);
>  #endif /* __ASSEMBLY__ */
>
>  #endif /* _ASM_RISCV_MMU_H */
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 58fd7b70b903..7933f493db71 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -162,7 +162,7 @@ struct pt_alloc_ops {
>  #endif
>  };
>
> -extern struct pt_alloc_ops pt_ops __initdata;
> +extern struct pt_alloc_ops pt_ops __meminitdata;
>
>  #ifdef CONFIG_MMU
>  /* Number of PGD entries that a user-mode program can use */
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 5b8cdfafb52a..c969427eab88 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -295,7 +295,7 @@ static void __init setup_bootmem(void)
>  }
>
>  #ifdef CONFIG_MMU
> -struct pt_alloc_ops pt_ops __initdata;
> +struct pt_alloc_ops pt_ops __meminitdata;
>
>  pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
>  pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
> @@ -357,7 +357,7 @@ static inline pte_t *__init get_pte_virt_fixmap(phys_addr_t pa)
>         return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
>  }
>
> -static inline pte_t *__init get_pte_virt_late(phys_addr_t pa)
> +static inline pte_t *__meminit get_pte_virt_late(phys_addr_t pa)
>  {
>         return (pte_t *) __va(pa);
>  }
> @@ -376,7 +376,7 @@ static inline phys_addr_t __init alloc_pte_fixmap(uintptr_t va)
>         return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>  }
>
> -static phys_addr_t __init alloc_pte_late(uintptr_t va)
> +static phys_addr_t __meminit alloc_pte_late(uintptr_t va)
>  {
>         struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
>
> @@ -384,9 +384,8 @@ static phys_addr_t __init alloc_pte_late(uintptr_t va)
>         return __pa((pte_t *)ptdesc_address(ptdesc));
>  }
>
> -static void __init create_pte_mapping(pte_t *ptep,
> -                                     uintptr_t va, phys_addr_t pa,
> -                                     phys_addr_t sz, pgprot_t prot)
> +static void __meminit create_pte_mapping(pte_t *ptep, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
> +                                        pgprot_t prot)
>  {
>         uintptr_t pte_idx = pte_index(va);
>
> @@ -440,7 +439,7 @@ static pmd_t *__init get_pmd_virt_fixmap(phys_addr_t pa)
>         return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
>  }
>
> -static pmd_t *__init get_pmd_virt_late(phys_addr_t pa)
> +static pmd_t *__meminit get_pmd_virt_late(phys_addr_t pa)
>  {
>         return (pmd_t *) __va(pa);
>  }
> @@ -457,7 +456,7 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
>         return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>  }
>
> -static phys_addr_t __init alloc_pmd_late(uintptr_t va)
> +static phys_addr_t __meminit alloc_pmd_late(uintptr_t va)
>  {
>         struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
>
> @@ -465,9 +464,9 @@ static phys_addr_t __init alloc_pmd_late(uintptr_t va)
>         return __pa((pmd_t *)ptdesc_address(ptdesc));
>  }
>
> -static void __init create_pmd_mapping(pmd_t *pmdp,
> -                                     uintptr_t va, phys_addr_t pa,
> -                                     phys_addr_t sz, pgprot_t prot)
> +static void __meminit create_pmd_mapping(pmd_t *pmdp,
> +                                        uintptr_t va, phys_addr_t pa,
> +                                        phys_addr_t sz, pgprot_t prot)
>  {
>         pte_t *ptep;
>         phys_addr_t pte_phys;
> @@ -503,7 +502,7 @@ static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
>         return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
>  }
>
> -static pud_t *__init get_pud_virt_late(phys_addr_t pa)
> +static pud_t *__meminit get_pud_virt_late(phys_addr_t pa)
>  {
>         return (pud_t *)__va(pa);
>  }
> @@ -521,7 +520,7 @@ static phys_addr_t __init alloc_pud_fixmap(uintptr_t va)
>         return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>  }
>
> -static phys_addr_t alloc_pud_late(uintptr_t va)
> +static phys_addr_t __meminit alloc_pud_late(uintptr_t va)
>  {
>         unsigned long vaddr;
>
> @@ -541,7 +540,7 @@ static p4d_t *__init get_p4d_virt_fixmap(phys_addr_t pa)
>         return (p4d_t *)set_fixmap_offset(FIX_P4D, pa);
>  }
>
> -static p4d_t *__init get_p4d_virt_late(phys_addr_t pa)
> +static p4d_t *__meminit get_p4d_virt_late(phys_addr_t pa)
>  {
>         return (p4d_t *)__va(pa);
>  }
> @@ -559,7 +558,7 @@ static phys_addr_t __init alloc_p4d_fixmap(uintptr_t va)
>         return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>  }
>
> -static phys_addr_t alloc_p4d_late(uintptr_t va)
> +static phys_addr_t __meminit alloc_p4d_late(uintptr_t va)
>  {
>         unsigned long vaddr;
>
> @@ -568,9 +567,8 @@ static phys_addr_t alloc_p4d_late(uintptr_t va)
>         return __pa(vaddr);
>  }
>
> -static void __init create_pud_mapping(pud_t *pudp,
> -                                     uintptr_t va, phys_addr_t pa,
> -                                     phys_addr_t sz, pgprot_t prot)
> +static void __meminit create_pud_mapping(pud_t *pudp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
> +                                        pgprot_t prot)
>  {
>         pmd_t *nextp;
>         phys_addr_t next_phys;
> @@ -595,9 +593,8 @@ static void __init create_pud_mapping(pud_t *pudp,
>         create_pmd_mapping(nextp, va, pa, sz, prot);
>  }
>
> -static void __init create_p4d_mapping(p4d_t *p4dp,
> -                                     uintptr_t va, phys_addr_t pa,
> -                                     phys_addr_t sz, pgprot_t prot)
> +static void __meminit create_p4d_mapping(p4d_t *p4dp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
> +                                        pgprot_t prot)
>  {
>         pud_t *nextp;
>         phys_addr_t next_phys;
> @@ -653,9 +650,8 @@ static void __init create_p4d_mapping(p4d_t *p4dp,
>  #define create_pmd_mapping(__pmdp, __va, __pa, __sz, __prot) do {} while(0)
>  #endif /* __PAGETABLE_PMD_FOLDED */
>
> -void __init create_pgd_mapping(pgd_t *pgdp,
> -                                     uintptr_t va, phys_addr_t pa,
> -                                     phys_addr_t sz, pgprot_t prot)
> +void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
> +                                 pgprot_t prot)
>  {
>         pgd_next_t *nextp;
>         phys_addr_t next_phys;
> @@ -680,8 +676,7 @@ void __init create_pgd_mapping(pgd_t *pgdp,
>         create_pgd_next_mapping(nextp, va, pa, sz, prot);
>  }
>
> -static uintptr_t __init best_map_size(phys_addr_t pa, uintptr_t va,
> -                                     phys_addr_t size)
> +static uintptr_t __meminit best_map_size(phys_addr_t pa, uintptr_t va, phys_addr_t size)
>  {
>         if (pgtable_l5_enabled &&
>             !(pa & (P4D_SIZE - 1)) && !(va & (P4D_SIZE - 1)) && size >= P4D_SIZE)
> @@ -714,7 +709,7 @@ asmlinkage void __init __copy_data(void)
>  #endif
>
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> -static __init pgprot_t pgprot_from_va(uintptr_t va)
> +static __meminit pgprot_t pgprot_from_va(uintptr_t va)
>  {
>         if (is_va_kernel_text(va))
>                 return PAGE_KERNEL_READ_EXEC;
> @@ -739,7 +734,7 @@ void mark_rodata_ro(void)
>                                   set_memory_ro);
>  }
>  #else
> -static __init pgprot_t pgprot_from_va(uintptr_t va)
> +static __meminit pgprot_t pgprot_from_va(uintptr_t va)
>  {
>         if (IS_ENABLED(CONFIG_64BIT) && !is_kernel_mapping(va))
>                 return PAGE_KERNEL;
> @@ -1231,9 +1226,8 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>         pt_ops_set_fixmap();
>  }
>
> -static void __init create_linear_mapping_range(phys_addr_t start,
> -                                              phys_addr_t end,
> -                                              uintptr_t fixed_map_size)
> +static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t end,
> +                                                 uintptr_t fixed_map_size)
>  {
>         phys_addr_t pa;
>         uintptr_t va, map_size;
> @@ -1435,7 +1429,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>          * memory hotplug, we are not able to update all the page tables with
>          * the new PMDs.
>          */
> -       return vmemmap_populate_hugepages(start, end, node, NULL);
> +       return vmemmap_populate_hugepages(start, end, node, altmap);

Is this a fix? Does this deserve to be split into another patch then?

>  }
>  #endif
>
> --
> 2.40.1
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add
  2024-05-14 14:04 ` [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add Björn Töpel
  2024-05-14 16:00   ` David Hildenbrand
@ 2024-05-14 17:24   ` Alexandre Ghiti
  2024-05-14 20:37   ` Oscar Salvador
  2 siblings, 0 replies; 34+ messages in thread
From: Alexandre Ghiti @ 2024-05-14 17:24 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Albert Ou, David Hildenbrand, Palmer Dabbelt, Paul Walmsley,
	linux-riscv, Björn Töpel, Andrew Bresticker,
	Chethan Seshadri, Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

On Tue, May 14, 2024 at 4:05 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> From: Björn Töpel <bjorn@rivosinc.com>
>
> Add a parameter to the direct map setup function, so it can be used in
> arch_add_memory() later.
>
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> ---
>  arch/riscv/mm/init.c | 15 ++++++---------
>  1 file changed, 6 insertions(+), 9 deletions(-)
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index c969427eab88..6f72b0b2b854 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1227,7 +1227,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>  }
>
>  static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t end,
> -                                                 uintptr_t fixed_map_size)
> +                                                 uintptr_t fixed_map_size, const pgprot_t *pgprot)
>  {
>         phys_addr_t pa;
>         uintptr_t va, map_size;
> @@ -1238,7 +1238,7 @@ static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t
>                                             best_map_size(pa, va, end - pa);
>
>                 create_pgd_mapping(swapper_pg_dir, va, pa, map_size,
> -                                  pgprot_from_va(va));
> +                                  pgprot ? *pgprot : pgprot_from_va(va));
>         }
>  }
>
> @@ -1282,22 +1282,19 @@ static void __init create_linear_mapping_page_table(void)
>                 if (end >= __pa(PAGE_OFFSET) + memory_limit)
>                         end = __pa(PAGE_OFFSET) + memory_limit;
>
> -               create_linear_mapping_range(start, end, 0);
> +               create_linear_mapping_range(start, end, 0, NULL);
>         }
>
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> -       create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0);
> -       create_linear_mapping_range(krodata_start,
> -                                   krodata_start + krodata_size, 0);
> +       create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0, NULL);
> +       create_linear_mapping_range(krodata_start, krodata_start + krodata_size, 0, NULL);
>
>         memblock_clear_nomap(ktext_start,  ktext_size);
>         memblock_clear_nomap(krodata_start, krodata_size);
>  #endif
>
>  #ifdef CONFIG_KFENCE
> -       create_linear_mapping_range(kfence_pool,
> -                                   kfence_pool + KFENCE_POOL_SIZE,
> -                                   PAGE_SIZE);
> +       create_linear_mapping_range(kfence_pool, kfence_pool + KFENCE_POOL_SIZE, PAGE_SIZE, NULL);
>
>         memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
>  #endif
> --
> 2.40.1
>

You can add:

Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>

Thanks,

Alex


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
  2024-05-14 17:17   ` Alexandre Ghiti
@ 2024-05-14 17:45     ` Björn Töpel
  0 siblings, 0 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 17:45 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Albert Ou, David Hildenbrand, Palmer Dabbelt, Paul Walmsley,
	linux-riscv, Björn Töpel, Andrew Bresticker,
	Chethan Seshadri, Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

Alexandre Ghiti <alexghiti@rivosinc.com> writes:

> On Tue, May 14, 2024 at 4:05 PM Björn Töpel <bjorn@kernel.org> wrote:
>>
>> From: Björn Töpel <bjorn@rivosinc.com>
>>
>> Prepare for memory hotplugging support by changing from __init to
>> __meminit for the page table functions that are used by the upcoming
>> architecture specific callbacks.
>>
>> Changing the __init attribute to __meminit, avoids that the functions
>> are removed after init. The __meminit attribute makes sure the
>> functions are kept in the kernel text post init, but only if memory
>> hotplugging is enabled for the build.
>>
>> Also, make sure that the altmap parameter is properly passed on to
>> vmemmap_populate_hugepages().
>>
>> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
>> ---
>>  arch/riscv/include/asm/mmu.h     |  4 +--
>>  arch/riscv/include/asm/pgtable.h |  2 +-
>>  arch/riscv/mm/init.c             | 58 ++++++++++++++------------------
>>  3 files changed, 29 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h
>> index 60be458e94da..c09c3c79f496 100644
>> --- a/arch/riscv/include/asm/mmu.h
>> +++ b/arch/riscv/include/asm/mmu.h
>> @@ -28,8 +28,8 @@ typedef struct {
>>  #endif
>>  } mm_context_t;
>>
>> -void __init create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa,
>> -                              phys_addr_t sz, pgprot_t prot);
>> +void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
>> +                                 pgprot_t prot);
>>  #endif /* __ASSEMBLY__ */
>>
>>  #endif /* _ASM_RISCV_MMU_H */
>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>> index 58fd7b70b903..7933f493db71 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -162,7 +162,7 @@ struct pt_alloc_ops {
>>  #endif
>>  };
>>
>> -extern struct pt_alloc_ops pt_ops __initdata;
>> +extern struct pt_alloc_ops pt_ops __meminitdata;
>>
>>  #ifdef CONFIG_MMU
>>  /* Number of PGD entries that a user-mode program can use */
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 5b8cdfafb52a..c969427eab88 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -295,7 +295,7 @@ static void __init setup_bootmem(void)
>>  }
>>
>>  #ifdef CONFIG_MMU
>> -struct pt_alloc_ops pt_ops __initdata;
>> +struct pt_alloc_ops pt_ops __meminitdata;
>>
>>  pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
>>  pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
>> @@ -357,7 +357,7 @@ static inline pte_t *__init get_pte_virt_fixmap(phys_addr_t pa)
>>         return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
>>  }
>>
>> -static inline pte_t *__init get_pte_virt_late(phys_addr_t pa)
>> +static inline pte_t *__meminit get_pte_virt_late(phys_addr_t pa)
>>  {
>>         return (pte_t *) __va(pa);
>>  }
>> @@ -376,7 +376,7 @@ static inline phys_addr_t __init alloc_pte_fixmap(uintptr_t va)
>>         return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>  }
>>
>> -static phys_addr_t __init alloc_pte_late(uintptr_t va)
>> +static phys_addr_t __meminit alloc_pte_late(uintptr_t va)
>>  {
>>         struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
>>
>> @@ -384,9 +384,8 @@ static phys_addr_t __init alloc_pte_late(uintptr_t va)
>>         return __pa((pte_t *)ptdesc_address(ptdesc));
>>  }
>>
>> -static void __init create_pte_mapping(pte_t *ptep,
>> -                                     uintptr_t va, phys_addr_t pa,
>> -                                     phys_addr_t sz, pgprot_t prot)
>> +static void __meminit create_pte_mapping(pte_t *ptep, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
>> +                                        pgprot_t prot)
>>  {
>>         uintptr_t pte_idx = pte_index(va);
>>
>> @@ -440,7 +439,7 @@ static pmd_t *__init get_pmd_virt_fixmap(phys_addr_t pa)
>>         return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
>>  }
>>
>> -static pmd_t *__init get_pmd_virt_late(phys_addr_t pa)
>> +static pmd_t *__meminit get_pmd_virt_late(phys_addr_t pa)
>>  {
>>         return (pmd_t *) __va(pa);
>>  }
>> @@ -457,7 +456,7 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
>>         return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>  }
>>
>> -static phys_addr_t __init alloc_pmd_late(uintptr_t va)
>> +static phys_addr_t __meminit alloc_pmd_late(uintptr_t va)
>>  {
>>         struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
>>
>> @@ -465,9 +464,9 @@ static phys_addr_t __init alloc_pmd_late(uintptr_t va)
>>         return __pa((pmd_t *)ptdesc_address(ptdesc));
>>  }
>>
>> -static void __init create_pmd_mapping(pmd_t *pmdp,
>> -                                     uintptr_t va, phys_addr_t pa,
>> -                                     phys_addr_t sz, pgprot_t prot)
>> +static void __meminit create_pmd_mapping(pmd_t *pmdp,
>> +                                        uintptr_t va, phys_addr_t pa,
>> +                                        phys_addr_t sz, pgprot_t prot)
>>  {
>>         pte_t *ptep;
>>         phys_addr_t pte_phys;
>> @@ -503,7 +502,7 @@ static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
>>         return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
>>  }
>>
>> -static pud_t *__init get_pud_virt_late(phys_addr_t pa)
>> +static pud_t *__meminit get_pud_virt_late(phys_addr_t pa)
>>  {
>>         return (pud_t *)__va(pa);
>>  }
>> @@ -521,7 +520,7 @@ static phys_addr_t __init alloc_pud_fixmap(uintptr_t va)
>>         return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>  }
>>
>> -static phys_addr_t alloc_pud_late(uintptr_t va)
>> +static phys_addr_t __meminit alloc_pud_late(uintptr_t va)
>>  {
>>         unsigned long vaddr;
>>
>> @@ -541,7 +540,7 @@ static p4d_t *__init get_p4d_virt_fixmap(phys_addr_t pa)
>>         return (p4d_t *)set_fixmap_offset(FIX_P4D, pa);
>>  }
>>
>> -static p4d_t *__init get_p4d_virt_late(phys_addr_t pa)
>> +static p4d_t *__meminit get_p4d_virt_late(phys_addr_t pa)
>>  {
>>         return (p4d_t *)__va(pa);
>>  }
>> @@ -559,7 +558,7 @@ static phys_addr_t __init alloc_p4d_fixmap(uintptr_t va)
>>         return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>  }
>>
>> -static phys_addr_t alloc_p4d_late(uintptr_t va)
>> +static phys_addr_t __meminit alloc_p4d_late(uintptr_t va)
>>  {
>>         unsigned long vaddr;
>>
>> @@ -568,9 +567,8 @@ static phys_addr_t alloc_p4d_late(uintptr_t va)
>>         return __pa(vaddr);
>>  }
>>
>> -static void __init create_pud_mapping(pud_t *pudp,
>> -                                     uintptr_t va, phys_addr_t pa,
>> -                                     phys_addr_t sz, pgprot_t prot)
>> +static void __meminit create_pud_mapping(pud_t *pudp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
>> +                                        pgprot_t prot)
>>  {
>>         pmd_t *nextp;
>>         phys_addr_t next_phys;
>> @@ -595,9 +593,8 @@ static void __init create_pud_mapping(pud_t *pudp,
>>         create_pmd_mapping(nextp, va, pa, sz, prot);
>>  }
>>
>> -static void __init create_p4d_mapping(p4d_t *p4dp,
>> -                                     uintptr_t va, phys_addr_t pa,
>> -                                     phys_addr_t sz, pgprot_t prot)
>> +static void __meminit create_p4d_mapping(p4d_t *p4dp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
>> +                                        pgprot_t prot)
>>  {
>>         pud_t *nextp;
>>         phys_addr_t next_phys;
>> @@ -653,9 +650,8 @@ static void __init create_p4d_mapping(p4d_t *p4dp,
>>  #define create_pmd_mapping(__pmdp, __va, __pa, __sz, __prot) do {} while(0)
>>  #endif /* __PAGETABLE_PMD_FOLDED */
>>
>> -void __init create_pgd_mapping(pgd_t *pgdp,
>> -                                     uintptr_t va, phys_addr_t pa,
>> -                                     phys_addr_t sz, pgprot_t prot)
>> +void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
>> +                                 pgprot_t prot)
>>  {
>>         pgd_next_t *nextp;
>>         phys_addr_t next_phys;
>> @@ -680,8 +676,7 @@ void __init create_pgd_mapping(pgd_t *pgdp,
>>         create_pgd_next_mapping(nextp, va, pa, sz, prot);
>>  }
>>
>> -static uintptr_t __init best_map_size(phys_addr_t pa, uintptr_t va,
>> -                                     phys_addr_t size)
>> +static uintptr_t __meminit best_map_size(phys_addr_t pa, uintptr_t va, phys_addr_t size)
>>  {
>>         if (pgtable_l5_enabled &&
>>             !(pa & (P4D_SIZE - 1)) && !(va & (P4D_SIZE - 1)) && size >= P4D_SIZE)
>> @@ -714,7 +709,7 @@ asmlinkage void __init __copy_data(void)
>>  #endif
>>
>>  #ifdef CONFIG_STRICT_KERNEL_RWX
>> -static __init pgprot_t pgprot_from_va(uintptr_t va)
>> +static __meminit pgprot_t pgprot_from_va(uintptr_t va)
>>  {
>>         if (is_va_kernel_text(va))
>>                 return PAGE_KERNEL_READ_EXEC;
>> @@ -739,7 +734,7 @@ void mark_rodata_ro(void)
>>                                   set_memory_ro);
>>  }
>>  #else
>> -static __init pgprot_t pgprot_from_va(uintptr_t va)
>> +static __meminit pgprot_t pgprot_from_va(uintptr_t va)
>>  {
>>         if (IS_ENABLED(CONFIG_64BIT) && !is_kernel_mapping(va))
>>                 return PAGE_KERNEL;
>> @@ -1231,9 +1226,8 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>         pt_ops_set_fixmap();
>>  }
>>
>> -static void __init create_linear_mapping_range(phys_addr_t start,
>> -                                              phys_addr_t end,
>> -                                              uintptr_t fixed_map_size)
>> +static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t end,
>> +                                                 uintptr_t fixed_map_size)
>>  {
>>         phys_addr_t pa;
>>         uintptr_t va, map_size;
>> @@ -1435,7 +1429,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>          * memory hotplug, we are not able to update all the page tables with
>>          * the new PMDs.
>>          */
>> -       return vmemmap_populate_hugepages(start, end, node, NULL);
>> +       return vmemmap_populate_hugepages(start, end, node, altmap);
>
> Is this a fix? Does this deserve to be split into another patch then?

It's enablement. The altmap can't be used unless there's ZONE_DEVICE
support AFAIU.


Björn


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
  2024-05-14 14:04 ` [PATCH v2 4/8] riscv: mm: Add memory hotplugging support Björn Töpel
  2024-05-14 16:04   ` David Hildenbrand
@ 2024-05-14 17:49   ` Alexandre Ghiti
  2024-05-14 19:31     ` Björn Töpel
  2024-05-14 20:49   ` Oscar Salvador
  2 siblings, 1 reply; 34+ messages in thread
From: Alexandre Ghiti @ 2024-05-14 17:49 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Albert Ou, David Hildenbrand, Palmer Dabbelt, Paul Walmsley,
	linux-riscv, Björn Töpel, Andrew Bresticker,
	Chethan Seshadri, Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

On Tue, May 14, 2024 at 4:05 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> From: Björn Töpel <bjorn@rivosinc.com>
>
> For an architecture to support memory hotplugging, a couple of
> callbacks needs to be implemented:
>
>  arch_add_memory()
>   This callback is responsible for adding the physical memory into the
>   direct map, and call into the memory hotplugging generic code via
>   __add_pages() that adds the corresponding struct page entries, and
>   updates the vmemmap mapping.
>
>  arch_remove_memory()
>   This is the inverse of the callback above.
>
>  vmemmap_free()
>   This function tears down the vmemmap mappings (if
>   CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
>   backing vmemmap pages. Note that for persistent memory, an
>   alternative allocator for the backing pages can be used; The
>   vmem_altmap. This means that when the backing pages are cleared,
>   extra care is needed so that the correct deallocation method is
>   used.
>
>  arch_get_mappable_range()
>   This functions returns the PA range that the direct map can map.
>   Used by the MHP internals for sanity checks.
>
> The page table unmap/teardown functions are heavily based on code from
> the x86 tree. The same remove_pgd_mapping() function is used in both
> vmemmap_free() and arch_remove_memory(), but in the latter function
> the backing pages are not removed.
>
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> ---
>  arch/riscv/mm/init.c | 242 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 242 insertions(+)
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 6f72b0b2b854..7f0b921a3d3a 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
>         }
>  }
>  #endif
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> +{
> +       pte_t *pte;
> +       int i;
> +
> +       for (i = 0; i < PTRS_PER_PTE; i++) {
> +               pte = pte_start + i;
> +               if (!pte_none(*pte))
> +                       return;
> +       }
> +
> +       free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
> +       pmd_clear(pmd);
> +}
> +
> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> +{
> +       pmd_t *pmd;
> +       int i;
> +
> +       for (i = 0; i < PTRS_PER_PMD; i++) {
> +               pmd = pmd_start + i;
> +               if (!pmd_none(*pmd))
> +                       return;
> +       }
> +
> +       free_pages((unsigned long)page_address(pud_page(*pud)), 0);
> +       pud_clear(pud);
> +}
> +
> +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
> +{
> +       pud_t *pud;
> +       int i;
> +
> +       for (i = 0; i < PTRS_PER_PUD; i++) {
> +               pud = pud_start + i;
> +               if (!pud_none(*pud))
> +                       return;
> +       }
> +
> +       free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
> +       p4d_clear(p4d);
> +}
> +
> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
> +                                          struct vmem_altmap *altmap)
> +{
> +       if (altmap)
> +               vmem_altmap_free(altmap, size >> PAGE_SHIFT);
> +       else
> +               free_pages((unsigned long)page_address(page), get_order(size));
> +}
> +
> +static void __meminit remove_pte_mapping(pte_t *pte_base, unsigned long addr, unsigned long end,
> +                                        bool is_vmemmap, struct vmem_altmap *altmap)
> +{
> +       unsigned long next;
> +       pte_t *ptep, pte;
> +
> +       for (; addr < end; addr = next) {
> +               next = (addr + PAGE_SIZE) & PAGE_MASK;
> +               if (next > end)
> +                       next = end;
> +
> +               ptep = pte_base + pte_index(addr);
> +               pte = READ_ONCE(*ptep);
> +
> +               if (!pte_present(*ptep))
> +                       continue;
> +
> +               pte_clear(&init_mm, addr, ptep);
> +               if (is_vmemmap)
> +                       free_vmemmap_storage(pte_page(pte), PAGE_SIZE, altmap);
> +       }
> +}
> +
> +static void __meminit remove_pmd_mapping(pmd_t *pmd_base, unsigned long addr, unsigned long end,
> +                                        bool is_vmemmap, struct vmem_altmap *altmap)
> +{
> +       unsigned long next;
> +       pte_t *pte_base;
> +       pmd_t *pmdp, pmd;
> +
> +       for (; addr < end; addr = next) {
> +               next = pmd_addr_end(addr, end);
> +               pmdp = pmd_base + pmd_index(addr);
> +               pmd = READ_ONCE(*pmdp);
> +
> +               if (!pmd_present(pmd))
> +                       continue;
> +
> +               if (pmd_leaf(pmd)) {
> +                       pmd_clear(pmdp);
> +                       if (is_vmemmap)
> +                               free_vmemmap_storage(pmd_page(pmd), PMD_SIZE, altmap);
> +                       continue;
> +               }
> +
> +               pte_base = (pte_t *)pmd_page_vaddr(*pmdp);
> +               remove_pte_mapping(pte_base, addr, next, is_vmemmap, altmap);
> +               free_pte_table(pte_base, pmdp);
> +       }
> +}
> +
> +static void __meminit remove_pud_mapping(pud_t *pud_base, unsigned long addr, unsigned long end,
> +                                        bool is_vmemmap, struct vmem_altmap *altmap)
> +{
> +       unsigned long next;
> +       pud_t *pudp, pud;
> +       pmd_t *pmd_base;
> +
> +       for (; addr < end; addr = next) {
> +               next = pud_addr_end(addr, end);
> +               pudp = pud_base + pud_index(addr);
> +               pud = READ_ONCE(*pudp);
> +
> +               if (!pud_present(pud))
> +                       continue;
> +
> +               if (pud_leaf(pud)) {
> +                       if (pgtable_l4_enabled) {
> +                               pud_clear(pudp);
> +                               if (is_vmemmap)
> +                                       free_vmemmap_storage(pud_page(pud), PUD_SIZE, altmap);
> +                       }
> +                       continue;
> +               }
> +
> +               pmd_base = pmd_offset(pudp, 0);
> +               remove_pmd_mapping(pmd_base, addr, next, is_vmemmap, altmap);
> +
> +               if (pgtable_l4_enabled)
> +                       free_pmd_table(pmd_base, pudp);
> +       }
> +}
> +
> +static void __meminit remove_p4d_mapping(p4d_t *p4d_base, unsigned long addr, unsigned long end,
> +                                        bool is_vmemmap, struct vmem_altmap *altmap)
> +{
> +       unsigned long next;
> +       p4d_t *p4dp, p4d;
> +       pud_t *pud_base;
> +
> +       for (; addr < end; addr = next) {
> +               next = p4d_addr_end(addr, end);
> +               p4dp = p4d_base + p4d_index(addr);
> +               p4d = READ_ONCE(*p4dp);
> +
> +               if (!p4d_present(p4d))
> +                       continue;
> +
> +               if (p4d_leaf(p4d)) {
> +                       if (pgtable_l5_enabled) {
> +                               p4d_clear(p4dp);
> +                               if (is_vmemmap)
> +                                       free_vmemmap_storage(p4d_page(p4d), P4D_SIZE, altmap);
> +                       }
> +                       continue;
> +               }
> +
> +               pud_base = pud_offset(p4dp, 0);
> +               remove_pud_mapping(pud_base, addr, next, is_vmemmap, altmap);
> +
> +               if (pgtable_l5_enabled)
> +                       free_pud_table(pud_base, p4dp);
> +       }
> +}
> +
> +static void __meminit remove_pgd_mapping(unsigned long va, unsigned long end, bool is_vmemmap,
> +                                        struct vmem_altmap *altmap)
> +{
> +       unsigned long addr, next;
> +       p4d_t *p4d_base;
> +       pgd_t *pgd;
> +
> +       for (addr = va; addr < end; addr = next) {
> +               next = pgd_addr_end(addr, end);
> +               pgd = pgd_offset_k(addr);
> +
> +               if (!pgd_present(*pgd))
> +                       continue;
> +
> +               if (pgd_leaf(*pgd))
> +                       continue;
> +
> +               p4d_base = p4d_offset(pgd, 0);
> +               remove_p4d_mapping(p4d_base, addr, next, is_vmemmap, altmap);
> +       }
> +
> +       flush_tlb_all();
> +}
> +
> +static void __meminit remove_linear_mapping(phys_addr_t start, u64 size)
> +{
> +       unsigned long va = (unsigned long)__va(start);
> +       unsigned long end = (unsigned long)__va(start + size);
> +
> +       remove_pgd_mapping(va, end, false, NULL);
> +}
> +
> +struct range arch_get_mappable_range(void)
> +{
> +       struct range mhp_range;
> +
> +       mhp_range.start = __pa(PAGE_OFFSET);
> +       mhp_range.end = __pa(PAGE_END - 1);
> +       return mhp_range;
> +}
> +
> +int __ref arch_add_memory(int nid, u64 start, u64 size, struct mhp_params *params)
> +{
> +       int ret;
> +
> +       create_linear_mapping_range(start, start + size, 0, &params->pgprot);
> +       flush_tlb_all();
> +       ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, params);
> +       if (ret) {
> +               remove_linear_mapping(start, size);
> +               return ret;
> +       }
> +

You need to flush the TLB here too since __add_pages() populates the
page table with the new vmemmap mapping (only because riscv allows to
cache invalid entries, I'll adapt this in my next version of Svvptc
support).

> +       max_pfn = PFN_UP(start + size);
> +       max_low_pfn = max_pfn;
> +       return 0;
> +}
> +
> +void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
> +{
> +       __remove_pages(start >> PAGE_SHIFT, size >> PAGE_SHIFT, altmap);
> +       remove_linear_mapping(start, size);

You need to flush the TLB here too.

> +}
> +
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +void __ref vmemmap_free(unsigned long start, unsigned long end, struct vmem_altmap *altmap)
> +{
> +       remove_pgd_mapping(start, end, true, altmap);
> +}
> +#endif /* CONFIG_SPARSEMEM_VMEMMAP */
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> --
> 2.40.1
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
  2024-05-14 14:04 ` [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V Björn Töpel
@ 2024-05-14 18:00   ` Alexandre Ghiti
  2024-05-14 18:17     ` Björn Töpel
  2024-05-14 21:06   ` Oscar Salvador
  1 sibling, 1 reply; 34+ messages in thread
From: Alexandre Ghiti @ 2024-05-14 18:00 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Albert Ou, David Hildenbrand, Palmer Dabbelt, Paul Walmsley,
	linux-riscv, Björn Töpel, Andrew Bresticker,
	Chethan Seshadri, Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

On Tue, May 14, 2024 at 4:05 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> From: Björn Töpel <bjorn@rivosinc.com>
>
> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
> RISC-V.
>
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> ---
>  arch/riscv/Kconfig | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 6bec1bce6586..b9398b64bb69 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -16,6 +16,8 @@ config RISCV
>         select ACPI_REDUCED_HARDWARE_ONLY if ACPI
>         select ARCH_DMA_DEFAULT_COHERENT
>         select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> +       select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU

I think this should be SPARSEMEM_VMEMMAP here.

> +       select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
>         select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
>         select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
>         select ARCH_HAS_BINFMT_FLAT
> --
> 2.40.1
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
  2024-05-14 18:00   ` Alexandre Ghiti
@ 2024-05-14 18:17     ` Björn Töpel
  2024-05-14 18:41       ` Alexandre Ghiti
  2024-05-14 20:40       ` David Hildenbrand
  0 siblings, 2 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 18:17 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Albert Ou, David Hildenbrand, Palmer Dabbelt, Paul Walmsley,
	linux-riscv, Björn Töpel, Andrew Bresticker,
	Chethan Seshadri, Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

Alexandre Ghiti <alexghiti@rivosinc.com> writes:

> On Tue, May 14, 2024 at 4:05 PM Björn Töpel <bjorn@kernel.org> wrote:
>>
>> From: Björn Töpel <bjorn@rivosinc.com>
>>
>> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
>> RISC-V.
>>
>> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
>> ---
>>  arch/riscv/Kconfig | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index 6bec1bce6586..b9398b64bb69 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -16,6 +16,8 @@ config RISCV
>>         select ACPI_REDUCED_HARDWARE_ONLY if ACPI
>>         select ARCH_DMA_DEFAULT_COHERENT
>>         select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
>> +       select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU
>
> I think this should be SPARSEMEM_VMEMMAP here.

Hmm, care to elaborate? I thought that was optional.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
  2024-05-14 18:17     ` Björn Töpel
@ 2024-05-14 18:41       ` Alexandre Ghiti
  2024-05-14 20:40       ` David Hildenbrand
  1 sibling, 0 replies; 34+ messages in thread
From: Alexandre Ghiti @ 2024-05-14 18:41 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Albert Ou, David Hildenbrand, Palmer Dabbelt, Paul Walmsley,
	linux-riscv, Björn Töpel, Andrew Bresticker,
	Chethan Seshadri, Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

On Tue, May 14, 2024 at 8:17 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> Alexandre Ghiti <alexghiti@rivosinc.com> writes:
>
> > On Tue, May 14, 2024 at 4:05 PM Björn Töpel <bjorn@kernel.org> wrote:
> >>
> >> From: Björn Töpel <bjorn@rivosinc.com>
> >>
> >> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
> >> RISC-V.
> >>
> >> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> >> ---
> >>  arch/riscv/Kconfig | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> >> index 6bec1bce6586..b9398b64bb69 100644
> >> --- a/arch/riscv/Kconfig
> >> +++ b/arch/riscv/Kconfig
> >> @@ -16,6 +16,8 @@ config RISCV
> >>         select ACPI_REDUCED_HARDWARE_ONLY if ACPI
> >>         select ARCH_DMA_DEFAULT_COHERENT
> >>         select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> >> +       select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU
> >
> > I think this should be SPARSEMEM_VMEMMAP here.
>
> Hmm, care to elaborate? I thought that was optional.

My bad, I thought VMEMMAP was required in your patchset. Sorry for the noise!


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
  2024-05-14 17:49   ` Alexandre Ghiti
@ 2024-05-14 19:31     ` Björn Töpel
  0 siblings, 0 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-14 19:31 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Albert Ou, David Hildenbrand, Palmer Dabbelt, Paul Walmsley,
	linux-riscv, Björn Töpel, Andrew Bresticker,
	Chethan Seshadri, Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

Alexandre Ghiti <alexghiti@rivosinc.com> writes:

> On Tue, May 14, 2024 at 4:05 PM Björn Töpel <bjorn@kernel.org> wrote:

>> +int __ref arch_add_memory(int nid, u64 start, u64 size, struct mhp_params *params)
>> +{
>> +       int ret;
>> +
>> +       create_linear_mapping_range(start, start + size, 0, &params->pgprot);
>> +       flush_tlb_all();
>> +       ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, params);
>> +       if (ret) {
>> +               remove_linear_mapping(start, size);
>> +               return ret;
>> +       }
>> +
>
> You need to flush the TLB here too since __add_pages() populates the
> page table with the new vmemmap mapping (only because riscv allows to
> cache invalid entries, I'll adapt this in my next version of Svvptc
> support).
>
>> +       max_pfn = PFN_UP(start + size);
>> +       max_low_pfn = max_pfn;
>> +       return 0;
>> +}
>> +
>> +void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
>> +{
>> +       __remove_pages(start >> PAGE_SHIFT, size >> PAGE_SHIFT, altmap);
>> +       remove_linear_mapping(start, size);
>
> You need to flush the TLB here too.

I'll address all of the above in the next version. Thanks for reviewing
the series!


Björn


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
  2024-05-14 14:04 ` [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions Björn Töpel
  2024-05-14 15:59   ` David Hildenbrand
  2024-05-14 17:17   ` Alexandre Ghiti
@ 2024-05-14 20:32   ` Oscar Salvador
  2024-05-15  5:39     ` Björn Töpel
  2 siblings, 1 reply; 34+ messages in thread
From: Oscar Salvador @ 2024-05-14 20:32 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv, Björn Töpel,
	Andrew Bresticker, Chethan Seshadri, Lorenzo Stoakes,
	Santosh Mamila, Sivakumar Munnangi, Sunil V L, linux-kernel,
	linux-mm, virtualization

On Tue, May 14, 2024 at 04:04:40PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn@rivosinc.com>
> 
> Prepare for memory hotplugging support by changing from __init to
> __meminit for the page table functions that are used by the upcoming
> architecture specific callbacks.
> 
> Changing the __init attribute to __meminit, avoids that the functions
> are removed after init. The __meminit attribute makes sure the
> functions are kept in the kernel text post init, but only if memory
> hotplugging is enabled for the build.
> 
> Also, make sure that the altmap parameter is properly passed on to
> vmemmap_populate_hugepages().
> 
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

> +static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t end,
> +						  uintptr_t fixed_map_size)
>  {
>  	phys_addr_t pa;
>  	uintptr_t va, map_size;
> @@ -1435,7 +1429,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  	 * memory hotplug, we are not able to update all the page tables with
>  	 * the new PMDs.
>  	 */
> -	return vmemmap_populate_hugepages(start, end, node, NULL);
> +	return vmemmap_populate_hugepages(start, end, node, altmap);

I would have put this into a separate patch.

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add
  2024-05-14 14:04 ` [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add Björn Töpel
  2024-05-14 16:00   ` David Hildenbrand
  2024-05-14 17:24   ` Alexandre Ghiti
@ 2024-05-14 20:37   ` Oscar Salvador
  2 siblings, 0 replies; 34+ messages in thread
From: Oscar Salvador @ 2024-05-14 20:37 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv, Björn Töpel,
	Andrew Bresticker, Chethan Seshadri, Lorenzo Stoakes,
	Santosh Mamila, Sivakumar Munnangi, Sunil V L, linux-kernel,
	linux-mm, virtualization

On Tue, May 14, 2024 at 04:04:41PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn@rivosinc.com>
> 
> Add a parameter to the direct map setup function, so it can be used in
> arch_add_memory() later.
> 
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

> ---
>  arch/riscv/mm/init.c | 15 ++++++---------
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index c969427eab88..6f72b0b2b854 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1227,7 +1227,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>  }
>  
>  static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t end,
> -						  uintptr_t fixed_map_size)
> +						  uintptr_t fixed_map_size, const pgprot_t *pgprot)
>  {
>  	phys_addr_t pa;
>  	uintptr_t va, map_size;
> @@ -1238,7 +1238,7 @@ static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t
>  					    best_map_size(pa, va, end - pa);
>  
>  		create_pgd_mapping(swapper_pg_dir, va, pa, map_size,
> -				   pgprot_from_va(va));
> +				   pgprot ? *pgprot : pgprot_from_va(va));
>  	}
>  }
>  
> @@ -1282,22 +1282,19 @@ static void __init create_linear_mapping_page_table(void)
>  		if (end >= __pa(PAGE_OFFSET) + memory_limit)
>  			end = __pa(PAGE_OFFSET) + memory_limit;
>  
> -		create_linear_mapping_range(start, end, 0);
> +		create_linear_mapping_range(start, end, 0, NULL);
>  	}
>  
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> -	create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0);
> -	create_linear_mapping_range(krodata_start,
> -				    krodata_start + krodata_size, 0);
> +	create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0, NULL);
> +	create_linear_mapping_range(krodata_start, krodata_start + krodata_size, 0, NULL);
>  
>  	memblock_clear_nomap(ktext_start,  ktext_size);
>  	memblock_clear_nomap(krodata_start, krodata_size);
>  #endif
>  
>  #ifdef CONFIG_KFENCE
> -	create_linear_mapping_range(kfence_pool,
> -				    kfence_pool + KFENCE_POOL_SIZE,
> -				    PAGE_SIZE);
> +	create_linear_mapping_range(kfence_pool, kfence_pool + KFENCE_POOL_SIZE, PAGE_SIZE, NULL);
>  
>  	memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
>  #endif
> -- 
> 2.40.1
> 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump
  2024-05-14 14:04 ` [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump Björn Töpel
@ 2024-05-14 20:39   ` David Hildenbrand
  2024-05-14 21:03   ` Oscar Salvador
  1 sibling, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2024-05-14 20:39 UTC (permalink / raw)
  To: Björn Töpel, Alexandre Ghiti, Albert Ou, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

On 14.05.24 16:04, Björn Töpel wrote:
> From: Björn Töpel <bjorn@rivosinc.com>
> 
> During memory hot remove, the ptdump functionality can end up touching
> stale data. Avoid any potential crashes (or worse), by holding the
> memory hotplug read-lock while traversing the page table.
> 
> This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm:
> Hold memory hotplug lock while walking for kernel page table dump").
> 
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> ---
>   arch/riscv/mm/ptdump.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
> index 1289cc6d3700..9d5f657a251b 100644
> --- a/arch/riscv/mm/ptdump.c
> +++ b/arch/riscv/mm/ptdump.c
> @@ -6,6 +6,7 @@
>   #include <linux/efi.h>
>   #include <linux/init.h>
>   #include <linux/debugfs.h>
> +#include <linux/memory_hotplug.h>
>   #include <linux/seq_file.h>
>   #include <linux/ptdump.h>
>   
> @@ -370,7 +371,9 @@ bool ptdump_check_wx(void)
>   
>   static int ptdump_show(struct seq_file *m, void *v)
>   {
> +	get_online_mems();
>   	ptdump_walk(m, m->private);
> +	put_online_mems();
>   
>   	return 0;
>   }

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
  2024-05-14 18:17     ` Björn Töpel
  2024-05-14 18:41       ` Alexandre Ghiti
@ 2024-05-14 20:40       ` David Hildenbrand
  1 sibling, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2024-05-14 20:40 UTC (permalink / raw)
  To: Björn Töpel, Alexandre Ghiti
  Cc: Albert Ou, Palmer Dabbelt, Paul Walmsley, linux-riscv,
	Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

On 14.05.24 20:17, Björn Töpel wrote:
> Alexandre Ghiti <alexghiti@rivosinc.com> writes:
> 
>> On Tue, May 14, 2024 at 4:05 PM Björn Töpel <bjorn@kernel.org> wrote:
>>>
>>> From: Björn Töpel <bjorn@rivosinc.com>
>>>
>>> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
>>> RISC-V.
>>>
>>> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
>>> ---
>>>   arch/riscv/Kconfig | 2 ++
>>>   1 file changed, 2 insertions(+)
>>>
>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>>> index 6bec1bce6586..b9398b64bb69 100644
>>> --- a/arch/riscv/Kconfig
>>> +++ b/arch/riscv/Kconfig
>>> @@ -16,6 +16,8 @@ config RISCV
>>>          select ACPI_REDUCED_HARDWARE_ONLY if ACPI
>>>          select ARCH_DMA_DEFAULT_COHERENT
>>>          select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
>>> +       select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU
>>
>> I think this should be SPARSEMEM_VMEMMAP here.
> 
> Hmm, care to elaborate? I thought that was optional.

There was a discussion at LSF/MM today to maybe require 
SPARSEMEM_VMEMMAP for hotplug. Would that work here as well?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
  2024-05-14 14:04 ` [PATCH v2 4/8] riscv: mm: Add memory hotplugging support Björn Töpel
  2024-05-14 16:04   ` David Hildenbrand
  2024-05-14 17:49   ` Alexandre Ghiti
@ 2024-05-14 20:49   ` Oscar Salvador
  2024-05-15  5:41     ` Björn Töpel
  2 siblings, 1 reply; 34+ messages in thread
From: Oscar Salvador @ 2024-05-14 20:49 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv, Björn Töpel,
	Andrew Bresticker, Chethan Seshadri, Lorenzo Stoakes,
	Santosh Mamila, Sivakumar Munnangi, Sunil V L, linux-kernel,
	linux-mm, virtualization

On Tue, May 14, 2024 at 04:04:42PM +0200, Björn Töpel wrote:
> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
> +					   struct vmem_altmap *altmap)
> +{
> +	if (altmap)
> +		vmem_altmap_free(altmap, size >> PAGE_SHIFT);
> +	else
> +		free_pages((unsigned long)page_address(page), get_order(size));

David already pointed this out, but can check
arch/x86/mm/init_64.c:free_pagetable().

You will see that we have to do some magic for bootmem memory (DIMMs
which were not hotplugged but already present)

> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +void __ref vmemmap_free(unsigned long start, unsigned long end, struct vmem_altmap *altmap)
> +{
> +	remove_pgd_mapping(start, end, true, altmap);
> +}
> +#endif /* CONFIG_SPARSEMEM_VMEMMAP */
> +#endif /* CONFIG_MEMORY_HOTPLUG */

I will comment on the patch where you add support for hotplug and the
dependency, but on a track in LSFMM today, we decided that most likely
we will drop memory-hotplug support for !CONFIG_SPARSEMEM_VMEMMAP
environments.
So, since you are adding this plain fresh, please consider to tight the
hotplug dependency to CONFIG_SPARSEMEM_VMEMMAP.
As a bonus, you will only have to maintain one flavour of functions.


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump
  2024-05-14 14:04 ` [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump Björn Töpel
  2024-05-14 20:39   ` David Hildenbrand
@ 2024-05-14 21:03   ` Oscar Salvador
  1 sibling, 0 replies; 34+ messages in thread
From: Oscar Salvador @ 2024-05-14 21:03 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv, Björn Töpel,
	Andrew Bresticker, Chethan Seshadri, Lorenzo Stoakes,
	Santosh Mamila, Sivakumar Munnangi, Sunil V L, linux-kernel,
	linux-mm, virtualization

On Tue, May 14, 2024 at 04:04:43PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn@rivosinc.com>
> 
> During memory hot remove, the ptdump functionality can end up touching
> stale data. Avoid any potential crashes (or worse), by holding the
> memory hotplug read-lock while traversing the page table.
> 
> This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm:
> Hold memory hotplug lock while walking for kernel page table dump").
> 
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

funny enough, it seems arm64 and riscv are the only ones holding the
hotplug lock here.
I think we have the same problem on the other arches as well (at least
on x86_64 that I can see).

If we happen to finally need the lock in those, I would rather have a
centric function in the generic mm code with the locking and then
calling an arch specific ptdump_show function, so the lock is not
scattered. But that is another story.

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
  2024-05-14 14:04 ` [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V Björn Töpel
  2024-05-14 18:00   ` Alexandre Ghiti
@ 2024-05-14 21:06   ` Oscar Salvador
  1 sibling, 0 replies; 34+ messages in thread
From: Oscar Salvador @ 2024-05-14 21:06 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv, Björn Töpel,
	Andrew Bresticker, Chethan Seshadri, Lorenzo Stoakes,
	Santosh Mamila, Sivakumar Munnangi, Sunil V L, linux-kernel,
	linux-mm, virtualization

On Tue, May 14, 2024 at 04:04:44PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn@rivosinc.com>
> 
> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
> RISC-V.
> 
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> ---
>  arch/riscv/Kconfig | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 6bec1bce6586..b9398b64bb69 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -16,6 +16,8 @@ config RISCV
>  	select ACPI_REDUCED_HARDWARE_ONLY if ACPI
>  	select ARCH_DMA_DEFAULT_COHERENT
>  	select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> +	select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU

Hopefully this should be SPARSEMEM_VMEMMAP.
We are trying to deprecate memory-hotplug on !SPARSEMEM_VMEMMAP.

And it is always easier to do it now that when the code goes already in,
so please consider if you really need SPARSEMEM and why (I do not think
you do).

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
  2024-05-14 20:32   ` Oscar Salvador
@ 2024-05-15  5:39     ` Björn Töpel
  0 siblings, 0 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-15  5:39 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv, Björn Töpel,
	Andrew Bresticker, Chethan Seshadri, Lorenzo Stoakes,
	Santosh Mamila, Sivakumar Munnangi, Sunil V L, linux-kernel,
	linux-mm, virtualization

Oscar Salvador <osalvador@suse.de> writes:

> On Tue, May 14, 2024 at 04:04:40PM +0200, Björn Töpel wrote:
>> From: Björn Töpel <bjorn@rivosinc.com>
>> 
>> Prepare for memory hotplugging support by changing from __init to
>> __meminit for the page table functions that are used by the upcoming
>> architecture specific callbacks.
>> 
>> Changing the __init attribute to __meminit, avoids that the functions
>> are removed after init. The __meminit attribute makes sure the
>> functions are kept in the kernel text post init, but only if memory
>> hotplugging is enabled for the build.
>> 
>> Also, make sure that the altmap parameter is properly passed on to
>> vmemmap_populate_hugepages().
>> 
>> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
>
>> +static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t end,
>> +						  uintptr_t fixed_map_size)
>>  {
>>  	phys_addr_t pa;
>>  	uintptr_t va, map_size;
>> @@ -1435,7 +1429,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>  	 * memory hotplug, we are not able to update all the page tables with
>>  	 * the new PMDs.
>>  	 */
>> -	return vmemmap_populate_hugepages(start, end, node, NULL);
>> +	return vmemmap_populate_hugepages(start, end, node, altmap);
>
> I would have put this into a separate patch.

Thanks for the review, Oscar!

I'll split this up (also suggested by Alex!).


Cheers,
Björn


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
  2024-05-14 20:49   ` Oscar Salvador
@ 2024-05-15  5:41     ` Björn Töpel
  0 siblings, 0 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-15  5:41 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv, Björn Töpel,
	Andrew Bresticker, Chethan Seshadri, Lorenzo Stoakes,
	Santosh Mamila, Sivakumar Munnangi, Sunil V L, linux-kernel,
	linux-mm, virtualization

Oscar Salvador <osalvador@suse.de> writes:

> On Tue, May 14, 2024 at 04:04:42PM +0200, Björn Töpel wrote:
>> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
>> +					   struct vmem_altmap *altmap)
>> +{
>> +	if (altmap)
>> +		vmem_altmap_free(altmap, size >> PAGE_SHIFT);
>> +	else
>> +		free_pages((unsigned long)page_address(page), get_order(size));
>
> David already pointed this out, but can check
> arch/x86/mm/init_64.c:free_pagetable().
>
> You will see that we have to do some magic for bootmem memory (DIMMs
> which were not hotplugged but already present)

Thank you!

>> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
>> +void __ref vmemmap_free(unsigned long start, unsigned long end, struct vmem_altmap *altmap)
>> +{
>> +	remove_pgd_mapping(start, end, true, altmap);
>> +}
>> +#endif /* CONFIG_SPARSEMEM_VMEMMAP */
>> +#endif /* CONFIG_MEMORY_HOTPLUG */
>
> I will comment on the patch where you add support for hotplug and the
> dependency, but on a track in LSFMM today, we decided that most likely
> we will drop memory-hotplug support for !CONFIG_SPARSEMEM_VMEMMAP
> environments.
> So, since you are adding this plain fresh, please consider to tight the
> hotplug dependency to CONFIG_SPARSEMEM_VMEMMAP.
> As a bonus, you will only have to maintain one flavour of functions.

Ah, yeah, I saw it mentioned on the LSF/MM/BPF topics. Less is
definitely more -- I'll make the next version depend on
SPARSEMEM_VMEMMAP.


Björn


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 8/8] riscv: mm: Add support for ZONE_DEVICE
  2024-05-14 14:04 ` [PATCH v2 8/8] riscv: mm: Add support for ZONE_DEVICE Björn Töpel
@ 2024-05-15  7:03   ` Björn Töpel
  0 siblings, 0 replies; 34+ messages in thread
From: Björn Töpel @ 2024-05-15  7:03 UTC (permalink / raw)
  To: Alexandre Ghiti, Albert Ou, David Hildenbrand, Palmer Dabbelt,
	Paul Walmsley, linux-riscv
  Cc: Björn Töpel, Andrew Bresticker, Chethan Seshadri,
	Lorenzo Stoakes, Oscar Salvador, Santosh Mamila,
	Sivakumar Munnangi, Sunil V L, linux-kernel, linux-mm,
	virtualization

Björn Töpel <bjorn@kernel.org> writes:

> From: Björn Töpel <bjorn@rivosinc.com>
>
> ZONE_DEVICE pages need DEVMAP PTEs support to function
> (ARCH_HAS_PTE_DEVMAP). Claim another RSW (reserved for software) bit
> in the PTE for DEVMAP mark, add the corresponding helpers, and enable
> ARCH_HAS_PTE_DEVMAP for riscv64.

...and this patch has rv32 build issues. Will fix as well.


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-05-15  7:03 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-14 14:04 [PATCH v2 0/8] riscv: Memory Hot(Un)Plug support Björn Töpel
2024-05-14 14:04 ` [PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries Björn Töpel
2024-05-14 15:57   ` Alexandre Ghiti
2024-05-14 16:33     ` Björn Töpel
2024-05-14 14:04 ` [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions Björn Töpel
2024-05-14 15:59   ` David Hildenbrand
2024-05-14 17:17   ` Alexandre Ghiti
2024-05-14 17:45     ` Björn Töpel
2024-05-14 20:32   ` Oscar Salvador
2024-05-15  5:39     ` Björn Töpel
2024-05-14 14:04 ` [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add Björn Töpel
2024-05-14 16:00   ` David Hildenbrand
2024-05-14 17:24   ` Alexandre Ghiti
2024-05-14 20:37   ` Oscar Salvador
2024-05-14 14:04 ` [PATCH v2 4/8] riscv: mm: Add memory hotplugging support Björn Töpel
2024-05-14 16:04   ` David Hildenbrand
2024-05-14 16:34     ` Björn Töpel
2024-05-14 17:49   ` Alexandre Ghiti
2024-05-14 19:31     ` Björn Töpel
2024-05-14 20:49   ` Oscar Salvador
2024-05-15  5:41     ` Björn Töpel
2024-05-14 14:04 ` [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump Björn Töpel
2024-05-14 20:39   ` David Hildenbrand
2024-05-14 21:03   ` Oscar Salvador
2024-05-14 14:04 ` [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V Björn Töpel
2024-05-14 18:00   ` Alexandre Ghiti
2024-05-14 18:17     ` Björn Töpel
2024-05-14 18:41       ` Alexandre Ghiti
2024-05-14 20:40       ` David Hildenbrand
2024-05-14 21:06   ` Oscar Salvador
2024-05-14 14:04 ` [PATCH v2 7/8] virtio-mem: Enable virtio-mem " Björn Töpel
2024-05-14 15:58   ` David Hildenbrand
2024-05-14 14:04 ` [PATCH v2 8/8] riscv: mm: Add support for ZONE_DEVICE Björn Töpel
2024-05-15  7:03   ` Björn Töpel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).