* [RFC PATCH v2 01/21] riscv: mm: Distinguish hardware base page and software base page
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 02/21] riscv: mm: Configure satp with hw page pfn Xu Lu
` (20 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
The key idea to implement larger base page based on MMU that only
supports 4K page is to decouple the MMU page from the software page in
view of kernel mm. In contrary to software page, we denote the MMU page
as hardware page.
To decouple these two kinds of pages, we should manage, allocate and map
memory at a granularity of software page, which is exactly what existing
mm code does. The page table operations, however, should configure page
table entries at a granularity of hardware page, which is the
responsibility of arch code.
This commit introduces the concept of hardware base page for RISCV.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/Kconfig | 10 ++++++++++
arch/riscv/include/asm/page.h | 7 +++++++
arch/riscv/include/asm/pgtable-32.h | 5 +++--
arch/riscv/include/asm/pgtable-64.h | 5 +++--
arch/riscv/include/asm/pgtable-bits.h | 3 ++-
arch/riscv/include/asm/pgtable.h | 1 +
6 files changed, 26 insertions(+), 5 deletions(-)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index fa8f2da87a0a..2c0cb175a92a 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -289,6 +289,16 @@ config PAGE_OFFSET
default 0xc0000000 if 32BIT
default 0xff60000000000000 if 64BIT
+config RISCV_HW_PAGE_SHIFT
+ int
+ default 12
+
+config RISCV_USE_SW_PAGE
+ bool
+ depends on 64BIT
+ depends on RISCV_HW_PAGE_SHIFT != PAGE_SHIFT
+ default n
+
config KASAN_SHADOW_OFFSET
hex
depends on KASAN_GENERIC
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 32d308a3355f..7c581a3e057b 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -12,6 +12,10 @@
#include <linux/pfn.h>
#include <linux/const.h>
+#define HW_PAGE_SHIFT CONFIG_RISCV_HW_PAGE_SHIFT
+#define HW_PAGE_SIZE (_AC(1, UL) << HW_PAGE_SHIFT)
+#define HW_PAGE_MASK (~(HW_PAGE_SIZE - 1))
+
#define PAGE_SHIFT CONFIG_PAGE_SHIFT
#define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE - 1))
@@ -185,6 +189,9 @@ extern phys_addr_t __phys_addr_symbol(unsigned long x);
#define __pa(x) __virt_to_phys((unsigned long)(x))
#define __va(x) ((void *)__pa_to_va_nodebug((phys_addr_t)(x)))
+#define pfn_to_hwpfn(pfn) (pfn << (PAGE_SHIFT - HW_PAGE_SHIFT))
+#define hwpfn_to_pfn(hwpfn) (hwpfn >> (PAGE_SHIFT - HW_PAGE_SHIFT))
+
#define phys_to_pfn(phys) (PFN_DOWN(phys))
#define pfn_to_phys(pfn) (PFN_PHYS(pfn))
diff --git a/arch/riscv/include/asm/pgtable-32.h b/arch/riscv/include/asm/pgtable-32.h
index 00f3369570a8..159a668c3dd8 100644
--- a/arch/riscv/include/asm/pgtable-32.h
+++ b/arch/riscv/include/asm/pgtable-32.h
@@ -20,9 +20,10 @@
/*
* rv32 PTE format:
* | XLEN-1 10 | 9 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
- * PFN reserved for SW D A G U X W R V
+ * HW_PFN reserved for SW D A G U X W R V
*/
-#define _PAGE_PFN_MASK GENMASK(31, 10)
+#define _PAGE_HW_PFN_MASK GENMASK(31, 10)
+#define _PAGE_PFN_MASK GENMASK(31, (10 + PAGE_SHIFT - HW_PAGE_SHIFT))
#define _PAGE_NOCACHE 0
#define _PAGE_IO 0
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index 0897dd99ab8d..963aa4be9eed 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -72,9 +72,10 @@ typedef struct {
/*
* rv64 PTE format:
* | 63 | 62 61 | 60 54 | 53 10 | 9 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
- * N MT RSV PFN reserved for SW D A G U X W R V
+ * N MT RSV HW_PFN reserved for SW D A G U X W R V
*/
-#define _PAGE_PFN_MASK GENMASK(53, 10)
+#define _PAGE_HW_PFN_MASK GENMASK(53, 10)
+#define _PAGE_PFN_MASK GENMASK(53, (10 + PAGE_SHIFT - HW_PAGE_SHIFT))
/*
* [63] Svnapot definitions:
diff --git a/arch/riscv/include/asm/pgtable-bits.h b/arch/riscv/include/asm/pgtable-bits.h
index a8f5205cea54..e5bb6a805505 100644
--- a/arch/riscv/include/asm/pgtable-bits.h
+++ b/arch/riscv/include/asm/pgtable-bits.h
@@ -31,7 +31,8 @@
/* Used for swap PTEs only. */
#define _PAGE_SWP_EXCLUSIVE _PAGE_ACCESSED
-#define _PAGE_PFN_SHIFT 10
+#define _PAGE_HWPFN_SHIFT 10
+#define _PAGE_PFN_SHIFT (_PAGE_HWPFN_SHIFT + (PAGE_SHIFT - HW_PAGE_SHIFT))
/*
* when all of R/W/X are zero, the PTE is a pointer to the next level
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index e79f15293492..9d6d0ff86c76 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -114,6 +114,7 @@
#include <linux/mm_types.h>
#include <asm/compat.h>
+#define __page_val_to_hwpfn(_val) (((_val) & _PAGE_HW_PFN_MASK) >> _PAGE_HWPFN_SHIFT)
#define __page_val_to_pfn(_val) (((_val) & _PAGE_PFN_MASK) >> _PAGE_PFN_SHIFT)
#ifdef CONFIG_64BIT
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 02/21] riscv: mm: Configure satp with hw page pfn
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 01/21] riscv: mm: Distinguish hardware base page and software " Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 03/21] riscv: mm: Reimplement page table entry structures Xu Lu
` (19 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
The control register CSR_SATP on RISC-V, which points to the root page
table page, is used by MMU to translate va to pa when TLB miss happens.
Thus it should be encoded at a granularity of hardware page, while
existing code usually encodes it via software page frame number.
This commit corrects encoding operations of CSR_SATP register. To get
developers rid of the annoying encoding format of CSR_SATP and the
conversion between sw pfn and hw pfn, we abstract the encoding
operations of CSR_SATP into a specific function.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/pgtable.h | 14 ++++++++++++++
arch/riscv/kernel/head.S | 4 ++--
arch/riscv/kernel/hibernate.c | 3 ++-
arch/riscv/mm/context.c | 7 +++----
arch/riscv/mm/fault.c | 2 +-
arch/riscv/mm/init.c | 7 +++++--
arch/riscv/mm/kasan_init.c | 7 +++++--
7 files changed, 32 insertions(+), 12 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 9d6d0ff86c76..9d3947ec3523 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -206,6 +206,20 @@ extern pgd_t swapper_pg_dir[];
extern pgd_t trampoline_pg_dir[];
extern pgd_t early_pg_dir[];
+static inline unsigned long make_satp(unsigned long pfn,
+ unsigned long asid, unsigned long satp_mode)
+{
+ return (pfn_to_hwpfn(pfn) |
+ ((asid & SATP_ASID_MASK) << SATP_ASID_SHIFT) | satp_mode);
+}
+
+static inline unsigned long satp_pfn(unsigned long satp)
+{
+ unsigned long hwpfn = satp & SATP_PPN;
+
+ return hwpfn_to_pfn(hwpfn);
+}
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static inline int pmd_present(pmd_t pmd)
{
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 356d5397b2a2..b8568e3ddefa 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -86,7 +86,7 @@ relocate_enable_mmu:
csrw CSR_TVEC, a2
/* Compute satp for kernel page tables, but don't load it yet */
- srl a2, a0, PAGE_SHIFT
+ srl a2, a0, HW_PAGE_SHIFT
la a1, satp_mode
XIP_FIXUP_OFFSET a1
REG_L a1, 0(a1)
@@ -100,7 +100,7 @@ relocate_enable_mmu:
*/
la a0, trampoline_pg_dir
XIP_FIXUP_OFFSET a0
- srl a0, a0, PAGE_SHIFT
+ srl a0, a0, HW_PAGE_SHIFT
or a0, a0, a1
sfence.vma
csrw CSR_SATP, a0
diff --git a/arch/riscv/kernel/hibernate.c b/arch/riscv/kernel/hibernate.c
index 671b686c0158..155be6b1d32c 100644
--- a/arch/riscv/kernel/hibernate.c
+++ b/arch/riscv/kernel/hibernate.c
@@ -395,7 +395,8 @@ int swsusp_arch_resume(void)
if (ret)
return ret;
- hibernate_restore_image(resume_hdr.saved_satp, (PFN_DOWN(__pa(resume_pg_dir)) | satp_mode),
+ hibernate_restore_image(resume_hdr.saved_satp,
+ make_satp(PFN_DOWN(__pa(resume_pg_dir)), 0, satp_mode),
resume_hdr.restore_cpu_addr);
return 0;
diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index 4abe3de23225..229c78d9ad3a 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -189,9 +189,8 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
raw_spin_unlock_irqrestore(&context_lock, flags);
switch_mm_fast:
- csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
- (cntx2asid(cntx) << SATP_ASID_SHIFT) |
- satp_mode);
+ csr_write(CSR_SATP, make_satp(virt_to_pfn(mm->pgd), cntx2asid(cntx),
+ satp_mode));
if (need_flush_tlb)
local_flush_tlb_all();
@@ -200,7 +199,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
static void set_mm_noasid(struct mm_struct *mm)
{
/* Switch the page table and blindly nuke entire local TLB */
- csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | satp_mode);
+ csr_write(CSR_SATP, make_satp(virt_to_pfn(mm->pgd), 0, satp_mode));
local_flush_tlb_all_asid(0);
}
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index a9f2b4af8f3f..4772152be0f9 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -133,7 +133,7 @@ static inline void vmalloc_fault(struct pt_regs *regs, int code, unsigned long a
* of a task switch.
*/
index = pgd_index(addr);
- pfn = csr_read(CSR_SATP) & SATP_PPN;
+ pfn = satp_pfn(csr_read(CSR_SATP));
pgd = (pgd_t *)pfn_to_virt(pfn) + index;
pgd_k = init_mm.pgd + index;
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 0e8c20adcd98..f9334aab45a6 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -836,7 +836,7 @@ static __init void set_satp_mode(uintptr_t dtb_pa)
(uintptr_t)early_p4d : (uintptr_t)early_pud,
PGDIR_SIZE, PAGE_TABLE);
- identity_satp = PFN_DOWN((uintptr_t)&early_pg_dir) | satp_mode;
+ identity_satp = make_satp(PFN_DOWN((uintptr_t)&early_pg_dir), 0, satp_mode);
local_flush_tlb_all();
csr_write(CSR_SATP, identity_satp);
@@ -1316,6 +1316,8 @@ static void __init create_linear_mapping_page_table(void)
static void __init setup_vm_final(void)
{
+ unsigned long satp;
+
/* Setup swapper PGD for fixmap */
#if !defined(CONFIG_64BIT)
/*
@@ -1349,7 +1351,8 @@ static void __init setup_vm_final(void)
clear_fixmap(FIX_P4D);
/* Move to swapper page table */
- csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode);
+ satp = make_satp(PFN_DOWN(__pa_symbol(swapper_pg_dir)), 0, satp_mode);
+ csr_write(CSR_SATP, satp);
local_flush_tlb_all();
pt_ops_set_late();
diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index c301c8d291d2..3eee1665358e 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -482,11 +482,13 @@ static void __init create_tmp_mapping(void)
void __init kasan_init(void)
{
+ unsigned long satp;
phys_addr_t p_start, p_end;
u64 i;
create_tmp_mapping();
- csr_write(CSR_SATP, PFN_DOWN(__pa(tmp_pg_dir)) | satp_mode);
+ satp = make_satp(PFN_DOWN(__pa(tmp_pg_dir)), 0, satp_mode);
+ csr_write(CSR_SATP, satp);
kasan_early_clear_pgd(pgd_offset_k(KASAN_SHADOW_START),
KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -531,6 +533,7 @@ void __init kasan_init(void)
memset(kasan_early_shadow_page, KASAN_SHADOW_INIT, PAGE_SIZE);
init_task.kasan_depth = 0;
- csr_write(CSR_SATP, PFN_DOWN(__pa(swapper_pg_dir)) | satp_mode);
+ satp = make_satp(PFN_DOWN(__pa(swapper_pg_dir)), 0, satp_mode);
+ csr_write(CSR_SATP, satp);
local_flush_tlb_all();
}
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 03/21] riscv: mm: Reimplement page table entry structures
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 01/21] riscv: mm: Distinguish hardware base page and software " Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 02/21] riscv: mm: Configure satp with hw page pfn Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 04/21] riscv: mm: Reimplement page table entry constructor function Xu Lu
` (18 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
After decoupling hardware base page and software base page, each
software page can consists of several hardware base pages now. The pte
struct should be turned to an array of mapping entires to map the
software page. For example, in 64K Page Size kernel, each software page
consists of 16 contiguous hardware pages. Thus the pte struct should
contains 16 mapping entries to map 16 hardware pages.
This commit reimplements pte structure.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/page.h | 43 +++++++++++++++++++++++----
arch/riscv/include/asm/pgtable-64.h | 41 +++++++++++++++++++++----
arch/riscv/include/asm/pgtable.h | 23 +++++++++++++--
arch/riscv/mm/pgtable.c | 46 +++++++++++++++++++++++++++++
4 files changed, 141 insertions(+), 12 deletions(-)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 7c581a3e057b..9bc908d94c7a 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -63,6 +63,36 @@ void clear_page(void *page);
* Use struct definitions to apply C type checking
*/
+#ifdef CONFIG_RISCV_USE_SW_PAGE
+
+#define HW_PAGES_PER_PAGE (1 << (PAGE_SHIFT - HW_PAGE_SHIFT))
+
+struct page_table_entry {
+ union {
+ unsigned long pgds[HW_PAGES_PER_PAGE];
+ unsigned long p4ds[HW_PAGES_PER_PAGE];
+ unsigned long puds[HW_PAGES_PER_PAGE];
+ unsigned long pmds[HW_PAGES_PER_PAGE];
+ unsigned long ptes[HW_PAGES_PER_PAGE];
+ };
+};
+
+/* Page Global Directory entry */
+typedef struct page_table_entry pgd_t;
+
+/* Page Table entry */
+typedef struct page_table_entry pte_t;
+
+#define pte_val(x) ((x).ptes[0])
+#define pgd_val(x) ((x).pgds[0])
+
+pte_t __pte(unsigned long pteval);
+pgd_t __pgd(unsigned long pgdval);
+#define __pte __pte
+#define __pgd __pgd
+
+#else /* CONFIG_RISCV_USE_SW_PAGE */
+
/* Page Global Directory entry */
typedef struct {
unsigned long pgd;
@@ -73,18 +103,21 @@ typedef struct {
unsigned long pte;
} pte_t;
+#define pte_val(x) ((x).pte)
+#define pgd_val(x) ((x).pgd)
+
+#define __pte(x) ((pte_t) { (x) })
+#define __pgd(x) ((pgd_t) { (x) })
+
+#endif /* CONFIG_RISCV_USE_SW_PAGE */
+
typedef struct {
unsigned long pgprot;
} pgprot_t;
typedef struct page *pgtable_t;
-#define pte_val(x) ((x).pte)
-#define pgd_val(x) ((x).pgd)
#define pgprot_val(x) ((x).pgprot)
-
-#define __pte(x) ((pte_t) { (x) })
-#define __pgd(x) ((pgd_t) { (x) })
#define __pgprot(x) ((pgprot_t) { (x) })
#ifdef CONFIG_64BIT
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index 963aa4be9eed..e736873d7768 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -41,6 +41,35 @@ extern bool pgtable_l5_enabled;
#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE - 1))
+#ifdef CONFIG_RISCV_USE_SW_PAGE
+
+/* Page 4th Directory entry */
+typedef struct page_table_entry p4d_t;
+
+#define p4d_val(x) ((x).p4ds[0])
+p4d_t __p4d(unsigned long p4dval);
+#define __p4d __p4d
+#define PTRS_PER_P4D (PAGE_SIZE / sizeof(p4d_t))
+
+/* Page Upper Directory entry */
+typedef struct page_table_entry pud_t;
+
+#define pud_val(x) ((x).puds[0])
+pud_t __pud(unsigned long pudval);
+#define __pud __pud
+#define PTRS_PER_PUD (PAGE_SIZE / sizeof(pud_t))
+
+/* Page Middle Directory entry */
+typedef struct page_table_entry pmd_t;
+
+#define pmd_val(x) ((x).pmds[0])
+pmd_t __pmd(unsigned long pmdval);
+#define __pmd __pmd
+
+#define PTRS_PER_PMD (PAGE_SIZE / sizeof(pmd_t))
+
+#else /* CONFIG_RISCV_USE_SW_PAGE */
+
/* Page 4th Directory entry */
typedef struct {
unsigned long p4d;
@@ -69,6 +98,8 @@ typedef struct {
#define PTRS_PER_PMD (PAGE_SIZE / sizeof(pmd_t))
+#endif /* CONFIG_RISCV_USE_SW_PAGE */
+
/*
* rv64 PTE format:
* | 63 | 62 61 | 60 54 | 53 10 | 9 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
@@ -98,7 +129,7 @@ enum napot_cont_order {
#define for_each_napot_order_rev(order) \
for (order = NAPOT_ORDER_MAX - 1; \
order >= NAPOT_CONT_ORDER_BASE; order--)
-#define napot_cont_order(val) (__builtin_ctzl((val.pte >> _PAGE_PFN_SHIFT) << 1))
+#define napot_cont_order(val) (__builtin_ctzl((pte_val(val) >> _PAGE_PFN_SHIFT) << 1))
#define napot_cont_shift(order) ((order) + PAGE_SHIFT)
#define napot_cont_size(order) BIT(napot_cont_shift(order))
@@ -279,7 +310,7 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
if (pgtable_l4_enabled)
WRITE_ONCE(*p4dp, p4d);
else
- set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
+ set_pud((pud_t *)p4dp, __pud(p4d_val(p4d)));
}
static inline int p4d_none(p4d_t p4d)
@@ -327,7 +358,7 @@ static inline pud_t *p4d_pgtable(p4d_t p4d)
if (pgtable_l4_enabled)
return (pud_t *)pfn_to_virt(__page_val_to_pfn(p4d_val(p4d)));
- return (pud_t *)pud_pgtable((pud_t) { p4d_val(p4d) });
+ return (pud_t *)pud_pgtable(__pud(p4d_val(p4d)));
}
#define p4d_page_vaddr(p4d) ((unsigned long)p4d_pgtable(p4d))
@@ -346,7 +377,7 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
if (pgtable_l5_enabled)
WRITE_ONCE(*pgdp, pgd);
else
- set_p4d((p4d_t *)pgdp, (p4d_t){ pgd_val(pgd) });
+ set_p4d((p4d_t *)pgdp, __p4d(pgd_val(pgd)));
}
static inline int pgd_none(pgd_t pgd)
@@ -384,7 +415,7 @@ static inline p4d_t *pgd_pgtable(pgd_t pgd)
if (pgtable_l5_enabled)
return (p4d_t *)pfn_to_virt(__page_val_to_pfn(pgd_val(pgd)));
- return (p4d_t *)p4d_pgtable((p4d_t) { pgd_val(pgd) });
+ return (p4d_t *)p4d_pgtable(__p4d(pgd_val(pgd)));
}
#define pgd_page_vaddr(pgd) ((unsigned long)pgd_pgtable(pgd))
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 9d3947ec3523..f9aed43809b3 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -574,6 +574,25 @@ static inline void __set_pte_at(struct mm_struct *mm, pte_t *ptep, pte_t pteval)
#define PFN_PTE_SHIFT _PAGE_PFN_SHIFT
+#ifdef CONFIG_RISCV_USE_SW_PAGE
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
+{
+ unsigned int i;
+
+ if (pte_present(pte) && !pte_napot(pte))
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++)
+ pte.ptes[i] += nr << _PAGE_PFN_SHIFT;
+
+ return pte;
+}
+#else
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
+{
+ return __pte(pte_val(pte) + (nr << _PAGE_PFN_SHIFT));
+}
+#endif
+#define pte_advance_pfn pte_advance_pfn
+
static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pteval, unsigned int nr)
{
@@ -584,7 +603,7 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
if (--nr == 0)
break;
ptep++;
- pte_val(pteval) += 1 << _PAGE_PFN_SHIFT;
+ pteval = pte_advance_pfn(pteval, 1);
}
}
#define set_ptes set_ptes
@@ -882,7 +901,7 @@ extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
((offset) << __SWP_OFFSET_SHIFT) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
-#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
+#define __swp_entry_to_pte(x) (__pte((x).val))
static inline int pte_swp_exclusive(pte_t pte)
{
diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
index 4ae67324f992..0c6b2fc6be58 100644
--- a/arch/riscv/mm/pgtable.c
+++ b/arch/riscv/mm/pgtable.c
@@ -5,6 +5,52 @@
#include <linux/kernel.h>
#include <linux/pgtable.h>
+#ifdef CONFIG_RISCV_USE_SW_PAGE
+
+pte_t __pte(unsigned long pteval)
+{
+ pte_t pte;
+
+ return pte;
+}
+EXPORT_SYMBOL(__pte);
+
+pgd_t __pgd(unsigned long pgdval)
+{
+ pgd_t pgd;
+
+ return pgd;
+}
+EXPORT_SYMBOL(__pgd);
+
+#ifdef CONFIG_64BIT
+p4d_t __p4d(unsigned long p4dval)
+{
+ p4d_t p4d;
+
+ return p4d;
+}
+EXPORT_SYMBOL(__p4d);
+
+pud_t __pud(unsigned long pudval)
+{
+ pud_t pud;
+
+ return pud;
+}
+EXPORT_SYMBOL(__pud);
+
+pmd_t __pmd(unsigned long pmdval)
+{
+ pmd_t pmd;
+
+ return pmd;
+}
+EXPORT_SYMBOL(__pmd);
+#endif /* CONFIG_64BIT */
+
+#endif /* CONFIG_RISCV_USE_SW_PAGE */
+
int ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty)
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 04/21] riscv: mm: Reimplement page table entry constructor function
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (2 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 03/21] riscv: mm: Reimplement page table entry structures Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 05/21] riscv: mm: Reimplement conversion functions between page table entry Xu Lu
` (17 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
This commit reimplements the page table entry constructor. As each page
can contains several hardware pages now, the pte constructor need to
initialize all mapping entries of these hardware pages. Note that the
step path between mapping entries differs in different page table entry
levels. For example, in PTE level, the step path between hardware
mapping entries is hardware page size (aka 4K). In PMD level, the step
path is (2 ^ 9) * hardware page size (aka 2M), etc.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/pgtable-32.h | 5 +++
arch/riscv/include/asm/pgtable-64.h | 41 +++++++++++++++++++---
arch/riscv/include/asm/pgtable.h | 54 ++++++++++++++++++++++++-----
arch/riscv/mm/pgtable.c | 47 +++++++++++++++++++++++++
4 files changed, 133 insertions(+), 14 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable-32.h b/arch/riscv/include/asm/pgtable-32.h
index 159a668c3dd8..2959ab72f926 100644
--- a/arch/riscv/include/asm/pgtable-32.h
+++ b/arch/riscv/include/asm/pgtable-32.h
@@ -37,4 +37,9 @@
static const __maybe_unused int pgtable_l4_enabled;
static const __maybe_unused int pgtable_l5_enabled;
+static inline int __pgd_present(unsigned long pgdval)
+{
+ return pgdval & _PAGE_PRESENT;
+}
+
#endif /* _ASM_RISCV_PGTABLE_32_H */
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index e736873d7768..efcf63667f93 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -204,9 +204,14 @@ static inline u64 riscv_page_io(void)
_PAGE_USER | _PAGE_GLOBAL | \
_PAGE_MTMASK))
+static inline int __pud_present(unsigned long pudval)
+{
+ return pudval & _PAGE_PRESENT;
+}
+
static inline int pud_present(pud_t pud)
{
- return (pud_val(pud) & _PAGE_PRESENT);
+ return __pud_present(pud_val(pud));
}
static inline int pud_none(pud_t pud)
@@ -219,11 +224,16 @@ static inline int pud_bad(pud_t pud)
return !pud_present(pud);
}
-#define pud_leaf pud_leaf
+static inline bool __pud_leaf(unsigned long pudval)
+{
+ return __pud_present(pudval) && (pudval & _PAGE_LEAF);
+}
+
static inline bool pud_leaf(pud_t pud)
{
- return pud_present(pud) && (pud_val(pud) & _PAGE_LEAF);
+ return __pud_leaf(pud_val(pud));
}
+#define pud_leaf pud_leaf
static inline int pud_user(pud_t pud)
{
@@ -321,14 +331,30 @@ static inline int p4d_none(p4d_t p4d)
return 0;
}
+static inline int __p4d_present(unsigned long p4dval)
+{
+ return p4dval & _PAGE_PRESENT;
+}
+
static inline int p4d_present(p4d_t p4d)
{
if (pgtable_l4_enabled)
- return (p4d_val(p4d) & _PAGE_PRESENT);
+ return __p4d_present(p4d_val(p4d));
return 1;
}
+static inline int __p4d_leaf(unsigned long p4dval)
+{
+ return 0;
+}
+
+static inline int p4d_leaf(p4d_t p4d)
+{
+ return __p4d_leaf(p4d_val(p4d));
+}
+#define p4d_leaf p4d_leaf
+
static inline int p4d_bad(p4d_t p4d)
{
if (pgtable_l4_enabled)
@@ -388,10 +414,15 @@ static inline int pgd_none(pgd_t pgd)
return 0;
}
+static inline int __pgd_present(unsigned long pgdval)
+{
+ return pgdval & _PAGE_PRESENT;
+}
+
static inline int pgd_present(pgd_t pgd)
{
if (pgtable_l5_enabled)
- return (pgd_val(pgd) & _PAGE_PRESENT);
+ return __pgd_present(pgd_val(pgd));
return 1;
}
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index f9aed43809b3..1d5f533edbd5 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -220,8 +220,19 @@ static inline unsigned long satp_pfn(unsigned long satp)
return hwpfn_to_pfn(hwpfn);
}
+static inline int __pgd_leaf(unsigned long pgdval)
+{
+ return __pgd_present(pgdval) && (pgdval & _PAGE_LEAF);
+}
+
+static inline int pgd_leaf(pgd_t pgd)
+{
+ return __pgd_leaf(pgd_val(pgd));
+}
+#define pgd_leaf pgd_leaf
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pmd_present(pmd_t pmd)
+static inline int __pmd_present(unsigned long pmdval)
{
/*
* Checking for _PAGE_LEAF is needed too because:
@@ -229,15 +240,20 @@ static inline int pmd_present(pmd_t pmd)
* the present bit, in this situation, pmd_present() and
* pmd_trans_huge() still needs to return true.
*/
- return (pmd_val(pmd) & (_PAGE_PRESENT | _PAGE_PROT_NONE | _PAGE_LEAF));
+ return (pmdval & (_PAGE_PRESENT | _PAGE_PROT_NONE | _PAGE_LEAF));
}
#else
-static inline int pmd_present(pmd_t pmd)
+static inline int __pmd_present(unsigned long pmdval)
{
- return (pmd_val(pmd) & (_PAGE_PRESENT | _PAGE_PROT_NONE));
+ return (pmdval & (_PAGE_PRESENT | _PAGE_PROT_NONE));
}
#endif
+static inline int pmd_present(pmd_t pmd)
+{
+ return __pmd_present(pmd_val(pmd));
+}
+
static inline int pmd_none(pmd_t pmd)
{
return (pmd_val(pmd) == 0);
@@ -248,11 +264,16 @@ static inline int pmd_bad(pmd_t pmd)
return !pmd_present(pmd) || (pmd_val(pmd) & _PAGE_LEAF);
}
-#define pmd_leaf pmd_leaf
+static inline bool __pmd_leaf(unsigned long pmdval)
+{
+ return __pmd_present(pmdval) && (pmdval & _PAGE_LEAF);
+}
+
static inline bool pmd_leaf(pmd_t pmd)
{
- return pmd_present(pmd) && (pmd_val(pmd) & _PAGE_LEAF);
+ return __pmd_leaf(pmd_val(pmd));
}
+#define pmd_leaf pmd_leaf
static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
{
@@ -306,9 +327,14 @@ static __always_inline bool has_svnapot(void)
return riscv_has_extension_likely(RISCV_ISA_EXT_SVNAPOT);
}
+static inline unsigned long __pte_napot(unsigned long val)
+{
+ return val & _PAGE_NAPOT;
+}
+
static inline unsigned long pte_napot(pte_t pte)
{
- return pte_val(pte) & _PAGE_NAPOT;
+ return __pte_napot(pte_val(pte));
}
static inline pte_t pte_mknapot(pte_t pte, unsigned int order)
@@ -324,11 +350,16 @@ static inline pte_t pte_mknapot(pte_t pte, unsigned int order)
static __always_inline bool has_svnapot(void) { return false; }
-static inline unsigned long pte_napot(pte_t pte)
+static inline unsigned long __pte_napot(unsigned long pteval)
{
return 0;
}
+static inline unsigned long pte_napot(pte_t pte)
+{
+ return __pte_napot(pte_val(pte));
+}
+
#endif /* CONFIG_RISCV_ISA_SVNAPOT */
/* Yields the page frame number (PFN) of a page table entry */
@@ -356,9 +387,14 @@ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot)
#define mk_pte(page, prot) pfn_pte(page_to_pfn(page), prot)
+static inline int __pte_present(unsigned long pteval)
+{
+ return (pteval & (_PAGE_PRESENT | _PAGE_PROT_NONE));
+}
+
static inline int pte_present(pte_t pte)
{
- return (pte_val(pte) & (_PAGE_PRESENT | _PAGE_PROT_NONE));
+ return __pte_present(pte_val(pte));
}
#define pte_accessible pte_accessible
diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
index 0c6b2fc6be58..f57ada26a183 100644
--- a/arch/riscv/mm/pgtable.c
+++ b/arch/riscv/mm/pgtable.c
@@ -10,6 +10,13 @@
pte_t __pte(unsigned long pteval)
{
pte_t pte;
+ unsigned int i;
+
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++) {
+ pte.ptes[i] = pteval;
+ if (__pte_present(pteval) && !__pte_napot(pteval))
+ pteval += 1 << _PAGE_HWPFN_SHIFT;
+ }
return pte;
}
@@ -18,6 +25,16 @@ EXPORT_SYMBOL(__pte);
pgd_t __pgd(unsigned long pgdval)
{
pgd_t pgd;
+ unsigned int i;
+
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++) {
+ pgd.pgds[i] = pgdval;
+ if (__pgd_leaf(pgdval))
+ pgdval += (1 << (PGDIR_SHIFT - PAGE_SHIFT)) <<
+ _PAGE_HWPFN_SHIFT;
+ else if (__pgd_present(pgdval))
+ pgdval += 1 << _PAGE_HWPFN_SHIFT;
+ }
return pgd;
}
@@ -27,6 +44,16 @@ EXPORT_SYMBOL(__pgd);
p4d_t __p4d(unsigned long p4dval)
{
p4d_t p4d;
+ unsigned int i;
+
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++) {
+ p4d.p4ds[i] = p4dval;
+ if (__p4d_leaf(p4dval))
+ p4dval += (1 << (P4D_SHIFT - PAGE_SHIFT)) <<
+ _PAGE_HWPFN_SHIFT;
+ else if (__p4d_present(p4dval))
+ p4dval += 1 << _PAGE_HWPFN_SHIFT;
+ }
return p4d;
}
@@ -35,6 +62,16 @@ EXPORT_SYMBOL(__p4d);
pud_t __pud(unsigned long pudval)
{
pud_t pud;
+ unsigned int i;
+
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++) {
+ pud.puds[i] = pudval;
+ if (__pud_leaf(pudval))
+ pudval += (1 << (PUD_SHIFT - PAGE_SHIFT)) <<
+ _PAGE_HWPFN_SHIFT;
+ else if (__pud_present(pudval))
+ pudval += 1 << _PAGE_HWPFN_SHIFT;
+ }
return pud;
}
@@ -43,6 +80,16 @@ EXPORT_SYMBOL(__pud);
pmd_t __pmd(unsigned long pmdval)
{
pmd_t pmd;
+ unsigned int i;
+
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++) {
+ pmd.pmds[i] = pmdval;
+ if (__pmd_leaf(pmdval))
+ pmdval += (1 << (PMD_SHIFT - PAGE_SHIFT)) <<
+ _PAGE_HWPFN_SHIFT;
+ else if (__pmd_present(pmdval))
+ pmdval += 1 << _PAGE_HWPFN_SHIFT;
+ }
return pmd;
}
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 05/21] riscv: mm: Reimplement conversion functions between page table entry
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (3 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 04/21] riscv: mm: Reimplement page table entry constructor function Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 06/21] riscv: mm: Avoid pte constructor during pte conversion Xu Lu
` (16 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
Some code will convert high level pte into low level pte to reuse pte
functions. For example, pmd_dirty() will convert pmd struct into pte
struct to check whether it is dirty using pte_dirty(). As pte struct at
different level has different constructor now, we can not apply pte
constructor during conversion. Thus, this commit converts ptes by directly
converting structure type.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/pgtable.h | 81 ++++++++++++++++++++++++++++++--
1 file changed, 76 insertions(+), 5 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 1d5f533edbd5..f7b51c52b815 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -309,6 +309,50 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
return (unsigned long)pfn_to_virt(__page_val_to_pfn(pmd_val(pmd)));
}
+#ifdef CONFIG_RISCV_USE_SW_PAGE
+
+static inline pte_t pmd_pte(pmd_t pmd)
+{
+ return (pte_t)pmd;
+}
+
+static inline pte_t pud_pte(pud_t pud)
+{
+ return (pte_t)pud;
+}
+
+static inline pte_t p4d_pte(p4d_t p4d)
+{
+ return (pte_t)p4d;
+}
+
+static inline pte_t pgd_pte(pgd_t pgd)
+{
+ return (pte_t)pgd;
+}
+
+static inline pmd_t pte_pmd(pte_t pte)
+{
+ return (pmd_t)pte;
+}
+
+static inline pud_t pte_pud(pte_t pte)
+{
+ return (pud_t)pte;
+}
+
+static inline p4d_t pte_p4d(pte_t pte)
+{
+ return (p4d_t)pte;
+}
+
+static inline pgd_t pte_pgd(pte_t pte)
+{
+ return (pgd_t)pte;
+}
+
+#else /* CONFIG_RISCV_USE_SW_PAGE */
+
static inline pte_t pmd_pte(pmd_t pmd)
{
return __pte(pmd_val(pmd));
@@ -319,6 +363,38 @@ static inline pte_t pud_pte(pud_t pud)
return __pte(pud_val(pud));
}
+static inline pte_t p4d_pte(p4d_t p4d)
+{
+ return __pte(p4d_val(p4d));
+}
+
+static inline pte_t pgd_pte(pgd_t pgd)
+{
+ return __pte(pgd_val(pgd));
+}
+
+static inline pmd_t pte_pmd(pte_t pte)
+{
+ return __pmd(pte_val(pte));
+}
+
+static inline pud_t pte_pud(pte_t pte)
+{
+ return __pud(pte_val(pte));
+}
+
+static inline p4d_t pte_p4d(pte_t pte)
+{
+ return __p4d(pte_val(pte));
+}
+
+static inline pgd_t pte_pgd(pte_t pte)
+{
+ return __pgd(pte_val(pte));
+}
+
+#endif /* CONFIG_RISCV_USE_SW_PAGE */
+
#ifdef CONFIG_RISCV_ISA_SVNAPOT
#include <asm/cpufeature.h>
@@ -728,11 +804,6 @@ static inline pgprot_t pgprot_writecombine(pgprot_t _prot)
/*
* THP functions
*/
-static inline pmd_t pte_pmd(pte_t pte)
-{
- return __pmd(pte_val(pte));
-}
-
static inline pmd_t pmd_mkhuge(pmd_t pmd)
{
return pmd;
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 06/21] riscv: mm: Avoid pte constructor during pte conversion
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (4 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 05/21] riscv: mm: Reimplement conversion functions between page table entry Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 07/21] riscv: mm: Reimplement page table entry get function Xu Lu
` (15 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
This commit converts ptes at different level via directly converting pte
type instead of using pte constructor, as ptes from different levels has
different constructors.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/pgtable.h | 52 +++++++++++++++++++++++++-------
1 file changed, 41 insertions(+), 11 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index f7b51c52b815..d3da8aee213c 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -351,6 +351,26 @@ static inline pgd_t pte_pgd(pte_t pte)
return (pgd_t)pte;
}
+static inline pte_t pte_set_flag(pte_t pte, unsigned long flag)
+{
+ unsigned int i;
+
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++)
+ pte.ptes[i] |= flag;
+
+ return pte;
+}
+
+static inline pte_t pte_clear_flag(pte_t pte, unsigned long flag)
+{
+ unsigned int i;
+
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++)
+ pte.ptes[i] &= (~flag);
+
+ return pte;
+}
+
#else /* CONFIG_RISCV_USE_SW_PAGE */
static inline pte_t pmd_pte(pmd_t pmd)
@@ -393,6 +413,16 @@ static inline pgd_t pte_pgd(pte_t pte)
return __pgd(pte_val(pte));
}
+static inline pte_t pte_set_flag(pte_t pte, unsigned long flag)
+{
+ return __pte(pte_val(pte) | flag);
+}
+
+static inline pte_t pte_clear_flag(pte_t pte, unsigned long flag)
+{
+ return __pte(pte_val(pte) & (~flag));
+}
+
#endif /* CONFIG_RISCV_USE_SW_PAGE */
#ifdef CONFIG_RISCV_ISA_SVNAPOT
@@ -537,46 +567,46 @@ static inline int pte_devmap(pte_t pte)
static inline pte_t pte_wrprotect(pte_t pte)
{
- return __pte(pte_val(pte) & ~(_PAGE_WRITE));
+ return pte_clear_flag(pte, _PAGE_WRITE);
}
/* static inline pte_t pte_mkread(pte_t pte) */
static inline pte_t pte_mkwrite_novma(pte_t pte)
{
- return __pte(pte_val(pte) | _PAGE_WRITE);
+ return pte_set_flag(pte, _PAGE_WRITE);
}
/* static inline pte_t pte_mkexec(pte_t pte) */
static inline pte_t pte_mkdirty(pte_t pte)
{
- return __pte(pte_val(pte) | _PAGE_DIRTY);
+ return pte_set_flag(pte, _PAGE_DIRTY);
}
static inline pte_t pte_mkclean(pte_t pte)
{
- return __pte(pte_val(pte) & ~(_PAGE_DIRTY));
+ return pte_clear_flag(pte, _PAGE_DIRTY);
}
static inline pte_t pte_mkyoung(pte_t pte)
{
- return __pte(pte_val(pte) | _PAGE_ACCESSED);
+ return pte_set_flag(pte, _PAGE_ACCESSED);
}
static inline pte_t pte_mkold(pte_t pte)
{
- return __pte(pte_val(pte) & ~(_PAGE_ACCESSED));
+ return pte_clear_flag(pte, _PAGE_ACCESSED);
}
static inline pte_t pte_mkspecial(pte_t pte)
{
- return __pte(pte_val(pte) | _PAGE_SPECIAL);
+ return pte_set_flag(pte, _PAGE_SPECIAL);
}
static inline pte_t pte_mkdevmap(pte_t pte)
{
- return __pte(pte_val(pte) | _PAGE_DEVMAP);
+ return pte_set_flag(pte, _PAGE_DEVMAP);
}
static inline pte_t pte_mkhuge(pte_t pte)
@@ -612,7 +642,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
ALT_THEAD_PMA(newprot_val);
- return __pte((pte_val(pte) & _PAGE_CHG_MASK) | newprot_val);
+ return pte_set_flag(pte_clear_flag(pte, ~_PAGE_CHG_MASK), newprot_val);
}
#define pgd_ERROR(e) \
@@ -1017,12 +1047,12 @@ static inline int pte_swp_exclusive(pte_t pte)
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
- return __pte(pte_val(pte) | _PAGE_SWP_EXCLUSIVE);
+ return pte_set_flag(pte, _PAGE_SWP_EXCLUSIVE);
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
- return __pte(pte_val(pte) & ~_PAGE_SWP_EXCLUSIVE);
+ return pte_clear_flag(pte, _PAGE_SWP_EXCLUSIVE);
}
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 07/21] riscv: mm: Reimplement page table entry get function
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (5 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 06/21] riscv: mm: Avoid pte constructor during pte conversion Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 08/21] riscv: mm: Reimplement page table entry atomic " Xu Lu
` (14 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
This commit reimplements ptep_get/pmdp_get/... functions. As pte
structures now contains multiple mapping entries, we can not use
READ_ONCE to fetch its value. Instead, we use traditional dereference
way.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/pgtable.h | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index d3da8aee213c..ba4a083b7210 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -704,6 +704,36 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
WRITE_ONCE(*ptep, pteval);
}
+static inline pte_t ptep_get(pte_t *ptep)
+{
+ return *ptep;
+}
+#define ptep_get ptep_get
+
+static inline pmd_t pmdp_get(pmd_t *pmdp)
+{
+ return *pmdp;
+}
+#define pmdp_get pmdp_get
+
+static inline pud_t pudp_get(pud_t *pudp)
+{
+ return *pudp;
+}
+#define pudp_get pudp_get
+
+static inline p4d_t p4dp_get(p4d_t *p4dp)
+{
+ return *p4dp;
+}
+#define p4dp_get p4dp_get
+
+static inline pgd_t pgdp_get(pgd_t *pgdp)
+{
+ return *pgdp;
+}
+#define pgdp_get pgdp_get
+
void flush_icache_pte(struct mm_struct *mm, pte_t pte);
static inline void __set_pte_at(struct mm_struct *mm, pte_t *ptep, pte_t pteval)
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 08/21] riscv: mm: Reimplement page table entry atomic get function
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (6 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 07/21] riscv: mm: Reimplement page table entry get function Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 09/21] riscv: mm: Replace READ_ONCE with atomic pte " Xu Lu
` (13 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
This commit implements lockless functions to atomically fetch pte's
value. For each pte structure, we atomically fetch the first mapping
entry, and then fetch the following entries and compare them with the
first mappin entry plus certain step path in a loop. If we find any
difference in their pfns or prots, then the pte structure has been
modified and need to be reloaded.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/pgtable.h | 156 +++++++++++++++++++++++++++++++
include/linux/pgtable.h | 21 +++++
2 files changed, 177 insertions(+)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index ba4a083b7210..fe42afb4441e 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -220,6 +220,18 @@ static inline unsigned long satp_pfn(unsigned long satp)
return hwpfn_to_pfn(hwpfn);
}
+static inline unsigned long __pte_pgprot(unsigned long pteval)
+{
+ unsigned long prot_mask = GENMASK(_PAGE_HWPFN_SHIFT - 1, 0);
+
+ return pteval & prot_mask;
+}
+
+static inline pgprot_t pte_pgprot(pte_t pte)
+{
+ return __pgprot(__pte_pgprot(pte_val(pte)));
+}
+
static inline int __pgd_leaf(unsigned long pgdval)
{
return __pgd_present(pgdval) && (pgdval & _PAGE_LEAF);
@@ -734,6 +746,150 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
}
#define pgdp_get pgdp_get
+#ifdef CONFIG_RISCV_USE_SW_PAGE
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+ unsigned long pteval;
+ pte_t pte;
+ int i;
+
+retry:
+ pteval = READ_ONCE(ptep->ptes[0]);
+ pte = *ptep;
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++) {
+ if (__page_val_to_pfn(pteval) !=
+ __page_val_to_pfn(pte.ptes[i]))
+ goto retry;
+ if ((__pte_pgprot(pteval) | _PAGE_DIRTY | _PAGE_ACCESSED) !=
+ (__pte_pgprot(pte.ptes[i]) | _PAGE_DIRTY | _PAGE_ACCESSED))
+ goto retry;
+
+ if (__pte_present(pteval) && !__pte_napot(pteval))
+ pteval += 1 << _PAGE_HWPFN_SHIFT;
+ }
+
+ return pte;
+}
+#define ptep_get_lockless ptep_get_lockless
+
+static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
+{
+ unsigned long pmdval;
+ pmd_t pmd;
+ int i;
+
+retry:
+ pmdval = READ_ONCE(pmdp->pmds[0]);
+ pmd = *pmdp;
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++) {
+ if (__page_val_to_pfn(pmdval) !=
+ __page_val_to_pfn(pmd.pmds[i]))
+ goto retry;
+ if ((__pte_pgprot(pmdval) | _PAGE_DIRTY | _PAGE_ACCESSED) !=
+ (__pte_pgprot(pmd.pmds[i]) | _PAGE_DIRTY | _PAGE_ACCESSED))
+ goto retry;
+
+ if (__pmd_leaf(pmdval))
+ pmdval += (1 << (PMD_SHIFT - PAGE_SHIFT)) <<
+ _PAGE_HWPFN_SHIFT;
+ else if (__pmd_present(pmdval))
+ pmdval += 1 << _PAGE_HWPFN_SHIFT;
+ }
+
+ return pmd;
+}
+#define pmdp_get_lockless pmdp_get_lockless
+
+static inline void pmdp_get_lockless_sync(void)
+{
+}
+
+static inline pud_t pudp_get_lockless(pud_t *pudp)
+{
+ unsigned long pudval;
+ pud_t pud;
+ int i;
+
+retry:
+ pudval = READ_ONCE(pudp->puds[0]);
+ pud = *pudp;
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++) {
+ if (__page_val_to_pfn(pudval) !=
+ __page_val_to_pfn(pud.puds[i]))
+ goto retry;
+ if ((__pte_pgprot(pudval) | _PAGE_DIRTY | _PAGE_ACCESSED) !=
+ (__pte_pgprot(pud.puds[i]) | _PAGE_DIRTY | _PAGE_ACCESSED))
+ goto retry;
+
+ if (__pud_leaf(pudval))
+ pudval += (1 << (PUD_SHIFT - PAGE_SHIFT)) <<
+ _PAGE_HWPFN_SHIFT;
+ else if (__pud_present(pudval))
+ pudval += 1 << _PAGE_HWPFN_SHIFT;
+ }
+
+ return pud;
+}
+#define pudp_get_lockless pudp_get_lockless
+
+static inline p4d_t p4dp_get_lockless(p4d_t *p4dp)
+{
+ unsigned long p4dval;
+ p4d_t p4d;
+ int i;
+
+retry:
+ p4dval = READ_ONCE(p4dp->p4ds[0]);
+ p4d = *p4dp;
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++) {
+ if (__page_val_to_pfn(p4dval) !=
+ __page_val_to_pfn(p4d.p4ds[i]))
+ goto retry;
+ if ((__pte_pgprot(p4dval) | _PAGE_DIRTY | _PAGE_ACCESSED) !=
+ (__pte_pgprot(p4d.p4ds[i]) | _PAGE_DIRTY | _PAGE_ACCESSED))
+ goto retry;
+
+ if (__p4d_leaf(p4dval))
+ p4dval += (1 << (P4D_SHIFT - PAGE_SHIFT)) <<
+ _PAGE_HWPFN_SHIFT;
+ else if (__p4d_present(p4dval))
+ p4dval += 1 << _PAGE_HWPFN_SHIFT;
+ }
+
+ return p4d;
+}
+#define p4dp_get_lockless p4dp_get_lockless
+
+static inline pgd_t pgdp_get_lockless(pgd_t *pgdp)
+{
+ unsigned long pgdval;
+ pgd_t pgd;
+ int i;
+
+retry:
+ pgdval = READ_ONCE(pgdp->pgds[0]);
+ pgd = *pgdp;
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++) {
+ if (__page_val_to_pfn(pgdval) !=
+ __page_val_to_pfn(pgd.pgds[i]))
+ goto retry;
+ if ((__pte_pgprot(pgdval) | _PAGE_DIRTY | _PAGE_ACCESSED) !=
+ (__pte_pgprot(pgd.pgds[i]) | _PAGE_DIRTY | _PAGE_ACCESSED))
+ goto retry;
+
+ if (__pgd_leaf(pgdval))
+ pgdval += (1 << (PGDIR_SHIFT - PAGE_SHIFT)) <<
+ _PAGE_HWPFN_SHIFT;
+ else if (__pgd_present(pgdval))
+ pgdval += 1 << _PAGE_HWPFN_SHIFT;
+ }
+
+ return pgd;
+}
+#define pgdp_get_lockless pgdp_get_lockless
+
+#endif /* CONFIG_RISCV_USE_SW_PAGE */
+
void flush_icache_pte(struct mm_struct *mm, pte_t pte);
static inline void __set_pte_at(struct mm_struct *mm, pte_t *ptep, pte_t pteval)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8b2ac6bd2ae..b629c48b980b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -598,6 +598,27 @@ static inline void pmdp_get_lockless_sync(void)
}
#endif
+#ifndef pudp_get_lockless
+static inline pud_t pudp_get_lockless(pud_t *pudp)
+{
+ return pudp_get(pudp);
+}
+#endif
+
+#ifndef p4dp_get_lockless
+static inline p4d_t p4dp_get_lockless(p4d_t *p4dp)
+{
+ return p4dp_get(p4dp);
+}
+#endif
+
+#ifndef pgdp_get_lockless
+static inline pgd_t pgdp_get_lockless(pgd_t *pgdp)
+{
+ return pgdp_get(pgdp);
+}
+#endif
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 09/21] riscv: mm: Replace READ_ONCE with atomic pte get function
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (7 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 08/21] riscv: mm: Reimplement page table entry atomic " Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 10/21] riscv: mm: Reimplement PTE A/D bit check function Xu Lu
` (12 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
READ_ONCE can not be applied to pte structure with multipling mapping
entries. This commit replaces READ_ONCE with atomic pte get function.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/pgtable-64.h | 6 +++---
arch/riscv/include/asm/pgtable.h | 21 +++++++++++++--------
arch/riscv/kernel/hibernate.c | 18 +++++++++---------
arch/riscv/mm/pgtable.c | 12 +++++++++---
kernel/events/core.c | 6 +++---
mm/debug_vm_pgtable.c | 4 ++--
mm/gup.c | 10 +++++-----
mm/hmm.c | 2 +-
mm/mapping_dirty_helpers.c | 2 +-
mm/memory.c | 4 ++--
mm/mprotect.c | 2 +-
mm/ptdump.c | 8 ++++----
mm/sparse-vmemmap.c | 2 +-
mm/vmscan.c | 2 +-
14 files changed, 55 insertions(+), 44 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index efcf63667f93..2649cc90b14e 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -242,7 +242,7 @@ static inline int pud_user(pud_t pud)
static inline void set_pud(pud_t *pudp, pud_t pud)
{
- WRITE_ONCE(*pudp, pud);
+ *pudp = pud;
}
static inline void pud_clear(pud_t *pudp)
@@ -318,7 +318,7 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
{
if (pgtable_l4_enabled)
- WRITE_ONCE(*p4dp, p4d);
+ *p4dp = p4d;
else
set_pud((pud_t *)p4dp, __pud(p4d_val(p4d)));
}
@@ -401,7 +401,7 @@ pud_t *pud_offset(p4d_t *p4d, unsigned long address);
static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
{
if (pgtable_l5_enabled)
- WRITE_ONCE(*pgdp, pgd);
+ *pgdp = pgd;
else
set_p4d((p4d_t *)pgdp, __p4d(pgd_val(pgd)));
}
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index fe42afb4441e..bf724d006236 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -289,7 +289,7 @@ static inline bool pmd_leaf(pmd_t pmd)
static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
{
- WRITE_ONCE(*pmdp, pmd);
+ *pmdp = pmd;
}
static inline void pmd_clear(pmd_t *pmdp)
@@ -713,7 +713,7 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b)
*/
static inline void set_pte(pte_t *ptep, pte_t pteval)
{
- WRITE_ONCE(*ptep, pteval);
+ *ptep = pteval;
}
static inline pte_t ptep_get(pte_t *ptep)
@@ -953,10 +953,9 @@ extern int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long a
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long address, pte_t *ptep)
{
- pte_t pte = __pte(atomic_long_xchg((atomic_long_t *)ptep, 0));
-
+ pte_t pte = ptep_get(ptep);
+ pte_clear(mm, address, ptep);
page_table_check_pte_clear(mm, pte);
-
return pte;
}
@@ -964,7 +963,8 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long address, pte_t *ptep)
{
- atomic_long_and(~(unsigned long)_PAGE_WRITE, (atomic_long_t *)ptep);
+ pte_t old_pte = ptep_get(ptep);
+ set_pte(ptep, pte_wrprotect(old_pte));
}
#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
@@ -1170,8 +1170,9 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
unsigned long address, pmd_t *pmdp)
{
- pmd_t pmd = __pmd(atomic_long_xchg((atomic_long_t *)pmdp, 0));
+ pmd_t pmd = pmdp_get(pmdp);
+ pmd_clear(pmdp);
page_table_check_pmd_clear(mm, pmd);
return pmd;
@@ -1188,8 +1189,12 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp, pmd_t pmd)
{
+ pmd_t old_pmd = pmdp_get(pmdp);
+
page_table_check_pmd_set(vma->vm_mm, pmdp, pmd);
- return __pmd(atomic_long_xchg((atomic_long_t *)pmdp, pmd_val(pmd)));
+ set_pmd(pmdp, pmd);
+
+ return old_pmd;
}
#define pmdp_collapse_flush pmdp_collapse_flush
diff --git a/arch/riscv/kernel/hibernate.c b/arch/riscv/kernel/hibernate.c
index 155be6b1d32c..5018d38f5280 100644
--- a/arch/riscv/kernel/hibernate.c
+++ b/arch/riscv/kernel/hibernate.c
@@ -171,7 +171,7 @@ static int temp_pgtable_map_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp, unsigned long
pte_t *src_ptep;
pte_t *dst_ptep;
- if (pmd_none(READ_ONCE(*dst_pmdp))) {
+ if (pmd_none(pmdp_get_lockless(dst_pmdp))) {
dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
if (!dst_ptep)
return -ENOMEM;
@@ -183,7 +183,7 @@ static int temp_pgtable_map_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp, unsigned long
src_ptep = pte_offset_kernel(src_pmdp, start);
do {
- pte_t pte = READ_ONCE(*src_ptep);
+ pte_t pte = ptep_get_lockless(src_ptep);
if (pte_present(pte))
set_pte(dst_ptep, __pte(pte_val(pte) | pgprot_val(prot)));
@@ -200,7 +200,7 @@ static int temp_pgtable_map_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long
pmd_t *src_pmdp;
pmd_t *dst_pmdp;
- if (pud_none(READ_ONCE(*dst_pudp))) {
+ if (pud_none(pudp_get_lockless(dst_pudp))) {
dst_pmdp = (pmd_t *)get_safe_page(GFP_ATOMIC);
if (!dst_pmdp)
return -ENOMEM;
@@ -212,7 +212,7 @@ static int temp_pgtable_map_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long
src_pmdp = pmd_offset(src_pudp, start);
do {
- pmd_t pmd = READ_ONCE(*src_pmdp);
+ pmd_t pmd = pmdp_get_lockless(src_pmdp);
next = pmd_addr_end(start, end);
@@ -239,7 +239,7 @@ static int temp_pgtable_map_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp, unsigned long
pud_t *dst_pudp;
pud_t *src_pudp;
- if (p4d_none(READ_ONCE(*dst_p4dp))) {
+ if (p4d_none(p4dp_get_lockless(dst_p4dp))) {
dst_pudp = (pud_t *)get_safe_page(GFP_ATOMIC);
if (!dst_pudp)
return -ENOMEM;
@@ -251,7 +251,7 @@ static int temp_pgtable_map_pud(p4d_t *dst_p4dp, p4d_t *src_p4dp, unsigned long
src_pudp = pud_offset(src_p4dp, start);
do {
- pud_t pud = READ_ONCE(*src_pudp);
+ pud_t pud = pudp_get_lockless(src_pudp);
next = pud_addr_end(start, end);
@@ -278,7 +278,7 @@ static int temp_pgtable_map_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long
p4d_t *dst_p4dp;
p4d_t *src_p4dp;
- if (pgd_none(READ_ONCE(*dst_pgdp))) {
+ if (pgd_none(pgdp_get_lockless(dst_pgdp))) {
dst_p4dp = (p4d_t *)get_safe_page(GFP_ATOMIC);
if (!dst_p4dp)
return -ENOMEM;
@@ -290,7 +290,7 @@ static int temp_pgtable_map_p4d(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long
src_p4dp = p4d_offset(src_pgdp, start);
do {
- p4d_t p4d = READ_ONCE(*src_p4dp);
+ p4d_t p4d = p4dp_get_lockless(src_p4dp);
next = p4d_addr_end(start, end);
@@ -317,7 +317,7 @@ static int temp_pgtable_mapping(pgd_t *pgdp, unsigned long start, unsigned long
unsigned long ret;
do {
- pgd_t pgd = READ_ONCE(*src_pgdp);
+ pgd_t pgd = pgdp_get_lockless(src_pgdp);
next = pgd_addr_end(start, end);
diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
index f57ada26a183..150aea8e2d7a 100644
--- a/arch/riscv/mm/pgtable.c
+++ b/arch/riscv/mm/pgtable.c
@@ -128,9 +128,15 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long address,
pte_t *ptep)
{
- if (!pte_young(ptep_get(ptep)))
- return 0;
- return test_and_clear_bit(_PAGE_ACCESSED_OFFSET, &pte_val(*ptep));
+ int r = 1;
+ pte_t pte = ptep_get(ptep);
+
+ if (!pte_young(pte))
+ r = 0;
+ else
+ set_pte(ptep, pte_mkold(pte));
+
+ return r;
}
EXPORT_SYMBOL_GPL(ptep_test_and_clear_young);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index df27d08a7232..84d49c60f55b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7709,7 +7709,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
pte_t *ptep, pte;
pgdp = pgd_offset(mm, addr);
- pgd = READ_ONCE(*pgdp);
+ pgd = pgdp_get_lockless(pgdp);
if (pgd_none(pgd))
return 0;
@@ -7717,7 +7717,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
return pgd_leaf_size(pgd);
p4dp = p4d_offset_lockless(pgdp, pgd, addr);
- p4d = READ_ONCE(*p4dp);
+ p4d = p4dp_get_lockless(p4dp);
if (!p4d_present(p4d))
return 0;
@@ -7725,7 +7725,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
return p4d_leaf_size(p4d);
pudp = pud_offset_lockless(p4dp, p4d, addr);
- pud = READ_ONCE(*pudp);
+ pud = pudp_get_lockless(pudp);
if (!pud_present(pud))
return 0;
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index bc748f700a9e..1cec548cc6c7 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -438,7 +438,7 @@ static void __init pmd_huge_tests(struct pgtable_debug_args *args)
* X86 defined pmd_set_huge() verifies that the given
* PMD is not a populated non-leaf entry.
*/
- WRITE_ONCE(*args->pmdp, __pmd(0));
+ set_pmd(args->pmdp, __pmd(0));
WARN_ON(!pmd_set_huge(args->pmdp, __pfn_to_phys(args->fixed_pmd_pfn), args->page_prot));
WARN_ON(!pmd_clear_huge(args->pmdp));
pmd = pmdp_get(args->pmdp);
@@ -458,7 +458,7 @@ static void __init pud_huge_tests(struct pgtable_debug_args *args)
* X86 defined pud_set_huge() verifies that the given
* PUD is not a populated non-leaf entry.
*/
- WRITE_ONCE(*args->pudp, __pud(0));
+ set_pud(args->pudp, __pud(0));
WARN_ON(!pud_set_huge(args->pudp, __pfn_to_phys(args->fixed_pud_pfn), args->page_prot));
WARN_ON(!pud_clear_huge(args->pudp));
pud = pudp_get(args->pudp);
diff --git a/mm/gup.c b/mm/gup.c
index ad0c8922dac3..db444d732028 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1004,7 +1004,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
pudp = pud_offset(p4dp, address);
- pud = READ_ONCE(*pudp);
+ pud = pudp_get_lockless(pudp);
if (!pud_present(pud))
return no_page_table(vma, flags, address);
if (pud_leaf(pud)) {
@@ -1029,7 +1029,7 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
p4d_t *p4dp, p4d;
p4dp = p4d_offset(pgdp, address);
- p4d = READ_ONCE(*p4dp);
+ p4d = p4dp_get_lockless(p4dp);
BUILD_BUG_ON(p4d_leaf(p4d));
if (!p4d_present(p4d) || p4d_bad(p4d))
@@ -3259,7 +3259,7 @@ static int gup_fast_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr,
pudp = pud_offset_lockless(p4dp, p4d, addr);
do {
- pud_t pud = READ_ONCE(*pudp);
+ pud_t pud = pudp_get_lockless(pudp);
next = pud_addr_end(addr, end);
if (unlikely(!pud_present(pud)))
@@ -3285,7 +3285,7 @@ static int gup_fast_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr,
p4dp = p4d_offset_lockless(pgdp, pgd, addr);
do {
- p4d_t p4d = READ_ONCE(*p4dp);
+ p4d_t p4d = p4dp_get_lockless(p4dp);
next = p4d_addr_end(addr, end);
if (!p4d_present(p4d))
@@ -3307,7 +3307,7 @@ static void gup_fast_pgd_range(unsigned long addr, unsigned long end,
pgdp = pgd_offset(current->mm, addr);
do {
- pgd_t pgd = READ_ONCE(*pgdp);
+ pgd_t pgd = pgdp_get_lockless(pgdp);
next = pgd_addr_end(addr, end);
if (pgd_none(pgd))
diff --git a/mm/hmm.c b/mm/hmm.c
index 7e0229ae4a5a..fa56b735883e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -423,7 +423,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
/* Normally we don't want to split the huge page */
walk->action = ACTION_CONTINUE;
- pud = READ_ONCE(*pudp);
+ pud = pudp_get_lockless(pudp);
if (!pud_present(pud)) {
spin_unlock(ptl);
return hmm_vma_walk_hole(start, end, -1, walk);
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 2f8829b3541a..8771432c3300 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -149,7 +149,7 @@ static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
- pud_t pudval = READ_ONCE(*pud);
+ pud_t pudval = pudp_get_lockless(pud);
/* Do not split a huge pud */
if (pud_trans_huge(pudval) || pud_devmap(pudval)) {
diff --git a/mm/memory.c b/mm/memory.c
index bdf77a3ec47b..03ee104cb009 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6428,12 +6428,12 @@ int follow_pfnmap_start(struct follow_pfnmap_args *args)
goto out;
p4dp = p4d_offset(pgdp, address);
- p4d = READ_ONCE(*p4dp);
+ p4d = p4dp_get_lockless(p4dp);
if (p4d_none(p4d) || unlikely(p4d_bad(p4d)))
goto out;
pudp = pud_offset(p4dp, address);
- pud = READ_ONCE(*pudp);
+ pud = pudp_get_lockless(pudp);
if (pud_none(pud))
goto out;
if (pud_leaf(pud)) {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6f450af3252e..a165ab597a73 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -447,7 +447,7 @@ static inline long change_pud_range(struct mmu_gather *tlb,
break;
}
- pud = READ_ONCE(*pudp);
+ pud = pudp_get_lockless(pudp);
if (pud_none(pud))
continue;
diff --git a/mm/ptdump.c b/mm/ptdump.c
index 106e1d66e9f9..b8a2ad43392f 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -30,7 +30,7 @@ static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr,
unsigned long next, struct mm_walk *walk)
{
struct ptdump_state *st = walk->private;
- pgd_t val = READ_ONCE(*pgd);
+ pgd_t val = pgdp_get_lockless(pgd);
#if CONFIG_PGTABLE_LEVELS > 4 && \
(defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS))
@@ -53,7 +53,7 @@ static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
unsigned long next, struct mm_walk *walk)
{
struct ptdump_state *st = walk->private;
- p4d_t val = READ_ONCE(*p4d);
+ p4d_t val = p4dp_get_lockless(p4d);
#if CONFIG_PGTABLE_LEVELS > 3 && \
(defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS))
@@ -76,7 +76,7 @@ static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
unsigned long next, struct mm_walk *walk)
{
struct ptdump_state *st = walk->private;
- pud_t val = READ_ONCE(*pud);
+ pud_t val = pudp_get_lockless(pud);
#if CONFIG_PGTABLE_LEVELS > 2 && \
(defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS))
@@ -99,7 +99,7 @@ static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr,
unsigned long next, struct mm_walk *walk)
{
struct ptdump_state *st = walk->private;
- pmd_t val = READ_ONCE(*pmd);
+ pmd_t val = pmdp_get_lockless(pmd);
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
if (pmd_page(val) == virt_to_page(lm_alias(kasan_early_shadow_pte)))
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index c0388b2e959d..6621fb096fd0 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -337,7 +337,7 @@ int __meminit vmemmap_populate_hugepages(unsigned long start, unsigned long end,
return -ENOMEM;
pmd = pmd_offset(pud, addr);
- if (pmd_none(READ_ONCE(*pmd))) {
+ if (pmd_none(pmdp_get_lockless(pmd))) {
void *p;
p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 28ba2b06fc7d..2bc78c339fd1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3608,7 +3608,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
pud = pud_offset(p4d, start & P4D_MASK);
restart:
for (i = pud_index(start), addr = start; addr != end; i++, addr = next) {
- pud_t val = READ_ONCE(pud[i]);
+ pud_t val = pudp_get_lockless(&pud[i]);
next = pud_addr_end(addr, end);
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 10/21] riscv: mm: Reimplement PTE A/D bit check function
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (8 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 09/21] riscv: mm: Replace READ_ONCE with atomic pte " Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 11/21] riscv: mm: Reimplement mk_huge_pte function Xu Lu
` (11 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
CPU that supports only 4K MMU usually updates access/dirty bit at 4K pte
level. As each software page can contains multiple 4K hardware pages, we need
to traverse all mapping entries to check whether corresponding 4K page
is accessed or dirty during pte_dirty/pte_access functions.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/pgtable.h | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index bf724d006236..c0f7442c8a9e 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -553,6 +553,29 @@ static inline int pte_huge(pte_t pte)
return pte_present(pte) && (pte_val(pte) & _PAGE_LEAF);
}
+#ifdef CONFIG_RISCV_USE_SW_PAGE
+static inline int pte_dirty(pte_t pte)
+{
+ unsigned int i;
+
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++)
+ if (pte.ptes[i] & _PAGE_DIRTY)
+ return 1;
+
+ return 0;
+}
+
+static inline int pte_young(pte_t pte)
+{
+ unsigned int i;
+
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++)
+ if (pte.ptes[i] & _PAGE_ACCESSED)
+ return 1;
+
+ return 0;
+}
+#else
static inline int pte_dirty(pte_t pte)
{
return pte_val(pte) & _PAGE_DIRTY;
@@ -562,6 +585,7 @@ static inline int pte_young(pte_t pte)
{
return pte_val(pte) & _PAGE_ACCESSED;
}
+#endif /* CONFIG_RISCV_USE_SW_PAGE */
static inline int pte_special(pte_t pte)
{
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 11/21] riscv: mm: Reimplement mk_huge_pte function
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (9 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 10/21] riscv: mm: Reimplement PTE A/D bit check function Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 12/21] riscv: mm: Reimplement tlb flush function Xu Lu
` (10 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
Huge pte can be pud, pmd, or svnapot pte. Huge ptes from different page
table levels have different pte constructors. This commit reimplements
mk_huge_pte function. We take vma struct as argument to check the target
huge pte level and applying corresponding constructor.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/hugetlb.h | 5 +++++
arch/riscv/mm/hugetlbpage.c | 23 ++++++++++++++++++++++-
arch/s390/include/asm/hugetlb.h | 2 +-
include/asm-generic/hugetlb.h | 5 ++++-
mm/debug_vm_pgtable.c | 2 +-
mm/hugetlb.c | 4 ++--
6 files changed, 35 insertions(+), 6 deletions(-)
diff --git a/arch/riscv/include/asm/hugetlb.h b/arch/riscv/include/asm/hugetlb.h
index faf3624d8057..eafd00f4b74f 100644
--- a/arch/riscv/include/asm/hugetlb.h
+++ b/arch/riscv/include/asm/hugetlb.h
@@ -51,6 +51,11 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags);
#endif /*CONFIG_RISCV_ISA_SVNAPOT*/
+#ifdef CONFIG_RISCV_USE_SW_PAGE
+#define __HAVE_ARCH_MK_HUGE_PTE
+pte_t mk_huge_pte(struct vm_area_struct *vma, struct page *page, pgprot_t pgprot);
+#endif
+
#include <asm-generic/hugetlb.h>
#endif /* _ASM_RISCV_HUGETLB_H */
diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index 42314f093922..8896c28ec881 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -2,6 +2,27 @@
#include <linux/hugetlb.h>
#include <linux/err.h>
+#ifdef CONFIG_RISCV_USE_SW_PAGE
+pte_t mk_huge_pte(struct vm_area_struct *vma, struct page *page, pgprot_t pgprot)
+{
+ pte_t pte;
+ unsigned int shift = huge_page_shift(hstate_vma(vma));
+
+ if (shift == PGDIR_SHIFT)
+ pte = pgd_pte(pfn_pgd(page_to_pfn(page), pgprot));
+ else if (shift == P4D_SHIFT)
+ pte = p4d_pte(pfn_p4d(page_to_pfn(page), pgprot));
+ else if (shift == PUD_SHIFT)
+ pte = pud_pte(pfn_pud(page_to_pfn(page), pgprot));
+ else if (shift == PMD_SHIFT)
+ pte = pmd_pte(pfn_pmd(page_to_pfn(page), pgprot));
+ else
+ pte = pfn_pte(page_to_pfn(page), pgprot);
+
+ return pte;
+}
+#endif /* CONFIG_RISCV_USE_SW_PAGE */
+
#ifdef CONFIG_RISCV_ISA_SVNAPOT
pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
{
@@ -74,7 +95,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
out:
if (pte) {
- pte_t pteval = ptep_get_lockless(pte);
+ pte_t pteval = ptep_get(pte);
WARN_ON_ONCE(pte_present(pteval) && !pte_huge(pteval));
}
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index cf1b5d6fb1a6..cea9118d4bba 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -79,7 +79,7 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
__set_huge_pte_at(mm, addr, ptep, pte_wrprotect(pte));
}
-static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
+static inline pte_t mk_huge_pte(struct vm_area_struct *vma, struct page *page, pgprot_t pgprot)
{
return mk_pte(page, pgprot);
}
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 594d5905f615..90765bc03bba 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -5,10 +5,13 @@
#include <linux/swap.h>
#include <linux/swapops.h>
-static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
+#ifndef __HAVE_ARCH_MK_HUGE_PTE
+static inline pte_t mk_huge_pte(struct vm_area_struct *vma, struct page *page,
+ pgprot_t pgprot)
{
return mk_pte(page, pgprot);
}
+#endif
static inline unsigned long huge_pte_write(pte_t pte)
{
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 1cec548cc6c7..24839883d513 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -919,7 +919,7 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args)
* as it was previously derived from a real kernel symbol.
*/
page = pfn_to_page(args->fixed_pmd_pfn);
- pte = mk_huge_pte(page, args->page_prot);
+ pte = mk_huge_pte(args->vma, page, args->page_prot);
WARN_ON(!huge_pte_dirty(huge_pte_mkdirty(pte)));
WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte))));
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 190fa05635f4..2b33eb46408f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5140,10 +5140,10 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
unsigned int shift = huge_page_shift(hstate_vma(vma));
if (writable) {
- entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
+ entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(vma, page,
vma->vm_page_prot)));
} else {
- entry = huge_pte_wrprotect(mk_huge_pte(page,
+ entry = huge_pte_wrprotect(mk_huge_pte(vma, page,
vma->vm_page_prot));
}
entry = pte_mkyoung(entry);
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 12/21] riscv: mm: Reimplement tlb flush function
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (10 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 11/21] riscv: mm: Reimplement mk_huge_pte function Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 13/21] riscv: mm: Adjust PGDIR/P4D/PUD/PMD_SHIFT Xu Lu
` (9 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
When tlb flushing a page correponding to a certain address, CPU actually
only flushes tlb entries of the first 4K hardware page. This commit
reimplements tlb flushing function to flush all tlb entries of hardware pages
in the same software page.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/pgtable.h | 9 ++++++---
arch/riscv/include/asm/tlbflush.h | 26 ++++++++++++++++++++------
arch/riscv/mm/fault.c | 13 +++++++++----
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/tlbflush.c | 31 +++++++++++++++++++++----------
5 files changed, 57 insertions(+), 24 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index c0f7442c8a9e..9fa16c0c20aa 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -701,7 +701,7 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
* the extra traps reduce performance. So, eagerly SFENCE.VMA.
*/
while (nr--)
- local_flush_tlb_page(address + nr * PAGE_SIZE);
+ local_flush_tlb_page(address + nr * PAGE_SIZE, PAGE_SIZE);
svvptc:;
/*
@@ -719,9 +719,12 @@ svvptc:;
static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp)
{
- pte_t *ptep = (pte_t *)pmdp;
+ asm goto(ALTERNATIVE("nop", "j %l[svvptc]", 0, RISCV_ISA_EXT_SVVPTC, 1)
+ : : : : svvptc);
- update_mmu_cache(vma, address, ptep);
+ local_flush_tlb_page(address, PMD_SIZE);
+
+svvptc:;
}
#define __HAVE_ARCH_PTE_SAME
diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
index 72e559934952..25cc39ab84d5 100644
--- a/arch/riscv/include/asm/tlbflush.h
+++ b/arch/riscv/include/asm/tlbflush.h
@@ -29,18 +29,32 @@ static inline void local_flush_tlb_all_asid(unsigned long asid)
}
/* Flush one page from local TLB */
-static inline void local_flush_tlb_page(unsigned long addr)
+static inline void local_flush_tlb_page(unsigned long addr,
+ unsigned long page_size)
{
- ALT_SFENCE_VMA_ADDR(addr);
+ unsigned int i;
+ unsigned long hw_page_num = 1 << (PAGE_SHIFT - HW_PAGE_SHIFT);
+ unsigned long hw_page_size = page_size >> (PAGE_SHIFT - HW_PAGE_SHIFT);
+
+ for (i = 0; i < hw_page_num; i++, addr += hw_page_size)
+ ALT_SFENCE_VMA_ADDR(addr);
}
static inline void local_flush_tlb_page_asid(unsigned long addr,
+ unsigned long page_size,
unsigned long asid)
{
- if (asid != FLUSH_TLB_NO_ASID)
- ALT_SFENCE_VMA_ADDR_ASID(addr, asid);
- else
- local_flush_tlb_page(addr);
+ unsigned int i;
+ unsigned long hw_page_num, hw_page_size;
+
+ if (asid != FLUSH_TLB_NO_ASID) {
+ hw_page_num = 1 << (PAGE_SHIFT - HW_PAGE_SHIFT);
+ hw_page_size = page_size >> (PAGE_SHIFT - HW_PAGE_SHIFT);
+
+ for (i = 0; i < hw_page_num; i++, addr += hw_page_size)
+ ALT_SFENCE_VMA_ADDR_ASID(addr, asid);
+ } else
+ local_flush_tlb_page(addr, page_size);
}
void flush_tlb_all(void);
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 4772152be0f9..94524e5adc0b 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -118,7 +118,7 @@ static inline void vmalloc_fault(struct pt_regs *regs, int code, unsigned long a
pmd_t *pmd_k;
pte_t *pte_k;
int index;
- unsigned long pfn;
+ unsigned long pfn, page_size;
/* User mode accesses just cause a SIGSEGV */
if (user_mode(regs))
@@ -154,8 +154,10 @@ static inline void vmalloc_fault(struct pt_regs *regs, int code, unsigned long a
no_context(regs, addr);
return;
}
- if (pud_leaf(pudp_get(pud_k)))
+ if (pud_leaf(pudp_get(pud_k))) {
+ page_size = PUD_SIZE;
goto flush_tlb;
+ }
/*
* Since the vmalloc area is global, it is unnecessary
@@ -166,8 +168,10 @@ static inline void vmalloc_fault(struct pt_regs *regs, int code, unsigned long a
no_context(regs, addr);
return;
}
- if (pmd_leaf(pmdp_get(pmd_k)))
+ if (pmd_leaf(pmdp_get(pmd_k))) {
+ page_size = PMD_SIZE;
goto flush_tlb;
+ }
/*
* Make sure the actual PTE exists as well to
@@ -180,6 +184,7 @@ static inline void vmalloc_fault(struct pt_regs *regs, int code, unsigned long a
no_context(regs, addr);
return;
}
+ page_size = PAGE_SIZE;
/*
* The kernel assumes that TLBs don't cache invalid
@@ -188,7 +193,7 @@ static inline void vmalloc_fault(struct pt_regs *regs, int code, unsigned long a
* necessary even after writing invalid entries.
*/
flush_tlb:
- local_flush_tlb_page(addr);
+ local_flush_tlb_page(addr, page_size);
}
static inline bool access_error(unsigned long cause, struct vm_area_struct *vma)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index f9334aab45a6..678b892b4ed8 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -356,7 +356,7 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, prot));
else
pte_clear(&init_mm, addr, ptep);
- local_flush_tlb_page(addr);
+ local_flush_tlb_page(addr, PAGE_SIZE);
}
static inline pte_t *__init get_pte_virt_early(phys_addr_t pa)
diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
index 9b6e86ce3867..d5036f2a8244 100644
--- a/arch/riscv/mm/tlbflush.c
+++ b/arch/riscv/mm/tlbflush.c
@@ -27,7 +27,7 @@ static void local_flush_tlb_range_threshold_asid(unsigned long start,
}
for (i = 0; i < nr_ptes_in_range; ++i) {
- local_flush_tlb_page_asid(start, asid);
+ local_flush_tlb_page_asid(start, stride, asid);
start += stride;
}
}
@@ -36,7 +36,7 @@ static inline void local_flush_tlb_range_asid(unsigned long start,
unsigned long size, unsigned long stride, unsigned long asid)
{
if (size <= stride)
- local_flush_tlb_page_asid(start, asid);
+ local_flush_tlb_page_asid(start, stride, asid);
else if (size == FLUSH_TLB_MAX_SIZE)
local_flush_tlb_all_asid(asid);
else
@@ -126,14 +126,7 @@ void flush_tlb_mm_range(struct mm_struct *mm,
start, end - start, page_size);
}
-void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)
-{
- __flush_tlb_range(mm_cpumask(vma->vm_mm), get_mm_asid(vma->vm_mm),
- addr, PAGE_SIZE, PAGE_SIZE);
-}
-
-void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end)
+static inline unsigned long local_flush_tlb_page_size(struct vm_area_struct *vma)
{
unsigned long stride_size;
@@ -161,6 +154,24 @@ void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
}
}
+ return stride_size;
+}
+
+void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)
+{
+ unsigned long page_size;
+
+ page_size = local_flush_tlb_page_size(vma);
+ __flush_tlb_range(mm_cpumask(vma->vm_mm), get_mm_asid(vma->vm_mm),
+ addr, page_size, page_size);
+}
+
+void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ unsigned long stride_size;
+
+ stride_size = local_flush_tlb_page_size(vma);
__flush_tlb_range(mm_cpumask(vma->vm_mm), get_mm_asid(vma->vm_mm),
start, end - start, stride_size);
}
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 13/21] riscv: mm: Adjust PGDIR/P4D/PUD/PMD_SHIFT
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (11 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 12/21] riscv: mm: Reimplement tlb flush function Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 14/21] riscv: mm: Only apply svnapot region bigger than software page Xu Lu
` (8 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
This commit adjusts the SHIFT of pte index bits at each page table
level.
For example, in SV39, the traditional va behaves as:
----------------------------------------------
| pgd index | pmd index | pte index | offset |
----------------------------------------------
| 38 30 | 29 21 | 20 12 | 11 0 |
----------------------------------------------
When we choose 64K as basic software page, va now behaves as:
----------------------------------------------
| pgd index | pmd index | pte index | offset |
----------------------------------------------
| 38 34 | 33 25 | 24 16 | 15 0 |
----------------------------------------------
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/pgtable-32.h | 2 +-
arch/riscv/include/asm/pgtable-64.h | 16 ++++++++--------
arch/riscv/include/asm/pgtable.h | 19 +++++++++++++++++++
3 files changed, 28 insertions(+), 9 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable-32.h b/arch/riscv/include/asm/pgtable-32.h
index 2959ab72f926..e0c5c62f88d9 100644
--- a/arch/riscv/include/asm/pgtable-32.h
+++ b/arch/riscv/include/asm/pgtable-32.h
@@ -11,7 +11,7 @@
#include <linux/const.h>
/* Size of region mapped by a page global directory */
-#define PGDIR_SHIFT 22
+#define PGDIR_SHIFT (10 + PAGE_SHIFT)
#define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT)
#define PGDIR_MASK (~(PGDIR_SIZE - 1))
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index 2649cc90b14e..26c13484e721 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -13,9 +13,9 @@
extern bool pgtable_l4_enabled;
extern bool pgtable_l5_enabled;
-#define PGDIR_SHIFT_L3 30
-#define PGDIR_SHIFT_L4 39
-#define PGDIR_SHIFT_L5 48
+#define PGDIR_SHIFT_L3 (9 + 9 + PAGE_SHIFT)
+#define PGDIR_SHIFT_L4 (9 + PGDIR_SHIFT_L3)
+#define PGDIR_SHIFT_L5 (9 + PGDIR_SHIFT_L4)
#define PGDIR_SHIFT (pgtable_l5_enabled ? PGDIR_SHIFT_L5 : \
(pgtable_l4_enabled ? PGDIR_SHIFT_L4 : PGDIR_SHIFT_L3))
/* Size of region mapped by a page global directory */
@@ -23,20 +23,20 @@ extern bool pgtable_l5_enabled;
#define PGDIR_MASK (~(PGDIR_SIZE - 1))
/* p4d is folded into pgd in case of 4-level page table */
-#define P4D_SHIFT_L3 30
-#define P4D_SHIFT_L4 39
-#define P4D_SHIFT_L5 39
+#define P4D_SHIFT_L3 (9 + 9 + PAGE_SHIFT)
+#define P4D_SHIFT_L4 (9 + P4D_SHIFT_L3)
+#define P4D_SHIFT_L5 (9 + P4D_SHIFT_L3)
#define P4D_SHIFT (pgtable_l5_enabled ? P4D_SHIFT_L5 : \
(pgtable_l4_enabled ? P4D_SHIFT_L4 : P4D_SHIFT_L3))
#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
#define P4D_MASK (~(P4D_SIZE - 1))
/* pud is folded into pgd in case of 3-level page table */
-#define PUD_SHIFT 30
+#define PUD_SHIFT (9 + 9 + PAGE_SHIFT)
#define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
#define PUD_MASK (~(PUD_SIZE - 1))
-#define PMD_SHIFT 21
+#define PMD_SHIFT (9 + PAGE_SHIFT)
/* Size of region mapped by a page middle directory */
#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE - 1))
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 9fa16c0c20aa..0fd9bd4e0d13 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -30,12 +30,27 @@
/* Number of entries in the page table */
#define PTRS_PER_PTE (PAGE_SIZE / sizeof(pte_t))
+#ifdef CONFIG_RISCV_USE_SW_PAGE
+
+/*
+ * PGDIR_SHIFT grows as PAGE_SIZE grows. To avoid va exceeds limitation, pgd
+ * index bits should be cut. Thus we use HW_PAGE_SIZE instead.
+ */
+#define __PTRS_PER_PGD (HW_PAGE_SIZE / sizeof(pgd_t))
+#define pgd_index(a) (((a) >> PGDIR_SHIFT) & (__PTRS_PER_PGD - 1))
+
+#define KERN_VIRT_SIZE ((__PTRS_PER_PGD / 2 * PGDIR_SIZE) / 2)
+
+#else
+
/*
* Half of the kernel address space (1/4 of the entries of the page global
* directory) is for the direct mapping.
*/
#define KERN_VIRT_SIZE ((PTRS_PER_PGD / 2 * PGDIR_SIZE) / 2)
+#endif /* CONFIG_RISCV_USE_SW_PAGE */
+
#define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
#define VMALLOC_END PAGE_OFFSET
#define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)
@@ -1304,7 +1319,11 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
* Similarly for SV57, bits 63–57 must be equal to bit 56.
*/
#ifdef CONFIG_64BIT
+#ifdef CONFIG_RISCV_USE_SW_PAGE
+#define TASK_SIZE_64 (PGDIR_SIZE * __PTRS_PER_PGD / 2)
+#else
#define TASK_SIZE_64 (PGDIR_SIZE * PTRS_PER_PGD / 2)
+#endif
#define TASK_SIZE_MAX LONG_MAX
#ifdef CONFIG_COMPAT
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 14/21] riscv: mm: Only apply svnapot region bigger than software page
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (12 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 13/21] riscv: mm: Adjust PGDIR/P4D/PUD/PMD_SHIFT Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 15/21] riscv: mm: Adjust FIX_BTMAPS_SLOTS for variable PAGE_SIZE Xu Lu
` (7 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
Usually, when it comes to napot pte order, we refer to the order of
software page number. Thus, this commit updates the napot order
calculation method. Also, we only apply svnapot pte as huge pte when its
napot size is bigger than software page.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/pgtable-64.h | 21 +++++++++---
arch/riscv/include/asm/pgtable.h | 50 +++++++++++++++++++++++------
arch/riscv/mm/hugetlbpage.c | 7 ++--
3 files changed, 61 insertions(+), 17 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index 26c13484e721..fbdaad9a98dd 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -124,12 +124,23 @@ enum napot_cont_order {
NAPOT_ORDER_MAX,
};
+#define NAPOT_PAGE_ORDER_BASE \
+ ((NAPOT_CONT_ORDER_BASE >= (PAGE_SHIFT - HW_PAGE_SHIFT)) ? \
+ (NAPOT_CONT_ORDER_BASE - (PAGE_SHIFT - HW_PAGE_SHIFT)) : 1)
+#define NAPOT_PAGE_ORDER_MAX \
+ ((NAPOT_ORDER_MAX > (PAGE_SHIFT - HW_PAGE_SHIFT)) ? \
+ (NAPOT_ORDER_MAX - (PAGE_SHIFT - HW_PAGE_SHIFT)) : \
+ NAPOT_PAGE_ORDER_BASE)
+
#define for_each_napot_order(order) \
- for (order = NAPOT_CONT_ORDER_BASE; order < NAPOT_ORDER_MAX; order++)
+ for (order = NAPOT_PAGE_ORDER_BASE; \
+ order < NAPOT_PAGE_ORDER_MAX; order++)
#define for_each_napot_order_rev(order) \
- for (order = NAPOT_ORDER_MAX - 1; \
- order >= NAPOT_CONT_ORDER_BASE; order--)
-#define napot_cont_order(val) (__builtin_ctzl((pte_val(val) >> _PAGE_PFN_SHIFT) << 1))
+ for (order = NAPOT_PAGE_ORDER_MAX - 1; \
+ order >= NAPOT_PAGE_ORDER_BASE; order--)
+#define napot_cont_order(val) \
+ (__builtin_ctzl((pte_val(val) >> _PAGE_HWPFN_SHIFT) << 1) - \
+ (PAGE_SHIFT - HW_PAGE_SHIFT))
#define napot_cont_shift(order) ((order) + PAGE_SHIFT)
#define napot_cont_size(order) BIT(napot_cont_shift(order))
@@ -137,7 +148,7 @@ enum napot_cont_order {
#define napot_pte_num(order) BIT(order)
#ifdef CONFIG_RISCV_ISA_SVNAPOT
-#define HUGE_MAX_HSTATE (2 + (NAPOT_ORDER_MAX - NAPOT_CONT_ORDER_BASE))
+#define HUGE_MAX_HSTATE (2 + (NAPOT_ORDER_MAX - NAPOT_PAGE_ORDER_BASE))
#else
#define HUGE_MAX_HSTATE 2
#endif
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 0fd9bd4e0d13..07d557bc8b39 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -130,7 +130,7 @@
#include <asm/compat.h>
#define __page_val_to_hwpfn(_val) (((_val) & _PAGE_HW_PFN_MASK) >> _PAGE_HWPFN_SHIFT)
-#define __page_val_to_pfn(_val) (((_val) & _PAGE_PFN_MASK) >> _PAGE_PFN_SHIFT)
+static inline unsigned long __page_val_to_pfn(unsigned long val);
#ifdef CONFIG_64BIT
#include <asm/pgtable-64.h>
@@ -470,15 +470,42 @@ static inline unsigned long pte_napot(pte_t pte)
return __pte_napot(pte_val(pte));
}
-static inline pte_t pte_mknapot(pte_t pte, unsigned int order)
+static inline unsigned long __pte_mknapot(unsigned long pteval,
+ unsigned int order)
{
int pos = order - 1 + _PAGE_PFN_SHIFT;
unsigned long napot_bit = BIT(pos);
- unsigned long napot_mask = ~GENMASK(pos, _PAGE_PFN_SHIFT);
+ unsigned long napot_mask = ~GENMASK(pos, _PAGE_HWPFN_SHIFT);
+
+ BUG_ON(__pte_napot(pteval));
+ pteval = (pteval & napot_mask) | napot_bit | _PAGE_NAPOT;
- return __pte((pte_val(pte) & napot_mask) | napot_bit | _PAGE_NAPOT);
+ return pteval;
}
+#ifdef CONFIG_RISCV_USE_SW_PAGE
+static inline pte_t pte_mknapot(pte_t pte, unsigned int order)
+{
+ unsigned long pteval = pte_val(pte);
+ unsigned int i;
+
+ pteval = __pte_mknapot(pteval, order);
+ for (i = 0; i < HW_PAGES_PER_PAGE; i++)
+ pte.ptes[i] = pteval;
+
+ return pte;
+}
+#else
+static inline pte_t pte_mknapot(pte_t pte, unsigned int order)
+{
+ unsigned long pteval = pte_val(pte);
+
+ pte_val(pte) = __pte_mknapot(pteval, order);
+
+ return pte;
+}
+#endif /* CONFIG_RISCV_USE_SW_PAGE */
+
#else
static __always_inline bool has_svnapot(void) { return false; }
@@ -495,15 +522,20 @@ static inline unsigned long pte_napot(pte_t pte)
#endif /* CONFIG_RISCV_ISA_SVNAPOT */
-/* Yields the page frame number (PFN) of a page table entry */
-static inline unsigned long pte_pfn(pte_t pte)
+static inline unsigned long __page_val_to_pfn(unsigned long pteval)
{
- unsigned long res = __page_val_to_pfn(pte_val(pte));
+ unsigned long res = __page_val_to_hwpfn(pteval);
- if (has_svnapot() && pte_napot(pte))
+ if (has_svnapot() && __pte_napot(pteval))
res = res & (res - 1UL);
- return res;
+ return hwpfn_to_pfn(res);
+}
+
+/* Yields the page frame number (PFN) of a page table entry */
+static inline unsigned long pte_pfn(pte_t pte)
+{
+ return __page_val_to_pfn(pte_val(pte));
}
#define pte_page(x) pfn_to_page(pte_pfn(x))
diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index 8896c28ec881..4286c7dea68d 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -212,7 +212,7 @@ pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
break;
}
}
- if (order == NAPOT_ORDER_MAX)
+ if (order == NAPOT_PAGE_ORDER_MAX)
entry = pte_mkhuge(entry);
return entry;
@@ -405,7 +405,8 @@ static __init int napot_hugetlbpages_init(void)
unsigned long order;
for_each_napot_order(order)
- hugetlb_add_hstate(order);
+ if (napot_cont_shift(order) > PAGE_SHIFT)
+ hugetlb_add_hstate(order);
}
return 0;
}
@@ -426,7 +427,7 @@ static bool __hugetlb_valid_size(unsigned long size)
return true;
else if (IS_ENABLED(CONFIG_64BIT) && size == PUD_SIZE)
return true;
- else if (is_napot_size(size))
+ else if (is_napot_size(size) && size > PAGE_SIZE)
return true;
else
return false;
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 15/21] riscv: mm: Adjust FIX_BTMAPS_SLOTS for variable PAGE_SIZE
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (13 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 14/21] riscv: mm: Only apply svnapot region bigger than software page Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 16/21] riscv: mm: Adjust FIX_FDT_SIZE for variable PMD_SIZE Xu Lu
` (6 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/fixmap.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
index 0a55099bb734..17bf31334bd5 100644
--- a/arch/riscv/include/asm/fixmap.h
+++ b/arch/riscv/include/asm/fixmap.h
@@ -44,7 +44,8 @@ enum fixed_addresses {
* before ioremap() is functional.
*/
#define NR_FIX_BTMAPS (SZ_256K / PAGE_SIZE)
-#define FIX_BTMAPS_SLOTS 7
+#define FIX_BTMAPS_SIZE (FIXADDR_SIZE - ((FIX_BTMAP_END + 1) << PAGE_SHIFT))
+#define FIX_BTMAPS_SLOTS (FIX_BTMAPS_SIZE / SZ_256K)
#define TOTAL_FIX_BTMAPS (NR_FIX_BTMAPS * FIX_BTMAPS_SLOTS)
FIX_BTMAP_END = __end_of_permanent_fixed_addresses,
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 16/21] riscv: mm: Adjust FIX_FDT_SIZE for variable PMD_SIZE
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (14 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 15/21] riscv: mm: Adjust FIX_BTMAPS_SLOTS for variable PAGE_SIZE Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 17/21] riscv: mm: Apply Svnapot for base page mapping if possible Xu Lu
` (5 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/pgtable.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 07d557bc8b39..5b2ca92ad833 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -108,10 +108,10 @@
#define PCI_IO_END VMEMMAP_START
#define PCI_IO_START (PCI_IO_END - PCI_IO_SIZE)
-#define FIXADDR_TOP PCI_IO_START
+#define FIXADDR_TOP (PCI_IO_START & PMD_MASK)
#ifdef CONFIG_64BIT
#define MAX_FDT_SIZE PMD_SIZE
-#define FIX_FDT_SIZE (MAX_FDT_SIZE + SZ_2M)
+#define FIX_FDT_SIZE (MAX_FDT_SIZE + PMD_SIZE)
#define FIXADDR_SIZE (PMD_SIZE + FIX_FDT_SIZE)
#else
#define MAX_FDT_SIZE PGDIR_SIZE
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 17/21] riscv: mm: Apply Svnapot for base page mapping if possible
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (15 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 16/21] riscv: mm: Adjust FIX_FDT_SIZE for variable PMD_SIZE Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 18/21] riscv: Kconfig: Introduce 64K page size Xu Lu
` (4 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
All hardware pages in the same software page point to the same contiguous
memory region (the region size is equal to the software page size) and
has same prots. Thus this commit uses Svnapot extension to optimize the
mapping to software page to reduce tlb pressure.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/pgtable.h | 16 +++++++++++++++-
arch/riscv/mm/pgtable.c | 6 ++++++
2 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 5b2ca92ad833..9f347e5eefeb 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -483,13 +483,27 @@ static inline unsigned long __pte_mknapot(unsigned long pteval,
return pteval;
}
+static inline unsigned long __pte_denapot(unsigned long pteval)
+{
+ unsigned long prot_mask = ~(_PAGE_HW_PFN_MASK | _PAGE_NAPOT);
+ unsigned long res;
+
+ if (!__pte_napot(pteval))
+ return pteval;
+ res = __page_val_to_hwpfn(pteval);
+ res = res & (res - 1UL);
+ pteval = (res << _PAGE_HWPFN_SHIFT) | (pteval & prot_mask);
+
+ return pteval;
+}
+
#ifdef CONFIG_RISCV_USE_SW_PAGE
static inline pte_t pte_mknapot(pte_t pte, unsigned int order)
{
unsigned long pteval = pte_val(pte);
unsigned int i;
- pteval = __pte_mknapot(pteval, order);
+ pteval = __pte_denapot(pteval);
for (i = 0; i < HW_PAGES_PER_PAGE; i++)
pte.ptes[i] = pteval;
diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
index 150aea8e2d7a..0bcaffe798d5 100644
--- a/arch/riscv/mm/pgtable.c
+++ b/arch/riscv/mm/pgtable.c
@@ -11,6 +11,12 @@ pte_t __pte(unsigned long pteval)
{
pte_t pte;
unsigned int i;
+ unsigned int order;
+
+ if (has_svnapot() && __pte_present(pteval) && !__pte_napot(pteval))
+ for_each_napot_order(order)
+ if (napot_cont_shift(order) == PAGE_SHIFT)
+ pteval = __pte_mknapot(pteval, order);
for (i = 0; i < HW_PAGES_PER_PAGE; i++) {
pte.ptes[i] = pteval;
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 18/21] riscv: Kconfig: Introduce 64K page size
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (16 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 17/21] riscv: mm: Apply Svnapot for base page mapping if possible Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 19/21] riscv: Kconfig: Adjust mmap rnd bits for 64K Page Xu Lu
` (3 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
This patch introduces new config to control whether enabling the 64K
base page feature on RISC-V.
The 64K config will set software page size as 64K and automatically use
svnapot to accelerate basic page mapping.
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/Kconfig | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 2c0cb175a92a..592eb5766029 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -167,7 +167,6 @@ config RISCV
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
select HAVE_MOVE_PUD
- select HAVE_PAGE_SIZE_4KB
select HAVE_PCI
select HAVE_PERF_EVENTS
select HAVE_PERF_REGS
@@ -885,6 +884,28 @@ config RISCV_BOOT_SPINWAIT
If unsure what to do here, say N.
+choice
+ prompt "Page size"
+ default RISCV_4K_PAGES
+ help
+ Page size (translation granule) configuration.
+
+config RISCV_4K_PAGES
+ bool "4KB"
+ select HAVE_PAGE_SIZE_4KB
+ help
+ This feature enables 4KB pages support.
+
+config RISCV_64K_PAGES
+ bool "64KB"
+ depends on RISCV_ISA_SVNAPOT && 64BIT
+ select HAVE_PAGE_SIZE_64KB
+ select RISCV_USE_SW_PAGE
+ help
+ This feature enables 64KB pages support.
+
+endchoice
+
config ARCH_SUPPORTS_KEXEC
def_bool y
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 19/21] riscv: Kconfig: Adjust mmap rnd bits for 64K Page
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (17 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 18/21] riscv: Kconfig: Introduce 64K page size Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 20/21] riscv: mm: Adjust address space layout and init page table " Xu Lu
` (2 subsequent siblings)
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 592eb5766029..9dfe95a12890 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -253,6 +253,7 @@ config ARCH_MMAP_RND_COMPAT_BITS_MIN
# max bits determined by the following formula:
# VA_BITS - PAGE_SHIFT - 3
config ARCH_MMAP_RND_BITS_MAX
+ default 20 if 64BIT && RISCV_64K_PAGES # SV39 based
default 24 if 64BIT # SV39 based
default 17
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 20/21] riscv: mm: Adjust address space layout and init page table for 64K Page
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (18 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 19/21] riscv: Kconfig: Adjust mmap rnd bits for 64K Page Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-05 10:37 ` [RFC PATCH v2 21/21] riscv: mm: Update EXEC_PAGESIZE " Xu Lu
2024-12-06 2:00 ` [RFC PATCH v2 00/21] riscv: Introduce 64K base page Zi Yan
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/asm/page.h | 6 +++++-
arch/riscv/include/asm/pgtable.h | 12 +++++++++++
arch/riscv/mm/init.c | 36 +++++++++++++++++++++++---------
3 files changed, 43 insertions(+), 11 deletions(-)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 9bc908d94c7a..236b0106a1c9 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -40,8 +40,12 @@
* By default, CONFIG_PAGE_OFFSET value corresponds to SV57 address space so
* define the PAGE_OFFSET value for SV48 and SV39.
*/
+#ifdef CONFIG_RISCV_64K_PAGES
+#define PAGE_OFFSET_L4 _AC(0xffffa80000000000, UL)
+#else
#define PAGE_OFFSET_L4 _AC(0xffffaf8000000000, UL)
-#define PAGE_OFFSET_L3 _AC(0xffffffd600000000, UL)
+#endif /* CONFIG_RISCV_64K_PAGES */
+#define PAGE_OFFSET_L3 _AC(0xffffffd800000000, UL)
#else
#define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
#endif /* CONFIG_64BIT */
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 9f347e5eefeb..fbc397c4e1c8 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -147,6 +147,18 @@ static inline unsigned long __page_val_to_pfn(unsigned long val);
#include <asm/pgtable-32.h>
#endif /* CONFIG_64BIT */
+#define __PMD_SHIFT (PMD_SHIFT - (PAGE_SHIFT - HW_PAGE_SHIFT))
+#define __PMD_SIZE (_AC(1, UL) << __PMD_SHIFT)
+
+#define __PUD_SHIFT (PUD_SHIFT - (PAGE_SHIFT - HW_PAGE_SHIFT))
+#define __PUD_SIZE (_AC(1, UL) << __PUD_SHIFT)
+
+#define __P4D_SHIFT (P4D_SHIFT - (PAGE_SHIFT - HW_PAGE_SHIFT))
+#define __P4D_SIZE (_AC(1, UL) << __P4D_SHIFT)
+
+#define __PGD_SHIFT (PGD_SHIFT - (PAGE_SHIFT - HW_PAGE_SHIFT))
+#define __PGD_SIZE (_AC(1, UL) << __PGD_SHIFT)
+
#include <linux/page_table_check.h>
#ifdef CONFIG_XIP_KERNEL
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 678b892b4ed8..2c6b7ea33009 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -695,15 +695,15 @@ static uintptr_t __meminit best_map_size(phys_addr_t pa, uintptr_t va, phys_addr
return PAGE_SIZE;
if (pgtable_l5_enabled &&
- !(pa & (P4D_SIZE - 1)) && !(va & (P4D_SIZE - 1)) && size >= P4D_SIZE)
+ !(pa & (__P4D_SIZE - 1)) && !(va & (P4D_SIZE - 1)) && size >= P4D_SIZE)
return P4D_SIZE;
if (pgtable_l4_enabled &&
- !(pa & (PUD_SIZE - 1)) && !(va & (PUD_SIZE - 1)) && size >= PUD_SIZE)
+ !(pa & (__PUD_SIZE - 1)) && !(va & (PUD_SIZE - 1)) && size >= PUD_SIZE)
return PUD_SIZE;
if (IS_ENABLED(CONFIG_64BIT) &&
- !(pa & (PMD_SIZE - 1)) && !(va & (PMD_SIZE - 1)) && size >= PMD_SIZE)
+ !(pa & (__PMD_SIZE - 1)) && !(va & (PMD_SIZE - 1)) && size >= PMD_SIZE)
return PMD_SIZE;
return PAGE_SIZE;
@@ -937,17 +937,33 @@ static void __init create_kernel_page_table(pgd_t *pgdir,
PMD_SIZE, PAGE_KERNEL);
}
#else
+
+#ifdef CONFIG_RISCV_64K_PAGES
+/* TODO: better implementation */
+#define KERNEL_MAP_STEP PAGE_SIZE
+#else
+#define KERNEL_MAP_STEP PMD_SIZE
+#endif
+
static void __init create_kernel_page_table(pgd_t *pgdir, bool early)
{
uintptr_t va, end_va;
end_va = kernel_map.virt_addr + kernel_map.size;
- for (va = kernel_map.virt_addr; va < end_va; va += PMD_SIZE)
- create_pgd_mapping(pgdir, va,
- kernel_map.phys_addr + (va - kernel_map.virt_addr),
- PMD_SIZE,
- early ?
- PAGE_KERNEL_EXEC : pgprot_from_va(va));
+ if (early)
+ for (va = kernel_map.virt_addr; va < end_va; va += PMD_SIZE)
+ create_pgd_mapping(pgdir, va,
+ kernel_map.phys_addr + (va - kernel_map.virt_addr),
+ PMD_SIZE,
+ early ?
+ PAGE_KERNEL_EXEC : pgprot_from_va(va));
+ else
+ for (va = kernel_map.virt_addr; va < end_va; va += KERNEL_MAP_STEP)
+ create_pgd_mapping(pgdir, va,
+ kernel_map.phys_addr + (va - kernel_map.virt_addr),
+ KERNEL_MAP_STEP,
+ early ?
+ PAGE_KERNEL_EXEC : pgprot_from_va(va));
}
#endif
@@ -1138,7 +1154,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
/* Sanity check alignment and size */
BUG_ON((PAGE_OFFSET % PGDIR_SIZE) != 0);
- BUG_ON((kernel_map.phys_addr % PMD_SIZE) != 0);
+ BUG_ON((kernel_map.phys_addr % __PMD_SIZE) != 0);
#ifdef CONFIG_64BIT
/*
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* [RFC PATCH v2 21/21] riscv: mm: Update EXEC_PAGESIZE for 64K Page
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (19 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 20/21] riscv: mm: Adjust address space layout and init page table " Xu Lu
@ 2024-12-05 10:37 ` Xu Lu
2024-12-06 2:00 ` [RFC PATCH v2 00/21] riscv: Introduce 64K base page Zi Yan
21 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-05 10:37 UTC (permalink / raw)
To: paul.walmsley, palmer, aou, ardb, anup, atishp
Cc: xieyongji, lihangjing, punit.agrawal, linux-kernel, linux-riscv,
Xu Lu
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
---
arch/riscv/include/uapi/asm/param.h | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
create mode 100644 arch/riscv/include/uapi/asm/param.h
diff --git a/arch/riscv/include/uapi/asm/param.h b/arch/riscv/include/uapi/asm/param.h
new file mode 100644
index 000000000000..1221e570a077
--- /dev/null
+++ b/arch/riscv/include/uapi/asm/param.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright (C) 2024 RISCV Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef __ASM_PARAM_H
+#define __ASM_PARAM_H
+
+#define EXEC_PAGESIZE 65536
+
+#include <asm-generic/param.h>
+
+#endif
--
2.20.1
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page
2024-12-05 10:37 [RFC PATCH v2 00/21] riscv: Introduce 64K base page Xu Lu
` (20 preceding siblings ...)
2024-12-05 10:37 ` [RFC PATCH v2 21/21] riscv: mm: Update EXEC_PAGESIZE " Xu Lu
@ 2024-12-06 2:00 ` Zi Yan
2024-12-06 2:41 ` [External] " Xu Lu
2024-12-06 10:13 ` David Hildenbrand
21 siblings, 2 replies; 30+ messages in thread
From: Zi Yan @ 2024-12-06 2:00 UTC (permalink / raw)
To: Xu Lu
Cc: paul.walmsley, palmer, aou, ardb, anup, atishp, xieyongji,
lihangjing, punit.agrawal, linux-kernel, linux-riscv, Linux MM
On 5 Dec 2024, at 5:37, Xu Lu wrote:
> This patch series attempts to break through the limitation of MMU and
> supports larger base page on RISC-V, which only supports 4K page size
> now. The key idea is to always manage and allocate memory at a
> granularity of 64K and use SVNAPOT to accelerate address translation.
> This is the second version and the detailed introduction can be found
> in [1].
>
> Changes from v1:
> - Rebase on v6.12.
>
> - Adjust the page table entry shift to reduce page table memory usage.
> For example, in SV39, the traditional va behaves as:
>
> ----------------------------------------------
> | pgd index | pmd index | pte index | offset |
> ----------------------------------------------
> | 38 30 | 29 21 | 20 12 | 11 0 |
> ----------------------------------------------
>
> When we choose 64K as basic software page, va now behaves as:
>
> ----------------------------------------------
> | pgd index | pmd index | pte index | offset |
> ----------------------------------------------
> | 38 34 | 33 25 | 24 16 | 15 0 |
> ----------------------------------------------
>
> - Fix some bugs in v1.
>
> Thanks in advance for comments.
>
> [1] https://lwn.net/Articles/952722/
This looks very interesting. Can you cc me and linux-mm@kvack.org
in the future? Thanks.
Have you thought about doing it for ARM64 4KB as well? ARM64’s contig PTE
should have similar effect of RISC-V’s SVNAPOT, right?
>
> Xu Lu (21):
> riscv: mm: Distinguish hardware base page and software base page
> riscv: mm: Configure satp with hw page pfn
> riscv: mm: Reimplement page table entry structures
> riscv: mm: Reimplement page table entry constructor function
> riscv: mm: Reimplement conversion functions between page table entry
> riscv: mm: Avoid pte constructor during pte conversion
> riscv: mm: Reimplement page table entry get function
> riscv: mm: Reimplement page table entry atomic get function
> riscv: mm: Replace READ_ONCE with atomic pte get function
> riscv: mm: Reimplement PTE A/D bit check function
> riscv: mm: Reimplement mk_huge_pte function
> riscv: mm: Reimplement tlb flush function
> riscv: mm: Adjust PGDIR/P4D/PUD/PMD_SHIFT
> riscv: mm: Only apply svnapot region bigger than software page
> riscv: mm: Adjust FIX_BTMAPS_SLOTS for variable PAGE_SIZE
> riscv: mm: Adjust FIX_FDT_SIZE for variable PMD_SIZE
> riscv: mm: Apply Svnapot for base page mapping if possible
> riscv: Kconfig: Introduce 64K page size
> riscv: Kconfig: Adjust mmap rnd bits for 64K Page
> riscv: mm: Adjust address space layout and init page table for 64K
> Page
> riscv: mm: Update EXEC_PAGESIZE for 64K Page
>
> arch/riscv/Kconfig | 34 +-
> arch/riscv/include/asm/fixmap.h | 3 +-
> arch/riscv/include/asm/hugetlb.h | 5 +
> arch/riscv/include/asm/page.h | 56 ++-
> arch/riscv/include/asm/pgtable-32.h | 12 +-
> arch/riscv/include/asm/pgtable-64.h | 128 ++++--
> arch/riscv/include/asm/pgtable-bits.h | 3 +-
> arch/riscv/include/asm/pgtable.h | 564 +++++++++++++++++++++++---
> arch/riscv/include/asm/tlbflush.h | 26 +-
> arch/riscv/include/uapi/asm/param.h | 24 ++
> arch/riscv/kernel/head.S | 4 +-
> arch/riscv/kernel/hibernate.c | 21 +-
> arch/riscv/mm/context.c | 7 +-
> arch/riscv/mm/fault.c | 15 +-
> arch/riscv/mm/hugetlbpage.c | 30 +-
> arch/riscv/mm/init.c | 45 +-
> arch/riscv/mm/kasan_init.c | 7 +-
> arch/riscv/mm/pgtable.c | 111 ++++-
> arch/riscv/mm/tlbflush.c | 31 +-
> arch/s390/include/asm/hugetlb.h | 2 +-
> include/asm-generic/hugetlb.h | 5 +-
> include/linux/pgtable.h | 21 +
> kernel/events/core.c | 6 +-
> mm/debug_vm_pgtable.c | 6 +-
> mm/gup.c | 10 +-
> mm/hmm.c | 2 +-
> mm/hugetlb.c | 4 +-
> mm/mapping_dirty_helpers.c | 2 +-
> mm/memory.c | 4 +-
> mm/mprotect.c | 2 +-
> mm/ptdump.c | 8 +-
> mm/sparse-vmemmap.c | 2 +-
> mm/vmscan.c | 2 +-
> 33 files changed, 1029 insertions(+), 173 deletions(-)
> create mode 100644 arch/riscv/include/uapi/asm/param.h
>
> --
> 2.20.1
Best Regards,
Yan, Zi
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [External] Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page
2024-12-06 2:00 ` [RFC PATCH v2 00/21] riscv: Introduce 64K base page Zi Yan
@ 2024-12-06 2:41 ` Xu Lu
2024-12-06 10:13 ` David Hildenbrand
1 sibling, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-06 2:41 UTC (permalink / raw)
To: Zi Yan
Cc: paul.walmsley, palmer, aou, ardb, anup, atishp, xieyongji,
lihangjing, punit.agrawal, linux-kernel, linux-riscv, Linux MM
Hi Zi Yan,
On Fri, Dec 6, 2024 at 10:00 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 5 Dec 2024, at 5:37, Xu Lu wrote:
>
> > This patch series attempts to break through the limitation of MMU and
> > supports larger base page on RISC-V, which only supports 4K page size
> > now. The key idea is to always manage and allocate memory at a
> > granularity of 64K and use SVNAPOT to accelerate address translation.
> > This is the second version and the detailed introduction can be found
> > in [1].
> >
> > Changes from v1:
> > - Rebase on v6.12.
> >
> > - Adjust the page table entry shift to reduce page table memory usage.
> > For example, in SV39, the traditional va behaves as:
> >
> > ----------------------------------------------
> > | pgd index | pmd index | pte index | offset |
> > ----------------------------------------------
> > | 38 30 | 29 21 | 20 12 | 11 0 |
> > ----------------------------------------------
> >
> > When we choose 64K as basic software page, va now behaves as:
> >
> > ----------------------------------------------
> > | pgd index | pmd index | pte index | offset |
> > ----------------------------------------------
> > | 38 34 | 33 25 | 24 16 | 15 0 |
> > ----------------------------------------------
> >
> > - Fix some bugs in v1.
> >
> > Thanks in advance for comments.
> >
> > [1] https://lwn.net/Articles/952722/
>
> This looks very interesting. Can you cc me and linux-mm@kvack.org
> in the future? Thanks.
Of course. Hope this patch can be of any help.
>
> Have you thought about doing it for ARM64 4KB as well? ARM64’s contig PTE
> should have similar effect of RISC-V’s SVNAPOT, right?
I have not thought about it yet. ARM64 has native 64K MMU. The kernel
can directly configure the page size as 64K and MMU will do
translation at corresponding granularity. So I doubt if there is a
need to implement 64K Page Size based on CONT PTE. If you want to use
CONT PTE for acceleration instead of 64K MMU, maybe you can have a try
on THP_CONTPTE[1] which has been merged~
[1] https://lwn.net/Articles/935887/
Best regards,
Xu Lu
>
> >
> > Xu Lu (21):
> > riscv: mm: Distinguish hardware base page and software base page
> > riscv: mm: Configure satp with hw page pfn
> > riscv: mm: Reimplement page table entry structures
> > riscv: mm: Reimplement page table entry constructor function
> > riscv: mm: Reimplement conversion functions between page table entry
> > riscv: mm: Avoid pte constructor during pte conversion
> > riscv: mm: Reimplement page table entry get function
> > riscv: mm: Reimplement page table entry atomic get function
> > riscv: mm: Replace READ_ONCE with atomic pte get function
> > riscv: mm: Reimplement PTE A/D bit check function
> > riscv: mm: Reimplement mk_huge_pte function
> > riscv: mm: Reimplement tlb flush function
> > riscv: mm: Adjust PGDIR/P4D/PUD/PMD_SHIFT
> > riscv: mm: Only apply svnapot region bigger than software page
> > riscv: mm: Adjust FIX_BTMAPS_SLOTS for variable PAGE_SIZE
> > riscv: mm: Adjust FIX_FDT_SIZE for variable PMD_SIZE
> > riscv: mm: Apply Svnapot for base page mapping if possible
> > riscv: Kconfig: Introduce 64K page size
> > riscv: Kconfig: Adjust mmap rnd bits for 64K Page
> > riscv: mm: Adjust address space layout and init page table for 64K
> > Page
> > riscv: mm: Update EXEC_PAGESIZE for 64K Page
> >
> > arch/riscv/Kconfig | 34 +-
> > arch/riscv/include/asm/fixmap.h | 3 +-
> > arch/riscv/include/asm/hugetlb.h | 5 +
> > arch/riscv/include/asm/page.h | 56 ++-
> > arch/riscv/include/asm/pgtable-32.h | 12 +-
> > arch/riscv/include/asm/pgtable-64.h | 128 ++++--
> > arch/riscv/include/asm/pgtable-bits.h | 3 +-
> > arch/riscv/include/asm/pgtable.h | 564 +++++++++++++++++++++++---
> > arch/riscv/include/asm/tlbflush.h | 26 +-
> > arch/riscv/include/uapi/asm/param.h | 24 ++
> > arch/riscv/kernel/head.S | 4 +-
> > arch/riscv/kernel/hibernate.c | 21 +-
> > arch/riscv/mm/context.c | 7 +-
> > arch/riscv/mm/fault.c | 15 +-
> > arch/riscv/mm/hugetlbpage.c | 30 +-
> > arch/riscv/mm/init.c | 45 +-
> > arch/riscv/mm/kasan_init.c | 7 +-
> > arch/riscv/mm/pgtable.c | 111 ++++-
> > arch/riscv/mm/tlbflush.c | 31 +-
> > arch/s390/include/asm/hugetlb.h | 2 +-
> > include/asm-generic/hugetlb.h | 5 +-
> > include/linux/pgtable.h | 21 +
> > kernel/events/core.c | 6 +-
> > mm/debug_vm_pgtable.c | 6 +-
> > mm/gup.c | 10 +-
> > mm/hmm.c | 2 +-
> > mm/hugetlb.c | 4 +-
> > mm/mapping_dirty_helpers.c | 2 +-
> > mm/memory.c | 4 +-
> > mm/mprotect.c | 2 +-
> > mm/ptdump.c | 8 +-
> > mm/sparse-vmemmap.c | 2 +-
> > mm/vmscan.c | 2 +-
> > 33 files changed, 1029 insertions(+), 173 deletions(-)
> > create mode 100644 arch/riscv/include/uapi/asm/param.h
> >
> > --
> > 2.20.1
>
>
> Best Regards,
> Yan, Zi
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page
2024-12-06 2:00 ` [RFC PATCH v2 00/21] riscv: Introduce 64K base page Zi Yan
2024-12-06 2:41 ` [External] " Xu Lu
@ 2024-12-06 10:13 ` David Hildenbrand
2024-12-06 13:42 ` [External] " Xu Lu
1 sibling, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2024-12-06 10:13 UTC (permalink / raw)
To: Zi Yan, Xu Lu
Cc: paul.walmsley, palmer, aou, ardb, anup, atishp, xieyongji,
lihangjing, punit.agrawal, linux-kernel, linux-riscv, Linux MM
On 06.12.24 03:00, Zi Yan wrote:
> On 5 Dec 2024, at 5:37, Xu Lu wrote:
>
>> This patch series attempts to break through the limitation of MMU and
>> supports larger base page on RISC-V, which only supports 4K page size
>> now. The key idea is to always manage and allocate memory at a
>> granularity of 64K and use SVNAPOT to accelerate address translation.
>> This is the second version and the detailed introduction can be found
>> in [1].
>>
>> Changes from v1:
>> - Rebase on v6.12.
>>
>> - Adjust the page table entry shift to reduce page table memory usage.
>> For example, in SV39, the traditional va behaves as:
>>
>> ----------------------------------------------
>> | pgd index | pmd index | pte index | offset |
>> ----------------------------------------------
>> | 38 30 | 29 21 | 20 12 | 11 0 |
>> ----------------------------------------------
>>
>> When we choose 64K as basic software page, va now behaves as:
>>
>> ----------------------------------------------
>> | pgd index | pmd index | pte index | offset |
>> ----------------------------------------------
>> | 38 34 | 33 25 | 24 16 | 15 0 |
>> ----------------------------------------------
>>
>> - Fix some bugs in v1.
>>
>> Thanks in advance for comments.
>>
>> [1] https://lwn.net/Articles/952722/
>
> This looks very interesting. Can you cc me and linux-mm@kvack.org
> in the future? Thanks.
>
> Have you thought about doing it for ARM64 4KB as well? ARM64’s contig PTE
> should have similar effect of RISC-V’s SVNAPOT, right?
What is the real benefit over 4k + large folios/mTHP?
64K comes with the problem of internal fragmentation: for example, a
page table that only occupies 4k of memory suddenly consumes 64K; quite
a downside.
--
Cheers,
David / dhildenb
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [External] Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page
2024-12-06 10:13 ` David Hildenbrand
@ 2024-12-06 13:42 ` Xu Lu
2024-12-06 18:48 ` Pedro Falcato
0 siblings, 1 reply; 30+ messages in thread
From: Xu Lu @ 2024-12-06 13:42 UTC (permalink / raw)
To: David Hildenbrand
Cc: Zi Yan, paul.walmsley, palmer, aou, ardb, anup, atishp, xieyongji,
lihangjing, punit.agrawal, linux-kernel, linux-riscv, Linux MM
[-- Attachment #1: Type: text/plain, Size: 4562 bytes --]
Hi David,
On Fri, Dec 6, 2024 at 6:13 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 06.12.24 03:00, Zi Yan wrote:
> > On 5 Dec 2024, at 5:37, Xu Lu wrote:
> >
> >> This patch series attempts to break through the limitation of MMU and
> >> supports larger base page on RISC-V, which only supports 4K page size
> >> now. The key idea is to always manage and allocate memory at a
> >> granularity of 64K and use SVNAPOT to accelerate address translation.
> >> This is the second version and the detailed introduction can be found
> >> in [1].
> >>
> >> Changes from v1:
> >> - Rebase on v6.12.
> >>
> >> - Adjust the page table entry shift to reduce page table memory usage.
> >> For example, in SV39, the traditional va behaves as:
> >>
> >> ----------------------------------------------
> >> | pgd index | pmd index | pte index | offset |
> >> ----------------------------------------------
> >> | 38 30 | 29 21 | 20 12 | 11 0 |
> >> ----------------------------------------------
> >>
> >> When we choose 64K as basic software page, va now behaves as:
> >>
> >> ----------------------------------------------
> >> | pgd index | pmd index | pte index | offset |
> >> ----------------------------------------------
> >> | 38 34 | 33 25 | 24 16 | 15 0 |
> >> ----------------------------------------------
> >>
> >> - Fix some bugs in v1.
> >>
> >> Thanks in advance for comments.
> >>
> >> [1] https://lwn.net/Articles/952722/
> >
> > This looks very interesting. Can you cc me and linux-mm@kvack.org
> > in the future? Thanks.
> >
> > Have you thought about doing it for ARM64 4KB as well? ARM64’s contig PTE
> > should have similar effect of RISC-V’s SVNAPOT, right?
>
> What is the real benefit over 4k + large folios/mTHP?
>
> 64K comes with the problem of internal fragmentation: for example, a
> page table that only occupies 4k of memory suddenly consumes 64K; quite
> a downside.
The original idea comes from the performance benefits we achieved on
the ARM 64K kernel. We run several real world applications on the ARM
Ampere Altra platform and found these apps' performance based on the
64K page kernel is significantly higher than that on the 4K page
kernel:
For Redis, the throughput has increased by 250% and latency has
decreased by 70%.
For Mysql, the throughput has increased by 16.9% and latency has
decreased by 14.5%.
For our own newsql database, throughput has increased by 16.5% and
latency has decreased by 13.8%.
Also, we have compared the performance between 64K and 4k + large
folios/mTHP on ARM Neoverse-N2. The result shows considerable
performance improvement on 64K kernel for both speccpu and lmbench,
even when 4K kernel enables THP and ARM64_CONTPTE:
For speccpu benchmark, 64K kernel without any huge pages optimization
can still achieve 4.17% higher score than 4K kernel with transparent
huge pages as well as CONTPTE optimization.
For lmbench, 64K kernel achieves 75.98% lower memory mapping
latency(16MB) than 4K kernel with transparent huge pages and CONTPTE
optimization, 84.34% higher map read open2close bandwidth(16MB), and
10.71% lower random load latency(16MB).
Interestingly, sometimes kernel with transparent pages support have
poorer performance for both 4K and 64K (for example, mmap read
bandwidth bench). We assume this is due to the overhead of huge pages'
combination and collapse.
Also, if you check the full result, you will find that usually the
larger the memory size used for testing is, the better the performance
of 64k kernel is (compared to 4K kernel). Unless the memory size lies
in a range where 4K kernel can apply 2MB huge pages while 64K kernel
can't.
In summary, for performance sensitive applications which require
higher bandwidth and lower latency, sometimes 4K pages with huge pages
may not be the best choice and 64k page can achieve better results.
The test environment and result is attached.
As RISC-V has no native 64K MMU support, we introduce a software
implementation and accelerate it via Svnapot. Of course, there will be
some extra overhead compared with native 64K MMU. Thus, we are also
trying to persuade the RISC-V community to support the extension of
native 64K MMU [1]. Please join us if you are interested.
[1] https://lists.riscv.org/g/tech-privileged/topic/query_about_risc_v_s_support/108641509
Best Regards,
Xu Lu
>
> --
> Cheers,
>
> David / dhildenb
>
[-- Attachment #2: ARM 64K Result.xlsx --]
[-- Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, Size: 54752 bytes --]
[-- Attachment #3: Type: text/plain, Size: 161 bytes --]
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [External] Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page
2024-12-06 13:42 ` [External] " Xu Lu
@ 2024-12-06 18:48 ` Pedro Falcato
2024-12-07 8:03 ` Xu Lu
0 siblings, 1 reply; 30+ messages in thread
From: Pedro Falcato @ 2024-12-06 18:48 UTC (permalink / raw)
To: Xu Lu
Cc: David Hildenbrand, Zi Yan, paul.walmsley, palmer, aou, ardb, anup,
atishp, xieyongji, lihangjing, punit.agrawal, linux-kernel,
linux-riscv, Linux MM
On Fri, Dec 6, 2024 at 1:42 PM Xu Lu <luxu.kernel@bytedance.com> wrote:
>
> Hi David,
>
> On Fri, Dec 6, 2024 at 6:13 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 06.12.24 03:00, Zi Yan wrote:
> > > On 5 Dec 2024, at 5:37, Xu Lu wrote:
> > >
> > >> This patch series attempts to break through the limitation of MMU and
> > >> supports larger base page on RISC-V, which only supports 4K page size
> > >> now. The key idea is to always manage and allocate memory at a
> > >> granularity of 64K and use SVNAPOT to accelerate address translation.
> > >> This is the second version and the detailed introduction can be found
> > >> in [1].
> > >>
> > >> Changes from v1:
> > >> - Rebase on v6.12.
> > >>
> > >> - Adjust the page table entry shift to reduce page table memory usage.
> > >> For example, in SV39, the traditional va behaves as:
> > >>
> > >> ----------------------------------------------
> > >> | pgd index | pmd index | pte index | offset |
> > >> ----------------------------------------------
> > >> | 38 30 | 29 21 | 20 12 | 11 0 |
> > >> ----------------------------------------------
> > >>
> > >> When we choose 64K as basic software page, va now behaves as:
> > >>
> > >> ----------------------------------------------
> > >> | pgd index | pmd index | pte index | offset |
> > >> ----------------------------------------------
> > >> | 38 34 | 33 25 | 24 16 | 15 0 |
> > >> ----------------------------------------------
> > >>
> > >> - Fix some bugs in v1.
> > >>
> > >> Thanks in advance for comments.
> > >>
> > >> [1] https://lwn.net/Articles/952722/
> > >
> > > This looks very interesting. Can you cc me and linux-mm@kvack.org
> > > in the future? Thanks.
> > >
> > > Have you thought about doing it for ARM64 4KB as well? ARM64’s contig PTE
> > > should have similar effect of RISC-V’s SVNAPOT, right?
> >
> > What is the real benefit over 4k + large folios/mTHP?
> >
> > 64K comes with the problem of internal fragmentation: for example, a
> > page table that only occupies 4k of memory suddenly consumes 64K; quite
> > a downside.
>
> The original idea comes from the performance benefits we achieved on
> the ARM 64K kernel. We run several real world applications on the ARM
> Ampere Altra platform and found these apps' performance based on the
> 64K page kernel is significantly higher than that on the 4K page
> kernel:
> For Redis, the throughput has increased by 250% and latency has
> decreased by 70%.
> For Mysql, the throughput has increased by 16.9% and latency has
> decreased by 14.5%.
> For our own newsql database, throughput has increased by 16.5% and
> latency has decreased by 13.8%.
>
> Also, we have compared the performance between 64K and 4k + large
> folios/mTHP on ARM Neoverse-N2. The result shows considerable
> performance improvement on 64K kernel for both speccpu and lmbench,
> even when 4K kernel enables THP and ARM64_CONTPTE:
> For speccpu benchmark, 64K kernel without any huge pages optimization
> can still achieve 4.17% higher score than 4K kernel with transparent
> huge pages as well as CONTPTE optimization.
> For lmbench, 64K kernel achieves 75.98% lower memory mapping
> latency(16MB) than 4K kernel with transparent huge pages and CONTPTE
> optimization, 84.34% higher map read open2close bandwidth(16MB), and
> 10.71% lower random load latency(16MB).
> Interestingly, sometimes kernel with transparent pages support have
> poorer performance for both 4K and 64K (for example, mmap read
> bandwidth bench). We assume this is due to the overhead of huge pages'
> combination and collapse.
> Also, if you check the full result, you will find that usually the
> larger the memory size used for testing is, the better the performance
> of 64k kernel is (compared to 4K kernel). Unless the memory size lies
> in a range where 4K kernel can apply 2MB huge pages while 64K kernel
> can't.
> In summary, for performance sensitive applications which require
> higher bandwidth and lower latency, sometimes 4K pages with huge pages
> may not be the best choice and 64k page can achieve better results.
> The test environment and result is attached.
>
> As RISC-V has no native 64K MMU support, we introduce a software
> implementation and accelerate it via Svnapot. Of course, there will be
> some extra overhead compared with native 64K MMU. Thus, we are also
> trying to persuade the RISC-V community to support the extension of
> native 64K MMU [1]. Please join us if you are interested.
>
Ok, so you... didn't test this on riscv? And you're basing this
patchset off of a native 64KiB page size kernel being faster than 4KiB
+ CONTPTE? I don't see how that makes sense?
/me is confused
How many of these PAGE_SIZE wins are related to e.g userspace basing
its buffer sizes (or whatever) off of the system page size? Where
exactly are you gaining time versus the CONTPTE stuff?
I think MM in general would be better off if we were more transparent
with regard to CONTPTE and page sizes instead of hand waving with
"hardware page size != software page size", which is such a *checks
notes* 4.4BSD idea... :) At the very least, this patchset seems to go
against all the work on better supporting large folios and CONTPTE.
--
Pedro
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [External] Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page
2024-12-06 18:48 ` Pedro Falcato
@ 2024-12-07 8:03 ` Xu Lu
2024-12-07 22:02 ` Yu Zhao
0 siblings, 1 reply; 30+ messages in thread
From: Xu Lu @ 2024-12-07 8:03 UTC (permalink / raw)
To: Pedro Falcato
Cc: David Hildenbrand, Zi Yan, paul.walmsley, palmer, aou, ardb, anup,
atishp, xieyongji, lihangjing, punit.agrawal, linux-kernel,
linux-riscv, Linux MM
Hi Pedro,
On Sat, Dec 7, 2024 at 2:49 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Fri, Dec 6, 2024 at 1:42 PM Xu Lu <luxu.kernel@bytedance.com> wrote:
> >
> > Hi David,
> >
> > On Fri, Dec 6, 2024 at 6:13 PM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 06.12.24 03:00, Zi Yan wrote:
> > > > On 5 Dec 2024, at 5:37, Xu Lu wrote:
> > > >
> > > >> This patch series attempts to break through the limitation of MMU and
> > > >> supports larger base page on RISC-V, which only supports 4K page size
> > > >> now. The key idea is to always manage and allocate memory at a
> > > >> granularity of 64K and use SVNAPOT to accelerate address translation.
> > > >> This is the second version and the detailed introduction can be found
> > > >> in [1].
> > > >>
> > > >> Changes from v1:
> > > >> - Rebase on v6.12.
> > > >>
> > > >> - Adjust the page table entry shift to reduce page table memory usage.
> > > >> For example, in SV39, the traditional va behaves as:
> > > >>
> > > >> ----------------------------------------------
> > > >> | pgd index | pmd index | pte index | offset |
> > > >> ----------------------------------------------
> > > >> | 38 30 | 29 21 | 20 12 | 11 0 |
> > > >> ----------------------------------------------
> > > >>
> > > >> When we choose 64K as basic software page, va now behaves as:
> > > >>
> > > >> ----------------------------------------------
> > > >> | pgd index | pmd index | pte index | offset |
> > > >> ----------------------------------------------
> > > >> | 38 34 | 33 25 | 24 16 | 15 0 |
> > > >> ----------------------------------------------
> > > >>
> > > >> - Fix some bugs in v1.
> > > >>
> > > >> Thanks in advance for comments.
> > > >>
> > > >> [1] https://lwn.net/Articles/952722/
> > > >
> > > > This looks very interesting. Can you cc me and linux-mm@kvack.org
> > > > in the future? Thanks.
> > > >
> > > > Have you thought about doing it for ARM64 4KB as well? ARM64’s contig PTE
> > > > should have similar effect of RISC-V’s SVNAPOT, right?
> > >
> > > What is the real benefit over 4k + large folios/mTHP?
> > >
> > > 64K comes with the problem of internal fragmentation: for example, a
> > > page table that only occupies 4k of memory suddenly consumes 64K; quite
> > > a downside.
> >
> > The original idea comes from the performance benefits we achieved on
> > the ARM 64K kernel. We run several real world applications on the ARM
> > Ampere Altra platform and found these apps' performance based on the
> > 64K page kernel is significantly higher than that on the 4K page
> > kernel:
> > For Redis, the throughput has increased by 250% and latency has
> > decreased by 70%.
> > For Mysql, the throughput has increased by 16.9% and latency has
> > decreased by 14.5%.
> > For our own newsql database, throughput has increased by 16.5% and
> > latency has decreased by 13.8%.
> >
> > Also, we have compared the performance between 64K and 4k + large
> > folios/mTHP on ARM Neoverse-N2. The result shows considerable
> > performance improvement on 64K kernel for both speccpu and lmbench,
> > even when 4K kernel enables THP and ARM64_CONTPTE:
> > For speccpu benchmark, 64K kernel without any huge pages optimization
> > can still achieve 4.17% higher score than 4K kernel with transparent
> > huge pages as well as CONTPTE optimization.
> > For lmbench, 64K kernel achieves 75.98% lower memory mapping
> > latency(16MB) than 4K kernel with transparent huge pages and CONTPTE
> > optimization, 84.34% higher map read open2close bandwidth(16MB), and
> > 10.71% lower random load latency(16MB).
> > Interestingly, sometimes kernel with transparent pages support have
> > poorer performance for both 4K and 64K (for example, mmap read
> > bandwidth bench). We assume this is due to the overhead of huge pages'
> > combination and collapse.
> > Also, if you check the full result, you will find that usually the
> > larger the memory size used for testing is, the better the performance
> > of 64k kernel is (compared to 4K kernel). Unless the memory size lies
> > in a range where 4K kernel can apply 2MB huge pages while 64K kernel
> > can't.
> > In summary, for performance sensitive applications which require
> > higher bandwidth and lower latency, sometimes 4K pages with huge pages
> > may not be the best choice and 64k page can achieve better results.
> > The test environment and result is attached.
> >
> > As RISC-V has no native 64K MMU support, we introduce a software
> > implementation and accelerate it via Svnapot. Of course, there will be
> > some extra overhead compared with native 64K MMU. Thus, we are also
> > trying to persuade the RISC-V community to support the extension of
> > native 64K MMU [1]. Please join us if you are interested.
> >
>
> Ok, so you... didn't test this on riscv? And you're basing this
> patchset off of a native 64KiB page size kernel being faster than 4KiB
> + CONTPTE? I don't see how that makes sense?
Sorry for the misleading. I didn't intend to use ARM data to support
this patch, just to explain the idea source. We do prefer 64K MMU for
the performance improvement it brought to real applications and
benchmarks. And since RISC-V does not support it yet, we internally
use this patch as a transitional solution for RISC-V. And if native
64k MMU is available, this patch can be canceled. The only usage of
this patch I can think of then is to make the kernel support more page
sizes than MMU, as long as Svnapot supports the corresponding size.
We will try to release the performance data in the next version. There
have been more issues with applications and OS adaptation:) So this
version is still an RFC.
>
> /me is confused
>
> How many of these PAGE_SIZE wins are related to e.g userspace basing
> its buffer sizes (or whatever) off of the system page size? Where
> exactly are you gaining time versus the CONTPTE stuff?
> I think MM in general would be better off if we were more transparent
> with regard to CONTPTE and page sizes instead of hand waving with
> "hardware page size != software page size", which is such a *checks
> notes* 4.4BSD idea... :) At the very least, this patchset seems to go
> against all the work on better supporting large folios and CONTPTE.
By the way, the core modification of this patch is turning pte
structure to an array of 16 entries to map a 64K page and accelerating
it via Svnapot. I think it is all about architectural pte and has
little impact on pages or folios. Please remind me if anything is
missed and I will try to fix it.
>
> --
> Pedro
Thanks,
Xu Lu
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [External] Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page
2024-12-07 8:03 ` Xu Lu
@ 2024-12-07 22:02 ` Yu Zhao
2024-12-09 3:36 ` Xu Lu
0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2024-12-07 22:02 UTC (permalink / raw)
To: Xu Lu
Cc: Pedro Falcato, David Hildenbrand, Zi Yan, paul.walmsley, palmer,
aou, ardb, anup, atishp, xieyongji, lihangjing, punit.agrawal,
linux-kernel, linux-riscv, Linux MM
On Sat, Dec 7, 2024 at 1:03 AM Xu Lu <luxu.kernel@bytedance.com> wrote:
>
> Hi Pedro,
>
> On Sat, Dec 7, 2024 at 2:49 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > On Fri, Dec 6, 2024 at 1:42 PM Xu Lu <luxu.kernel@bytedance.com> wrote:
> > >
> > > Hi David,
> > >
> > > On Fri, Dec 6, 2024 at 6:13 PM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > On 06.12.24 03:00, Zi Yan wrote:
> > > > > On 5 Dec 2024, at 5:37, Xu Lu wrote:
> > > > >
> > > > >> This patch series attempts to break through the limitation of MMU and
> > > > >> supports larger base page on RISC-V, which only supports 4K page size
> > > > >> now. The key idea is to always manage and allocate memory at a
> > > > >> granularity of 64K and use SVNAPOT to accelerate address translation.
> > > > >> This is the second version and the detailed introduction can be found
> > > > >> in [1].
> > > > >>
> > > > >> Changes from v1:
> > > > >> - Rebase on v6.12.
> > > > >>
> > > > >> - Adjust the page table entry shift to reduce page table memory usage.
> > > > >> For example, in SV39, the traditional va behaves as:
> > > > >>
> > > > >> ----------------------------------------------
> > > > >> | pgd index | pmd index | pte index | offset |
> > > > >> ----------------------------------------------
> > > > >> | 38 30 | 29 21 | 20 12 | 11 0 |
> > > > >> ----------------------------------------------
> > > > >>
> > > > >> When we choose 64K as basic software page, va now behaves as:
> > > > >>
> > > > >> ----------------------------------------------
> > > > >> | pgd index | pmd index | pte index | offset |
> > > > >> ----------------------------------------------
> > > > >> | 38 34 | 33 25 | 24 16 | 15 0 |
> > > > >> ----------------------------------------------
> > > > >>
> > > > >> - Fix some bugs in v1.
> > > > >>
> > > > >> Thanks in advance for comments.
> > > > >>
> > > > >> [1] https://lwn.net/Articles/952722/
> > > > >
> > > > > This looks very interesting. Can you cc me and linux-mm@kvack.org
> > > > > in the future? Thanks.
> > > > >
> > > > > Have you thought about doing it for ARM64 4KB as well? ARM64’s contig PTE
> > > > > should have similar effect of RISC-V’s SVNAPOT, right?
> > > >
> > > > What is the real benefit over 4k + large folios/mTHP?
> > > >
> > > > 64K comes with the problem of internal fragmentation: for example, a
> > > > page table that only occupies 4k of memory suddenly consumes 64K; quite
> > > > a downside.
> > >
> > > The original idea comes from the performance benefits we achieved on
> > > the ARM 64K kernel. We run several real world applications on the ARM
> > > Ampere Altra platform and found these apps' performance based on the
> > > 64K page kernel is significantly higher than that on the 4K page
> > > kernel:
> > > For Redis, the throughput has increased by 250% and latency has
> > > decreased by 70%.
> > > For Mysql, the throughput has increased by 16.9% and latency has
> > > decreased by 14.5%.
> > > For our own newsql database, throughput has increased by 16.5% and
> > > latency has decreased by 13.8%.
> > >
> > > Also, we have compared the performance between 64K and 4k + large
> > > folios/mTHP on ARM Neoverse-N2. The result shows considerable
> > > performance improvement on 64K kernel for both speccpu and lmbench,
> > > even when 4K kernel enables THP and ARM64_CONTPTE:
> > > For speccpu benchmark, 64K kernel without any huge pages optimization
> > > can still achieve 4.17% higher score than 4K kernel with transparent
> > > huge pages as well as CONTPTE optimization.
> > > For lmbench, 64K kernel achieves 75.98% lower memory mapping
> > > latency(16MB) than 4K kernel with transparent huge pages and CONTPTE
> > > optimization, 84.34% higher map read open2close bandwidth(16MB), and
> > > 10.71% lower random load latency(16MB).
> > > Interestingly, sometimes kernel with transparent pages support have
> > > poorer performance for both 4K and 64K (for example, mmap read
> > > bandwidth bench). We assume this is due to the overhead of huge pages'
> > > combination and collapse.
> > > Also, if you check the full result, you will find that usually the
> > > larger the memory size used for testing is, the better the performance
> > > of 64k kernel is (compared to 4K kernel). Unless the memory size lies
> > > in a range where 4K kernel can apply 2MB huge pages while 64K kernel
> > > can't.
> > > In summary, for performance sensitive applications which require
> > > higher bandwidth and lower latency, sometimes 4K pages with huge pages
> > > may not be the best choice and 64k page can achieve better results.
> > > The test environment and result is attached.
> > >
> > > As RISC-V has no native 64K MMU support, we introduce a software
> > > implementation and accelerate it via Svnapot. Of course, there will be
> > > some extra overhead compared with native 64K MMU. Thus, we are also
> > > trying to persuade the RISC-V community to support the extension of
> > > native 64K MMU [1]. Please join us if you are interested.
> > >
> >
> > Ok, so you... didn't test this on riscv? And you're basing this
> > patchset off of a native 64KiB page size kernel being faster than 4KiB
> > + CONTPTE? I don't see how that makes sense?
>
> Sorry for the misleading. I didn't intend to use ARM data to support
> this patch, just to explain the idea source. We do prefer 64K MMU for
> the performance improvement it brought to real applications and
> benchmarks.
This breaks ABI, doesn't it? Not only userspace needs to be recompiled
with 64KB alignment, it also needs not to assume 4KB base page size.
> And since RISC-V does not support it yet, we internally
> use this patch as a transitional solution for RISC-V.
Distros need to support this as well. Otherwise it's a tech island.
Also why RV? It can be a generic feature which can apply to other
archs like x86, right? See "page clustering" [1][2].
[1] https://lwn.net/Articles/23785/
[2] https://lore.kernel.org/linux-mm/Pine.LNX.4.21.0107051737340.1577-100000@localhost.localdomain/
> And if native
> 64k MMU is available, this patch can be canceled.
Why 64KB? Why not 32KB or 128KB? In general, the less dependency on
h/w, the better. Ideally, *if* we want to consider this, it should be
a s/w feature applicable to all (or most of) archs.
> The only usage of
> this patch I can think of then is to make the kernel support more page
> sizes than MMU, as long as Svnapot supports the corresponding size.
>
> We will try to release the performance data in the next version. There
> have been more issues with applications and OS adaptation:) So this
> version is still an RFC.
>
> >
> > /me is confused
> >
> > How many of these PAGE_SIZE wins are related to e.g userspace basing
> > its buffer sizes (or whatever) off of the system page size? Where
> > exactly are you gaining time versus the CONTPTE stuff?
> > I think MM in general would be better off if we were more transparent
> > with regard to CONTPTE and page sizes instead of hand waving with
> > "hardware page size != software page size", which is such a *checks
> > notes* 4.4BSD idea... :) At the very least, this patchset seems to go
> > against all the work on better supporting large folios and CONTPTE.
>
> By the way, the core modification of this patch is turning pte
> structure to an array of 16 entries to map a 64K page and accelerating
> it via Svnapot. I think it is all about architectural pte and has
> little impact on pages or folios. Please remind me if anything is
> missed and I will try to fix it.
>
> >
> > --
> > Pedro
>
> Thanks,
>
> Xu Lu
>
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [External] Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page
2024-12-07 22:02 ` Yu Zhao
@ 2024-12-09 3:36 ` Xu Lu
0 siblings, 0 replies; 30+ messages in thread
From: Xu Lu @ 2024-12-09 3:36 UTC (permalink / raw)
To: Yu Zhao
Cc: Pedro Falcato, David Hildenbrand, Zi Yan, paul.walmsley, palmer,
aou, ardb, anup, atishp, xieyongji, lihangjing, punit.agrawal,
linux-kernel, linux-riscv, Linux MM
Hi Yu Zhao,
On Sun, Dec 8, 2024 at 6:03 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Sat, Dec 7, 2024 at 1:03 AM Xu Lu <luxu.kernel@bytedance.com> wrote:
> >
> > Hi Pedro,
> >
> > On Sat, Dec 7, 2024 at 2:49 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > >
> > > On Fri, Dec 6, 2024 at 1:42 PM Xu Lu <luxu.kernel@bytedance.com> wrote:
> > > >
> > > > Hi David,
> > > >
> > > > On Fri, Dec 6, 2024 at 6:13 PM David Hildenbrand <david@redhat.com> wrote:
> > > > >
> > > > > On 06.12.24 03:00, Zi Yan wrote:
> > > > > > On 5 Dec 2024, at 5:37, Xu Lu wrote:
> > > > > >
> > > > > >> This patch series attempts to break through the limitation of MMU and
> > > > > >> supports larger base page on RISC-V, which only supports 4K page size
> > > > > >> now. The key idea is to always manage and allocate memory at a
> > > > > >> granularity of 64K and use SVNAPOT to accelerate address translation.
> > > > > >> This is the second version and the detailed introduction can be found
> > > > > >> in [1].
> > > > > >>
> > > > > >> Changes from v1:
> > > > > >> - Rebase on v6.12.
> > > > > >>
> > > > > >> - Adjust the page table entry shift to reduce page table memory usage.
> > > > > >> For example, in SV39, the traditional va behaves as:
> > > > > >>
> > > > > >> ----------------------------------------------
> > > > > >> | pgd index | pmd index | pte index | offset |
> > > > > >> ----------------------------------------------
> > > > > >> | 38 30 | 29 21 | 20 12 | 11 0 |
> > > > > >> ----------------------------------------------
> > > > > >>
> > > > > >> When we choose 64K as basic software page, va now behaves as:
> > > > > >>
> > > > > >> ----------------------------------------------
> > > > > >> | pgd index | pmd index | pte index | offset |
> > > > > >> ----------------------------------------------
> > > > > >> | 38 34 | 33 25 | 24 16 | 15 0 |
> > > > > >> ----------------------------------------------
> > > > > >>
> > > > > >> - Fix some bugs in v1.
> > > > > >>
> > > > > >> Thanks in advance for comments.
> > > > > >>
> > > > > >> [1] https://lwn.net/Articles/952722/
> > > > > >
> > > > > > This looks very interesting. Can you cc me and linux-mm@kvack.org
> > > > > > in the future? Thanks.
> > > > > >
> > > > > > Have you thought about doing it for ARM64 4KB as well? ARM64’s contig PTE
> > > > > > should have similar effect of RISC-V’s SVNAPOT, right?
> > > > >
> > > > > What is the real benefit over 4k + large folios/mTHP?
> > > > >
> > > > > 64K comes with the problem of internal fragmentation: for example, a
> > > > > page table that only occupies 4k of memory suddenly consumes 64K; quite
> > > > > a downside.
> > > >
> > > > The original idea comes from the performance benefits we achieved on
> > > > the ARM 64K kernel. We run several real world applications on the ARM
> > > > Ampere Altra platform and found these apps' performance based on the
> > > > 64K page kernel is significantly higher than that on the 4K page
> > > > kernel:
> > > > For Redis, the throughput has increased by 250% and latency has
> > > > decreased by 70%.
> > > > For Mysql, the throughput has increased by 16.9% and latency has
> > > > decreased by 14.5%.
> > > > For our own newsql database, throughput has increased by 16.5% and
> > > > latency has decreased by 13.8%.
> > > >
> > > > Also, we have compared the performance between 64K and 4k + large
> > > > folios/mTHP on ARM Neoverse-N2. The result shows considerable
> > > > performance improvement on 64K kernel for both speccpu and lmbench,
> > > > even when 4K kernel enables THP and ARM64_CONTPTE:
> > > > For speccpu benchmark, 64K kernel without any huge pages optimization
> > > > can still achieve 4.17% higher score than 4K kernel with transparent
> > > > huge pages as well as CONTPTE optimization.
> > > > For lmbench, 64K kernel achieves 75.98% lower memory mapping
> > > > latency(16MB) than 4K kernel with transparent huge pages and CONTPTE
> > > > optimization, 84.34% higher map read open2close bandwidth(16MB), and
> > > > 10.71% lower random load latency(16MB).
> > > > Interestingly, sometimes kernel with transparent pages support have
> > > > poorer performance for both 4K and 64K (for example, mmap read
> > > > bandwidth bench). We assume this is due to the overhead of huge pages'
> > > > combination and collapse.
> > > > Also, if you check the full result, you will find that usually the
> > > > larger the memory size used for testing is, the better the performance
> > > > of 64k kernel is (compared to 4K kernel). Unless the memory size lies
> > > > in a range where 4K kernel can apply 2MB huge pages while 64K kernel
> > > > can't.
> > > > In summary, for performance sensitive applications which require
> > > > higher bandwidth and lower latency, sometimes 4K pages with huge pages
> > > > may not be the best choice and 64k page can achieve better results.
> > > > The test environment and result is attached.
> > > >
> > > > As RISC-V has no native 64K MMU support, we introduce a software
> > > > implementation and accelerate it via Svnapot. Of course, there will be
> > > > some extra overhead compared with native 64K MMU. Thus, we are also
> > > > trying to persuade the RISC-V community to support the extension of
> > > > native 64K MMU [1]. Please join us if you are interested.
> > > >
> > >
> > > Ok, so you... didn't test this on riscv? And you're basing this
> > > patchset off of a native 64KiB page size kernel being faster than 4KiB
> > > + CONTPTE? I don't see how that makes sense?
> >
> > Sorry for the misleading. I didn't intend to use ARM data to support
> > this patch, just to explain the idea source. We do prefer 64K MMU for
> > the performance improvement it brought to real applications and
> > benchmarks.
>
> This breaks ABI, doesn't it? Not only userspace needs to be recompiled
> with 64KB alignment, it also needs not to assume 4KB base page size.
Yes, it does.
>
> > And since RISC-V does not support it yet, we internally
> > use this patch as a transitional solution for RISC-V.
>
> Distros need to support this as well. Otherwise it's a tech island.
> Also why RV? It can be a generic feature which can apply to other
> archs like x86, right? See "page clustering" [1][2].
>
> [1] https://lwn.net/Articles/23785/
> [2] https://lore.kernel.org/linux-mm/Pine.LNX.4.21.0107051737340.1577-100000@localhost.localdomain/
>
> > And if native
> > 64k MMU is available, this patch can be canceled.
>
> Why 64KB? Why not 32KB or 128KB? In general, the less dependency on
> h/w, the better. Ideally, *if* we want to consider this, it should be
> a s/w feature applicable to all (or most of) archs.
We chose RISC-V because of internal business needs, and chose 64k
because of the benefits we have achieved on ARM 64k.
It is a pretty ambitious goal to apply such a feature to all
architectures. We are very glad to do so and request more assistance
if everyone thinks it is better. But for now, perhaps it is a better
choice to try it on RV first? After all, not all architectures support
features like Svnapot or CONTPTE. Of course, for architectures not
supporting Svnapot, applying a bigger page size can still achieve less
metadata memory overhead and less page faults.
We are pleased to see that similar things have already been considered
before. We give the most respect to William Lee Irwin and Hugh Dickins
and hope they can continue on this work. We will cc them in the future
emails.
Best Regards,
Xu Lu
>
>
> > The only usage of
> > this patch I can think of then is to make the kernel support more page
> > sizes than MMU, as long as Svnapot supports the corresponding size.
> >
> > We will try to release the performance data in the next version. There
> > have been more issues with applications and OS adaptation:) So this
> > version is still an RFC.
> >
> > >
> > > /me is confused
> > >
> > > How many of these PAGE_SIZE wins are related to e.g userspace basing
> > > its buffer sizes (or whatever) off of the system page size? Where
> > > exactly are you gaining time versus the CONTPTE stuff?
> > > I think MM in general would be better off if we were more transparent
> > > with regard to CONTPTE and page sizes instead of hand waving with
> > > "hardware page size != software page size", which is such a *checks
> > > notes* 4.4BSD idea... :) At the very least, this patchset seems to go
> > > against all the work on better supporting large folios and CONTPTE.
> >
> > By the way, the core modification of this patch is turning pte
> > structure to an array of 16 entries to map a 64K page and accelerating
> > it via Svnapot. I think it is all about architectural pte and has
> > little impact on pages or folios. Please remind me if anything is
> > missed and I will try to fix it.
> >
> > >
> > > --
> > > Pedro
> >
> > Thanks,
> >
> > Xu Lu
> >
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
^ permalink raw reply [flat|nested] 30+ messages in thread