* [PATCH v3 00/20] xen/riscv: introduce p2m functionality
@ 2025-07-31 15:57 Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 01/20] xen/riscv: implement sbi_remote_hfence_gvma() Oleksii Kurochko
` (19 more replies)
0 siblings, 20 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini,
Bertrand Marquis, Volodymyr Babchuk
In this patch series are introduced necessary functions to build and manage
RISC-V guest page tables and MMIO/RAM mappings.
---
Changes in V3:
- Introduce metadata table to store P2M types.
- Use x86's way to allocate VMID.
- Abstract Arm-specific p2m type name for device MMIO mappings.
- All other updates please look at specific patch.
---
Changes in V2:
- Merged to staging:
- [PATCH v1 1/6] xen/riscv: add inclusion of xen/bitops.h to asm/cmpxchg.h
- New patches:
- xen/riscv: implement sbi_remote_hfence_gvma{_vmid}().
- Split patch "xen/riscv: implement p2m mapping functionality" into smaller
one patches:
- xen/riscv: introduce page_set_xenheap_gfn()
- xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs
- xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
- xen/riscv: Implement p2m_free_entry() and related helpers
- xen/riscv: Implement superpage splitting for p2m mappings
- xen/riscv: implement p2m_next_level()
- xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
- Move root p2m table allocation to separate patch:
xen/riscv: add root page table allocation
- Drop dependency of this patch series from the patch witn an introduction of
SvPBMT as it was merged.
- Patch "[PATCH v1 4/6] xen/riscv: define pt_t and pt_walk_t structures" was
renamed to xen/riscv: introduce pte_{set,get}_mfn() as after dropping of
bitfields for PTE structure, this patch introduce only pte_{set,get}_mfn().
- Rename "xen/riscv: define pt_t and pt_walk_t structures" to
"xen/riscv: introduce pte_{set,get}_mfn()" as pt_t and pt_walk_t were
dropped.
- Introduce guest domain's VMID allocation and manegement.
- Add patches necessary to implement p2m lookup:
- xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
- xen/riscv: add support of page lookup by GFN
- Re-sort patch series.
- All other changes are patch-specific. Please check them.
---
Oleksii Kurochko (20):
xen/riscv: implement sbi_remote_hfence_gvma()
xen/riscv: introduce sbi_remote_hfence_gvma_vmid()
xen/riscv: introduce VMID allocation and manegement
xen/riscv: introduce things necessary for p2m initialization
xen/riscv: construct the P2M pages pool for guests
xen/riscv: add root page table allocation
xen/riscv: introduce pte_{set,get}_mfn()
xen/riscv: add new p2m types and helper macros for type classification
xen/dom0less: abstract Arm-specific p2m type name for device MMIO
mappings
xen/riscv: introduce page_{get,set}_xenheap_gfn()
xen/riscv: implement function to map memory in guest p2m
xen/riscv: implement p2m_set_range()
xen/riscv: Implement p2m_free_subtree() and related helpers
xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration
xen/riscv: implement p2m_next_level()
xen/riscv: Implement superpage splitting for p2m mappings
xen/riscv: implement put_page()
xen/riscv: implement mfn_valid() and page reference, ownership
handling helpers
xen/riscv: add support of page lookup by GFN
xen/riscv: introduce metadata table to store P2M type
xen/arch/arm/include/asm/p2m.h | 2 +
xen/arch/riscv/Makefile | 3 +
xen/arch/riscv/include/asm/Makefile | 1 -
xen/arch/riscv/include/asm/domain.h | 23 +
xen/arch/riscv/include/asm/flushtlb.h | 5 +
xen/arch/riscv/include/asm/mm.h | 72 +-
xen/arch/riscv/include/asm/p2m.h | 145 ++-
xen/arch/riscv/include/asm/page.h | 38 +
xen/arch/riscv/include/asm/paging.h | 19 +
xen/arch/riscv/include/asm/riscv_encoding.h | 6 +
xen/arch/riscv/include/asm/sbi.h | 32 +
xen/arch/riscv/include/asm/vmid.h | 8 +
xen/arch/riscv/mm.c | 73 +-
xen/arch/riscv/p2m.c | 1107 +++++++++++++++++++
xen/arch/riscv/paging.c | 112 ++
xen/arch/riscv/sbi.c | 14 +
xen/arch/riscv/setup.c | 3 +
xen/arch/riscv/vmid.c | 165 +++
xen/common/device-tree/dom0less-build.c | 2 +-
19 files changed, 1809 insertions(+), 21 deletions(-)
create mode 100644 xen/arch/riscv/include/asm/paging.h
create mode 100644 xen/arch/riscv/include/asm/vmid.h
create mode 100644 xen/arch/riscv/p2m.c
create mode 100644 xen/arch/riscv/paging.c
create mode 100644 xen/arch/riscv/vmid.c
--
2.50.1
^ permalink raw reply [flat|nested] 84+ messages in thread
* [PATCH v3 01/20] xen/riscv: implement sbi_remote_hfence_gvma()
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-04 13:52 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 02/20] xen/riscv: introduce sbi_remote_hfence_gvma_vmid() Oleksii Kurochko
` (18 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
covering the range of guest physical addresses between start_addr and
start_addr + size for all VMIDs.
The remote fence operation applies to the entire address space if either:
- start_addr and size are both 0, or
- size is equal to 2^XLEN-1.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
- Update the comment message above declaration of sbi_remote_hfence_gvma()
and update the commit message in sync.
- Drop ASSERT() in sbi_remote_hfence_gvma().
---
xen/arch/riscv/include/asm/sbi.h | 19 +++++++++++++++++++
xen/arch/riscv/sbi.c | 7 +++++++
2 files changed, 26 insertions(+)
diff --git a/xen/arch/riscv/include/asm/sbi.h b/xen/arch/riscv/include/asm/sbi.h
index 527d773277..0277aab747 100644
--- a/xen/arch/riscv/include/asm/sbi.h
+++ b/xen/arch/riscv/include/asm/sbi.h
@@ -89,6 +89,25 @@ bool sbi_has_rfence(void);
int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
size_t size);
+/*
+ * Instructs the remote harts to execute one or more HFENCE.GVMA
+ * instructions, covering the range of guest physical addresses
+ * between start_addr and start_addr + size for all VMIDs.
+ *
+ * Returns 0 if IPI was sent to all the targeted harts successfully
+ * or negative value if start_addr or size is not valid.
+ *
+ * The remote fence operation applies to the entire address space if either:
+ * - start_addr and size are both 0, or
+ * - size is equal to 2^XLEN-1.
+ *
+ * @cpu_mask a cpu mask containing all the target CPUs (in Xen space).
+ * @param start virtual address start
+ * @param size virtual address range size
+ */
+int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
+ size_t size);
+
/*
* Initialize SBI library
*
diff --git a/xen/arch/riscv/sbi.c b/xen/arch/riscv/sbi.c
index 4209520389..1809f614c5 100644
--- a/xen/arch/riscv/sbi.c
+++ b/xen/arch/riscv/sbi.c
@@ -258,6 +258,13 @@ int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
cpu_mask, start, size, 0, 0);
}
+int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
+ size_t size)
+{
+ return sbi_rfence(SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA,
+ cpu_mask, start, size, 0, 0);
+}
+
/* This function must always succeed. */
#define sbi_get_spec_version() \
sbi_ext_base_func(SBI_EXT_BASE_GET_SPEC_VERSION)
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 02/20] xen/riscv: introduce sbi_remote_hfence_gvma_vmid()
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 01/20] xen/riscv: implement sbi_remote_hfence_gvma() Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-04 13:55 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement Oleksii Kurochko
` (17 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
It instructs the remote harts to execute one or more HFENCE.GVMA instructions
by making an SBI call, covering the range of guest physical addresses between
start_addr and start_addr + size only for the given VMID.
The remote fence operation applies to the entire address space if either:
- start_addr and size are both 0, or
- size is equal to 2^XLEN-1.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
- Drop ASSERT() in sbi_remote_hfence_gvma_vmid() as failure will happen anyway if
rfence isn't initialized.
- Drop "This function call is only valid for harts implementing hypervisor
extension." from the commit message and the comment above the declaration
of sbi_remote_hfence_gvma_vmid().
- Use proper FID for sbi_remote_hfence_gvma_vmid().
---
Changes in V2:
- New patch.
---
xen/arch/riscv/include/asm/sbi.h | 13 +++++++++++++
xen/arch/riscv/sbi.c | 7 +++++++
2 files changed, 20 insertions(+)
diff --git a/xen/arch/riscv/include/asm/sbi.h b/xen/arch/riscv/include/asm/sbi.h
index 0277aab747..10930dea93 100644
--- a/xen/arch/riscv/include/asm/sbi.h
+++ b/xen/arch/riscv/include/asm/sbi.h
@@ -108,6 +108,19 @@ int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
size_t size);
+/*
+ * Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
+ * covering the range of guest physical addresses between start_addr and
+ * start_addr + size only for the given VMID.
+ *
+ * @cpu_mask a cpu mask containing all the target CPUs (in Xen space).
+ * @param start virtual address start
+ * @param size virtual address range size
+ * @param vmid virtual machine id
+ */
+int sbi_remote_hfence_gvma_vmid(const cpumask_t *cpu_mask, vaddr_t start,
+ size_t size, unsigned long vmid);
+
/*
* Initialize SBI library
*
diff --git a/xen/arch/riscv/sbi.c b/xen/arch/riscv/sbi.c
index 1809f614c5..425dce44c6 100644
--- a/xen/arch/riscv/sbi.c
+++ b/xen/arch/riscv/sbi.c
@@ -265,6 +265,13 @@ int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
cpu_mask, start, size, 0, 0);
}
+int sbi_remote_hfence_gvma_vmid(const cpumask_t *cpu_mask, vaddr_t start,
+ size_t size, unsigned long vmid)
+{
+ return sbi_rfence(SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA_VMID,
+ cpu_mask, start, size, vmid, 0);
+}
+
/* This function must always succeed. */
#define sbi_get_spec_version() \
sbi_ext_base_func(SBI_EXT_BASE_GET_SPEC_VERSION)
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 01/20] xen/riscv: implement sbi_remote_hfence_gvma() Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 02/20] xen/riscv: introduce sbi_remote_hfence_gvma_vmid() Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-04 15:19 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 04/20] xen/riscv: introduce things necessary for p2m initialization Oleksii Kurochko
` (16 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Current implementation is based on x86's way to allocate VMIDs:
VMIDs partition the physical TLB. In the current implementation VMIDs are
introduced to reduce the number of TLB flushes. Each time the guest's
virtual address space changes, instead of flushing the TLB, a new VMID is
assigned. This reduces the number of TLB flushes to at most 1/#VMIDs.
The biggest advantage is that hot parts of the hypervisor's code and data
retain in the TLB.
VMIDs are a hart-local resource. As preemption of VMIDs is not possible,
VMIDs are assigned in a round-robin scheme. To minimize the overhead of
VMID invalidation, at the time of a TLB flush, VMIDs are tagged with a
64-bit generation. Only on a generation overflow the code needs to
invalidate all VMID information stored at the VCPUs with are run on the
specific physical processor. This overflow appears after about 2^80
host processor cycles, so we do not optimize this case, but simply disable
VMID useage to retain correctness.
Only minor changes are made compared to the x86 implementation.
These include using RISC-V-specific terminology, adding a check to ensure
the type used for storing the VMID has enough bits to hold VMIDLEN,
and introducing a new function vmidlen_detect() to clarify the VMIDLEN
value.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
- Reimplemnt VMID allocation similar to what x86 has implemented.
---
Changes in V2:
- New patch.
---
xen/arch/riscv/Makefile | 1 +
xen/arch/riscv/include/asm/domain.h | 6 +
xen/arch/riscv/include/asm/flushtlb.h | 5 +
xen/arch/riscv/include/asm/vmid.h | 8 ++
xen/arch/riscv/setup.c | 3 +
xen/arch/riscv/vmid.c | 165 ++++++++++++++++++++++++++
6 files changed, 188 insertions(+)
create mode 100644 xen/arch/riscv/include/asm/vmid.h
create mode 100644 xen/arch/riscv/vmid.c
diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index e2b8aa42c8..745a85e116 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -16,6 +16,7 @@ obj-y += smpboot.o
obj-y += stubs.o
obj-y += time.o
obj-y += traps.o
+obj-y += vmid.o
obj-y += vm_event.o
$(TARGET): $(TARGET)-syms
diff --git a/xen/arch/riscv/include/asm/domain.h b/xen/arch/riscv/include/asm/domain.h
index c3d965a559..aac1040658 100644
--- a/xen/arch/riscv/include/asm/domain.h
+++ b/xen/arch/riscv/include/asm/domain.h
@@ -5,6 +5,11 @@
#include <xen/xmalloc.h>
#include <public/hvm/params.h>
+struct vcpu_vmid {
+ uint64_t generation;
+ uint16_t vmid;
+};
+
struct hvm_domain
{
uint64_t params[HVM_NR_PARAMS];
@@ -14,6 +19,7 @@ struct arch_vcpu_io {
};
struct arch_vcpu {
+ struct vcpu_vmid vmid;
};
struct arch_domain {
diff --git a/xen/arch/riscv/include/asm/flushtlb.h b/xen/arch/riscv/include/asm/flushtlb.h
index 51c8f753c5..f391ae6eb7 100644
--- a/xen/arch/riscv/include/asm/flushtlb.h
+++ b/xen/arch/riscv/include/asm/flushtlb.h
@@ -7,6 +7,11 @@
#include <asm/sbi.h>
+static inline void local_hfence_gvma_all(void)
+{
+ asm volatile ( "hfence.gvma zero, zero" ::: "memory" );
+}
+
/* Flush TLB of local processor for address va. */
static inline void flush_tlb_one_local(vaddr_t va)
{
diff --git a/xen/arch/riscv/include/asm/vmid.h b/xen/arch/riscv/include/asm/vmid.h
new file mode 100644
index 0000000000..2f1f7ec9a2
--- /dev/null
+++ b/xen/arch/riscv/include/asm/vmid.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef ASM_RISCV_VMID_H
+#define ASM_RISCV_VMID_H
+
+void vmid_init(void);
+
+#endif /* ASM_RISCV_VMID_H */
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index 483cdd7e17..549228d73f 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -25,6 +25,7 @@
#include <asm/sbi.h>
#include <asm/setup.h>
#include <asm/traps.h>
+#include <asm/vmid.h>
/* Xen stack for bringing up the first CPU. */
unsigned char __initdata cpu0_boot_stack[STACK_SIZE]
@@ -148,6 +149,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
console_init_postirq();
+ vmid_init();
+
printk("All set up\n");
machine_halt();
diff --git a/xen/arch/riscv/vmid.c b/xen/arch/riscv/vmid.c
new file mode 100644
index 0000000000..7ad1b91ee2
--- /dev/null
+++ b/xen/arch/riscv/vmid.c
@@ -0,0 +1,165 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include <xen/domain.h>
+#include <xen/init.h>
+#include <xen/sections.h>
+#include <xen/lib.h>
+#include <xen/param.h>
+#include <xen/percpu.h>
+
+#include <asm/atomic.h>
+#include <asm/csr.h>
+#include <asm/flushtlb.h>
+
+/* Xen command-line option to enable VMIDs */
+static bool __read_mostly opt_vmid_enabled = true;
+boolean_param("vmid", opt_vmid_enabled);
+
+/*
+ * VMIDs partition the physical TLB. In the current implementation VMIDs are
+ * introduced to reduce the number of TLB flushes. Each time the guest's
+ * virtual address space changes, instead of flushing the TLB, a new VMID is
+ * assigned. This reduces the number of TLB flushes to at most 1/#VMIDs.
+ * The biggest advantage is that hot parts of the hypervisor's code and data
+ * retain in the TLB.
+ *
+ * Sketch of the Implementation:
+ *
+ * VMIDs are a hart-local resource. As preemption of VMIDs is not possible,
+ * VMIDs are assigned in a round-robin scheme. To minimize the overhead of
+ * VMID invalidation, at the time of a TLB flush, VMIDs are tagged with a
+ * 64-bit generation. Only on a generation overflow the code needs to
+ * invalidate all VMID information stored at the VCPUs with are run on the
+ * specific physical processor. This overflow appears after about 2^80
+ * host processor cycles, so we do not optimize this case, but simply disable
+ * VMID useage to retain correctness.
+ */
+
+/* Per-Hart VMID management. */
+struct vmid_data {
+ uint64_t hart_vmid_generation;
+ uint16_t next_vmid;
+ uint16_t max_vmid;
+ bool disabled;
+};
+
+static DEFINE_PER_CPU(struct vmid_data, vmid_data);
+
+static unsigned long vmidlen_detect(void)
+{
+ unsigned long vmid_bits;
+ unsigned long old;
+
+ /* Figure-out number of VMID bits in HW */
+ old = csr_read(CSR_HGATP);
+
+ csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
+ vmid_bits = csr_read(CSR_HGATP);
+ vmid_bits = MASK_EXTR(vmid_bits, HGATP_VMID_MASK);
+ vmid_bits = flsl(vmid_bits);
+ csr_write(CSR_HGATP, old);
+
+ /*
+ * We polluted local TLB so flush all guest TLB as
+ * a speculative access can happen at any time.
+ */
+ local_hfence_gvma_all();
+
+ return vmid_bits;
+}
+
+void vmid_init(void)
+{
+ static bool g_disabled = false;
+
+ unsigned long vmid_len = vmidlen_detect();
+ struct vmid_data *data = &this_cpu(vmid_data);
+ unsigned long max_availalbe_bits = sizeof(data->max_vmid) << 3;
+
+ if ( vmid_len > max_availalbe_bits )
+ panic("%s: VMIDLEN is bigger then a type which represent VMID: %lu(%lu)\n",
+ __func__, vmid_len, max_availalbe_bits);
+
+ data->max_vmid = BIT(vmid_len, U) - 1;
+ data->disabled = !opt_vmid_enabled || (vmid_len <= 1);
+
+ if ( g_disabled != data->disabled )
+ {
+ printk("%s: VMIDs %sabled.\n", __func__,
+ data->disabled ? "dis" : "en");
+ if ( !g_disabled )
+ g_disabled = data->disabled;
+ }
+
+ /* Zero indicates 'invalid generation', so we start the count at one. */
+ data->hart_vmid_generation = 1;
+
+ /* Zero indicates 'VMIDs disabled', so we start the count at one. */
+ data->next_vmid = 1;
+}
+
+void vcpu_vmid_flush_vcpu(struct vcpu *v)
+{
+ write_atomic(&v->arch.vmid.generation, 0);
+}
+
+void vmid_flush_hart(void)
+{
+ struct vmid_data *data = &this_cpu(vmid_data);
+
+ if ( data->disabled )
+ return;
+
+ if ( likely(++data->hart_vmid_generation != 0) )
+ return;
+
+ /*
+ * VMID generations are 64 bit. Overflow of generations never happens.
+ * For safety, we simply disable ASIDs, so correctness is established; it
+ * only runs a bit slower.
+ */
+ printk("%s: VMID generation overrun. Disabling VMIDs.\n", __func__);
+ data->disabled = 1;
+}
+
+bool vmid_handle_vmenter(struct vcpu_vmid *vmid)
+{
+ struct vmid_data *data = &this_cpu(vmid_data);
+
+ /* Test if VCPU has valid VMID. */
+ if ( read_atomic(&vmid->generation) == data->hart_vmid_generation )
+ return 0;
+
+ /* If there are no free VMIDs, need to go to a new generation. */
+ if ( unlikely(data->next_vmid > data->max_vmid) )
+ {
+ vmid_flush_hart();
+ data->next_vmid = 1;
+ if ( data->disabled )
+ goto disabled;
+ }
+
+ /* Now guaranteed to be a free VMID. */
+ vmid->vmid = data->next_vmid++;
+ write_atomic(&vmid->generation, data->hart_vmid_generation);
+
+ /*
+ * When we assign VMID 1, flush all TLB entries as we are starting a new
+ * generation, and all old VMID allocations are now stale.
+ */
+ return (vmid->vmid == 1);
+
+ disabled:
+ vmid->vmid = 0;
+ return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 04/20] xen/riscv: introduce things necessary for p2m initialization
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (2 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-04 15:53 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 05/20] xen/riscv: construct the P2M pages pool for guests Oleksii Kurochko
` (15 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Introduce the following things:
- Update p2m_domain structure, which describe per p2m-table state, with:
- lock to protect updates to p2m.
- pool with pages used to construct p2m.
- clean_pte which indicate if it is requires to clean the cache when
writing an entry.
- back pointer to domain structure.
- p2m_init() to initalize members introduced in p2m_domain structure.
- Call of paging_domain_init() in p2m_init() to initlize paging spinlock
and freelist head.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
- s/p2m_type/p2m_types.
- Drop init. of p2m->clean_pte in p2m_init() as CONFIG_HAS_PASSTHROUGH is
going to be selected unconditionaly. Plus CONFIG_HAS_PASSTHROUGH isn't
ready to be used for RISC-V.
Add compilation error to not forget to init p2m->clean_pte.
- Move defintion of p2m->domain up in p2m_init().
- Add iommu_use_hap_pt() when p2m->clean_pte is initialized.
- Add the comment above p2m_types member of p2m_domain struct.
- Add need_flush member to p2m_domain structure.
- Move introduction of p2m_write_(un)lock() and p2m_tlb_flush_sync()
to the patch where they are really used:
xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFN
- Add p2m member to arch_domain structure.
- Drop p2m_types from struct p2m_domain as P2M type for PTE will be stored
differently.
- Drop default_access as it isn't going to be used for now.
- Move defintion of p2m_is_write_locked() to "implement function to map memory
in guest p2m" where it is really used.
---
Changes in V2:
- Use introduced erlier sbi_remote_hfence_gvma_vmid() for proper implementation
of p2m_force_tlb_flush_sync() as TLB flushing needs to happen for each pCPU
which potentially has cached a mapping, what is tracked by d->dirty_cpumask.
- Drop unnecessary blanks.
- Fix code style for # of pre-processor directive.
- Drop max_mapped_gfn and lowest_mapped_gfn as they aren't used now.
- [p2m_init()] Set p2m->clean_pte=false if CONFIG_HAS_PASSTHROUGH=n.
- [p2m_init()] Update the comment above p2m->domain = d;
- Drop p2m->need_flush as it seems to be always true for RISC-V and as a
consequence drop p2m_tlb_flush_sync().
- Move to separate patch an introduction of root page table allocation.
---
xen/arch/riscv/Makefile | 1 +
xen/arch/riscv/include/asm/domain.h | 5 +++++
xen/arch/riscv/include/asm/p2m.h | 34 +++++++++++++++++++++++++++++
xen/arch/riscv/p2m.c | 32 +++++++++++++++++++++++++++
4 files changed, 72 insertions(+)
create mode 100644 xen/arch/riscv/p2m.c
diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index 745a85e116..e2499210c8 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -7,6 +7,7 @@ obj-y += intc.o
obj-y += irq.o
obj-y += mm.o
obj-y += pt.o
+obj-y += p2m.o
obj-$(CONFIG_RISCV_64) += riscv64/
obj-y += sbi.o
obj-y += setup.o
diff --git a/xen/arch/riscv/include/asm/domain.h b/xen/arch/riscv/include/asm/domain.h
index aac1040658..e688980efa 100644
--- a/xen/arch/riscv/include/asm/domain.h
+++ b/xen/arch/riscv/include/asm/domain.h
@@ -5,6 +5,8 @@
#include <xen/xmalloc.h>
#include <public/hvm/params.h>
+#include <asm/p2m.h>
+
struct vcpu_vmid {
uint64_t generation;
uint16_t vmid;
@@ -24,6 +26,9 @@ struct arch_vcpu {
struct arch_domain {
struct hvm_domain hvm;
+
+ /* Virtual MMU */
+ struct p2m_domain p2m;
};
#include <xen/sched.h>
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 28f57a74f2..f8051ed893 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -3,11 +3,45 @@
#define ASM__RISCV__P2M_H
#include <xen/errno.h>
+#include <xen/mm.h>
+#include <xen/rwlock.h>
+#include <xen/types.h>
#include <asm/page-bits.h>
#define paddr_bits PADDR_BITS
+/* Get host p2m table */
+#define p2m_get_hostp2m(d) (&(d)->arch.p2m)
+
+/* Per-p2m-table state */
+struct p2m_domain {
+ /*
+ * Lock that protects updates to the p2m.
+ */
+ rwlock_t lock;
+
+ /* Pages used to construct the p2m */
+ struct page_list_head pages;
+
+ /* Indicate if it is required to clean the cache when writing an entry */
+ bool clean_pte;
+
+ /* Back pointer to domain */
+ struct domain *domain;
+
+ /*
+ * P2M updates may required TLBs to be flushed (invalidated).
+ *
+ * Flushes may be deferred by setting 'need_flush' and then flushing
+ * when the p2m write lock is released.
+ *
+ * If an immediate flush is required (e.g, if a super page is
+ * shattered), call p2m_tlb_flush_sync().
+ */
+ bool need_flush;
+};
+
/*
* List of possible type for each page in the p2m entry.
* The number of available bit per page in the pte for this purpose is 2 bits.
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
new file mode 100644
index 0000000000..ae937e9bdd
--- /dev/null
+++ b/xen/arch/riscv/p2m.c
@@ -0,0 +1,32 @@
+#include <xen/mm.h>
+#include <xen/rwlock.h>
+#include <xen/sched.h>
+
+int p2m_init(struct domain *d)
+{
+ struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+ /*
+ * "Trivial" initialisation is now complete. Set the backpointer so the
+ * users of p2m could get an access to domain structure.
+ */
+ p2m->domain = d;
+
+ rwlock_init(&p2m->lock);
+ INIT_PAGE_LIST_HEAD(&p2m->pages);
+
+ /*
+ * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
+ * is not ready for RISC-V support.
+ *
+ * When CONFIG_HAS_PASSTHROUGH=y, p2m->clean_pte must be properly
+ * initialized.
+ * At the moment, it defaults to false because the p2m structure is
+ * zero-initialized.
+ */
+#ifdef CONFIG_HAS_PASSTHROUGH
+# error "Add init of p2m->clean_pte"
+#endif
+
+ return 0;
+}
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 05/20] xen/riscv: construct the P2M pages pool for guests
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (3 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 04/20] xen/riscv: introduce things necessary for p2m initialization Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-04 15:58 ` Jan Beulich
2025-08-05 10:40 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 06/20] xen/riscv: add root page table allocation Oleksii Kurochko
` (14 subsequent siblings)
19 siblings, 2 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Implement p2m_set_allocation() to construct p2m pages pool for guests
based on required number of pages.
This is implemented by:
- Adding a `struct paging_domain` which contains a freelist, a
counter variable and a spinlock to `struct arch_domain` to
indicate the free p2m pages and the number of p2m total pages in
the p2m pages pool.
- Adding a helper `p2m_set_allocation` to set the p2m pages pool
size. This helper should be called before allocating memory for
a guest and is called from domain_p2m_set_allocation(), the latter
is a part of common dom0less code.
- Adding paging_freelist_init() to struct paging_domain.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in v3:
- Drop usage of p2m_ prefix inside struct paging_domain().
- Introduce paging_domain_init() to init paging struct.
---
Changes in v2:
- Drop the comment above inclusion of <xen/event.h> in riscv/p2m.c.
- Use ACCESS_ONCE() for lhs and rhs for the expressions in
p2m_set_allocation().
---
xen/arch/riscv/Makefile | 1 +
xen/arch/riscv/include/asm/Makefile | 1 -
xen/arch/riscv/include/asm/domain.h | 12 ++++++
xen/arch/riscv/include/asm/paging.h | 13 ++++++
xen/arch/riscv/p2m.c | 19 +++++++++
xen/arch/riscv/paging.c | 64 +++++++++++++++++++++++++++++
6 files changed, 109 insertions(+), 1 deletion(-)
create mode 100644 xen/arch/riscv/include/asm/paging.h
create mode 100644 xen/arch/riscv/paging.c
diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index e2499210c8..6b912465b9 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -6,6 +6,7 @@ obj-y += imsic.o
obj-y += intc.o
obj-y += irq.o
obj-y += mm.o
+obj-y += paging.o
obj-y += pt.o
obj-y += p2m.o
obj-$(CONFIG_RISCV_64) += riscv64/
diff --git a/xen/arch/riscv/include/asm/Makefile b/xen/arch/riscv/include/asm/Makefile
index bfdf186c68..3824f31c39 100644
--- a/xen/arch/riscv/include/asm/Makefile
+++ b/xen/arch/riscv/include/asm/Makefile
@@ -6,7 +6,6 @@ generic-y += hardirq.h
generic-y += hypercall.h
generic-y += iocap.h
generic-y += irq-dt.h
-generic-y += paging.h
generic-y += percpu.h
generic-y += perfc_defn.h
generic-y += random.h
diff --git a/xen/arch/riscv/include/asm/domain.h b/xen/arch/riscv/include/asm/domain.h
index e688980efa..316e7c6c84 100644
--- a/xen/arch/riscv/include/asm/domain.h
+++ b/xen/arch/riscv/include/asm/domain.h
@@ -2,6 +2,8 @@
#ifndef ASM__RISCV__DOMAIN_H
#define ASM__RISCV__DOMAIN_H
+#include <xen/mm.h>
+#include <xen/spinlock.h>
#include <xen/xmalloc.h>
#include <public/hvm/params.h>
@@ -24,11 +26,21 @@ struct arch_vcpu {
struct vcpu_vmid vmid;
};
+struct paging_domain {
+ spinlock_t lock;
+ /* Free pages from the pre-allocated pool */
+ struct page_list_head freelist;
+ /* Number of pages from the pre-allocated pool */
+ unsigned long total_pages;
+};
+
struct arch_domain {
struct hvm_domain hvm;
/* Virtual MMU */
struct p2m_domain p2m;
+
+ struct paging_domain paging;
};
#include <xen/sched.h>
diff --git a/xen/arch/riscv/include/asm/paging.h b/xen/arch/riscv/include/asm/paging.h
new file mode 100644
index 0000000000..8fdaeeb2e4
--- /dev/null
+++ b/xen/arch/riscv/include/asm/paging.h
@@ -0,0 +1,13 @@
+#ifndef ASM_RISCV_PAGING_H
+#define ASM_RISCV_PAGING_H
+
+#include <asm-generic/paging.h>
+
+struct domain;
+
+int paging_domain_init(struct domain *d);
+
+int paging_freelist_init(struct domain *d, unsigned long pages,
+ bool *preempted);
+
+#endif /* ASM_RISCV_PAGING_H */
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index ae937e9bdd..214b4861d2 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -2,6 +2,8 @@
#include <xen/rwlock.h>
#include <xen/sched.h>
+#include <asm/paging.h>
+
int p2m_init(struct domain *d)
{
struct p2m_domain *p2m = p2m_get_hostp2m(d);
@@ -12,6 +14,8 @@ int p2m_init(struct domain *d)
*/
p2m->domain = d;
+ paging_domain_init(d);
+
rwlock_init(&p2m->lock);
INIT_PAGE_LIST_HEAD(&p2m->pages);
@@ -30,3 +34,18 @@ int p2m_init(struct domain *d)
return 0;
}
+
+/*
+ * Set the pool of pages to the required number of pages.
+ * Returns 0 for success, non-zero for failure.
+ * Call with d->arch.paging.lock held.
+ */
+int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
+{
+ int rc;
+
+ if ( (rc = paging_freelist_init(d, pages, preempted)) )
+ return rc;
+
+ return 0;
+}
diff --git a/xen/arch/riscv/paging.c b/xen/arch/riscv/paging.c
new file mode 100644
index 0000000000..8882be5ac9
--- /dev/null
+++ b/xen/arch/riscv/paging.c
@@ -0,0 +1,64 @@
+#include <xen/event.h>
+#include <xen/lib.h>
+#include <xen/mm.h>
+#include <xen/sched.h>
+#include <xen/spinlock.h>
+
+int paging_freelist_init(struct domain *d, unsigned long pages,
+ bool *preempted)
+{
+ struct page_info *pg;
+
+ ASSERT(spin_is_locked(&d->arch.paging.lock));
+
+ for ( ; ; )
+ {
+ if ( d->arch.paging.total_pages < pages )
+ {
+ /* Need to allocate more memory from domheap */
+ pg = alloc_domheap_page(d, MEMF_no_owner);
+ if ( pg == NULL )
+ {
+ printk(XENLOG_ERR "Failed to allocate pages.\n");
+ return -ENOMEM;
+ }
+ ACCESS_ONCE(d->arch.paging.total_pages)++;
+ page_list_add_tail(pg, &d->arch.paging.freelist);
+ }
+ else if ( d->arch.paging.total_pages > pages )
+ {
+ /* Need to return memory to domheap */
+ pg = page_list_remove_head(&d->arch.paging.freelist);
+ if ( pg )
+ {
+ ACCESS_ONCE(d->arch.paging.total_pages)--;
+ free_domheap_page(pg);
+ }
+ else
+ {
+ printk(XENLOG_ERR
+ "Failed to free pages, freelist is empty.\n");
+ return -ENOMEM;
+ }
+ }
+ else
+ break;
+
+ /* Check to see if we need to yield and try again */
+ if ( preempted && general_preempt_check() )
+ {
+ *preempted = true;
+ return -ERESTART;
+ }
+ }
+
+ return 0;
+}
+/* Domain paging struct initialization. */
+int paging_domain_init(struct domain *d)
+{
+ spin_lock_init(&d->arch.paging.lock);
+ INIT_PAGE_LIST_HEAD(&d->arch.paging.freelist);
+
+ return 0;
+}
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 06/20] xen/riscv: add root page table allocation
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (4 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 05/20] xen/riscv: construct the P2M pages pool for guests Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-05 10:37 ` Jan Beulich
2025-08-05 10:43 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 07/20] xen/riscv: introduce pte_{set,get}_mfn() Oleksii Kurochko
` (13 subsequent siblings)
19 siblings, 2 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Introduce support for allocating and initializing the root page table
required for RISC-V stage-2 address translation.
To implement root page table allocation the following is introduced:
- p2m_get_clean_page() and p2m_alloc_root_table(), p2m_allocate_root()
helpers to allocate and zero a 16 KiB root page table, as mandated
by the RISC-V privileged specification for Sv32x4/Sv39x4/Sv48x4/Sv57x4
modes.
- Update p2m_init() to inititialize p2m_root_order.
- Add maddr_to_page() and page_to_maddr() macros for easier address
manipulation.
- Introduce paging_ret_pages_to_domheap() to return some pages before
allocate 16 KiB pages for root page table.
- Allocate root p2m table after p2m pool is initialized.
- Add construct_hgatp() to construct the hgatp register value based on
p2m->root, p2m->hgatp_mode and VMID.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in v3:
- Drop insterting of p2m->vmid in hgatp_from_page() as now vmid is allocated
per-CPU, not per-domain, so it will be inserted later somewhere in
context_switch or before returning control to a guest.
- use BIT() to init nr_pages in p2m_allocate_root() instead of open-code
BIT() macros.
- Fix order in clear_and_clean_page().
- s/panic("Specify more xen,domain-p2m-mem-mb\n")/return NULL.
- Use lock around a procedure of returning back pages necessary for p2m
root table.
- Update the comment about allocation of page for root page table.
- Update an argument of hgatp_from_page() to "struct page_info *p2m_root_page"
to be consistent with the function name.
- Use p2m_get_hostp2m(d) instead of open-coding it.
- Update the comment above the call of p2m_alloc_root_table().
- Update the comments in p2m_allocate_root().
- Move part which returns some page to domheap before root page table allocation
to paging.c.
- Pass p2m_domain * instead of struct domain * for p2m_alloc_root_table().
- Introduce construct_hgatp() instead of hgatp_from_page().
- Add vmid and hgatp_mode member of struct p2m_domain.
- Add explanatory comment above clean_dcache_va_range() in
clear_and_clean_page().
- Introduce P2M_ROOT_ORDER and P2M_ROOT_PAGES.
- Drop vmid member from p2m_domain as now we are using per-pCPU
VMID allocation.
- Update a declaration of construct_hgatp() to recieve VMID as it
isn't per-VM anymore.
- Drop hgatp member of p2m_domain struct as with the new VMID scheme
allocation construction of hgatp will be needed more often.
- Drop is_hardware_domain() case in p2m_allocate_root(), just always
allocate root using p2m pool pages.
- Refactor p2m_alloc_root_table() and p2m_alloc_table().
---
Changes in v2:
- This patch was created from "xen/riscv: introduce things necessary for p2m
initialization" with the following changes:
- [clear_and_clean_page()] Add missed call of clean_dcache_va_range().
- Drop p2m_get_clean_page() as it is going to be used only once to allocate
root page table. Open-code it explicittly in p2m_allocate_root(). Also,
it will help avoid duplication of the code connected to order and nr_pages
of p2m root page table.
- Instead of using order 2 for alloc_domheap_pages(), use
get_order_from_bytes(KB(16)).
- Clear and clean a proper amount of allocated pages in p2m_allocate_root().
- Drop _info from the function name hgatp_from_page_info() and its argument
page_info.
- Introduce HGATP_MODE_MASK and use MASK_INSR() instead of shift to calculate
value of hgatp.
- Drop unnecessary parentheses in definition of page_to_maddr().
- Add support of VMID.
- Drop TLB flushing in p2m_alloc_root_table() and do that once when VMID
is re-used. [Look at p2m_alloc_vmid()]
- Allocate p2m root table after p2m pool is fully initialized: first
return pages to p2m pool them allocate p2m root table.
---
xen/arch/riscv/include/asm/mm.h | 4 +
xen/arch/riscv/include/asm/p2m.h | 12 +++
xen/arch/riscv/include/asm/paging.h | 2 +
xen/arch/riscv/include/asm/riscv_encoding.h | 6 ++
xen/arch/riscv/p2m.c | 90 +++++++++++++++++++++
xen/arch/riscv/paging.c | 30 +++++++
6 files changed, 144 insertions(+)
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index 9283616c02..dd8cdc9782 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -167,6 +167,10 @@ extern struct page_info *frametable_virt_start;
#define mfn_to_page(mfn) (frametable_virt_start + mfn_x(mfn))
#define page_to_mfn(pg) _mfn((pg) - frametable_virt_start)
+/* Convert between machine addresses and page-info structures. */
+#define maddr_to_page(ma) mfn_to_page(maddr_to_mfn(ma))
+#define page_to_maddr(pg) mfn_to_maddr(page_to_mfn(pg))
+
static inline void *page_to_virt(const struct page_info *pg)
{
return mfn_to_virt(mfn_x(page_to_mfn(pg)));
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index f8051ed893..3c37a708db 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -9,6 +9,10 @@
#include <asm/page-bits.h>
+extern unsigned int p2m_root_order;
+#define P2M_ROOT_ORDER p2m_root_order
+#define P2M_ROOT_PAGES BIT(P2M_ROOT_ORDER, U)
+
#define paddr_bits PADDR_BITS
/* Get host p2m table */
@@ -24,6 +28,12 @@ struct p2m_domain {
/* Pages used to construct the p2m */
struct page_list_head pages;
+ /* The root of the p2m tree. May be concatenated */
+ struct page_info *root;
+
+ /* G-stage (stage-2) address translation mode */
+ unsigned long hgatp_mode;
+
/* Indicate if it is required to clean the cache when writing an entry */
bool clean_pte;
@@ -127,6 +137,8 @@ static inline void p2m_altp2m_check(struct vcpu *v, uint16_t idx)
/* Not supported on RISCV. */
}
+unsigned long construct_hgatp(struct p2m_domain *p2m, uint16_t vmid);
+
#endif /* ASM__RISCV__P2M_H */
/*
diff --git a/xen/arch/riscv/include/asm/paging.h b/xen/arch/riscv/include/asm/paging.h
index 8fdaeeb2e4..557fbd1abc 100644
--- a/xen/arch/riscv/include/asm/paging.h
+++ b/xen/arch/riscv/include/asm/paging.h
@@ -10,4 +10,6 @@ int paging_domain_init(struct domain *d);
int paging_freelist_init(struct domain *d, unsigned long pages,
bool *preempted);
+bool paging_ret_pages_to_domheap(struct domain *d, unsigned int nr_pages);
+
#endif /* ASM_RISCV_PAGING_H */
diff --git a/xen/arch/riscv/include/asm/riscv_encoding.h b/xen/arch/riscv/include/asm/riscv_encoding.h
index 6cc8f4eb45..8362df8784 100644
--- a/xen/arch/riscv/include/asm/riscv_encoding.h
+++ b/xen/arch/riscv/include/asm/riscv_encoding.h
@@ -133,11 +133,13 @@
#define HGATP_MODE_SV48X4 _UL(9)
#define HGATP32_MODE_SHIFT 31
+#define HGATP32_MODE_MASK _UL(0x80000000)
#define HGATP32_VMID_SHIFT 22
#define HGATP32_VMID_MASK _UL(0x1FC00000)
#define HGATP32_PPN _UL(0x003FFFFF)
#define HGATP64_MODE_SHIFT 60
+#define HGATP64_MODE_MASK _ULL(0xF000000000000000)
#define HGATP64_VMID_SHIFT 44
#define HGATP64_VMID_MASK _ULL(0x03FFF00000000000)
#define HGATP64_PPN _ULL(0x00000FFFFFFFFFFF)
@@ -170,6 +172,7 @@
#define HGATP_VMID_SHIFT HGATP64_VMID_SHIFT
#define HGATP_VMID_MASK HGATP64_VMID_MASK
#define HGATP_MODE_SHIFT HGATP64_MODE_SHIFT
+#define HGATP_MODE_MASK HGATP64_MODE_MASK
#else
#define MSTATUS_SD MSTATUS32_SD
#define SSTATUS_SD SSTATUS32_SD
@@ -181,8 +184,11 @@
#define HGATP_VMID_SHIFT HGATP32_VMID_SHIFT
#define HGATP_VMID_MASK HGATP32_VMID_MASK
#define HGATP_MODE_SHIFT HGATP32_MODE_SHIFT
+#define HGATP_MODE_MASK HGATP32_MODE_MASK
#endif
+#define GUEST_ROOT_PAGE_TABLE_SIZE KB(16)
+
#define TOPI_IID_SHIFT 16
#define TOPI_IID_MASK 0xfff
#define TOPI_IPRIO_MASK 0xff
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 214b4861d2..cac07c51c9 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -1,8 +1,86 @@
+#include <xen/domain_page.h>
#include <xen/mm.h>
#include <xen/rwlock.h>
#include <xen/sched.h>
#include <asm/paging.h>
+#include <asm/p2m.h>
+#include <asm/riscv_encoding.h>
+
+unsigned int __read_mostly p2m_root_order;
+
+static void clear_and_clean_page(struct page_info *page)
+{
+ clear_domain_page(page_to_mfn(page));
+
+ /*
+ * If the IOMMU doesn't support coherent walks and the p2m tables are
+ * shared between the CPU and IOMMU, it is necessary to clean the
+ * d-cache.
+ */
+ clean_dcache_va_range(page, PAGE_SIZE);
+}
+
+static struct page_info *p2m_allocate_root(struct domain *d)
+{
+ struct page_info *page;
+
+ /*
+ * As mentioned in the Priviliged Architecture Spec (version 20240411)
+ * in Section 18.5.1, for the paged virtual-memory schemes (Sv32x4,
+ * Sv39x4, Sv48x4, and Sv57x4), the root page table is 16 KiB and must
+ * be aligned to a 16-KiB boundary.
+ */
+ page = alloc_domheap_pages(d, P2M_ROOT_ORDER, MEMF_no_owner);
+ if ( !page )
+ return NULL;
+
+ for ( unsigned int i = 0; i < P2M_ROOT_PAGES; i++ )
+ clear_and_clean_page(page + i);
+
+ return page;
+}
+
+unsigned long construct_hgatp(struct p2m_domain *p2m, uint16_t vmid)
+{
+ unsigned long ppn;
+
+ ppn = PFN_DOWN(page_to_maddr(p2m->root)) & HGATP_PPN;
+
+ /* TODO: add detection of hgatp_mode instead of hard-coding it. */
+#if RV_STAGE1_MODE == SATP_MODE_SV39
+ p2m->hgatp_mode = HGATP_MODE_SV39X4;
+#elif RV_STAGE1_MODE == SATP_MODE_SV48
+ p2m->hgatp_mode = HGATP_MODE_SV48X4;
+#else
+# error "add HGATP_MODE"
+#endif
+
+ return ppn | MASK_INSR(p2m->hgatp_mode, HGATP_MODE_MASK) |
+ MASK_INSR(vmid, HGATP_VMID_MASK);
+}
+
+static int p2m_alloc_root_table(struct p2m_domain *p2m)
+{
+ struct domain *d = p2m->domain;
+ struct page_info *page;
+ const unsigned int nr_root_pages = P2M_ROOT_PAGES;
+
+ /*
+ * Return back nr_root_pages to assure the root table memory is also
+ * accounted against the P2M pool of the domain.
+ */
+ if ( !paging_ret_pages_to_domheap(d, nr_root_pages) )
+ return -ENOMEM;
+
+ page = p2m_allocate_root(d);
+ if ( !page )
+ return -ENOMEM;
+
+ p2m->root = page;
+
+ return 0;
+}
int p2m_init(struct domain *d)
{
@@ -32,6 +110,8 @@ int p2m_init(struct domain *d)
# error "Add init of p2m->clean_pte"
#endif
+ p2m_root_order = get_order_from_bytes(GUEST_ROOT_PAGE_TABLE_SIZE);
+
return 0;
}
@@ -42,10 +122,20 @@ int p2m_init(struct domain *d)
*/
int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
{
+ struct p2m_domain *p2m = p2m_get_hostp2m(d);
int rc;
if ( (rc = paging_freelist_init(d, pages, preempted)) )
return rc;
+ /*
+ * First, initialize p2m pool. Then allocate the root
+ * table so that the necessary pages can be returned from the p2m pool,
+ * since the root table must be allocated using alloc_domheap_pages(...)
+ * to meet its specific requirements.
+ */
+ if ( !p2m->root )
+ p2m_alloc_root_table(p2m);
+
return 0;
}
diff --git a/xen/arch/riscv/paging.c b/xen/arch/riscv/paging.c
index 8882be5ac9..bbe1186900 100644
--- a/xen/arch/riscv/paging.c
+++ b/xen/arch/riscv/paging.c
@@ -54,6 +54,36 @@ int paging_freelist_init(struct domain *d, unsigned long pages,
return 0;
}
+
+bool paging_ret_pages_to_domheap(struct domain *d, unsigned int nr_pages)
+{
+ struct page_info *page;
+
+ ASSERT(spin_is_locked(&d->arch.paging.lock));
+
+ if ( ACCESS_ONCE(d->arch.paging.total_pages) < nr_pages )
+ return false;
+
+ for ( unsigned int i = 0; i < nr_pages; i++ )
+ {
+ /* Return memory to domheap. */
+ page = page_list_remove_head(&d->arch.paging.freelist);
+ if( page )
+ {
+ ACCESS_ONCE(d->arch.paging.total_pages)--;
+ free_domheap_page(page);
+ }
+ else
+ {
+ printk(XENLOG_ERR
+ "Failed to free P2M pages, P2M freelist is empty.\n");
+ return false;
+ }
+ }
+
+ return true;
+}
+
/* Domain paging struct initialization. */
int paging_domain_init(struct domain *d)
{
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 07/20] xen/riscv: introduce pte_{set,get}_mfn()
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (5 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 06/20] xen/riscv: add root page table allocation Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 08/20] xen/riscv: add new p2m types and helper macros for type classification Oleksii Kurochko
` (12 subsequent siblings)
19 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Introduce helpers pte_{set,get}_mfn() to simplify setting and getting
of mfn.
Also, introduce PTE_PPN_MASK and add BUILD_BUG_ON() to be sure that
PTE_PPN_MASK remains the same for all MMU modes except Sv32.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V3:
- Add Acked-by: Jan Beulich <jbeulich@suse.com>.
---
Changes in V2:
- Patch "[PATCH v1 4/6] xen/riscv: define pt_t and pt_walk_t structures" was
renamed to xen/riscv: introduce pte_{set,get}_mfn() as after dropping of
bitfields for PTE structure, this patch introduce only pte_{set,get}_mfn().
- As pt_t and pt_walk_t were dropped, update implementation of
pte_{set,get}_mfn() to use bit operations and shifts instead of bitfields.
- Introduce PTE_PPN_MASK to be able to use MASK_INSR for setting/getting PPN.
- Add BUILD_BUG_ON(RV_STAGE1_MODE > SATP_MODE_SV57) to be sure that when
new MMU mode will be added, someone checks that PPN is still bits 53:10.
---
xen/arch/riscv/include/asm/page.h | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index ddcc4da0a3..66cb192316 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -112,6 +112,30 @@ typedef struct {
#endif
} pte_t;
+#if RV_STAGE1_MODE != SATP_MODE_SV32
+#define PTE_PPN_MASK _UL(0x3FFFFFFFFFFC00)
+#else
+#define PTE_PPN_MASK _U(0xFFFFFC00)
+#endif
+
+static inline void pte_set_mfn(pte_t *p, mfn_t mfn)
+{
+ /*
+ * At the moment spec provides Sv32 - Sv57.
+ * If one day new MMU mode will be added it will be needed
+ * to check that PPN mask still continue to cover bits 53:10.
+ */
+ BUILD_BUG_ON(RV_STAGE1_MODE > SATP_MODE_SV57);
+
+ p->pte &= ~PTE_PPN_MASK;
+ p->pte |= MASK_INSR(mfn_x(mfn), PTE_PPN_MASK);
+}
+
+static inline mfn_t pte_get_mfn(pte_t p)
+{
+ return _mfn(MASK_EXTR(p.pte, PTE_PPN_MASK));
+}
+
static inline bool pte_is_valid(pte_t p)
{
return p.pte & PTE_VALID;
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 08/20] xen/riscv: add new p2m types and helper macros for type classification
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (6 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 07/20] xen/riscv: introduce pte_{set,get}_mfn() Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-04 14:16 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 09/20] xen/dom0less: abstract Arm-specific p2m type name for device MMIO mappings Oleksii Kurochko
` (11 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
- Extended p2m_type_t with additional types: p2m_mmio_direct,
p2m_grant_map_{rw,ro}.
- Added macros to classify memory types: P2M_RAM_TYPES, P2M_GRANT_TYPES.
- Introduced helper predicates: p2m_is_ram(), p2m_is_any_ram().
- Define p2m_mmio_direct to tell handle_passthrough_prop() from common
code how to map device memory.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
- Drop p2m_ram_ro.
- Rename p2m_mmio_direct_dev to p2m_mmio_direct_io to make it more RISC-V specicific.
- s/p2m_mmio_direct_dev/p2m_mmio_direct_io.
---
Changes in V2:
- Drop stuff connected to foreign mapping as it isn't necessary for RISC-V
right now.
---
xen/arch/riscv/include/asm/p2m.h | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 3c37a708db..5f253da1dd 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -62,8 +62,30 @@ struct p2m_domain {
typedef enum {
p2m_invalid = 0, /* Nothing mapped here */
p2m_ram_rw, /* Normal read/write domain RAM */
+ p2m_mmio_direct_io, /* Read/write mapping of genuine Device MMIO area,
+ PTE_PBMT_IO will be used for such mappings */
+ p2m_ext_storage, /* Following types'll be stored outsude PTE bits: */
+ p2m_grant_map_rw, /* Read/write grant mapping */
+ p2m_grant_map_ro, /* Read-only grant mapping */
} p2m_type_t;
+#define p2m_mmio_direct p2m_mmio_direct_io
+
+/* We use bitmaps and mask to handle groups of types */
+#define p2m_to_mask(t_) BIT(t_, UL)
+
+/* RAM types, which map to real machine frames */
+#define P2M_RAM_TYPES (p2m_to_mask(p2m_ram_rw))
+
+/* Grant mapping types, which map to a real frame in another VM */
+#define P2M_GRANT_TYPES (p2m_to_mask(p2m_grant_map_rw) | \
+ p2m_to_mask(p2m_grant_map_ro))
+
+/* Useful predicates */
+#define p2m_is_ram(t_) (p2m_to_mask(t_) & P2M_RAM_TYPES)
+#define p2m_is_any_ram(t_) (p2m_to_mask(t_) & \
+ (P2M_RAM_TYPES | P2M_GRANT_TYPES))
+
#include <xen/p2m-common.h>
static inline int get_page_and_type(struct page_info *page,
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 09/20] xen/dom0less: abstract Arm-specific p2m type name for device MMIO mappings
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (7 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 08/20] xen/riscv: add new p2m types and helper macros for type classification Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-04 14:11 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 10/20] xen/riscv: introduce page_{get,set}_xenheap_gfn() Oleksii Kurochko
` (10 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Stefano Stabellini, Julien Grall,
Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Jan Beulich
Rename `p2m_mmio_direct_dev` to a more architecture-neutral alias
`p2m_mmio_direct` to avoid leaking Arm-specific naming into common Xen code,
such as dom0less passthrough property handling.
This helps reduce platform-specific terminology in shared logic and
improves clarity for future non-Arm ports (e.g. RISC-V or PowerPC).
No functional changes — the definition is preserved via a macro alias
for Arm.
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in v3:
- New patch
---
xen/arch/arm/include/asm/p2m.h | 2 ++
xen/common/device-tree/dom0less-build.c | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/xen/arch/arm/include/asm/p2m.h b/xen/arch/arm/include/asm/p2m.h
index 2d53bf9b61..bade1eb71b 100644
--- a/xen/arch/arm/include/asm/p2m.h
+++ b/xen/arch/arm/include/asm/p2m.h
@@ -137,6 +137,8 @@ typedef enum {
p2m_max_real_type, /* Types after this won't be store in the p2m */
} p2m_type_t;
+#define p2m_mmio_direct p2m_mmio_direct_dev
+
/* We use bitmaps and mask to handle groups of types */
#define p2m_to_mask(_t) (1UL << (_t))
diff --git a/xen/common/device-tree/dom0less-build.c b/xen/common/device-tree/dom0less-build.c
index 6bb038111d..5b97bf0343 100644
--- a/xen/common/device-tree/dom0less-build.c
+++ b/xen/common/device-tree/dom0less-build.c
@@ -185,7 +185,7 @@ static int __init handle_passthrough_prop(struct kernel_info *kinfo,
gaddr_to_gfn(gstart),
PFN_DOWN(size),
maddr_to_mfn(mstart),
- p2m_mmio_direct_dev);
+ p2m_mmio_direct);
if ( res < 0 )
{
printk(XENLOG_ERR
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 10/20] xen/riscv: introduce page_{get,set}_xenheap_gfn()
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (8 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 09/20] xen/dom0less: abstract Arm-specific p2m type name for device MMIO mappings Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-05 14:11 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 11/20] xen/riscv: implement function to map memory in guest p2m Oleksii Kurochko
` (9 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Introduce page_set_xenheap_gfn() to encode the GFN associated with a Xen heap
page directly into the type_info field of struct page_info.
Introduce page_get_xenheap_gfn() to retrieve the GFN from a Xen heap page.
Reserve the upper 10 bits of type_info for the usage counter and frame type;
use the remaining lower bits to store the grant table frame GFN.
This is sufficient for all supported RISC-V MMU modes: Sv32 uses 22-bit GFNs,
while Sv39, Sv47, and Sv57 use up to 44-bit GFNs.
Define PGT_gfn_mask and PGT_gfn_width to ensure a consistent bit layout
across all RISC-V MMU modes, avoiding the need for mode-specific ifdefs.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in v3:
- Update the comment above defintions of PGT_gfn_width, PGT_gfn_mask.
- Add page_get_xenheap_gfn().
- Make commit message clearer.
---
Changes in v2:
- This changes were part of "xen/riscv: implement p2m mapping functionality".
No additional changes were done.
---
xen/arch/riscv/include/asm/mm.h | 43 ++++++++++++++++++++++++++++++---
1 file changed, 40 insertions(+), 3 deletions(-)
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index dd8cdc9782..7950d132c1 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -12,6 +12,7 @@
#include <xen/sections.h>
#include <xen/types.h>
+#include <asm/cmpxchg.h>
#include <asm/page.h>
#include <asm/page-bits.h>
@@ -247,9 +248,17 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
#define PGT_writable_page PG_mask(1, 1) /* has writable mappings? */
#define PGT_type_mask PG_mask(1, 1) /* Bits 31 or 63. */
-/* Count of uses of this frame as its current type. */
-#define PGT_count_width PG_shift(2)
-#define PGT_count_mask ((1UL << PGT_count_width) - 1)
+ /* 9-bit count of uses of this frame as its current type. */
+#define PGT_count_mask PG_mask(0x3FF, 10)
+
+/*
+ * Stored in bits [22:0] (Sv32) or [44:0] (Sv39,48,57) GFN if page is
+ * xenheap page.
+ */
+#define PGT_gfn_width PG_shift(10)
+#define PGT_gfn_mask (BIT(PGT_gfn_width, UL) - 1)
+
+#define PGT_INVALID_XENHEAP_GFN _gfn(PGT_gfn_mask)
/*
* Page needs to be scrubbed. Since this bit can only be set on a page that is
@@ -301,6 +310,34 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
#define PFN_ORDER(pg) ((pg)->v.free.order)
+/*
+ * All accesses to the GFN portion of type_info field should always be
+ * protected by the P2M lock. In case when it is not feasible to satisfy
+ * that requirement (risk of deadlock, lock inversion, etc) it is important
+ * to make sure that all non-protected updates to this field are atomic.
+ */
+static inline gfn_t page_get_xenheap_gfn(const struct page_info *p)
+{
+ gfn_t gfn = _gfn(ACCESS_ONCE(p->u.inuse.type_info) & PGT_gfn_mask);
+
+ ASSERT(is_xen_heap_page(p));
+
+ return gfn_eq(gfn, PGT_INVALID_XENHEAP_GFN) ? INVALID_GFN : gfn;
+}
+
+static inline void page_set_xenheap_gfn(struct page_info *p, gfn_t gfn)
+{
+ gfn_t gfn_ = gfn_eq(gfn, INVALID_GFN) ? PGT_INVALID_XENHEAP_GFN : gfn;
+ unsigned long x, nx, y = p->u.inuse.type_info;
+
+ ASSERT(is_xen_heap_page(p));
+
+ do {
+ x = y;
+ nx = (x & ~PGT_gfn_mask) | gfn_x(gfn_);
+ } while ( (y = cmpxchg(&p->u.inuse.type_info, x, nx)) != x );
+}
+
extern unsigned char cpu0_boot_stack[];
void setup_initial_pagetables(void);
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 11/20] xen/riscv: implement function to map memory in guest p2m
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (9 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 10/20] xen/riscv: introduce page_{get,set}_xenheap_gfn() Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-05 15:20 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 12/20] xen/riscv: implement p2m_set_range() Oleksii Kurochko
` (8 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Implement map_regions_p2mt() to map a region in the guest p2m with
a specific p2m type. The memory attributes will be derived from the
p2m type. This function is going to be called from dom0less common
code.
To implement it, introduce:
- p2m_write_(un)lock() to ensure safe concurrent updates to the P2M.
As part of this change, introduce p2m_tlb_flush_sync() and
p2m_force_tlb_flush_sync().
- A stub for p2m_set_range() to map a range of GFNs to MFNs.
- p2m_insert_mapping().
- p2m_is_write_locked().
Drop guest_physmap_add_entry() and call map_regions_p2mt() directly
from guest_physmap_add_page(), making guest_physmap_add_entry()
unnecessary.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in v3:
- Introudce p2m_write_lock() and p2m_is_write_locked().
- Introduce p2m_force_tlb_flush_sync() and p2m_flush_tlb() to flush TLBs
after p2m table update.
- Change an argument of p2m_insert_mapping() from struct domain *d to
p2m_domain *p2m.
- Drop guest_physmap_add_entry() and use map_regions_p2mt() to define
guest_physmap_add_page().
- Add declaration of map_regions_p2mt() to asm/p2m.h.
- Rewrite commit message and subject.
- Drop p2m_access_t related stuff.
- Add defintion of p2m_is_write_locked().
---
Changes in v2:
- This changes were part of "xen/riscv: implement p2m mapping functionality".
No additional signigicant changes were done.
---
xen/arch/riscv/include/asm/p2m.h | 31 ++++++++++-----
xen/arch/riscv/p2m.c | 65 ++++++++++++++++++++++++++++++++
2 files changed, 87 insertions(+), 9 deletions(-)
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 5f253da1dd..ada3c398b4 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -121,21 +121,22 @@ static inline int guest_physmap_mark_populate_on_demand(struct domain *d,
return -EOPNOTSUPP;
}
-static inline int guest_physmap_add_entry(struct domain *d,
- gfn_t gfn, mfn_t mfn,
- unsigned long page_order,
- p2m_type_t t)
-{
- BUG_ON("unimplemented");
- return -EINVAL;
-}
+/*
+ * Map a region in the guest p2m with a specific p2m type.
+ * The memory attributes will be derived from the p2m type.
+ */
+int map_regions_p2mt(struct domain *d,
+ gfn_t gfn,
+ unsigned long nr,
+ mfn_t mfn,
+ p2m_type_t p2mt);
/* Untyped version for RAM only, for compatibility */
static inline int __must_check
guest_physmap_add_page(struct domain *d, gfn_t gfn, mfn_t mfn,
unsigned int page_order)
{
- return guest_physmap_add_entry(d, gfn, mfn, page_order, p2m_ram_rw);
+ return map_regions_p2mt(d, gfn, BIT(page_order, UL), mfn, p2m_ram_rw);
}
static inline mfn_t gfn_to_mfn(struct domain *d, gfn_t gfn)
@@ -159,6 +160,18 @@ static inline void p2m_altp2m_check(struct vcpu *v, uint16_t idx)
/* Not supported on RISCV. */
}
+static inline void p2m_write_lock(struct p2m_domain *p2m)
+{
+ write_lock(&p2m->lock);
+}
+
+void p2m_write_unlock(struct p2m_domain *p2m);
+
+static inline int p2m_is_write_locked(struct p2m_domain *p2m)
+{
+ return rw_is_write_locked(&p2m->lock);
+}
+
unsigned long construct_hgatp(struct p2m_domain *p2m, uint16_t vmid);
#endif /* ASM__RISCV__P2M_H */
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index cac07c51c9..7cfcf76f24 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -9,6 +9,41 @@
unsigned int __read_mostly p2m_root_order;
+/*
+ * Force a synchronous P2M TLB flush.
+ *
+ * Must be called with the p2m lock held.
+ */
+static void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
+{
+ struct domain *d = p2m->domain;
+
+ ASSERT(p2m_is_write_locked(p2m));
+
+ sbi_remote_hfence_gvma(d->dirty_cpumask, 0, 0);
+
+ p2m->need_flush = false;
+}
+
+void p2m_tlb_flush_sync(struct p2m_domain *p2m)
+{
+ if ( p2m->need_flush )
+ p2m_force_tlb_flush_sync(p2m);
+}
+
+/* Unlock the flush and do a P2M TLB flush if necessary */
+void p2m_write_unlock(struct p2m_domain *p2m)
+{
+ /*
+ * The final flush is done with the P2M write lock taken to avoid
+ * someone else modifying the P2M wbefore the TLB invalidation has
+ * completed.
+ */
+ p2m_tlb_flush_sync(p2m);
+
+ write_unlock(&p2m->lock);
+}
+
static void clear_and_clean_page(struct page_info *page)
{
clear_domain_page(page_to_mfn(page));
@@ -139,3 +174,33 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
return 0;
}
+
+static int p2m_set_range(struct p2m_domain *p2m,
+ gfn_t sgfn,
+ unsigned long nr,
+ mfn_t smfn,
+ p2m_type_t t)
+{
+ return -EOPNOTSUPP;
+}
+
+static int p2m_insert_mapping(struct p2m_domain *p2m, gfn_t start_gfn,
+ unsigned long nr, mfn_t mfn, p2m_type_t t)
+{
+ int rc;
+
+ p2m_write_lock(p2m);
+ rc = p2m_set_range(p2m, start_gfn, nr, mfn, t);
+ p2m_write_unlock(p2m);
+
+ return rc;
+}
+
+int map_regions_p2mt(struct domain *d,
+ gfn_t gfn,
+ unsigned long nr,
+ mfn_t mfn,
+ p2m_type_t p2mt)
+{
+ return p2m_insert_mapping(p2m_get_hostp2m(d), gfn, nr, mfn, p2mt);
+}
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 12/20] xen/riscv: implement p2m_set_range()
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (10 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 11/20] xen/riscv: implement function to map memory in guest p2m Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-05 16:04 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 13/20] xen/riscv: Implement p2m_free_subtree() and related helpers Oleksii Kurochko
` (7 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
This patch introduces p2m_set_range() and its core helper p2m_set_entry() for
RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
modifications.
The main changes are:
- Simplification of Break-Before-Make (BBM) approach as according to RISC-V
spec:
It is permitted for multiple address-translation cache entries to co-exist
for the same address. This represents the fact that in a conventional
TLB hierarchy, it is possible for multiple entries to match a single
address if, for example, a page is upgraded to a superpage without first
clearing the original non-leaf PTE’s valid bit and executing an SFENCE.VMA
with rs1=x0, or if multiple TLBs exist in parallel at a given level of the
hierarchy. In this case, just as if an SFENCE.VMA is not executed between
a write to the memory-management tables and subsequent implicit read of the
same address: it is unpredictable whether the old non-leaf PTE or the new
leaf PTE is used, but the behavior is otherwise well defined.
In contrast to the Arm architecture, where BBM is mandatory and failing to
use it in some cases can lead to CPU instability, RISC-V guarantees
stability, and the behavior remains safe — though unpredictable in terms of
which translation will be used.
- Unlike Arm, the valid bit is not repurposed for other uses in this
implementation. Instead, entry validity is determined based solely on P2M
PTE's valid bit.
The main functionality is in p2m_set_entry(), which handles mappings aligned
to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).
p2m_set_range() breaks a region down into block-aligned mappings and calls
p2m_set_entry() accordingly.
Stub implementations (to be completed later) include:
- p2m_free_subtree()
- p2m_next_level()
- p2m_pte_from_mfn()
Note: Support for shattering block entries is not implemented in this
patch and will be added separately.
Additionally, some straightforward helper functions are now implemented:
- p2m_write_pte()
- p2m_remove_pte()
- p2m_get_root_pointer()
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
- Drop p2m_access_t connected stuff as it isn't going to be used, at least
now.
- Move defintion of P2M_ROOT_ORDER and P2M_ROOT_PAGES to earlier patches.
- Update the comment above lowest_mapped_gfn declaration.
- Update the comment above p2m_get_root_pointer(): s/"...ofset of the root
table"/"...ofset into root table".
- s/p2m_remove_pte/p2m_clean_pte.
- Use plain 0 instead of 0x00 in p2m_clean_pte().
- s/p2m_entry_from_mfn/p2m_pte_from_mfn.
- s/GUEST_TABLE_*/P2M_TABLE_*.
- Update the comment above p2m_next_level(): "GFN entry" -> "corresponding
the entry corresponding to the GFN".
- s/__p2m_set_entry/_p2m_set_entry.
- drop "s" for sgfn and smfn prefixes of _p2m_set_entry()'s arguments
as this function work only with one GFN and one MFN.
- Return correct return code when p2m_next_level() faild in _p2m_set_entry(),
also drop "else" and just handle case (rc != P2M_TABLE_NORMAL) separately.
- Code style fixes.
- Use unsigned int for "order" in p2m_set_entry().
- s/p2m_set_entry/p2m_free_subtree.
- Update ASSERT() in __p2m_set_enty() to check that page_order is propertly
aligned.
- Return -EACCES instead of -ENOMEM in the chase when domain is dying and
someone called p2m_set_entry.
- s/p2m_set_entry/p2m_set_range.
- s/__p2m_set_entry/p2m_set_entry
- s/p2me_is_valid/p2m_is_valid()
- Return a number of successfully mapped GFNs in case if not all were mapped
in p2m_set_range().
- Use BIT(order, UL) instead of 1 << order.
- Drop IOMMU flushing code from p2m_set_entry().
- set p2m->need_flush=true when entry in p2m_set_entry() is changed.
- Introduce p2m_mapping_order() to support superpages.
- Drop p2m_is_valid() and use pte_is_valid() instead as there is no tricks
with copying of valid bit anymore.
- Update p2m_pte_from_mfn() prototype: drop p2m argument.
---
Changes in V2:
- New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
functionality" which was splitted to smaller.
- Update the way when p2m TLB is flushed:
- RISC-V does't require BBM so there is no need to remove PTE before making
new so drop 'if /*pte_is_valid(orig_pte) */' and remove PTE only removing
has been requested.
- Drop p2m->need_flush |= !!pte_is_valid(orig_pte); for the case when
PTE's removing is happening as RISC-V could cache invalid PTE and thereby
it requires to do a flush each time and it doesn't matter if PTE is valid
or not at the moment when PTE removing is happening.
- Drop a check if PTE is valid in case of PTE is modified as it was mentioned
above as BBM isn't required so TLB flushing could be defered and there is
no need to do it before modifying of PTE.
- Drop p2m->need_flush as it seems like it will be always true.
- Drop foreign mapping things as it isn't necessary for RISC-V right now.
- s/p2m_is_valid/p2me_is_valid.
- Move definition and initalization of p2m->{max_mapped_gfn,lowest_mapped_gfn}
to this patch.
---
xen/arch/riscv/include/asm/p2m.h | 12 ++
xen/arch/riscv/p2m.c | 250 ++++++++++++++++++++++++++++++-
2 files changed, 261 insertions(+), 1 deletion(-)
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index ada3c398b4..26ad87b8df 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -7,11 +7,13 @@
#include <xen/rwlock.h>
#include <xen/types.h>
+#include <asm/page.h>
#include <asm/page-bits.h>
extern unsigned int p2m_root_order;
#define P2M_ROOT_ORDER p2m_root_order
#define P2M_ROOT_PAGES BIT(P2M_ROOT_ORDER, U)
+#define P2M_ROOT_LEVEL HYP_PT_ROOT_LEVEL
#define paddr_bits PADDR_BITS
@@ -50,6 +52,16 @@ struct p2m_domain {
* shattered), call p2m_tlb_flush_sync().
*/
bool need_flush;
+
+ /* Highest guest frame that's ever been mapped in the p2m */
+ gfn_t max_mapped_gfn;
+
+ /*
+ * Lowest mapped gfn in the p2m. When releasing mapped gfn's in a
+ * preemptible manner this is updated to track where to resume
+ * the search. Apart from during teardown this can only decrease.
+ */
+ gfn_t lowest_mapped_gfn;
};
/*
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 7cfcf76f24..6c99719c66 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -3,6 +3,7 @@
#include <xen/rwlock.h>
#include <xen/sched.h>
+#include <asm/page.h>
#include <asm/paging.h>
#include <asm/p2m.h>
#include <asm/riscv_encoding.h>
@@ -132,6 +133,9 @@ int p2m_init(struct domain *d)
rwlock_init(&p2m->lock);
INIT_PAGE_LIST_HEAD(&p2m->pages);
+ p2m->max_mapped_gfn = _gfn(0);
+ p2m->lowest_mapped_gfn = _gfn(ULONG_MAX);
+
/*
* Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
* is not ready for RISC-V support.
@@ -175,13 +179,257 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
return 0;
}
+/*
+ * Find and map the root page table. The caller is responsible for
+ * unmapping the table.
+ *
+ * The function will return NULL if the offset into the root table is
+ * invalid.
+ */
+static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
+{
+ unsigned long root_table_indx;
+
+ root_table_indx = gfn_x(gfn) >> XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL);
+ if ( root_table_indx >= P2M_ROOT_PAGES )
+ return NULL;
+
+ return __map_domain_page(p2m->root + root_table_indx);
+}
+
+static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
+{
+ write_pte(p, pte);
+ if ( clean_pte )
+ clean_dcache_va_range(p, sizeof(*p));
+}
+
+static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
+{
+ pte_t pte;
+
+ memset(&pte, 0, sizeof(pte));
+ p2m_write_pte(p, pte, clean_pte);
+}
+
+static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
+{
+ panic("%s: hasn't been implemented yet\n", __func__);
+
+ return (pte_t) { .pte = 0 };
+}
+
+#define P2M_TABLE_MAP_NONE 0
+#define P2M_TABLE_MAP_NOMEM 1
+#define P2M_TABLE_SUPER_PAGE 2
+#define P2M_TABLE_NORMAL 3
+
+/*
+ * Take the currently mapped table, find the corresponding the entry
+ * corresponding to the GFN, and map the next table, if available.
+ * The previous table will be unmapped if the next level was mapped
+ * (e.g P2M_TABLE_NORMAL returned).
+ *
+ * `alloc_tbl` parameter indicates whether intermediate tables should
+ * be allocated when not present.
+ *
+ * Return values:
+ * P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
+ * P2M_TABLE_MAP_NOMEM: allocating a new page failed.
+ * P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
+ * P2M_TABLE_NORMAL: The next entry points to a superpage.
+ */
+static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
+ unsigned int level, pte_t **table,
+ unsigned int offset)
+{
+ panic("%s: hasn't been implemented yet\n", __func__);
+
+ return P2M_TABLE_MAP_NONE;
+}
+
+/* Free pte sub-tree behind an entry */
+static void p2m_free_subtree(struct p2m_domain *p2m,
+ pte_t entry, unsigned int level)
+{
+ panic("%s: hasn't been implemented yet\n", __func__);
+}
+
+/*
+ * Insert an entry in the p2m. This should be called with a mapping
+ * equal to a page/superpage.
+ */
+static int p2m_set_entry(struct p2m_domain *p2m,
+ gfn_t gfn,
+ unsigned long page_order,
+ mfn_t mfn,
+ p2m_type_t t)
+{
+ unsigned int level;
+ unsigned int target = page_order / PAGETABLE_ORDER;
+ pte_t *entry, *table, orig_pte;
+ int rc;
+ /* A mapping is removed if the MFN is invalid. */
+ bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
+ DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
+
+ ASSERT(p2m_is_write_locked(p2m));
+
+ /*
+ * Check if the level target is valid: we only support
+ * 4K - 2M - 1G mapping.
+ */
+ ASSERT((target <= 2) && !(page_order % PAGETABLE_ORDER));
+
+ table = p2m_get_root_pointer(p2m, gfn);
+ if ( !table )
+ return -EINVAL;
+
+ for ( level = P2M_ROOT_LEVEL; level > target; level-- )
+ {
+ /*
+ * Don't try to allocate intermediate page table if the mapping
+ * is about to be removed.
+ */
+ rc = p2m_next_level(p2m, !removing_mapping,
+ level, &table, offsets[level]);
+ if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
+ {
+ rc = (rc == P2M_TABLE_MAP_NONE) ? -ENOENT : -ENOMEM;
+ /*
+ * We are here because p2m_next_level has failed to map
+ * the intermediate page table (e.g the table does not exist
+ * and they p2m tree is read-only). It is a valid case
+ * when removing a mapping as it may not exist in the
+ * page table. In this case, just ignore it.
+ */
+ rc = removing_mapping ? 0 : rc;
+ goto out;
+ }
+
+ if ( rc != P2M_TABLE_NORMAL )
+ break;
+ }
+
+ entry = table + offsets[level];
+
+ /*
+ * If we are here with level > target, we must be at a leaf node,
+ * and we need to break up the superpage.
+ */
+ if ( level > target )
+ {
+ panic("Shattering isn't implemented\n");
+ }
+
+ /*
+ * We should always be there with the correct level because all the
+ * intermediate tables have been installed if necessary.
+ */
+ ASSERT(level == target);
+
+ orig_pte = *entry;
+
+ if ( removing_mapping )
+ p2m_clean_pte(entry, p2m->clean_pte);
+ else
+ {
+ pte_t pte = p2m_pte_from_mfn(mfn, t);
+
+ p2m_write_pte(entry, pte, p2m->clean_pte);
+
+ p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
+ gfn_add(gfn, BIT(page_order, UL) - 1));
+ p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
+ }
+
+ p2m->need_flush = true;
+
+ /*
+ * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
+ * is not ready for RISC-V support.
+ *
+ * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
+ * here.
+ */
+#ifdef CONFIG_HAS_PASSTHROUGH
+# error "add code to flush IOMMU TLB"
+#endif
+
+ rc = 0;
+
+ /*
+ * Free the entry only if the original pte was valid and the base
+ * is different (to avoid freeing when permission is changed).
+ */
+ if ( pte_is_valid(orig_pte) &&
+ !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
+ p2m_free_subtree(p2m, orig_pte, level);
+
+ out:
+ unmap_domain_page(table);
+
+ return rc;
+}
+
+/* Return mapping order for given gfn, mfn and nr */
+static unsigned long p2m_mapping_order(gfn_t gfn, mfn_t mfn, unsigned long nr)
+{
+ unsigned long mask;
+ /* 1gb, 2mb, 4k mappings are supported */
+ unsigned int level = min(P2M_ROOT_LEVEL, 2);
+ unsigned long order = 0;
+
+ mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
+ mask |= gfn_x(gfn);
+
+ for ( ; level != 0; level-- )
+ {
+ if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(level), UL) - 1)) &&
+ (nr >= BIT(XEN_PT_LEVEL_ORDER(level), UL)) )
+ {
+ order = XEN_PT_LEVEL_ORDER(level);
+ break;
+ }
+ }
+
+ return order;
+}
+
static int p2m_set_range(struct p2m_domain *p2m,
gfn_t sgfn,
unsigned long nr,
mfn_t smfn,
p2m_type_t t)
{
- return -EOPNOTSUPP;
+ int rc = 0;
+ unsigned long left = nr;
+
+ /*
+ * Any reference taken by the P2M mappings (e.g. foreign mapping) will
+ * be dropped in relinquish_p2m_mapping(). As the P2M will still
+ * be accessible after, we need to prevent mapping to be added when the
+ * domain is dying.
+ */
+ if ( unlikely(p2m->domain->is_dying) )
+ return -EACCES;
+
+ while ( left )
+ {
+ unsigned long order = p2m_mapping_order(sgfn, smfn, left);
+
+ rc = p2m_set_entry(p2m, sgfn, order, smfn, t);
+ if ( rc )
+ break;
+
+ sgfn = gfn_add(sgfn, BIT(order, UL));
+ if ( !mfn_eq(smfn, INVALID_MFN) )
+ smfn = mfn_add(smfn, BIT(order, UL));
+
+ left -= BIT(order, UL);
+ }
+
+ return !left ? 0 : left == nr ? rc : (nr - left);
}
static int p2m_insert_mapping(struct p2m_domain *p2m, gfn_t start_gfn,
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 13/20] xen/riscv: Implement p2m_free_subtree() and related helpers
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (11 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 12/20] xen/riscv: implement p2m_set_range() Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-06 15:55 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 14/20] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration Oleksii Kurochko
` (6 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
This patch introduces a working implementation of p2m_free_subtree() for RISC-V
based on ARM's implementation of p2m_free_entry(), enabling proper cleanup
of page table entries in the P2M (physical-to-machine) mapping.
Only few things are changed:
- Introduce and use p2m_get_type() to get a type of p2m entry as
RISC-V's PTE doesn't have enough space to store all necessary types so
a type is stored outside PTE. But, at the moment, handle only types
which fit into PTE's bits.
Key additions include:
- p2m_free_subtree(): Recursively frees page table entries at all levels. It
handles both regular and superpage mappings and ensures that TLB entries
are flushed before freeing intermediate tables.
- p2m_put_page() and helpers:
- p2m_put_4k_page(): Clears GFN from xenheap pages if applicable.
- p2m_put_2m_superpage(): Releases foreign page references in a 2MB
superpage.
- p2m_get_type(): Extracts the stored p2m_type from the PTE bits.
- p2m_free_page(): Returns a page to a domain's freelist.
- Introduce p2m_is_foreign() and connected to it things.
Defines XEN_PT_ENTRIES in asm/page.h to simplify loops over page table
entries.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
- Use p2m_tlb_flush_sync(p2m) instead of p2m_force_tlb_flush_sync() in
p2m_free_subtree().
- Drop p2m_is_valid() implementation as pte_is_valid() is going to be used
instead.
- Drop p2m_is_superpage() and introduce pte_is_superpage() instead.
- s/p2m_free_entry/p2m_free_subtree.
- s/p2m_type_radix_get/p2m_get_type.
- Update implementation of p2m_get_type() to get type both from PTE bits,
other cases will be covered in a separate patch. This requires an
introduction of new P2M_TYPE_PTE_BITS_MASK macros.
- Drop p2m argument of p2m_get_type() as it isn't needed anymore.
- Put cheapest checks first in p2m_is_superpage().
- Use switch() in p2m_put_page().
- Update the comment in p2m_put_foreign_page().
- Code style fixes.
- Move p2m_foreign stuff to this commit.
- Drop p2m argument of p2m_put_page() as itsn't used anymore.
---
Changes in V2:
- New patch. It was a part of 2ma big patch "xen/riscv: implement p2m mapping
functionality" which was splitted to smaller.
- s/p2m_is_superpage/p2me_is_superpage.
---
xen/arch/riscv/include/asm/p2m.h | 18 +++-
xen/arch/riscv/include/asm/page.h | 6 ++
xen/arch/riscv/include/asm/paging.h | 2 +
xen/arch/riscv/p2m.c | 137 +++++++++++++++++++++++++++-
xen/arch/riscv/paging.c | 7 ++
5 files changed, 168 insertions(+), 2 deletions(-)
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 26ad87b8df..fbc73448a7 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -79,10 +79,20 @@ typedef enum {
p2m_ext_storage, /* Following types'll be stored outsude PTE bits: */
p2m_grant_map_rw, /* Read/write grant mapping */
p2m_grant_map_ro, /* Read-only grant mapping */
+ p2m_map_foreign_rw, /* Read/write RAM pages from foreign domain */
+ p2m_map_foreign_ro, /* Read-only RAM pages from foreign domain */
} p2m_type_t;
#define p2m_mmio_direct p2m_mmio_direct_io
+/*
+ * Bits 8 and 9 are reserved for use by supervisor software;
+ * the implementation shall ignore this field.
+ * We are going to use to save in these bits frequently used types to avoid
+ * get/set of a type from radix tree.
+ */
+#define P2M_TYPE_PTE_BITS_MASK 0x300
+
/* We use bitmaps and mask to handle groups of types */
#define p2m_to_mask(t_) BIT(t_, UL)
@@ -93,10 +103,16 @@ typedef enum {
#define P2M_GRANT_TYPES (p2m_to_mask(p2m_grant_map_rw) | \
p2m_to_mask(p2m_grant_map_ro))
+ /* Foreign mappings types */
+#define P2M_FOREIGN_TYPES (p2m_to_mask(p2m_map_foreign_rw) | \
+ p2m_to_mask(p2m_map_foreign_ro))
+
/* Useful predicates */
#define p2m_is_ram(t_) (p2m_to_mask(t_) & P2M_RAM_TYPES)
#define p2m_is_any_ram(t_) (p2m_to_mask(t_) & \
- (P2M_RAM_TYPES | P2M_GRANT_TYPES))
+ (P2M_RAM_TYPES | P2M_GRANT_TYPES | \
+ P2M_FOREIGN_TYPES))
+#define p2m_is_foreign(t_) (p2m_to_mask(t_) & P2M_FOREIGN_TYPES)
#include <xen/p2m-common.h>
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index 66cb192316..cb303af0c0 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -20,6 +20,7 @@
#define XEN_PT_LEVEL_SIZE(lvl) (_AT(paddr_t, 1) << XEN_PT_LEVEL_SHIFT(lvl))
#define XEN_PT_LEVEL_MAP_MASK(lvl) (~(XEN_PT_LEVEL_SIZE(lvl) - 1))
#define XEN_PT_LEVEL_MASK(lvl) (VPN_MASK << XEN_PT_LEVEL_SHIFT(lvl))
+#define XEN_PT_ENTRIES (_AT(unsigned int, 1) << PAGETABLE_ORDER)
/*
* PTE format:
@@ -182,6 +183,11 @@ static inline bool pte_is_mapping(pte_t p)
return (p.pte & PTE_VALID) && (p.pte & PTE_ACCESS_MASK);
}
+static inline bool pte_is_superpage(pte_t p, unsigned int level)
+{
+ return (level > 0) && pte_is_mapping(p);
+}
+
static inline int clean_and_invalidate_dcache_va_range(const void *p,
unsigned long size)
{
diff --git a/xen/arch/riscv/include/asm/paging.h b/xen/arch/riscv/include/asm/paging.h
index 557fbd1abc..c9063b7f76 100644
--- a/xen/arch/riscv/include/asm/paging.h
+++ b/xen/arch/riscv/include/asm/paging.h
@@ -12,4 +12,6 @@ int paging_freelist_init(struct domain *d, unsigned long pages,
bool paging_ret_pages_to_domheap(struct domain *d, unsigned int nr_pages);
+void paging_free_page(struct domain *d, struct page_info *pg);
+
#endif /* ASM_RISCV_PAGING_H */
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 6c99719c66..2467e459cc 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -197,6 +197,16 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
return __map_domain_page(p2m->root + root_table_indx);
}
+static p2m_type_t p2m_get_type(const pte_t pte)
+{
+ p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
+
+ if ( type == p2m_ext_storage )
+ panic("unimplemented\n");
+
+ return type;
+}
+
static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
{
write_pte(p, pte);
@@ -248,11 +258,136 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
return P2M_TABLE_MAP_NONE;
}
+static void p2m_put_foreign_page(struct page_info *pg)
+{
+ /*
+ * It’s safe to call put_page() here because arch_flush_tlb_mask()
+ * will be invoked if the page is reallocated before the end of
+ * this loop, which will trigger a flush of the guest TLBs.
+ */
+ put_page(pg);
+}
+
+/* Put any references on the single 4K page referenced by mfn. */
+static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
+{
+ /* TODO: Handle other p2m types */
+
+ if ( p2m_is_foreign(type) )
+ {
+ ASSERT(mfn_valid(mfn));
+ p2m_put_foreign_page(mfn_to_page(mfn));
+ }
+
+ /*
+ * Detect the xenheap page and mark the stored GFN as invalid.
+ * We don't free the underlying page until the guest requested to do so.
+ * So we only need to tell the page is not mapped anymore in the P2M by
+ * marking the stored GFN as invalid.
+ */
+ if ( p2m_is_ram(type) && is_xen_heap_mfn(mfn) )
+ page_set_xenheap_gfn(mfn_to_page(mfn), INVALID_GFN);
+}
+
+/* Put any references on the superpage referenced by mfn. */
+static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
+{
+ struct page_info *pg;
+ unsigned int i;
+
+ ASSERT(mfn_valid(mfn));
+
+ pg = mfn_to_page(mfn);
+
+ for ( i = 0; i < XEN_PT_ENTRIES; i++, pg++ )
+ p2m_put_foreign_page(pg);
+}
+
+/* Put any references on the page referenced by pte. */
+static void p2m_put_page(const pte_t pte, unsigned int level)
+{
+ mfn_t mfn = pte_get_mfn(pte);
+ p2m_type_t p2m_type = p2m_get_type(pte);
+
+ ASSERT(pte_is_valid(pte));
+
+ /*
+ * TODO: Currently we don't handle level 2 super-page, Xen is not
+ * preemptible and therefore some work is needed to handle such
+ * superpages, for which at some point Xen might end up freeing memory
+ * and therefore for such a big mapping it could end up in a very long
+ * operation.
+ */
+ switch ( level )
+ {
+ case 1:
+ return p2m_put_2m_superpage(mfn, p2m_type);
+
+ case 0:
+ return p2m_put_4k_page(mfn, p2m_type);
+ }
+}
+
+static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg)
+{
+ page_list_del(pg, &p2m->pages);
+
+ paging_free_page(p2m->domain, pg);
+}
+
/* Free pte sub-tree behind an entry */
static void p2m_free_subtree(struct p2m_domain *p2m,
pte_t entry, unsigned int level)
{
- panic("%s: hasn't been implemented yet\n", __func__);
+ unsigned int i;
+ pte_t *table;
+ mfn_t mfn;
+ struct page_info *pg;
+
+ /* Nothing to do if the entry is invalid. */
+ if ( !pte_is_valid(entry) )
+ return;
+
+ if ( pte_is_superpage(entry, level) || (level == 0) )
+ {
+#ifdef CONFIG_IOREQ_SERVER
+ /*
+ * If this gets called then either the entry was replaced by an entry
+ * with a different base (valid case) or the shattering of a superpage
+ * has failed (error case).
+ * So, at worst, the spurious mapcache invalidation might be sent.
+ */
+ if ( p2m_is_ram(p2m_get_type(p2m, entry)) &&
+ domain_has_ioreq_server(p2m->domain) )
+ ioreq_request_mapcache_invalidate(p2m->domain);
+#endif
+
+ p2m_put_page(entry, level);
+
+ return;
+ }
+
+ table = map_domain_page(pte_get_mfn(entry));
+ for ( i = 0; i < XEN_PT_ENTRIES; i++ )
+ p2m_free_subtree(p2m, table[i], level - 1);
+
+ unmap_domain_page(table);
+
+ /*
+ * Make sure all the references in the TLB have been removed before
+ * freing the intermediate page table.
+ * XXX: Should we defer the free of the page table to avoid the
+ * flush?
+ */
+ p2m_tlb_flush_sync(p2m);
+
+ mfn = pte_get_mfn(entry);
+ ASSERT(mfn_valid(mfn));
+
+ pg = mfn_to_page(mfn);
+
+ page_list_del(pg, &p2m->pages);
+ p2m_free_page(p2m, pg);
}
/*
diff --git a/xen/arch/riscv/paging.c b/xen/arch/riscv/paging.c
index bbe1186900..853e0e14c6 100644
--- a/xen/arch/riscv/paging.c
+++ b/xen/arch/riscv/paging.c
@@ -84,6 +84,13 @@ bool paging_ret_pages_to_domheap(struct domain *d, unsigned int nr_pages)
return true;
}
+void paging_free_page(struct domain *d, struct page_info *pg)
+{
+ spin_lock(&d->arch.paging.lock);
+ page_list_add_tail(pg, &d->arch.paging.freelist);
+ spin_unlock(&d->arch.paging.lock);
+}
+
/* Domain paging struct initialization. */
int paging_domain_init(struct domain *d)
{
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 14/20] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (12 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 13/20] xen/riscv: Implement p2m_free_subtree() and related helpers Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-11 11:36 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 15/20] xen/riscv: implement p2m_next_level() Oleksii Kurochko
` (5 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
This patch adds the initial logic for constructing PTEs from MFNs in the RISC-V
p2m subsystem. It includes:
- Implementation of p2m_pte_from_mfn(): Generates a valid PTE using the
given MFN, p2m_type_t, including permission encoding and PBMT attribute
setup.
- New helper p2m_set_permission(): Encodes access rights (r, w, x) into the
PTE based on both p2m type and access permissions.
- p2m_set_type(): Stores the p2m type in PTE's bits. The storage of types,
which don't fit PTE bits, will be implemented separately later.
PBMT type encoding support:
- Introduces an enum pbmt_type_t to represent the PBMT field values.
- Maps types like p2m_mmio_direct_dev to p2m_mmio_direct_io, others default
to pbmt_pma.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
- s/p2m_entry_from_mfn/p2m_pte_from_mfn.
- s/pbmt_type_t/pbmt_type.
- s/pbmt_max/pbmt_count.
- s/p2m_type_radix_set/p2m_set_type.
- Rework p2m_set_type() to handle only types which are fited into PTEs bits.
Other types will be covered separately.
Update arguments of p2m_set_type(): there is no any reason for p2m anymore.
- p2m_set_permissions() updates:
- Update the code in p2m_set_permission() for cases p2m_raw_rw and
p2m_mmio_direct_io to set proper type permissions.
- Add cases for p2m_grant_map_rw and p2m_grant_map_ro.
- Use ASSERT_UNEACHABLE() instead of BUG() in switch cases of
p2m_set_permissions.
- Add blank lines non-fall-through case blocks in switch cases.
- Set MFN before permissions are set in p2m_pte_from_mfn().
- Update prototype of p2m_entry_from_mfn().
---
Changes in V2:
- New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
functionality" which was splitted to smaller.
---
xen/arch/riscv/include/asm/page.h | 8 +++
xen/arch/riscv/p2m.c | 81 +++++++++++++++++++++++++++++--
2 files changed, 85 insertions(+), 4 deletions(-)
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index cb303af0c0..4fa0556073 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -74,6 +74,14 @@
#define PTE_SMALL BIT(10, UL)
#define PTE_POPULATE BIT(11, UL)
+enum pbmt_type {
+ pbmt_pma,
+ pbmt_nc,
+ pbmt_io,
+ pbmt_rsvd,
+ pbmt_count,
+};
+
#define PTE_ACCESS_MASK (PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE)
#define PTE_PBMT_MASK (PTE_PBMT_NOCACHE | PTE_PBMT_IO)
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 2467e459cc..efc7320619 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -1,3 +1,4 @@
+#include <xen/bug.h>
#include <xen/domain_page.h>
#include <xen/mm.h>
#include <xen/rwlock.h>
@@ -197,6 +198,18 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
return __map_domain_page(p2m->root + root_table_indx);
}
+static int p2m_set_type(pte_t *pte, p2m_type_t t)
+{
+ int rc = 0;
+
+ if ( t > p2m_ext_storage )
+ panic("unimplemeted\n");
+ else
+ pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
+
+ return rc;
+}
+
static p2m_type_t p2m_get_type(const pte_t pte)
{
p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
@@ -222,11 +235,71 @@ static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
p2m_write_pte(p, pte, clean_pte);
}
-static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
+static void p2m_set_permission(pte_t *e, p2m_type_t t)
{
- panic("%s: hasn't been implemented yet\n", __func__);
+ e->pte &= ~PTE_ACCESS_MASK;
+
+ switch ( t )
+ {
+ case p2m_grant_map_rw:
+ case p2m_ram_rw:
+ e->pte |= PTE_READABLE | PTE_WRITABLE;
+ break;
+
+ case p2m_ext_storage:
+ case p2m_mmio_direct_io:
+ e->pte |= PTE_ACCESS_MASK;
+ break;
+
+ case p2m_invalid:
+ e->pte &= ~(PTE_ACCESS_MASK | PTE_VALID);
+ break;
+
+ case p2m_grant_map_ro:
+ e->pte |= PTE_READABLE;
+ break;
+
+ default:
+ ASSERT_UNREACHABLE();
+ break;
+ }
+}
+
+static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
+{
+ pte_t e = (pte_t) { PTE_VALID };
+
+ switch ( t )
+ {
+ case p2m_mmio_direct_io:
+ e.pte |= PTE_PBMT_IO;
+ break;
+
+ default:
+ break;
+ }
+
+ pte_set_mfn(&e, mfn);
+
+ ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK));
+
+ if ( !is_table )
+ {
+ p2m_set_permission(&e, t);
+
+ if ( t < p2m_ext_storage )
+ p2m_set_type(&e, t);
+ else
+ panic("unimplemeted\n");
+ }
+ else
+ /*
+ * According to the spec and table "Encoding of PTE R/W/X fields":
+ * X=W=R=0 -> Pointer to next level of page table.
+ */
+ e.pte &= ~PTE_ACCESS_MASK;
- return (pte_t) { .pte = 0 };
+ return e;
}
#define P2M_TABLE_MAP_NONE 0
@@ -469,7 +542,7 @@ static int p2m_set_entry(struct p2m_domain *p2m,
p2m_clean_pte(entry, p2m->clean_pte);
else
{
- pte_t pte = p2m_pte_from_mfn(mfn, t);
+ pte_t pte = p2m_pte_from_mfn(mfn, t, false);
p2m_write_pte(entry, pte, p2m->clean_pte);
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 15/20] xen/riscv: implement p2m_next_level()
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (13 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 14/20] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-11 11:44 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 16/20] xen/riscv: Implement superpage splitting for p2m mappings Oleksii Kurochko
` (4 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Implement the p2m_next_level() function, which enables traversal and dynamic
allocation of intermediate levels (if necessary) in the RISC-V
p2m (physical-to-machine) page table hierarchy.
To support this, the following helpers are introduced:
- page_to_p2m_table(): Constructs non-leaf PTEs pointing to next-level page
tables with correct attributes.
- p2m_alloc_page(): Allocates page table pages, supporting both hardware and
guest domains.
- p2m_create_table(): Allocates and initializes a new page table page and
installs it into the hierarchy.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
- s/p2me_is_mapping/p2m_is_mapping to be in syc with other p2m_is_*() functions.
- clear_and_clean_page() in p2m_create_table() instead of clear_page() to be
sure that page is cleared and d-cache is flushed for it.
- Move ASSERT(level != 0) in p2m_next_level() ahead of trying to allocate a
page table.
- Update p2m_create_table() to allocate metadata page to store p2m type in it
for each entry of page table.
- Introduce paging_alloc_page() and use it inside p2m_alloc_page().
- Add allocated page to p2m->pages list in p2m_alloc_page() to simplify
a caller code a little bit.
- Drop p2m_is_mapping() and use pte_is_mapping() instead as P2M PTE's valid
bit doesn't have another purpose anymore.
- Update an implementation and prototype of page_to_p2m_table(), it is enough
to pass only a page as an argument.
---
Changes in V2:
- New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
functionality" which was splitted to smaller.
- s/p2m_is_mapping/p2m_is_mapping.
---
xen/arch/riscv/include/asm/paging.h | 2 +
xen/arch/riscv/p2m.c | 80 ++++++++++++++++++++++++++++-
xen/arch/riscv/paging.c | 11 ++++
3 files changed, 91 insertions(+), 2 deletions(-)
diff --git a/xen/arch/riscv/include/asm/paging.h b/xen/arch/riscv/include/asm/paging.h
index c9063b7f76..3642bcfc7a 100644
--- a/xen/arch/riscv/include/asm/paging.h
+++ b/xen/arch/riscv/include/asm/paging.h
@@ -14,4 +14,6 @@ bool paging_ret_pages_to_domheap(struct domain *d, unsigned int nr_pages);
void paging_free_page(struct domain *d, struct page_info *pg);
+struct page_info * paging_alloc_page(struct domain *d);
+
#endif /* ASM_RISCV_PAGING_H */
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index efc7320619..e04cfde8c7 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -302,6 +302,48 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
return e;
}
+/* Generate table entry with correct attributes. */
+static pte_t page_to_p2m_table(struct page_info *page)
+{
+ /*
+ * p2m_invalid will be ignored inside p2m_pte_from_mfn() as is_table is
+ * set to true and p2m_type_t shouldn't be applied for PTEs which
+ * describe an intermidiate table.
+ */
+ return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, true);
+}
+
+static struct page_info *p2m_alloc_page(struct p2m_domain *p2m)
+{
+ struct page_info *pg = paging_alloc_page(p2m->domain);
+
+ if ( pg )
+ page_list_add(pg, &p2m->pages);
+
+ return pg;
+}
+
+/*
+ * Allocate a new page table page with an extra metadata page and hook it
+ * in via the given entry.
+ */
+static int p2m_create_table(struct p2m_domain *p2m, pte_t *entry)
+{
+ struct page_info *page;
+
+ ASSERT(!pte_is_valid(*entry));
+
+ page = p2m_alloc_page(p2m);
+ if ( page == NULL )
+ return -ENOMEM;
+
+ clear_and_clean_page(page);
+
+ p2m_write_pte(entry, page_to_p2m_table(page), p2m->clean_pte);
+
+ return 0;
+}
+
#define P2M_TABLE_MAP_NONE 0
#define P2M_TABLE_MAP_NOMEM 1
#define P2M_TABLE_SUPER_PAGE 2
@@ -326,9 +368,43 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
unsigned int level, pte_t **table,
unsigned int offset)
{
- panic("%s: hasn't been implemented yet\n", __func__);
+ pte_t *entry;
+ int ret;
+ mfn_t mfn;
+
+ /* The function p2m_next_level() is never called at the last level */
+ ASSERT(level != 0);
+
+ entry = *table + offset;
+
+ if ( !pte_is_valid(*entry) )
+ {
+ if ( !alloc_tbl )
+ return P2M_TABLE_MAP_NONE;
+
+ ret = p2m_create_table(p2m, entry);
+ if ( ret )
+ return P2M_TABLE_MAP_NOMEM;
+ }
+
+ /* The function p2m_next_level() is never called at the last level */
+ ASSERT(level != 0);
+ if ( pte_is_mapping(*entry) )
+ return P2M_TABLE_SUPER_PAGE;
+
+ mfn = mfn_from_pte(*entry);
+
+ unmap_domain_page(*table);
+
+ /*
+ * TODO: There's an inefficiency here:
+ * In p2m_create_table(), the page is mapped to clear it.
+ * Then that mapping is torn down in p2m_create_table(),
+ * only to be re-established here.
+ */
+ *table = map_domain_page(mfn);
- return P2M_TABLE_MAP_NONE;
+ return P2M_TABLE_NORMAL;
}
static void p2m_put_foreign_page(struct page_info *pg)
diff --git a/xen/arch/riscv/paging.c b/xen/arch/riscv/paging.c
index 853e0e14c6..72ff183260 100644
--- a/xen/arch/riscv/paging.c
+++ b/xen/arch/riscv/paging.c
@@ -91,6 +91,17 @@ void paging_free_page(struct domain *d, struct page_info *pg)
spin_unlock(&d->arch.paging.lock);
}
+struct page_info * paging_alloc_page(struct domain *d)
+{
+ struct page_info *pg;
+
+ spin_lock(&d->arch.paging.lock);
+ pg = page_list_remove_head(&d->arch.paging.freelist);
+ spin_unlock(&d->arch.paging.lock);
+
+ return pg;
+}
+
/* Domain paging struct initialization. */
int paging_domain_init(struct domain *d)
{
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 16/20] xen/riscv: Implement superpage splitting for p2m mappings
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (14 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 15/20] xen/riscv: implement p2m_next_level() Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-11 11:59 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 17/20] xen/riscv: implement put_page() Oleksii Kurochko
` (3 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Add support for down large memory mappings ("superpages") in the RISC-V
p2m mapping so that smaller, more precise mappings ("finer-grained entries")
can be inserted into lower levels of the page table hierarchy.
To implement that the following is done:
- Introduce p2m_split_superpage(): Recursively shatters a superpage into
smaller page table entries down to the target level, preserving original
permissions and attributes.
- p2m_set_entry() updated to invoke superpage splitting when inserting
entries at lower levels within a superpage-mapped region.
This implementation is based on the ARM code, with modifications to the part
that follows the BBM (break-before-make) approach, some parts are simplified
as according to RISC-V spec:
It is permitted for multiple address-translation cache entries to co-exist
for the same address. This represents the fact that in a conventional
TLB hierarchy, it is possible for multiple entries to match a single
address if, for example, a page is upgraded to a superpage without first
clearing the original non-leaf PTE’s valid bit and executing an SFENCE.VMA
with rs1=x0, or if multiple TLBs exist in parallel at a given level of the
hierarchy. In this case, just as if an SFENCE.VMA is not executed between
a write to the memory-management tables and subsequent implicit read of the
same address: it is unpredictable whether the old non-leaf PTE or the new
leaf PTE is used, but the behavior is otherwise well defined.
In contrast to the Arm architecture, where BBM is mandatory and failing to
use it in some cases can lead to CPU instability, RISC-V guarantees
stability, and the behavior remains safe — though unpredictable in terms of
which translation will be used.
Additionally, the page table walk logic has been adjusted, as ARM uses the
opposite number of levels compared to RISC-V.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
- Move page_list_add(page, &p2m->pages) inside p2m_alloc_page().
- Use 'unsigned long' for local vairiable 'i' in p2m_split_superpage().
- Update the comment above if ( next_level != target ) in p2m_split_superpage().
- Reverse cycle to iterate through page table levels in p2m_set_entry().
- Update p2m_split_superpage() with the same changes which are done in the
patch "P2M: Don't try to free the existing PTE if we can't allocate a new table".
---
Changes in V2:
- New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
functionality" which was splitted to smaller.
- Update the commit above the cycle which creates new page table as
RISC-V travserse page tables in an opposite to ARM order.
- RISC-V doesn't require BBM so there is no needed for invalidating
and TLB flushing before updating PTE.
---
xen/arch/riscv/p2m.c | 118 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 117 insertions(+), 1 deletion(-)
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index e04cfde8c7..e9e6818da2 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -539,6 +539,91 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
p2m_free_page(p2m, pg);
}
+static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
+ unsigned int level, unsigned int target,
+ const unsigned int *offsets)
+{
+ struct page_info *page;
+ unsigned long i;
+ pte_t pte, *table;
+ bool rv = true;
+
+ /* Convenience aliases */
+ mfn_t mfn = pte_get_mfn(*entry);
+ unsigned int next_level = level - 1;
+ unsigned int level_order = XEN_PT_LEVEL_ORDER(next_level);
+
+ /*
+ * This should only be called with target != level and the entry is
+ * a superpage.
+ */
+ ASSERT(level > target);
+ ASSERT(pte_is_superpage(*entry, level));
+
+ page = p2m_alloc_page(p2m->domain);
+ if ( !page )
+ {
+ /*
+ * The caller is in charge to free the sub-tree.
+ * As we didn't manage to allocate anything, just tell the
+ * caller there is nothing to free by invalidating the PTE.
+ */
+ memset(entry, 0, sizeof(*entry));
+ return false;
+ }
+
+ table = __map_domain_page(page);
+
+ /*
+ * We are either splitting a second level 1G page into 512 first level
+ * 2M pages, or a first level 2M page into 512 zero level 4K pages.
+ */
+ for ( i = 0; i < XEN_PT_ENTRIES; i++ )
+ {
+ pte_t *new_entry = table + i;
+
+ /*
+ * Use the content of the superpage entry and override
+ * the necessary fields. So the correct permission are kept.
+ */
+ pte = *entry;
+ pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
+
+ write_pte(new_entry, pte);
+ }
+
+ /*
+ * Shatter superpage in the page to the level we want to make the
+ * changes.
+ * This is done outside the loop to avoid checking the offset
+ * for every entry to know whether the entry should be shattered.
+ */
+ if ( next_level != target )
+ rv = p2m_split_superpage(p2m, table + offsets[next_level],
+ level - 1, target, offsets);
+
+ if ( p2m->clean_pte )
+ clean_dcache_va_range(table, PAGE_SIZE);
+
+ /*
+ * TODO: an inefficiency here: the caller almost certainly wants to map
+ * the same page again, to update the one entry that caused the
+ * request to shatter the page.
+ */
+ unmap_domain_page(table);
+
+ /*
+ * Even if we failed, we should (according to the current implemetation
+ * of a way how sub-tree is freed if p2m_split_superpage hasn't been
+ * finished fully) install the newly allocated PTE
+ * entry.
+ * The caller will be in charge to free the sub-tree.
+ */
+ p2m_write_pte(entry, page_to_p2m_table(page), p2m->clean_pte);
+
+ return rv;
+}
+
/*
* Insert an entry in the p2m. This should be called with a mapping
* equal to a page/superpage.
@@ -603,7 +688,38 @@ static int p2m_set_entry(struct p2m_domain *p2m,
*/
if ( level > target )
{
- panic("Shattering isn't implemented\n");
+ /* We need to split the original page. */
+ pte_t split_pte = *entry;
+
+ ASSERT(pte_is_superpage(*entry, level));
+
+ if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
+ {
+ /* Free the allocated sub-tree */
+ p2m_free_subtree(p2m, split_pte, level);
+
+ rc = -ENOMEM;
+ goto out;
+ }
+
+ p2m_write_pte(entry, split_pte, p2m->clean_pte);
+
+ p2m->need_flush = true;
+
+ /* Then move to the level we want to make real changes */
+ for ( ; level > target; level-- )
+ {
+ rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
+
+ /*
+ * The entry should be found and either be a table
+ * or a superpage if level 0 is not targeted
+ */
+ ASSERT(rc == P2M_TABLE_NORMAL ||
+ (rc == P2M_TABLE_SUPER_PAGE && target > 0));
+ }
+
+ entry = table + offsets[level];
}
/*
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 17/20] xen/riscv: implement put_page()
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (15 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 16/20] xen/riscv: Implement superpage splitting for p2m mappings Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-11 12:43 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 18/20] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers Oleksii Kurochko
` (2 subsequent siblings)
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Implement put_page(), as it will be used by p2m_put_code().
Although CONFIG_STATIC_MEMORY has not yet been introduced for RISC-V,
a stub for PGC_static is added to avoid cluttering the code of
put_page_nr() with #ifdefs.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
xen/arch/riscv/include/asm/mm.h | 7 +++++++
xen/arch/riscv/mm.c | 25 ++++++++++++++++++++-----
2 files changed, 27 insertions(+), 5 deletions(-)
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index 7950d132c1..b914813e52 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -273,6 +273,13 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
/* Page is Xen heap? */
#define _PGC_xen_heap PG_shift(2)
#define PGC_xen_heap PG_mask(1, 2)
+#ifdef CONFIG_STATIC_MEMORY
+/* Page is static memory */
+#define _PGC_static PG_shift(3)
+#define PGC_static PG_mask(1, 3)
+#else
+#define PGC_static 0
+#endif
/* Page is broken? */
#define _PGC_broken PG_shift(7)
#define PGC_broken PG_mask(1, 7)
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index 1ef015f179..3cac16f1b7 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -362,11 +362,6 @@ unsigned long __init calc_phys_offset(void)
return phys_offset;
}
-void put_page(struct page_info *page)
-{
- BUG_ON("unimplemented");
-}
-
void arch_dump_shared_mem_info(void)
{
BUG_ON("unimplemented");
@@ -627,3 +622,23 @@ void flush_page_to_ram(unsigned long mfn, bool sync_icache)
if ( sync_icache )
invalidate_icache();
}
+
+void put_page(struct page_info *page)
+{
+ unsigned long nx, x, y = page->count_info;
+
+ do {
+ ASSERT((y & PGC_count_mask) >= 1);
+ x = y;
+ nx = x - 1;
+ }
+ while ( unlikely((y = cmpxchg(&page->count_info, x, nx)) != x) );
+
+ if ( unlikely((nx & PGC_count_mask) == 0) )
+ {
+ if ( unlikely(nx & PGC_static) )
+ free_domstatic_page(page);
+ else
+ free_domheap_page(page);
+ }
+}
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 18/20] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (16 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 17/20] xen/riscv: implement put_page() Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-11 12:50 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 19/20] xen/riscv: add support of page lookup by GFN Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 20/20] xen/riscv: introduce metadata table to store P2M type Oleksii Kurochko
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Implement the mfn_valid() macro to verify whether a given MFN is valid by
checking that it falls within the range [start_page, max_page).
These bounds are initialized based on the start and end addresses of RAM.
As part of this patch, start_page is introduced and initialized with the
PFN of the first RAM page.
Also, initialize pdx_group_valid() by calling set_pdx_range() when
memory banks are being mapped.
Also, after providing a non-stub implementation of the mfn_valid() macro,
the following compilation errors started to occur:
riscv64-linux-gnu-ld: prelink.o: in function `__next_node':
/build/xen/./include/xen/nodemask.h:202: undefined reference to `page_is_ram_type'
riscv64-linux-gnu-ld: prelink.o: in function `get_free_buddy':
/build/xen/common/page_alloc.c:881: undefined reference to `page_is_ram_type'
riscv64-linux-gnu-ld: prelink.o: in function `alloc_heap_pages':
/build/xen/common/page_alloc.c:1043: undefined reference to `page_get_owner_and_reference'
riscv64-linux-gnu-ld: /build/xen/common/page_alloc.c:1098: undefined reference to `page_is_ram_type'
riscv64-linux-gnu-ld: prelink.o: in function `ns16550_interrupt':
/build/xen/drivers/char/ns16550.c:205: undefined reference to `get_page'
riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol `page_get_owner_and_reference' isn't defined
riscv64-linux-gnu-ld: final link failed: bad value
make[2]: *** [arch/riscv/Makefile:35: xen-syms] Error 1
To resolve these errors, the following functions have also been introduced,
based on their Arm counterparts:
- page_get_owner_and_reference() and its variant to safely acquire a
reference to a page and retrieve its owner.
- A stub for page_is_ram_type() that currently always returns 0 and asserts
unreachable, as RAM type checking is not yet implemented.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
- Update defintion of mfn_valid().
- Use __ro_after_init for variable start_page.
- Drop ASSERT_UNREACHABLE() in page_get_owner_and_nr_reference().
- Update the comment inside do/while in page_get_owner_and_nr_reference().
- Define _PGC_static and drop "#ifdef CONFIG_STATIC_MEMORY" in put_page_nr().
- Initialize pdx_group_valid() by calling set_pdx_range() when memory banks are mapped.
- Drop page_get_owner_and_nr_reference() and implement page_get_owner_and_reference()
without reusing of a page_get_owner_and_nr_reference() to avoid potential dead code.
- Move defintion of get_page() to "xen/riscv: add support of page lookup by GFN", where
it is really used.
---
Changes in V2:
- New patch.
---
xen/arch/riscv/include/asm/mm.h | 9 +++++++--
xen/arch/riscv/mm.c | 35 +++++++++++++++++++++++++++++++++
2 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index b914813e52..d5be328906 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -5,6 +5,7 @@
#include <public/xen.h>
#include <xen/bug.h>
+#include <xen/compiler.h>
#include <xen/const.h>
#include <xen/mm-frame.h>
#include <xen/pdx.h>
@@ -309,8 +310,12 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
#define page_get_owner(p) (p)->v.inuse.domain
#define page_set_owner(p, d) ((p)->v.inuse.domain = (d))
-/* TODO: implement */
-#define mfn_valid(mfn) ({ (void)(mfn); 0; })
+extern unsigned long start_page;
+
+#define mfn_valid(mfn) ({ \
+ unsigned long mfn__ = mfn_x(mfn); \
+ likely((mfn__ >= start_page)) && likely(__mfn_valid(mfn__)); \
+})
#define domain_set_alloc_bitsize(d) ((void)(d))
#define domain_clamp_alloc_bitsize(d, b) ((void)(d), (b))
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index 3cac16f1b7..3ad2b9cf93 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -521,6 +521,8 @@ static void __init setup_directmap_mappings(unsigned long base_mfn,
#error setup_{directmap,frametable}_mapping() should be implemented for RV_32
#endif
+unsigned long __ro_after_init start_page;
+
/*
* Setup memory management
*
@@ -570,9 +572,13 @@ void __init setup_mm(void)
ram_end = max(ram_end, bank_end);
setup_directmap_mappings(PFN_DOWN(bank_start), PFN_DOWN(bank_size));
+
+ set_pdx_range(paddr_to_pfn(bank_start), paddr_to_pfn(bank_end));
}
setup_frametable_mappings(ram_start, ram_end);
+
+ start_page = PFN_DOWN(ram_start);
max_page = PFN_DOWN(ram_end);
}
@@ -642,3 +648,32 @@ void put_page(struct page_info *page)
free_domheap_page(page);
}
}
+
+int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
+{
+ ASSERT_UNREACHABLE();
+
+ return 0;
+}
+
+struct domain *page_get_owner_and_reference(struct page_info *page)
+{
+ unsigned long x, y = page->count_info;
+ struct domain *owner;
+
+ do {
+ x = y;
+ /*
+ * Count == 0: Page is not allocated, so we cannot take a reference.
+ * Count == -1: Reference count would wrap, which is invalid.
+ */
+ if ( unlikely(((x + 1) & PGC_count_mask) <= 1) )
+ return NULL;
+ }
+ while ( (y = cmpxchg(&page->count_info, x, x + 1)) != x );
+
+ owner = page_get_owner(page);
+ ASSERT(owner);
+
+ return owner;
+}
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 19/20] xen/riscv: add support of page lookup by GFN
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (17 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 18/20] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-11 13:25 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 20/20] xen/riscv: introduce metadata table to store P2M type Oleksii Kurochko
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Introduce helper functions for safely querying the P2M (physical-to-machine)
mapping:
- add p2m_read_lock(), p2m_read_unlock(), and p2m_is_locked() for managing
P2M lock state.
- Implement p2m_get_entry() to retrieve mapping details for a given GFN,
including MFN, page order, and validity.
- Add p2m_lookup() to encapsulate read-locked MFN retrieval.
- Introduce p2m_get_page_from_gfn() to convert a GFN into a page_info
pointer, acquiring a reference to the page if valid.
- Introduce get_page().
Implementations are based on Arm's functions with some minor modifications:
- p2m_get_entry():
- Reverse traversal of page tables, as RISC-V uses the opposite level
numbering compared to Arm.
- Removed the return of p2m_access_t from p2m_get_entry() since
mem_access_settings is not introduced for RISC-V.
- Updated BUILD_BUG_ON() to check using the level 0 mask, which corresponds
to Arm's THIRD_MASK.
- Replaced open-coded bit shifts with the BIT() macro.
- Other minor changes, such as using RISC-V-specific functions to validate
P2M PTEs, and replacing Arm-specific GUEST_* macros with their RISC-V
equivalents.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
- Add is_p2m_foreign() macro and connected stuff.
- Change struct domain *d argument of p2m_get_page_from_gfn() to
struct p2m_domain.
- Update the comment above p2m_get_entry().
- s/_t/p2mt for local variable in p2m_get_entry().
- Drop local variable addr in p2m_get_entry() and use gfn_to_gaddr(gfn)
to define offsets array.
- Code style fixes.
- Update a check of rc code from p2m_next_level() in p2m_get_entry()
and drop "else" case.
- Do not call p2m_get_type() if p2m_get_entry()'s t argument is NULL.
- Use struct p2m_domain instead of struct domain for p2m_lookup() and
p2m_get_page_from_gfn().
- Move defintion of get_page() from "xen/riscv: implement mfn_valid() and page reference, ownership handling helpers"
---
Changes in V2:
- New patch.
---
xen/arch/riscv/include/asm/p2m.h | 18 ++++
xen/arch/riscv/mm.c | 13 +++
xen/arch/riscv/p2m.c | 136 +++++++++++++++++++++++++++++++
3 files changed, 167 insertions(+)
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index fbc73448a7..dc3a77cc15 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -202,6 +202,24 @@ static inline int p2m_is_write_locked(struct p2m_domain *p2m)
unsigned long construct_hgatp(struct p2m_domain *p2m, uint16_t vmid);
+static inline void p2m_read_lock(struct p2m_domain *p2m)
+{
+ read_lock(&p2m->lock);
+}
+
+static inline void p2m_read_unlock(struct p2m_domain *p2m)
+{
+ read_unlock(&p2m->lock);
+}
+
+static inline int p2m_is_locked(struct p2m_domain *p2m)
+{
+ return rw_is_locked(&p2m->lock);
+}
+
+struct page_info *p2m_get_page_from_gfn(struct p2m_domain *p2m, gfn_t gfn,
+ p2m_type_t *t);
+
#endif /* ASM__RISCV__P2M_H */
/*
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index 3ad2b9cf93..5e09d46a75 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -677,3 +677,16 @@ struct domain *page_get_owner_and_reference(struct page_info *page)
return owner;
}
+
+bool get_page(struct page_info *page, const struct domain *domain)
+{
+ const struct domain *owner = page_get_owner_and_reference(page);
+
+ if ( likely(owner == domain) )
+ return true;
+
+ if ( owner != NULL )
+ put_page(page);
+
+ return false;
+}
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index e9e6818da2..24a09d4537 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -852,3 +852,139 @@ int map_regions_p2mt(struct domain *d,
{
return p2m_insert_mapping(p2m_get_hostp2m(d), gfn, nr, mfn, p2mt);
}
+
+/*
+ * Get the details of a given gfn.
+ *
+ * If the entry is present, the associated MFN will be returned type filled up.
+ * The page_order will correspond to the order of the mapping in the page
+ * table (i.e it could be a superpage).
+ *
+ * If the entry is not present, INVALID_MFN will be returned and the
+ * page_order will be set according to the order of the invalid range.
+ *
+ * valid will contain the value of bit[0] (e.g valid bit) of the
+ * entry.
+ */
+static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
+ p2m_type_t *t,
+ unsigned int *page_order,
+ bool *valid)
+{
+ unsigned int level = 0;
+ pte_t entry, *table;
+ int rc;
+ mfn_t mfn = INVALID_MFN;
+ DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
+
+ ASSERT(p2m_is_locked(p2m));
+ BUILD_BUG_ON(XEN_PT_LEVEL_MAP_MASK(0) != PAGE_MASK);
+
+ if ( valid )
+ *valid = false;
+
+ /* XXX: Check if the mapping is lower than the mapped gfn */
+
+ /* This gfn is higher than the highest the p2m map currently holds */
+ if ( gfn_x(gfn) > gfn_x(p2m->max_mapped_gfn) )
+ {
+ for ( level = P2M_ROOT_LEVEL; level; level-- )
+ if ( (gfn_x(gfn) & (XEN_PT_LEVEL_MASK(level) >> PAGE_SHIFT)) >
+ gfn_x(p2m->max_mapped_gfn) )
+ break;
+
+ goto out;
+ }
+
+ table = p2m_get_root_pointer(p2m, gfn);
+
+ /*
+ * the table should always be non-NULL because the gfn is below
+ * p2m->max_mapped_gfn and the root table pages are always present.
+ */
+ if ( !table )
+ {
+ ASSERT_UNREACHABLE();
+ level = P2M_ROOT_LEVEL;
+ goto out;
+ }
+
+ for ( level = P2M_ROOT_LEVEL; level; level-- )
+ {
+ rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
+ if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
+ goto out_unmap;
+
+ if ( rc != P2M_TABLE_NORMAL )
+ break;
+ }
+
+ entry = table[offsets[level]];
+
+ if ( pte_is_valid(entry) )
+ {
+ if ( t )
+ *t = p2m_get_type(entry);
+
+ mfn = pte_get_mfn(entry);
+ /*
+ * The entry may point to a superpage. Find the MFN associated
+ * to the GFN.
+ */
+ mfn = mfn_add(mfn,
+ gfn_x(gfn) & (BIT(XEN_PT_LEVEL_ORDER(level), UL) - 1));
+
+ if ( valid )
+ *valid = pte_is_valid(entry);
+ }
+
+ out_unmap:
+ unmap_domain_page(table);
+
+ out:
+ if ( page_order )
+ *page_order = XEN_PT_LEVEL_ORDER(level);
+
+ return mfn;
+}
+
+static mfn_t p2m_lookup(struct p2m_domain *p2m, gfn_t gfn, p2m_type_t *t)
+{
+ mfn_t mfn;
+
+ p2m_read_lock(p2m);
+ mfn = p2m_get_entry(p2m, gfn, t, NULL, NULL);
+ p2m_read_unlock(p2m);
+
+ return mfn;
+}
+
+struct page_info *p2m_get_page_from_gfn(struct p2m_domain *p2m, gfn_t gfn,
+ p2m_type_t *t)
+{
+ struct page_info *page;
+ p2m_type_t p2mt = p2m_invalid;
+ mfn_t mfn = p2m_lookup(p2m, gfn, t);
+
+ if ( !mfn_valid(mfn) )
+ return NULL;
+
+ if ( t )
+ p2mt = *t;
+
+ page = mfn_to_page(mfn);
+
+ /*
+ * get_page won't work on foreign mapping because the page doesn't
+ * belong to the current domain.
+ */
+ if ( p2m_is_foreign(p2mt) )
+ {
+ struct domain *fdom = page_get_owner_and_reference(page);
+ ASSERT(fdom != NULL);
+ ASSERT(fdom != p2m->domain);
+ return page;
+ }
+
+ return get_page(page, p2m->domain) ? page : NULL;
+}
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* [PATCH v3 20/20] xen/riscv: introduce metadata table to store P2M type
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (18 preceding siblings ...)
2025-07-31 15:58 ` [PATCH v3 19/20] xen/riscv: add support of page lookup by GFN Oleksii Kurochko
@ 2025-07-31 15:58 ` Oleksii Kurochko
2025-08-11 15:44 ` Jan Beulich
19 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-07-31 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
RISC-V's PTE has only two available bits that can be used to store the P2M
type. This is insufficient to represent all the current RISC-V P2M types.
Therefore, some P2M types must be stored outside the PTE bits.
To address this, a metadata table is introduced to store P2M types that
cannot fit in the PTE itself. Not all P2M types are stored in the
metadata table—only those that require it.
The metadata table is linked to the intermediate page table via the
`struct page_info`'s list field of the corresponding intermediate page.
To simplify the allocation and linking of intermediate and metadata page
tables, `p2m_{alloc,free}_table()` functions are implemented.
These changes impact `p2m_split_superpage()`, since when a superpage is
split, it is necessary to update the metadata table of the new
intermediate page table — if the entry being split has its P2M type set
to `p2m_ext_storage` in its `P2M_TYPES` bits. In addition to updating
the metadata of the new intermediate page table, the corresponding entry
in the metadata for the original superpage is invalidated.
Also, update p2m_{get,set}_type to work with P2M types which don't fit
into PTE bits.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V3:
- Add is_p2m_foreign() macro and connected stuff.
- Change struct domain *d argument of p2m_get_page_from_gfn() to
struct p2m_domain.
- Update the comment above p2m_get_entry().
- s/_t/p2mt for local variable in p2m_get_entry().
- Drop local variable addr in p2m_get_entry() and use gfn_to_gaddr(gfn)
to define offsets array.
- Code style fixes.
- Update a check of rc code from p2m_next_level() in p2m_get_entry()
and drop "else" case.
- Do not call p2m_get_type() if p2m_get_entry()'s t argument is NULL.
- Use struct p2m_domain instead of struct domain for p2m_lookup() and
p2m_get_page_from_gfn().
- Move defintion of get_page() from "xen/riscv: implement mfn_valid() and page reference, ownership handling helpers"
---
Changes in V2:
- New patch.
---
xen/arch/riscv/include/asm/mm.h | 9 ++
xen/arch/riscv/p2m.c | 205 +++++++++++++++++++++++++-------
2 files changed, 170 insertions(+), 44 deletions(-)
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index d5be328906..7cf0988f44 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -150,6 +150,15 @@ struct page_info
/* Order-size of the free chunk this page is the head of. */
unsigned int order;
} free;
+
+ /* Page is used to store metadata: p2m type. */
+ struct {
+ /*
+ * Pointer to a page which store metadata for an intermediate page
+ * table.
+ */
+ struct page_info *metadata;
+ } md;
} v;
union {
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 24a09d4537..a909db654a 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -101,7 +101,16 @@ static int p2m_alloc_root_table(struct p2m_domain *p2m)
{
struct domain *d = p2m->domain;
struct page_info *page;
- const unsigned int nr_root_pages = P2M_ROOT_PAGES;
+ /*
+ * If the root page table starts at Level <= 2, and since only 1GB, 2MB,
+ * and 4KB mappings are supported (as enforced by the ASSERT() in
+ * p2m_set_entry()), it is necessary to allocate P2M_ROOT_PAGES for
+ * the root page table itself, plus an additional P2M_ROOT_PAGES for
+ * metadata storage. This is because only two free bits are available in
+ * the PTE, which are not sufficient to represent all possible P2M types.
+ */
+ const unsigned int nr_root_pages = P2M_ROOT_PAGES *
+ ((P2M_ROOT_LEVEL <= 2) ? 2 : 1);
/*
* Return back nr_root_pages to assure the root table memory is also
@@ -114,6 +123,23 @@ static int p2m_alloc_root_table(struct p2m_domain *p2m)
if ( !page )
return -ENOMEM;
+ if ( P2M_ROOT_LEVEL <= 2 )
+ {
+ /*
+ * In the case where P2M_ROOT_LEVEL <= 2, it is necessary to allocate
+ * a page of the same size as that used for the root page table.
+ * Therefore, p2m_allocate_root() can be safely reused.
+ */
+ struct page_info *metadata = p2m_allocate_root(d);
+ if ( !metadata )
+ {
+ free_domheap_pages(page, P2M_ROOT_ORDER);
+ return -ENOMEM;
+ }
+
+ page->v.md.metadata = metadata;
+ }
+
p2m->root = page;
return 0;
@@ -198,24 +224,25 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
return __map_domain_page(p2m->root + root_table_indx);
}
-static int p2m_set_type(pte_t *pte, p2m_type_t t)
+static void p2m_set_type(pte_t *pte, const p2m_type_t t, const unsigned int i)
{
- int rc = 0;
-
if ( t > p2m_ext_storage )
- panic("unimplemeted\n");
+ {
+ ASSERT(pte);
+
+ pte[i].pte = t;
+ }
else
pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
-
- return rc;
}
-static p2m_type_t p2m_get_type(const pte_t pte)
+static p2m_type_t p2m_get_type(const pte_t pte, const pte_t *metadata,
+ const unsigned int i)
{
p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
if ( type == p2m_ext_storage )
- panic("unimplemented\n");
+ type = metadata[i].pte;
return type;
}
@@ -265,7 +292,10 @@ static void p2m_set_permission(pte_t *e, p2m_type_t t)
}
}
-static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
+static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t,
+ struct page_info *metadata_pg,
+ const unsigned int indx,
+ bool is_table)
{
pte_t e = (pte_t) { PTE_VALID };
@@ -285,12 +315,21 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
if ( !is_table )
{
+ pte_t *metadata = __map_domain_page(metadata_pg);
+
p2m_set_permission(&e, t);
+ metadata[indx].pte = p2m_invalid;
+
if ( t < p2m_ext_storage )
- p2m_set_type(&e, t);
+ p2m_set_type(&e, t, indx);
else
- panic("unimplemeted\n");
+ {
+ e.pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
+ p2m_set_type(metadata, t, indx);
+ }
+
+ unmap_domain_page(metadata);
}
else
/*
@@ -309,8 +348,10 @@ static pte_t page_to_p2m_table(struct page_info *page)
* p2m_invalid will be ignored inside p2m_pte_from_mfn() as is_table is
* set to true and p2m_type_t shouldn't be applied for PTEs which
* describe an intermidiate table.
+ * That it also a reason why `metadata` and `indx` argument of
+ * p2m_pte_from_mfn() are NULL.
*/
- return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, true);
+ return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, NULL, 0, true);
}
static struct page_info *p2m_alloc_page(struct p2m_domain *p2m)
@@ -323,22 +364,71 @@ static struct page_info *p2m_alloc_page(struct p2m_domain *p2m)
return pg;
}
+static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
+
+/*
+ * Allocate a page table with an additional extra page to store
+ * metadata for each entry of the page table.
+ * Link this metadata page to page table page's list field.
+ */
+static struct page_info * p2m_alloc_table(struct p2m_domain *p2m)
+{
+ enum table_type
+ {
+ INTERMEDIATE_TABLE=0,
+ /*
+ * At the moment, metadata is going to store P2M type
+ * for each PTE of page table.
+ */
+ METADATA_TABLE,
+ TABLE_MAX
+ };
+
+ struct page_info *tables[TABLE_MAX];
+
+ for ( unsigned int i = 0; i < TABLE_MAX; i++ )
+ {
+ tables[i] = p2m_alloc_page(p2m);
+
+ if ( !tables[i] )
+ goto out;
+
+ clear_and_clean_page(tables[i]);
+ }
+
+ tables[INTERMEDIATE_TABLE]->v.md.metadata = tables[METADATA_TABLE];
+
+ return tables[INTERMEDIATE_TABLE];
+
+ out:
+ for ( unsigned int i = 0; i < TABLE_MAX; i++ )
+ if ( tables[i] )
+ p2m_free_page(p2m, tables[i]);
+
+ return NULL;
+}
+
+/*
+ * Free page table's page and metadata page linked to page table's page.
+ */
+static void p2m_free_table(struct p2m_domain *p2m, struct page_info *tbl_pg)
+{
+ ASSERT(tbl_pg->v.md.metadata);
+
+ p2m_free_page(p2m, tbl_pg->v.md.metadata);
+ p2m_free_page(p2m, tbl_pg);
+}
+
/*
* Allocate a new page table page with an extra metadata page and hook it
* in via the given entry.
*/
static int p2m_create_table(struct p2m_domain *p2m, pte_t *entry)
{
- struct page_info *page;
+ struct page_info *page = p2m_alloc_table(p2m);
ASSERT(!pte_is_valid(*entry));
- page = p2m_alloc_page(p2m);
- if ( page == NULL )
- return -ENOMEM;
-
- clear_and_clean_page(page);
-
p2m_write_pte(entry, page_to_p2m_table(page), p2m->clean_pte);
return 0;
@@ -453,10 +543,9 @@ static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
}
/* Put any references on the page referenced by pte. */
-static void p2m_put_page(const pte_t pte, unsigned int level)
+static void p2m_put_page(const pte_t pte, unsigned int level, p2m_type_t p2mt)
{
mfn_t mfn = pte_get_mfn(pte);
- p2m_type_t p2m_type = p2m_get_type(pte);
ASSERT(pte_is_valid(pte));
@@ -470,10 +559,10 @@ static void p2m_put_page(const pte_t pte, unsigned int level)
switch ( level )
{
case 1:
- return p2m_put_2m_superpage(mfn, p2m_type);
+ return p2m_put_2m_superpage(mfn, p2mt);
case 0:
- return p2m_put_4k_page(mfn, p2m_type);
+ return p2m_put_4k_page(mfn, p2mt);
}
}
@@ -486,10 +575,11 @@ static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg)
/* Free pte sub-tree behind an entry */
static void p2m_free_subtree(struct p2m_domain *p2m,
- pte_t entry, unsigned int level)
+ pte_t entry, unsigned int level,
+ const pte_t *metadata, const unsigned int index)
{
unsigned int i;
- pte_t *table;
+ pte_t *table, *tmp_metadata;
mfn_t mfn;
struct page_info *pg;
@@ -499,6 +589,8 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
if ( pte_is_superpage(entry, level) || (level == 0) )
{
+ p2m_type_t p2mt = p2m_get_type(entry, metadata, index);
+
#ifdef CONFIG_IOREQ_SERVER
/*
* If this gets called then either the entry was replaced by an entry
@@ -511,15 +603,21 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
ioreq_request_mapcache_invalidate(p2m->domain);
#endif
- p2m_put_page(entry, level);
+ p2m_put_page(entry, level, p2mt);
return;
}
- table = map_domain_page(pte_get_mfn(entry));
+ mfn = pte_get_mfn(entry);
+ ASSERT(mfn_valid(mfn));
+ table = map_domain_page(mfn);
+ pg = mfn_to_page(mfn);
+ tmp_metadata = __map_domain_page(pg->v.md.metadata);
+
for ( i = 0; i < XEN_PT_ENTRIES; i++ )
- p2m_free_subtree(p2m, table[i], level - 1);
+ p2m_free_subtree(p2m, table[i], level - 1, tmp_metadata, i);
+ unmap_domain_page(tmp_metadata);
unmap_domain_page(table);
/*
@@ -530,23 +628,19 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
*/
p2m_tlb_flush_sync(p2m);
- mfn = pte_get_mfn(entry);
- ASSERT(mfn_valid(mfn));
-
- pg = mfn_to_page(mfn);
-
- page_list_del(pg, &p2m->pages);
- p2m_free_page(p2m, pg);
+ p2m_free_table(p2m, pg);
}
static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
unsigned int level, unsigned int target,
- const unsigned int *offsets)
+ const unsigned int *offsets,
+ pte_t *metadata_tbl)
{
struct page_info *page;
unsigned long i;
pte_t pte, *table;
bool rv = true;
+ pte_t *tmp_metadata_tbl;
/* Convenience aliases */
mfn_t mfn = pte_get_mfn(*entry);
@@ -560,7 +654,7 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
ASSERT(level > target);
ASSERT(pte_is_superpage(*entry, level));
- page = p2m_alloc_page(p2m->domain);
+ page = p2m_alloc_table(p2m);
if ( !page )
{
/*
@@ -572,6 +666,8 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
return false;
}
+ tmp_metadata_tbl = __map_domain_page(page->v.md.metadata);
+
table = __map_domain_page(page);
/*
@@ -589,6 +685,9 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
pte = *entry;
pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
+ if ( MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
+ tmp_metadata_tbl[i] = metadata_tbl[offsets[level]];
+
write_pte(new_entry, pte);
}
@@ -600,7 +699,7 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
*/
if ( next_level != target )
rv = p2m_split_superpage(p2m, table + offsets[next_level],
- level - 1, target, offsets);
+ level - 1, target, offsets, tmp_metadata_tbl);
if ( p2m->clean_pte )
clean_dcache_va_range(table, PAGE_SIZE);
@@ -612,6 +711,8 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
*/
unmap_domain_page(table);
+ unmap_domain_page(tmp_metadata_tbl);
+
/*
* Even if we failed, we should (according to the current implemetation
* of a way how sub-tree is freed if p2m_split_superpage hasn't been
@@ -690,18 +791,23 @@ static int p2m_set_entry(struct p2m_domain *p2m,
{
/* We need to split the original page. */
pte_t split_pte = *entry;
+ struct page_info *metadata = virt_to_page(table)->v.md.metadata;
+ pte_t *metadata_tbl = __map_domain_page(metadata);
ASSERT(pte_is_superpage(*entry, level));
- if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
+ if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets,
+ metadata_tbl) )
{
/* Free the allocated sub-tree */
- p2m_free_subtree(p2m, split_pte, level);
+ p2m_free_subtree(p2m, split_pte, level, metadata_tbl, offsets[level]);
rc = -ENOMEM;
goto out;
}
+ unmap_domain_page(metadata_tbl);
+
p2m_write_pte(entry, split_pte, p2m->clean_pte);
p2m->need_flush = true;
@@ -734,7 +840,8 @@ static int p2m_set_entry(struct p2m_domain *p2m,
p2m_clean_pte(entry, p2m->clean_pte);
else
{
- pte_t pte = p2m_pte_from_mfn(mfn, t, false);
+ pte_t pte = p2m_pte_from_mfn(mfn, t, virt_to_page(table)->v.md.metadata,
+ offsets[level], false);
p2m_write_pte(entry, pte, p2m->clean_pte);
@@ -764,7 +871,12 @@ static int p2m_set_entry(struct p2m_domain *p2m,
*/
if ( pte_is_valid(orig_pte) &&
!mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
- p2m_free_subtree(p2m, orig_pte, level);
+ {
+ struct page_info *metadata = virt_to_page(table)->v.md.metadata;
+ pte_t *metadata_tbl = __map_domain_page(metadata);
+ p2m_free_subtree(p2m, orig_pte, level, metadata_tbl, offsets[level]);
+ unmap_domain_page(metadata_tbl);
+ }
out:
unmap_domain_page(table);
@@ -924,7 +1036,12 @@ static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
if ( pte_is_valid(entry) )
{
if ( t )
- *t = p2m_get_type(entry);
+ {
+ struct page_info *metadata_pg = virt_to_page(table)->v.md.metadata;
+ pte_t *metadata = __map_domain_page(metadata_pg);
+ *t = p2m_get_type(entry, metadata, offsets[level]);
+ unmap_domain_page(metadata);
+ }
mfn = pte_get_mfn(entry);
/*
--
2.50.1
^ permalink raw reply related [flat|nested] 84+ messages in thread
* Re: [PATCH v3 01/20] xen/riscv: implement sbi_remote_hfence_gvma()
2025-07-31 15:58 ` [PATCH v3 01/20] xen/riscv: implement sbi_remote_hfence_gvma() Oleksii Kurochko
@ 2025-08-04 13:52 ` Jan Beulich
2025-08-05 14:45 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-04 13:52 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
> covering the range of guest physical addresses between start_addr and
> start_addr + size for all VMIDs.
>
> The remote fence operation applies to the entire address space if either:
> - start_addr and size are both 0, or
> - size is equal to 2^XLEN-1.
>
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
However, ...
> --- a/xen/arch/riscv/include/asm/sbi.h
> +++ b/xen/arch/riscv/include/asm/sbi.h
> @@ -89,6 +89,25 @@ bool sbi_has_rfence(void);
> int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
> size_t size);
>
> +/*
> + * Instructs the remote harts to execute one or more HFENCE.GVMA
> + * instructions, covering the range of guest physical addresses
> + * between start_addr and start_addr + size for all VMIDs.
... I'd like to ask that you avoid fuzzy terminology like this one. Afaict
you mean [start, start + size). Help yourself and future readers by then
also saying it exactly like this. (Happy to make a respective edit while
committing.)
> + * Returns 0 if IPI was sent to all the targeted harts successfully
> + * or negative value if start_addr or size is not valid.
This similarly is ambiguous: The union of the success case stated and the
error case stated isn't obviously all possible states. The success
statement in particular alludes to the possibility of an IPI not actually
reaching its target.
> + * The remote fence operation applies to the entire address space if either:
> + * - start_addr and size are both 0, or
> + * - size is equal to 2^XLEN-1.
Whose XLEN is this? The guest's? The host's? (I assume the latter, but it's
not unambiguous, unless there's specific terminology that I'm unaware of,
yet which would make this unambiguous.)
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 02/20] xen/riscv: introduce sbi_remote_hfence_gvma_vmid()
2025-07-31 15:58 ` [PATCH v3 02/20] xen/riscv: introduce sbi_remote_hfence_gvma_vmid() Oleksii Kurochko
@ 2025-08-04 13:55 ` Jan Beulich
2025-08-05 14:57 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-04 13:55 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> It instructs the remote harts to execute one or more HFENCE.GVMA instructions
> by making an SBI call, covering the range of guest physical addresses between
> start_addr and start_addr + size only for the given VMID.
>
> The remote fence operation applies to the entire address space if either:
> - start_addr and size are both 0, or
> - size is equal to 2^XLEN-1.
>
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
with perhaps a similar on-commit edit as suggested for patch 1.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 09/20] xen/dom0less: abstract Arm-specific p2m type name for device MMIO mappings
2025-07-31 15:58 ` [PATCH v3 09/20] xen/dom0less: abstract Arm-specific p2m type name for device MMIO mappings Oleksii Kurochko
@ 2025-08-04 14:11 ` Jan Beulich
2025-08-07 15:23 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-04 14:11 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
Volodymyr Babchuk, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> Rename `p2m_mmio_direct_dev` to a more architecture-neutral alias
> `p2m_mmio_direct` to avoid leaking Arm-specific naming into common Xen code,
> such as dom0less passthrough property handling.
>
> This helps reduce platform-specific terminology in shared logic and
> improves clarity for future non-Arm ports (e.g. RISC-V or PowerPC).
>
> No functional changes — the definition is preserved via a macro alias
> for Arm.
>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
I'm sorry, but no, ...
> --- a/xen/arch/arm/include/asm/p2m.h
> +++ b/xen/arch/arm/include/asm/p2m.h
> @@ -137,6 +137,8 @@ typedef enum {
> p2m_max_real_type, /* Types after this won't be store in the p2m */
> } p2m_type_t;
>
> +#define p2m_mmio_direct p2m_mmio_direct_dev
... this isn't what I suggested. When Arm has three p2m_mmio_direct_*,
randomly aliasing one to p2m_mmio_direct is imo more likely to create
confusion than to help things. Imo you want to introduce ...
> --- a/xen/common/device-tree/dom0less-build.c
> +++ b/xen/common/device-tree/dom0less-build.c
> @@ -185,7 +185,7 @@ static int __init handle_passthrough_prop(struct kernel_info *kinfo,
> gaddr_to_gfn(gstart),
> PFN_DOWN(size),
> maddr_to_mfn(mstart),
> - p2m_mmio_direct_dev);
> + p2m_mmio_direct);
... a per-arch inline function which returns the type to use here.
The name of the function would want to properly reflect the purpose;
my limited DT knowledge may make arch_dt_passthrough_p2m_type() an
entirely wrong suggestion.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 08/20] xen/riscv: add new p2m types and helper macros for type classification
2025-07-31 15:58 ` [PATCH v3 08/20] xen/riscv: add new p2m types and helper macros for type classification Oleksii Kurochko
@ 2025-08-04 14:16 ` Jan Beulich
2025-08-07 15:41 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-04 14:16 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> - Extended p2m_type_t with additional types: p2m_mmio_direct,
> p2m_grant_map_{rw,ro}.
> - Added macros to classify memory types: P2M_RAM_TYPES, P2M_GRANT_TYPES.
> - Introduced helper predicates: p2m_is_ram(), p2m_is_any_ram().
> - Define p2m_mmio_direct to tell handle_passthrough_prop() from common
> code how to map device memory.
>
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Almost ready to be acked, except for ...
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -62,8 +62,30 @@ struct p2m_domain {
> typedef enum {
> p2m_invalid = 0, /* Nothing mapped here */
> p2m_ram_rw, /* Normal read/write domain RAM */
> + p2m_mmio_direct_io, /* Read/write mapping of genuine Device MMIO area,
> + PTE_PBMT_IO will be used for such mappings */
> + p2m_ext_storage, /* Following types'll be stored outsude PTE bits: */
> + p2m_grant_map_rw, /* Read/write grant mapping */
> + p2m_grant_map_ro, /* Read-only grant mapping */
> } p2m_type_t;
>
> +#define p2m_mmio_direct p2m_mmio_direct_io
... this (see reply to patch 09).
> +/* We use bitmaps and mask to handle groups of types */
> +#define p2m_to_mask(t_) BIT(t_, UL)
I notice that you moved the underscore to the back of the parameters,
compared to how Arm has it. I wonder though: What use are these
underscores in the first place, here and below? (There are macros where
conflicts could arise, but the ones here don't fall in that group,
afaict.)
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
2025-07-31 15:58 ` [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement Oleksii Kurochko
@ 2025-08-04 15:19 ` Jan Beulich
2025-08-06 11:33 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-04 15:19 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> Current implementation is based on x86's way to allocate VMIDs:
> VMIDs partition the physical TLB. In the current implementation VMIDs are
> introduced to reduce the number of TLB flushes. Each time the guest's
> virtual address space changes,
virtual?
> instead of flushing the TLB, a new VMID is
> assigned. This reduces the number of TLB flushes to at most 1/#VMIDs.
> The biggest advantage is that hot parts of the hypervisor's code and data
> retain in the TLB.
>
> VMIDs are a hart-local resource. As preemption of VMIDs is not possible,
> VMIDs are assigned in a round-robin scheme. To minimize the overhead of
> VMID invalidation, at the time of a TLB flush, VMIDs are tagged with a
> 64-bit generation. Only on a generation overflow the code needs to
> invalidate all VMID information stored at the VCPUs with are run on the
> specific physical processor. This overflow appears after about 2^80
> host processor cycles,
Where's this number coming from? (If you provide numbers, I think they will
want to be "reproducible" by the reader. Which I fear isn't the case here.)
> @@ -148,6 +149,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
>
> console_init_postirq();
>
> + vmid_init();
This lives here only temporarily, I assume? Every hart will need to execute
it, and hence (like we have it on x86) this may want to be a central place
elsewhere.
> --- /dev/null
> +++ b/xen/arch/riscv/vmid.c
> @@ -0,0 +1,165 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#include <xen/domain.h>
> +#include <xen/init.h>
> +#include <xen/sections.h>
> +#include <xen/lib.h>
> +#include <xen/param.h>
> +#include <xen/percpu.h>
> +
> +#include <asm/atomic.h>
> +#include <asm/csr.h>
> +#include <asm/flushtlb.h>
> +
> +/* Xen command-line option to enable VMIDs */
> +static bool __read_mostly opt_vmid_enabled = true;
__ro_after_init ?
> +boolean_param("vmid", opt_vmid_enabled);
> +
> +/*
> + * VMIDs partition the physical TLB. In the current implementation VMIDs are
> + * introduced to reduce the number of TLB flushes. Each time the guest's
> + * virtual address space changes, instead of flushing the TLB, a new VMID is
The same odd "virtual" again? All the code here is about guest-physical, isn't
it?
> + * assigned. This reduces the number of TLB flushes to at most 1/#VMIDs.
> + * The biggest advantage is that hot parts of the hypervisor's code and data
> + * retain in the TLB.
> + *
> + * Sketch of the Implementation:
> + *
> + * VMIDs are a hart-local resource. As preemption of VMIDs is not possible,
> + * VMIDs are assigned in a round-robin scheme. To minimize the overhead of
> + * VMID invalidation, at the time of a TLB flush, VMIDs are tagged with a
> + * 64-bit generation. Only on a generation overflow the code needs to
> + * invalidate all VMID information stored at the VCPUs with are run on the
> + * specific physical processor. This overflow appears after about 2^80
And the same interesting number again.
> + * host processor cycles, so we do not optimize this case, but simply disable
> + * VMID useage to retain correctness.
> + */
> +
> +/* Per-Hart VMID management. */
> +struct vmid_data {
> + uint64_t hart_vmid_generation;
Any reason not to simply use "generation"?
> + uint16_t next_vmid;
> + uint16_t max_vmid;
> + bool disabled;
> +};
> +
> +static DEFINE_PER_CPU(struct vmid_data, vmid_data);
> +
> +static unsigned long vmidlen_detect(void)
__init ? Or wait, are you (deliberately) permitting different VMIDLEN
across harts?
> +{
> + unsigned long vmid_bits;
Why "long" (also for the function return type)?
> + unsigned long old;
> +
> + /* Figure-out number of VMID bits in HW */
> + old = csr_read(CSR_HGATP);
> +
> + csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
> + vmid_bits = csr_read(CSR_HGATP);
> + vmid_bits = MASK_EXTR(vmid_bits, HGATP_VMID_MASK);
Nit: Stray blank.
> + vmid_bits = flsl(vmid_bits);
> + csr_write(CSR_HGATP, old);
> +
> + /*
> + * We polluted local TLB so flush all guest TLB as
> + * a speculative access can happen at any time.
> + */
> + local_hfence_gvma_all();
There's no guest running. If you wrote hgat.MODE as zero, as per my
understanding now new TLB entries could even purely theoretically appear.
In fact, with no guest running (yet) I'm having a hard time seeing why
you shouldn't be able to simply write the register with just
HGATP_VMID_MASK, i.e. without OR-ing in "old". It's even questionable
whether "old" needs restoring; writing plain zero afterwards ought to
suffice. You're in charcge of the register, after all.
> + return vmid_bits;
> +}
> +
> +void vmid_init(void)
> +{
> + static bool g_disabled = false;
> + unsigned long vmid_len = vmidlen_detect();
> + struct vmid_data *data = &this_cpu(vmid_data);
> + unsigned long max_availalbe_bits = sizeof(data->max_vmid) << 3;
Nit: Typo in "available". Also now that we have it, better use
BITS_PER_BYTE here?
> + if ( vmid_len > max_availalbe_bits )
> + panic("%s: VMIDLEN is bigger then a type which represent VMID: %lu(%lu)\n",
> + __func__, vmid_len, max_availalbe_bits);
This shouldn't be a runtime check imo. What you want to check (at build
time) is that the bits set in HGATP_VMID_MASK can be held in ->max_vmid.
> + data->max_vmid = BIT(vmid_len, U) - 1;
> + data->disabled = !opt_vmid_enabled || (vmid_len <= 1);
Actually, what exactly does it mean that "VMIDs are disabled"? There's
no enable bit that I could find anywhere. Isn't it rather that in this
case you need to arrange to flush always on VM entry (or always after a
P2M change, depending how the TLB is split between guest and host use)?
If you look at vmx_vmenter_helper(), its flipping of
SECONDARY_EXEC_ENABLE_VPID tweaks CPU behavior, such that the flush
would be implicit (when the bit is off). I don't expect RISC-V has any
such "implicit" flushing behavior?
> + if ( g_disabled != data->disabled )
> + {
> + printk("%s: VMIDs %sabled.\n", __func__,
> + data->disabled ? "dis" : "en");
> + if ( !g_disabled )
> + g_disabled = data->disabled;
This doesn't match x86 code. g_disabled is a tristate there, which only
the boot CPU would ever write to.
A clear shortcoming of the x86 code (that you copied) is that the log
message doesn't identify the CPU in question. A sequence of "disabled"
and "enabled" could thus result, without the last one (or in fact any
one) making clear what the overall state is. I think you want to avoid
this from the beginning.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 04/20] xen/riscv: introduce things necessary for p2m initialization
2025-07-31 15:58 ` [PATCH v3 04/20] xen/riscv: introduce things necessary for p2m initialization Oleksii Kurochko
@ 2025-08-04 15:53 ` Jan Beulich
2025-08-06 11:43 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-04 15:53 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -3,11 +3,45 @@
> #define ASM__RISCV__P2M_H
>
> #include <xen/errno.h>
> +#include <xen/mm.h>
> +#include <xen/rwlock.h>
> +#include <xen/types.h>
>
> #include <asm/page-bits.h>
>
> #define paddr_bits PADDR_BITS
>
> +/* Get host p2m table */
> +#define p2m_get_hostp2m(d) (&(d)->arch.p2m)
> +
> +/* Per-p2m-table state */
> +struct p2m_domain {
> + /*
> + * Lock that protects updates to the p2m.
> + */
> + rwlock_t lock;
> +
> + /* Pages used to construct the p2m */
> + struct page_list_head pages;
> +
> + /* Indicate if it is required to clean the cache when writing an entry */
> + bool clean_pte;
I'm a little puzzled by this field still being here, despite the extensive
revlog commentary. If you really feel you need to keep it, please ...
> + /* Back pointer to domain */
> + struct domain *domain;
> +
> + /*
> + * P2M updates may required TLBs to be flushed (invalidated).
> + *
> + * Flushes may be deferred by setting 'need_flush' and then flushing
> + * when the p2m write lock is released.
> + *
> + * If an immediate flush is required (e.g, if a super page is
> + * shattered), call p2m_tlb_flush_sync().
> + */
> + bool need_flush;
... group booleans together, for better packing.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 05/20] xen/riscv: construct the P2M pages pool for guests
2025-07-31 15:58 ` [PATCH v3 05/20] xen/riscv: construct the P2M pages pool for guests Oleksii Kurochko
@ 2025-08-04 15:58 ` Jan Beulich
2025-08-05 10:40 ` Jan Beulich
1 sibling, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2025-08-04 15:58 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> Implement p2m_set_allocation() to construct p2m pages pool for guests
> based on required number of pages.
>
> This is implemented by:
> - Adding a `struct paging_domain` which contains a freelist, a
> counter variable and a spinlock to `struct arch_domain` to
> indicate the free p2m pages and the number of p2m total pages in
> the p2m pages pool.
> - Adding a helper `p2m_set_allocation` to set the p2m pages pool
> size. This helper should be called before allocating memory for
> a guest and is called from domain_p2m_set_allocation(), the latter
> is a part of common dom0less code.
> - Adding paging_freelist_init() to struct paging_domain.
>
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 06/20] xen/riscv: add root page table allocation
2025-07-31 15:58 ` [PATCH v3 06/20] xen/riscv: add root page table allocation Oleksii Kurochko
@ 2025-08-05 10:37 ` Jan Beulich
2025-08-07 12:00 ` Oleksii Kurochko
2025-08-05 10:43 ` Jan Beulich
1 sibling, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-05 10:37 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> Introduce support for allocating and initializing the root page table
> required for RISC-V stage-2 address translation.
>
> To implement root page table allocation the following is introduced:
> - p2m_get_clean_page() and p2m_alloc_root_table(), p2m_allocate_root()
> helpers to allocate and zero a 16 KiB root page table, as mandated
> by the RISC-V privileged specification for Sv32x4/Sv39x4/Sv48x4/Sv57x4
> modes.
> - Update p2m_init() to inititialize p2m_root_order.
> - Add maddr_to_page() and page_to_maddr() macros for easier address
> manipulation.
> - Introduce paging_ret_pages_to_domheap() to return some pages before
> allocate 16 KiB pages for root page table.
> - Allocate root p2m table after p2m pool is initialized.
> - Add construct_hgatp() to construct the hgatp register value based on
> p2m->root, p2m->hgatp_mode and VMID.
Imo for this to be complete, freeing of the root table also wants taking
care of. Much like imo p2m_init() would better immediately be accompanied
by the respective teardown function. Once you start using them, you want
to use them in pairs, after all.
> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
> @@ -133,11 +133,13 @@
> #define HGATP_MODE_SV48X4 _UL(9)
>
> #define HGATP32_MODE_SHIFT 31
> +#define HGATP32_MODE_MASK _UL(0x80000000)
> #define HGATP32_VMID_SHIFT 22
> #define HGATP32_VMID_MASK _UL(0x1FC00000)
> #define HGATP32_PPN _UL(0x003FFFFF)
>
> #define HGATP64_MODE_SHIFT 60
> +#define HGATP64_MODE_MASK _ULL(0xF000000000000000)
> #define HGATP64_VMID_SHIFT 44
> #define HGATP64_VMID_MASK _ULL(0x03FFF00000000000)
> #define HGATP64_PPN _ULL(0x00000FFFFFFFFFFF)
> @@ -170,6 +172,7 @@
> #define HGATP_VMID_SHIFT HGATP64_VMID_SHIFT
> #define HGATP_VMID_MASK HGATP64_VMID_MASK
> #define HGATP_MODE_SHIFT HGATP64_MODE_SHIFT
> +#define HGATP_MODE_MASK HGATP64_MODE_MASK
> #else
> #define MSTATUS_SD MSTATUS32_SD
> #define SSTATUS_SD SSTATUS32_SD
> @@ -181,8 +184,11 @@
> #define HGATP_VMID_SHIFT HGATP32_VMID_SHIFT
> #define HGATP_VMID_MASK HGATP32_VMID_MASK
> #define HGATP_MODE_SHIFT HGATP32_MODE_SHIFT
> +#define HGATP_MODE_MASK HGATP32_MODE_MASK
> #endif
>
> +#define GUEST_ROOT_PAGE_TABLE_SIZE KB(16)
In another context I already mentioned that imo you want to be careful with
the use of "guest" in identifiers. It's not the guest page tables which have
an order-2 root table, but the P2M (Xen terminology) or G-stage / second
stage (RISC-V spec terminology) ones. As long as you're only doing P2M
work, this may not look significant. But once you actually start dealing
with guest page tables, it easily can end up confusing.
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -1,8 +1,86 @@
> +#include <xen/domain_page.h>
> #include <xen/mm.h>
> #include <xen/rwlock.h>
> #include <xen/sched.h>
>
> #include <asm/paging.h>
> +#include <asm/p2m.h>
> +#include <asm/riscv_encoding.h>
> +
> +unsigned int __read_mostly p2m_root_order;
If this is to be a variable at all, it ought to be __ro_after_init, and
hence it shouldn't be written every time p2m_init() is run. If you want
to to remain as a variable, what's wrong with
const unsigned int p2m_root_order = ilog2(GUEST_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT;
or some such? But of course equally well you could have
#define P2M_ROOT_ORDER (ilog2(GUEST_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
> +static void clear_and_clean_page(struct page_info *page)
> +{
> + clear_domain_page(page_to_mfn(page));
> +
> + /*
> + * If the IOMMU doesn't support coherent walks and the p2m tables are
> + * shared between the CPU and IOMMU, it is necessary to clean the
> + * d-cache.
> + */
That is, ...
> + clean_dcache_va_range(page, PAGE_SIZE);
... this call really wants to be conditional?
> +}
> +
> +static struct page_info *p2m_allocate_root(struct domain *d)
With there also being p2m_alloc_root_table() and with that being the sole
caller of the function here, I wonder: Is having this in a separate
function really outweighing the possible confusion of which of the two
functions to use?
> +{
> + struct page_info *page;
> +
> + /*
> + * As mentioned in the Priviliged Architecture Spec (version 20240411)
> + * in Section 18.5.1, for the paged virtual-memory schemes (Sv32x4,
> + * Sv39x4, Sv48x4, and Sv57x4), the root page table is 16 KiB and must
> + * be aligned to a 16-KiB boundary.
> + */
> + page = alloc_domheap_pages(d, P2M_ROOT_ORDER, MEMF_no_owner);
> + if ( !page )
> + return NULL;
> +
> + for ( unsigned int i = 0; i < P2M_ROOT_PAGES; i++ )
> + clear_and_clean_page(page + i);
> +
> + return page;
> +}
> +
> +unsigned long construct_hgatp(struct p2m_domain *p2m, uint16_t vmid)
> +{
> + unsigned long ppn;
> +
> + ppn = PFN_DOWN(page_to_maddr(p2m->root)) & HGATP_PPN;
Why not page_to_pfn() or mfn_x(page_to_mfn())? I.e. why mix different groups
of accessors?
As to "& HGATP_PPN" - that's making an assumption that you could avoid by
using ...
> + /* TODO: add detection of hgatp_mode instead of hard-coding it. */
> +#if RV_STAGE1_MODE == SATP_MODE_SV39
> + p2m->hgatp_mode = HGATP_MODE_SV39X4;
> +#elif RV_STAGE1_MODE == SATP_MODE_SV48
> + p2m->hgatp_mode = HGATP_MODE_SV48X4;
> +#else
> +# error "add HGATP_MODE"
> +#endif
> +
> + return ppn | MASK_INSR(p2m->hgatp_mode, HGATP_MODE_MASK) |
> + MASK_INSR(vmid, HGATP_VMID_MASK);
... MASK_INSR() also on "ppn".
As to the writing of p2m->hgatp_mode - you don't want to do this here, when
this is the function to calculate the value to put into hgatp. This field
needs calculating only once, perhaps in p2m_init().
> +static int p2m_alloc_root_table(struct p2m_domain *p2m)
> +{
> + struct domain *d = p2m->domain;
> + struct page_info *page;
> + const unsigned int nr_root_pages = P2M_ROOT_PAGES;
Is this local variable really of any use?
> + /*
> + * Return back nr_root_pages to assure the root table memory is also
> + * accounted against the P2M pool of the domain.
> + */
> + if ( !paging_ret_pages_to_domheap(d, nr_root_pages) )
> + return -ENOMEM;
> +
> + page = p2m_allocate_root(d);
> + if ( !page )
> + return -ENOMEM;
Hmm, and the pool is then left shrunk by 4 pages?
> --- a/xen/arch/riscv/paging.c
> +++ b/xen/arch/riscv/paging.c
> @@ -54,6 +54,36 @@ int paging_freelist_init(struct domain *d, unsigned long pages,
>
> return 0;
> }
> +
> +bool paging_ret_pages_to_domheap(struct domain *d, unsigned int nr_pages)
> +{
> + struct page_info *page;
> +
> + ASSERT(spin_is_locked(&d->arch.paging.lock));
> +
> + if ( ACCESS_ONCE(d->arch.paging.total_pages) < nr_pages )
> + return false;
> +
> + for ( unsigned int i = 0; i < nr_pages; i++ )
> + {
> + /* Return memory to domheap. */
> + page = page_list_remove_head(&d->arch.paging.freelist);
> + if( page )
> + {
> + ACCESS_ONCE(d->arch.paging.total_pages)--;
> + free_domheap_page(page);
> + }
> + else
> + {
> + printk(XENLOG_ERR
> + "Failed to free P2M pages, P2M freelist is empty.\n");
> + return false;
Looks pretty redundant with half of paging_freelist_init(), including the
stray full stop in the log message.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 05/20] xen/riscv: construct the P2M pages pool for guests
2025-07-31 15:58 ` [PATCH v3 05/20] xen/riscv: construct the P2M pages pool for guests Oleksii Kurochko
2025-08-04 15:58 ` Jan Beulich
@ 2025-08-05 10:40 ` Jan Beulich
2025-08-06 12:01 ` Oleksii Kurochko
1 sibling, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-05 10:40 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> @@ -30,3 +34,18 @@ int p2m_init(struct domain *d)
>
> return 0;
> }
> +
> +/*
> + * Set the pool of pages to the required number of pages.
> + * Returns 0 for success, non-zero for failure.
> + * Call with d->arch.paging.lock held.
> + */
> +int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
Noticed only when looking at the subsequent patch: With this being ...
> +{
> + int rc;
> +
> + if ( (rc = paging_freelist_init(d, pages, preempted)) )
... a caller of this function, the "init" in the name feels wrong.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 06/20] xen/riscv: add root page table allocation
2025-07-31 15:58 ` [PATCH v3 06/20] xen/riscv: add root page table allocation Oleksii Kurochko
2025-08-05 10:37 ` Jan Beulich
@ 2025-08-05 10:43 ` Jan Beulich
2025-08-07 13:35 ` Oleksii Kurochko
1 sibling, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-05 10:43 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> +static int p2m_alloc_root_table(struct p2m_domain *p2m)
> +{
> + struct domain *d = p2m->domain;
> + struct page_info *page;
> + const unsigned int nr_root_pages = P2M_ROOT_PAGES;
> +
> + /*
> + * Return back nr_root_pages to assure the root table memory is also
> + * accounted against the P2M pool of the domain.
> + */
> + if ( !paging_ret_pages_to_domheap(d, nr_root_pages) )
> + return -ENOMEM;
> +
> + page = p2m_allocate_root(d);
> + if ( !page )
> + return -ENOMEM;
> +
> + p2m->root = page;
> +
> + return 0;
> +}
In the success case, shouldn't you bump the paging pool's total_pages by
P2M_ROOT_PAGES? (As the freeing side is missing so far, it's not easy to
tell whether there's [going to be] a balancing problem in the long run.
In the short run there certainly is.)
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 10/20] xen/riscv: introduce page_{get,set}_xenheap_gfn()
2025-07-31 15:58 ` [PATCH v3 10/20] xen/riscv: introduce page_{get,set}_xenheap_gfn() Oleksii Kurochko
@ 2025-08-05 14:11 ` Jan Beulich
2025-08-08 9:16 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-05 14:11 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/mm.h
> +++ b/xen/arch/riscv/include/asm/mm.h
> @@ -12,6 +12,7 @@
> #include <xen/sections.h>
> #include <xen/types.h>
>
> +#include <asm/cmpxchg.h>
> #include <asm/page.h>
> #include <asm/page-bits.h>
>
> @@ -247,9 +248,17 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
> #define PGT_writable_page PG_mask(1, 1) /* has writable mappings? */
> #define PGT_type_mask PG_mask(1, 1) /* Bits 31 or 63. */
>
> -/* Count of uses of this frame as its current type. */
> -#define PGT_count_width PG_shift(2)
> -#define PGT_count_mask ((1UL << PGT_count_width) - 1)
> + /* 9-bit count of uses of this frame as its current type. */
Nit: Stray blank at start of line.
> +#define PGT_count_mask PG_mask(0x3FF, 10)
A 9-bit count corresponds to a mask of 0x1ff, doesn't it? With 0x3ff the count
can spill over the type.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 01/20] xen/riscv: implement sbi_remote_hfence_gvma()
2025-08-04 13:52 ` Jan Beulich
@ 2025-08-05 14:45 ` Oleksii Kurochko
2025-08-05 15:01 ` Jan Beulich
0 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-05 14:45 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 4573 bytes --]
On 8/4/25 3:52 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
>> covering the range of guest physical addresses between start_addr and
>> start_addr + size for all VMIDs.
>>
>> The remote fence operation applies to the entire address space if either:
>> - start_addr and size are both 0, or
>> - size is equal to 2^XLEN-1.
>>
>> Signed-off-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
> Acked-by: Jan Beulich<jbeulich@suse.com>
>
> However, ...
>
>> --- a/xen/arch/riscv/include/asm/sbi.h
>> +++ b/xen/arch/riscv/include/asm/sbi.h
>> @@ -89,6 +89,25 @@ bool sbi_has_rfence(void);
>> int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
>> size_t size);
>>
>> +/*
>> + * Instructs the remote harts to execute one or more HFENCE.GVMA
>> + * instructions, covering the range of guest physical addresses
>> + * between start_addr and start_addr + size for all VMIDs.
> ... I'd like to ask that you avoid fuzzy terminology like this one. Afaict
> you mean [start, start + size). Help yourself and future readers by then
> also saying it exactly like this. (Happy to make a respective edit while
> committing.)
I just tried the following wording in SBI spec.
I agree that using [start, start+size) is clearer as each time I'm going
to check SBI code to verify if 'start+size' is included or not.
It would be happy if you could update this part of commit message during
commit.
>
>> + * Returns 0 if IPI was sent to all the targeted harts successfully
>> + * or negative value if start_addr or size is not valid.
> This similarly is ambiguous: The union of the success case stated and the
> error case stated isn't obviously all possible states. The success
> statement in particular alludes to the possibility of an IPI not actually
> reaching its target.
The same as above this is what SBI spec. tells.
I've not checked SBI code deeply, but it seems like the code is waiting while
IPI will be reached as looking at the code:
/**
* As this this function only handlers scalar values of hart mask, it must be
* set to all online harts if the intention is to send IPIs to all the harts.
* If hmask is zero, no IPIs will be sent.
*/
int sbi_ipi_send_many(ulong hmask, ulong hbase, u32 event, void *data)
{
...
/* Send IPIs */
do {
retry_needed = false;
sbi_hartmask_for_each_hart(i, &target_mask) {
rc = sbi_ipi_send(scratch, i, event, data);
if (rc == SBI_IPI_UPDATE_RETRY)
retry_needed = true;
else
sbi_hartmask_clear_hart(i, &target_mask);
}
} while (retry_needed);
/* Sync IPIs */
sbi_ipi_sync(scratch, event);
return 0;
}
and
static int sbi_ipi_sync(struct sbi_scratch *scratch, u32 event)
{
const struct sbi_ipi_event_ops *ipi_ops;
if ((SBI_IPI_EVENT_MAX <= event) ||
!ipi_ops_array[event])
return SBI_EINVAL;
ipi_ops = ipi_ops_array[event];
if (ipi_ops->sync)
ipi_ops->sync(scratch);
return 0;
}
which calls:
static void tlb_sync(struct sbi_scratch *scratch)
{
atomic_t *tlb_sync =
sbi_scratch_offset_ptr(scratch, tlb_sync_off);
while (atomic_read(tlb_sync) > 0) {
/*
* While we are waiting for remote hart to set the sync,
* consume fifo requests to avoid deadlock.
*/
tlb_process_once(scratch);
}
return;
}
>
>> + * The remote fence operation applies to the entire address space if either:
>> + * - start_addr and size are both 0, or
>> + * - size is equal to 2^XLEN-1.
> Whose XLEN is this? The guest's? The host's? (I assume the latter, but it's
> not unambiguous, unless there's specific terminology that I'm unaware of,
> yet which would make this unambiguous.)
RISC-V spec quite mixes the terminology (3.1.6.2. Base ISA Control in mstatus Register)
around XLEN:
For RV64 harts, the SXL and UXL fields are WARL fields that control the value
of XLEN for S-mode and U-mode, respectively. The encoding of these fields is
the same as the MXL field of misa, shown in Table 9. The effective XLEN in
S-mode and U-mode are termed SXLEN and UXLEN, respectively
Basically, RISC-V privileged architecture defines different XLEN values for
various privilege modes:
- MXLEN for Machine mode
- SXLEN for Supervisor mode.
- HSXLEN for Hypervisor-Supervisor mode.
- VSXLEN for Virtual Supervisor mode.
Considering that SBI is an API that is provided for S-mode I expect that XLEN = SXLEN
in this case, but SBI spec. is using just XLEN.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 02/20] xen/riscv: introduce sbi_remote_hfence_gvma_vmid()
2025-08-04 13:55 ` Jan Beulich
@ 2025-08-05 14:57 ` Oleksii Kurochko
0 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-05 14:57 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 711 bytes --]
On 8/4/25 3:55 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> It instructs the remote harts to execute one or more HFENCE.GVMA instructions
>> by making an SBI call, covering the range of guest physical addresses between
>> start_addr and start_addr + size only for the given VMID.
>>
>> The remote fence operation applies to the entire address space if either:
>> - start_addr and size are both 0, or
>> - size is equal to 2^XLEN-1.
>>
>> Signed-off-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
> Acked-by: Jan Beulich<jbeulich@suse.com>
> with perhaps a similar on-commit edit as suggested for patch 1.
I would happy if you could do that during merge. Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 1369 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 01/20] xen/riscv: implement sbi_remote_hfence_gvma()
2025-08-05 14:45 ` Oleksii Kurochko
@ 2025-08-05 15:01 ` Jan Beulich
0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2025-08-05 15:01 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 05.08.2025 16:45, Oleksii Kurochko wrote:
> On 8/4/25 3:52 PM, Jan Beulich wrote:
>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>> + * Returns 0 if IPI was sent to all the targeted harts successfully
>>> + * or negative value if start_addr or size is not valid.
>> This similarly is ambiguous: The union of the success case stated and the
>> error case stated isn't obviously all possible states. The success
>> statement in particular alludes to the possibility of an IPI not actually
>> reaching its target.
>
> The same as above this is what SBI spec. tells.
>
> I've not checked SBI code deeply, but it seems like the code is waiting while
> IPI will be reached as looking at the code:
> /**
> * As this this function only handlers scalar values of hart mask, it must be
> * set to all online harts if the intention is to send IPIs to all the harts.
> * If hmask is zero, no IPIs will be sent.
> */
> int sbi_ipi_send_many(ulong hmask, ulong hbase, u32 event, void *data)
> {
> ...
>
> /* Send IPIs */
> do {
> retry_needed = false;
> sbi_hartmask_for_each_hart(i, &target_mask) {
> rc = sbi_ipi_send(scratch, i, event, data);
> if (rc == SBI_IPI_UPDATE_RETRY)
> retry_needed = true;
> else
> sbi_hartmask_clear_hart(i, &target_mask);
> }
> } while (retry_needed);
>
> /* Sync IPIs */
> sbi_ipi_sync(scratch, event);
>
> return 0;
> }
> and
> static int sbi_ipi_sync(struct sbi_scratch *scratch, u32 event)
> {
> const struct sbi_ipi_event_ops *ipi_ops;
>
> if ((SBI_IPI_EVENT_MAX <= event) ||
> !ipi_ops_array[event])
> return SBI_EINVAL;
> ipi_ops = ipi_ops_array[event];
>
> if (ipi_ops->sync)
> ipi_ops->sync(scratch);
>
> return 0;
> }
> which calls:
> static void tlb_sync(struct sbi_scratch *scratch)
> {
> atomic_t *tlb_sync =
> sbi_scratch_offset_ptr(scratch, tlb_sync_off);
>
> while (atomic_read(tlb_sync) > 0) {
> /*
> * While we are waiting for remote hart to set the sync,
> * consume fifo requests to avoid deadlock.
> */
> tlb_process_once(scratch);
> }
>
> return;
> }
I'll leave that comment as-is then, even if I'm not really happy with it.
>>> + * The remote fence operation applies to the entire address space if either:
>>> + * - start_addr and size are both 0, or
>>> + * - size is equal to 2^XLEN-1.
>> Whose XLEN is this? The guest's? The host's? (I assume the latter, but it's
>> not unambiguous, unless there's specific terminology that I'm unaware of,
>> yet which would make this unambiguous.)
>
> RISC-V spec quite mixes the terminology (3.1.6.2. Base ISA Control in mstatus Register)
> around XLEN:
> For RV64 harts, the SXL and UXL fields are WARL fields that control the value
> of XLEN for S-mode and U-mode, respectively. The encoding of these fields is
> the same as the MXL field of misa, shown in Table 9. The effective XLEN in
> S-mode and U-mode are termed SXLEN and UXLEN, respectively
>
> Basically, RISC-V privileged architecture defines different XLEN values for
> various privilege modes:
> - MXLEN for Machine mode
> - SXLEN for Supervisor mode.
> - HSXLEN for Hypervisor-Supervisor mode.
> - VSXLEN for Virtual Supervisor mode.
>
> Considering that SBI is an API that is provided for S-mode I expect that XLEN = SXLEN
> in this case, but SBI spec. is using just XLEN.
Very helpful.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 11/20] xen/riscv: implement function to map memory in guest p2m
2025-07-31 15:58 ` [PATCH v3 11/20] xen/riscv: implement function to map memory in guest p2m Oleksii Kurochko
@ 2025-08-05 15:20 ` Jan Beulich
2025-08-08 13:46 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-05 15:20 UTC (permalink / raw)
To: Oleksii Kurochko, Andrew Cooper
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Anthony PERARD,
Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> Implement map_regions_p2mt() to map a region in the guest p2m with
> a specific p2m type. The memory attributes will be derived from the
> p2m type. This function is going to be called from dom0less common
> code.
s/is going to be/is/ ? Such a call exists already, after all.
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -121,21 +121,22 @@ static inline int guest_physmap_mark_populate_on_demand(struct domain *d,
> return -EOPNOTSUPP;
> }
>
> -static inline int guest_physmap_add_entry(struct domain *d,
> - gfn_t gfn, mfn_t mfn,
> - unsigned long page_order,
> - p2m_type_t t)
> -{
> - BUG_ON("unimplemented");
> - return -EINVAL;
> -}
> +/*
> + * Map a region in the guest p2m with a specific p2m type.
What is "the guest p2m"? In your answer, please consider the possible
(and at some point likely necessary) existence of altp2m and nestedp2m.
In patch 04 you introduce p2m_get_hostp2m(), and I expect it's that
what you mean here.
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -9,6 +9,41 @@
>
> unsigned int __read_mostly p2m_root_order;
>
> +/*
> + * Force a synchronous P2M TLB flush.
> + *
> + * Must be called with the p2m lock held.
> + */
> +static void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
> +{
> + struct domain *d = p2m->domain;
Pointer-to-const please. Personally, given the implementation of this
function (and also ...
> + ASSERT(p2m_is_write_locked(p2m));
> +
> + sbi_remote_hfence_gvma(d->dirty_cpumask, 0, 0);
> +
> + p2m->need_flush = false;
> +}
> +
> +void p2m_tlb_flush_sync(struct p2m_domain *p2m)
> +{
> + if ( p2m->need_flush )
> + p2m_force_tlb_flush_sync(p2m);
> +}
... this one) I'd further ask for the function parameters to also be
pointer-to-const, but Andrew may object to that. Andrew - it continues to
be unclear to me under what conditions you agree with adding const, and
under what conditions you would object to me asking for such. Please can
you take the time to clarify this?
> +/* Unlock the flush and do a P2M TLB flush if necessary */
> +void p2m_write_unlock(struct p2m_domain *p2m)
> +{
> + /*
> + * The final flush is done with the P2M write lock taken to avoid
> + * someone else modifying the P2M wbefore the TLB invalidation has
Nit: Stray 'w'.
> + * completed.
> + */
> + p2m_tlb_flush_sync(p2m);
Wasn't the plan to have this be conditional?
> @@ -139,3 +174,33 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>
> return 0;
> }
> +
> +static int p2m_set_range(struct p2m_domain *p2m,
> + gfn_t sgfn,
> + unsigned long nr,
> + mfn_t smfn,
> + p2m_type_t t)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static int p2m_insert_mapping(struct p2m_domain *p2m, gfn_t start_gfn,
> + unsigned long nr, mfn_t mfn, p2m_type_t t)
> +{
> + int rc;
> +
> + p2m_write_lock(p2m);
> + rc = p2m_set_range(p2m, start_gfn, nr, mfn, t);
> + p2m_write_unlock(p2m);
> +
> + return rc;
> +}
> +
> +int map_regions_p2mt(struct domain *d,
> + gfn_t gfn,
> + unsigned long nr,
> + mfn_t mfn,
> + p2m_type_t p2mt)
> +{
> + return p2m_insert_mapping(p2m_get_hostp2m(d), gfn, nr, mfn, p2mt);
> +}
And eventually both helper functions will gain further callers? Otherwise
it's a little hard to see why they would both need to be separate functions.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 12/20] xen/riscv: implement p2m_set_range()
2025-07-31 15:58 ` [PATCH v3 12/20] xen/riscv: implement p2m_set_range() Oleksii Kurochko
@ 2025-08-05 16:04 ` Jan Beulich
2025-08-15 9:52 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-05 16:04 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> This patch introduces p2m_set_range() and its core helper p2m_set_entry() for
Nit: This patch doesn't introduce p2m_set_range(); it merely fleshes it out.
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -7,11 +7,13 @@
> #include <xen/rwlock.h>
> #include <xen/types.h>
>
> +#include <asm/page.h>
> #include <asm/page-bits.h>
>
> extern unsigned int p2m_root_order;
> #define P2M_ROOT_ORDER p2m_root_order
> #define P2M_ROOT_PAGES BIT(P2M_ROOT_ORDER, U)
> +#define P2M_ROOT_LEVEL HYP_PT_ROOT_LEVEL
I think I commented on this before, and I would have hoped for at least a remark
in the description to appear (perhaps even a comment here): It's okay(ish) to tie
these together for now, but in the longer run I don't expect this is going to be
wanted. If e.g. we ran Xen in Sv57 mode, there would be no reason at all to force
all P2Ms to use 5 levels of page tables.
> @@ -175,13 +179,257 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
> return 0;
> }
>
> +/*
> + * Find and map the root page table. The caller is responsible for
> + * unmapping the table.
> + *
> + * The function will return NULL if the offset into the root table is
> + * invalid.
> + */
> +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
> +{
> + unsigned long root_table_indx;
> +
> + root_table_indx = gfn_x(gfn) >> XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL);
Right now page table layouts / arrangements are indeed similar enough to
share accessor constructs. Nevertheless I find it problematic (doc-wise
at the very least) that a Xen page table construct is used to access a
P2M page table. If and when these needed to be decoupled, it would likely
help of the distinction was already made, by - for now - simply
introducing aliases (here e.g. P2M_LEVEL_ORDER(), expanding to
XEN_PT_LEVEL_ORDER() for the time being).
> + if ( root_table_indx >= P2M_ROOT_PAGES )
> + return NULL;
> +
> + return __map_domain_page(p2m->root + root_table_indx);
> +}
> +
> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
> +{
> + write_pte(p, pte);
> + if ( clean_pte )
> + clean_dcache_va_range(p, sizeof(*p));
Not necessarily for right away, but if multiple adjacent PTEs are
written without releasing the lock, this then redundant cache flushing
can be a performance issue.
> +}
> +
> +static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
> +{
> + pte_t pte;
> +
> + memset(&pte, 0, sizeof(pte));
Why memset()? Why not simply give the variable an appropriate initializer?
Or use ...
> + p2m_write_pte(p, pte, clean_pte);
... a compound literal here, like you do ...
> +}
> +
> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
> +{
> + panic("%s: hasn't been implemented yet\n", __func__);
> +
> + return (pte_t) { .pte = 0 };
... here? (Just {} would also do, if I'm not mistaken.)
> +}
> +
> +#define P2M_TABLE_MAP_NONE 0
> +#define P2M_TABLE_MAP_NOMEM 1
> +#define P2M_TABLE_SUPER_PAGE 2
> +#define P2M_TABLE_NORMAL 3
> +
> +/*
> + * Take the currently mapped table, find the corresponding the entry
> + * corresponding to the GFN, and map the next table, if available.
Nit: Double "corresponding".
> + * The previous table will be unmapped if the next level was mapped
> + * (e.g P2M_TABLE_NORMAL returned).
> + *
> + * `alloc_tbl` parameter indicates whether intermediate tables should
> + * be allocated when not present.
> + *
> + * Return values:
> + * P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
> + * P2M_TABLE_MAP_NOMEM: allocating a new page failed.
> + * P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
> + * P2M_TABLE_NORMAL: The next entry points to a superpage.
> + */
> +static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
> + unsigned int level, pte_t **table,
> + unsigned int offset)
> +{
> + panic("%s: hasn't been implemented yet\n", __func__);
> +
> + return P2M_TABLE_MAP_NONE;
> +}
> +
> +/* Free pte sub-tree behind an entry */
> +static void p2m_free_subtree(struct p2m_domain *p2m,
> + pte_t entry, unsigned int level)
> +{
> + panic("%s: hasn't been implemented yet\n", __func__);
> +}
> +
> +/*
> + * Insert an entry in the p2m. This should be called with a mapping
> + * equal to a page/superpage.
> + */
> +static int p2m_set_entry(struct p2m_domain *p2m,
> + gfn_t gfn,
> + unsigned long page_order,
> + mfn_t mfn,
> + p2m_type_t t)
> +{
> + unsigned int level;
> + unsigned int target = page_order / PAGETABLE_ORDER;
> + pte_t *entry, *table, orig_pte;
> + int rc;
> + /* A mapping is removed if the MFN is invalid. */
> + bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
Comment and code don't fit together. Many MFNs are invalid (any for which
mfn_valid() returns false), yet you only check for INVALID_MFN here.
> + DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
> +
> + ASSERT(p2m_is_write_locked(p2m));
> +
> + /*
> + * Check if the level target is valid: we only support
> + * 4K - 2M - 1G mapping.
> + */
> + ASSERT((target <= 2) && !(page_order % PAGETABLE_ORDER));
If you think you need to check this, don't you also want to check that
GFN and MFN (the latter if it isn't INVALID_MFN) fit the requested order?
> + table = p2m_get_root_pointer(p2m, gfn);
> + if ( !table )
> + return -EINVAL;
> +
> + for ( level = P2M_ROOT_LEVEL; level > target; level-- )
> + {
> + /*
> + * Don't try to allocate intermediate page table if the mapping
> + * is about to be removed.
> + */
> + rc = p2m_next_level(p2m, !removing_mapping,
> + level, &table, offsets[level]);
> + if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
> + {
> + rc = (rc == P2M_TABLE_MAP_NONE) ? -ENOENT : -ENOMEM;
> + /*
> + * We are here because p2m_next_level has failed to map
> + * the intermediate page table (e.g the table does not exist
> + * and they p2m tree is read-only). It is a valid case
> + * when removing a mapping as it may not exist in the
> + * page table. In this case, just ignore it.
> + */
> + rc = removing_mapping ? 0 : rc;
Nit: Stray blank.
> + goto out;
> + }
> +
> + if ( rc != P2M_TABLE_NORMAL )
> + break;
> + }
> +
> + entry = table + offsets[level];
> +
> + /*
> + * If we are here with level > target, we must be at a leaf node,
> + * and we need to break up the superpage.
> + */
> + if ( level > target )
> + {
> + panic("Shattering isn't implemented\n");
> + }
> +
> + /*
> + * We should always be there with the correct level because all the
> + * intermediate tables have been installed if necessary.
> + */
> + ASSERT(level == target);
> +
> + orig_pte = *entry;
> +
> + if ( removing_mapping )
> + p2m_clean_pte(entry, p2m->clean_pte);
> + else
> + {
> + pte_t pte = p2m_pte_from_mfn(mfn, t);
> +
> + p2m_write_pte(entry, pte, p2m->clean_pte);
> +
> + p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
> + gfn_add(gfn, BIT(page_order, UL) - 1));
> + p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
> + }
> +
> + p2m->need_flush = true;
> +
> + /*
> + * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
> + * is not ready for RISC-V support.
> + *
> + * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
> + * here.
> + */
> +#ifdef CONFIG_HAS_PASSTHROUGH
> +# error "add code to flush IOMMU TLB"
> +#endif
> +
> + rc = 0;
> +
> + /*
> + * Free the entry only if the original pte was valid and the base
> + * is different (to avoid freeing when permission is changed).
> + */
> + if ( pte_is_valid(orig_pte) &&
> + !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
I'm puzzled by this 2nd check: A permission change would - I expect - only
occur to a leaf entry. If the new entry is a super-page one and the old
wasn't, don't you still need to free the sub-tree, no matter whether the
MFNs are the same? Plus consider the special case of MFN 0: If you clear
an entry using MFN 0, you will find old and new PTEs' both having the same
MFN.
> static int p2m_set_range(struct p2m_domain *p2m,
> gfn_t sgfn,
> unsigned long nr,
> mfn_t smfn,
> p2m_type_t t)
> {
> - return -EOPNOTSUPP;
> + int rc = 0;
> + unsigned long left = nr;
> +
> + /*
> + * Any reference taken by the P2M mappings (e.g. foreign mapping) will
> + * be dropped in relinquish_p2m_mapping(). As the P2M will still
> + * be accessible after, we need to prevent mapping to be added when the
> + * domain is dying.
> + */
> + if ( unlikely(p2m->domain->is_dying) )
> + return -EACCES;
> +
> + while ( left )
> + {
> + unsigned long order = p2m_mapping_order(sgfn, smfn, left);
> +
> + rc = p2m_set_entry(p2m, sgfn, order, smfn, t);
> + if ( rc )
> + break;
> +
> + sgfn = gfn_add(sgfn, BIT(order, UL));
> + if ( !mfn_eq(smfn, INVALID_MFN) )
> + smfn = mfn_add(smfn, BIT(order, UL));
> +
> + left -= BIT(order, UL);
> + }
> +
> + return !left ? 0 : left == nr ? rc : (nr - left);
The function returning "int", you may be truncating the return value here.
In the worst case indicating success (0) or an error (negative) when some
of the upper bits were set.
Also looks like you could get away with a single conditional operator here:
return !left || left == nr ? rc : (nr - left);
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
2025-08-04 15:19 ` Jan Beulich
@ 2025-08-06 11:33 ` Oleksii Kurochko
2025-08-06 12:05 ` Jan Beulich
0 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-06 11:33 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 12978 bytes --]
On 8/4/25 5:19 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> Current implementation is based on x86's way to allocate VMIDs:
>> VMIDs partition the physical TLB. In the current implementation VMIDs are
>> introduced to reduce the number of TLB flushes. Each time the guest's
>> virtual address space changes,
> virtual?
I assumed that originally it meant that from Xen point of view it could be
called guest's virtual as guest doesn't really work with real physical address,
but it seems like it would be more clear to use guest-physical as you suggested
below.
>> instead of flushing the TLB, a new VMID is
>> assigned. This reduces the number of TLB flushes to at most 1/#VMIDs.
>> The biggest advantage is that hot parts of the hypervisor's code and data
>> retain in the TLB.
>>
>> VMIDs are a hart-local resource. As preemption of VMIDs is not possible,
>> VMIDs are assigned in a round-robin scheme. To minimize the overhead of
>> VMID invalidation, at the time of a TLB flush, VMIDs are tagged with a
>> 64-bit generation. Only on a generation overflow the code needs to
>> invalidate all VMID information stored at the VCPUs with are run on the
>> specific physical processor. This overflow appears after about 2^80
>> host processor cycles,
> Where's this number coming from? (If you provide numbers, I think they will
> want to be "reproducible" by the reader. Which I fear isn't the case here.)
The 2^80 cycles (based on x86-related numbers) result from:
1. And VM-Entry/-Exit cycle takes at least 1800 cycles (approximated by 2^10)
2. We have 64 ASIDs (2^6)
3. 2^64 generations.
I removed this part of the comment earlier because I assumed that the first
item is quite CPU-dependent, even for x86, let alone for other architectures,
which may have a different number (?).
However, this part of the comment was reintroduced during one of the merges.
>> @@ -148,6 +149,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
>>
>> console_init_postirq();
>>
>> + vmid_init();
> This lives here only temporarily, I assume? Every hart will need to execute
> it, and hence (like we have it on x86) this may want to be a central place
> elsewhere.
I haven’t checked how it is done on x86; I probably should.
I planned to call it for each hart separately during secondary hart bring-up,
since accessing the|hgatp| register of a hart is required to detect|VMIDLEN|.
Therefore,|vmid_init()| should be called for secondary harts when their
initialization code starts executing.
>> --- /dev/null
>> +++ b/xen/arch/riscv/vmid.c
>> @@ -0,0 +1,165 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +
>> +#include <xen/domain.h>
>> +#include <xen/init.h>
>> +#include <xen/sections.h>
>> +#include <xen/lib.h>
>> +#include <xen/param.h>
>> +#include <xen/percpu.h>
>> +
>> +#include <asm/atomic.h>
>> +#include <asm/csr.h>
>> +#include <asm/flushtlb.h>
>> +
>> +/* Xen command-line option to enable VMIDs */
>> +static bool __read_mostly opt_vmid_enabled = true;
> __ro_after_init ?
Agree, __ro_afer_init would be better.
>> +boolean_param("vmid", opt_vmid_enabled);
>> +
>> +/*
>> + * VMIDs partition the physical TLB. In the current implementation VMIDs are
>> + * introduced to reduce the number of TLB flushes. Each time the guest's
>> + * virtual address space changes, instead of flushing the TLB, a new VMID is
> The same odd "virtual" again? All the code here is about guest-physical, isn't
> it?
Answered above.
>> + * assigned. This reduces the number of TLB flushes to at most 1/#VMIDs.
>> + * The biggest advantage is that hot parts of the hypervisor's code and data
>> + * retain in the TLB.
>> + *
>> + * Sketch of the Implementation:
>> + *
>> + * VMIDs are a hart-local resource. As preemption of VMIDs is not possible,
>> + * VMIDs are assigned in a round-robin scheme. To minimize the overhead of
>> + * VMID invalidation, at the time of a TLB flush, VMIDs are tagged with a
>> + * 64-bit generation. Only on a generation overflow the code needs to
>> + * invalidate all VMID information stored at the VCPUs with are run on the
>> + * specific physical processor. This overflow appears after about 2^80
> And the same interesting number again.
Answered above.
>> + * host processor cycles, so we do not optimize this case, but simply disable
>> + * VMID useage to retain correctness.
>> + */
>> +
>> +/* Per-Hart VMID management. */
>> +struct vmid_data {
>> + uint64_t hart_vmid_generation;
> Any reason not to simply use "generation"?
No specific reason for that, it could be renamed to "generation".
>> + uint16_t next_vmid;
>> + uint16_t max_vmid;
>> + bool disabled;
>> +};
>> +
>> +static DEFINE_PER_CPU(struct vmid_data, vmid_data);
>> +
>> +static unsigned long vmidlen_detect(void)
> __init ? Or wait, are you (deliberately) permitting different VMIDLEN
> across harts?
All what I was able in RISC-V spec is that:
The number of VMID bits is UNSPECIFIED and may be zero. The number of
implemented VMID bits, termed VMIDLEN, may be determined by writing one
to every bit position in the VMID field, then reading back the value in
hgatp to see which bit positions in the VMID field hold a one. The least-
significant bits of VMID are implemented first: that is, if VMIDLEN > 0,
VMID[VMIDLEN-1:0] is writable. The maximal value of VMIDLEN, termed
VMIDMAX, is 7 for Sv32x4 or 14 for Sv39x4, Sv48x4, and Sv57x4..
And I couldn't find explicitly that VMIDLEN will be the same across harts.
Therefore, IMO, while the specification doesn't guarantee VMID will be
different, the "unspecified" nature and the per-hart discovery mechanism
of VMIDLEN in the hgatp CSR allows for VMIDLEN to be different on
different harts in an implementation without violating the
RISC-V privileged specification.
>
>> +{
>> + unsigned long vmid_bits;
> Why "long" (also for the function return type)?
Because csr_read() returns unsigned long as HGATP register has
'unsigned long' length.
But it could be done in this way:
csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
vmid_bits = MASK_EXTR(csr_read(CSR_HGATP), HGATP_VMID_MASK);
vmid_bits = ffs_g(vmid_bits);
csr_write(CSR_HGATP, old);
And then use uint16_t for vmid_bits and use uin16_t as a return type.
>
>> + unsigned long old;
>> +
>> + /* Figure-out number of VMID bits in HW */
>> + old = csr_read(CSR_HGATP);
>> +
>> + csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
>> + vmid_bits = csr_read(CSR_HGATP);
>> + vmid_bits = MASK_EXTR(vmid_bits, HGATP_VMID_MASK);
> Nit: Stray blank.
>
>> + vmid_bits = flsl(vmid_bits);
>> + csr_write(CSR_HGATP, old);
>> +
>> + /*
>> + * We polluted local TLB so flush all guest TLB as
>> + * a speculative access can happen at any time.
>> + */
>> + local_hfence_gvma_all();
> There's no guest running. If you wrote hgat.MODE as zero, as per my
> understanding now new TLB entries could even purely theoretically appear.
It could be an issue (or, at least, it is recommended) when hgatp.MODE is
changed:
If hgatp.MODE is changed for a given VMID, an HFENCE.GVMA with rs1=x0
(and rs2 set to either x0 or the VMID) must be executed to order subsequent
guest translations with the MODE change—even if the old MODE or new MODE
is Bare.
On other hand it is guaranteed that, at least, on Reset (and so I assume
for power on) that:
If the hypervisor extension is implemented, the hgatp.MODE and vsatp.MODE
fields are reset to 0.
So it seems like if no guest is ran then there is no need even to write
hgatp.MODE as zero, but it might be sense to do that explicitly just to
be sure.
I thought it was possible to have a running guest and perform a CPU hotplug.
In that case, I expect that during the hotplug,|vmidlen_detect()| will be
called and return the|vmid_bits| value, which is used as the active VMID.
At that moment, the local TLB could be speculatively polluted, I think.
Likely, it makes sense to call vmidlen_detect() only once for each hart
during initial bringup.
> In fact, with no guest running (yet) I'm having a hard time seeing why
> you shouldn't be able to simply write the register with just
> HGATP_VMID_MASK, i.e. without OR-ing in "old". It's even questionable
> whether "old" needs restoring; writing plain zero afterwards ought to
> suffice. You're in charcge of the register, after all.
It make sense (but I don't know if it is a possible case) to be sure that
HGATP.MODE remains the same, so there is no need to have TLB flush. If
HGATP.MODE is changed then it will be needed to do TLB flush as I mentioned
above.
If we agreed to keep local_hfence_gvma_all() then I think it isn't really
any sense to restore 'old' or OR-ing it with HGATP_VMID_MASK.
Generally if 'old' is guaranteed to be zero (and, probably, it makes sense
to check that in vmidlen_detect() and panic if it isn't zero) and if
vmidlen_detect() function will be called before any guest domain(s) will
be ran then I could agree that we don't need local_hfence_gvma_all() here.
As an option we can do local_hfence_gvma_all() only if 'old' value wasn't
set to zero.
Does it make sense?
>
>> + return vmid_bits;
>> +}
>> +
>> +void vmid_init(void)
>> +{
>> + static bool g_disabled = false;
>> + unsigned long vmid_len = vmidlen_detect();
>> + struct vmid_data *data = &this_cpu(vmid_data);
>> + unsigned long max_availalbe_bits = sizeof(data->max_vmid) << 3;
> Nit: Typo in "available". Also now that we have it, better use
> BITS_PER_BYTE here?
Sure, I will use BITS_PER_BYTE.
>
>> + if ( vmid_len > max_availalbe_bits )
>> + panic("%s: VMIDLEN is bigger then a type which represent VMID: %lu(%lu)\n",
>> + __func__, vmid_len, max_availalbe_bits);
> This shouldn't be a runtime check imo. What you want to check (at build
> time) is that the bits set in HGATP_VMID_MASK can be held in ->max_vmid.
Oh, I just noticed that this check isn't even really correct because of
data->max_vmid is inited after this check.
Anyway, build time check would be better.
>
>> + data->max_vmid = BIT(vmid_len, U) - 1;
>> + data->disabled = !opt_vmid_enabled || (vmid_len <= 1);
> Actually, what exactly does it mean that "VMIDs are disabled"? There's
> no enable bit that I could find anywhere. Isn't it rather that in this
> case you need to arrange to flush always on VM entry (or always after a
> P2M change, depending how the TLB is split between guest and host use)?
"VMIDs are disabled" here means that TLB flush will happen each time p2m
is changed.
> If you look at vmx_vmenter_helper(), its flipping of
> SECONDARY_EXEC_ENABLE_VPID tweaks CPU behavior, such that the flush
> would be implicit (when the bit is off). I don't expect RISC-V has any
> such "implicit" flushing behavior?
RISC-V relies on explicit software-managed fence instructions for TLB
flushing.
It seems like vmid_handle_vmenter() should be updated then to return
true if VMIDs are disabled:
bool vmid_handle_vmenter(struct vcpu_vmid *vmid)
{
struct vmid_data *data = &this_cpu(vmid_data);
...
/*
* When we assign VMID 1, flush all TLB entries as we are starting a new
* generation, and all old VMID allocations are now stale.
*/
return (vmid->vmid == 1);
disabled:
vmid->vmid = 0;
- return 0;
+ return true;
}
>
>> + if ( g_disabled != data->disabled )
>> + {
>> + printk("%s: VMIDs %sabled.\n", __func__,
>> + data->disabled ? "dis" : "en");
>> + if ( !g_disabled )
>> + g_disabled = data->disabled;
> This doesn't match x86 code. g_disabled is a tristate there, which only
> the boot CPU would ever write to.
Why g_disabled is written only by boot CPU? Does x86 have only two options
or VMIDs are enabled for all CPUs or it is disabled for all of them?
For RISC-V as I mentioned above it is needed to check all harts as the spec.
doesn't explicitly mention that VMIDLEN is equal for all harts...
>
> A clear shortcoming of the x86 code (that you copied) is that the log
> message doesn't identify the CPU in question. A sequence of "disabled"
> and "enabled" could thus result, without the last one (or in fact any
> one) making clear what the overall state is. I think you want to avoid
> this from the beginning.
... Thereby it seems like declaration of g_disabled should be moved outside
vmid_init() function and add a new function which will return g_disabled
value (or just make g_disabled not static and rename to something like
g_vmids_disabled).
And the print message once after all harts will be initialized, somewhere
in setup.c in start_xen() after:
for_each_present_cpu ( i )
{
if( (num_online_cpus()< nr_cpu_ids)&& !cpu_online(i))
{
intret = cpu_up(i);
if( ret != 0)
printk("Failed to bring up CPU %u (error %d)\n", i, ret);
}
}
(RISC-V doesn't have such code at the moment)
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 19254 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 04/20] xen/riscv: introduce things necessary for p2m initialization
2025-08-04 15:53 ` Jan Beulich
@ 2025-08-06 11:43 ` Oleksii Kurochko
0 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-06 11:43 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 1744 bytes --]
On 8/4/25 5:53 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -3,11 +3,45 @@
>> #define ASM__RISCV__P2M_H
>>
>> #include <xen/errno.h>
>> +#include <xen/mm.h>
>> +#include <xen/rwlock.h>
>> +#include <xen/types.h>
>>
>> #include <asm/page-bits.h>
>>
>> #define paddr_bits PADDR_BITS
>>
>> +/* Get host p2m table */
>> +#define p2m_get_hostp2m(d) (&(d)->arch.p2m)
>> +
>> +/* Per-p2m-table state */
>> +struct p2m_domain {
>> + /*
>> + * Lock that protects updates to the p2m.
>> + */
>> + rwlock_t lock;
>> +
>> + /* Pages used to construct the p2m */
>> + struct page_list_head pages;
>> +
>> + /* Indicate if it is required to clean the cache when writing an entry */
>> + bool clean_pte;
> I'm a little puzzled by this field still being here, despite the extensive
> revlog commentary. If you really feel you need to keep it, please ...
I think still it could be useful to have clean_pte, but likely not in this patch.
I will move an introduction of it to one of the next patch where it is started
really to be used.
>
>> + /* Back pointer to domain */
>> + struct domain *domain;
>> +
>> + /*
>> + * P2M updates may required TLBs to be flushed (invalidated).
>> + *
>> + * Flushes may be deferred by setting 'need_flush' and then flushing
>> + * when the p2m write lock is released.
>> + *
>> + * If an immediate flush is required (e.g, if a super page is
>> + * shattered), call p2m_tlb_flush_sync().
>> + */
>> + bool need_flush;
> ... group booleans together, for better packing.
I will take that into account. Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 2492 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 05/20] xen/riscv: construct the P2M pages pool for guests
2025-08-05 10:40 ` Jan Beulich
@ 2025-08-06 12:01 ` Oleksii Kurochko
2025-08-06 12:07 ` Jan Beulich
0 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-06 12:01 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 980 bytes --]
On 8/5/25 12:40 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> @@ -30,3 +34,18 @@ int p2m_init(struct domain *d)
>>
>> return 0;
>> }
>> +
>> +/*
>> + * Set the pool of pages to the required number of pages.
>> + * Returns 0 for success, non-zero for failure.
>> + * Call with d->arch.paging.lock held.
>> + */
>> +int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
> Noticed only when looking at the subsequent patch: With this being ...
>
>> +{
>> + int rc;
>> +
>> + if ( (rc = paging_freelist_init(d, pages, preempted)) )
> ... a caller of this function, the "init" in the name feels wrong.
I thought about paging_freelist_alloc(), but it feels wrong too as it sounds like
freelist is being allocated inside this functions, but what really happens that
pages are allocated and just added/removed to/from freelist.
Maybe something like paging_freelist_resize() or *_adjust() would be better?
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 1609 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
2025-08-06 11:33 ` Oleksii Kurochko
@ 2025-08-06 12:05 ` Jan Beulich
2025-08-06 16:24 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-06 12:05 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 06.08.2025 13:33, Oleksii Kurochko wrote:
> On 8/4/25 5:19 PM, Jan Beulich wrote:
>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>> @@ -148,6 +149,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
>>>
>>> console_init_postirq();
>>>
>>> + vmid_init();
>> This lives here only temporarily, I assume? Every hart will need to execute
>> it, and hence (like we have it on x86) this may want to be a central place
>> elsewhere.
>
> I haven’t checked how it is done on x86; I probably should.
>
> I planned to call it for each hart separately during secondary hart bring-up,
> since accessing the|hgatp| register of a hart is required to detect|VMIDLEN|.
> Therefore,|vmid_init()| should be called for secondary harts when their
> initialization code starts executing.
But is this going to be the only per-hart thing that will need doing? Otherwise
the same larger "container" function may want calling instead.
>>> +static unsigned long vmidlen_detect(void)
>> __init ? Or wait, are you (deliberately) permitting different VMIDLEN
>> across harts?
>
> All what I was able in RISC-V spec is that:
> The number of VMID bits is UNSPECIFIED and may be zero. The number of
> implemented VMID bits, termed VMIDLEN, may be determined by writing one
> to every bit position in the VMID field, then reading back the value in
> hgatp to see which bit positions in the VMID field hold a one. The least-
> significant bits of VMID are implemented first: that is, if VMIDLEN > 0,
> VMID[VMIDLEN-1:0] is writable. The maximal value of VMIDLEN, termed
> VMIDMAX, is 7 for Sv32x4 or 14 for Sv39x4, Sv48x4, and Sv57x4..
> And I couldn't find explicitly that VMIDLEN will be the same across harts.
>
> Therefore, IMO, while the specification doesn't guarantee VMID will be
> different, the "unspecified" nature and the per-hart discovery mechanism
> of VMIDLEN in the hgatp CSR allows for VMIDLEN to be different on
> different harts in an implementation without violating the
> RISC-V privileged specification.
Okay, since that's easily feasible with the present implementation, why not
keep it like that then.
>>> +{
>>> + unsigned long vmid_bits;
>> Why "long" (also for the function return type)?
>
> Because csr_read() returns unsigned long as HGATP register has
> 'unsigned long' length.
Oh, right, I should have commented on the function return type only.
Yet then I also can't resist stating that this kind of use of a variable,
which initially is assigned a value that doesn't really fit its name, is
easily misleading towards giving such comments.
> But it could be done in this way:
> csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
> vmid_bits = MASK_EXTR(csr_read(CSR_HGATP), HGATP_VMID_MASK);
> vmid_bits = ffs_g(vmid_bits);
> csr_write(CSR_HGATP, old);
> And then use uint16_t for vmid_bits and use uin16_t as a return type.
Please check ./CODING_STYLE again as to the use of fixed-width types.
>>> + unsigned long old;
>>> +
>>> + /* Figure-out number of VMID bits in HW */
>>> + old = csr_read(CSR_HGATP);
>>> +
>>> + csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
>>> + vmid_bits = csr_read(CSR_HGATP);
>>> + vmid_bits = MASK_EXTR(vmid_bits, HGATP_VMID_MASK);
>> Nit: Stray blank.
>>
>>> + vmid_bits = flsl(vmid_bits);
>>> + csr_write(CSR_HGATP, old);
>>> +
>>> + /*
>>> + * We polluted local TLB so flush all guest TLB as
>>> + * a speculative access can happen at any time.
>>> + */
>>> + local_hfence_gvma_all();
>> There's no guest running. If you wrote hgat.MODE as zero, as per my
>> understanding now new TLB entries could even purely theoretically appear.
>
> It could be an issue (or, at least, it is recommended) when hgatp.MODE is
> changed:
> If hgatp.MODE is changed for a given VMID, an HFENCE.GVMA with rs1=x0
> (and rs2 set to either x0 or the VMID) must be executed to order subsequent
> guest translations with the MODE change—even if the old MODE or new MODE
> is Bare.
> On other hand it is guaranteed that, at least, on Reset (and so I assume
> for power on) that:
> If the hypervisor extension is implemented, the hgatp.MODE and vsatp.MODE
> fields are reset to 0.
>
> So it seems like if no guest is ran then there is no need even to write
> hgatp.MODE as zero, but it might be sense to do that explicitly just to
> be sure.
>
> I thought it was possible to have a running guest and perform a CPU hotplug.
But that guest will run on another hart.
> In that case, I expect that during the hotplug,|vmidlen_detect()| will be
> called and return the|vmid_bits| value, which is used as the active VMID.
> At that moment, the local TLB could be speculatively polluted, I think.
> Likely, it makes sense to call vmidlen_detect() only once for each hart
> during initial bringup.
That may bring you more problems than it solves. You'd need to stash away
the value originally read somewhere. And that somewhere isn't per-CPU data.
>> In fact, with no guest running (yet) I'm having a hard time seeing why
>> you shouldn't be able to simply write the register with just
>> HGATP_VMID_MASK, i.e. without OR-ing in "old". It's even questionable
>> whether "old" needs restoring; writing plain zero afterwards ought to
>> suffice. You're in charcge of the register, after all.
>
> It make sense (but I don't know if it is a possible case) to be sure that
> HGATP.MODE remains the same, so there is no need to have TLB flush. If
> HGATP.MODE is changed then it will be needed to do TLB flush as I mentioned
> above.
>
> If we agreed to keep local_hfence_gvma_all() then I think it isn't really
> any sense to restore 'old' or OR-ing it with HGATP_VMID_MASK.
>
> Generally if 'old' is guaranteed to be zero (and, probably, it makes sense
> to check that in vmidlen_detect() and panic if it isn't zero) and if
> vmidlen_detect() function will be called before any guest domain(s) will
> be ran then I could agree that we don't need local_hfence_gvma_all() here.
>
> As an option we can do local_hfence_gvma_all() only if 'old' value wasn't
> set to zero.
>
> Does it make sense?
Well - I'd like the pre-conditions to be understood better. For example, can
a hart really speculate into guest mode, when the hart is only in the
process of being brought up?
>>> + data->max_vmid = BIT(vmid_len, U) - 1;
>>> + data->disabled = !opt_vmid_enabled || (vmid_len <= 1);
>> Actually, what exactly does it mean that "VMIDs are disabled"? There's
>> no enable bit that I could find anywhere. Isn't it rather that in this
>> case you need to arrange to flush always on VM entry (or always after a
>> P2M change, depending how the TLB is split between guest and host use)?
>
> "VMIDs are disabled" here means that TLB flush will happen each time p2m
> is changed.
That's better described as "VMIDs aren't used" then?
>>> + if ( g_disabled != data->disabled )
>>> + {
>>> + printk("%s: VMIDs %sabled.\n", __func__,
>>> + data->disabled ? "dis" : "en");
>>> + if ( !g_disabled )
>>> + g_disabled = data->disabled;
>> This doesn't match x86 code. g_disabled is a tristate there, which only
>> the boot CPU would ever write to.
>
> Why g_disabled is written only by boot CPU? Does x86 have only two options
> or VMIDs are enabled for all CPUs or it is disabled for all of them?
Did you look at the x86 code again, or the patch that I sent for it?
> For RISC-V as I mentioned above it is needed to check all harts as the spec.
> doesn't explicitly mention that VMIDLEN is equal for all harts...
Even if in practice x86 systems are symmetric in this regard, you will
have seen that we support varying values there as well. Up to and
including ASIDs being in use on some CPUs, but not on others. So that
code can serve as a reference for you, I think.
>> A clear shortcoming of the x86 code (that you copied) is that the log
>> message doesn't identify the CPU in question. A sequence of "disabled"
>> and "enabled" could thus result, without the last one (or in fact any
>> one) making clear what the overall state is. I think you want to avoid
>> this from the beginning.
>
> ... Thereby it seems like declaration of g_disabled should be moved outside
> vmid_init() function and add a new function which will return g_disabled
> value (or just make g_disabled not static and rename to something like
> g_vmids_disabled).
No, why? While I didn't Cc you on my patch submission, I specifically
replied to it with you (alone) on the To: list, just so you can look
there first before suggesting (sorry) odd things.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 05/20] xen/riscv: construct the P2M pages pool for guests
2025-08-06 12:01 ` Oleksii Kurochko
@ 2025-08-06 12:07 ` Jan Beulich
0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2025-08-06 12:07 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 06.08.2025 14:01, Oleksii Kurochko wrote:
>
> On 8/5/25 12:40 PM, Jan Beulich wrote:
>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>> @@ -30,3 +34,18 @@ int p2m_init(struct domain *d)
>>>
>>> return 0;
>>> }
>>> +
>>> +/*
>>> + * Set the pool of pages to the required number of pages.
>>> + * Returns 0 for success, non-zero for failure.
>>> + * Call with d->arch.paging.lock held.
>>> + */
>>> +int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>> Noticed only when looking at the subsequent patch: With this being ...
>>
>>> +{
>>> + int rc;
>>> +
>>> + if ( (rc = paging_freelist_init(d, pages, preempted)) )
>> ... a caller of this function, the "init" in the name feels wrong.
>
> I thought about paging_freelist_alloc(), but it feels wrong too as it sounds like
> freelist is being allocated inside this functions, but what really happens that
> pages are allocated and just added/removed to/from freelist.
>
> Maybe something like paging_freelist_resize() or *_adjust() would be better?
Yes; whichever of the two you like better.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 13/20] xen/riscv: Implement p2m_free_subtree() and related helpers
2025-07-31 15:58 ` [PATCH v3 13/20] xen/riscv: Implement p2m_free_subtree() and related helpers Oleksii Kurochko
@ 2025-08-06 15:55 ` Jan Beulich
2025-08-14 15:09 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-06 15:55 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -79,10 +79,20 @@ typedef enum {
> p2m_ext_storage, /* Following types'll be stored outsude PTE bits: */
> p2m_grant_map_rw, /* Read/write grant mapping */
> p2m_grant_map_ro, /* Read-only grant mapping */
> + p2m_map_foreign_rw, /* Read/write RAM pages from foreign domain */
> + p2m_map_foreign_ro, /* Read-only RAM pages from foreign domain */
> } p2m_type_t;
>
> #define p2m_mmio_direct p2m_mmio_direct_io
>
> +/*
> + * Bits 8 and 9 are reserved for use by supervisor software;
> + * the implementation shall ignore this field.
> + * We are going to use to save in these bits frequently used types to avoid
> + * get/set of a type from radix tree.
> + */
> +#define P2M_TYPE_PTE_BITS_MASK 0x300
> +
> /* We use bitmaps and mask to handle groups of types */
> #define p2m_to_mask(t_) BIT(t_, UL)
>
> @@ -93,10 +103,16 @@ typedef enum {
> #define P2M_GRANT_TYPES (p2m_to_mask(p2m_grant_map_rw) | \
> p2m_to_mask(p2m_grant_map_ro))
>
> + /* Foreign mappings types */
Nit: Why so far to the right?
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -197,6 +197,16 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
> return __map_domain_page(p2m->root + root_table_indx);
> }
>
> +static p2m_type_t p2m_get_type(const pte_t pte)
> +{
> + p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
> +
> + if ( type == p2m_ext_storage )
> + panic("unimplemented\n");
That is, as per p2m.h additions you pretend to add support for foreign types
here, but then you don't?
> @@ -248,11 +258,136 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
> return P2M_TABLE_MAP_NONE;
> }
>
> +static void p2m_put_foreign_page(struct page_info *pg)
> +{
> + /*
> + * It’s safe to call put_page() here because arch_flush_tlb_mask()
> + * will be invoked if the page is reallocated before the end of
> + * this loop, which will trigger a flush of the guest TLBs.
> + */
> + put_page(pg);
> +}
How can one know the comment is true? arch_flush_tlb_mask() still lives in
stubs.c, and hence what it is eventually going to do (something like Arm's
vs more like x86'es) is entirely unknown right now.
> +/* Put any references on the single 4K page referenced by mfn. */
> +static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
> +{
> + /* TODO: Handle other p2m types */
> +
> + if ( p2m_is_foreign(type) )
> + {
> + ASSERT(mfn_valid(mfn));
> + p2m_put_foreign_page(mfn_to_page(mfn));
> + }
> +
> + /*
> + * Detect the xenheap page and mark the stored GFN as invalid.
> + * We don't free the underlying page until the guest requested to do so.
> + * So we only need to tell the page is not mapped anymore in the P2M by
> + * marking the stored GFN as invalid.
> + */
> + if ( p2m_is_ram(type) && is_xen_heap_mfn(mfn) )
> + page_set_xenheap_gfn(mfn_to_page(mfn), INVALID_GFN);
Isn't this for grants? p2m_is_ram() doesn't cover p2m_grant_map_*.
> +}
> +
> +/* Put any references on the superpage referenced by mfn. */
> +static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
> +{
> + struct page_info *pg;
> + unsigned int i;
> +
> + ASSERT(mfn_valid(mfn));
> +
> + pg = mfn_to_page(mfn);
> +
> + for ( i = 0; i < XEN_PT_ENTRIES; i++, pg++ )
> + p2m_put_foreign_page(pg);
> +}
In p2m_put_4k_page() you check the type, whereas here you don't.
> +/* Put any references on the page referenced by pte. */
> +static void p2m_put_page(const pte_t pte, unsigned int level)
> +{
> + mfn_t mfn = pte_get_mfn(pte);
> + p2m_type_t p2m_type = p2m_get_type(pte);
> +
> + ASSERT(pte_is_valid(pte));
> +
> + /*
> + * TODO: Currently we don't handle level 2 super-page, Xen is not
> + * preemptible and therefore some work is needed to handle such
> + * superpages, for which at some point Xen might end up freeing memory
> + * and therefore for such a big mapping it could end up in a very long
> + * operation.
> + */
> + switch ( level )
> + {
> + case 1:
> + return p2m_put_2m_superpage(mfn, p2m_type);
> +
> + case 0:
> + return p2m_put_4k_page(mfn, p2m_type);
> + }
Yet despite the comment not even an assertion for level 2 and up?
> /* Free pte sub-tree behind an entry */
> static void p2m_free_subtree(struct p2m_domain *p2m,
> pte_t entry, unsigned int level)
> {
> - panic("%s: hasn't been implemented yet\n", __func__);
> + unsigned int i;
> + pte_t *table;
> + mfn_t mfn;
> + struct page_info *pg;
> +
> + /* Nothing to do if the entry is invalid. */
> + if ( !pte_is_valid(entry) )
> + return;
> +
> + if ( pte_is_superpage(entry, level) || (level == 0) )
Perhaps swap the two conditions around?
> + {
> +#ifdef CONFIG_IOREQ_SERVER
> + /*
> + * If this gets called then either the entry was replaced by an entry
> + * with a different base (valid case) or the shattering of a superpage
> + * has failed (error case).
> + * So, at worst, the spurious mapcache invalidation might be sent.
> + */
> + if ( p2m_is_ram(p2m_get_type(p2m, entry)) &&
> + domain_has_ioreq_server(p2m->domain) )
> + ioreq_request_mapcache_invalidate(p2m->domain);
> +#endif
> +
> + p2m_put_page(entry, level);
> +
> + return;
> + }
> +
> + table = map_domain_page(pte_get_mfn(entry));
> + for ( i = 0; i < XEN_PT_ENTRIES; i++ )
> + p2m_free_subtree(p2m, table[i], level - 1);
In p2m_put_page() you comment towards concerns for level >= 2; no similar
concerns for the resulting recursion here?
> + unmap_domain_page(table);
> +
> + /*
> + * Make sure all the references in the TLB have been removed before
> + * freing the intermediate page table.
> + * XXX: Should we defer the free of the page table to avoid the
> + * flush?
> + */
> + p2m_tlb_flush_sync(p2m);
> +
> + mfn = pte_get_mfn(entry);
> + ASSERT(mfn_valid(mfn));
> +
> + pg = mfn_to_page(mfn);
> +
> + page_list_del(pg, &p2m->pages);
> + p2m_free_page(p2m, pg);
Once again I wonder whether this code path was actually tested: p2m_free_page()
also invokes page_list_del(), and double deletions typically won't end very
well.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
2025-08-06 12:05 ` Jan Beulich
@ 2025-08-06 16:24 ` Oleksii Kurochko
2025-08-06 16:50 ` Demi Marie Obenour
2025-08-07 10:11 ` Jan Beulich
0 siblings, 2 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-06 16:24 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 10662 bytes --]
On 8/6/25 2:05 PM, Jan Beulich wrote:
> On 06.08.2025 13:33, Oleksii Kurochko wrote:
>> On 8/4/25 5:19 PM, Jan Beulich wrote:
>>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>>> @@ -148,6 +149,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
>>>>
>>>> console_init_postirq();
>>>>
>>>> + vmid_init();
>>> This lives here only temporarily, I assume? Every hart will need to execute
>>> it, and hence (like we have it on x86) this may want to be a central place
>>> elsewhere.
>> I haven’t checked how it is done on x86; I probably should.
>>
>> I planned to call it for each hart separately during secondary hart bring-up,
>> since accessing the|hgatp| register of a hart is required to detect|VMIDLEN|.
>> Therefore,|vmid_init()| should be called for secondary harts when their
>> initialization code starts executing.
> But is this going to be the only per-hart thing that will need doing? Otherwise
> the same larger "container" function may want calling instead.
Yes, it is going to be the only per-hart operation.
There is|__cpu_up()| (not yet upstreamed [1]), which calls
|sbi_hsm_hart_start(hartid, boot_addr, hsm_data)| to start a hart, and I planned
to place|vmid_init()| somewhere in the code executed at|boot_addr|.
[1]https://gitlab.com/xen-project/people/olkur/xen/-/blob/latest/xen/arch/riscv/smpboot.c#L40
>>>> +{
>>>> + unsigned long vmid_bits;
>>> Why "long" (also for the function return type)?
>> Because csr_read() returns unsigned long as HGATP register has
>> 'unsigned long' length.
> Oh, right, I should have commented on the function return type only.
> Yet then I also can't resist stating that this kind of use of a variable,
> which initially is assigned a value that doesn't really fit its name, is
> easily misleading towards giving such comments.
>
>> But it could be done in this way:
>> csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
>> vmid_bits = MASK_EXTR(csr_read(CSR_HGATP), HGATP_VMID_MASK);
>> vmid_bits = ffs_g(vmid_bits);
>> csr_write(CSR_HGATP, old);
>> And then use uint16_t for vmid_bits and use uin16_t as a return type.
> Please check ./CODING_STYLE again as to the use of fixed-width types.
I meant unsigned short, uint16_t was just short to write. I'll try to be
more specific.
>
>>>> + unsigned long old;
>>>> +
>>>> + /* Figure-out number of VMID bits in HW */
>>>> + old = csr_read(CSR_HGATP);
>>>> +
>>>> + csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
>>>> + vmid_bits = csr_read(CSR_HGATP);
>>>> + vmid_bits = MASK_EXTR(vmid_bits, HGATP_VMID_MASK);
>>> Nit: Stray blank.
>>>
>>>> + vmid_bits = flsl(vmid_bits);
>>>> + csr_write(CSR_HGATP, old);
>>>> +
>>>> + /*
>>>> + * We polluted local TLB so flush all guest TLB as
>>>> + * a speculative access can happen at any time.
>>>> + */
>>>> + local_hfence_gvma_all();
>>> There's no guest running. If you wrote hgat.MODE as zero, as per my
>>> understanding now new TLB entries could even purely theoretically appear.
>> It could be an issue (or, at least, it is recommended) when hgatp.MODE is
>> changed:
>> If hgatp.MODE is changed for a given VMID, an HFENCE.GVMA with rs1=x0
>> (and rs2 set to either x0 or the VMID) must be executed to order subsequent
>> guest translations with the MODE change—even if the old MODE or new MODE
>> is Bare.
>> On other hand it is guaranteed that, at least, on Reset (and so I assume
>> for power on) that:
>> If the hypervisor extension is implemented, the hgatp.MODE and vsatp.MODE
>> fields are reset to 0.
>>
>> So it seems like if no guest is ran then there is no need even to write
>> hgatp.MODE as zero, but it might be sense to do that explicitly just to
>> be sure.
>>
>> I thought it was possible to have a running guest and perform a CPU hotplug.
> But that guest will run on another hart.
>
>> In that case, I expect that during the hotplug,|vmidlen_detect()| will be
>> called and return the|vmid_bits| value, which is used as the active VMID.
>> At that moment, the local TLB could be speculatively polluted, I think.
>> Likely, it makes sense to call vmidlen_detect() only once for each hart
>> during initial bringup.
> That may bring you more problems than it solves. You'd need to stash away
> the value originally read somewhere. And that somewhere isn't per-CPU data.
>
>>> In fact, with no guest running (yet) I'm having a hard time seeing why
>>> you shouldn't be able to simply write the register with just
>>> HGATP_VMID_MASK, i.e. without OR-ing in "old". It's even questionable
>>> whether "old" needs restoring; writing plain zero afterwards ought to
>>> suffice. You're in charcge of the register, after all.
>> It make sense (but I don't know if it is a possible case) to be sure that
>> HGATP.MODE remains the same, so there is no need to have TLB flush. If
>> HGATP.MODE is changed then it will be needed to do TLB flush as I mentioned
>> above.
>>
>> If we agreed to keep local_hfence_gvma_all() then I think it isn't really
>> any sense to restore 'old' or OR-ing it with HGATP_VMID_MASK.
>>
>> Generally if 'old' is guaranteed to be zero (and, probably, it makes sense
>> to check that in vmidlen_detect() and panic if it isn't zero) and if
>> vmidlen_detect() function will be called before any guest domain(s) will
>> be ran then I could agree that we don't need local_hfence_gvma_all() here.
>>
>> As an option we can do local_hfence_gvma_all() only if 'old' value wasn't
>> set to zero.
>>
>> Does it make sense?
> Well - I'd like the pre-conditions to be understood better. For example, can
> a hart really speculate into guest mode, when the hart is only in the
> process of being brought up?
I couldn't explicit words that a hart can't speculate into guest mode
either on bring up or during its work.
But there are some moments in the spec which tells:
Implementations with virtual memory are permitted to perform address
translations speculatively and earlier than required by an explicit
memory access, and are permitted to cache them in address translation
cache structures—including possibly caching the identity mappings from
effective address to physical address used in Bare translation modes and
M-mode.
And here:
Implementations may also execute the address-translation algorithm
speculatively at any time, for any virtual address, as long as satp is
active (as defined in Section 10.1.11). Such speculative executions have
the effect of pre-populating the address-translation cache.
Where it is explicitly mentioned that speculation can happen in *any time*.
And at the same time:
Speculative executions of the address-translation algorithm behave as
non-speculative executions of the algorithm do, except that they must
not set the dirty bit for a PTE, they must not trigger an exception,
and they must not create address-translation cache entries if those
entries would have been invalidated by any SFENCE.VMA instruction
executed by the hart since the speculative execution of the algorithm began.
What I read as if TLB was empty before it will stay empty.
Also, despite of the fact here it is mentioned that when V=0 two-stage address
translation is inactivated:
The current virtualization mode, denoted V, indicates whether the hart is
currently executing in a guest. When V=1, the hart is either in virtual
S-mode (VS-mode), or in virtual U-mode (VU-mode) atop a guest OS running
in VS-mode. When V=0, the hart is either in M-mode, in HS-mode, or in
U-mode atop an OS running in HS-mode. The virtualization mode also
indicates whether two-stage address translation is active (V=1) or
inactive (V=0).
But on the same side, writing to hgatp register activates it:
The hgatp register is considered active for the purposes of
the address-translation algorithm unless the effective privilege mode
is U and hstatus.HU=0.
And if so + considering that speculation could happen at any time, and
we are in HS-mode, not it U mode then I would say that it could really
speculate into guest mode.
>
>>>> + data->max_vmid = BIT(vmid_len, U) - 1;
>>>> + data->disabled = !opt_vmid_enabled || (vmid_len <= 1);
>>> Actually, what exactly does it mean that "VMIDs are disabled"? There's
>>> no enable bit that I could find anywhere. Isn't it rather that in this
>>> case you need to arrange to flush always on VM entry (or always after a
>>> P2M change, depending how the TLB is split between guest and host use)?
>> "VMIDs are disabled" here means that TLB flush will happen each time p2m
>> is changed.
> That's better described as "VMIDs aren't used" then?
It sounds a little bit just like an opposite to "disabled" (i.e. means
basically the same), but I am okay to use "used" instead.
>
>>>> + if ( g_disabled != data->disabled )
>>>> + {
>>>> + printk("%s: VMIDs %sabled.\n", __func__,
>>>> + data->disabled ? "dis" : "en");
>>>> + if ( !g_disabled )
>>>> + g_disabled = data->disabled;
>>> This doesn't match x86 code. g_disabled is a tristate there, which only
>>> the boot CPU would ever write to.
>> Why g_disabled is written only by boot CPU? Does x86 have only two options
>> or VMIDs are enabled for all CPUs or it is disabled for all of them?
> Did you look at the x86 code again, or the patch that I sent for it?
>
>> For RISC-V as I mentioned above it is needed to check all harts as the spec.
>> doesn't explicitly mention that VMIDLEN is equal for all harts...
> Even if in practice x86 systems are symmetric in this regard, you will
> have seen that we support varying values there as well. Up to and
> including ASIDs being in use on some CPUs, but not on others. So that
> code can serve as a reference for you, I think.
>
>>> A clear shortcoming of the x86 code (that you copied) is that the log
>>> message doesn't identify the CPU in question. A sequence of "disabled"
>>> and "enabled" could thus result, without the last one (or in fact any
>>> one) making clear what the overall state is. I think you want to avoid
>>> this from the beginning.
>> ... Thereby it seems like declaration of g_disabled should be moved outside
>> vmid_init() function and add a new function which will return g_disabled
>> value (or just make g_disabled not static and rename to something like
>> g_vmids_disabled).
> No, why? While I didn't Cc you on my patch submission, I specifically
> replied to it with you (alone) on the To: list, just so you can look
> there first before suggesting (sorry) odd things.
I haven't received this e-mail, but I was able to find it on lore.kernel.org.
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 14752 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
2025-08-06 16:24 ` Oleksii Kurochko
@ 2025-08-06 16:50 ` Demi Marie Obenour
2025-08-07 8:43 ` Oleksii Kurochko
2025-08-07 10:11 ` Jan Beulich
1 sibling, 1 reply; 84+ messages in thread
From: Demi Marie Obenour @ 2025-08-06 16:50 UTC (permalink / raw)
To: Oleksii Kurochko, Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1.1.1: Type: text/plain, Size: 7872 bytes --]
On 8/6/25 12:24, Oleksii Kurochko wrote:
>
> On 8/6/25 2:05 PM, Jan Beulich wrote:
>> On 06.08.2025 13:33, Oleksii Kurochko wrote:
>>> On 8/4/25 5:19 PM, Jan Beulich wrote:
>>>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>>>> @@ -148,6 +149,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
>>>>>
>>>>> console_init_postirq();
>>>>>
>>>>> + vmid_init();
>>>> This lives here only temporarily, I assume? Every hart will need to execute
>>>> it, and hence (like we have it on x86) this may want to be a central place
>>>> elsewhere.
>>> I haven’t checked how it is done on x86; I probably should.
>>>
>>> I planned to call it for each hart separately during secondary hart bring-up,
>>> since accessing the|hgatp| register of a hart is required to detect|VMIDLEN|.
>>> Therefore,|vmid_init()| should be called for secondary harts when their
>>> initialization code starts executing.
>> But is this going to be the only per-hart thing that will need doing? Otherwise
>> the same larger "container" function may want calling instead.
>
> Yes, it is going to be the only per-hart operation.
>
> There is|__cpu_up()| (not yet upstreamed [1]), which calls
> |sbi_hsm_hart_start(hartid, boot_addr, hsm_data)| to start a hart, and I planned
> to place|vmid_init()| somewhere in the code executed at|boot_addr|.
>
> [1]https://gitlab.com/xen-project/people/olkur/xen/-/blob/latest/xen/arch/riscv/smpboot.c#L40
>
>>>>> +{
>>>>> + unsigned long vmid_bits;
>>>> Why "long" (also for the function return type)?
>>> Because csr_read() returns unsigned long as HGATP register has
>>> 'unsigned long' length.
>> Oh, right, I should have commented on the function return type only.
>> Yet then I also can't resist stating that this kind of use of a variable,
>> which initially is assigned a value that doesn't really fit its name, is
>> easily misleading towards giving such comments.
>>
>>> But it could be done in this way:
>>> csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
>>> vmid_bits = MASK_EXTR(csr_read(CSR_HGATP), HGATP_VMID_MASK);
>>> vmid_bits = ffs_g(vmid_bits);
>>> csr_write(CSR_HGATP, old);
>>> And then use uint16_t for vmid_bits and use uin16_t as a return type.
>> Please check ./CODING_STYLE again as to the use of fixed-width types.
>
> I meant unsigned short, uint16_t was just short to write. I'll try to be
> more specific.
>
>>
>>>>> + unsigned long old;
>>>>> +
>>>>> + /* Figure-out number of VMID bits in HW */
>>>>> + old = csr_read(CSR_HGATP);
>>>>> +
>>>>> + csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
>>>>> + vmid_bits = csr_read(CSR_HGATP);
>>>>> + vmid_bits = MASK_EXTR(vmid_bits, HGATP_VMID_MASK);
>>>> Nit: Stray blank.
>>>>
>>>>> + vmid_bits = flsl(vmid_bits);
>>>>> + csr_write(CSR_HGATP, old);
>>>>> +
>>>>> + /*
>>>>> + * We polluted local TLB so flush all guest TLB as
>>>>> + * a speculative access can happen at any time.
>>>>> + */
>>>>> + local_hfence_gvma_all();
>>>> There's no guest running. If you wrote hgat.MODE as zero, as per my
>>>> understanding now new TLB entries could even purely theoretically appear.
>>> It could be an issue (or, at least, it is recommended) when hgatp.MODE is
>>> changed:
>>> If hgatp.MODE is changed for a given VMID, an HFENCE.GVMA with rs1=x0
>>> (and rs2 set to either x0 or the VMID) must be executed to order subsequent
>>> guest translations with the MODE change—even if the old MODE or new MODE
>>> is Bare.
>>> On other hand it is guaranteed that, at least, on Reset (and so I assume
>>> for power on) that:
>>> If the hypervisor extension is implemented, the hgatp.MODE and vsatp.MODE
>>> fields are reset to 0.
>>>
>>> So it seems like if no guest is ran then there is no need even to write
>>> hgatp.MODE as zero, but it might be sense to do that explicitly just to
>>> be sure.
>>>
>>> I thought it was possible to have a running guest and perform a CPU hotplug.
>> But that guest will run on another hart.
>>
>>> In that case, I expect that during the hotplug,|vmidlen_detect()| will be
>>> called and return the|vmid_bits| value, which is used as the active VMID.
>>> At that moment, the local TLB could be speculatively polluted, I think.
>>> Likely, it makes sense to call vmidlen_detect() only once for each hart
>>> during initial bringup.
>> That may bring you more problems than it solves. You'd need to stash away
>> the value originally read somewhere. And that somewhere isn't per-CPU data.
>>
>>>> In fact, with no guest running (yet) I'm having a hard time seeing why
>>>> you shouldn't be able to simply write the register with just
>>>> HGATP_VMID_MASK, i.e. without OR-ing in "old". It's even questionable
>>>> whether "old" needs restoring; writing plain zero afterwards ought to
>>>> suffice. You're in charcge of the register, after all.
>>> It make sense (but I don't know if it is a possible case) to be sure that
>>> HGATP.MODE remains the same, so there is no need to have TLB flush. If
>>> HGATP.MODE is changed then it will be needed to do TLB flush as I mentioned
>>> above.
>>>
>>> If we agreed to keep local_hfence_gvma_all() then I think it isn't really
>>> any sense to restore 'old' or OR-ing it with HGATP_VMID_MASK.
>>>
>>> Generally if 'old' is guaranteed to be zero (and, probably, it makes sense
>>> to check that in vmidlen_detect() and panic if it isn't zero) and if
>>> vmidlen_detect() function will be called before any guest domain(s) will
>>> be ran then I could agree that we don't need local_hfence_gvma_all() here.
>>>
>>> As an option we can do local_hfence_gvma_all() only if 'old' value wasn't
>>> set to zero.
>>>
>>> Does it make sense?
>> Well - I'd like the pre-conditions to be understood better. For example, can
>> a hart really speculate into guest mode, when the hart is only in the
>> process of being brought up?
>
> I couldn't explicit words that a hart can't speculate into guest mode
> either on bring up or during its work.
>
> But there are some moments in the spec which tells:
> Implementations with virtual memory are permitted to perform address
> translations speculatively and earlier than required by an explicit
> memory access, and are permitted to cache them in address translation
> cache structures—including possibly caching the identity mappings from
> effective address to physical address used in Bare translation modes and
> M-mode.
> And here:
> Implementations may also execute the address-translation algorithm
> speculatively at any time, for any virtual address, as long as satp is
> active (as defined in Section 10.1.11). Such speculative executions have
> the effect of pre-populating the address-translation cache.
> Where it is explicitly mentioned that speculation can happen in *any time*.
> And at the same time:
> Speculative executions of the address-translation algorithm behave as
> non-speculative executions of the algorithm do, except that they must
> not set the dirty bit for a PTE, they must not trigger an exception,
> and they must not create address-translation cache entries if those
> entries would have been invalidated by any SFENCE.VMA instruction
> executed by the hart since the speculative execution of the algorithm began.
> What I read as if TLB was empty before it will stay empty.
I read that as "flushing the TLB invalidates entries created by speculative
execution before the TLB flush". That is the bare minimum needed for TLB
flushing to work. You have to do the TLB flush *after* changing the PTEs,
not before.
This is true on at least x86 but I expect it to hold in general.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
2025-08-06 16:50 ` Demi Marie Obenour
@ 2025-08-07 8:43 ` Oleksii Kurochko
0 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-07 8:43 UTC (permalink / raw)
To: Demi Marie Obenour, Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 8583 bytes --]
On 8/6/25 6:50 PM, Demi Marie Obenour wrote:
> On 8/6/25 12:24, Oleksii Kurochko wrote:
>> On 8/6/25 2:05 PM, Jan Beulich wrote:
>>> On 06.08.2025 13:33, Oleksii Kurochko wrote:
>>>> On 8/4/25 5:19 PM, Jan Beulich wrote:
>>>>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>>>>> @@ -148,6 +149,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
>>>>>>
>>>>>> console_init_postirq();
>>>>>>
>>>>>> + vmid_init();
>>>>> This lives here only temporarily, I assume? Every hart will need to execute
>>>>> it, and hence (like we have it on x86) this may want to be a central place
>>>>> elsewhere.
>>>> I haven’t checked how it is done on x86; I probably should.
>>>>
>>>> I planned to call it for each hart separately during secondary hart bring-up,
>>>> since accessing the|hgatp| register of a hart is required to detect|VMIDLEN|.
>>>> Therefore,|vmid_init()| should be called for secondary harts when their
>>>> initialization code starts executing.
>>> But is this going to be the only per-hart thing that will need doing? Otherwise
>>> the same larger "container" function may want calling instead.
>> Yes, it is going to be the only per-hart operation.
>>
>> There is|__cpu_up()| (not yet upstreamed [1]), which calls
>> |sbi_hsm_hart_start(hartid, boot_addr, hsm_data)| to start a hart, and I planned
>> to place|vmid_init()| somewhere in the code executed at|boot_addr|.
>>
>> [1]https://gitlab.com/xen-project/people/olkur/xen/-/blob/latest/xen/arch/riscv/smpboot.c#L40
>>
>>>>>> +{
>>>>>> + unsigned long vmid_bits;
>>>>> Why "long" (also for the function return type)?
>>>> Because csr_read() returns unsigned long as HGATP register has
>>>> 'unsigned long' length.
>>> Oh, right, I should have commented on the function return type only.
>>> Yet then I also can't resist stating that this kind of use of a variable,
>>> which initially is assigned a value that doesn't really fit its name, is
>>> easily misleading towards giving such comments.
>>>
>>>> But it could be done in this way:
>>>> csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
>>>> vmid_bits = MASK_EXTR(csr_read(CSR_HGATP), HGATP_VMID_MASK);
>>>> vmid_bits = ffs_g(vmid_bits);
>>>> csr_write(CSR_HGATP, old);
>>>> And then use uint16_t for vmid_bits and use uin16_t as a return type.
>>> Please check ./CODING_STYLE again as to the use of fixed-width types.
>> I meant unsigned short, uint16_t was just short to write. I'll try to be
>> more specific.
>>
>>>>>> + unsigned long old;
>>>>>> +
>>>>>> + /* Figure-out number of VMID bits in HW */
>>>>>> + old = csr_read(CSR_HGATP);
>>>>>> +
>>>>>> + csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
>>>>>> + vmid_bits = csr_read(CSR_HGATP);
>>>>>> + vmid_bits = MASK_EXTR(vmid_bits, HGATP_VMID_MASK);
>>>>> Nit: Stray blank.
>>>>>
>>>>>> + vmid_bits = flsl(vmid_bits);
>>>>>> + csr_write(CSR_HGATP, old);
>>>>>> +
>>>>>> + /*
>>>>>> + * We polluted local TLB so flush all guest TLB as
>>>>>> + * a speculative access can happen at any time.
>>>>>> + */
>>>>>> + local_hfence_gvma_all();
>>>>> There's no guest running. If you wrote hgat.MODE as zero, as per my
>>>>> understanding now new TLB entries could even purely theoretically appear.
>>>> It could be an issue (or, at least, it is recommended) when hgatp.MODE is
>>>> changed:
>>>> If hgatp.MODE is changed for a given VMID, an HFENCE.GVMA with rs1=x0
>>>> (and rs2 set to either x0 or the VMID) must be executed to order subsequent
>>>> guest translations with the MODE change—even if the old MODE or new MODE
>>>> is Bare.
>>>> On other hand it is guaranteed that, at least, on Reset (and so I assume
>>>> for power on) that:
>>>> If the hypervisor extension is implemented, the hgatp.MODE and vsatp.MODE
>>>> fields are reset to 0.
>>>>
>>>> So it seems like if no guest is ran then there is no need even to write
>>>> hgatp.MODE as zero, but it might be sense to do that explicitly just to
>>>> be sure.
>>>>
>>>> I thought it was possible to have a running guest and perform a CPU hotplug.
>>> But that guest will run on another hart.
>>>
>>>> In that case, I expect that during the hotplug,|vmidlen_detect()| will be
>>>> called and return the|vmid_bits| value, which is used as the active VMID.
>>>> At that moment, the local TLB could be speculatively polluted, I think.
>>>> Likely, it makes sense to call vmidlen_detect() only once for each hart
>>>> during initial bringup.
>>> That may bring you more problems than it solves. You'd need to stash away
>>> the value originally read somewhere. And that somewhere isn't per-CPU data.
>>>
>>>>> In fact, with no guest running (yet) I'm having a hard time seeing why
>>>>> you shouldn't be able to simply write the register with just
>>>>> HGATP_VMID_MASK, i.e. without OR-ing in "old". It's even questionable
>>>>> whether "old" needs restoring; writing plain zero afterwards ought to
>>>>> suffice. You're in charcge of the register, after all.
>>>> It make sense (but I don't know if it is a possible case) to be sure that
>>>> HGATP.MODE remains the same, so there is no need to have TLB flush. If
>>>> HGATP.MODE is changed then it will be needed to do TLB flush as I mentioned
>>>> above.
>>>>
>>>> If we agreed to keep local_hfence_gvma_all() then I think it isn't really
>>>> any sense to restore 'old' or OR-ing it with HGATP_VMID_MASK.
>>>>
>>>> Generally if 'old' is guaranteed to be zero (and, probably, it makes sense
>>>> to check that in vmidlen_detect() and panic if it isn't zero) and if
>>>> vmidlen_detect() function will be called before any guest domain(s) will
>>>> be ran then I could agree that we don't need local_hfence_gvma_all() here.
>>>>
>>>> As an option we can do local_hfence_gvma_all() only if 'old' value wasn't
>>>> set to zero.
>>>>
>>>> Does it make sense?
>>> Well - I'd like the pre-conditions to be understood better. For example, can
>>> a hart really speculate into guest mode, when the hart is only in the
>>> process of being brought up?
>> I couldn't explicit words that a hart can't speculate into guest mode
>> either on bring up or during its work.
>>
>> But there are some moments in the spec which tells:
>> Implementations with virtual memory are permitted to perform address
>> translations speculatively and earlier than required by an explicit
>> memory access, and are permitted to cache them in address translation
>> cache structures—including possibly caching the identity mappings from
>> effective address to physical address used in Bare translation modes and
>> M-mode.
>> And here:
>> Implementations may also execute the address-translation algorithm
>> speculatively at any time, for any virtual address, as long as satp is
>> active (as defined in Section 10.1.11). Such speculative executions have
>> the effect of pre-populating the address-translation cache.
>> Where it is explicitly mentioned that speculation can happen in *any time*.
>> And at the same time:
>> Speculative executions of the address-translation algorithm behave as
>> non-speculative executions of the algorithm do, except that they must
>> not set the dirty bit for a PTE, they must not trigger an exception,
>> and they must not create address-translation cache entries if those
>> entries would have been invalidated by any SFENCE.VMA instruction
>> executed by the hart since the speculative execution of the algorithm began.
>> What I read as if TLB was empty before it will stay empty.
> I read that as "flushing the TLB invalidates entries created by speculative
> execution before the TLB flush".
But this part:
they must not create address-translation cache entries if those entries
would have been invalidated by any SFENCE.VMA instruction
Doesn't it mean that entries which was invalidated by SFENCE.VMA can't be
inserted into the TLB during speculative execution?
So, if the speculative page walk started before|SFENCE.VMA|,|SFENCE.VMA|
indicates: “All previous TLB entries might be invalid". Therefore, any
speculative TLB entry/that started before/ must*not* be inserted into the
TLB afterward.
So, hardware tracks if a|SFENCE.VMA| occurred/after/ speculation started.
If so, any speculative address translations must be*discarded* or
*not committed*.
> That is the bare minimum needed for TLB
> flushing to work. You have to do the TLB flush *after* changing the PTEs,
> not before.
>
> This is true on at least x86 but I expect it to hold in general.
Agree with that.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 11668 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
2025-08-06 16:24 ` Oleksii Kurochko
2025-08-06 16:50 ` Demi Marie Obenour
@ 2025-08-07 10:11 ` Jan Beulich
2025-08-07 14:45 ` Oleksii Kurochko
1 sibling, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-07 10:11 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 06.08.2025 18:24, Oleksii Kurochko wrote:
> On 8/6/25 2:05 PM, Jan Beulich wrote:
>> On 06.08.2025 13:33, Oleksii Kurochko wrote:
>>> On 8/4/25 5:19 PM, Jan Beulich wrote:
>>>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>>>> +{
>>>>> + unsigned long vmid_bits;
>>>> Why "long" (also for the function return type)?
>>> Because csr_read() returns unsigned long as HGATP register has
>>> 'unsigned long' length.
>> Oh, right, I should have commented on the function return type only.
>> Yet then I also can't resist stating that this kind of use of a variable,
>> which initially is assigned a value that doesn't really fit its name, is
>> easily misleading towards giving such comments.
>>
>>> But it could be done in this way:
>>> csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
>>> vmid_bits = MASK_EXTR(csr_read(CSR_HGATP), HGATP_VMID_MASK);
>>> vmid_bits = ffs_g(vmid_bits);
>>> csr_write(CSR_HGATP, old);
>>> And then use uint16_t for vmid_bits and use uin16_t as a return type.
>> Please check ./CODING_STYLE again as to the use of fixed-width types.
>
> I meant unsigned short, uint16_t was just short to write. I'll try to be
> more specific.
I'd also recommend against unsigned short when there are no space concerns.
unsigned int is what wants using in the general case.
>>>>> + unsigned long old;
>>>>> +
>>>>> + /* Figure-out number of VMID bits in HW */
>>>>> + old = csr_read(CSR_HGATP);
>>>>> +
>>>>> + csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
>>>>> + vmid_bits = csr_read(CSR_HGATP);
>>>>> + vmid_bits = MASK_EXTR(vmid_bits, HGATP_VMID_MASK);
>>>> Nit: Stray blank.
>>>>
>>>>> + vmid_bits = flsl(vmid_bits);
>>>>> + csr_write(CSR_HGATP, old);
>>>>> +
>>>>> + /*
>>>>> + * We polluted local TLB so flush all guest TLB as
>>>>> + * a speculative access can happen at any time.
>>>>> + */
>>>>> + local_hfence_gvma_all();
>>>> There's no guest running. If you wrote hgat.MODE as zero, as per my
>>>> understanding now new TLB entries could even purely theoretically appear.
>>> It could be an issue (or, at least, it is recommended) when hgatp.MODE is
>>> changed:
>>> If hgatp.MODE is changed for a given VMID, an HFENCE.GVMA with rs1=x0
>>> (and rs2 set to either x0 or the VMID) must be executed to order subsequent
>>> guest translations with the MODE change—even if the old MODE or new MODE
>>> is Bare.
>>> On other hand it is guaranteed that, at least, on Reset (and so I assume
>>> for power on) that:
>>> If the hypervisor extension is implemented, the hgatp.MODE and vsatp.MODE
>>> fields are reset to 0.
>>>
>>> So it seems like if no guest is ran then there is no need even to write
>>> hgatp.MODE as zero, but it might be sense to do that explicitly just to
>>> be sure.
>>>
>>> I thought it was possible to have a running guest and perform a CPU hotplug.
>> But that guest will run on another hart.
>>
>>> In that case, I expect that during the hotplug,|vmidlen_detect()| will be
>>> called and return the|vmid_bits| value, which is used as the active VMID.
>>> At that moment, the local TLB could be speculatively polluted, I think.
>>> Likely, it makes sense to call vmidlen_detect() only once for each hart
>>> during initial bringup.
>> That may bring you more problems than it solves. You'd need to stash away
>> the value originally read somewhere. And that somewhere isn't per-CPU data.
>>
>>>> In fact, with no guest running (yet) I'm having a hard time seeing why
>>>> you shouldn't be able to simply write the register with just
>>>> HGATP_VMID_MASK, i.e. without OR-ing in "old". It's even questionable
>>>> whether "old" needs restoring; writing plain zero afterwards ought to
>>>> suffice. You're in charcge of the register, after all.
>>> It make sense (but I don't know if it is a possible case) to be sure that
>>> HGATP.MODE remains the same, so there is no need to have TLB flush. If
>>> HGATP.MODE is changed then it will be needed to do TLB flush as I mentioned
>>> above.
>>>
>>> If we agreed to keep local_hfence_gvma_all() then I think it isn't really
>>> any sense to restore 'old' or OR-ing it with HGATP_VMID_MASK.
>>>
>>> Generally if 'old' is guaranteed to be zero (and, probably, it makes sense
>>> to check that in vmidlen_detect() and panic if it isn't zero) and if
>>> vmidlen_detect() function will be called before any guest domain(s) will
>>> be ran then I could agree that we don't need local_hfence_gvma_all() here.
>>>
>>> As an option we can do local_hfence_gvma_all() only if 'old' value wasn't
>>> set to zero.
>>>
>>> Does it make sense?
>> Well - I'd like the pre-conditions to be understood better. For example, can
>> a hart really speculate into guest mode, when the hart is only in the
>> process of being brought up?
>
> I couldn't explicit words that a hart can't speculate into guest mode
> either on bring up or during its work.
>
> But there are some moments in the spec which tells:
> Implementations with virtual memory are permitted to perform address
> translations speculatively and earlier than required by an explicit
> memory access, and are permitted to cache them in address translation
> cache structures—including possibly caching the identity mappings from
> effective address to physical address used in Bare translation modes and
> M-mode.
> And here:
> Implementations may also execute the address-translation algorithm
> speculatively at any time, for any virtual address, as long as satp is
> active (as defined in Section 10.1.11). Such speculative executions have
> the effect of pre-populating the address-translation cache.
That's satp though, not hgatp.
> Where it is explicitly mentioned that speculation can happen in *any time*.
> And at the same time:
> Speculative executions of the address-translation algorithm behave as
> non-speculative executions of the algorithm do, except that they must
> not set the dirty bit for a PTE, they must not trigger an exception,
> and they must not create address-translation cache entries if those
> entries would have been invalidated by any SFENCE.VMA instruction
> executed by the hart since the speculative execution of the algorithm began.
> What I read as if TLB was empty before it will stay empty.
>
> Also, despite of the fact here it is mentioned that when V=0 two-stage address
> translation is inactivated:
> The current virtualization mode, denoted V, indicates whether the hart is
> currently executing in a guest. When V=1, the hart is either in virtual
> S-mode (VS-mode), or in virtual U-mode (VU-mode) atop a guest OS running
> in VS-mode. When V=0, the hart is either in M-mode, in HS-mode, or in
> U-mode atop an OS running in HS-mode. The virtualization mode also
> indicates whether two-stage address translation is active (V=1) or
> inactive (V=0).
> But on the same side, writing to hgatp register activates it:
> The hgatp register is considered active for the purposes of
> the address-translation algorithm unless the effective privilege mode
> is U and hstatus.HU=0.
> And if so + considering that speculation could happen at any time, and
> we are in HS-mode, not it U mode then I would say that it could really
> speculate into guest mode.
Hmm, that leaves some things to be desired. What I'm particularly puzzled
by is that there's nothing said either way towards speculation through SRET.
That's the important aspect here aiui, because without that the hart can't
speculate into guest mode.
But yes, in the absence of any clear indication to the contrary, I think
you want to keep the local_hfence_gvma_all() (with a suitable comment).
>>>>> + data->max_vmid = BIT(vmid_len, U) - 1;
>>>>> + data->disabled = !opt_vmid_enabled || (vmid_len <= 1);
>>>> Actually, what exactly does it mean that "VMIDs are disabled"? There's
>>>> no enable bit that I could find anywhere. Isn't it rather that in this
>>>> case you need to arrange to flush always on VM entry (or always after a
>>>> P2M change, depending how the TLB is split between guest and host use)?
>>> "VMIDs are disabled" here means that TLB flush will happen each time p2m
>>> is changed.
>> That's better described as "VMIDs aren't used" then?
>
> It sounds a little bit just like an opposite to "disabled" (i.e. means
> basically the same), but I am okay to use "used" instead.
If you want to stick to using "disabled", then how about "VMID use is
disabled"? (You probably meanwhile understood that what I'm after is it
becoming clear that this is a software decision, not something you can
enforce in hardware.)
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 06/20] xen/riscv: add root page table allocation
2025-08-05 10:37 ` Jan Beulich
@ 2025-08-07 12:00 ` Oleksii Kurochko
2025-08-07 15:30 ` Jan Beulich
0 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-07 12:00 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 9459 bytes --]
On 8/5/25 12:37 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> Introduce support for allocating and initializing the root page table
>> required for RISC-V stage-2 address translation.
>>
>> To implement root page table allocation the following is introduced:
>> - p2m_get_clean_page() and p2m_alloc_root_table(), p2m_allocate_root()
>> helpers to allocate and zero a 16 KiB root page table, as mandated
>> by the RISC-V privileged specification for Sv32x4/Sv39x4/Sv48x4/Sv57x4
>> modes.
>> - Update p2m_init() to inititialize p2m_root_order.
>> - Add maddr_to_page() and page_to_maddr() macros for easier address
>> manipulation.
>> - Introduce paging_ret_pages_to_domheap() to return some pages before
>> allocate 16 KiB pages for root page table.
>> - Allocate root p2m table after p2m pool is initialized.
>> - Add construct_hgatp() to construct the hgatp register value based on
>> p2m->root, p2m->hgatp_mode and VMID.
> Imo for this to be complete, freeing of the root table also wants taking
> care of. Much like imo p2m_init() would better immediately be accompanied
> by the respective teardown function. Once you start using them, you want
> to use them in pairs, after all.
I decided to ignore freeing of the root table and tearing down p2m mapping
as it is going to be used during a domain destroy, which isn't supported
at the moment, and thereby an implementation of them could be delayed when
they really will be used.
>
>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>> @@ -133,11 +133,13 @@
>> #define HGATP_MODE_SV48X4 _UL(9)
>>
>> #define HGATP32_MODE_SHIFT 31
>> +#define HGATP32_MODE_MASK _UL(0x80000000)
>> #define HGATP32_VMID_SHIFT 22
>> #define HGATP32_VMID_MASK _UL(0x1FC00000)
>> #define HGATP32_PPN _UL(0x003FFFFF)
>>
>> #define HGATP64_MODE_SHIFT 60
>> +#define HGATP64_MODE_MASK _ULL(0xF000000000000000)
>> #define HGATP64_VMID_SHIFT 44
>> #define HGATP64_VMID_MASK _ULL(0x03FFF00000000000)
>> #define HGATP64_PPN _ULL(0x00000FFFFFFFFFFF)
>> @@ -170,6 +172,7 @@
>> #define HGATP_VMID_SHIFT HGATP64_VMID_SHIFT
>> #define HGATP_VMID_MASK HGATP64_VMID_MASK
>> #define HGATP_MODE_SHIFT HGATP64_MODE_SHIFT
>> +#define HGATP_MODE_MASK HGATP64_MODE_MASK
>> #else
>> #define MSTATUS_SD MSTATUS32_SD
>> #define SSTATUS_SD SSTATUS32_SD
>> @@ -181,8 +184,11 @@
>> #define HGATP_VMID_SHIFT HGATP32_VMID_SHIFT
>> #define HGATP_VMID_MASK HGATP32_VMID_MASK
>> #define HGATP_MODE_SHIFT HGATP32_MODE_SHIFT
>> +#define HGATP_MODE_MASK HGATP32_MODE_MASK
>> #endif
>>
>> +#define GUEST_ROOT_PAGE_TABLE_SIZE KB(16)
> In another context I already mentioned that imo you want to be careful with
> the use of "guest" in identifiers. It's not the guest page tables which have
> an order-2 root table, but the P2M (Xen terminology) or G-stage / second
> stage (RISC-V spec terminology) ones. As long as you're only doing P2M
> work, this may not look significant. But once you actually start dealing
> with guest page tables, it easily can end up confusing.
I thought that GUEST_ROOT_PAGE_TABLE is equal to G-stage root page table.
But it is confusing even now, then I'll use GSTAGE_ROOT_PAGE_TABLE_SIZE
instead.
>
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -1,8 +1,86 @@
>> +#include <xen/domain_page.h>
>> #include <xen/mm.h>
>> #include <xen/rwlock.h>
>> #include <xen/sched.h>
>>
>> #include <asm/paging.h>
>> +#include <asm/p2m.h>
>> +#include <asm/riscv_encoding.h>
>> +
>> +unsigned int __read_mostly p2m_root_order;
> If this is to be a variable at all, it ought to be __ro_after_init, and
> hence it shouldn't be written every time p2m_init() is run. If you want
> to to remain as a variable, what's wrong with
>
> const unsigned int p2m_root_order = ilog2(GUEST_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT;
>
> or some such? But of course equally well you could have
>
> #define P2M_ROOT_ORDER (ilog2(GUEST_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
The only one reason p2m_root_order was introduced as variable it was that
I had a compilation issue when define P2M_ROOT_ORDER in such way:
#define P2M_ROOT_ORDER get_order_from_bytes(GUEST_ROOT_PAGE_TABLE_SIZE)
But I can't reproduce it anymore.
Anyway, your option is better as it should be faster.
>
>> +static void clear_and_clean_page(struct page_info *page)
>> +{
>> + clear_domain_page(page_to_mfn(page));
>> +
>> + /*
>> + * If the IOMMU doesn't support coherent walks and the p2m tables are
>> + * shared between the CPU and IOMMU, it is necessary to clean the
>> + * d-cache.
>> + */
> That is, ...
>
>> + clean_dcache_va_range(page, PAGE_SIZE);
> ... this call really wants to be conditional?
It makes sense. I will add "if ( p2m->clean_pte )" and update clear_and_clean_page()
declaration.
>
>> +}
>> +
>> +static struct page_info *p2m_allocate_root(struct domain *d)
> With there also being p2m_alloc_root_table() and with that being the sole
> caller of the function here, I wonder: Is having this in a separate
> function really outweighing the possible confusion of which of the two
> functions to use?
p2m_allocate_root() will be also used in further patches to allocate
root's metadata page(s), but, also, in the same function p2m_alloc_root_table().
Probably, to avoid confusion it makes sense to rename p2m_allocate_root() to
p2m_allocate_root_page().
>
>> +{
>> + struct page_info *page;
>> +
>> + /*
>> + * As mentioned in the Priviliged Architecture Spec (version 20240411)
>> + * in Section 18.5.1, for the paged virtual-memory schemes (Sv32x4,
>> + * Sv39x4, Sv48x4, and Sv57x4), the root page table is 16 KiB and must
>> + * be aligned to a 16-KiB boundary.
>> + */
>> + page = alloc_domheap_pages(d, P2M_ROOT_ORDER, MEMF_no_owner);
>> + if ( !page )
>> + return NULL;
>> +
>> + for ( unsigned int i = 0; i < P2M_ROOT_PAGES; i++ )
>> + clear_and_clean_page(page + i);
>> +
>> + return page;
>> +}
>> +
>> +unsigned long construct_hgatp(struct p2m_domain *p2m, uint16_t vmid)
>> +{
>> + unsigned long ppn;
>> +
>> + ppn = PFN_DOWN(page_to_maddr(p2m->root)) & HGATP_PPN;
> Why not page_to_pfn() or mfn_x(page_to_mfn())? I.e. why mix different groups
> of accessors?
No specific reason, just missed such option.
>
> As to "& HGATP_PPN" - that's making an assumption that you could avoid by
> using ...
>
>> + /* TODO: add detection of hgatp_mode instead of hard-coding it. */
>> +#if RV_STAGE1_MODE == SATP_MODE_SV39
>> + p2m->hgatp_mode = HGATP_MODE_SV39X4;
>> +#elif RV_STAGE1_MODE == SATP_MODE_SV48
>> + p2m->hgatp_mode = HGATP_MODE_SV48X4;
>> +#else
>> +# error "add HGATP_MODE"
>> +#endif
>> +
>> + return ppn | MASK_INSR(p2m->hgatp_mode, HGATP_MODE_MASK) |
>> + MASK_INSR(vmid, HGATP_VMID_MASK);
> ... MASK_INSR() also on "ppn".
>
> As to the writing of p2m->hgatp_mode - you don't want to do this here, when
> this is the function to calculate the value to put into hgatp. This field
> needs calculating only once, perhaps in p2m_init().
Agree, it makes sense to move hgatp_mode detection to p2m_init().
>
>> +static int p2m_alloc_root_table(struct p2m_domain *p2m)
>> +{
>> + struct domain *d = p2m->domain;
>> + struct page_info *page;
>> + const unsigned int nr_root_pages = P2M_ROOT_PAGES;
> Is this local variable really of any use?
It will be needed for one of the next patches and to have less change in
further patch, I've decided to introduce it here.
>
>> + /*
>> + * Return back nr_root_pages to assure the root table memory is also
>> + * accounted against the P2M pool of the domain.
>> + */
>> + if ( !paging_ret_pages_to_domheap(d, nr_root_pages) )
>> + return -ENOMEM;
>> +
>> + page = p2m_allocate_root(d);
>> + if ( !page )
>> + return -ENOMEM;
> Hmm, and the pool is then left shrunk by 4 pages?
Yes until they are used for root table it shouldn't be in p2m pool (freelist),
when root table will be freed then it makes sense to return them back.
Am I missing something?
Probably, you meant that it is needed to update p2m->pages?
>
>> --- a/xen/arch/riscv/paging.c
>> +++ b/xen/arch/riscv/paging.c
>> @@ -54,6 +54,36 @@ int paging_freelist_init(struct domain *d, unsigned long pages,
>>
>> return 0;
>> }
>> +
>> +bool paging_ret_pages_to_domheap(struct domain *d, unsigned int nr_pages)
>> +{
>> + struct page_info *page;
>> +
>> + ASSERT(spin_is_locked(&d->arch.paging.lock));
>> +
>> + if ( ACCESS_ONCE(d->arch.paging.total_pages) < nr_pages )
>> + return false;
>> +
>> + for ( unsigned int i = 0; i < nr_pages; i++ )
>> + {
>> + /* Return memory to domheap. */
>> + page = page_list_remove_head(&d->arch.paging.freelist);
>> + if( page )
>> + {
>> + ACCESS_ONCE(d->arch.paging.total_pages)--;
>> + free_domheap_page(page);
>> + }
>> + else
>> + {
>> + printk(XENLOG_ERR
>> + "Failed to free P2M pages, P2M freelist is empty.\n");
>> + return false;
> Looks pretty redundant with half of paging_freelist_init(), including the
> stray full stop in the log message.
I will introduce then a separate function (for a code, which is inside
for-loop) and use it here and in paging_freelist_init().
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 12571 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 06/20] xen/riscv: add root page table allocation
2025-08-05 10:43 ` Jan Beulich
@ 2025-08-07 13:35 ` Oleksii Kurochko
2025-08-07 15:57 ` Jan Beulich
0 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-07 13:35 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 1279 bytes --]
On 8/5/25 12:43 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> +static int p2m_alloc_root_table(struct p2m_domain *p2m)
>> +{
>> + struct domain *d = p2m->domain;
>> + struct page_info *page;
>> + const unsigned int nr_root_pages = P2M_ROOT_PAGES;
>> +
>> + /*
>> + * Return back nr_root_pages to assure the root table memory is also
>> + * accounted against the P2M pool of the domain.
>> + */
>> + if ( !paging_ret_pages_to_domheap(d, nr_root_pages) )
>> + return -ENOMEM;
>> +
>> + page = p2m_allocate_root(d);
>> + if ( !page )
>> + return -ENOMEM;
>> +
>> + p2m->root = page;
>> +
>> + return 0;
>> +}
> In the success case, shouldn't you bump the paging pool's total_pages by
> P2M_ROOT_PAGES? (As the freeing side is missing so far, it's not easy to
> tell whether there's [going to be] a balancing problem in the long run.
> In the short run there certainly is.)
I think that total_pages should be updated only in case when page is added
to freelist.
In the case of p2m root table, we just returning some pages to domheap and
durint that decreasing an amount of total_pages as freelist has lesser pages,
and then just allocate pages from domheap without adding them to freelist.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 1743 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement
2025-08-07 10:11 ` Jan Beulich
@ 2025-08-07 14:45 ` Oleksii Kurochko
0 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-07 14:45 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 8310 bytes --]
On 8/7/25 12:11 PM, Jan Beulich wrote:
>>>>>> + unsigned long old;
>>>>>> +
>>>>>> + /* Figure-out number of VMID bits in HW */
>>>>>> + old = csr_read(CSR_HGATP);
>>>>>> +
>>>>>> + csr_write(CSR_HGATP, old | HGATP_VMID_MASK);
>>>>>> + vmid_bits = csr_read(CSR_HGATP);
>>>>>> + vmid_bits = MASK_EXTR(vmid_bits, HGATP_VMID_MASK);
>>>>> Nit: Stray blank.
>>>>>
>>>>>> + vmid_bits = flsl(vmid_bits);
>>>>>> + csr_write(CSR_HGATP, old);
>>>>>> +
>>>>>> + /*
>>>>>> + * We polluted local TLB so flush all guest TLB as
>>>>>> + * a speculative access can happen at any time.
>>>>>> + */
>>>>>> + local_hfence_gvma_all();
>>>>> There's no guest running. If you wrote hgat.MODE as zero, as per my
>>>>> understanding now new TLB entries could even purely theoretically appear.
>>>> It could be an issue (or, at least, it is recommended) when hgatp.MODE is
>>>> changed:
>>>> If hgatp.MODE is changed for a given VMID, an HFENCE.GVMA with rs1=x0
>>>> (and rs2 set to either x0 or the VMID) must be executed to order subsequent
>>>> guest translations with the MODE change—even if the old MODE or new MODE
>>>> is Bare.
>>>> On other hand it is guaranteed that, at least, on Reset (and so I assume
>>>> for power on) that:
>>>> If the hypervisor extension is implemented, the hgatp.MODE and vsatp.MODE
>>>> fields are reset to 0.
>>>>
>>>> So it seems like if no guest is ran then there is no need even to write
>>>> hgatp.MODE as zero, but it might be sense to do that explicitly just to
>>>> be sure.
>>>>
>>>> I thought it was possible to have a running guest and perform a CPU hotplug.
>>> But that guest will run on another hart.
>>>
>>>> In that case, I expect that during the hotplug,|vmidlen_detect()| will be
>>>> called and return the|vmid_bits| value, which is used as the active VMID.
>>>> At that moment, the local TLB could be speculatively polluted, I think.
>>>> Likely, it makes sense to call vmidlen_detect() only once for each hart
>>>> during initial bringup.
>>> That may bring you more problems than it solves. You'd need to stash away
>>> the value originally read somewhere. And that somewhere isn't per-CPU data.
>>>
>>>>> In fact, with no guest running (yet) I'm having a hard time seeing why
>>>>> you shouldn't be able to simply write the register with just
>>>>> HGATP_VMID_MASK, i.e. without OR-ing in "old". It's even questionable
>>>>> whether "old" needs restoring; writing plain zero afterwards ought to
>>>>> suffice. You're in charcge of the register, after all.
>>>> It make sense (but I don't know if it is a possible case) to be sure that
>>>> HGATP.MODE remains the same, so there is no need to have TLB flush. If
>>>> HGATP.MODE is changed then it will be needed to do TLB flush as I mentioned
>>>> above.
>>>>
>>>> If we agreed to keep local_hfence_gvma_all() then I think it isn't really
>>>> any sense to restore 'old' or OR-ing it with HGATP_VMID_MASK.
>>>>
>>>> Generally if 'old' is guaranteed to be zero (and, probably, it makes sense
>>>> to check that in vmidlen_detect() and panic if it isn't zero) and if
>>>> vmidlen_detect() function will be called before any guest domain(s) will
>>>> be ran then I could agree that we don't need local_hfence_gvma_all() here.
>>>>
>>>> As an option we can do local_hfence_gvma_all() only if 'old' value wasn't
>>>> set to zero.
>>>>
>>>> Does it make sense?
>>> Well - I'd like the pre-conditions to be understood better. For example, can
>>> a hart really speculate into guest mode, when the hart is only in the
>>> process of being brought up?
>> I couldn't explicit words that a hart can't speculate into guest mode
>> either on bring up or during its work.
>>
>> But there are some moments in the spec which tells:
>> Implementations with virtual memory are permitted to perform address
>> translations speculatively and earlier than required by an explicit
>> memory access, and are permitted to cache them in address translation
>> cache structures—including possibly caching the identity mappings from
>> effective address to physical address used in Bare translation modes and
>> M-mode.
>> And here:
>> Implementations may also execute the address-translation algorithm
>> speculatively at any time, for any virtual address, as long as satp is
>> active (as defined in Section 10.1.11). Such speculative executions have
>> the effect of pre-populating the address-translation cache.
> That's satp though, not hgatp.
>
>> Where it is explicitly mentioned that speculation can happen in*any time*.
>> And at the same time:
>> Speculative executions of the address-translation algorithm behave as
>> non-speculative executions of the algorithm do, except that they must
>> not set the dirty bit for a PTE, they must not trigger an exception,
>> and they must not create address-translation cache entries if those
>> entries would have been invalidated by any SFENCE.VMA instruction
>> executed by the hart since the speculative execution of the algorithm began.
>> What I read as if TLB was empty before it will stay empty.
>>
>> Also, despite of the fact here it is mentioned that when V=0 two-stage address
>> translation is inactivated:
>> The current virtualization mode, denoted V, indicates whether the hart is
>> currently executing in a guest. When V=1, the hart is either in virtual
>> S-mode (VS-mode), or in virtual U-mode (VU-mode) atop a guest OS running
>> in VS-mode. When V=0, the hart is either in M-mode, in HS-mode, or in
>> U-mode atop an OS running in HS-mode. The virtualization mode also
>> indicates whether two-stage address translation is active (V=1) or
>> inactive (V=0).
>> But on the same side, writing to hgatp register activates it:
>> The hgatp register is considered active for the purposes of
>> the address-translation algorithm unless the effective privilege mode
>> is U and hstatus.HU=0.
>> And if so + considering that speculation could happen at any time, and
>> we are in HS-mode, not it U mode then I would say that it could really
>> speculate into guest mode.
> Hmm, that leaves some things to be desired. What I'm particularly puzzled
> by is that there's nothing said either way towards speculation through SRET.
> That's the important aspect here aiui, because without that the hart can't
> speculate into guest mode.
>
> But yes, in the absence of any clear indication to the contrary, I think
> you want to keep the local_hfence_gvma_all() (with a suitable comment).
Interesting that for VS-stage translation is explicitly mention that it is
possible to stop speculation:
No mechanism is provided to atomically change vsatp and hgatp together.
Hence, to prevent speculative execution causing one guest’s VS-stage
translations to be cached under another guest’s VMID, world-switch code
should zero vsatp, then swap hgatp, then finally write the new vsatp value.
Similarly, if henvcfg.PBMTE need be world-switched, it should be switched
after zeroing vsatp but before writing the new vsatp value, obviating
the need to execute an HFENCE.VVMA instruction.
So if VSATP is 0 then there is no speculation as there is no need to execute
HFENCE.VVMA.
>
>>>>>> + data->max_vmid = BIT(vmid_len, U) - 1;
>>>>>> + data->disabled = !opt_vmid_enabled || (vmid_len <= 1);
>>>>> Actually, what exactly does it mean that "VMIDs are disabled"? There's
>>>>> no enable bit that I could find anywhere. Isn't it rather that in this
>>>>> case you need to arrange to flush always on VM entry (or always after a
>>>>> P2M change, depending how the TLB is split between guest and host use)?
>>>> "VMIDs are disabled" here means that TLB flush will happen each time p2m
>>>> is changed.
>>> That's better described as "VMIDs aren't used" then?
>> It sounds a little bit just like an opposite to "disabled" (i.e. means
>> basically the same), but I am okay to use "used" instead.
> If you want to stick to using "disabled", then how about "VMID use is
> disabled"? (You probably meanwhile understood that what I'm after is it
> becoming clear that this is a software decision, not something you can
> enforce in hardware.)
"VMID use is disabled" really sounds more clear. Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 10264 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 09/20] xen/dom0less: abstract Arm-specific p2m type name for device MMIO mappings
2025-08-04 14:11 ` Jan Beulich
@ 2025-08-07 15:23 ` Oleksii Kurochko
0 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-07 15:23 UTC (permalink / raw)
To: Jan Beulich
Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
Volodymyr Babchuk, xen-devel
[-- Attachment #1: Type: text/plain, Size: 2018 bytes --]
On 8/4/25 4:11 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> Rename `p2m_mmio_direct_dev` to a more architecture-neutral alias
>> `p2m_mmio_direct` to avoid leaking Arm-specific naming into common Xen code,
>> such as dom0less passthrough property handling.
>>
>> This helps reduce platform-specific terminology in shared logic and
>> improves clarity for future non-Arm ports (e.g. RISC-V or PowerPC).
>>
>> No functional changes — the definition is preserved via a macro alias
>> for Arm.
>>
>> Suggested-by: Jan Beulich<jbeulich@suse.com>
> I'm sorry, but no, ...
>
>> --- a/xen/arch/arm/include/asm/p2m.h
>> +++ b/xen/arch/arm/include/asm/p2m.h
>> @@ -137,6 +137,8 @@ typedef enum {
>> p2m_max_real_type, /* Types after this won't be store in the p2m */
>> } p2m_type_t;
>>
>> +#define p2m_mmio_direct p2m_mmio_direct_dev
> ... this isn't what I suggested. When Arm has three p2m_mmio_direct_*,
> randomly aliasing one to p2m_mmio_direct is imo more likely to create
> confusion than to help things. Imo you want to introduce ...
This is not randomly, this what Arm uses for device's node(s), which is going
to be passthroughed...
>
>> --- a/xen/common/device-tree/dom0less-build.c
>> +++ b/xen/common/device-tree/dom0less-build.c
>> @@ -185,7 +185,7 @@ static int __init handle_passthrough_prop(struct kernel_info *kinfo,
>> gaddr_to_gfn(gstart),
>> PFN_DOWN(size),
>> maddr_to_mfn(mstart),
>> - p2m_mmio_direct_dev);
>> + p2m_mmio_direct);
> ... a per-arch inline function which returns the type to use here.
> The name of the function would want to properly reflect the purpose;
> my limited DT knowledge may make arch_dt_passthrough_p2m_type() an
> entirely wrong suggestion.
... But make it even more generic by providing an inline function which
just return p2m_type_t would be really better.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 3010 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 06/20] xen/riscv: add root page table allocation
2025-08-07 12:00 ` Oleksii Kurochko
@ 2025-08-07 15:30 ` Jan Beulich
2025-08-07 15:59 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-07 15:30 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 07.08.2025 14:00, Oleksii Kurochko wrote:
> On 8/5/25 12:37 PM, Jan Beulich wrote:
>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>> + /*
>>> + * Return back nr_root_pages to assure the root table memory is also
>>> + * accounted against the P2M pool of the domain.
>>> + */
>>> + if ( !paging_ret_pages_to_domheap(d, nr_root_pages) )
>>> + return -ENOMEM;
>>> +
>>> + page = p2m_allocate_root(d);
>>> + if ( !page )
>>> + return -ENOMEM;
>> Hmm, and the pool is then left shrunk by 4 pages?
>
> Yes until they are used for root table it shouldn't be in p2m pool (freelist),
> when root table will be freed then it makes sense to return them back.
> Am I missing something?
I'm commenting specifically on the error path here.
> Probably, you meant that it is needed to update p2m->pages?
That (I think) I commented on elsewhere, yes.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 08/20] xen/riscv: add new p2m types and helper macros for type classification
2025-08-04 14:16 ` Jan Beulich
@ 2025-08-07 15:41 ` Oleksii Kurochko
0 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-07 15:41 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 1767 bytes --]
On 8/4/25 4:16 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> - Extended p2m_type_t with additional types: p2m_mmio_direct,
>> p2m_grant_map_{rw,ro}.
>> - Added macros to classify memory types: P2M_RAM_TYPES, P2M_GRANT_TYPES.
>> - Introduced helper predicates: p2m_is_ram(), p2m_is_any_ram().
>> - Define p2m_mmio_direct to tell handle_passthrough_prop() from common
>> code how to map device memory.
>>
>> Signed-off-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
> Almost ready to be acked, except for ...
>
>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -62,8 +62,30 @@ struct p2m_domain {
>> typedef enum {
>> p2m_invalid = 0, /* Nothing mapped here */
>> p2m_ram_rw, /* Normal read/write domain RAM */
>> + p2m_mmio_direct_io, /* Read/write mapping of genuine Device MMIO area,
>> + PTE_PBMT_IO will be used for such mappings */
>> + p2m_ext_storage, /* Following types'll be stored outsude PTE bits: */
>> + p2m_grant_map_rw, /* Read/write grant mapping */
>> + p2m_grant_map_ro, /* Read-only grant mapping */
>> } p2m_type_t;
>>
>> +#define p2m_mmio_direct p2m_mmio_direct_io
> ... this (see reply to patch 09).
>
>> +/* We use bitmaps and mask to handle groups of types */
>> +#define p2m_to_mask(t_) BIT(t_, UL)
> I notice that you moved the underscore to the back of the parameters,
> compared to how Arm has it. I wonder though: What use are these
> underscores in the first place, here and below? (There are macros where
> conflicts could arise, but the ones here don't fall in that group,
> afaict.)
Good point, there is really no name conflicts here, so underscore could
be just dropped.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 2587 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 06/20] xen/riscv: add root page table allocation
2025-08-07 13:35 ` Oleksii Kurochko
@ 2025-08-07 15:57 ` Jan Beulich
2025-08-08 9:14 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-07 15:57 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 07.08.2025 15:35, Oleksii Kurochko wrote:
>
> On 8/5/25 12:43 PM, Jan Beulich wrote:
>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>> +static int p2m_alloc_root_table(struct p2m_domain *p2m)
>>> +{
>>> + struct domain *d = p2m->domain;
>>> + struct page_info *page;
>>> + const unsigned int nr_root_pages = P2M_ROOT_PAGES;
>>> +
>>> + /*
>>> + * Return back nr_root_pages to assure the root table memory is also
>>> + * accounted against the P2M pool of the domain.
>>> + */
>>> + if ( !paging_ret_pages_to_domheap(d, nr_root_pages) )
>>> + return -ENOMEM;
>>> +
>>> + page = p2m_allocate_root(d);
>>> + if ( !page )
>>> + return -ENOMEM;
>>> +
>>> + p2m->root = page;
>>> +
>>> + return 0;
>>> +}
>> In the success case, shouldn't you bump the paging pool's total_pages by
>> P2M_ROOT_PAGES? (As the freeing side is missing so far, it's not easy to
>> tell whether there's [going to be] a balancing problem in the long run.
>> In the short run there certainly is.)
>
> I think that total_pages should be updated only in case when page is added
> to freelist.
> In the case of p2m root table, we just returning some pages to domheap and
> durint that decreasing an amount of total_pages as freelist has lesser pages,
> and then just allocate pages from domheap without adding them to freelist.
But how's freeing of a root table going to look like? Logically that group
of 4 pages would be put back into the pool. And from that the pool's
total_pages should reflect that right after successful allocation.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 06/20] xen/riscv: add root page table allocation
2025-08-07 15:30 ` Jan Beulich
@ 2025-08-07 15:59 ` Oleksii Kurochko
2025-08-07 16:03 ` Jan Beulich
0 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-07 15:59 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 1825 bytes --]
On 8/7/25 5:30 PM, Jan Beulich wrote:
> On 07.08.2025 14:00, Oleksii Kurochko wrote:
>> On 8/5/25 12:37 PM, Jan Beulich wrote:
>>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>>> + /*
>>>> + * Return back nr_root_pages to assure the root table memory is also
>>>> + * accounted against the P2M pool of the domain.
>>>> + */
>>>> + if ( !paging_ret_pages_to_domheap(d, nr_root_pages) )
>>>> + return -ENOMEM;
>>>> +
>>>> + page = p2m_allocate_root(d);
>>>> + if ( !page )
>>>> + return -ENOMEM;
>>> Hmm, and the pool is then left shrunk by 4 pages?
>> Yes until they are used for root table it shouldn't be in p2m pool (freelist),
>> when root table will be freed then it makes sense to return them back.
>> Am I missing something?
> I'm commenting specifically on the error path here.
Ohh, got it.
In this case, should we really care about this 4 pages as a domain can't be ran
without allocated page root table and a panic() will be occured anyway according
to the create_domUs() common code (construct_domU() -> domain_p2m_set_allocation()
-> p2m_set_allocation() -> p2m_alloc_root_table()):
...
rc = construct_domU(&ki, node);
if ( rc )
panic("Could not set up domain %s (rc = %d)\n",
dt_node_name(node), rc);
...
(Note: I missed to return a value returned by p2m_alloc_root_table() in p2m_set_allocation()
so it isn't really propagated, at the moment, but I will fix that in the next patch
version) ...
>> Probably, you meant that it is needed to update p2m->pages?
> That (I think) I commented on elsewhere, yes.
...
if it is needed really to update p2m->pages when a page is allocated, I think
it will be better to in p2m_allocate_root() immediately after alloc_domheap_pages()
is called in p2m_allocate_root().
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 2971 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 06/20] xen/riscv: add root page table allocation
2025-08-07 15:59 ` Oleksii Kurochko
@ 2025-08-07 16:03 ` Jan Beulich
0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2025-08-07 16:03 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 07.08.2025 17:59, Oleksii Kurochko wrote:
> On 8/7/25 5:30 PM, Jan Beulich wrote:
>> On 07.08.2025 14:00, Oleksii Kurochko wrote:
>>> On 8/5/25 12:37 PM, Jan Beulich wrote:
>>>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>>>> + /*
>>>>> + * Return back nr_root_pages to assure the root table memory is also
>>>>> + * accounted against the P2M pool of the domain.
>>>>> + */
>>>>> + if ( !paging_ret_pages_to_domheap(d, nr_root_pages) )
>>>>> + return -ENOMEM;
>>>>> +
>>>>> + page = p2m_allocate_root(d);
>>>>> + if ( !page )
>>>>> + return -ENOMEM;
>>>> Hmm, and the pool is then left shrunk by 4 pages?
>>> Yes until they are used for root table it shouldn't be in p2m pool (freelist),
>>> when root table will be freed then it makes sense to return them back.
>>> Am I missing something?
>> I'm commenting specifically on the error path here.
>
> Ohh, got it.
>
> In this case, should we really care about this 4 pages as a domain can't be ran
> without allocated page root table and a panic() will be occured anyway according
> to the create_domUs() common code (construct_domU() -> domain_p2m_set_allocation()
> -> p2m_set_allocation() -> p2m_alloc_root_table()):
> ...
> rc = construct_domU(&ki, node);
> if ( rc )
> panic("Could not set up domain %s (rc = %d)\n",
> dt_node_name(node), rc);
Well, that's for dom0less. Even for tool-stack created VMs there would be
no problem. But root tables required on demand (altp2m, nested) would be
different. So what you do here may be good enough for now, but likely will
want improving later on. (Such temporary restrictions may want putting
down somewhere.)
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 06/20] xen/riscv: add root page table allocation
2025-08-07 15:57 ` Jan Beulich
@ 2025-08-08 9:14 ` Oleksii Kurochko
0 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-08 9:14 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 2700 bytes --]
On 8/7/25 5:57 PM, Jan Beulich wrote:
> On 07.08.2025 15:35, Oleksii Kurochko wrote:
>> On 8/5/25 12:43 PM, Jan Beulich wrote:
>>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>>> +static int p2m_alloc_root_table(struct p2m_domain *p2m)
>>>> +{
>>>> + struct domain *d = p2m->domain;
>>>> + struct page_info *page;
>>>> + const unsigned int nr_root_pages = P2M_ROOT_PAGES;
>>>> +
>>>> + /*
>>>> + * Return back nr_root_pages to assure the root table memory is also
>>>> + * accounted against the P2M pool of the domain.
>>>> + */
>>>> + if ( !paging_ret_pages_to_domheap(d, nr_root_pages) )
>>>> + return -ENOMEM;
>>>> +
>>>> + page = p2m_allocate_root(d);
>>>> + if ( !page )
>>>> + return -ENOMEM;
>>>> +
>>>> + p2m->root = page;
>>>> +
>>>> + return 0;
>>>> +}
>>> In the success case, shouldn't you bump the paging pool's total_pages by
>>> P2M_ROOT_PAGES? (As the freeing side is missing so far, it's not easy to
>>> tell whether there's [going to be] a balancing problem in the long run.
>>> In the short run there certainly is.)
>> I think that total_pages should be updated only in case when page is added
>> to freelist.
>> In the case of p2m root table, we just returning some pages to domheap and
>> durint that decreasing an amount of total_pages as freelist has lesser pages,
>> and then just allocate pages from domheap without adding them to freelist.
> But how's freeing of a root table going to look like?
We have saved pointer to first page of P2M_ROOT_PAGES allocated for root page
table which is stored in p2m->root. Then when a domain is going to be destroyed,
then do something like:
for ( i = 0; i < P2M_ROOT_PAGES; i++ )
clear_and_clean_page(p2m->root + i);
...
> Logically that group
> of 4 pages would be put back into the pool. And from that the pool's
> total_pages should reflect that right after successful allocation.
... I think instead of having the loop mentioned above we could add root table
pages to p2m->pages (as you suggested) in p2m_allocate_root() and then a domain
is being destroyed just do the following:
while ( (pg = page_list_remove_head(&p2m->pages)) )
{
p2m_free_page(p2m->domain, pg);
And it will be a job of internals of p2m_free_page() -> paging_free_page() to
adjust freelist's total_pages and return back page(s) allocated for root table
to the freelist. (Note: the current implementation of paging_free_page() just
add a page to freelist without updating of freelist's total_pages what looks
incorrect. And it will be enough as total_pages is present only for freelist
and there is not separate total_pages (or something similar) for p2m->pages).
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 3622 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 10/20] xen/riscv: introduce page_{get,set}_xenheap_gfn()
2025-08-05 14:11 ` Jan Beulich
@ 2025-08-08 9:16 ` Oleksii Kurochko
0 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-08 9:16 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]
On 8/5/25 4:11 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/include/asm/mm.h
>> +++ b/xen/arch/riscv/include/asm/mm.h
>> @@ -12,6 +12,7 @@
>> #include <xen/sections.h>
>> #include <xen/types.h>
>>
>> +#include <asm/cmpxchg.h>
>> #include <asm/page.h>
>> #include <asm/page-bits.h>
>>
>> @@ -247,9 +248,17 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
>> #define PGT_writable_page PG_mask(1, 1) /* has writable mappings? */
>> #define PGT_type_mask PG_mask(1, 1) /* Bits 31 or 63. */
>>
>> -/* Count of uses of this frame as its current type. */
>> -#define PGT_count_width PG_shift(2)
>> -#define PGT_count_mask ((1UL << PGT_count_width) - 1)
>> + /* 9-bit count of uses of this frame as its current type. */
> Nit: Stray blank at start of line.
>
>> +#define PGT_count_mask PG_mask(0x3FF, 10)
> A 9-bit count corresponds to a mask of 0x1ff, doesn't it? With 0x3ff the count
> can spill over the type.
It should be really 0x1ff, thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 1735 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 11/20] xen/riscv: implement function to map memory in guest p2m
2025-08-05 15:20 ` Jan Beulich
@ 2025-08-08 13:46 ` Oleksii Kurochko
2025-08-11 7:28 ` Jan Beulich
0 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-08 13:46 UTC (permalink / raw)
To: Jan Beulich, Andrew Cooper
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Anthony PERARD,
Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 4428 bytes --]
On 8/5/25 5:20 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> Implement map_regions_p2mt() to map a region in the guest p2m with
>> a specific p2m type. The memory attributes will be derived from the
>> p2m type. This function is going to be called from dom0less common
>> code.
> s/is going to be/is/ ? Such a call exists already, after all.
>
>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -121,21 +121,22 @@ static inline int guest_physmap_mark_populate_on_demand(struct domain *d,
>> return -EOPNOTSUPP;
>> }
>>
>> -static inline int guest_physmap_add_entry(struct domain *d,
>> - gfn_t gfn, mfn_t mfn,
>> - unsigned long page_order,
>> - p2m_type_t t)
>> -{
>> - BUG_ON("unimplemented");
>> - return -EINVAL;
>> -}
>> +/*
>> + * Map a region in the guest p2m with a specific p2m type.
> What is "the guest p2m"? In your answer, please consider the possible
> (and at some point likely necessary) existence of altp2m and nestedp2m.
> In patch 04 you introduce p2m_get_hostp2m(), and I expect it's that
> what you mean here.
In the current one context it is host p2m. I can update the comment with:
"guest's hostp2m".
>
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -9,6 +9,41 @@
>>
>> unsigned int __read_mostly p2m_root_order;
>>
>> +/*
>> + * Force a synchronous P2M TLB flush.
>> + *
>> + * Must be called with the p2m lock held.
>> + */
>> +static void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
>> +{
>> + struct domain *d = p2m->domain;
> Pointer-to-const please. Personally, given the implementation of this
> function (and also ...
>
>> + ASSERT(p2m_is_write_locked(p2m));
>> +
>> + sbi_remote_hfence_gvma(d->dirty_cpumask, 0, 0);
>> +
>> + p2m->need_flush = false;
>> +}
>> +
>> +void p2m_tlb_flush_sync(struct p2m_domain *p2m)
>> +{
>> + if ( p2m->need_flush )
>> + p2m_force_tlb_flush_sync(p2m);
>> +}
> ... this one) I'd further ask for the function parameters to also be
> pointer-to-const, but Andrew may object to that. Andrew - it continues to
> be unclear to me under what conditions you agree with adding const, and
> under what conditions you would object to me asking for such. Please can
> you take the time to clarify this?
>
>> +/* Unlock the flush and do a P2M TLB flush if necessary */
>> +void p2m_write_unlock(struct p2m_domain *p2m)
>> +{
>> + /*
>> + * The final flush is done with the P2M write lock taken to avoid
>> + * someone else modifying the P2M wbefore the TLB invalidation has
> Nit: Stray 'w'.
>
>> + * completed.
>> + */
>> + p2m_tlb_flush_sync(p2m);
> Wasn't the plan to have this be conditional?
Not really, probably, I misunderstood you before.
Previously, I only had|p2m_force_tlb_flush_sync()| here, instead of
|p2m_tlb_flush_sync()|, and the latter includes a condition check on
|p2m->need_flush|.
>
>> @@ -139,3 +174,33 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>
>> return 0;
>> }
>> +
>> +static int p2m_set_range(struct p2m_domain *p2m,
>> + gfn_t sgfn,
>> + unsigned long nr,
>> + mfn_t smfn,
>> + p2m_type_t t)
>> +{
>> + return -EOPNOTSUPP;
>> +}
>> +
>> +static int p2m_insert_mapping(struct p2m_domain *p2m, gfn_t start_gfn,
>> + unsigned long nr, mfn_t mfn, p2m_type_t t)
>> +{
>> + int rc;
>> +
>> + p2m_write_lock(p2m);
>> + rc = p2m_set_range(p2m, start_gfn, nr, mfn, t);
>> + p2m_write_unlock(p2m);
>> +
>> + return rc;
>> +}
>> +
>> +int map_regions_p2mt(struct domain *d,
>> + gfn_t gfn,
>> + unsigned long nr,
>> + mfn_t mfn,
>> + p2m_type_t p2mt)
>> +{
>> + return p2m_insert_mapping(p2m_get_hostp2m(d), gfn, nr, mfn, p2mt);
>> +}
> And eventually both helper functions will gain further callers? Otherwise
> it's a little hard to see why they would both need to be separate functions.
Good point.
Actually, I think that it is enough to have map_regions_p2mt() as it is used
for dom0less common code, and re-use it every where potentially p2m_insert_mapping()
will be needed.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 6048 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 11/20] xen/riscv: implement function to map memory in guest p2m
2025-08-08 13:46 ` Oleksii Kurochko
@ 2025-08-11 7:28 ` Jan Beulich
2025-08-11 9:29 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-11 7:28 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Anthony PERARD,
Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel, Andrew Cooper
On 08.08.2025 15:46, Oleksii Kurochko wrote:
> On 8/5/25 5:20 PM, Jan Beulich wrote:
>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>> +/* Unlock the flush and do a P2M TLB flush if necessary */
>>> +void p2m_write_unlock(struct p2m_domain *p2m)
>>> +{
>>> + /*
>>> + * The final flush is done with the P2M write lock taken to avoid
>>> + * someone else modifying the P2M wbefore the TLB invalidation has
>> Nit: Stray 'w'.
>>
>>> + * completed.
>>> + */
>>> + p2m_tlb_flush_sync(p2m);
>> Wasn't the plan to have this be conditional?
>
> Not really, probably, I misunderstood you before.
>
> Previously, I only had|p2m_force_tlb_flush_sync()| here, instead of
> |p2m_tlb_flush_sync()|, and the latter includes a condition check on
> |p2m->need_flush|.
Just to re-iterate my point: Not every unlock will require a flush. Hence
why I expect the flush to be conditional upon there being an indication
that some change was done that requires flushing.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 11/20] xen/riscv: implement function to map memory in guest p2m
2025-08-11 7:28 ` Jan Beulich
@ 2025-08-11 9:29 ` Oleksii Kurochko
2025-08-11 9:35 ` Jan Beulich
0 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-11 9:29 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Anthony PERARD,
Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel, Andrew Cooper
[-- Attachment #1: Type: text/plain, Size: 1260 bytes --]
On 8/11/25 9:28 AM, Jan Beulich wrote:
> On 08.08.2025 15:46, Oleksii Kurochko wrote:
>> On 8/5/25 5:20 PM, Jan Beulich wrote:
>>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>>> +/* Unlock the flush and do a P2M TLB flush if necessary */
>>>> +void p2m_write_unlock(struct p2m_domain *p2m)
>>>> +{
>>>> + /*
>>>> + * The final flush is done with the P2M write lock taken to avoid
>>>> + * someone else modifying the P2M wbefore the TLB invalidation has
>>> Nit: Stray 'w'.
>>>
>>>> + * completed.
>>>> + */
>>>> + p2m_tlb_flush_sync(p2m);
>>> Wasn't the plan to have this be conditional?
>> Not really, probably, I misunderstood you before.
>>
>> Previously, I only had|p2m_force_tlb_flush_sync()| here, instead of
>> |p2m_tlb_flush_sync()|, and the latter includes a condition check on
>> |p2m->need_flush|.
> Just to re-iterate my point: Not every unlock will require a flush. Hence
> why I expect the flush to be conditional upon there being an indication
> that some change was done that requires flushing.
>
The flush is actually conditional; the condition is inside
|p2m_tlb_flush_sync()|:
void p2m_tlb_flush_sync(struct p2m_domain *p2m)
{
if ( p2m->need_flush )
p2m_force_tlb_flush_sync(p2m);
}
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 2237 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 11/20] xen/riscv: implement function to map memory in guest p2m
2025-08-11 9:29 ` Oleksii Kurochko
@ 2025-08-11 9:35 ` Jan Beulich
0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2025-08-11 9:35 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Anthony PERARD,
Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel, Andrew Cooper
On 11.08.2025 11:29, Oleksii Kurochko wrote:
>
> On 8/11/25 9:28 AM, Jan Beulich wrote:
>> On 08.08.2025 15:46, Oleksii Kurochko wrote:
>>> On 8/5/25 5:20 PM, Jan Beulich wrote:
>>>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>>>> +/* Unlock the flush and do a P2M TLB flush if necessary */
>>>>> +void p2m_write_unlock(struct p2m_domain *p2m)
>>>>> +{
>>>>> + /*
>>>>> + * The final flush is done with the P2M write lock taken to avoid
>>>>> + * someone else modifying the P2M wbefore the TLB invalidation has
>>>> Nit: Stray 'w'.
>>>>
>>>>> + * completed.
>>>>> + */
>>>>> + p2m_tlb_flush_sync(p2m);
>>>> Wasn't the plan to have this be conditional?
>>> Not really, probably, I misunderstood you before.
>>>
>>> Previously, I only had|p2m_force_tlb_flush_sync()| here, instead of
>>> |p2m_tlb_flush_sync()|, and the latter includes a condition check on
>>> |p2m->need_flush|.
>> Just to re-iterate my point: Not every unlock will require a flush. Hence
>> why I expect the flush to be conditional upon there being an indication
>> that some change was done that requires flushing.
>>
> The flush is actually conditional; the condition is inside
> |p2m_tlb_flush_sync()|:
> void p2m_tlb_flush_sync(struct p2m_domain *p2m)
> {
> if ( p2m->need_flush )
> p2m_force_tlb_flush_sync(p2m);
> }
Hmm, I'd consider this misleading function naming then. Especially with
"force" and "sync" being kind of redundant with one another already anyway.
See x86'es naming.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 14/20] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration
2025-07-31 15:58 ` [PATCH v3 14/20] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration Oleksii Kurochko
@ 2025-08-11 11:36 ` Jan Beulich
2025-08-11 14:44 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-11 11:36 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -1,3 +1,4 @@
> +#include <xen/bug.h>
> #include <xen/domain_page.h>
> #include <xen/mm.h>
> #include <xen/rwlock.h>
> @@ -197,6 +198,18 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
> return __map_domain_page(p2m->root + root_table_indx);
> }
>
> +static int p2m_set_type(pte_t *pte, p2m_type_t t)
> +{
> + int rc = 0;
> +
> + if ( t > p2m_ext_storage )
Seeing this separator enumerator in use, it becomes pretty clear that its name
needs to change, so one doesn't need to go look at its definition to understand
whether it's inclusive or exclusive. (This isn't helped by there presently being
a spare entry, which, when made use of, might then cause problems with
expressions like this one as well.)
> @@ -222,11 +235,71 @@ static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
> p2m_write_pte(p, pte, clean_pte);
> }
>
> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
> +static void p2m_set_permission(pte_t *e, p2m_type_t t)
> {
> - panic("%s: hasn't been implemented yet\n", __func__);
> + e->pte &= ~PTE_ACCESS_MASK;
> +
> + switch ( t )
> + {
> + case p2m_grant_map_rw:
> + case p2m_ram_rw:
> + e->pte |= PTE_READABLE | PTE_WRITABLE;
> + break;
While I agree for r/w grants, shouldn't r/w RAM also be executable?
> + case p2m_ext_storage:
Why exactly would this placeholder ...
> + case p2m_mmio_direct_io:
> + e->pte |= PTE_ACCESS_MASK;
> + break;
... gain full access? It shouldn't make it here at all, should it?
> +
> + case p2m_invalid:
> + e->pte &= ~(PTE_ACCESS_MASK | PTE_VALID);
Redundantly masking off PTE_ACCESS_MASK? (Plus, for the entry to be
invalid, turning off PTE_VALID alone ought to suffice anyway?)
> + break;
> +
> + case p2m_grant_map_ro:
> + e->pte |= PTE_READABLE;
> + break;
> +
> + default:
> + ASSERT_UNREACHABLE();
> + break;
> + }
> +}
> +
> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
> +{
> + pte_t e = (pte_t) { PTE_VALID };
This and the rest of the function demand that mfn != INVALID_MFN, no matter
whether ...
> + switch ( t )
> + {
> + case p2m_mmio_direct_io:
> + e.pte |= PTE_PBMT_IO;
> + break;
> +
> + default:
> + break;
> + }
> +
> + pte_set_mfn(&e, mfn);
> +
> + ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK));
... PADDR_MASK is actually narrow enough to catch that case. Maybe best to
add an explicit assertion to that effect?
> + if ( !is_table )
> + {
> + p2m_set_permission(&e, t);
> +
> + if ( t < p2m_ext_storage )
> + p2m_set_type(&e, t);
> + else
> + panic("unimplemeted\n");
The check is already done inside p2m_set_type() - why open-code it here?
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 15/20] xen/riscv: implement p2m_next_level()
2025-07-31 15:58 ` [PATCH v3 15/20] xen/riscv: implement p2m_next_level() Oleksii Kurochko
@ 2025-08-11 11:44 ` Jan Beulich
0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2025-08-11 11:44 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -302,6 +302,48 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
> return e;
> }
>
> +/* Generate table entry with correct attributes. */
> +static pte_t page_to_p2m_table(struct page_info *page)
You don't mean to alter what page points to, so pointer-to-const please.
> @@ -326,9 +368,43 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
> unsigned int level, pte_t **table,
> unsigned int offset)
> {
> - panic("%s: hasn't been implemented yet\n", __func__);
> + pte_t *entry;
> + int ret;
Please can this move into the more narrow scope it's (solely) used in?
> + mfn_t mfn;
> +
> + /* The function p2m_next_level() is never called at the last level */
> + ASSERT(level != 0);
The revlog says "move", but ...
> + entry = *table + offset;
> +
> + if ( !pte_is_valid(*entry) )
> + {
> + if ( !alloc_tbl )
> + return P2M_TABLE_MAP_NONE;
> +
> + ret = p2m_create_table(p2m, entry);
> + if ( ret )
> + return P2M_TABLE_MAP_NOMEM;
> + }
> +
> + /* The function p2m_next_level() is never called at the last level */
> + ASSERT(level != 0);
... the original one's still here.
> --- a/xen/arch/riscv/paging.c
> +++ b/xen/arch/riscv/paging.c
> @@ -91,6 +91,17 @@ void paging_free_page(struct domain *d, struct page_info *pg)
> spin_unlock(&d->arch.paging.lock);
> }
>
> +struct page_info * paging_alloc_page(struct domain *d)
Nit: Stray blank after *.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 16/20] xen/riscv: Implement superpage splitting for p2m mappings
2025-07-31 15:58 ` [PATCH v3 16/20] xen/riscv: Implement superpage splitting for p2m mappings Oleksii Kurochko
@ 2025-08-11 11:59 ` Jan Beulich
2025-08-11 15:19 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-11 11:59 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> Add support for down large memory mappings ("superpages") in the RISC-V
> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
> can be inserted into lower levels of the page table hierarchy.
>
> To implement that the following is done:
> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
> smaller page table entries down to the target level, preserving original
> permissions and attributes.
> - p2m_set_entry() updated to invoke superpage splitting when inserting
> entries at lower levels within a superpage-mapped region.
>
> This implementation is based on the ARM code, with modifications to the part
> that follows the BBM (break-before-make) approach, some parts are simplified
> as according to RISC-V spec:
> It is permitted for multiple address-translation cache entries to co-exist
> for the same address. This represents the fact that in a conventional
> TLB hierarchy, it is possible for multiple entries to match a single
> address if, for example, a page is upgraded to a superpage without first
> clearing the original non-leaf PTE’s valid bit and executing an SFENCE.VMA
> with rs1=x0, or if multiple TLBs exist in parallel at a given level of the
> hierarchy. In this case, just as if an SFENCE.VMA is not executed between
> a write to the memory-management tables and subsequent implicit read of the
> same address: it is unpredictable whether the old non-leaf PTE or the new
> leaf PTE is used, but the behavior is otherwise well defined.
> In contrast to the Arm architecture, where BBM is mandatory and failing to
> use it in some cases can lead to CPU instability, RISC-V guarantees
> stability, and the behavior remains safe — though unpredictable in terms of
> which translation will be used.
>
> Additionally, the page table walk logic has been adjusted, as ARM uses the
> opposite number of levels compared to RISC-V.
As before, I think you mean "numbering".
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -539,6 +539,91 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
> p2m_free_page(p2m, pg);
> }
>
> +static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
> + unsigned int level, unsigned int target,
> + const unsigned int *offsets)
> +{
> + struct page_info *page;
> + unsigned long i;
> + pte_t pte, *table;
> + bool rv = true;
> +
> + /* Convenience aliases */
> + mfn_t mfn = pte_get_mfn(*entry);
> + unsigned int next_level = level - 1;
> + unsigned int level_order = XEN_PT_LEVEL_ORDER(next_level);
> +
> + /*
> + * This should only be called with target != level and the entry is
> + * a superpage.
> + */
> + ASSERT(level > target);
> + ASSERT(pte_is_superpage(*entry, level));
> +
> + page = p2m_alloc_page(p2m->domain);
> + if ( !page )
> + {
> + /*
> + * The caller is in charge to free the sub-tree.
> + * As we didn't manage to allocate anything, just tell the
> + * caller there is nothing to free by invalidating the PTE.
> + */
> + memset(entry, 0, sizeof(*entry));
> + return false;
> + }
> +
> + table = __map_domain_page(page);
> +
> + /*
> + * We are either splitting a second level 1G page into 512 first level
> + * 2M pages, or a first level 2M page into 512 zero level 4K pages.
> + */
Such a comment is at risk of (silently) going stale when support for 512G
mappings is added. I wonder if it's really that informative to have here.
> + for ( i = 0; i < XEN_PT_ENTRIES; i++ )
> + {
> + pte_t *new_entry = table + i;
> +
> + /*
> + * Use the content of the superpage entry and override
> + * the necessary fields. So the correct permission are kept.
> + */
It's not just permissions though? The memory type field also needs
retaining (and is being retained this way). Maybe better say "attributes"?
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 17/20] xen/riscv: implement put_page()
2025-07-31 15:58 ` [PATCH v3 17/20] xen/riscv: implement put_page() Oleksii Kurochko
@ 2025-08-11 12:43 ` Jan Beulich
2025-08-11 15:32 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-11 12:43 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> Implement put_page(), as it will be used by p2m_put_code().
I would have ack-ed the code change, but the description is irritating:
Who or what is p2m_put_code() (going to be)?
> Although CONFIG_STATIC_MEMORY has not yet been introduced for RISC-V,
> a stub for PGC_static is added to avoid cluttering the code of
> put_page_nr() with #ifdefs.
There isn't any put_page_nr() being introduced (anymore), though.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 18/20] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
2025-07-31 15:58 ` [PATCH v3 18/20] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers Oleksii Kurochko
@ 2025-08-11 12:50 ` Jan Beulich
2025-08-11 15:34 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-11 12:50 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> Implement the mfn_valid() macro to verify whether a given MFN is valid by
> checking that it falls within the range [start_page, max_page).
> These bounds are initialized based on the start and end addresses of RAM.
>
> As part of this patch, start_page is introduced and initialized with the
> PFN of the first RAM page.
> Also, initialize pdx_group_valid() by calling set_pdx_range() when
> memory banks are being mapped.
>
> Also, after providing a non-stub implementation of the mfn_valid() macro,
> the following compilation errors started to occur:
> riscv64-linux-gnu-ld: prelink.o: in function `__next_node':
> /build/xen/./include/xen/nodemask.h:202: undefined reference to `page_is_ram_type'
> riscv64-linux-gnu-ld: prelink.o: in function `get_free_buddy':
> /build/xen/common/page_alloc.c:881: undefined reference to `page_is_ram_type'
> riscv64-linux-gnu-ld: prelink.o: in function `alloc_heap_pages':
> /build/xen/common/page_alloc.c:1043: undefined reference to `page_get_owner_and_reference'
> riscv64-linux-gnu-ld: /build/xen/common/page_alloc.c:1098: undefined reference to `page_is_ram_type'
> riscv64-linux-gnu-ld: prelink.o: in function `ns16550_interrupt':
> /build/xen/drivers/char/ns16550.c:205: undefined reference to `get_page'
> riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol `page_get_owner_and_reference' isn't defined
> riscv64-linux-gnu-ld: final link failed: bad value
> make[2]: *** [arch/riscv/Makefile:35: xen-syms] Error 1
> To resolve these errors, the following functions have also been introduced,
> based on their Arm counterparts:
> - page_get_owner_and_reference() and its variant to safely acquire a
> reference to a page and retrieve its owner.
> - A stub for page_is_ram_type() that currently always returns 0 and asserts
> unreachable, as RAM type checking is not yet implemented.
For this latter part I can only repeat that the code is reachable, and hence it
is wrong to put ASSERT_UNREACHABLE() there. That's true for Arm's code as well.
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
With said line dropped:
Acked-by: Jan Beulich <jbeulich@suse.com>
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 19/20] xen/riscv: add support of page lookup by GFN
2025-07-31 15:58 ` [PATCH v3 19/20] xen/riscv: add support of page lookup by GFN Oleksii Kurochko
@ 2025-08-11 13:25 ` Jan Beulich
2025-08-12 11:42 ` Oleksii Kurochko
2025-08-22 8:39 ` Oleksii Kurochko
0 siblings, 2 replies; 84+ messages in thread
From: Jan Beulich @ 2025-08-11 13:25 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> Introduce helper functions for safely querying the P2M (physical-to-machine)
> mapping:
> - add p2m_read_lock(), p2m_read_unlock(), and p2m_is_locked() for managing
> P2M lock state.
> - Implement p2m_get_entry() to retrieve mapping details for a given GFN,
> including MFN, page order, and validity.
> - Add p2m_lookup() to encapsulate read-locked MFN retrieval.
> - Introduce p2m_get_page_from_gfn() to convert a GFN into a page_info
> pointer, acquiring a reference to the page if valid.
> - Introduce get_page().
>
> Implementations are based on Arm's functions with some minor modifications:
> - p2m_get_entry():
> - Reverse traversal of page tables, as RISC-V uses the opposite level
> numbering compared to Arm.
> - Removed the return of p2m_access_t from p2m_get_entry() since
> mem_access_settings is not introduced for RISC-V.
> - Updated BUILD_BUG_ON() to check using the level 0 mask, which corresponds
> to Arm's THIRD_MASK.
> - Replaced open-coded bit shifts with the BIT() macro.
> - Other minor changes, such as using RISC-V-specific functions to validate
> P2M PTEs, and replacing Arm-specific GUEST_* macros with their RISC-V
> equivalents.
>
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> ---
> Changes in V3:
> - Add is_p2m_foreign() macro and connected stuff.
What is this about?
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -202,6 +202,24 @@ static inline int p2m_is_write_locked(struct p2m_domain *p2m)
>
> unsigned long construct_hgatp(struct p2m_domain *p2m, uint16_t vmid);
>
> +static inline void p2m_read_lock(struct p2m_domain *p2m)
> +{
> + read_lock(&p2m->lock);
> +}
> +
> +static inline void p2m_read_unlock(struct p2m_domain *p2m)
> +{
> + read_unlock(&p2m->lock);
> +}
> +
> +static inline int p2m_is_locked(struct p2m_domain *p2m)
bool return type (also for p2m_is_write_locked() in patch 11)? Also perhaps
pointer-to-const parameter?
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -852,3 +852,139 @@ int map_regions_p2mt(struct domain *d,
> {
> return p2m_insert_mapping(p2m_get_hostp2m(d), gfn, nr, mfn, p2mt);
> }
> +
> +/*
> + * Get the details of a given gfn.
> + *
> + * If the entry is present, the associated MFN will be returned type filled up.
This sentence doesn't really parse, perhaps due to missing words.
> + * The page_order will correspond to the order of the mapping in the page
> + * table (i.e it could be a superpage).
> + *
> + * If the entry is not present, INVALID_MFN will be returned and the
> + * page_order will be set according to the order of the invalid range.
> + *
> + * valid will contain the value of bit[0] (e.g valid bit) of the
> + * entry.
> + */
> +static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
> + p2m_type_t *t,
> + unsigned int *page_order,
> + bool *valid)
> +{
> + unsigned int level = 0;
> + pte_t entry, *table;
> + int rc;
> + mfn_t mfn = INVALID_MFN;
> + DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
> +
> + ASSERT(p2m_is_locked(p2m));
> + BUILD_BUG_ON(XEN_PT_LEVEL_MAP_MASK(0) != PAGE_MASK);
What function-wide property is this check about? Even when moved ...
> + if ( valid )
> + *valid = false;
> +
> + /* XXX: Check if the mapping is lower than the mapped gfn */
(Nested: What is this about?)
> + /* This gfn is higher than the highest the p2m map currently holds */
> + if ( gfn_x(gfn) > gfn_x(p2m->max_mapped_gfn) )
> + {
> + for ( level = P2M_ROOT_LEVEL; level; level-- )
> + if ( (gfn_x(gfn) & (XEN_PT_LEVEL_MASK(level) >> PAGE_SHIFT)) >
... into the more narrow scope where another XEN_PT_LEVEL_MASK() exists I
can't really spot what the check is to guard against.
> + gfn_x(p2m->max_mapped_gfn) )
> + break;
> +
> + goto out;
> + }
> +
> + table = p2m_get_root_pointer(p2m, gfn);
> +
> + /*
> + * the table should always be non-NULL because the gfn is below
> + * p2m->max_mapped_gfn and the root table pages are always present.
> + */
Nit: Style.
> + if ( !table )
> + {
> + ASSERT_UNREACHABLE();
> + level = P2M_ROOT_LEVEL;
> + goto out;
> + }
> +
> + for ( level = P2M_ROOT_LEVEL; level; level-- )
> + {
> + rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
Why would you blindly allocate a page table (hierarchy) here? If anything,
this may need doing upon caller request (as it's only up the call chain
where the necessary knowledge exists). For example, ...
> +static mfn_t p2m_lookup(struct p2m_domain *p2m, gfn_t gfn, p2m_type_t *t)
> +{
> + mfn_t mfn;
> +
> + p2m_read_lock(p2m);
> + mfn = p2m_get_entry(p2m, gfn, t, NULL, NULL);
... this (by its name) pretty likely won't want allocation, while ...
> + p2m_read_unlock(p2m);
> +
> + return mfn;
> +}
> +
> +struct page_info *p2m_get_page_from_gfn(struct p2m_domain *p2m, gfn_t gfn,
> + p2m_type_t *t)
> +{
... this will. Yet then ...
> + struct page_info *page;
> + p2m_type_t p2mt = p2m_invalid;
> + mfn_t mfn = p2m_lookup(p2m, gfn, t);
... you use the earlier one here.
> + if ( !mfn_valid(mfn) )
> + return NULL;
> +
> + if ( t )
> + p2mt = *t;
> +
> + page = mfn_to_page(mfn);
> +
> + /*
> + * get_page won't work on foreign mapping because the page doesn't
> + * belong to the current domain.
> + */
> + if ( p2m_is_foreign(p2mt) )
> + {
> + struct domain *fdom = page_get_owner_and_reference(page);
> + ASSERT(fdom != NULL);
> + ASSERT(fdom != p2m->domain);
> + return page;
In a release build (with no assertions) this will be wrong if either of the
two condition would not be satisfied. See x86'es respective code.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 14/20] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration
2025-08-11 11:36 ` Jan Beulich
@ 2025-08-11 14:44 ` Oleksii Kurochko
2025-08-11 15:11 ` Jan Beulich
0 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-11 14:44 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 4245 bytes --]
On 8/11/25 1:36 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -1,3 +1,4 @@
>> +#include <xen/bug.h>
>> #include <xen/domain_page.h>
>> #include <xen/mm.h>
>> #include <xen/rwlock.h>
>> @@ -197,6 +198,18 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>> return __map_domain_page(p2m->root + root_table_indx);
>> }
>>
>> +static int p2m_set_type(pte_t *pte, p2m_type_t t)
>> +{
>> + int rc = 0;
>> +
>> + if ( t > p2m_ext_storage )
> Seeing this separator enumerator in use, it becomes pretty clear that its name
> needs to change, so one doesn't need to go look at its definition to understand
> whether it's inclusive or exclusive. (This isn't helped by there presently being
> a spare entry, which, when made use of, might then cause problems with
> expressions like this one as well.)
Then|p2m_pte_type_count| might be a better name, as it indicates how many types are
stored directly in the PTE bits.
>
>> @@ -222,11 +235,71 @@ static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
>> p2m_write_pte(p, pte, clean_pte);
>> }
>>
>> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
>> +static void p2m_set_permission(pte_t *e, p2m_type_t t)
>> {
>> - panic("%s: hasn't been implemented yet\n", __func__);
>> + e->pte &= ~PTE_ACCESS_MASK;
>> +
>> + switch ( t )
>> + {
>> + case p2m_grant_map_rw:
>> + case p2m_ram_rw:
>> + e->pte |= PTE_READABLE | PTE_WRITABLE;
>> + break;
> While I agree for r/w grants, shouldn't r/w RAM also be executable?
>
>> + case p2m_ext_storage:
> Why exactly would this placeholder ...
>
>> + case p2m_mmio_direct_io:
>> + e->pte |= PTE_ACCESS_MASK;
>> + break;
> ... gain full access? It shouldn't make it here at all, should it?
I missed to add break between them, but I don't remember why I
put it here.
It could be freely moved before "default".
And, yes, you are right it seems like is shouldn't be handled at all
in this function as this function isn't expected to be called with
this type as this type only is used to indicate that a real type is
stored somwehere.
>
>> +
>> + case p2m_invalid:
>> + e->pte &= ~(PTE_ACCESS_MASK | PTE_VALID);
> Redundantly masking off PTE_ACCESS_MASK? (Plus, for the entry to be
> invalid, turning off PTE_VALID alone ought to suffice anyway?)
Agree, turning off PTE_VALID would be just enough.
>> + break;
>> +
>> + case p2m_grant_map_ro:
>> + e->pte |= PTE_READABLE;
>> + break;
>> +
>> + default:
>> + ASSERT_UNREACHABLE();
>> + break;
>> + }
>> +}
>> +
>> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>> +{
>> + pte_t e = (pte_t) { PTE_VALID };
> This and the rest of the function demand that mfn != INVALID_MFN, no matter
> whether ...
>
>> + switch ( t )
>> + {
>> + case p2m_mmio_direct_io:
>> + e.pte |= PTE_PBMT_IO;
>> + break;
>> +
>> + default:
>> + break;
>> + }
>> +
>> + pte_set_mfn(&e, mfn);
>> +
>> + ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK));
> ... PADDR_MASK is actually narrow enough to catch that case. Maybe best to
> add an explicit assertion to that effect?
Then it should be enough instead of what we have now:
ASSERT(mfn_valid(mfn));
>
>> + if ( !is_table )
>> + {
>> + p2m_set_permission(&e, t);
>> +
>> + if ( t < p2m_ext_storage )
>> + p2m_set_type(&e, t);
>> + else
>> + panic("unimplemeted\n");
> The check is already done inside p2m_set_type() - why open-code it here?
It isn't really matters now (so could be dropped), but in further patch this part
of code will look like:
metadata[indx].pte = p2m_invalid;
if ( t < p2m_ext_storage )
p2m_set_type(&e, t, indx);
else
{
e.pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
p2m_set_type(metadata, t, indx);
}
So my intention was to re-use p2m_set_type() without changing of a prototype. So,
if a type is stored in PTE bits then we pass PTE directly, if not - then pass
metadata.
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 6370 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 14/20] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration
2025-08-11 14:44 ` Oleksii Kurochko
@ 2025-08-11 15:11 ` Jan Beulich
0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2025-08-11 15:11 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 11.08.2025 16:44, Oleksii Kurochko wrote:
> On 8/11/25 1:36 PM, Jan Beulich wrote:
>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>>> +{
>>> + pte_t e = (pte_t) { PTE_VALID };
>> This and the rest of the function demand that mfn != INVALID_MFN, no matter
>> whether ...
>>
>>> + switch ( t )
>>> + {
>>> + case p2m_mmio_direct_io:
>>> + e.pte |= PTE_PBMT_IO;
>>> + break;
>>> +
>>> + default:
>>> + break;
>>> + }
>>> +
>>> + pte_set_mfn(&e, mfn);
>>> +
>>> + ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK));
>> ... PADDR_MASK is actually narrow enough to catch that case. Maybe best to
>> add an explicit assertion to that effect?
>
> Then it should be enough instead of what we have now:
> ASSERT(mfn_valid(mfn));
No, that would exclude MMIO living beyond max_page.
>>> + if ( !is_table )
>>> + {
>>> + p2m_set_permission(&e, t);
>>> +
>>> + if ( t < p2m_ext_storage )
>>> + p2m_set_type(&e, t);
>>> + else
>>> + panic("unimplemeted\n");
>> The check is already done inside p2m_set_type() - why open-code it here?
>
> It isn't really matters now (so could be dropped), but in further patch this part
> of code will look like:
> metadata[indx].pte = p2m_invalid;
>
> if ( t < p2m_ext_storage )
> p2m_set_type(&e, t, indx);
> else
> {
> e.pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
> p2m_set_type(metadata, t, indx);
> }
> So my intention was to re-use p2m_set_type() without changing of a prototype. So,
> if a type is stored in PTE bits then we pass PTE directly, if not - then pass
> metadata.
Then at the very least p2m_set_type() may not be a good name; a function of this
name imo should set the type, whatever it takes to do so. But I'm unconvinced of
the model as a whole.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 16/20] xen/riscv: Implement superpage splitting for p2m mappings
2025-08-11 11:59 ` Jan Beulich
@ 2025-08-11 15:19 ` Oleksii Kurochko
2025-08-11 15:47 ` Jan Beulich
0 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-11 15:19 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 4564 bytes --]
On 8/11/25 1:59 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> Add support for down large memory mappings ("superpages") in the RISC-V
>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>> can be inserted into lower levels of the page table hierarchy.
>>
>> To implement that the following is done:
>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>> smaller page table entries down to the target level, preserving original
>> permissions and attributes.
>> - p2m_set_entry() updated to invoke superpage splitting when inserting
>> entries at lower levels within a superpage-mapped region.
>>
>> This implementation is based on the ARM code, with modifications to the part
>> that follows the BBM (break-before-make) approach, some parts are simplified
>> as according to RISC-V spec:
>> It is permitted for multiple address-translation cache entries to co-exist
>> for the same address. This represents the fact that in a conventional
>> TLB hierarchy, it is possible for multiple entries to match a single
>> address if, for example, a page is upgraded to a superpage without first
>> clearing the original non-leaf PTE’s valid bit and executing an SFENCE.VMA
>> with rs1=x0, or if multiple TLBs exist in parallel at a given level of the
>> hierarchy. In this case, just as if an SFENCE.VMA is not executed between
>> a write to the memory-management tables and subsequent implicit read of the
>> same address: it is unpredictable whether the old non-leaf PTE or the new
>> leaf PTE is used, but the behavior is otherwise well defined.
>> In contrast to the Arm architecture, where BBM is mandatory and failing to
>> use it in some cases can lead to CPU instability, RISC-V guarantees
>> stability, and the behavior remains safe — though unpredictable in terms of
>> which translation will be used.
>>
>> Additionally, the page table walk logic has been adjusted, as ARM uses the
>> opposite number of levels compared to RISC-V.
> As before, I think you mean "numbering".
Yes, level numbering would be better.
>
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -539,6 +539,91 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
>> p2m_free_page(p2m, pg);
>> }
>>
>> +static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>> + unsigned int level, unsigned int target,
>> + const unsigned int *offsets)
>> +{
>> + struct page_info *page;
>> + unsigned long i;
>> + pte_t pte, *table;
>> + bool rv = true;
>> +
>> + /* Convenience aliases */
>> + mfn_t mfn = pte_get_mfn(*entry);
>> + unsigned int next_level = level - 1;
>> + unsigned int level_order = XEN_PT_LEVEL_ORDER(next_level);
>> +
>> + /*
>> + * This should only be called with target != level and the entry is
>> + * a superpage.
>> + */
>> + ASSERT(level > target);
>> + ASSERT(pte_is_superpage(*entry, level));
>> +
>> + page = p2m_alloc_page(p2m->domain);
>> + if ( !page )
>> + {
>> + /*
>> + * The caller is in charge to free the sub-tree.
>> + * As we didn't manage to allocate anything, just tell the
>> + * caller there is nothing to free by invalidating the PTE.
>> + */
>> + memset(entry, 0, sizeof(*entry));
>> + return false;
>> + }
>> +
>> + table = __map_domain_page(page);
>> +
>> + /*
>> + * We are either splitting a second level 1G page into 512 first level
>> + * 2M pages, or a first level 2M page into 512 zero level 4K pages.
>> + */
> Such a comment is at risk of (silently) going stale when support for 512G
> mappings is added. I wonder if it's really that informative to have here.
Good point, I think we could really drop it.
Regarding support for 512G mappings. Is it really make sense to support
such big mappings? It seems like some operations as splitting or sub-entry
freeing could be pretty long under some circumstances.
>
>> + for ( i = 0; i < XEN_PT_ENTRIES; i++ )
>> + {
>> + pte_t *new_entry = table + i;
>> +
>> + /*
>> + * Use the content of the superpage entry and override
>> + * the necessary fields. So the correct permission are kept.
>> + */
> It's not just permissions though? The memory type field also needs
> retaining (and is being retained this way). Maybe better say "attributes"?
Sure, I'll use "attributes" instead.
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 5471 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 17/20] xen/riscv: implement put_page()
2025-08-11 12:43 ` Jan Beulich
@ 2025-08-11 15:32 ` Oleksii Kurochko
0 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-11 15:32 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 634 bytes --]
On 8/11/25 2:43 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> Implement put_page(), as it will be used by p2m_put_code().
> I would have ack-ed the code change, but the description is irritating:
> Who or what is p2m_put_code() (going to be)?
It should be p2m_put_*-related code.
>
>> Although CONFIG_STATIC_MEMORY has not yet been introduced for RISC-V,
>> a stub for PGC_static is added to avoid cluttering the code of
>> put_page_nr() with #ifdefs.
> There isn't any put_page_nr() being introduced (anymore), though.
I'll correct the commit message, it should be put_page() here.
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 1466 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 18/20] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
2025-08-11 12:50 ` Jan Beulich
@ 2025-08-11 15:34 ` Oleksii Kurochko
0 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-11 15:34 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 2360 bytes --]
On 8/11/25 2:50 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> Implement the mfn_valid() macro to verify whether a given MFN is valid by
>> checking that it falls within the range [start_page, max_page).
>> These bounds are initialized based on the start and end addresses of RAM.
>>
>> As part of this patch, start_page is introduced and initialized with the
>> PFN of the first RAM page.
>> Also, initialize pdx_group_valid() by calling set_pdx_range() when
>> memory banks are being mapped.
>>
>> Also, after providing a non-stub implementation of the mfn_valid() macro,
>> the following compilation errors started to occur:
>> riscv64-linux-gnu-ld: prelink.o: in function `__next_node':
>> /build/xen/./include/xen/nodemask.h:202: undefined reference to `page_is_ram_type'
>> riscv64-linux-gnu-ld: prelink.o: in function `get_free_buddy':
>> /build/xen/common/page_alloc.c:881: undefined reference to `page_is_ram_type'
>> riscv64-linux-gnu-ld: prelink.o: in function `alloc_heap_pages':
>> /build/xen/common/page_alloc.c:1043: undefined reference to `page_get_owner_and_reference'
>> riscv64-linux-gnu-ld: /build/xen/common/page_alloc.c:1098: undefined reference to `page_is_ram_type'
>> riscv64-linux-gnu-ld: prelink.o: in function `ns16550_interrupt':
>> /build/xen/drivers/char/ns16550.c:205: undefined reference to `get_page'
>> riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol `page_get_owner_and_reference' isn't defined
>> riscv64-linux-gnu-ld: final link failed: bad value
>> make[2]: *** [arch/riscv/Makefile:35: xen-syms] Error 1
>> To resolve these errors, the following functions have also been introduced,
>> based on their Arm counterparts:
>> - page_get_owner_and_reference() and its variant to safely acquire a
>> reference to a page and retrieve its owner.
>> - A stub for page_is_ram_type() that currently always returns 0 and asserts
>> unreachable, as RAM type checking is not yet implemented.
> For this latter part I can only repeat that the code is reachable, and hence it
> is wrong to put ASSERT_UNREACHABLE() there. That's true for Arm's code as well.
I will drop this stuff in the next patch version.
>
>> Signed-off-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
> With said line dropped:
> Acked-by: Jan Beulich<jbeulich@suse.com>
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 3259 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 20/20] xen/riscv: introduce metadata table to store P2M type
2025-07-31 15:58 ` [PATCH v3 20/20] xen/riscv: introduce metadata table to store P2M type Oleksii Kurochko
@ 2025-08-11 15:44 ` Jan Beulich
2025-08-12 14:52 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-11 15:44 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 31.07.2025 17:58, Oleksii Kurochko wrote:
> RISC-V's PTE has only two available bits that can be used to store the P2M
> type. This is insufficient to represent all the current RISC-V P2M types.
> Therefore, some P2M types must be stored outside the PTE bits.
>
> To address this, a metadata table is introduced to store P2M types that
> cannot fit in the PTE itself. Not all P2M types are stored in the
> metadata table—only those that require it.
>
> The metadata table is linked to the intermediate page table via the
> `struct page_info`'s list field of the corresponding intermediate page.
>
> To simplify the allocation and linking of intermediate and metadata page
> tables, `p2m_{alloc,free}_table()` functions are implemented.
>
> These changes impact `p2m_split_superpage()`, since when a superpage is
> split, it is necessary to update the metadata table of the new
> intermediate page table — if the entry being split has its P2M type set
> to `p2m_ext_storage` in its `P2M_TYPES` bits.
Oh, this was an aspect I didn't realize when commenting on the name of
the enumerator. I think you want to keep the name for the purpose here,
but you better wouldn't apply relational operators to it (and hence
have a second value to serve that purpose).
> In addition to updating
> the metadata of the new intermediate page table, the corresponding entry
> in the metadata for the original superpage is invalidated.
>
> Also, update p2m_{get,set}_type to work with P2M types which don't fit
> into PTE bits.
>
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
No Suggested-by: or anything?
> --- a/xen/arch/riscv/include/asm/mm.h
> +++ b/xen/arch/riscv/include/asm/mm.h
> @@ -150,6 +150,15 @@ struct page_info
> /* Order-size of the free chunk this page is the head of. */
> unsigned int order;
> } free;
> +
> + /* Page is used to store metadata: p2m type. */
That's not correct. The page thus described is what the pointer below
points to. Here it's more like "Page is used as an intermediate P2M
page table".
> + struct {
> + /*
> + * Pointer to a page which store metadata for an intermediate page
> + * table.
> + */
> + struct page_info *metadata;
> + } md;
In the description you say you would re-use the list field.
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -101,7 +101,16 @@ static int p2m_alloc_root_table(struct p2m_domain *p2m)
> {
> struct domain *d = p2m->domain;
> struct page_info *page;
> - const unsigned int nr_root_pages = P2M_ROOT_PAGES;
> + /*
> + * If the root page table starts at Level <= 2, and since only 1GB, 2MB,
> + * and 4KB mappings are supported (as enforced by the ASSERT() in
> + * p2m_set_entry()), it is necessary to allocate P2M_ROOT_PAGES for
> + * the root page table itself, plus an additional P2M_ROOT_PAGES for
> + * metadata storage. This is because only two free bits are available in
> + * the PTE, which are not sufficient to represent all possible P2M types.
> + */
> + const unsigned int nr_root_pages = P2M_ROOT_PAGES *
> + ((P2M_ROOT_LEVEL <= 2) ? 2 : 1);
>
> /*
> * Return back nr_root_pages to assure the root table memory is also
> @@ -114,6 +123,23 @@ static int p2m_alloc_root_table(struct p2m_domain *p2m)
> if ( !page )
> return -ENOMEM;
>
> + if ( P2M_ROOT_LEVEL <= 2 )
> + {
> + /*
> + * In the case where P2M_ROOT_LEVEL <= 2, it is necessary to allocate
> + * a page of the same size as that used for the root page table.
> + * Therefore, p2m_allocate_root() can be safely reused.
> + */
> + struct page_info *metadata = p2m_allocate_root(d);
> + if ( !metadata )
> + {
> + free_domheap_pages(page, P2M_ROOT_ORDER);
> + return -ENOMEM;
> + }
> +
> + page->v.md.metadata = metadata;
Don't you need to install such a link for every one of the 4 pages?
> @@ -198,24 +224,25 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
> return __map_domain_page(p2m->root + root_table_indx);
> }
>
> -static int p2m_set_type(pte_t *pte, p2m_type_t t)
> +static void p2m_set_type(pte_t *pte, const p2m_type_t t, const unsigned int i)
> {
> - int rc = 0;
> -
> if ( t > p2m_ext_storage )
> - panic("unimplemeted\n");
> + {
> + ASSERT(pte);
> +
> + pte[i].pte = t;
What does i identify here?
> + }
> else
> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
> -
> - return rc;
> }
>
> -static p2m_type_t p2m_get_type(const pte_t pte)
> +static p2m_type_t p2m_get_type(const pte_t pte, const pte_t *metadata,
> + const unsigned int i)
> {
> p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
>
> if ( type == p2m_ext_storage )
> - panic("unimplemented\n");
> + type = metadata[i].pte;
>
> return type;
> }
Overall this feels pretty fragile, as the caller has to pass several values
which all need to be in sync with one another. If you ...
> @@ -265,7 +292,10 @@ static void p2m_set_permission(pte_t *e, p2m_type_t t)
> }
> }
>
> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t,
> + struct page_info *metadata_pg,
> + const unsigned int indx,
> + bool is_table)
> {
> pte_t e = (pte_t) { PTE_VALID };
>
> @@ -285,12 +315,21 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>
> if ( !is_table )
> {
> + pte_t *metadata = __map_domain_page(metadata_pg);
... map the page anyway, no matter whether ...
> p2m_set_permission(&e, t);
>
> + metadata[indx].pte = p2m_invalid;
> +
> if ( t < p2m_ext_storage )
> - p2m_set_type(&e, t);
> + p2m_set_type(&e, t, indx);
> else
> - panic("unimplemeted\n");
> + {
> + e.pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
> + p2m_set_type(metadata, t, indx);
> + }
... you'll actually use it, maybe best to map both pages at the same point?
And as said elsewhere, no, I don't think you want to use p2m_set_type() for
two entirely different purposes.
> @@ -323,22 +364,71 @@ static struct page_info *p2m_alloc_page(struct p2m_domain *p2m)
> return pg;
> }
>
> +static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
> +
> +/*
> + * Allocate a page table with an additional extra page to store
> + * metadata for each entry of the page table.
> + * Link this metadata page to page table page's list field.
> + */
> +static struct page_info * p2m_alloc_table(struct p2m_domain *p2m)
Nit: Stray blank after * again.
> +{
> + enum table_type
> + {
> + INTERMEDIATE_TABLE=0,
If you really think you need the "= 0", then please with blanks around '='.
> + /*
> + * At the moment, metadata is going to store P2M type
> + * for each PTE of page table.
> + */
> + METADATA_TABLE,
> + TABLE_MAX
> + };
> +
> + struct page_info *tables[TABLE_MAX];
> +
> + for ( unsigned int i = 0; i < TABLE_MAX; i++ )
> + {
> + tables[i] = p2m_alloc_page(p2m);
> +
> + if ( !tables[i] )
> + goto out;
> +
> + clear_and_clean_page(tables[i]);
> + }
> +
> + tables[INTERMEDIATE_TABLE]->v.md.metadata = tables[METADATA_TABLE];
> +
> + return tables[INTERMEDIATE_TABLE];
> +
> + out:
> + for ( unsigned int i = 0; i < TABLE_MAX; i++ )
> + if ( tables[i] )
You didn't clear all of tables[] first, though. This kind of cleanup is
often better done as
while ( i-- > 0 )
...
You don't even need an if() then, as you know allocations succeeded for all
earlier array slots.
> + p2m_free_page(p2m, tables[i]);
> +
> + return NULL;
> +}
I'm also surprised you allocate the metadata table no matter whether you'll
actually need it. That'll double your average paging pool usage, when in a
typical case only very few entries would actually require this extra
storage.
> @@ -453,10 +543,9 @@ static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
> }
>
> /* Put any references on the page referenced by pte. */
> -static void p2m_put_page(const pte_t pte, unsigned int level)
> +static void p2m_put_page(const pte_t pte, unsigned int level, p2m_type_t p2mt)
> {
> mfn_t mfn = pte_get_mfn(pte);
> - p2m_type_t p2m_type = p2m_get_type(pte);
>
> ASSERT(pte_is_valid(pte));
>
> @@ -470,10 +559,10 @@ static void p2m_put_page(const pte_t pte, unsigned int level)
> switch ( level )
> {
> case 1:
> - return p2m_put_2m_superpage(mfn, p2m_type);
> + return p2m_put_2m_superpage(mfn, p2mt);
>
> case 0:
> - return p2m_put_4k_page(mfn, p2m_type);
> + return p2m_put_4k_page(mfn, p2mt);
> }
> }
Might it be better to introduce this function in this shape right away, in
the earlier patch?
> @@ -690,18 +791,23 @@ static int p2m_set_entry(struct p2m_domain *p2m,
> {
> /* We need to split the original page. */
> pte_t split_pte = *entry;
> + struct page_info *metadata = virt_to_page(table)->v.md.metadata;
This (or along these lines) is how I would have expected things to be done
elsewhere as well, limiting the amount of arguments you need to pass
around.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 16/20] xen/riscv: Implement superpage splitting for p2m mappings
2025-08-11 15:19 ` Oleksii Kurochko
@ 2025-08-11 15:47 ` Jan Beulich
0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2025-08-11 15:47 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 11.08.2025 17:19, Oleksii Kurochko wrote:
> On 8/11/25 1:59 PM, Jan Beulich wrote:
>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/p2m.c
>>> +++ b/xen/arch/riscv/p2m.c
>>> @@ -539,6 +539,91 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
>>> p2m_free_page(p2m, pg);
>>> }
>>>
>>> +static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>>> + unsigned int level, unsigned int target,
>>> + const unsigned int *offsets)
>>> +{
>>> + struct page_info *page;
>>> + unsigned long i;
>>> + pte_t pte, *table;
>>> + bool rv = true;
>>> +
>>> + /* Convenience aliases */
>>> + mfn_t mfn = pte_get_mfn(*entry);
>>> + unsigned int next_level = level - 1;
>>> + unsigned int level_order = XEN_PT_LEVEL_ORDER(next_level);
>>> +
>>> + /*
>>> + * This should only be called with target != level and the entry is
>>> + * a superpage.
>>> + */
>>> + ASSERT(level > target);
>>> + ASSERT(pte_is_superpage(*entry, level));
>>> +
>>> + page = p2m_alloc_page(p2m->domain);
>>> + if ( !page )
>>> + {
>>> + /*
>>> + * The caller is in charge to free the sub-tree.
>>> + * As we didn't manage to allocate anything, just tell the
>>> + * caller there is nothing to free by invalidating the PTE.
>>> + */
>>> + memset(entry, 0, sizeof(*entry));
>>> + return false;
>>> + }
>>> +
>>> + table = __map_domain_page(page);
>>> +
>>> + /*
>>> + * We are either splitting a second level 1G page into 512 first level
>>> + * 2M pages, or a first level 2M page into 512 zero level 4K pages.
>>> + */
>> Such a comment is at risk of (silently) going stale when support for 512G
>> mappings is added. I wonder if it's really that informative to have here.
>
> Good point, I think we could really drop it.
> Regarding support for 512G mappings. Is it really make sense to support
> such big mappings?
I think so, yes (in the longer run). And yes, ...
> It seems like some operations as splitting or sub-entry
> freeing could be pretty long under some circumstances.
... such will need sorting.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 19/20] xen/riscv: add support of page lookup by GFN
2025-08-11 13:25 ` Jan Beulich
@ 2025-08-12 11:42 ` Oleksii Kurochko
2025-08-22 8:39 ` Oleksii Kurochko
1 sibling, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-12 11:42 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 8295 bytes --]
On 8/11/25 3:25 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> Introduce helper functions for safely querying the P2M (physical-to-machine)
>> mapping:
>> - add p2m_read_lock(), p2m_read_unlock(), and p2m_is_locked() for managing
>> P2M lock state.
>> - Implement p2m_get_entry() to retrieve mapping details for a given GFN,
>> including MFN, page order, and validity.
>> - Add p2m_lookup() to encapsulate read-locked MFN retrieval.
>> - Introduce p2m_get_page_from_gfn() to convert a GFN into a page_info
>> pointer, acquiring a reference to the page if valid.
>> - Introduce get_page().
>>
>> Implementations are based on Arm's functions with some minor modifications:
>> - p2m_get_entry():
>> - Reverse traversal of page tables, as RISC-V uses the opposite level
>> numbering compared to Arm.
>> - Removed the return of p2m_access_t from p2m_get_entry() since
>> mem_access_settings is not introduced for RISC-V.
>> - Updated BUILD_BUG_ON() to check using the level 0 mask, which corresponds
>> to Arm's THIRD_MASK.
>> - Replaced open-coded bit shifts with the BIT() macro.
>> - Other minor changes, such as using RISC-V-specific functions to validate
>> P2M PTEs, and replacing Arm-specific GUEST_* macros with their RISC-V
>> equivalents.
>>
>> Signed-off-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
>> ---
>> Changes in V3:
>> - Add is_p2m_foreign() macro and connected stuff.
> What is this about?
Sorry for that, it is a stale change. I will drop it in the next patch version.
>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -202,6 +202,24 @@ static inline int p2m_is_write_locked(struct p2m_domain *p2m)
>>
>> unsigned long construct_hgatp(struct p2m_domain *p2m, uint16_t vmid);
>>
>> +static inline void p2m_read_lock(struct p2m_domain *p2m)
>> +{
>> + read_lock(&p2m->lock);
>> +}
>> +
>> +static inline void p2m_read_unlock(struct p2m_domain *p2m)
>> +{
>> + read_unlock(&p2m->lock);
>> +}
>> +
>> +static inline int p2m_is_locked(struct p2m_domain *p2m)
> bool return type (also for p2m_is_write_locked() in patch 11)? Also perhaps
> pointer-to-const parameter?
I haven't checked what is a argument type of rw_is_locked() inside, so, automatically
use just pointer parameter, but now I see that it could be really const.
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -852,3 +852,139 @@ int map_regions_p2mt(struct domain *d,
>> {
>> return p2m_insert_mapping(p2m_get_hostp2m(d), gfn, nr, mfn, p2mt);
>> }
>> +
>> +/*
>> + * Get the details of a given gfn.
>> + *
>> + * If the entry is present, the associated MFN will be returned type filled up.
> This sentence doesn't really parse, perhaps due to missing words.
IDK what happened but it should be:
... the associated MFN will returned and type filled up ...
Perhpaps, it would be better just:
... the associated MFN will returned and the p2m type of the mapping.
(or just entry's type)
>> + * The page_order will correspond to the order of the mapping in the page
>> + * table (i.e it could be a superpage).
>> + *
>> + * If the entry is not present, INVALID_MFN will be returned and the
>> + * page_order will be set according to the order of the invalid range.
>> + *
>> + * valid will contain the value of bit[0] (e.g valid bit) of the
>> + * entry.
>> + */
>> +static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
>> + p2m_type_t *t,
>> + unsigned int *page_order,
>> + bool *valid)
>> +{
>> + unsigned int level = 0;
>> + pte_t entry, *table;
>> + int rc;
>> + mfn_t mfn = INVALID_MFN;
>> + DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
>> +
>> + ASSERT(p2m_is_locked(p2m));
>> + BUILD_BUG_ON(XEN_PT_LEVEL_MAP_MASK(0) != PAGE_MASK);
> What function-wide property is this check about? Even when moved ...
I think this check isn't needed anymore.
This check is/was needed to be sure that 4k page(s) are used on L3 (in Arm terms)
mapping as Arm can support 4k, 16k and 64k.
Initially this check derived from:
https://lore.kernel.org/xen-devel/1402394278-9850-4-git-send-email-ian.campbell@citrix.com/
And it was needed because of the way how maddr is calculated, calculation for which
could be wrong if page size isn't 4k.
But then this check was migrated to p2m_get_entry():
https://lore.kernel.org/xen-devel/1469717505-8026-13-git-send-email-julien.grall@arm.com/
But the way how maddr is got isn't depends on mask and PAGE_MASK, and I don't see any other
reason to why BUILD_BUG_ON() is needed now.
>
>> + if ( valid )
>> + *valid = false;
>> +
>> + /* XXX: Check if the mapping is lower than the mapped gfn */
> (Nested: What is this about?)
>
>> + /* This gfn is higher than the highest the p2m map currently holds */
>> + if ( gfn_x(gfn) > gfn_x(p2m->max_mapped_gfn) )
>> + {
>> + for ( level = P2M_ROOT_LEVEL; level; level-- )
>> + if ( (gfn_x(gfn) & (XEN_PT_LEVEL_MASK(level) >> PAGE_SHIFT)) >
> ... into the more narrow scope where another XEN_PT_LEVEL_MASK() exists I
> can't really spot what the check is to guard against.
>
>> + gfn_x(p2m->max_mapped_gfn) )
>> + break;
>> +
>> + goto out;
>> + }
>> +
>> + table = p2m_get_root_pointer(p2m, gfn);
>> +
>> + /*
>> + * the table should always be non-NULL because the gfn is below
>> + * p2m->max_mapped_gfn and the root table pages are always present.
>> + */
> Nit: Style.
>
>> + if ( !table )
>> + {
>> + ASSERT_UNREACHABLE();
>> + level = P2M_ROOT_LEVEL;
>> + goto out;
>> + }
>> +
>> + for ( level = P2M_ROOT_LEVEL; level; level-- )
>> + {
>> + rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
> Why would you blindly allocate a page table (hierarchy) here? If anything,
> this may need doing upon caller request (as it's only up the call chain
> where the necessary knowledge exists).
I wanted to set it to always|false|, as based on the name|p2m_get_entry()|,
it is expected that the page tables are already allocated.
> For example, ...
>
>> +static mfn_t p2m_lookup(struct p2m_domain *p2m, gfn_t gfn, p2m_type_t *t)
>> +{
>> + mfn_t mfn;
>> +
>> + p2m_read_lock(p2m);
>> + mfn = p2m_get_entry(p2m, gfn, t, NULL, NULL);
> ... this (by its name) pretty likely won't want allocation, while ...
>
>> + p2m_read_unlock(p2m);
>> +
>> + return mfn;
>> +}
>> +
>> +struct page_info *p2m_get_page_from_gfn(struct p2m_domain *p2m, gfn_t gfn,
>> + p2m_type_t *t)
>> +{
> ... this will. Yet then ...
I didn't really get why p2m_get_page_from_gfn() is expected to allocate
page table. My understanding is that GFN will point to a page only if
a mapping was done before for the GFN.
>
>> + struct page_info *page;
>> + p2m_type_t p2mt = p2m_invalid;
>> + mfn_t mfn = p2m_lookup(p2m, gfn, t);
> ... you use the earlier one here.
We don't need|page_order| and/or the valid bit in|p2m_get_page_from_gfn()|.
>
>> + if ( !mfn_valid(mfn) )
>> + return NULL;
>> +
>> + if ( t )
>> + p2mt = *t;
>> +
>> + page = mfn_to_page(mfn);
>> +
>> + /*
>> + * get_page won't work on foreign mapping because the page doesn't
>> + * belong to the current domain.
>> + */
>> + if ( p2m_is_foreign(p2mt) )
>> + {
>> + struct domain *fdom = page_get_owner_and_reference(page);
>> + ASSERT(fdom != NULL);
>> + ASSERT(fdom != p2m->domain);
>> + return page;
> In a release build (with no assertions) this will be wrong if either of the
> two condition would not be satisfied. See x86'es respective code.
I will add the following then instead:
if ( unlikely(p2m_is_foreign(t)) )
{
const struct domain *fdom = page_get_owner_and_reference(page);
if ( fdom )
{
if ( likely(fdom != d) )
return page;
ASSERT_UNREACHABLE();
put_page(page);
}
return NULL;
}
I'm not sure that unlikely() is needed, x86 has it.
It seems then Arm needs such a change too.
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 11660 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 20/20] xen/riscv: introduce metadata table to store P2M type
2025-08-11 15:44 ` Jan Beulich
@ 2025-08-12 14:52 ` Oleksii Kurochko
0 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-12 14:52 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 11663 bytes --]
On 8/11/25 5:44 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> RISC-V's PTE has only two available bits that can be used to store the P2M
>> type. This is insufficient to represent all the current RISC-V P2M types.
>> Therefore, some P2M types must be stored outside the PTE bits.
>>
>> To address this, a metadata table is introduced to store P2M types that
>> cannot fit in the PTE itself. Not all P2M types are stored in the
>> metadata table—only those that require it.
>>
>> The metadata table is linked to the intermediate page table via the
>> `struct page_info`'s list field of the corresponding intermediate page.
>>
>> To simplify the allocation and linking of intermediate and metadata page
>> tables, `p2m_{alloc,free}_table()` functions are implemented.
>>
>> These changes impact `p2m_split_superpage()`, since when a superpage is
>> split, it is necessary to update the metadata table of the new
>> intermediate page table — if the entry being split has its P2M type set
>> to `p2m_ext_storage` in its `P2M_TYPES` bits.
> Oh, this was an aspect I didn't realize when commenting on the name of
> the enumerator. I think you want to keep the name for the purpose here,
> but you better wouldn't apply relational operators to it (and hence
> have a second value to serve that purpose).
It could be done in this way, but I think that it would be better just to have
one value with a better name as I suggested in the reply to other patch.
>
>> In addition to updating
>> the metadata of the new intermediate page table, the corresponding entry
>> in the metadata for the original superpage is invalidated.
>>
>> Also, update p2m_{get,set}_type to work with P2M types which don't fit
>> into PTE bits.
>>
>> Signed-off-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
> No Suggested-by: or anything?
Sorry for that, Suggested-by should be added here, I'll fix that in the
next patch series version.
>
>> --- a/xen/arch/riscv/include/asm/mm.h
>> +++ b/xen/arch/riscv/include/asm/mm.h
>> @@ -150,6 +150,15 @@ struct page_info
>> /* Order-size of the free chunk this page is the head of. */
>> unsigned int order;
>> } free;
>> +
>> + /* Page is used to store metadata: p2m type. */
> That's not correct. The page thus described is what the pointer below
> points to. Here it's more like "Page is used as an intermediate P2M
> page table".
>
>> + struct {
>> + /*
>> + * Pointer to a page which store metadata for an intermediate page
>> + * table.
>> + */
>> + struct page_info *metadata;
>> + } md;
> In the description you say you would re-use the list field.
It was so in a first version of storing P2M type outside PTE bits, so, it is a
stale part of the commit message. I'll correct it.
>
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -101,7 +101,16 @@ static int p2m_alloc_root_table(struct p2m_domain *p2m)
>> {
>> struct domain *d = p2m->domain;
>> struct page_info *page;
>> - const unsigned int nr_root_pages = P2M_ROOT_PAGES;
>> + /*
>> + * If the root page table starts at Level <= 2, and since only 1GB, 2MB,
>> + * and 4KB mappings are supported (as enforced by the ASSERT() in
>> + * p2m_set_entry()), it is necessary to allocate P2M_ROOT_PAGES for
>> + * the root page table itself, plus an additional P2M_ROOT_PAGES for
>> + * metadata storage. This is because only two free bits are available in
>> + * the PTE, which are not sufficient to represent all possible P2M types.
>> + */
>> + const unsigned int nr_root_pages = P2M_ROOT_PAGES *
>> + ((P2M_ROOT_LEVEL <= 2) ? 2 : 1);
>>
>> /*
>> * Return back nr_root_pages to assure the root table memory is also
>> @@ -114,6 +123,23 @@ static int p2m_alloc_root_table(struct p2m_domain *p2m)
>> if ( !page )
>> return -ENOMEM;
>>
>> + if ( P2M_ROOT_LEVEL <= 2 )
>> + {
>> + /*
>> + * In the case where P2M_ROOT_LEVEL <= 2, it is necessary to allocate
>> + * a page of the same size as that used for the root page table.
>> + * Therefore, p2m_allocate_root() can be safely reused.
>> + */
>> + struct page_info *metadata = p2m_allocate_root(d);
>> + if ( !metadata )
>> + {
>> + free_domheap_pages(page, P2M_ROOT_ORDER);
>> + return -ENOMEM;
>> + }
>> +
>> + page->v.md.metadata = metadata;
> Don't you need to install such a link for every one of the 4 pages?
Yes, I need to do that. Thanks.
>
>> @@ -198,24 +224,25 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>> return __map_domain_page(p2m->root + root_table_indx);
>> }
>>
>> -static int p2m_set_type(pte_t *pte, p2m_type_t t)
>> +static void p2m_set_type(pte_t *pte, const p2m_type_t t, const unsigned int i)
>> {
>> - int rc = 0;
>> -
>> if ( t > p2m_ext_storage )
>> - panic("unimplemeted\n");
>> + {
>> + ASSERT(pte);
>> +
>> + pte[i].pte = t;
> What does i identify here?
An index in metadata page where P2M type for corresponding PTE is stored.
I will re-name it to metadata_indx for more clarity.
>
>> + }
>> else
>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>> -
>> - return rc;
>> }
>>
>> -static p2m_type_t p2m_get_type(const pte_t pte)
>> +static p2m_type_t p2m_get_type(const pte_t pte, const pte_t *metadata,
>> + const unsigned int i)
>> {
>> p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
>>
>> if ( type == p2m_ext_storage )
>> - panic("unimplemented\n");
>> + type = metadata[i].pte;
>>
>> return type;
>> }
> Overall this feels pretty fragile, as the caller has to pass several values
> which all need to be in sync with one another. If you ...
Generally, agree it is fragile enough.
>
>> @@ -265,7 +292,10 @@ static void p2m_set_permission(pte_t *e, p2m_type_t t)
>> }
>> }
>>
>> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t,
>> + struct page_info *metadata_pg,
>> + const unsigned int indx,
>> + bool is_table)
>> {
>> pte_t e = (pte_t) { PTE_VALID };
>>
>> @@ -285,12 +315,21 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>>
>> if ( !is_table )
>> {
>> + pte_t *metadata = __map_domain_page(metadata_pg);
> ... map the page anyway, no matter whether ...
>
>> p2m_set_permission(&e, t);
>>
>> + metadata[indx].pte = p2m_invalid;
>> +
>> if ( t < p2m_ext_storage )
>> - p2m_set_type(&e, t);
>> + p2m_set_type(&e, t, indx);
>> else
>> - panic("unimplemeted\n");
>> + {
>> + e.pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
>> + p2m_set_type(metadata, t, indx);
>> + }
> ... you'll actually use it, maybe best to map both pages at the same point?
Only one page is mapped here (?) and it should be mapped here, I suppose, it could be a case
when a previous set type is overwritten, so, it could be needed to invalidate a type written
in metadata.
> And as said elsewhere, no, I don't think you want to use p2m_set_type() for
> two entirely different purposes.
I wasn't very happy too, but, at the same time I didn't want to have a prototype where
it isn't really clear when it is needed to pass metadata and where it is not. But considering
your comments then this one solution isn't good too. So maybe it would be better just have
two separate functions: p2m_set_pte_type() and p2m_set_metadata_type().
>
>> @@ -323,22 +364,71 @@ static struct page_info *p2m_alloc_page(struct p2m_domain *p2m)
>> return pg;
>> }
>>
>> +static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
>> +
>> +/*
>> + * Allocate a page table with an additional extra page to store
>> + * metadata for each entry of the page table.
>> + * Link this metadata page to page table page's list field.
>> + */
>> +static struct page_info * p2m_alloc_table(struct p2m_domain *p2m)
> Nit: Stray blank after * again.
>
>> +{
>> + enum table_type
>> + {
>> + INTERMEDIATE_TABLE=0,
> If you really think you need the "= 0", then please with blanks around '='.
>
>> + /*
>> + * At the moment, metadata is going to store P2M type
>> + * for each PTE of page table.
>> + */
>> + METADATA_TABLE,
>> + TABLE_MAX
>> + };
>> +
>> + struct page_info *tables[TABLE_MAX];
>> +
>> + for ( unsigned int i = 0; i < TABLE_MAX; i++ )
>> + {
>> + tables[i] = p2m_alloc_page(p2m);
>> +
>> + if ( !tables[i] )
>> + goto out;
>> +
>> + clear_and_clean_page(tables[i]);
>> + }
>> +
>> + tables[INTERMEDIATE_TABLE]->v.md.metadata = tables[METADATA_TABLE];
>> +
>> + return tables[INTERMEDIATE_TABLE];
>> +
>> + out:
>> + for ( unsigned int i = 0; i < TABLE_MAX; i++ )
>> + if ( tables[i] )
> You didn't clear all of tables[] first, though.
Oh, right, i missed an initalizer for tables[] array.
> This kind of cleanup is
> often better done as
>
> while ( i-- > 0 )
> ...
>
> You don't even need an if() then, as you know allocations succeeded for all
> earlier array slots.
Yes, it looks very nice.
>
>> + p2m_free_page(p2m, tables[i]);
>> +
>> + return NULL;
>> +}
> I'm also surprised you allocate the metadata table no matter whether you'll
> actually need it. That'll double your average paging pool usage, when in a
> typical case only very few entries would actually require this extra
> storage.
Nice point, we could really do a delayed allocation instead and allocate only
when requested P2M type is > p2m_ext_storage.
I'll implement that.
>
>> @@ -453,10 +543,9 @@ static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
>> }
>>
>> /* Put any references on the page referenced by pte. */
>> -static void p2m_put_page(const pte_t pte, unsigned int level)
>> +static void p2m_put_page(const pte_t pte, unsigned int level, p2m_type_t p2mt)
>> {
>> mfn_t mfn = pte_get_mfn(pte);
>> - p2m_type_t p2m_type = p2m_get_type(pte);
>>
>> ASSERT(pte_is_valid(pte));
>>
>> @@ -470,10 +559,10 @@ static void p2m_put_page(const pte_t pte, unsigned int level)
>> switch ( level )
>> {
>> case 1:
>> - return p2m_put_2m_superpage(mfn, p2m_type);
>> + return p2m_put_2m_superpage(mfn, p2mt);
>>
>> case 0:
>> - return p2m_put_4k_page(mfn, p2m_type);
>> + return p2m_put_4k_page(mfn, p2mt);
>> }
>> }
> Might it be better to introduce this function in this shape right away, in
> the earlier patch?
Agree, probably, I did that intentionally, but I don't remember why. I will try to
avoid these changes in this patch as it looks unnecessary here.
>
>> @@ -690,18 +791,23 @@ static int p2m_set_entry(struct p2m_domain *p2m,
>> {
>> /* We need to split the original page. */
>> pte_t split_pte = *entry;
>> + struct page_info *metadata = virt_to_page(table)->v.md.metadata;
> This (or along these lines) is how I would have expected things to be done
> elsewhere as well, limiting the amount of arguments you need to pass
> around.
I will try to re-use this approach elsewhere I can.
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 15706 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 13/20] xen/riscv: Implement p2m_free_subtree() and related helpers
2025-08-06 15:55 ` Jan Beulich
@ 2025-08-14 15:09 ` Oleksii Kurochko
2025-08-14 15:17 ` Jan Beulich
0 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-14 15:09 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 8772 bytes --]
On 8/6/25 5:55 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -79,10 +79,20 @@ typedef enum {
>> p2m_ext_storage, /* Following types'll be stored outsude PTE bits: */
>> p2m_grant_map_rw, /* Read/write grant mapping */
>> p2m_grant_map_ro, /* Read-only grant mapping */
>> + p2m_map_foreign_rw, /* Read/write RAM pages from foreign domain */
>> + p2m_map_foreign_ro, /* Read-only RAM pages from foreign domain */
>> } p2m_type_t;
>>
>> #define p2m_mmio_direct p2m_mmio_direct_io
>>
>> +/*
>> + * Bits 8 and 9 are reserved for use by supervisor software;
>> + * the implementation shall ignore this field.
>> + * We are going to use to save in these bits frequently used types to avoid
>> + * get/set of a type from radix tree.
>> + */
>> +#define P2M_TYPE_PTE_BITS_MASK 0x300
>> +
>> /* We use bitmaps and mask to handle groups of types */
>> #define p2m_to_mask(t_) BIT(t_, UL)
>>
>> @@ -93,10 +103,16 @@ typedef enum {
>> #define P2M_GRANT_TYPES (p2m_to_mask(p2m_grant_map_rw) | \
>> p2m_to_mask(p2m_grant_map_ro))
>>
>> + /* Foreign mappings types */
> Nit: Why so far to the right?
>
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -197,6 +197,16 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>> return __map_domain_page(p2m->root + root_table_indx);
>> }
>>
>> +static p2m_type_t p2m_get_type(const pte_t pte)
>> +{
>> + p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
>> +
>> + if ( type == p2m_ext_storage )
>> + panic("unimplemented\n");
> That is, as per p2m.h additions you pretend to add support for foreign types
> here, but then you don't?
I count foreign types as p2m_ext_storage type, so a support for them will be added in the patch
[1] of this patch series as a type for p2m_ext_storage type will stored in metadata
due to the lack of free bits in PTE.
[1]https://lore.kernel.org/xen-devel/cover.1753973161.git.oleksii.kurochko@gmail.com/T/#mcc1a0367fdbfbf3ca073f152efa799c1a4354974
>> @@ -248,11 +258,136 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>> return P2M_TABLE_MAP_NONE;
>> }
>>
>> +static void p2m_put_foreign_page(struct page_info *pg)
>> +{
>> + /*
>> + * It’s safe to call put_page() here because arch_flush_tlb_mask()
>> + * will be invoked if the page is reallocated before the end of
>> + * this loop, which will trigger a flush of the guest TLBs.
>> + */
>> + put_page(pg);
>> +}
> How can one know the comment is true? arch_flush_tlb_mask() still lives in
> stubs.c, and hence what it is eventually going to do (something like Arm's
> vs more like x86'es) is entirely unknown right now.
I'll introduce arch_flush_tlb_mask() in this patch in the next version.
>> +/* Put any references on the single 4K page referenced by mfn. */
>> +static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
>> +{
>> + /* TODO: Handle other p2m types */
>> +
>> + if ( p2m_is_foreign(type) )
>> + {
>> + ASSERT(mfn_valid(mfn));
>> + p2m_put_foreign_page(mfn_to_page(mfn));
>> + }
>> +
>> + /*
>> + * Detect the xenheap page and mark the stored GFN as invalid.
>> + * We don't free the underlying page until the guest requested to do so.
>> + * So we only need to tell the page is not mapped anymore in the P2M by
>> + * marking the stored GFN as invalid.
>> + */
>> + if ( p2m_is_ram(type) && is_xen_heap_mfn(mfn) )
>> + page_set_xenheap_gfn(mfn_to_page(mfn), INVALID_GFN);
> Isn't this for grants? p2m_is_ram() doesn't cover p2m_grant_map_*.
p2m_is_ram() looks really unnecessary here. I'm thinking if it could be useful
to store for RAM types GFNs too to have something like M2P.
>> +}
>> +
>> +/* Put any references on the superpage referenced by mfn. */
>> +static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
>> +{
>> + struct page_info *pg;
>> + unsigned int i;
>> +
>> + ASSERT(mfn_valid(mfn));
>> +
>> + pg = mfn_to_page(mfn);
>> +
>> + for ( i = 0; i < XEN_PT_ENTRIES; i++, pg++ )
>> + p2m_put_foreign_page(pg);
>> +}
> In p2m_put_4k_page() you check the type, whereas here you don't.
Missed to add that:
if ( !p2m_is_foreign(type) )
return;
>> +/* Put any references on the page referenced by pte. */
>> +static void p2m_put_page(const pte_t pte, unsigned int level)
>> +{
>> + mfn_t mfn = pte_get_mfn(pte);
>> + p2m_type_t p2m_type = p2m_get_type(pte);
>> +
>> + ASSERT(pte_is_valid(pte));
>> +
>> + /*
>> + * TODO: Currently we don't handle level 2 super-page, Xen is not
>> + * preemptible and therefore some work is needed to handle such
>> + * superpages, for which at some point Xen might end up freeing memory
>> + * and therefore for such a big mapping it could end up in a very long
>> + * operation.
>> + */
>> + switch ( level )
>> + {
>> + case 1:
>> + return p2m_put_2m_superpage(mfn, p2m_type);
>> +
>> + case 0:
>> + return p2m_put_4k_page(mfn, p2m_type);
>> + }
> Yet despite the comment not even an assertion for level 2 and up?
Not sure that an ASSERT() is needed here as a reference(s) for such page(s)
will be put during domain_relinquish_resources() as there we could do preemption.
Something like Arm does here:
https://gitlab.com/xen-project/people/olkur/xen/-/blob/staging/xen/arch/arm/mmu/p2m.c?ref_type=heads#L1587
I'm thinking that probably it makes sense to put only 4k page(s) and
all other cases postpone until domain_relinquish_resources() is called.
>> /* Free pte sub-tree behind an entry */
>> static void p2m_free_subtree(struct p2m_domain *p2m,
>> pte_t entry, unsigned int level)
>> {
>> - panic("%s: hasn't been implemented yet\n", __func__);
>> + unsigned int i;
>> + pte_t *table;
>> + mfn_t mfn;
>> + struct page_info *pg;
>> +
>> + /* Nothing to do if the entry is invalid. */
>> + if ( !pte_is_valid(entry) )
>> + return;
>> +
>> + if ( pte_is_superpage(entry, level) || (level == 0) )
> Perhaps swap the two conditions around?
>
>> + {
>> +#ifdef CONFIG_IOREQ_SERVER
>> + /*
>> + * If this gets called then either the entry was replaced by an entry
>> + * with a different base (valid case) or the shattering of a superpage
>> + * has failed (error case).
>> + * So, at worst, the spurious mapcache invalidation might be sent.
>> + */
>> + if ( p2m_is_ram(p2m_get_type(p2m, entry)) &&
>> + domain_has_ioreq_server(p2m->domain) )
>> + ioreq_request_mapcache_invalidate(p2m->domain);
>> +#endif
>> +
>> + p2m_put_page(entry, level);
>> +
>> + return;
>> + }
>> +
>> + table = map_domain_page(pte_get_mfn(entry));
>> + for ( i = 0; i < XEN_PT_ENTRIES; i++ )
>> + p2m_free_subtree(p2m, table[i], level - 1);
> In p2m_put_page() you comment towards concerns for level >= 2; no similar
> concerns for the resulting recursion here?
This function is generic enough to handle any level.
Except that it is possible that it will be needed, for example, to split 1G mapping
into something smaller then p2m_free_subtree() could be called for freeing a subtree
of 1gb mapping.
>> + unmap_domain_page(table);
>> +
>> + /*
>> + * Make sure all the references in the TLB have been removed before
>> + * freing the intermediate page table.
>> + * XXX: Should we defer the free of the page table to avoid the
>> + * flush?
>> + */
>> + p2m_tlb_flush_sync(p2m);
>> +
>> + mfn = pte_get_mfn(entry);
>> + ASSERT(mfn_valid(mfn));
>> +
>> + pg = mfn_to_page(mfn);
>> +
>> + page_list_del(pg, &p2m->pages);
>> + p2m_free_page(p2m, pg);
> Once again I wonder whether this code path was actually tested: p2m_free_page()
> also invokes page_list_del(), and double deletions typically won't end very
> well.
Agree, it should be dropped here and left only in p2m_free_page().
It should be tested, I have a test case where I'm chaning MFN so this one should be called:
+ /*
+ * Free the entry only if the original pte was valid and the base
+ * is different (to avoid freeing when permission is changed).
+ */
+ if ( pte_is_valid(orig_pte) &&
+ !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
+ p2m_free_subtree(p2m, orig_pte, level);
I will double check.
But I think I was lucky because I've tested only the whole patch series and in one of a
further patches page_list_del(pg, &p2m->pages) is dropped.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 11065 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 13/20] xen/riscv: Implement p2m_free_subtree() and related helpers
2025-08-14 15:09 ` Oleksii Kurochko
@ 2025-08-14 15:17 ` Jan Beulich
2025-08-18 8:22 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-14 15:17 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 14.08.2025 17:09, Oleksii Kurochko wrote:
> On 8/6/25 5:55 PM, Jan Beulich wrote:
>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>> +/* Put any references on the page referenced by pte. */
>>> +static void p2m_put_page(const pte_t pte, unsigned int level)
>>> +{
>>> + mfn_t mfn = pte_get_mfn(pte);
>>> + p2m_type_t p2m_type = p2m_get_type(pte);
>>> +
>>> + ASSERT(pte_is_valid(pte));
>>> +
>>> + /*
>>> + * TODO: Currently we don't handle level 2 super-page, Xen is not
>>> + * preemptible and therefore some work is needed to handle such
>>> + * superpages, for which at some point Xen might end up freeing memory
>>> + * and therefore for such a big mapping it could end up in a very long
>>> + * operation.
>>> + */
>>> + switch ( level )
>>> + {
>>> + case 1:
>>> + return p2m_put_2m_superpage(mfn, p2m_type);
>>> +
>>> + case 0:
>>> + return p2m_put_4k_page(mfn, p2m_type);
>>> + }
>> Yet despite the comment not even an assertion for level 2 and up?
>
> Not sure that an ASSERT() is needed here as a reference(s) for such page(s)
> will be put during domain_relinquish_resources() as there we could do preemption.
> Something like Arm does here:
> https://gitlab.com/xen-project/people/olkur/xen/-/blob/staging/xen/arch/arm/mmu/p2m.c?ref_type=heads#L1587
>
> I'm thinking that probably it makes sense to put only 4k page(s) and
> all other cases postpone until domain_relinquish_resources() is called.
How can you defer to domain cleanup? How would handling of foreign mappings
(or e.g. ballooning? not sure) work when you don't drop references as
necessary?
>>> /* Free pte sub-tree behind an entry */
>>> static void p2m_free_subtree(struct p2m_domain *p2m,
>>> pte_t entry, unsigned int level)
>>> {
>>> - panic("%s: hasn't been implemented yet\n", __func__);
>>> + unsigned int i;
>>> + pte_t *table;
>>> + mfn_t mfn;
>>> + struct page_info *pg;
>>> +
>>> + /* Nothing to do if the entry is invalid. */
>>> + if ( !pte_is_valid(entry) )
>>> + return;
>>> +
>>> + if ( pte_is_superpage(entry, level) || (level == 0) )
>> Perhaps swap the two conditions around?
>>
>>> + {
>>> +#ifdef CONFIG_IOREQ_SERVER
>>> + /*
>>> + * If this gets called then either the entry was replaced by an entry
>>> + * with a different base (valid case) or the shattering of a superpage
>>> + * has failed (error case).
>>> + * So, at worst, the spurious mapcache invalidation might be sent.
>>> + */
>>> + if ( p2m_is_ram(p2m_get_type(p2m, entry)) &&
>>> + domain_has_ioreq_server(p2m->domain) )
>>> + ioreq_request_mapcache_invalidate(p2m->domain);
>>> +#endif
>>> +
>>> + p2m_put_page(entry, level);
>>> +
>>> + return;
>>> + }
>>> +
>>> + table = map_domain_page(pte_get_mfn(entry));
>>> + for ( i = 0; i < XEN_PT_ENTRIES; i++ )
>>> + p2m_free_subtree(p2m, table[i], level - 1);
>> In p2m_put_page() you comment towards concerns for level >= 2; no similar
>> concerns for the resulting recursion here?
>
> This function is generic enough to handle any level.
>
> Except that it is possible that it will be needed, for example, to split 1G mapping
> into something smaller then p2m_free_subtree() could be called for freeing a subtree
> of 1gb mapping.
The question wasn't about it being generic enough, but it possibly taking
too much time for level >= 2.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 12/20] xen/riscv: implement p2m_set_range()
2025-08-05 16:04 ` Jan Beulich
@ 2025-08-15 9:52 ` Oleksii Kurochko
2025-08-15 12:50 ` Jan Beulich
0 siblings, 1 reply; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-15 9:52 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 13911 bytes --]
On 8/5/25 6:04 PM, Jan Beulich wrote:
> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>> This patch introduces p2m_set_range() and its core helper p2m_set_entry() for
> Nit: This patch doesn't introduce p2m_set_range(); it merely fleshes it out.
>
>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -7,11 +7,13 @@
>> #include <xen/rwlock.h>
>> #include <xen/types.h>
>>
>> +#include <asm/page.h>
>> #include <asm/page-bits.h>
>>
>> extern unsigned int p2m_root_order;
>> #define P2M_ROOT_ORDER p2m_root_order
>> #define P2M_ROOT_PAGES BIT(P2M_ROOT_ORDER, U)
>> +#define P2M_ROOT_LEVEL HYP_PT_ROOT_LEVEL
> I think I commented on this before, and I would have hoped for at least a remark
> in the description to appear (perhaps even a comment here): It's okay(ish) to tie
> these together for now, but in the longer run I don't expect this is going to be
> wanted. If e.g. we ran Xen in Sv57 mode, there would be no reason at all to force
> all P2Ms to use 5 levels of page tables.
Do you mean that for G-stage it could be chosen any SvXX mode to limit an amount
of page tables necessary for G-stage? If yes, then, at least, I agree that a
comment should be added or, probably, "#warning optimize an amount of p2m root level
for MMU mode > Sv48" (or maybe >=).
Or do you mean if we set hgatp.mode=Sv57 then it is possible to limit an amount of
page table's levels to use? In this case I think hardware still will expect to see
5 levels of page tables.
>> @@ -175,13 +179,257 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>> return 0;
>> }
>>
>> +/*
>> + * Find and map the root page table. The caller is responsible for
>> + * unmapping the table.
>> + *
>> + * The function will return NULL if the offset into the root table is
>> + * invalid.
>> + */
>> +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>> +{
>> + unsigned long root_table_indx;
>> +
>> + root_table_indx = gfn_x(gfn) >> XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL);
> Right now page table layouts / arrangements are indeed similar enough to
> share accessor constructs. Nevertheless I find it problematic (doc-wise
> at the very least) that a Xen page table construct is used to access a
> P2M page table. If and when these needed to be decoupled, it would likely
> help of the distinction was already made, by - for now - simply
> introducing aliases (here e.g. P2M_LEVEL_ORDER(), expanding to
> XEN_PT_LEVEL_ORDER() for the time being).
I think it's better to define this correctly now, as I initially missed
that only non-root page tables and all page table entries (PTEs) share
the same format as their corresponding Sv39, Sv48, or Sv57 modes
(i.e., corresponding to SvXXx4).
However, in this case, we're dealing with the*root level*, which is extended
by 2 extra bits. Therefore, using|XEN_PT_LEVEL_ORDER()| would be incorrect
if those two extra bits are actually used.
We're just fortunate that no one currently uses these extra 2 bits.
>> + if ( root_table_indx >= P2M_ROOT_PAGES )
>> + return NULL;
>> +
>> + return __map_domain_page(p2m->root + root_table_indx);
>> +}
>> +
>> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
>> +{
>> + write_pte(p, pte);
>> + if ( clean_pte )
>> + clean_dcache_va_range(p, sizeof(*p));
> Not necessarily for right away, but if multiple adjacent PTEs are
> written without releasing the lock, this then redundant cache flushing
> can be a performance issue.
Can't it be resolved on a caller side? Something like:
p2m_write_pte(p1, pte1, false);
p2m_write_pte(p2, pte2, false);
p2m_write_pte(p3, pte3, false);
p2m_write_pte(p4, pte4, true);
where p1-p4 are adjacent.
>> +}
>> +
>> +static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
>> +{
>> + pte_t pte;
>> +
>> + memset(&pte, 0, sizeof(pte));
> Why memset()? Why not simply give the variable an appropriate initializer?
Good point. It would be much better just to use an appropriate initializer.
> Or use ...
>
>> + p2m_write_pte(p, pte, clean_pte);
> ... a compound literal here, like you do ...
>
>> +}
>> +
>> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
>> +{
>> + panic("%s: hasn't been implemented yet\n", __func__);
>> +
>> + return (pte_t) { .pte = 0 };
> ... here? (Just {} would also do, if I'm not mistaken.)
According to C99 it will be enough {}, but {.pte = 0} is more obvious and
clearer, IMO.
>> +}
>> +
>> +#define P2M_TABLE_MAP_NONE 0
>> +#define P2M_TABLE_MAP_NOMEM 1
>> +#define P2M_TABLE_SUPER_PAGE 2
>> +#define P2M_TABLE_NORMAL 3
>> +
>> +/*
>> + * Take the currently mapped table, find the corresponding the entry
>> + * corresponding to the GFN, and map the next table, if available.
> Nit: Double "corresponding".
>
>> + * The previous table will be unmapped if the next level was mapped
>> + * (e.g P2M_TABLE_NORMAL returned).
>> + *
>> + * `alloc_tbl` parameter indicates whether intermediate tables should
>> + * be allocated when not present.
>> + *
>> + * Return values:
>> + * P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
>> + * P2M_TABLE_MAP_NOMEM: allocating a new page failed.
>> + * P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
>> + * P2M_TABLE_NORMAL: The next entry points to a superpage.
>> + */
>> +static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>> + unsigned int level, pte_t **table,
>> + unsigned int offset)
>> +{
>> + panic("%s: hasn't been implemented yet\n", __func__);
>> +
>> + return P2M_TABLE_MAP_NONE;
>> +}
>> +
>> +/* Free pte sub-tree behind an entry */
>> +static void p2m_free_subtree(struct p2m_domain *p2m,
>> + pte_t entry, unsigned int level)
>> +{
>> + panic("%s: hasn't been implemented yet\n", __func__);
>> +}
>> +
>> +/*
>> + * Insert an entry in the p2m. This should be called with a mapping
>> + * equal to a page/superpage.
>> + */
>> +static int p2m_set_entry(struct p2m_domain *p2m,
>> + gfn_t gfn,
>> + unsigned long page_order,
>> + mfn_t mfn,
>> + p2m_type_t t)
>> +{
>> + unsigned int level;
>> + unsigned int target = page_order / PAGETABLE_ORDER;
>> + pte_t *entry, *table, orig_pte;
>> + int rc;
>> + /* A mapping is removed if the MFN is invalid. */
>> + bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
> Comment and code don't fit together. Many MFNs are invalid (any for which
> mfn_valid() returns false), yet you only check for INVALID_MFN here.
Probably, it makes sense to add an|ASSERT()| here for the case when
|mfn_valid(mfn)| is false, but the MFN is not explicitly equal to|INVALID_MFN|.
This would indicate that someone attempted to perform a mapping with an
incorrect MFN, which, IMO, is entirely wrong.
In the case where the MFN is explicitly set to|INVALID_MFN|, I believe it's
enough to indicate that a mapping is being removed.
>> + DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
>> +
>> + ASSERT(p2m_is_write_locked(p2m));
>> +
>> + /*
>> + * Check if the level target is valid: we only support
>> + * 4K - 2M - 1G mapping.
>> + */
>> + ASSERT((target <= 2) && !(page_order % PAGETABLE_ORDER));
> If you think you need to check this, don't you also want to check that
> GFN and MFN (the latter if it isn't INVALID_MFN) fit the requested order?
I'm not 100% sure it makes sense to check|page_order|, as in IMO, proper
alignment is already guaranteed by|p2m_mapping_order(sgfn, smfn, left)|,
and thus both GFN and MFN are aligned accordingly.
So, I think this part of the check could be dropped.
>> + table = p2m_get_root_pointer(p2m, gfn);
>> + if ( !table )
>> + return -EINVAL;
>> +
>> + for ( level = P2M_ROOT_LEVEL; level > target; level-- )
>> + {
>> + /*
>> + * Don't try to allocate intermediate page table if the mapping
>> + * is about to be removed.
>> + */
>> + rc = p2m_next_level(p2m, !removing_mapping,
>> + level, &table, offsets[level]);
>> + if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
>> + {
>> + rc = (rc == P2M_TABLE_MAP_NONE) ? -ENOENT : -ENOMEM;
>> + /*
>> + * We are here because p2m_next_level has failed to map
>> + * the intermediate page table (e.g the table does not exist
>> + * and they p2m tree is read-only). It is a valid case
>> + * when removing a mapping as it may not exist in the
>> + * page table. In this case, just ignore it.
>> + */
>> + rc = removing_mapping ? 0 : rc;
> Nit: Stray blank.
>
>> + goto out;
>> + }
>> +
>> + if ( rc != P2M_TABLE_NORMAL )
>> + break;
>> + }
>> +
>> + entry = table + offsets[level];
>> +
>> + /*
>> + * If we are here with level > target, we must be at a leaf node,
>> + * and we need to break up the superpage.
>> + */
>> + if ( level > target )
>> + {
>> + panic("Shattering isn't implemented\n");
>> + }
>> +
>> + /*
>> + * We should always be there with the correct level because all the
>> + * intermediate tables have been installed if necessary.
>> + */
>> + ASSERT(level == target);
>> +
>> + orig_pte = *entry;
>> +
>> + if ( removing_mapping )
>> + p2m_clean_pte(entry, p2m->clean_pte);
>> + else
>> + {
>> + pte_t pte = p2m_pte_from_mfn(mfn, t);
>> +
>> + p2m_write_pte(entry, pte, p2m->clean_pte);
>> +
>> + p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
>> + gfn_add(gfn, BIT(page_order, UL) - 1));
>> + p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
>> + }
>> +
>> + p2m->need_flush = true;
>> +
>> + /*
>> + * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
>> + * is not ready for RISC-V support.
>> + *
>> + * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
>> + * here.
>> + */
>> +#ifdef CONFIG_HAS_PASSTHROUGH
>> +# error "add code to flush IOMMU TLB"
>> +#endif
>> +
>> + rc = 0;
>> +
>> + /*
>> + * Free the entry only if the original pte was valid and the base
>> + * is different (to avoid freeing when permission is changed).
>> + */
>> + if ( pte_is_valid(orig_pte) &&
>> + !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
> I'm puzzled by this 2nd check: A permission change would - I expect - only
> occur to a leaf entry. If the new entry is a super-page one and the old
> wasn't, don't you still need to free the sub-tree, no matter whether the
> MFNs are the same?
I expect the MFNs to differ in this scenario, so the old sub-tree will be freed.
Based on your example (new entry is super-page and old entry isn't):
For old mapping (lets say, 4 KiB leaf) p2m_set_entry() walks all levels down
to L0, so we will have the following MMU page table walks:
L2 PTE -> L1 PTE (MFN of L0 page table) -> L0 PTE -> RAM
When new mapping (lets say, 2 MiB superpage) will be requested, p2m_set_entry()
will stop at L1 (the superpage level):
L2 PTE -> L1 PTE (at this moment, L1 PTE points to L0 page table, which
points to RAM)
Then the old L1 PTE will be saved in 'orig_pte', then writes 'entry' with
the RAM MFN for the 2 MiB mapping. The walk becomes:
L2 PTE -> L1 PTE -> RAM
Therefore, 'entry' now holds an MFN pointing to RAM (superpage leaf). 'orig_pte'
still holds an MFN pointing to the L0 table (the old sub-tree). Since these MFNs
differ, the code calls p2m_free_subtree(p2m, orig_pte, …) and frees the old L0
sub-tree.
> Plus consider the special case of MFN 0: If you clear
> an entry using MFN 0, you will find old and new PTEs' both having the same
> MFN.
Isn't this happen only when a mapping removal is explicitly requested?
In the case of a mapping removal it seems to me it is enough just to
clear PTE with all zeroes.
> static int p2m_set_range(struct p2m_domain *p2m,
> gfn_t sgfn,
> unsigned long nr,
> mfn_t smfn,
> p2m_type_t t)
> {
> - return -EOPNOTSUPP;
> + int rc = 0;
> + unsigned long left = nr;
> +
> + /*
> + * Any reference taken by the P2M mappings (e.g. foreign mapping) will
> + * be dropped in relinquish_p2m_mapping(). As the P2M will still
> + * be accessible after, we need to prevent mapping to be added when the
> + * domain is dying.
> + */
> + if ( unlikely(p2m->domain->is_dying) )
> + return -EACCES;
> +
> + while ( left )
> + {
> + unsigned long order = p2m_mapping_order(sgfn, smfn, left);
> +
> + rc = p2m_set_entry(p2m, sgfn, order, smfn, t);
> + if ( rc )
> + break;
> +
> + sgfn = gfn_add(sgfn, BIT(order, UL));
> + if ( !mfn_eq(smfn, INVALID_MFN) )
> + smfn = mfn_add(smfn, BIT(order, UL));
> +
> + left -= BIT(order, UL);
> + }
> +
> + return !left ? 0 : left == nr ? rc : (nr - left);
> The function returning "int", you may be truncating the return value here.
> In the worst case indicating success (0) or an error (negative) when some
> of the upper bits were set.
I think what will be better:
or (1) Return long instead of int and the following check:
long result = nr - left;
if (result < 0 || result > LONG_MAX)
return -ERANGE;
or (2) Just add new `left_to_map` (or just left) argument to p2m_set_range().
> Also looks like you could get away with a single conditional operator here:
>
> return !left || left == nr ? rc : (nr - left);
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 17555 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 12/20] xen/riscv: implement p2m_set_range()
2025-08-15 9:52 ` Oleksii Kurochko
@ 2025-08-15 12:50 ` Jan Beulich
2025-08-18 11:03 ` Oleksii Kurochko
0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2025-08-15 12:50 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 15.08.2025 11:52, Oleksii Kurochko wrote:
> On 8/5/25 6:04 PM, Jan Beulich wrote:
>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>> This patch introduces p2m_set_range() and its core helper p2m_set_entry() for
>> Nit: This patch doesn't introduce p2m_set_range(); it merely fleshes it out.
>>
>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>> @@ -7,11 +7,13 @@
>>> #include <xen/rwlock.h>
>>> #include <xen/types.h>
>>>
>>> +#include <asm/page.h>
>>> #include <asm/page-bits.h>
>>>
>>> extern unsigned int p2m_root_order;
>>> #define P2M_ROOT_ORDER p2m_root_order
>>> #define P2M_ROOT_PAGES BIT(P2M_ROOT_ORDER, U)
>>> +#define P2M_ROOT_LEVEL HYP_PT_ROOT_LEVEL
>> I think I commented on this before, and I would have hoped for at least a remark
>> in the description to appear (perhaps even a comment here): It's okay(ish) to tie
>> these together for now, but in the longer run I don't expect this is going to be
>> wanted. If e.g. we ran Xen in Sv57 mode, there would be no reason at all to force
>> all P2Ms to use 5 levels of page tables.
>
> Do you mean that for G-stage it could be chosen any SvXX mode to limit an amount
> of page tables necessary for G-stage? If yes, then, at least, I agree that a
> comment should be added or, probably, "#warning optimize an amount of p2m root level
> for MMU mode > Sv48" (or maybe >=).
Yes.
> Or do you mean if we set hgatp.mode=Sv57 then it is possible to limit an amount of
> page table's levels to use? In this case I think hardware still will expect to see
> 5 levels of page tables.
No.
>>> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
>>> +{
>>> + write_pte(p, pte);
>>> + if ( clean_pte )
>>> + clean_dcache_va_range(p, sizeof(*p));
>> Not necessarily for right away, but if multiple adjacent PTEs are
>> written without releasing the lock, this then redundant cache flushing
>> can be a performance issue.
>
> Can't it be resolved on a caller side? Something like:
> p2m_write_pte(p1, pte1, false);
> p2m_write_pte(p2, pte2, false);
> p2m_write_pte(p3, pte3, false);
> p2m_write_pte(p4, pte4, true);
> where p1-p4 are adjacent.
No. You wouldn't know whether the last write flushes what the earlier
three have written. There may be a cacheline boundary in between. Plus
I didn't really think of back-to-back writes, but e.g. a loop doing
many of them, where a single wider flush may then be more efficient.
>>> +#define P2M_TABLE_MAP_NONE 0
>>> +#define P2M_TABLE_MAP_NOMEM 1
>>> +#define P2M_TABLE_SUPER_PAGE 2
>>> +#define P2M_TABLE_NORMAL 3
>>> +
>>> +/*
>>> + * Take the currently mapped table, find the corresponding the entry
>>> + * corresponding to the GFN, and map the next table, if available.
>> Nit: Double "corresponding".
>>
>>> + * The previous table will be unmapped if the next level was mapped
>>> + * (e.g P2M_TABLE_NORMAL returned).
>>> + *
>>> + * `alloc_tbl` parameter indicates whether intermediate tables should
>>> + * be allocated when not present.
>>> + *
>>> + * Return values:
>>> + * P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
>>> + * P2M_TABLE_MAP_NOMEM: allocating a new page failed.
>>> + * P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
>>> + * P2M_TABLE_NORMAL: The next entry points to a superpage.
>>> + */
>>> +static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>>> + unsigned int level, pte_t **table,
>>> + unsigned int offset)
>>> +{
>>> + panic("%s: hasn't been implemented yet\n", __func__);
>>> +
>>> + return P2M_TABLE_MAP_NONE;
>>> +}
>>> +
>>> +/* Free pte sub-tree behind an entry */
>>> +static void p2m_free_subtree(struct p2m_domain *p2m,
>>> + pte_t entry, unsigned int level)
>>> +{
>>> + panic("%s: hasn't been implemented yet\n", __func__);
>>> +}
>>> +
>>> +/*
>>> + * Insert an entry in the p2m. This should be called with a mapping
>>> + * equal to a page/superpage.
>>> + */
>>> +static int p2m_set_entry(struct p2m_domain *p2m,
>>> + gfn_t gfn,
>>> + unsigned long page_order,
>>> + mfn_t mfn,
>>> + p2m_type_t t)
>>> +{
>>> + unsigned int level;
>>> + unsigned int target = page_order / PAGETABLE_ORDER;
>>> + pte_t *entry, *table, orig_pte;
>>> + int rc;
>>> + /* A mapping is removed if the MFN is invalid. */
>>> + bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
>> Comment and code don't fit together. Many MFNs are invalid (any for which
>> mfn_valid() returns false), yet you only check for INVALID_MFN here.
>
> Probably, it makes sense to add an|ASSERT()| here for the case when
> |mfn_valid(mfn)| is false, but the MFN is not explicitly equal to|INVALID_MFN|.
> This would indicate that someone attempted to perform a mapping with an
> incorrect MFN, which, IMO, is entirely wrong.
No, and we've been there before. MMIO can live anywhere, and mappings for
such still will need to be permitted. It is correct to check only for
INVALID_MFN here imo; it's just the comment which also needs to reflect
that.
>>> + /*
>>> + * If we are here with level > target, we must be at a leaf node,
>>> + * and we need to break up the superpage.
>>> + */
>>> + if ( level > target )
>>> + {
>>> + panic("Shattering isn't implemented\n");
>>> + }
>>> +
>>> + /*
>>> + * We should always be there with the correct level because all the
>>> + * intermediate tables have been installed if necessary.
>>> + */
>>> + ASSERT(level == target);
>>> +
>>> + orig_pte = *entry;
>>> +
>>> + if ( removing_mapping )
>>> + p2m_clean_pte(entry, p2m->clean_pte);
>>> + else
>>> + {
>>> + pte_t pte = p2m_pte_from_mfn(mfn, t);
>>> +
>>> + p2m_write_pte(entry, pte, p2m->clean_pte);
>>> +
>>> + p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
>>> + gfn_add(gfn, BIT(page_order, UL) - 1));
>>> + p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
>>> + }
>>> +
>>> + p2m->need_flush = true;
>>> +
>>> + /*
>>> + * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
>>> + * is not ready for RISC-V support.
>>> + *
>>> + * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
>>> + * here.
>>> + */
>>> +#ifdef CONFIG_HAS_PASSTHROUGH
>>> +# error "add code to flush IOMMU TLB"
>>> +#endif
>>> +
>>> + rc = 0;
>>> +
>>> + /*
>>> + * Free the entry only if the original pte was valid and the base
>>> + * is different (to avoid freeing when permission is changed).
>>> + */
>>> + if ( pte_is_valid(orig_pte) &&
>>> + !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
>> I'm puzzled by this 2nd check: A permission change would - I expect - only
>> occur to a leaf entry. If the new entry is a super-page one and the old
>> wasn't, don't you still need to free the sub-tree, no matter whether the
>> MFNs are the same?
>
> I expect the MFNs to differ in this scenario, so the old sub-tree will be freed.
You expecting something isn't a good criteria. If it's possible, even if
unexpected (by you), it needs dealing with correctly.
> Based on your example (new entry is super-page and old entry isn't):
> For old mapping (lets say, 4 KiB leaf) p2m_set_entry() walks all levels down
> to L0, so we will have the following MMU page table walks:
> L2 PTE -> L1 PTE (MFN of L0 page table) -> L0 PTE -> RAM
>
> When new mapping (lets say, 2 MiB superpage) will be requested, p2m_set_entry()
> will stop at L1 (the superpage level):
> L2 PTE -> L1 PTE (at this moment, L1 PTE points to L0 page table, which
> points to RAM)
> Then the old L1 PTE will be saved in 'orig_pte', then writes 'entry' with
> the RAM MFN for the 2 MiB mapping. The walk becomes:
> L2 PTE -> L1 PTE -> RAM
>
> Therefore, 'entry' now holds an MFN pointing to RAM (superpage leaf). 'orig_pte'
> still holds an MFN pointing to the L0 table (the old sub-tree). Since these MFNs
> differ, the code calls p2m_free_subtree(p2m, orig_pte, …) and frees the old L0
> sub-tree.
A particular example doesn't help. All possible cases need handling correctly.
>> Plus consider the special case of MFN 0: If you clear
>> an entry using MFN 0, you will find old and new PTEs' both having the same
>> MFN.
>
> Isn't this happen only when a mapping removal is explicitly requested?
> In the case of a mapping removal it seems to me it is enough just to
> clear PTE with all zeroes.
Correct. Which means original MFN (PPN) and new MFN (PPN) would match.
Jan
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 13/20] xen/riscv: Implement p2m_free_subtree() and related helpers
2025-08-14 15:17 ` Jan Beulich
@ 2025-08-18 8:22 ` Oleksii Kurochko
0 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-18 8:22 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 4421 bytes --]
On 8/14/25 5:17 PM, Jan Beulich wrote:
> On 14.08.2025 17:09, Oleksii Kurochko wrote:
>> On 8/6/25 5:55 PM, Jan Beulich wrote:
>>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>>> +/* Put any references on the page referenced by pte. */
>>>> +static void p2m_put_page(const pte_t pte, unsigned int level)
>>>> +{
>>>> + mfn_t mfn = pte_get_mfn(pte);
>>>> + p2m_type_t p2m_type = p2m_get_type(pte);
>>>> +
>>>> + ASSERT(pte_is_valid(pte));
>>>> +
>>>> + /*
>>>> + * TODO: Currently we don't handle level 2 super-page, Xen is not
>>>> + * preemptible and therefore some work is needed to handle such
>>>> + * superpages, for which at some point Xen might end up freeing memory
>>>> + * and therefore for such a big mapping it could end up in a very long
>>>> + * operation.
>>>> + */
>>>> + switch ( level )
>>>> + {
>>>> + case 1:
>>>> + return p2m_put_2m_superpage(mfn, p2m_type);
>>>> +
>>>> + case 0:
>>>> + return p2m_put_4k_page(mfn, p2m_type);
>>>> + }
>>> Yet despite the comment not even an assertion for level 2 and up?
>> Not sure that an ASSERT() is needed here as a reference(s) for such page(s)
>> will be put during domain_relinquish_resources() as there we could do preemption.
>> Something like Arm does here:
>> https://gitlab.com/xen-project/people/olkur/xen/-/blob/staging/xen/arch/arm/mmu/p2m.c?ref_type=heads#L1587
>>
>> I'm thinking that probably it makes sense to put only 4k page(s) and
>> all other cases postpone until domain_relinquish_resources() is called.
> How can you defer to domain cleanup? How would handling of foreign mappings
> (or e.g. ballooning? not sure) work when you don't drop references as
> necessary?
I was confused by the code in|relinquish_p2m_mapping()|, since it removes
foreign mappings from the P2M. My current understanding is that it is called
for foreign mappings that weren’t explicitly unmapped, in order to drop the
page reference taken when the mapping was created. Initially, I thought it
would be enough to just perform the (un)map in the P2M page tables to have
foreign mapping working, but that could result in a page never being fully
released, which would in turn break or confuse other logic.
So, yes, I agree that your initial suggestion to add ASSERT() is useful to
be sure that no one is using level 2 super-pages for foreign mapping.
>>>> /* Free pte sub-tree behind an entry */
>>>> static void p2m_free_subtree(struct p2m_domain *p2m,
>>>> pte_t entry, unsigned int level)
>>>> {
>>>> - panic("%s: hasn't been implemented yet\n", __func__);
>>>> + unsigned int i;
>>>> + pte_t *table;
>>>> + mfn_t mfn;
>>>> + struct page_info *pg;
>>>> +
>>>> + /* Nothing to do if the entry is invalid. */
>>>> + if ( !pte_is_valid(entry) )
>>>> + return;
>>>> +
>>>> + if ( pte_is_superpage(entry, level) || (level == 0) )
>>> Perhaps swap the two conditions around?
>>>
>>>> + {
>>>> +#ifdef CONFIG_IOREQ_SERVER
>>>> + /*
>>>> + * If this gets called then either the entry was replaced by an entry
>>>> + * with a different base (valid case) or the shattering of a superpage
>>>> + * has failed (error case).
>>>> + * So, at worst, the spurious mapcache invalidation might be sent.
>>>> + */
>>>> + if ( p2m_is_ram(p2m_get_type(p2m, entry)) &&
>>>> + domain_has_ioreq_server(p2m->domain) )
>>>> + ioreq_request_mapcache_invalidate(p2m->domain);
>>>> +#endif
>>>> +
>>>> + p2m_put_page(entry, level);
>>>> +
>>>> + return;
>>>> + }
>>>> +
>>>> + table = map_domain_page(pte_get_mfn(entry));
>>>> + for ( i = 0; i < XEN_PT_ENTRIES; i++ )
>>>> + p2m_free_subtree(p2m, table[i], level - 1);
>>> In p2m_put_page() you comment towards concerns for level >= 2; no similar
>>> concerns for the resulting recursion here?
>> This function is generic enough to handle any level.
>>
>> Except that it is possible that it will be needed, for example, to split 1G mapping
>> into something smaller then p2m_free_subtree() could be called for freeing a subtree
>> of 1gb mapping.
> The question wasn't about it being generic enough, but it possibly taking
> too much time for level >= 2.
In this terms it makes sense to add such an assertion which will check that we are
working with levels <= 2.
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 5797 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 12/20] xen/riscv: implement p2m_set_range()
2025-08-15 12:50 ` Jan Beulich
@ 2025-08-18 11:03 ` Oleksii Kurochko
0 siblings, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-18 11:03 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 9444 bytes --]
On 8/15/25 2:50 PM, Jan Beulich wrote:
> On 15.08.2025 11:52, Oleksii Kurochko wrote:
>> On 8/5/25 6:04 PM, Jan Beulich wrote:
>>> On 31.07.2025 17:58, Oleksii Kurochko wrote:
>>>> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
>>>> +{
>>>> + write_pte(p, pte);
>>>> + if ( clean_pte )
>>>> + clean_dcache_va_range(p, sizeof(*p));
>>> Not necessarily for right away, but if multiple adjacent PTEs are
>>> written without releasing the lock, this then redundant cache flushing
>>> can be a performance issue.
>> Can't it be resolved on a caller side? Something like:
>> p2m_write_pte(p1, pte1, false);
>> p2m_write_pte(p2, pte2, false);
>> p2m_write_pte(p3, pte3, false);
>> p2m_write_pte(p4, pte4, true);
>> where p1-p4 are adjacent.
> No. You wouldn't know whether the last write flushes what the earlier
> three have written. There may be a cacheline boundary in between.
Oh, correct. It would be hard to detect, so agree that it will work
badly...
> Plus
> I didn't really think of back-to-back writes, but e.g. a loop doing
> many of them, where a single wider flush may then be more efficient.
... So IIUC you mean something like:
for (i = 0; i < nr_entries; i++)
p2m_write_pte(&pt[i], entries[i], false); // no flush yet
clean_dcache_va_range(pt, nr_entries * sizeof(pte_t));
>>>> +#define P2M_TABLE_MAP_NONE 0
>>>> +#define P2M_TABLE_MAP_NOMEM 1
>>>> +#define P2M_TABLE_SUPER_PAGE 2
>>>> +#define P2M_TABLE_NORMAL 3
>>>> +
>>>> +/*
>>>> + * Take the currently mapped table, find the corresponding the entry
>>>> + * corresponding to the GFN, and map the next table, if available.
>>> Nit: Double "corresponding".
>>>
>>>> + * The previous table will be unmapped if the next level was mapped
>>>> + * (e.g P2M_TABLE_NORMAL returned).
>>>> + *
>>>> + * `alloc_tbl` parameter indicates whether intermediate tables should
>>>> + * be allocated when not present.
>>>> + *
>>>> + * Return values:
>>>> + * P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
>>>> + * P2M_TABLE_MAP_NOMEM: allocating a new page failed.
>>>> + * P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
>>>> + * P2M_TABLE_NORMAL: The next entry points to a superpage.
>>>> + */
>>>> +static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>>>> + unsigned int level, pte_t **table,
>>>> + unsigned int offset)
>>>> +{
>>>> + panic("%s: hasn't been implemented yet\n", __func__);
>>>> +
>>>> + return P2M_TABLE_MAP_NONE;
>>>> +}
>>>> +
>>>> +/* Free pte sub-tree behind an entry */
>>>> +static void p2m_free_subtree(struct p2m_domain *p2m,
>>>> + pte_t entry, unsigned int level)
>>>> +{
>>>> + panic("%s: hasn't been implemented yet\n", __func__);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Insert an entry in the p2m. This should be called with a mapping
>>>> + * equal to a page/superpage.
>>>> + */
>>>> +static int p2m_set_entry(struct p2m_domain *p2m,
>>>> + gfn_t gfn,
>>>> + unsigned long page_order,
>>>> + mfn_t mfn,
>>>> + p2m_type_t t)
>>>> +{
>>>> + unsigned int level;
>>>> + unsigned int target = page_order / PAGETABLE_ORDER;
>>>> + pte_t *entry, *table, orig_pte;
>>>> + int rc;
>>>> + /* A mapping is removed if the MFN is invalid. */
>>>> + bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
>>> Comment and code don't fit together. Many MFNs are invalid (any for which
>>> mfn_valid() returns false), yet you only check for INVALID_MFN here.
>> Probably, it makes sense to add an|ASSERT()| here for the case when
>> |mfn_valid(mfn)| is false, but the MFN is not explicitly equal to|INVALID_MFN|.
>> This would indicate that someone attempted to perform a mapping with an
>> incorrect MFN, which, IMO, is entirely wrong.
> No, and we've been there before. MMIO can live anywhere, and mappings for
> such still will need to be permitted. It is correct to check only for
> INVALID_MFN here imo; it's just the comment which also needs to reflect
> that.
Got it now. The original one comment looked clear to me, but considering what
you wrote, I will update the comment then to:
A mapping is removed only if the MFN is explicitly passed as INVALID_MFN.
Also, perhaps, it makes sense to add the following:
Other MFNs that are not valid (e.g., MMIO) from mfn_valid() point of
view are allowed.
Does it make more sense now?
>
>>>> + /*
>>>> + * If we are here with level > target, we must be at a leaf node,
>>>> + * and we need to break up the superpage.
>>>> + */
>>>> + if ( level > target )
>>>> + {
>>>> + panic("Shattering isn't implemented\n");
>>>> + }
>>>> +
>>>> + /*
>>>> + * We should always be there with the correct level because all the
>>>> + * intermediate tables have been installed if necessary.
>>>> + */
>>>> + ASSERT(level == target);
>>>> +
>>>> + orig_pte = *entry;
>>>> +
>>>> + if ( removing_mapping )
>>>> + p2m_clean_pte(entry, p2m->clean_pte);
>>>> + else
>>>> + {
>>>> + pte_t pte = p2m_pte_from_mfn(mfn, t);
>>>> +
>>>> + p2m_write_pte(entry, pte, p2m->clean_pte);
>>>> +
>>>> + p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
>>>> + gfn_add(gfn, BIT(page_order, UL) - 1));
>>>> + p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
>>>> + }
>>>> +
>>>> + p2m->need_flush = true;
>>>> +
>>>> + /*
>>>> + * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
>>>> + * is not ready for RISC-V support.
>>>> + *
>>>> + * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
>>>> + * here.
>>>> + */
>>>> +#ifdef CONFIG_HAS_PASSTHROUGH
>>>> +# error "add code to flush IOMMU TLB"
>>>> +#endif
>>>> +
>>>> + rc = 0;
>>>> +
>>>> + /*
>>>> + * Free the entry only if the original pte was valid and the base
>>>> + * is different (to avoid freeing when permission is changed).
>>>> + */
>>>> + if ( pte_is_valid(orig_pte) &&
>>>> + !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
>>> I'm puzzled by this 2nd check: A permission change would - I expect - only
>>> occur to a leaf entry. If the new entry is a super-page one and the old
>>> wasn't, don't you still need to free the sub-tree, no matter whether the
>>> MFNs are the same?
>> I expect the MFNs to differ in this scenario, so the old sub-tree will be freed.
> You expecting something isn't a good criteria. If it's possible, even if
> unexpected (by you), it needs dealing with correctly.
>
>> Based on your example (new entry is super-page and old entry isn't):
>> For old mapping (lets say, 4 KiB leaf) p2m_set_entry() walks all levels down
>> to L0, so we will have the following MMU page table walks:
>> L2 PTE -> L1 PTE (MFN of L0 page table) -> L0 PTE -> RAM
>>
>> When new mapping (lets say, 2 MiB superpage) will be requested, p2m_set_entry()
>> will stop at L1 (the superpage level):
>> L2 PTE -> L1 PTE (at this moment, L1 PTE points to L0 page table, which
>> points to RAM)
>> Then the old L1 PTE will be saved in 'orig_pte', then writes 'entry' with
>> the RAM MFN for the 2 MiB mapping. The walk becomes:
>> L2 PTE -> L1 PTE -> RAM
>>
>> Therefore, 'entry' now holds an MFN pointing to RAM (superpage leaf). 'orig_pte'
>> still holds an MFN pointing to the L0 table (the old sub-tree). Since these MFNs
>> differ, the code calls p2m_free_subtree(p2m, orig_pte, …) and frees the old L0
>> sub-tree.
> A particular example doesn't help. All possible cases need handling correctly.
For sure, all possible cases need handling correctly, but I don't see any cases
except one you mentioned below where MFNs will be the same.
>
>>> Plus consider the special case of MFN 0: If you clear
>>> an entry using MFN 0, you will find old and new PTEs' both having the same
>>> MFN.
>> Isn't this happen only when a mapping removal is explicitly requested?
>> In the case of a mapping removal it seems to me it is enough just to
>> clear PTE with all zeroes.
> Correct. Which means original MFN (PPN) and new MFN (PPN) would match.
Oh, I got it what is the issue here. If previously MFN 0 was mapped, then
it is going to be removed and considering that during removing MFN 0 is
used, we won't put MFN 0 page reference (mapped earlier) because
p2m_free_subtree() won't be called.
In this case, if-condidtion should be updated with:
@@ -883,7 +890,8 @@ static int p2m_set_entry(struct p2m_domain *p2m,
* is different (to avoid freeing when permission is changed).
*/
if ( pte_is_valid(orig_pte) &&
- !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
+ (!mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) ||
+ (removing_mapping && mfn_eq(pte_get_mfn(*entry), mfn_t(0))) )
or call p2m_free_subentry() in remove_mapping handling:
@@ -850,7 +852,12 @@ static int p2m_set_entry(struct p2m_domain *p2m,
orig_pte = *entry;
if ( removing_mapping )
+ {
+ if ( mfn_eq(pte_get_mfn(*entry), mfn_t(0) )
+ p2m_free_subtree(p2m, orig_pte, level, virt_to_page(table), offsets[level]);
+
p2m_clean_pte(entry, p2m->clean_pte);
+ }
else
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 11689 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
* Re: [PATCH v3 19/20] xen/riscv: add support of page lookup by GFN
2025-08-11 13:25 ` Jan Beulich
2025-08-12 11:42 ` Oleksii Kurochko
@ 2025-08-22 8:39 ` Oleksii Kurochko
1 sibling, 0 replies; 84+ messages in thread
From: Oleksii Kurochko @ 2025-08-22 8:39 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 2239 bytes --]
On 8/11/25 3:25 PM, Jan Beulich wrote:
>> + * The page_order will correspond to the order of the mapping in the page
>> + * table (i.e it could be a superpage).
>> + *
>> + * If the entry is not present, INVALID_MFN will be returned and the
>> + * page_order will be set according to the order of the invalid range.
>> + *
>> + * valid will contain the value of bit[0] (e.g valid bit) of the
>> + * entry.
>> + */
>> +static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
>> + p2m_type_t *t,
>> + unsigned int *page_order,
>> + bool *valid)
>> +{
>> + unsigned int level = 0;
>> + pte_t entry, *table;
>> + int rc;
>> + mfn_t mfn = INVALID_MFN;
>> + DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
>> +
>> + ASSERT(p2m_is_locked(p2m));
>> + BUILD_BUG_ON(XEN_PT_LEVEL_MAP_MASK(0) != PAGE_MASK);
> What function-wide property is this check about? Even when moved ...
>
>> + if ( valid )
>> + *valid = false;
>> +
>> + /* XXX: Check if the mapping is lower than the mapped gfn */
> (Nested: What is this about?)
>
>> + /* This gfn is higher than the highest the p2m map currently holds */
>> + if ( gfn_x(gfn) > gfn_x(p2m->max_mapped_gfn) )
>> + {
>> + for ( level = P2M_ROOT_LEVEL; level; level-- )
>> + if ( (gfn_x(gfn) & (XEN_PT_LEVEL_MASK(level) >> PAGE_SHIFT)) >
> ... into the more narrow scope where another XEN_PT_LEVEL_MASK() exists I
> can't really spot what the check is to guard against.
Missed to answer in my prev. reply to this and noticed that only during
start of reworking it.
I think it makes sense to update the comment above if condition, this is needed
to find the highest possible order by checking the base of the block mapping
is greater than the max mapped gfn as it is mentioned in the description of the
function, if the entry is not present, the function will return the order of
the invalid range.
I expect that probably it makes sense to do something similar for ->lowest_mapped_gfn
and it is a reason why /* XXX: ... */ comment exist.
~ Oleksii
>
>> + gfn_x(p2m->max_mapped_gfn) )
>> + break;
>> +
>> + goto out;
>> + }
[-- Attachment #2: Type: text/html, Size: 3286 bytes --]
^ permalink raw reply [flat|nested] 84+ messages in thread
end of thread, other threads:[~2025-08-22 8:40 UTC | newest]
Thread overview: 84+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-31 15:57 [PATCH v3 00/20] xen/riscv: introduce p2m functionality Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 01/20] xen/riscv: implement sbi_remote_hfence_gvma() Oleksii Kurochko
2025-08-04 13:52 ` Jan Beulich
2025-08-05 14:45 ` Oleksii Kurochko
2025-08-05 15:01 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 02/20] xen/riscv: introduce sbi_remote_hfence_gvma_vmid() Oleksii Kurochko
2025-08-04 13:55 ` Jan Beulich
2025-08-05 14:57 ` Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 03/20] xen/riscv: introduce VMID allocation and manegement Oleksii Kurochko
2025-08-04 15:19 ` Jan Beulich
2025-08-06 11:33 ` Oleksii Kurochko
2025-08-06 12:05 ` Jan Beulich
2025-08-06 16:24 ` Oleksii Kurochko
2025-08-06 16:50 ` Demi Marie Obenour
2025-08-07 8:43 ` Oleksii Kurochko
2025-08-07 10:11 ` Jan Beulich
2025-08-07 14:45 ` Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 04/20] xen/riscv: introduce things necessary for p2m initialization Oleksii Kurochko
2025-08-04 15:53 ` Jan Beulich
2025-08-06 11:43 ` Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 05/20] xen/riscv: construct the P2M pages pool for guests Oleksii Kurochko
2025-08-04 15:58 ` Jan Beulich
2025-08-05 10:40 ` Jan Beulich
2025-08-06 12:01 ` Oleksii Kurochko
2025-08-06 12:07 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 06/20] xen/riscv: add root page table allocation Oleksii Kurochko
2025-08-05 10:37 ` Jan Beulich
2025-08-07 12:00 ` Oleksii Kurochko
2025-08-07 15:30 ` Jan Beulich
2025-08-07 15:59 ` Oleksii Kurochko
2025-08-07 16:03 ` Jan Beulich
2025-08-05 10:43 ` Jan Beulich
2025-08-07 13:35 ` Oleksii Kurochko
2025-08-07 15:57 ` Jan Beulich
2025-08-08 9:14 ` Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 07/20] xen/riscv: introduce pte_{set,get}_mfn() Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 08/20] xen/riscv: add new p2m types and helper macros for type classification Oleksii Kurochko
2025-08-04 14:16 ` Jan Beulich
2025-08-07 15:41 ` Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 09/20] xen/dom0less: abstract Arm-specific p2m type name for device MMIO mappings Oleksii Kurochko
2025-08-04 14:11 ` Jan Beulich
2025-08-07 15:23 ` Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 10/20] xen/riscv: introduce page_{get,set}_xenheap_gfn() Oleksii Kurochko
2025-08-05 14:11 ` Jan Beulich
2025-08-08 9:16 ` Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 11/20] xen/riscv: implement function to map memory in guest p2m Oleksii Kurochko
2025-08-05 15:20 ` Jan Beulich
2025-08-08 13:46 ` Oleksii Kurochko
2025-08-11 7:28 ` Jan Beulich
2025-08-11 9:29 ` Oleksii Kurochko
2025-08-11 9:35 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 12/20] xen/riscv: implement p2m_set_range() Oleksii Kurochko
2025-08-05 16:04 ` Jan Beulich
2025-08-15 9:52 ` Oleksii Kurochko
2025-08-15 12:50 ` Jan Beulich
2025-08-18 11:03 ` Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 13/20] xen/riscv: Implement p2m_free_subtree() and related helpers Oleksii Kurochko
2025-08-06 15:55 ` Jan Beulich
2025-08-14 15:09 ` Oleksii Kurochko
2025-08-14 15:17 ` Jan Beulich
2025-08-18 8:22 ` Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 14/20] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration Oleksii Kurochko
2025-08-11 11:36 ` Jan Beulich
2025-08-11 14:44 ` Oleksii Kurochko
2025-08-11 15:11 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 15/20] xen/riscv: implement p2m_next_level() Oleksii Kurochko
2025-08-11 11:44 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 16/20] xen/riscv: Implement superpage splitting for p2m mappings Oleksii Kurochko
2025-08-11 11:59 ` Jan Beulich
2025-08-11 15:19 ` Oleksii Kurochko
2025-08-11 15:47 ` Jan Beulich
2025-07-31 15:58 ` [PATCH v3 17/20] xen/riscv: implement put_page() Oleksii Kurochko
2025-08-11 12:43 ` Jan Beulich
2025-08-11 15:32 ` Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 18/20] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers Oleksii Kurochko
2025-08-11 12:50 ` Jan Beulich
2025-08-11 15:34 ` Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 19/20] xen/riscv: add support of page lookup by GFN Oleksii Kurochko
2025-08-11 13:25 ` Jan Beulich
2025-08-12 11:42 ` Oleksii Kurochko
2025-08-22 8:39 ` Oleksii Kurochko
2025-07-31 15:58 ` [PATCH v3 20/20] xen/riscv: introduce metadata table to store P2M type Oleksii Kurochko
2025-08-11 15:44 ` Jan Beulich
2025-08-12 14:52 ` Oleksii Kurochko
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.