[PATCH v2 00/17] xen/riscv: introduce p2m functionality

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/17] xen/riscv: introduce p2m functionality
@ 2025-06-10 13:05 Oleksii Kurochko
  2025-06-10 13:05 ` [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma() Oleksii Kurochko
                   ` (16 more replies)
  0 siblings, 17 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

In this patch series are introduced necessary functions to build and manage
RISC-V guest page tables and MMIO/RAM mappings.

CI tests:
  https://gitlab.com/xen-project/people/olkur/xen/-/pipelines/1862284573

---
Changes in V2:
 - Merged to staging:
   - [PATCH v1 1/6] xen/riscv: add inclusion of xen/bitops.h to asm/cmpxchg.h
 - New patches:
   - xen/riscv: implement sbi_remote_hfence_gvma{_vmid}().
 - Split patch "xen/riscv: implement p2m mapping functionality" into smaller
   one patches:
   - xen/riscv: introduce page_set_xenheap_gfn()
   - xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs
   - xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
   - xen/riscv: Implement p2m_free_entry() and related helpers
   - xen/riscv: Implement superpage splitting for p2m mappings
   - xen/riscv: implement p2m_next_level()
   - xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
 - Move root p2m table allocation to separate patch:
   xen/riscv: add root page table allocation
 - Drop dependency of this patch series from the patch witn an introduction of
   SvPBMT as it was merged.
 - Patch "[PATCH v1 4/6] xen/riscv: define pt_t and pt_walk_t structures" was
   renamed to xen/riscv: introduce pte_{set,get}_mfn() as after dropping of
   bitfields for PTE structure, this patch introduce only pte_{set,get}_mfn().
 - Rename "xen/riscv: define pt_t and pt_walk_t structures" to
   "xen/riscv: introduce pte_{set,get}_mfn()" as pt_t and pt_walk_t were
   dropped.
 - Introduce guest domain's VMID allocation and manegement.
 - Add patches necessary to implement p2m lookup:
   - xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
   - xen/riscv: add support of page lookup by GFN
 - Re-sort patch series.
 - Add link to CI tests.
 - All other changes are patch-specific. Please check them.
---

Oleksii Kurochko (17):
  xen/riscv: implement sbi_remote_hfence_gvma()
  xen/riscv: introduce sbi_remote_hfence_gvma_vmid()
  xen/riscv: introduce guest domain's VMID allocation and manegement
  xen/riscv: construct the P2M pages pool for guests
  xen/riscv: introduce things necessary for p2m initialization
  xen/riscv: add root page table allocation
  xen/riscv: introduce pte_{set,get}_mfn()
  xen/riscv: add new p2m types and helper macros for type classification
  xen/riscv: introduce page_set_xenheap_gfn()
  xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to
    MFNs
  xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  xen/riscv: Implement p2m_free_entry() and related helpers
  xen/riscv: Implement p2m_entry_from_mfn() and support PBMT
    configuration
  xen/riscv: implement p2m_next_level()
  xen/riscv: Implement superpage splitting for p2m mappings
  xen/riscv: implement mfn_valid() and page reference, ownership
    handling helpers
  xen/riscv: add support of page lookup by GFN

 xen/arch/riscv/Makefile                     |    1 +
 xen/arch/riscv/include/asm/domain.h         |   16 +
 xen/arch/riscv/include/asm/mm.h             |   45 +-
 xen/arch/riscv/include/asm/p2m.h            |  125 +-
 xen/arch/riscv/include/asm/page.h           |   33 +
 xen/arch/riscv/include/asm/riscv_encoding.h |    4 +
 xen/arch/riscv/include/asm/sbi.h            |   38 +
 xen/arch/riscv/mm.c                         |   97 +-
 xen/arch/riscv/p2m.c                        | 1188 +++++++++++++++++++
 xen/arch/riscv/sbi.c                        |   18 +
 xen/arch/riscv/setup.c                      |    3 +
 11 files changed, 1550 insertions(+), 18 deletions(-)
 create mode 100644 xen/arch/riscv/p2m.c

-- 
2.49.0



^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-06-18 15:15   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 02/17] xen/riscv: introduce sbi_remote_hfence_gvma_vmid() Oleksii Kurochko
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
covering the range of guest physical addresses between start_addr and
start_addr + size for all the guests.

The remote fence operation applies to the entire address space if either:
 - start_addr and size are both 0, or
 - size is equal to 2^XLEN-1.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - New patch.
---
 xen/arch/riscv/include/asm/sbi.h | 21 +++++++++++++++++++++
 xen/arch/riscv/sbi.c             |  9 +++++++++
 2 files changed, 30 insertions(+)

diff --git a/xen/arch/riscv/include/asm/sbi.h b/xen/arch/riscv/include/asm/sbi.h
index 527d773277..8e346347af 100644
--- a/xen/arch/riscv/include/asm/sbi.h
+++ b/xen/arch/riscv/include/asm/sbi.h
@@ -89,6 +89,27 @@ bool sbi_has_rfence(void);
 int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
                           size_t size);
 
+/*
+ * Instructs the remote harts to execute one or more HFENCE.GVMA
+ * instructions, covering the range of guest physical addresses
+ * between start_addr and start_addr + size for all the guests.
+ * This function call is only valid for harts implementing
+ * hypervisor extension.
+ *
+ * Returns 0 if IPI was sent to all the targeted harts successfully
+ * or negative value if start_addr or size is not valid.
+ *
+ * The remote fence operation applies to the entire address space if either:
+ *  - start_addr and size are both 0, or
+ *  - size is equal to 2^XLEN-1.
+ *
+ * @cpu_mask a cpu mask containing all the target CPUs (in Xen space).
+ * @param start virtual address start
+ * @param size virtual address range size
+ */
+int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
+                           size_t size);
+
 /*
  * Initialize SBI library
  *
diff --git a/xen/arch/riscv/sbi.c b/xen/arch/riscv/sbi.c
index 4209520389..0613ad1cb0 100644
--- a/xen/arch/riscv/sbi.c
+++ b/xen/arch/riscv/sbi.c
@@ -258,6 +258,15 @@ int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
                       cpu_mask, start, size, 0, 0);
 }
 
+int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
+                           size_t size)
+{
+    ASSERT(sbi_rfence);
+
+    return sbi_rfence(SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA,
+                      cpu_mask, start, size, 0, 0);
+}
+
 /* This function must always succeed. */
 #define sbi_get_spec_version()  \
     sbi_ext_base_func(SBI_EXT_BASE_GET_SPEC_VERSION)
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 02/17] xen/riscv: introduce sbi_remote_hfence_gvma_vmid()
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
  2025-06-10 13:05 ` [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma() Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-06-18 15:20   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement Oleksii Kurochko
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

It instructs the remote harts to execute one or more HFENCE.GVMA instructions
by making an SBI call, covering the range of guest physical addresses between
start_addr and start_addr + size only for the given VMID.

This function call is only valid for harts implementing hypervisor extension.

The remote fence operation applies to the entire address space if either:
  - start_addr and size are both 0, or
  - size is equal to 2^XLEN-1.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - New patch.
---
 xen/arch/riscv/include/asm/sbi.h | 17 +++++++++++++++++
 xen/arch/riscv/sbi.c             |  9 +++++++++
 2 files changed, 26 insertions(+)

diff --git a/xen/arch/riscv/include/asm/sbi.h b/xen/arch/riscv/include/asm/sbi.h
index 8e346347af..2644833eb4 100644
--- a/xen/arch/riscv/include/asm/sbi.h
+++ b/xen/arch/riscv/include/asm/sbi.h
@@ -110,6 +110,23 @@ int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
 int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
                            size_t size);
 
+/*
+ * Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
+ * covering the range of guest physical addresses between start_addr and
+ * start_addr + size only for the given VMID. This function call is only
+ * valid for harts implementing hypervisor extension.
+ * The remote fence operation applies to the entire address space if either:
+ *  - start_addr and size are both 0, or
+ *  - size is equal to 2^XLEN-1.
+ *
+ * @cpu_mask a cpu mask containing all the target CPUs (in Xen space).
+ * @param start virtual address start
+ * @param size virtual address range size
+ * @param vmid virtual machine id
+ */
+int sbi_remote_hfence_gvma_vmid(const cpumask_t *cpu_mask, vaddr_t start,
+                                size_t size, unsigned long vmid);
+
 /*
  * Initialize SBI library
  *
diff --git a/xen/arch/riscv/sbi.c b/xen/arch/riscv/sbi.c
index 0613ad1cb0..bfd1193509 100644
--- a/xen/arch/riscv/sbi.c
+++ b/xen/arch/riscv/sbi.c
@@ -267,6 +267,15 @@ int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
                       cpu_mask, start, size, 0, 0);
 }
 
+int sbi_remote_hfence_gvma_vmid(const cpumask_t *cpu_mask, vaddr_t start,
+                           size_t size, unsigned long vmid)
+{
+    ASSERT(sbi_rfence);
+
+    return sbi_rfence(SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA,
+                      cpu_mask, start, size, vmid, 0);
+}
+
 /* This function must always succeed. */
 #define sbi_get_spec_version()  \
     sbi_ext_base_func(SBI_EXT_BASE_GET_SPEC_VERSION)
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
  2025-06-10 13:05 ` [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma() Oleksii Kurochko
  2025-06-10 13:05 ` [PATCH v2 02/17] xen/riscv: introduce sbi_remote_hfence_gvma_vmid() Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-06-18 15:46   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests Oleksii Kurochko
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

Implementation is based on Arm code with some minor changes:
 - Re-define INVALID_VMID.
 - Re-define MAX_VMID.
 - Add TLB flushing when VMID is re-used.

Also, as a part of this path structure p2m_domain is introduced with
vmid member inside it. It is necessary for VMID management functions.

Add a bitmap-based allocator to manage VMID space, supporting up to 127
VMIDs on RV32 and 16,383 on RV64 platforms, in accordance with the
architecture's hgatp VMID field (RV32 - 7 bit long, others - 14 bit long).

Reserve the highest VMID as INVALID_VMID to ensure it's not reused.

Implement p2m_alloc_vmid() and p2m_free_vmid() for dynamic allocation
and release of VMIDs per domain.

Integrate VMID initialization into p2m_init() and ensured domain-specific
TLB flushes on VMID release using sbi_remote_hfence_gvma_vmid().

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - New patch.
---
 xen/arch/riscv/Makefile             |   1 +
 xen/arch/riscv/include/asm/domain.h |   4 +
 xen/arch/riscv/include/asm/p2m.h    |  14 ++++
 xen/arch/riscv/p2m.c                | 115 ++++++++++++++++++++++++++++
 xen/arch/riscv/setup.c              |   3 +
 5 files changed, 137 insertions(+)
 create mode 100644 xen/arch/riscv/p2m.c

diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index a1c145c506..1034f2c9cd 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -6,6 +6,7 @@ obj-y += intc.o
 obj-y += irq.o
 obj-y += mm.o
 obj-y += pt.o
+obj-y += p2m.o
 obj-$(CONFIG_RISCV_64) += riscv64/
 obj-y += sbi.o
 obj-y += setup.o
diff --git a/xen/arch/riscv/include/asm/domain.h b/xen/arch/riscv/include/asm/domain.h
index c3d965a559..b9a03e91c5 100644
--- a/xen/arch/riscv/include/asm/domain.h
+++ b/xen/arch/riscv/include/asm/domain.h
@@ -5,6 +5,8 @@
 #include <xen/xmalloc.h>
 #include <public/hvm/params.h>
 
+#include <asm/p2m.h>
+
 struct hvm_domain
 {
     uint64_t              params[HVM_NR_PARAMS];
@@ -18,6 +20,8 @@ struct arch_vcpu {
 
 struct arch_domain {
     struct hvm_domain hvm;
+
+    struct p2m_domain p2m;
 };
 
 #include <xen/sched.h>
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 28f57a74f2..359408e1be 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -3,11 +3,21 @@
 #define ASM__RISCV__P2M_H
 
 #include <xen/errno.h>
+#include <xen/types.h>
 
 #include <asm/page-bits.h>
 
 #define paddr_bits PADDR_BITS
 
+/* Get host p2m table */
+#define p2m_get_hostp2m(d) (&(d)->arch.p2m)
+
+/* Per-p2m-table state */
+struct p2m_domain {
+    /* Current VMID in use */
+    uint16_t vmid;
+};
+
 /*
  * List of possible type for each page in the p2m entry.
  * The number of available bit per page in the pte for this purpose is 2 bits.
@@ -93,6 +103,10 @@ static inline void p2m_altp2m_check(struct vcpu *v, uint16_t idx)
     /* Not supported on RISCV. */
 }
 
+void p2m_vmid_allocator_init(void);
+
+int p2m_init(struct domain *d);
+
 #endif /* ASM__RISCV__P2M_H */
 
 /*
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
new file mode 100644
index 0000000000..9f7fd8290a
--- /dev/null
+++ b/xen/arch/riscv/p2m.c
@@ -0,0 +1,115 @@
+#include <xen/bitops.h>
+#include <xen/lib.h>
+#include <xen/sched.h>
+#include <xen/spinlock.h>
+#include <xen/xvmalloc.h>
+
+#include <asm/p2m.h>
+#include <asm/sbi.h>
+
+static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
+
+/*
+ * hgatp's VMID field is 7 or 14 bits. RV64 may support 14-bit VMID.
+ * Using a bitmap here limits us to 127 (2^7 - 1) or 16383 (2^14 - 1)
+ * concurrent domains. The bitmap space will be allocated dynamically
+ * based on whether 7 or 14 bit VMIDs are supported.
+ */
+static unsigned long *vmid_mask;
+static unsigned long *vmid_flushing_needed;
+
+/*
+ * -2 here because:
+ *    - -1 is needed to get the maximal possible VMID
+ *    - -1 is reserved for beinng used as INVALID_VMID
+ */
+#ifdef CONFIG_RISCV_32
+#define MAX_VMID (BIT(7, U) - 2)
+#else
+#define MAX_VMID (BIT(14, U) - 2)
+#endif
+
+/* Reserve the max possible VMID to be INVALID. */
+#define INVALID_VMID (MAX_VMID + 1)
+
+void p2m_vmid_allocator_init(void)
+{
+    /*
+     * Allocate space for vmid_mask and vmid_flushing_needed
+     * based on INVALID_VMID as it is the max possible VMID which just
+     * was reserved to be INVALID_VMID.
+     */
+    vmid_mask = xvzalloc_array(unsigned long, BITS_TO_LONGS(INVALID_VMID));
+    vmid_flushing_needed =
+        xvzalloc_array(unsigned long, BITS_TO_LONGS(INVALID_VMID));
+
+    if ( !vmid_mask || !vmid_flushing_needed )
+        panic("Could not allocate VMID bitmap space or VMID flushing map\n");
+
+    set_bit(INVALID_VMID, vmid_mask);
+}
+
+int p2m_alloc_vmid(struct domain *d)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+    int rc, nr;
+
+    spin_lock(&vmid_alloc_lock);
+
+    nr = find_first_zero_bit(vmid_mask, MAX_VMID);
+
+    ASSERT(nr != INVALID_VMID);
+
+    if ( nr == MAX_VMID )
+    {
+        rc = -EBUSY;
+        printk(XENLOG_ERR "p2m.c: dom%d: VMID pool exhausted\n", d->domain_id);
+        goto out;
+    }
+
+    set_bit(nr, vmid_mask);
+
+    if ( test_bit(p2m->vmid, vmid_flushing_needed) )
+    {
+        clear_bit(p2m->vmid, vmid_flushing_needed);
+        sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid);
+    }
+
+    p2m->vmid = nr;
+
+    rc = 0;
+
+out:
+    spin_unlock(&vmid_alloc_lock);
+    return rc;
+}
+
+void p2m_free_vmid(struct domain *d)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+    spin_lock(&vmid_alloc_lock);
+
+    if ( p2m->vmid != INVALID_VMID )
+    {
+        clear_bit(p2m->vmid, vmid_mask);
+        set_bit(p2m->vmid, vmid_flushing_needed);
+    }
+
+    spin_unlock(&vmid_alloc_lock);
+}
+
+int p2m_init(struct domain *d)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    int rc;
+
+    p2m->vmid = INVALID_VMID;
+
+    rc = p2m_alloc_vmid(d);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index 8bcd19218d..aa8f5646ea 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -19,6 +19,7 @@
 #include <asm/early_printk.h>
 #include <asm/fixmap.h>
 #include <asm/intc.h>
+#include <asm/p2m.h>
 #include <asm/sbi.h>
 #include <asm/setup.h>
 #include <asm/traps.h>
@@ -134,6 +135,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
 
     intc_preinit();
 
+    p2m_vmid_allocator_init();
+
     printk("All set up\n");
 
     machine_halt();
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (2 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-06-18 15:53   ` Jan Beulich
  2025-07-01 13:04   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 05/17] xen/riscv: introduce things necessary for p2m initialization Oleksii Kurochko
                   ` (12 subsequent siblings)
  16 siblings, 2 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

Implement p2m_set_allocation() to construct p2m pages pool for guests
based on required number of pages.

This is implemented by:
- Adding a `struct paging_domain` which contains a freelist, a
  counter variable and a spinlock to `struct arch_domain` to
  indicate the free p2m pages and the number of p2m total pages in
  the p2m pages pool.
- Adding a helper `p2m_set_allocation` to set the p2m pages pool
  size. This helper should be called before allocating memory for
  a guest and is called from domain_p2m_set_allocation(), the latter
  is a part of common dom0less code.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in v2:
 - Drop the comment above inclusion of <xen/event.h> in riscv/p2m.c.
 - Use ACCESS_ONCE() for lhs and rhs for the expressions in
   p2m_set_allocation().
---
 xen/arch/riscv/include/asm/domain.h | 12 ++++++
 xen/arch/riscv/p2m.c                | 59 +++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+)

diff --git a/xen/arch/riscv/include/asm/domain.h b/xen/arch/riscv/include/asm/domain.h
index b9a03e91c5..b818127f9f 100644
--- a/xen/arch/riscv/include/asm/domain.h
+++ b/xen/arch/riscv/include/asm/domain.h
@@ -2,6 +2,8 @@
 #ifndef ASM__RISCV__DOMAIN_H
 #define ASM__RISCV__DOMAIN_H
 
+#include <xen/mm.h>
+#include <xen/spinlock.h>
 #include <xen/xmalloc.h>
 #include <public/hvm/params.h>
 
@@ -18,10 +20,20 @@ struct arch_vcpu_io {
 struct arch_vcpu {
 };
 
+struct paging_domain {
+    spinlock_t lock;
+    /* Free P2M pages from the pre-allocated P2M pool */
+    struct page_list_head p2m_freelist;
+    /* Number of pages from the pre-allocated P2M pool */
+    unsigned long p2m_total_pages;
+};
+
 struct arch_domain {
     struct hvm_domain hvm;
 
     struct p2m_domain p2m;
+
+    struct paging_domain paging;
 };
 
 #include <xen/sched.h>
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 9f7fd8290a..f33c7147ff 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -1,4 +1,5 @@
 #include <xen/bitops.h>
+#include <xen/event.h>
 #include <xen/lib.h>
 #include <xen/sched.h>
 #include <xen/spinlock.h>
@@ -105,6 +106,9 @@ int p2m_init(struct domain *d)
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int rc;
 
+    spin_lock_init(&d->arch.paging.lock);
+    INIT_PAGE_LIST_HEAD(&d->arch.paging.p2m_freelist);
+
     p2m->vmid = INVALID_VMID;
 
     rc = p2m_alloc_vmid(d);
@@ -113,3 +117,58 @@ int p2m_init(struct domain *d)
 
     return 0;
 }
+
+/*
+ * Set the pool of pages to the required number of pages.
+ * Returns 0 for success, non-zero for failure.
+ * Call with d->arch.paging.lock held.
+ */
+int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
+{
+    struct page_info *pg;
+
+    ASSERT(spin_is_locked(&d->arch.paging.lock));
+
+    for ( ; ; )
+    {
+        if ( d->arch.paging.p2m_total_pages < pages )
+        {
+            /* Need to allocate more memory from domheap */
+            pg = alloc_domheap_page(d, MEMF_no_owner);
+            if ( pg == NULL )
+            {
+                printk(XENLOG_ERR "Failed to allocate P2M pages.\n");
+                return -ENOMEM;
+            }
+            ACCESS_ONCE(d->arch.paging.p2m_total_pages)++;
+            page_list_add_tail(pg, &d->arch.paging.p2m_freelist);
+        }
+        else if ( d->arch.paging.p2m_total_pages > pages )
+        {
+            /* Need to return memory to domheap */
+            pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
+            if( pg )
+            {
+                ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
+                free_domheap_page(pg);
+            }
+            else
+            {
+                printk(XENLOG_ERR
+                       "Failed to free P2M pages, P2M freelist is empty.\n");
+                return -ENOMEM;
+            }
+        }
+        else
+            break;
+
+        /* Check to see if we need to yield and try again */
+        if ( preempted && general_preempt_check() )
+        {
+            *preempted = true;
+            return -ERESTART;
+        }
+    }
+
+    return 0;
+}
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 05/17] xen/riscv: introduce things necessary for p2m initialization
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (3 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-06-18 16:08   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 06/17] xen/riscv: add root page table allocation Oleksii Kurochko
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

Introduce the following things:
- Update p2m_domain structure, which describe per p2m-table state, with:
  - lock to protect updates to p2m.
  - pool with pages used to construct p2m.
  - clean_pte which indicate if it is requires to clean the cache when
    writing an entry.
  - radix tree to store p2m type as PTE doesn't have enough free bits to
    store type.
  - default_access to store p2m access type for each page in the domain.
  - back pointer to domain structure.
- p2m_init() to initalize members introduced in p2m_domain structure.
- Introudce p2m_write_lock() and p2m_is_write_locked().
- Introduce p2m_force_tlb_flush_sync() to flush TLBs after p2m table
  update.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - Use introduced erlier sbi_remote_hfence_gvma_vmid() for proper implementation
   of p2m_force_tlb_flush_sync() as TLB flushing needs to happen for each pCPU
   which potentially has cached a mapping, what is tracked by d->dirty_cpumask.
 - Drop unnecessary blanks.
 - Fix code style for # of pre-processor directive.
 - Drop max_mapped_gfn and lowest_mapped_gfn as they aren't used now.
 - [p2m_init()] Set p2m->clean_pte=false if CONFIG_HAS_PASSTHROUGH=n.
 - [p2m_init()] Update the comment above p2m->domain = d;
 - Drop p2m->need_flush as it seems to be always true for RISC-V and as a
   consequence drop p2m_tlb_flush_sync().
 - Move to separate patch an introduction of root page table allocation.
---
 xen/arch/riscv/include/asm/p2m.h | 39 +++++++++++++++++++++
 xen/arch/riscv/p2m.c             | 58 ++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)

diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 359408e1be..9570eff014 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -3,6 +3,10 @@
 #define ASM__RISCV__P2M_H
 
 #include <xen/errno.h>
+#include <xen/mem_access.h>
+#include <xen/mm.h>
+#include <xen/radix-tree.h>
+#include <xen/rwlock.h>
 #include <xen/types.h>
 
 #include <asm/page-bits.h>
@@ -14,6 +18,29 @@
 
 /* Per-p2m-table state */
 struct p2m_domain {
+    /*
+     * Lock that protects updates to the p2m.
+     */
+    rwlock_t lock;
+
+    /* Pages used to construct the p2m */
+    struct page_list_head pages;
+
+    /* Indicate if it is required to clean the cache when writing an entry */
+    bool clean_pte;
+
+    struct radix_tree_root p2m_type;
+
+    /*
+     * Default P2M access type for each page in the the domain: new pages,
+     * swapped in pages, cleared pages, and pages that are ambiguously
+     * retyped get this access type.  See definition of p2m_access_t.
+     */
+    p2m_access_t default_access;
+
+    /* Back pointer to domain */
+    struct domain *domain;
+
     /* Current VMID in use */
     uint16_t vmid;
 };
@@ -107,6 +134,18 @@ void p2m_vmid_allocator_init(void);
 
 int p2m_init(struct domain *d);
 
+static inline void p2m_write_lock(struct p2m_domain *p2m)
+{
+    write_lock(&p2m->lock);
+}
+
+void p2m_write_unlock(struct p2m_domain *p2m);
+
+static inline int p2m_is_write_locked(struct p2m_domain *p2m)
+{
+    return rw_is_write_locked(&p2m->lock);
+}
+
 #endif /* ASM__RISCV__P2M_H */
 
 /*
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index f33c7147ff..e409997499 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -1,13 +1,46 @@
 #include <xen/bitops.h>
+#include <xen/domain_page.h>
 #include <xen/event.h>
+#include <xen/iommu.h>
 #include <xen/lib.h>
+#include <xen/mm.h>
+#include <xen/pfn.h>
+#include <xen/rwlock.h>
 #include <xen/sched.h>
 #include <xen/spinlock.h>
 #include <xen/xvmalloc.h>
 
+#include <asm/page.h>
 #include <asm/p2m.h>
 #include <asm/sbi.h>
 
+/*
+ * Force a synchronous P2M TLB flush.
+ *
+ * Must be called with the p2m lock held.
+ */
+static void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
+{
+    struct domain *d = p2m->domain;
+
+    ASSERT(p2m_is_write_locked(p2m));
+
+    sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid);
+}
+
+/* Unlock the flush and do a P2M TLB flush if necessary */
+void p2m_write_unlock(struct p2m_domain *p2m)
+{
+    /*
+     * The final flush is done with the P2M write lock taken to avoid
+     * someone else modifying the P2M wbefore the TLB invalidation has
+     * completed.
+     */
+    p2m_force_tlb_flush_sync(p2m);
+
+    write_unlock(&p2m->lock);
+}
+
 static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
 
 /*
@@ -109,8 +142,33 @@ int p2m_init(struct domain *d)
     spin_lock_init(&d->arch.paging.lock);
     INIT_PAGE_LIST_HEAD(&d->arch.paging.p2m_freelist);
 
+    rwlock_init(&p2m->lock);
+    INIT_PAGE_LIST_HEAD(&p2m->pages);
+
     p2m->vmid = INVALID_VMID;
 
+    p2m->default_access = p2m_access_rwx;
+
+    radix_tree_init(&p2m->p2m_type);
+
+#ifdef CONFIG_HAS_PASSTHROUGH
+    /*
+     * Some IOMMUs don't support coherent PT walk. When the p2m is
+     * shared with the CPU, Xen has to make sure that the PT changes have
+     * reached the memory
+     */
+    p2m->clean_pte = is_iommu_enabled(d) &&
+        !iommu_has_feature(d, IOMMU_FEAT_COHERENT_WALK);
+#else
+    p2m->clean_pte = false;
+#endif
+
+    /*
+     * "Trivial" initialisation is now complete.  Set the backpointer so the
+     * users of p2m could get an access to domain structure.
+     */
+    p2m->domain = d;
+
     rc = p2m_alloc_vmid(d);
     if ( rc )
         return rc;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 06/17] xen/riscv: add root page table allocation
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (4 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 05/17] xen/riscv: introduce things necessary for p2m initialization Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-06-30 15:22   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 07/17] xen/riscv: introduce pte_{set,get}_mfn() Oleksii Kurochko
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

Introduce support for allocating and initializing the root page table
required for RISC-V stage-2 address translation.

To implement root page table allocation the following is introduced:
- p2m_get_clean_page() and p2m_allocate_root() helpers to allocate and
  zero a 16 KiB root page table, as mandated by the RISC-V privileged
  specification for Sv39x4/Sv48x4 modes.
- Add hgatp_from_page() to construct the hgatp register value from the
  allocated root page.
- Update p2m_init() to allocate the root table and initialize
  p2m->root and p2m->hgatp.
- Add maddr_to_page() and page_to_maddr() macros for easier address
  manipulation.
- Allocate root p2m table after p2m pool is initialized.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in v2:
 - This patch was created from "xen/riscv: introduce things necessary for p2m
   initialization" with the following changes:
   - [clear_and_clean_page()] Add missed call of clean_dcache_va_range().
   - Drop p2m_get_clean_page() as it is going to be used only once to allocate
     root page table. Open-code it explicittly in p2m_allocate_root(). Also,
     it will help avoid duplication of the code connected to order and nr_pages
     of p2m root page table.
   - Instead of using order 2 for alloc_domheap_pages(), use
     get_order_from_bytes(KB(16)).
   - Clear and clean a proper amount of allocated pages in p2m_allocate_root().
   - Drop _info from the function name hgatp_from_page_info() and its argument
     page_info.
   - Introduce HGATP_MODE_MASK and use MASK_INSR() instead of shift to calculate
     value of hgatp.
   - Drop unnecessary parentheses in definition of page_to_maddr().
   - Add support of VMID.
   - Drop TLB flushing in p2m_alloc_root_table() and do that once when VMID
     is re-used. [Look at p2m_alloc_vmid()]
   - Allocate p2m root table after p2m pool is fully initialized: first
     return pages to p2m pool them allocate p2m root table.
---
 xen/arch/riscv/include/asm/mm.h             |  4 +
 xen/arch/riscv/include/asm/p2m.h            |  6 ++
 xen/arch/riscv/include/asm/riscv_encoding.h |  4 +
 xen/arch/riscv/p2m.c                        | 94 +++++++++++++++++++++
 4 files changed, 108 insertions(+)

diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index 01bbd92a06..912bc79e1b 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -149,6 +149,10 @@ extern struct page_info *frametable_virt_start;
 #define mfn_to_page(mfn)    (frametable_virt_start + mfn_x(mfn))
 #define page_to_mfn(pg)     _mfn((pg) - frametable_virt_start)
 
+/* Convert between machine addresses and page-info structures. */
+#define maddr_to_page(ma) mfn_to_page(maddr_to_mfn(ma))
+#define page_to_maddr(pg) mfn_to_maddr(page_to_mfn(pg))
+
 static inline void *page_to_virt(const struct page_info *pg)
 {
     return mfn_to_virt(mfn_x(page_to_mfn(pg)));
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 9570eff014..a31b05bd50 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -26,6 +26,12 @@ struct p2m_domain {
     /* Pages used to construct the p2m */
     struct page_list_head pages;
 
+    /* The root of the p2m tree. May be concatenated */
+    struct page_info *root;
+
+    /* Address Translation Table for the p2m */
+    paddr_t hgatp;
+
     /* Indicate if it is required to clean the cache when writing an entry */
     bool clean_pte;
 
diff --git a/xen/arch/riscv/include/asm/riscv_encoding.h b/xen/arch/riscv/include/asm/riscv_encoding.h
index 6cc8f4eb45..a71b7546ef 100644
--- a/xen/arch/riscv/include/asm/riscv_encoding.h
+++ b/xen/arch/riscv/include/asm/riscv_encoding.h
@@ -133,11 +133,13 @@
 #define HGATP_MODE_SV48X4		_UL(9)
 
 #define HGATP32_MODE_SHIFT		31
+#define HGATP32_MODE_MASK		_UL(0x80000000)
 #define HGATP32_VMID_SHIFT		22
 #define HGATP32_VMID_MASK		_UL(0x1FC00000)
 #define HGATP32_PPN			_UL(0x003FFFFF)
 
 #define HGATP64_MODE_SHIFT		60
+#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
 #define HGATP64_VMID_SHIFT		44
 #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)
 #define HGATP64_PPN			_ULL(0x00000FFFFFFFFFFF)
@@ -170,6 +172,7 @@
 #define HGATP_VMID_SHIFT		HGATP64_VMID_SHIFT
 #define HGATP_VMID_MASK			HGATP64_VMID_MASK
 #define HGATP_MODE_SHIFT		HGATP64_MODE_SHIFT
+#define HGATP_MODE_MASK			HGATP64_MODE_MASK
 #else
 #define MSTATUS_SD			MSTATUS32_SD
 #define SSTATUS_SD			SSTATUS32_SD
@@ -181,6 +184,7 @@
 #define HGATP_VMID_SHIFT		HGATP32_VMID_SHIFT
 #define HGATP_VMID_MASK			HGATP32_VMID_MASK
 #define HGATP_MODE_SHIFT		HGATP32_MODE_SHIFT
+#define HGATP_MODE_MASK			HGATP32_MODE_MASK
 #endif
 
 #define TOPI_IID_SHIFT			16
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index e409997499..2419a61d8c 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
     write_unlock(&p2m->lock);
 }
 
+static void clear_and_clean_page(struct page_info *page)
+{
+    clean_dcache_va_range(page, PAGE_SIZE);
+    clear_domain_page(page_to_mfn(page));
+}
+
+static struct page_info *p2m_allocate_root(struct domain *d)
+{
+    struct page_info *page;
+    unsigned int order = get_order_from_bytes(KB(16));
+    unsigned int nr_pages = _AC(1,U) << order;
+
+    /* Return back nr_pages necessary for p2m root table. */
+
+    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
+        panic("Specify more xen,domain-p2m-mem-mb\n");
+
+    for ( unsigned int i = 0; i < nr_pages; i++ )
+    {
+        /* Return memory to domheap. */
+        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
+        if( page )
+        {
+            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
+            free_domheap_page(page);
+        }
+        else
+        {
+            printk(XENLOG_ERR
+                   "Failed to free P2M pages, P2M freelist is empty.\n");
+            return NULL;
+        }
+    }
+
+    /* Allocate memory for p2m root table. */
+
+    /*
+     * As mentioned in the Priviliged Architecture Spec (version 20240411)
+     * As explained in Section 18.5.1, for the paged virtual-memory schemes
+     * (Sv32x4, Sv39x4, Sv48x4, and Sv57x4), the root page table is 16 KiB
+     * and must be aligned to a 16-KiB boundary.
+     */
+    page = alloc_domheap_pages(d, order, MEMF_no_owner);
+    if ( page == NULL )
+        return NULL;
+
+    for ( unsigned int i = 0; i < nr_pages; i++ )
+        clear_and_clean_page(page + i);
+
+    return page;
+}
+
+static unsigned long hgatp_from_page(struct p2m_domain *p2m)
+{
+    struct page_info *p2m_root_page = p2m->root;
+    unsigned long ppn;
+    unsigned long hgatp_mode;
+
+    ppn = PFN_DOWN(page_to_maddr(p2m_root_page)) & HGATP_PPN;
+
+#if RV_STAGE1_MODE == SATP_MODE_SV39
+    hgatp_mode = HGATP_MODE_SV39X4;
+#elif RV_STAGE1_MODE == SATP_MODE_SV48
+    hgatp_mode = HGATP_MODE_SV48X4;
+#else
+#   error "add HGATP_MODE"
+#endif
+
+    return ppn | MASK_INSR(p2m->vmid, HGATP_VMID_MASK) |
+           MASK_INSR(hgatp_mode, HGATP_MODE_MASK);
+}
+
+static int p2m_alloc_root_table(struct domain *d)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+    p2m->root = p2m_allocate_root(d);
+    if ( !p2m->root )
+        return -ENOMEM;
+
+    p2m->hgatp = hgatp_from_page(p2m);
+
+    return 0;
+}
+
 static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
 
 /*
@@ -228,5 +313,14 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
         }
     }
 
+    /*
+    * First, wait for the p2m pool to be initialized. Then allocate the root
+    * table so that the necessary pages can be returned from the p2m pool,
+    * since the root table must be allocated using alloc_domheap_pages(...)
+    * to meet its specific requirements.
+    */
+    if ( !d->arch.p2m.root )
+        p2m_alloc_root_table(d);
+
     return 0;
 }
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 07/17] xen/riscv: introduce pte_{set,get}_mfn()
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (5 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 06/17] xen/riscv: add root page table allocation Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-06-26 14:57   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification Oleksii Kurochko
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

Introduce helpers pte_{set,get}_mfn() to simplify setting and getting
of mfn.

Also, introduce PTE_PPN_MASK and add BUILD_BUG_ON() to be sure that
PTE_PPN_MASK remains the same for all MMU modes except Sv32.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - Patch "[PATCH v1 4/6] xen/riscv: define pt_t and pt_walk_t structures" was
   renamed to xen/riscv: introduce pte_{set,get}_mfn() as after dropping of
   bitfields for PTE structure, this patch introduce only pte_{set,get}_mfn().
 - As pt_t and pt_walk_t were dropped, update implementation of
   pte_{set,get}_mfn() to use bit operations and shifts instead of bitfields.
 - Introduce PTE_PPN_MASK to be able to use MASK_INSR for setting/getting PPN.
 - Add BUILD_BUG_ON(RV_STAGE1_MODE > SATP_MODE_SV57) to be sure that when
   new MMU mode will be added, someone checks that PPN is still bits 53:10.
---
 xen/arch/riscv/include/asm/page.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index 4cb0179648..1b8b145663 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -114,6 +114,30 @@ typedef struct {
 #endif
 } pte_t;
 
+#if RV_STAGE1_MODE != SATP_MODE_SV32
+#define PTE_PPN_MASK _UL(0x3FFFFFFFFFFC00)
+#else
+#define PTE_PPN_MASK _U(0xFFFFFC00)
+#endif
+
+static inline void pte_set_mfn(pte_t *p, mfn_t mfn)
+{
+    /*
+     * At the moment spec provides Sv32 - Sv57.
+     * If one day new MMU mode will be added it will be needed
+     * to check that PPN mask still continue to cover bits 53:10.
+     */
+    BUILD_BUG_ON(RV_STAGE1_MODE > SATP_MODE_SV57);
+
+    p->pte &= ~PTE_PPN_MASK;
+    p->pte |= MASK_INSR(mfn_x(mfn), PTE_PPN_MASK);
+}
+
+static inline mfn_t pte_get_mfn(pte_t p)
+{
+    return _mfn(MASK_EXTR(p.pte, PTE_PPN_MASK));
+}
+
 static inline pte_t paddr_to_pte(paddr_t paddr,
                                  unsigned int permissions)
 {
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (6 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 07/17] xen/riscv: introduce pte_{set,get}_mfn() Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-06-26 14:59   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 09/17] xen/riscv: introduce page_set_xenheap_gfn() Oleksii Kurochko
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

- Extended p2m_type_t with additional types: p2m_ram_ro, p2m_mmio_direct_dev,
  p2m_grant_map_{rw,ro}.
- Added macros to classify memory types: P2M_RAM_TYPES, P2M_GRANT_TYPES.
- Introduced helper predicates: p2m_is_ram(), p2m_is_any_ram().

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - Drop stuff connected to foreign mapping as it isn't necessary for RISC-V
   right now.
---
 xen/arch/riscv/include/asm/p2m.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index a31b05bd50..0c05b58992 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -61,8 +61,28 @@ struct p2m_domain {
 typedef enum {
     p2m_invalid = 0,    /* Nothing mapped here */
     p2m_ram_rw,         /* Normal read/write domain RAM */
+    p2m_ram_ro,         /* Read-only; writes are silently dropped */
+    p2m_mmio_direct_dev,/* Read/write mapping of genuine Device MMIO area */
+    p2m_grant_map_rw,   /* Read/write grant mapping */
+    p2m_grant_map_ro,   /* Read-only grant mapping */
 } p2m_type_t;
 
+/* We use bitmaps and mask to handle groups of types */
+#define p2m_to_mask(t_) BIT(t_, UL)
+
+/* RAM types, which map to real machine frames */
+#define P2M_RAM_TYPES (p2m_to_mask(p2m_ram_rw) | \
+                       p2m_to_mask(p2m_ram_ro))
+
+/* Grant mapping types, which map to a real frame in another VM */
+#define P2M_GRANT_TYPES (p2m_to_mask(p2m_grant_map_rw) | \
+                         p2m_to_mask(p2m_grant_map_ro))
+
+/* Useful predicates */
+#define p2m_is_ram(t_) (p2m_to_mask(t_) & P2M_RAM_TYPES)
+#define p2m_is_any_ram(t_) (p2m_to_mask(t_) & \
+                            (P2M_RAM_TYPES | P2M_GRANT_TYPES))
+
 #include <xen/p2m-common.h>
 
 static inline int get_page_and_type(struct page_info *page,
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 09/17] xen/riscv: introduce page_set_xenheap_gfn()
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (7 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-06-30 15:48   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 10/17] xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs Oleksii Kurochko
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

Introduce page_set_xenheap_gfn() helper to encode the GFN associated with
a Xen heap page directly into the type_info field of struct page_info.

Introduce a GFN field in the type_info of a Xen heap page by reserving 10
bits (sufficient for both Sv32 and Sv39+ modes), and define PGT_gfn_mask
and PGT_gfn_width accordingly. This ensures a consistent bit layout across
all RISC-V MMU modes, avoiding the need for mode-specific ifdefs.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in v2:
 - This changes were part of "xen/riscv: implement p2m mapping functionality".
   No additional changes were done.
---
 xen/arch/riscv/include/asm/mm.h | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index 912bc79e1b..41bf9002d7 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -12,6 +12,7 @@
 #include <xen/sections.h>
 #include <xen/types.h>
 
+#include <asm/cmpxchg.h>
 #include <asm/page-bits.h>
 
 extern vaddr_t directmap_virt_start;
@@ -229,9 +230,21 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
 #define PGT_writable_page PG_mask(1, 1)  /* has writable mappings?         */
 #define PGT_type_mask     PG_mask(1, 1)  /* Bits 31 or 63.                 */
 
-/* Count of uses of this frame as its current type. */
-#define PGT_count_width   PG_shift(2)
-#define PGT_count_mask    ((1UL << PGT_count_width) - 1)
+ /* 9-bit count of uses of this frame as its current type. */
+#define PGT_count_mask    PG_mask(0x3FF, 10)
+
+/*
+ * Sv32 has 22-bit GFN. Sv{39, 48, 57} have 44-bit GFN.
+ * Thereby we can use for `type_info` 10 bits for all modes, having the same
+ * amount of bits for `type_info` for all MMU modes let us avoid introducing
+ * an extra #ifdef to that header:
+ *   if we go with maximum possible bits for count on each configuration
+ *   we would need to have a set of PGT_count_* and PGT_gfn_*).
+ */
+#define PGT_gfn_width     PG_shift(10)
+#define PGT_gfn_mask      (BIT(PGT_gfn_width, UL) - 1)
+
+#define PGT_INVALID_XENHEAP_GFN   _gfn(PGT_gfn_mask)
 
 /*
  * Page needs to be scrubbed. Since this bit can only be set on a page that is
@@ -283,6 +296,19 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
 
 #define PFN_ORDER(pg) ((pg)->v.free.order)
 
+static inline void page_set_xenheap_gfn(struct page_info *p, gfn_t gfn)
+{
+    gfn_t gfn_ = gfn_eq(gfn, INVALID_GFN) ? PGT_INVALID_XENHEAP_GFN : gfn;
+    unsigned long x, nx, y = p->u.inuse.type_info;
+
+    ASSERT(is_xen_heap_page(p));
+
+    do {
+        x = y;
+        nx = (x & ~PGT_gfn_mask) | gfn_x(gfn_);
+    } while ( (y = cmpxchg(&p->u.inuse.type_info, x, nx)) != x );
+}
+
 extern unsigned char cpu0_boot_stack[];
 
 void setup_initial_pagetables(void);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 10/17] xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (8 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 09/17] xen/riscv: introduce page_set_xenheap_gfn() Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-06-30 15:59   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry() Oleksii Kurochko
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

Introduce an initial implementation of guest_physmap_add_entry() on RISC-V
by adding a basic framework to insert guest physical memory mappings.
This allows mapping a range of GFNs to MFNs using a placeholder
p2m_set_entry() function, which currently returns -EOPNOTSUPP.

Changes included:
- Promoting guest_physmap_add_entry() from a stub to a functional
  interface calling a new p2m_insert_mapping() helper.
- Adding map_regions_p2mt() for generic mapping purposes.
- Introducing p2m_insert_mapping() and a skeleton for p2m_set_entry() to
  prepare for future support of actual page table manipulation.
- Enclosing the actual mapping logic within
  p2m_write_lock() / p2m_write_unlock() to ensure safe concurrent
  updates to the P2M.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in v2:
 - This changes were part of "xen/riscv: implement p2m mapping functionality".
   No additional signigicant changes were done.
---
 xen/arch/riscv/include/asm/p2m.h | 12 ++++------
 xen/arch/riscv/p2m.c             | 41 ++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 0c05b58992..af2025b9fd 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -118,14 +118,10 @@ static inline int guest_physmap_mark_populate_on_demand(struct domain *d,
     return -EOPNOTSUPP;
 }
 
-static inline int guest_physmap_add_entry(struct domain *d,
-                                          gfn_t gfn, mfn_t mfn,
-                                          unsigned long page_order,
-                                          p2m_type_t t)
-{
-    BUG_ON("unimplemented");
-    return -EINVAL;
-}
+int guest_physmap_add_entry(struct domain *d,
+                            gfn_t gfn, mfn_t mfn,
+                            unsigned long page_order,
+                            p2m_type_t t);
 
 /* Untyped version for RAM only, for compatibility */
 static inline int __must_check
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 2419a61d8c..cea37c8bda 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -324,3 +324,44 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
 
     return 0;
 }
+
+static int p2m_set_entry(struct p2m_domain *p2m,
+                         gfn_t sgfn,
+                         unsigned long nr,
+                         mfn_t smfn,
+                         p2m_type_t t,
+                         p2m_access_t a)
+{
+    return -EOPNOTSUPP;
+}
+
+static int p2m_insert_mapping(struct domain *d, gfn_t start_gfn,
+                              unsigned long nr, mfn_t mfn, p2m_type_t t)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    int rc;
+
+    p2m_write_lock(p2m);
+    rc = p2m_set_entry(p2m, start_gfn, nr, mfn, t, p2m->default_access);
+    p2m_write_unlock(p2m);
+
+    return rc;
+}
+
+int map_regions_p2mt(struct domain *d,
+                     gfn_t gfn,
+                     unsigned long nr,
+                     mfn_t mfn,
+                     p2m_type_t p2mt)
+{
+    return p2m_insert_mapping(d, gfn, nr, mfn, p2mt);
+}
+
+int guest_physmap_add_entry(struct domain *d,
+                            gfn_t gfn,
+                            mfn_t mfn,
+                            unsigned long page_order,
+                            p2m_type_t t)
+{
+    return p2m_insert_mapping(d, gfn, (1 << page_order), mfn, t);
+}
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (9 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 10/17] xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-07-01 13:49   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 12/17] xen/riscv: Implement p2m_free_entry() and related helpers Oleksii Kurochko
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

This patch introduces p2m_set_entry() and its core helper __p2m_set_entry() for
RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
modifications.

Key differences include:
- TLB Flushing: RISC-V allows caching of invalid PTEs and does not require
  break-before-make (BBM). As a result, the flushing logic is simplified.
  TLB invalidation can be deferred until p2m_write_unlock() is called.
  Consequently, the p2m->need_flush flag is always considered true and is
  removed.
- Page Table Traversal: The order of walking the page tables differs from Arm,
  and this implementation reflects that reversed traversal.
- Macro Adjustments: The macros P2M_ROOT_LEVEL, P2M_ROOT_ORDER, and
  P2M_ROOT_PAGES are updated to align with the new RISC-V implementation.

The main functionality is in __p2m_set_entry(), which handles mappings aligned
to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).

p2m_set_entry() breaks a region down into block-aligned mappings and calls
__p2m_set_entry() accordingly.

Stub implementations (to be completed later) include:
- p2m_free_entry()
- p2m_next_level()
- p2m_entry_from_mfn()
- p2me_is_valid()

Note: Support for shattering block entries is not implemented in this patch
and will be added separately.

Additionally, some straightforward helper functions are now implemented:
- p2m_write_pte()
- p2m_remove_pte()
- p2m_get_root_pointer()

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
   functionality" which was splitted to smaller.
 - Update the way when p2m TLB is flushed:
 - RISC-V does't require BBM so there is no need to remove PTE before making
   new so drop 'if /*pte_is_valid(orig_pte) */' and remove PTE only removing
   has been requested.
 - Drop p2m->need_flush |= !!pte_is_valid(orig_pte); for the case when
   PTE's removing is happening as RISC-V could cache invalid PTE and thereby
   it requires to do a flush each time and it doesn't matter if PTE is valid
   or not at the moment when PTE removing is happening.
 - Drop a check if PTE is valid in case of PTE is modified as it was mentioned
   above as BBM isn't required so TLB flushing could be defered and there is
   no need to do it before modifying of PTE.
 - Drop p2m->need_flush as it seems like it will be always true.
 - Drop foreign mapping things as it isn't necessary for RISC-V right now.
 - s/p2m_is_valid/p2me_is_valid.
 - Move definition and initalization of p2m->{max_mapped_gfn,lowest_mapped_gfn}
   to this patch.
---
 xen/arch/riscv/include/asm/p2m.h |  16 ++
 xen/arch/riscv/p2m.c             | 260 ++++++++++++++++++++++++++++++-
 2 files changed, 275 insertions(+), 1 deletion(-)

diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index af2025b9fd..fdebd18356 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -9,8 +9,13 @@
 #include <xen/rwlock.h>
 #include <xen/types.h>
 
+#include <asm/page.h>
 #include <asm/page-bits.h>
 
+#define P2M_ROOT_LEVEL  HYP_PT_ROOT_LEVEL
+#define P2M_ROOT_ORDER  XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL)
+#define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
+
 #define paddr_bits PADDR_BITS
 
 /* Get host p2m table */
@@ -49,6 +54,17 @@ struct p2m_domain {
 
     /* Current VMID in use */
     uint16_t vmid;
+
+    /* Highest guest frame that's ever been mapped in the p2m */
+    gfn_t max_mapped_gfn;
+
+    /*
+     * Lowest mapped gfn in the p2m. When releasing mapped gfn's in a
+     * preemptible manner this is update to track recall where to
+     * resume the search. Apart from during teardown this can only
+     * decrease.
+     */
+    gfn_t lowest_mapped_gfn;
 };
 
 /*
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index cea37c8bda..27499a86bb 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -231,6 +231,8 @@ int p2m_init(struct domain *d)
     INIT_PAGE_LIST_HEAD(&p2m->pages);
 
     p2m->vmid = INVALID_VMID;
+    p2m->max_mapped_gfn = _gfn(0);
+    p2m->lowest_mapped_gfn = _gfn(ULONG_MAX);
 
     p2m->default_access = p2m_access_rwx;
 
@@ -325,6 +327,214 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
     return 0;
 }
 
+/*
+ * Find and map the root page table. The caller is responsible for
+ * unmapping the table.
+ *
+ * The function will return NULL if the offset of the root table is
+ * invalid.
+ */
+static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
+{
+    unsigned long root_table_indx;
+
+    root_table_indx = gfn_x(gfn) >> XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL);
+    if ( root_table_indx >= P2M_ROOT_PAGES )
+        return NULL;
+
+    return __map_domain_page(p2m->root + root_table_indx);
+}
+
+static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
+{
+    panic("%s: isn't implemented for now\n", __func__);
+
+    return false;
+}
+
+static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
+{
+    write_pte(p, pte);
+    if ( clean_pte )
+        clean_dcache_va_range(p, sizeof(*p));
+}
+
+static inline void p2m_remove_pte(pte_t *p, bool clean_pte)
+{
+    pte_t pte;
+
+    memset(&pte, 0x00, sizeof(pte));
+    p2m_write_pte(p, pte, clean_pte);
+}
+
+static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn,
+                                p2m_type_t t, p2m_access_t a)
+{
+    panic("%s: hasn't been implemented yet\n", __func__);
+
+    return (pte_t) { .pte = 0 };
+}
+
+#define GUEST_TABLE_MAP_NONE 0
+#define GUEST_TABLE_MAP_NOMEM 1
+#define GUEST_TABLE_SUPER_PAGE 2
+#define GUEST_TABLE_NORMAL 3
+
+/*
+ * Take the currently mapped table, find the corresponding GFN entry,
+ * and map the next table, if available. The previous table will be
+ * unmapped if the next level was mapped (e.g GUEST_TABLE_NORMAL
+ * returned).
+ *
+ * `alloc_tbl` parameter indicates whether intermediate tables should
+ * be allocated when not present.
+ *
+ * Return values:
+ *  GUEST_TABLE_MAP_NONE: a table allocation isn't permitted.
+ *  GUEST_TABLE_MAP_NOMEM: allocating a new page failed.
+ *  GUEST_TABLE_SUPER_PAGE: next level or leaf mapped normally.
+ *  GUEST_TABLE_NORMAL: The next entry points to a superpage.
+ */
+static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
+                          unsigned int level, pte_t **table,
+                          unsigned int offset)
+{
+    panic("%s: hasn't been implemented yet\n", __func__);
+
+    return GUEST_TABLE_MAP_NONE;
+}
+
+/* Free pte sub-tree behind an entry */
+static void p2m_free_entry(struct p2m_domain *p2m,
+                           pte_t entry, unsigned int level)
+{
+    panic("%s: hasn't been implemented yet\n", __func__);
+}
+
+/*
+ * Insert an entry in the p2m. This should be called with a mapping
+ * equal to a page/superpage.
+ */
+static int __p2m_set_entry(struct p2m_domain *p2m,
+                           gfn_t sgfn,
+                           unsigned int page_order,
+                           mfn_t smfn,
+                           p2m_type_t t,
+                           p2m_access_t a)
+{
+    unsigned int level;
+    unsigned int target = page_order / PAGETABLE_ORDER;
+    pte_t *entry, *table, orig_pte;
+    int rc;
+    /* A mapping is removed if the MFN is invalid. */
+    bool removing_mapping = mfn_eq(smfn, INVALID_MFN);
+    DECLARE_OFFSETS(offsets, gfn_to_gaddr(sgfn));
+
+    ASSERT(p2m_is_write_locked(p2m));
+
+    /*
+     * Check if the level target is valid: we only support
+     * 4K - 2M - 1G mapping.
+     */
+    ASSERT(target <= 2);
+
+    table = p2m_get_root_pointer(p2m, sgfn);
+    if ( !table )
+        return -EINVAL;
+
+    for ( level = P2M_ROOT_LEVEL; level > target; level-- )
+    {
+        /*
+         * Don't try to allocate intermediate page table if the mapping
+         * is about to be removed.
+         */
+        rc = p2m_next_level(p2m, !removing_mapping,
+                            level, &table, offsets[level]);
+        if ( (rc == GUEST_TABLE_MAP_NONE) || (rc == GUEST_TABLE_MAP_NOMEM) )
+        {
+            /*
+             * We are here because p2m_next_level has failed to map
+             * the intermediate page table (e.g the table does not exist
+             * and they p2m tree is read-only). It is a valid case
+             * when removing a mapping as it may not exist in the
+             * page table. In this case, just ignore it.
+             */
+            rc = removing_mapping ?  0 : -ENOENT;
+            goto out;
+        }
+        else if ( rc != GUEST_TABLE_NORMAL )
+            break;
+    }
+
+    entry = table + offsets[level];
+
+    /*
+     * If we are here with level > target, we must be at a leaf node,
+     * and we need to break up the superpage.
+     */
+    if ( level > target )
+    {
+        panic("Shattering isn't implemented\n");
+    }
+
+    /*
+     * We should always be there with the correct level because
+     * all the intermediate tables have been installed if necessary.
+     */
+    ASSERT(level == target);
+
+    orig_pte = *entry;
+
+    /*
+     * The access type should always be p2m_access_rwx when the mapping
+     * is removed.
+     */
+    ASSERT(!mfn_eq(INVALID_MFN, smfn) || (a == p2m_access_rwx));
+
+    if ( removing_mapping )
+        p2m_remove_pte(entry, p2m->clean_pte);
+    else {
+        pte_t pte = p2m_entry_from_mfn(p2m, smfn, t, a);
+
+        p2m_write_pte(entry, pte, p2m->clean_pte);
+
+        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
+                                      gfn_add(sgfn, (1UL << page_order) - 1));
+        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, sgfn);
+    }
+
+#ifdef CONFIG_HAS_PASSTHROUGH
+    if ( is_iommu_enabled(p2m->domain) &&
+         (pte_is_valid(orig_pte) || pte_is_valid(*entry)) )
+    {
+        unsigned int flush_flags = 0;
+
+        if ( pte_is_valid(orig_pte) )
+            flush_flags |= IOMMU_FLUSHF_modified;
+        if ( pte_is_valid(*entry) )
+            flush_flags |= IOMMU_FLUSHF_added;
+
+        rc = iommu_iotlb_flush(p2m->domain, _dfn(gfn_x(sgfn)),
+                               1UL << page_order, flush_flags);
+    }
+    else
+#endif
+        rc = 0;
+
+    /*
+     * Free the entry only if the original pte was valid and the base
+     * is different (to avoid freeing when permission is changed).
+     */
+    if ( p2me_is_valid(p2m, orig_pte) &&
+         !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
+        p2m_free_entry(p2m, orig_pte, level);
+
+out:
+    unmap_domain_page(table);
+
+    return rc;
+}
+
 static int p2m_set_entry(struct p2m_domain *p2m,
                          gfn_t sgfn,
                          unsigned long nr,
@@ -332,7 +542,55 @@ static int p2m_set_entry(struct p2m_domain *p2m,
                          p2m_type_t t,
                          p2m_access_t a)
 {
-    return -EOPNOTSUPP;
+    int rc = 0;
+
+    /*
+     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
+     * be dropped in relinquish_p2m_mapping(). As the P2M will still
+     * be accessible after, we need to prevent mapping to be added when the
+     * domain is dying.
+     */
+    if ( unlikely(p2m->domain->is_dying) )
+        return -ENOMEM;
+
+    while ( nr )
+    {
+        unsigned long mask;
+        unsigned long order = 0;
+        /* 1gb, 2mb, 4k mappings are supported */
+        unsigned int i = ( P2M_ROOT_LEVEL > 2 ) ? 2 : P2M_ROOT_LEVEL;
+
+        /*
+         * Don't take into account the MFN when removing mapping (i.e
+         * MFN_INVALID) to calculate the correct target order.
+         *
+         * XXX: Support superpage mappings if nr is not aligned to a
+         * superpage size.
+         */
+        mask = !mfn_eq(smfn, INVALID_MFN) ? mfn_x(smfn) : 0;
+        mask |= gfn_x(sgfn) | nr;
+
+        for ( ; i != 0; i-- )
+        {
+            if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) )
+            {
+                    order = XEN_PT_LEVEL_ORDER(i);
+                    break;
+            }
+        }
+
+        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
+        if ( rc )
+            break;
+
+        sgfn = gfn_add(sgfn, (1 << order));
+        if ( !mfn_eq(smfn, INVALID_MFN) )
+           smfn = mfn_add(smfn, (1 << order));
+
+        nr -= (1 << order);
+    }
+
+    return rc;
 }
 
 static int p2m_insert_mapping(struct domain *d, gfn_t start_gfn,
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 12/17] xen/riscv: Implement p2m_free_entry() and related helpers
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (10 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry() Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-07-01 14:23   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration Oleksii Kurochko
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

This patch introduces a working implementation of p2m_free_entry() for RISC-V
based on ARM's implementation of p2m_free_entry(), enabling proper cleanup
of page table entries in the P2M (physical-to-machine) mapping.

Only few things are changed:
- Use p2m_force_flush_sync() instead of p2m_tlb_flush_sync() as latter
  isn't implemented on RISC-V.
- Introduce and use p2m_type_radix_get() to get a type of p2m entry as
  RISC-V's PTE doesn't have enough space to store all necessary types so
  a type is stored in a radix tree.

Key additions include:
- p2m_free_entry(): Recursively frees page table entries at all levels. It
  handles both regular and superpage mappings and ensures that TLB entries
  are flushed before freeing intermediate tables.
- p2m_put_page() and helpers:
  - p2m_put_4k_page(): Clears GFN from xenheap pages if applicable.
  - p2m_put_2m_superpage(): Releases foreign page references in a 2MB
    superpage.
  - p2m_type_radix_get(): Extracts the stored p2m_type from the radix tree
    using the PTE.
- p2m_free_page(): Returns a page either to the domain's freelist or to
  the domheap, depending on whether the domain is hardware-backed.

Defines XEN_PT_ENTRIES in asm/page.h to simplify loops over page table
entries.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
   functionality" which was splitted to smaller.
 - s/p2m_is_superpage/p2me_is_superpage.
---
 xen/arch/riscv/include/asm/page.h |   1 +
 xen/arch/riscv/p2m.c              | 144 +++++++++++++++++++++++++++++-
 2 files changed, 142 insertions(+), 3 deletions(-)

diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index 1b8b145663..c67b9578c9 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -22,6 +22,7 @@
 #define XEN_PT_LEVEL_SIZE(lvl)      (_AT(paddr_t, 1) << XEN_PT_LEVEL_SHIFT(lvl))
 #define XEN_PT_LEVEL_MAP_MASK(lvl)  (~(XEN_PT_LEVEL_SIZE(lvl) - 1))
 #define XEN_PT_LEVEL_MASK(lvl)      (VPN_MASK << XEN_PT_LEVEL_SHIFT(lvl))
+#define XEN_PT_ENTRIES              (_AT(unsigned int, 1) << PAGETABLE_ORDER)
 
 /*
  * PTE format:
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 27499a86bb..6b11e87b22 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -345,11 +345,33 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
     return __map_domain_page(p2m->root + root_table_indx);
 }
 
+static p2m_type_t p2m_type_radix_get(struct p2m_domain *p2m, pte_t pte)
+{
+    void *ptr;
+    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
+
+    ptr = radix_tree_lookup(&p2m->p2m_type, gfn_x(gfn));
+
+    if ( !ptr )
+        return p2m_invalid;
+
+    return radix_tree_ptr_to_int(ptr);
+}
+
+/*
+ * In the case of the P2M, the valid bit is used for other purpose. Use
+ * the type to check whether an entry is valid.
+ */
 static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
 {
-    panic("%s: isn't implemented for now\n", __func__);
+    return p2m_type_radix_get(p2m, pte) != p2m_invalid;
+}
 
-    return false;
+static inline bool p2me_is_superpage(struct p2m_domain *p2m, pte_t pte,
+                                    unsigned int level)
+{
+    return p2me_is_valid(p2m, pte) && (pte.pte & PTE_ACCESS_MASK) &&
+           (level > 0);
 }
 
 static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
@@ -404,11 +426,127 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
     return GUEST_TABLE_MAP_NONE;
 }
 
+static void p2m_put_foreign_page(struct page_info *pg)
+{
+    /*
+     * It's safe to do the put_page here because page_alloc will
+     * flush the TLBs if the page is reallocated before the end of
+     * this loop.
+     */
+    put_page(pg);
+}
+
+/* Put any references on the single 4K page referenced by mfn. */
+static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
+{
+    /* TODO: Handle other p2m types */
+
+    /* Detect the xenheap page and mark the stored GFN as invalid. */
+    if ( p2m_is_ram(type) && is_xen_heap_mfn(mfn) )
+        page_set_xenheap_gfn(mfn_to_page(mfn), INVALID_GFN);
+}
+
+/* Put any references on the superpage referenced by mfn. */
+static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
+{
+    struct page_info *pg;
+    unsigned int i;
+
+    ASSERT(mfn_valid(mfn));
+
+    pg = mfn_to_page(mfn);
+
+    for ( i = 0; i < XEN_PT_ENTRIES; i++, pg++ )
+        p2m_put_foreign_page(pg);
+}
+
+/* Put any references on the page referenced by pte. */
+static void p2m_put_page(struct p2m_domain *p2m, const pte_t pte,
+                         unsigned int level)
+{
+    mfn_t mfn = pte_get_mfn(pte);
+    p2m_type_t p2m_type = p2m_type_radix_get(p2m, pte);
+
+    ASSERT(p2me_is_valid(p2m, pte));
+
+    /*
+     * TODO: Currently we don't handle level 2 super-page, Xen is not
+     * preemptible and therefore some work is needed to handle such
+     * superpages, for which at some point Xen might end up freeing memory
+     * and therefore for such a big mapping it could end up in a very long
+     * operation.
+     */
+    if ( level == 1 )
+        return p2m_put_2m_superpage(mfn, p2m_type);
+    else if ( level == 0 )
+        return p2m_put_4k_page(mfn, p2m_type);
+}
+
+static void p2m_free_page(struct domain *d, struct page_info *pg)
+{
+    if ( is_hardware_domain(d) )
+        free_domheap_page(pg);
+    else
+    {
+        spin_lock(&d->arch.paging.lock);
+        page_list_add_tail(pg, &d->arch.paging.p2m_freelist);
+        spin_unlock(&d->arch.paging.lock);
+    }
+}
+
 /* Free pte sub-tree behind an entry */
 static void p2m_free_entry(struct p2m_domain *p2m,
                            pte_t entry, unsigned int level)
 {
-    panic("%s: hasn't been implemented yet\n", __func__);
+    unsigned int i;
+    pte_t *table;
+    mfn_t mfn;
+    struct page_info *pg;
+
+    /* Nothing to do if the entry is invalid. */
+    if ( !p2me_is_valid(p2m, entry) )
+        return;
+
+    if ( p2me_is_superpage(p2m, entry, level) || (level == 0) )
+    {
+#ifdef CONFIG_IOREQ_SERVER
+        /*
+         * If this gets called then either the entry was replaced by an entry
+         * with a different base (valid case) or the shattering of a superpage
+         * has failed (error case).
+         * So, at worst, the spurious mapcache invalidation might be sent.
+         */
+        if ( p2m_is_ram( p2m_type_radix_get(p2m, entry)) &&
+             domain_has_ioreq_server(p2m->domain) )
+            ioreq_request_mapcache_invalidate(p2m->domain);
+#endif
+
+        p2m_put_page(p2m, entry, level);
+
+        return;
+    }
+
+    table = map_domain_page(pte_get_mfn(entry));
+    for ( i = 0; i < XEN_PT_ENTRIES; i++ )
+        p2m_free_entry(p2m, *(table + i), level - 1);
+
+    unmap_domain_page(table);
+
+    /*
+     * Make sure all the references in the TLB have been removed before
+     * freing the intermediate page table.
+     * XXX: Should we defer the free of the page table to avoid the
+     * flush?
+     */
+    p2m_force_tlb_flush_sync(p2m);
+
+    mfn = pte_get_mfn(entry);
+    ASSERT(mfn_valid(mfn));
+
+    pg = mfn_to_page(mfn);
+
+    page_list_del(pg, &p2m->pages);
+    p2m_free_page(p2m->domain, pg);
 }
 
 /*
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (11 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 12/17] xen/riscv: Implement p2m_free_entry() and related helpers Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-07-01 15:08   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 14/17] xen/riscv: implement p2m_next_level() Oleksii Kurochko
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

This patch adds the initial logic for constructing PTEs from MFNs in the RISC-V
p2m subsystem. It includes:
- Implementation of p2m_entry_from_mfn(): Generates a valid PTE using the
  given MFN, p2m_type_t, and p2m_access_t, including permission encoding and
  PBMT attribute setup.
- New helper p2m_set_permission(): Encodes access rights (r, w, x) into the
  PTE based on both p2m type and access permissions.
- p2m_type_radix_set(): Stores the p2m type in a radix tree keyed by the PTE
  for later retrieval.

PBMT type encoding support:
- Introduces an enum pbmt_type_t to represent the PBMT field values.
- Maps types like p2m_mmio_direct_dev to pbmt_io, others default to pbmt_pma.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
   functionality" which was splitted to smaller.
---
 xen/arch/riscv/include/asm/page.h |   8 +++
 xen/arch/riscv/p2m.c              | 103 ++++++++++++++++++++++++++++--
 2 files changed, 107 insertions(+), 4 deletions(-)

diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index c67b9578c9..1d1054fa5c 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -76,6 +76,14 @@
 #define PTE_SMALL       BIT(10, UL)
 #define PTE_POPULATE    BIT(11, UL)
 
+enum pbmt_type_t {
+    pbmt_pma,
+    pbmt_nc,
+    pbmt_io,
+    pbmt_rsvd,
+    pbmt_max,
+};
+
 #define PTE_ACCESS_MASK (PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE)
 
 #define PTE_PBMT_MASK   (PTE_PBMT_NOCACHE | PTE_PBMT_IO)
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 6b11e87b22..cba04acf38 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
     return __map_domain_page(p2m->root + root_table_indx);
 }
 
+static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
+{
+    int rc;
+    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
+
+    rc = radix_tree_insert(&p2m->p2m_type, gfn_x(gfn),
+                           radix_tree_int_to_ptr(t));
+    if ( rc == -EEXIST )
+    {
+        /* If a setting already exists, change it to the new one */
+        radix_tree_replace_slot(
+            radix_tree_lookup_slot(
+                &p2m->p2m_type, gfn_x(gfn)),
+            radix_tree_int_to_ptr(t));
+        rc = 0;
+    }
+
+    return rc;
+}
+
 static p2m_type_t p2m_type_radix_get(struct p2m_domain *p2m, pte_t pte)
 {
     void *ptr;
@@ -389,12 +409,87 @@ static inline void p2m_remove_pte(pte_t *p, bool clean_pte)
     p2m_write_pte(p, pte, clean_pte);
 }
 
-static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn,
-                                p2m_type_t t, p2m_access_t a)
+static void p2m_set_permission(pte_t *e, p2m_type_t t, p2m_access_t a)
 {
-    panic("%s: hasn't been implemented yet\n", __func__);
+    /* First apply type permissions */
+    switch ( t )
+    {
+    case p2m_ram_rw:
+        e->pte |= PTE_ACCESS_MASK;
+        break;
+
+    case p2m_mmio_direct_dev:
+        e->pte |= (PTE_READABLE | PTE_WRITABLE);
+        e->pte &= ~PTE_EXECUTABLE;
+        break;
+
+    case p2m_invalid:
+        e->pte &= ~PTE_ACCESS_MASK;
+        break;
+
+    default:
+        BUG();
+        break;
+    }
+
+    /* Then restrict with access permissions */
+    switch ( a )
+    {
+    case p2m_access_rwx:
+        break;
+    case p2m_access_wx:
+        e->pte &= ~PTE_READABLE;
+        break;
+    case p2m_access_rw:
+        e->pte &= ~PTE_EXECUTABLE;
+        break;
+    case p2m_access_w:
+        e->pte &= ~(PTE_READABLE | PTE_EXECUTABLE);
+        e->pte &= ~PTE_EXECUTABLE;
+        break;
+    case p2m_access_rx:
+    case p2m_access_rx2rw:
+        e->pte &= ~PTE_WRITABLE;
+        break;
+    case p2m_access_x:
+        e->pte &= ~(PTE_READABLE | PTE_WRITABLE);
+        break;
+    case p2m_access_r:
+        e->pte &= ~(PTE_WRITABLE | PTE_EXECUTABLE);
+        break;
+    case p2m_access_n:
+    case p2m_access_n2rwx:
+        e->pte &= ~PTE_ACCESS_MASK;
+        break;
+    default:
+        BUG();
+        break;
+    }
+}
+
+static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn, p2m_type_t t, p2m_access_t a)
+{
+    pte_t e = (pte_t) { 1 };
+
+    switch ( t )
+    {
+    case p2m_mmio_direct_dev:
+        e.pte |= PTE_PBMT_IO;
+        break;
+
+    default:
+        break;
+    }
+
+    p2m_set_permission(&e, t, a);
+
+    ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK));
+
+    pte_set_mfn(&e, mfn);
+
+    BUG_ON(p2m_type_radix_set(p2m, e, t));
 
-    return (pte_t) { .pte = 0 };
+    return e;
 }
 
 #define GUEST_TABLE_MAP_NONE 0
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 14/17] xen/riscv: implement p2m_next_level()
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (12 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-07-02  8:35   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings Oleksii Kurochko
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

Implement the p2m_next_level() function, which enables traversal and dynamic
allocation of intermediate levels (if necessary) in the RISC-V
p2m (physical-to-machine) page table hierarchy.

To support this, the following helpers are introduced:
- p2me_is_mapping(): Determines whether a PTE represents a valid mapping.
- page_to_p2m_table(): Constructs non-leaf PTEs pointing to next-level page
  tables with correct attributes.
- p2m_alloc_page(): Allocates page table pages, supporting both hardware and
  guest domains.
- p2m_create_table(): Allocates and initializes a new page table page and
  installs it into the hierarchy.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
   functionality" which was splitted to smaller.
 - s/p2m_is_mapping/p2me_is_mapping.
---
 xen/arch/riscv/p2m.c | 103 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 101 insertions(+), 2 deletions(-)

diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index cba04acf38..87dd636b80 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -387,6 +387,17 @@ static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
     return p2m_type_radix_get(p2m, pte) != p2m_invalid;
 }
 
+/*
+ * pte_is_* helpers are checking the valid bit set in the
+ * PTE but we have to check p2m_type instead (look at the comment above
+ * p2me_is_valid())
+ * Provide our own overlay to check the valid bit.
+ */
+static inline bool p2me_is_mapping(struct p2m_domain *p2m, pte_t pte)
+{
+    return p2me_is_valid(p2m, pte) && (pte.pte & PTE_ACCESS_MASK);
+}
+
 static inline bool p2me_is_superpage(struct p2m_domain *p2m, pte_t pte,
                                     unsigned int level)
 {
@@ -492,6 +503,70 @@ static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn, p2m_type_t t,
     return e;
 }
 
+/* Generate table entry with correct attributes. */
+static pte_t page_to_p2m_table(struct p2m_domain *p2m, struct page_info *page)
+{
+    /*
+     * Since this function generates a table entry, according to "Encoding
+     * of PTE R/W/X fields," the entry's r, w, and x fields must be set to 0
+     * to point to the next level of the page table.
+     * Therefore, to ensure that an entry is a page table entry,
+     * `p2m_access_n2rwx` is passed to `mfn_to_p2m_entry()` as the access value,
+     * which overrides whatever was passed as `p2m_type_t` and guarantees that
+     * the entry is a page table entry by setting r = w = x = 0.
+     */
+    return p2m_entry_from_mfn(p2m, page_to_mfn(page), p2m_ram_rw, p2m_access_n2rwx);
+}
+
+static struct page_info *p2m_alloc_page(struct domain *d)
+{
+    struct page_info *pg;
+
+    /*
+     * For hardware domain, there should be no limit in the number of pages that
+     * can be allocated, so that the kernel may take advantage of the extended
+     * regions. Hence, allocate p2m pages for hardware domains from heap.
+     */
+    if ( is_hardware_domain(d) )
+    {
+        pg = alloc_domheap_page(d, MEMF_no_owner);
+        if ( pg == NULL )
+            printk(XENLOG_G_ERR "Failed to allocate P2M pages for hwdom.\n");
+    }
+    else
+    {
+        spin_lock(&d->arch.paging.lock);
+        pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
+        spin_unlock(&d->arch.paging.lock);
+    }
+
+    return pg;
+}
+
+/* Allocate a new page table page and hook it in via the given entry. */
+static int p2m_create_table(struct p2m_domain *p2m, pte_t *entry)
+{
+    struct page_info *page;
+    pte_t *p;
+
+    ASSERT(!p2me_is_valid(p2m, *entry));
+
+    page = p2m_alloc_page(p2m->domain);
+    if ( page == NULL )
+        return -ENOMEM;
+
+    page_list_add(page, &p2m->pages);
+
+    p = __map_domain_page(page);
+    clear_page(p);
+
+    unmap_domain_page(p);
+
+    p2m_write_pte(entry, page_to_p2m_table(p2m, page), p2m->clean_pte);
+
+    return 0;
+}
+
 #define GUEST_TABLE_MAP_NONE 0
 #define GUEST_TABLE_MAP_NOMEM 1
 #define GUEST_TABLE_SUPER_PAGE 2
@@ -516,9 +591,33 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
                           unsigned int level, pte_t **table,
                           unsigned int offset)
 {
-    panic("%s: hasn't been implemented yet\n", __func__);
+    pte_t *entry;
+    int ret;
+    mfn_t mfn;
+
+    entry = *table + offset;
+
+    if ( !p2me_is_valid(p2m, *entry) )
+    {
+        if ( !alloc_tbl )
+            return GUEST_TABLE_MAP_NONE;
+
+        ret = p2m_create_table(p2m, entry);
+        if ( ret )
+            return GUEST_TABLE_MAP_NOMEM;
+    }
+
+    /* The function p2m_next_level() is never called at the last level */
+    ASSERT(level != 0);
+    if ( p2me_is_mapping(p2m, *entry) )
+        return GUEST_TABLE_SUPER_PAGE;
+
+    mfn = mfn_from_pte(*entry);
+
+    unmap_domain_page(*table);
+    *table = map_domain_page(mfn);
 
-    return GUEST_TABLE_MAP_NONE;
+    return GUEST_TABLE_NORMAL;
 }
 
 static void p2m_put_foreign_page(struct page_info *pg)
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (13 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 14/17] xen/riscv: implement p2m_next_level() Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-07-02  9:25   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers Oleksii Kurochko
  2025-06-10 13:05 ` [PATCH v2 17/17] xen/riscv: add support of page lookup by GFN Oleksii Kurochko
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

Add support for down large memory mappings ("superpages") in the RISC-V
p2m mapping so that smaller, more precise mappings ("finer-grained entries")
can be inserted into lower levels of the page table hierarchy.

To implement that the following is done:
- Introduce p2m_split_superpage(): Recursively shatters a superpage into
  smaller page table entries down to the target level, preserving original
  permissions and attributes.
- __p2m_set_entry() updated to invoke superpage splitting when inserting
  entries at lower levels within a superpage-mapped region.

This implementation is based on the ARM code, with modifications to the part
that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
not require BBM, so there is no need to invalidate the PTE and flush the
TLB before updating it with the newly created, split page table.
Additionally, the page table walk logic has been adjusted, as ARM uses the
opposite walk order compared to RISC-V.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
   functionality" which was splitted to smaller.
 - Update the commit above the cycle which creates new page table as
   RISC-V travserse page tables in an opposite to ARM order.
 - RISC-V doesn't require BBM so there is no needed for invalidating
   and TLB flushing before updating PTE.
---
 xen/arch/riscv/p2m.c | 102 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 101 insertions(+), 1 deletion(-)

diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 87dd636b80..79c4473f1f 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -743,6 +743,77 @@ static void p2m_free_entry(struct p2m_domain *p2m,
     p2m_free_page(p2m->domain, pg);
 }
 
+static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
+                                unsigned int level, unsigned int target,
+                                const unsigned int *offsets)
+{
+    struct page_info *page;
+    unsigned int i;
+    pte_t pte, *table;
+    bool rv = true;
+
+    /* Convenience aliases */
+    mfn_t mfn = pte_get_mfn(*entry);
+    unsigned int next_level = level - 1;
+    unsigned int level_order = XEN_PT_LEVEL_ORDER(next_level);
+
+    /*
+     * This should only be called with target != level and the entry is
+     * a superpage.
+     */
+    ASSERT(level > target);
+    ASSERT(p2me_is_superpage(p2m, *entry, level));
+
+    page = p2m_alloc_page(p2m->domain);
+    if ( !page )
+        return false;
+
+    page_list_add(page, &p2m->pages);
+    table = __map_domain_page(page);
+
+    /*
+     * We are either splitting a second level 1G page into 512 first level
+     * 2M pages, or a first level 2M page into 512 zero level 4K pages.
+     */
+    for ( i = 0; i < XEN_PT_ENTRIES; i++ )
+    {
+        pte_t *new_entry = table + i;
+
+        /*
+         * Use the content of the superpage entry and override
+         * the necessary fields. So the correct permission are kept.
+         */
+        pte = *entry;
+        pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
+
+        write_pte(new_entry, pte);
+    }
+
+    /*
+     * Shatter superpage in the page to the level we want to make the
+     * changes.
+     * This is done outside the loop to avoid checking the offset to
+     * know whether the entry should be shattered for every entry.
+     */
+    if ( next_level != target )
+        rv = p2m_split_superpage(p2m, table + offsets[next_level],
+                                 level - 1, target, offsets);
+
+    /* TODO: why it is necessary to have clean here? Not somewhere in the caller */
+    if ( p2m->clean_pte )
+        clean_dcache_va_range(table, PAGE_SIZE);
+
+    unmap_domain_page(table);
+
+    /*
+     * Even if we failed, we should install the newly allocated PTE
+     * entry. The caller will be in charge to free the sub-tree.
+     */
+    p2m_write_pte(entry, page_to_p2m_table(p2m, page), p2m->clean_pte);
+
+    return rv;
+}
+
 /*
  * Insert an entry in the p2m. This should be called with a mapping
  * equal to a page/superpage.
@@ -806,7 +877,36 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
      */
     if ( level > target )
     {
-        panic("Shattering isn't implemented\n");
+        /* We need to split the original page. */
+        pte_t split_pte = *entry;
+
+        ASSERT(p2me_is_superpage(p2m, *entry, level));
+
+        if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
+        {
+            /* Free the allocated sub-tree */
+            p2m_free_entry(p2m, split_pte, level);
+
+            rc = -ENOMEM;
+            goto out;
+        }
+
+        p2m_write_pte(entry, split_pte, p2m->clean_pte);
+
+        /* Then move to the level we want to make real changes */
+        for ( ; level < target; level++ )
+        {
+            rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
+
+            /*
+             * The entry should be found and either be a table
+             * or a superpage if level 0 is not targeted
+             */
+            ASSERT(rc == GUEST_TABLE_NORMAL ||
+                   (rc == GUEST_TABLE_SUPER_PAGE && target > 0));
+        }
+
+        entry = table + offsets[level];
     }
 
     /*
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (14 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-07-02 10:09   ` Jan Beulich
  2025-06-10 13:05 ` [PATCH v2 17/17] xen/riscv: add support of page lookup by GFN Oleksii Kurochko
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

Implement the mfn_valid() macro to verify whether a given MFN is valid by
checking that it falls within the range [start_page, max_page).
These bounds are initialized based on the start and end addresses of RAM.

As part of this patch, start_page is introduced and initialized with the
PFN of the first RAM page.

Also, after providing a non-stub implementation of the mfn_valid() macro,
the following compilation errors started to occur:
  riscv64-linux-gnu-ld: prelink.o: in function `__next_node':
  /build/xen/./include/xen/nodemask.h:202: undefined reference to `page_is_ram_type'
  riscv64-linux-gnu-ld: prelink.o: in function `get_free_buddy':
  /build/xen/common/page_alloc.c:881: undefined reference to `page_is_ram_type'
  riscv64-linux-gnu-ld: prelink.o: in function `alloc_heap_pages':
  /build/xen/common/page_alloc.c:1043: undefined reference to `page_get_owner_and_reference'
  riscv64-linux-gnu-ld: /build/xen/common/page_alloc.c:1098: undefined reference to `page_is_ram_type'
  riscv64-linux-gnu-ld: prelink.o: in function `ns16550_interrupt':
  /build/xen/drivers/char/ns16550.c:205: undefined reference to `get_page'
  riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol `page_get_owner_and_reference' isn't defined
  riscv64-linux-gnu-ld: final link failed: bad value
  make[2]: *** [arch/riscv/Makefile:35: xen-syms] Error 1
To resolve these errors, the following functions have also been introduced,
based on their Arm counterparts:
- page_get_owner_and_reference() and its variant to safely acquire a
  reference to a page and retrieve its owner.
- put_page() and put_page_nr() to release page references and free the page
  when the count drops to zero.
  For put_page_nr(), code related to static memory configuration is wrapped
  with CONFIG_STATIC_MEMORY, as this configuration has not yet been moved to
  common code. Therefore, PGC_static and free_domstatic_page() are not
  introduced for RISC-V. However, since this configuration could be useful
  in the future, the relevant code is retained and conditionally compiled.
- A stub for page_is_ram_type() that currently always returns 0 and asserts
  unreachable, as RAM type checking is not yet implemented.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - New patch.
---
 xen/arch/riscv/include/asm/mm.h |  9 ++-
 xen/arch/riscv/mm.c             | 97 +++++++++++++++++++++++++++++++--
 2 files changed, 99 insertions(+), 7 deletions(-)

diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index 41bf9002d7..bd8511e5f9 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -5,6 +5,7 @@
 
 #include <public/xen.h>
 #include <xen/bug.h>
+#include <xen/compiler.h>
 #include <xen/const.h>
 #include <xen/mm-frame.h>
 #include <xen/pdx.h>
@@ -288,8 +289,12 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
 #define page_get_owner(p)    (p)->v.inuse.domain
 #define page_set_owner(p, d) ((p)->v.inuse.domain = (d))
 
-/* TODO: implement */
-#define mfn_valid(mfn) ({ (void)(mfn); 0; })
+extern unsigned long start_page;
+
+#define mfn_valid(mfn) ({                                   \
+    unsigned long mfn__ = mfn_x(mfn);                       \
+    likely((mfn__ >= start_page) && (mfn__ < max_page));    \
+})
 
 #define domain_set_alloc_bitsize(d) ((void)(d))
 #define domain_clamp_alloc_bitsize(d, b) ((void)(d), (b))
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index 4047d67c0e..c88908d4f0 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -361,11 +361,6 @@ unsigned long __init calc_phys_offset(void)
     return phys_offset;
 }
 
-void put_page(struct page_info *page)
-{
-    BUG_ON("unimplemented");
-}
-
 void arch_dump_shared_mem_info(void)
 {
     BUG_ON("unimplemented");
@@ -525,6 +520,8 @@ static void __init setup_directmap_mappings(unsigned long base_mfn,
 #error setup_{directmap,frametable}_mapping() should be implemented for RV_32
 #endif
 
+unsigned long __read_mostly start_page;
+
 /*
  * Setup memory management
  *
@@ -577,6 +574,8 @@ void __init setup_mm(void)
     }
 
     setup_frametable_mappings(ram_start, ram_end);
+
+    start_page = PFN_DOWN(ram_start);
     max_page = PFN_DOWN(ram_end);
 }
 
@@ -613,3 +612,91 @@ void __iomem *ioremap(paddr_t pa, size_t len)
 {
     return ioremap_attr(pa, len, PAGE_HYPERVISOR_NOCACHE);
 }
+
+int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
+{
+    ASSERT_UNREACHABLE();
+
+    return 0;
+}
+
+static struct domain *page_get_owner_and_nr_reference(struct page_info *page,
+                                                      unsigned long nr)
+{
+    unsigned long x, y = page->count_info;
+    struct domain *owner;
+
+    /* Restrict nr to avoid "double" overflow */
+    if ( nr >= PGC_count_mask )
+    {
+        ASSERT_UNREACHABLE();
+        return NULL;
+    }
+
+    do {
+        x = y;
+        /*
+         * Count ==  0: Page is not allocated, so we cannot take a reference.
+         * Count == -1: Reference count would wrap, which is invalid.
+         */
+        if ( unlikely(((x + nr) & PGC_count_mask) <= nr) )
+            return NULL;
+    }
+    while ( (y = cmpxchg(&page->count_info, x, x + nr)) != x );
+
+    owner = page_get_owner(page);
+    ASSERT(owner);
+
+    return owner;
+}
+
+struct domain *page_get_owner_and_reference(struct page_info *page)
+{
+    return page_get_owner_and_nr_reference(page, 1);
+}
+
+void put_page_nr(struct page_info *page, unsigned long nr)
+{
+    unsigned long nx, x, y = page->count_info;
+
+    do {
+        ASSERT((y & PGC_count_mask) >= nr);
+        x  = y;
+        nx = x - nr;
+    }
+    while ( unlikely((y = cmpxchg(&page->count_info, x, nx)) != x) );
+
+    if ( unlikely((nx & PGC_count_mask) == 0) )
+    {
+#ifdef CONFIG_STATIC_MEMORY
+        if ( unlikely(nx & PGC_static) )
+            free_domstatic_page(page);
+        else
+#endif
+            free_domheap_page(page);
+    }
+}
+
+void put_page(struct page_info *page)
+{
+    put_page_nr(page, 1);
+}
+
+bool get_page_nr(struct page_info *page, const struct domain *domain,
+                 unsigned long nr)
+{
+    const struct domain *owner = page_get_owner_and_nr_reference(page, nr);
+
+    if ( likely(owner == domain) )
+        return true;
+
+    if ( owner != NULL )
+        put_page_nr(page, nr);
+
+    return false;
+}
+
+bool get_page(struct page_info *page, const struct domain *domain)
+{
+    return get_page_nr(page, domain, 1);
+}
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v2 17/17] xen/riscv: add support of page lookup by GFN
  2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
                   ` (15 preceding siblings ...)
  2025-06-10 13:05 ` [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers Oleksii Kurochko
@ 2025-06-10 13:05 ` Oleksii Kurochko
  2025-07-02 11:44   ` Jan Beulich
  16 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-10 13:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

Introduce helper functions for safely querying the P2M (physical-to-machine)
mapping:
 - add p2m_read_lock(), p2m_read_unlock(), and p2m_is_locked() for managing
   P2M lock state.
 - Implement p2m_get_entry() to retrieve mapping details for a given GFN,
   including MFN, page order, and validity.
 - Add p2m_lookup() to encapsulate read-locked MFN retrieval.
 - Introduce p2m_get_page_from_gfn() to convert a GFN into a page_info
   pointer, acquiring a reference to the page if valid.

Implementations are based on Arm's functions with some minor modifications:
- p2m_get_entry():
  - Reverse traversal of page tables, as RISC-V uses the opposite order
    compared to Arm.
  - Removed the return of p2m_access_t from p2m_get_entry() since
    mem_access_settings is not introduced for RISC-V.
  - Updated BUILD_BUG_ON() to check using the level 0 mask, which corresponds
    to Arm's THIRD_MASK.
  - Replaced open-coded bit shifts with the BIT() macro.
  - Other minor changes, such as using RISC-V-specific functions to validate
    P2M PTEs, and replacing Arm-specific GUEST_* macros with their RISC-V
    equivalents.
- p2m_get_page_from_gfn():
  - Removed p2m_is_foreign() and related logic, as this functionality is not
    implemented for RISC-V.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V2:
 - New patch.
---
 xen/arch/riscv/include/asm/p2m.h |  18 +++++
 xen/arch/riscv/p2m.c             | 131 +++++++++++++++++++++++++++++++
 2 files changed, 149 insertions(+)

diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index fdebd18356..96e0790dbc 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -184,6 +184,24 @@ static inline int p2m_is_write_locked(struct p2m_domain *p2m)
     return rw_is_write_locked(&p2m->lock);
 }
 
+static inline void p2m_read_lock(struct p2m_domain *p2m)
+{
+    read_lock(&p2m->lock);
+}
+
+static inline void p2m_read_unlock(struct p2m_domain *p2m)
+{
+    read_unlock(&p2m->lock);
+}
+
+static inline int p2m_is_locked(struct p2m_domain *p2m)
+{
+    return rw_is_locked(&p2m->lock);
+}
+
+struct page_info *p2m_get_page_from_gfn(struct domain *d, gfn_t gfn,
+                                        p2m_type_t *t);
+
 #endif /* ASM__RISCV__P2M_H */
 
 /*
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 79c4473f1f..034b1888c5 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -1055,3 +1055,134 @@ int guest_physmap_add_entry(struct domain *d,
 {
     return p2m_insert_mapping(d, gfn, (1 << page_order), mfn, t);
 }
+
+/*
+ * Get the details of a given gfn.
+ *
+ * If the entry is present, the associated MFN will be returned and the
+ * access and type filled up. The page_order will correspond to the
+ * order of the mapping in the page table (i.e it could be a superpage).
+ *
+ * If the entry is not present, INVALID_MFN will be returned and the
+ * page_order will be set according to the order of the invalid range.
+ *
+ * valid will contain the value of bit[0] (e.g valid bit) of the
+ * entry.
+ */
+static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
+                           p2m_type_t *t,
+                           unsigned int *page_order,
+                           bool *valid)
+{
+    paddr_t addr = gfn_to_gaddr(gfn);
+    unsigned int level = 0;
+    pte_t entry, *table;
+    int rc;
+    mfn_t mfn = INVALID_MFN;
+    p2m_type_t _t;
+    DECLARE_OFFSETS(offsets, addr);
+
+    ASSERT(p2m_is_locked(p2m));
+    BUILD_BUG_ON(XEN_PT_LEVEL_MAP_MASK(0) != PAGE_MASK);
+
+    /* Allow t to be NULL */
+    t = t ?: &_t;
+
+    *t = p2m_invalid;
+
+    if ( valid )
+        *valid = false;
+
+    /* XXX: Check if the mapping is lower than the mapped gfn */
+
+    /* This gfn is higher than the highest the p2m map currently holds */
+    if ( gfn_x(gfn) > gfn_x(p2m->max_mapped_gfn) )
+    {
+        for ( level = P2M_ROOT_LEVEL; level ; level-- )
+            if ( (gfn_x(gfn) & (XEN_PT_LEVEL_MASK(level) >> PAGE_SHIFT)) >
+                 gfn_x(p2m->max_mapped_gfn) )
+                break;
+
+        goto out;
+    }
+
+    table = p2m_get_root_pointer(p2m, gfn);
+
+    /*
+     * the table should always be non-NULL because the gfn is below
+     * p2m->max_mapped_gfn and the root table pages are always present.
+     */
+    if ( !table )
+    {
+        ASSERT_UNREACHABLE();
+        level = P2M_ROOT_LEVEL;
+        goto out;
+    }
+
+    for ( level = P2M_ROOT_LEVEL; level ; level-- )
+    {
+        rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
+        if ( (rc == GUEST_TABLE_MAP_NONE) && (rc != GUEST_TABLE_MAP_NOMEM) )
+            goto out_unmap;
+        else if ( rc != GUEST_TABLE_NORMAL )
+            break;
+    }
+
+    entry = table[offsets[level]];
+
+    if ( p2me_is_valid(p2m, entry) )
+    {
+        *t = p2m_type_radix_get(p2m, entry);
+
+        mfn = pte_get_mfn(entry);
+        /*
+         * The entry may point to a superpage. Find the MFN associated
+         * to the GFN.
+         */
+        mfn = mfn_add(mfn,
+                      gfn_x(gfn) & (BIT(XEN_PT_LEVEL_ORDER(level), UL) - 1));
+
+        if ( valid )
+            *valid = pte_is_valid(entry);
+    }
+
+out_unmap:
+    unmap_domain_page(table);
+
+out:
+    if ( page_order )
+        *page_order = XEN_PT_LEVEL_ORDER(level);
+
+    return mfn;
+}
+
+static mfn_t p2m_lookup(struct domain *d, gfn_t gfn, p2m_type_t *t)
+{
+    mfn_t mfn;
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+    p2m_read_lock(p2m);
+    mfn = p2m_get_entry(p2m, gfn, t, NULL, NULL);
+    p2m_read_unlock(p2m);
+
+    return mfn;
+}
+
+struct page_info *p2m_get_page_from_gfn(struct domain *d, gfn_t gfn,
+                                        p2m_type_t *t)
+{
+    p2m_type_t p2mt = {0};
+    struct page_info *page;
+
+    mfn_t mfn = p2m_lookup(d, gfn, &p2mt);
+
+    if ( t )
+        *t = p2mt;
+
+    if ( !mfn_valid(mfn) )
+        return NULL;
+
+    page = mfn_to_page(mfn);
+
+    return get_page(page, d) ? page : NULL;
+}
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()
  2025-06-10 13:05 ` [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma() Oleksii Kurochko
@ 2025-06-18 15:15   ` Jan Beulich
  2025-06-23 14:31     ` Oleksii Kurochko
  2025-06-24 10:33     ` Oleksii Kurochko
  0 siblings, 2 replies; 161+ messages in thread
From: Jan Beulich @ 2025-06-18 15:15 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
> covering the range of guest physical addresses between start_addr and
> start_addr + size for all the guests.

Here and in the code comment: Why "for all the guests"? Under what conditions
would you require such a broad (guest) TLB flush?

> --- a/xen/arch/riscv/sbi.c
> +++ b/xen/arch/riscv/sbi.c
> @@ -258,6 +258,15 @@ int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
>                        cpu_mask, start, size, 0, 0);
>  }
>  
> +int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
> +                           size_t size)
> +{
> +    ASSERT(sbi_rfence);

As previously indicated, I question the usefulness of such assertions. If the
pointer is still NULL, ...

> +    return sbi_rfence(SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA,
> +                      cpu_mask, start, size, 0, 0);

... you'll crash here anyway (much like you will in a release build).

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 02/17] xen/riscv: introduce sbi_remote_hfence_gvma_vmid()
  2025-06-10 13:05 ` [PATCH v2 02/17] xen/riscv: introduce sbi_remote_hfence_gvma_vmid() Oleksii Kurochko
@ 2025-06-18 15:20   ` Jan Beulich
  2025-06-23 14:38     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-18 15:20 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> It instructs the remote harts to execute one or more HFENCE.GVMA instructions
> by making an SBI call, covering the range of guest physical addresses between
> start_addr and start_addr + size only for the given VMID.
> 
> This function call is only valid for harts implementing hypervisor extension.

We require H now, don't we? It's also odd to have this here, but not in patch 1.

> --- a/xen/arch/riscv/sbi.c
> +++ b/xen/arch/riscv/sbi.c
> @@ -267,6 +267,15 @@ int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
>                        cpu_mask, start, size, 0, 0);
>  }
>  
> +int sbi_remote_hfence_gvma_vmid(const cpumask_t *cpu_mask, vaddr_t start,
> +                           size_t size, unsigned long vmid)
> +{
> +    ASSERT(sbi_rfence);
> +
> +    return sbi_rfence(SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA,
> +                      cpu_mask, start, size, vmid, 0);
> +}

sbi_remote_hfence_gvma() may want implementing in terms of this new function,
requiring the patches to be swapped. Provided (see comment there) that helper
is actually needed.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-10 13:05 ` [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement Oleksii Kurochko
@ 2025-06-18 15:46   ` Jan Beulich
  2025-06-24  9:46     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-18 15:46 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> Implementation is based on Arm code with some minor changes:
>  - Re-define INVALID_VMID.
>  - Re-define MAX_VMID.
>  - Add TLB flushing when VMID is re-used.
> 
> Also, as a part of this path structure p2m_domain is introduced with
> vmid member inside it. It is necessary for VMID management functions.
> 
> Add a bitmap-based allocator to manage VMID space, supporting up to 127
> VMIDs on RV32 and 16,383 on RV64 platforms, in accordance with the
> architecture's hgatp VMID field (RV32 - 7 bit long, others - 14 bit long).
> 
> Reserve the highest VMID as INVALID_VMID to ensure it's not reused.

Why must that VMID not be (re)used? INVALID_VMID can be any value wider
than the hgatp.VMID field.

> --- a/xen/arch/riscv/Makefile
> +++ b/xen/arch/riscv/Makefile
> @@ -6,6 +6,7 @@ obj-y += intc.o
>  obj-y += irq.o
>  obj-y += mm.o
>  obj-y += pt.o
> +obj-y += p2m.o

Nit: Numbers typically sort ahead of letters.

> --- /dev/null
> +++ b/xen/arch/riscv/p2m.c
> @@ -0,0 +1,115 @@
> +#include <xen/bitops.h>
> +#include <xen/lib.h>
> +#include <xen/sched.h>
> +#include <xen/spinlock.h>
> +#include <xen/xvmalloc.h>
> +
> +#include <asm/p2m.h>
> +#include <asm/sbi.h>
> +
> +static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
> +
> +/*
> + * hgatp's VMID field is 7 or 14 bits. RV64 may support 14-bit VMID.
> + * Using a bitmap here limits us to 127 (2^7 - 1) or 16383 (2^14 - 1)
> + * concurrent domains.

Which is pretty limiting especially in the RV32 case. Hence why we don't
assign a permanent ID to VMs on x86, but rather manage IDs per-CPU (note:
not per-vCPU).

> The bitmap space will be allocated dynamically
> + * based on whether 7 or 14 bit VMIDs are supported.
> + */
> +static unsigned long *vmid_mask;
> +static unsigned long *vmid_flushing_needed;
> +
> +/*
> + * -2 here because:
> + *    - -1 is needed to get the maximal possible VMID

I don't follow this part.

> + *    - -1 is reserved for beinng used as INVALID_VMID

Whereas for this part - see above.

> + */
> +#ifdef CONFIG_RISCV_32
> +#define MAX_VMID (BIT(7, U) - 2)
> +#else

Better "#elif defined(CONFIG_RISCV_64)"?

> +#define MAX_VMID (BIT(14, U) - 2)
> +#endif
> +
> +/* Reserve the max possible VMID to be INVALID. */
> +#define INVALID_VMID (MAX_VMID + 1)
> +
> +void p2m_vmid_allocator_init(void)

__init

> +{
> +    /*
> +     * Allocate space for vmid_mask and vmid_flushing_needed
> +     * based on INVALID_VMID as it is the max possible VMID which just
> +     * was reserved to be INVALID_VMID.
> +     */
> +    vmid_mask = xvzalloc_array(unsigned long, BITS_TO_LONGS(INVALID_VMID));
> +    vmid_flushing_needed =
> +        xvzalloc_array(unsigned long, BITS_TO_LONGS(INVALID_VMID));

These both want to use MAX_VMID + 1; there's no logical connection here to
INVALID_VMID.

Furthermore don't you first need to determine how many bits hgatp.VMID actually
implements? The 7 and 14 bits respectively are maximum values only, after all.

VMIDLEN being permitted to be 0, how would you run more than one VM (e.g. Dom0)
on such a system?

> +    if ( !vmid_mask || !vmid_flushing_needed )
> +        panic("Could not allocate VMID bitmap space or VMID flushing map\n");
> +
> +    set_bit(INVALID_VMID, vmid_mask);

If (see above) this is really needed, __set_bit() please.

> +}
> +
> +int p2m_alloc_vmid(struct domain *d)

Looks like this can be static? (p2m_free_vmid() has no caller at all, so
it's not clear what use it is going to be.)

> +{
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +    int rc, nr;

No need for the blank line between the (few) declarations?

> +    spin_lock(&vmid_alloc_lock);
> +
> +    nr = find_first_zero_bit(vmid_mask, MAX_VMID);

As per this nr wants to be unsigned int.

> +    ASSERT(nr != INVALID_VMID);
> +
> +    if ( nr == MAX_VMID )
> +    {
> +        rc = -EBUSY;
> +        printk(XENLOG_ERR "p2m.c: dom%d: VMID pool exhausted\n", d->domain_id);

Please use %pd.

> +        goto out;
> +    }
> +
> +    set_bit(nr, vmid_mask);

Since you do this under lock, even here __set_bit() ought to be sufficient.

> +    if ( test_bit(p2m->vmid, vmid_flushing_needed) )
> +    {
> +        clear_bit(p2m->vmid, vmid_flushing_needed);

And __clear_bit() here, or yet better use __test_and_clear_bit() in the if().

> +        sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid);

You're creating d; it cannot possibly have run on any CPU yet. IOW
d->dirty_cpumask will be reliably empty here. I think it would be hard to
avoid issuing the flush to all CPUs here in this scheme.

> +    }
> +
> +    p2m->vmid = nr;
> +
> +    rc = 0;
> +
> +out:

Nit: Style.

> +    spin_unlock(&vmid_alloc_lock);
> +    return rc;
> +}
> +
> +void p2m_free_vmid(struct domain *d)
> +{
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +    spin_lock(&vmid_alloc_lock);
> +
> +    if ( p2m->vmid != INVALID_VMID )
> +    {
> +        clear_bit(p2m->vmid, vmid_mask);
> +        set_bit(p2m->vmid, vmid_flushing_needed);

Does this scheme really avoid any flushes (except near when the system is
about to go down)?

As to choice of functions - see above.

> +    }
> +
> +    spin_unlock(&vmid_alloc_lock);
> +}
> +
> +int p2m_init(struct domain *d)
> +{
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +    int rc;
> +
> +    p2m->vmid = INVALID_VMID;

Given the absence of callers of p2m_free_vmid() it's also not clear what use
this is.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests
  2025-06-10 13:05 ` [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests Oleksii Kurochko
@ 2025-06-18 15:53   ` Jan Beulich
  2025-06-25 14:48     ` Oleksii Kurochko
  2025-07-01 13:04   ` Jan Beulich
  1 sibling, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-18 15:53 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> @@ -18,10 +20,20 @@ struct arch_vcpu_io {
>  struct arch_vcpu {
>  };
>  
> +struct paging_domain {
> +    spinlock_t lock;
> +    /* Free P2M pages from the pre-allocated P2M pool */
> +    struct page_list_head p2m_freelist;
> +    /* Number of pages from the pre-allocated P2M pool */
> +    unsigned long p2m_total_pages;
> +};
> +
>  struct arch_domain {
>      struct hvm_domain hvm;
>  
>      struct p2m_domain p2m;
> +
> +    struct paging_domain paging;

With the separate structures, do you have plans to implement e.g. shadow paging?
Or some other paging mode beyond the basic one based on the H extension? If the
structures are to remain separate, may I suggest that you keep things properly
separated (no matter how e.g. Arm may have it) in terms of naming? I.e. no
single "p2m" inside struct paging_domain.

> @@ -105,6 +106,9 @@ int p2m_init(struct domain *d)
>      struct p2m_domain *p2m = p2m_get_hostp2m(d);
>      int rc;
>  
> +    spin_lock_init(&d->arch.paging.lock);
> +    INIT_PAGE_LIST_HEAD(&d->arch.paging.p2m_freelist);

If you want p2m and paging to be separate, you will want to put these in a new
paging_init().

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 05/17] xen/riscv: introduce things necessary for p2m initialization
  2025-06-10 13:05 ` [PATCH v2 05/17] xen/riscv: introduce things necessary for p2m initialization Oleksii Kurochko
@ 2025-06-18 16:08   ` Jan Beulich
  2025-06-25 15:31     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-18 16:08 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> Introduce the following things:
> - Update p2m_domain structure, which describe per p2m-table state, with:
>   - lock to protect updates to p2m.
>   - pool with pages used to construct p2m.
>   - clean_pte which indicate if it is requires to clean the cache when
>     writing an entry.
>   - radix tree to store p2m type as PTE doesn't have enough free bits to
>     store type.
>   - default_access to store p2m access type for each page in the domain.
>   - back pointer to domain structure.
> - p2m_init() to initalize members introduced in p2m_domain structure.
> - Introudce p2m_write_lock() and p2m_is_write_locked().

What about the reader variant? If you don't need that, why not use a simple
spin lock?

> @@ -14,6 +18,29 @@
>  
>  /* Per-p2m-table state */
>  struct p2m_domain {
> +    /*
> +     * Lock that protects updates to the p2m.
> +     */
> +    rwlock_t lock;
> +
> +    /* Pages used to construct the p2m */
> +    struct page_list_head pages;
> +
> +    /* Indicate if it is required to clean the cache when writing an entry */
> +    bool clean_pte;
> +
> +    struct radix_tree_root p2m_type;

A field with a p2m_ prefix in a p2m struct? And is this tree really about
just a single "type"?

> +    /*
> +     * Default P2M access type for each page in the the domain: new pages,
> +     * swapped in pages, cleared pages, and pages that are ambiguously
> +     * retyped get this access type.  See definition of p2m_access_t.
> +     */
> +    p2m_access_t default_access;
> +
> +    /* Back pointer to domain */
> +    struct domain *domain;

This you may want to introduce earlier, to prefer passing around struct
p2m_domain * in / to P2M functions (which would benefit earlier patches
already, I think).

> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -1,13 +1,46 @@
>  #include <xen/bitops.h>
> +#include <xen/domain_page.h>
>  #include <xen/event.h>
> +#include <xen/iommu.h>
>  #include <xen/lib.h>
> +#include <xen/mm.h>
> +#include <xen/pfn.h>
> +#include <xen/rwlock.h>
>  #include <xen/sched.h>
>  #include <xen/spinlock.h>
>  #include <xen/xvmalloc.h>
>  
> +#include <asm/page.h>
>  #include <asm/p2m.h>
>  #include <asm/sbi.h>
>  
> +/*
> + * Force a synchronous P2M TLB flush.
> + *
> + * Must be called with the p2m lock held.
> + */
> +static void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
> +{
> +    struct domain *d = p2m->domain;
> +
> +    ASSERT(p2m_is_write_locked(p2m));
> +
> +    sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid);
> +}
> +
> +/* Unlock the flush and do a P2M TLB flush if necessary */
> +void p2m_write_unlock(struct p2m_domain *p2m)
> +{
> +    /*
> +     * The final flush is done with the P2M write lock taken to avoid
> +     * someone else modifying the P2M wbefore the TLB invalidation has
> +     * completed.
> +     */
> +    p2m_force_tlb_flush_sync(p2m);

The comment ahead of the function says "if necessary". Yet there's no
conditional here. I also question the need for a global flush in all
cases.

> @@ -109,8 +142,33 @@ int p2m_init(struct domain *d)
>      spin_lock_init(&d->arch.paging.lock);
>      INIT_PAGE_LIST_HEAD(&d->arch.paging.p2m_freelist);
>  
> +    rwlock_init(&p2m->lock);
> +    INIT_PAGE_LIST_HEAD(&p2m->pages);
> +
>      p2m->vmid = INVALID_VMID;
>  
> +    p2m->default_access = p2m_access_rwx;
> +
> +    radix_tree_init(&p2m->p2m_type);
> +
> +#ifdef CONFIG_HAS_PASSTHROUGH

Do you expect this to be conditionally selected on RISC-V?

> +    /*
> +     * Some IOMMUs don't support coherent PT walk. When the p2m is
> +     * shared with the CPU, Xen has to make sure that the PT changes have
> +     * reached the memory
> +     */
> +    p2m->clean_pte = is_iommu_enabled(d) &&
> +        !iommu_has_feature(d, IOMMU_FEAT_COHERENT_WALK);

The comment talks about shared page tables, yet you don't check whether
page table sharing is actually enabled for the domain.

> +#else
> +    p2m->clean_pte = false;

I hope the struct starts out zero-filled, in which case you wouldn't need
this.

> +#endif
> +
> +    /*
> +     * "Trivial" initialisation is now complete.  Set the backpointer so the
> +     * users of p2m could get an access to domain structure.
> +     */
> +    p2m->domain = d;

Better set this about the very first thing?

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()
  2025-06-18 15:15   ` Jan Beulich
@ 2025-06-23 14:31     ` Oleksii Kurochko
  2025-06-23 14:39       ` Jan Beulich
  2025-06-24 10:33     ` Oleksii Kurochko
  1 sibling, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-23 14:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1979 bytes --]

On 6/18/25 5:15 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
>> covering the range of guest physical addresses between start_addr and
>> start_addr + size for all the guests.
> Here and in the code comment: Why "for all the guests"? Under what conditions
> would you require such a broad (guest) TLB flush?

Originally, it came from Andrew reply:
```
TLB flushing needs to happen for each pCPU which potentially has cached
a mapping.

In other arches, this is tracked by d->dirty_cpumask which is the bitmap
of pCPUs where this domain is scheduled.

CPUs need to flush their TLBs before removing themselves from
d->dirty_cpumask, which is typically done during context switch, but it
means that to flush the P2M, you only need to IPI a subset of CPUs.
```

But specifically this function was introduced to work in case no VMID support
as we can't distinguish which TLB entries belong to which domain. As a result,
we have no choice but to flush the entire TLB to avoid incorrect translations.

However, this patch may no longer be necessary, as VMID support has been
introduced and|sbi_remote_hfence_gvma_vmid()| will be used instead.

>
>> --- a/xen/arch/riscv/sbi.c
>> +++ b/xen/arch/riscv/sbi.c
>> @@ -258,6 +258,15 @@ int sbi_remote_sfence_vma(const cpumask_t *cpu_mask, vaddr_t start,
>>                         cpu_mask, start, size, 0, 0);
>>   }
>>   
>> +int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
>> +                           size_t size)
>> +{
>> +    ASSERT(sbi_rfence);
> As previously indicated, I question the usefulness of such assertions. If the
> pointer is still NULL, ...
>
>> +    return sbi_rfence(SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA,
>> +                      cpu_mask, start, size, 0, 0);
> ... you'll crash here anyway (much like you will in a release build).

I will drop ASSERT() for rfence functions.

Thanks.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 3062 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 02/17] xen/riscv: introduce sbi_remote_hfence_gvma_vmid()
  2025-06-18 15:20   ` Jan Beulich
@ 2025-06-23 14:38     ` Oleksii Kurochko
  0 siblings, 0 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-23 14:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1448 bytes --]


On 6/18/25 5:20 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> It instructs the remote harts to execute one or more HFENCE.GVMA instructions
>> by making an SBI call, covering the range of guest physical addresses between
>> start_addr and start_addr + size only for the given VMID.
>>
>> This function call is only valid for harts implementing hypervisor extension.
> We require H now, don't we? It's also odd to have this here, but not in patch 1.

Yes, we required it. I will drop this part from the commit message and the comment
above declaration of sbi_remote_hfence_gvma_vmid().

>
>> --- a/xen/arch/riscv/sbi.c
>> +++ b/xen/arch/riscv/sbi.c
>> @@ -267,6 +267,15 @@ int sbi_remote_hfence_gvma(const cpumask_t *cpu_mask, vaddr_t start,
>>                         cpu_mask, start, size, 0, 0);
>>   }
>>   
>> +int sbi_remote_hfence_gvma_vmid(const cpumask_t *cpu_mask, vaddr_t start,
>> +                           size_t size, unsigned long vmid)
>> +{
>> +    ASSERT(sbi_rfence);
>> +
>> +    return sbi_rfence(SBI_EXT_RFENCE_REMOTE_HFENCE_GVMA,
>> +                      cpu_mask, start, size, vmid, 0);
>> +}
> sbi_remote_hfence_gvma() may want implementing in terms of this new function,
> requiring the patches to be swapped. Provided (see comment there) that helper
> is actually needed.

It makes sense.
But it seems like there is no need for sbi_remote_hfence_gvma() as we have VMID introduced.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 2247 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()
  2025-06-23 14:31     ` Oleksii Kurochko
@ 2025-06-23 14:39       ` Jan Beulich
  2025-06-23 14:45         ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-23 14:39 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 23.06.2025 16:31, Oleksii Kurochko wrote:
> On 6/18/25 5:15 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
>>> covering the range of guest physical addresses between start_addr and
>>> start_addr + size for all the guests.
>> Here and in the code comment: Why "for all the guests"? Under what conditions
>> would you require such a broad (guest) TLB flush?
> 
> Originally, it came from Andrew reply:
> ```
> TLB flushing needs to happen for each pCPU which potentially has cached
> a mapping.
> 
> In other arches, this is tracked by d->dirty_cpumask which is the bitmap
> of pCPUs where this domain is scheduled.
> 
> CPUs need to flush their TLBs before removing themselves from
> d->dirty_cpumask, which is typically done during context switch, but it
> means that to flush the P2M, you only need to IPI a subset of CPUs.
> ```

Hmm, but the word "guest" doesn't even appear there. "Each pCPU" isn't quite
the same as "all guests".

> But specifically this function was introduced to work in case no VMID support
> as we can't distinguish which TLB entries belong to which domain. As a result,
> we have no choice but to flush the entire TLB to avoid incorrect translations.
> 
> However, this patch may no longer be necessary, as VMID support has been
> introduced and|sbi_remote_hfence_gvma_vmid()| will be used instead.

Good.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()
  2025-06-23 14:39       ` Jan Beulich
@ 2025-06-23 14:45         ` Oleksii Kurochko
  0 siblings, 0 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-23 14:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1432 bytes --]


On 6/23/25 4:39 PM, Jan Beulich wrote:
> On 23.06.2025 16:31, Oleksii Kurochko wrote:
>> On 6/18/25 5:15 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
>>>> covering the range of guest physical addresses between start_addr and
>>>> start_addr + size for all the guests.
>>> Here and in the code comment: Why "for all the guests"? Under what conditions
>>> would you require such a broad (guest) TLB flush?
>> Originally, it came from Andrew reply:
>> ```
>> TLB flushing needs to happen for each pCPU which potentially has cached
>> a mapping.
>>
>> In other arches, this is tracked by d->dirty_cpumask which is the bitmap
>> of pCPUs where this domain is scheduled.
>>
>> CPUs need to flush their TLBs before removing themselves from
>> d->dirty_cpumask, which is typically done during context switch, but it
>> means that to flush the P2M, you only need to IPI a subset of CPUs.
>> ```
> Hmm, but the word "guest" doesn't even appear there. "Each pCPU" isn't quite
> the same as "all guests".

Agree, it is just what SBI spec wording that...

pCPU here it is the first argument of sbi_remote_hfence_gvma(unsigned long hart_mask, ...)

It is even more confusing as based on explaantion if RISC-V priv spec., hfence.gvma it is
used to flush G-stage (stage-2) TLB and hfence.vvma it is VS-stage (stage-1) TLB flush.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 2223 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-18 15:46   ` Jan Beulich
@ 2025-06-24  9:46     ` Oleksii Kurochko
  2025-06-24 10:44       ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-24  9:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 8569 bytes --]

On 6/18/25 5:46 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> Implementation is based on Arm code with some minor changes:
>>   - Re-define INVALID_VMID.
>>   - Re-define MAX_VMID.
>>   - Add TLB flushing when VMID is re-used.
>>
>> Also, as a part of this path structure p2m_domain is introduced with
>> vmid member inside it. It is necessary for VMID management functions.
>>
>> Add a bitmap-based allocator to manage VMID space, supporting up to 127
>> VMIDs on RV32 and 16,383 on RV64 platforms, in accordance with the
>> architecture's hgatp VMID field (RV32 - 7 bit long, others - 14 bit long).
>>
>> Reserve the highest VMID as INVALID_VMID to ensure it's not reused.
> Why must that VMID not be (re)used? INVALID_VMID can be any value wider
> than the hgatp.VMID field.

Oh, agree it could be just any value wider tan hgatp.VMID filed. I forgot
about that hgatp.VMID is only 14-bit long value. So we have two extra bits
in uint16_t.

>> --- /dev/null
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -0,0 +1,115 @@
>> +#include <xen/bitops.h>
>> +#include <xen/lib.h>
>> +#include <xen/sched.h>
>> +#include <xen/spinlock.h>
>> +#include <xen/xvmalloc.h>
>> +
>> +#include <asm/p2m.h>
>> +#include <asm/sbi.h>
>> +
>> +static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>> +
>> +/*
>> + * hgatp's VMID field is 7 or 14 bits. RV64 may support 14-bit VMID.
>> + * Using a bitmap here limits us to 127 (2^7 - 1) or 16383 (2^14 - 1)
>> + * concurrent domains.
> Which is pretty limiting especially in the RV32 case. Hence why we don't
> assign a permanent ID to VMs on x86, but rather manage IDs per-CPU (note:
> not per-vCPU).

Good point.

I don't believe anyone will use RV32.
For RV64, the available ID space seems sufficiently large.

However, if it turns out that the value isn't large enough even for RV64,
I can rework it to manage IDs per physical CPU.
Wouldn't that approach result in more TLB entries being flushed compared
to per-vCPU allocation, potentially leading to slightly worse performance?

What about then to allocate VMID per-domain?

>> The bitmap space will be allocated dynamically
>> + * based on whether 7 or 14 bit VMIDs are supported.
>> + */
>> +static unsigned long *vmid_mask;
>> +static unsigned long *vmid_flushing_needed;
>> +
>> +/*
>> + * -2 here because:
>> + *    - -1 is needed to get the maximal possible VMID
> I don't follow this part.

Probably, I'm missing something.

hgat.vmid is 7 bit long. BIT(7,U) = 1 << 7 = 128 which is bigger
then 7 bit can cover (0b1000_0000 and 0x111_1111). Thereby the MAX_VMID is:
  BIT(7, U) - 1 (in case of RV32).

>> + */
>> +#ifdef CONFIG_RISCV_32
>> +#define MAX_VMID (BIT(7, U) - 2)
>> +#else
> Better "#elif defined(CONFIG_RISCV_64)"?

First, I read the spec as for other bitness except 32 it will be 14 bit long, but I re-read it and
it is true only for HSXLEN=64, so RV128 will/can have different amount of bit for VMID. I will
update to "#elif defined(CONFIG_RISCV_64)" + #error "Define MAX_VMID" if bitness isn't 32 or 64.

>> +{
>> +    /*
>> +     * Allocate space for vmid_mask and vmid_flushing_needed
>> +     * based on INVALID_VMID as it is the max possible VMID which just
>> +     * was reserved to be INVALID_VMID.
>> +     */
>> +    vmid_mask = xvzalloc_array(unsigned long, BITS_TO_LONGS(INVALID_VMID));
>> +    vmid_flushing_needed =
>> +        xvzalloc_array(unsigned long, BITS_TO_LONGS(INVALID_VMID));
> These both want to use MAX_VMID + 1; there's no logical connection here to
> INVALID_VMID.
>
> Furthermore don't you first need to determine how many bits hgatp.VMID actually
> implements? The 7 and 14 bits respectively are maximum values only, after all.

I missed that it depends on VMIDLEN:
```
The number of VMID bits is UNSPECIFIED and may be zero. The number of implemented VMID bits,
termed VMIDLEN, may be determined by writing one to every bit position in the VMID field, then
reading back the value in hgatp to see which bit positions in the VMID field hold a one. The least-
significant bits of VMID are implemented first: that is, if VMIDLEN > 0, VMID[VMIDLEN-1:0] is
writable. The maximal value of VMIDLEN, termed VMIDMAX, is 7 for Sv32x4 or 14 for Sv39x4,
Sv48x4, and Sv57x4.
```
So yes, I have to determine first how many bits are supported by an implementation.

> VMIDLEN being permitted to be 0, how would you run more than one VM (e.g. Dom0)
> on such a system?

Hmm, good question.

Then it will be needed to flush TLB on each VM switch by using
sbi_remote_hfence_gvma().

>> +    if ( !vmid_mask || !vmid_flushing_needed )
>> +        panic("Could not allocate VMID bitmap space or VMID flushing map\n");
>> +
>> +    set_bit(INVALID_VMID, vmid_mask);
> If (see above) this is really needed, __set_bit() please.
>
>> +}
>> +
>> +int p2m_alloc_vmid(struct domain *d)
> Looks like this can be static? (p2m_free_vmid() has no caller at all, so
> it's not clear what use it is going to be.)

It really can be static. And p2m_free_vmid() too, but as there is no caller
of p2m_free_vmid() probably it makes sense to do in the following way:
   /* Uncomment static when p2m_free_vmid() will be called. */
   /* static */ void p2m_free_vmid(struct domain *d)
Or just drop for the moment when it will be really needed.

>> +        goto out;
>> +    }
>> +
>> +    set_bit(nr, vmid_mask);
> Since you do this under lock, even here __set_bit() ought to be sufficient.
>
>> +    if ( test_bit(p2m->vmid, vmid_flushing_needed) )
>> +    {
>> +        clear_bit(p2m->vmid, vmid_flushing_needed);
> And __clear_bit() here, or yet better use __test_and_clear_bit() in the if().
>
>> +        sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid);
> You're creating d; it cannot possibly have run on any CPU yet. IOW
> d->dirty_cpumask will be reliably empty here. I think it would be hard to
> avoid issuing the flush to all CPUs here in this scheme.

I didn't double check, but I was sure that in case d->dirty_cpumask is empty then
rfence for all CPUs will be send. But I was wrong about that.

What about just update a code of sbi_rfence_v02()?

At the moment, we have check if a pointer to cpu_mask isn't NULL and if NULL then
do rfence for all CPUs:

static int cf_check sbi_rfence_v02(unsigned long fid,
                                    const cpumask_t *cpu_mask,
                                    vaddr_t start, size_t size,
                                    unsigned long arg4, unsigned long arg5)
{
    ...

     /*
      * hart_mask_base can be set to -1 to indicate that hart_mask can be
      * ignored and all available harts must be considered.
      */
     if ( !cpu_mask )
         return sbi_rfence_v02_real(fid, 0UL, -1UL, start, size, arg4);
    ...

What about  just to add here:
     if ( !cpu_mask || cpumask_empty(cpu_mask) )

Does it make sense?

>> +    spin_unlock(&vmid_alloc_lock);
>> +    return rc;
>> +}
>> +
>> +void p2m_free_vmid(struct domain *d)
>> +{
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +
>> +    spin_lock(&vmid_alloc_lock);
>> +
>> +    if ( p2m->vmid != INVALID_VMID )
>> +    {
>> +        clear_bit(p2m->vmid, vmid_mask);
>> +        set_bit(p2m->vmid, vmid_flushing_needed);
> Does this scheme really avoid any flushes (except near when the system is
> about to go down)?
>
> As to choice of functions - see above.

I think yes, so my idea was that if vmid isn't freed then we have enough free VMID
and in this case flush isn't needed as each vcpu has unique not-used yet VMID,
and if there is no free VMID then and error will return in p2m_alloc_vmid():
     if ( nr == MAX_VMID )
     {
         rc = -EBUSY;
         printk(XENLOG_ERR "p2m.c: dom%pd: VMID pool exhausted\n", d->domain_id);
         goto out;
     }

On other hand, if VMID was freed and then re-used in p2m_alloc_vmid(), then it means
that vmid_flushing_needed will have VMID bit set, what means that a TLB flush is needed.

>
>> +    }
>> +
>> +    spin_unlock(&vmid_alloc_lock);
>> +}
>> +
>> +int p2m_init(struct domain *d)
>> +{
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +    int rc;
>> +
>> +    p2m->vmid = INVALID_VMID;
> Given the absence of callers of p2m_free_vmid() it's also not clear what use
> this is.

Just mark that VMID for this domain wasn't yet allocated.

Anyway, it will be called from arch_domain_create() by arch_domain_destroy() so if the some
error happens during arch_domain_create() and p2m->vmid wasn't allocated yet (so is equal to
INVALID_VMID), it means that there is no sense to update vmid_mask or vmid_flushing_needed.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 11771 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()
  2025-06-18 15:15   ` Jan Beulich
  2025-06-23 14:31     ` Oleksii Kurochko
@ 2025-06-24 10:33     ` Oleksii Kurochko
  2025-06-24 10:48       ` Jan Beulich
  1 sibling, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-24 10:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1943 bytes --]


On 6/18/25 5:15 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
>> covering the range of guest physical addresses between start_addr and
>> start_addr + size for all the guests.
> Here and in the code comment: Why "for all the guests"? Under what conditions
> would you require such a broad (guest) TLB flush?

Hmm, it seems like KVM always do such a broad (guest) TLB flush during detection
of VMIDLEN:
	void __init kvm_riscv_gstage_vmid_detect(void)
	{
		unsigned long old;
	
		/* Figure-out number of VMID bits in HW */
		old = csr_read(CSR_HGATP);
		csr_write(CSR_HGATP, old | HGATP_VMID);
		vmid_bits = csr_read(CSR_HGATP);
		vmid_bits = (vmid_bits & HGATP_VMID) >> HGATP_VMID_SHIFT;
		vmid_bits = fls_long(vmid_bits);
		csr_write(CSR_HGATP, old);
	
		/* We polluted local TLB so flush all guest TLB */
		kvm_riscv_local_hfence_gvma_all();
	
		/* We don't use VMID bits if they are not sufficient */
		if ((1UL << vmid_bits) < num_possible_cpus())
			vmid_bits = 0;
	}

It is not clear actually why so broad and why not hfence_gvma_vmid(vmid_bits).

And I am not really 100% sure that any hfence_gvma() is needed here as I don't see
what could pollutes local guest TLB between csr_write() calls.

RISC-V spec. says that:
	Note that writing hgatp does not imply any ordering constraints between page-table updates and
	subsequent G-stage address translations. If the new virtual machine’s guest physical page tables have
	been modified, or if a VMID is reused, it may be necessary to execute an HFENCE.GVMA instruction
	(see Section 18.3.2) before or after writing hgatp.

But we don't modify VM's guest physical page table. We could potentially reuse VMID between csr_write()
calls, but it is returning back and we don't switch to a guest with this "new" VMID, so it isn't really used.

Do you have any thoughts about that?

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 2524 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-24  9:46     ` Oleksii Kurochko
@ 2025-06-24 10:44       ` Jan Beulich
  2025-06-24 13:47         ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-24 10:44 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 24.06.2025 11:46, Oleksii Kurochko wrote:
> On 6/18/25 5:46 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> --- /dev/null
>>> +++ b/xen/arch/riscv/p2m.c
>>> @@ -0,0 +1,115 @@
>>> +#include <xen/bitops.h>
>>> +#include <xen/lib.h>
>>> +#include <xen/sched.h>
>>> +#include <xen/spinlock.h>
>>> +#include <xen/xvmalloc.h>
>>> +
>>> +#include <asm/p2m.h>
>>> +#include <asm/sbi.h>
>>> +
>>> +static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>> +
>>> +/*
>>> + * hgatp's VMID field is 7 or 14 bits. RV64 may support 14-bit VMID.
>>> + * Using a bitmap here limits us to 127 (2^7 - 1) or 16383 (2^14 - 1)
>>> + * concurrent domains.
>> Which is pretty limiting especially in the RV32 case. Hence why we don't
>> assign a permanent ID to VMs on x86, but rather manage IDs per-CPU (note:
>> not per-vCPU).
> 
> Good point.
> 
> I don't believe anyone will use RV32.
> For RV64, the available ID space seems sufficiently large.
> 
> However, if it turns out that the value isn't large enough even for RV64,
> I can rework it to manage IDs per physical CPU.
> Wouldn't that approach result in more TLB entries being flushed compared
> to per-vCPU allocation, potentially leading to slightly worse performance?

Depends on the condition for when to flush. Of course performance is
unavoidably going to suffer if you have only very few VMIDs to use.
Nevertheless, as indicated before, the model used on x86 may be a
candidate to use here, too. See hvm_asid_handle_vmenter() for the
core (and vendor-independent) part of it.

> What about then to allocate VMID per-domain?

That's what you're doing right now, isn't it? And that gets problematic when
you have only very few bits in hgatp.VMID, as mentioned below.

>>> The bitmap space will be allocated dynamically
>>> + * based on whether 7 or 14 bit VMIDs are supported.
>>> + */
>>> +static unsigned long *vmid_mask;
>>> +static unsigned long *vmid_flushing_needed;
>>> +
>>> +/*
>>> + * -2 here because:
>>> + *    - -1 is needed to get the maximal possible VMID
>> I don't follow this part.
> 
> Probably, I'm missing something.
> 
> hgat.vmid is 7 bit long. BIT(7,U) = 1 << 7 = 128 which is bigger
> then 7 bit can cover (0b1000_0000 and 0x111_1111). Thereby the MAX_VMID is:
>   BIT(7, U) - 1 (in case of RV32).

Right, but then why -2? (Maybe this is moot now that you agreed that
INVALID_VMID can be defined differently.

>> VMIDLEN being permitted to be 0, how would you run more than one VM (e.g. Dom0)
>> on such a system?
> 
> Hmm, good question.
> 
> Then it will be needed to flush TLB on each VM switch by using
> sbi_remote_hfence_gvma().

Right, but just to be clear: That flush should not be conditional upon
VMIDLEN being 0. In whatever model you chose, the handling of this special
case should come out "natural".

>>> +        sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid);
>> You're creating d; it cannot possibly have run on any CPU yet. IOW
>> d->dirty_cpumask will be reliably empty here. I think it would be hard to
>> avoid issuing the flush to all CPUs here in this scheme.
> 
> I didn't double check, but I was sure that in case d->dirty_cpumask is empty then
> rfence for all CPUs will be send. But I was wrong about that.
> 
> What about just update a code of sbi_rfence_v02()?

I don't know, but dealing with the issue there feels wrong. However,
before deciding where to do something, it needs to be clear what you
actually want to achieve. To me at least, that's not clear at all.

>>> +    spin_unlock(&vmid_alloc_lock);
>>> +    return rc;
>>> +}
>>> +
>>> +void p2m_free_vmid(struct domain *d)
>>> +{
>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>> +
>>> +    spin_lock(&vmid_alloc_lock);
>>> +
>>> +    if ( p2m->vmid != INVALID_VMID )
>>> +    {
>>> +        clear_bit(p2m->vmid, vmid_mask);
>>> +        set_bit(p2m->vmid, vmid_flushing_needed);
>> Does this scheme really avoid any flushes (except near when the system is
>> about to go down)?
>>
>> As to choice of functions - see above.
> 
> I think yes, so my idea was that if vmid isn't freed then we have enough free VMID
> and in this case flush isn't needed as each vcpu has unique not-used yet VMID,
> and if there is no free VMID then and error will return in p2m_alloc_vmid():
>      if ( nr == MAX_VMID )
>      {
>          rc = -EBUSY;
>          printk(XENLOG_ERR "p2m.c: dom%pd: VMID pool exhausted\n", d->domain_id);
>          goto out;
>      }

Which, as said, is a problem when there are only very few VMIDs.

> On other hand, if VMID was freed and then re-used in p2m_alloc_vmid(), then it means
> that vmid_flushing_needed will have VMID bit set, what means that a TLB flush is needed.

Let's assume over the uptime of a system you cycle through all VMIDs a thousand
times. While you manage to delay some TLB flushes, the percentage of ones actually
saved is going to be very low then.

>>> +    }
>>> +
>>> +    spin_unlock(&vmid_alloc_lock);
>>> +}
>>> +
>>> +int p2m_init(struct domain *d)
>>> +{
>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>> +    int rc;
>>> +
>>> +    p2m->vmid = INVALID_VMID;
>> Given the absence of callers of p2m_free_vmid() it's also not clear what use
>> this is.
> 
> Just mark that VMID for this domain wasn't yet allocated.
> 
> Anyway, it will be called from arch_domain_create() by arch_domain_destroy() so if the some
> error happens during arch_domain_create() and p2m->vmid wasn't allocated yet (so is equal to
> INVALID_VMID), it means that there is no sense to update vmid_mask or vmid_flushing_needed.

But only if you actually came through p2m_init() prior to the error. My point
is: If you allocate a VMID here anyway, why first set the field like this?
(Again, this is likely moot since the allocation scheme is likely to change
altogether.)

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma()
  2025-06-24 10:33     ` Oleksii Kurochko
@ 2025-06-24 10:48       ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-06-24 10:48 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 24.06.2025 12:33, Oleksii Kurochko wrote:
> On 6/18/25 5:15 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> Instruct the remote harts to execute one or more HFENCE.GVMA instructions,
>>> covering the range of guest physical addresses between start_addr and
>>> start_addr + size for all the guests.
>> Here and in the code comment: Why "for all the guests"? Under what conditions
>> would you require such a broad (guest) TLB flush?
> 
> Hmm, it seems like KVM always do such a broad (guest) TLB flush during detection
> of VMIDLEN:
> 	void __init kvm_riscv_gstage_vmid_detect(void)
> 	{
> 		unsigned long old;
> 	
> 		/* Figure-out number of VMID bits in HW */
> 		old = csr_read(CSR_HGATP);
> 		csr_write(CSR_HGATP, old | HGATP_VMID);
> 		vmid_bits = csr_read(CSR_HGATP);
> 		vmid_bits = (vmid_bits & HGATP_VMID) >> HGATP_VMID_SHIFT;
> 		vmid_bits = fls_long(vmid_bits);
> 		csr_write(CSR_HGATP, old);
> 	
> 		/* We polluted local TLB so flush all guest TLB */
> 		kvm_riscv_local_hfence_gvma_all();
> 	
> 		/* We don't use VMID bits if they are not sufficient */
> 		if ((1UL << vmid_bits) < num_possible_cpus())
> 			vmid_bits = 0;
> 	}
> 
> It is not clear actually why so broad and why not hfence_gvma_vmid(vmid_bits).
> 
> And I am not really 100% sure that any hfence_gvma() is needed here as I don't see
> what could pollutes local guest TLB between csr_write() calls.
> 
> RISC-V spec. says that:
> 	Note that writing hgatp does not imply any ordering constraints between page-table updates and
> 	subsequent G-stage address translations. If the new virtual machine’s guest physical page tables have
> 	been modified, or if a VMID is reused, it may be necessary to execute an HFENCE.GVMA instruction
> 	(see Section 18.3.2) before or after writing hgatp.
> 
> But we don't modify VM's guest physical page table. We could potentially reuse VMID between csr_write()
> calls, but it is returning back and we don't switch to a guest with this "new" VMID, so it isn't really used.

That would be my expectation, too. Yet I don't know if RISC-V has any
peculiarities there.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-24 10:44       ` Jan Beulich
@ 2025-06-24 13:47         ` Oleksii Kurochko
  2025-06-24 14:01           ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-24 13:47 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 8259 bytes --]


On 6/24/25 12:44 PM, Jan Beulich wrote:
> On 24.06.2025 11:46, Oleksii Kurochko wrote:
>> On 6/18/25 5:46 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> --- /dev/null
>>>> +++ b/xen/arch/riscv/p2m.c
>>>> @@ -0,0 +1,115 @@
>>>> +#include <xen/bitops.h>
>>>> +#include <xen/lib.h>
>>>> +#include <xen/sched.h>
>>>> +#include <xen/spinlock.h>
>>>> +#include <xen/xvmalloc.h>
>>>> +
>>>> +#include <asm/p2m.h>
>>>> +#include <asm/sbi.h>
>>>> +
>>>> +static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>>> +
>>>> +/*
>>>> + * hgatp's VMID field is 7 or 14 bits. RV64 may support 14-bit VMID.
>>>> + * Using a bitmap here limits us to 127 (2^7 - 1) or 16383 (2^14 - 1)
>>>> + * concurrent domains.
>>> Which is pretty limiting especially in the RV32 case. Hence why we don't
>>> assign a permanent ID to VMs on x86, but rather manage IDs per-CPU (note:
>>> not per-vCPU).
>> Good point.
>>
>> I don't believe anyone will use RV32.
>> For RV64, the available ID space seems sufficiently large.
>>
>> However, if it turns out that the value isn't large enough even for RV64,
>> I can rework it to manage IDs per physical CPU.
>> Wouldn't that approach result in more TLB entries being flushed compared
>> to per-vCPU allocation, potentially leading to slightly worse performance?
> Depends on the condition for when to flush. Of course performance is
> unavoidably going to suffer if you have only very few VMIDs to use.
> Nevertheless, as indicated before, the model used on x86 may be a
> candidate to use here, too. See hvm_asid_handle_vmenter() for the
> core (and vendor-independent) part of it.

Thanks.

IIUC, so basically it is just a round-robin and when VMIDs are ran out
then just do full guest TLB flush and start to re-use VMIDs from the start.
It makes sense to me, I'll implement something similar. (as I'm not really
sure that we needdata->core_asid_generation, probably, I will understand it better when 
start to implement it)

>
>> What about then to allocate VMID per-domain?
> That's what you're doing right now, isn't it? And that gets problematic when
> you have only very few bits in hgatp.VMID, as mentioned below.

Right, I just phrased my question poorly—sorry about that.

What I meant to ask is: does the approach described above actually depend on whether
VMIDs are allocated per-domain or per-pCPU? It seems that the main advantage of
allocating VMIDs per-pCPU is potentially reducing the number of TLB flushes,
since it's more likely that a platform will have more than|VMID_MAX| domains than
|VMID_MAX| physical CPUs—am I right?


>
>>>> The bitmap space will be allocated dynamically
>>>> + * based on whether 7 or 14 bit VMIDs are supported.
>>>> + */
>>>> +static unsigned long *vmid_mask;
>>>> +static unsigned long *vmid_flushing_needed;
>>>> +
>>>> +/*
>>>> + * -2 here because:
>>>> + *    - -1 is needed to get the maximal possible VMID
>>> I don't follow this part.
>> Probably, I'm missing something.
>>
>> hgat.vmid is 7 bit long. BIT(7,U) = 1 << 7 = 128 which is bigger
>> then 7 bit can cover (0b1000_0000 and 0x111_1111). Thereby the MAX_VMID is:
>>    BIT(7, U) - 1 (in case of RV32).
> Right, but then why -2? (Maybe this is moot now that you agreed that
> INVALID_VMID can be defined differently.

Yes, another one -1 was because how INVALID_VMID was defined.

>
>>> VMIDLEN being permitted to be 0, how would you run more than one VM (e.g. Dom0)
>>> on such a system?
>> Hmm, good question.
>>
>> Then it will be needed to flush TLB on each VM switch by using
>> sbi_remote_hfence_gvma().
> Right, but just to be clear: That flush should not be conditional upon
> VMIDLEN being 0. In whatever model you chose, the handling of this special
> case should come out "natural".

Sure. I have some ideas how to do it natural.

>
>>>> +        sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid);
>>> You're creating d; it cannot possibly have run on any CPU yet. IOW
>>> d->dirty_cpumask will be reliably empty here. I think it would be hard to
>>> avoid issuing the flush to all CPUs here in this scheme.
>> I didn't double check, but I was sure that in case d->dirty_cpumask is empty then
>> rfence for all CPUs will be send. But I was wrong about that.
>>
>> What about just update a code of sbi_rfence_v02()?
> I don't know, but dealing with the issue there feels wrong. However,
> before deciding where to do something, it needs to be clear what you
> actually want to achieve. To me at least, that's not clear at all.

I want to achieve the following behavior: if a mask is empty
(specifically, in our case|d->dirty_cpumask|), then perform the flush
on all CPUs.

If you think it's not a good idea to change the current implementation
of|sbi_rfence_v02()|, then I’ll just check if|d->dirty_cpumask| is empty
before calling|sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid)|.

If it is empty, I’ll call|sbi_remote_hfence_gvma()| instead:
|if( !cpumask_empty(d->dirty_cpumask) ) 
sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid); 
elsesbi_remote_hfence_gvma(NULL, 0, 0); |

A similar check will be needed in|p2m_force_tlb_flush_sync()|, which is
implemented in one of the following patches in this series.

However, if we instead move the|if ( !cpumask_empty(d->dirty_cpumask) ) |check into||https://gitlab.com/xen-project/xen/-/blob/staging/xen/arch/riscv/sbi.c?ref_type=heads#L178,
we could call only:
   |sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid); |and get the same effect, which might result in cleaner code overall
as we already have a similar check (cpumask == NULL)|sbi_rfence_v02|() and a result of which
is just to send rfence operation to all CPUs.

>
>>>> +    spin_unlock(&vmid_alloc_lock);
>>>> +    return rc;
>>>> +}
>>>> +
>>>> +void p2m_free_vmid(struct domain *d)
>>>> +{
>>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>> +
>>>> +    spin_lock(&vmid_alloc_lock);
>>>> +
>>>> +    if ( p2m->vmid != INVALID_VMID )
>>>> +    {
>>>> +        clear_bit(p2m->vmid, vmid_mask);
>>>> +        set_bit(p2m->vmid, vmid_flushing_needed);
>>> Does this scheme really avoid any flushes (except near when the system is
>>> about to go down)?
>>>
>>> As to choice of functions - see above.
>> I think yes, so my idea was that if vmid isn't freed then we have enough free VMID
>> and in this case flush isn't needed as each vcpu has unique not-used yet VMID,
>> and if there is no free VMID then and error will return in p2m_alloc_vmid():
>>       if ( nr == MAX_VMID )
>>       {
>>           rc = -EBUSY;
>>           printk(XENLOG_ERR "p2m.c: dom%pd: VMID pool exhausted\n", d->domain_id);
>>           goto out;
>>       }
> Which, as said, is a problem when there are only very few VMIDs.
>
>> On other hand, if VMID was freed and then re-used in p2m_alloc_vmid(), then it means
>> that vmid_flushing_needed will have VMID bit set, what means that a TLB flush is needed.
> Let's assume over the uptime of a system you cycle through all VMIDs a thousand
> times. While you manage to delay some TLB flushes, the percentage of ones actually
> saved is going to be very low then.

Then it is just better to update VMID allocation algo.

>
>>>> +    }
>>>> +
>>>> +    spin_unlock(&vmid_alloc_lock);
>>>> +}
>>>> +
>>>> +int p2m_init(struct domain *d)
>>>> +{
>>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>> +    int rc;
>>>> +
>>>> +    p2m->vmid = INVALID_VMID;
>>> Given the absence of callers of p2m_free_vmid() it's also not clear what use
>>> this is.
>> Just mark that VMID for this domain wasn't yet allocated.
>>
>> Anyway, it will be called from arch_domain_create() by arch_domain_destroy() so if the some
>> error happens during arch_domain_create() and p2m->vmid wasn't allocated yet (so is equal to
>> INVALID_VMID), it means that there is no sense to update vmid_mask or vmid_flushing_needed.
> But only if you actually came through p2m_init() prior to the error. My point
> is: If you allocate a VMID here anyway, why first set the field like this?

Oh, got your point. Indeed, there is no sense.

> (Again, this is likely moot since the allocation scheme is likely to change
> altogether.)

Yes, it won't be really needed in the new allocation scheme.

Thanks.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 14221 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-24 13:47         ` Oleksii Kurochko
@ 2025-06-24 14:01           ` Jan Beulich
  2025-06-24 15:32             ` Oleksii Kurochko
  2025-06-26 10:05             ` Oleksii Kurochko
  0 siblings, 2 replies; 161+ messages in thread
From: Jan Beulich @ 2025-06-24 14:01 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 24.06.2025 15:47, Oleksii Kurochko wrote:
> On 6/24/25 12:44 PM, Jan Beulich wrote:
>> On 24.06.2025 11:46, Oleksii Kurochko wrote:
>>> On 6/18/25 5:46 PM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> --- /dev/null
>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>> @@ -0,0 +1,115 @@
>>>>> +#include <xen/bitops.h>
>>>>> +#include <xen/lib.h>
>>>>> +#include <xen/sched.h>
>>>>> +#include <xen/spinlock.h>
>>>>> +#include <xen/xvmalloc.h>
>>>>> +
>>>>> +#include <asm/p2m.h>
>>>>> +#include <asm/sbi.h>
>>>>> +
>>>>> +static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>>>> +
>>>>> +/*
>>>>> + * hgatp's VMID field is 7 or 14 bits. RV64 may support 14-bit VMID.
>>>>> + * Using a bitmap here limits us to 127 (2^7 - 1) or 16383 (2^14 - 1)
>>>>> + * concurrent domains.
>>>> Which is pretty limiting especially in the RV32 case. Hence why we don't
>>>> assign a permanent ID to VMs on x86, but rather manage IDs per-CPU (note:
>>>> not per-vCPU).
>>> Good point.
>>>
>>> I don't believe anyone will use RV32.
>>> For RV64, the available ID space seems sufficiently large.
>>>
>>> However, if it turns out that the value isn't large enough even for RV64,
>>> I can rework it to manage IDs per physical CPU.
>>> Wouldn't that approach result in more TLB entries being flushed compared
>>> to per-vCPU allocation, potentially leading to slightly worse performance?
>> Depends on the condition for when to flush. Of course performance is
>> unavoidably going to suffer if you have only very few VMIDs to use.
>> Nevertheless, as indicated before, the model used on x86 may be a
>> candidate to use here, too. See hvm_asid_handle_vmenter() for the
>> core (and vendor-independent) part of it.
> 
> IIUC, so basically it is just a round-robin and when VMIDs are ran out
> then just do full guest TLB flush and start to re-use VMIDs from the start.
> It makes sense to me, I'll implement something similar. (as I'm not really
> sure that we needdata->core_asid_generation, probably, I will understand it better when 
> start to implement it)

Well. The fewer VMID bits you have the more quickly you will need a new
generation. And keep track of the generation you're at you also need to
track the present number somewhere.

>>> What about then to allocate VMID per-domain?
>> That's what you're doing right now, isn't it? And that gets problematic when
>> you have only very few bits in hgatp.VMID, as mentioned below.
> 
> Right, I just phrased my question poorly—sorry about that.
> 
> What I meant to ask is: does the approach described above actually depend on whether
> VMIDs are allocated per-domain or per-pCPU? It seems that the main advantage of
> allocating VMIDs per-pCPU is potentially reducing the number of TLB flushes,
> since it's more likely that a platform will have more than|VMID_MAX| domains than
> |VMID_MAX| physical CPUs—am I right?

Seeing that there can be systems with hundreds or even thousands of CPUs,
I don't think I can agree here. Plus per-pCPU allocation would similarly
get you in trouble when you have only very few VMID bits.

>>>>> +        sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid);
>>>> You're creating d; it cannot possibly have run on any CPU yet. IOW
>>>> d->dirty_cpumask will be reliably empty here. I think it would be hard to
>>>> avoid issuing the flush to all CPUs here in this scheme.
>>> I didn't double check, but I was sure that in case d->dirty_cpumask is empty then
>>> rfence for all CPUs will be send. But I was wrong about that.
>>>
>>> What about just update a code of sbi_rfence_v02()?
>> I don't know, but dealing with the issue there feels wrong. However,
>> before deciding where to do something, it needs to be clear what you
>> actually want to achieve. To me at least, that's not clear at all.
> 
> I want to achieve the following behavior: if a mask is empty
> (specifically, in our case|d->dirty_cpumask|), then perform the flush
> on all CPUs.

That's still too far into the "how". The "why" here is still unclear: Why
do you need any flushing here at all? (With the scheme you now mean to
implement I expect it'll become yet more clear that no flush is needed
during domain construction.)

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-24 14:01           ` Jan Beulich
@ 2025-06-24 15:32             ` Oleksii Kurochko
  2025-06-26 10:05             ` Oleksii Kurochko
  1 sibling, 0 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-24 15:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1929 bytes --]


On 6/24/25 4:01 PM, Jan Beulich wrote:
>>>>>> +        sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid);
>>>>> You're creating d; it cannot possibly have run on any CPU yet. IOW
>>>>> d->dirty_cpumask will be reliably empty here. I think it would be hard to
>>>>> avoid issuing the flush to all CPUs here in this scheme.
>>>> I didn't double check, but I was sure that in case d->dirty_cpumask is empty then
>>>> rfence for all CPUs will be send. But I was wrong about that.
>>>>
>>>> What about just update a code of sbi_rfence_v02()?
>>> I don't know, but dealing with the issue there feels wrong. However,
>>> before deciding where to do something, it needs to be clear what you
>>> actually want to achieve. To me at least, that's not clear at all.
>> I want to achieve the following behavior: if a mask is empty
>> (specifically, in our case|d->dirty_cpumask|), then perform the flush
>> on all CPUs.
> That's still too far into the "how". The "why" here is still unclear: Why
> do you need any flushing here at all? (With the scheme you now mean to
> implement I expect it'll become yet more clear that no flush is needed
> during domain construction.)

For the same reason x86 has flush:
     /* If there are no free ASIDs, need to go to a new generation */
     if ( unlikely(data->next_asid > data->max_asid) )
     {
         hvm_asid_flush_core();

But hvm_asid_flush_core() isn't doing a "real" flush what I missed to check
on the first look at hvm_asid_handle_vmenter().

So I assume then a "real" flush will be called somewhere before entry to guest
context.

I think now it is more or less clear.

Anyway, what then do for the cases if it is needed to have ASID which isn't
expected to be changed?
With this cycling approach after a new generation will be needed, all ASIDs
could/will be changed. It isn't a case for RISC-V (at least, at the moment)
but AFAIK it is an issue for AMD SEV.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 2839 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests
  2025-06-18 15:53   ` Jan Beulich
@ 2025-06-25 14:48     ` Oleksii Kurochko
  2025-06-25 14:55       ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-25 14:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2065 bytes --]


On 6/18/25 5:53 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> @@ -18,10 +20,20 @@ struct arch_vcpu_io {
>>   struct arch_vcpu {
>>   };
>>   
>> +struct paging_domain {
>> +    spinlock_t lock;
>> +    /* Free P2M pages from the pre-allocated P2M pool */
>> +    struct page_list_head p2m_freelist;
>> +    /* Number of pages from the pre-allocated P2M pool */
>> +    unsigned long p2m_total_pages;
>> +};
>> +
>>   struct arch_domain {
>>       struct hvm_domain hvm;
>>   
>>       struct p2m_domain p2m;
>> +
>> +    struct paging_domain paging;
> With the separate structures, do you have plans to implement e.g. shadow paging?
> Or some other paging mode beyond the basic one based on the H extension?

No, there is no such plans.

>   If the
> structures are to remain separate, may I suggest that you keep things properly
> separated (no matter how e.g. Arm may have it) in terms of naming? I.e. no
> single "p2m" inside struct paging_domain.

Arm doesn't implement shadow paging too (AFAIK) and probably this approach was
copied from x86, and then to RISC-V.
I thought that a reason for that was just to have two separate entities: one which
covers page tables and which covers the full available guest memory.
And if the only idea of that was to have shadow paging then I don't how it should
be done better. As p2m code is based on Arm's, perhaps, it makes sense to have
this stuff separated, so easier porting will be.

>
>> @@ -105,6 +106,9 @@ int p2m_init(struct domain *d)
>>       struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>       int rc;
>>   
>> +    spin_lock_init(&d->arch.paging.lock);
>> +    INIT_PAGE_LIST_HEAD(&d->arch.paging.p2m_freelist);
> If you want p2m and paging to be separate, you will want to put these in a new
> paging_init().

I am not really understand what is wrong to have it here, but likely it is because
I don't really get an initial purpose of having p2m and paging separately.
It seems like p2m and paging are connected between each other, so it is fine
to init them together.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 3002 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests
  2025-06-25 14:48     ` Oleksii Kurochko
@ 2025-06-25 14:55       ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-06-25 14:55 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 25.06.2025 16:48, Oleksii Kurochko wrote:
> 
> On 6/18/25 5:53 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> @@ -18,10 +20,20 @@ struct arch_vcpu_io {
>>>   struct arch_vcpu {
>>>   };
>>>   
>>> +struct paging_domain {
>>> +    spinlock_t lock;
>>> +    /* Free P2M pages from the pre-allocated P2M pool */
>>> +    struct page_list_head p2m_freelist;
>>> +    /* Number of pages from the pre-allocated P2M pool */
>>> +    unsigned long p2m_total_pages;
>>> +};
>>> +
>>>   struct arch_domain {
>>>       struct hvm_domain hvm;
>>>   
>>>       struct p2m_domain p2m;
>>> +
>>> +    struct paging_domain paging;
>> With the separate structures, do you have plans to implement e.g. shadow paging?
>> Or some other paging mode beyond the basic one based on the H extension?
> 
> No, there is no such plans.
> 
>>   If the
>> structures are to remain separate, may I suggest that you keep things properly
>> separated (no matter how e.g. Arm may have it) in terms of naming? I.e. no
>> single "p2m" inside struct paging_domain.
> 
> Arm doesn't implement shadow paging too (AFAIK) and probably this approach was
> copied from x86, and then to RISC-V.
> I thought that a reason for that was just to have two separate entities: one which
> covers page tables and which covers the full available guest memory.
> And if the only idea of that was to have shadow paging then I don't how it should
> be done better. As p2m code is based on Arm's, perhaps, it makes sense to have
> this stuff separated, so easier porting will be.
> 
>>> @@ -105,6 +106,9 @@ int p2m_init(struct domain *d)
>>>       struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>       int rc;
>>>   
>>> +    spin_lock_init(&d->arch.paging.lock);
>>> +    INIT_PAGE_LIST_HEAD(&d->arch.paging.p2m_freelist);
>> If you want p2m and paging to be separate, you will want to put these in a new
>> paging_init().
> 
> I am not really understand what is wrong to have it here, but likely it is because
> I don't really get an initial purpose of having p2m and paging separately.
> It seems like p2m and paging are connected between each other, so it is fine
> to init them together.

If you want to retain the separation, imo you want to follow what x86 has:
paging_domain_init() calling p2m_init(). And d->arch.paging.* would then
be initialized in paging_domain_init(), like x86 has it.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 05/17] xen/riscv: introduce things necessary for p2m initialization
  2025-06-18 16:08   ` Jan Beulich
@ 2025-06-25 15:31     ` Oleksii Kurochko
  2025-06-25 15:53       ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-25 15:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5312 bytes --]


On 6/18/25 6:08 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> Introduce the following things:
>> - Update p2m_domain structure, which describe per p2m-table state, with:
>>    - lock to protect updates to p2m.
>>    - pool with pages used to construct p2m.
>>    - clean_pte which indicate if it is requires to clean the cache when
>>      writing an entry.
>>    - radix tree to store p2m type as PTE doesn't have enough free bits to
>>      store type.
>>    - default_access to store p2m access type for each page in the domain.
>>    - back pointer to domain structure.
>> - p2m_init() to initalize members introduced in p2m_domain structure.
>> - Introudce p2m_write_lock() and p2m_is_write_locked().
> What about the reader variant? If you don't need that, why not use a simple
> spin lock?

It will be introduced later in "xen/riscv: add support of page lookup by GFN"
of this patch series where it is really used.

But I can move it here.

>
>> @@ -14,6 +18,29 @@
>>   
>>   /* Per-p2m-table state */
>>   struct p2m_domain {
>> +    /*
>> +     * Lock that protects updates to the p2m.
>> +     */
>> +    rwlock_t lock;
>> +
>> +    /* Pages used to construct the p2m */
>> +    struct page_list_head pages;
>> +
>> +    /* Indicate if it is required to clean the cache when writing an entry */
>> +    bool clean_pte;
>> +
>> +    struct radix_tree_root p2m_type;
> A field with a p2m_ prefix in a p2m struct?

p2m_ prefix could be really dropped.

>   And is this tree really about
> just a single "type"?

Yes, we don't have enough bits in PTE so we need some extra storage to store type.

>
>> +    /*
>> +     * Default P2M access type for each page in the the domain: new pages,
>> +     * swapped in pages, cleared pages, and pages that are ambiguously
>> +     * retyped get this access type.  See definition of p2m_access_t.
>> +     */
>> +    p2m_access_t default_access;
>> +
>> +    /* Back pointer to domain */
>> +    struct domain *domain;
> This you may want to introduce earlier, to prefer passing around struct
> p2m_domain * in / to P2M functions (which would benefit earlier patches
> already, I think).

But nothing uses it earlier.

>
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -1,13 +1,46 @@
>>   #include <xen/bitops.h>
>> +#include <xen/domain_page.h>
>>   #include <xen/event.h>
>> +#include <xen/iommu.h>
>>   #include <xen/lib.h>
>> +#include <xen/mm.h>
>> +#include <xen/pfn.h>
>> +#include <xen/rwlock.h>
>>   #include <xen/sched.h>
>>   #include <xen/spinlock.h>
>>   #include <xen/xvmalloc.h>
>>   
>> +#include <asm/page.h>
>>   #include <asm/p2m.h>
>>   #include <asm/sbi.h>
>>   
>> +/*
>> + * Force a synchronous P2M TLB flush.
>> + *
>> + * Must be called with the p2m lock held.
>> + */
>> +static void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
>> +{
>> +    struct domain *d = p2m->domain;
>> +
>> +    ASSERT(p2m_is_write_locked(p2m));
>> +
>> +    sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid);
>> +}
>> +
>> +/* Unlock the flush and do a P2M TLB flush if necessary */
>> +void p2m_write_unlock(struct p2m_domain *p2m)
>> +{
>> +    /*
>> +     * The final flush is done with the P2M write lock taken to avoid
>> +     * someone else modifying the P2M wbefore the TLB invalidation has
>> +     * completed.
>> +     */
>> +    p2m_force_tlb_flush_sync(p2m);
> The comment ahead of the function says "if necessary". Yet there's no
> conditional here. I also question the need for a global flush in all
> cases.

Stale comment.

But if p2m page table was modified that it is needed to do a flush for CPUs
in d->dirty_cpumask.

>
>> @@ -109,8 +142,33 @@ int p2m_init(struct domain *d)
>>       spin_lock_init(&d->arch.paging.lock);
>>       INIT_PAGE_LIST_HEAD(&d->arch.paging.p2m_freelist);
>>   
>> +    rwlock_init(&p2m->lock);
>> +    INIT_PAGE_LIST_HEAD(&p2m->pages);
>> +
>>       p2m->vmid = INVALID_VMID;
>>   
>> +    p2m->default_access = p2m_access_rwx;
>> +
>> +    radix_tree_init(&p2m->p2m_type);
>> +
>> +#ifdef CONFIG_HAS_PASSTHROUGH
> Do you expect this to be conditionally selected on RISC-V?

No, once it will be implemented it will be just selected once by config RISC-V.
And it was done so because iommu_has_feature() isn't implemented now as IOMMU
isn't supported now and depends on CONFIG_HAS_PASSTHROUGH.

>
>> +    /*
>> +     * Some IOMMUs don't support coherent PT walk. When the p2m is
>> +     * shared with the CPU, Xen has to make sure that the PT changes have
>> +     * reached the memory
>> +     */
>> +    p2m->clean_pte = is_iommu_enabled(d) &&
>> +        !iommu_has_feature(d, IOMMU_FEAT_COHERENT_WALK);
> The comment talks about shared page tables, yet you don't check whether
> page table sharing is actually enabled for the domain.

Do we have such function/macros? It is shared by implementation now.

>
>> +#else
>> +    p2m->clean_pte = false;
> I hope the struct starts out zero-filled, in which case you wouldn't need
> this.
>
>> +#endif
>> +
>> +    /*
>> +     * "Trivial" initialisation is now complete.  Set the backpointer so the
>> +     * users of p2m could get an access to domain structure.
>> +     */
>> +    p2m->domain = d;
> Better set this about the very first thing?

It makes sense. I will move it up.

Thanks.

~ Oleksii


[-- Attachment #2: Type: text/html, Size: 7863 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 05/17] xen/riscv: introduce things necessary for p2m initialization
  2025-06-25 15:31     ` Oleksii Kurochko
@ 2025-06-25 15:53       ` Jan Beulich
  2025-06-26  8:40         ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-25 15:53 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 25.06.2025 17:31, Oleksii Kurochko wrote:
> On 6/18/25 6:08 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> @@ -14,6 +18,29 @@
>>>   
>>>   /* Per-p2m-table state */
>>>   struct p2m_domain {
>>> +    /*
>>> +     * Lock that protects updates to the p2m.
>>> +     */
>>> +    rwlock_t lock;
>>> +
>>> +    /* Pages used to construct the p2m */
>>> +    struct page_list_head pages;
>>> +
>>> +    /* Indicate if it is required to clean the cache when writing an entry */
>>> +    bool clean_pte;
>>> +
>>> +    struct radix_tree_root p2m_type;
>> A field with a p2m_ prefix in a p2m struct?
> 
> p2m_ prefix could be really dropped.
> 
>>   And is this tree really about
>> just a single "type"?
> 
> Yes, we don't have enough bits in PTE so we need some extra storage to store type.

My question wasn't about that, though. My question was whether in the name
"type" (singular) is appropriate. I didn't think you need a tree to store just
a single type.

>>> +    /*
>>> +     * Default P2M access type for each page in the the domain: new pages,
>>> +     * swapped in pages, cleared pages, and pages that are ambiguously
>>> +     * retyped get this access type.  See definition of p2m_access_t.
>>> +     */
>>> +    p2m_access_t default_access;
>>> +
>>> +    /* Back pointer to domain */
>>> +    struct domain *domain;
>> This you may want to introduce earlier, to prefer passing around struct
>> p2m_domain * in / to P2M functions (which would benefit earlier patches
>> already, I think).
> 
> But nothing uses it earlier.

If you do as suggested and pass around struct p2m_domain * for p2m_*()
functions, you'll quickly find it used, I think.

>>> --- a/xen/arch/riscv/p2m.c
>>> +++ b/xen/arch/riscv/p2m.c
>>> @@ -1,13 +1,46 @@
>>>   #include <xen/bitops.h>
>>> +#include <xen/domain_page.h>
>>>   #include <xen/event.h>
>>> +#include <xen/iommu.h>
>>>   #include <xen/lib.h>
>>> +#include <xen/mm.h>
>>> +#include <xen/pfn.h>
>>> +#include <xen/rwlock.h>
>>>   #include <xen/sched.h>
>>>   #include <xen/spinlock.h>
>>>   #include <xen/xvmalloc.h>
>>>   
>>> +#include <asm/page.h>
>>>   #include <asm/p2m.h>
>>>   #include <asm/sbi.h>
>>>   
>>> +/*
>>> + * Force a synchronous P2M TLB flush.
>>> + *
>>> + * Must be called with the p2m lock held.
>>> + */
>>> +static void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
>>> +{
>>> +    struct domain *d = p2m->domain;
>>> +
>>> +    ASSERT(p2m_is_write_locked(p2m));
>>> +
>>> +    sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid);
>>> +}
>>> +
>>> +/* Unlock the flush and do a P2M TLB flush if necessary */
>>> +void p2m_write_unlock(struct p2m_domain *p2m)
>>> +{
>>> +    /*
>>> +     * The final flush is done with the P2M write lock taken to avoid
>>> +     * someone else modifying the P2M wbefore the TLB invalidation has
>>> +     * completed.
>>> +     */
>>> +    p2m_force_tlb_flush_sync(p2m);
>> The comment ahead of the function says "if necessary". Yet there's no
>> conditional here. I also question the need for a global flush in all
>> cases.
> 
> Stale comment.
> 
> But if p2m page table was modified that it is needed to do a flush for CPUs
> in d->dirty_cpumask.

Right, but is that true for each and every case where you acquire the
lock in write mode? There may e.g. be early-out path which end up doing
nothing, yet you would then still flush the TLB.

>>> @@ -109,8 +142,33 @@ int p2m_init(struct domain *d)
>>>       spin_lock_init(&d->arch.paging.lock);
>>>       INIT_PAGE_LIST_HEAD(&d->arch.paging.p2m_freelist);
>>>   
>>> +    rwlock_init(&p2m->lock);
>>> +    INIT_PAGE_LIST_HEAD(&p2m->pages);
>>> +
>>>       p2m->vmid = INVALID_VMID;
>>>   
>>> +    p2m->default_access = p2m_access_rwx;
>>> +
>>> +    radix_tree_init(&p2m->p2m_type);
>>> +
>>> +#ifdef CONFIG_HAS_PASSTHROUGH
>> Do you expect this to be conditionally selected on RISC-V?
> 
> No, once it will be implemented it will be just selected once by config RISC-V.
> And it was done so because iommu_has_feature() isn't implemented now as IOMMU
> isn't supported now and depends on CONFIG_HAS_PASSTHROUGH.

If the selection isn't going to be conditional, then I see no reason to have
such conditionals in RISC-V-specific code. The piece of code presently inside
that #ifdef may simply need adding later, once there's enough infrastructure
to allow that code to compile. Or maybe it would even compile fine already now?

>>> +    /*
>>> +     * Some IOMMUs don't support coherent PT walk. When the p2m is
>>> +     * shared with the CPU, Xen has to make sure that the PT changes have
>>> +     * reached the memory
>>> +     */
>>> +    p2m->clean_pte = is_iommu_enabled(d) &&
>>> +        !iommu_has_feature(d, IOMMU_FEAT_COHERENT_WALK);
>> The comment talks about shared page tables, yet you don't check whether
>> page table sharing is actually enabled for the domain.
> 
> Do we have such function/macros?

We have iommu_hap_pt_share, and we have the per-domain hap_pt_share flag.

> It is shared by implementation now.

I don't understand. There's no IOMMU support yet for RISC-V. Hence it's in
neither state - not shared, but also not not shared.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 05/17] xen/riscv: introduce things necessary for p2m initialization
  2025-06-25 15:53       ` Jan Beulich
@ 2025-06-26  8:40         ` Oleksii Kurochko
  2025-06-26 11:01           ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-26  8:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5841 bytes --]


On 6/25/25 5:53 PM, Jan Beulich wrote:
> On 25.06.2025 17:31, Oleksii Kurochko wrote:
>> On 6/18/25 6:08 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> @@ -14,6 +18,29 @@
>>>>    
>>>>    /* Per-p2m-table state */
>>>>    struct p2m_domain {
>>>> +    /*
>>>> +     * Lock that protects updates to the p2m.
>>>> +     */
>>>> +    rwlock_t lock;
>>>> +
>>>> +    /* Pages used to construct the p2m */
>>>> +    struct page_list_head pages;
>>>> +
>>>> +    /* Indicate if it is required to clean the cache when writing an entry */
>>>> +    bool clean_pte;
>>>> +
>>>> +    struct radix_tree_root p2m_type;
>>> A field with a p2m_ prefix in a p2m struct?
>> p2m_ prefix could be really dropped.
>>
>>>    And is this tree really about
>>> just a single "type"?
>> Yes, we don't have enough bits in PTE so we need some extra storage to store type.
> My question wasn't about that, though. My question was whether in the name
> "type" (singular) is appropriate. I didn't think you need a tree to store just
> a single type.

I need tree to store a pair of <gfn, p2m_type>, where gfn is an index. And it seems
to me a tree is a good structure for fast insert/search.

>
>>>> +    /*
>>>> +     * Default P2M access type for each page in the the domain: new pages,
>>>> +     * swapped in pages, cleared pages, and pages that are ambiguously
>>>> +     * retyped get this access type.  See definition of p2m_access_t.
>>>> +     */
>>>> +    p2m_access_t default_access;
>>>> +
>>>> +    /* Back pointer to domain */
>>>> +    struct domain *domain;
>>> This you may want to introduce earlier, to prefer passing around struct
>>> p2m_domain * in / to P2M functions (which would benefit earlier patches
>>> already, I think).
>> But nothing uses it earlier.
> If you do as suggested and pass around struct p2m_domain * for p2m_*()
> functions, you'll quickly find it used, I think.
>
>>>> --- a/xen/arch/riscv/p2m.c
>>>> +++ b/xen/arch/riscv/p2m.c
>>>> @@ -1,13 +1,46 @@
>>>>    #include <xen/bitops.h>
>>>> +#include <xen/domain_page.h>
>>>>    #include <xen/event.h>
>>>> +#include <xen/iommu.h>
>>>>    #include <xen/lib.h>
>>>> +#include <xen/mm.h>
>>>> +#include <xen/pfn.h>
>>>> +#include <xen/rwlock.h>
>>>>    #include <xen/sched.h>
>>>>    #include <xen/spinlock.h>
>>>>    #include <xen/xvmalloc.h>
>>>>    
>>>> +#include <asm/page.h>
>>>>    #include <asm/p2m.h>
>>>>    #include <asm/sbi.h>
>>>>    
>>>> +/*
>>>> + * Force a synchronous P2M TLB flush.
>>>> + *
>>>> + * Must be called with the p2m lock held.
>>>> + */
>>>> +static void p2m_force_tlb_flush_sync(struct p2m_domain *p2m)
>>>> +{
>>>> +    struct domain *d = p2m->domain;
>>>> +
>>>> +    ASSERT(p2m_is_write_locked(p2m));
>>>> +
>>>> +    sbi_remote_hfence_gvma_vmid(d->dirty_cpumask, 0, 0, p2m->vmid);
>>>> +}
>>>> +
>>>> +/* Unlock the flush and do a P2M TLB flush if necessary */
>>>> +void p2m_write_unlock(struct p2m_domain *p2m)
>>>> +{
>>>> +    /*
>>>> +     * The final flush is done with the P2M write lock taken to avoid
>>>> +     * someone else modifying the P2M wbefore the TLB invalidation has
>>>> +     * completed.
>>>> +     */
>>>> +    p2m_force_tlb_flush_sync(p2m);
>>> The comment ahead of the function says "if necessary". Yet there's no
>>> conditional here. I also question the need for a global flush in all
>>> cases.
>> Stale comment.
>>
>> But if p2m page table was modified that it is needed to do a flush for CPUs
>> in d->dirty_cpumask.
> Right, but is that true for each and every case where you acquire the
> lock in write mode? There may e.g. be early-out path which end up doing
> nothing, yet you would then still flush the TLB.

Initially, I assumed that early-out patch will happen mostly in the cases when
some error happen, so it will be okay to flush the TLB each time.

But, yes, I missed some cases when it will be end up doing nothing. I will return
back need_flush.

>
>>>> @@ -109,8 +142,33 @@ int p2m_init(struct domain *d)
>>>>        spin_lock_init(&d->arch.paging.lock);
>>>>        INIT_PAGE_LIST_HEAD(&d->arch.paging.p2m_freelist);
>>>>    
>>>> +    rwlock_init(&p2m->lock);
>>>> +    INIT_PAGE_LIST_HEAD(&p2m->pages);
>>>> +
>>>>        p2m->vmid = INVALID_VMID;
>>>>    
>>>> +    p2m->default_access = p2m_access_rwx;
>>>> +
>>>> +    radix_tree_init(&p2m->p2m_type);
>>>> +
>>>> +#ifdef CONFIG_HAS_PASSTHROUGH
>>> Do you expect this to be conditionally selected on RISC-V?
>> No, once it will be implemented it will be just selected once by config RISC-V.
>> And it was done so because iommu_has_feature() isn't implemented now as IOMMU
>> isn't supported now and depends on CONFIG_HAS_PASSTHROUGH.
> If the selection isn't going to be conditional, then I see no reason to have
> such conditionals in RISC-V-specific code. The piece of code presently inside
> that #ifdef may simply need adding later, once there's enough infrastructure
> to allow that code to compile. Or maybe it would even compile fine already now?

I haven't tried. Anyway, I get your point.

>
>>>> +    /*
>>>> +     * Some IOMMUs don't support coherent PT walk. When the p2m is
>>>> +     * shared with the CPU, Xen has to make sure that the PT changes have
>>>> +     * reached the memory
>>>> +     */
>>>> +    p2m->clean_pte = is_iommu_enabled(d) &&
>>>> +        !iommu_has_feature(d, IOMMU_FEAT_COHERENT_WALK);
>>> The comment talks about shared page tables, yet you don't check whether
>>> page table sharing is actually enabled for the domain.
>> Do we have such function/macros?
> We have iommu_hap_pt_share, and we have the per-domain hap_pt_share flag.
>
>> It is shared by implementation now.
> I don't understand. There's no IOMMU support yet for RISC-V. Hence it's in
> neither state - not shared, but also not not shared.

In downstream there is a support of IOMMU for RISC-V.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 8586 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-24 14:01           ` Jan Beulich
  2025-06-24 15:32             ` Oleksii Kurochko
@ 2025-06-26 10:05             ` Oleksii Kurochko
  2025-06-26 10:41               ` Jan Beulich
  1 sibling, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-26 10:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 4100 bytes --]


On 6/24/25 4:01 PM, Jan Beulich wrote:
> On 24.06.2025 15:47, Oleksii Kurochko wrote:
>> On 6/24/25 12:44 PM, Jan Beulich wrote:
>>> On 24.06.2025 11:46, Oleksii Kurochko wrote:
>>>> On 6/18/25 5:46 PM, Jan Beulich wrote:
>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>> --- /dev/null
>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>> @@ -0,0 +1,115 @@
>>>>>> +#include <xen/bitops.h>
>>>>>> +#include <xen/lib.h>
>>>>>> +#include <xen/sched.h>
>>>>>> +#include <xen/spinlock.h>
>>>>>> +#include <xen/xvmalloc.h>
>>>>>> +
>>>>>> +#include <asm/p2m.h>
>>>>>> +#include <asm/sbi.h>
>>>>>> +
>>>>>> +static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>>>>> +
>>>>>> +/*
>>>>>> + * hgatp's VMID field is 7 or 14 bits. RV64 may support 14-bit VMID.
>>>>>> + * Using a bitmap here limits us to 127 (2^7 - 1) or 16383 (2^14 - 1)
>>>>>> + * concurrent domains.
>>>>> Which is pretty limiting especially in the RV32 case. Hence why we don't
>>>>> assign a permanent ID to VMs on x86, but rather manage IDs per-CPU (note:
>>>>> not per-vCPU).
>>>> Good point.
>>>>
>>>> I don't believe anyone will use RV32.
>>>> For RV64, the available ID space seems sufficiently large.
>>>>
>>>> However, if it turns out that the value isn't large enough even for RV64,
>>>> I can rework it to manage IDs per physical CPU.
>>>> Wouldn't that approach result in more TLB entries being flushed compared
>>>> to per-vCPU allocation, potentially leading to slightly worse performance?
>>> Depends on the condition for when to flush. Of course performance is
>>> unavoidably going to suffer if you have only very few VMIDs to use.
>>> Nevertheless, as indicated before, the model used on x86 may be a
>>> candidate to use here, too. See hvm_asid_handle_vmenter() for the
>>> core (and vendor-independent) part of it.
>> IIUC, so basically it is just a round-robin and when VMIDs are ran out
>> then just do full guest TLB flush and start to re-use VMIDs from the start.
>> It makes sense to me, I'll implement something similar. (as I'm not really
>> sure that we needdata->core_asid_generation, probably, I will understand it better when
>> start to implement it)
> Well. The fewer VMID bits you have the more quickly you will need a new
> generation. And keep track of the generation you're at you also need to
> track the present number somewhere.
>
>>>> What about then to allocate VMID per-domain?
>>> That's what you're doing right now, isn't it? And that gets problematic when
>>> you have only very few bits in hgatp.VMID, as mentioned below.
>> Right, I just phrased my question poorly—sorry about that.
>>
>> What I meant to ask is: does the approach described above actually depend on whether
>> VMIDs are allocated per-domain or per-pCPU? It seems that the main advantage of
>> allocating VMIDs per-pCPU is potentially reducing the number of TLB flushes,
>> since it's more likely that a platform will have more than|VMID_MAX| domains than
>> |VMID_MAX| physical CPUs—am I right?
> Seeing that there can be systems with hundreds or even thousands of CPUs,
> I don't think I can agree here. Plus per-pCPU allocation would similarly
> get you in trouble when you have only very few VMID bits.

But not so fast as in case of per-domain allocation, right?

I mean that if we have only 4 bits, then in case of per-domain allocation we will
need to do TLB flush + VMID re-assigning when we have more then 16 domains.

But in case of per-pCPU allocation we could run 16 domains on 1 pCPU and at the same
time in multiprocessor systems we have more pCPUs, which will allow us to run more
domains and avoid TLB flushes.
On other hand, it is needed to consider that it's unlikely that a domain will have
only one vCPU. And it is likely that amount of vCPUs will be bigger then an amount
of domains, so to have a round-robin approach (as x86) without permanent ID allocation
for each domain will work better then per-pCPU allocation.
In other words, I'm not 100% sure that I get a point why x86 chose per-pCPU allocation
instead of per-domain allocation with having the same VMID for all vCPUs of domains.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 5735 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-26 10:05             ` Oleksii Kurochko
@ 2025-06-26 10:41               ` Jan Beulich
  2025-06-26 11:34                 ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-26 10:41 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 26.06.2025 12:05, Oleksii Kurochko wrote:
> 
> On 6/24/25 4:01 PM, Jan Beulich wrote:
>> On 24.06.2025 15:47, Oleksii Kurochko wrote:
>>> On 6/24/25 12:44 PM, Jan Beulich wrote:
>>>> On 24.06.2025 11:46, Oleksii Kurochko wrote:
>>>>> On 6/18/25 5:46 PM, Jan Beulich wrote:
>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>> --- /dev/null
>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>> @@ -0,0 +1,115 @@
>>>>>>> +#include <xen/bitops.h>
>>>>>>> +#include <xen/lib.h>
>>>>>>> +#include <xen/sched.h>
>>>>>>> +#include <xen/spinlock.h>
>>>>>>> +#include <xen/xvmalloc.h>
>>>>>>> +
>>>>>>> +#include <asm/p2m.h>
>>>>>>> +#include <asm/sbi.h>
>>>>>>> +
>>>>>>> +static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>>>>>> +
>>>>>>> +/*
>>>>>>> + * hgatp's VMID field is 7 or 14 bits. RV64 may support 14-bit VMID.
>>>>>>> + * Using a bitmap here limits us to 127 (2^7 - 1) or 16383 (2^14 - 1)
>>>>>>> + * concurrent domains.
>>>>>> Which is pretty limiting especially in the RV32 case. Hence why we don't
>>>>>> assign a permanent ID to VMs on x86, but rather manage IDs per-CPU (note:
>>>>>> not per-vCPU).
>>>>> Good point.
>>>>>
>>>>> I don't believe anyone will use RV32.
>>>>> For RV64, the available ID space seems sufficiently large.
>>>>>
>>>>> However, if it turns out that the value isn't large enough even for RV64,
>>>>> I can rework it to manage IDs per physical CPU.
>>>>> Wouldn't that approach result in more TLB entries being flushed compared
>>>>> to per-vCPU allocation, potentially leading to slightly worse performance?
>>>> Depends on the condition for when to flush. Of course performance is
>>>> unavoidably going to suffer if you have only very few VMIDs to use.
>>>> Nevertheless, as indicated before, the model used on x86 may be a
>>>> candidate to use here, too. See hvm_asid_handle_vmenter() for the
>>>> core (and vendor-independent) part of it.
>>> IIUC, so basically it is just a round-robin and when VMIDs are ran out
>>> then just do full guest TLB flush and start to re-use VMIDs from the start.
>>> It makes sense to me, I'll implement something similar. (as I'm not really
>>> sure that we needdata->core_asid_generation, probably, I will understand it better when
>>> start to implement it)
>> Well. The fewer VMID bits you have the more quickly you will need a new
>> generation. And keep track of the generation you're at you also need to
>> track the present number somewhere.
>>
>>>>> What about then to allocate VMID per-domain?
>>>> That's what you're doing right now, isn't it? And that gets problematic when
>>>> you have only very few bits in hgatp.VMID, as mentioned below.
>>> Right, I just phrased my question poorly—sorry about that.
>>>
>>> What I meant to ask is: does the approach described above actually depend on whether
>>> VMIDs are allocated per-domain or per-pCPU? It seems that the main advantage of
>>> allocating VMIDs per-pCPU is potentially reducing the number of TLB flushes,
>>> since it's more likely that a platform will have more than|VMID_MAX| domains than
>>> |VMID_MAX| physical CPUs—am I right?
>> Seeing that there can be systems with hundreds or even thousands of CPUs,
>> I don't think I can agree here. Plus per-pCPU allocation would similarly
>> get you in trouble when you have only very few VMID bits.
> 
> But not so fast as in case of per-domain allocation, right?
> 
> I mean that if we have only 4 bits, then in case of per-domain allocation we will
> need to do TLB flush + VMID re-assigning when we have more then 16 domains.
> 
> But in case of per-pCPU allocation we could run 16 domains on 1 pCPU and at the same
> time in multiprocessor systems we have more pCPUs, which will allow us to run more
> domains and avoid TLB flushes.
> On other hand, it is needed to consider that it's unlikely that a domain will have
> only one vCPU. And it is likely that amount of vCPUs will be bigger then an amount
> of domains, so to have a round-robin approach (as x86) without permanent ID allocation
> for each domain will work better then per-pCPU allocation.

Here you (appear to) say one thing, ...

> In other words, I'm not 100% sure that I get a point why x86 chose per-pCPU allocation
> instead of per-domain allocation with having the same VMID for all vCPUs of domains.

... and then here the opposite. Overall I'm in severe trouble understanding this
reply of yours as a whole, so I fear I can't really respond to it (or even just
parts thereof).

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 05/17] xen/riscv: introduce things necessary for p2m initialization
  2025-06-26  8:40         ` Oleksii Kurochko
@ 2025-06-26 11:01           ` Jan Beulich
  2025-06-26 11:55             ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-26 11:01 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 26.06.2025 10:40, Oleksii Kurochko wrote:
> On 6/25/25 5:53 PM, Jan Beulich wrote:
>> On 25.06.2025 17:31, Oleksii Kurochko wrote:
>>> On 6/18/25 6:08 PM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> @@ -14,6 +18,29 @@
>>>>>    
>>>>>    /* Per-p2m-table state */
>>>>>    struct p2m_domain {
>>>>> +    /*
>>>>> +     * Lock that protects updates to the p2m.
>>>>> +     */
>>>>> +    rwlock_t lock;
>>>>> +
>>>>> +    /* Pages used to construct the p2m */
>>>>> +    struct page_list_head pages;
>>>>> +
>>>>> +    /* Indicate if it is required to clean the cache when writing an entry */
>>>>> +    bool clean_pte;
>>>>> +
>>>>> +    struct radix_tree_root p2m_type;
>>>> A field with a p2m_ prefix in a p2m struct?
>>> p2m_ prefix could be really dropped.
>>>
>>>>    And is this tree really about
>>>> just a single "type"?
>>> Yes, we don't have enough bits in PTE so we need some extra storage to store type.
>> My question wasn't about that, though. My question was whether in the name
>> "type" (singular) is appropriate. I didn't think you need a tree to store just
>> a single type.
> 
> I need tree to store a pair of <gfn, p2m_type>, where gfn is an index. And it seems
> to me a tree is a good structure for fast insert/search.

Hmm, I'm increasingly puzzled. I tried to emphasize that my question was towards
the singular "type" in the variable name. I can't see any relationship between
that and your reply. (And yes, using a tree here may be appropriate. There is a
concern towards memory consumption, but that's a separate topic.)

Having said that, aiui you don't use the two RSW bits in the PTE. Do you have
any plans there? If not, can't they be used to at least represent the most
commonly used types, such that the number of entries in that tree can be kept
(relatively) low?

>>>>> +    /*
>>>>> +     * Some IOMMUs don't support coherent PT walk. When the p2m is
>>>>> +     * shared with the CPU, Xen has to make sure that the PT changes have
>>>>> +     * reached the memory
>>>>> +     */
>>>>> +    p2m->clean_pte = is_iommu_enabled(d) &&
>>>>> +        !iommu_has_feature(d, IOMMU_FEAT_COHERENT_WALK);
>>>> The comment talks about shared page tables, yet you don't check whether
>>>> page table sharing is actually enabled for the domain.
>>> Do we have such function/macros?
>> We have iommu_hap_pt_share, and we have the per-domain hap_pt_share flag.
>>
>>> It is shared by implementation now.
>> I don't understand. There's no IOMMU support yet for RISC-V. Hence it's in
>> neither state - not shared, but also not not shared.
> 
> In downstream there is a support of IOMMU for RISC-V.

And there page tables are unconditionally shared? I'll be surprised if no
want/need for non-shared page tables would ever appear.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-26 10:41               ` Jan Beulich
@ 2025-06-26 11:34                 ` Oleksii Kurochko
  2025-06-26 11:43                   ` Juergen Gross
  2025-06-26 12:16                   ` Jan Beulich
  0 siblings, 2 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-26 11:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 6378 bytes --]


On 6/26/25 12:41 PM, Jan Beulich wrote:
> On 26.06.2025 12:05, Oleksii Kurochko wrote:
>> On 6/24/25 4:01 PM, Jan Beulich wrote:
>>> On 24.06.2025 15:47, Oleksii Kurochko wrote:
>>>> On 6/24/25 12:44 PM, Jan Beulich wrote:
>>>>> On 24.06.2025 11:46, Oleksii Kurochko wrote:
>>>>>> On 6/18/25 5:46 PM, Jan Beulich wrote:
>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>> --- /dev/null
>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>> @@ -0,0 +1,115 @@
>>>>>>>> +#include <xen/bitops.h>
>>>>>>>> +#include <xen/lib.h>
>>>>>>>> +#include <xen/sched.h>
>>>>>>>> +#include <xen/spinlock.h>
>>>>>>>> +#include <xen/xvmalloc.h>
>>>>>>>> +
>>>>>>>> +#include <asm/p2m.h>
>>>>>>>> +#include <asm/sbi.h>
>>>>>>>> +
>>>>>>>> +static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>> + * hgatp's VMID field is 7 or 14 bits. RV64 may support 14-bit VMID.
>>>>>>>> + * Using a bitmap here limits us to 127 (2^7 - 1) or 16383 (2^14 - 1)
>>>>>>>> + * concurrent domains.
>>>>>>> Which is pretty limiting especially in the RV32 case. Hence why we don't
>>>>>>> assign a permanent ID to VMs on x86, but rather manage IDs per-CPU (note:
>>>>>>> not per-vCPU).
>>>>>> Good point.
>>>>>>
>>>>>> I don't believe anyone will use RV32.
>>>>>> For RV64, the available ID space seems sufficiently large.
>>>>>>
>>>>>> However, if it turns out that the value isn't large enough even for RV64,
>>>>>> I can rework it to manage IDs per physical CPU.
>>>>>> Wouldn't that approach result in more TLB entries being flushed compared
>>>>>> to per-vCPU allocation, potentially leading to slightly worse performance?
>>>>> Depends on the condition for when to flush. Of course performance is
>>>>> unavoidably going to suffer if you have only very few VMIDs to use.
>>>>> Nevertheless, as indicated before, the model used on x86 may be a
>>>>> candidate to use here, too. See hvm_asid_handle_vmenter() for the
>>>>> core (and vendor-independent) part of it.
>>>> IIUC, so basically it is just a round-robin and when VMIDs are ran out
>>>> then just do full guest TLB flush and start to re-use VMIDs from the start.
>>>> It makes sense to me, I'll implement something similar. (as I'm not really
>>>> sure that we needdata->core_asid_generation, probably, I will understand it better when
>>>> start to implement it)
>>> Well. The fewer VMID bits you have the more quickly you will need a new
>>> generation. And keep track of the generation you're at you also need to
>>> track the present number somewhere.
>>>
>>>>>> What about then to allocate VMID per-domain?
>>>>> That's what you're doing right now, isn't it? And that gets problematic when
>>>>> you have only very few bits in hgatp.VMID, as mentioned below.
>>>> Right, I just phrased my question poorly—sorry about that.
>>>>
>>>> What I meant to ask is: does the approach described above actually depend on whether
>>>> VMIDs are allocated per-domain or per-pCPU? It seems that the main advantage of
>>>> allocating VMIDs per-pCPU is potentially reducing the number of TLB flushes,
>>>> since it's more likely that a platform will have more than|VMID_MAX| domains than
>>>> |VMID_MAX| physical CPUs—am I right?
>>> Seeing that there can be systems with hundreds or even thousands of CPUs,
>>> I don't think I can agree here. Plus per-pCPU allocation would similarly
>>> get you in trouble when you have only very few VMID bits.
>> But not so fast as in case of per-domain allocation, right?
>>
>> I mean that if we have only 4 bits, then in case of per-domain allocation we will
>> need to do TLB flush + VMID re-assigning when we have more then 16 domains.
>>
>> But in case of per-pCPU allocation we could run 16 domains on 1 pCPU and at the same
>> time in multiprocessor systems we have more pCPUs, which will allow us to run more
>> domains and avoid TLB flushes.
>> On other hand, it is needed to consider that it's unlikely that a domain will have
>> only one vCPU. And it is likely that amount of vCPUs will be bigger then an amount
>> of domains, so to have a round-robin approach (as x86) without permanent ID allocation
>> for each domain will work better then per-pCPU allocation.
> Here you (appear to) say one thing, ...
>
>> In other words, I'm not 100% sure that I get a point why x86 chose per-pCPU allocation
>> instead of per-domain allocation with having the same VMID for all vCPUs of domains.
> ... and then here the opposite. Overall I'm in severe trouble understanding this
> reply of yours as a whole, so I fear I can't really respond to it (or even just
> parts thereof).

IIUC, x86 allocates VMIDs per physical CPU (pCPU) "dynamically" — these are just
sequential numbers, and once VMIDs run out on a given pCPU, there's no guarantee
that a vCPU will receive the same VMID again.

On the other hand, RISC-V currently allocates a single VMID per domain, and that
VMID is considered "permanent" until the domain is destroyed. This means we are
limited to at most VMID_MAX domains. To avoid this limitation, I plan to implement
a round-robin reuse approach: when no free VMIDs remain, we start a new generation
and begin reusing old VMIDs.

The only remaining design question is whether we want RISC-V to follow a global
VMID allocation policy (i.e., one VMID per domain, shared across all of its vCPUs),
or adopt a policy similar to x86 with per-CPU VMID allocation (each vCPU gets its
own VMID, local to the CPU it's running on).

Each policy has its own trade-offs. But in the case where the number of available
VMIDs is small (i.e., low VMIDLEN), a global allocation policy may be more suitable,
as it requires fewer VMIDs overall.

So my main question was:
What are the advantages of per-pCPU VMID allocation in scenarios with limited VMID
space, and why did x86 choose that design?

 From what I can tell, the benefits of per-pCPU VMID allocation include:
- Minimized inter-CPU TLB flushes — since VMIDs are local, TLB entries don’t need
   to be invalidated on other CPUs when reused.
- Better scalability — this approach works better on systems with a large number
   of CPUs.
- Frequent VM switches don’t require global TLB flushes — reducing the overhead
   of context switching.
However, the downside is that this model consumes more VMIDs. For example,
if a single domain runs on 4 vCPUs across 4 CPUs, it will consume 4 VMIDs instead
of just one.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 8107 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-26 11:34                 ` Oleksii Kurochko
@ 2025-06-26 11:43                   ` Juergen Gross
  2025-06-26 12:05                     ` Oleksii Kurochko
  2025-06-26 12:17                     ` Teddy Astie
  2025-06-26 12:16                   ` Jan Beulich
  1 sibling, 2 replies; 161+ messages in thread
From: Juergen Gross @ 2025-06-26 11:43 UTC (permalink / raw)
  To: Oleksii Kurochko, Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 7001 bytes --]

On 26.06.25 13:34, Oleksii Kurochko wrote:
> 
> On 6/26/25 12:41 PM, Jan Beulich wrote:
>> On 26.06.2025 12:05, Oleksii Kurochko wrote:
>>> On 6/24/25 4:01 PM, Jan Beulich wrote:
>>>> On 24.06.2025 15:47, Oleksii Kurochko wrote:
>>>>> On 6/24/25 12:44 PM, Jan Beulich wrote:
>>>>>> On 24.06.2025 11:46, Oleksii Kurochko wrote:
>>>>>>> On 6/18/25 5:46 PM, Jan Beulich wrote:
>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>> --- /dev/null
>>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>>> @@ -0,0 +1,115 @@
>>>>>>>>> +#include <xen/bitops.h>
>>>>>>>>> +#include <xen/lib.h>
>>>>>>>>> +#include <xen/sched.h>
>>>>>>>>> +#include <xen/spinlock.h>
>>>>>>>>> +#include <xen/xvmalloc.h>
>>>>>>>>> +
>>>>>>>>> +#include <asm/p2m.h>
>>>>>>>>> +#include <asm/sbi.h>
>>>>>>>>> +
>>>>>>>>> +static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>>>>>>>> +
>>>>>>>>> +/*
>>>>>>>>> + * hgatp's VMID field is 7 or 14 bits. RV64 may support 14-bit VMID.
>>>>>>>>> + * Using a bitmap here limits us to 127 (2^7 - 1) or 16383 (2^14 - 1)
>>>>>>>>> + * concurrent domains.
>>>>>>>> Which is pretty limiting especially in the RV32 case. Hence why we don't
>>>>>>>> assign a permanent ID to VMs on x86, but rather manage IDs per-CPU (note:
>>>>>>>> not per-vCPU).
>>>>>>> Good point.
>>>>>>>
>>>>>>> I don't believe anyone will use RV32.
>>>>>>> For RV64, the available ID space seems sufficiently large.
>>>>>>>
>>>>>>> However, if it turns out that the value isn't large enough even for RV64,
>>>>>>> I can rework it to manage IDs per physical CPU.
>>>>>>> Wouldn't that approach result in more TLB entries being flushed compared
>>>>>>> to per-vCPU allocation, potentially leading to slightly worse performance?
>>>>>> Depends on the condition for when to flush. Of course performance is
>>>>>> unavoidably going to suffer if you have only very few VMIDs to use.
>>>>>> Nevertheless, as indicated before, the model used on x86 may be a
>>>>>> candidate to use here, too. See hvm_asid_handle_vmenter() for the
>>>>>> core (and vendor-independent) part of it.
>>>>> IIUC, so basically it is just a round-robin and when VMIDs are ran out
>>>>> then just do full guest TLB flush and start to re-use VMIDs from the start.
>>>>> It makes sense to me, I'll implement something similar. (as I'm not really
>>>>> sure that we needdata->core_asid_generation, probably, I will understand it better when
>>>>> start to implement it)
>>>> Well. The fewer VMID bits you have the more quickly you will need a new
>>>> generation. And keep track of the generation you're at you also need to
>>>> track the present number somewhere.
>>>>
>>>>>>> What about then to allocate VMID per-domain?
>>>>>> That's what you're doing right now, isn't it? And that gets problematic when
>>>>>> you have only very few bits in hgatp.VMID, as mentioned below.
>>>>> Right, I just phrased my question poorly—sorry about that.
>>>>>
>>>>> What I meant to ask is: does the approach described above actually depend on whether
>>>>> VMIDs are allocated per-domain or per-pCPU? It seems that the main advantage of
>>>>> allocating VMIDs per-pCPU is potentially reducing the number of TLB flushes,
>>>>> since it's more likely that a platform will have more than|VMID_MAX| domains than
>>>>> |VMID_MAX| physical CPUs—am I right?
>>>> Seeing that there can be systems with hundreds or even thousands of CPUs,
>>>> I don't think I can agree here. Plus per-pCPU allocation would similarly
>>>> get you in trouble when you have only very few VMID bits.
>>> But not so fast as in case of per-domain allocation, right?
>>>
>>> I mean that if we have only 4 bits, then in case of per-domain allocation we will
>>> need to do TLB flush + VMID re-assigning when we have more then 16 domains.
>>>
>>> But in case of per-pCPU allocation we could run 16 domains on 1 pCPU and at the same
>>> time in multiprocessor systems we have more pCPUs, which will allow us to run more
>>> domains and avoid TLB flushes.
>>> On other hand, it is needed to consider that it's unlikely that a domain will have
>>> only one vCPU. And it is likely that amount of vCPUs will be bigger then an amount
>>> of domains, so to have a round-robin approach (as x86) without permanent ID allocation
>>> for each domain will work better then per-pCPU allocation.
>> Here you (appear to) say one thing, ...
>>
>>> In other words, I'm not 100% sure that I get a point why x86 chose per-pCPU allocation
>>> instead of per-domain allocation with having the same VMID for all vCPUs of domains.
>> ... and then here the opposite. Overall I'm in severe trouble understanding this
>> reply of yours as a whole, so I fear I can't really respond to it (or even just
>> parts thereof).
> 
> IIUC, x86 allocates VMIDs per physical CPU (pCPU) "dynamically" — these are just
> sequential numbers, and once VMIDs run out on a given pCPU, there's no guarantee
> that a vCPU will receive the same VMID again.
> 
> On the other hand, RISC-V currently allocates a single VMID per domain, and that
> VMID is considered "permanent" until the domain is destroyed. This means we are
> limited to at most VMID_MAX domains. To avoid this limitation, I plan to implement
> a round-robin reuse approach: when no free VMIDs remain, we start a new generation
> and begin reusing old VMIDs.
> 
> The only remaining design question is whether we want RISC-V to follow a global
> VMID allocation policy (i.e., one VMID per domain, shared across all of its vCPUs),
> or adopt a policy similar to x86 with per-CPU VMID allocation (each vCPU gets its
> own VMID, local to the CPU it's running on).
> 
> Each policy has its own trade-offs. But in the case where the number of available
> VMIDs is small (i.e., low VMIDLEN), a global allocation policy may be more suitable,
> as it requires fewer VMIDs overall.
> 
> So my main question was:
> What are the advantages of per-pCPU VMID allocation in scenarios with limited VMID
> space, and why did x86 choose that design?
> 
>>From what I can tell, the benefits of per-pCPU VMID allocation include:
> - Minimized inter-CPU TLB flushes — since VMIDs are local, TLB entries don’t need
>    to be invalidated on other CPUs when reused.
> - Better scalability — this approach works better on systems with a large number
>    of CPUs.
> - Frequent VM switches don’t require global TLB flushes — reducing the overhead
>    of context switching.
> However, the downside is that this model consumes more VMIDs. For example,
> if a single domain runs on 4 vCPUs across 4 CPUs, it will consume 4 VMIDs instead
> of just one.

Consider you have 4 bits for VMIDs, resulting in 16 VMID values.

If you have a system with 32 physical CPUs and 32 domains with 1 vcpu each
on that system, your scheme would NOT allow to keep each physical cpu busy
by running a domain on it, as only 16 domains could be active at the same
time.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 05/17] xen/riscv: introduce things necessary for p2m initialization
  2025-06-26 11:01           ` Jan Beulich
@ 2025-06-26 11:55             ` Oleksii Kurochko
  0 siblings, 0 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-26 11:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3799 bytes --]


On 6/26/25 1:01 PM, Jan Beulich wrote:
> On 26.06.2025 10:40, Oleksii Kurochko wrote:
>> On 6/25/25 5:53 PM, Jan Beulich wrote:
>>> On 25.06.2025 17:31, Oleksii Kurochko wrote:
>>>> On 6/18/25 6:08 PM, Jan Beulich wrote:
>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>> @@ -14,6 +18,29 @@
>>>>>>     
>>>>>>     /* Per-p2m-table state */
>>>>>>     struct p2m_domain {
>>>>>> +    /*
>>>>>> +     * Lock that protects updates to the p2m.
>>>>>> +     */
>>>>>> +    rwlock_t lock;
>>>>>> +
>>>>>> +    /* Pages used to construct the p2m */
>>>>>> +    struct page_list_head pages;
>>>>>> +
>>>>>> +    /* Indicate if it is required to clean the cache when writing an entry */
>>>>>> +    bool clean_pte;
>>>>>> +
>>>>>> +    struct radix_tree_root p2m_type;
>>>>> A field with a p2m_ prefix in a p2m struct?
>>>> p2m_ prefix could be really dropped.
>>>>
>>>>>     And is this tree really about
>>>>> just a single "type"?
>>>> Yes, we don't have enough bits in PTE so we need some extra storage to store type.
>>> My question wasn't about that, though. My question was whether in the name
>>> "type" (singular) is appropriate. I didn't think you need a tree to store just
>>> a single type.
>> I need tree to store a pair of <gfn, p2m_type>, where gfn is an index. And it seems
>> to me a tree is a good structure for fast insert/search.
> Hmm, I'm increasingly puzzled. I tried to emphasize that my question was towards
> the singular "type" in the variable name. I can't see any relationship between
> that and your reply. (And yes, using a tree here may be appropriate. There is a
> concern towards memory consumption, but that's a separate topic.)

Oh, I got your initial intention. For sure, it should be "types".

>
> Having said that, aiui you don't use the two RSW bits in the PTE. Do you have
> any plans there? If not, can't they be used to at least represent the most
> commonly used types, such that the number of entries in that tree can be kept
> (relatively) low?

It could be really an option for optimization.

In this case I have to p2m_type_t by adding a new type p2m_tree_type:
typedef enum {
     p2m_invalid = 0,    /* Nothing mapped here */
     p2m_ram_rw,         /* Normal read/write domain RAM */
     p2m_ram_ro,         /* Read-only */
     
     + p2m_tree_type,    /* The types below p2m_free_type will be stored outside PTE's bits */

     p2m_mmio_direct_dev,/* Read/write mapping of genuine Device MMIO area */
     p2m_grant_map_rw,   /* Read/write grant mapping */
     p2m_grant_map_ro,   /* Read-only grant mapping */
} p2m_type_t;

Probably, it make sense to switch p2m_ram_ro and p2m_mmio_direct_dev. I think device mapping
is more often operations.

>
>>>>>> +    /*
>>>>>> +     * Some IOMMUs don't support coherent PT walk. When the p2m is
>>>>>> +     * shared with the CPU, Xen has to make sure that the PT changes have
>>>>>> +     * reached the memory
>>>>>> +     */
>>>>>> +    p2m->clean_pte = is_iommu_enabled(d) &&
>>>>>> +        !iommu_has_feature(d, IOMMU_FEAT_COHERENT_WALK);
>>>>> The comment talks about shared page tables, yet you don't check whether
>>>>> page table sharing is actually enabled for the domain.
>>>> Do we have such function/macros?
>>> We have iommu_hap_pt_share, and we have the per-domain hap_pt_share flag.
>>>
>>>> It is shared by implementation now.
>>> I don't understand. There's no IOMMU support yet for RISC-V. Hence it's in
>>> neither state - not shared, but also not not shared.
>> In downstream there is a support of IOMMU for RISC-V.
> And there page tables are unconditionally shared? I'll be surprised if no
> want/need for non-shared page tables would ever appear.

At the moment, yes, but it isn't strict limitation. So yes, it should be page
tables should be conditionally shared.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 6017 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-26 11:43                   ` Juergen Gross
@ 2025-06-26 12:05                     ` Oleksii Kurochko
  2025-06-26 12:17                     ` Teddy Astie
  1 sibling, 0 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-26 12:05 UTC (permalink / raw)
  To: Juergen Gross, Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 7403 bytes --]


On 6/26/25 1:43 PM, Juergen Gross wrote:
> On 26.06.25 13:34, Oleksii Kurochko wrote:
>>
>> On 6/26/25 12:41 PM, Jan Beulich wrote:
>>> On 26.06.2025 12:05, Oleksii Kurochko wrote:
>>>> On 6/24/25 4:01 PM, Jan Beulich wrote:
>>>>> On 24.06.2025 15:47, Oleksii Kurochko wrote:
>>>>>> On 6/24/25 12:44 PM, Jan Beulich wrote:
>>>>>>> On 24.06.2025 11:46, Oleksii Kurochko wrote:
>>>>>>>> On 6/18/25 5:46 PM, Jan Beulich wrote:
>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>> --- /dev/null
>>>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>>>> @@ -0,0 +1,115 @@
>>>>>>>>>> +#include <xen/bitops.h>
>>>>>>>>>> +#include <xen/lib.h>
>>>>>>>>>> +#include <xen/sched.h>
>>>>>>>>>> +#include <xen/spinlock.h>
>>>>>>>>>> +#include <xen/xvmalloc.h>
>>>>>>>>>> +
>>>>>>>>>> +#include <asm/p2m.h>
>>>>>>>>>> +#include <asm/sbi.h>
>>>>>>>>>> +
>>>>>>>>>> +static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * hgatp's VMID field is 7 or 14 bits. RV64 may support 
>>>>>>>>>> 14-bit VMID.
>>>>>>>>>> + * Using a bitmap here limits us to 127 (2^7 - 1) or 16383 
>>>>>>>>>> (2^14 - 1)
>>>>>>>>>> + * concurrent domains.
>>>>>>>>> Which is pretty limiting especially in the RV32 case. Hence 
>>>>>>>>> why we don't
>>>>>>>>> assign a permanent ID to VMs on x86, but rather manage IDs 
>>>>>>>>> per-CPU (note:
>>>>>>>>> not per-vCPU).
>>>>>>>> Good point.
>>>>>>>>
>>>>>>>> I don't believe anyone will use RV32.
>>>>>>>> For RV64, the available ID space seems sufficiently large.
>>>>>>>>
>>>>>>>> However, if it turns out that the value isn't large enough even 
>>>>>>>> for RV64,
>>>>>>>> I can rework it to manage IDs per physical CPU.
>>>>>>>> Wouldn't that approach result in more TLB entries being flushed 
>>>>>>>> compared
>>>>>>>> to per-vCPU allocation, potentially leading to slightly worse 
>>>>>>>> performance?
>>>>>>> Depends on the condition for when to flush. Of course 
>>>>>>> performance is
>>>>>>> unavoidably going to suffer if you have only very few VMIDs to use.
>>>>>>> Nevertheless, as indicated before, the model used on x86 may be a
>>>>>>> candidate to use here, too. See hvm_asid_handle_vmenter() for the
>>>>>>> core (and vendor-independent) part of it.
>>>>>> IIUC, so basically it is just a round-robin and when VMIDs are 
>>>>>> ran out
>>>>>> then just do full guest TLB flush and start to re-use VMIDs from 
>>>>>> the start.
>>>>>> It makes sense to me, I'll implement something similar. (as I'm 
>>>>>> not really
>>>>>> sure that we needdata->core_asid_generation, probably, I will 
>>>>>> understand it better when
>>>>>> start to implement it)
>>>>> Well. The fewer VMID bits you have the more quickly you will need 
>>>>> a new
>>>>> generation. And keep track of the generation you're at you also 
>>>>> need to
>>>>> track the present number somewhere.
>>>>>
>>>>>>>> What about then to allocate VMID per-domain?
>>>>>>> That's what you're doing right now, isn't it? And that gets 
>>>>>>> problematic when
>>>>>>> you have only very few bits in hgatp.VMID, as mentioned below.
>>>>>> Right, I just phrased my question poorly—sorry about that.
>>>>>>
>>>>>> What I meant to ask is: does the approach described above 
>>>>>> actually depend on whether
>>>>>> VMIDs are allocated per-domain or per-pCPU? It seems that the 
>>>>>> main advantage of
>>>>>> allocating VMIDs per-pCPU is potentially reducing the number of 
>>>>>> TLB flushes,
>>>>>> since it's more likely that a platform will have more 
>>>>>> than|VMID_MAX| domains than
>>>>>> |VMID_MAX| physical CPUs—am I right?
>>>>> Seeing that there can be systems with hundreds or even thousands 
>>>>> of CPUs,
>>>>> I don't think I can agree here. Plus per-pCPU allocation would 
>>>>> similarly
>>>>> get you in trouble when you have only very few VMID bits.
>>>> But not so fast as in case of per-domain allocation, right?
>>>>
>>>> I mean that if we have only 4 bits, then in case of per-domain 
>>>> allocation we will
>>>> need to do TLB flush + VMID re-assigning when we have more then 16 
>>>> domains.
>>>>
>>>> But in case of per-pCPU allocation we could run 16 domains on 1 
>>>> pCPU and at the same
>>>> time in multiprocessor systems we have more pCPUs, which will allow 
>>>> us to run more
>>>> domains and avoid TLB flushes.
>>>> On other hand, it is needed to consider that it's unlikely that a 
>>>> domain will have
>>>> only one vCPU. And it is likely that amount of vCPUs will be bigger 
>>>> then an amount
>>>> of domains, so to have a round-robin approach (as x86) without 
>>>> permanent ID allocation
>>>> for each domain will work better then per-pCPU allocation.
>>> Here you (appear to) say one thing, ...
>>>
>>>> In other words, I'm not 100% sure that I get a point why x86 chose 
>>>> per-pCPU allocation
>>>> instead of per-domain allocation with having the same VMID for all 
>>>> vCPUs of domains.
>>> ... and then here the opposite. Overall I'm in severe trouble 
>>> understanding this
>>> reply of yours as a whole, so I fear I can't really respond to it 
>>> (or even just
>>> parts thereof).
>>
>> IIUC, x86 allocates VMIDs per physical CPU (pCPU) "dynamically" — 
>> these are just
>> sequential numbers, and once VMIDs run out on a given pCPU, there's 
>> no guarantee
>> that a vCPU will receive the same VMID again.
>>
>> On the other hand, RISC-V currently allocates a single VMID per 
>> domain, and that
>> VMID is considered "permanent" until the domain is destroyed. This 
>> means we are
>> limited to at most VMID_MAX domains. To avoid this limitation, I plan 
>> to implement
>> a round-robin reuse approach: when no free VMIDs remain, we start a 
>> new generation
>> and begin reusing old VMIDs.
>>
>> The only remaining design question is whether we want RISC-V to 
>> follow a global
>> VMID allocation policy (i.e., one VMID per domain, shared across all 
>> of its vCPUs),
>> or adopt a policy similar to x86 with per-CPU VMID allocation (each 
>> vCPU gets its
>> own VMID, local to the CPU it's running on).
>>
>> Each policy has its own trade-offs. But in the case where the number 
>> of available
>> VMIDs is small (i.e., low VMIDLEN), a global allocation policy may be 
>> more suitable,
>> as it requires fewer VMIDs overall.
>>
>> So my main question was:
>> What are the advantages of per-pCPU VMID allocation in scenarios with 
>> limited VMID
>> space, and why did x86 choose that design?
>>
>>> From what I can tell, the benefits of per-pCPU VMID allocation include:
>> - Minimized inter-CPU TLB flushes — since VMIDs are local, TLB 
>> entries don’t need
>>    to be invalidated on other CPUs when reused.
>> - Better scalability — this approach works better on systems with a 
>> large number
>>    of CPUs.
>> - Frequent VM switches don’t require global TLB flushes — reducing 
>> the overhead
>>    of context switching.
>> However, the downside is that this model consumes more VMIDs. For 
>> example,
>> if a single domain runs on 4 vCPUs across 4 CPUs, it will consume 4 
>> VMIDs instead
>> of just one.
>
> Consider you have 4 bits for VMIDs, resulting in 16 VMID values.
>
> If you have a system with 32 physical CPUs and 32 domains with 1 vcpu 
> each
> on that system, your scheme would NOT allow to keep each physical cpu 
> busy
> by running a domain on it, as only 16 domains could be active at the same
> time.

It makes sense to me.

Thanks.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 12398 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-26 11:34                 ` Oleksii Kurochko
  2025-06-26 11:43                   ` Juergen Gross
@ 2025-06-26 12:16                   ` Jan Beulich
  2025-06-26 12:25                     ` Oleksii Kurochko
  1 sibling, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-26 12:16 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 26.06.2025 13:34, Oleksii Kurochko wrote:
> 
> On 6/26/25 12:41 PM, Jan Beulich wrote:
>> On 26.06.2025 12:05, Oleksii Kurochko wrote:
>>> On 6/24/25 4:01 PM, Jan Beulich wrote:
>>>> On 24.06.2025 15:47, Oleksii Kurochko wrote:
>>>>> On 6/24/25 12:44 PM, Jan Beulich wrote:
>>>>>> On 24.06.2025 11:46, Oleksii Kurochko wrote:
>>>>>>> On 6/18/25 5:46 PM, Jan Beulich wrote:
>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>> --- /dev/null
>>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>>> @@ -0,0 +1,115 @@
>>>>>>>>> +#include <xen/bitops.h>
>>>>>>>>> +#include <xen/lib.h>
>>>>>>>>> +#include <xen/sched.h>
>>>>>>>>> +#include <xen/spinlock.h>
>>>>>>>>> +#include <xen/xvmalloc.h>
>>>>>>>>> +
>>>>>>>>> +#include <asm/p2m.h>
>>>>>>>>> +#include <asm/sbi.h>
>>>>>>>>> +
>>>>>>>>> +static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>>>>>>>> +
>>>>>>>>> +/*
>>>>>>>>> + * hgatp's VMID field is 7 or 14 bits. RV64 may support 14-bit VMID.
>>>>>>>>> + * Using a bitmap here limits us to 127 (2^7 - 1) or 16383 (2^14 - 1)
>>>>>>>>> + * concurrent domains.
>>>>>>>> Which is pretty limiting especially in the RV32 case. Hence why we don't
>>>>>>>> assign a permanent ID to VMs on x86, but rather manage IDs per-CPU (note:
>>>>>>>> not per-vCPU).
>>>>>>> Good point.
>>>>>>>
>>>>>>> I don't believe anyone will use RV32.
>>>>>>> For RV64, the available ID space seems sufficiently large.
>>>>>>>
>>>>>>> However, if it turns out that the value isn't large enough even for RV64,
>>>>>>> I can rework it to manage IDs per physical CPU.
>>>>>>> Wouldn't that approach result in more TLB entries being flushed compared
>>>>>>> to per-vCPU allocation, potentially leading to slightly worse performance?
>>>>>> Depends on the condition for when to flush. Of course performance is
>>>>>> unavoidably going to suffer if you have only very few VMIDs to use.
>>>>>> Nevertheless, as indicated before, the model used on x86 may be a
>>>>>> candidate to use here, too. See hvm_asid_handle_vmenter() for the
>>>>>> core (and vendor-independent) part of it.
>>>>> IIUC, so basically it is just a round-robin and when VMIDs are ran out
>>>>> then just do full guest TLB flush and start to re-use VMIDs from the start.
>>>>> It makes sense to me, I'll implement something similar. (as I'm not really
>>>>> sure that we needdata->core_asid_generation, probably, I will understand it better when
>>>>> start to implement it)
>>>> Well. The fewer VMID bits you have the more quickly you will need a new
>>>> generation. And keep track of the generation you're at you also need to
>>>> track the present number somewhere.
>>>>
>>>>>>> What about then to allocate VMID per-domain?
>>>>>> That's what you're doing right now, isn't it? And that gets problematic when
>>>>>> you have only very few bits in hgatp.VMID, as mentioned below.
>>>>> Right, I just phrased my question poorly—sorry about that.
>>>>>
>>>>> What I meant to ask is: does the approach described above actually depend on whether
>>>>> VMIDs are allocated per-domain or per-pCPU? It seems that the main advantage of
>>>>> allocating VMIDs per-pCPU is potentially reducing the number of TLB flushes,
>>>>> since it's more likely that a platform will have more than|VMID_MAX| domains than
>>>>> |VMID_MAX| physical CPUs—am I right?
>>>> Seeing that there can be systems with hundreds or even thousands of CPUs,
>>>> I don't think I can agree here. Plus per-pCPU allocation would similarly
>>>> get you in trouble when you have only very few VMID bits.
>>> But not so fast as in case of per-domain allocation, right?
>>>
>>> I mean that if we have only 4 bits, then in case of per-domain allocation we will
>>> need to do TLB flush + VMID re-assigning when we have more then 16 domains.
>>>
>>> But in case of per-pCPU allocation we could run 16 domains on 1 pCPU and at the same
>>> time in multiprocessor systems we have more pCPUs, which will allow us to run more
>>> domains and avoid TLB flushes.
>>> On other hand, it is needed to consider that it's unlikely that a domain will have
>>> only one vCPU. And it is likely that amount of vCPUs will be bigger then an amount
>>> of domains, so to have a round-robin approach (as x86) without permanent ID allocation
>>> for each domain will work better then per-pCPU allocation.
>> Here you (appear to) say one thing, ...
>>
>>> In other words, I'm not 100% sure that I get a point why x86 chose per-pCPU allocation
>>> instead of per-domain allocation with having the same VMID for all vCPUs of domains.
>> ... and then here the opposite. Overall I'm in severe trouble understanding this
>> reply of yours as a whole, so I fear I can't really respond to it (or even just
>> parts thereof).
> 
> IIUC, x86 allocates VMIDs per physical CPU (pCPU) "dynamically" — these are just
> sequential numbers, and once VMIDs run out on a given pCPU, there's no guarantee
> that a vCPU will receive the same VMID again.
> 
> On the other hand, RISC-V currently allocates a single VMID per domain, and that
> VMID is considered "permanent" until the domain is destroyed. This means we are
> limited to at most VMID_MAX domains. To avoid this limitation, I plan to implement
> a round-robin reuse approach: when no free VMIDs remain, we start a new generation
> and begin reusing old VMIDs.
> 
> The only remaining design question is whether we want RISC-V to follow a global
> VMID allocation policy (i.e., one VMID per domain, shared across all of its vCPUs),
> or adopt a policy similar to x86 with per-CPU VMID allocation (each vCPU gets its
> own VMID, local to the CPU it's running on).

Besides what Jürgen has said, what would this mean if you have 16 VMIDs and a 17th
domain appears? You can't "take away" the VMID from any domain, unless you fully
suspended it first (that is, all of its vCPU-s).

> Each policy has its own trade-offs. But in the case where the number of available
> VMIDs is small (i.e., low VMIDLEN), a global allocation policy may be more suitable,
> as it requires fewer VMIDs overall.
> 
> So my main question was:
> What are the advantages of per-pCPU VMID allocation in scenarios with limited VMID
> space, and why did x86 choose that design?
> 
>  From what I can tell, the benefits of per-pCPU VMID allocation include:
> - Minimized inter-CPU TLB flushes — since VMIDs are local, TLB entries don’t need
>    to be invalidated on other CPUs when reused.
> - Better scalability — this approach works better on systems with a large number
>    of CPUs.
> - Frequent VM switches don’t require global TLB flushes — reducing the overhead
>    of context switching.
> However, the downside is that this model consumes more VMIDs. For example,
> if a single domain runs on 4 vCPUs across 4 CPUs, it will consume 4 VMIDs instead
> of just one.

I don't understand this, nor why it's a downside. Looking at a domain as a whole
simply doesn't make sense in this model. Or if you do, then you need to consider
the system-wide number of VMIDs you have available:
(1 << VMIDLEN) * num_online_cpus(). That is, in your calculation a domain with
4 vCPU-s may indeed use up to 4 VMIDs at a time, but out of a pool at least 4
times the size of that of an individual pCPU.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-26 11:43                   ` Juergen Gross
  2025-06-26 12:05                     ` Oleksii Kurochko
@ 2025-06-26 12:17                     ` Teddy Astie
  2025-06-26 12:37                       ` Jan Beulich
  1 sibling, 1 reply; 161+ messages in thread
From: Teddy Astie @ 2025-06-26 12:17 UTC (permalink / raw)
  To: Juergen Gross, Oleksii Kurochko, Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

Le 26/06/2025 à 13:46, Juergen Gross a écrit :
> On 26.06.25 13:34, Oleksii Kurochko wrote:
>>
>> On 6/26/25 12:41 PM, Jan Beulich wrote:
>> - Minimized inter-CPU TLB flushes — since VMIDs are local, TLB entries
>> don’t need
>>    to be invalidated on other CPUs when reused.
>> - Better scalability — this approach works better on systems with a
>> large number
>>    of CPUs.
>> - Frequent VM switches don’t require global TLB flushes — reducing the
>> overhead
>>    of context switching.
>> However, the downside is that this model consumes more VMIDs. For
>> example,
>> if a single domain runs on 4 vCPUs across 4 CPUs, it will consume 4
>> VMIDs instead
>> of just one.
>
> Consider you have 4 bits for VMIDs, resulting in 16 VMID values.
>
> If you have a system with 32 physical CPUs and 32 domains with 1 vcpu each
> on that system, your scheme would NOT allow to keep each physical cpu busy
> by running a domain on it, as only 16 domains could be active at the same
> time.
>

Why not instead consider dropping use of VMID in case there is no one
remaining ?
(i.e systematically flush the guest TLB before entering the vcpu and
using a "blank" VMID)

I don't expect a lot of platforms to allow for 32 pCPU while not giving
more than 16 VMID values. So it would just be less efficient in that
case at worst.

Teddy


Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-26 12:16                   ` Jan Beulich
@ 2025-06-26 12:25                     ` Oleksii Kurochko
  0 siblings, 0 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-26 12:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 7551 bytes --]


On 6/26/25 2:16 PM, Jan Beulich wrote:
> On 26.06.2025 13:34, Oleksii Kurochko wrote:
>> On 6/26/25 12:41 PM, Jan Beulich wrote:
>>> On 26.06.2025 12:05, Oleksii Kurochko wrote:
>>>> On 6/24/25 4:01 PM, Jan Beulich wrote:
>>>>> On 24.06.2025 15:47, Oleksii Kurochko wrote:
>>>>>> On 6/24/25 12:44 PM, Jan Beulich wrote:
>>>>>>> On 24.06.2025 11:46, Oleksii Kurochko wrote:
>>>>>>>> On 6/18/25 5:46 PM, Jan Beulich wrote:
>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>> --- /dev/null
>>>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>>>> @@ -0,0 +1,115 @@
>>>>>>>>>> +#include <xen/bitops.h>
>>>>>>>>>> +#include <xen/lib.h>
>>>>>>>>>> +#include <xen/sched.h>
>>>>>>>>>> +#include <xen/spinlock.h>
>>>>>>>>>> +#include <xen/xvmalloc.h>
>>>>>>>>>> +
>>>>>>>>>> +#include <asm/p2m.h>
>>>>>>>>>> +#include <asm/sbi.h>
>>>>>>>>>> +
>>>>>>>>>> +static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * hgatp's VMID field is 7 or 14 bits. RV64 may support 14-bit VMID.
>>>>>>>>>> + * Using a bitmap here limits us to 127 (2^7 - 1) or 16383 (2^14 - 1)
>>>>>>>>>> + * concurrent domains.
>>>>>>>>> Which is pretty limiting especially in the RV32 case. Hence why we don't
>>>>>>>>> assign a permanent ID to VMs on x86, but rather manage IDs per-CPU (note:
>>>>>>>>> not per-vCPU).
>>>>>>>> Good point.
>>>>>>>>
>>>>>>>> I don't believe anyone will use RV32.
>>>>>>>> For RV64, the available ID space seems sufficiently large.
>>>>>>>>
>>>>>>>> However, if it turns out that the value isn't large enough even for RV64,
>>>>>>>> I can rework it to manage IDs per physical CPU.
>>>>>>>> Wouldn't that approach result in more TLB entries being flushed compared
>>>>>>>> to per-vCPU allocation, potentially leading to slightly worse performance?
>>>>>>> Depends on the condition for when to flush. Of course performance is
>>>>>>> unavoidably going to suffer if you have only very few VMIDs to use.
>>>>>>> Nevertheless, as indicated before, the model used on x86 may be a
>>>>>>> candidate to use here, too. See hvm_asid_handle_vmenter() for the
>>>>>>> core (and vendor-independent) part of it.
>>>>>> IIUC, so basically it is just a round-robin and when VMIDs are ran out
>>>>>> then just do full guest TLB flush and start to re-use VMIDs from the start.
>>>>>> It makes sense to me, I'll implement something similar. (as I'm not really
>>>>>> sure that we needdata->core_asid_generation, probably, I will understand it better when
>>>>>> start to implement it)
>>>>> Well. The fewer VMID bits you have the more quickly you will need a new
>>>>> generation. And keep track of the generation you're at you also need to
>>>>> track the present number somewhere.
>>>>>
>>>>>>>> What about then to allocate VMID per-domain?
>>>>>>> That's what you're doing right now, isn't it? And that gets problematic when
>>>>>>> you have only very few bits in hgatp.VMID, as mentioned below.
>>>>>> Right, I just phrased my question poorly—sorry about that.
>>>>>>
>>>>>> What I meant to ask is: does the approach described above actually depend on whether
>>>>>> VMIDs are allocated per-domain or per-pCPU? It seems that the main advantage of
>>>>>> allocating VMIDs per-pCPU is potentially reducing the number of TLB flushes,
>>>>>> since it's more likely that a platform will have more than|VMID_MAX| domains than
>>>>>> |VMID_MAX| physical CPUs—am I right?
>>>>> Seeing that there can be systems with hundreds or even thousands of CPUs,
>>>>> I don't think I can agree here. Plus per-pCPU allocation would similarly
>>>>> get you in trouble when you have only very few VMID bits.
>>>> But not so fast as in case of per-domain allocation, right?
>>>>
>>>> I mean that if we have only 4 bits, then in case of per-domain allocation we will
>>>> need to do TLB flush + VMID re-assigning when we have more then 16 domains.
>>>>
>>>> But in case of per-pCPU allocation we could run 16 domains on 1 pCPU and at the same
>>>> time in multiprocessor systems we have more pCPUs, which will allow us to run more
>>>> domains and avoid TLB flushes.
>>>> On other hand, it is needed to consider that it's unlikely that a domain will have
>>>> only one vCPU. And it is likely that amount of vCPUs will be bigger then an amount
>>>> of domains, so to have a round-robin approach (as x86) without permanent ID allocation
>>>> for each domain will work better then per-pCPU allocation.
>>> Here you (appear to) say one thing, ...
>>>
>>>> In other words, I'm not 100% sure that I get a point why x86 chose per-pCPU allocation
>>>> instead of per-domain allocation with having the same VMID for all vCPUs of domains.
>>> ... and then here the opposite. Overall I'm in severe trouble understanding this
>>> reply of yours as a whole, so I fear I can't really respond to it (or even just
>>> parts thereof).
>> IIUC, x86 allocates VMIDs per physical CPU (pCPU) "dynamically" — these are just
>> sequential numbers, and once VMIDs run out on a given pCPU, there's no guarantee
>> that a vCPU will receive the same VMID again.
>>
>> On the other hand, RISC-V currently allocates a single VMID per domain, and that
>> VMID is considered "permanent" until the domain is destroyed. This means we are
>> limited to at most VMID_MAX domains. To avoid this limitation, I plan to implement
>> a round-robin reuse approach: when no free VMIDs remain, we start a new generation
>> and begin reusing old VMIDs.
>>
>> The only remaining design question is whether we want RISC-V to follow a global
>> VMID allocation policy (i.e., one VMID per domain, shared across all of its vCPUs),
>> or adopt a policy similar to x86 with per-CPU VMID allocation (each vCPU gets its
>> own VMID, local to the CPU it's running on).
> Besides what Jürgen has said, what would this mean if you have 16 VMIDs and a 17th
> domain appears? You can't "take away" the VMID from any domain, unless you fully
> suspended it first (that is, all of its vCPU-s).

In this case, use of VMID could be dropped and just flush the guest TLB before entering
domain. Not efficient, but still an option.

>
>> Each policy has its own trade-offs. But in the case where the number of available
>> VMIDs is small (i.e., low VMIDLEN), a global allocation policy may be more suitable,
>> as it requires fewer VMIDs overall.
>>
>> So my main question was:
>> What are the advantages of per-pCPU VMID allocation in scenarios with limited VMID
>> space, and why did x86 choose that design?
>>
>>   From what I can tell, the benefits of per-pCPU VMID allocation include:
>> - Minimized inter-CPU TLB flushes — since VMIDs are local, TLB entries don’t need
>>     to be invalidated on other CPUs when reused.
>> - Better scalability — this approach works better on systems with a large number
>>     of CPUs.
>> - Frequent VM switches don’t require global TLB flushes — reducing the overhead
>>     of context switching.
>> However, the downside is that this model consumes more VMIDs. For example,
>> if a single domain runs on 4 vCPUs across 4 CPUs, it will consume 4 VMIDs instead
>> of just one.
> I don't understand this, nor why it's a downside. Looking at a domain as a whole
> simply doesn't make sense in this model. Or if you do, then you need to consider
> the system-wide number of VMIDs you have available:
> (1 << VMIDLEN) * num_online_cpus(). That is, in your calculation a domain with
> 4 vCPU-s may indeed use up to 4 VMIDs at a time, but out of a pool at least 4
> times the size of that of an individual pCPU.

Good point, I thought about that too.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 9800 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement
  2025-06-26 12:17                     ` Teddy Astie
@ 2025-06-26 12:37                       ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-06-26 12:37 UTC (permalink / raw)
  To: Teddy Astie
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel, Juergen Gross, Oleksii Kurochko

On 26.06.2025 14:17, Teddy Astie wrote:
> Le 26/06/2025 à 13:46, Juergen Gross a écrit :
>> On 26.06.25 13:34, Oleksii Kurochko wrote:
>>>
>>> On 6/26/25 12:41 PM, Jan Beulich wrote:
>>> - Minimized inter-CPU TLB flushes — since VMIDs are local, TLB entries 
>>> don’t need
>>>    to be invalidated on other CPUs when reused.
>>> - Better scalability — this approach works better on systems with a 
>>> large number
>>>    of CPUs.
>>> - Frequent VM switches don’t require global TLB flushes — reducing the 
>>> overhead
>>>    of context switching.
>>> However, the downside is that this model consumes more VMIDs. For 
>>> example,
>>> if a single domain runs on 4 vCPUs across 4 CPUs, it will consume 4 
>>> VMIDs instead
>>> of just one.
>>
>> Consider you have 4 bits for VMIDs, resulting in 16 VMID values.
>>
>> If you have a system with 32 physical CPUs and 32 domains with 1 vcpu each
>> on that system, your scheme would NOT allow to keep each physical cpu busy
>> by running a domain on it, as only 16 domains could be active at the same
>> time.
> 
> Why not instead consider dropping use of VMID in case there is no one 
> remaining ?
> (i.e systematically flush the guest TLB before entering the vcpu and 
> using a "blank" VMID)

Why would one want to do that, when there's a better scheme available?
And how would you decide which VMs to penalize?

> I don't expect a lot of platforms to allow for 32 pCPU while not giving 
> more than 16 VMID values. So it would just be less efficient in that 
> case at worst.

How would you know? How many CPUs (cores) to have in a system is entirely
independent of the capabilities of the individual CPUs.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 07/17] xen/riscv: introduce pte_{set,get}_mfn()
  2025-06-10 13:05 ` [PATCH v2 07/17] xen/riscv: introduce pte_{set,get}_mfn() Oleksii Kurochko
@ 2025-06-26 14:57   ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-06-26 14:57 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> Introduce helpers pte_{set,get}_mfn() to simplify setting and getting
> of mfn.
> 
> Also, introduce PTE_PPN_MASK and add BUILD_BUG_ON() to be sure that
> PTE_PPN_MASK remains the same for all MMU modes except Sv32.
> 
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>

Acked-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification
  2025-06-10 13:05 ` [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification Oleksii Kurochko
@ 2025-06-26 14:59   ` Jan Beulich
  2025-06-30 14:33     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-26 14:59 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -61,8 +61,28 @@ struct p2m_domain {
>  typedef enum {
>      p2m_invalid = 0,    /* Nothing mapped here */
>      p2m_ram_rw,         /* Normal read/write domain RAM */
> +    p2m_ram_ro,         /* Read-only; writes are silently dropped */

As indicated before - this type should be added when the special handling that
it requires is also introduced.

> +    p2m_mmio_direct_dev,/* Read/write mapping of genuine Device MMIO area */

What's the _dev suffix indicating here?

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification
  2025-06-26 14:59   ` Jan Beulich
@ 2025-06-30 14:33     ` Oleksii Kurochko
  2025-06-30 14:38       ` Oleksii Kurochko
  2025-06-30 14:42       ` Jan Beulich
  0 siblings, 2 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-30 14:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1042 bytes --]


On 6/26/25 4:59 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -61,8 +61,28 @@ struct p2m_domain {
>>   typedef enum {
>>       p2m_invalid = 0,    /* Nothing mapped here */
>>       p2m_ram_rw,         /* Normal read/write domain RAM */
>> +    p2m_ram_ro,         /* Read-only; writes are silently dropped */
> As indicated before - this type should be added when the special handling that
> it requires is also introduced.

Perhaps, I missed that. I will drop this type for now.

>
>> +    p2m_mmio_direct_dev,/* Read/write mapping of genuine Device MMIO area */
> What's the _dev suffix indicating here?

It indicates that it is device memory, probably, it isn't so necessary in case of RISC-V as
spec doesn't use such terminology. In RISC-V there is only available IO, NC. And we are
|using PTE_PBMT_IO for |p2m_mmio_direct_dev.

Maybe it would be better just to rename s/p2m_mmio_direct_dev/p2m_mmio_direct_io.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 1871 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification
  2025-06-30 14:33     ` Oleksii Kurochko
@ 2025-06-30 14:38       ` Oleksii Kurochko
  2025-06-30 14:45         ` Jan Beulich
  2025-06-30 14:42       ` Jan Beulich
  1 sibling, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-30 14:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1216 bytes --]


On 6/30/25 4:33 PM, Oleksii Kurochko wrote:
>
>
> On 6/26/25 4:59 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>> @@ -61,8 +61,28 @@ struct p2m_domain {
>>>   typedef enum {
>>>       p2m_invalid = 0,    /* Nothing mapped here */
>>>       p2m_ram_rw,         /* Normal read/write domain RAM */
>>> +    p2m_ram_ro,         /* Read-only; writes are silently dropped */
>> As indicated before - this type should be added when the special handling that
>> it requires is also introduced.
> Perhaps, I missed that. I will drop this type for now.
>
>>> +    p2m_mmio_direct_dev,/* Read/write mapping of genuine Device MMIO area */
>> What's the _dev suffix indicating here?
> It indicates that it is device memory, probably, it isn't so necessary in case of RISC-V as
> spec doesn't use such terminology. In RISC-V there is only available IO, NC. And we are
> |using PTE_PBMT_IO for |p2m_mmio_direct_dev.
>
> Maybe it would be better just to rename s/p2m_mmio_direct_dev/p2m_mmio_direct_io.

I forgot that p2m_mmio_direct_dev is used by common code for dom0less code (handle_passthrough_prop())

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 2305 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification
  2025-06-30 14:33     ` Oleksii Kurochko
  2025-06-30 14:38       ` Oleksii Kurochko
@ 2025-06-30 14:42       ` Jan Beulich
  2025-06-30 15:13         ` Oleksii Kurochko
  1 sibling, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-30 14:42 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 30.06.2025 16:33, Oleksii Kurochko wrote:
> On 6/26/25 4:59 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>> @@ -61,8 +61,28 @@ struct p2m_domain {
>>>   typedef enum {
>>>       p2m_invalid = 0,    /* Nothing mapped here */
>>>       p2m_ram_rw,         /* Normal read/write domain RAM */
>>> +    p2m_ram_ro,         /* Read-only; writes are silently dropped */
>> As indicated before - this type should be added when the special handling that
>> it requires is also introduced.
> 
> Perhaps, I missed that. I will drop this type for now.
> 
>>
>>> +    p2m_mmio_direct_dev,/* Read/write mapping of genuine Device MMIO area */
>> What's the _dev suffix indicating here?
> 
> It indicates that it is device memory, probably, it isn't so necessary in case of RISC-V as
> spec doesn't use such terminology. In RISC-V there is only available IO, NC. And we are
> |using PTE_PBMT_IO for |p2m_mmio_direct_dev.
> 
> Maybe it would be better just to rename s/p2m_mmio_direct_dev/p2m_mmio_direct_io.

And what would the _io suffix indicate, beyond what "mmio" already indicates?

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification
  2025-06-30 14:38       ` Oleksii Kurochko
@ 2025-06-30 14:45         ` Jan Beulich
  2025-06-30 15:27           ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-30 14:45 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 30.06.2025 16:38, Oleksii Kurochko wrote:
> On 6/30/25 4:33 PM, Oleksii Kurochko wrote:
>> On 6/26/25 4:59 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>>> @@ -61,8 +61,28 @@ struct p2m_domain {
>>>>   typedef enum {
>>>>       p2m_invalid = 0,    /* Nothing mapped here */
>>>>       p2m_ram_rw,         /* Normal read/write domain RAM */
>>>> +    p2m_ram_ro,         /* Read-only; writes are silently dropped */
>>> As indicated before - this type should be added when the special handling that
>>> it requires is also introduced.
>> Perhaps, I missed that. I will drop this type for now.
>>
>>>> +    p2m_mmio_direct_dev,/* Read/write mapping of genuine Device MMIO area */
>>> What's the _dev suffix indicating here?
>> It indicates that it is device memory, probably, it isn't so necessary in case of RISC-V as
>> spec doesn't use such terminology. In RISC-V there is only available IO, NC. And we are
>> |using PTE_PBMT_IO for |p2m_mmio_direct_dev.
>>
>> Maybe it would be better just to rename s/p2m_mmio_direct_dev/p2m_mmio_direct_io.
> 
> I forgot that p2m_mmio_direct_dev is used by common code for dom0less code (handle_passthrough_prop())

That'll want abstracting out, I think. I don't view it as helpful to clutter
RISC-V (and later perhaps also PPC) with Arm-specific terminology.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification
  2025-06-30 14:42       ` Jan Beulich
@ 2025-06-30 15:13         ` Oleksii Kurochko
  2025-06-30 15:27           ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-30 15:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1320 bytes --]


On 6/30/25 4:42 PM, Jan Beulich wrote:
> On 30.06.2025 16:33, Oleksii Kurochko wrote:
>> On 6/26/25 4:59 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>>> @@ -61,8 +61,28 @@ struct p2m_domain {
>>>>    typedef enum {
>>>>        p2m_invalid = 0,    /* Nothing mapped here */
>>>>        p2m_ram_rw,         /* Normal read/write domain RAM */
>>>> +    p2m_ram_ro,         /* Read-only; writes are silently dropped */
>>> As indicated before - this type should be added when the special handling that
>>> it requires is also introduced.
>> Perhaps, I missed that. I will drop this type for now.
>>
>>>> +    p2m_mmio_direct_dev,/* Read/write mapping of genuine Device MMIO area */
>>> What's the _dev suffix indicating here?
>> It indicates that it is device memory, probably, it isn't so necessary in case of RISC-V as
>> spec doesn't use such terminology. In RISC-V there is only available IO, NC. And we are
>> |using PTE_PBMT_IO for |p2m_mmio_direct_dev.
>>
>> Maybe it would be better just to rename s/p2m_mmio_direct_dev/p2m_mmio_direct_io.
> And what would the _io suffix indicate, beyond what "mmio" already indicates?

Just that PBMT_IO will be used for device memory and not PBMT_NC.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 2413 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 06/17] xen/riscv: add root page table allocation
  2025-06-10 13:05 ` [PATCH v2 06/17] xen/riscv: add root page table allocation Oleksii Kurochko
@ 2025-06-30 15:22   ` Jan Beulich
  2025-06-30 16:18     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-30 15:22 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -26,6 +26,12 @@ struct p2m_domain {
>      /* Pages used to construct the p2m */
>      struct page_list_head pages;
>  
> +    /* The root of the p2m tree. May be concatenated */
> +    struct page_info *root;
> +
> +    /* Address Translation Table for the p2m */
> +    paddr_t hgatp;

Does this really need holding in a struct field? Can't is be re-created at
any time from "root" above? And such re-creation is apparently infrequent,
if happening at all after initial allocation. (But of course I don't know
what future patches of yours will bring.) This is even more so if ...

> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
> @@ -133,11 +133,13 @@
>  #define HGATP_MODE_SV48X4		_UL(9)
>  
>  #define HGATP32_MODE_SHIFT		31
> +#define HGATP32_MODE_MASK		_UL(0x80000000)
>  #define HGATP32_VMID_SHIFT		22
>  #define HGATP32_VMID_MASK		_UL(0x1FC00000)
>  #define HGATP32_PPN			_UL(0x003FFFFF)
>  
>  #define HGATP64_MODE_SHIFT		60
> +#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
>  #define HGATP64_VMID_SHIFT		44
>  #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)

... VMID management is going to change as previously discussed, at which
point the value to put in hgatp will need (partly) re-calculating at certain
points anyway.

> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>      write_unlock(&p2m->lock);
>  }
>  
> +static void clear_and_clean_page(struct page_info *page)
> +{
> +    clean_dcache_va_range(page, PAGE_SIZE);
> +    clear_domain_page(page_to_mfn(page));
> +}

A function of this name can, imo, only clear and then clean. Question is why
it's the other way around, and what the underlying requirement is for the
cleaning part to be there in the first place. Maybe that's obvious for a
RISC-V person, but it's entirely non-obvious to me (Arm being different in
this regard because of running with caches disabled at certain points in
time).

> +static struct page_info *p2m_allocate_root(struct domain *d)
> +{
> +    struct page_info *page;
> +    unsigned int order = get_order_from_bytes(KB(16));

While better than a hard-coded order of 2, this still is lacking. Is there
a reason there can't be a suitable manifest constant in the header?

> +    unsigned int nr_pages = _AC(1,U) << order;

Nit (style): Missing blank after comma.

> +    /* Return back nr_pages necessary for p2m root table. */
> +
> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
> +        panic("Specify more xen,domain-p2m-mem-mb\n");

You shouldn't panic() in anything involved in domain creation. You want to
return NULL in this case.

Further, to me the use of "more" looks misleading here. Do you perhaps mean
"larger" or "bigger"?

This also looks to be happening without any lock held. If that's intentional,
I think the "why" wants clarifying in a code comment.

> +    for ( unsigned int i = 0; i < nr_pages; i++ )
> +    {
> +        /* Return memory to domheap. */
> +        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
> +        if( page )
> +        {
> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
> +            free_domheap_page(page);
> +        }
> +        else
> +        {
> +            printk(XENLOG_ERR
> +                   "Failed to free P2M pages, P2M freelist is empty.\n");
> +            return NULL;
> +        }
> +    }

The reason for doing this may also want to be put in a comment.

> +    /* Allocate memory for p2m root table. */
> +
> +    /*
> +     * As mentioned in the Priviliged Architecture Spec (version 20240411)
> +     * As explained in Section 18.5.1, for the paged virtual-memory schemes

The first sentence didn't finish when the 2nd starts. Is there a piece missing?
Do the two sentences want to be joined together?

> +static unsigned long hgatp_from_page(struct p2m_domain *p2m)

Function name and parameter type/name don't fit together.

> +{
> +    struct page_info *p2m_root_page = p2m->root;

As always: pointer-to-const wherever possible, please. But: Is this local
variable really useful to have?

> +    unsigned long ppn;
> +    unsigned long hgatp_mode;
> +
> +    ppn = PFN_DOWN(page_to_maddr(p2m_root_page)) & HGATP_PPN;
> +
> +#if RV_STAGE1_MODE == SATP_MODE_SV39
> +    hgatp_mode = HGATP_MODE_SV39X4;
> +#elif RV_STAGE1_MODE == SATP_MODE_SV48
> +    hgatp_mode = HGATP_MODE_SV48X4;
> +#else
> +#   error "add HGATP_MODE"
> +#endif
> +
> +    return ppn | MASK_INSR(p2m->vmid, HGATP_VMID_MASK) |
> +           MASK_INSR(hgatp_mode, HGATP_MODE_MASK);
> +}
> +
> +static int p2m_alloc_root_table(struct domain *d)

As indicated earlier, in a wider context - this is a good candidate where
the caller rather wants to pass struct p2m_domain *. Once you get variations
on P2Ms (like x86'es altp2m or nestedp2m, the domain won't be meaningful
here to know which P2M to allocate the root for.

> +{
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +    p2m->root = p2m_allocate_root(d);
> +    if ( !p2m->root )
> +        return -ENOMEM;
> +
> +    p2m->hgatp = hgatp_from_page(p2m);
> +
> +    return 0;
> +}
> +
>  static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>  
>  /*
> @@ -228,5 +313,14 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>          }
>      }
>  
> +    /*
> +    * First, wait for the p2m pool to be initialized. Then allocate the root

Why "wait"? There's waiting here.

> +    * table so that the necessary pages can be returned from the p2m pool,
> +    * since the root table must be allocated using alloc_domheap_pages(...)
> +    * to meet its specific requirements.
> +    */
> +    if ( !d->arch.p2m.root )

Aren't you open-coding p2m_get_hostp2m() here?

Jan

> +        p2m_alloc_root_table(d);
> +
>      return 0;
>  }

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification
  2025-06-30 15:13         ` Oleksii Kurochko
@ 2025-06-30 15:27           ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-06-30 15:27 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 30.06.2025 17:13, Oleksii Kurochko wrote:
> 
> On 6/30/25 4:42 PM, Jan Beulich wrote:
>> On 30.06.2025 16:33, Oleksii Kurochko wrote:
>>> On 6/26/25 4:59 PM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>>>> @@ -61,8 +61,28 @@ struct p2m_domain {
>>>>>    typedef enum {
>>>>>        p2m_invalid = 0,    /* Nothing mapped here */
>>>>>        p2m_ram_rw,         /* Normal read/write domain RAM */
>>>>> +    p2m_ram_ro,         /* Read-only; writes are silently dropped */
>>>> As indicated before - this type should be added when the special handling that
>>>> it requires is also introduced.
>>> Perhaps, I missed that. I will drop this type for now.
>>>
>>>>> +    p2m_mmio_direct_dev,/* Read/write mapping of genuine Device MMIO area */
>>>> What's the _dev suffix indicating here?
>>> It indicates that it is device memory, probably, it isn't so necessary in case of RISC-V as
>>> spec doesn't use such terminology. In RISC-V there is only available IO, NC. And we are
>>> |using PTE_PBMT_IO for |p2m_mmio_direct_dev.
>>>
>>> Maybe it would be better just to rename s/p2m_mmio_direct_dev/p2m_mmio_direct_io.
>> And what would the _io suffix indicate, beyond what "mmio" already indicates?
> 
> Just that PBMT_IO will be used for device memory and not PBMT_NC.

And will there (later) also be a p2m_mmio_direct_nc type? If so, I can see the point
of the suffix.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification
  2025-06-30 14:45         ` Jan Beulich
@ 2025-06-30 15:27           ` Oleksii Kurochko
  2025-06-30 15:50             ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-30 15:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1805 bytes --]


On 6/30/25 4:45 PM, Jan Beulich wrote:
> On 30.06.2025 16:38, Oleksii Kurochko wrote:
>> On 6/30/25 4:33 PM, Oleksii Kurochko wrote:
>>> On 6/26/25 4:59 PM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>>>> @@ -61,8 +61,28 @@ struct p2m_domain {
>>>>>    typedef enum {
>>>>>        p2m_invalid = 0,    /* Nothing mapped here */
>>>>>        p2m_ram_rw,         /* Normal read/write domain RAM */
>>>>> +    p2m_ram_ro,         /* Read-only; writes are silently dropped */
>>>> As indicated before - this type should be added when the special handling that
>>>> it requires is also introduced.
>>> Perhaps, I missed that. I will drop this type for now.
>>>
>>>>> +    p2m_mmio_direct_dev,/* Read/write mapping of genuine Device MMIO area */
>>>> What's the _dev suffix indicating here?
>>> It indicates that it is device memory, probably, it isn't so necessary in case of RISC-V as
>>> spec doesn't use such terminology. In RISC-V there is only available IO, NC. And we are
>>> |using PTE_PBMT_IO for |p2m_mmio_direct_dev.
>>>
>>> Maybe it would be better just to rename s/p2m_mmio_direct_dev/p2m_mmio_direct_io.
>> I forgot that p2m_mmio_direct_dev is used by common code for dom0less code (handle_passthrough_prop())
> That'll want abstracting out, I think. I don't view it as helpful to clutter
> RISC-V (and later perhaps also PPC) with Arm-specific terminology.

Would it be better then just rename it to p2m_device? Then it won't clear for Arm which type of MMIO p2m's
types is used as Arm has there MMIO types: *_dev, *_nc, *_c.

As an option (which I don't really like) it could be "#define p2m_mmio_direct_dev ARCH_specific_name" in
asm/p2m.h to not touch common code.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 3003 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 09/17] xen/riscv: introduce page_set_xenheap_gfn()
  2025-06-10 13:05 ` [PATCH v2 09/17] xen/riscv: introduce page_set_xenheap_gfn() Oleksii Kurochko
@ 2025-06-30 15:48   ` Jan Beulich
  2025-07-02 15:59     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-30 15:48 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> Introduce page_set_xenheap_gfn() helper to encode the GFN associated with
> a Xen heap page directly into the type_info field of struct page_info.
> 
> Introduce a GFN field in the type_info of a Xen heap page by reserving 10
> bits (sufficient for both Sv32 and Sv39+ modes), and define PGT_gfn_mask
> and PGT_gfn_width accordingly.

This reads as if you wanted to encode the GFN in 10 bits.

What would also help is if you said why you actually need this. x86, after
all, gets away without anything like this. (But I understand you're more
Arm-like here.)

> @@ -229,9 +230,21 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
>  #define PGT_writable_page PG_mask(1, 1)  /* has writable mappings?         */
>  #define PGT_type_mask     PG_mask(1, 1)  /* Bits 31 or 63.                 */
>  
> -/* Count of uses of this frame as its current type. */
> -#define PGT_count_width   PG_shift(2)
> -#define PGT_count_mask    ((1UL << PGT_count_width) - 1)
> + /* 9-bit count of uses of this frame as its current type. */
> +#define PGT_count_mask    PG_mask(0x3FF, 10)
> +
> +/*
> + * Sv32 has 22-bit GFN. Sv{39, 48, 57} have 44-bit GFN.
> + * Thereby we can use for `type_info` 10 bits for all modes, having the same
> + * amount of bits for `type_info` for all MMU modes let us avoid introducing
> + * an extra #ifdef to that header:
> + *   if we go with maximum possible bits for count on each configuration
> + *   we would need to have a set of PGT_count_* and PGT_gfn_*).
> + */
> +#define PGT_gfn_width     PG_shift(10)
> +#define PGT_gfn_mask      (BIT(PGT_gfn_width, UL) - 1)
> +
> +#define PGT_INVALID_XENHEAP_GFN   _gfn(PGT_gfn_mask)

Commentary here would imo be preferable to be much closer to Arm's. I don't
see the point of the extra verbosity (part of which may be fine to have in
the description, except you already say something along these lines there).
While in turn the comment talks of fewer bits than are actually being used
in the RV64 case.

> @@ -283,6 +296,19 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
>  
>  #define PFN_ORDER(pg) ((pg)->v.free.order)
>  
> +static inline void page_set_xenheap_gfn(struct page_info *p, gfn_t gfn)
> +{
> +    gfn_t gfn_ = gfn_eq(gfn, INVALID_GFN) ? PGT_INVALID_XENHEAP_GFN : gfn;
> +    unsigned long x, nx, y = p->u.inuse.type_info;
> +
> +    ASSERT(is_xen_heap_page(p));
> +
> +    do {
> +        x = y;
> +        nx = (x & ~PGT_gfn_mask) | gfn_x(gfn_);
> +    } while ( (y = cmpxchg(&p->u.inuse.type_info, x, nx)) != x );
> +}
> +
>  extern unsigned char cpu0_boot_stack[];
>  
>  void setup_initial_pagetables(void);

What about the "get" counterpart?

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification
  2025-06-30 15:27           ` Oleksii Kurochko
@ 2025-06-30 15:50             ` Jan Beulich
  2025-07-02 10:13               ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-30 15:50 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 30.06.2025 17:27, Oleksii Kurochko wrote:
> 
> On 6/30/25 4:45 PM, Jan Beulich wrote:
>> On 30.06.2025 16:38, Oleksii Kurochko wrote:
>>> On 6/30/25 4:33 PM, Oleksii Kurochko wrote:
>>>> On 6/26/25 4:59 PM, Jan Beulich wrote:
>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>>>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>>>>> @@ -61,8 +61,28 @@ struct p2m_domain {
>>>>>>    typedef enum {
>>>>>>        p2m_invalid = 0,    /* Nothing mapped here */
>>>>>>        p2m_ram_rw,         /* Normal read/write domain RAM */
>>>>>> +    p2m_ram_ro,         /* Read-only; writes are silently dropped */
>>>>> As indicated before - this type should be added when the special handling that
>>>>> it requires is also introduced.
>>>> Perhaps, I missed that. I will drop this type for now.
>>>>
>>>>>> +    p2m_mmio_direct_dev,/* Read/write mapping of genuine Device MMIO area */
>>>>> What's the _dev suffix indicating here?
>>>> It indicates that it is device memory, probably, it isn't so necessary in case of RISC-V as
>>>> spec doesn't use such terminology. In RISC-V there is only available IO, NC. And we are
>>>> |using PTE_PBMT_IO for |p2m_mmio_direct_dev.
>>>>
>>>> Maybe it would be better just to rename s/p2m_mmio_direct_dev/p2m_mmio_direct_io.
>>> I forgot that p2m_mmio_direct_dev is used by common code for dom0less code (handle_passthrough_prop())
>> That'll want abstracting out, I think. I don't view it as helpful to clutter
>> RISC-V (and later perhaps also PPC) with Arm-specific terminology.
> 
> Would it be better then just rename it to p2m_device? Then it won't clear for Arm which type of MMIO p2m's
> types is used as Arm has there MMIO types: *_dev, *_nc, *_c.

I don't understand why Arm matters here. P2M types want naming in a way that makes
sense for RISC-V.

> As an option (which I don't really like) it could be "#define p2m_mmio_direct_dev ARCH_specific_name" in
> asm/p2m.h to not touch common code.

A #define may be needed, but not one to _still_ introduce Arm naming into non-Arm
code.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 10/17] xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs
  2025-06-10 13:05 ` [PATCH v2 10/17] xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs Oleksii Kurochko
@ 2025-06-30 15:59   ` Jan Beulich
  2025-07-03 11:02     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-06-30 15:59 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -324,3 +324,44 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>  
>      return 0;
>  }
> +
> +static int p2m_set_entry(struct p2m_domain *p2m,
> +                         gfn_t sgfn,
> +                         unsigned long nr,
> +                         mfn_t smfn,
> +                         p2m_type_t t,
> +                         p2m_access_t a)
> +{
> +    return -EOPNOTSUPP;
> +}
> +
> +static int p2m_insert_mapping(struct domain *d, gfn_t start_gfn,

This likely again wants to be struct p2m_domain *.

> +                              unsigned long nr, mfn_t mfn, p2m_type_t t)
> +{
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +    int rc;
> +
> +    p2m_write_lock(p2m);
> +    rc = p2m_set_entry(p2m, start_gfn, nr, mfn, t, p2m->default_access);
> +    p2m_write_unlock(p2m);
> +
> +    return rc;
> +}
> +
> +int map_regions_p2mt(struct domain *d,
> +                     gfn_t gfn,
> +                     unsigned long nr,
> +                     mfn_t mfn,
> +                     p2m_type_t p2mt)
> +{
> +    return p2m_insert_mapping(d, gfn, nr, mfn, p2mt);
> +}

What is this function doing here? The description says "for generic mapping
purposes", which really may mean anything. Plus, if and when you need it, it
wants to come with a name that fits with e.g. ...

> +int guest_physmap_add_entry(struct domain *d,
> +                            gfn_t gfn,
> +                            mfn_t mfn,
> +                            unsigned long page_order,
> +                            p2m_type_t t)

... this one, to understand their relationship / difference.

> +{
> +    return p2m_insert_mapping(d, gfn, (1 << page_order), mfn, t);

1UL please, while at the same time the parentheses could be omitted.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 06/17] xen/riscv: add root page table allocation
  2025-06-30 15:22   ` Jan Beulich
@ 2025-06-30 16:18     ` Oleksii Kurochko
  2025-07-01  6:29       ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-06-30 16:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 7944 bytes --]


On 6/30/25 5:22 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -26,6 +26,12 @@ struct p2m_domain {
>>       /* Pages used to construct the p2m */
>>       struct page_list_head pages;
>>   
>> +    /* The root of the p2m tree. May be concatenated */
>> +    struct page_info *root;
>> +
>> +    /* Address Translation Table for the p2m */
>> +    paddr_t hgatp;
> Does this really need holding in a struct field? Can't is be re-created at
> any time from "root" above?

Yes, with the current one implementation, I agree it would be enough only
root. But as you noticed below...

> And such re-creation is apparently infrequent,
> if happening at all after initial allocation. (But of course I don't know
> what future patches of yours will bring.) This is even more so if ...
>
>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>> @@ -133,11 +133,13 @@
>>   #define HGATP_MODE_SV48X4		_UL(9)
>>   
>>   #define HGATP32_MODE_SHIFT		31
>> +#define HGATP32_MODE_MASK		_UL(0x80000000)
>>   #define HGATP32_VMID_SHIFT		22
>>   #define HGATP32_VMID_MASK		_UL(0x1FC00000)
>>   #define HGATP32_PPN			_UL(0x003FFFFF)
>>   
>>   #define HGATP64_MODE_SHIFT		60
>> +#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
>>   #define HGATP64_VMID_SHIFT		44
>>   #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)
> ... VMID management is going to change as previously discussed, at which
> point the value to put in hgatp will need (partly) re-calculating at certain
> points anyway.

... after VMID management will changed to per-CPU base then it will be needed
to update re-calculate hgatp each time vCPU on pCPU is changed.
In this case I prefer to have partially calculated 'hgatp'.

>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>>       write_unlock(&p2m->lock);
>>   }
>>   
>> +static void clear_and_clean_page(struct page_info *page)
>> +{
>> +    clean_dcache_va_range(page, PAGE_SIZE);
>> +    clear_domain_page(page_to_mfn(page));
>> +}
> A function of this name can, imo, only clear and then clean. Question is why
> it's the other way around, and what the underlying requirement is for the
> cleaning part to be there in the first place. Maybe that's obvious for a
> RISC-V person, but it's entirely non-obvious to me (Arm being different in
> this regard because of running with caches disabled at certain points in
> time).

You're right, the current name|clear_and_clean_page()| implies that clearing
should come before cleaning, which contradicts the current implementation.
The intent here is to ensure that the page contents are consistent in RAM
(not just in cache) before use by other entities (guests or devices).

The clean must follow the clear — so yes, the order needs to be reversed.

>
>> +static struct page_info *p2m_allocate_root(struct domain *d)
>> +{
>> +    struct page_info *page;
>> +    unsigned int order = get_order_from_bytes(KB(16));
> While better than a hard-coded order of 2, this still is lacking. Is there
> a reason there can't be a suitable manifest constant in the header?

No any specific reason, I just decided not to introduce new definition as
it is going to be used only inside this function.

I think it will make sense to have in p2m.c:
  #define P2M_ROOT_PT_SIZE KB(16)
If it isn't the best one option, then what about to move this defintion
to config.h or asm/p2m.h.

>
>> +    unsigned int nr_pages = _AC(1,U) << order;
> Nit (style): Missing blank after comma.

I've changed that to BIT(order, U)

>
>> +    /* Return back nr_pages necessary for p2m root table. */
>> +
>> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
>> +        panic("Specify more xen,domain-p2m-mem-mb\n");
> You shouldn't panic() in anything involved in domain creation. You want to
> return NULL in this case.

It makes sense in this case just to return NULL.

>
> Further, to me the use of "more" looks misleading here. Do you perhaps mean
> "larger" or "bigger"?
>
> This also looks to be happening without any lock held. If that's intentional,
> I think the "why" wants clarifying in a code comment.

Agree, returning back pages necessary for p2m root table should be done under
spin_lock(&d->arch.paging.lock).

>
>> +    for ( unsigned int i = 0; i < nr_pages; i++ )
>> +    {
>> +        /* Return memory to domheap. */
>> +        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
>> +        if( page )
>> +        {
>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>> +            free_domheap_page(page);
>> +        }
>> +        else
>> +        {
>> +            printk(XENLOG_ERR
>> +                   "Failed to free P2M pages, P2M freelist is empty.\n");
>> +            return NULL;
>> +        }
>> +    }
> The reason for doing this may also want to be put in a comment.

I thought it would be enough the comment above: /* Return back nr_pages necessary for p2m root table. */

>
>> +    /* Allocate memory for p2m root table. */
>> +
>> +    /*
>> +     * As mentioned in the Priviliged Architecture Spec (version 20240411)
>> +     * As explained in Section 18.5.1, for the paged virtual-memory schemes
> The first sentence didn't finish when the 2nd starts. Is there a piece missing?
> Do the two sentences want to be joined together?

Nothing is missed, just bad wording. I will update to:
   As mentioned in the Priviliged Architecture Spec (version 20240411) in Section 18.5.1, ...

>
>> +static unsigned long hgatp_from_page(struct p2m_domain *p2m)
> Function name and parameter type/name don't fit together.

I'll update an argument to struct page_info *root.

>
>> +{
>> +    struct page_info *p2m_root_page = p2m->root;
> As always: pointer-to-const wherever possible, please. But: Is this local
> variable really useful to have?

No, it will be just passed as an argument.

>
>> +    unsigned long ppn;
>> +    unsigned long hgatp_mode;
>> +
>> +    ppn = PFN_DOWN(page_to_maddr(p2m_root_page)) & HGATP_PPN;
>> +
>> +#if RV_STAGE1_MODE == SATP_MODE_SV39
>> +    hgatp_mode = HGATP_MODE_SV39X4;
>> +#elif RV_STAGE1_MODE == SATP_MODE_SV48
>> +    hgatp_mode = HGATP_MODE_SV48X4;
>> +#else
>> +#   error "add HGATP_MODE"
>> +#endif
>> +
>> +    return ppn | MASK_INSR(p2m->vmid, HGATP_VMID_MASK) |
>> +           MASK_INSR(hgatp_mode, HGATP_MODE_MASK);
>> +}
>> +
>> +static int p2m_alloc_root_table(struct domain *d)
> As indicated earlier, in a wider context - this is a good candidate where
> the caller rather wants to pass struct p2m_domain *. Once you get variations
> on P2Ms (like x86'es altp2m or nestedp2m, the domain won't be meaningful
> here to know which P2M to allocate the root for.

Good point. I will re-work that.

>
>> +{
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +
>> +    p2m->root = p2m_allocate_root(d);
>> +    if ( !p2m->root )
>> +        return -ENOMEM;
>> +
>> +    p2m->hgatp = hgatp_from_page(p2m);
>> +
>> +    return 0;
>> +}
>> +
>>   static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>   
>>   /*
>> @@ -228,5 +313,14 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>           }
>>       }
>>   
>> +    /*
>> +    * First, wait for the p2m pool to be initialized. Then allocate the root
> Why "wait"? There's waiting here.

I am not really get your question.

"wait" here is about the initialization of the pool which happens above this comment.

>
>> +    * table so that the necessary pages can be returned from the p2m pool,
>> +    * since the root table must be allocated using alloc_domheap_pages(...)
>> +    * to meet its specific requirements.
>> +    */
>> +    if ( !d->arch.p2m.root )
> Aren't you open-coding p2m_get_hostp2m() here?

Yes, p2m_get_hostp2m()  should be used here.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 12279 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 06/17] xen/riscv: add root page table allocation
  2025-06-30 16:18     ` Oleksii Kurochko
@ 2025-07-01  6:29       ` Jan Beulich
  2025-07-01  9:44         ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-01  6:29 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 30.06.2025 18:18, Oleksii Kurochko wrote:
> On 6/30/25 5:22 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>> @@ -26,6 +26,12 @@ struct p2m_domain {
>>>       /* Pages used to construct the p2m */
>>>       struct page_list_head pages;
>>>   
>>> +    /* The root of the p2m tree. May be concatenated */
>>> +    struct page_info *root;
>>> +
>>> +    /* Address Translation Table for the p2m */
>>> +    paddr_t hgatp;
>> Does this really need holding in a struct field? Can't is be re-created at
>> any time from "root" above?
> 
> Yes, with the current one implementation, I agree it would be enough only
> root. But as you noticed below...
> 
>> And such re-creation is apparently infrequent,
>> if happening at all after initial allocation. (But of course I don't know
>> what future patches of yours will bring.) This is even more so if ...
>>
>>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>>> @@ -133,11 +133,13 @@
>>>   #define HGATP_MODE_SV48X4		_UL(9)
>>>   
>>>   #define HGATP32_MODE_SHIFT		31
>>> +#define HGATP32_MODE_MASK		_UL(0x80000000)
>>>   #define HGATP32_VMID_SHIFT		22
>>>   #define HGATP32_VMID_MASK		_UL(0x1FC00000)
>>>   #define HGATP32_PPN			_UL(0x003FFFFF)
>>>   
>>>   #define HGATP64_MODE_SHIFT		60
>>> +#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
>>>   #define HGATP64_VMID_SHIFT		44
>>>   #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)
>> ... VMID management is going to change as previously discussed, at which
>> point the value to put in hgatp will need (partly) re-calculating at certain
>> points anyway.
> 
> ... after VMID management will changed to per-CPU base then it will be needed
> to update re-calculate hgatp each time vCPU on pCPU is changed.
> In this case I prefer to have partially calculated 'hgatp'.

But why, when you need to do some recalculation anyway?

>>> --- a/xen/arch/riscv/p2m.c
>>> +++ b/xen/arch/riscv/p2m.c
>>> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>>>       write_unlock(&p2m->lock);
>>>   }
>>>   
>>> +static void clear_and_clean_page(struct page_info *page)
>>> +{
>>> +    clean_dcache_va_range(page, PAGE_SIZE);
>>> +    clear_domain_page(page_to_mfn(page));
>>> +}
>> A function of this name can, imo, only clear and then clean. Question is why
>> it's the other way around, and what the underlying requirement is for the
>> cleaning part to be there in the first place. Maybe that's obvious for a
>> RISC-V person, but it's entirely non-obvious to me (Arm being different in
>> this regard because of running with caches disabled at certain points in
>> time).
> 
> You're right, the current name|clear_and_clean_page()| implies that clearing
> should come before cleaning, which contradicts the current implementation.
> The intent here is to ensure that the page contents are consistent in RAM
> (not just in cache) before use by other entities (guests or devices).
> 
> The clean must follow the clear — so yes, the order needs to be reversed.

What you don't address though - why's the cleaning needed in the first place?

>>> +static struct page_info *p2m_allocate_root(struct domain *d)
>>> +{
>>> +    struct page_info *page;
>>> +    unsigned int order = get_order_from_bytes(KB(16));
>> While better than a hard-coded order of 2, this still is lacking. Is there
>> a reason there can't be a suitable manifest constant in the header?
> 
> No any specific reason, I just decided not to introduce new definition as
> it is going to be used only inside this function.
> 
> I think it will make sense to have in p2m.c:
>   #define P2M_ROOT_PT_SIZE KB(16)
> If it isn't the best one option, then what about to move this defintion
> to config.h or asm/p2m.h.

It's defined by the hardware, so neither of the two headers looks to be a
good fit. Nor is the P2M_ prefix really in line with this being hardware-
defined. page.h has various paging-related hw definitions, and
riscv_encoding.h may also be a suitable place. There may be other candidates
that I'm presently overlooking.

>>> +    unsigned int nr_pages = _AC(1,U) << order;
>> Nit (style): Missing blank after comma.
> 
> I've changed that to BIT(order, U)
> 
>>
>>> +    /* Return back nr_pages necessary for p2m root table. */
>>> +
>>> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
>>> +        panic("Specify more xen,domain-p2m-mem-mb\n");
>> You shouldn't panic() in anything involved in domain creation. You want to
>> return NULL in this case.
> 
> It makes sense in this case just to return NULL.
> 
>>
>> Further, to me the use of "more" looks misleading here. Do you perhaps mean
>> "larger" or "bigger"?
>>
>> This also looks to be happening without any lock held. If that's intentional,
>> I think the "why" wants clarifying in a code comment.
> 
> Agree, returning back pages necessary for p2m root table should be done under
> spin_lock(&d->arch.paging.lock).

Which should be acquired at the paging_*() layer then, not at the p2m_*() layer.
(As long as you mean to have that separation, that is. See the earlier discussion
on that matter.)

>>> +    for ( unsigned int i = 0; i < nr_pages; i++ )
>>> +    {
>>> +        /* Return memory to domheap. */
>>> +        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>> +        if( page )
>>> +        {
>>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>>> +            free_domheap_page(page);
>>> +        }
>>> +        else
>>> +        {
>>> +            printk(XENLOG_ERR
>>> +                   "Failed to free P2M pages, P2M freelist is empty.\n");
>>> +            return NULL;
>>> +        }
>>> +    }
>> The reason for doing this may also want to be put in a comment.
> 
> I thought it would be enough the comment above: /* Return back nr_pages necessary for p2m root table. */

That describes what the code does, but not why.

>>> +{
>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>> +
>>> +    p2m->root = p2m_allocate_root(d);
>>> +    if ( !p2m->root )
>>> +        return -ENOMEM;
>>> +
>>> +    p2m->hgatp = hgatp_from_page(p2m);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>   static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>>   
>>>   /*
>>> @@ -228,5 +313,14 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>>           }
>>>       }
>>>   
>>> +    /*
>>> +    * First, wait for the p2m pool to be initialized. Then allocate the root
>> Why "wait"? There's waiting here.
> 
> I am not really get your question.
> 
> "wait" here is about the initialization of the pool which happens above this comment.

But there's no "waiting" involved. What you talk about is one thing needing to
happen after the other.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 06/17] xen/riscv: add root page table allocation
  2025-07-01  6:29       ` Jan Beulich
@ 2025-07-01  9:44         ` Oleksii Kurochko
  2025-07-01 10:27           ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-01  9:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 6899 bytes --]


On 7/1/25 8:29 AM, Jan Beulich wrote:
> On 30.06.2025 18:18, Oleksii Kurochko wrote:
>> On 6/30/25 5:22 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>>> @@ -26,6 +26,12 @@ struct p2m_domain {
>>>>        /* Pages used to construct the p2m */
>>>>        struct page_list_head pages;
>>>>    
>>>> +    /* The root of the p2m tree. May be concatenated */
>>>> +    struct page_info *root;
>>>> +
>>>> +    /* Address Translation Table for the p2m */
>>>> +    paddr_t hgatp;
>>> Does this really need holding in a struct field? Can't is be re-created at
>>> any time from "root" above?
>> Yes, with the current one implementation, I agree it would be enough only
>> root. But as you noticed below...
>>
>>> And such re-creation is apparently infrequent,
>>> if happening at all after initial allocation. (But of course I don't know
>>> what future patches of yours will bring.) This is even more so if ...
>>>
>>>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>>>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>>>> @@ -133,11 +133,13 @@
>>>>    #define HGATP_MODE_SV48X4		_UL(9)
>>>>    
>>>>    #define HGATP32_MODE_SHIFT		31
>>>> +#define HGATP32_MODE_MASK		_UL(0x80000000)
>>>>    #define HGATP32_VMID_SHIFT		22
>>>>    #define HGATP32_VMID_MASK		_UL(0x1FC00000)
>>>>    #define HGATP32_PPN			_UL(0x003FFFFF)
>>>>    
>>>>    #define HGATP64_MODE_SHIFT		60
>>>> +#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
>>>>    #define HGATP64_VMID_SHIFT		44
>>>>    #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)
>>> ... VMID management is going to change as previously discussed, at which
>>> point the value to put in hgatp will need (partly) re-calculating at certain
>>> points anyway.
>> ... after VMID management will changed to per-CPU base then it will be needed
>> to update re-calculate hgatp each time vCPU on pCPU is changed.
>> In this case I prefer to have partially calculated 'hgatp'.
> But why, when you need to do some recalculation anyway?

Less operations will be needed to do.
If we have partially prepared 'hgatp' then we have to only update VMID bits
instead of getting ppn for page, then calculate hgatp_mode each time.
But if you think it isn't really needed I can add vmid argument for hgatp_from_page()
and just call this function when an update of hgatp is needed.

>
>>>> --- a/xen/arch/riscv/p2m.c
>>>> +++ b/xen/arch/riscv/p2m.c
>>>> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>>>>        write_unlock(&p2m->lock);
>>>>    }
>>>>    
>>>> +static void clear_and_clean_page(struct page_info *page)
>>>> +{
>>>> +    clean_dcache_va_range(page, PAGE_SIZE);
>>>> +    clear_domain_page(page_to_mfn(page));
>>>> +}
>>> A function of this name can, imo, only clear and then clean. Question is why
>>> it's the other way around, and what the underlying requirement is for the
>>> cleaning part to be there in the first place. Maybe that's obvious for a
>>> RISC-V person, but it's entirely non-obvious to me (Arm being different in
>>> this regard because of running with caches disabled at certain points in
>>> time).
>> You're right, the current name|clear_and_clean_page()| implies that clearing
>> should come before cleaning, which contradicts the current implementation.
>> The intent here is to ensure that the page contents are consistent in RAM
>> (not just in cache) before use by other entities (guests or devices).
>>
>> The clean must follow the clear — so yes, the order needs to be reversed.
> What you don't address though - why's the cleaning needed in the first place?

If we clean the data cache first, we flush the d-cache and then use the page to
perform the clear operation. As a result, the "cleared" value will be written into
the d-cache. To avoid polluting the d-cache with the "cleared" value, the correct
sequence is to clear the page first, then clean the data cache.

>>>> +    unsigned int nr_pages = _AC(1,U) << order;
>>> Nit (style): Missing blank after comma.
>> I've changed that to BIT(order, U)
>>
>>>> +    /* Return back nr_pages necessary for p2m root table. */
>>>> +
>>>> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
>>>> +        panic("Specify more xen,domain-p2m-mem-mb\n");
>>> You shouldn't panic() in anything involved in domain creation. You want to
>>> return NULL in this case.
>> It makes sense in this case just to return NULL.
>>
>>> Further, to me the use of "more" looks misleading here. Do you perhaps mean
>>> "larger" or "bigger"?
>>>
>>> This also looks to be happening without any lock held. If that's intentional,
>>> I think the "why" wants clarifying in a code comment.
>> Agree, returning back pages necessary for p2m root table should be done under
>> spin_lock(&d->arch.paging.lock).
> Which should be acquired at the paging_*() layer then, not at the p2m_*() layer.
> (As long as you mean to have that separation, that is. See the earlier discussion
> on that matter.)

Then partly p2m_set_allocation() should be moved to paging_*() too.

>>>> +    for ( unsigned int i = 0; i < nr_pages; i++ )
>>>> +    {
>>>> +        /* Return memory to domheap. */
>>>> +        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>>> +        if( page )
>>>> +        {
>>>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>>>> +            free_domheap_page(page);
>>>> +        }
>>>> +        else
>>>> +        {
>>>> +            printk(XENLOG_ERR
>>>> +                   "Failed to free P2M pages, P2M freelist is empty.\n");
>>>> +            return NULL;
>>>> +        }
>>>> +    }
>>> The reason for doing this may also want to be put in a comment.
>> I thought it would be enough the comment above: /* Return back nr_pages necessary for p2m root table. */
> That describes what the code does, but not why.

I will add to the comment: "... to get the memory accounting right".

>
>>>> +{
>>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>> +
>>>> +    p2m->root = p2m_allocate_root(d);
>>>> +    if ( !p2m->root )
>>>> +        return -ENOMEM;
>>>> +
>>>> +    p2m->hgatp = hgatp_from_page(p2m);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>>    static spinlock_t vmid_alloc_lock = SPIN_LOCK_UNLOCKED;
>>>>    
>>>>    /*
>>>> @@ -228,5 +313,14 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>>>            }
>>>>        }
>>>>    
>>>> +    /*
>>>> +    * First, wait for the p2m pool to be initialized. Then allocate the root
>>> Why "wait"? There's waiting here.
>> I am not really get your question.
>>
>> "wait" here is about the initialization of the pool which happens above this comment.
> But there's no "waiting" involved. What you talk about is one thing needing to
> happen after the other.

Okay, then I will just reword comment.

Thanks.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 10131 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 06/17] xen/riscv: add root page table allocation
  2025-07-01  9:44         ` Oleksii Kurochko
@ 2025-07-01 10:27           ` Jan Beulich
  2025-07-01 14:02             ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-01 10:27 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 01.07.2025 11:44, Oleksii Kurochko wrote:
> On 7/1/25 8:29 AM, Jan Beulich wrote:
>> On 30.06.2025 18:18, Oleksii Kurochko wrote:
>>> On 6/30/25 5:22 PM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>>>> @@ -26,6 +26,12 @@ struct p2m_domain {
>>>>>        /* Pages used to construct the p2m */
>>>>>        struct page_list_head pages;
>>>>>    
>>>>> +    /* The root of the p2m tree. May be concatenated */
>>>>> +    struct page_info *root;
>>>>> +
>>>>> +    /* Address Translation Table for the p2m */
>>>>> +    paddr_t hgatp;
>>>> Does this really need holding in a struct field? Can't is be re-created at
>>>> any time from "root" above?
>>> Yes, with the current one implementation, I agree it would be enough only
>>> root. But as you noticed below...
>>>
>>>> And such re-creation is apparently infrequent,
>>>> if happening at all after initial allocation. (But of course I don't know
>>>> what future patches of yours will bring.) This is even more so if ...
>>>>
>>>>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>>>>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>>>>> @@ -133,11 +133,13 @@
>>>>>    #define HGATP_MODE_SV48X4		_UL(9)
>>>>>    
>>>>>    #define HGATP32_MODE_SHIFT		31
>>>>> +#define HGATP32_MODE_MASK		_UL(0x80000000)
>>>>>    #define HGATP32_VMID_SHIFT		22
>>>>>    #define HGATP32_VMID_MASK		_UL(0x1FC00000)
>>>>>    #define HGATP32_PPN			_UL(0x003FFFFF)
>>>>>    
>>>>>    #define HGATP64_MODE_SHIFT		60
>>>>> +#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
>>>>>    #define HGATP64_VMID_SHIFT		44
>>>>>    #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)
>>>> ... VMID management is going to change as previously discussed, at which
>>>> point the value to put in hgatp will need (partly) re-calculating at certain
>>>> points anyway.
>>> ... after VMID management will changed to per-CPU base then it will be needed
>>> to update re-calculate hgatp each time vCPU on pCPU is changed.
>>> In this case I prefer to have partially calculated 'hgatp'.
>> But why, when you need to do some recalculation anyway?
> 
> Less operations will be needed to do.

Right; I wonder how big the savings would be.

> If we have partially prepared 'hgatp' then we have to only update VMID bits
> instead of getting ppn for page, then calculate hgatp_mode each time.
> But if you think it isn't really needed I can add vmid argument for hgatp_from_page()
> and just call this function when an update of hgatp is needed.

I think it'll need to be struct p2m_domain * that you (also?) pass in. In the
longer run I think you will want to support all three permitted modes, with
smaller guests using fewer page table levels.

As to "also" - maybe it's better to change the name of the function, and pass
in just (const if possible) struct p2m_domain *.

>>>>> --- a/xen/arch/riscv/p2m.c
>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>>>>>        write_unlock(&p2m->lock);
>>>>>    }
>>>>>    
>>>>> +static void clear_and_clean_page(struct page_info *page)
>>>>> +{
>>>>> +    clean_dcache_va_range(page, PAGE_SIZE);
>>>>> +    clear_domain_page(page_to_mfn(page));
>>>>> +}
>>>> A function of this name can, imo, only clear and then clean. Question is why
>>>> it's the other way around, and what the underlying requirement is for the
>>>> cleaning part to be there in the first place. Maybe that's obvious for a
>>>> RISC-V person, but it's entirely non-obvious to me (Arm being different in
>>>> this regard because of running with caches disabled at certain points in
>>>> time).
>>> You're right, the current name|clear_and_clean_page()| implies that clearing
>>> should come before cleaning, which contradicts the current implementation.
>>> The intent here is to ensure that the page contents are consistent in RAM
>>> (not just in cache) before use by other entities (guests or devices).
>>>
>>> The clean must follow the clear — so yes, the order needs to be reversed.
>> What you don't address though - why's the cleaning needed in the first place?
> 
> If we clean the data cache first, we flush the d-cache and then use the page to
> perform the clear operation. As a result, the "cleared" value will be written into
> the d-cache. To avoid polluting the d-cache with the "cleared" value, the correct
> sequence is to clear the page first, then clean the data cache.

If you want to avoid cache pollution, I think you'd need to use a form of stores
which simply bypass the cache. Yet then - why would this matter here, but not
elsewhere? Wouldn't you better leave such to the hardware, unless you can prove
a (meaningful) performance gain?

>>>>> +    unsigned int nr_pages = _AC(1,U) << order;
>>>> Nit (style): Missing blank after comma.
>>> I've changed that to BIT(order, U)
>>>
>>>>> +    /* Return back nr_pages necessary for p2m root table. */
>>>>> +
>>>>> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
>>>>> +        panic("Specify more xen,domain-p2m-mem-mb\n");
>>>> You shouldn't panic() in anything involved in domain creation. You want to
>>>> return NULL in this case.
>>> It makes sense in this case just to return NULL.
>>>
>>>> Further, to me the use of "more" looks misleading here. Do you perhaps mean
>>>> "larger" or "bigger"?
>>>>
>>>> This also looks to be happening without any lock held. If that's intentional,
>>>> I think the "why" wants clarifying in a code comment.
>>> Agree, returning back pages necessary for p2m root table should be done under
>>> spin_lock(&d->arch.paging.lock).
>> Which should be acquired at the paging_*() layer then, not at the p2m_*() layer.
>> (As long as you mean to have that separation, that is. See the earlier discussion
>> on that matter.)
> 
> Then partly p2m_set_allocation() should be moved to paging_*() too.

Not exactly sure what you mean. On x86 at least the paging layer part of
the function is pretty slim.

>>>>> +    for ( unsigned int i = 0; i < nr_pages; i++ )
>>>>> +    {
>>>>> +        /* Return memory to domheap. */
>>>>> +        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>>>> +        if( page )
>>>>> +        {
>>>>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>>>>> +            free_domheap_page(page);
>>>>> +        }
>>>>> +        else
>>>>> +        {
>>>>> +            printk(XENLOG_ERR
>>>>> +                   "Failed to free P2M pages, P2M freelist is empty.\n");
>>>>> +            return NULL;
>>>>> +        }
>>>>> +    }
>>>> The reason for doing this may also want to be put in a comment.
>>> I thought it would be enough the comment above: /* Return back nr_pages necessary for p2m root table. */
>> That describes what the code does, but not why.
> 
> I will add to the comment: "... to get the memory accounting right".

I'm sorry to be picky, but what is "right"? You want assure the root table
memory is also accounted against the P2M pool of the domain. Can't you say
exactly that?

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests
  2025-06-10 13:05 ` [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests Oleksii Kurochko
  2025-06-18 15:53   ` Jan Beulich
@ 2025-07-01 13:04   ` Jan Beulich
  2025-07-02 10:30     ` Oleksii Kurochko
  2025-07-02 11:48     ` Oleksii Kurochko
  1 sibling, 2 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-01 13:04 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> @@ -113,3 +117,58 @@ int p2m_init(struct domain *d)
>  
>      return 0;
>  }
> +
> +/*
> + * Set the pool of pages to the required number of pages.
> + * Returns 0 for success, non-zero for failure.
> + * Call with d->arch.paging.lock held.
> + */
> +int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
> +{
> +    struct page_info *pg;
> +
> +    ASSERT(spin_is_locked(&d->arch.paging.lock));
> +
> +    for ( ; ; )
> +    {
> +        if ( d->arch.paging.p2m_total_pages < pages )
> +        {
> +            /* Need to allocate more memory from domheap */
> +            pg = alloc_domheap_page(d, MEMF_no_owner);
> +            if ( pg == NULL )
> +            {
> +                printk(XENLOG_ERR "Failed to allocate P2M pages.\n");
> +                return -ENOMEM;
> +            }
> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)++;
> +            page_list_add_tail(pg, &d->arch.paging.p2m_freelist);
> +        }
> +        else if ( d->arch.paging.p2m_total_pages > pages )
> +        {
> +            /* Need to return memory to domheap */
> +            pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
> +            if( pg )
> +            {
> +                ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
> +                free_domheap_page(pg);
> +            }
> +            else
> +            {
> +                printk(XENLOG_ERR
> +                       "Failed to free P2M pages, P2M freelist is empty.\n");
> +                return -ENOMEM;
> +            }
> +        }
> +        else
> +            break;
> +
> +        /* Check to see if we need to yield and try again */
> +        if ( preempted && general_preempt_check() )
> +        {
> +            *preempted = true;
> +            return -ERESTART;
> +        }
> +    }
> +
> +    return 0;
> +}

Btw, with the order-2 requirement for the root page table, you may want to
consider an alternative approach: Here you could allocate some order-2
pages (possibly up to as many as a domain might need, which right now
would be exactly one), put them on a separate list, and consume the root
table(s) from there. If you run out of pages on the order-0 list, you
could shatter a page from the order-2 one (as long as that's still non-
empty). The difficulty would be with freeing, where a previously shattered
order-2 page would be nice to re-combine once all of its constituents are
free again. The main benefit would be avoiding the back and forth in patch
6.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-06-10 13:05 ` [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry() Oleksii Kurochko
@ 2025-07-01 13:49   ` Jan Beulich
  2025-07-04 15:01     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-01 13:49 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> This patch introduces p2m_set_entry() and its core helper __p2m_set_entry() for
> RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
> modifications.
> 
> Key differences include:
> - TLB Flushing: RISC-V allows caching of invalid PTEs and does not require
>   break-before-make (BBM). As a result, the flushing logic is simplified.
>   TLB invalidation can be deferred until p2m_write_unlock() is called.
>   Consequently, the p2m->need_flush flag is always considered true and is
>   removed.
> - Page Table Traversal: The order of walking the page tables differs from Arm,
>   and this implementation reflects that reversed traversal.
> - Macro Adjustments: The macros P2M_ROOT_LEVEL, P2M_ROOT_ORDER, and
>   P2M_ROOT_PAGES are updated to align with the new RISC-V implementation.
> 
> The main functionality is in __p2m_set_entry(), which handles mappings aligned
> to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).
> 
> p2m_set_entry() breaks a region down into block-aligned mappings and calls
> __p2m_set_entry() accordingly.
> 
> Stub implementations (to be completed later) include:
> - p2m_free_entry()

What would a function of this name do? You can clear entries, but you can't
free them, can you?

> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -9,8 +9,13 @@
>  #include <xen/rwlock.h>
>  #include <xen/types.h>
>  
> +#include <asm/page.h>
>  #include <asm/page-bits.h>
>  
> +#define P2M_ROOT_LEVEL  HYP_PT_ROOT_LEVEL
> +#define P2M_ROOT_ORDER  XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL)

This is confusing, as in patch 6 we see that p2m root table order is 2.
Something needs doing about the naming, so the two sets of things can't
be confused.

> @@ -49,6 +54,17 @@ struct p2m_domain {
>  
>      /* Current VMID in use */
>      uint16_t vmid;
> +
> +    /* Highest guest frame that's ever been mapped in the p2m */
> +    gfn_t max_mapped_gfn;
> +
> +    /*
> +     * Lowest mapped gfn in the p2m. When releasing mapped gfn's in a
> +     * preemptible manner this is update to track recall where to
> +     * resume the search. Apart from during teardown this can only
> +     * decrease.
> +     */
> +    gfn_t lowest_mapped_gfn;

When you copied the comment, you surely read it. Yet you copied pretty
obvious flaws as-is. That is s/update/updated/, and something wants
doing about "track recall", which makes no sense to me.

> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -231,6 +231,8 @@ int p2m_init(struct domain *d)
>      INIT_PAGE_LIST_HEAD(&p2m->pages);
>  
>      p2m->vmid = INVALID_VMID;
> +    p2m->max_mapped_gfn = _gfn(0);
> +    p2m->lowest_mapped_gfn = _gfn(ULONG_MAX);
>  
>      p2m->default_access = p2m_access_rwx;
>  
> @@ -325,6 +327,214 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>      return 0;
>  }
>  
> +/*
> + * Find and map the root page table. The caller is responsible for
> + * unmapping the table.
> + *
> + * The function will return NULL if the offset of the root table is
> + * invalid.

Don't you mean "offset into ..."?

> + */
> +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
> +{
> +    unsigned long root_table_indx;
> +
> +    root_table_indx = gfn_x(gfn) >> XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL);
> +    if ( root_table_indx >= P2M_ROOT_PAGES )
> +        return NULL;
> +
> +    return __map_domain_page(p2m->root + root_table_indx);
> +}
> +
> +static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)

The rule of thumb is to have inline functions only in header files, leaving
decisions to the compiler elsewhere.

> +{
> +    panic("%s: isn't implemented for now\n", __func__);
> +
> +    return false;
> +}

For this function in particular, though: Besides the "p2me" in the name
being somewhat odd (supposedly page table entries here are simply pte_t),
how is this going to be different from pte_is_valid()?

> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
> +{
> +    write_pte(p, pte);
> +    if ( clean_pte )
> +        clean_dcache_va_range(p, sizeof(*p));
> +}
> +
> +static inline void p2m_remove_pte(pte_t *p, bool clean_pte)
> +{
> +    pte_t pte;
> +
> +    memset(&pte, 0x00, sizeof(pte));
> +    p2m_write_pte(p, pte, clean_pte);
> +}

May I suggest "clear" instead of "remove" and plain 0 instead of 0x00
(or simply give the variable a trivial initializer)?

As to the earlier function that I commented on: Seeing the names here,
wouldn't p2m_pte_is_valid() be a more consistent name there?

> +static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn,
> +                                p2m_type_t t, p2m_access_t a)
> +{
> +    panic("%s: hasn't been implemented yet\n", __func__);
> +
> +    return (pte_t) { .pte = 0 };
> +}

And then perhaps p2m_pte_from_mfn() here?

> +#define GUEST_TABLE_MAP_NONE 0
> +#define GUEST_TABLE_MAP_NOMEM 1
> +#define GUEST_TABLE_SUPER_PAGE 2
> +#define GUEST_TABLE_NORMAL 3

Is GUEST_ a good prefix? The guest doesn't control these tables, and the
word could also mean the guest's own page tables.

> +/*
> + * Take the currently mapped table, find the corresponding GFN entry,

That's not what you mean though, is it? It's more like "the entry
corresponding to the GFN" (implying "at the given level").

> + * and map the next table, if available. The previous table will be
> + * unmapped if the next level was mapped (e.g GUEST_TABLE_NORMAL
> + * returned).
> + *
> + * `alloc_tbl` parameter indicates whether intermediate tables should
> + * be allocated when not present.
> + *
> + * Return values:
> + *  GUEST_TABLE_MAP_NONE: a table allocation isn't permitted.
> + *  GUEST_TABLE_MAP_NOMEM: allocating a new page failed.
> + *  GUEST_TABLE_SUPER_PAGE: next level or leaf mapped normally.
> + *  GUEST_TABLE_NORMAL: The next entry points to a superpage.
> + */
> +static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
> +                          unsigned int level, pte_t **table,
> +                          unsigned int offset)
> +{
> +    panic("%s: hasn't been implemented yet\n", __func__);
> +
> +    return GUEST_TABLE_MAP_NONE;
> +}
> +
> +/* Free pte sub-tree behind an entry */
> +static void p2m_free_entry(struct p2m_domain *p2m,
> +                           pte_t entry, unsigned int level)
> +{
> +    panic("%s: hasn't been implemented yet\n", __func__);
> +}
> +
> +/*
> + * Insert an entry in the p2m. This should be called with a mapping
> + * equal to a page/superpage.
> + */
> +static int __p2m_set_entry(struct p2m_domain *p2m,

No double leading underscores, please. A single one is fine and will do.

> +                           gfn_t sgfn,
> +                           unsigned int page_order,
> +                           mfn_t smfn,

What are the "s" in "sgfn" and "smfn" indicating? Possibly "start", except
that you don't process multiple GFNs here (unlike in the caller).

> +                           p2m_type_t t,
> +                           p2m_access_t a)
> +{
> +    unsigned int level;
> +    unsigned int target = page_order / PAGETABLE_ORDER;
> +    pte_t *entry, *table, orig_pte;
> +    int rc;
> +    /* A mapping is removed if the MFN is invalid. */
> +    bool removing_mapping = mfn_eq(smfn, INVALID_MFN);
> +    DECLARE_OFFSETS(offsets, gfn_to_gaddr(sgfn));
> +
> +    ASSERT(p2m_is_write_locked(p2m));
> +
> +    /*
> +     * Check if the level target is valid: we only support
> +     * 4K - 2M - 1G mapping.
> +     */
> +    ASSERT(target <= 2);

No provisions towards the division that produced the value having left
a remainder?

> +    table = p2m_get_root_pointer(p2m, sgfn);
> +    if ( !table )
> +        return -EINVAL;
> +
> +    for ( level = P2M_ROOT_LEVEL; level > target; level-- )
> +    {
> +        /*
> +         * Don't try to allocate intermediate page table if the mapping
> +         * is about to be removed.
> +         */
> +        rc = p2m_next_level(p2m, !removing_mapping,
> +                            level, &table, offsets[level]);
> +        if ( (rc == GUEST_TABLE_MAP_NONE) || (rc == GUEST_TABLE_MAP_NOMEM) )
> +        {
> +            /*
> +             * We are here because p2m_next_level has failed to map
> +             * the intermediate page table (e.g the table does not exist
> +             * and they p2m tree is read-only). It is a valid case
> +             * when removing a mapping as it may not exist in the
> +             * page table. In this case, just ignore it.
> +             */
> +            rc = removing_mapping ?  0 : -ENOENT;

Shouldn't GUEST_TABLE_MAP_NOMEM be transformed to -ENOMEM?

> +            goto out;
> +        }
> +        else if ( rc != GUEST_TABLE_NORMAL )

No need for "else" here.

> +            break;
> +    }
> +
> +    entry = table + offsets[level];
> +
> +    /*
> +     * If we are here with level > target, we must be at a leaf node,
> +     * and we need to break up the superpage.
> +     */
> +    if ( level > target )
> +    {
> +        panic("Shattering isn't implemented\n");
> +    }
> +
> +    /*
> +     * We should always be there with the correct level because
> +     * all the intermediate tables have been installed if necessary.
> +     */
> +    ASSERT(level == target);
> +
> +    orig_pte = *entry;
> +
> +    /*
> +     * The access type should always be p2m_access_rwx when the mapping
> +     * is removed.
> +     */
> +    ASSERT(!mfn_eq(INVALID_MFN, smfn) || (a == p2m_access_rwx));
> +
> +    if ( removing_mapping )
> +        p2m_remove_pte(entry, p2m->clean_pte);
> +    else {

Nit: Style.

> +        pte_t pte = p2m_entry_from_mfn(p2m, smfn, t, a);
> +
> +        p2m_write_pte(entry, pte, p2m->clean_pte);
> +
> +        p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
> +                                      gfn_add(sgfn, (1UL << page_order) - 1));
> +        p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, sgfn);
> +    }
> +
> +#ifdef CONFIG_HAS_PASSTHROUGH

See my earlier comment regarding this kind of #ifdef.

> @@ -332,7 +542,55 @@ static int p2m_set_entry(struct p2m_domain *p2m,
>                           p2m_type_t t,
>                           p2m_access_t a)
>  {
> -    return -EOPNOTSUPP;
> +    int rc = 0;
> +
> +    /*
> +     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
> +     * be dropped in relinquish_p2m_mapping(). As the P2M will still
> +     * be accessible after, we need to prevent mapping to be added when the
> +     * domain is dying.
> +     */
> +    if ( unlikely(p2m->domain->is_dying) )
> +        return -ENOMEM;

Why ENOMEM?

> +    while ( nr )

Why's there a loop here? The function name uses singular, i.e. means to
create exactly one entry.

> +    {
> +        unsigned long mask;
> +        unsigned long order = 0;

unsigned int?

> +        /* 1gb, 2mb, 4k mappings are supported */
> +        unsigned int i = ( P2M_ROOT_LEVEL > 2 ) ? 2 : P2M_ROOT_LEVEL;

Not (style): Excess blanks. Yet then aren't you open-coding min() here
anyway? Plus isn't P2M_ROOT_LEVEL always >= 2?

> +        /*
> +         * Don't take into account the MFN when removing mapping (i.e
> +         * MFN_INVALID) to calculate the correct target order.
> +         *
> +         * XXX: Support superpage mappings if nr is not aligned to a
> +         * superpage size.
> +         */

Does this really need leaving as a to-do?

> +        mask = !mfn_eq(smfn, INVALID_MFN) ? mfn_x(smfn) : 0;
> +        mask |= gfn_x(sgfn) | nr;
> +
> +        for ( ; i != 0; i-- )
> +        {
> +            if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) )
> +            {
> +                    order = XEN_PT_LEVEL_ORDER(i);
> +                    break;

Nit: Style.

> +            }
> +        }
> +
> +        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
> +        if ( rc )
> +            break;
> +
> +        sgfn = gfn_add(sgfn, (1 << order));
> +        if ( !mfn_eq(smfn, INVALID_MFN) )
> +           smfn = mfn_add(smfn, (1 << order));
> +
> +        nr -= (1 << order);

Throughout maybe better be safe right away and use 1UL?

> +    }
> +
> +    return rc;
>  }

How's the caller going to know how much of the range was successfully
mapped? That part may need undoing (if not here, then in the caller),
or a caller may want to retry.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 06/17] xen/riscv: add root page table allocation
  2025-07-01 10:27           ` Jan Beulich
@ 2025-07-01 14:02             ` Oleksii Kurochko
  2025-07-01 14:28               ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-01 14:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 8472 bytes --]


On 7/1/25 12:27 PM, Jan Beulich wrote:
> On 01.07.2025 11:44, Oleksii Kurochko wrote:
>> On 7/1/25 8:29 AM, Jan Beulich wrote:
>>> On 30.06.2025 18:18, Oleksii Kurochko wrote:
>>>> On 6/30/25 5:22 PM, Jan Beulich wrote:
>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>>>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>>>>> @@ -26,6 +26,12 @@ struct p2m_domain {
>>>>>>         /* Pages used to construct the p2m */
>>>>>>         struct page_list_head pages;
>>>>>>     
>>>>>> +    /* The root of the p2m tree. May be concatenated */
>>>>>> +    struct page_info *root;
>>>>>> +
>>>>>> +    /* Address Translation Table for the p2m */
>>>>>> +    paddr_t hgatp;
>>>>> Does this really need holding in a struct field? Can't is be re-created at
>>>>> any time from "root" above?
>>>> Yes, with the current one implementation, I agree it would be enough only
>>>> root. But as you noticed below...
>>>>
>>>>> And such re-creation is apparently infrequent,
>>>>> if happening at all after initial allocation. (But of course I don't know
>>>>> what future patches of yours will bring.) This is even more so if ...
>>>>>
>>>>>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>>>>>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>>>>>> @@ -133,11 +133,13 @@
>>>>>>     #define HGATP_MODE_SV48X4		_UL(9)
>>>>>>     
>>>>>>     #define HGATP32_MODE_SHIFT		31
>>>>>> +#define HGATP32_MODE_MASK		_UL(0x80000000)
>>>>>>     #define HGATP32_VMID_SHIFT		22
>>>>>>     #define HGATP32_VMID_MASK		_UL(0x1FC00000)
>>>>>>     #define HGATP32_PPN			_UL(0x003FFFFF)
>>>>>>     
>>>>>>     #define HGATP64_MODE_SHIFT		60
>>>>>> +#define HGATP64_MODE_MASK		_ULL(0xF000000000000000)
>>>>>>     #define HGATP64_VMID_SHIFT		44
>>>>>>     #define HGATP64_VMID_MASK		_ULL(0x03FFF00000000000)
>>>>> ... VMID management is going to change as previously discussed, at which
>>>>> point the value to put in hgatp will need (partly) re-calculating at certain
>>>>> points anyway.
>>>> ... after VMID management will changed to per-CPU base then it will be needed
>>>> to update re-calculate hgatp each time vCPU on pCPU is changed.
>>>> In this case I prefer to have partially calculated 'hgatp'.
>>> But why, when you need to do some recalculation anyway?
>> Less operations will be needed to do.
> Right; I wonder how big the savings would be.

Probably not big.

>
>> If we have partially prepared 'hgatp' then we have to only update VMID bits
>> instead of getting ppn for page, then calculate hgatp_mode each time.
>> But if you think it isn't really needed I can add vmid argument for hgatp_from_page()
>> and just call this function when an update of hgatp is needed.
> I think it'll need to be struct p2m_domain * that you (also?) pass in. In the
> longer run I think you will want to support all three permitted modes, with
> smaller guests using fewer page table levels.

Yes, but these modes will be const for a domain, I guess. I mean that once a mode has
been set, it isn't going to be changed. But VMID is going to be changed each time vCPU
gives control to another vCPU.
Anyway, I am okay to make update of hgatp more generic and just update it fully each
time it is needed.

>
> As to "also" - maybe it's better to change the name of the function, and pass
> in just (const if possible) struct p2m_domain *.
>
>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>>>>>>         write_unlock(&p2m->lock);
>>>>>>     }
>>>>>>     
>>>>>> +static void clear_and_clean_page(struct page_info *page)
>>>>>> +{
>>>>>> +    clean_dcache_va_range(page, PAGE_SIZE);
>>>>>> +    clear_domain_page(page_to_mfn(page));
>>>>>> +}
>>>>> A function of this name can, imo, only clear and then clean. Question is why
>>>>> it's the other way around, and what the underlying requirement is for the
>>>>> cleaning part to be there in the first place. Maybe that's obvious for a
>>>>> RISC-V person, but it's entirely non-obvious to me (Arm being different in
>>>>> this regard because of running with caches disabled at certain points in
>>>>> time).
>>>> You're right, the current name|clear_and_clean_page()| implies that clearing
>>>> should come before cleaning, which contradicts the current implementation.
>>>> The intent here is to ensure that the page contents are consistent in RAM
>>>> (not just in cache) before use by other entities (guests or devices).
>>>>
>>>> The clean must follow the clear — so yes, the order needs to be reversed.
>>> What you don't address though - why's the cleaning needed in the first place?
>> If we clean the data cache first, we flush the d-cache and then use the page to
>> perform the clear operation. As a result, the "cleared" value will be written into
>> the d-cache. To avoid polluting the d-cache with the "cleared" value, the correct
>> sequence is to clear the page first, then clean the data cache.
> If you want to avoid cache pollution, I think you'd need to use a form of stores
> which simply bypass the cache. Yet then - why would this matter here, but not
> elsewhere? Wouldn't you better leave such to the hardware, unless you can prove
> a (meaningful) performance gain?

I thought about a case when IOMMU doesn't support coherent walks and p2m tables are
shared between CPU and IOMMU. Then my understanding is:
- clear_page(p) just zero-ing a page in a CPU's cache.
- But IOMMU can see old data or uninitialized, if they still in cache.
- So, it is need to do clean_cache() to writeback data from cache to RAM, before a
   page will be used as a part of page table for IOMMU.

>
>>>>>> +    unsigned int nr_pages = _AC(1,U) << order;
>>>>> Nit (style): Missing blank after comma.
>>>> I've changed that to BIT(order, U)
>>>>
>>>>>> +    /* Return back nr_pages necessary for p2m root table. */
>>>>>> +
>>>>>> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
>>>>>> +        panic("Specify more xen,domain-p2m-mem-mb\n");
>>>>> You shouldn't panic() in anything involved in domain creation. You want to
>>>>> return NULL in this case.
>>>> It makes sense in this case just to return NULL.
>>>>
>>>>> Further, to me the use of "more" looks misleading here. Do you perhaps mean
>>>>> "larger" or "bigger"?
>>>>>
>>>>> This also looks to be happening without any lock held. If that's intentional,
>>>>> I think the "why" wants clarifying in a code comment.
>>>> Agree, returning back pages necessary for p2m root table should be done under
>>>> spin_lock(&d->arch.paging.lock).
>>> Which should be acquired at the paging_*() layer then, not at the p2m_*() layer.
>>> (As long as you mean to have that separation, that is. See the earlier discussion
>>> on that matter.)
>> Then partly p2m_set_allocation() should be moved to paging_*() too.
> Not exactly sure what you mean. On x86 at least the paging layer part of
> the function is pretty slim.

I meant that part of code which is spin_lock(&d->arch.paging.lock); ... spin_unlock(&d->arch.paging.lock)
in function p2m_set_allocation() should be moved somewhere to paging_*() layer for the same logic as you
suggested to move part of  p2m_allocate_root()'s code which is guarded by d->arch.paging.lock to
paging_*() layer.

Or I just misunderstood your initial idea with this paging_*() layer and its necessity.

>
>>>>>> +    for ( unsigned int i = 0; i < nr_pages; i++ )
>>>>>> +    {
>>>>>> +        /* Return memory to domheap. */
>>>>>> +        page = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>>>>> +        if( page )
>>>>>> +        {
>>>>>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>>>>>> +            free_domheap_page(page);
>>>>>> +        }
>>>>>> +        else
>>>>>> +        {
>>>>>> +            printk(XENLOG_ERR
>>>>>> +                   "Failed to free P2M pages, P2M freelist is empty.\n");
>>>>>> +            return NULL;
>>>>>> +        }
>>>>>> +    }
>>>>> The reason for doing this may also want to be put in a comment.
>>>> I thought it would be enough the comment above: /* Return back nr_pages necessary for p2m root table. */
>>> That describes what the code does, but not why.
>> I will add to the comment: "... to get the memory accounting right".
> I'm sorry to be picky, but what is "right"? You want assure the root table
> memory is also accounted against the P2M pool of the domain. Can't you say
> exactly that?

It can be said in this way.

Thanks.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 12478 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 12/17] xen/riscv: Implement p2m_free_entry() and related helpers
  2025-06-10 13:05 ` [PATCH v2 12/17] xen/riscv: Implement p2m_free_entry() and related helpers Oleksii Kurochko
@ 2025-07-01 14:23   ` Jan Beulich
  2025-07-11 15:56     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-01 14:23 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> This patch introduces a working implementation of p2m_free_entry() for RISC-V
> based on ARM's implementation of p2m_free_entry(), enabling proper cleanup
> of page table entries in the P2M (physical-to-machine) mapping.
> 
> Only few things are changed:
> - Use p2m_force_flush_sync() instead of p2m_tlb_flush_sync() as latter
>   isn't implemented on RISC-V.
> - Introduce and use p2m_type_radix_get() to get a type of p2m entry as
>   RISC-V's PTE doesn't have enough space to store all necessary types so
>   a type is stored in a radix tree.
> 
> Key additions include:
> - p2m_free_entry(): Recursively frees page table entries at all levels. It
>   handles both regular and superpage mappings and ensures that TLB entries
>   are flushed before freeing intermediate tables.
> - p2m_put_page() and helpers:
>   - p2m_put_4k_page(): Clears GFN from xenheap pages if applicable.
>   - p2m_put_2m_superpage(): Releases foreign page references in a 2MB
>     superpage.
>   - p2m_type_radix_get(): Extracts the stored p2m_type from the radix tree
>     using the PTE.
> - p2m_free_page(): Returns a page either to the domain's freelist or to
>   the domheap, depending on whether the domain is hardware-backed.

What is "hardware-backed"?

> Defines XEN_PT_ENTRIES in asm/page.h to simplify loops over page table
> entries.
> 
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> ---
> Changes in V2:
>  - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
>    functionality" which was splitted to smaller.
>  - s/p2m_is_superpage/p2me_is_superpage.

See my earlier comments regarding naming.

> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -345,11 +345,33 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>      return __map_domain_page(p2m->root + root_table_indx);
>  }
>  
> +static p2m_type_t p2m_type_radix_get(struct p2m_domain *p2m, pte_t pte)

Does it matter to callers that ...

> +{
> +    void *ptr;
> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
> +
> +    ptr = radix_tree_lookup(&p2m->p2m_type, gfn_x(gfn));
> +
> +    if ( !ptr )
> +        return p2m_invalid;
> +
> +    return radix_tree_ptr_to_int(ptr);
> +}

... this is a radix tree lookup? IOW does "radix" need to be part of the
function name? Also "get" may want to move forward in the name, to better
match the naming of other functions.

> +/*
> + * In the case of the P2M, the valid bit is used for other purpose. Use
> + * the type to check whether an entry is valid.
> + */
>  static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>  {
> -    panic("%s: isn't implemented for now\n", __func__);
> +    return p2m_type_radix_get(p2m, pte) != p2m_invalid;
> +}

No checking of the valid bit?

> -    return false;
> +static inline bool p2me_is_superpage(struct p2m_domain *p2m, pte_t pte,
> +                                    unsigned int level)
> +{
> +    return p2me_is_valid(p2m, pte) && (pte.pte & PTE_ACCESS_MASK) &&
> +           (level > 0);

In such combinations of conditions it's usually helpful to put the
cheapest check(s) first. IOW what point is there in doing a radix
tree lookup when the other two conditions aren't satisfied? (FTAOD
applies elsewhere as well, even within this same patch.)

> @@ -404,11 +426,127 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>      return GUEST_TABLE_MAP_NONE;
>  }
>  
> +static void p2m_put_foreign_page(struct page_info *pg)
> +{
> +    /*
> +     * It's safe to do the put_page here because page_alloc will
> +     * flush the TLBs if the page is reallocated before the end of
> +     * this loop.
> +     */
> +    put_page(pg);

Is the comment really true? The page allocator will flush the normal
TLBs, but not the stage-2 ones. Yet those are what you care about here,
aiui.

> +/* Put any references on the single 4K page referenced by mfn. */
> +static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
> +{
> +    /* TODO: Handle other p2m types */
> +
> +    /* Detect the xenheap page and mark the stored GFN as invalid. */
> +    if ( p2m_is_ram(type) && is_xen_heap_mfn(mfn) )
> +        page_set_xenheap_gfn(mfn_to_page(mfn), INVALID_GFN);

Is this a valid thing to do? How do you make sure the respective uses
(in gnttab's shared and status page arrays) are / were also removed?

> +}
> +
> +/* Put any references on the superpage referenced by mfn. */
> +static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
> +{
> +    struct page_info *pg;
> +    unsigned int i;
> +
> +    ASSERT(mfn_valid(mfn));
> +
> +    pg = mfn_to_page(mfn);
> +
> +    for ( i = 0; i < XEN_PT_ENTRIES; i++, pg++ )
> +        p2m_put_foreign_page(pg);
> +}
> +
> +/* Put any references on the page referenced by pte. */
> +static void p2m_put_page(struct p2m_domain *p2m, const pte_t pte,
> +                         unsigned int level)
> +{
> +    mfn_t mfn = pte_get_mfn(pte);
> +    p2m_type_t p2m_type = p2m_type_radix_get(p2m, pte);

This gives you the type of the 1st page. What guarantees that all other pages
in a superpage are of the exact same type?

> +    ASSERT(p2me_is_valid(p2m, pte));
> +
> +    /*
> +     * TODO: Currently we don't handle level 2 super-page, Xen is not
> +     * preemptible and therefore some work is needed to handle such
> +     * superpages, for which at some point Xen might end up freeing memory
> +     * and therefore for such a big mapping it could end up in a very long
> +     * operation.
> +     */

This is pretty unsatisfactory. Imo, if you don't deal with that right away,
you're setting yourself up for a significant re-write.

> +    if ( level == 1 )
> +        return p2m_put_2m_superpage(mfn, p2m_type);
> +    else if ( level == 0 )
> +        return p2m_put_4k_page(mfn, p2m_type);

Use switch() right away?

> +}
> +
> +static void p2m_free_page(struct domain *d, struct page_info *pg)
> +{
> +    if ( is_hardware_domain(d) )
> +        free_domheap_page(pg);

Why's the hardware domain different here? It should have a pool just like
all other domains have.

> +    else
> +    {
> +        spin_lock(&d->arch.paging.lock);
> +        page_list_add_tail(pg, &d->arch.paging.p2m_freelist);
> +        spin_unlock(&d->arch.paging.lock);
> +    }
> +}
> +
>  /* Free pte sub-tree behind an entry */
>  static void p2m_free_entry(struct p2m_domain *p2m,
>                             pte_t entry, unsigned int level)
>  {
> -    panic("%s: hasn't been implemented yet\n", __func__);
> +    unsigned int i;
> +    pte_t *table;
> +    mfn_t mfn;
> +    struct page_info *pg;
> +
> +    /* Nothing to do if the entry is invalid. */
> +    if ( !p2me_is_valid(p2m, entry) )
> +        return;

Does this actually apply to intermediate page tables (which you handle
later in the function), when that's (only) a P2M type check?

> +    if ( p2me_is_superpage(p2m, entry, level) || (level == 0) )
> +    {
> +#ifdef CONFIG_IOREQ_SERVER
> +        /*
> +         * If this gets called then either the entry was replaced by an entry
> +         * with a different base (valid case) or the shattering of a superpage
> +         * has failed (error case).
> +         * So, at worst, the spurious mapcache invalidation might be sent.
> +         */
> +        if ( p2m_is_ram( p2m_type_radix_get(p2m, entry)) &&

Nit: Style.

> +             domain_has_ioreq_server(p2m->domain) )
> +            ioreq_request_mapcache_invalidate(p2m->domain);
> +#endif
> +
> +        p2m_put_page(p2m, entry, level);
> +
> +        return;
> +    }
> +
> +    table = map_domain_page(pte_get_mfn(entry));
> +    for ( i = 0; i < XEN_PT_ENTRIES; i++ )
> +        p2m_free_entry(p2m, *(table + i), level - 1);

Better table[i]?

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 06/17] xen/riscv: add root page table allocation
  2025-07-01 14:02             ` Oleksii Kurochko
@ 2025-07-01 14:28               ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-01 14:28 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 01.07.2025 16:02, Oleksii Kurochko wrote:
> On 7/1/25 12:27 PM, Jan Beulich wrote:
>> On 01.07.2025 11:44, Oleksii Kurochko wrote:
>>> On 7/1/25 8:29 AM, Jan Beulich wrote:
>>>> On 30.06.2025 18:18, Oleksii Kurochko wrote:
>>>>> On 6/30/25 5:22 PM, Jan Beulich wrote:
>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>> @@ -41,6 +41,91 @@ void p2m_write_unlock(struct p2m_domain *p2m)
>>>>>>>         write_unlock(&p2m->lock);
>>>>>>>     }
>>>>>>>     
>>>>>>> +static void clear_and_clean_page(struct page_info *page)
>>>>>>> +{
>>>>>>> +    clean_dcache_va_range(page, PAGE_SIZE);
>>>>>>> +    clear_domain_page(page_to_mfn(page));
>>>>>>> +}
>>>>>> A function of this name can, imo, only clear and then clean. Question is why
>>>>>> it's the other way around, and what the underlying requirement is for the
>>>>>> cleaning part to be there in the first place. Maybe that's obvious for a
>>>>>> RISC-V person, but it's entirely non-obvious to me (Arm being different in
>>>>>> this regard because of running with caches disabled at certain points in
>>>>>> time).
>>>>> You're right, the current name|clear_and_clean_page()| implies that clearing
>>>>> should come before cleaning, which contradicts the current implementation.
>>>>> The intent here is to ensure that the page contents are consistent in RAM
>>>>> (not just in cache) before use by other entities (guests or devices).
>>>>>
>>>>> The clean must follow the clear — so yes, the order needs to be reversed.
>>>> What you don't address though - why's the cleaning needed in the first place?
>>> If we clean the data cache first, we flush the d-cache and then use the page to
>>> perform the clear operation. As a result, the "cleared" value will be written into
>>> the d-cache. To avoid polluting the d-cache with the "cleared" value, the correct
>>> sequence is to clear the page first, then clean the data cache.
>> If you want to avoid cache pollution, I think you'd need to use a form of stores
>> which simply bypass the cache. Yet then - why would this matter here, but not
>> elsewhere? Wouldn't you better leave such to the hardware, unless you can prove
>> a (meaningful) performance gain?
> 
> I thought about a case when IOMMU doesn't support coherent walks and p2m tables are
> shared between CPU and IOMMU. Then my understanding is:
> - clear_page(p) just zero-ing a page in a CPU's cache.
> - But IOMMU can see old data or uninitialized, if they still in cache.
> - So, it is need to do clean_cache() to writeback data from cache to RAM, before a
>    page will be used as a part of page table for IOMMU.

Okay, so this is purely about something that doesn't matter at all for now
(until IOMMU support is introduced). Fair enough then to play safe from the
beginning.

>>>>>>> +    unsigned int nr_pages = _AC(1,U) << order;
>>>>>> Nit (style): Missing blank after comma.
>>>>> I've changed that to BIT(order, U)
>>>>>
>>>>>>> +    /* Return back nr_pages necessary for p2m root table. */
>>>>>>> +
>>>>>>> +    if ( ACCESS_ONCE(d->arch.paging.p2m_total_pages) < nr_pages )
>>>>>>> +        panic("Specify more xen,domain-p2m-mem-mb\n");
>>>>>> You shouldn't panic() in anything involved in domain creation. You want to
>>>>>> return NULL in this case.
>>>>> It makes sense in this case just to return NULL.
>>>>>
>>>>>> Further, to me the use of "more" looks misleading here. Do you perhaps mean
>>>>>> "larger" or "bigger"?
>>>>>>
>>>>>> This also looks to be happening without any lock held. If that's intentional,
>>>>>> I think the "why" wants clarifying in a code comment.
>>>>> Agree, returning back pages necessary for p2m root table should be done under
>>>>> spin_lock(&d->arch.paging.lock).
>>>> Which should be acquired at the paging_*() layer then, not at the p2m_*() layer.
>>>> (As long as you mean to have that separation, that is. See the earlier discussion
>>>> on that matter.)
>>> Then partly p2m_set_allocation() should be moved to paging_*() too.
>> Not exactly sure what you mean. On x86 at least the paging layer part of
>> the function is pretty slim.
> 
> I meant that part of code which is spin_lock(&d->arch.paging.lock); ... spin_unlock(&d->arch.paging.lock)
> in function p2m_set_allocation() should be moved somewhere to paging_*() layer for the same logic as you
> suggested to move part of  p2m_allocate_root()'s code which is guarded by d->arch.paging.lock to
> paging_*() layer.

Yes, of course.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-06-10 13:05 ` [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration Oleksii Kurochko
@ 2025-07-01 15:08   ` Jan Beulich
  2025-07-15 14:47     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-01 15:08 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/page.h
> +++ b/xen/arch/riscv/include/asm/page.h
> @@ -76,6 +76,14 @@
>  #define PTE_SMALL       BIT(10, UL)
>  #define PTE_POPULATE    BIT(11, UL)
>  
> +enum pbmt_type_t {

Please can we stick to _t suffixes only being used on typedef-ed identifiers?

> +    pbmt_pma,
> +    pbmt_nc,
> +    pbmt_io,
> +    pbmt_rsvd,
> +    pbmt_max,

It's a 2-bit field in the PTE, isn't it? In which case the maximum valid value
to put there is 3. That's what an identifier named "max" should evaluate to.
The value 4 here would want to be named "count", "num", "nr", or alike.

> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>      return __map_domain_page(p2m->root + root_table_indx);
>  }
>  
> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)

See comments on the earlier patch regarding naming.

> +{
> +    int rc;
> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));

How does this work, when you record GFNs only for Xenheap pages? I don't
think you can get around having the caller pass in the GFN. At which point
the PTE probably doesn't need passing.

> +    rc = radix_tree_insert(&p2m->p2m_type, gfn_x(gfn),
> +                           radix_tree_int_to_ptr(t));
> +    if ( rc == -EEXIST )
> +    {
> +        /* If a setting already exists, change it to the new one */
> +        radix_tree_replace_slot(
> +            radix_tree_lookup_slot(
> +                &p2m->p2m_type, gfn_x(gfn)),
> +            radix_tree_int_to_ptr(t));
> +        rc = 0;
> +    }
> +
> +    return rc;
> +}
> +
>  static p2m_type_t p2m_type_radix_get(struct p2m_domain *p2m, pte_t pte)
>  {
>      void *ptr;
> @@ -389,12 +409,87 @@ static inline void p2m_remove_pte(pte_t *p, bool clean_pte)
>      p2m_write_pte(p, pte, clean_pte);
>  }
>  
> -static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn,
> -                                p2m_type_t t, p2m_access_t a)
> +static void p2m_set_permission(pte_t *e, p2m_type_t t, p2m_access_t a)
>  {
> -    panic("%s: hasn't been implemented yet\n", __func__);
> +    /* First apply type permissions */
> +    switch ( t )
> +    {
> +    case p2m_ram_rw:
> +        e->pte |= PTE_ACCESS_MASK;
> +        break;
> +
> +    case p2m_mmio_direct_dev:
> +        e->pte |= (PTE_READABLE | PTE_WRITABLE);
> +        e->pte &= ~PTE_EXECUTABLE;

What's wrong with code living in MMIO, e.g. in the ROM of a PCI device?
Such code would want to be executable.

> +        break;
> +
> +    case p2m_invalid:
> +        e->pte &= ~PTE_ACCESS_MASK;
> +        break;
> +
> +    default:
> +        BUG();
> +        break;
> +    }

I think you ought to handle all types that are defined right away. I also
don't think you should BUG() in the default case (also in the other switch()
below). ASSERT_UNEACHABLE() may be fine, along with clearing all permissions
in the entry for release builds.

> +    /* Then restrict with access permissions */
> +    switch ( a )
> +    {
> +    case p2m_access_rwx:
> +        break;
> +    case p2m_access_wx:
> +        e->pte &= ~PTE_READABLE;
> +        break;
> +    case p2m_access_rw:
> +        e->pte &= ~PTE_EXECUTABLE;
> +        break;
> +    case p2m_access_w:
> +        e->pte &= ~(PTE_READABLE | PTE_EXECUTABLE);
> +        e->pte &= ~PTE_EXECUTABLE;
> +        break;
> +    case p2m_access_rx:
> +    case p2m_access_rx2rw:
> +        e->pte &= ~PTE_WRITABLE;
> +        break;
> +    case p2m_access_x:
> +        e->pte &= ~(PTE_READABLE | PTE_WRITABLE);
> +        break;
> +    case p2m_access_r:
> +        e->pte &= ~(PTE_WRITABLE | PTE_EXECUTABLE);
> +        break;
> +    case p2m_access_n:
> +    case p2m_access_n2rwx:
> +        e->pte &= ~PTE_ACCESS_MASK;
> +        break;
> +    default:
> +        BUG();
> +        break;
> +    }

Nit: Blank lines between non-fall-through case blocks, please.

> +static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn, p2m_type_t t, p2m_access_t a)
> +{
> +    pte_t e = (pte_t) { 1 };

What's the 1 doing here?

> +    switch ( t )
> +    {
> +    case p2m_mmio_direct_dev:
> +        e.pte |= PTE_PBMT_IO;
> +        break;
> +
> +    default:
> +        break;
> +    }
> +
> +    p2m_set_permission(&e, t, a);
> +
> +    ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK));
> +
> +    pte_set_mfn(&e, mfn);

Based on how things work on x86 (and how I would have expected them to also
work on Arm), may I suggest that you set MFN ahead of permissions, so that
the permissions setting function can use the MFN for e.g. a lookup in
mmio_ro_ranges.

> +    BUG_ON(p2m_type_radix_set(p2m, e, t));

I'm not convinced of this error handling here either. Radix tree insertion
_can_ fail, e.g. when there's no memory left. This must not bring down Xen,
or we'll have an XSA right away. You could zap the PTE, or if need be you
could crash the offending domain.

In this context (not sure if I asked before): With this use of a radix tree,
how do you intend to bound the amount of memory that a domain can use, by
making Xen insert very many entries?

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 14/17] xen/riscv: implement p2m_next_level()
  2025-06-10 13:05 ` [PATCH v2 14/17] xen/riscv: implement p2m_next_level() Oleksii Kurochko
@ 2025-07-02  8:35   ` Jan Beulich
  2025-07-16 11:32     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-02  8:35 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -387,6 +387,17 @@ static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>      return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>  }
>  
> +/*
> + * pte_is_* helpers are checking the valid bit set in the
> + * PTE but we have to check p2m_type instead (look at the comment above
> + * p2me_is_valid())
> + * Provide our own overlay to check the valid bit.
> + */
> +static inline bool p2me_is_mapping(struct p2m_domain *p2m, pte_t pte)
> +{
> +    return p2me_is_valid(p2m, pte) && (pte.pte & PTE_ACCESS_MASK);
> +}

Same question as on the earlier patch - does P2M type apply to intermediate
page tables at all? (Conceptually it shouldn't.)

> @@ -492,6 +503,70 @@ static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn, p2m_type_t t,
>      return e;
>  }
>  
> +/* Generate table entry with correct attributes. */
> +static pte_t page_to_p2m_table(struct p2m_domain *p2m, struct page_info *page)
> +{
> +    /*
> +     * Since this function generates a table entry, according to "Encoding
> +     * of PTE R/W/X fields," the entry's r, w, and x fields must be set to 0
> +     * to point to the next level of the page table.
> +     * Therefore, to ensure that an entry is a page table entry,
> +     * `p2m_access_n2rwx` is passed to `mfn_to_p2m_entry()` as the access value,
> +     * which overrides whatever was passed as `p2m_type_t` and guarantees that
> +     * the entry is a page table entry by setting r = w = x = 0.
> +     */
> +    return p2m_entry_from_mfn(p2m, page_to_mfn(page), p2m_ram_rw, p2m_access_n2rwx);

Similarly P2M access shouldn't apply to intermediate page tables. (Moot
with that, but (ab)using p2m_access_n2rwx would also look wrong: You did
read what it means, didn't you?)

> +}
> +
> +static struct page_info *p2m_alloc_page(struct domain *d)
> +{
> +    struct page_info *pg;
> +
> +    /*
> +     * For hardware domain, there should be no limit in the number of pages that
> +     * can be allocated, so that the kernel may take advantage of the extended
> +     * regions. Hence, allocate p2m pages for hardware domains from heap.
> +     */
> +    if ( is_hardware_domain(d) )
> +    {
> +        pg = alloc_domheap_page(d, MEMF_no_owner);
> +        if ( pg == NULL )
> +            printk(XENLOG_G_ERR "Failed to allocate P2M pages for hwdom.\n");
> +    }

The comment looks to have been taken verbatim from Arm. Whatever "extended
regions" are, does the same concept even exist on RISC-V?

Also, special casing Dom0 like this has benefits, but also comes with a
pitfall: If the system's out of memory, allocations will fail. A pre-
populated pool would avoid that (until exhausted, of course). If special-
casing of Dom0 is needed, I wonder whether ...

> +    else
> +    {
> +        spin_lock(&d->arch.paging.lock);
> +        pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
> +        spin_unlock(&d->arch.paging.lock);
> +    }

... going this path but with a Dom0-only fallback to general allocation
wouldn't be the better route.

> +    return pg;
> +}
> +
> +/* Allocate a new page table page and hook it in via the given entry. */
> +static int p2m_create_table(struct p2m_domain *p2m, pte_t *entry)
> +{
> +    struct page_info *page;
> +    pte_t *p;
> +
> +    ASSERT(!p2me_is_valid(p2m, *entry));
> +
> +    page = p2m_alloc_page(p2m->domain);
> +    if ( page == NULL )
> +        return -ENOMEM;
> +
> +    page_list_add(page, &p2m->pages);
> +
> +    p = __map_domain_page(page);
> +    clear_page(p);
> +
> +    unmap_domain_page(p);

clear_domain_page()? Or actually clear_and_clean_page()?

> @@ -516,9 +591,33 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>                            unsigned int level, pte_t **table,
>                            unsigned int offset)
>  {
> -    panic("%s: hasn't been implemented yet\n", __func__);
> +    pte_t *entry;
> +    int ret;
> +    mfn_t mfn;
> +
> +    entry = *table + offset;
> +
> +    if ( !p2me_is_valid(p2m, *entry) )
> +    {
> +        if ( !alloc_tbl )
> +            return GUEST_TABLE_MAP_NONE;
> +
> +        ret = p2m_create_table(p2m, entry);
> +        if ( ret )
> +            return GUEST_TABLE_MAP_NOMEM;
> +    }
> +
> +    /* The function p2m_next_level() is never called at the last level */
> +    ASSERT(level != 0);

Logically you would perhaps better do this ahead of trying to allocate a
page table. Calls here with level == 0 are invalid in all cases aiui, not
just when you make it here.

> +    if ( p2me_is_mapping(p2m, *entry) )
> +        return GUEST_TABLE_SUPER_PAGE;
> +
> +    mfn = mfn_from_pte(*entry);
> +
> +    unmap_domain_page(*table);
> +    *table = map_domain_page(mfn);

Just to mention it (may not need taking care of right away), there's an
inefficiency here: In p2m_create_table() you map the page to clear it.
Then you tear down that mapping, just to re-establish it here.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings
  2025-06-10 13:05 ` [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings Oleksii Kurochko
@ 2025-07-02  9:25   ` Jan Beulich
  2025-07-17 16:37     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-02  9:25 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> Add support for down large memory mappings ("superpages") in the RISC-V
> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
> can be inserted into lower levels of the page table hierarchy.
> 
> To implement that the following is done:
> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>   smaller page table entries down to the target level, preserving original
>   permissions and attributes.
> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>   entries at lower levels within a superpage-mapped region.
> 
> This implementation is based on the ARM code, with modifications to the part
> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
> not require BBM, so there is no need to invalidate the PTE and flush the
> TLB before updating it with the newly created, split page table.

But some flushing is going to be necessary. As long as you only ever do
global flushes, the one after the individual PTE modification (within the
split table) will do (if BBM isn't required, see below), but once you move
to more fine-grained flushing, that's not going to be enough anymore. Not
sure it's a good idea to leave such a pitfall.

As to (no need for) BBM: I couldn't find anything to that effect in the
privileged spec. Can you provide some pointer? What I found instead is e.g.
this sentence: "To ensure that implicit reads observe writes to the same
memory locations, an SFENCE.VMA instruction must be executed after the
writes to flush the relevant cached translations." And this: "Accessing the
same location using different cacheability attributes may cause loss of
coherence." (This may not only occur when the same physical address is
mapped twice at different VAs, but also after the shattering of a superpage
when the new entry differs in cacheability.)

> Additionally, the page table walk logic has been adjusted, as ARM uses the
> opposite walk order compared to RISC-V.

I think you used some similar wording already in an earlier patch. I find
this confusing: Walk order is, aiui, the same. It's merely the numbering
of levels that is the opposite way round, isn't it?

> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> ---
> Changes in V2:
>  - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
>    functionality" which was splitted to smaller.
>  - Update the commit above the cycle which creates new page table as
>    RISC-V travserse page tables in an opposite to ARM order.
>  - RISC-V doesn't require BBM so there is no needed for invalidating
>    and TLB flushing before updating PTE.
> ---
>  xen/arch/riscv/p2m.c | 102 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 101 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
> index 87dd636b80..79c4473f1f 100644
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -743,6 +743,77 @@ static void p2m_free_entry(struct p2m_domain *p2m,
>      p2m_free_page(p2m->domain, pg);
>  }
>  
> +static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
> +                                unsigned int level, unsigned int target,
> +                                const unsigned int *offsets)
> +{
> +    struct page_info *page;
> +    unsigned int i;
> +    pte_t pte, *table;
> +    bool rv = true;
> +
> +    /* Convenience aliases */
> +    mfn_t mfn = pte_get_mfn(*entry);
> +    unsigned int next_level = level - 1;
> +    unsigned int level_order = XEN_PT_LEVEL_ORDER(next_level);
> +
> +    /*
> +     * This should only be called with target != level and the entry is
> +     * a superpage.
> +     */
> +    ASSERT(level > target);
> +    ASSERT(p2me_is_superpage(p2m, *entry, level));
> +
> +    page = p2m_alloc_page(p2m->domain);
> +    if ( !page )
> +        return false;
> +
> +    page_list_add(page, &p2m->pages);

Is there a reason this list maintenance isn't done in p2m_alloc_page()?

> +    table = __map_domain_page(page);
> +
> +    /*
> +     * We are either splitting a second level 1G page into 512 first level
> +     * 2M pages, or a first level 2M page into 512 zero level 4K pages.
> +     */
> +    for ( i = 0; i < XEN_PT_ENTRIES; i++ )
> +    {
> +        pte_t *new_entry = table + i;
> +
> +        /*
> +         * Use the content of the superpage entry and override
> +         * the necessary fields. So the correct permission are kept.
> +         */
> +        pte = *entry;
> +        pte_set_mfn(&pte, mfn_add(mfn, i << level_order));

While okay as long as you only permit superpages up to 1G, this is another
trap for someone to fall into: Imo i would better be unsigned long right
away, considering that RISC-V permits large pages at all levels.

> +        write_pte(new_entry, pte);
> +    }
> +
> +    /*
> +     * Shatter superpage in the page to the level we want to make the
> +     * changes.
> +     * This is done outside the loop to avoid checking the offset to
> +     * know whether the entry should be shattered for every entry.
> +     */
> +    if ( next_level != target )
> +        rv = p2m_split_superpage(p2m, table + offsets[next_level],
> +                                 level - 1, target, offsets);

I don't understand the comment: Under what conditions would every entry
need (further) shattering? And where's that happening? Or is this merely
a word ordering issue in the sentence, and "for every entry" wants
moving ahead? (In that case I'm unconvinced this is in need of commenting
upon.)

> +    /* TODO: why it is necessary to have clean here? Not somewhere in the caller */
> +    if ( p2m->clean_pte )
> +        clean_dcache_va_range(table, PAGE_SIZE);
> +
> +    unmap_domain_page(table);

Again likely not something that wants taking care of right away, but there
again is an inefficiency here: The caller almost certainly wants to map
the same page again, to update the one entry that caused the request to
shatter the page.

> +    /*
> +     * Even if we failed, we should install the newly allocated PTE
> +     * entry. The caller will be in charge to free the sub-tree.
> +     */
> +    p2m_write_pte(entry, page_to_p2m_table(p2m, page), p2m->clean_pte);

Why would it be wrong to free the page right here, vacating the entry at
the same time (or leaving just that to the caller)? (IOW - if this is an
implementation decision of yours, I think the word "should" would want
dropping.) After all, the caller invoking p2m_free_entry() on the thus
split PTE is less efficient (needs to iterate over all entries) than on
the original one (where it's just a single superpage).

> @@ -806,7 +877,36 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
>       */
>      if ( level > target )

This condition is likely too strong, unless you actually mean to also
split a superpage if it really wouldn't need splitting (new entry written
still fitting with the superpage mapping, i.e. suitable MFN and same
attributes).

>      {
> -        panic("Shattering isn't implemented\n");
> +        /* We need to split the original page. */
> +        pte_t split_pte = *entry;
> +
> +        ASSERT(p2me_is_superpage(p2m, *entry, level));
> +
> +        if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
> +        {
> +            /* Free the allocated sub-tree */
> +            p2m_free_entry(p2m, split_pte, level);
> +
> +            rc = -ENOMEM;
> +            goto out;
> +        }
> +
> +        p2m_write_pte(entry, split_pte, p2m->clean_pte);
> +
> +        /* Then move to the level we want to make real changes */
> +        for ( ; level < target; level++ )

Don't you mean to move downwards here? At which point I wonder: Did you test
this code?

> +        {
> +            rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
> +
> +            /*
> +             * The entry should be found and either be a table
> +             * or a superpage if level 0 is not targeted
> +             */
> +            ASSERT(rc == GUEST_TABLE_NORMAL ||
> +                   (rc == GUEST_TABLE_SUPER_PAGE && target > 0));
> +        }

This, too, is inefficient (but likely good enough as a starting point): You walk
tables twice - first when splitting, and then again when finding the target level.

Considering the enclosing if(), this also again is a do/while() candidate.

Jan

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-06-10 13:05 ` [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers Oleksii Kurochko
@ 2025-07-02 10:09   ` Jan Beulich
  2025-07-02 10:28     ` Jan Beulich
                       ` (2 more replies)
  0 siblings, 3 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-02 10:09 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> Implement the mfn_valid() macro to verify whether a given MFN is valid by
> checking that it falls within the range [start_page, max_page).
> These bounds are initialized based on the start and end addresses of RAM.
> 
> As part of this patch, start_page is introduced and initialized with the
> PFN of the first RAM page.
> 
> Also, after providing a non-stub implementation of the mfn_valid() macro,
> the following compilation errors started to occur:
>   riscv64-linux-gnu-ld: prelink.o: in function `__next_node':
>   /build/xen/./include/xen/nodemask.h:202: undefined reference to `page_is_ram_type'
>   riscv64-linux-gnu-ld: prelink.o: in function `get_free_buddy':
>   /build/xen/common/page_alloc.c:881: undefined reference to `page_is_ram_type'
>   riscv64-linux-gnu-ld: prelink.o: in function `alloc_heap_pages':
>   /build/xen/common/page_alloc.c:1043: undefined reference to `page_get_owner_and_reference'
>   riscv64-linux-gnu-ld: /build/xen/common/page_alloc.c:1098: undefined reference to `page_is_ram_type'
>   riscv64-linux-gnu-ld: prelink.o: in function `ns16550_interrupt':
>   /build/xen/drivers/char/ns16550.c:205: undefined reference to `get_page'
>   riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol `page_get_owner_and_reference' isn't defined
>   riscv64-linux-gnu-ld: final link failed: bad value
>   make[2]: *** [arch/riscv/Makefile:35: xen-syms] Error 1
> To resolve these errors, the following functions have also been introduced,
> based on their Arm counterparts:
> - page_get_owner_and_reference() and its variant to safely acquire a
>   reference to a page and retrieve its owner.
> - put_page() and put_page_nr() to release page references and free the page
>   when the count drops to zero.
>   For put_page_nr(), code related to static memory configuration is wrapped
>   with CONFIG_STATIC_MEMORY, as this configuration has not yet been moved to
>   common code. Therefore, PGC_static and free_domstatic_page() are not
>   introduced for RISC-V. However, since this configuration could be useful
>   in the future, the relevant code is retained and conditionally compiled.
> - A stub for page_is_ram_type() that currently always returns 0 and asserts
>   unreachable, as RAM type checking is not yet implemented.

How does this end up working when common code references the function?

> @@ -288,8 +289,12 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
>  #define page_get_owner(p)    (p)->v.inuse.domain
>  #define page_set_owner(p, d) ((p)->v.inuse.domain = (d))
>  
> -/* TODO: implement */
> -#define mfn_valid(mfn) ({ (void)(mfn); 0; })
> +extern unsigned long start_page;
> +
> +#define mfn_valid(mfn) ({                                   \
> +    unsigned long mfn__ = mfn_x(mfn);                       \
> +    likely((mfn__ >= start_page) && (mfn__ < max_page));    \
> +})

I don't think you should try to be clever and avoid using __mfn_valid() here,
at least not without an easily identifiable TODO. Surely you've seen that both
Arm and x86 use it.

Also, according to all I know, likely() doesn't work very well when used like
this, except for architectures supporting conditionally executed insns (like
Arm32 or IA-64, i.e. beyond conditional branches). I.e. if you want to use
likely() here, I think you need two of them.

> @@ -525,6 +520,8 @@ static void __init setup_directmap_mappings(unsigned long base_mfn,
>  #error setup_{directmap,frametable}_mapping() should be implemented for RV_32
>  #endif
>  
> +unsigned long __read_mostly start_page;

Memory hotplug question again: __read_mostly or __ro_after_init?

> @@ -613,3 +612,91 @@ void __iomem *ioremap(paddr_t pa, size_t len)
>  {
>      return ioremap_attr(pa, len, PAGE_HYPERVISOR_NOCACHE);
>  }
> +
> +int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
> +{
> +    ASSERT_UNREACHABLE();
> +
> +    return 0;
> +}
> +
> +static struct domain *page_get_owner_and_nr_reference(struct page_info *page,
> +                                                      unsigned long nr)
> +{
> +    unsigned long x, y = page->count_info;
> +    struct domain *owner;
> +
> +    /* Restrict nr to avoid "double" overflow */
> +    if ( nr >= PGC_count_mask )
> +    {
> +        ASSERT_UNREACHABLE();
> +        return NULL;
> +    }

I question the validity of this, already in the Arm original: I can't spot
how the caller guarantees to stay below that limit. Without such an
(attempted) guarantee, ASSERT_UNREACHABLE() is wrong to use. All I can see
is process_shm_node() incrementing shmem_extra[].nr_shm_borrowers, without
any limit check.

> +    do {
> +        x = y;
> +        /*
> +         * Count ==  0: Page is not allocated, so we cannot take a reference.
> +         * Count == -1: Reference count would wrap, which is invalid.
> +         */

May I once again ask that you look carefully at comments (as much as at code)
you copy. Clearly this comment wasn't properly updated when the bumping by 1
was changed to bumping by nr.

> +        if ( unlikely(((x + nr) & PGC_count_mask) <= nr) )
> +            return NULL;
> +    }
> +    while ( (y = cmpxchg(&page->count_info, x, x + nr)) != x );
> +
> +    owner = page_get_owner(page);
> +    ASSERT(owner);
> +
> +    return owner;
> +}
> +
> +struct domain *page_get_owner_and_reference(struct page_info *page)
> +{
> +    return page_get_owner_and_nr_reference(page, 1);
> +}
> +
> +void put_page_nr(struct page_info *page, unsigned long nr)
> +{
> +    unsigned long nx, x, y = page->count_info;
> +
> +    do {
> +        ASSERT((y & PGC_count_mask) >= nr);
> +        x  = y;
> +        nx = x - nr;
> +    }
> +    while ( unlikely((y = cmpxchg(&page->count_info, x, nx)) != x) );
> +
> +    if ( unlikely((nx & PGC_count_mask) == 0) )
> +    {
> +#ifdef CONFIG_STATIC_MEMORY
> +        if ( unlikely(nx & PGC_static) )
> +            free_domstatic_page(page);
> +        else
> +#endif

Such #ifdef-ed-out code is liable to go stale. Minimally use IS_ENABLED().
Even better would imo be if you introduced a "stub" PGC_static, resolving
to 0 (i.e. for now unconditionally).

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification
  2025-06-30 15:50             ` Jan Beulich
@ 2025-07-02 10:13               ` Oleksii Kurochko
  2025-07-02 10:36                 ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-02 10:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2656 bytes --]


On 6/30/25 5:50 PM, Jan Beulich wrote:
> On 30.06.2025 17:27, Oleksii Kurochko wrote:
>> On 6/30/25 4:45 PM, Jan Beulich wrote:
>>> On 30.06.2025 16:38, Oleksii Kurochko wrote:
>>>> On 6/30/25 4:33 PM, Oleksii Kurochko wrote:
>>>>> On 6/26/25 4:59 PM, Jan Beulich wrote:
>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>>>>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>>>>>> @@ -61,8 +61,28 @@ struct p2m_domain {
>>>>>>>     typedef enum {
>>>>>>>         p2m_invalid = 0,    /* Nothing mapped here */
>>>>>>>         p2m_ram_rw,         /* Normal read/write domain RAM */
>>>>>>> +    p2m_ram_ro,         /* Read-only; writes are silently dropped */
>>>>>> As indicated before - this type should be added when the special handling that
>>>>>> it requires is also introduced.
>>>>> Perhaps, I missed that. I will drop this type for now.
>>>>>
>>>>>>> +    p2m_mmio_direct_dev,/* Read/write mapping of genuine Device MMIO area */
>>>>>> What's the _dev suffix indicating here?
>>>>> It indicates that it is device memory, probably, it isn't so necessary in case of RISC-V as
>>>>> spec doesn't use such terminology. In RISC-V there is only available IO, NC. And we are
>>>>> |using PTE_PBMT_IO for |p2m_mmio_direct_dev.
>>>>>
>>>>> Maybe it would be better just to rename s/p2m_mmio_direct_dev/p2m_mmio_direct_io.
>>>> I forgot that p2m_mmio_direct_dev is used by common code for dom0less code (handle_passthrough_prop())
>>> That'll want abstracting out, I think. I don't view it as helpful to clutter
>>> RISC-V (and later perhaps also PPC) with Arm-specific terminology.
>> Would it be better then just rename it to p2m_device? Then it won't clear for Arm which type of MMIO p2m's
>> types is used as Arm has there MMIO types: *_dev, *_nc, *_c.
> I don't understand why Arm matters here. P2M types want naming in a way that makes
> sense for RISC-V.

It doesn't matter.
But if we want to change the type name from p2m_mmio_direct_dev to p2m_mmio_direct or p2m_device then it will
affect Arm too as p2m_mmio_direct_dev is used in dom0less code which is also used by Arm.
I just re-used p2m_mmio_direct_dev as it looked for me pretty generic and clear for what this type is.

>> As an option (which I don't really like) it could be "#define p2m_mmio_direct_dev ARCH_specific_name" in
>> asm/p2m.h to not touch common code.
> A #define may be needed, but not one to _still_ introduce Arm naming into non-Arm
> code.

As I mentioned above that p2m_mmio_direct_dev sounds pretty generic to me and I am okay to use it for
RISC-V. But if you have better suggestions I will be happy to consider it.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 4517 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-07-02 10:09   ` Jan Beulich
@ 2025-07-02 10:28     ` Jan Beulich
  2025-07-18 14:37       ` Oleksii Kurochko
  2025-07-02 12:52     ` Orzel, Michal
  2025-07-18 14:49     ` Oleksii Kurochko
  2 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-02 10:28 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	xen-devel, Oleksii Kurochko

On 02.07.2025 12:09, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> @@ -613,3 +612,91 @@ void __iomem *ioremap(paddr_t pa, size_t len)
>>  {
>>      return ioremap_attr(pa, len, PAGE_HYPERVISOR_NOCACHE);
>>  }
>> +
>> +int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
>> +{
>> +    ASSERT_UNREACHABLE();
>> +
>> +    return 0;
>> +}
>> +
>> +static struct domain *page_get_owner_and_nr_reference(struct page_info *page,
>> +                                                      unsigned long nr)
>> +{
>> +    unsigned long x, y = page->count_info;
>> +    struct domain *owner;
>> +
>> +    /* Restrict nr to avoid "double" overflow */
>> +    if ( nr >= PGC_count_mask )
>> +    {
>> +        ASSERT_UNREACHABLE();
>> +        return NULL;
>> +    }
> 
> I question the validity of this, already in the Arm original: I can't spot
> how the caller guarantees to stay below that limit. Without such an
> (attempted) guarantee, ASSERT_UNREACHABLE() is wrong to use. All I can see
> is process_shm_node() incrementing shmem_extra[].nr_shm_borrowers, without
> any limit check.
> 
>> +    do {
>> +        x = y;
>> +        /*
>> +         * Count ==  0: Page is not allocated, so we cannot take a reference.
>> +         * Count == -1: Reference count would wrap, which is invalid.
>> +         */
> 
> May I once again ask that you look carefully at comments (as much as at code)
> you copy. Clearly this comment wasn't properly updated when the bumping by 1
> was changed to bumping by nr.
> 
>> +        if ( unlikely(((x + nr) & PGC_count_mask) <= nr) )
>> +            return NULL;
>> +    }
>> +    while ( (y = cmpxchg(&page->count_info, x, x + nr)) != x );
>> +
>> +    owner = page_get_owner(page);
>> +    ASSERT(owner);
>> +
>> +    return owner;
>> +}

There also looks to be a dead code concern here (towards the "nr" parameters
here and elsewhere, when STATIC_SHM=n). Just that apparently we decided to
leave out Misra rule 2.2 entirely.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests
  2025-07-01 13:04   ` Jan Beulich
@ 2025-07-02 10:30     ` Oleksii Kurochko
  2025-07-02 10:34       ` Jan Beulich
  2025-07-02 11:48     ` Oleksii Kurochko
  1 sibling, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-02 10:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3158 bytes --]


On 7/1/25 3:04 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> @@ -113,3 +117,58 @@ int p2m_init(struct domain *d)
>>   
>>       return 0;
>>   }
>> +
>> +/*
>> + * Set the pool of pages to the required number of pages.
>> + * Returns 0 for success, non-zero for failure.
>> + * Call with d->arch.paging.lock held.
>> + */
>> +int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>> +{
>> +    struct page_info *pg;
>> +
>> +    ASSERT(spin_is_locked(&d->arch.paging.lock));
>> +
>> +    for ( ; ; )
>> +    {
>> +        if ( d->arch.paging.p2m_total_pages < pages )
>> +        {
>> +            /* Need to allocate more memory from domheap */
>> +            pg = alloc_domheap_page(d, MEMF_no_owner);
>> +            if ( pg == NULL )
>> +            {
>> +                printk(XENLOG_ERR "Failed to allocate P2M pages.\n");
>> +                return -ENOMEM;
>> +            }
>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)++;
>> +            page_list_add_tail(pg, &d->arch.paging.p2m_freelist);
>> +        }
>> +        else if ( d->arch.paging.p2m_total_pages > pages )
>> +        {
>> +            /* Need to return memory to domheap */
>> +            pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>> +            if( pg )
>> +            {
>> +                ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>> +                free_domheap_page(pg);
>> +            }
>> +            else
>> +            {
>> +                printk(XENLOG_ERR
>> +                       "Failed to free P2M pages, P2M freelist is empty.\n");
>> +                return -ENOMEM;
>> +            }
>> +        }
>> +        else
>> +            break;
>> +
>> +        /* Check to see if we need to yield and try again */
>> +        if ( preempted && general_preempt_check() )
>> +        {
>> +            *preempted = true;
>> +            return -ERESTART;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
> Btw, with the order-2 requirement for the root page table, you may want to
> consider an alternative approach: Here you could allocate some order-2
> pages (possibly up to as many as a domain might need, which right now
> would be exactly one), put them on a separate list, and consume the root
> table(s) from there. If you run out of pages on the order-0 list, you
> could shatter a page from the order-2 one (as long as that's still non-
> empty). The difficulty would be with freeing, where a previously shattered
> order-2 page would be nice to re-combine once all of its constituents are
> free again. The main benefit would be avoiding the back and forth in patch
> 6.

It is an option.

But I'm still not 100% sure it's necessary to allocate the root page table
from the freelist. We could simply allocate the root page table from the
domheap (as is done for hardware domains) and reserve the freelist for other
pages.
The freelist is specific to Dom0less guest domains and is primarily used to
limit the amount of memory available for the guest—essentially for static
configurations where you want a clear and fixed limit on p2m allocations.

~ Oleksii


[-- Attachment #2: Type: text/html, Size: 3545 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests
  2025-07-02 10:30     ` Oleksii Kurochko
@ 2025-07-02 10:34       ` Jan Beulich
  2025-07-02 11:17         ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-02 10:34 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 02.07.2025 12:30, Oleksii Kurochko wrote:
> 
> On 7/1/25 3:04 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> @@ -113,3 +117,58 @@ int p2m_init(struct domain *d)
>>>   
>>>       return 0;
>>>   }
>>> +
>>> +/*
>>> + * Set the pool of pages to the required number of pages.
>>> + * Returns 0 for success, non-zero for failure.
>>> + * Call with d->arch.paging.lock held.
>>> + */
>>> +int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>> +{
>>> +    struct page_info *pg;
>>> +
>>> +    ASSERT(spin_is_locked(&d->arch.paging.lock));
>>> +
>>> +    for ( ; ; )
>>> +    {
>>> +        if ( d->arch.paging.p2m_total_pages < pages )
>>> +        {
>>> +            /* Need to allocate more memory from domheap */
>>> +            pg = alloc_domheap_page(d, MEMF_no_owner);
>>> +            if ( pg == NULL )
>>> +            {
>>> +                printk(XENLOG_ERR "Failed to allocate P2M pages.\n");
>>> +                return -ENOMEM;
>>> +            }
>>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)++;
>>> +            page_list_add_tail(pg, &d->arch.paging.p2m_freelist);
>>> +        }
>>> +        else if ( d->arch.paging.p2m_total_pages > pages )
>>> +        {
>>> +            /* Need to return memory to domheap */
>>> +            pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>> +            if( pg )
>>> +            {
>>> +                ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>>> +                free_domheap_page(pg);
>>> +            }
>>> +            else
>>> +            {
>>> +                printk(XENLOG_ERR
>>> +                       "Failed to free P2M pages, P2M freelist is empty.\n");
>>> +                return -ENOMEM;
>>> +            }
>>> +        }
>>> +        else
>>> +            break;
>>> +
>>> +        /* Check to see if we need to yield and try again */
>>> +        if ( preempted && general_preempt_check() )
>>> +        {
>>> +            *preempted = true;
>>> +            return -ERESTART;
>>> +        }
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>> Btw, with the order-2 requirement for the root page table, you may want to
>> consider an alternative approach: Here you could allocate some order-2
>> pages (possibly up to as many as a domain might need, which right now
>> would be exactly one), put them on a separate list, and consume the root
>> table(s) from there. If you run out of pages on the order-0 list, you
>> could shatter a page from the order-2 one (as long as that's still non-
>> empty). The difficulty would be with freeing, where a previously shattered
>> order-2 page would be nice to re-combine once all of its constituents are
>> free again. The main benefit would be avoiding the back and forth in patch
>> 6.
> 
> It is an option.
> 
> But I'm still not 100% sure it's necessary to allocate the root page table
> from the freelist. We could simply allocate the root page table from the
> domheap (as is done for hardware domains) and reserve the freelist for other
> pages.
> The freelist is specific to Dom0less guest domains and is primarily used to
> limit the amount of memory available for the guest—essentially for static
> configurations where you want a clear and fixed limit on p2m allocations.

Is that true? My understanding is that this pre-populated pool is used by
all DomU-s, whether or not under dom0less.

Plus we're meaning to move towards better accounting of memory used by a
domain (besides its actual allocation). Allocating the root table from the
domain heap would move us one small step farther away from there.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification
  2025-07-02 10:13               ` Oleksii Kurochko
@ 2025-07-02 10:36                 ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-02 10:36 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 02.07.2025 12:13, Oleksii Kurochko wrote:
> 
> On 6/30/25 5:50 PM, Jan Beulich wrote:
>> On 30.06.2025 17:27, Oleksii Kurochko wrote:
>>> On 6/30/25 4:45 PM, Jan Beulich wrote:
>>>> On 30.06.2025 16:38, Oleksii Kurochko wrote:
>>>>> On 6/30/25 4:33 PM, Oleksii Kurochko wrote:
>>>>>> On 6/26/25 4:59 PM, Jan Beulich wrote:
>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>> --- a/xen/arch/riscv/include/asm/p2m.h
>>>>>>>> +++ b/xen/arch/riscv/include/asm/p2m.h
>>>>>>>> @@ -61,8 +61,28 @@ struct p2m_domain {
>>>>>>>>     typedef enum {
>>>>>>>>         p2m_invalid = 0,    /* Nothing mapped here */
>>>>>>>>         p2m_ram_rw,         /* Normal read/write domain RAM */
>>>>>>>> +    p2m_ram_ro,         /* Read-only; writes are silently dropped */
>>>>>>> As indicated before - this type should be added when the special handling that
>>>>>>> it requires is also introduced.
>>>>>> Perhaps, I missed that. I will drop this type for now.
>>>>>>
>>>>>>>> +    p2m_mmio_direct_dev,/* Read/write mapping of genuine Device MMIO area */
>>>>>>> What's the _dev suffix indicating here?
>>>>>> It indicates that it is device memory, probably, it isn't so necessary in case of RISC-V as
>>>>>> spec doesn't use such terminology. In RISC-V there is only available IO, NC. And we are
>>>>>> |using PTE_PBMT_IO for |p2m_mmio_direct_dev.
>>>>>>
>>>>>> Maybe it would be better just to rename s/p2m_mmio_direct_dev/p2m_mmio_direct_io.
>>>>> I forgot that p2m_mmio_direct_dev is used by common code for dom0less code (handle_passthrough_prop())
>>>> That'll want abstracting out, I think. I don't view it as helpful to clutter
>>>> RISC-V (and later perhaps also PPC) with Arm-specific terminology.
>>> Would it be better then just rename it to p2m_device? Then it won't clear for Arm which type of MMIO p2m's
>>> types is used as Arm has there MMIO types: *_dev, *_nc, *_c.
>> I don't understand why Arm matters here. P2M types want naming in a way that makes
>> sense for RISC-V.
> 
> It doesn't matter.
> But if we want to change the type name from p2m_mmio_direct_dev to p2m_mmio_direct or p2m_device then it will
> affect Arm too as p2m_mmio_direct_dev is used in dom0less code which is also used by Arm.

As said - imo this needs abstracting away.

> I just re-used p2m_mmio_direct_dev as it looked for me pretty generic and clear for what this type is.
> 
>>> As an option (which I don't really like) it could be "#define p2m_mmio_direct_dev ARCH_specific_name" in
>>> asm/p2m.h to not touch common code.
>> A #define may be needed, but not one to _still_ introduce Arm naming into non-Arm
>> code.
> 
> As I mentioned above that p2m_mmio_direct_dev sounds pretty generic to me and I am okay to use it for
> RISC-V. But if you have better suggestions I will be happy to consider it.

Well, the name we use on x86 (and I think this was quite obviously implied
by earlier replies of mine): p2m_mmio_direct.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests
  2025-07-02 10:34       ` Jan Beulich
@ 2025-07-02 11:17         ` Oleksii Kurochko
  0 siblings, 0 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-02 11:17 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3877 bytes --]


On 7/2/25 12:34 PM, Jan Beulich wrote:
> On 02.07.2025 12:30, Oleksii Kurochko wrote:
>> On 7/1/25 3:04 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> @@ -113,3 +117,58 @@ int p2m_init(struct domain *d)
>>>>    
>>>>        return 0;
>>>>    }
>>>> +
>>>> +/*
>>>> + * Set the pool of pages to the required number of pages.
>>>> + * Returns 0 for success, non-zero for failure.
>>>> + * Call with d->arch.paging.lock held.
>>>> + */
>>>> +int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>>> +{
>>>> +    struct page_info *pg;
>>>> +
>>>> +    ASSERT(spin_is_locked(&d->arch.paging.lock));
>>>> +
>>>> +    for ( ; ; )
>>>> +    {
>>>> +        if ( d->arch.paging.p2m_total_pages < pages )
>>>> +        {
>>>> +            /* Need to allocate more memory from domheap */
>>>> +            pg = alloc_domheap_page(d, MEMF_no_owner);
>>>> +            if ( pg == NULL )
>>>> +            {
>>>> +                printk(XENLOG_ERR "Failed to allocate P2M pages.\n");
>>>> +                return -ENOMEM;
>>>> +            }
>>>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)++;
>>>> +            page_list_add_tail(pg, &d->arch.paging.p2m_freelist);
>>>> +        }
>>>> +        else if ( d->arch.paging.p2m_total_pages > pages )
>>>> +        {
>>>> +            /* Need to return memory to domheap */
>>>> +            pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>>> +            if( pg )
>>>> +            {
>>>> +                ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>>>> +                free_domheap_page(pg);
>>>> +            }
>>>> +            else
>>>> +            {
>>>> +                printk(XENLOG_ERR
>>>> +                       "Failed to free P2M pages, P2M freelist is empty.\n");
>>>> +                return -ENOMEM;
>>>> +            }
>>>> +        }
>>>> +        else
>>>> +            break;
>>>> +
>>>> +        /* Check to see if we need to yield and try again */
>>>> +        if ( preempted && general_preempt_check() )
>>>> +        {
>>>> +            *preempted = true;
>>>> +            return -ERESTART;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>> Btw, with the order-2 requirement for the root page table, you may want to
>>> consider an alternative approach: Here you could allocate some order-2
>>> pages (possibly up to as many as a domain might need, which right now
>>> would be exactly one), put them on a separate list, and consume the root
>>> table(s) from there. If you run out of pages on the order-0 list, you
>>> could shatter a page from the order-2 one (as long as that's still non-
>>> empty). The difficulty would be with freeing, where a previously shattered
>>> order-2 page would be nice to re-combine once all of its constituents are
>>> free again. The main benefit would be avoiding the back and forth in patch
>>> 6.
>> It is an option.
>>
>> But I'm still not 100% sure it's necessary to allocate the root page table
>> from the freelist. We could simply allocate the root page table from the
>> domheap (as is done for hardware domains) and reserve the freelist for other
>> pages.
>> The freelist is specific to Dom0less guest domains and is primarily used to
>> limit the amount of memory available for the guest—essentially for static
>> configurations where you want a clear and fixed limit on p2m allocations.
> Is that true? My understanding is that this pre-populated pool is used by
> all DomU-s, whether or not under dom0less.

I think you are right, I just automatically decided so as this pre-populated
pool is set now only in dom0less.

~ Oleksii

>
> Plus we're meaning to move towards better accounting of memory used by a
> domain (besides its actual allocation). Allocating the root table from the
> domain heap would move us one small step farther away from there.
>
> Jan

[-- Attachment #2: Type: text/html, Size: 4555 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 17/17] xen/riscv: add support of page lookup by GFN
  2025-06-10 13:05 ` [PATCH v2 17/17] xen/riscv: add support of page lookup by GFN Oleksii Kurochko
@ 2025-07-02 11:44   ` Jan Beulich
  2025-07-21  9:43     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-02 11:44 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 10.06.2025 15:05, Oleksii Kurochko wrote:
> Introduce helper functions for safely querying the P2M (physical-to-machine)
> mapping:
>  - add p2m_read_lock(), p2m_read_unlock(), and p2m_is_locked() for managing
>    P2M lock state.
>  - Implement p2m_get_entry() to retrieve mapping details for a given GFN,
>    including MFN, page order, and validity.
>  - Add p2m_lookup() to encapsulate read-locked MFN retrieval.
>  - Introduce p2m_get_page_from_gfn() to convert a GFN into a page_info
>    pointer, acquiring a reference to the page if valid.
> 
> Implementations are based on Arm's functions with some minor modifications:
> - p2m_get_entry():
>   - Reverse traversal of page tables, as RISC-V uses the opposite order
>     compared to Arm.
>   - Removed the return of p2m_access_t from p2m_get_entry() since
>     mem_access_settings is not introduced for RISC-V.

Didn't I see uses of p2m_access in earlier patches? If you don't mean to have
that, then please consistently {every,no}where.

>   - Updated BUILD_BUG_ON() to check using the level 0 mask, which corresponds
>     to Arm's THIRD_MASK.
>   - Replaced open-coded bit shifts with the BIT() macro.
>   - Other minor changes, such as using RISC-V-specific functions to validate
>     P2M PTEs, and replacing Arm-specific GUEST_* macros with their RISC-V
>     equivalents.
> - p2m_get_page_from_gfn():
>   - Removed p2m_is_foreign() and related logic, as this functionality is not
>     implemented for RISC-V.

Yet I expect you'll need this, sooner or later.

> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -184,6 +184,24 @@ static inline int p2m_is_write_locked(struct p2m_domain *p2m)
>      return rw_is_write_locked(&p2m->lock);
>  }
>  
> +static inline void p2m_read_lock(struct p2m_domain *p2m)
> +{
> +    read_lock(&p2m->lock);
> +}
> +
> +static inline void p2m_read_unlock(struct p2m_domain *p2m)
> +{
> +    read_unlock(&p2m->lock);
> +}
> +
> +static inline int p2m_is_locked(struct p2m_domain *p2m)
> +{
> +    return rw_is_locked(&p2m->lock);
> +}
> +
> +struct page_info *p2m_get_page_from_gfn(struct domain *d, gfn_t gfn,
> +                                        p2m_type_t *t);

Once again I don't think you can pass struct domain * here, when in
the long run a domain can have multiple P2Ms.

> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -1055,3 +1055,134 @@ int guest_physmap_add_entry(struct domain *d,
>  {
>      return p2m_insert_mapping(d, gfn, (1 << page_order), mfn, t);
>  }
> +
> +/*
> + * Get the details of a given gfn.
> + *
> + * If the entry is present, the associated MFN will be returned and the
> + * access and type filled up. The page_order will correspond to the

You removed p2m_access_t * from the parameters; you need to also update
the comment then accordingly.

> + * order of the mapping in the page table (i.e it could be a superpage).
> + *
> + * If the entry is not present, INVALID_MFN will be returned and the
> + * page_order will be set according to the order of the invalid range.
> + *
> + * valid will contain the value of bit[0] (e.g valid bit) of the
> + * entry.
> + */
> +static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
> +                           p2m_type_t *t,
> +                           unsigned int *page_order,
> +                           bool *valid)
> +{
> +    paddr_t addr = gfn_to_gaddr(gfn);
> +    unsigned int level = 0;
> +    pte_t entry, *table;
> +    int rc;
> +    mfn_t mfn = INVALID_MFN;
> +    p2m_type_t _t;

Please no local variables with leading underscores. In x86 we commonly
name such variables p2mt.

> +    DECLARE_OFFSETS(offsets, addr);

This is the sole use of "addr". Is such a local variable really worth having?

> +    ASSERT(p2m_is_locked(p2m));
> +    BUILD_BUG_ON(XEN_PT_LEVEL_MAP_MASK(0) != PAGE_MASK);
> +
> +    /* Allow t to be NULL */
> +    t = t ?: &_t;
> +
> +    *t = p2m_invalid;
> +
> +    if ( valid )
> +        *valid = false;
> +
> +    /* XXX: Check if the mapping is lower than the mapped gfn */
> +
> +    /* This gfn is higher than the highest the p2m map currently holds */
> +    if ( gfn_x(gfn) > gfn_x(p2m->max_mapped_gfn) )
> +    {
> +        for ( level = P2M_ROOT_LEVEL; level ; level-- )

Nit: Stray blank before the 2nd semicolon. (Again at least once below.)

> +            if ( (gfn_x(gfn) & (XEN_PT_LEVEL_MASK(level) >> PAGE_SHIFT)) >
> +                 gfn_x(p2m->max_mapped_gfn) )
> +                break;
> +
> +        goto out;
> +    }
> +
> +    table = p2m_get_root_pointer(p2m, gfn);
> +
> +    /*
> +     * the table should always be non-NULL because the gfn is below
> +     * p2m->max_mapped_gfn and the root table pages are always present.
> +     */
> +    if ( !table )
> +    {
> +        ASSERT_UNREACHABLE();
> +        level = P2M_ROOT_LEVEL;
> +        goto out;
> +    }
> +
> +    for ( level = P2M_ROOT_LEVEL; level ; level-- )
> +    {
> +        rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
> +        if ( (rc == GUEST_TABLE_MAP_NONE) && (rc != GUEST_TABLE_MAP_NOMEM) )

This condition looks odd. As written the rhs of the && is redundant.

> +            goto out_unmap;
> +        else if ( rc != GUEST_TABLE_NORMAL )

As before, no real need for "else" in such cases.

> +            break;
> +    }
> +
> +    entry = table[offsets[level]];
> +
> +    if ( p2me_is_valid(p2m, entry) )
> +    {
> +        *t = p2m_type_radix_get(p2m, entry);

If the incoming argument is NULL, the somewhat expensive radix tree lookup
is unnecessary here.

> +        mfn = pte_get_mfn(entry);
> +        /*
> +         * The entry may point to a superpage. Find the MFN associated
> +         * to the GFN.
> +         */
> +        mfn = mfn_add(mfn,
> +                      gfn_x(gfn) & (BIT(XEN_PT_LEVEL_ORDER(level), UL) - 1));
> +
> +        if ( valid )
> +            *valid = pte_is_valid(entry);

Interesting. Why not the P2M counterpart of the function? Yes, the comment
ahead of the function says so, but I don't see why the valid bit suddenly
is relevant here (besides the P2M type).

> +    }
> +
> +out_unmap:
> +    unmap_domain_page(table);
> +
> +out:

Nit: Style (bot labels).

> +    if ( page_order )
> +        *page_order = XEN_PT_LEVEL_ORDER(level);
> +
> +    return mfn;
> +}
> +
> +static mfn_t p2m_lookup(struct domain *d, gfn_t gfn, p2m_type_t *t)

pointer-to-const for the 1st arg? But again more likely struct p2m_domain *
anyway?

> +{
> +    mfn_t mfn;
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +    p2m_read_lock(p2m);
> +    mfn = p2m_get_entry(p2m, gfn, t, NULL, NULL);
> +    p2m_read_unlock(p2m);
> +
> +    return mfn;
> +}
> +
> +struct page_info *p2m_get_page_from_gfn(struct domain *d, gfn_t gfn,

Same here - likely you mean struct p2m_domain * instead.

> +                                        p2m_type_t *t)
> +{
> +    p2m_type_t p2mt = {0};

Why a compound initializer for something that isn't a compound object?
And why plain 0 for something that is an enumerated type?

> +    struct page_info *page;
> +
> +    mfn_t mfn = p2m_lookup(d, gfn, &p2mt);
> +
> +    if ( t )
> +        *t = p2mt;

What's wrong with passing t directly to p2m_lookup()?

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests
  2025-07-01 13:04   ` Jan Beulich
  2025-07-02 10:30     ` Oleksii Kurochko
@ 2025-07-02 11:48     ` Oleksii Kurochko
  2025-07-02 11:56       ` Jan Beulich
  1 sibling, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-02 11:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3599 bytes --]


On 7/1/25 3:04 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> @@ -113,3 +117,58 @@ int p2m_init(struct domain *d)
>>   
>>       return 0;
>>   }
>> +
>> +/*
>> + * Set the pool of pages to the required number of pages.
>> + * Returns 0 for success, non-zero for failure.
>> + * Call with d->arch.paging.lock held.
>> + */
>> +int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>> +{
>> +    struct page_info *pg;
>> +
>> +    ASSERT(spin_is_locked(&d->arch.paging.lock));
>> +
>> +    for ( ; ; )
>> +    {
>> +        if ( d->arch.paging.p2m_total_pages < pages )
>> +        {
>> +            /* Need to allocate more memory from domheap */
>> +            pg = alloc_domheap_page(d, MEMF_no_owner);
>> +            if ( pg == NULL )
>> +            {
>> +                printk(XENLOG_ERR "Failed to allocate P2M pages.\n");
>> +                return -ENOMEM;
>> +            }
>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)++;
>> +            page_list_add_tail(pg, &d->arch.paging.p2m_freelist);
>> +        }
>> +        else if ( d->arch.paging.p2m_total_pages > pages )
>> +        {
>> +            /* Need to return memory to domheap */
>> +            pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>> +            if( pg )
>> +            {
>> +                ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>> +                free_domheap_page(pg);
>> +            }
>> +            else
>> +            {
>> +                printk(XENLOG_ERR
>> +                       "Failed to free P2M pages, P2M freelist is empty.\n");
>> +                return -ENOMEM;
>> +            }
>> +        }
>> +        else
>> +            break;
>> +
>> +        /* Check to see if we need to yield and try again */
>> +        if ( preempted && general_preempt_check() )
>> +        {
>> +            *preempted = true;
>> +            return -ERESTART;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
> Btw, with the order-2 requirement for the root page table, you may want to
> consider an alternative approach: Here you could allocate some order-2
> pages (possibly up to as many as a domain might need, which right now
> would be exactly one), put them on a separate list, and consume the root
> table(s) from there. If you run out of pages on the order-0 list, you
> could shatter a page from the order-2 one (as long as that's still non-
> empty). The difficulty would be with freeing, where a previously shattered
> order-2 page would be nice to re-combine once all of its constituents are
> free again.

Do we really need to re-combine shattered order-2 pages?
It seems like the only usage for this order-2-list is to have 1 order-2 page
for root page table. All other pages are 4k pages so even if we won't re-combine
them, nothing serious will happen.

And if we aren't going to have more usages of order-2 pages list then do we
really need a separate order-2 list just basically for root page table?

...

>   The main benefit would be avoiding the back and forth in patch
> 6.

...
Can’t we just avoid putting the pages (which will get back) for the root page table into the
freelist at all? That way, there would be no need to return them
later—something like:

Something like:
int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
{
     struct page_info *pg;

     ASSERT(spin_is_locked(&d->arch.paging.lock));

     pages -= root_page_table_num;
     
     for ( ; ; )
     {
         if ( d->arch.paging.p2m_total_pages < pages )
         {
    ...
}

~ Oleksii


[-- Attachment #2: Type: text/html, Size: 4152 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests
  2025-07-02 11:48     ` Oleksii Kurochko
@ 2025-07-02 11:56       ` Jan Beulich
  2025-07-02 12:34         ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-02 11:56 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 02.07.2025 13:48, Oleksii Kurochko wrote:
> On 7/1/25 3:04 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> @@ -113,3 +117,58 @@ int p2m_init(struct domain *d)
>>>   
>>>       return 0;
>>>   }
>>> +
>>> +/*
>>> + * Set the pool of pages to the required number of pages.
>>> + * Returns 0 for success, non-zero for failure.
>>> + * Call with d->arch.paging.lock held.
>>> + */
>>> +int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>> +{
>>> +    struct page_info *pg;
>>> +
>>> +    ASSERT(spin_is_locked(&d->arch.paging.lock));
>>> +
>>> +    for ( ; ; )
>>> +    {
>>> +        if ( d->arch.paging.p2m_total_pages < pages )
>>> +        {
>>> +            /* Need to allocate more memory from domheap */
>>> +            pg = alloc_domheap_page(d, MEMF_no_owner);
>>> +            if ( pg == NULL )
>>> +            {
>>> +                printk(XENLOG_ERR "Failed to allocate P2M pages.\n");
>>> +                return -ENOMEM;
>>> +            }
>>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)++;
>>> +            page_list_add_tail(pg, &d->arch.paging.p2m_freelist);
>>> +        }
>>> +        else if ( d->arch.paging.p2m_total_pages > pages )
>>> +        {
>>> +            /* Need to return memory to domheap */
>>> +            pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>> +            if( pg )
>>> +            {
>>> +                ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>>> +                free_domheap_page(pg);
>>> +            }
>>> +            else
>>> +            {
>>> +                printk(XENLOG_ERR
>>> +                       "Failed to free P2M pages, P2M freelist is empty.\n");
>>> +                return -ENOMEM;
>>> +            }
>>> +        }
>>> +        else
>>> +            break;
>>> +
>>> +        /* Check to see if we need to yield and try again */
>>> +        if ( preempted && general_preempt_check() )
>>> +        {
>>> +            *preempted = true;
>>> +            return -ERESTART;
>>> +        }
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>> Btw, with the order-2 requirement for the root page table, you may want to
>> consider an alternative approach: Here you could allocate some order-2
>> pages (possibly up to as many as a domain might need, which right now
>> would be exactly one), put them on a separate list, and consume the root
>> table(s) from there. If you run out of pages on the order-0 list, you
>> could shatter a page from the order-2 one (as long as that's still non-
>> empty). The difficulty would be with freeing, where a previously shattered
>> order-2 page would be nice to re-combine once all of its constituents are
>> free again.
> 
> Do we really need to re-combine shattered order-2 pages?
> It seems like the only usage for this order-2-list is to have 1 order-2 page
> for root page table. All other pages are 4k pages so even if we won't re-combine
> them, nothing serious will happen.

That's true as long as you have only the host-P2M for each domain. Once you
have alternative or nested ones, things may change (unless they all have
their roots also set up right during domain creation, which would seem
wasteful to me).

>>   The main benefit would be avoiding the back and forth in patch
>> 6.
> 
> ...
> Can’t we just avoid putting the pages (which will get back) for the root page table into the
> freelist at all?

Again, this may be fine as long as there's only the host-P2M. That sole root
won't ever be freed anyway during the lifetime of a domain.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests
  2025-07-02 11:56       ` Jan Beulich
@ 2025-07-02 12:34         ` Oleksii Kurochko
  2025-07-02 12:49           ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-02 12:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3901 bytes --]


On 7/2/25 1:56 PM, Jan Beulich wrote:
> On 02.07.2025 13:48, Oleksii Kurochko wrote:
>> On 7/1/25 3:04 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> @@ -113,3 +117,58 @@ int p2m_init(struct domain *d)
>>>>    
>>>>        return 0;
>>>>    }
>>>> +
>>>> +/*
>>>> + * Set the pool of pages to the required number of pages.
>>>> + * Returns 0 for success, non-zero for failure.
>>>> + * Call with d->arch.paging.lock held.
>>>> + */
>>>> +int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>>> +{
>>>> +    struct page_info *pg;
>>>> +
>>>> +    ASSERT(spin_is_locked(&d->arch.paging.lock));
>>>> +
>>>> +    for ( ; ; )
>>>> +    {
>>>> +        if ( d->arch.paging.p2m_total_pages < pages )
>>>> +        {
>>>> +            /* Need to allocate more memory from domheap */
>>>> +            pg = alloc_domheap_page(d, MEMF_no_owner);
>>>> +            if ( pg == NULL )
>>>> +            {
>>>> +                printk(XENLOG_ERR "Failed to allocate P2M pages.\n");
>>>> +                return -ENOMEM;
>>>> +            }
>>>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)++;
>>>> +            page_list_add_tail(pg, &d->arch.paging.p2m_freelist);
>>>> +        }
>>>> +        else if ( d->arch.paging.p2m_total_pages > pages )
>>>> +        {
>>>> +            /* Need to return memory to domheap */
>>>> +            pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>>> +            if( pg )
>>>> +            {
>>>> +                ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>>>> +                free_domheap_page(pg);
>>>> +            }
>>>> +            else
>>>> +            {
>>>> +                printk(XENLOG_ERR
>>>> +                       "Failed to free P2M pages, P2M freelist is empty.\n");
>>>> +                return -ENOMEM;
>>>> +            }
>>>> +        }
>>>> +        else
>>>> +            break;
>>>> +
>>>> +        /* Check to see if we need to yield and try again */
>>>> +        if ( preempted && general_preempt_check() )
>>>> +        {
>>>> +            *preempted = true;
>>>> +            return -ERESTART;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>> Btw, with the order-2 requirement for the root page table, you may want to
>>> consider an alternative approach: Here you could allocate some order-2
>>> pages (possibly up to as many as a domain might need, which right now
>>> would be exactly one), put them on a separate list, and consume the root
>>> table(s) from there. If you run out of pages on the order-0 list, you
>>> could shatter a page from the order-2 one (as long as that's still non-
>>> empty). The difficulty would be with freeing, where a previously shattered
>>> order-2 page would be nice to re-combine once all of its constituents are
>>> free again.
>> Do we really need to re-combine shattered order-2 pages?
>> It seems like the only usage for this order-2-list is to have 1 order-2 page
>> for root page table. All other pages are 4k pages so even if we won't re-combine
>> them, nothing serious will happen.
> That's true as long as you have only the host-P2M for each domain. Once you
> have alternative or nested ones, things may change (unless they all have
> their roots also set up right during domain creation, which would seem
> wasteful to me).

I don't know how it is implemented on x86, but I thought that if it is needed alternative
or nested P2Ms then it is needed to provide separated from host-P2M page tables (root page
table including).

~ Oleksii

>
>>>    The main benefit would be avoiding the back and forth in patch
>>> 6.
>> ...
>> Can’t we just avoid putting the pages (which will get back) for the root page table into the
>> freelist at all?
> Again, this may be fine as long as there's only the host-P2M. That sole root
> won't ever be freed anyway during the lifetime of a domain.
>
> Jan

[-- Attachment #2: Type: text/html, Size: 4833 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests
  2025-07-02 12:34         ` Oleksii Kurochko
@ 2025-07-02 12:49           ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-02 12:49 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 02.07.2025 14:34, Oleksii Kurochko wrote:
> 
> On 7/2/25 1:56 PM, Jan Beulich wrote:
>> On 02.07.2025 13:48, Oleksii Kurochko wrote:
>>> On 7/1/25 3:04 PM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> @@ -113,3 +117,58 @@ int p2m_init(struct domain *d)
>>>>>    
>>>>>        return 0;
>>>>>    }
>>>>> +
>>>>> +/*
>>>>> + * Set the pool of pages to the required number of pages.
>>>>> + * Returns 0 for success, non-zero for failure.
>>>>> + * Call with d->arch.paging.lock held.
>>>>> + */
>>>>> +int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>>>> +{
>>>>> +    struct page_info *pg;
>>>>> +
>>>>> +    ASSERT(spin_is_locked(&d->arch.paging.lock));
>>>>> +
>>>>> +    for ( ; ; )
>>>>> +    {
>>>>> +        if ( d->arch.paging.p2m_total_pages < pages )
>>>>> +        {
>>>>> +            /* Need to allocate more memory from domheap */
>>>>> +            pg = alloc_domheap_page(d, MEMF_no_owner);
>>>>> +            if ( pg == NULL )
>>>>> +            {
>>>>> +                printk(XENLOG_ERR "Failed to allocate P2M pages.\n");
>>>>> +                return -ENOMEM;
>>>>> +            }
>>>>> +            ACCESS_ONCE(d->arch.paging.p2m_total_pages)++;
>>>>> +            page_list_add_tail(pg, &d->arch.paging.p2m_freelist);
>>>>> +        }
>>>>> +        else if ( d->arch.paging.p2m_total_pages > pages )
>>>>> +        {
>>>>> +            /* Need to return memory to domheap */
>>>>> +            pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>>>> +            if( pg )
>>>>> +            {
>>>>> +                ACCESS_ONCE(d->arch.paging.p2m_total_pages)--;
>>>>> +                free_domheap_page(pg);
>>>>> +            }
>>>>> +            else
>>>>> +            {
>>>>> +                printk(XENLOG_ERR
>>>>> +                       "Failed to free P2M pages, P2M freelist is empty.\n");
>>>>> +                return -ENOMEM;
>>>>> +            }
>>>>> +        }
>>>>> +        else
>>>>> +            break;
>>>>> +
>>>>> +        /* Check to see if we need to yield and try again */
>>>>> +        if ( preempted && general_preempt_check() )
>>>>> +        {
>>>>> +            *preempted = true;
>>>>> +            return -ERESTART;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>> Btw, with the order-2 requirement for the root page table, you may want to
>>>> consider an alternative approach: Here you could allocate some order-2
>>>> pages (possibly up to as many as a domain might need, which right now
>>>> would be exactly one), put them on a separate list, and consume the root
>>>> table(s) from there. If you run out of pages on the order-0 list, you
>>>> could shatter a page from the order-2 one (as long as that's still non-
>>>> empty). The difficulty would be with freeing, where a previously shattered
>>>> order-2 page would be nice to re-combine once all of its constituents are
>>>> free again.
>>> Do we really need to re-combine shattered order-2 pages?
>>> It seems like the only usage for this order-2-list is to have 1 order-2 page
>>> for root page table. All other pages are 4k pages so even if we won't re-combine
>>> them, nothing serious will happen.
>> That's true as long as you have only the host-P2M for each domain. Once you
>> have alternative or nested ones, things may change (unless they all have
>> their roots also set up right during domain creation, which would seem
>> wasteful to me).
> 
> I don't know how it is implemented on x86, but I thought that if it is needed alternative
> or nested P2Ms then it is needed to provide separated from host-P2M page tables (root page
> table including).

Correct, hence why you will then need to allocate multiple root tables.
Those secondary page tables are nevertheless all allocated from the
single pool that a domain has.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-07-02 10:09   ` Jan Beulich
  2025-07-02 10:28     ` Jan Beulich
@ 2025-07-02 12:52     ` Orzel, Michal
  2025-07-18 14:49     ` Oleksii Kurochko
  2 siblings, 0 replies; 161+ messages in thread
From: Orzel, Michal @ 2025-07-02 12:52 UTC (permalink / raw)
  To: Jan Beulich, Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel



On 02/07/2025 12:09, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> Implement the mfn_valid() macro to verify whether a given MFN is valid by
>> checking that it falls within the range [start_page, max_page).
>> These bounds are initialized based on the start and end addresses of RAM.
>>
>> As part of this patch, start_page is introduced and initialized with the
>> PFN of the first RAM page.
>>
>> Also, after providing a non-stub implementation of the mfn_valid() macro,
>> the following compilation errors started to occur:
>>   riscv64-linux-gnu-ld: prelink.o: in function `__next_node':
>>   /build/xen/./include/xen/nodemask.h:202: undefined reference to `page_is_ram_type'
>>   riscv64-linux-gnu-ld: prelink.o: in function `get_free_buddy':
>>   /build/xen/common/page_alloc.c:881: undefined reference to `page_is_ram_type'
>>   riscv64-linux-gnu-ld: prelink.o: in function `alloc_heap_pages':
>>   /build/xen/common/page_alloc.c:1043: undefined reference to `page_get_owner_and_reference'
>>   riscv64-linux-gnu-ld: /build/xen/common/page_alloc.c:1098: undefined reference to `page_is_ram_type'
>>   riscv64-linux-gnu-ld: prelink.o: in function `ns16550_interrupt':
>>   /build/xen/drivers/char/ns16550.c:205: undefined reference to `get_page'
>>   riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol `page_get_owner_and_reference' isn't defined
>>   riscv64-linux-gnu-ld: final link failed: bad value
>>   make[2]: *** [arch/riscv/Makefile:35: xen-syms] Error 1
>> To resolve these errors, the following functions have also been introduced,
>> based on their Arm counterparts:
>> - page_get_owner_and_reference() and its variant to safely acquire a
>>   reference to a page and retrieve its owner.
>> - put_page() and put_page_nr() to release page references and free the page
>>   when the count drops to zero.
>>   For put_page_nr(), code related to static memory configuration is wrapped
>>   with CONFIG_STATIC_MEMORY, as this configuration has not yet been moved to
>>   common code. Therefore, PGC_static and free_domstatic_page() are not
>>   introduced for RISC-V. However, since this configuration could be useful
>>   in the future, the relevant code is retained and conditionally compiled.
>> - A stub for page_is_ram_type() that currently always returns 0 and asserts
>>   unreachable, as RAM type checking is not yet implemented.
> 
> How does this end up working when common code references the function?
> 
>> @@ -288,8 +289,12 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
>>  #define page_get_owner(p)    (p)->v.inuse.domain
>>  #define page_set_owner(p, d) ((p)->v.inuse.domain = (d))
>>  
>> -/* TODO: implement */
>> -#define mfn_valid(mfn) ({ (void)(mfn); 0; })
>> +extern unsigned long start_page;
>> +
>> +#define mfn_valid(mfn) ({                                   \
>> +    unsigned long mfn__ = mfn_x(mfn);                       \
>> +    likely((mfn__ >= start_page) && (mfn__ < max_page));    \
>> +})
> 
> I don't think you should try to be clever and avoid using __mfn_valid() here,
> at least not without an easily identifiable TODO. Surely you've seen that both
> Arm and x86 use it.
> 
> Also, according to all I know, likely() doesn't work very well when used like
> this, except for architectures supporting conditionally executed insns (like
> Arm32 or IA-64, i.e. beyond conditional branches). I.e. if you want to use
> likely() here, I think you need two of them.
> 
>> @@ -525,6 +520,8 @@ static void __init setup_directmap_mappings(unsigned long base_mfn,
>>  #error setup_{directmap,frametable}_mapping() should be implemented for RV_32
>>  #endif
>>  
>> +unsigned long __read_mostly start_page;
> 
> Memory hotplug question again: __read_mostly or __ro_after_init?
> 
>> @@ -613,3 +612,91 @@ void __iomem *ioremap(paddr_t pa, size_t len)
>>  {
>>      return ioremap_attr(pa, len, PAGE_HYPERVISOR_NOCACHE);
>>  }
>> +
>> +int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
>> +{
>> +    ASSERT_UNREACHABLE();
>> +
>> +    return 0;
>> +}
>> +
>> +static struct domain *page_get_owner_and_nr_reference(struct page_info *page,
>> +                                                      unsigned long nr)
>> +{
>> +    unsigned long x, y = page->count_info;
>> +    struct domain *owner;
>> +
>> +    /* Restrict nr to avoid "double" overflow */
>> +    if ( nr >= PGC_count_mask )
>> +    {
>> +        ASSERT_UNREACHABLE();
>> +        return NULL;
>> +    }
> 
> I question the validity of this, already in the Arm original: I can't spot
> how the caller guarantees to stay below that limit. Without such an
> (attempted) guarantee, ASSERT_UNREACHABLE() is wrong to use. All I can see
> is process_shm_node() incrementing shmem_extra[].nr_shm_borrowers, without
> any limit check.
Honestly I don't know why this assert was placed here. I checked the code and we
don't limit nr_shm_borrowers in any place, so in theory it's possible to end up
here.

~Michal

> 
>> +    do {
>> +        x = y;
>> +        /*
>> +         * Count ==  0: Page is not allocated, so we cannot take a reference.
>> +         * Count == -1: Reference count would wrap, which is invalid.
>> +         */
> 
> May I once again ask that you look carefully at comments (as much as at code)
> you copy. Clearly this comment wasn't properly updated when the bumping by 1
> was changed to bumping by nr.
> 
>> +        if ( unlikely(((x + nr) & PGC_count_mask) <= nr) )
>> +            return NULL;
>> +    }
>> +    while ( (y = cmpxchg(&page->count_info, x, x + nr)) != x );
>> +
>> +    owner = page_get_owner(page);
>> +    ASSERT(owner);
>> +
>> +    return owner;
>> +}
>> +
>> +struct domain *page_get_owner_and_reference(struct page_info *page)
>> +{
>> +    return page_get_owner_and_nr_reference(page, 1);
>> +}
>> +
>> +void put_page_nr(struct page_info *page, unsigned long nr)
>> +{
>> +    unsigned long nx, x, y = page->count_info;
>> +
>> +    do {
>> +        ASSERT((y & PGC_count_mask) >= nr);
>> +        x  = y;
>> +        nx = x - nr;
>> +    }
>> +    while ( unlikely((y = cmpxchg(&page->count_info, x, nx)) != x) );
>> +
>> +    if ( unlikely((nx & PGC_count_mask) == 0) )
>> +    {
>> +#ifdef CONFIG_STATIC_MEMORY
>> +        if ( unlikely(nx & PGC_static) )
>> +            free_domstatic_page(page);
>> +        else
>> +#endif
> 
> Such #ifdef-ed-out code is liable to go stale. Minimally use IS_ENABLED().
> Even better would imo be if you introduced a "stub" PGC_static, resolving
> to 0 (i.e. for now unconditionally).
> 
> Jan



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 09/17] xen/riscv: introduce page_set_xenheap_gfn()
  2025-06-30 15:48   ` Jan Beulich
@ 2025-07-02 15:59     ` Oleksii Kurochko
  2025-07-03  5:59       ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-02 15:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3876 bytes --]


On 6/30/25 5:48 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> Introduce page_set_xenheap_gfn() helper to encode the GFN associated with
>> a Xen heap page directly into the type_info field of struct page_info.
>>
>> Introduce a GFN field in the type_info of a Xen heap page by reserving 10
>> bits (sufficient for both Sv32 and Sv39+ modes), and define PGT_gfn_mask
>> and PGT_gfn_width accordingly.
> This reads as if you wanted to encode the GFN in 10 bits.

I will reword it to:
   Reserve 10 MSB bits to store the usage counter and frame type;
   use all remaining bits to store the grant table frame GFN.
   It will be enough as Sv32 uses 22-bit GFNs and Sv{39, 47, 58} uses 44-bit GFNs.

>
> What would also help is if you said why you actually need this. x86, after
> all, gets away without anything like this. (But I understand you're more
> Arm-like here.)

I think with the rewording mentioned above it will be clear that it is needed for
grant tables. But I also can add the following:
   The grant table frame GFN will be stored directly in|struct page_info| instead
   of being maintained in separate status/shared arrays. To avoid increasing the
   size of|struct page_info|, the necessary bits are borrowed from the
|||type_info of struct page_info.|

>> @@ -229,9 +230,21 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
>>   #define PGT_writable_page PG_mask(1, 1)  /* has writable mappings?         */
>>   #define PGT_type_mask     PG_mask(1, 1)  /* Bits 31 or 63.                 */
>>   
>> -/* Count of uses of this frame as its current type. */
>> -#define PGT_count_width   PG_shift(2)
>> -#define PGT_count_mask    ((1UL << PGT_count_width) - 1)
>> + /* 9-bit count of uses of this frame as its current type. */
>> +#define PGT_count_mask    PG_mask(0x3FF, 10)
>> +
>> +/*
>> + * Sv32 has 22-bit GFN. Sv{39, 48, 57} have 44-bit GFN.
>> + * Thereby we can use for `type_info` 10 bits for all modes, having the same
>> + * amount of bits for `type_info` for all MMU modes let us avoid introducing
>> + * an extra #ifdef to that header:
>> + *   if we go with maximum possible bits for count on each configuration
>> + *   we would need to have a set of PGT_count_* and PGT_gfn_*).
>> + */
>> +#define PGT_gfn_width     PG_shift(10)
>> +#define PGT_gfn_mask      (BIT(PGT_gfn_width, UL) - 1)
>> +
>> +#define PGT_INVALID_XENHEAP_GFN   _gfn(PGT_gfn_mask)
> Commentary here would imo be preferable to be much closer to Arm's. I don't
> see the point of the extra verbosity (part of which may be fine to have in
> the description, except you already say something along these lines there).
> While in turn the comment talks of fewer bits than are actually being used
> in the RV64 case.

Sure, I will replace this comment with:
/*
  * Stored in bits [22:0] (Sv32) or [44:0] (Sv39,48,57) GFN if page is xenheap page.
  */

>> @@ -283,6 +296,19 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
>>   
>>   #define PFN_ORDER(pg) ((pg)->v.free.order)
>>   
>> +static inline void page_set_xenheap_gfn(struct page_info *p, gfn_t gfn)
>> +{
>> +    gfn_t gfn_ = gfn_eq(gfn, INVALID_GFN) ? PGT_INVALID_XENHEAP_GFN : gfn;
>> +    unsigned long x, nx, y = p->u.inuse.type_info;
>> +
>> +    ASSERT(is_xen_heap_page(p));
>> +
>> +    do {
>> +        x = y;
>> +        nx = (x & ~PGT_gfn_mask) | gfn_x(gfn_);
>> +    } while ( (y = cmpxchg(&p->u.inuse.type_info, x, nx)) != x );
>> +}
>> +
>>   extern unsigned char cpu0_boot_stack[];
>>   
>>   void setup_initial_pagetables(void);
> What about the "get" counterpart?

I haven't added it as it isn't used now and it will lead to compilation error as it will be static inline
(in a similar way as Arm introduces it).

As an option this patch could be dropped and introduced with an introduction of grant tables.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 5229 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 09/17] xen/riscv: introduce page_set_xenheap_gfn()
  2025-07-02 15:59     ` Oleksii Kurochko
@ 2025-07-03  5:59       ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-03  5:59 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 02.07.2025 17:59, Oleksii Kurochko wrote:
> 
> On 6/30/25 5:48 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> Introduce page_set_xenheap_gfn() helper to encode the GFN associated with
>>> a Xen heap page directly into the type_info field of struct page_info.
>>>
>>> Introduce a GFN field in the type_info of a Xen heap page by reserving 10
>>> bits (sufficient for both Sv32 and Sv39+ modes), and define PGT_gfn_mask
>>> and PGT_gfn_width accordingly.
>> This reads as if you wanted to encode the GFN in 10 bits.
> 
> I will reword it to:
>    Reserve 10 MSB bits to store the usage counter and frame type;
>    use all remaining bits to store the grant table frame GFN.
>    It will be enough as Sv32 uses 22-bit GFNs and Sv{39, 47, 58} uses 44-bit GFNs.
> 
>>
>> What would also help is if you said why you actually need this. x86, after
>> all, gets away without anything like this. (But I understand you're more
>> Arm-like here.)
> 
> I think with the rewording mentioned above it will be clear that it is needed for
> grant tables. But I also can add the following:

I agree it's fine with just the re-wording.

>>> @@ -283,6 +296,19 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
>>>   
>>>   #define PFN_ORDER(pg) ((pg)->v.free.order)
>>>   
>>> +static inline void page_set_xenheap_gfn(struct page_info *p, gfn_t gfn)
>>> +{
>>> +    gfn_t gfn_ = gfn_eq(gfn, INVALID_GFN) ? PGT_INVALID_XENHEAP_GFN : gfn;
>>> +    unsigned long x, nx, y = p->u.inuse.type_info;
>>> +
>>> +    ASSERT(is_xen_heap_page(p));
>>> +
>>> +    do {
>>> +        x = y;
>>> +        nx = (x & ~PGT_gfn_mask) | gfn_x(gfn_);
>>> +    } while ( (y = cmpxchg(&p->u.inuse.type_info, x, nx)) != x );
>>> +}
>>> +
>>>   extern unsigned char cpu0_boot_stack[];
>>>   
>>>   void setup_initial_pagetables(void);
>> What about the "get" counterpart?
> 
> I haven't added it as it isn't used now and it will lead to compilation error as it will be static inline
> (in a similar way as Arm introduces it).

Why would a static inline (in a header) cause compilation errors?

> As an option this patch could be dropped and introduced with an introduction of grant tables.

That's up to you - you must have had a reason to include it here.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 10/17] xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs
  2025-06-30 15:59   ` Jan Beulich
@ 2025-07-03 11:02     ` Oleksii Kurochko
  2025-07-03 11:33       ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-03 11:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2610 bytes --]

On 6/30/25 5:59 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> +                              unsigned long nr, mfn_t mfn, p2m_type_t t)
>> +{
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +    int rc;
>> +
>> +    p2m_write_lock(p2m);
>> +    rc = p2m_set_entry(p2m, start_gfn, nr, mfn, t, p2m->default_access);
>> +    p2m_write_unlock(p2m);
>> +
>> +    return rc;
>> +}
>> +
>> +int map_regions_p2mt(struct domain *d,
>> +                     gfn_t gfn,
>> +                     unsigned long nr,
>> +                     mfn_t mfn,
>> +                     p2m_type_t p2mt)
>> +{
>> +    return p2m_insert_mapping(d, gfn, nr, mfn, p2mt);
>> +}
> What is this function doing here? The description says "for generic mapping
> purposes", which really may mean anything. Plus, if and when you need it, it
> wants to come with a name that fits with e.g. ...

These names are used across the common code and various architectures. Not all
architectures need to implement all of these functions.
I believe|guest_physmap_add_page()| (which internally calls|guest_physmap_add_entry()|)
is needed to be implemented for all architectures, while|map_regions_p2mt()| is used
by Arm and the common Dom0less-related code, and because of RISC-V is going to re-use
common Dom0less code it is implementing this function too.

>> +int guest_physmap_add_entry(struct domain *d,
>> +                            gfn_t gfn,
>> +                            mfn_t mfn,
>> +                            unsigned long page_order,
>> +                            p2m_type_t t)
> ... this one, to understand their relationship / difference.

Basically, the difference is only in API and where they are expected to be used:
- guest_physmap_add_entry() to map and set a specific p2m type for a page.
- map_regions_p2mt() to map a region (mostly MMIO) in the guest p2m with
   a specific p2m type.

I added both of them here as they are implemented in a similar way.
I will re-word commit subject and message:
   xen/riscv: implement functions to map memory in guest p2m

   Introduce guest_physmap_add_entry() to map a page and assign a specific
   p2m type, and map_regions_p2mt() to map a region (typically MMIO) in
   the guest p2m with a designated p2m type.

   Currently, this functionality is not fully operational, as p2m_set_entry()
   still returns -EOPNOTSUPP.

   Additionally, introduce p2m_write_(un)lock() to protect modifications to
   the p2m page tables, along with p2m TLB flush helpers to ensure proper
   TLB invalidation (if necessary) when the p2m lock is released.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 3511 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 10/17] xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs
  2025-07-03 11:02     ` Oleksii Kurochko
@ 2025-07-03 11:33       ` Jan Beulich
  2025-07-03 11:54         ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-03 11:33 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 03.07.2025 13:02, Oleksii Kurochko wrote:
> On 6/30/25 5:59 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> +                              unsigned long nr, mfn_t mfn, p2m_type_t t)
>>> +{
>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>> +    int rc;
>>> +
>>> +    p2m_write_lock(p2m);
>>> +    rc = p2m_set_entry(p2m, start_gfn, nr, mfn, t, p2m->default_access);
>>> +    p2m_write_unlock(p2m);
>>> +
>>> +    return rc;
>>> +}
>>> +
>>> +int map_regions_p2mt(struct domain *d,
>>> +                     gfn_t gfn,
>>> +                     unsigned long nr,
>>> +                     mfn_t mfn,
>>> +                     p2m_type_t p2mt)
>>> +{
>>> +    return p2m_insert_mapping(d, gfn, nr, mfn, p2mt);
>>> +}
>> What is this function doing here? The description says "for generic mapping
>> purposes", which really may mean anything. Plus, if and when you need it, it
>> wants to come with a name that fits with e.g. ...
> 
> These names are used across the common code and various architectures. Not all
> architectures need to implement all of these functions.
> I believe|guest_physmap_add_page()| (which internally calls|guest_physmap_add_entry()|)
> is needed to be implemented for all architectures, while|map_regions_p2mt()| is used
> by Arm and the common Dom0less-related code, and because of RISC-V is going to re-use
> common Dom0less code it is implementing this function too.

First, my comment was solely about this one function above. And then I didn't
even know Arm had such a function. It's not used from common code (except again
from dom0less code where it should have been better abstracted, imo). I'm also
not surprised I wasn't aware of it since, as can be implied from the above,
otherwise I would likely have complained about its name not fitting the general
scheme (which isn't all that good either).

>>> +int guest_physmap_add_entry(struct domain *d,
>>> +                            gfn_t gfn,
>>> +                            mfn_t mfn,
>>> +                            unsigned long page_order,
>>> +                            p2m_type_t t)
>> ... this one, to understand their relationship / difference.
> 
> Basically, the difference is only in API and where they are expected to be used:
> - guest_physmap_add_entry() to map and set a specific p2m type for a page.
> - map_regions_p2mt() to map a region (mostly MMIO) in the guest p2m with
>    a specific p2m type.

Sorry, from this description they still look basically identical to me. The
visible difference being that one takes a "nr" argument and the other a
"page_order" one. Which still makes them largely redundant, and which still
suggests that the earlier one's name doesn't really fit.

> I added both of them here as they are implemented in a similar way.
> I will re-word commit subject and message:
>    xen/riscv: implement functions to map memory in guest p2m
> 
>    Introduce guest_physmap_add_entry() to map a page and assign a specific
>    p2m type, and map_regions_p2mt() to map a region (typically MMIO) in
>    the guest p2m with a designated p2m type.

I.e., as per above, two functions for basically the same purpose.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 10/17] xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs
  2025-07-03 11:33       ` Jan Beulich
@ 2025-07-03 11:54         ` Oleksii Kurochko
  2025-07-03 13:09           ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-03 11:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3841 bytes --]


On 7/3/25 1:33 PM, Jan Beulich wrote:
> On 03.07.2025 13:02, Oleksii Kurochko wrote:
>> On 6/30/25 5:59 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> +                              unsigned long nr, mfn_t mfn, p2m_type_t t)
>>>> +{
>>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>> +    int rc;
>>>> +
>>>> +    p2m_write_lock(p2m);
>>>> +    rc = p2m_set_entry(p2m, start_gfn, nr, mfn, t, p2m->default_access);
>>>> +    p2m_write_unlock(p2m);
>>>> +
>>>> +    return rc;
>>>> +}
>>>> +
>>>> +int map_regions_p2mt(struct domain *d,
>>>> +                     gfn_t gfn,
>>>> +                     unsigned long nr,
>>>> +                     mfn_t mfn,
>>>> +                     p2m_type_t p2mt)
>>>> +{
>>>> +    return p2m_insert_mapping(d, gfn, nr, mfn, p2mt);
>>>> +}
>>> What is this function doing here? The description says "for generic mapping
>>> purposes", which really may mean anything. Plus, if and when you need it, it
>>> wants to come with a name that fits with e.g. ...
>> These names are used across the common code and various architectures. Not all
>> architectures need to implement all of these functions.
>> I believe|guest_physmap_add_page()| (which internally calls|guest_physmap_add_entry()|)
>> is needed to be implemented for all architectures, while|map_regions_p2mt()| is used
>> by Arm and the common Dom0less-related code, and because of RISC-V is going to re-use
>> common Dom0less code it is implementing this function too.
> First, my comment was solely about this one function above. And then I didn't
> even know Arm had such a function. It's not used from common code (except again
> from dom0less code where it should have been better abstracted, imo). I'm also
> not surprised I wasn't aware of it since, as can be implied from the above,
> otherwise I would likely have complained about its name not fitting the general
> scheme (which isn't all that good either).

If I'm right, there is nothing similar to|map_regions_p2mt()| in the common headers.

Anyway, I think we could follow up with a patch to rename this function to
something more appropriate.

I was thinking about adding something like|map_regions_to_guest()|,|map_p2m_regions()|,
or|map_p2m_memory()| to|xen/mm.h|, along with proper renaming in the Arm code.

Does that make sense?

>
>>>> +int guest_physmap_add_entry(struct domain *d,
>>>> +                            gfn_t gfn,
>>>> +                            mfn_t mfn,
>>>> +                            unsigned long page_order,
>>>> +                            p2m_type_t t)
>>> ... this one, to understand their relationship / difference.
>> Basically, the difference is only in API and where they are expected to be used:
>> - guest_physmap_add_entry() to map and set a specific p2m type for a page.
>> - map_regions_p2mt() to map a region (mostly MMIO) in the guest p2m with
>>     a specific p2m type.
> Sorry, from this description they still look basically identical to me. The
> visible difference being that one takes a "nr" argument and the other a
> "page_order" one. Which still makes them largely redundant, and which still
> suggests that the earlier one's name doesn't really fit.
>
>> I added both of them here as they are implemented in a similar way.
>> I will re-word commit subject and message:
>>     xen/riscv: implement functions to map memory in guest p2m
>>
>>     Introduce guest_physmap_add_entry() to map a page and assign a specific
>>     p2m type, and map_regions_p2mt() to map a region (typically MMIO) in
>>     the guest p2m with a designated p2m type.
> I.e., as per above, two functions for basically the same purpose.

Generally, I agree that the purpose is the same.

I will then just drop|guest_physmap_add_entry()| and use only
|map_regions_p2mt()| (or whatever name we decide on).

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 5666 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 10/17] xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs
  2025-07-03 11:54         ` Oleksii Kurochko
@ 2025-07-03 13:09           ` Jan Beulich
  2025-07-03 13:28             ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-03 13:09 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 03.07.2025 13:54, Oleksii Kurochko wrote:
> 
> On 7/3/25 1:33 PM, Jan Beulich wrote:
>> On 03.07.2025 13:02, Oleksii Kurochko wrote:
>>> On 6/30/25 5:59 PM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> +                              unsigned long nr, mfn_t mfn, p2m_type_t t)
>>>>> +{
>>>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>>> +    int rc;
>>>>> +
>>>>> +    p2m_write_lock(p2m);
>>>>> +    rc = p2m_set_entry(p2m, start_gfn, nr, mfn, t, p2m->default_access);
>>>>> +    p2m_write_unlock(p2m);
>>>>> +
>>>>> +    return rc;
>>>>> +}
>>>>> +
>>>>> +int map_regions_p2mt(struct domain *d,
>>>>> +                     gfn_t gfn,
>>>>> +                     unsigned long nr,
>>>>> +                     mfn_t mfn,
>>>>> +                     p2m_type_t p2mt)
>>>>> +{
>>>>> +    return p2m_insert_mapping(d, gfn, nr, mfn, p2mt);
>>>>> +}
>>>> What is this function doing here? The description says "for generic mapping
>>>> purposes", which really may mean anything. Plus, if and when you need it, it
>>>> wants to come with a name that fits with e.g. ...
>>> These names are used across the common code and various architectures. Not all
>>> architectures need to implement all of these functions.
>>> I believe|guest_physmap_add_page()| (which internally calls|guest_physmap_add_entry()|)
>>> is needed to be implemented for all architectures, while|map_regions_p2mt()| is used
>>> by Arm and the common Dom0less-related code, and because of RISC-V is going to re-use
>>> common Dom0less code it is implementing this function too.
>> First, my comment was solely about this one function above. And then I didn't
>> even know Arm had such a function. It's not used from common code (except again
>> from dom0less code where it should have been better abstracted, imo). I'm also
>> not surprised I wasn't aware of it since, as can be implied from the above,
>> otherwise I would likely have complained about its name not fitting the general
>> scheme (which isn't all that good either).
> 
> If I'm right, there is nothing similar to|map_regions_p2mt()| in the common headers.
> 
> Anyway, I think we could follow up with a patch to rename this function to
> something more appropriate.
> 
> I was thinking about adding something like|map_regions_to_guest()|,|map_p2m_regions()|,
> or|map_p2m_memory()| to|xen/mm.h|, along with proper renaming in the Arm code.
> 
> Does that make sense?

Imo that seemingly redundant function (i.e. if it's really needed) would want
to be named guest_physmap_<whatever>().

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 10/17] xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs
  2025-07-03 13:09           ` Jan Beulich
@ 2025-07-03 13:28             ` Oleksii Kurochko
  2025-07-03 13:34               ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-03 13:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2919 bytes --]


On 7/3/25 3:09 PM, Jan Beulich wrote:
> On 03.07.2025 13:54, Oleksii Kurochko wrote:
>> On 7/3/25 1:33 PM, Jan Beulich wrote:
>>> On 03.07.2025 13:02, Oleksii Kurochko wrote:
>>>> On 6/30/25 5:59 PM, Jan Beulich wrote:
>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>> +                              unsigned long nr, mfn_t mfn, p2m_type_t t)
>>>>>> +{
>>>>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>>>> +    int rc;
>>>>>> +
>>>>>> +    p2m_write_lock(p2m);
>>>>>> +    rc = p2m_set_entry(p2m, start_gfn, nr, mfn, t, p2m->default_access);
>>>>>> +    p2m_write_unlock(p2m);
>>>>>> +
>>>>>> +    return rc;
>>>>>> +}
>>>>>> +
>>>>>> +int map_regions_p2mt(struct domain *d,
>>>>>> +                     gfn_t gfn,
>>>>>> +                     unsigned long nr,
>>>>>> +                     mfn_t mfn,
>>>>>> +                     p2m_type_t p2mt)
>>>>>> +{
>>>>>> +    return p2m_insert_mapping(d, gfn, nr, mfn, p2mt);
>>>>>> +}
>>>>> What is this function doing here? The description says "for generic mapping
>>>>> purposes", which really may mean anything. Plus, if and when you need it, it
>>>>> wants to come with a name that fits with e.g. ...
>>>> These names are used across the common code and various architectures. Not all
>>>> architectures need to implement all of these functions.
>>>> I believe|guest_physmap_add_page()| (which internally calls|guest_physmap_add_entry()|)
>>>> is needed to be implemented for all architectures, while|map_regions_p2mt()| is used
>>>> by Arm and the common Dom0less-related code, and because of RISC-V is going to re-use
>>>> common Dom0less code it is implementing this function too.
>>> First, my comment was solely about this one function above. And then I didn't
>>> even know Arm had such a function. It's not used from common code (except again
>>> from dom0less code where it should have been better abstracted, imo). I'm also
>>> not surprised I wasn't aware of it since, as can be implied from the above,
>>> otherwise I would likely have complained about its name not fitting the general
>>> scheme (which isn't all that good either).
>> If I'm right, there is nothing similar to|map_regions_p2mt()| in the common headers.
>>
>> Anyway, I think we could follow up with a patch to rename this function to
>> something more appropriate.
>>
>> I was thinking about adding something like|map_regions_to_guest()|,|map_p2m_regions()|,
>> or|map_p2m_memory()| to|xen/mm.h|, along with proper renaming in the Arm code.
>>
>> Does that make sense?
> Imo that seemingly redundant function (i.e. if it's really needed) would want
> to be named guest_physmap_<whatever>().

If it is redundant what is expected to be used instead to map_regions_p2mt() to map MMIO,
for example, to guest: guest_physmap_add_page()? Based on the comment above the definition
of this function it is for RAM: /* Untyped version for RAM only, for compatibility */

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 3893 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 10/17] xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs
  2025-07-03 13:28             ` Oleksii Kurochko
@ 2025-07-03 13:34               ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-03 13:34 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 03.07.2025 15:28, Oleksii Kurochko wrote:
> 
> On 7/3/25 3:09 PM, Jan Beulich wrote:
>> On 03.07.2025 13:54, Oleksii Kurochko wrote:
>>> On 7/3/25 1:33 PM, Jan Beulich wrote:
>>>> On 03.07.2025 13:02, Oleksii Kurochko wrote:
>>>>> On 6/30/25 5:59 PM, Jan Beulich wrote:
>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>> +                              unsigned long nr, mfn_t mfn, p2m_type_t t)
>>>>>>> +{
>>>>>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>>>>> +    int rc;
>>>>>>> +
>>>>>>> +    p2m_write_lock(p2m);
>>>>>>> +    rc = p2m_set_entry(p2m, start_gfn, nr, mfn, t, p2m->default_access);
>>>>>>> +    p2m_write_unlock(p2m);
>>>>>>> +
>>>>>>> +    return rc;
>>>>>>> +}
>>>>>>> +
>>>>>>> +int map_regions_p2mt(struct domain *d,
>>>>>>> +                     gfn_t gfn,
>>>>>>> +                     unsigned long nr,
>>>>>>> +                     mfn_t mfn,
>>>>>>> +                     p2m_type_t p2mt)
>>>>>>> +{
>>>>>>> +    return p2m_insert_mapping(d, gfn, nr, mfn, p2mt);
>>>>>>> +}
>>>>>> What is this function doing here? The description says "for generic mapping
>>>>>> purposes", which really may mean anything. Plus, if and when you need it, it
>>>>>> wants to come with a name that fits with e.g. ...
>>>>> These names are used across the common code and various architectures. Not all
>>>>> architectures need to implement all of these functions.
>>>>> I believe|guest_physmap_add_page()| (which internally calls|guest_physmap_add_entry()|)
>>>>> is needed to be implemented for all architectures, while|map_regions_p2mt()| is used
>>>>> by Arm and the common Dom0less-related code, and because of RISC-V is going to re-use
>>>>> common Dom0less code it is implementing this function too.
>>>> First, my comment was solely about this one function above. And then I didn't
>>>> even know Arm had such a function. It's not used from common code (except again
>>>> from dom0less code where it should have been better abstracted, imo). I'm also
>>>> not surprised I wasn't aware of it since, as can be implied from the above,
>>>> otherwise I would likely have complained about its name not fitting the general
>>>> scheme (which isn't all that good either).
>>> If I'm right, there is nothing similar to|map_regions_p2mt()| in the common headers.
>>>
>>> Anyway, I think we could follow up with a patch to rename this function to
>>> something more appropriate.
>>>
>>> I was thinking about adding something like|map_regions_to_guest()|,|map_p2m_regions()|,
>>> or|map_p2m_memory()| to|xen/mm.h|, along with proper renaming in the Arm code.
>>>
>>> Does that make sense?
>> Imo that seemingly redundant function (i.e. if it's really needed) would want
>> to be named guest_physmap_<whatever>().
> 
> If it is redundant what is expected to be used instead to map_regions_p2mt() to map MMIO,
> for example, to guest: guest_physmap_add_page()? Based on the comment above the definition
> of this function it is for RAM: /* Untyped version for RAM only, for compatibility */

But we're talking about guest_physmap_add_entry().

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-01 13:49   ` Jan Beulich
@ 2025-07-04 15:01     ` Oleksii Kurochko
  2025-07-07  7:20       ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-04 15:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 14721 bytes --]


On 7/1/25 3:49 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> This patch introduces p2m_set_entry() and its core helper __p2m_set_entry() for
>> RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
>> modifications.
>>
>> Key differences include:
>> - TLB Flushing: RISC-V allows caching of invalid PTEs and does not require
>>    break-before-make (BBM). As a result, the flushing logic is simplified.
>>    TLB invalidation can be deferred until p2m_write_unlock() is called.
>>    Consequently, the p2m->need_flush flag is always considered true and is
>>    removed.
>> - Page Table Traversal: The order of walking the page tables differs from Arm,
>>    and this implementation reflects that reversed traversal.
>> - Macro Adjustments: The macros P2M_ROOT_LEVEL, P2M_ROOT_ORDER, and
>>    P2M_ROOT_PAGES are updated to align with the new RISC-V implementation.
>>
>> The main functionality is in __p2m_set_entry(), which handles mappings aligned
>> to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).
>>
>> p2m_set_entry() breaks a region down into block-aligned mappings and calls
>> __p2m_set_entry() accordingly.
>>
>> Stub implementations (to be completed later) include:
>> - p2m_free_entry()
> What would a function of this name do?

Recursively visiting all leaf PTE's for sub-tree behind an entry, then calls
put_page() (which will free if there is no any reference to this page),
freeing intermediate page table (after all entries were freed) by removing
it from d->arch.paging.freelist, and removes correspondent page of intermediate page
table from p2m->pages list.

> You can clear entries, but you can't
> free them, can you?

Is is a question regarding terminology? I can't free entry itself, but a page table or
a page (if it is a leaf entry) on which it points could free.

>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -9,8 +9,13 @@
>>   #include <xen/rwlock.h>
>>   #include <xen/types.h>
>>   
>> +#include <asm/page.h>
>>   #include <asm/page-bits.h>
>>   
>> +#define P2M_ROOT_LEVEL  HYP_PT_ROOT_LEVEL
>> +#define P2M_ROOT_ORDER  XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL)
> This is confusing, as in patch 6 we see that p2m root table order is 2.
> Something needs doing about the naming, so the two sets of things can't
> be confused.

Agree, confusing enough.

I will define|P2M_ROOT_ORDER| as|get_order_from_bytes(GUEST_ROOT_PAGE_TABLE_SIZE)|
(or declare a new variable to store this value).

Actually, the way it's currently defined was only needed for|p2m_get_root_pointer() |to find the root page table by GFN, but|XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL) |is used explicitly there, so I just missed doing a proper cleanup.


>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -231,6 +231,8 @@ int p2m_init(struct domain *d)
>>       INIT_PAGE_LIST_HEAD(&p2m->pages);
>>   
>>       p2m->vmid = INVALID_VMID;
>> +    p2m->max_mapped_gfn = _gfn(0);
>> +    p2m->lowest_mapped_gfn = _gfn(ULONG_MAX);
>>   
>>       p2m->default_access = p2m_access_rwx;
>>   
>> @@ -325,6 +327,214 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>>       return 0;
>>   }
>>   
>> +/*
>> + * Find and map the root page table. The caller is responsible for
>> + * unmapping the table.
>> + *
>> + * The function will return NULL if the offset of the root table is
>> + * invalid.
> Don't you mean "offset into ..."?

If you won't suggested that, I will think that the meaning of "of" and "into" is pretty close.
But it seems like semantically "into" is more accurate and better conveys the intent of the code.

>> + */
>> +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>> +{
>> +    unsigned long root_table_indx;
>> +
>> +    root_table_indx = gfn_x(gfn) >> XEN_PT_LEVEL_ORDER(P2M_ROOT_LEVEL);
>> +    if ( root_table_indx >= P2M_ROOT_PAGES )
>> +        return NULL;
>> +
>> +    return __map_domain_page(p2m->root + root_table_indx);
>> +}
>> +
>> +static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
> The rule of thumb is to have inline functions only in header files, leaving
> decisions to the compiler elsewhere.

I am not sure what you mean in the second part (after coma) of your sentence.

>> +{
>> +    panic("%s: isn't implemented for now\n", __func__);
>> +
>> +    return false;
>> +}
> For this function in particular, though: Besides the "p2me" in the name
> being somewhat odd (supposedly page table entries here are simply pte_t),
> how is this going to be different from pte_is_valid()?

pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
what is a type stored in the radix tree (p2m->p2m_types):
   /*
    * In the case of the P2M, the valid bit is used for other purpose. Use
    * the type to check whether an entry is valid.
    */
   static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
   {
       return p2m_type_radix_get(p2m, pte) != p2m_invalid;
   }

It is done to track which page was modified by a guest.


>> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
>> +{
>> +    write_pte(p, pte);
>> +    if ( clean_pte )
>> +        clean_dcache_va_range(p, sizeof(*p));
>> +}
>> +
>> +static inline void p2m_remove_pte(pte_t *p, bool clean_pte)
>> +{
>> +    pte_t pte;
>> +
>> +    memset(&pte, 0x00, sizeof(pte));
>> +    p2m_write_pte(p, pte, clean_pte);
>> +}
> May I suggest "clear" instead of "remove" and plain 0 instead of 0x00
> (or simply give the variable a trivial initializer)?

Sure, I will rename and use plain 0.

>
> As to the earlier function that I commented on: Seeing the names here,
> wouldn't p2m_pte_is_valid() be a more consistent name there?

Then all p2me_*() should be updated to p2m_pte_*().

But initial logic was that p2me = p2m entry = p2m page table entry.

Probably we can just return back to the prefix p2m_ as based on arguments
it is clear that it is a function for working with P2M's PTE.

>
>> +static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn,
>> +                                p2m_type_t t, p2m_access_t a)
>> +{
>> +    panic("%s: hasn't been implemented yet\n", __func__);
>> +
>> +    return (pte_t) { .pte = 0 };
>> +}
> And then perhaps p2m_pte_from_mfn() here?
>
>> +#define GUEST_TABLE_MAP_NONE 0
>> +#define GUEST_TABLE_MAP_NOMEM 1
>> +#define GUEST_TABLE_SUPER_PAGE 2
>> +#define GUEST_TABLE_NORMAL 3
> Is GUEST_ a good prefix? The guest doesn't control these tables, and the
> word could also mean the guest's own page tables.

Then P2M_ prefix should be better.

>
>> +/*
>> + * Take the currently mapped table, find the corresponding GFN entry,
> That's not what you mean though, is it? It's more like "the entry
> corresponding to the GFN" (implying "at the given level").

It will be more clear, I'll update the comment.

>
>> + * and map the next table, if available. The previous table will be
>> + * unmapped if the next level was mapped (e.g GUEST_TABLE_NORMAL
>> + * returned).
>> + *
>> + * `alloc_tbl` parameter indicates whether intermediate tables should
>> + * be allocated when not present.
>> + *
>> + * Return values:
>> + *  GUEST_TABLE_MAP_NONE: a table allocation isn't permitted.
>> + *  GUEST_TABLE_MAP_NOMEM: allocating a new page failed.
>> + *  GUEST_TABLE_SUPER_PAGE: next level or leaf mapped normally.
>> + *  GUEST_TABLE_NORMAL: The next entry points to a superpage.
>> + */
>> +static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>> +                          unsigned int level, pte_t **table,
>> +                          unsigned int offset)
>> +{
>> +    panic("%s: hasn't been implemented yet\n", __func__);
>> +
>> +    return GUEST_TABLE_MAP_NONE;
>> +}
>> +
>> +/* Free pte sub-tree behind an entry */
>> +static void p2m_free_entry(struct p2m_domain *p2m,
>> +                           pte_t entry, unsigned int level)
>> +{
>> +    panic("%s: hasn't been implemented yet\n", __func__);
>> +}
>> +
>> +/*
>> + * Insert an entry in the p2m. This should be called with a mapping
>> + * equal to a page/superpage.
>> + */
>> +static int __p2m_set_entry(struct p2m_domain *p2m,
> No double leading underscores, please. A single one is fine and will do.
>
>> +                           gfn_t sgfn,
>> +                           unsigned int page_order,
>> +                           mfn_t smfn,
> What are the "s" in "sgfn" and "smfn" indicating? Possibly "start", except
> that you don't process multiple GFNs here (unlike in the caller).

Yes, it stands for "start". I agree that is not so necessary for __p2m_set_entry()
to use "s" prefix. I'll rename them for __p2m_set_entry().

>
>> +                           p2m_type_t t,
>> +                           p2m_access_t a)
>> +{
>> +    unsigned int level;
>> +    unsigned int target = page_order / PAGETABLE_ORDER;
>> +    pte_t *entry, *table, orig_pte;
>> +    int rc;
>> +    /* A mapping is removed if the MFN is invalid. */
>> +    bool removing_mapping = mfn_eq(smfn, INVALID_MFN);
>> +    DECLARE_OFFSETS(offsets, gfn_to_gaddr(sgfn));
>> +
>> +    ASSERT(p2m_is_write_locked(p2m));
>> +
>> +    /*
>> +     * Check if the level target is valid: we only support
>> +     * 4K - 2M - 1G mapping.
>> +     */
>> +    ASSERT(target <= 2);
> No provisions towards the division that produced the value having left
> a remainder?

The way the order is initialized will always result in division without
a remainder.

If it makes sense, the|ASSERT()| could be updated to ensure that the order
is always a multiple of|PAGETABLE_ORDER|:
   ASSERT((target <= 2) && !IS_ALIGNED(page_order, PAGETABLE_ORDER));

>
>> +    table = p2m_get_root_pointer(p2m, sgfn);
>> +    if ( !table )
>> +        return -EINVAL;
>> +
>> +    for ( level = P2M_ROOT_LEVEL; level > target; level-- )
>> +    {
>> +        /*
>> +         * Don't try to allocate intermediate page table if the mapping
>> +         * is about to be removed.
>> +         */
>> +        rc = p2m_next_level(p2m, !removing_mapping,
>> +                            level, &table, offsets[level]);
>> +        if ( (rc == GUEST_TABLE_MAP_NONE) || (rc == GUEST_TABLE_MAP_NOMEM) )
>> +        {
>> +            /*
>> +             * We are here because p2m_next_level has failed to map
>> +             * the intermediate page table (e.g the table does not exist
>> +             * and they p2m tree is read-only). It is a valid case
>> +             * when removing a mapping as it may not exist in the
>> +             * page table. In this case, just ignore it.
>> +             */
>> +            rc = removing_mapping ?  0 : -ENOENT;
> Shouldn't GUEST_TABLE_MAP_NOMEM be transformed to -ENOMEM?

Maybe, but I think that it is not really necessary to be so precise here. -ENOENT
could cover both GUEST_TABLE_MAP_NONE and GUEST_TABLE_MAP_NOMEM.
Anyway. for consistency I will change this code to:
             rc = (rc == P2M_TABLE_MAP_NONE) ? -ENOENT : -ENOMEM;
             /*
              * We are here because p2m_next_level has failed to map
              * the intermediate page table (e.g the table does not exist
              * and they p2m tree is read-only). It is a valid case
              * when removing a mapping as it may not exist in the
              * page table. In this case, just ignore it.
              */
             rc = removing_mapping ?  0 : rc;
             goto out;

>> @@ -332,7 +542,55 @@ static int p2m_set_entry(struct p2m_domain *p2m,
>>                            p2m_type_t t,
>>                            p2m_access_t a)
>>   {
>> -    return -EOPNOTSUPP;
>> +    int rc = 0;
>> +
>> +    /*
>> +     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
>> +     * be dropped in relinquish_p2m_mapping(). As the P2M will still
>> +     * be accessible after, we need to prevent mapping to be added when the
>> +     * domain is dying.
>> +     */
>> +    if ( unlikely(p2m->domain->is_dying) )
>> +        return -ENOMEM;
> Why ENOMEM?

I expect that when a domain is dying, it means there’s no point in using its
memory—either because it's no longer available or it has already been freed.
Basically, no memory.

>
>> +    while ( nr )
> Why's there a loop here? The function name uses singular, i.e. means to
> create exactly one entry.

I will rename the function to  p2m_set_entries().

>
>> +    {
>> +        unsigned long mask;
>> +        unsigned long order = 0;
> unsigned int?
>
>> +        /* 1gb, 2mb, 4k mappings are supported */
>> +        unsigned int i = ( P2M_ROOT_LEVEL > 2 ) ? 2 : P2M_ROOT_LEVEL;
> Not (style): Excess blanks. Yet then aren't you open-coding min() here
> anyway?

Yes, it is open-coded version of min(). I will use min() instead.

> Plus isn't P2M_ROOT_LEVEL always >= 2?

For Sv32, P2M_ROOT_LEVEL is 1; for other modes it is really always >= 2.


>
>> +        /*
>> +         * Don't take into account the MFN when removing mapping (i.e
>> +         * MFN_INVALID) to calculate the correct target order.
>> +         *
>> +         * XXX: Support superpage mappings if nr is not aligned to a
>> +         * superpage size.
>> +         */
> Does this really need leaving as a to-do?

I think so, yes. It won’t break the current workflow if|nr| isn’t aligned,
a smaller order will simply be chosen.

>
>> +        mask = !mfn_eq(smfn, INVALID_MFN) ? mfn_x(smfn) : 0;
>> +        mask |= gfn_x(sgfn) | nr;
>> +
>> +        for ( ; i != 0; i-- )
>> +        {
>> +            if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) )
>> +            {
>> +                    order = XEN_PT_LEVEL_ORDER(i);
>> +                    break;
> Nit: Style.
>
>> +            }
>> +        }
>> +
>> +        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
>> +        if ( rc )
>> +            break;
>> +
>> +        sgfn = gfn_add(sgfn, (1 << order));
>> +        if ( !mfn_eq(smfn, INVALID_MFN) )
>> +           smfn = mfn_add(smfn, (1 << order));
>> +
>> +        nr -= (1 << order);
> Throughout maybe better be safe right away and use 1UL?
>
>> +    }
>> +
>> +    return rc;
>>   }
> How's the caller going to know how much of the range was successfully
> mapped?

There is no such option. Do other arches do that? I mean returns somehow
the number of successfully mapped (sgfn,smfn).

> That part may need undoing (if not here, then in the caller),
> or a caller may want to retry.

So the caller in the case if rc != 0, can just undoing the full range
(by using the same sgfn, nr, smfn).
Or, as an option, just go for range (sgfn, nr), get each entry and if it
was mapped then just clear entry; otherwise just stop.

Yes, it isn't optimal, but should work.

Thanks.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 21027 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-04 15:01     ` Oleksii Kurochko
@ 2025-07-07  7:20       ` Jan Beulich
  2025-07-07 11:46         ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-07  7:20 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 04.07.2025 17:01, Oleksii Kurochko wrote:
> On 7/1/25 3:49 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> This patch introduces p2m_set_entry() and its core helper __p2m_set_entry() for
>>> RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
>>> modifications.
>>>
>>> Key differences include:
>>> - TLB Flushing: RISC-V allows caching of invalid PTEs and does not require
>>>    break-before-make (BBM). As a result, the flushing logic is simplified.
>>>    TLB invalidation can be deferred until p2m_write_unlock() is called.
>>>    Consequently, the p2m->need_flush flag is always considered true and is
>>>    removed.
>>> - Page Table Traversal: The order of walking the page tables differs from Arm,
>>>    and this implementation reflects that reversed traversal.
>>> - Macro Adjustments: The macros P2M_ROOT_LEVEL, P2M_ROOT_ORDER, and
>>>    P2M_ROOT_PAGES are updated to align with the new RISC-V implementation.
>>>
>>> The main functionality is in __p2m_set_entry(), which handles mappings aligned
>>> to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).
>>>
>>> p2m_set_entry() breaks a region down into block-aligned mappings and calls
>>> __p2m_set_entry() accordingly.
>>>
>>> Stub implementations (to be completed later) include:
>>> - p2m_free_entry()
>> What would a function of this name do?
> 
> Recursively visiting all leaf PTE's for sub-tree behind an entry, then calls
> put_page() (which will free if there is no any reference to this page),
> freeing intermediate page table (after all entries were freed) by removing
> it from d->arch.paging.freelist, and removes correspondent page of intermediate page
> table from p2m->pages list.
> 
>> You can clear entries, but you can't
>> free them, can you?
> 
> Is is a question regarding terminology?

Yes. If one sees a call to a function, it should be possible to at least
roughly know what it does without needing to go look at the implementation.

> I can't free entry itself, but a page table or
> a page (if it is a leaf entry) on which it points could free.

Then e.g. pte_free_subtree() or some such?

>>> +static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>> The rule of thumb is to have inline functions only in header files, leaving
>> decisions to the compiler elsewhere.
> 
> I am not sure what you mean in the second part (after coma) of your sentence.

The compiler does its own inlining decisions quite fine when it can see all
call sites (as is the case for static functions). Hence in general you want
to omit "inline" there. Except of course in header files, where non-inline
static-s are a problem.

>>> +{
>>> +    panic("%s: isn't implemented for now\n", __func__);
>>> +
>>> +    return false;
>>> +}
>> For this function in particular, though: Besides the "p2me" in the name
>> being somewhat odd (supposedly page table entries here are simply pte_t),
>> how is this going to be different from pte_is_valid()?
> 
> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
> what is a type stored in the radix tree (p2m->p2m_types):
>    /*
>     * In the case of the P2M, the valid bit is used for other purpose. Use
>     * the type to check whether an entry is valid.
>     */
>    static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>    {
>        return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>    }
> 
> It is done to track which page was modified by a guest.

But then (again) the name doesn't convey what the function does. Plus
can't a guest also arrange for an entry's type to move to p2m_invalid?
That's then still an entry that was modified by the guest.

Overall I think I'm lacking clarity what you mean to use this predicate
for.

>>> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
>>> +{
>>> +    write_pte(p, pte);
>>> +    if ( clean_pte )
>>> +        clean_dcache_va_range(p, sizeof(*p));
>>> +}
>>> +
>>> +static inline void p2m_remove_pte(pte_t *p, bool clean_pte)
>>> +{
>>> +    pte_t pte;
>>> +
>>> +    memset(&pte, 0x00, sizeof(pte));
>>> +    p2m_write_pte(p, pte, clean_pte);
>>> +}
>> May I suggest "clear" instead of "remove" and plain 0 instead of 0x00
>> (or simply give the variable a trivial initializer)?
> 
> Sure, I will rename and use plain 0.
> 
>>
>> As to the earlier function that I commented on: Seeing the names here,
>> wouldn't p2m_pte_is_valid() be a more consistent name there?
> 
> Then all p2me_*() should be updated to p2m_pte_*().
> 
> But initial logic was that p2me = p2m entry = p2m page table entry.
> 
> Probably we can just return back to the prefix p2m_ as based on arguments
> it is clear that it is a function for working with P2M's PTE.

In the end it's up to you. Having thought about it some more, perhaps
p2me_*() is still quite helpful to separate from functions dealing with
P2Ms as a while, and to also avoid the verbosity of p2m_pte_*().

>>> +{
>>> +    unsigned int level;
>>> +    unsigned int target = page_order / PAGETABLE_ORDER;
>>> +    pte_t *entry, *table, orig_pte;
>>> +    int rc;
>>> +    /* A mapping is removed if the MFN is invalid. */
>>> +    bool removing_mapping = mfn_eq(smfn, INVALID_MFN);
>>> +    DECLARE_OFFSETS(offsets, gfn_to_gaddr(sgfn));
>>> +
>>> +    ASSERT(p2m_is_write_locked(p2m));
>>> +
>>> +    /*
>>> +     * Check if the level target is valid: we only support
>>> +     * 4K - 2M - 1G mapping.
>>> +     */
>>> +    ASSERT(target <= 2);
>> No provisions towards the division that produced the value having left
>> a remainder?
> 
> The way the order is initialized will always result in division without
> a remainder.
> 
> If it makes sense, the|ASSERT()| could be updated to ensure that the order
> is always a multiple of|PAGETABLE_ORDER|:
>    ASSERT((target <= 2) && !IS_ALIGNED(page_order, PAGETABLE_ORDER));

Except that the ! looks wrong here.

>>> @@ -332,7 +542,55 @@ static int p2m_set_entry(struct p2m_domain *p2m,
>>>                            p2m_type_t t,
>>>                            p2m_access_t a)
>>>   {
>>> -    return -EOPNOTSUPP;
>>> +    int rc = 0;
>>> +
>>> +    /*
>>> +     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
>>> +     * be dropped in relinquish_p2m_mapping(). As the P2M will still
>>> +     * be accessible after, we need to prevent mapping to be added when the
>>> +     * domain is dying.
>>> +     */
>>> +    if ( unlikely(p2m->domain->is_dying) )
>>> +        return -ENOMEM;
>> Why ENOMEM?
> 
> I expect that when a domain is dying, it means there’s no point in using its
> memory—either because it's no longer available or it has already been freed.
> Basically, no memory.

That can end up odd for call sites. Please consider using e.g. EACCES.

>>> +    while ( nr )
>> Why's there a loop here? The function name uses singular, i.e. means to
>> create exactly one entry.
> 
> I will rename the function to  p2m_set_entries().

Or maybe p2m_set_range()?

>>> +        /*
>>> +         * Don't take into account the MFN when removing mapping (i.e
>>> +         * MFN_INVALID) to calculate the correct target order.
>>> +         *
>>> +         * XXX: Support superpage mappings if nr is not aligned to a
>>> +         * superpage size.
>>> +         */
>> Does this really need leaving as a to-do?
> 
> I think so, yes. It won’t break the current workflow if|nr| isn’t aligned,
> a smaller order will simply be chosen.

Well, my question was more like "Isn't it simple enough to cover the case
right away?"

>>> +        mask = !mfn_eq(smfn, INVALID_MFN) ? mfn_x(smfn) : 0;
>>> +        mask |= gfn_x(sgfn) | nr;
>>> +
>>> +        for ( ; i != 0; i-- )
>>> +        {
>>> +            if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) )
>>> +            {
>>> +                    order = XEN_PT_LEVEL_ORDER(i);
>>> +                    break;
>> Nit: Style.
>>
>>> +            }
>>> +        }
>>> +
>>> +        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
>>> +        if ( rc )
>>> +            break;
>>> +
>>> +        sgfn = gfn_add(sgfn, (1 << order));
>>> +        if ( !mfn_eq(smfn, INVALID_MFN) )
>>> +           smfn = mfn_add(smfn, (1 << order));
>>> +
>>> +        nr -= (1 << order);
>> Throughout maybe better be safe right away and use 1UL?
>>
>>> +    }
>>> +
>>> +    return rc;
>>>   }
>> How's the caller going to know how much of the range was successfully
>> mapped?
> 
> There is no such option. Do other arches do that? I mean returns somehow
> the number of successfully mapped (sgfn,smfn).

On x86 we had to introduce some not very nice code to cover for the absence
of proper handling there. For a new port I think it wants at least seriously
considering not to repeat such a potentially unhelpful pattern.

>> That part may need undoing (if not here, then in the caller),
>> or a caller may want to retry.
> 
> So the caller in the case if rc != 0, can just undoing the full range
> (by using the same sgfn, nr, smfn).

Can it? How would it know what the original state was?

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-07  7:20       ` Jan Beulich
@ 2025-07-07 11:46         ` Oleksii Kurochko
  2025-07-07 12:53           ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-07 11:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 10663 bytes --]


On 7/7/25 9:20 AM, Jan Beulich wrote:
> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>> On 7/1/25 3:49 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> This patch introduces p2m_set_entry() and its core helper __p2m_set_entry() for
>>>> RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
>>>> modifications.
>>>>
>>>> Key differences include:
>>>> - TLB Flushing: RISC-V allows caching of invalid PTEs and does not require
>>>>     break-before-make (BBM). As a result, the flushing logic is simplified.
>>>>     TLB invalidation can be deferred until p2m_write_unlock() is called.
>>>>     Consequently, the p2m->need_flush flag is always considered true and is
>>>>     removed.
>>>> - Page Table Traversal: The order of walking the page tables differs from Arm,
>>>>     and this implementation reflects that reversed traversal.
>>>> - Macro Adjustments: The macros P2M_ROOT_LEVEL, P2M_ROOT_ORDER, and
>>>>     P2M_ROOT_PAGES are updated to align with the new RISC-V implementation.
>>>>
>>>> The main functionality is in __p2m_set_entry(), which handles mappings aligned
>>>> to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).
>>>>
>>>> p2m_set_entry() breaks a region down into block-aligned mappings and calls
>>>> __p2m_set_entry() accordingly.
>>>>
>>>> Stub implementations (to be completed later) include:
>>>> - p2m_free_entry()
>>> What would a function of this name do?
>> Recursively visiting all leaf PTE's for sub-tree behind an entry, then calls
>> put_page() (which will free if there is no any reference to this page),
>> freeing intermediate page table (after all entries were freed) by removing
>> it from d->arch.paging.freelist, and removes correspondent page of intermediate page
>> table from p2m->pages list.
>>
>>> You can clear entries, but you can't
>>> free them, can you?
>> Is is a question regarding terminology?
> Yes. If one sees a call to a function, it should be possible to at least
> roughly know what it does without needing to go look at the implementation.
>
>> I can't free entry itself, but a page table or
>> a page (if it is a leaf entry) on which it points could free.
> Then e.g. pte_free_subtree() or some such?

It sounds fine to me. I'll use suggested name.

Just want to notice that other arches also have the same function
for the same purpose with the same name.
Does it make sense then to change a name for all arches?

>
>>>> +static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>> The rule of thumb is to have inline functions only in header files, leaving
>>> decisions to the compiler elsewhere.
>> I am not sure what you mean in the second part (after coma) of your sentence.
> The compiler does its own inlining decisions quite fine when it can see all
> call sites (as is the case for static functions). Hence in general you want
> to omit "inline" there. Except of course in header files, where non-inline
> static-s are a problem.

Thanks, now it is clear what you meant.

>
>>>> +{
>>>> +    panic("%s: isn't implemented for now\n", __func__);
>>>> +
>>>> +    return false;
>>>> +}
>>> For this function in particular, though: Besides the "p2me" in the name
>>> being somewhat odd (supposedly page table entries here are simply pte_t),
>>> how is this going to be different from pte_is_valid()?
>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
>> what is a type stored in the radix tree (p2m->p2m_types):
>>     /*
>>      * In the case of the P2M, the valid bit is used for other purpose. Use
>>      * the type to check whether an entry is valid.
>>      */
>>     static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>     {
>>         return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>     }
>>
>> It is done to track which page was modified by a guest.
> But then (again) the name doesn't convey what the function does.

Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.

>   Plus
> can't a guest also arrange for an entry's type to move to p2m_invalid?
> That's then still an entry that was modified by the guest.

I am not really sure that I fully understand the question.
Do you ask if a guest can do something which will lead to a call of p2m_set_entry()
with p2m_invalid argument?
If yes, then it seems like it will be done only in case of p2m_remove_mapping() what
will mean that alongside with p2m_invalid INVALID_MFN will be also passed, what means
this entry isn't expected to be used anymore.

> Overall I think I'm lacking clarity what you mean to use this predicate
> for.

By using of "p2me_" predicate I wanted to express that not PTE's valid bit will be
checked, but the type saved in radix tree will be used.
As suggested above probably it will be better drop "e" too and just use p2m_type_is_valid().

>
>>>> +static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
>>>> +{
>>>> +    write_pte(p, pte);
>>>> +    if ( clean_pte )
>>>> +        clean_dcache_va_range(p, sizeof(*p));
>>>> +}
>>>> +
>>>> +static inline void p2m_remove_pte(pte_t *p, bool clean_pte)
>>>> +{
>>>> +    pte_t pte;
>>>> +
>>>> +    memset(&pte, 0x00, sizeof(pte));
>>>> +    p2m_write_pte(p, pte, clean_pte);
>>>> +}
>>> May I suggest "clear" instead of "remove" and plain 0 instead of 0x00
>>> (or simply give the variable a trivial initializer)?
>> Sure, I will rename and use plain 0.
>>
>>> As to the earlier function that I commented on: Seeing the names here,
>>> wouldn't p2m_pte_is_valid() be a more consistent name there?
>> Then all p2me_*() should be updated to p2m_pte_*().
>>
>> But initial logic was that p2me = p2m entry = p2m page table entry.
>>
>> Probably we can just return back to the prefix p2m_ as based on arguments
>> it is clear that it is a function for working with P2M's PTE.
> In the end it's up to you. Having thought about it some more, perhaps
> p2me_*() is still quite helpful to separate from functions dealing with
> P2Ms as a while, and to also avoid the verbosity of p2m_pte_*().
>
>>>> +{
>>>> +    unsigned int level;
>>>> +    unsigned int target = page_order / PAGETABLE_ORDER;
>>>> +    pte_t *entry, *table, orig_pte;
>>>> +    int rc;
>>>> +    /* A mapping is removed if the MFN is invalid. */
>>>> +    bool removing_mapping = mfn_eq(smfn, INVALID_MFN);
>>>> +    DECLARE_OFFSETS(offsets, gfn_to_gaddr(sgfn));
>>>> +
>>>> +    ASSERT(p2m_is_write_locked(p2m));
>>>> +
>>>> +    /*
>>>> +     * Check if the level target is valid: we only support
>>>> +     * 4K - 2M - 1G mapping.
>>>> +     */
>>>> +    ASSERT(target <= 2);
>>> No provisions towards the division that produced the value having left
>>> a remainder?
>> The way the order is initialized will always result in division without
>> a remainder.
>>
>> If it makes sense, the|ASSERT()| could be updated to ensure that the order
>> is always a multiple of|PAGETABLE_ORDER|:
>>     ASSERT((target <= 2) && !IS_ALIGNED(page_order, PAGETABLE_ORDER));
> Except that the ! looks wrong here.

Agree, it shouldn't be here. Thanks.

>
>>>> @@ -332,7 +542,55 @@ static int p2m_set_entry(struct p2m_domain *p2m,
>>>>                             p2m_type_t t,
>>>>                             p2m_access_t a)
>>>>    {
>>>> -    return -EOPNOTSUPP;
>>>> +    int rc = 0;
>>>> +
>>>> +    /*
>>>> +     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
>>>> +     * be dropped in relinquish_p2m_mapping(). As the P2M will still
>>>> +     * be accessible after, we need to prevent mapping to be added when the
>>>> +     * domain is dying.
>>>> +     */
>>>> +    if ( unlikely(p2m->domain->is_dying) )
>>>> +        return -ENOMEM;
>>> Why ENOMEM?
>> I expect that when a domain is dying, it means there’s no point in using its
>> memory—either because it's no longer available or it has already been freed.
>> Basically, no memory.
> That can end up odd for call sites. Please consider using e.g. EACCES.
>
>>>> +    while ( nr )
>>> Why's there a loop here? The function name uses singular, i.e. means to
>>> create exactly one entry.
>> I will rename the function to  p2m_set_entries().
> Or maybe p2m_set_range()?

It is much better.

>
>>>> +        /*
>>>> +         * Don't take into account the MFN when removing mapping (i.e
>>>> +         * MFN_INVALID) to calculate the correct target order.
>>>> +         *
>>>> +         * XXX: Support superpage mappings if nr is not aligned to a
>>>> +         * superpage size.
>>>> +         */
>>> Does this really need leaving as a to-do?
>> I think so, yes. It won’t break the current workflow if|nr| isn’t aligned,
>> a smaller order will simply be chosen.
> Well, my question was more like "Isn't it simple enough to cover the case
> right away?"
>
>>>> +        mask = !mfn_eq(smfn, INVALID_MFN) ? mfn_x(smfn) : 0;
>>>> +        mask |= gfn_x(sgfn) | nr;
>>>> +
>>>> +        for ( ; i != 0; i-- )
>>>> +        {
>>>> +            if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) )
>>>> +            {
>>>> +                    order = XEN_PT_LEVEL_ORDER(i);
>>>> +                    break;
>>> Nit: Style.
>>>
>>>> +            }
>>>> +        }
>>>> +
>>>> +        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
>>>> +        if ( rc )
>>>> +            break;
>>>> +
>>>> +        sgfn = gfn_add(sgfn, (1 << order));
>>>> +        if ( !mfn_eq(smfn, INVALID_MFN) )
>>>> +           smfn = mfn_add(smfn, (1 << order));
>>>> +
>>>> +        nr -= (1 << order);
>>> Throughout maybe better be safe right away and use 1UL?
>>>
>>>> +    }
>>>> +
>>>> +    return rc;
>>>>    }
>>> How's the caller going to know how much of the range was successfully
>>> mapped?
>> There is no such option. Do other arches do that? I mean returns somehow
>> the number of successfully mapped (sgfn,smfn).
> On x86 we had to introduce some not very nice code to cover for the absence
> of proper handling there. For a new port I think it wants at least seriously
> considering not to repeat such a potentially unhelpful pattern.
>
>>> That part may need undoing (if not here, then in the caller),
>>> or a caller may want to retry.
>> So the caller in the case if rc != 0, can just undoing the full range
>> (by using the same sgfn, nr, smfn).
> Can it? How would it know what the original state was?

You're right — blindly unmapping the range assumes that no entries were valid
beforehand and I missed that it could be that something valid was mapped before
p2m_set_entry(sgfn,...,smfn) was called.
But then I am not really understand why it won't be an issue if will know
how many GFNs were successfully mapped.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 15953 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-07 11:46         ` Oleksii Kurochko
@ 2025-07-07 12:53           ` Jan Beulich
  2025-07-07 15:00             ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-07 12:53 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 07.07.2025 13:46, Oleksii Kurochko wrote:
> On 7/7/25 9:20 AM, Jan Beulich wrote:
>> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>>> On 7/1/25 3:49 PM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> This patch introduces p2m_set_entry() and its core helper __p2m_set_entry() for
>>>>> RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
>>>>> modifications.
>>>>>
>>>>> Key differences include:
>>>>> - TLB Flushing: RISC-V allows caching of invalid PTEs and does not require
>>>>>     break-before-make (BBM). As a result, the flushing logic is simplified.
>>>>>     TLB invalidation can be deferred until p2m_write_unlock() is called.
>>>>>     Consequently, the p2m->need_flush flag is always considered true and is
>>>>>     removed.
>>>>> - Page Table Traversal: The order of walking the page tables differs from Arm,
>>>>>     and this implementation reflects that reversed traversal.
>>>>> - Macro Adjustments: The macros P2M_ROOT_LEVEL, P2M_ROOT_ORDER, and
>>>>>     P2M_ROOT_PAGES are updated to align with the new RISC-V implementation.
>>>>>
>>>>> The main functionality is in __p2m_set_entry(), which handles mappings aligned
>>>>> to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).
>>>>>
>>>>> p2m_set_entry() breaks a region down into block-aligned mappings and calls
>>>>> __p2m_set_entry() accordingly.
>>>>>
>>>>> Stub implementations (to be completed later) include:
>>>>> - p2m_free_entry()
>>>> What would a function of this name do?
>>> Recursively visiting all leaf PTE's for sub-tree behind an entry, then calls
>>> put_page() (which will free if there is no any reference to this page),
>>> freeing intermediate page table (after all entries were freed) by removing
>>> it from d->arch.paging.freelist, and removes correspondent page of intermediate page
>>> table from p2m->pages list.
>>>
>>>> You can clear entries, but you can't
>>>> free them, can you?
>>> Is is a question regarding terminology?
>> Yes. If one sees a call to a function, it should be possible to at least
>> roughly know what it does without needing to go look at the implementation.
>>
>>> I can't free entry itself, but a page table or
>>> a page (if it is a leaf entry) on which it points could free.
>> Then e.g. pte_free_subtree() or some such?
> 
> It sounds fine to me. I'll use suggested name.
> 
> Just want to notice that other arches also have the same function
> for the same purpose with the same name.

As to x86, it's not general P2M code which uses this odd (for the purpose)
name, but only p2m-pt.c.

> Does it make sense then to change a name for all arches?

I think so.

>>>>> +{
>>>>> +    panic("%s: isn't implemented for now\n", __func__);
>>>>> +
>>>>> +    return false;
>>>>> +}
>>>> For this function in particular, though: Besides the "p2me" in the name
>>>> being somewhat odd (supposedly page table entries here are simply pte_t),
>>>> how is this going to be different from pte_is_valid()?
>>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
>>> what is a type stored in the radix tree (p2m->p2m_types):
>>>     /*
>>>      * In the case of the P2M, the valid bit is used for other purpose. Use
>>>      * the type to check whether an entry is valid.
>>>      */
>>>     static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>     {
>>>         return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>     }
>>>
>>> It is done to track which page was modified by a guest.
>> But then (again) the name doesn't convey what the function does.
> 
> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.

For P2M type checks please don't invent new naming, but use what both x86
and Arm are already using. Note how we already have p2m_is_valid() in that
set. Just that it's not doing what you want here.

>>   Plus
>> can't a guest also arrange for an entry's type to move to p2m_invalid?
>> That's then still an entry that was modified by the guest.
> 
> I am not really sure that I fully understand the question.
> Do you ask if a guest can do something which will lead to a call of p2m_set_entry()
> with p2m_invalid argument?

That I'm not asking, but rather stating. I.e. I expect such is possible.

> If yes, then it seems like it will be done only in case of p2m_remove_mapping() what
> will mean that alongside with p2m_invalid INVALID_MFN will be also passed, what means
> this entry isn't expected to be used anymore.

Right. But such an entry would still have been "modified" by the guest.

>> Overall I think I'm lacking clarity what you mean to use this predicate
>> for.
> 
> By using of "p2me_" predicate I wanted to express that not PTE's valid bit will be
> checked, but the type saved in radix tree will be used.
> As suggested above probably it will be better drop "e" too and just use p2m_type_is_valid().

See above regarding that name.

>>>>> +        /*
>>>>> +         * Don't take into account the MFN when removing mapping (i.e
>>>>> +         * MFN_INVALID) to calculate the correct target order.
>>>>> +         *
>>>>> +         * XXX: Support superpage mappings if nr is not aligned to a
>>>>> +         * superpage size.
>>>>> +         */
>>>> Does this really need leaving as a to-do?
>>> I think so, yes. It won’t break the current workflow if|nr| isn’t aligned,
>>> a smaller order will simply be chosen.
>> Well, my question was more like "Isn't it simple enough to cover the case
>> right away?"
>>
>>>>> +        mask = !mfn_eq(smfn, INVALID_MFN) ? mfn_x(smfn) : 0;
>>>>> +        mask |= gfn_x(sgfn) | nr;
>>>>> +
>>>>> +        for ( ; i != 0; i-- )
>>>>> +        {
>>>>> +            if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) )
>>>>> +            {
>>>>> +                    order = XEN_PT_LEVEL_ORDER(i);
>>>>> +                    break;
>>>> Nit: Style.
>>>>
>>>>> +            }
>>>>> +        }
>>>>> +
>>>>> +        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
>>>>> +        if ( rc )
>>>>> +            break;
>>>>> +
>>>>> +        sgfn = gfn_add(sgfn, (1 << order));
>>>>> +        if ( !mfn_eq(smfn, INVALID_MFN) )
>>>>> +           smfn = mfn_add(smfn, (1 << order));
>>>>> +
>>>>> +        nr -= (1 << order);
>>>> Throughout maybe better be safe right away and use 1UL?
>>>>
>>>>> +    }
>>>>> +
>>>>> +    return rc;
>>>>>    }
>>>> How's the caller going to know how much of the range was successfully
>>>> mapped?
>>> There is no such option. Do other arches do that? I mean returns somehow
>>> the number of successfully mapped (sgfn,smfn).
>> On x86 we had to introduce some not very nice code to cover for the absence
>> of proper handling there. For a new port I think it wants at least seriously
>> considering not to repeat such a potentially unhelpful pattern.
>>
>>>> That part may need undoing (if not here, then in the caller),
>>>> or a caller may want to retry.
>>> So the caller in the case if rc != 0, can just undoing the full range
>>> (by using the same sgfn, nr, smfn).
>> Can it? How would it know what the original state was?
> 
> You're right — blindly unmapping the range assumes that no entries were valid
> beforehand and I missed that it could be that something valid was mapped before
> p2m_set_entry(sgfn,...,smfn) was called.
> But then I am not really understand why it won't be an issue if will know
> how many GFNs were successfully mapped.

The caller may know what that range's state was. But what I really wanted to
convey is: Updating multiple entries in one go is complicated in some of the
corner cases. You will want to think this through now, in order to avoid the
need to re-write everything later again.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-07 12:53           ` Jan Beulich
@ 2025-07-07 15:00             ` Oleksii Kurochko
  2025-07-07 15:15               ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-07 15:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 10120 bytes --]


On 7/7/25 2:53 PM, Jan Beulich wrote:
> On 07.07.2025 13:46, Oleksii Kurochko wrote:
>> On 7/7/25 9:20 AM, Jan Beulich wrote:
>>> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>>>> On 7/1/25 3:49 PM, Jan Beulich wrote:
>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>> This patch introduces p2m_set_entry() and its core helper __p2m_set_entry() for
>>>>>> RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
>>>>>> modifications.
>>>>>>
>>>>>> Key differences include:
>>>>>> - TLB Flushing: RISC-V allows caching of invalid PTEs and does not require
>>>>>>      break-before-make (BBM). As a result, the flushing logic is simplified.
>>>>>>      TLB invalidation can be deferred until p2m_write_unlock() is called.
>>>>>>      Consequently, the p2m->need_flush flag is always considered true and is
>>>>>>      removed.
>>>>>> - Page Table Traversal: The order of walking the page tables differs from Arm,
>>>>>>      and this implementation reflects that reversed traversal.
>>>>>> - Macro Adjustments: The macros P2M_ROOT_LEVEL, P2M_ROOT_ORDER, and
>>>>>>      P2M_ROOT_PAGES are updated to align with the new RISC-V implementation.
>>>>>>
>>>>>> The main functionality is in __p2m_set_entry(), which handles mappings aligned
>>>>>> to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).
>>>>>>
>>>>>> p2m_set_entry() breaks a region down into block-aligned mappings and calls
>>>>>> __p2m_set_entry() accordingly.
>>>>>>
>>>>>> Stub implementations (to be completed later) include:
>>>>>> - p2m_free_entry()
>>>>> What would a function of this name do?
>>>> Recursively visiting all leaf PTE's for sub-tree behind an entry, then calls
>>>> put_page() (which will free if there is no any reference to this page),
>>>> freeing intermediate page table (after all entries were freed) by removing
>>>> it from d->arch.paging.freelist, and removes correspondent page of intermediate page
>>>> table from p2m->pages list.
>>>>
>>>>> You can clear entries, but you can't
>>>>> free them, can you?
>>>> Is is a question regarding terminology?
>>> Yes. If one sees a call to a function, it should be possible to at least
>>> roughly know what it does without needing to go look at the implementation.
>>>
>>>> I can't free entry itself, but a page table or
>>>> a page (if it is a leaf entry) on which it points could free.
>>> Then e.g. pte_free_subtree() or some such?
>> It sounds fine to me. I'll use suggested name.
>>
>> Just want to notice that other arches also have the same function
>> for the same purpose with the same name.
> As to x86, it's not general P2M code which uses this odd (for the purpose)
> name, but only p2m-pt.c.
>
>> Does it make sense then to change a name for all arches?
> I think so.
>
>>>>>> +{
>>>>>> +    panic("%s: isn't implemented for now\n", __func__);
>>>>>> +
>>>>>> +    return false;
>>>>>> +}
>>>>> For this function in particular, though: Besides the "p2me" in the name
>>>>> being somewhat odd (supposedly page table entries here are simply pte_t),
>>>>> how is this going to be different from pte_is_valid()?
>>>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
>>>> what is a type stored in the radix tree (p2m->p2m_types):
>>>>      /*
>>>>       * In the case of the P2M, the valid bit is used for other purpose. Use
>>>>       * the type to check whether an entry is valid.
>>>>       */
>>>>      static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>      {
>>>>          return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>      }
>>>>
>>>> It is done to track which page was modified by a guest.
>>> But then (again) the name doesn't convey what the function does.
>> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.
> For P2M type checks please don't invent new naming, but use what both x86
> and Arm are already using. Note how we already have p2m_is_valid() in that
> set. Just that it's not doing what you want here.

Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
And in here it is checked if P2M pte is valid from P2M point of view by checking
the type in radix tree and/or in reserved PTEs bits (just to remind we have only 2
free bits for type).

>
>>>    Plus
>>> can't a guest also arrange for an entry's type to move to p2m_invalid?
>>> That's then still an entry that was modified by the guest.
>> I am not really sure that I fully understand the question.
>> Do you ask if a guest can do something which will lead to a call of p2m_set_entry()
>> with p2m_invalid argument?
> That I'm not asking, but rather stating. I.e. I expect such is possible.
>
>> If yes, then it seems like it will be done only in case of p2m_remove_mapping() what
>> will mean that alongside with p2m_invalid INVALID_MFN will be also passed, what means
>> this entry isn't expected to be used anymore.
> Right. But such an entry would still have been "modified" by the guest.

Yes, but nothing then is needed to do with it. For example, if it is already invalid there
is not any sense to flush page to RAM (as in this case PTE's bit will be checked),
something like Arm does:
   https://elixir.bootlin.com/xen/v4.20.0/source/xen/arch/arm/p2m.c#L375

>>>>>> +        /*
>>>>>> +         * Don't take into account the MFN when removing mapping (i.e
>>>>>> +         * MFN_INVALID) to calculate the correct target order.
>>>>>> +         *
>>>>>> +         * XXX: Support superpage mappings if nr is not aligned to a
>>>>>> +         * superpage size.
>>>>>> +         */
>>>>> Does this really need leaving as a to-do?
>>>> I think so, yes. It won’t break the current workflow if|nr| isn’t aligned,
>>>> a smaller order will simply be chosen.
>>> Well, my question was more like "Isn't it simple enough to cover the case
>>> right away?"
>>>
>>>>>> +        mask = !mfn_eq(smfn, INVALID_MFN) ? mfn_x(smfn) : 0;
>>>>>> +        mask |= gfn_x(sgfn) | nr;
>>>>>> +
>>>>>> +        for ( ; i != 0; i-- )
>>>>>> +        {
>>>>>> +            if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) )
>>>>>> +            {
>>>>>> +                    order = XEN_PT_LEVEL_ORDER(i);
>>>>>> +                    break;
>>>>> Nit: Style.
>>>>>
>>>>>> +            }
>>>>>> +        }
>>>>>> +
>>>>>> +        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
>>>>>> +        if ( rc )
>>>>>> +            break;
>>>>>> +
>>>>>> +        sgfn = gfn_add(sgfn, (1 << order));
>>>>>> +        if ( !mfn_eq(smfn, INVALID_MFN) )
>>>>>> +           smfn = mfn_add(smfn, (1 << order));
>>>>>> +
>>>>>> +        nr -= (1 << order);
>>>>> Throughout maybe better be safe right away and use 1UL?
>>>>>
>>>>>> +    }
>>>>>> +
>>>>>> +    return rc;
>>>>>>     }
>>>>> How's the caller going to know how much of the range was successfully
>>>>> mapped?
>>>> There is no such option. Do other arches do that? I mean returns somehow
>>>> the number of successfully mapped (sgfn,smfn).
>>> On x86 we had to introduce some not very nice code to cover for the absence
>>> of proper handling there. For a new port I think it wants at least seriously
>>> considering not to repeat such a potentially unhelpful pattern.
>>>
>>>>> That part may need undoing (if not here, then in the caller),
>>>>> or a caller may want to retry.
>>>> So the caller in the case if rc != 0, can just undoing the full range
>>>> (by using the same sgfn, nr, smfn).
>>> Can it? How would it know what the original state was?
>> You're right — blindly unmapping the range assumes that no entries were valid
>> beforehand and I missed that it could be that something valid was mapped before
>> p2m_set_entry(sgfn,...,smfn) was called.
>> But then I am not really understand why it won't be an issue if will know
>> how many GFNs were successfully mapped.
> The caller may know what that range's state was. But what I really wanted to
> convey is: Updating multiple entries in one go is complicated in some of the
> corner cases. You will want to think this through now, in order to avoid the
> need to re-write everything later again.

I can add one more argument to return the number of successfully mapped GFNs.
Fortunately, that's very easy to do.

The problem for me is that I don’t really understand what the caller is supposed
to do with that information. The only use case I can think of is that the caller
might try to map the remaining GFNs again. But that doesn’t seem very useful,
if|p2m_set_entry()| wasn’t able to map the full range, it likely indicates a serious
issue, and retrying would probably result in the same error.

The same applies to rolling back the state. It wouldn’t be difficult to add a local
array to track all modified PTEs and then use it to revert the state if needed.
But again, what would the caller do after the rollback? At this point, it still seems
like the best option is simply to|panic(). |

Basically, I don’t see or understand the cases where knowing how many GFNs were
successfully mapped, or whether a rollback was performed, would really help — because
in most cases, I don’t have a better option than just calling|panic()| at the end.

For example, if I call|map_regions_p2mt()| for an MMIO region described in a device
tree node, and the mapping fails partway through, I’m left with two options: either
ignore the device (if it's not essential for Xen or guest functionality) and continue
  booting; in which case I’d need to perform a rollback, and simply knowing the number
of successfully mapped GFNs may not be enough or, more likely, just panic.

Are there any realistic use cases where knowing the number of mapped GFNs or having
rollback support would actually allow us to avoid a panic?

Even more so, how would that information be used in the current call chain?
We have the following chain:
  |map_regions_p2mt()| →|p2m_insert()| →|p2m_set_entry()|

If|p2m_set_entry()| returns the number of successfully mapped GFNs, what should
|p2m_insert()| do with it — process it further, or just pass it along to
|map_regions_p2mt()|?

Thanks in advance for clarifications.
|~ Oleksii |


[-- Attachment #2: Type: text/html, Size: 15164 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-07 15:00             ` Oleksii Kurochko
@ 2025-07-07 15:15               ` Jan Beulich
  2025-07-07 16:10                 ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-07 15:15 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 07.07.2025 17:00, Oleksii Kurochko wrote:
> On 7/7/25 2:53 PM, Jan Beulich wrote:
>> On 07.07.2025 13:46, Oleksii Kurochko wrote:
>>> On 7/7/25 9:20 AM, Jan Beulich wrote:
>>>> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>>>>> On 7/1/25 3:49 PM, Jan Beulich wrote:
>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>> +{
>>>>>>> +    panic("%s: isn't implemented for now\n", __func__);
>>>>>>> +
>>>>>>> +    return false;
>>>>>>> +}
>>>>>> For this function in particular, though: Besides the "p2me" in the name
>>>>>> being somewhat odd (supposedly page table entries here are simply pte_t),
>>>>>> how is this going to be different from pte_is_valid()?
>>>>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
>>>>> what is a type stored in the radix tree (p2m->p2m_types):
>>>>>      /*
>>>>>       * In the case of the P2M, the valid bit is used for other purpose. Use
>>>>>       * the type to check whether an entry is valid.
>>>>>       */
>>>>>      static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>      {
>>>>>          return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>      }
>>>>>
>>>>> It is done to track which page was modified by a guest.
>>>> But then (again) the name doesn't convey what the function does.
>>> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.
>> For P2M type checks please don't invent new naming, but use what both x86
>> and Arm are already using. Note how we already have p2m_is_valid() in that
>> set. Just that it's not doing what you want here.
> 
> Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
> And in here it is checked if P2M pte is valid from P2M point of view by checking
> the type in radix tree and/or in reserved PTEs bits (just to remind we have only 2
> free bits for type).

Because this is how it's defined on x86:

#define p2m_is_valid(_t)    (p2m_to_mask(_t) & \
                             (P2M_RAM_TYPES | p2m_to_mask(p2m_mmio_direct)))

I.e. more strict that simply "!= p2m_invalid". And I think such predicates
would better be uniform across architectures, such that in principle they
might also be usable in common code (as we already do with p2m_is_foreign()).

>>>>    Plus
>>>> can't a guest also arrange for an entry's type to move to p2m_invalid?
>>>> That's then still an entry that was modified by the guest.
>>> I am not really sure that I fully understand the question.
>>> Do you ask if a guest can do something which will lead to a call of p2m_set_entry()
>>> with p2m_invalid argument?
>> That I'm not asking, but rather stating. I.e. I expect such is possible.
>>
>>> If yes, then it seems like it will be done only in case of p2m_remove_mapping() what
>>> will mean that alongside with p2m_invalid INVALID_MFN will be also passed, what means
>>> this entry isn't expected to be used anymore.
>> Right. But such an entry would still have been "modified" by the guest.
> 
> Yes, but nothing then is needed to do with it.

I understand that. Maybe I'm overly picky, but all of the above was in response
to you saying "It is done to track which page was modified by a guest." And I'm
simply trying to get you to use precise wording, both in code comments and in
discussions. In a case like the one here I simply can't judge whether you simply
expressed yourself not clear enough, or whether you indeed meant what you said.

>>>>>>> +        /*
>>>>>>> +         * Don't take into account the MFN when removing mapping (i.e
>>>>>>> +         * MFN_INVALID) to calculate the correct target order.
>>>>>>> +         *
>>>>>>> +         * XXX: Support superpage mappings if nr is not aligned to a
>>>>>>> +         * superpage size.
>>>>>>> +         */
>>>>>> Does this really need leaving as a to-do?
>>>>> I think so, yes. It won’t break the current workflow if|nr| isn’t aligned,
>>>>> a smaller order will simply be chosen.
>>>> Well, my question was more like "Isn't it simple enough to cover the case
>>>> right away?"
>>>>
>>>>>>> +        mask = !mfn_eq(smfn, INVALID_MFN) ? mfn_x(smfn) : 0;
>>>>>>> +        mask |= gfn_x(sgfn) | nr;
>>>>>>> +
>>>>>>> +        for ( ; i != 0; i-- )
>>>>>>> +        {
>>>>>>> +            if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) )
>>>>>>> +            {
>>>>>>> +                    order = XEN_PT_LEVEL_ORDER(i);
>>>>>>> +                    break;
>>>>>> Nit: Style.
>>>>>>
>>>>>>> +            }
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
>>>>>>> +        if ( rc )
>>>>>>> +            break;
>>>>>>> +
>>>>>>> +        sgfn = gfn_add(sgfn, (1 << order));
>>>>>>> +        if ( !mfn_eq(smfn, INVALID_MFN) )
>>>>>>> +           smfn = mfn_add(smfn, (1 << order));
>>>>>>> +
>>>>>>> +        nr -= (1 << order);
>>>>>> Throughout maybe better be safe right away and use 1UL?
>>>>>>
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return rc;
>>>>>>>     }
>>>>>> How's the caller going to know how much of the range was successfully
>>>>>> mapped?
>>>>> There is no such option. Do other arches do that? I mean returns somehow
>>>>> the number of successfully mapped (sgfn,smfn).
>>>> On x86 we had to introduce some not very nice code to cover for the absence
>>>> of proper handling there. For a new port I think it wants at least seriously
>>>> considering not to repeat such a potentially unhelpful pattern.
>>>>
>>>>>> That part may need undoing (if not here, then in the caller),
>>>>>> or a caller may want to retry.
>>>>> So the caller in the case if rc != 0, can just undoing the full range
>>>>> (by using the same sgfn, nr, smfn).
>>>> Can it? How would it know what the original state was?
>>> You're right — blindly unmapping the range assumes that no entries were valid
>>> beforehand and I missed that it could be that something valid was mapped before
>>> p2m_set_entry(sgfn,...,smfn) was called.
>>> But then I am not really understand why it won't be an issue if will know
>>> how many GFNs were successfully mapped.
>> The caller may know what that range's state was. But what I really wanted to
>> convey is: Updating multiple entries in one go is complicated in some of the
>> corner cases. You will want to think this through now, in order to avoid the
>> need to re-write everything later again.
> 
> I can add one more argument to return the number of successfully mapped GFNs.
> Fortunately, that's very easy to do.
> 
> The problem for me is that I don’t really understand what the caller is supposed
> to do with that information.

That's only the 2nd step to take. The first is: What behavior do you want, overall?

> The only use case I can think of is that the caller
> might try to map the remaining GFNs again. But that doesn’t seem very useful,
> if|p2m_set_entry()| wasn’t able to map the full range, it likely indicates a serious
> issue, and retrying would probably result in the same error.
> 
> The same applies to rolling back the state. It wouldn’t be difficult to add a local
> array to track all modified PTEs and then use it to revert the state if needed.
> But again, what would the caller do after the rollback? At this point, it still seems
> like the best option is simply to|panic(). |
> 
> Basically, I don’t see or understand the cases where knowing how many GFNs were
> successfully mapped, or whether a rollback was performed, would really help — because
> in most cases, I don’t have a better option than just calling|panic()| at the end.

panic()-ing is of course only a last resort. Anything related to domain handling
would better crash only the domain in question. And even that only if suitable
error handling isn't possible.

> For example, if I call|map_regions_p2mt()| for an MMIO region described in a device
> tree node, and the mapping fails partway through, I’m left with two options: either
> ignore the device (if it's not essential for Xen or guest functionality) and continue
>   booting; in which case I’d need to perform a rollback, and simply knowing the number
> of successfully mapped GFNs may not be enough or, more likely, just panic.

Well, no. For example, before even trying to map you could check that the range
of P2M entries covered is all empty. _Then_ you know how to correctly roll back.
And yes, doing so may not even require passing back information on how much of
a region was successfully mapped.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-07 15:15               ` Jan Beulich
@ 2025-07-07 16:10                 ` Oleksii Kurochko
  2025-07-08  7:10                   ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-07 16:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 10270 bytes --]


On 7/7/25 5:15 PM, Jan Beulich wrote:
> On 07.07.2025 17:00, Oleksii Kurochko wrote:
>> On 7/7/25 2:53 PM, Jan Beulich wrote:
>>> On 07.07.2025 13:46, Oleksii Kurochko wrote:
>>>> On 7/7/25 9:20 AM, Jan Beulich wrote:
>>>>> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>>>>>> On 7/1/25 3:49 PM, Jan Beulich wrote:
>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>> +{
>>>>>>>> +    panic("%s: isn't implemented for now\n", __func__);
>>>>>>>> +
>>>>>>>> +    return false;
>>>>>>>> +}
>>>>>>> For this function in particular, though: Besides the "p2me" in the name
>>>>>>> being somewhat odd (supposedly page table entries here are simply pte_t),
>>>>>>> how is this going to be different from pte_is_valid()?
>>>>>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
>>>>>> what is a type stored in the radix tree (p2m->p2m_types):
>>>>>>       /*
>>>>>>        * In the case of the P2M, the valid bit is used for other purpose. Use
>>>>>>        * the type to check whether an entry is valid.
>>>>>>        */
>>>>>>       static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>>       {
>>>>>>           return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>>       }
>>>>>>
>>>>>> It is done to track which page was modified by a guest.
>>>>> But then (again) the name doesn't convey what the function does.
>>>> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.
>>> For P2M type checks please don't invent new naming, but use what both x86
>>> and Arm are already using. Note how we already have p2m_is_valid() in that
>>> set. Just that it's not doing what you want here.
>> Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
>> And in here it is checked if P2M pte is valid from P2M point of view by checking
>> the type in radix tree and/or in reserved PTEs bits (just to remind we have only 2
>> free bits for type).
> Because this is how it's defined on x86:
>
> #define p2m_is_valid(_t)    (p2m_to_mask(_t) & \
>                               (P2M_RAM_TYPES | p2m_to_mask(p2m_mmio_direct)))
>
> I.e. more strict that simply "!= p2m_invalid". And I think such predicates
> would better be uniform across architectures, such that in principle they
> might also be usable in common code (as we already do with p2m_is_foreign()).

Yeah, Arm isn't so strict in definition of p2m_is_valid() and it seems like
x86 and Arm have different understanding what is valid.

Except what mentioned in the comment that grant types aren't considered valid
for x86 (and shouldn't be the same then for Arm?), it isn't clear why x86's
p2m_is_valid() is stricter then Arm's one and if other arches should be also
so strict.
It seems like from the point of view of mapping/unmapping it is enough just
to verify a "copy" of PTE's valid bit (in terms of P2M it is p2m_invalid type).

>
>>>>>     Plus
>>>>> can't a guest also arrange for an entry's type to move to p2m_invalid?
>>>>> That's then still an entry that was modified by the guest.
>>>> I am not really sure that I fully understand the question.
>>>> Do you ask if a guest can do something which will lead to a call of p2m_set_entry()
>>>> with p2m_invalid argument?
>>> That I'm not asking, but rather stating. I.e. I expect such is possible.
>>>
>>>> If yes, then it seems like it will be done only in case of p2m_remove_mapping() what
>>>> will mean that alongside with p2m_invalid INVALID_MFN will be also passed, what means
>>>> this entry isn't expected to be used anymore.
>>> Right. But such an entry would still have been "modified" by the guest.
>> Yes, but nothing then is needed to do with it.
> I understand that. Maybe I'm overly picky, but all of the above was in response
> to you saying "It is done to track which page was modified by a guest." And I'm
> simply trying to get you to use precise wording, both in code comments and in
> discussions. In a case like the one here I simply can't judge whether you simply
> expressed yourself not clear enough, or whether you indeed meant what you said.
>
>>>>>>>> +        /*
>>>>>>>> +         * Don't take into account the MFN when removing mapping (i.e
>>>>>>>> +         * MFN_INVALID) to calculate the correct target order.
>>>>>>>> +         *
>>>>>>>> +         * XXX: Support superpage mappings if nr is not aligned to a
>>>>>>>> +         * superpage size.
>>>>>>>> +         */
>>>>>>> Does this really need leaving as a to-do?
>>>>>> I think so, yes. It won’t break the current workflow if|nr| isn’t aligned,
>>>>>> a smaller order will simply be chosen.
>>>>> Well, my question was more like "Isn't it simple enough to cover the case
>>>>> right away?"
>>>>>
>>>>>>>> +        mask = !mfn_eq(smfn, INVALID_MFN) ? mfn_x(smfn) : 0;
>>>>>>>> +        mask |= gfn_x(sgfn) | nr;
>>>>>>>> +
>>>>>>>> +        for ( ; i != 0; i-- )
>>>>>>>> +        {
>>>>>>>> +            if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) )
>>>>>>>> +            {
>>>>>>>> +                    order = XEN_PT_LEVEL_ORDER(i);
>>>>>>>> +                    break;
>>>>>>> Nit: Style.
>>>>>>>
>>>>>>>> +            }
>>>>>>>> +        }
>>>>>>>> +
>>>>>>>> +        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
>>>>>>>> +        if ( rc )
>>>>>>>> +            break;
>>>>>>>> +
>>>>>>>> +        sgfn = gfn_add(sgfn, (1 << order));
>>>>>>>> +        if ( !mfn_eq(smfn, INVALID_MFN) )
>>>>>>>> +           smfn = mfn_add(smfn, (1 << order));
>>>>>>>> +
>>>>>>>> +        nr -= (1 << order);
>>>>>>> Throughout maybe better be safe right away and use 1UL?
>>>>>>>
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return rc;
>>>>>>>>      }
>>>>>>> How's the caller going to know how much of the range was successfully
>>>>>>> mapped?
>>>>>> There is no such option. Do other arches do that? I mean returns somehow
>>>>>> the number of successfully mapped (sgfn,smfn).
>>>>> On x86 we had to introduce some not very nice code to cover for the absence
>>>>> of proper handling there. For a new port I think it wants at least seriously
>>>>> considering not to repeat such a potentially unhelpful pattern.
>>>>>
>>>>>>> That part may need undoing (if not here, then in the caller),
>>>>>>> or a caller may want to retry.
>>>>>> So the caller in the case if rc != 0, can just undoing the full range
>>>>>> (by using the same sgfn, nr, smfn).
>>>>> Can it? How would it know what the original state was?
>>>> You're right — blindly unmapping the range assumes that no entries were valid
>>>> beforehand and I missed that it could be that something valid was mapped before
>>>> p2m_set_entry(sgfn,...,smfn) was called.
>>>> But then I am not really understand why it won't be an issue if will know
>>>> how many GFNs were successfully mapped.
>>> The caller may know what that range's state was. But what I really wanted to
>>> convey is: Updating multiple entries in one go is complicated in some of the
>>> corner cases. You will want to think this through now, in order to avoid the
>>> need to re-write everything later again.
>> I can add one more argument to return the number of successfully mapped GFNs.
>> Fortunately, that's very easy to do.
>>
>> The problem for me is that I don’t really understand what the caller is supposed
>> to do with that information.
> That's only the 2nd step to take. The first is: What behavior do you want, overall?

My initial idea was that if something went wrong ( rc != 0 ) then just panic(). But
based on your questions it seems like it isn't the best one idea.

>
>> The only use case I can think of is that the caller
>> might try to map the remaining GFNs again. But that doesn’t seem very useful,
>> if|p2m_set_entry()| wasn’t able to map the full range, it likely indicates a serious
>> issue, and retrying would probably result in the same error.
>>
>> The same applies to rolling back the state. It wouldn’t be difficult to add a local
>> array to track all modified PTEs and then use it to revert the state if needed.
>> But again, what would the caller do after the rollback? At this point, it still seems
>> like the best option is simply to|panic(). |
>>
>> Basically, I don’t see or understand the cases where knowing how many GFNs were
>> successfully mapped, or whether a rollback was performed, would really help — because
>> in most cases, I don’t have a better option than just calling|panic()| at the end.
> panic()-ing is of course only a last resort. Anything related to domain handling
> would better crash only the domain in question. And even that only if suitable
> error handling isn't possible.

And if there is no still any runnable domain available, for example, we are creating
domain and some p2m mapping is called? Will it be enough just ignore to boot this domain?
If yes, then it is enough to return only error code without returning how many GFNs were
mapped or rollbacking as domain won't be ran anyway.
(just to mention, I am not trying to convince you that rollback or returning of an amount
of GFNs isn't necessary, I just trying to understand what is the best implementation of
handling none-fully mapped mappings you mentioned)

>
>> For example, if I call|map_regions_p2mt()| for an MMIO region described in a device
>> tree node, and the mapping fails partway through, I’m left with two options: either
>> ignore the device (if it's not essential for Xen or guest functionality) and continue
>>    booting; in which case I’d need to perform a rollback, and simply knowing the number
>> of successfully mapped GFNs may not be enough or, more likely, just panic.
> Well, no. For example, before even trying to map you could check that the range
> of P2M entries covered is all empty.

Could it be that they aren't all empty? Then it seems like we have overlapping and we can't
just do a mapping, right?
Won't be this procedure consume a lot of time as it is needed to go through each page
tables for each entry.


>   _Then_ you know how to correctly roll back.
> And yes, doing so may not even require passing back information on how much of
> a region was successfully mapped.

If P2M entries were empty before start of the mapping then it is enough to just go
through the same range (sgfn,nr,smfn) and just clean them, right?

Thanks.

~ Oleksii


[-- Attachment #2: Type: text/html, Size: 14701 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-07 16:10                 ` Oleksii Kurochko
@ 2025-07-08  7:10                   ` Jan Beulich
  2025-07-08  9:01                     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-08  7:10 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 07.07.2025 18:10, Oleksii Kurochko wrote:
> On 7/7/25 5:15 PM, Jan Beulich wrote:
>> On 07.07.2025 17:00, Oleksii Kurochko wrote:
>>> On 7/7/25 2:53 PM, Jan Beulich wrote:
>>>> On 07.07.2025 13:46, Oleksii Kurochko wrote:
>>>>> On 7/7/25 9:20 AM, Jan Beulich wrote:
>>>>>> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>>>>>>> On 7/1/25 3:49 PM, Jan Beulich wrote:
>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>> +{
>>>>>>>>> +    panic("%s: isn't implemented for now\n", __func__);
>>>>>>>>> +
>>>>>>>>> +    return false;
>>>>>>>>> +}
>>>>>>>> For this function in particular, though: Besides the "p2me" in the name
>>>>>>>> being somewhat odd (supposedly page table entries here are simply pte_t),
>>>>>>>> how is this going to be different from pte_is_valid()?
>>>>>>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
>>>>>>> what is a type stored in the radix tree (p2m->p2m_types):
>>>>>>>       /*
>>>>>>>        * In the case of the P2M, the valid bit is used for other purpose. Use
>>>>>>>        * the type to check whether an entry is valid.
>>>>>>>        */
>>>>>>>       static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>>>       {
>>>>>>>           return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>>>       }
>>>>>>>
>>>>>>> It is done to track which page was modified by a guest.
>>>>>> But then (again) the name doesn't convey what the function does.
>>>>> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.
>>>> For P2M type checks please don't invent new naming, but use what both x86
>>>> and Arm are already using. Note how we already have p2m_is_valid() in that
>>>> set. Just that it's not doing what you want here.
>>> Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
>>> And in here it is checked if P2M pte is valid from P2M point of view by checking
>>> the type in radix tree and/or in reserved PTEs bits (just to remind we have only 2
>>> free bits for type).
>> Because this is how it's defined on x86:
>>
>> #define p2m_is_valid(_t)    (p2m_to_mask(_t) & \
>>                               (P2M_RAM_TYPES | p2m_to_mask(p2m_mmio_direct)))
>>
>> I.e. more strict that simply "!= p2m_invalid". And I think such predicates
>> would better be uniform across architectures, such that in principle they
>> might also be usable in common code (as we already do with p2m_is_foreign()).
> 
> Yeah, Arm isn't so strict in definition of p2m_is_valid() and it seems like
> x86 and Arm have different understanding what is valid.
> 
> Except what mentioned in the comment that grant types aren't considered valid
> for x86 (and shouldn't be the same then for Arm?), it isn't clear why x86's
> p2m_is_valid() is stricter then Arm's one and if other arches should be also
> so strict.

Arm's p2m_is_valid() is entirely different (and imo misnamed, but arguably one
could also consider x86'es to require a better name). It's a local helper, not
a P2M type checking predicate. With that in mind, you may of course follow
Arm's model, but in the longer run we may need to do something about the name
collision then.

>>> The only use case I can think of is that the caller
>>> might try to map the remaining GFNs again. But that doesn’t seem very useful,
>>> if|p2m_set_entry()| wasn’t able to map the full range, it likely indicates a serious
>>> issue, and retrying would probably result in the same error.
>>>
>>> The same applies to rolling back the state. It wouldn’t be difficult to add a local
>>> array to track all modified PTEs and then use it to revert the state if needed.
>>> But again, what would the caller do after the rollback? At this point, it still seems
>>> like the best option is simply to|panic(). |
>>>
>>> Basically, I don’t see or understand the cases where knowing how many GFNs were
>>> successfully mapped, or whether a rollback was performed, would really help — because
>>> in most cases, I don’t have a better option than just calling|panic()| at the end.
>> panic()-ing is of course only a last resort. Anything related to domain handling
>> would better crash only the domain in question. And even that only if suitable
>> error handling isn't possible.
> 
> And if there is no still any runnable domain available, for example, we are creating
> domain and some p2m mapping is called? Will it be enough just ignore to boot this domain?
> If yes, then it is enough to return only error code without returning how many GFNs were
> mapped or rollbacking as domain won't be ran anyway.

During domain creation all you need to do is return an error. But when you write a
generic function that's also (going to be) used at domain runtime, you need to
consider what to do there in case of partial success.

>>> For example, if I call|map_regions_p2mt()| for an MMIO region described in a device
>>> tree node, and the mapping fails partway through, I’m left with two options: either
>>> ignore the device (if it's not essential for Xen or guest functionality) and continue
>>>    booting; in which case I’d need to perform a rollback, and simply knowing the number
>>> of successfully mapped GFNs may not be enough or, more likely, just panic.
>> Well, no. For example, before even trying to map you could check that the range
>> of P2M entries covered is all empty.
> 
> Could it be that they aren't all empty? Then it seems like we have overlapping and we can't
> just do a mapping, right?

Possibly that would simply mean to return an error, yes.

> Won't be this procedure consume a lot of time as it is needed to go through each page
> tables for each entry.

Well, you're free to suggest a clean alternative without doing so.

>>   _Then_ you know how to correctly roll back.
>> And yes, doing so may not even require passing back information on how much of
>> a region was successfully mapped.
> 
> If P2M entries were empty before start of the mapping then it is enough to just go
> through the same range (sgfn,nr,smfn) and just clean them, right?

Yes, what else would "roll back" mean in that case?

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-08  7:10                   ` Jan Beulich
@ 2025-07-08  9:01                     ` Oleksii Kurochko
  2025-07-08 10:37                       ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-08  9:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 6783 bytes --]


On 7/8/25 9:10 AM, Jan Beulich wrote:
> On 07.07.2025 18:10, Oleksii Kurochko wrote:
>> On 7/7/25 5:15 PM, Jan Beulich wrote:
>>> On 07.07.2025 17:00, Oleksii Kurochko wrote:
>>>> On 7/7/25 2:53 PM, Jan Beulich wrote:
>>>>> On 07.07.2025 13:46, Oleksii Kurochko wrote:
>>>>>> On 7/7/25 9:20 AM, Jan Beulich wrote:
>>>>>>> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>>>>>>>> On 7/1/25 3:49 PM, Jan Beulich wrote:
>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>> +{
>>>>>>>>>> +    panic("%s: isn't implemented for now\n", __func__);
>>>>>>>>>> +
>>>>>>>>>> +    return false;
>>>>>>>>>> +}
>>>>>>>>> For this function in particular, though: Besides the "p2me" in the name
>>>>>>>>> being somewhat odd (supposedly page table entries here are simply pte_t),
>>>>>>>>> how is this going to be different from pte_is_valid()?
>>>>>>>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
>>>>>>>> what is a type stored in the radix tree (p2m->p2m_types):
>>>>>>>>        /*
>>>>>>>>         * In the case of the P2M, the valid bit is used for other purpose. Use
>>>>>>>>         * the type to check whether an entry is valid.
>>>>>>>>         */
>>>>>>>>        static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>>>>        {
>>>>>>>>            return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>>>>        }
>>>>>>>>
>>>>>>>> It is done to track which page was modified by a guest.
>>>>>>> But then (again) the name doesn't convey what the function does.
>>>>>> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.
>>>>> For P2M type checks please don't invent new naming, but use what both x86
>>>>> and Arm are already using. Note how we already have p2m_is_valid() in that
>>>>> set. Just that it's not doing what you want here.
>>>> Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
>>>> And in here it is checked if P2M pte is valid from P2M point of view by checking
>>>> the type in radix tree and/or in reserved PTEs bits (just to remind we have only 2
>>>> free bits for type).
>>> Because this is how it's defined on x86:
>>>
>>> #define p2m_is_valid(_t)    (p2m_to_mask(_t) & \
>>>                                (P2M_RAM_TYPES | p2m_to_mask(p2m_mmio_direct)))
>>>
>>> I.e. more strict that simply "!= p2m_invalid". And I think such predicates
>>> would better be uniform across architectures, such that in principle they
>>> might also be usable in common code (as we already do with p2m_is_foreign()).
>> Yeah, Arm isn't so strict in definition of p2m_is_valid() and it seems like
>> x86 and Arm have different understanding what is valid.
>>
>> Except what mentioned in the comment that grant types aren't considered valid
>> for x86 (and shouldn't be the same then for Arm?), it isn't clear why x86's
>> p2m_is_valid() is stricter then Arm's one and if other arches should be also
>> so strict.
> Arm's p2m_is_valid() is entirely different (and imo misnamed, but arguably one
> could also consider x86'es to require a better name). It's a local helper, not
> a P2M type checking predicate. With that in mind, you may of course follow
> Arm's model, but in the longer run we may need to do something about the name
> collision then.
>
>>>> The only use case I can think of is that the caller
>>>> might try to map the remaining GFNs again. But that doesn’t seem very useful,
>>>> if|p2m_set_entry()| wasn’t able to map the full range, it likely indicates a serious
>>>> issue, and retrying would probably result in the same error.
>>>>
>>>> The same applies to rolling back the state. It wouldn’t be difficult to add a local
>>>> array to track all modified PTEs and then use it to revert the state if needed.
>>>> But again, what would the caller do after the rollback? At this point, it still seems
>>>> like the best option is simply to|panic(). |
>>>>
>>>> Basically, I don’t see or understand the cases where knowing how many GFNs were
>>>> successfully mapped, or whether a rollback was performed, would really help — because
>>>> in most cases, I don’t have a better option than just calling|panic()| at the end.
>>> panic()-ing is of course only a last resort. Anything related to domain handling
>>> would better crash only the domain in question. And even that only if suitable
>>> error handling isn't possible.
>> And if there is no still any runnable domain available, for example, we are creating
>> domain and some p2m mapping is called? Will it be enough just ignore to boot this domain?
>> If yes, then it is enough to return only error code without returning how many GFNs were
>> mapped or rollbacking as domain won't be ran anyway.
> During domain creation all you need to do is return an error. But when you write a
> generic function that's also (going to be) used at domain runtime, you need to
> consider what to do there in case of partial success.
>
>>>> For example, if I call|map_regions_p2mt()| for an MMIO region described in a device
>>>> tree node, and the mapping fails partway through, I’m left with two options: either
>>>> ignore the device (if it's not essential for Xen or guest functionality) and continue
>>>>     booting; in which case I’d need to perform a rollback, and simply knowing the number
>>>> of successfully mapped GFNs may not be enough or, more likely, just panic.
>>> Well, no. For example, before even trying to map you could check that the range
>>> of P2M entries covered is all empty.
>> Could it be that they aren't all empty? Then it seems like we have overlapping and we can't
>> just do a mapping, right?
> Possibly that would simply mean to return an error, yes.
>
>> Won't be this procedure consume a lot of time as it is needed to go through each page
>> tables for each entry.
> Well, you're free to suggest a clean alternative without doing so.

I thought about dynamically allocating an array in p2m_set_entry(), where to save all changed PTEs,
and then use it to roll back if __p2m_set_entry() returns rc != 0 ...

>
>>>    _Then_ you know how to correctly roll back.
>>> And yes, doing so may not even require passing back information on how much of
>>> a region was successfully mapped.
>> If P2M entries were empty before start of the mapping then it is enough to just go
>> through the same range (sgfn,nr,smfn) and just clean them, right?
> Yes, what else would "roll back" mean in that case?

... If we know that the P2M entries were empty, then there's nothing else to be done, just
clean PTE is needed to be done.
However, if the P2M entries weren’t empty (and I’m still not sure whether that’s a legal
case), then rolling back would mean restoring their original state, the state they
had before the P2M mapping procedure started.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 9588 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-08  9:01                     ` Oleksii Kurochko
@ 2025-07-08 10:37                       ` Oleksii Kurochko
  2025-07-08 12:45                         ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-08 10:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 7430 bytes --]


On 7/8/25 11:01 AM, Oleksii Kurochko wrote:
>
>
> On 7/8/25 9:10 AM, Jan Beulich wrote:
>> On 07.07.2025 18:10, Oleksii Kurochko wrote:
>>> On 7/7/25 5:15 PM, Jan Beulich wrote:
>>>> On 07.07.2025 17:00, Oleksii Kurochko wrote:
>>>>> On 7/7/25 2:53 PM, Jan Beulich wrote:
>>>>>> On 07.07.2025 13:46, Oleksii Kurochko wrote:
>>>>>>> On 7/7/25 9:20 AM, Jan Beulich wrote:
>>>>>>>> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>>>>>>>>> On 7/1/25 3:49 PM, Jan Beulich wrote:
>>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>>> +{
>>>>>>>>>>> +    panic("%s: isn't implemented for now\n", __func__);
>>>>>>>>>>> +
>>>>>>>>>>> +    return false;
>>>>>>>>>>> +}
>>>>>>>>>> For this function in particular, though: Besides the "p2me" in the name
>>>>>>>>>> being somewhat odd (supposedly page table entries here are simply pte_t),
>>>>>>>>>> how is this going to be different from pte_is_valid()?
>>>>>>>>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
>>>>>>>>> what is a type stored in the radix tree (p2m->p2m_types):
>>>>>>>>>        /*
>>>>>>>>>         * In the case of the P2M, the valid bit is used for other purpose. Use
>>>>>>>>>         * the type to check whether an entry is valid.
>>>>>>>>>         */
>>>>>>>>>        static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>>>>>        {
>>>>>>>>>            return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>>>>>        }
>>>>>>>>>
>>>>>>>>> It is done to track which page was modified by a guest.
>>>>>>>> But then (again) the name doesn't convey what the function does.
>>>>>>> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.
>>>>>> For P2M type checks please don't invent new naming, but use what both x86
>>>>>> and Arm are already using. Note how we already have p2m_is_valid() in that
>>>>>> set. Just that it's not doing what you want here.
>>>>> Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
>>>>> And in here it is checked if P2M pte is valid from P2M point of view by checking
>>>>> the type in radix tree and/or in reserved PTEs bits (just to remind we have only 2
>>>>> free bits for type).
>>>> Because this is how it's defined on x86:
>>>>
>>>> #define p2m_is_valid(_t)    (p2m_to_mask(_t) & \
>>>>                                (P2M_RAM_TYPES | p2m_to_mask(p2m_mmio_direct)))
>>>>
>>>> I.e. more strict that simply "!= p2m_invalid". And I think such predicates
>>>> would better be uniform across architectures, such that in principle they
>>>> might also be usable in common code (as we already do with p2m_is_foreign()).
>>> Yeah, Arm isn't so strict in definition of p2m_is_valid() and it seems like
>>> x86 and Arm have different understanding what is valid.
>>>
>>> Except what mentioned in the comment that grant types aren't considered valid
>>> for x86 (and shouldn't be the same then for Arm?), it isn't clear why x86's
>>> p2m_is_valid() is stricter then Arm's one and if other arches should be also
>>> so strict.
>> Arm's p2m_is_valid() is entirely different (and imo misnamed, but arguably one
>> could also consider x86'es to require a better name). It's a local helper, not
>> a P2M type checking predicate. With that in mind, you may of course follow
>> Arm's model, but in the longer run we may need to do something about the name
>> collision then.
>>
>>>>> The only use case I can think of is that the caller
>>>>> might try to map the remaining GFNs again. But that doesn’t seem very useful,
>>>>> if|p2m_set_entry()| wasn’t able to map the full range, it likely indicates a serious
>>>>> issue, and retrying would probably result in the same error.
>>>>>
>>>>> The same applies to rolling back the state. It wouldn’t be difficult to add a local
>>>>> array to track all modified PTEs and then use it to revert the state if needed.
>>>>> But again, what would the caller do after the rollback? At this point, it still seems
>>>>> like the best option is simply to|panic(). |
>>>>>
>>>>> Basically, I don’t see or understand the cases where knowing how many GFNs were
>>>>> successfully mapped, or whether a rollback was performed, would really help — because
>>>>> in most cases, I don’t have a better option than just calling|panic()| at the end.
>>>> panic()-ing is of course only a last resort. Anything related to domain handling
>>>> would better crash only the domain in question. And even that only if suitable
>>>> error handling isn't possible.
>>> And if there is no still any runnable domain available, for example, we are creating
>>> domain and some p2m mapping is called? Will it be enough just ignore to boot this domain?
>>> If yes, then it is enough to return only error code without returning how many GFNs were
>>> mapped or rollbacking as domain won't be ran anyway.
>> During domain creation all you need to do is return an error. But when you write a
>> generic function that's also (going to be) used at domain runtime, you need to
>> consider what to do there in case of partial success.
>>
>>>>> For example, if I call|map_regions_p2mt()| for an MMIO region described in a device
>>>>> tree node, and the mapping fails partway through, I’m left with two options: either
>>>>> ignore the device (if it's not essential for Xen or guest functionality) and continue
>>>>>     booting; in which case I’d need to perform a rollback, and simply knowing the number
>>>>> of successfully mapped GFNs may not be enough or, more likely, just panic.
>>>> Well, no. For example, before even trying to map you could check that the range
>>>> of P2M entries covered is all empty.
>>> Could it be that they aren't all empty? Then it seems like we have overlapping and we can't
>>> just do a mapping, right?
>> Possibly that would simply mean to return an error, yes.
>>
>>> Won't be this procedure consume a lot of time as it is needed to go through each page
>>> tables for each entry.
>> Well, you're free to suggest a clean alternative without doing so.
> I thought about dynamically allocating an array in p2m_set_entry(), where to save all changed PTEs,
> and then use it to roll back if __p2m_set_entry() returns rc != 0 ...
>
>>>>    _Then_ you know how to correctly roll back.
>>>> And yes, doing so may not even require passing back information on how much of
>>>> a region was successfully mapped.
>>> If P2M entries were empty before start of the mapping then it is enough to just go
>>> through the same range (sgfn,nr,smfn) and just clean them, right?
>> Yes, what else would "roll back" mean in that case?
> ... If we know that the P2M entries were empty, then there's nothing else to be done, just
> clean PTE is needed to be done.
> However, if the P2M entries weren’t empty (and I’m still not sure whether that’s a legal
> case), then rolling back would mean restoring their original state, the state they
> had before the P2M mapping procedure started.

Possible roll back is harder to implement as expected because there is a case where subtree
could be freed:
     /*
      * Free the entry only if the original pte was valid and the base
      * is different (to avoid freeing when permission is changed).
      */
     if ( p2me_is_valid(p2m, orig_pte) &&
          !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
         p2m_free_subtree(p2m, orig_pte, level);
In this case then it will be needed to store the full subtree.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 10506 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-08 10:37                       ` Oleksii Kurochko
@ 2025-07-08 12:45                         ` Jan Beulich
  2025-07-08 15:42                           ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-08 12:45 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 08.07.2025 12:37, Oleksii Kurochko wrote:
> 
> On 7/8/25 11:01 AM, Oleksii Kurochko wrote:
>>
>>
>> On 7/8/25 9:10 AM, Jan Beulich wrote:
>>> On 07.07.2025 18:10, Oleksii Kurochko wrote:
>>>> On 7/7/25 5:15 PM, Jan Beulich wrote:
>>>>> On 07.07.2025 17:00, Oleksii Kurochko wrote:
>>>>>> On 7/7/25 2:53 PM, Jan Beulich wrote:
>>>>>>> On 07.07.2025 13:46, Oleksii Kurochko wrote:
>>>>>>>> On 7/7/25 9:20 AM, Jan Beulich wrote:
>>>>>>>>> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>>>>>>>>>> On 7/1/25 3:49 PM, Jan Beulich wrote:
>>>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    panic("%s: isn't implemented for now\n", __func__);
>>>>>>>>>>>> +
>>>>>>>>>>>> +    return false;
>>>>>>>>>>>> +}
>>>>>>>>>>> For this function in particular, though: Besides the "p2me" in the name
>>>>>>>>>>> being somewhat odd (supposedly page table entries here are simply pte_t),
>>>>>>>>>>> how is this going to be different from pte_is_valid()?
>>>>>>>>>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
>>>>>>>>>> what is a type stored in the radix tree (p2m->p2m_types):
>>>>>>>>>>        /*
>>>>>>>>>>         * In the case of the P2M, the valid bit is used for other purpose. Use
>>>>>>>>>>         * the type to check whether an entry is valid.
>>>>>>>>>>         */
>>>>>>>>>>        static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>>>>>>        {
>>>>>>>>>>            return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>>>>>>        }
>>>>>>>>>>
>>>>>>>>>> It is done to track which page was modified by a guest.
>>>>>>>>> But then (again) the name doesn't convey what the function does.
>>>>>>>> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.
>>>>>>> For P2M type checks please don't invent new naming, but use what both x86
>>>>>>> and Arm are already using. Note how we already have p2m_is_valid() in that
>>>>>>> set. Just that it's not doing what you want here.
>>>>>> Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
>>>>>> And in here it is checked if P2M pte is valid from P2M point of view by checking
>>>>>> the type in radix tree and/or in reserved PTEs bits (just to remind we have only 2
>>>>>> free bits for type).
>>>>> Because this is how it's defined on x86:
>>>>>
>>>>> #define p2m_is_valid(_t)    (p2m_to_mask(_t) & \
>>>>>                                (P2M_RAM_TYPES | p2m_to_mask(p2m_mmio_direct)))
>>>>>
>>>>> I.e. more strict that simply "!= p2m_invalid". And I think such predicates
>>>>> would better be uniform across architectures, such that in principle they
>>>>> might also be usable in common code (as we already do with p2m_is_foreign()).
>>>> Yeah, Arm isn't so strict in definition of p2m_is_valid() and it seems like
>>>> x86 and Arm have different understanding what is valid.
>>>>
>>>> Except what mentioned in the comment that grant types aren't considered valid
>>>> for x86 (and shouldn't be the same then for Arm?), it isn't clear why x86's
>>>> p2m_is_valid() is stricter then Arm's one and if other arches should be also
>>>> so strict.
>>> Arm's p2m_is_valid() is entirely different (and imo misnamed, but arguably one
>>> could also consider x86'es to require a better name). It's a local helper, not
>>> a P2M type checking predicate. With that in mind, you may of course follow
>>> Arm's model, but in the longer run we may need to do something about the name
>>> collision then.
>>>
>>>>>> The only use case I can think of is that the caller
>>>>>> might try to map the remaining GFNs again. But that doesn’t seem very useful,
>>>>>> if|p2m_set_entry()| wasn’t able to map the full range, it likely indicates a serious
>>>>>> issue, and retrying would probably result in the same error.
>>>>>>
>>>>>> The same applies to rolling back the state. It wouldn’t be difficult to add a local
>>>>>> array to track all modified PTEs and then use it to revert the state if needed.
>>>>>> But again, what would the caller do after the rollback? At this point, it still seems
>>>>>> like the best option is simply to|panic(). |
>>>>>>
>>>>>> Basically, I don’t see or understand the cases where knowing how many GFNs were
>>>>>> successfully mapped, or whether a rollback was performed, would really help — because
>>>>>> in most cases, I don’t have a better option than just calling|panic()| at the end.
>>>>> panic()-ing is of course only a last resort. Anything related to domain handling
>>>>> would better crash only the domain in question. And even that only if suitable
>>>>> error handling isn't possible.
>>>> And if there is no still any runnable domain available, for example, we are creating
>>>> domain and some p2m mapping is called? Will it be enough just ignore to boot this domain?
>>>> If yes, then it is enough to return only error code without returning how many GFNs were
>>>> mapped or rollbacking as domain won't be ran anyway.
>>> During domain creation all you need to do is return an error. But when you write a
>>> generic function that's also (going to be) used at domain runtime, you need to
>>> consider what to do there in case of partial success.
>>>
>>>>>> For example, if I call|map_regions_p2mt()| for an MMIO region described in a device
>>>>>> tree node, and the mapping fails partway through, I’m left with two options: either
>>>>>> ignore the device (if it's not essential for Xen or guest functionality) and continue
>>>>>>     booting; in which case I’d need to perform a rollback, and simply knowing the number
>>>>>> of successfully mapped GFNs may not be enough or, more likely, just panic.
>>>>> Well, no. For example, before even trying to map you could check that the range
>>>>> of P2M entries covered is all empty.
>>>> Could it be that they aren't all empty? Then it seems like we have overlapping and we can't
>>>> just do a mapping, right?
>>> Possibly that would simply mean to return an error, yes.
>>>
>>>> Won't be this procedure consume a lot of time as it is needed to go through each page
>>>> tables for each entry.
>>> Well, you're free to suggest a clean alternative without doing so.
>> I thought about dynamically allocating an array in p2m_set_entry(), where to save all changed PTEs,
>> and then use it to roll back if __p2m_set_entry() returns rc != 0 ...

That's another possible source for failure, and such an allocation may end
up being a rather big one.

>>>>>    _Then_ you know how to correctly roll back.
>>>>> And yes, doing so may not even require passing back information on how much of
>>>>> a region was successfully mapped.
>>>> If P2M entries were empty before start of the mapping then it is enough to just go
>>>> through the same range (sgfn,nr,smfn) and just clean them, right?
>>> Yes, what else would "roll back" mean in that case?
>> ... If we know that the P2M entries were empty, then there's nothing else to be done, just
>> clean PTE is needed to be done.
>> However, if the P2M entries weren’t empty (and I’m still not sure whether that’s a legal
>> case), then rolling back would mean restoring their original state, the state they
>> had before the P2M mapping procedure started.
> 
> Possible roll back is harder to implement as expected because there is a case where subtree
> could be freed:
>      /*
>       * Free the entry only if the original pte was valid and the base
>       * is different (to avoid freeing when permission is changed).
>       */
>      if ( p2me_is_valid(p2m, orig_pte) &&
>           !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
>          p2m_free_subtree(p2m, orig_pte, level);
> In this case then it will be needed to store the full subtree.

Right, which is why it may be desirable to limit the ability to update multiple
entries in one go. Or work from certain assumptions, violation of which would
cause the domain to be crashed.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-08 12:45                         ` Jan Beulich
@ 2025-07-08 15:42                           ` Oleksii Kurochko
  2025-07-08 16:04                             ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-08 15:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 9712 bytes --]


On 7/8/25 2:45 PM, Jan Beulich wrote:
> On 08.07.2025 12:37, Oleksii Kurochko wrote:
>> On 7/8/25 11:01 AM, Oleksii Kurochko wrote:
>>>
>>> On 7/8/25 9:10 AM, Jan Beulich wrote:
>>>> On 07.07.2025 18:10, Oleksii Kurochko wrote:
>>>>> On 7/7/25 5:15 PM, Jan Beulich wrote:
>>>>>> On 07.07.2025 17:00, Oleksii Kurochko wrote:
>>>>>>> On 7/7/25 2:53 PM, Jan Beulich wrote:
>>>>>>>> On 07.07.2025 13:46, Oleksii Kurochko wrote:
>>>>>>>>> On 7/7/25 9:20 AM, Jan Beulich wrote:
>>>>>>>>>> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>>>>>>>>>>> On 7/1/25 3:49 PM, Jan Beulich wrote:
>>>>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    panic("%s: isn't implemented for now\n", __func__);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +    return false;
>>>>>>>>>>>>> +}
>>>>>>>>>>>> For this function in particular, though: Besides the "p2me" in the name
>>>>>>>>>>>> being somewhat odd (supposedly page table entries here are simply pte_t),
>>>>>>>>>>>> how is this going to be different from pte_is_valid()?
>>>>>>>>>>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
>>>>>>>>>>> what is a type stored in the radix tree (p2m->p2m_types):
>>>>>>>>>>>         /*
>>>>>>>>>>>          * In the case of the P2M, the valid bit is used for other purpose. Use
>>>>>>>>>>>          * the type to check whether an entry is valid.
>>>>>>>>>>>          */
>>>>>>>>>>>         static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>>>>>>>         {
>>>>>>>>>>>             return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>>>>>>>         }
>>>>>>>>>>>
>>>>>>>>>>> It is done to track which page was modified by a guest.
>>>>>>>>>> But then (again) the name doesn't convey what the function does.
>>>>>>>>> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.
>>>>>>>> For P2M type checks please don't invent new naming, but use what both x86
>>>>>>>> and Arm are already using. Note how we already have p2m_is_valid() in that
>>>>>>>> set. Just that it's not doing what you want here.
>>>>>>> Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
>>>>>>> And in here it is checked if P2M pte is valid from P2M point of view by checking
>>>>>>> the type in radix tree and/or in reserved PTEs bits (just to remind we have only 2
>>>>>>> free bits for type).
>>>>>> Because this is how it's defined on x86:
>>>>>>
>>>>>> #define p2m_is_valid(_t)    (p2m_to_mask(_t) & \
>>>>>>                                 (P2M_RAM_TYPES | p2m_to_mask(p2m_mmio_direct)))
>>>>>>
>>>>>> I.e. more strict that simply "!= p2m_invalid". And I think such predicates
>>>>>> would better be uniform across architectures, such that in principle they
>>>>>> might also be usable in common code (as we already do with p2m_is_foreign()).
>>>>> Yeah, Arm isn't so strict in definition of p2m_is_valid() and it seems like
>>>>> x86 and Arm have different understanding what is valid.
>>>>>
>>>>> Except what mentioned in the comment that grant types aren't considered valid
>>>>> for x86 (and shouldn't be the same then for Arm?), it isn't clear why x86's
>>>>> p2m_is_valid() is stricter then Arm's one and if other arches should be also
>>>>> so strict.
>>>> Arm's p2m_is_valid() is entirely different (and imo misnamed, but arguably one
>>>> could also consider x86'es to require a better name). It's a local helper, not
>>>> a P2M type checking predicate. With that in mind, you may of course follow
>>>> Arm's model, but in the longer run we may need to do something about the name
>>>> collision then.
>>>>
>>>>>>> The only use case I can think of is that the caller
>>>>>>> might try to map the remaining GFNs again. But that doesn’t seem very useful,
>>>>>>> if|p2m_set_entry()| wasn’t able to map the full range, it likely indicates a serious
>>>>>>> issue, and retrying would probably result in the same error.
>>>>>>>
>>>>>>> The same applies to rolling back the state. It wouldn’t be difficult to add a local
>>>>>>> array to track all modified PTEs and then use it to revert the state if needed.
>>>>>>> But again, what would the caller do after the rollback? At this point, it still seems
>>>>>>> like the best option is simply to|panic(). |
>>>>>>>
>>>>>>> Basically, I don’t see or understand the cases where knowing how many GFNs were
>>>>>>> successfully mapped, or whether a rollback was performed, would really help — because
>>>>>>> in most cases, I don’t have a better option than just calling|panic()| at the end.
>>>>>> panic()-ing is of course only a last resort. Anything related to domain handling
>>>>>> would better crash only the domain in question. And even that only if suitable
>>>>>> error handling isn't possible.
>>>>> And if there is no still any runnable domain available, for example, we are creating
>>>>> domain and some p2m mapping is called? Will it be enough just ignore to boot this domain?
>>>>> If yes, then it is enough to return only error code without returning how many GFNs were
>>>>> mapped or rollbacking as domain won't be ran anyway.
>>>> During domain creation all you need to do is return an error. But when you write a
>>>> generic function that's also (going to be) used at domain runtime, you need to
>>>> consider what to do there in case of partial success.
>>>>
>>>>>>> For example, if I call|map_regions_p2mt()| for an MMIO region described in a device
>>>>>>> tree node, and the mapping fails partway through, I’m left with two options: either
>>>>>>> ignore the device (if it's not essential for Xen or guest functionality) and continue
>>>>>>>      booting; in which case I’d need to perform a rollback, and simply knowing the number
>>>>>>> of successfully mapped GFNs may not be enough or, more likely, just panic.
>>>>>> Well, no. For example, before even trying to map you could check that the range
>>>>>> of P2M entries covered is all empty.
>>>>> Could it be that they aren't all empty? Then it seems like we have overlapping and we can't
>>>>> just do a mapping, right?
>>>> Possibly that would simply mean to return an error, yes.
>>>>
>>>>> Won't be this procedure consume a lot of time as it is needed to go through each page
>>>>> tables for each entry.
>>>> Well, you're free to suggest a clean alternative without doing so.
>>> I thought about dynamically allocating an array in p2m_set_entry(), where to save all changed PTEs,
>>> and then use it to roll back if __p2m_set_entry() returns rc != 0 ...
> That's another possible source for failure, and such an allocation may end
> up being a rather big one.
>
>>>>>>     _Then_ you know how to correctly roll back.
>>>>>> And yes, doing so may not even require passing back information on how much of
>>>>>> a region was successfully mapped.
>>>>> If P2M entries were empty before start of the mapping then it is enough to just go
>>>>> through the same range (sgfn,nr,smfn) and just clean them, right?
>>>> Yes, what else would "roll back" mean in that case?
>>> ... If we know that the P2M entries were empty, then there's nothing else to be done, just
>>> clean PTE is needed to be done.
>>> However, if the P2M entries weren’t empty (and I’m still not sure whether that’s a legal
>>> case), then rolling back would mean restoring their original state, the state they
>>> had before the P2M mapping procedure started.
>> Possible roll back is harder to implement as expected because there is a case where subtree
>> could be freed:
>>       /*
>>        * Free the entry only if the original pte was valid and the base
>>        * is different (to avoid freeing when permission is changed).
>>        */
>>       if ( p2me_is_valid(p2m, orig_pte) &&
>>            !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
>>           p2m_free_subtree(p2m, orig_pte, level);
>> In this case then it will be needed to store the full subtree.
> Right, which is why it may be desirable to limit the ability to update multiple
> entries in one go. Or work from certain assumptions, violation of which would
> cause the domain to be crashed.

It seems to me that the main issue with updating multiple entries in one go is the rollback
mechanism in case of a partial mapping failure. (other issues? mapping could consume a lot
of time so something should wait while allocation will end?) In my opinion, the rollback
mechanism is quite complex to implement and could become a source of further failures.
For example, most of the cases where p2m_set_entry() could fail are due to failure in
mapping the page table (to allow Xen to walk through it) or failure in creating a new page
table due to memory exhaustion. Then, during rollback, which might also require memory
allocation, we could face the same memory shortage issue.
And what should be done in that case?

In my opinion, the best option is to simply return from p2m_set_entry() the number of
successfully mapped GFNs (stored in rc which is returned by p2m_set_entry()) and let
the caller decide how to handle the partial mapping:
1. If a partial mapping occurs during domain creation, we could just report that this
    domain can't be created and continue without it if there are other domains to start;
    otherwise, panic.
2. If a partial mapping occurs during the lifetime of a domain, for example, if the domain
    requests to map some memory, we return the number of successfully mapped GFNs and let the
    domain decide what to do: either remove the mappings or retry mapping the remaining part.
    However, I think there's not much value in retrying, since p2m_set_entry() is likely to
    fail again. So, perhaps the best course of action is to stop the domain altogether.
Does that make sense?

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 13087 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-08 15:42                           ` Oleksii Kurochko
@ 2025-07-08 16:04                             ` Jan Beulich
  2025-07-09  8:24                               ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-08 16:04 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 08.07.2025 17:42, Oleksii Kurochko wrote:
> 
> On 7/8/25 2:45 PM, Jan Beulich wrote:
>> On 08.07.2025 12:37, Oleksii Kurochko wrote:
>>> On 7/8/25 11:01 AM, Oleksii Kurochko wrote:
>>>>
>>>> On 7/8/25 9:10 AM, Jan Beulich wrote:
>>>>> On 07.07.2025 18:10, Oleksii Kurochko wrote:
>>>>>> On 7/7/25 5:15 PM, Jan Beulich wrote:
>>>>>>> On 07.07.2025 17:00, Oleksii Kurochko wrote:
>>>>>>>> On 7/7/25 2:53 PM, Jan Beulich wrote:
>>>>>>>>> On 07.07.2025 13:46, Oleksii Kurochko wrote:
>>>>>>>>>> On 7/7/25 9:20 AM, Jan Beulich wrote:
>>>>>>>>>>> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>>>>>>>>>>>> On 7/1/25 3:49 PM, Jan Beulich wrote:
>>>>>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +    panic("%s: isn't implemented for now\n", __func__);
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +    return false;
>>>>>>>>>>>>>> +}
>>>>>>>>>>>>> For this function in particular, though: Besides the "p2me" in the name
>>>>>>>>>>>>> being somewhat odd (supposedly page table entries here are simply pte_t),
>>>>>>>>>>>>> how is this going to be different from pte_is_valid()?
>>>>>>>>>>>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
>>>>>>>>>>>> what is a type stored in the radix tree (p2m->p2m_types):
>>>>>>>>>>>>         /*
>>>>>>>>>>>>          * In the case of the P2M, the valid bit is used for other purpose. Use
>>>>>>>>>>>>          * the type to check whether an entry is valid.
>>>>>>>>>>>>          */
>>>>>>>>>>>>         static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>>>>>>>>         {
>>>>>>>>>>>>             return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>>>>>>>>         }
>>>>>>>>>>>>
>>>>>>>>>>>> It is done to track which page was modified by a guest.
>>>>>>>>>>> But then (again) the name doesn't convey what the function does.
>>>>>>>>>> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.
>>>>>>>>> For P2M type checks please don't invent new naming, but use what both x86
>>>>>>>>> and Arm are already using. Note how we already have p2m_is_valid() in that
>>>>>>>>> set. Just that it's not doing what you want here.
>>>>>>>> Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
>>>>>>>> And in here it is checked if P2M pte is valid from P2M point of view by checking
>>>>>>>> the type in radix tree and/or in reserved PTEs bits (just to remind we have only 2
>>>>>>>> free bits for type).
>>>>>>> Because this is how it's defined on x86:
>>>>>>>
>>>>>>> #define p2m_is_valid(_t)    (p2m_to_mask(_t) & \
>>>>>>>                                 (P2M_RAM_TYPES | p2m_to_mask(p2m_mmio_direct)))
>>>>>>>
>>>>>>> I.e. more strict that simply "!= p2m_invalid". And I think such predicates
>>>>>>> would better be uniform across architectures, such that in principle they
>>>>>>> might also be usable in common code (as we already do with p2m_is_foreign()).
>>>>>> Yeah, Arm isn't so strict in definition of p2m_is_valid() and it seems like
>>>>>> x86 and Arm have different understanding what is valid.
>>>>>>
>>>>>> Except what mentioned in the comment that grant types aren't considered valid
>>>>>> for x86 (and shouldn't be the same then for Arm?), it isn't clear why x86's
>>>>>> p2m_is_valid() is stricter then Arm's one and if other arches should be also
>>>>>> so strict.
>>>>> Arm's p2m_is_valid() is entirely different (and imo misnamed, but arguably one
>>>>> could also consider x86'es to require a better name). It's a local helper, not
>>>>> a P2M type checking predicate. With that in mind, you may of course follow
>>>>> Arm's model, but in the longer run we may need to do something about the name
>>>>> collision then.
>>>>>
>>>>>>>> The only use case I can think of is that the caller
>>>>>>>> might try to map the remaining GFNs again. But that doesn’t seem very useful,
>>>>>>>> if|p2m_set_entry()| wasn’t able to map the full range, it likely indicates a serious
>>>>>>>> issue, and retrying would probably result in the same error.
>>>>>>>>
>>>>>>>> The same applies to rolling back the state. It wouldn’t be difficult to add a local
>>>>>>>> array to track all modified PTEs and then use it to revert the state if needed.
>>>>>>>> But again, what would the caller do after the rollback? At this point, it still seems
>>>>>>>> like the best option is simply to|panic(). |
>>>>>>>>
>>>>>>>> Basically, I don’t see or understand the cases where knowing how many GFNs were
>>>>>>>> successfully mapped, or whether a rollback was performed, would really help — because
>>>>>>>> in most cases, I don’t have a better option than just calling|panic()| at the end.
>>>>>>> panic()-ing is of course only a last resort. Anything related to domain handling
>>>>>>> would better crash only the domain in question. And even that only if suitable
>>>>>>> error handling isn't possible.
>>>>>> And if there is no still any runnable domain available, for example, we are creating
>>>>>> domain and some p2m mapping is called? Will it be enough just ignore to boot this domain?
>>>>>> If yes, then it is enough to return only error code without returning how many GFNs were
>>>>>> mapped or rollbacking as domain won't be ran anyway.
>>>>> During domain creation all you need to do is return an error. But when you write a
>>>>> generic function that's also (going to be) used at domain runtime, you need to
>>>>> consider what to do there in case of partial success.
>>>>>
>>>>>>>> For example, if I call|map_regions_p2mt()| for an MMIO region described in a device
>>>>>>>> tree node, and the mapping fails partway through, I’m left with two options: either
>>>>>>>> ignore the device (if it's not essential for Xen or guest functionality) and continue
>>>>>>>>      booting; in which case I’d need to perform a rollback, and simply knowing the number
>>>>>>>> of successfully mapped GFNs may not be enough or, more likely, just panic.
>>>>>>> Well, no. For example, before even trying to map you could check that the range
>>>>>>> of P2M entries covered is all empty.
>>>>>> Could it be that they aren't all empty? Then it seems like we have overlapping and we can't
>>>>>> just do a mapping, right?
>>>>> Possibly that would simply mean to return an error, yes.
>>>>>
>>>>>> Won't be this procedure consume a lot of time as it is needed to go through each page
>>>>>> tables for each entry.
>>>>> Well, you're free to suggest a clean alternative without doing so.
>>>> I thought about dynamically allocating an array in p2m_set_entry(), where to save all changed PTEs,
>>>> and then use it to roll back if __p2m_set_entry() returns rc != 0 ...
>> That's another possible source for failure, and such an allocation may end
>> up being a rather big one.
>>
>>>>>>>     _Then_ you know how to correctly roll back.
>>>>>>> And yes, doing so may not even require passing back information on how much of
>>>>>>> a region was successfully mapped.
>>>>>> If P2M entries were empty before start of the mapping then it is enough to just go
>>>>>> through the same range (sgfn,nr,smfn) and just clean them, right?
>>>>> Yes, what else would "roll back" mean in that case?
>>>> ... If we know that the P2M entries were empty, then there's nothing else to be done, just
>>>> clean PTE is needed to be done.
>>>> However, if the P2M entries weren’t empty (and I’m still not sure whether that’s a legal
>>>> case), then rolling back would mean restoring their original state, the state they
>>>> had before the P2M mapping procedure started.
>>> Possible roll back is harder to implement as expected because there is a case where subtree
>>> could be freed:
>>>       /*
>>>        * Free the entry only if the original pte was valid and the base
>>>        * is different (to avoid freeing when permission is changed).
>>>        */
>>>       if ( p2me_is_valid(p2m, orig_pte) &&
>>>            !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
>>>           p2m_free_subtree(p2m, orig_pte, level);
>>> In this case then it will be needed to store the full subtree.
>> Right, which is why it may be desirable to limit the ability to update multiple
>> entries in one go. Or work from certain assumptions, violation of which would
>> cause the domain to be crashed.
> 
> It seems to me that the main issue with updating multiple entries in one go is the rollback
> mechanism in case of a partial mapping failure. (other issues? mapping could consume a lot
> of time so something should wait while allocation will end?) In my opinion, the rollback
> mechanism is quite complex to implement and could become a source of further failures.
> For example, most of the cases where p2m_set_entry() could fail are due to failure in
> mapping the page table (to allow Xen to walk through it) or failure in creating a new page
> table due to memory exhaustion. Then, during rollback, which might also require memory
> allocation, we could face the same memory shortage issue.
> And what should be done in that case?
> 
> In my opinion, the best option is to simply return from p2m_set_entry() the number of
> successfully mapped GFNs (stored in rc which is returned by p2m_set_entry()) and let
> the caller decide how to handle the partial mapping:
> 1. If a partial mapping occurs during domain creation, we could just report that this
>     domain can't be created and continue without it if there are other domains to start;
>     otherwise, panic.

I don't see how panic()-ing is relevant here. That's to be decided (far) up
the call stack.

> 2. If a partial mapping occurs during the lifetime of a domain, for example, if the domain
>     requests to map some memory, we return the number of successfully mapped GFNs and let the
>     domain decide what to do: either remove the mappings or retry mapping the remaining part.
>     However, I think there's not much value in retrying, since p2m_set_entry() is likely to
>     fail again. So, perhaps the best course of action is to stop the domain altogether.
> Does that make sense?

Sure, why not. Provided you actually have a way to communicate back how much
was mapped.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-08 16:04                             ` Jan Beulich
@ 2025-07-09  8:24                               ` Oleksii Kurochko
  2025-07-09  8:41                                 ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-09  8:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 13299 bytes --]


On 7/8/25 6:04 PM, Jan Beulich wrote:
> On 08.07.2025 17:42, Oleksii Kurochko wrote:
>> On 7/8/25 2:45 PM, Jan Beulich wrote:
>>> On 08.07.2025 12:37, Oleksii Kurochko wrote:
>>>> On 7/8/25 11:01 AM, Oleksii Kurochko wrote:
>>>>> On 7/8/25 9:10 AM, Jan Beulich wrote:
>>>>>> On 07.07.2025 18:10, Oleksii Kurochko wrote:
>>>>>>> On 7/7/25 5:15 PM, Jan Beulich wrote:
>>>>>>>> On 07.07.2025 17:00, Oleksii Kurochko wrote:
>>>>>>>>> On 7/7/25 2:53 PM, Jan Beulich wrote:
>>>>>>>>>> On 07.07.2025 13:46, Oleksii Kurochko wrote:
>>>>>>>>>>> On 7/7/25 9:20 AM, Jan Beulich wrote:
>>>>>>>>>>>> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>>>>>>>>>>>>> On 7/1/25 3:49 PM, Jan Beulich wrote:
>>>>>>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> +    panic("%s: isn't implemented for now\n", __func__);
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +    return false;
>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>> For this function in particular, though: Besides the "p2me" in the name
>>>>>>>>>>>>>> being somewhat odd (supposedly page table entries here are simply pte_t),
>>>>>>>>>>>>>> how is this going to be different from pte_is_valid()?
>>>>>>>>>>>>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
>>>>>>>>>>>>> what is a type stored in the radix tree (p2m->p2m_types):
>>>>>>>>>>>>>          /*
>>>>>>>>>>>>>           * In the case of the P2M, the valid bit is used for other purpose. Use
>>>>>>>>>>>>>           * the type to check whether an entry is valid.
>>>>>>>>>>>>>           */
>>>>>>>>>>>>>          static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>>>>>>>>>          {
>>>>>>>>>>>>>              return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>>>>>>>>>          }
>>>>>>>>>>>>>
>>>>>>>>>>>>> It is done to track which page was modified by a guest.
>>>>>>>>>>>> But then (again) the name doesn't convey what the function does.
>>>>>>>>>>> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.
>>>>>>>>>> For P2M type checks please don't invent new naming, but use what both x86
>>>>>>>>>> and Arm are already using. Note how we already have p2m_is_valid() in that
>>>>>>>>>> set. Just that it's not doing what you want here.
>>>>>>>>> Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
>>>>>>>>> And in here it is checked if P2M pte is valid from P2M point of view by checking
>>>>>>>>> the type in radix tree and/or in reserved PTEs bits (just to remind we have only 2
>>>>>>>>> free bits for type).
>>>>>>>> Because this is how it's defined on x86:
>>>>>>>>
>>>>>>>> #define p2m_is_valid(_t)    (p2m_to_mask(_t) & \
>>>>>>>>                                  (P2M_RAM_TYPES | p2m_to_mask(p2m_mmio_direct)))
>>>>>>>>
>>>>>>>> I.e. more strict that simply "!= p2m_invalid". And I think such predicates
>>>>>>>> would better be uniform across architectures, such that in principle they
>>>>>>>> might also be usable in common code (as we already do with p2m_is_foreign()).
>>>>>>> Yeah, Arm isn't so strict in definition of p2m_is_valid() and it seems like
>>>>>>> x86 and Arm have different understanding what is valid.
>>>>>>>
>>>>>>> Except what mentioned in the comment that grant types aren't considered valid
>>>>>>> for x86 (and shouldn't be the same then for Arm?), it isn't clear why x86's
>>>>>>> p2m_is_valid() is stricter then Arm's one and if other arches should be also
>>>>>>> so strict.
>>>>>> Arm's p2m_is_valid() is entirely different (and imo misnamed, but arguably one
>>>>>> could also consider x86'es to require a better name). It's a local helper, not
>>>>>> a P2M type checking predicate. With that in mind, you may of course follow
>>>>>> Arm's model, but in the longer run we may need to do something about the name
>>>>>> collision then.
>>>>>>
>>>>>>>>> The only use case I can think of is that the caller
>>>>>>>>> might try to map the remaining GFNs again. But that doesn’t seem very useful,
>>>>>>>>> if|p2m_set_entry()| wasn’t able to map the full range, it likely indicates a serious
>>>>>>>>> issue, and retrying would probably result in the same error.
>>>>>>>>>
>>>>>>>>> The same applies to rolling back the state. It wouldn’t be difficult to add a local
>>>>>>>>> array to track all modified PTEs and then use it to revert the state if needed.
>>>>>>>>> But again, what would the caller do after the rollback? At this point, it still seems
>>>>>>>>> like the best option is simply to|panic(). |
>>>>>>>>>
>>>>>>>>> Basically, I don’t see or understand the cases where knowing how many GFNs were
>>>>>>>>> successfully mapped, or whether a rollback was performed, would really help — because
>>>>>>>>> in most cases, I don’t have a better option than just calling|panic()| at the end.
>>>>>>>> panic()-ing is of course only a last resort. Anything related to domain handling
>>>>>>>> would better crash only the domain in question. And even that only if suitable
>>>>>>>> error handling isn't possible.
>>>>>>> And if there is no still any runnable domain available, for example, we are creating
>>>>>>> domain and some p2m mapping is called? Will it be enough just ignore to boot this domain?
>>>>>>> If yes, then it is enough to return only error code without returning how many GFNs were
>>>>>>> mapped or rollbacking as domain won't be ran anyway.
>>>>>> During domain creation all you need to do is return an error. But when you write a
>>>>>> generic function that's also (going to be) used at domain runtime, you need to
>>>>>> consider what to do there in case of partial success.
>>>>>>
>>>>>>>>> For example, if I call|map_regions_p2mt()| for an MMIO region described in a device
>>>>>>>>> tree node, and the mapping fails partway through, I’m left with two options: either
>>>>>>>>> ignore the device (if it's not essential for Xen or guest functionality) and continue
>>>>>>>>>       booting; in which case I’d need to perform a rollback, and simply knowing the number
>>>>>>>>> of successfully mapped GFNs may not be enough or, more likely, just panic.
>>>>>>>> Well, no. For example, before even trying to map you could check that the range
>>>>>>>> of P2M entries covered is all empty.
>>>>>>> Could it be that they aren't all empty? Then it seems like we have overlapping and we can't
>>>>>>> just do a mapping, right?
>>>>>> Possibly that would simply mean to return an error, yes.
>>>>>>
>>>>>>> Won't be this procedure consume a lot of time as it is needed to go through each page
>>>>>>> tables for each entry.
>>>>>> Well, you're free to suggest a clean alternative without doing so.
>>>>> I thought about dynamically allocating an array in p2m_set_entry(), where to save all changed PTEs,
>>>>> and then use it to roll back if __p2m_set_entry() returns rc != 0 ...
>>> That's another possible source for failure, and such an allocation may end
>>> up being a rather big one.
>>>
>>>>>>>>      _Then_ you know how to correctly roll back.
>>>>>>>> And yes, doing so may not even require passing back information on how much of
>>>>>>>> a region was successfully mapped.
>>>>>>> If P2M entries were empty before start of the mapping then it is enough to just go
>>>>>>> through the same range (sgfn,nr,smfn) and just clean them, right?
>>>>>> Yes, what else would "roll back" mean in that case?
>>>>> ... If we know that the P2M entries were empty, then there's nothing else to be done, just
>>>>> clean PTE is needed to be done.
>>>>> However, if the P2M entries weren’t empty (and I’m still not sure whether that’s a legal
>>>>> case), then rolling back would mean restoring their original state, the state they
>>>>> had before the P2M mapping procedure started.
>>>> Possible roll back is harder to implement as expected because there is a case where subtree
>>>> could be freed:
>>>>        /*
>>>>         * Free the entry only if the original pte was valid and the base
>>>>         * is different (to avoid freeing when permission is changed).
>>>>         */
>>>>        if ( p2me_is_valid(p2m, orig_pte) &&
>>>>             !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
>>>>            p2m_free_subtree(p2m, orig_pte, level);
>>>> In this case then it will be needed to store the full subtree.
>>> Right, which is why it may be desirable to limit the ability to update multiple
>>> entries in one go. Or work from certain assumptions, violation of which would
>>> cause the domain to be crashed.
>> It seems to me that the main issue with updating multiple entries in one go is the rollback
>> mechanism in case of a partial mapping failure. (other issues? mapping could consume a lot
>> of time so something should wait while allocation will end?) In my opinion, the rollback
>> mechanism is quite complex to implement and could become a source of further failures.
>> For example, most of the cases where p2m_set_entry() could fail are due to failure in
>> mapping the page table (to allow Xen to walk through it) or failure in creating a new page
>> table due to memory exhaustion. Then, during rollback, which might also require memory
>> allocation, we could face the same memory shortage issue.
>> And what should be done in that case?
>>
>> In my opinion, the best option is to simply return from p2m_set_entry() the number of
>> successfully mapped GFNs (stored in rc which is returned by p2m_set_entry()) and let
>> the caller decide how to handle the partial mapping:
>> 1. If a partial mapping occurs during domain creation, we could just report that this
>>      domain can't be created and continue without it if there are other domains to start;
>>      otherwise, panic.
> I don't see how panic()-ing is relevant here. That's to be decided (far) up
> the call stack.

So it's just a question of whether the caller should panic() or propagate the return
value (error code) up the call stack.

For example, in case of domain construction return value is propogate almost to  the top
of the stack:
   p2m_set_entry(p2m_access_t a, p2m_type_t t, mfn_t smfn, unsigned long nr, gfn_t sgfn, struct p2m_domain * p2m) (/run/media/ok/blue_disk//xen/xen/arch/riscv/p2m.c:1005)
   p2m_insert_mapping(struct domain * d, gfn_t start_gfn, unsigned long nr, mfn_t mfn, p2m_type_t t) (/run/media/ok/blue_disk//xen/xen/arch/riscv/p2m.c:1055)
   guest_physmap_add_entry(struct domain * d, gfn_t gfn, mfn_t mfn, unsigned long page_order, p2m_type_t t) (/run/media/ok/blue_disk//xen/xen/arch/riscv/p2m.c:1076)
   guest_physmap_add_page(unsigned int page_order, struct domain * d) (/run/media/ok/blue_disk//xen/xen/arch/riscv/include/asm/p2m.h:152)
   guest_map_pages(struct domain * d, struct page_info * pg, unsigned int order, void * extra) (/run/media/ok/blue_disk//xen/xen/common/device-tree/domain-build.c:63)
   allocate_domheap_memory(struct domain * d, paddr_t tot_size, alloc_domheap_mem_cb cb, void * extra) (/run/media/ok/blue_disk//xen/xen/common/device-tree/domain-build.c:47)
   allocate_bank_memory(struct kernel_info * kinfo, gfn_t sgfn, paddr_t tot_size) (/run/media/ok/blue_disk//xen/xen/common/device-tree/domain-build.c:99)
   allocate_memory(struct domain * d, struct kernel_info * kinfo) (/run/media/ok/blue_disk//xen/xen/include/xen/mm-frame.h:43)
   construct_domU(struct domain * d, const struct dt_device_node * node) (/run/media/ok/blue_disk//xen/xen/common/device-tree/dom0less-build.c:835)
   create_domUs() (/run/media/ok/blue_disk//xen/xen/common/device-tree/dom0less-build.c:1019)
   start_xen(unsigned long bootcpu_id, paddr_t dtb_addr) (/run/media/ok/blue_disk//xen/xen/arch/riscv/setup.c:296)
   start() (/run/media/ok/blue_disk//xen/xen/arch/riscv/riscv64/head.S:61)

And panic() almost at the end:
         rc = construct_domU(d, node);
         if ( rc )
             panic("Could not set up domain %s (rc = %d)\n",
                   dt_node_name(node), rc);

>
>> 2. If a partial mapping occurs during the lifetime of a domain, for example, if the domain
>>      requests to map some memory, we return the number of successfully mapped GFNs and let the
>>      domain decide what to do: either remove the mappings or retry mapping the remaining part.
>>      However, I think there's not much value in retrying, since p2m_set_entry() is likely to
>>      fail again. So, perhaps the best course of action is to stop the domain altogether.
>> Does that make sense?
> Sure, why not. Provided you actually have a way to communicate back how much
> was mapped.

I was thinking of simply returning it as the return value. This way, a return value of 0 would
indicate that everything was mapped successfully, while a value greater than 0 would indicate
how many GFNs were successfully mapped. And negative value if nothing was be mapped at all:
     static int p2m_set_entry(struct p2m_domain *p2m,
                            gfn_t sgfn,
                            unsigned long nr,
                            mfn_t smfn,
                            p2m_type_t t,
                            p2m_access_t a)
    {
       int rc = 0;
       unsigned int i;
   
       ...
   
       for ( i = 1; i <= nr; i++ )
       {
           ...
           rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
           if ( rc )
               break;
           ...
       }
       
       return i == nr ? 0 : i ?: rc;
    }

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 17278 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()
  2025-07-09  8:24                               ` Oleksii Kurochko
@ 2025-07-09  8:41                                 ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-09  8:41 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 09.07.2025 10:24, Oleksii Kurochko wrote:
> 
> On 7/8/25 6:04 PM, Jan Beulich wrote:
>> On 08.07.2025 17:42, Oleksii Kurochko wrote:
>>> On 7/8/25 2:45 PM, Jan Beulich wrote:
>>>> On 08.07.2025 12:37, Oleksii Kurochko wrote:
>>>>> On 7/8/25 11:01 AM, Oleksii Kurochko wrote:
>>>>>> On 7/8/25 9:10 AM, Jan Beulich wrote:
>>>>>>> On 07.07.2025 18:10, Oleksii Kurochko wrote:
>>>>>>>> On 7/7/25 5:15 PM, Jan Beulich wrote:
>>>>>>>>> On 07.07.2025 17:00, Oleksii Kurochko wrote:
>>>>>>>>>> On 7/7/25 2:53 PM, Jan Beulich wrote:
>>>>>>>>>>> On 07.07.2025 13:46, Oleksii Kurochko wrote:
>>>>>>>>>>>> On 7/7/25 9:20 AM, Jan Beulich wrote:
>>>>>>>>>>>>> On 04.07.2025 17:01, Oleksii Kurochko wrote:
>>>>>>>>>>>>>> On 7/1/25 3:49 PM, Jan Beulich wrote:
>>>>>>>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> +    panic("%s: isn't implemented for now\n", __func__);
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +    return false;
>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>> For this function in particular, though: Besides the "p2me" in the name
>>>>>>>>>>>>>>> being somewhat odd (supposedly page table entries here are simply pte_t),
>>>>>>>>>>>>>>> how is this going to be different from pte_is_valid()?
>>>>>>>>>>>>>> pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
>>>>>>>>>>>>>> what is a type stored in the radix tree (p2m->p2m_types):
>>>>>>>>>>>>>>          /*
>>>>>>>>>>>>>>           * In the case of the P2M, the valid bit is used for other purpose. Use
>>>>>>>>>>>>>>           * the type to check whether an entry is valid.
>>>>>>>>>>>>>>           */
>>>>>>>>>>>>>>          static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>>>>>>>>>>          {
>>>>>>>>>>>>>>              return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>>>>>>>>>>          }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It is done to track which page was modified by a guest.
>>>>>>>>>>>>> But then (again) the name doesn't convey what the function does.
>>>>>>>>>>>> Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.
>>>>>>>>>>> For P2M type checks please don't invent new naming, but use what both x86
>>>>>>>>>>> and Arm are already using. Note how we already have p2m_is_valid() in that
>>>>>>>>>>> set. Just that it's not doing what you want here.
>>>>>>>>>> Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
>>>>>>>>>> And in here it is checked if P2M pte is valid from P2M point of view by checking
>>>>>>>>>> the type in radix tree and/or in reserved PTEs bits (just to remind we have only 2
>>>>>>>>>> free bits for type).
>>>>>>>>> Because this is how it's defined on x86:
>>>>>>>>>
>>>>>>>>> #define p2m_is_valid(_t)    (p2m_to_mask(_t) & \
>>>>>>>>>                                  (P2M_RAM_TYPES | p2m_to_mask(p2m_mmio_direct)))
>>>>>>>>>
>>>>>>>>> I.e. more strict that simply "!= p2m_invalid". And I think such predicates
>>>>>>>>> would better be uniform across architectures, such that in principle they
>>>>>>>>> might also be usable in common code (as we already do with p2m_is_foreign()).
>>>>>>>> Yeah, Arm isn't so strict in definition of p2m_is_valid() and it seems like
>>>>>>>> x86 and Arm have different understanding what is valid.
>>>>>>>>
>>>>>>>> Except what mentioned in the comment that grant types aren't considered valid
>>>>>>>> for x86 (and shouldn't be the same then for Arm?), it isn't clear why x86's
>>>>>>>> p2m_is_valid() is stricter then Arm's one and if other arches should be also
>>>>>>>> so strict.
>>>>>>> Arm's p2m_is_valid() is entirely different (and imo misnamed, but arguably one
>>>>>>> could also consider x86'es to require a better name). It's a local helper, not
>>>>>>> a P2M type checking predicate. With that in mind, you may of course follow
>>>>>>> Arm's model, but in the longer run we may need to do something about the name
>>>>>>> collision then.
>>>>>>>
>>>>>>>>>> The only use case I can think of is that the caller
>>>>>>>>>> might try to map the remaining GFNs again. But that doesn’t seem very useful,
>>>>>>>>>> if|p2m_set_entry()| wasn’t able to map the full range, it likely indicates a serious
>>>>>>>>>> issue, and retrying would probably result in the same error.
>>>>>>>>>>
>>>>>>>>>> The same applies to rolling back the state. It wouldn’t be difficult to add a local
>>>>>>>>>> array to track all modified PTEs and then use it to revert the state if needed.
>>>>>>>>>> But again, what would the caller do after the rollback? At this point, it still seems
>>>>>>>>>> like the best option is simply to|panic(). |
>>>>>>>>>>
>>>>>>>>>> Basically, I don’t see or understand the cases where knowing how many GFNs were
>>>>>>>>>> successfully mapped, or whether a rollback was performed, would really help — because
>>>>>>>>>> in most cases, I don’t have a better option than just calling|panic()| at the end.
>>>>>>>>> panic()-ing is of course only a last resort. Anything related to domain handling
>>>>>>>>> would better crash only the domain in question. And even that only if suitable
>>>>>>>>> error handling isn't possible.
>>>>>>>> And if there is no still any runnable domain available, for example, we are creating
>>>>>>>> domain and some p2m mapping is called? Will it be enough just ignore to boot this domain?
>>>>>>>> If yes, then it is enough to return only error code without returning how many GFNs were
>>>>>>>> mapped or rollbacking as domain won't be ran anyway.
>>>>>>> During domain creation all you need to do is return an error. But when you write a
>>>>>>> generic function that's also (going to be) used at domain runtime, you need to
>>>>>>> consider what to do there in case of partial success.
>>>>>>>
>>>>>>>>>> For example, if I call|map_regions_p2mt()| for an MMIO region described in a device
>>>>>>>>>> tree node, and the mapping fails partway through, I’m left with two options: either
>>>>>>>>>> ignore the device (if it's not essential for Xen or guest functionality) and continue
>>>>>>>>>>       booting; in which case I’d need to perform a rollback, and simply knowing the number
>>>>>>>>>> of successfully mapped GFNs may not be enough or, more likely, just panic.
>>>>>>>>> Well, no. For example, before even trying to map you could check that the range
>>>>>>>>> of P2M entries covered is all empty.
>>>>>>>> Could it be that they aren't all empty? Then it seems like we have overlapping and we can't
>>>>>>>> just do a mapping, right?
>>>>>>> Possibly that would simply mean to return an error, yes.
>>>>>>>
>>>>>>>> Won't be this procedure consume a lot of time as it is needed to go through each page
>>>>>>>> tables for each entry.
>>>>>>> Well, you're free to suggest a clean alternative without doing so.
>>>>>> I thought about dynamically allocating an array in p2m_set_entry(), where to save all changed PTEs,
>>>>>> and then use it to roll back if __p2m_set_entry() returns rc != 0 ...
>>>> That's another possible source for failure, and such an allocation may end
>>>> up being a rather big one.
>>>>
>>>>>>>>>      _Then_ you know how to correctly roll back.
>>>>>>>>> And yes, doing so may not even require passing back information on how much of
>>>>>>>>> a region was successfully mapped.
>>>>>>>> If P2M entries were empty before start of the mapping then it is enough to just go
>>>>>>>> through the same range (sgfn,nr,smfn) and just clean them, right?
>>>>>>> Yes, what else would "roll back" mean in that case?
>>>>>> ... If we know that the P2M entries were empty, then there's nothing else to be done, just
>>>>>> clean PTE is needed to be done.
>>>>>> However, if the P2M entries weren’t empty (and I’m still not sure whether that’s a legal
>>>>>> case), then rolling back would mean restoring their original state, the state they
>>>>>> had before the P2M mapping procedure started.
>>>>> Possible roll back is harder to implement as expected because there is a case where subtree
>>>>> could be freed:
>>>>>        /*
>>>>>         * Free the entry only if the original pte was valid and the base
>>>>>         * is different (to avoid freeing when permission is changed).
>>>>>         */
>>>>>        if ( p2me_is_valid(p2m, orig_pte) &&
>>>>>             !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte)) )
>>>>>            p2m_free_subtree(p2m, orig_pte, level);
>>>>> In this case then it will be needed to store the full subtree.
>>>> Right, which is why it may be desirable to limit the ability to update multiple
>>>> entries in one go. Or work from certain assumptions, violation of which would
>>>> cause the domain to be crashed.
>>> It seems to me that the main issue with updating multiple entries in one go is the rollback
>>> mechanism in case of a partial mapping failure. (other issues? mapping could consume a lot
>>> of time so something should wait while allocation will end?) In my opinion, the rollback
>>> mechanism is quite complex to implement and could become a source of further failures.
>>> For example, most of the cases where p2m_set_entry() could fail are due to failure in
>>> mapping the page table (to allow Xen to walk through it) or failure in creating a new page
>>> table due to memory exhaustion. Then, during rollback, which might also require memory
>>> allocation, we could face the same memory shortage issue.
>>> And what should be done in that case?
>>>
>>> In my opinion, the best option is to simply return from p2m_set_entry() the number of
>>> successfully mapped GFNs (stored in rc which is returned by p2m_set_entry()) and let
>>> the caller decide how to handle the partial mapping:
>>> 1. If a partial mapping occurs during domain creation, we could just report that this
>>>      domain can't be created and continue without it if there are other domains to start;
>>>      otherwise, panic.
>> I don't see how panic()-ing is relevant here. That's to be decided (far) up
>> the call stack.
> 
> So it's just a question of whether the caller should panic() or propagate the return
> value (error code) up the call stack.
> 
> For example, in case of domain construction return value is propogate almost to  the top
> of the stack:
>    p2m_set_entry(p2m_access_t a, p2m_type_t t, mfn_t smfn, unsigned long nr, gfn_t sgfn, struct p2m_domain * p2m) (/run/media/ok/blue_disk//xen/xen/arch/riscv/p2m.c:1005)
>    p2m_insert_mapping(struct domain * d, gfn_t start_gfn, unsigned long nr, mfn_t mfn, p2m_type_t t) (/run/media/ok/blue_disk//xen/xen/arch/riscv/p2m.c:1055)
>    guest_physmap_add_entry(struct domain * d, gfn_t gfn, mfn_t mfn, unsigned long page_order, p2m_type_t t) (/run/media/ok/blue_disk//xen/xen/arch/riscv/p2m.c:1076)
>    guest_physmap_add_page(unsigned int page_order, struct domain * d) (/run/media/ok/blue_disk//xen/xen/arch/riscv/include/asm/p2m.h:152)
>    guest_map_pages(struct domain * d, struct page_info * pg, unsigned int order, void * extra) (/run/media/ok/blue_disk//xen/xen/common/device-tree/domain-build.c:63)
>    allocate_domheap_memory(struct domain * d, paddr_t tot_size, alloc_domheap_mem_cb cb, void * extra) (/run/media/ok/blue_disk//xen/xen/common/device-tree/domain-build.c:47)
>    allocate_bank_memory(struct kernel_info * kinfo, gfn_t sgfn, paddr_t tot_size) (/run/media/ok/blue_disk//xen/xen/common/device-tree/domain-build.c:99)
>    allocate_memory(struct domain * d, struct kernel_info * kinfo) (/run/media/ok/blue_disk//xen/xen/include/xen/mm-frame.h:43)
>    construct_domU(struct domain * d, const struct dt_device_node * node) (/run/media/ok/blue_disk//xen/xen/common/device-tree/dom0less-build.c:835)
>    create_domUs() (/run/media/ok/blue_disk//xen/xen/common/device-tree/dom0less-build.c:1019)
>    start_xen(unsigned long bootcpu_id, paddr_t dtb_addr) (/run/media/ok/blue_disk//xen/xen/arch/riscv/setup.c:296)
>    start() (/run/media/ok/blue_disk//xen/xen/arch/riscv/riscv64/head.S:61)
> 
> And panic() almost at the end:
>          rc = construct_domU(d, node);
>          if ( rc )
>              panic("Could not set up domain %s (rc = %d)\n",
>                    dt_node_name(node), rc);

Which is what is wanted, imo.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 12/17] xen/riscv: Implement p2m_free_entry() and related helpers
  2025-07-01 14:23   ` Jan Beulich
@ 2025-07-11 15:56     ` Oleksii Kurochko
  2025-07-14  7:15       ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-11 15:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 8339 bytes --]


On 7/1/25 4:23 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> This patch introduces a working implementation of p2m_free_entry() for RISC-V
>> based on ARM's implementation of p2m_free_entry(), enabling proper cleanup
>> of page table entries in the P2M (physical-to-machine) mapping.
>>
>> Only few things are changed:
>> - Use p2m_force_flush_sync() instead of p2m_tlb_flush_sync() as latter
>>    isn't implemented on RISC-V.
>> - Introduce and use p2m_type_radix_get() to get a type of p2m entry as
>>    RISC-V's PTE doesn't have enough space to store all necessary types so
>>    a type is stored in a radix tree.
>>
>> Key additions include:
>> - p2m_free_entry(): Recursively frees page table entries at all levels. It
>>    handles both regular and superpage mappings and ensures that TLB entries
>>    are flushed before freeing intermediate tables.
>> - p2m_put_page() and helpers:
>>    - p2m_put_4k_page(): Clears GFN from xenheap pages if applicable.
>>    - p2m_put_2m_superpage(): Releases foreign page references in a 2MB
>>      superpage.
>>    - p2m_type_radix_get(): Extracts the stored p2m_type from the radix tree
>>      using the PTE.
>> - p2m_free_page(): Returns a page either to the domain's freelist or to
>>    the domheap, depending on whether the domain is hardware-backed.
> What is "hardware-backed"?

It means basically hardware domain, i.e. DOM0.

>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -345,11 +345,33 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>       return __map_domain_page(p2m->root + root_table_indx);
>>   }
>>   
>> +static p2m_type_t p2m_type_radix_get(struct p2m_domain *p2m, pte_t pte)
> Does it matter to callers that ...
>
>> +{
>> +    void *ptr;
>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>> +
>> +    ptr = radix_tree_lookup(&p2m->p2m_type, gfn_x(gfn));
>> +
>> +    if ( !ptr )
>> +        return p2m_invalid;
>> +
>> +    return radix_tree_ptr_to_int(ptr);
>> +}
> ... this is a radix tree lookup? IOW does "radix" need to be part of the
> function name? Also "get" may want to move forward in the name, to better
> match the naming of other functions.

Agree, it doesn't really matter, so I will rename it.

>> +/*
>> + * In the case of the P2M, the valid bit is used for other purpose. Use
>> + * the type to check whether an entry is valid.
>> + */
>>   static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>   {
>> -    panic("%s: isn't implemented for now\n", __func__);
>> +    return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>> +}
> No checking of the valid bit?

As mentioned in the comment, only the P2M type should be checked, since the
valid bit is used for other purposes we discussed earlier, for example, to
track whether pages were accessed by a guest domain, or to support certain
table invalidation optimizations (1) and (2).
So, in this case, we only need to consider whether the entry is invalid
from the P2M perspective.

(1)https://github.com/xen-project/xen/blob/19772b67/xen/arch/arm/mmu/p2m.c#L1245
(2)https://github.com/xen-project/xen/blob/19772b67/xen/arch/arm/mmu/p2m.c#L1386

>> @@ -404,11 +426,127 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>>       return GUEST_TABLE_MAP_NONE;
>>   }
>>   
>> +static void p2m_put_foreign_page(struct page_info *pg)
>> +{
>> +    /*
>> +     * It's safe to do the put_page here because page_alloc will
>> +     * flush the TLBs if the page is reallocated before the end of
>> +     * this loop.
>> +     */
>> +    put_page(pg);
> Is the comment really true? The page allocator will flush the normal
> TLBs, but not the stage-2 ones. Yet those are what you care about here,
> aiui.

In alloc_heap_pages():
  ...
      if ( need_tlbflush )
         filtered_flush_tlb_mask(tlbflush_timestamp);
  ...
  
filtered_flush_tlb_mask() calls arch_flush_tlb_mask().

and arch_flush_tlb_mask(), at least, on Arm (I haven't checked x86) is
implented as:
   void arch_flush_tlb_mask(const cpumask_t *mask)
   {
       /* No need to IPI other processors on ARM, the processor takes care of it. */
       flush_all_guests_tlb();
   }

So it flushes stage-2 TLB.

>
>> +/* Put any references on the single 4K page referenced by mfn. */
>> +static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
>> +{
>> +    /* TODO: Handle other p2m types */
>> +
>> +    /* Detect the xenheap page and mark the stored GFN as invalid. */
>> +    if ( p2m_is_ram(type) && is_xen_heap_mfn(mfn) )
>> +        page_set_xenheap_gfn(mfn_to_page(mfn), INVALID_GFN);
> Is this a valid thing to do? How do you make sure the respective uses
> (in gnttab's shared and status page arrays) are / were also removed?

As grant table frame GFN is stored directly in struct page_info instead
of keeping it in standalone status/shared arrays, thereby there is no need
for status/shared arrays.

>
>> +}
>> +
>> +/* Put any references on the superpage referenced by mfn. */
>> +static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
>> +{
>> +    struct page_info *pg;
>> +    unsigned int i;
>> +
>> +    ASSERT(mfn_valid(mfn));
>> +
>> +    pg = mfn_to_page(mfn);
>> +
>> +    for ( i = 0; i < XEN_PT_ENTRIES; i++, pg++ )
>> +        p2m_put_foreign_page(pg);
>> +}
>> +
>> +/* Put any references on the page referenced by pte. */
>> +static void p2m_put_page(struct p2m_domain *p2m, const pte_t pte,
>> +                         unsigned int level)
>> +{
>> +    mfn_t mfn = pte_get_mfn(pte);
>> +    p2m_type_t p2m_type = p2m_type_radix_get(p2m, pte);
> This gives you the type of the 1st page. What guarantees that all other pages
> in a superpage are of the exact same type?

Doesn't superpage mean that all the 4KB pages within that superpage have the
same type and contiguous in memory?

>
>> +    ASSERT(p2me_is_valid(p2m, pte));
>> +
>> +    /*
>> +     * TODO: Currently we don't handle level 2 super-page, Xen is not
>> +     * preemptible and therefore some work is needed to handle such
>> +     * superpages, for which at some point Xen might end up freeing memory
>> +     * and therefore for such a big mapping it could end up in a very long
>> +     * operation.
>> +     */
> This is pretty unsatisfactory. Imo, if you don't deal with that right away,
> you're setting yourself up for a significant re-write.

ARM leaves with that for a long time and it seems like it isn't a big issue for it.
And considering that frametable supports only 4Kb page granularity such big mappings
could lead to long operations during memory freeing.
And 1gb mapping isn't used for

>
>> +    if ( level == 1 )
>> +        return p2m_put_2m_superpage(mfn, p2m_type);
>> +    else if ( level == 0 )
>> +        return p2m_put_4k_page(mfn, p2m_type);
> Use switch() right away?

It could be, I think that no big difference at the moment, at least.
But I am okay to rework it.

>
>> +}
>> +
>> +static void p2m_free_page(struct domain *d, struct page_info *pg)
>> +{
>> +    if ( is_hardware_domain(d) )
>> +        free_domheap_page(pg);
> Why's the hardware domain different here? It should have a pool just like
> all other domains have.

Hardware domain (dom0) should be no limit in the number of pages that can
be allocated, so allocate p2m pages for hardware domain is done from heap.

An idea of p2m pool is to provide a way how to put clear limit and amount
to the p2m allocation.

>
>> +    else
>> +    {
>> +        spin_lock(&d->arch.paging.lock);
>> +        page_list_add_tail(pg, &d->arch.paging.p2m_freelist);
>> +        spin_unlock(&d->arch.paging.lock);
>> +    }
>> +}
>> +
>>   /* Free pte sub-tree behind an entry */
>>   static void p2m_free_entry(struct p2m_domain *p2m,
>>                              pte_t entry, unsigned int level)
>>   {
>> -    panic("%s: hasn't been implemented yet\n", __func__);
>> +    unsigned int i;
>> +    pte_t *table;
>> +    mfn_t mfn;
>> +    struct page_info *pg;
>> +
>> +    /* Nothing to do if the entry is invalid. */
>> +    if ( !p2me_is_valid(p2m, entry) )
>> +        return;
> Does this actually apply to intermediate page tables (which you handle
> later in the function), when that's (only) a P2M type check?

Yes, any PTE should have V bit set to 1, so from P2M perspective it also
should be, at least, not equal to p2m_invalid.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 11693 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 12/17] xen/riscv: Implement p2m_free_entry() and related helpers
  2025-07-11 15:56     ` Oleksii Kurochko
@ 2025-07-14  7:15       ` Jan Beulich
  2025-07-14 16:01         ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-14  7:15 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 11.07.2025 17:56, Oleksii Kurochko wrote:
> On 7/1/25 4:23 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> +/*
>>> + * In the case of the P2M, the valid bit is used for other purpose. Use
>>> + * the type to check whether an entry is valid.
>>> + */
>>>   static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>   {
>>> -    panic("%s: isn't implemented for now\n", __func__);
>>> +    return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>> +}
>> No checking of the valid bit?
> 
> As mentioned in the comment, only the P2M type should be checked, since the
> valid bit is used for other purposes we discussed earlier, for example, to
> track whether pages were accessed by a guest domain, or to support certain
> table invalidation optimizations (1) and (2).
> So, in this case, we only need to consider whether the entry is invalid
> from the P2M perspective.
> 
> (1)https://github.com/xen-project/xen/blob/19772b67/xen/arch/arm/mmu/p2m.c#L1245
> (2)https://github.com/xen-project/xen/blob/19772b67/xen/arch/arm/mmu/p2m.c#L1386

And there can be e.g. entries with the valid bit set and the type being
p2m_invalid? IOW there's no short-circuiting possible in any of the
possible cases, avoiding the radix tree lookup in at least some of the
cases?

>>> @@ -404,11 +426,127 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>>>       return GUEST_TABLE_MAP_NONE;
>>>   }
>>>   
>>> +static void p2m_put_foreign_page(struct page_info *pg)
>>> +{
>>> +    /*
>>> +     * It's safe to do the put_page here because page_alloc will
>>> +     * flush the TLBs if the page is reallocated before the end of
>>> +     * this loop.
>>> +     */
>>> +    put_page(pg);
>> Is the comment really true? The page allocator will flush the normal
>> TLBs, but not the stage-2 ones. Yet those are what you care about here,
>> aiui.
> 
> In alloc_heap_pages():
>   ...
>       if ( need_tlbflush )
>          filtered_flush_tlb_mask(tlbflush_timestamp);
>   ...
>   
> filtered_flush_tlb_mask() calls arch_flush_tlb_mask().
> 
> and arch_flush_tlb_mask(), at least, on Arm (I haven't checked x86) is
> implented as:
>    void arch_flush_tlb_mask(const cpumask_t *mask)
>    {
>        /* No need to IPI other processors on ARM, the processor takes care of it. */
>        flush_all_guests_tlb();
>    }
> 
> So it flushes stage-2 TLB.

Hmm, okay. And I take it you have the same plan on RISC-V? What I'd like to
ask for, though, is that the comment (also) mentions where that (guest)
flushing actually happens. That's not in page_alloc.c, and it also wasn't
originally intended for guest TLBs to also be flushed from there (as x86 is
where the flush avoidance machinery originates, which Arm and now also
RISC-V don't really use).

>>> +/* Put any references on the single 4K page referenced by mfn. */
>>> +static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
>>> +{
>>> +    /* TODO: Handle other p2m types */
>>> +
>>> +    /* Detect the xenheap page and mark the stored GFN as invalid. */
>>> +    if ( p2m_is_ram(type) && is_xen_heap_mfn(mfn) )
>>> +        page_set_xenheap_gfn(mfn_to_page(mfn), INVALID_GFN);
>> Is this a valid thing to do? How do you make sure the respective uses
>> (in gnttab's shared and status page arrays) are / were also removed?
> 
> As grant table frame GFN is stored directly in struct page_info instead
> of keeping it in standalone status/shared arrays, thereby there is no need
> for status/shared arrays.

I fear I don't follow. Looking at Arm's header (which I understand you
derive from), I see

#define gnttab_shared_page(t, i)   virt_to_page((t)->shared_raw[i])

#define gnttab_status_page(t, i)   virt_to_page((t)->status[i])

Are you intending to do things differently?

>>> +/* Put any references on the superpage referenced by mfn. */
>>> +static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
>>> +{
>>> +    struct page_info *pg;
>>> +    unsigned int i;
>>> +
>>> +    ASSERT(mfn_valid(mfn));
>>> +
>>> +    pg = mfn_to_page(mfn);
>>> +
>>> +    for ( i = 0; i < XEN_PT_ENTRIES; i++, pg++ )
>>> +        p2m_put_foreign_page(pg);
>>> +}
>>> +
>>> +/* Put any references on the page referenced by pte. */
>>> +static void p2m_put_page(struct p2m_domain *p2m, const pte_t pte,
>>> +                         unsigned int level)
>>> +{
>>> +    mfn_t mfn = pte_get_mfn(pte);
>>> +    p2m_type_t p2m_type = p2m_type_radix_get(p2m, pte);
>> This gives you the type of the 1st page. What guarantees that all other pages
>> in a superpage are of the exact same type?
> 
> Doesn't superpage mean that all the 4KB pages within that superpage have the
> same type and contiguous in memory?

If the mapping is a super-page one - yes. Yet I see nothing super-page-ish
here.

>>> +    if ( level == 1 )
>>> +        return p2m_put_2m_superpage(mfn, p2m_type);
>>> +    else if ( level == 0 )
>>> +        return p2m_put_4k_page(mfn, p2m_type);
>> Use switch() right away?
> 
> It could be, I think that no big difference at the moment, at least.
> But I am okay to rework it.

If you don't want to use switch() here, then my other style nit would
need giving: Please avoid "else" in situations like this.

>>> +static void p2m_free_page(struct domain *d, struct page_info *pg)
>>> +{
>>> +    if ( is_hardware_domain(d) )
>>> +        free_domheap_page(pg);
>> Why's the hardware domain different here? It should have a pool just like
>> all other domains have.
> 
> Hardware domain (dom0) should be no limit in the number of pages that can
> be allocated, so allocate p2m pages for hardware domain is done from heap.
> 
> An idea of p2m pool is to provide a way how to put clear limit and amount
> to the p2m allocation.

Well, we had been there on another thread, and I outlined how I think
Dom0 may want handling.

>>>   /* Free pte sub-tree behind an entry */
>>>   static void p2m_free_entry(struct p2m_domain *p2m,
>>>                              pte_t entry, unsigned int level)
>>>   {
>>> -    panic("%s: hasn't been implemented yet\n", __func__);
>>> +    unsigned int i;
>>> +    pte_t *table;
>>> +    mfn_t mfn;
>>> +    struct page_info *pg;
>>> +
>>> +    /* Nothing to do if the entry is invalid. */
>>> +    if ( !p2me_is_valid(p2m, entry) )
>>> +        return;
>> Does this actually apply to intermediate page tables (which you handle
>> later in the function), when that's (only) a P2M type check?
> 
> Yes, any PTE should have V bit set to 1, so from P2M perspective it also
> should be, at least, not equal to p2m_invalid.

I don't follow. Where would that type be set? The radix tree being GFN-
indexed, you would need to "invent" a GFN for every intermediate page table,
just to be able to (legitimately) invoke the type retrieval function. Maybe
you mean to leverage that (now, i.e. post-v2) you encode some of the types
directly in the PTE, and p2m_invalid may be one of them. But that wasn't
the case in the v2 submission, and hence the code looked wrong to me. Which
in turn suggests that at least some better commentary is going to be needed,
maybe even some BUILD_BUG_ON().

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 12/17] xen/riscv: Implement p2m_free_entry() and related helpers
  2025-07-14  7:15       ` Jan Beulich
@ 2025-07-14 16:01         ` Oleksii Kurochko
  2025-07-14 16:17           ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-14 16:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 11441 bytes --]


On 7/14/25 9:15 AM, Jan Beulich wrote:
> On 11.07.2025 17:56, Oleksii Kurochko wrote:
>> On 7/1/25 4:23 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> +/*
>>>> + * In the case of the P2M, the valid bit is used for other purpose. Use
>>>> + * the type to check whether an entry is valid.
>>>> + */
>>>>    static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>    {
>>>> -    panic("%s: isn't implemented for now\n", __func__);
>>>> +    return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>> +}
>>> No checking of the valid bit?
>> As mentioned in the comment, only the P2M type should be checked, since the
>> valid bit is used for other purposes we discussed earlier, for example, to
>> track whether pages were accessed by a guest domain, or to support certain
>> table invalidation optimizations (1) and (2).
>> So, in this case, we only need to consider whether the entry is invalid
>> from the P2M perspective.
>>
>> (1)https://github.com/xen-project/xen/blob/19772b67/xen/arch/arm/mmu/p2m.c#L1245
>> (2)https://github.com/xen-project/xen/blob/19772b67/xen/arch/arm/mmu/p2m.c#L1386
> And there can be e.g. entries with the valid bit set and the type being
> p2m_invalid?

It shouldn't be so, at least, at the moment, I don't know such cases.

> IOW there's no short-circuiting possible in any of the
> possible cases, avoiding the radix tree lookup in at least some of the
> cases?

Yes, I’ve implemented such optimization. I started using two free bits
in the PTE for some “popular” types:
   static p2m_type_t p2m_get_type(struct p2m_domain *p2m, pte_t pte)
   {
       p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
   
       if ( type == p2m_device_tree_type )
       {
	  ...
           ptr = radix_tree_lookup(&p2m->p2m_types, gfn_x(gfn));
           ...
           return radix_tree_ptr_to_int(ptr);
       }
   
       return type;
   }
   
   /*
    * In the case of the P2M, the valid bit is used for other purpose. Use
    * the type to check whether an entry is valid.
    */
   static inline bool p2m_is_valid(struct p2m_domain *p2m, pte_t pte)
   {
       return p2m_get_type(p2m, pte) != p2m_invalid;
   }


But thanks to your reply, I realized that in the case of|p2m_is_valid()|,
the implementation could be simplified further to:
   /*
    * In the case of the P2M, the valid bit is used for other purpose. Use
    * the type to check whether an entry is valid.
    */
   static inline bool p2m_is_valid(struct p2m_domain *p2m, pte_t pte)
   {
       return MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK) != p2m_invalid;
   }

As we care here only about whether the type is|p2m_invalid| or not,
and we don’t need the specific type here if it’s not|p2m_invalid|.

>
>>>> @@ -404,11 +426,127 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>>>>        return GUEST_TABLE_MAP_NONE;
>>>>    }
>>>>    
>>>> +static void p2m_put_foreign_page(struct page_info *pg)
>>>> +{
>>>> +    /*
>>>> +     * It's safe to do the put_page here because page_alloc will
>>>> +     * flush the TLBs if the page is reallocated before the end of
>>>> +     * this loop.
>>>> +     */
>>>> +    put_page(pg);
>>> Is the comment really true? The page allocator will flush the normal
>>> TLBs, but not the stage-2 ones. Yet those are what you care about here,
>>> aiui.
>> In alloc_heap_pages():
>>    ...
>>        if ( need_tlbflush )
>>           filtered_flush_tlb_mask(tlbflush_timestamp);
>>    ...
>>    
>> filtered_flush_tlb_mask() calls arch_flush_tlb_mask().
>>
>> and arch_flush_tlb_mask(), at least, on Arm (I haven't checked x86) is
>> implented as:
>>     void arch_flush_tlb_mask(const cpumask_t *mask)
>>     {
>>         /* No need to IPI other processors on ARM, the processor takes care of it. */
>>         flush_all_guests_tlb();
>>     }
>>
>> So it flushes stage-2 TLB.
> Hmm, okay. And I take it you have the same plan on RISC-V?

Yes, there is such a plan.

>   What I'd like to
> ask for, though, is that the comment (also) mentions where that (guest)
> flushing actually happens. That's not in page_alloc.c, and it also wasn't
> originally intended for guest TLBs to also be flushed from there (as x86 is
> where the flush avoidance machinery originates, which Arm and now also
> RISC-V don't really use).

Sure, it makes sense to update the comment.


>
>>>> +/* Put any references on the single 4K page referenced by mfn. */
>>>> +static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
>>>> +{
>>>> +    /* TODO: Handle other p2m types */
>>>> +
>>>> +    /* Detect the xenheap page and mark the stored GFN as invalid. */
>>>> +    if ( p2m_is_ram(type) && is_xen_heap_mfn(mfn) )
>>>> +        page_set_xenheap_gfn(mfn_to_page(mfn), INVALID_GFN);
>>> Is this a valid thing to do? How do you make sure the respective uses
>>> (in gnttab's shared and status page arrays) are / were also removed?
>> As grant table frame GFN is stored directly in struct page_info instead
>> of keeping it in standalone status/shared arrays, thereby there is no need
>> for status/shared arrays.
> I fear I don't follow. Looking at Arm's header (which I understand you
> derive from), I see
>
> #define gnttab_shared_page(t, i)   virt_to_page((t)->shared_raw[i])
>
> #define gnttab_status_page(t, i)   virt_to_page((t)->status[i])
>
> Are you intending to do things differently?

I missed these arrays... Arm had different arrays:
-    (gt)->arch.shared_gfn = xmalloc_array(gfn_t, ngf_);                  \
-    (gt)->arch.status_gfn = xmalloc_array(gfn_t, nsf_);                  \

I think I don't know the answer to your question, as I'm not deeply familiar
with grant tables and would need to do some additional investigation.

And just to be sure I understand your question correctly: are you asking
whether I marked a page as|INVALID_GFN| while a domain might still be using
it for grant table purposes?

>
>>>> +/* Put any references on the superpage referenced by mfn. */
>>>> +static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
>>>> +{
>>>> +    struct page_info *pg;
>>>> +    unsigned int i;
>>>> +
>>>> +    ASSERT(mfn_valid(mfn));
>>>> +
>>>> +    pg = mfn_to_page(mfn);
>>>> +
>>>> +    for ( i = 0; i < XEN_PT_ENTRIES; i++, pg++ )
>>>> +        p2m_put_foreign_page(pg);
>>>> +}
>>>> +
>>>> +/* Put any references on the page referenced by pte. */
>>>> +static void p2m_put_page(struct p2m_domain *p2m, const pte_t pte,
>>>> +                         unsigned int level)
>>>> +{
>>>> +    mfn_t mfn = pte_get_mfn(pte);
>>>> +    p2m_type_t p2m_type = p2m_type_radix_get(p2m, pte);
>>> This gives you the type of the 1st page. What guarantees that all other pages
>>> in a superpage are of the exact same type?
>> Doesn't superpage mean that all the 4KB pages within that superpage have the
>> same type and contiguous in memory?
> If the mapping is a super-page one - yes. Yet I see nothing super-page-ish
> here.

Probably, I just misunderstood your reply, but there is a check below:
     if ( level == 2 )
         return p2m_put_l2_superpage(mfn, pte.p2m.type);
And I expect that if|level == 2|, it means it is a superpage, which means that
all the 4KB pages within that superpage share the same type and are contiguous
in memory.


>
>>>> +    if ( level == 1 )
>>>> +        return p2m_put_2m_superpage(mfn, p2m_type);
>>>> +    else if ( level == 0 )
>>>> +        return p2m_put_4k_page(mfn, p2m_type);
>>> Use switch() right away?
>> It could be, I think that no big difference at the moment, at least.
>> But I am okay to rework it.
> If you don't want to use switch() here, then my other style nit would
> need giving: Please avoid "else" in situations like this.
>
>>>> +static void p2m_free_page(struct domain *d, struct page_info *pg)
>>>> +{
>>>> +    if ( is_hardware_domain(d) )
>>>> +        free_domheap_page(pg);
>>> Why's the hardware domain different here? It should have a pool just like
>>> all other domains have.
>> Hardware domain (dom0) should be no limit in the number of pages that can
>> be allocated, so allocate p2m pages for hardware domain is done from heap.
>>
>> An idea of p2m pool is to provide a way how to put clear limit and amount
>> to the p2m allocation.
> Well, we had been there on another thread, and I outlined how I think
> Dom0 may want handling.

I think that I don't remember. Could you please remind me what was that thread?
Probably, do you mean this reply:https://lore.kernel.org/xen-devel/cover.1749555949.git.oleksii.kurochko@gmail.com/T/#m4789842aaae1653b91d3368f66cadb0ef87fb17e ?
But this is not really about Dom0 case.

>
>>>>    /* Free pte sub-tree behind an entry */
>>>>    static void p2m_free_entry(struct p2m_domain *p2m,
>>>>                               pte_t entry, unsigned int level)
>>>>    {
>>>> -    panic("%s: hasn't been implemented yet\n", __func__);
>>>> +    unsigned int i;
>>>> +    pte_t *table;
>>>> +    mfn_t mfn;
>>>> +    struct page_info *pg;
>>>> +
>>>> +    /* Nothing to do if the entry is invalid. */
>>>> +    if ( !p2me_is_valid(p2m, entry) )
>>>> +        return;
>>> Does this actually apply to intermediate page tables (which you handle
>>> later in the function), when that's (only) a P2M type check?
>> Yes, any PTE should have V bit set to 1, so from P2M perspective it also
>> should be, at least, not equal to p2m_invalid.
> I don't follow. Where would that type be set? The radix tree being GFN-
> indexed, you would need to "invent" a GFN for every intermediate page table,
> just to be able to (legitimately) invoke the type retrieval function.

Maybe, it is incorrect, but in this patch series the type is set when
|page_to_p2m_table|() is called, which get as an argument a page correspondent
to a table. And then GFN is calculated based on this mfn:
staticpte_tpage_to_p2m_table(structp2m_domain *p2m, structpage_info *page)
{
/*
*Since this function generates a table entry, according to "Encoding
* of PTE R/W/X fields," the entry's r, w, and x fields must be set to 0
*to point to the next level of the page table.
* Therefore,to ensure that anentry is a page table entry,
* `p2m_access_n2rwx`is passed to `mfn_to_p2m_entry()` as the access value,
*which overrides whatever was passed as `p2m_type_t` and guarantees that
*the entry is apage table entry by setting r = w = x = 0.
*/
returnp2m_entry_from_mfn(p2m, page_to_mfn(page), p2m_ram_rw, 
p2m_access_n2rwx);
}
where:
   static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn, p2m_type_t t, p2m_access_t a)
   {
       ...
   
       pte_set_mfn(&e, mfn);
   
       BUG_ON(p2m_type_radix_set(p2m, e, t));
   
       return e;
   }
   
and where:
   static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
   {
       int rc;
       gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
   
       rc = radix_tree_insert(&p2m->p2m_types, gfn_x(gfn),
                              radix_tree_int_to_ptr(t));
       ....
   }

But as you mentioned below ...

>   Maybe
> you mean to leverage that (now, i.e. post-v2) you encode some of the types
> directly in the PTE, and p2m_invalid may be one of them. But that wasn't
> the case in the v2 submission, and hence the code looked wrong to me. Which
> in turn suggests that at least some better commentary is going to be needed,
> maybe even some BUILD_BUG_ON().

... p2m_invalid type will be encoded directly in the PTE in the next patch version.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 17870 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 12/17] xen/riscv: Implement p2m_free_entry() and related helpers
  2025-07-14 16:01         ` Oleksii Kurochko
@ 2025-07-14 16:17           ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-14 16:17 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 14.07.2025 18:01, Oleksii Kurochko wrote:
> On 7/14/25 9:15 AM, Jan Beulich wrote:
>> On 11.07.2025 17:56, Oleksii Kurochko wrote:
>>> On 7/1/25 4:23 PM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> +/* Put any references on the single 4K page referenced by mfn. */
>>>>> +static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
>>>>> +{
>>>>> +    /* TODO: Handle other p2m types */
>>>>> +
>>>>> +    /* Detect the xenheap page and mark the stored GFN as invalid. */
>>>>> +    if ( p2m_is_ram(type) && is_xen_heap_mfn(mfn) )
>>>>> +        page_set_xenheap_gfn(mfn_to_page(mfn), INVALID_GFN);
>>>> Is this a valid thing to do? How do you make sure the respective uses
>>>> (in gnttab's shared and status page arrays) are / were also removed?
>>> As grant table frame GFN is stored directly in struct page_info instead
>>> of keeping it in standalone status/shared arrays, thereby there is no need
>>> for status/shared arrays.
>> I fear I don't follow. Looking at Arm's header (which I understand you
>> derive from), I see
>>
>> #define gnttab_shared_page(t, i)   virt_to_page((t)->shared_raw[i])
>>
>> #define gnttab_status_page(t, i)   virt_to_page((t)->status[i])
>>
>> Are you intending to do things differently?
> 
> I missed these arrays... Arm had different arrays:
> -    (gt)->arch.shared_gfn = xmalloc_array(gfn_t, ngf_);                  \
> -    (gt)->arch.status_gfn = xmalloc_array(gfn_t, nsf_);                  \
> 
> I think I don't know the answer to your question, as I'm not deeply familiar
> with grant tables and would need to do some additional investigation.
> 
> And just to be sure I understand your question correctly: are you asking
> whether I marked a page as|INVALID_GFN| while a domain might still be using
> it for grant table purposes?

Not quite. I'm trying to indicate that you may leave stale information around
when you update the struct page_info instance without also updating one of the
array slots. IOW I think both updates need to happen in sync, or it needs to
be explained why not doing so is still okay.

>>>>> +/* Put any references on the superpage referenced by mfn. */
>>>>> +static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
>>>>> +{
>>>>> +    struct page_info *pg;
>>>>> +    unsigned int i;
>>>>> +
>>>>> +    ASSERT(mfn_valid(mfn));
>>>>> +
>>>>> +    pg = mfn_to_page(mfn);
>>>>> +
>>>>> +    for ( i = 0; i < XEN_PT_ENTRIES; i++, pg++ )
>>>>> +        p2m_put_foreign_page(pg);
>>>>> +}
>>>>> +
>>>>> +/* Put any references on the page referenced by pte. */
>>>>> +static void p2m_put_page(struct p2m_domain *p2m, const pte_t pte,
>>>>> +                         unsigned int level)
>>>>> +{
>>>>> +    mfn_t mfn = pte_get_mfn(pte);
>>>>> +    p2m_type_t p2m_type = p2m_type_radix_get(p2m, pte);
>>>> This gives you the type of the 1st page. What guarantees that all other pages
>>>> in a superpage are of the exact same type?
>>> Doesn't superpage mean that all the 4KB pages within that superpage have the
>>> same type and contiguous in memory?
>> If the mapping is a super-page one - yes. Yet I see nothing super-page-ish
>> here.
> 
> Probably, I just misunderstood your reply, but there is a check below:
>      if ( level == 2 )
>          return p2m_put_l2_superpage(mfn, pte.p2m.type);
> And I expect that if|level == 2|, it means it is a superpage, which means that
> all the 4KB pages within that superpage share the same type and are contiguous
> in memory.

Let's hope that all of this is going to remain consistent then.

>>>>> +static void p2m_free_page(struct domain *d, struct page_info *pg)
>>>>> +{
>>>>> +    if ( is_hardware_domain(d) )
>>>>> +        free_domheap_page(pg);
>>>> Why's the hardware domain different here? It should have a pool just like
>>>> all other domains have.
>>> Hardware domain (dom0) should be no limit in the number of pages that can
>>> be allocated, so allocate p2m pages for hardware domain is done from heap.
>>>
>>> An idea of p2m pool is to provide a way how to put clear limit and amount
>>> to the p2m allocation.
>> Well, we had been there on another thread, and I outlined how I think
>> Dom0 may want handling.
> 
> I think that I don't remember. Could you please remind me what was that thread?
> Probably, do you mean this reply:https://lore.kernel.org/xen-devel/cover.1749555949.git.oleksii.kurochko@gmail.com/T/#m4789842aaae1653b91d3368f66cadb0ef87fb17e ?
> But this is not really about Dom0 case.

It would have been where the allocation counterpart to the freeing here is,
I expect.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-01 15:08   ` Jan Beulich
@ 2025-07-15 14:47     ` Oleksii Kurochko
  2025-07-16 11:31       ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-15 14:47 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 6585 bytes --]


On 7/1/25 5:08 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>       return __map_domain_page(p2m->root + root_table_indx);
>>   }
>>   
>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
> See comments on the earlier patch regarding naming.
>
>> +{
>> +    int rc;
>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
> How does this work, when you record GFNs only for Xenheap pages?

I think I don't understand what is an issue. Could you please provide
some extra details?

> I don't
> think you can get around having the caller pass in the GFN. At which point
> the PTE probably doesn't need passing.

It’s an option. PTE argument, I think, we still need because as we discussed
earlier, partly some P2M types will be stored in PTE bits.

I’m also wondering whether the MFN could be used to identify the P2M PTE’s type,
or if, in general, it isn’t unique (since different GFNs can map to the same MFN),
meaning it can't reliably be used to determine the PTE’s type. Right?

>
>> +    rc = radix_tree_insert(&p2m->p2m_type, gfn_x(gfn),
>> +                           radix_tree_int_to_ptr(t));
>> +    if ( rc == -EEXIST )
>> +    {
>> +        /* If a setting already exists, change it to the new one */
>> +        radix_tree_replace_slot(
>> +            radix_tree_lookup_slot(
>> +                &p2m->p2m_type, gfn_x(gfn)),
>> +            radix_tree_int_to_ptr(t));
>> +        rc = 0;
>> +    }
>> +
>> +    return rc;
>> +}
>> +
>>   static p2m_type_t p2m_type_radix_get(struct p2m_domain *p2m, pte_t pte)
>>   {
>>       void *ptr;
>> @@ -389,12 +409,87 @@ static inline void p2m_remove_pte(pte_t *p, bool clean_pte)
>>       p2m_write_pte(p, pte, clean_pte);
>>   }
>>   
>> -static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn,
>> -                                p2m_type_t t, p2m_access_t a)
>> +static void p2m_set_permission(pte_t *e, p2m_type_t t, p2m_access_t a)
>>   {
>> -    panic("%s: hasn't been implemented yet\n", __func__);
>> +    /* First apply type permissions */
>> +    switch ( t )
>> +    {
>> +    case p2m_ram_rw:
>> +        e->pte |= PTE_ACCESS_MASK;
>> +        break;
>> +
>> +    case p2m_mmio_direct_dev:
>> +        e->pte |= (PTE_READABLE | PTE_WRITABLE);
>> +        e->pte &= ~PTE_EXECUTABLE;
> What's wrong with code living in MMIO, e.g. in the ROM of a PCI device?
> Such code would want to be executable.

I think you are right and nothing wrong with code living in MMIO.

According to the spec:
   I/O regions can specify which combinations of read, write, or execute accesses
   to which data widths are supported.

>> +        break;
>> +
>> +    case p2m_invalid:
>> +        e->pte &= ~PTE_ACCESS_MASK;
>> +        break;
>> +
>> +    default:
>> +        BUG();
>> +        break;
>> +    }
> I think you ought to handle all types that are defined right away. I also
> don't think you should BUG() in the default case (also in the other switch()
> below). ASSERT_UNEACHABLE() may be fine, along with clearing all permissions
> in the entry for release builds.
>
>> +    /* Then restrict with access permissions */
>> +    switch ( a )
>> +    {
>> +    case p2m_access_rwx:
>> +        break;
>> +    case p2m_access_wx:
>> +        e->pte &= ~PTE_READABLE;
>> +        break;
>> +    case p2m_access_rw:
>> +        e->pte &= ~PTE_EXECUTABLE;
>> +        break;
>> +    case p2m_access_w:
>> +        e->pte &= ~(PTE_READABLE | PTE_EXECUTABLE);
>> +        e->pte &= ~PTE_EXECUTABLE;
>> +        break;
>> +    case p2m_access_rx:
>> +    case p2m_access_rx2rw:
>> +        e->pte &= ~PTE_WRITABLE;
>> +        break;
>> +    case p2m_access_x:
>> +        e->pte &= ~(PTE_READABLE | PTE_WRITABLE);
>> +        break;
>> +    case p2m_access_r:
>> +        e->pte &= ~(PTE_WRITABLE | PTE_EXECUTABLE);
>> +        break;
>> +    case p2m_access_n:
>> +    case p2m_access_n2rwx:
>> +        e->pte &= ~PTE_ACCESS_MASK;
>> +        break;
>> +    default:
>> +        BUG();
>> +        break;
>> +    }
> Nit: Blank lines between non-fall-through case blocks, please.
>
>> +static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn, p2m_type_t t, p2m_access_t a)
>> +{
>> +    pte_t e = (pte_t) { 1 };
> What's the 1 doing here?

Set valid bit of PTE to 1.

>
>> +    switch ( t )
>> +    {
>> +    case p2m_mmio_direct_dev:
>> +        e.pte |= PTE_PBMT_IO;
>> +        break;
>> +
>> +    default:
>> +        break;
>> +    }
>> +
>> +    p2m_set_permission(&e, t, a);
>> +
>> +    ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK));
>> +
>> +    pte_set_mfn(&e, mfn);
> Based on how things work on x86 (and how I would have expected them to also
> work on Arm), may I suggest that you set MFN ahead of permissions, so that
> the permissions setting function can use the MFN for e.g. a lookup in
> mmio_ro_ranges.

Sure, just a note that on Arm, the MFN is set last.

>
>> +    BUG_ON(p2m_type_radix_set(p2m, e, t));
> I'm not convinced of this error handling here either. Radix tree insertion
> _can_ fail, e.g. when there's no memory left. This must not bring down Xen,
> or we'll have an XSA right away. You could zap the PTE, or if need be you
> could crash the offending domain.

IIUC what is "zap the PTE", then I will do in this way:
     if ( p2m_set_type(p2m, e, t) )
         e.pte = 0;

But then it will lead to an MMU failure—how is that expected to be handled?
There’s no guarantee that, at the moment of handling this exception, enough
memory will be available to set a type for the PTE and also there is not really
clear how to detect in exception handler that it is needed just to re-try to
set a type. Or should we just call|domain_crash()|?
In that case, it seems more reasonable to call|domain_crash() |immediately in
|p2m_pte_from_mfn().|

>
> In this context (not sure if I asked before): With this use of a radix tree,
> how do you intend to bound the amount of memory that a domain can use, by
> making Xen insert very many entries?

I didn’t think about that. I assumed it would be enough to set the amount of
memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.

Also, it seems this would just lead to the issue you mentioned earlier: when
the memory runs out,|domain_crash()| will be called or PTE will be zapped.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 9366 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-15 14:47     ` Oleksii Kurochko
@ 2025-07-16 11:31       ` Jan Beulich
  2025-07-16 16:07         ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-16 11:31 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 15.07.2025 16:47, Oleksii Kurochko wrote:
> On 7/1/25 5:08 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/p2m.c
>>> +++ b/xen/arch/riscv/p2m.c
>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>       return __map_domain_page(p2m->root + root_table_indx);
>>>   }
>>>   
>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>> See comments on the earlier patch regarding naming.
>>
>>> +{
>>> +    int rc;
>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>> How does this work, when you record GFNs only for Xenheap pages?
> 
> I think I don't understand what is an issue. Could you please provide
> some extra details?

Counter question: The mfn_to_gfn() you currently have is only a stub. It only
works for 1:1 mapped domains. Can you show me the eventual final implementation
of the function, making it possible to use it here? Having such stubs, and not
even annotated in any way, is imo a problem: People may thing they're fine to
use when really they aren't.

>>> +static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn, p2m_type_t t, p2m_access_t a)
>>> +{
>>> +    pte_t e = (pte_t) { 1 };
>> What's the 1 doing here?
> 
> Set valid bit of PTE to 1.

But something like this isn't to be done using a plain, unannotated literal
number. Aiui you mean PTE_VALID here.

>>> +    switch ( t )
>>> +    {
>>> +    case p2m_mmio_direct_dev:
>>> +        e.pte |= PTE_PBMT_IO;
>>> +        break;
>>> +
>>> +    default:
>>> +        break;
>>> +    }
>>> +
>>> +    p2m_set_permission(&e, t, a);
>>> +
>>> +    ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK));
>>> +
>>> +    pte_set_mfn(&e, mfn);
>> Based on how things work on x86 (and how I would have expected them to also
>> work on Arm), may I suggest that you set MFN ahead of permissions, so that
>> the permissions setting function can use the MFN for e.g. a lookup in
>> mmio_ro_ranges.
> 
> Sure, just a note that on Arm, the MFN is set last.

That's apparently because they (still) don't have mmio_ro_ranges. That's only
a latent issue (I hope) while they still don't have PCI support.

>>> +    BUG_ON(p2m_type_radix_set(p2m, e, t));
>> I'm not convinced of this error handling here either. Radix tree insertion
>> _can_ fail, e.g. when there's no memory left. This must not bring down Xen,
>> or we'll have an XSA right away. You could zap the PTE, or if need be you
>> could crash the offending domain.
> 
> IIUC what is "zap the PTE", then I will do in this way:
>      if ( p2m_set_type(p2m, e, t) )
>          e.pte = 0;
> 
> But then it will lead to an MMU failure—how is that expected to be handled?
> There’s no guarantee that, at the moment of handling this exception, enough
> memory will be available to set a type for the PTE and also there is not really
> clear how to detect in exception handler that it is needed just to re-try to
> set a type. Or should we just call|domain_crash()|?
> In that case, it seems more reasonable to call|domain_crash() |immediately in
> |p2m_pte_from_mfn().|

As said - crashing the domain in such an event is an option. The question
here is whether to do so right away, or whether to defer that in the hope
that the PTE may not actually be accessed (before being rewritten).

>> In this context (not sure if I asked before): With this use of a radix tree,
>> how do you intend to bound the amount of memory that a domain can use, by
>> making Xen insert very many entries?
> 
> I didn’t think about that. I assumed it would be enough to set the amount of
> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.

Which would require these allocations to come from that pool.

> Also, it seems this would just lead to the issue you mentioned earlier: when
> the memory runs out,|domain_crash()| will be called or PTE will be zapped.

Or one domain exhausting memory would cause another domain to fail. A domain
impacting just itself may be tolerable. But a domain affecting other domains
isn't.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 14/17] xen/riscv: implement p2m_next_level()
  2025-07-02  8:35   ` Jan Beulich
@ 2025-07-16 11:32     ` Oleksii Kurochko
  2025-07-16 11:43       ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-16 11:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 8973 bytes --]


On 7/2/25 10:35 AM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -387,6 +387,17 @@ static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>       return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>   }
>>   
>> +/*
>> + * pte_is_* helpers are checking the valid bit set in the
>> + * PTE but we have to check p2m_type instead (look at the comment above
>> + * p2me_is_valid())
>> + * Provide our own overlay to check the valid bit.
>> + */
>> +static inline bool p2me_is_mapping(struct p2m_domain *p2m, pte_t pte)
>> +{
>> +    return p2me_is_valid(p2m, pte) && (pte.pte & PTE_ACCESS_MASK);
>> +}
> Same question as on the earlier patch - does P2M type apply to intermediate
> page tables at all? (Conceptually it shouldn't.)

It doesn't matter whether it is an intermediate page table or a leaf PTE pointing
to a page — PTE should be valid. Considering that in the current implementation
it’s possible for PTE.v = 0 but P2M.v = 1, it is better to check P2M.v instead
of PTE.v.

>
>> @@ -492,6 +503,70 @@ static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn, p2m_type_t t,
>>       return e;
>>   }
>>   
>> +/* Generate table entry with correct attributes. */
>> +static pte_t page_to_p2m_table(struct p2m_domain *p2m, struct page_info *page)
>> +{
>> +    /*
>> +     * Since this function generates a table entry, according to "Encoding
>> +     * of PTE R/W/X fields," the entry's r, w, and x fields must be set to 0
>> +     * to point to the next level of the page table.
>> +     * Therefore, to ensure that an entry is a page table entry,
>> +     * `p2m_access_n2rwx` is passed to `mfn_to_p2m_entry()` as the access value,
>> +     * which overrides whatever was passed as `p2m_type_t` and guarantees that
>> +     * the entry is a page table entry by setting r = w = x = 0.
>> +     */
>> +    return p2m_entry_from_mfn(p2m, page_to_mfn(page), p2m_ram_rw, p2m_access_n2rwx);
> Similarly P2M access shouldn't apply to intermediate page tables. (Moot
> with that, but (ab)using p2m_access_n2rwx would also look wrong: You did
> read what it means, didn't you?)

|p2m_access_n2rwx| was chosen not really because of the description mentioned near
its declaration, but because it sets r=w=x=0, which RISC-V expects for a PTE that
points to the next-level page table.

Generally, I agree that P2M access shouldn't be applied to intermediate page tables.

What I can suggest in this case is to use|p2m_access_rwx| instead of|p2m_access_n2rwx|,
which will ensure that the P2M access type isn't applied when|p2m_entry_from_mfn() |is called, and then, after calling|p2m_entry_from_mfn()|, simply set|PTE.r,w,x=0|.
So this function will look like:
     /* Generate table entry with correct attributes. */
     static pte_t page_to_p2m_table(struct p2m_domain *p2m, struct page_info *page)
     {
         /*
         * p2m_ram_rw is chosen for a table entry as p2m table should be valid
         * from both P2M and hardware point of view.
         *
         * p2m_access_rwx is chosen to restrict access permissions, what mean
         * do not apply access permission for a table entry
         */
         pte_t pte = p2m_pte_from_mfn(p2m, page_to_mfn(page), _gfn(0), p2m_ram_rw,
                                     p2m_access_rwx);

         /*
         * Since this function generates a table entry, according to "Encoding
         * of PTE R/W/X fields," the entry's r, w, and x fields must be set to 0
         * to point to the next level of the page table.
         */
         pte.pte &= ~PTE_ACCESS_MASK;

         return pte;
     }

Does this make sense? Or would it be better to keep the current version of
|page_to_p2m_table()| and just improve the comment explaining why|p2m_access_n2rwx |is used for a table entry?

>
>> +}
>> +
>> +static struct page_info *p2m_alloc_page(struct domain *d)
>> +{
>> +    struct page_info *pg;
>> +
>> +    /*
>> +     * For hardware domain, there should be no limit in the number of pages that
>> +     * can be allocated, so that the kernel may take advantage of the extended
>> +     * regions. Hence, allocate p2m pages for hardware domains from heap.
>> +     */
>> +    if ( is_hardware_domain(d) )
>> +    {
>> +        pg = alloc_domheap_page(d, MEMF_no_owner);
>> +        if ( pg == NULL )
>> +            printk(XENLOG_G_ERR "Failed to allocate P2M pages for hwdom.\n");
>> +    }
> The comment looks to have been taken verbatim from Arm. Whatever "extended
> regions" are, does the same concept even exist on RISC-V?

Initially, I missed that it’s used only for Arm. Since it was mentioned in
|doc/misc/xen-command-line.pandoc|, I assumed it applied to all architectures.
But now I see that it’s Arm-specific:: ### ext_regions (Arm)

>
> Also, special casing Dom0 like this has benefits, but also comes with a
> pitfall: If the system's out of memory, allocations will fail. A pre-
> populated pool would avoid that (until exhausted, of course). If special-
> casing of Dom0 is needed, I wonder whether ...
>
>> +    else
>> +    {
>> +        spin_lock(&d->arch.paging.lock);
>> +        pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>> +        spin_unlock(&d->arch.paging.lock);
>> +    }
> ... going this path but with a Dom0-only fallback to general allocation
> wouldn't be the better route.

IIUC, then it should be something like:
   static struct page_info *p2m_alloc_page(struct domain *d)
   {
       struct page_info *pg;
       
       spin_lock(&d->arch.paging.lock);
       pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
       spin_unlock(&d->arch.paging.lock);

       if ( !pg && is_hardware_domain(d) )
       {
             /* Need to allocate more memory from domheap */
             pg = alloc_domheap_page(d, MEMF_no_owner);
             if ( pg == NULL )
             {
                 printk(XENLOG_ERR "Failed to allocate pages.\n");
                 return pg;
             }
             ACCESS_ONCE(d->arch.paging.total_pages)++;
             page_list_add_tail(pg, &d->arch.paging.freelist);
       }
    
       return pg;
}

And basically use|d->arch.paging.freelist| for both dom0less and dom0 domains,
with the only difference being that in the case of Dom0,|d->arch.paging.freelist |could be extended.

Do I understand your idea correctly?

(
Probably, this is the reply you’re referring to:
   https://lore.kernel.org/xen-devel/43e89225-5e69-49a6-a8c8-bda6d120d8ff@suse.com/,
at the moment, I can't find a better one.
)


>
>> +    return pg;
>> +}
>> +
>> +/* Allocate a new page table page and hook it in via the given entry. */
>> +static int p2m_create_table(struct p2m_domain *p2m, pte_t *entry)
>> +{
>> +    struct page_info *page;
>> +    pte_t *p;
>> +
>> +    ASSERT(!p2me_is_valid(p2m, *entry));
>> +
>> +    page = p2m_alloc_page(p2m->domain);
>> +    if ( page == NULL )
>> +        return -ENOMEM;
>> +
>> +    page_list_add(page, &p2m->pages);
>> +
>> +    p = __map_domain_page(page);
>> +    clear_page(p);
>> +
>> +    unmap_domain_page(p);
> clear_domain_page()? Or actually clear_and_clean_page()?

Agree, clear_and_clean_page() would be better here.

>
>> @@ -516,9 +591,33 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>>                             unsigned int level, pte_t **table,
>>                             unsigned int offset)
>>   {
>> -    panic("%s: hasn't been implemented yet\n", __func__);
>> +    pte_t *entry;
>> +    int ret;
>> +    mfn_t mfn;
>> +
>> +    entry = *table + offset;
>> +
>> +    if ( !p2me_is_valid(p2m, *entry) )
>> +    {
>> +        if ( !alloc_tbl )
>> +            return GUEST_TABLE_MAP_NONE;
>> +
>> +        ret = p2m_create_table(p2m, entry);
>> +        if ( ret )
>> +            return GUEST_TABLE_MAP_NOMEM;
>> +    }
>> +
>> +    /* The function p2m_next_level() is never called at the last level */
>> +    ASSERT(level != 0);
> Logically you would perhaps better do this ahead of trying to allocate a
> page table. Calls here with level == 0 are invalid in all cases aiui, not
> just when you make it here.

It makes sense. I will move ASSERT() to the start of the function
p2m_next_level().

>> +    if ( p2me_is_mapping(p2m, *entry) )
>> +        return GUEST_TABLE_SUPER_PAGE;
>> +
>> +    mfn = mfn_from_pte(*entry);
>> +
>> +    unmap_domain_page(*table);
>> +    *table = map_domain_page(mfn);
> Just to mention it (may not need taking care of right away), there's an
> inefficiency here: In p2m_create_table() you map the page to clear it.
> Then you tear down that mapping, just to re-establish it here.

I will add:
     /*
      * TODO: There's an inefficiency here:
      *       In p2m_create_table(), the page is mapped to clear it.
      *       Then that mapping is torn down in p2m_create_table(),
      *       only to be re-established here.
      */
     *table = map_domain_page(mfn);

Thanks.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 12046 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 14/17] xen/riscv: implement p2m_next_level()
  2025-07-16 11:32     ` Oleksii Kurochko
@ 2025-07-16 11:43       ` Jan Beulich
  2025-07-16 15:53         ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-16 11:43 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 16.07.2025 13:32, Oleksii Kurochko wrote:
> On 7/2/25 10:35 AM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/p2m.c
>>> +++ b/xen/arch/riscv/p2m.c
>>> @@ -387,6 +387,17 @@ static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>       return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>   }
>>>   
>>> +/*
>>> + * pte_is_* helpers are checking the valid bit set in the
>>> + * PTE but we have to check p2m_type instead (look at the comment above
>>> + * p2me_is_valid())
>>> + * Provide our own overlay to check the valid bit.
>>> + */
>>> +static inline bool p2me_is_mapping(struct p2m_domain *p2m, pte_t pte)
>>> +{
>>> +    return p2me_is_valid(p2m, pte) && (pte.pte & PTE_ACCESS_MASK);
>>> +}
>> Same question as on the earlier patch - does P2M type apply to intermediate
>> page tables at all? (Conceptually it shouldn't.)
> 
> It doesn't matter whether it is an intermediate page table or a leaf PTE pointing
> to a page — PTE should be valid. Considering that in the current implementation
> it’s possible for PTE.v = 0 but P2M.v = 1, it is better to check P2M.v instead
> of PTE.v.

I'm confused by this reply. If you want to name 2nd level page table entries
P2M - fine (but unhelpful). But then for any memory access there's only one
of the two involved: A PTE (Xen accesses) or a P2M (guest accesses). Hence
how can there be "PTE.v = 0 but P2M.v = 1"?

An intermediate page table entry is something Xen controls entirely. Hence
it has no (guest induced) type.

>>> @@ -492,6 +503,70 @@ static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn, p2m_type_t t,
>>>       return e;
>>>   }
>>>   
>>> +/* Generate table entry with correct attributes. */
>>> +static pte_t page_to_p2m_table(struct p2m_domain *p2m, struct page_info *page)
>>> +{
>>> +    /*
>>> +     * Since this function generates a table entry, according to "Encoding
>>> +     * of PTE R/W/X fields," the entry's r, w, and x fields must be set to 0
>>> +     * to point to the next level of the page table.
>>> +     * Therefore, to ensure that an entry is a page table entry,
>>> +     * `p2m_access_n2rwx` is passed to `mfn_to_p2m_entry()` as the access value,
>>> +     * which overrides whatever was passed as `p2m_type_t` and guarantees that
>>> +     * the entry is a page table entry by setting r = w = x = 0.
>>> +     */
>>> +    return p2m_entry_from_mfn(p2m, page_to_mfn(page), p2m_ram_rw, p2m_access_n2rwx);
>> Similarly P2M access shouldn't apply to intermediate page tables. (Moot
>> with that, but (ab)using p2m_access_n2rwx would also look wrong: You did
>> read what it means, didn't you?)
> 
> |p2m_access_n2rwx| was chosen not really because of the description mentioned near
> its declaration, but because it sets r=w=x=0, which RISC-V expects for a PTE that
> points to the next-level page table.
> 
> Generally, I agree that P2M access shouldn't be applied to intermediate page tables.
> 
> What I can suggest in this case is to use|p2m_access_rwx| instead of|p2m_access_n2rwx|,

No. p2m_access_* shouldn't come into play here at all. Period. Just like P2M types
shouldn't. As per above - intermediate page tables are Xen internal constructs.

> which will ensure that the P2M access type isn't applied when|p2m_entry_from_mfn() |is called, and then, after calling|p2m_entry_from_mfn()|, simply set|PTE.r,w,x=0|.
> So this function will look like:
>      /* Generate table entry with correct attributes. */
>      static pte_t page_to_p2m_table(struct p2m_domain *p2m, struct page_info *page)
>      {
>          /*
>          * p2m_ram_rw is chosen for a table entry as p2m table should be valid
>          * from both P2M and hardware point of view.
>          *
>          * p2m_access_rwx is chosen to restrict access permissions, what mean
>          * do not apply access permission for a table entry
>          */
>          pte_t pte = p2m_pte_from_mfn(p2m, page_to_mfn(page), _gfn(0), p2m_ram_rw,
>                                      p2m_access_rwx);
> 
>          /*
>          * Since this function generates a table entry, according to "Encoding
>          * of PTE R/W/X fields," the entry's r, w, and x fields must be set to 0
>          * to point to the next level of the page table.
>          */
>          pte.pte &= ~PTE_ACCESS_MASK;
> 
>          return pte;
>      }
> 
> Does this make sense? Or would it be better to keep the current version of
> |page_to_p2m_table()| and just improve the comment explaining why|p2m_access_n2rwx |is used for a table entry?

No to both, as per above.

>>> +static struct page_info *p2m_alloc_page(struct domain *d)
>>> +{
>>> +    struct page_info *pg;
>>> +
>>> +    /*
>>> +     * For hardware domain, there should be no limit in the number of pages that
>>> +     * can be allocated, so that the kernel may take advantage of the extended
>>> +     * regions. Hence, allocate p2m pages for hardware domains from heap.
>>> +     */
>>> +    if ( is_hardware_domain(d) )
>>> +    {
>>> +        pg = alloc_domheap_page(d, MEMF_no_owner);
>>> +        if ( pg == NULL )
>>> +            printk(XENLOG_G_ERR "Failed to allocate P2M pages for hwdom.\n");
>>> +    }
>> The comment looks to have been taken verbatim from Arm. Whatever "extended
>> regions" are, does the same concept even exist on RISC-V?
> 
> Initially, I missed that it’s used only for Arm. Since it was mentioned in
> |doc/misc/xen-command-line.pandoc|, I assumed it applied to all architectures.
> But now I see that it’s Arm-specific:: ### ext_regions (Arm)
> 
>>
>> Also, special casing Dom0 like this has benefits, but also comes with a
>> pitfall: If the system's out of memory, allocations will fail. A pre-
>> populated pool would avoid that (until exhausted, of course). If special-
>> casing of Dom0 is needed, I wonder whether ...
>>
>>> +    else
>>> +    {
>>> +        spin_lock(&d->arch.paging.lock);
>>> +        pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>> +        spin_unlock(&d->arch.paging.lock);
>>> +    }
>> ... going this path but with a Dom0-only fallback to general allocation
>> wouldn't be the better route.
> 
> IIUC, then it should be something like:
>    static struct page_info *p2m_alloc_page(struct domain *d)
>    {
>        struct page_info *pg;
>        
>        spin_lock(&d->arch.paging.lock);
>        pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>        spin_unlock(&d->arch.paging.lock);
> 
>        if ( !pg && is_hardware_domain(d) )
>        {
>              /* Need to allocate more memory from domheap */
>              pg = alloc_domheap_page(d, MEMF_no_owner);
>              if ( pg == NULL )
>              {
>                  printk(XENLOG_ERR "Failed to allocate pages.\n");
>                  return pg;
>              }
>              ACCESS_ONCE(d->arch.paging.total_pages)++;
>              page_list_add_tail(pg, &d->arch.paging.freelist);
>        }
>     
>        return pg;
> }
> 
> And basically use|d->arch.paging.freelist| for both dom0less and dom0 domains,
> with the only difference being that in the case of Dom0,|d->arch.paging.freelist |could be extended.
> 
> Do I understand your idea correctly?

Broadly yes, but not in the details. For example, I don't think such a
page allocated from the general heap would want appending to freelist.
Commentary and alike also would want tidying.

And of course going forward, for split hardware and control domains the
latter may want similar treatment.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 14/17] xen/riscv: implement p2m_next_level()
  2025-07-16 11:43       ` Jan Beulich
@ 2025-07-16 15:53         ` Oleksii Kurochko
  2025-07-16 16:12           ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-16 15:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 9639 bytes --]


On 7/16/25 1:43 PM, Jan Beulich wrote:
> On 16.07.2025 13:32, Oleksii Kurochko wrote:
>> On 7/2/25 10:35 AM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> --- a/xen/arch/riscv/p2m.c
>>>> +++ b/xen/arch/riscv/p2m.c
>>>> @@ -387,6 +387,17 @@ static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>        return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>    }
>>>>    
>>>> +/*
>>>> + * pte_is_* helpers are checking the valid bit set in the
>>>> + * PTE but we have to check p2m_type instead (look at the comment above
>>>> + * p2me_is_valid())
>>>> + * Provide our own overlay to check the valid bit.
>>>> + */
>>>> +static inline bool p2me_is_mapping(struct p2m_domain *p2m, pte_t pte)
>>>> +{
>>>> +    return p2me_is_valid(p2m, pte) && (pte.pte & PTE_ACCESS_MASK);
>>>> +}
>>> Same question as on the earlier patch - does P2M type apply to intermediate
>>> page tables at all? (Conceptually it shouldn't.)
>> It doesn't matter whether it is an intermediate page table or a leaf PTE pointing
>> to a page — PTE should be valid. Considering that in the current implementation
>> it’s possible for PTE.v = 0 but P2M.v = 1, it is better to check P2M.v instead
>> of PTE.v.
> I'm confused by this reply. If you want to name 2nd level page table entries
> P2M - fine (but unhelpful). But then for any memory access there's only one
> of the two involved: A PTE (Xen accesses) or a P2M (guest accesses). Hence
> how can there be "PTE.v = 0 but P2M.v = 1"?

I think I understand your confusion, let me try to rephrase.

The reason for having both|p2m_is_valid()| and|pte_is_valid()| is that I want to
have the ability to use the P2M PTE valid bit to track which pages were accessed
by a vCPU, so that cleaning and invalidating RAM associated with the guest vCPU
won't be too expensive, for example.
In this case, the P2M PTE valid bit will be set to 0, but the P2M PTE type bits
will be set to something other than|p2m_invalid| (even for a table entries),
so when an MMU fault occurs, we can properly resolve it.

So, if the P2M PTE type (what|p2m_is_valid()| checks) is set to|p2m_invalid|, it
means that the valid bit (what|pte_is_valid()| checks) should be set to 0, so
the P2M PTE is genuinely invalid.

It could also be the case that the P2M PTE type isn't|p2m_invalid (and P2M PTE valid will be intentionally set to 0 to have 
ability to track which pages were accessed for the reason I wrote above)|, and when MMU fault occurs we could
properly handle it and set to 1 P2M PTE valid bit to 1...

>
> An intermediate page table entry is something Xen controls entirely. Hence
> it has no (guest induced) type.

... And actually it is a reason why it is needed to set a type even for an
intermediate page table entry.

I hope now it is a lit bit clearer what and why was done.

>
>>>> @@ -492,6 +503,70 @@ static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn, p2m_type_t t,
>>>>        return e;
>>>>    }
>>>>    
>>>> +/* Generate table entry with correct attributes. */
>>>> +static pte_t page_to_p2m_table(struct p2m_domain *p2m, struct page_info *page)
>>>> +{
>>>> +    /*
>>>> +     * Since this function generates a table entry, according to "Encoding
>>>> +     * of PTE R/W/X fields," the entry's r, w, and x fields must be set to 0
>>>> +     * to point to the next level of the page table.
>>>> +     * Therefore, to ensure that an entry is a page table entry,
>>>> +     * `p2m_access_n2rwx` is passed to `mfn_to_p2m_entry()` as the access value,
>>>> +     * which overrides whatever was passed as `p2m_type_t` and guarantees that
>>>> +     * the entry is a page table entry by setting r = w = x = 0.
>>>> +     */
>>>> +    return p2m_entry_from_mfn(p2m, page_to_mfn(page), p2m_ram_rw, p2m_access_n2rwx);
>>> Similarly P2M access shouldn't apply to intermediate page tables. (Moot
>>> with that, but (ab)using p2m_access_n2rwx would also look wrong: You did
>>> read what it means, didn't you?)
>> |p2m_access_n2rwx| was chosen not really because of the description mentioned near
>> its declaration, but because it sets r=w=x=0, which RISC-V expects for a PTE that
>> points to the next-level page table.
>>
>> Generally, I agree that P2M access shouldn't be applied to intermediate page tables.
>>
>> What I can suggest in this case is to use|p2m_access_rwx| instead of|p2m_access_n2rwx|,
> No. p2m_access_* shouldn't come into play here at all.

Okay, then it seems like I just can't explicitly re-use p2m_pte_from_mfn() in
page_to_p2m_table() and have to open-code p2m_pte_from_mfn() or add another one
argument is_table to decide if p2m_access_t and/or p2m_type_t should be applied.

>   Period. Just like P2M types
> shouldn't. As per above - intermediate page tables are Xen internal constructs.

Look please at the explaining above why p2m types is needed despite of the fact that
logically it isn't really needed.

>
>> which will ensure that the P2M access type isn't applied when|p2m_entry_from_mfn() |is called, and then, after calling|p2m_entry_from_mfn()|, simply set|PTE.r,w,x=0|.
>> So this function will look like:
>>       /* Generate table entry with correct attributes. */
>>       static pte_t page_to_p2m_table(struct p2m_domain *p2m, struct page_info *page)
>>       {
>>           /*
>>           * p2m_ram_rw is chosen for a table entry as p2m table should be valid
>>           * from both P2M and hardware point of view.
>>           *
>>           * p2m_access_rwx is chosen to restrict access permissions, what mean
>>           * do not apply access permission for a table entry
>>           */
>>           pte_t pte = p2m_pte_from_mfn(p2m, page_to_mfn(page), _gfn(0), p2m_ram_rw,
>>                                       p2m_access_rwx);
>>
>>           /*
>>           * Since this function generates a table entry, according to "Encoding
>>           * of PTE R/W/X fields," the entry's r, w, and x fields must be set to 0
>>           * to point to the next level of the page table.
>>           */
>>           pte.pte &= ~PTE_ACCESS_MASK;
>>
>>           return pte;
>>       }
>>
>> Does this make sense? Or would it be better to keep the current version of
>> |page_to_p2m_table()| and just improve the comment explaining why|p2m_access_n2rwx |is used for a table entry?
> No to both, as per above.
>
>>>> +static struct page_info *p2m_alloc_page(struct domain *d)
>>>> +{
>>>> +    struct page_info *pg;
>>>> +
>>>> +    /*
>>>> +     * For hardware domain, there should be no limit in the number of pages that
>>>> +     * can be allocated, so that the kernel may take advantage of the extended
>>>> +     * regions. Hence, allocate p2m pages for hardware domains from heap.
>>>> +     */
>>>> +    if ( is_hardware_domain(d) )
>>>> +    {
>>>> +        pg = alloc_domheap_page(d, MEMF_no_owner);
>>>> +        if ( pg == NULL )
>>>> +            printk(XENLOG_G_ERR "Failed to allocate P2M pages for hwdom.\n");
>>>> +    }
>>> The comment looks to have been taken verbatim from Arm. Whatever "extended
>>> regions" are, does the same concept even exist on RISC-V?
>> Initially, I missed that it’s used only for Arm. Since it was mentioned in
>> |doc/misc/xen-command-line.pandoc|, I assumed it applied to all architectures.
>> But now I see that it’s Arm-specific:: ### ext_regions (Arm)
>>
>>> Also, special casing Dom0 like this has benefits, but also comes with a
>>> pitfall: If the system's out of memory, allocations will fail. A pre-
>>> populated pool would avoid that (until exhausted, of course). If special-
>>> casing of Dom0 is needed, I wonder whether ...
>>>
>>>> +    else
>>>> +    {
>>>> +        spin_lock(&d->arch.paging.lock);
>>>> +        pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>>> +        spin_unlock(&d->arch.paging.lock);
>>>> +    }
>>> ... going this path but with a Dom0-only fallback to general allocation
>>> wouldn't be the better route.
>> IIUC, then it should be something like:
>>     static struct page_info *p2m_alloc_page(struct domain *d)
>>     {
>>         struct page_info *pg;
>>         
>>         spin_lock(&d->arch.paging.lock);
>>         pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>         spin_unlock(&d->arch.paging.lock);
>>
>>         if ( !pg && is_hardware_domain(d) )
>>         {
>>               /* Need to allocate more memory from domheap */
>>               pg = alloc_domheap_page(d, MEMF_no_owner);
>>               if ( pg == NULL )
>>               {
>>                   printk(XENLOG_ERR "Failed to allocate pages.\n");
>>                   return pg;
>>               }
>>               ACCESS_ONCE(d->arch.paging.total_pages)++;
>>               page_list_add_tail(pg, &d->arch.paging.freelist);
>>         }
>>      
>>         return pg;
>> }
>>
>> And basically use|d->arch.paging.freelist| for both dom0less and dom0 domains,
>> with the only difference being that in the case of Dom0,|d->arch.paging.freelist |could be extended.
>>
>> Do I understand your idea correctly?
> Broadly yes, but not in the details. For example, I don't think such a
> page allocated from the general heap would want appending to freelist.
> Commentary and alike also would want tidying.

Could you please explain why it wouldn't want appending to freelist?

>
> And of course going forward, for split hardware and control domains the
> latter may want similar treatment.

Could you please clarify what is the difference between hardware and control
domains?
I thought that it is the same or is it for the case when we have
dom0 (control domain) which runs domD (hardware domain) and guest domain?

Thanks.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 12131 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-16 11:31       ` Jan Beulich
@ 2025-07-16 16:07         ` Oleksii Kurochko
  2025-07-16 16:18           ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-16 16:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5443 bytes --]


On 7/16/25 1:31 PM, Jan Beulich wrote:
> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> --- a/xen/arch/riscv/p2m.c
>>>> +++ b/xen/arch/riscv/p2m.c
>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>        return __map_domain_page(p2m->root + root_table_indx);
>>>>    }
>>>>    
>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>> See comments on the earlier patch regarding naming.
>>>
>>>> +{
>>>> +    int rc;
>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>> How does this work, when you record GFNs only for Xenheap pages?


>> I think I don't understand what is an issue. Could you please provide
>> some extra details?
> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
> works for 1:1 mapped domains. Can you show me the eventual final implementation
> of the function, making it possible to use it here?

At the moment, I planned to support only 1:1 mapped domains, so it is final
implementation.

I think that I understand your initial question. So yes, at the moment, we have
only Xenheap pages and as for such pages we have stored GFNs it will be easy to
recover gfn for mfn, and so it will be easy to implement mfn_to_gfn() for Xenheap
pages.


>   Having such stubs, and not
> even annotated in any way, is imo a problem: People may thing they're fine to
> use when really they aren't.

Then more correct will be to pass GFN through an argument as you suggested earlier
(and I've already added such argument).

I just initially made incorrect suggestion that it is a question to an implementation
of mfn_to_gfn() to provide such implementation which supports any type of page.

>
>>>> +static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn, p2m_type_t t, p2m_access_t a)
>>>> +{
>>>> +    pte_t e = (pte_t) { 1 };
>>> What's the 1 doing here?
>> Set valid bit of PTE to 1.
> But something like this isn't to be done using a plain, unannotated literal
> number. Aiui you mean PTE_VALID here.

Yes. I will use PTE_VALID instead.

>
>>>> +    switch ( t )
>>>> +    {
>>>> +    case p2m_mmio_direct_dev:
>>>> +        e.pte |= PTE_PBMT_IO;
>>>> +        break;
>>>> +
>>>> +    default:
>>>> +        break;
>>>> +    }
>>>> +
>>>> +    p2m_set_permission(&e, t, a);
>>>> +
>>>> +    ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK));
>>>> +
>>>> +    pte_set_mfn(&e, mfn);
>>> Based on how things work on x86 (and how I would have expected them to also
>>> work on Arm), may I suggest that you set MFN ahead of permissions, so that
>>> the permissions setting function can use the MFN for e.g. a lookup in
>>> mmio_ro_ranges.
>> Sure, just a note that on Arm, the MFN is set last.
> That's apparently because they (still) don't have mmio_ro_ranges. That's only
> a latent issue (I hope) while they still don't have PCI support.
>
>>>> +    BUG_ON(p2m_type_radix_set(p2m, e, t));
>>> I'm not convinced of this error handling here either. Radix tree insertion
>>> _can_ fail, e.g. when there's no memory left. This must not bring down Xen,
>>> or we'll have an XSA right away. You could zap the PTE, or if need be you
>>> could crash the offending domain.
>> IIUC what is "zap the PTE", then I will do in this way:
>>       if ( p2m_set_type(p2m, e, t) )
>>           e.pte = 0;
>>
>> But then it will lead to an MMU failure—how is that expected to be handled?
>> There’s no guarantee that, at the moment of handling this exception, enough
>> memory will be available to set a type for the PTE and also there is not really
>> clear how to detect in exception handler that it is needed just to re-try to
>> set a type. Or should we just call|domain_crash()|?
>> In that case, it seems more reasonable to call|domain_crash() |immediately in
>> |p2m_pte_from_mfn().|
> As said - crashing the domain in such an event is an option. The question
> here is whether to do so right away, or whether to defer that in the hope
> that the PTE may not actually be accessed (before being rewritten).
>
>>> In this context (not sure if I asked before): With this use of a radix tree,
>>> how do you intend to bound the amount of memory that a domain can use, by
>>> making Xen insert very many entries?
>> I didn’t think about that. I assumed it would be enough to set the amount of
>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
> Which would require these allocations to come from that pool.

Yes, and it is true only for non-hardware domains with the current implementation.

>
>> Also, it seems this would just lead to the issue you mentioned earlier: when
>> the memory runs out,|domain_crash()| will be called or PTE will be zapped.
> Or one domain exhausting memory would cause another domain to fail. A domain
> impacting just itself may be tolerable. But a domain affecting other domains
> isn't.

But it seems like this issue could happen in any implementation. It won't happen only
if we will have only pre-populated pool for any domain type (hardware, control, guest
domain) without ability to extend them or allocate extra pages from domheap in runtime.
Otherwise, if extra pages allocation is allowed then we can't really do something
with this issue.


~ Oleksii

[-- Attachment #2: Type: text/html, Size: 8673 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 14/17] xen/riscv: implement p2m_next_level()
  2025-07-16 15:53         ` Oleksii Kurochko
@ 2025-07-16 16:12           ` Jan Beulich
  2025-07-17  9:42             ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-16 16:12 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 16.07.2025 17:53, Oleksii Kurochko wrote:
> On 7/16/25 1:43 PM, Jan Beulich wrote:
>> On 16.07.2025 13:32, Oleksii Kurochko wrote:
>>> On 7/2/25 10:35 AM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> --- a/xen/arch/riscv/p2m.c
>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>> @@ -387,6 +387,17 @@ static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>        return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>    }
>>>>>    
>>>>> +/*
>>>>> + * pte_is_* helpers are checking the valid bit set in the
>>>>> + * PTE but we have to check p2m_type instead (look at the comment above
>>>>> + * p2me_is_valid())
>>>>> + * Provide our own overlay to check the valid bit.
>>>>> + */
>>>>> +static inline bool p2me_is_mapping(struct p2m_domain *p2m, pte_t pte)
>>>>> +{
>>>>> +    return p2me_is_valid(p2m, pte) && (pte.pte & PTE_ACCESS_MASK);
>>>>> +}
>>>> Same question as on the earlier patch - does P2M type apply to intermediate
>>>> page tables at all? (Conceptually it shouldn't.)
>>> It doesn't matter whether it is an intermediate page table or a leaf PTE pointing
>>> to a page — PTE should be valid. Considering that in the current implementation
>>> it’s possible for PTE.v = 0 but P2M.v = 1, it is better to check P2M.v instead
>>> of PTE.v.
>> I'm confused by this reply. If you want to name 2nd level page table entries
>> P2M - fine (but unhelpful). But then for any memory access there's only one
>> of the two involved: A PTE (Xen accesses) or a P2M (guest accesses). Hence
>> how can there be "PTE.v = 0 but P2M.v = 1"?
> 
> I think I understand your confusion, let me try to rephrase.
> 
> The reason for having both|p2m_is_valid()| and|pte_is_valid()| is that I want to
> have the ability to use the P2M PTE valid bit to track which pages were accessed
> by a vCPU, so that cleaning and invalidating RAM associated with the guest vCPU
> won't be too expensive, for example.

I don't know what you're talking about here.

> In this case, the P2M PTE valid bit will be set to 0, but the P2M PTE type bits
> will be set to something other than|p2m_invalid| (even for a table entries),
> so when an MMU fault occurs, we can properly resolve it.
> 
> So, if the P2M PTE type (what|p2m_is_valid()| checks) is set to|p2m_invalid|, it
> means that the valid bit (what|pte_is_valid()| checks) should be set to 0, so
> the P2M PTE is genuinely invalid.
> 
> It could also be the case that the P2M PTE type isn't|p2m_invalid (and P2M PTE valid will be intentionally set to 0 to have 
> ability to track which pages were accessed for the reason I wrote above)|, and when MMU fault occurs we could
> properly handle it and set to 1 P2M PTE valid bit to 1...
> 
>>
>> An intermediate page table entry is something Xen controls entirely. Hence
>> it has no (guest induced) type.
> 
> ... And actually it is a reason why it is needed to set a type even for an
> intermediate page table entry.
> 
> I hope now it is a lit bit clearer what and why was done.

Sadly not. I still don't see what use the P2M type in of an intermediate page
table is going to be. It surely can't reliably describe all of the entries that
page table holds. Intermediate page tables and leaf pages are just too different
to share a concept like this, I think. That said, I'll be happy to be shown code
demonstrating the contrary.

>>>>> +static struct page_info *p2m_alloc_page(struct domain *d)
>>>>> +{
>>>>> +    struct page_info *pg;
>>>>> +
>>>>> +    /*
>>>>> +     * For hardware domain, there should be no limit in the number of pages that
>>>>> +     * can be allocated, so that the kernel may take advantage of the extended
>>>>> +     * regions. Hence, allocate p2m pages for hardware domains from heap.
>>>>> +     */
>>>>> +    if ( is_hardware_domain(d) )
>>>>> +    {
>>>>> +        pg = alloc_domheap_page(d, MEMF_no_owner);
>>>>> +        if ( pg == NULL )
>>>>> +            printk(XENLOG_G_ERR "Failed to allocate P2M pages for hwdom.\n");
>>>>> +    }
>>>> The comment looks to have been taken verbatim from Arm. Whatever "extended
>>>> regions" are, does the same concept even exist on RISC-V?
>>> Initially, I missed that it’s used only for Arm. Since it was mentioned in
>>> |doc/misc/xen-command-line.pandoc|, I assumed it applied to all architectures.
>>> But now I see that it’s Arm-specific:: ### ext_regions (Arm)
>>>
>>>> Also, special casing Dom0 like this has benefits, but also comes with a
>>>> pitfall: If the system's out of memory, allocations will fail. A pre-
>>>> populated pool would avoid that (until exhausted, of course). If special-
>>>> casing of Dom0 is needed, I wonder whether ...
>>>>
>>>>> +    else
>>>>> +    {
>>>>> +        spin_lock(&d->arch.paging.lock);
>>>>> +        pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>>>> +        spin_unlock(&d->arch.paging.lock);
>>>>> +    }
>>>> ... going this path but with a Dom0-only fallback to general allocation
>>>> wouldn't be the better route.
>>> IIUC, then it should be something like:
>>>     static struct page_info *p2m_alloc_page(struct domain *d)
>>>     {
>>>         struct page_info *pg;
>>>         
>>>         spin_lock(&d->arch.paging.lock);
>>>         pg = page_list_remove_head(&d->arch.paging.p2m_freelist);

Note this: Here you _remove_ from freelist, because you want to actually
use the page. Then clearly ...

>>>         spin_unlock(&d->arch.paging.lock);
>>>
>>>         if ( !pg && is_hardware_domain(d) )
>>>         {
>>>               /* Need to allocate more memory from domheap */
>>>               pg = alloc_domheap_page(d, MEMF_no_owner);
>>>               if ( pg == NULL )
>>>               {
>>>                   printk(XENLOG_ERR "Failed to allocate pages.\n");
>>>                   return pg;
>>>               }
>>>               ACCESS_ONCE(d->arch.paging.total_pages)++;
>>>               page_list_add_tail(pg, &d->arch.paging.freelist);
>>>         }
>>>      
>>>         return pg;
>>> }
>>>
>>> And basically use|d->arch.paging.freelist| for both dom0less and dom0 domains,
>>> with the only difference being that in the case of Dom0,|d->arch.paging.freelist |could be extended.
>>>
>>> Do I understand your idea correctly?
>> Broadly yes, but not in the details. For example, I don't think such a
>> page allocated from the general heap would want appending to freelist.
>> Commentary and alike also would want tidying.
> 
> Could you please explain why it wouldn't want appending to freelist?

... adding to freelist here is wrong: You want to use this separately
allocated page, too. Else once it is freed it'll be added to freelist
a 2nd time, leading to a corrupt list.

>> And of course going forward, for split hardware and control domains the
>> latter may want similar treatment.
> 
> Could you please clarify what is the difference between hardware and control
> domains?
> I thought that it is the same or is it for the case when we have
> dom0 (control domain) which runs domD (hardware domain) and guest domain?

That's the common case, yes, but conceptually the two can be separate.
And if you've followed recent discussions on the list you would also
have noticed that work is being done in that direction. (But this was
really a forward-looking comment; I didn't mean to make you cover that
case right away. Just wanted you to be aware.)

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-16 16:07         ` Oleksii Kurochko
@ 2025-07-16 16:18           ` Jan Beulich
  2025-07-17  8:56             ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-16 16:18 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 16.07.2025 18:07, Oleksii Kurochko wrote:
> On 7/16/25 1:31 PM, Jan Beulich wrote:
>> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> --- a/xen/arch/riscv/p2m.c
>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>>        return __map_domain_page(p2m->root + root_table_indx);
>>>>>    }
>>>>>    
>>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>>> See comments on the earlier patch regarding naming.
>>>>
>>>>> +{
>>>>> +    int rc;
>>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>>> How does this work, when you record GFNs only for Xenheap pages?
> 
> 
>>> I think I don't understand what is an issue. Could you please provide
>>> some extra details?
>> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
>> works for 1:1 mapped domains. Can you show me the eventual final implementation
>> of the function, making it possible to use it here?
> 
> At the moment, I planned to support only 1:1 mapped domains, so it is final
> implementation.

Isn't that on overly severe limitation?

>>>> In this context (not sure if I asked before): With this use of a radix tree,
>>>> how do you intend to bound the amount of memory that a domain can use, by
>>>> making Xen insert very many entries?
>>> I didn’t think about that. I assumed it would be enough to set the amount of
>>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
>> Which would require these allocations to come from that pool.
> 
> Yes, and it is true only for non-hardware domains with the current implementation.

???

>>> Also, it seems this would just lead to the issue you mentioned earlier: when
>>> the memory runs out,|domain_crash()| will be called or PTE will be zapped.
>> Or one domain exhausting memory would cause another domain to fail. A domain
>> impacting just itself may be tolerable. But a domain affecting other domains
>> isn't.
> 
> But it seems like this issue could happen in any implementation. It won't happen only
> if we will have only pre-populated pool for any domain type (hardware, control, guest
> domain) without ability to extend them or allocate extra pages from domheap in runtime.
> Otherwise, if extra pages allocation is allowed then we can't really do something
> with this issue.

But that's why I brought this up: You simply have to. Or, as indicated, the
moment you mark Xen security-supported on RISC-V, there will be an XSA needed.
This is the kind of thing you need to consider up front. Or at least mark with
a prominent FIXME annotation. All of which would need resolving before even
considering to mark code as supported.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-16 16:18           ` Jan Beulich
@ 2025-07-17  8:56             ` Oleksii Kurochko
  2025-07-17 10:25               ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-17  8:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 4971 bytes --]

On 7/16/25 6:18 PM, Jan Beulich wrote:
> On 16.07.2025 18:07, Oleksii Kurochko wrote:
>> On 7/16/25 1:31 PM, Jan Beulich wrote:
>>> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>>>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>>>         return __map_domain_page(p2m->root + root_table_indx);
>>>>>>     }
>>>>>>     
>>>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>>>> See comments on the earlier patch regarding naming.
>>>>>
>>>>>> +{
>>>>>> +    int rc;
>>>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>>>> How does this work, when you record GFNs only for Xenheap pages?
>>
>>>> I think I don't understand what is an issue. Could you please provide
>>>> some extra details?
>>> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
>>> works for 1:1 mapped domains. Can you show me the eventual final implementation
>>> of the function, making it possible to use it here?
>> At the moment, I planned to support only 1:1 mapped domains, so it is final
>> implementation.
> Isn't that on overly severe limitation?

I wouldn't say that it's a severe limitation, as it's just a matter of how
|mfn_to_gfn()| is implemented. When non-1:1 mapped domains are supported,
|mfn_to_gfn()| can be implemented differently, while the code where it’s called
will likely remain unchanged.

What I meant in my reply is that, for the current state and current limitations,
this is the final implementation of|mfn_to_gfn()|. But that doesn't mean I don't
see the value in, or the need for, non-1:1 mapped domains—it's just that this
limitation simplifies development at the current stage of the RISC-V port.

>
>>>>> In this context (not sure if I asked before): With this use of a radix tree,
>>>>> how do you intend to bound the amount of memory that a domain can use, by
>>>>> making Xen insert very many entries?
>>>> I didn’t think about that. I assumed it would be enough to set the amount of
>>>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>>>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
>>> Which would require these allocations to come from that pool.
>> Yes, and it is true only for non-hardware domains with the current implementation.
> ???

I meant that pool is used now only for non-hardware domains at the moment.

>
>>>> Also, it seems this would just lead to the issue you mentioned earlier: when
>>>> the memory runs out,|domain_crash()| will be called or PTE will be zapped.
>>> Or one domain exhausting memory would cause another domain to fail. A domain
>>> impacting just itself may be tolerable. But a domain affecting other domains
>>> isn't.
>> But it seems like this issue could happen in any implementation. It won't happen only
>> if we will have only pre-populated pool for any domain type (hardware, control, guest
>> domain) without ability to extend them or allocate extra pages from domheap in runtime.
>> Otherwise, if extra pages allocation is allowed then we can't really do something
>> with this issue.
> But that's why I brought this up: You simply have to. Or, as indicated, the
> moment you mark Xen security-supported on RISC-V, there will be an XSA needed.

Why it isn't XSA for other architectures? At least, Arm then should have such
XSA.
I don't understand why x86 won't have the same issue. Memory is the limited
and shared resource, so if one of the domain will use to much memory then it could
happen that other domains won't have enough memory for its purpose...

> This is the kind of thing you need to consider up front. Or at least mark with
> a prominent FIXME annotation. All of which would need resolving before even
> considering to mark code as supported.

... At the moment, I’m trying to understand if this issue can be solved properly at
all when a domain is allowed to request or map extra memory for its own purposes.

The only solution I see is that each domain—regardless of its type—should have its
own pre-populated pools. This way, during construction, we’ll know whether the
domain can be created or if we’ve run out of memory, which would mean that no
more domains can be launched.
And if in runtime of a domain there is no free pages in a pre-populated pool then
just stop a domain (or return rc to a domain that there is no memory anymore and
let a domain to decide what it should do), otherwise if I will start to allocate
extra memory for domain which doesn't have free pages in a pool, it could lead
to the XSA issue you mentioned that one domain could exhaust memory so another
domain, at least, won't be able to allocate extra pages (in the case this another
domain also doesn't have free pages in a pool).

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 7931 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 14/17] xen/riscv: implement p2m_next_level()
  2025-07-16 16:12           ` Jan Beulich
@ 2025-07-17  9:42             ` Oleksii Kurochko
  2025-07-17 10:37               ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-17  9:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 8109 bytes --]


On 7/16/25 6:12 PM, Jan Beulich wrote:
> On 16.07.2025 17:53, Oleksii Kurochko wrote:
>> On 7/16/25 1:43 PM, Jan Beulich wrote:
>>> On 16.07.2025 13:32, Oleksii Kurochko wrote:
>>>> On 7/2/25 10:35 AM, Jan Beulich wrote:
>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>> @@ -387,6 +387,17 @@ static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>>         return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>>     }
>>>>>>     
>>>>>> +/*
>>>>>> + * pte_is_* helpers are checking the valid bit set in the
>>>>>> + * PTE but we have to check p2m_type instead (look at the comment above
>>>>>> + * p2me_is_valid())
>>>>>> + * Provide our own overlay to check the valid bit.
>>>>>> + */
>>>>>> +static inline bool p2me_is_mapping(struct p2m_domain *p2m, pte_t pte)
>>>>>> +{
>>>>>> +    return p2me_is_valid(p2m, pte) && (pte.pte & PTE_ACCESS_MASK);
>>>>>> +}
>>>>> Same question as on the earlier patch - does P2M type apply to intermediate
>>>>> page tables at all? (Conceptually it shouldn't.)
>>>> It doesn't matter whether it is an intermediate page table or a leaf PTE pointing
>>>> to a page — PTE should be valid. Considering that in the current implementation
>>>> it’s possible for PTE.v = 0 but P2M.v = 1, it is better to check P2M.v instead
>>>> of PTE.v.
>>> I'm confused by this reply. If you want to name 2nd level page table entries
>>> P2M - fine (but unhelpful). But then for any memory access there's only one
>>> of the two involved: A PTE (Xen accesses) or a P2M (guest accesses). Hence
>>> how can there be "PTE.v = 0 but P2M.v = 1"?
>> I think I understand your confusion, let me try to rephrase.
>>
>> The reason for having both|p2m_is_valid()| and|pte_is_valid()| is that I want to
>> have the ability to use the P2M PTE valid bit to track which pages were accessed
>> by a vCPU, so that cleaning and invalidating RAM associated with the guest vCPU
>> won't be too expensive, for example.
> I don't know what you're talking about here.

https://gitlab.com/xen-project/xen/-/blob/staging/xen/arch/arm/mmu/p2m.c#L1649

>
>> In this case, the P2M PTE valid bit will be set to 0, but the P2M PTE type bits
>> will be set to something other than|p2m_invalid| (even for a table entries),
>> so when an MMU fault occurs, we can properly resolve it.
>>
>> So, if the P2M PTE type (what|p2m_is_valid()| checks) is set to|p2m_invalid|, it
>> means that the valid bit (what|pte_is_valid()| checks) should be set to 0, so
>> the P2M PTE is genuinely invalid.
>>
>> It could also be the case that the P2M PTE type isn't|p2m_invalid (and P2M PTE valid will be intentionally set to 0 to have
>> ability to track which pages were accessed for the reason I wrote above)|, and when MMU fault occurs we could
>> properly handle it and set to 1 P2M PTE valid bit to 1...
>>
>>> An intermediate page table entry is something Xen controls entirely. Hence
>>> it has no (guest induced) type.
>> ... And actually it is a reason why it is needed to set a type even for an
>> intermediate page table entry.
>>
>> I hope now it is a lit bit clearer what and why was done.
> Sadly not. I still don't see what use the P2M type in of an intermediate page
> table is going to be. It surely can't reliably describe all of the entries that
> page table holds. Intermediate page tables and leaf pages are just too different
> to share a concept like this, I think. That said, I'll be happy to be shown code
> demonstrating the contrary.

Then it is needed to introduce new p2m_type_t - p2m_table and use it.
Would it be better?

I still need some type to have ability to distinguish if p2m is valid or not from
p2m management and hardware point of view.
If there is no need for such distinguish why all archs introduce p2m_invalid?
Isn't enough just to use P2M PTE valid bit?

>
>>>>>> +static struct page_info *p2m_alloc_page(struct domain *d)
>>>>>> +{
>>>>>> +    struct page_info *pg;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * For hardware domain, there should be no limit in the number of pages that
>>>>>> +     * can be allocated, so that the kernel may take advantage of the extended
>>>>>> +     * regions. Hence, allocate p2m pages for hardware domains from heap.
>>>>>> +     */
>>>>>> +    if ( is_hardware_domain(d) )
>>>>>> +    {
>>>>>> +        pg = alloc_domheap_page(d, MEMF_no_owner);
>>>>>> +        if ( pg == NULL )
>>>>>> +            printk(XENLOG_G_ERR "Failed to allocate P2M pages for hwdom.\n");
>>>>>> +    }
>>>>> The comment looks to have been taken verbatim from Arm. Whatever "extended
>>>>> regions" are, does the same concept even exist on RISC-V?
>>>> Initially, I missed that it’s used only for Arm. Since it was mentioned in
>>>> |doc/misc/xen-command-line.pandoc|, I assumed it applied to all architectures.
>>>> But now I see that it’s Arm-specific:: ### ext_regions (Arm)
>>>>
>>>>> Also, special casing Dom0 like this has benefits, but also comes with a
>>>>> pitfall: If the system's out of memory, allocations will fail. A pre-
>>>>> populated pool would avoid that (until exhausted, of course). If special-
>>>>> casing of Dom0 is needed, I wonder whether ...
>>>>>
>>>>>> +    else
>>>>>> +    {
>>>>>> +        spin_lock(&d->arch.paging.lock);
>>>>>> +        pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
>>>>>> +        spin_unlock(&d->arch.paging.lock);
>>>>>> +    }
>>>>> ... going this path but with a Dom0-only fallback to general allocation
>>>>> wouldn't be the better route.
>>>> IIUC, then it should be something like:
>>>>      static struct page_info *p2m_alloc_page(struct domain *d)
>>>>      {
>>>>          struct page_info *pg;
>>>>          
>>>>          spin_lock(&d->arch.paging.lock);
>>>>          pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
> Note this: Here you _remove_ from freelist, because you want to actually
> use the page. Then clearly ...
>
>>>>          spin_unlock(&d->arch.paging.lock);
>>>>
>>>>          if ( !pg && is_hardware_domain(d) )
>>>>          {
>>>>                /* Need to allocate more memory from domheap */
>>>>                pg = alloc_domheap_page(d, MEMF_no_owner);
>>>>                if ( pg == NULL )
>>>>                {
>>>>                    printk(XENLOG_ERR "Failed to allocate pages.\n");
>>>>                    return pg;
>>>>                }
>>>>                ACCESS_ONCE(d->arch.paging.total_pages)++;
>>>>                page_list_add_tail(pg, &d->arch.paging.freelist);
>>>>          }
>>>>       
>>>>          return pg;
>>>> }
>>>>
>>>> And basically use|d->arch.paging.freelist| for both dom0less and dom0 domains,
>>>> with the only difference being that in the case of Dom0,|d->arch.paging.freelist |could be extended.
>>>>
>>>> Do I understand your idea correctly?
>>> Broadly yes, but not in the details. For example, I don't think such a
>>> page allocated from the general heap would want appending to freelist.
>>> Commentary and alike also would want tidying.
>> Could you please explain why it wouldn't want appending to freelist?
> ... adding to freelist here is wrong: You want to use this separately
> allocated page, too. Else once it is freed it'll be added to freelist
> a 2nd time, leading to a corrupt list.

Got it, I understand why it shouldn’t be added to the freelist.

Incrementing total_pages still makes sense, right?

>
>>> And of course going forward, for split hardware and control domains the
>>> latter may want similar treatment.
>> Could you please clarify what is the difference between hardware and control
>> domains?
>> I thought that it is the same or is it for the case when we have
>> dom0 (control domain) which runs domD (hardware domain) and guest domain?
> That's the common case, yes, but conceptually the two can be separate.
> And if you've followed recent discussions on the list you would also
> have noticed that work is being done in that direction. (But this was
> really a forward-looking comment; I didn't mean to make you cover that
> case right away. Just wanted you to be aware.)

Thanks.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 11271 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-17  8:56             ` Oleksii Kurochko
@ 2025-07-17 10:25               ` Jan Beulich
  2025-07-18  9:52                 ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-17 10:25 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 17.07.2025 10:56, Oleksii Kurochko wrote:
> On 7/16/25 6:18 PM, Jan Beulich wrote:
>> On 16.07.2025 18:07, Oleksii Kurochko wrote:
>>> On 7/16/25 1:31 PM, Jan Beulich wrote:
>>>> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>>>>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>>>>         return __map_domain_page(p2m->root + root_table_indx);
>>>>>>>     }
>>>>>>>     
>>>>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>>>>> See comments on the earlier patch regarding naming.
>>>>>>
>>>>>>> +{
>>>>>>> +    int rc;
>>>>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>>>>> How does this work, when you record GFNs only for Xenheap pages?
>>>
>>>>> I think I don't understand what is an issue. Could you please provide
>>>>> some extra details?
>>>> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
>>>> works for 1:1 mapped domains. Can you show me the eventual final implementation
>>>> of the function, making it possible to use it here?
>>> At the moment, I planned to support only 1:1 mapped domains, so it is final
>>> implementation.
>> Isn't that on overly severe limitation?
> 
> I wouldn't say that it's a severe limitation, as it's just a matter of how
> |mfn_to_gfn()| is implemented. When non-1:1 mapped domains are supported,
> |mfn_to_gfn()| can be implemented differently, while the code where it’s called
> will likely remain unchanged.
> 
> What I meant in my reply is that, for the current state and current limitations,
> this is the final implementation of|mfn_to_gfn()|. But that doesn't mean I don't
> see the value in, or the need for, non-1:1 mapped domains—it's just that this
> limitation simplifies development at the current stage of the RISC-V port.

Simplification is fine in some cases, but not supporting the "normal" way of
domain construction looks like a pretty odd restriction. I'm also curious
how you envision to implement mfn_to_gfn() then, suitable for generic use like
the one here. Imo, current limitation or not, you simply want to avoid use of
that function outside of the special gnttab case.

>>>>>> In this context (not sure if I asked before): With this use of a radix tree,
>>>>>> how do you intend to bound the amount of memory that a domain can use, by
>>>>>> making Xen insert very many entries?
>>>>> I didn’t think about that. I assumed it would be enough to set the amount of
>>>>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>>>>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
>>>> Which would require these allocations to come from that pool.
>>> Yes, and it is true only for non-hardware domains with the current implementation.
>> ???
> 
> I meant that pool is used now only for non-hardware domains at the moment.

And how does this matter here? The memory required for the radix tree doesn't
come from that pool anyway.

>>>>> Also, it seems this would just lead to the issue you mentioned earlier: when
>>>>> the memory runs out,|domain_crash()| will be called or PTE will be zapped.
>>>> Or one domain exhausting memory would cause another domain to fail. A domain
>>>> impacting just itself may be tolerable. But a domain affecting other domains
>>>> isn't.
>>> But it seems like this issue could happen in any implementation. It won't happen only
>>> if we will have only pre-populated pool for any domain type (hardware, control, guest
>>> domain) without ability to extend them or allocate extra pages from domheap in runtime.
>>> Otherwise, if extra pages allocation is allowed then we can't really do something
>>> with this issue.
>> But that's why I brought this up: You simply have to. Or, as indicated, the
>> moment you mark Xen security-supported on RISC-V, there will be an XSA needed.
> 
> Why it isn't XSA for other architectures? At least, Arm then should have such
> XSA.

Does Arm use a radix tree for storing types? It uses one for mem-access, but
it's not clear to me whether that's actually a supported feature.

> I don't understand why x86 won't have the same issue. Memory is the limited
> and shared resource, so if one of the domain will use to much memory then it could
> happen that other domains won't have enough memory for its purpose...

The question is whether allocations are bounded. With this use of a radix tree,
you give domains a way to have Xen allocate pretty much arbitrary amounts of
memory to populate that tree. That unbounded-ness is the problem, not memory
allocations in general.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 14/17] xen/riscv: implement p2m_next_level()
  2025-07-17  9:42             ` Oleksii Kurochko
@ 2025-07-17 10:37               ` Jan Beulich
  2025-07-18 11:19                 ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-17 10:37 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 17.07.2025 11:42, Oleksii Kurochko wrote:
> On 7/16/25 6:12 PM, Jan Beulich wrote:
>> On 16.07.2025 17:53, Oleksii Kurochko wrote:
>>> On 7/16/25 1:43 PM, Jan Beulich wrote:
>>>> On 16.07.2025 13:32, Oleksii Kurochko wrote:
>>>>> On 7/2/25 10:35 AM, Jan Beulich wrote:
>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>> @@ -387,6 +387,17 @@ static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>>>         return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>>>     }
>>>>>>>     
>>>>>>> +/*
>>>>>>> + * pte_is_* helpers are checking the valid bit set in the
>>>>>>> + * PTE but we have to check p2m_type instead (look at the comment above
>>>>>>> + * p2me_is_valid())
>>>>>>> + * Provide our own overlay to check the valid bit.
>>>>>>> + */
>>>>>>> +static inline bool p2me_is_mapping(struct p2m_domain *p2m, pte_t pte)
>>>>>>> +{
>>>>>>> +    return p2me_is_valid(p2m, pte) && (pte.pte & PTE_ACCESS_MASK);
>>>>>>> +}
>>>>>> Same question as on the earlier patch - does P2M type apply to intermediate
>>>>>> page tables at all? (Conceptually it shouldn't.)
>>>>> It doesn't matter whether it is an intermediate page table or a leaf PTE pointing
>>>>> to a page — PTE should be valid. Considering that in the current implementation
>>>>> it’s possible for PTE.v = 0 but P2M.v = 1, it is better to check P2M.v instead
>>>>> of PTE.v.
>>>> I'm confused by this reply. If you want to name 2nd level page table entries
>>>> P2M - fine (but unhelpful). But then for any memory access there's only one
>>>> of the two involved: A PTE (Xen accesses) or a P2M (guest accesses). Hence
>>>> how can there be "PTE.v = 0 but P2M.v = 1"?
>>> I think I understand your confusion, let me try to rephrase.
>>>
>>> The reason for having both|p2m_is_valid()| and|pte_is_valid()| is that I want to
>>> have the ability to use the P2M PTE valid bit to track which pages were accessed
>>> by a vCPU, so that cleaning and invalidating RAM associated with the guest vCPU
>>> won't be too expensive, for example.
>> I don't know what you're talking about here.
> 
> https://gitlab.com/xen-project/xen/-/blob/staging/xen/arch/arm/mmu/p2m.c#L1649

How does that Arm function matter here? Aiui you don't need anything like that
in RISC-V, as there caches don't need disabling temporarily.

>>> In this case, the P2M PTE valid bit will be set to 0, but the P2M PTE type bits
>>> will be set to something other than|p2m_invalid| (even for a table entries),
>>> so when an MMU fault occurs, we can properly resolve it.
>>>
>>> So, if the P2M PTE type (what|p2m_is_valid()| checks) is set to|p2m_invalid|, it
>>> means that the valid bit (what|pte_is_valid()| checks) should be set to 0, so
>>> the P2M PTE is genuinely invalid.
>>>
>>> It could also be the case that the P2M PTE type isn't|p2m_invalid (and P2M PTE valid will be intentionally set to 0 to have
>>> ability to track which pages were accessed for the reason I wrote above)|, and when MMU fault occurs we could
>>> properly handle it and set to 1 P2M PTE valid bit to 1...
>>>
>>>> An intermediate page table entry is something Xen controls entirely. Hence
>>>> it has no (guest induced) type.
>>> ... And actually it is a reason why it is needed to set a type even for an
>>> intermediate page table entry.
>>>
>>> I hope now it is a lit bit clearer what and why was done.
>> Sadly not. I still don't see what use the P2M type in of an intermediate page
>> table is going to be. It surely can't reliably describe all of the entries that
>> page table holds. Intermediate page tables and leaf pages are just too different
>> to share a concept like this, I think. That said, I'll be happy to be shown code
>> demonstrating the contrary.
> 
> Then it is needed to introduce new p2m_type_t - p2m_table and use it.
> Would it be better?
> 
> I still need some type to have ability to distinguish if p2m is valid or not from
> p2m management and hardware point of view.
> If there is no need for such distinguish why all archs introduce p2m_invalid?
> Isn't enough just to use P2M PTE valid bit?

At least on x86 we don't tag intermediate page tables with P2M types. For
ordinary leaf entries the situation is different, as there may be varying
reasons why a PTE has its valid (on x86: present) bit cleared. Hence the
type is relevant there, just to know what to do when a page is accessed
through such a not-present PTE.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings
  2025-07-02  9:25   ` Jan Beulich
@ 2025-07-17 16:37     ` Oleksii Kurochko
  2025-07-21 13:34       ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-17 16:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 10752 bytes --]


On 7/2/25 11:25 AM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> Add support for down large memory mappings ("superpages") in the RISC-V
>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>> can be inserted into lower levels of the page table hierarchy.
>>
>> To implement that the following is done:
>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>>    smaller page table entries down to the target level, preserving original
>>    permissions and attributes.
>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>>    entries at lower levels within a superpage-mapped region.
>>
>> This implementation is based on the ARM code, with modifications to the part
>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>> not require BBM, so there is no need to invalidate the PTE and flush the
>> TLB before updating it with the newly created, split page table.
> But some flushing is going to be necessary. As long as you only ever do
> global flushes, the one after the individual PTE modification (within the
> split table) will do (if BBM isn't required, see below), but once you move
> to more fine-grained flushing, that's not going to be enough anymore. Not
> sure it's a good idea to leave such a pitfall.

I think that I don't fully understand what is an issue.

>
> As to (no need for) BBM: I couldn't find anything to that effect in the
> privileged spec. Can you provide some pointer? What I found instead is e.g.
> this sentence: "To ensure that implicit reads observe writes to the same
> memory locations, an SFENCE.VMA instruction must be executed after the
> writes to flush the relevant cached translations." And this: "Accessing the
> same location using different cacheability attributes may cause loss of
> coherence." (This may not only occur when the same physical address is
> mapped twice at different VAs, but also after the shattering of a superpage
> when the new entry differs in cacheability.)

I also couldn't find that RISC-V spec mandates BBM explicitly as Arm does it.

What I meant that on RISC-V can do:
- Write new PTE
- Flush TLB

While on Arm it is almost always needed to do:
- Write zero to PTE
- Flush TLB
- Write new PTE

For example, the common CoW code path where you copy from a read only page to
a new page, then map that new page as writable just works on RISC-V without
extra considerations and on Arm it requires BBM.

It seems to me that BBM is mandated for Arm only because that TLB is shared
among cores, so there is no any guarantee that no prior entry for same VA
remains in TLB. In case of RISC-V's TLB isn't shared and after a flush it is
guaranteed that no prior entry for the same VA remains in the TLB.

But in the same time it could be cases, I guess, where BBM will be needed for
RISC-V too. Even the case of CoW mentioned above will need some kind of BBM,
but nothing that'll the CPU misbehave by doing CoW without BBM on RISC-V.

>
>> Additionally, the page table walk logic has been adjusted, as ARM uses the
>> opposite walk order compared to RISC-V.
> I think you used some similar wording already in an earlier patch. I find
> this confusing: Walk order is, aiui, the same. It's merely the numbering
> of levels that is the opposite way round, isn't it?

Yes, the numbering of levels is different and I counted that as a different
walk order. If it is too confusing, I will reword it and use numbering of levels.

>
>> Signed-off-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
>> ---
>> Changes in V2:
>>   - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
>>     functionality" which was splitted to smaller.
>>   - Update the commit above the cycle which creates new page table as
>>     RISC-V travserse page tables in an opposite to ARM order.
>>   - RISC-V doesn't require BBM so there is no needed for invalidating
>>     and TLB flushing before updating PTE.
>> ---
>>   xen/arch/riscv/p2m.c | 102 ++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 101 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
>> index 87dd636b80..79c4473f1f 100644
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -743,6 +743,77 @@ static void p2m_free_entry(struct p2m_domain *p2m,
>>       p2m_free_page(p2m->domain, pg);
>>   }
>>   
>> +static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>> +                                unsigned int level, unsigned int target,
>> +                                const unsigned int *offsets)
>> +{
>> +    struct page_info *page;
>> +    unsigned int i;
>> +    pte_t pte, *table;
>> +    bool rv = true;
>> +
>> +    /* Convenience aliases */
>> +    mfn_t mfn = pte_get_mfn(*entry);
>> +    unsigned int next_level = level - 1;
>> +    unsigned int level_order = XEN_PT_LEVEL_ORDER(next_level);
>> +
>> +    /*
>> +     * This should only be called with target != level and the entry is
>> +     * a superpage.
>> +     */
>> +    ASSERT(level > target);
>> +    ASSERT(p2me_is_superpage(p2m, *entry, level));
>> +
>> +    page = p2m_alloc_page(p2m->domain);
>> +    if ( !page )
>> +        return false;
>> +
>> +    page_list_add(page, &p2m->pages);
> Is there a reason this list maintenance isn't done in p2m_alloc_page()?

No there is no any reason, I will move that inside p2m_alloc_page().

>
>> +    table = __map_domain_page(page);
>> +
>> +    /*
>> +     * We are either splitting a second level 1G page into 512 first level
>> +     * 2M pages, or a first level 2M page into 512 zero level 4K pages.
>> +     */
>> +    for ( i = 0; i < XEN_PT_ENTRIES; i++ )
>> +    {
>> +        pte_t *new_entry = table + i;
>> +
>> +        /*
>> +         * Use the content of the superpage entry and override
>> +         * the necessary fields. So the correct permission are kept.
>> +         */
>> +        pte = *entry;
>> +        pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
> While okay as long as you only permit superpages up to 1G, this is another
> trap for someone to fall into: Imo i would better be unsigned long right
> away, considering that RISC-V permits large pages at all levels.
>
>> +        write_pte(new_entry, pte);
>> +    }
>> +
>> +    /*
>> +     * Shatter superpage in the page to the level we want to make the
>> +     * changes.
>> +     * This is done outside the loop to avoid checking the offset to
>> +     * know whether the entry should be shattered for every entry.
>> +     */
>> +    if ( next_level != target )
>> +        rv = p2m_split_superpage(p2m, table + offsets[next_level],
>> +                                 level - 1, target, offsets);
> I don't understand the comment: Under what conditions would every entry
> need (further) shattering? And where's that happening? Or is this merely
> a word ordering issue in the sentence, and "for every entry" wants
> moving ahead? (In that case I'm unconvinced this is in need of commenting
> upon.)

It is wording question. It should be something like:
+    /*
+     * Shatter superpage in the page to the level we want to make the
+     * changes.
+     * This is done outside the loop to avoid checking the offset for every
+     * entry (of new page table) to know whether the entry should be shattered.
+     */


>
>> +    /* TODO: why it is necessary to have clean here? Not somewhere in the caller */
>> +    if ( p2m->clean_pte )
>> +        clean_dcache_va_range(table, PAGE_SIZE);
>> +
>> +    unmap_domain_page(table);
> Again likely not something that wants taking care of right away, but there
> again is an inefficiency here: The caller almost certainly wants to map
> the same page again, to update the one entry that caused the request to
> shatter the page.

I'll add TODO.

>
>> +    /*
>> +     * Even if we failed, we should install the newly allocated PTE
>> +     * entry. The caller will be in charge to free the sub-tree.
>> +     */
>> +    p2m_write_pte(entry, page_to_p2m_table(p2m, page), p2m->clean_pte);
> Why would it be wrong to free the page right here, vacating the entry at
> the same time (or leaving just that to the caller)? (IOW - if this is an
> implementation decision of yours, I think the word "should" would want
> dropping.) After all, the caller invoking p2m_free_entry() on the thus
> split PTE is less efficient (needs to iterate over all entries) than on
> the original one (where it's just a single superpage).

I think that I didn't get your idea.

>
>> @@ -806,7 +877,36 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
>>        */
>>       if ( level > target )
> This condition is likely too strong, unless you actually mean to also
> split a superpage if it really wouldn't need splitting (new entry written
> still fitting with the superpage mapping, i.e. suitable MFN and same
> attributes).

I am not really sure that I fully understand.
My understanding is if level != target then the splitting is needed, I am
not really get the part "split a superpage if it really wouldn't need splitting".

>
>>       {
>> -        panic("Shattering isn't implemented\n");
>> +        /* We need to split the original page. */
>> +        pte_t split_pte = *entry;
>> +
>> +        ASSERT(p2me_is_superpage(p2m, *entry, level));
>> +
>> +        if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
>> +        {
>> +            /* Free the allocated sub-tree */
>> +            p2m_free_entry(p2m, split_pte, level);
>> +
>> +            rc = -ENOMEM;
>> +            goto out;
>> +        }
>> +
>> +        p2m_write_pte(entry, split_pte, p2m->clean_pte);
>> +
>> +        /* Then move to the level we want to make real changes */
>> +        for ( ; level < target; level++ )
> Don't you mean to move downwards here? At which point I wonder: Did you test
> this code?

No as the test for this case was missed. I will add one.

>
>> +        {
>> +            rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
>> +
>> +            /*
>> +             * The entry should be found and either be a table
>> +             * or a superpage if level 0 is not targeted
>> +             */
>> +            ASSERT(rc == GUEST_TABLE_NORMAL ||
>> +                   (rc == GUEST_TABLE_SUPER_PAGE && target > 0));
>> +        }
> This, too, is inefficient (but likely good enough as a starting point): You walk
> tables twice - first when splitting, and then again when finding the target level.
>
> Considering the enclosing if(), this also again is a do/while() candidate.

I will add TODO to make that part more efficient. And based on your reply regarding
statement inside if(), I'll likely to use do/while().

Thanks.

~ Oleksii


[-- Attachment #2: Type: text/html, Size: 13870 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-17 10:25               ` Jan Beulich
@ 2025-07-18  9:52                 ` Oleksii Kurochko
  2025-07-21 12:18                   ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-18  9:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5825 bytes --]


On 7/17/25 12:25 PM, Jan Beulich wrote:
> On 17.07.2025 10:56, Oleksii Kurochko wrote:
>> On 7/16/25 6:18 PM, Jan Beulich wrote:
>>> On 16.07.2025 18:07, Oleksii Kurochko wrote:
>>>> On 7/16/25 1:31 PM, Jan Beulich wrote:
>>>>> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>>>>>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>>>>>          return __map_domain_page(p2m->root + root_table_indx);
>>>>>>>>      }
>>>>>>>>      
>>>>>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>>>>>> See comments on the earlier patch regarding naming.
>>>>>>>
>>>>>>>> +{
>>>>>>>> +    int rc;
>>>>>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>>>>>> How does this work, when you record GFNs only for Xenheap pages?
>>>>>> I think I don't understand what is an issue. Could you please provide
>>>>>> some extra details?
>>>>> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
>>>>> works for 1:1 mapped domains. Can you show me the eventual final implementation
>>>>> of the function, making it possible to use it here?
>>>> At the moment, I planned to support only 1:1 mapped domains, so it is final
>>>> implementation.
>>> Isn't that on overly severe limitation?
>> I wouldn't say that it's a severe limitation, as it's just a matter of how
>> |mfn_to_gfn()| is implemented. When non-1:1 mapped domains are supported,
>> |mfn_to_gfn()| can be implemented differently, while the code where it’s called
>> will likely remain unchanged.
>>
>> What I meant in my reply is that, for the current state and current limitations,
>> this is the final implementation of|mfn_to_gfn()|. But that doesn't mean I don't
>> see the value in, or the need for, non-1:1 mapped domains—it's just that this
>> limitation simplifies development at the current stage of the RISC-V port.
> Simplification is fine in some cases, but not supporting the "normal" way of
> domain construction looks like a pretty odd restriction. I'm also curious
> how you envision to implement mfn_to_gfn() then, suitable for generic use like
> the one here. Imo, current limitation or not, you simply want to avoid use of
> that function outside of the special gnttab case.
>
>>>>>>> In this context (not sure if I asked before): With this use of a radix tree,
>>>>>>> how do you intend to bound the amount of memory that a domain can use, by
>>>>>>> making Xen insert very many entries?
>>>>>> I didn’t think about that. I assumed it would be enough to set the amount of
>>>>>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>>>>>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
>>>>> Which would require these allocations to come from that pool.
>>>> Yes, and it is true only for non-hardware domains with the current implementation.
>>> ???
>> I meant that pool is used now only for non-hardware domains at the moment.
> And how does this matter here? The memory required for the radix tree doesn't
> come from that pool anyway.

I thought that is possible to do that somehow, but looking at a code of
radix-tree.c it seems like the only one way to allocate memroy for the radix
tree isradix_tree_node_alloc() -> xzalloc(struct rcu_node).

Then it is needed to introduce radix_tree_node_allocate(domain) or radix tree
can't be used at all for mentioned in the previous replies security reason, no?


>>>>>> Also, it seems this would just lead to the issue you mentioned earlier: when
>>>>>> the memory runs out,|domain_crash()| will be called or PTE will be zapped.
>>>>> Or one domain exhausting memory would cause another domain to fail. A domain
>>>>> impacting just itself may be tolerable. But a domain affecting other domains
>>>>> isn't.
>>>> But it seems like this issue could happen in any implementation. It won't happen only
>>>> if we will have only pre-populated pool for any domain type (hardware, control, guest
>>>> domain) without ability to extend them or allocate extra pages from domheap in runtime.
>>>> Otherwise, if extra pages allocation is allowed then we can't really do something
>>>> with this issue.
>>> But that's why I brought this up: You simply have to. Or, as indicated, the
>>> moment you mark Xen security-supported on RISC-V, there will be an XSA needed.
>> Why it isn't XSA for other architectures? At least, Arm then should have such
>> XSA.
> Does Arm use a radix tree for storing types? It uses one for mem-access, but
> it's not clear to me whether that's actually a supported feature.
>
>> I don't understand why x86 won't have the same issue. Memory is the limited
>> and shared resource, so if one of the domain will use to much memory then it could
>> happen that other domains won't have enough memory for its purpose...
> The question is whether allocations are bounded. With this use of a radix tree,
> you give domains a way to have Xen allocate pretty much arbitrary amounts of
> memory to populate that tree. That unbounded-ness is the problem, not memory
> allocations in general.

Isn't radix tree key bounded to an amount of GFNs given for a domain? We can't have
more keys then a max GFN number for a domain. So a potential amount of necessary memory
for radix tree is also bounded to an amount of GFNs.

Anyway, IIUC I just can't use radix tree for p2m types at all, right?
If yes, does it make sense to borrow 2 bits from struct page_info->type_info as now it
is used 9-bits for count of a frame?
So we will 7-bit reference counter, 2 bits for p2m types in type_info + 2 bits in PTE
what in general will give us 16 p2m types.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 9712 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 14/17] xen/riscv: implement p2m_next_level()
  2025-07-17 10:37               ` Jan Beulich
@ 2025-07-18 11:19                 ` Oleksii Kurochko
  2025-07-21 13:14                   ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-18 11:19 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5316 bytes --]


On 7/17/25 12:37 PM, Jan Beulich wrote:
> On 17.07.2025 11:42, Oleksii Kurochko wrote:
>> On 7/16/25 6:12 PM, Jan Beulich wrote:
>>> On 16.07.2025 17:53, Oleksii Kurochko wrote:
>>>> On 7/16/25 1:43 PM, Jan Beulich wrote:
>>>>> On 16.07.2025 13:32, Oleksii Kurochko wrote:
>>>>>> On 7/2/25 10:35 AM, Jan Beulich wrote:
>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>> @@ -387,6 +387,17 @@ static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
>>>>>>>>          return p2m_type_radix_get(p2m, pte) != p2m_invalid;
>>>>>>>>      }
>>>>>>>>      
>>>>>>>> +/*
>>>>>>>> + * pte_is_* helpers are checking the valid bit set in the
>>>>>>>> + * PTE but we have to check p2m_type instead (look at the comment above
>>>>>>>> + * p2me_is_valid())
>>>>>>>> + * Provide our own overlay to check the valid bit.
>>>>>>>> + */
>>>>>>>> +static inline bool p2me_is_mapping(struct p2m_domain *p2m, pte_t pte)
>>>>>>>> +{
>>>>>>>> +    return p2me_is_valid(p2m, pte) && (pte.pte & PTE_ACCESS_MASK);
>>>>>>>> +}
>>>>>>> Same question as on the earlier patch - does P2M type apply to intermediate
>>>>>>> page tables at all? (Conceptually it shouldn't.)
>>>>>> It doesn't matter whether it is an intermediate page table or a leaf PTE pointing
>>>>>> to a page — PTE should be valid. Considering that in the current implementation
>>>>>> it’s possible for PTE.v = 0 but P2M.v = 1, it is better to check P2M.v instead
>>>>>> of PTE.v.
>>>>> I'm confused by this reply. If you want to name 2nd level page table entries
>>>>> P2M - fine (but unhelpful). But then for any memory access there's only one
>>>>> of the two involved: A PTE (Xen accesses) or a P2M (guest accesses). Hence
>>>>> how can there be "PTE.v = 0 but P2M.v = 1"?
>>>> I think I understand your confusion, let me try to rephrase.
>>>>
>>>> The reason for having both|p2m_is_valid()| and|pte_is_valid()| is that I want to
>>>> have the ability to use the P2M PTE valid bit to track which pages were accessed
>>>> by a vCPU, so that cleaning and invalidating RAM associated with the guest vCPU
>>>> won't be too expensive, for example.
>>> I don't know what you're talking about here.
>> https://gitlab.com/xen-project/xen/-/blob/staging/xen/arch/arm/mmu/p2m.c#L1649
> How does that Arm function matter here? Aiui you don't need anything like that
> in RISC-V, as there caches don't need disabling temporarily.

I thought that it could be needed not only in the case when a d-cache is disabled
temporarily, but it seems like that I was just wrong and all other cases are
handled by cache coherency protocol.

>
>>>> In this case, the P2M PTE valid bit will be set to 0, but the P2M PTE type bits
>>>> will be set to something other than|p2m_invalid| (even for a table entries),
>>>> so when an MMU fault occurs, we can properly resolve it.
>>>>
>>>> So, if the P2M PTE type (what|p2m_is_valid()| checks) is set to|p2m_invalid|, it
>>>> means that the valid bit (what|pte_is_valid()| checks) should be set to 0, so
>>>> the P2M PTE is genuinely invalid.
>>>>
>>>> It could also be the case that the P2M PTE type isn't|p2m_invalid (and P2M PTE valid will be intentionally set to 0 to have
>>>> ability to track which pages were accessed for the reason I wrote above)|, and when MMU fault occurs we could
>>>> properly handle it and set to 1 P2M PTE valid bit to 1...
>>>>
>>>>> An intermediate page table entry is something Xen controls entirely. Hence
>>>>> it has no (guest induced) type.
>>>> ... And actually it is a reason why it is needed to set a type even for an
>>>> intermediate page table entry.
>>>>
>>>> I hope now it is a lit bit clearer what and why was done.
>>> Sadly not. I still don't see what use the P2M type in of an intermediate page
>>> table is going to be. It surely can't reliably describe all of the entries that
>>> page table holds. Intermediate page tables and leaf pages are just too different
>>> to share a concept like this, I think. That said, I'll be happy to be shown code
>>> demonstrating the contrary.
>> Then it is needed to introduce new p2m_type_t - p2m_table and use it.
>> Would it be better?
>>
>> I still need some type to have ability to distinguish if p2m is valid or not from
>> p2m management and hardware point of view.
>> If there is no need for such distinguish why all archs introduce p2m_invalid?
>> Isn't enough just to use P2M PTE valid bit?
> At least on x86 we don't tag intermediate page tables with P2M types. For
> ordinary leaf entries the situation is different, as there may be varying
> reasons why a PTE has its valid (on x86: present) bit cleared. Hence the
> type is relevant there, just to know what to do when a page is accessed
> through such a not-present PTE.

I think that I got your idea now.

Does it make sense to have such optimization when we have 2Mb memory range and
it was mapped using 4k pages instead of 1 super-page, could it be useful to
invalidate just just page table entry of L1 which corresponds to the start of
this 2mb memory range, instead of invalidating each entry on L0?
If it could useful then intermediate page tables should be tagged too. Arm has
such use cases:
   https://gitlab.com/xen-project/people/olkur/xen/-/blob/staging/xen/arch/arm/mmu/p2m.c#L1286

~ OLeksii

[-- Attachment #2: Type: text/html, Size: 7467 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-07-02 10:28     ` Jan Beulich
@ 2025-07-18 14:37       ` Oleksii Kurochko
  2025-07-21 13:39         ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-18 14:37 UTC (permalink / raw)
  To: Jan Beulich, Stefano Stabellini
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 2261 bytes --]


On 7/2/25 12:28 PM, Jan Beulich wrote:
> On 02.07.2025 12:09, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> @@ -613,3 +612,91 @@ void __iomem *ioremap(paddr_t pa, size_t len)
>>>   {
>>>       return ioremap_attr(pa, len, PAGE_HYPERVISOR_NOCACHE);
>>>   }
>>> +
>>> +int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
>>> +{
>>> +    ASSERT_UNREACHABLE();
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static struct domain *page_get_owner_and_nr_reference(struct page_info *page,
>>> +                                                      unsigned long nr)
>>> +{
>>> +    unsigned long x, y = page->count_info;
>>> +    struct domain *owner;
>>> +
>>> +    /* Restrict nr to avoid "double" overflow */
>>> +    if ( nr >= PGC_count_mask )
>>> +    {
>>> +        ASSERT_UNREACHABLE();
>>> +        return NULL;
>>> +    }
>> I question the validity of this, already in the Arm original: I can't spot
>> how the caller guarantees to stay below that limit. Without such an
>> (attempted) guarantee, ASSERT_UNREACHABLE() is wrong to use. All I can see
>> is process_shm_node() incrementing shmem_extra[].nr_shm_borrowers, without
>> any limit check.
>>
>>> +    do {
>>> +        x = y;
>>> +        /*
>>> +         * Count ==  0: Page is not allocated, so we cannot take a reference.
>>> +         * Count == -1: Reference count would wrap, which is invalid.
>>> +         */
>> May I once again ask that you look carefully at comments (as much as at code)
>> you copy. Clearly this comment wasn't properly updated when the bumping by 1
>> was changed to bumping by nr.
>>
>>> +        if ( unlikely(((x + nr) & PGC_count_mask) <= nr) )
>>> +            return NULL;
>>> +    }
>>> +    while ( (y = cmpxchg(&page->count_info, x, x + nr)) != x );
>>> +
>>> +    owner = page_get_owner(page);
>>> +    ASSERT(owner);
>>> +
>>> +    return owner;
>>> +}
> There also looks to be a dead code concern here (towards the "nr" parameters
> here and elsewhere, when STATIC_SHM=n). Just that apparently we decided to
> leave out Misra rule 2.2 entirely.

I think that I didn't get what is an issue when STATIC_SHM=n, functions is still
going to be called through page_get_owner_and_reference(), at least, in page_alloc.c .

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 3049 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-07-02 10:09   ` Jan Beulich
  2025-07-02 10:28     ` Jan Beulich
  2025-07-02 12:52     ` Orzel, Michal
@ 2025-07-18 14:49     ` Oleksii Kurochko
  2025-07-21 13:42       ` Jan Beulich
  2025-07-21 13:53       ` Jan Beulich
  2 siblings, 2 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-18 14:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 7316 bytes --]


On 7/2/25 12:09 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> Implement the mfn_valid() macro to verify whether a given MFN is valid by
>> checking that it falls within the range [start_page, max_page).
>> These bounds are initialized based on the start and end addresses of RAM.
>>
>> As part of this patch, start_page is introduced and initialized with the
>> PFN of the first RAM page.
>>
>> Also, after providing a non-stub implementation of the mfn_valid() macro,
>> the following compilation errors started to occur:
>>    riscv64-linux-gnu-ld: prelink.o: in function `__next_node':
>>    /build/xen/./include/xen/nodemask.h:202: undefined reference to `page_is_ram_type'
>>    riscv64-linux-gnu-ld: prelink.o: in function `get_free_buddy':
>>    /build/xen/common/page_alloc.c:881: undefined reference to `page_is_ram_type'
>>    riscv64-linux-gnu-ld: prelink.o: in function `alloc_heap_pages':
>>    /build/xen/common/page_alloc.c:1043: undefined reference to `page_get_owner_and_reference'
>>    riscv64-linux-gnu-ld: /build/xen/common/page_alloc.c:1098: undefined reference to `page_is_ram_type'
>>    riscv64-linux-gnu-ld: prelink.o: in function `ns16550_interrupt':
>>    /build/xen/drivers/char/ns16550.c:205: undefined reference to `get_page'
>>    riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol `page_get_owner_and_reference' isn't defined
>>    riscv64-linux-gnu-ld: final link failed: bad value
>>    make[2]: *** [arch/riscv/Makefile:35: xen-syms] Error 1
>> To resolve these errors, the following functions have also been introduced,
>> based on their Arm counterparts:
>> - page_get_owner_and_reference() and its variant to safely acquire a
>>    reference to a page and retrieve its owner.
>> - put_page() and put_page_nr() to release page references and free the page
>>    when the count drops to zero.
>>    For put_page_nr(), code related to static memory configuration is wrapped
>>    with CONFIG_STATIC_MEMORY, as this configuration has not yet been moved to
>>    common code. Therefore, PGC_static and free_domstatic_page() are not
>>    introduced for RISC-V. However, since this configuration could be useful
>>    in the future, the relevant code is retained and conditionally compiled.
>> - A stub for page_is_ram_type() that currently always returns 0 and asserts
>>    unreachable, as RAM type checking is not yet implemented.
> How does this end up working when common code references the function?

Based on the following commit message:
     Callers are VT-d (so x86 specific) and various bits of page offlining
     support, which although it looks generic (and is in xen/common) does
     things like diving into page_info->count_info which is not generic.
     
     In any case on this is only reachable via XEN_SYSCTL_page_offline_op,
     which clearly shouldn't be called on ARM just yet.

There is no callers for page_is_ram_type() for Arm now, and I expect something similar
for RISC-V. As we don't even introduce hypercalls for RISC-V, we can just live
without it.

>
>> @@ -288,8 +289,12 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
>>   #define page_get_owner(p)    (p)->v.inuse.domain
>>   #define page_set_owner(p, d) ((p)->v.inuse.domain = (d))
>>   
>> -/* TODO: implement */
>> -#define mfn_valid(mfn) ({ (void)(mfn); 0; })
>> +extern unsigned long start_page;
>> +
>> +#define mfn_valid(mfn) ({                                   \
>> +    unsigned long mfn__ = mfn_x(mfn);                       \
>> +    likely((mfn__ >= start_page) && (mfn__ < max_page));    \
>> +})
> I don't think you should try to be clever and avoid using __mfn_valid() here,
> at least not without an easily identifiable TODO. Surely you've seen that both
> Arm and x86 use it.

Overlooked that pdx.c is compiled unconditionally, so I thought that __mfn_valid() common
implementation isn't avaiable as, at the moment, RISC-V doesn't support PDX_COMPRESSION...

> Also, according to all I know, likely() doesn't work very well when used like
> this, except for architectures supporting conditionally executed insns (like
> Arm32 or IA-64, i.e. beyond conditional branches). I.e. if you want to use
> likely() here, I think you need two of them.

... I will update mfn_valid() definition according to your recommendations.

>
>> @@ -525,6 +520,8 @@ static void __init setup_directmap_mappings(unsigned long base_mfn,
>>   #error setup_{directmap,frametable}_mapping() should be implemented for RV_32
>>   #endif
>>   
>> +unsigned long __read_mostly start_page;
> Memory hotplug question again: __read_mostly or __ro_after_init?
>
>> @@ -613,3 +612,91 @@ void __iomem *ioremap(paddr_t pa, size_t len)
>>   {
>>       return ioremap_attr(pa, len, PAGE_HYPERVISOR_NOCACHE);
>>   }
>> +
>> +int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
>> +{
>> +    ASSERT_UNREACHABLE();
>> +
>> +    return 0;
>> +}
>> +
>> +static struct domain *page_get_owner_and_nr_reference(struct page_info *page,
>> +                                                      unsigned long nr)
>> +{
>> +    unsigned long x, y = page->count_info;
>> +    struct domain *owner;
>> +
>> +    /* Restrict nr to avoid "double" overflow */
>> +    if ( nr >= PGC_count_mask )
>> +    {
>> +        ASSERT_UNREACHABLE();
>> +        return NULL;
>> +    }
> I question the validity of this, already in the Arm original: I can't spot
> how the caller guarantees to stay below that limit. Without such an
> (attempted) guarantee, ASSERT_UNREACHABLE() is wrong to use. All I can see
> is process_shm_node() incrementing shmem_extra[].nr_shm_borrowers, without
> any limit check.

Agree, it could be really dropped.

>
>> +    do {
>> +        x = y;
>> +        /*
>> +         * Count ==  0: Page is not allocated, so we cannot take a reference.
>> +         * Count == -1: Reference count would wrap, which is invalid.
>> +         */
> May I once again ask that you look carefully at comments (as much as at code)
> you copy. Clearly this comment wasn't properly updated when the bumping by 1
> was changed to bumping by nr.
>
>> +        if ( unlikely(((x + nr) & PGC_count_mask) <= nr) )
>> +            return NULL;
>> +    }
>> +    while ( (y = cmpxchg(&page->count_info, x, x + nr)) != x );
>> +
>> +    owner = page_get_owner(page);
>> +    ASSERT(owner);
>> +
>> +    return owner;
>> +}
>> +
>> +struct domain *page_get_owner_and_reference(struct page_info *page)
>> +{
>> +    return page_get_owner_and_nr_reference(page, 1);
>> +}
>> +
>> +void put_page_nr(struct page_info *page, unsigned long nr)
>> +{
>> +    unsigned long nx, x, y = page->count_info;
>> +
>> +    do {
>> +        ASSERT((y & PGC_count_mask) >= nr);
>> +        x  = y;
>> +        nx = x - nr;
>> +    }
>> +    while ( unlikely((y = cmpxchg(&page->count_info, x, nx)) != x) );
>> +
>> +    if ( unlikely((nx & PGC_count_mask) == 0) )
>> +    {
>> +#ifdef CONFIG_STATIC_MEMORY
>> +        if ( unlikely(nx & PGC_static) )
>> +            free_domstatic_page(page);
>> +        else
>> +#endif
> Such #ifdef-ed-out code is liable to go stale. Minimally use IS_ENABLED().
> Even better would imo be if you introduced a "stub" PGC_static, resolving
> to 0 (i.e. for now unconditionally).

An introduction of PGC_static would be better.

Thanks.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 8947 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 17/17] xen/riscv: add support of page lookup by GFN
  2025-07-02 11:44   ` Jan Beulich
@ 2025-07-21  9:43     ` Oleksii Kurochko
  2025-07-21 14:06       ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-21  9:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 9627 bytes --]


On 7/2/25 1:44 PM, Jan Beulich wrote:
> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>> Introduce helper functions for safely querying the P2M (physical-to-machine)
>> mapping:
>>   - add p2m_read_lock(), p2m_read_unlock(), and p2m_is_locked() for managing
>>     P2M lock state.
>>   - Implement p2m_get_entry() to retrieve mapping details for a given GFN,
>>     including MFN, page order, and validity.
>>   - Add p2m_lookup() to encapsulate read-locked MFN retrieval.
>>   - Introduce p2m_get_page_from_gfn() to convert a GFN into a page_info
>>     pointer, acquiring a reference to the page if valid.
>>
>> Implementations are based on Arm's functions with some minor modifications:
>> - p2m_get_entry():
>>    - Reverse traversal of page tables, as RISC-V uses the opposite order
>>      compared to Arm.
>>    - Removed the return of p2m_access_t from p2m_get_entry() since
>>      mem_access_settings is not introduced for RISC-V.
> Didn't I see uses of p2m_access in earlier patches? If you don't mean to have
> that, then please consistently {every,no}where.

Yes, it was used. I think it would be better just usage of p2m_access from earlier
patches.

>>    - Updated BUILD_BUG_ON() to check using the level 0 mask, which corresponds
>>      to Arm's THIRD_MASK.
>>    - Replaced open-coded bit shifts with the BIT() macro.
>>    - Other minor changes, such as using RISC-V-specific functions to validate
>>      P2M PTEs, and replacing Arm-specific GUEST_* macros with their RISC-V
>>      equivalents.
>> - p2m_get_page_from_gfn():
>>    - Removed p2m_is_foreign() and related logic, as this functionality is not
>>      implemented for RISC-V.
> Yet I expect you'll need this, sooner or later.

Then I'll add correspondent code in this patch.

>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -1055,3 +1055,134 @@ int guest_physmap_add_entry(struct domain *d,
>>   {
>>       return p2m_insert_mapping(d, gfn, (1 << page_order), mfn, t);
>>   }
>> +
>> +/*
>> + * Get the details of a given gfn.
>> + *
>> + * If the entry is present, the associated MFN will be returned and the
>> + * access and type filled up. The page_order will correspond to the
> You removed p2m_access_t * from the parameters; you need to also update
> the comment then accordingly.
>
>> + * order of the mapping in the page table (i.e it could be a superpage).
>> + *
>> + * If the entry is not present, INVALID_MFN will be returned and the
>> + * page_order will be set according to the order of the invalid range.
>> + *
>> + * valid will contain the value of bit[0] (e.g valid bit) of the
>> + * entry.
>> + */
>> +static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
>> +                           p2m_type_t *t,
>> +                           unsigned int *page_order,
>> +                           bool *valid)
>> +{
>> +    paddr_t addr = gfn_to_gaddr(gfn);
>> +    unsigned int level = 0;
>> +    pte_t entry, *table;
>> +    int rc;
>> +    mfn_t mfn = INVALID_MFN;
>> +    p2m_type_t _t;
> Please no local variables with leading underscores. In x86 we commonly
> name such variables p2mt.
>
>> +    DECLARE_OFFSETS(offsets, addr);
> This is the sole use of "addr". Is such a local variable really worth having?

Not really, I'll drop it.

>> +    ASSERT(p2m_is_locked(p2m));
>> +    BUILD_BUG_ON(XEN_PT_LEVEL_MAP_MASK(0) != PAGE_MASK);
>> +
>> +    /* Allow t to be NULL */
>> +    t = t ?: &_t;
>> +
>> +    *t = p2m_invalid;
>> +
>> +    if ( valid )
>> +        *valid = false;
>> +
>> +    /* XXX: Check if the mapping is lower than the mapped gfn */
>> +
>> +    /* This gfn is higher than the highest the p2m map currently holds */
>> +    if ( gfn_x(gfn) > gfn_x(p2m->max_mapped_gfn) )
>> +    {
>> +        for ( level = P2M_ROOT_LEVEL; level ; level-- )
> Nit: Stray blank before the 2nd semicolon. (Again at least once below.)
>
>> +            if ( (gfn_x(gfn) & (XEN_PT_LEVEL_MASK(level) >> PAGE_SHIFT)) >
>> +                 gfn_x(p2m->max_mapped_gfn) )
>> +                break;
>> +
>> +        goto out;
>> +    }
>> +
>> +    table = p2m_get_root_pointer(p2m, gfn);
>> +
>> +    /*
>> +     * the table should always be non-NULL because the gfn is below
>> +     * p2m->max_mapped_gfn and the root table pages are always present.
>> +     */
>> +    if ( !table )
>> +    {
>> +        ASSERT_UNREACHABLE();
>> +        level = P2M_ROOT_LEVEL;
>> +        goto out;
>> +    }
>> +
>> +    for ( level = P2M_ROOT_LEVEL; level ; level-- )
>> +    {
>> +        rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
>> +        if ( (rc == GUEST_TABLE_MAP_NONE) && (rc != GUEST_TABLE_MAP_NOMEM) )
> This condition looks odd. As written the rhs of the && is redundant.

And it is wrong. It should:
  if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )

>> +            goto out_unmap;
>> +        else if ( rc != GUEST_TABLE_NORMAL )
> As before, no real need for "else" in such cases.
>
>> +            break;
>> +    }
>> +
>> +    entry = table[offsets[level]];
>> +
>> +    if ( p2me_is_valid(p2m, entry) )
>> +    {
>> +        *t = p2m_type_radix_get(p2m, entry);
> If the incoming argument is NULL, the somewhat expensive radix tree lookup
> is unnecessary here.

Good point.

>> +        mfn = pte_get_mfn(entry);
>> +        /*
>> +         * The entry may point to a superpage. Find the MFN associated
>> +         * to the GFN.
>> +         */
>> +        mfn = mfn_add(mfn,
>> +                      gfn_x(gfn) & (BIT(XEN_PT_LEVEL_ORDER(level), UL) - 1));
>> +
>> +        if ( valid )
>> +            *valid = pte_is_valid(entry);
> Interesting. Why not the P2M counterpart of the function? Yes, the comment
> ahead of the function says so, but I don't see why the valid bit suddenly
> is relevant here (besides the P2M type).

This valid variable is expected to be used in the caller (something what Arm does here
https://gitlab.com/xen-project/xen/-/blob/staging/xen/arch/arm/p2m.c#L375) to check if
it is needed to do flush_page_to_ram(), so if the the valid bit in PTE was set to 0
then it means nothing should be flushed as entry wasn't used as it marked invalid.

>> +    }
>> +
>> +out_unmap:
>> +    unmap_domain_page(table);
>> +
>> +out:
> Nit: Style (bot labels).
>
>> +    if ( page_order )
>> +        *page_order = XEN_PT_LEVEL_ORDER(level);
>> +
>> +    return mfn;
>> +}
>> +
>> +static mfn_t p2m_lookup(struct domain *d, gfn_t gfn, p2m_type_t *t)
> pointer-to-const for the 1st arg? But again more likely struct p2m_domain *
> anyway?

|struct p2_domain| would be better, but I’m not really sure that a
pointer-to-const can be used here. I expect that, in that case,
|p2m_read_lock()| would also need to accept a pointer-to-const, and since it
calls|read_lock()| internally, that could be a problem because|read_lock() |accepts a|rwlock_t *l|.

>> +{
>> +    mfn_t mfn;
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +
>> +    p2m_read_lock(p2m);
>> +    mfn = p2m_get_entry(p2m, gfn, t, NULL, NULL);
>> +    p2m_read_unlock(p2m);
>> +
>> +    return mfn;
>> +}
>> +
>> +struct page_info *p2m_get_page_from_gfn(struct domain *d, gfn_t gfn,
> Same here - likely you mean struct p2m_domain * instead.
>
>> +                                        p2m_type_t *t)
>> +{
>> +    p2m_type_t p2mt = {0};
> Why a compound initializer for something that isn't a compound object?
> And why plain 0 for something that is an enumerated type?

Agree, it should be a compound initializer. I'll drop it.

>
>> +    struct page_info *page;
>> +
>> +    mfn_t mfn = p2m_lookup(d, gfn, &p2mt);
>> +
>> +    if ( t )
>> +        *t = p2mt;
> What's wrong with passing t directly to p2m_lookup()?

It was needed before when the code of p2m_get_page_from_gfn() looked like:
   struct page_info *p2m_get_page_from_gfn(struct domain *d, gfn_t gfn,
                                           p2m_type_t *t)
   {
       struct page_info *page;
       p2m_type_t p2mt;
       mfn_t mfn = p2m_lookup(d, gfn, &p2mt);
   
       if ( t )
           *t = p2mt;
   
       if ( !p2m_is_any_ram(p2mt) )
           return NULL;
So it was needed to make sure that p2m_is_any_ram(*t) doesn't try to dereference
a NULL pointer.

Even with the current one implementation the similar issue could be with if use
*t instead of p2mt:
   struct page_info *p2m_get_page_from_gfn(struct p2m_domain *p2m, gfn_t gfn,
                                           p2m_type_t *t)
   {
       ...
       if ( p2m_is_foreign(p2mt) )
       {
           struct domain *fdom = page_get_owner_and_reference(page);

And the second reason it is because of p2m_get_entry() (which is used inside
p2m_lookup()) could return for `t` a pointer to local variable:
   static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
                              p2m_type_t *t,
                              unsigned int *page_order,
                              bool *valid)
   {
       ...
       p2m_type_t p2mt;
       ...
       /* Allow t to be NULL */
       t = t ?: &p2mt;
       ...

What looks wrong. I will remove this part of the code and pass
`t` directly to p2m_lookup().

And after p2m_lookup() call will just check if t argument is NULL then init
it with p2mt:
    struct page_info *p2m_get_page_from_gfn(struct p2m_domain *p2m, gfn_t gfn,
                                            p2m_type_t *t)
    {
        struct page_info *page;
        p2m_type_t p2mt = p2m_invalid;
        mfn_t mfn = p2m_lookup(p2m, gfn, t);
    
        if ( !mfn_valid(mfn) )
            return NULL;
    
        if ( !t )
            p2mt = *t;
    
        ...
    }

Thanks.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 12968 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-18  9:52                 ` Oleksii Kurochko
@ 2025-07-21 12:18                   ` Jan Beulich
  2025-07-22 10:41                     ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-21 12:18 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 18.07.2025 11:52, Oleksii Kurochko wrote:
> 
> On 7/17/25 12:25 PM, Jan Beulich wrote:
>> On 17.07.2025 10:56, Oleksii Kurochko wrote:
>>> On 7/16/25 6:18 PM, Jan Beulich wrote:
>>>> On 16.07.2025 18:07, Oleksii Kurochko wrote:
>>>>> On 7/16/25 1:31 PM, Jan Beulich wrote:
>>>>>> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>>>>>>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>>>>>>          return __map_domain_page(p2m->root + root_table_indx);
>>>>>>>>>      }
>>>>>>>>>      
>>>>>>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>>>>>>> See comments on the earlier patch regarding naming.
>>>>>>>>
>>>>>>>>> +{
>>>>>>>>> +    int rc;
>>>>>>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>>>>>>> How does this work, when you record GFNs only for Xenheap pages?
>>>>>>> I think I don't understand what is an issue. Could you please provide
>>>>>>> some extra details?
>>>>>> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
>>>>>> works for 1:1 mapped domains. Can you show me the eventual final implementation
>>>>>> of the function, making it possible to use it here?
>>>>> At the moment, I planned to support only 1:1 mapped domains, so it is final
>>>>> implementation.
>>>> Isn't that on overly severe limitation?
>>> I wouldn't say that it's a severe limitation, as it's just a matter of how
>>> |mfn_to_gfn()| is implemented. When non-1:1 mapped domains are supported,
>>> |mfn_to_gfn()| can be implemented differently, while the code where it’s called
>>> will likely remain unchanged.
>>>
>>> What I meant in my reply is that, for the current state and current limitations,
>>> this is the final implementation of|mfn_to_gfn()|. But that doesn't mean I don't
>>> see the value in, or the need for, non-1:1 mapped domains—it's just that this
>>> limitation simplifies development at the current stage of the RISC-V port.
>> Simplification is fine in some cases, but not supporting the "normal" way of
>> domain construction looks like a pretty odd restriction. I'm also curious
>> how you envision to implement mfn_to_gfn() then, suitable for generic use like
>> the one here. Imo, current limitation or not, you simply want to avoid use of
>> that function outside of the special gnttab case.
>>
>>>>>>>> In this context (not sure if I asked before): With this use of a radix tree,
>>>>>>>> how do you intend to bound the amount of memory that a domain can use, by
>>>>>>>> making Xen insert very many entries?
>>>>>>> I didn’t think about that. I assumed it would be enough to set the amount of
>>>>>>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>>>>>>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
>>>>>> Which would require these allocations to come from that pool.
>>>>> Yes, and it is true only for non-hardware domains with the current implementation.
>>>> ???
>>> I meant that pool is used now only for non-hardware domains at the moment.
>> And how does this matter here? The memory required for the radix tree doesn't
>> come from that pool anyway.
> 
> I thought that is possible to do that somehow, but looking at a code of
> radix-tree.c it seems like the only one way to allocate memroy for the radix
> tree isradix_tree_node_alloc() -> xzalloc(struct rcu_node).
> 
> Then it is needed to introduce radix_tree_node_allocate(domain)

That would be a possibility, but you may have seen that less than half a
year ago we got rid of something along these lines. So it would require
some pretty good justification to re-introduce.

> or radix tree
> can't be used at all for mentioned in the previous replies security reason, no?

(Very) careful use may still be possible. But the downside of using this
(potentially long lookup times) would always remain.

>>>>>>> Also, it seems this would just lead to the issue you mentioned earlier: when
>>>>>>> the memory runs out,|domain_crash()| will be called or PTE will be zapped.
>>>>>> Or one domain exhausting memory would cause another domain to fail. A domain
>>>>>> impacting just itself may be tolerable. But a domain affecting other domains
>>>>>> isn't.
>>>>> But it seems like this issue could happen in any implementation. It won't happen only
>>>>> if we will have only pre-populated pool for any domain type (hardware, control, guest
>>>>> domain) without ability to extend them or allocate extra pages from domheap in runtime.
>>>>> Otherwise, if extra pages allocation is allowed then we can't really do something
>>>>> with this issue.
>>>> But that's why I brought this up: You simply have to. Or, as indicated, the
>>>> moment you mark Xen security-supported on RISC-V, there will be an XSA needed.
>>> Why it isn't XSA for other architectures? At least, Arm then should have such
>>> XSA.
>> Does Arm use a radix tree for storing types? It uses one for mem-access, but
>> it's not clear to me whether that's actually a supported feature.
>>
>>> I don't understand why x86 won't have the same issue. Memory is the limited
>>> and shared resource, so if one of the domain will use to much memory then it could
>>> happen that other domains won't have enough memory for its purpose...
>> The question is whether allocations are bounded. With this use of a radix tree,
>> you give domains a way to have Xen allocate pretty much arbitrary amounts of
>> memory to populate that tree. That unbounded-ness is the problem, not memory
>> allocations in general.
> 
> Isn't radix tree key bounded to an amount of GFNs given for a domain? We can't have
> more keys then a max GFN number for a domain. So a potential amount of necessary memory
> for radix tree is also bounded to an amount of GFNs.

To some degree yes, hence why I said "pretty much arbitrary amounts".
But recall that "amount of GFNs" is a fuzzy term; I think you mean to
use it to describe the amount of memory pages given to the guest. GFNs
can be used for other purposes, though. Guests could e.g. grant
themselves access to their own memory, then map those grants at
otherwise unused GFNs.

> Anyway, IIUC I just can't use radix tree for p2m types at all, right?
> If yes, does it make sense to borrow 2 bits from struct page_info->type_info as now it
> is used 9-bits for count of a frame?

struct page_info describes MFNs, when you want to describe GFNs. As you
mentioned earlier, multiple GFNs can in principle map to the same MFN.
You would force them to all have the same properties, which would be in
direct conflict with e.g. the grant P2M types.

Just to mention one possible alternative to using radix trees: You could
maintain a 2nd set of intermediate "page tables", just that leaf entries
would hold meta data for the respective GFN. The memory for those "page
tables" could come from the normal P2M pool (and allocation would thus
only consume domain-specific resources). Of course in any model like
this the question of lookup times (as mentioned above) would remain.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 14/17] xen/riscv: implement p2m_next_level()
  2025-07-18 11:19                 ` Oleksii Kurochko
@ 2025-07-21 13:14                   ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-21 13:14 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 18.07.2025 13:19, Oleksii Kurochko wrote:
> On 7/17/25 12:37 PM, Jan Beulich wrote:
>> On 17.07.2025 11:42, Oleksii Kurochko wrote:
>>> On 7/16/25 6:12 PM, Jan Beulich wrote:
>>>> On 16.07.2025 17:53, Oleksii Kurochko wrote:
>>>>> In this case, the P2M PTE valid bit will be set to 0, but the P2M PTE type bits
>>>>> will be set to something other than|p2m_invalid| (even for a table entries),
>>>>> so when an MMU fault occurs, we can properly resolve it.
>>>>>
>>>>> So, if the P2M PTE type (what|p2m_is_valid()| checks) is set to|p2m_invalid|, it
>>>>> means that the valid bit (what|pte_is_valid()| checks) should be set to 0, so
>>>>> the P2M PTE is genuinely invalid.
>>>>>
>>>>> It could also be the case that the P2M PTE type isn't|p2m_invalid (and P2M PTE valid will be intentionally set to 0 to have
>>>>> ability to track which pages were accessed for the reason I wrote above)|, and when MMU fault occurs we could
>>>>> properly handle it and set to 1 P2M PTE valid bit to 1...
>>>>>
>>>>>> An intermediate page table entry is something Xen controls entirely. Hence
>>>>>> it has no (guest induced) type.
>>>>> ... And actually it is a reason why it is needed to set a type even for an
>>>>> intermediate page table entry.
>>>>>
>>>>> I hope now it is a lit bit clearer what and why was done.
>>>> Sadly not. I still don't see what use the P2M type in of an intermediate page
>>>> table is going to be. It surely can't reliably describe all of the entries that
>>>> page table holds. Intermediate page tables and leaf pages are just too different
>>>> to share a concept like this, I think. That said, I'll be happy to be shown code
>>>> demonstrating the contrary.
>>> Then it is needed to introduce new p2m_type_t - p2m_table and use it.
>>> Would it be better?
>>>
>>> I still need some type to have ability to distinguish if p2m is valid or not from
>>> p2m management and hardware point of view.
>>> If there is no need for such distinguish why all archs introduce p2m_invalid?
>>> Isn't enough just to use P2M PTE valid bit?
>> At least on x86 we don't tag intermediate page tables with P2M types. For
>> ordinary leaf entries the situation is different, as there may be varying
>> reasons why a PTE has its valid (on x86: present) bit cleared. Hence the
>> type is relevant there, just to know what to do when a page is accessed
>> through such a not-present PTE.
> 
> I think that I got your idea now.
> 
> Does it make sense to have such optimization when we have 2Mb memory range and
> it was mapped using 4k pages instead of 1 super-page, could it be useful to
> invalidate just just page table entry of L1 which corresponds to the start of
> this 2mb memory range, instead of invalidating each entry on L0?
> If it could useful then intermediate page tables should be tagged too. Arm has
> such use cases:
>    https://gitlab.com/xen-project/people/olkur/xen/-/blob/staging/xen/arch/arm/mmu/p2m.c#L1286

I don't currently see how that's related to the topic at hand.

Furthermore range-constrained TLB flushing is never at just an address, i.e.
"L1 which corresponds to the start of this 2mb memory range" isn't meaningful
here. It's always a range (typically expressed by address and size), and it
always needs to be the full range that is invalidated. This can be a solitary
low-level flush operation when you know a large page mapping would _not_ be
split. When splitting is done in software or when hardware may split behind
your back, you always need to invalidate the entire range. Or else, in your
example, 4k TLB entries may remain for any but the first page of the 2M
super-page. (Whether such a range can still be done in a single invalidation
operation is a separate question. But I don't see how maintaining the type
at the L1 level would help there.)

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings
  2025-07-17 16:37     ` Oleksii Kurochko
@ 2025-07-21 13:34       ` Jan Beulich
  2025-07-22 14:57         ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-21 13:34 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 17.07.2025 18:37, Oleksii Kurochko wrote:
> On 7/2/25 11:25 AM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> Add support for down large memory mappings ("superpages") in the RISC-V
>>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>>> can be inserted into lower levels of the page table hierarchy.
>>>
>>> To implement that the following is done:
>>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>>>    smaller page table entries down to the target level, preserving original
>>>    permissions and attributes.
>>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>>>    entries at lower levels within a superpage-mapped region.
>>>
>>> This implementation is based on the ARM code, with modifications to the part
>>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>>> not require BBM, so there is no need to invalidate the PTE and flush the
>>> TLB before updating it with the newly created, split page table.
>> But some flushing is going to be necessary. As long as you only ever do
>> global flushes, the one after the individual PTE modification (within the
>> split table) will do (if BBM isn't required, see below), but once you move
>> to more fine-grained flushing, that's not going to be enough anymore. Not
>> sure it's a good idea to leave such a pitfall.
> 
> I think that I don't fully understand what is an issue.

Whether a flush is necessary after solely breaking up a superpage is arch-
defined. I don't know how much restrictions the spec on possible hardware
behavior for RISC-V. However, the eventual change of (at least) one entry
of fulfill the original request will surely require a flush. What I was
trying to say is that this required flush would better not also cover for
the flushes that may or may not be required by the spec. IOW if the spec
leaves any room for flushes to possibly be needed, those flushes would
better be explicit.

>> As to (no need for) BBM: I couldn't find anything to that effect in the
>> privileged spec. Can you provide some pointer? What I found instead is e.g.
>> this sentence: "To ensure that implicit reads observe writes to the same
>> memory locations, an SFENCE.VMA instruction must be executed after the
>> writes to flush the relevant cached translations." And this: "Accessing the
>> same location using different cacheability attributes may cause loss of
>> coherence." (This may not only occur when the same physical address is
>> mapped twice at different VAs, but also after the shattering of a superpage
>> when the new entry differs in cacheability.)
> 
> I also couldn't find that RISC-V spec mandates BBM explicitly as Arm does it.
> 
> What I meant that on RISC-V can do:
> - Write new PTE
> - Flush TLB
> 
> While on Arm it is almost always needed to do:
> - Write zero to PTE
> - Flush TLB
> - Write new PTE
> 
> For example, the common CoW code path where you copy from a read only page to
> a new page, then map that new page as writable just works on RISC-V without
> extra considerations and on Arm it requires BBM.

CoW is a specific sub-case with increasing privilege.

> It seems to me that BBM is mandated for Arm only because that TLB is shared
> among cores, so there is no any guarantee that no prior entry for same VA
> remains in TLB. In case of RISC-V's TLB isn't shared and after a flush it is
> guaranteed that no prior entry for the same VA remains in the TLB.

Not sure that's the sole reason. But again the question is: Is this written
down explicitly anywhere? Generally there can be multiple levels of TLBs, and
while some of them may be per-core, others may be shared.

>>> +    /*
>>> +     * Even if we failed, we should install the newly allocated PTE
>>> +     * entry. The caller will be in charge to free the sub-tree.
>>> +     */
>>> +    p2m_write_pte(entry, page_to_p2m_table(p2m, page), p2m->clean_pte);
>> Why would it be wrong to free the page right here, vacating the entry at
>> the same time (or leaving just that to the caller)? (IOW - if this is an
>> implementation decision of yours, I think the word "should" would want
>> dropping.) After all, the caller invoking p2m_free_entry() on the thus
>> split PTE is less efficient (needs to iterate over all entries) than on
>> the original one (where it's just a single superpage).
> 
> I think that I didn't get your idea.

Well, first and foremost this was a question to you, as it's not clear to
me whether "should" is the correct word to use here. It would be
appropriate if the spec mandated this behavior. It would seem less
appropriate if this arrangement was an implementation choice of yours.
And it looks to me as if the latter was the case here.

>>> @@ -806,7 +877,36 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
>>>        */
>>>       if ( level > target )
>> This condition is likely too strong, unless you actually mean to also
>> split a superpage if it really wouldn't need splitting (new entry written
>> still fitting with the superpage mapping, i.e. suitable MFN and same
>> attributes).
> 
> I am not really sure that I fully understand.
> My understanding is if level != target then the splitting is needed, I am
> not really get the part "split a superpage if it really wouldn't need splitting".

Hmm, maybe I was wrong here. The caller determines at what level the
actual change needs to occur? In which case what you have may be right.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-07-18 14:37       ` Oleksii Kurochko
@ 2025-07-21 13:39         ` Jan Beulich
  2025-07-22 12:03           ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-21 13:39 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	xen-devel, Stefano Stabellini

On 18.07.2025 16:37, Oleksii Kurochko wrote:
> 
> On 7/2/25 12:28 PM, Jan Beulich wrote:
>> On 02.07.2025 12:09, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> @@ -613,3 +612,91 @@ void __iomem *ioremap(paddr_t pa, size_t len)
>>>>   {
>>>>       return ioremap_attr(pa, len, PAGE_HYPERVISOR_NOCACHE);
>>>>   }
>>>> +
>>>> +int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
>>>> +{
>>>> +    ASSERT_UNREACHABLE();
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static struct domain *page_get_owner_and_nr_reference(struct page_info *page,
>>>> +                                                      unsigned long nr)
>>>> +{
>>>> +    unsigned long x, y = page->count_info;
>>>> +    struct domain *owner;
>>>> +
>>>> +    /* Restrict nr to avoid "double" overflow */
>>>> +    if ( nr >= PGC_count_mask )
>>>> +    {
>>>> +        ASSERT_UNREACHABLE();
>>>> +        return NULL;
>>>> +    }
>>> I question the validity of this, already in the Arm original: I can't spot
>>> how the caller guarantees to stay below that limit. Without such an
>>> (attempted) guarantee, ASSERT_UNREACHABLE() is wrong to use. All I can see
>>> is process_shm_node() incrementing shmem_extra[].nr_shm_borrowers, without
>>> any limit check.
>>>
>>>> +    do {
>>>> +        x = y;
>>>> +        /*
>>>> +         * Count ==  0: Page is not allocated, so we cannot take a reference.
>>>> +         * Count == -1: Reference count would wrap, which is invalid.
>>>> +         */
>>> May I once again ask that you look carefully at comments (as much as at code)
>>> you copy. Clearly this comment wasn't properly updated when the bumping by 1
>>> was changed to bumping by nr.
>>>
>>>> +        if ( unlikely(((x + nr) & PGC_count_mask) <= nr) )
>>>> +            return NULL;
>>>> +    }
>>>> +    while ( (y = cmpxchg(&page->count_info, x, x + nr)) != x );
>>>> +
>>>> +    owner = page_get_owner(page);
>>>> +    ASSERT(owner);
>>>> +
>>>> +    return owner;
>>>> +}
>> There also looks to be a dead code concern here (towards the "nr" parameters
>> here and elsewhere, when STATIC_SHM=n). Just that apparently we decided to
>> leave out Misra rule 2.2 entirely.
> 
> I think that I didn't get what is an issue when STATIC_SHM=n, functions is still
> going to be called through page_get_owner_and_reference(), at least, in page_alloc.c .

Yes, but will "nr" ever be anything other than 1 then? IOW omitting the parameter
would be fine. And that's what "dead code" is about.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-07-18 14:49     ` Oleksii Kurochko
@ 2025-07-21 13:42       ` Jan Beulich
  2025-07-22 13:38         ` Oleksii Kurochko
  2025-07-21 13:53       ` Jan Beulich
  1 sibling, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-21 13:42 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 18.07.2025 16:49, Oleksii Kurochko wrote:
> On 7/2/25 12:09 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> Implement the mfn_valid() macro to verify whether a given MFN is valid by
>>> checking that it falls within the range [start_page, max_page).
>>> These bounds are initialized based on the start and end addresses of RAM.
>>>
>>> As part of this patch, start_page is introduced and initialized with the
>>> PFN of the first RAM page.
>>>
>>> Also, after providing a non-stub implementation of the mfn_valid() macro,
>>> the following compilation errors started to occur:
>>>    riscv64-linux-gnu-ld: prelink.o: in function `__next_node':
>>>    /build/xen/./include/xen/nodemask.h:202: undefined reference to `page_is_ram_type'
>>>    riscv64-linux-gnu-ld: prelink.o: in function `get_free_buddy':
>>>    /build/xen/common/page_alloc.c:881: undefined reference to `page_is_ram_type'
>>>    riscv64-linux-gnu-ld: prelink.o: in function `alloc_heap_pages':
>>>    /build/xen/common/page_alloc.c:1043: undefined reference to `page_get_owner_and_reference'
>>>    riscv64-linux-gnu-ld: /build/xen/common/page_alloc.c:1098: undefined reference to `page_is_ram_type'
>>>    riscv64-linux-gnu-ld: prelink.o: in function `ns16550_interrupt':
>>>    /build/xen/drivers/char/ns16550.c:205: undefined reference to `get_page'
>>>    riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol `page_get_owner_and_reference' isn't defined
>>>    riscv64-linux-gnu-ld: final link failed: bad value
>>>    make[2]: *** [arch/riscv/Makefile:35: xen-syms] Error 1
>>> To resolve these errors, the following functions have also been introduced,
>>> based on their Arm counterparts:
>>> - page_get_owner_and_reference() and its variant to safely acquire a
>>>    reference to a page and retrieve its owner.
>>> - put_page() and put_page_nr() to release page references and free the page
>>>    when the count drops to zero.
>>>    For put_page_nr(), code related to static memory configuration is wrapped
>>>    with CONFIG_STATIC_MEMORY, as this configuration has not yet been moved to
>>>    common code. Therefore, PGC_static and free_domstatic_page() are not
>>>    introduced for RISC-V. However, since this configuration could be useful
>>>    in the future, the relevant code is retained and conditionally compiled.
>>> - A stub for page_is_ram_type() that currently always returns 0 and asserts
>>>    unreachable, as RAM type checking is not yet implemented.
>> How does this end up working when common code references the function?
> 
> Based on the following commit message:
>      Callers are VT-d (so x86 specific) and various bits of page offlining
>      support, which although it looks generic (and is in xen/common) does
>      things like diving into page_info->count_info which is not generic.
>      
>      In any case on this is only reachable via XEN_SYSCTL_page_offline_op,
>      which clearly shouldn't be called on ARM just yet.

What commit message are you talking about? Nothing like the above is anywhere
in this patch.

> There is no callers for page_is_ram_type() for Arm now, and I expect something similar
> for RISC-V. As we don't even introduce hypercalls for RISC-V, we can just live
> without it.

If there's no caller, why the stub?

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-07-18 14:49     ` Oleksii Kurochko
  2025-07-21 13:42       ` Jan Beulich
@ 2025-07-21 13:53       ` Jan Beulich
  1 sibling, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-21 13:53 UTC (permalink / raw)
  To: Oleksii Kurochko, Julien Grall, Stefano Stabellini,
	Bertrand Marquis, Michal Orzel
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Roger Pau Monné, xen-devel

On 18.07.2025 16:49, Oleksii Kurochko wrote:
> On 7/2/25 12:09 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> Implement the mfn_valid() macro to verify whether a given MFN is valid by
>>> checking that it falls within the range [start_page, max_page).
>>> These bounds are initialized based on the start and end addresses of RAM.
>>>
>>> As part of this patch, start_page is introduced and initialized with the
>>> PFN of the first RAM page.
>>>
>>> Also, after providing a non-stub implementation of the mfn_valid() macro,
>>> the following compilation errors started to occur:
>>>    riscv64-linux-gnu-ld: prelink.o: in function `__next_node':
>>>    /build/xen/./include/xen/nodemask.h:202: undefined reference to `page_is_ram_type'
>>>    riscv64-linux-gnu-ld: prelink.o: in function `get_free_buddy':
>>>    /build/xen/common/page_alloc.c:881: undefined reference to `page_is_ram_type'
>>>    riscv64-linux-gnu-ld: prelink.o: in function `alloc_heap_pages':
>>>    /build/xen/common/page_alloc.c:1043: undefined reference to `page_get_owner_and_reference'
>>>    riscv64-linux-gnu-ld: /build/xen/common/page_alloc.c:1098: undefined reference to `page_is_ram_type'
>>>    riscv64-linux-gnu-ld: prelink.o: in function `ns16550_interrupt':
>>>    /build/xen/drivers/char/ns16550.c:205: undefined reference to `get_page'
>>>    riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol `page_get_owner_and_reference' isn't defined
>>>    riscv64-linux-gnu-ld: final link failed: bad value
>>>    make[2]: *** [arch/riscv/Makefile:35: xen-syms] Error 1
>>> To resolve these errors, the following functions have also been introduced,
>>> based on their Arm counterparts:
>>> - page_get_owner_and_reference() and its variant to safely acquire a
>>>    reference to a page and retrieve its owner.
>>> - put_page() and put_page_nr() to release page references and free the page
>>>    when the count drops to zero.
>>>    For put_page_nr(), code related to static memory configuration is wrapped
>>>    with CONFIG_STATIC_MEMORY, as this configuration has not yet been moved to
>>>    common code. Therefore, PGC_static and free_domstatic_page() are not
>>>    introduced for RISC-V. However, since this configuration could be useful
>>>    in the future, the relevant code is retained and conditionally compiled.
>>> - A stub for page_is_ram_type() that currently always returns 0 and asserts
>>>    unreachable, as RAM type checking is not yet implemented.
>> How does this end up working when common code references the function?
> 
> Based on the following commit message:
>      Callers are VT-d (so x86 specific) and various bits of page offlining
>      support, which although it looks generic (and is in xen/common) does
>      things like diving into page_info->count_info which is not generic.
>      
>      In any case on this is only reachable via XEN_SYSCTL_page_offline_op,
>      which clearly shouldn't be called on ARM just yet.

Assuming this is from an old commit, then I have to question this justification.
I see nothing preventing XEN_SYSCTL_page_offline_op to be invoked on an Arm
system. Hence (unless I'm overlooking somthing) ASSERT_UNREACHABLE() is simply
inappropriate (and wants fixing). Luckily it being sysctl-s only, there's no
need for an XSA. In no case should known flawed code be copied into another
port.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 17/17] xen/riscv: add support of page lookup by GFN
  2025-07-21  9:43     ` Oleksii Kurochko
@ 2025-07-21 14:06       ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-21 14:06 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 21.07.2025 11:43, Oleksii Kurochko wrote:
> On 7/2/25 1:44 PM, Jan Beulich wrote:
>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/p2m.c
>>> +++ b/xen/arch/riscv/p2m.c
>>> @@ -1055,3 +1055,134 @@ int guest_physmap_add_entry(struct domain *d,
>>>   {
>>>       return p2m_insert_mapping(d, gfn, (1 << page_order), mfn, t);
>>>   }
>>> +
>>> +/*
>>> + * Get the details of a given gfn.
>>> + *
>>> + * If the entry is present, the associated MFN will be returned and the
>>> + * access and type filled up. The page_order will correspond to the
>> You removed p2m_access_t * from the parameters; you need to also update
>> the comment then accordingly.
>>
>>> + * order of the mapping in the page table (i.e it could be a superpage).
>>> + *
>>> + * If the entry is not present, INVALID_MFN will be returned and the
>>> + * page_order will be set according to the order of the invalid range.
>>> + *
>>> + * valid will contain the value of bit[0] (e.g valid bit) of the
>>> + * entry.
>>> + */
>>> +static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
>>> +                           p2m_type_t *t,
>>> +                           unsigned int *page_order,
>>> +                           bool *valid)
>>> +{
>>> +    paddr_t addr = gfn_to_gaddr(gfn);
>>> +    unsigned int level = 0;
>>> +    pte_t entry, *table;
>>> +    int rc;
>>> +    mfn_t mfn = INVALID_MFN;
>>> +    p2m_type_t _t;
>> Please no local variables with leading underscores. In x86 we commonly
>> name such variables p2mt.
>>
>>> +    DECLARE_OFFSETS(offsets, addr);
>> This is the sole use of "addr". Is such a local variable really worth having?
> 
> Not really, I'll drop it.
> 
>>> +    ASSERT(p2m_is_locked(p2m));
>>> +    BUILD_BUG_ON(XEN_PT_LEVEL_MAP_MASK(0) != PAGE_MASK);
>>> +
>>> +    /* Allow t to be NULL */
>>> +    t = t ?: &_t;
>>> +
>>> +    *t = p2m_invalid;
>>> +
>>> +    if ( valid )
>>> +        *valid = false;
>>> +
>>> +    /* XXX: Check if the mapping is lower than the mapped gfn */
>>> +
>>> +    /* This gfn is higher than the highest the p2m map currently holds */
>>> +    if ( gfn_x(gfn) > gfn_x(p2m->max_mapped_gfn) )
>>> +    {
>>> +        for ( level = P2M_ROOT_LEVEL; level ; level-- )
>> Nit: Stray blank before the 2nd semicolon. (Again at least once below.)
>>
>>> +            if ( (gfn_x(gfn) & (XEN_PT_LEVEL_MASK(level) >> PAGE_SHIFT)) >
>>> +                 gfn_x(p2m->max_mapped_gfn) )
>>> +                break;
>>> +
>>> +        goto out;
>>> +    }
>>> +
>>> +    table = p2m_get_root_pointer(p2m, gfn);
>>> +
>>> +    /*
>>> +     * the table should always be non-NULL because the gfn is below
>>> +     * p2m->max_mapped_gfn and the root table pages are always present.
>>> +     */
>>> +    if ( !table )
>>> +    {
>>> +        ASSERT_UNREACHABLE();
>>> +        level = P2M_ROOT_LEVEL;
>>> +        goto out;
>>> +    }
>>> +
>>> +    for ( level = P2M_ROOT_LEVEL; level ; level-- )
>>> +    {
>>> +        rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
>>> +        if ( (rc == GUEST_TABLE_MAP_NONE) && (rc != GUEST_TABLE_MAP_NOMEM) )
>> This condition looks odd. As written the rhs of the && is redundant.
> 
> And it is wrong. It should:
>   if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
> 
>>> +            goto out_unmap;
>>> +        else if ( rc != GUEST_TABLE_NORMAL )
>> As before, no real need for "else" in such cases.
>>
>>> +            break;
>>> +    }
>>> +
>>> +    entry = table[offsets[level]];
>>> +
>>> +    if ( p2me_is_valid(p2m, entry) )
>>> +    {
>>> +        *t = p2m_type_radix_get(p2m, entry);
>> If the incoming argument is NULL, the somewhat expensive radix tree lookup
>> is unnecessary here.
> 
> Good point.
> 
>>> +        mfn = pte_get_mfn(entry);
>>> +        /*
>>> +         * The entry may point to a superpage. Find the MFN associated
>>> +         * to the GFN.
>>> +         */
>>> +        mfn = mfn_add(mfn,
>>> +                      gfn_x(gfn) & (BIT(XEN_PT_LEVEL_ORDER(level), UL) - 1));
>>> +
>>> +        if ( valid )
>>> +            *valid = pte_is_valid(entry);
>> Interesting. Why not the P2M counterpart of the function? Yes, the comment
>> ahead of the function says so, but I don't see why the valid bit suddenly
>> is relevant here (besides the P2M type).
> 
> This valid variable is expected to be used in the caller (something what Arm does here
> https://gitlab.com/xen-project/xen/-/blob/staging/xen/arch/arm/p2m.c#L375) to check if
> it is needed to do flush_page_to_ram(), so if the the valid bit in PTE was set to 0
> then it means nothing should be flushed as entry wasn't used as it marked invalid.

I hope you see what a mess you get if you have two flavors of "valid" for a
PTE.

>>> +    }
>>> +
>>> +out_unmap:
>>> +    unmap_domain_page(table);
>>> +
>>> +out:
>> Nit: Style (bot labels).
>>
>>> +    if ( page_order )
>>> +        *page_order = XEN_PT_LEVEL_ORDER(level);
>>> +
>>> +    return mfn;
>>> +}
>>> +
>>> +static mfn_t p2m_lookup(struct domain *d, gfn_t gfn, p2m_type_t *t)
>> pointer-to-const for the 1st arg? But again more likely struct p2m_domain *
>> anyway?
> 
> |struct p2_domain| would be better, but I’m not really sure that a
> pointer-to-const can be used here.

Note how I also didn't suggest const there.

> I expect that, in that case,
> |p2m_read_lock()| would also need to accept a pointer-to-const, and since it
> calls|read_lock()| internally, that could be a problem because|read_lock() |accepts a|rwlock_t *l|.

Correct.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-21 12:18                   ` Jan Beulich
@ 2025-07-22 10:41                     ` Oleksii Kurochko
  2025-07-22 11:34                       ` Oleksii Kurochko
  2025-07-22 11:54                       ` Jan Beulich
  0 siblings, 2 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-22 10:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 8097 bytes --]


On 7/21/25 2:18 PM, Jan Beulich wrote:
> On 18.07.2025 11:52, Oleksii Kurochko wrote:
>> On 7/17/25 12:25 PM, Jan Beulich wrote:
>>> On 17.07.2025 10:56, Oleksii Kurochko wrote:
>>>> On 7/16/25 6:18 PM, Jan Beulich wrote:
>>>>> On 16.07.2025 18:07, Oleksii Kurochko wrote:
>>>>>> On 7/16/25 1:31 PM, Jan Beulich wrote:
>>>>>>> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>>>>>>>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>>>>>>>           return __map_domain_page(p2m->root + root_table_indx);
>>>>>>>>>>       }
>>>>>>>>>>       
>>>>>>>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>>>>>>>> See comments on the earlier patch regarding naming.
>>>>>>>>>
>>>>>>>>>> +{
>>>>>>>>>> +    int rc;
>>>>>>>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>>>>>>>> How does this work, when you record GFNs only for Xenheap pages?
>>>>>>>> I think I don't understand what is an issue. Could you please provide
>>>>>>>> some extra details?
>>>>>>> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
>>>>>>> works for 1:1 mapped domains. Can you show me the eventual final implementation
>>>>>>> of the function, making it possible to use it here?
>>>>>> At the moment, I planned to support only 1:1 mapped domains, so it is final
>>>>>> implementation.
>>>>> Isn't that on overly severe limitation?
>>>> I wouldn't say that it's a severe limitation, as it's just a matter of how
>>>> |mfn_to_gfn()| is implemented. When non-1:1 mapped domains are supported,
>>>> |mfn_to_gfn()| can be implemented differently, while the code where it’s called
>>>> will likely remain unchanged.
>>>>
>>>> What I meant in my reply is that, for the current state and current limitations,
>>>> this is the final implementation of|mfn_to_gfn()|. But that doesn't mean I don't
>>>> see the value in, or the need for, non-1:1 mapped domains—it's just that this
>>>> limitation simplifies development at the current stage of the RISC-V port.
>>> Simplification is fine in some cases, but not supporting the "normal" way of
>>> domain construction looks like a pretty odd restriction. I'm also curious
>>> how you envision to implement mfn_to_gfn() then, suitable for generic use like
>>> the one here. Imo, current limitation or not, you simply want to avoid use of
>>> that function outside of the special gnttab case.
>>>
>>>>>>>>> In this context (not sure if I asked before): With this use of a radix tree,
>>>>>>>>> how do you intend to bound the amount of memory that a domain can use, by
>>>>>>>>> making Xen insert very many entries?
>>>>>>>> I didn’t think about that. I assumed it would be enough to set the amount of
>>>>>>>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>>>>>>>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
>>>>>>> Which would require these allocations to come from that pool.
>>>>>> Yes, and it is true only for non-hardware domains with the current implementation.
>>>>> ???
>>>> I meant that pool is used now only for non-hardware domains at the moment.
>>> And how does this matter here? The memory required for the radix tree doesn't
>>> come from that pool anyway.
>> I thought that is possible to do that somehow, but looking at a code of
>> radix-tree.c it seems like the only one way to allocate memroy for the radix
>> tree isradix_tree_node_alloc() -> xzalloc(struct rcu_node).
>>
>> Then it is needed to introduce radix_tree_node_allocate(domain)
> That would be a possibility, but you may have seen that less than half a
> year ago we got rid of something along these lines. So it would require
> some pretty good justification to re-introduce.
>
>> or radix tree
>> can't be used at all for mentioned in the previous replies security reason, no?
> (Very) careful use may still be possible. But the downside of using this
> (potentially long lookup times) would always remain.

Could you please clarify what do you mean here by "(Very) careful"?
I thought about an introduction of an amount of possible keys in radix tree and if this amount
is 0 then stop domain. And it is also unclear what should be a value for this amount.
Probably, you have better idea.

But generally your idea below ...

>
>>>>>>>> Also, it seems this would just lead to the issue you mentioned earlier: when
>>>>>>>> the memory runs out,|domain_crash()| will be called or PTE will be zapped.
>>>>>>> Or one domain exhausting memory would cause another domain to fail. A domain
>>>>>>> impacting just itself may be tolerable. But a domain affecting other domains
>>>>>>> isn't.
>>>>>> But it seems like this issue could happen in any implementation. It won't happen only
>>>>>> if we will have only pre-populated pool for any domain type (hardware, control, guest
>>>>>> domain) without ability to extend them or allocate extra pages from domheap in runtime.
>>>>>> Otherwise, if extra pages allocation is allowed then we can't really do something
>>>>>> with this issue.
>>>>> But that's why I brought this up: You simply have to. Or, as indicated, the
>>>>> moment you mark Xen security-supported on RISC-V, there will be an XSA needed.
>>>> Why it isn't XSA for other architectures? At least, Arm then should have such
>>>> XSA.
>>> Does Arm use a radix tree for storing types? It uses one for mem-access, but
>>> it's not clear to me whether that's actually a supported feature.
>>>
>>>> I don't understand why x86 won't have the same issue. Memory is the limited
>>>> and shared resource, so if one of the domain will use to much memory then it could
>>>> happen that other domains won't have enough memory for its purpose...
>>> The question is whether allocations are bounded. With this use of a radix tree,
>>> you give domains a way to have Xen allocate pretty much arbitrary amounts of
>>> memory to populate that tree. That unbounded-ness is the problem, not memory
>>> allocations in general.
>> Isn't radix tree key bounded to an amount of GFNs given for a domain? We can't have
>> more keys then a max GFN number for a domain. So a potential amount of necessary memory
>> for radix tree is also bounded to an amount of GFNs.
> To some degree yes, hence why I said "pretty much arbitrary amounts".
> But recall that "amount of GFNs" is a fuzzy term; I think you mean to
> use it to describe the amount of memory pages given to the guest. GFNs
> can be used for other purposes, though. Guests could e.g. grant
> themselves access to their own memory, then map those grants at
> otherwise unused GFNs.
>
>> Anyway, IIUC I just can't use radix tree for p2m types at all, right?
>> If yes, does it make sense to borrow 2 bits from struct page_info->type_info as now it
>> is used 9-bits for count of a frame?
> struct page_info describes MFNs, when you want to describe GFNs. As you
> mentioned earlier, multiple GFNs can in principle map to the same MFN.
> You would force them to all have the same properties, which would be in
> direct conflict with e.g. the grant P2M types.
>
> Just to mention one possible alternative to using radix trees: You could
> maintain a 2nd set of intermediate "page tables", just that leaf entries
> would hold meta data for the respective GFN. The memory for those "page
> tables" could come from the normal P2M pool (and allocation would thus
> only consume domain-specific resources). Of course in any model like
> this the question of lookup times (as mentioned above) would remain.

...looks like an optimal option.

The only thing I worry about is that it will require some code duplication
(I will think how to re-use the current one code), as for example, when
setting/getting metadata, TLB flushing isn’t needed at all as we aren't
working with with real P2M page tables.

Agree that lookup won't be the best one, but nothing can be done with
such models.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 12050 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-22 10:41                     ` Oleksii Kurochko
@ 2025-07-22 11:34                       ` Oleksii Kurochko
  2025-07-22 12:00                         ` Jan Beulich
  2025-07-22 11:54                       ` Jan Beulich
  1 sibling, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-22 11:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 8900 bytes --]


On 7/22/25 12:41 PM, Oleksii Kurochko wrote:
>
>
> On 7/21/25 2:18 PM, Jan Beulich wrote:
>> On 18.07.2025 11:52, Oleksii Kurochko wrote:
>>> On 7/17/25 12:25 PM, Jan Beulich wrote:
>>>> On 17.07.2025 10:56, Oleksii Kurochko wrote:
>>>>> On 7/16/25 6:18 PM, Jan Beulich wrote:
>>>>>> On 16.07.2025 18:07, Oleksii Kurochko wrote:
>>>>>>> On 7/16/25 1:31 PM, Jan Beulich wrote:
>>>>>>>> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>>>>>>>>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>>>>>>>>           return __map_domain_page(p2m->root + root_table_indx);
>>>>>>>>>>>       }
>>>>>>>>>>>       
>>>>>>>>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>>>>>>>>> See comments on the earlier patch regarding naming.
>>>>>>>>>>
>>>>>>>>>>> +{
>>>>>>>>>>> +    int rc;
>>>>>>>>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>>>>>>>>> How does this work, when you record GFNs only for Xenheap pages?
>>>>>>>>> I think I don't understand what is an issue. Could you please provide
>>>>>>>>> some extra details?
>>>>>>>> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
>>>>>>>> works for 1:1 mapped domains. Can you show me the eventual final implementation
>>>>>>>> of the function, making it possible to use it here?
>>>>>>> At the moment, I planned to support only 1:1 mapped domains, so it is final
>>>>>>> implementation.
>>>>>> Isn't that on overly severe limitation?
>>>>> I wouldn't say that it's a severe limitation, as it's just a matter of how
>>>>> |mfn_to_gfn()| is implemented. When non-1:1 mapped domains are supported,
>>>>> |mfn_to_gfn()| can be implemented differently, while the code where it’s called
>>>>> will likely remain unchanged.
>>>>>
>>>>> What I meant in my reply is that, for the current state and current limitations,
>>>>> this is the final implementation of|mfn_to_gfn()|. But that doesn't mean I don't
>>>>> see the value in, or the need for, non-1:1 mapped domains—it's just that this
>>>>> limitation simplifies development at the current stage of the RISC-V port.
>>>> Simplification is fine in some cases, but not supporting the "normal" way of
>>>> domain construction looks like a pretty odd restriction. I'm also curious
>>>> how you envision to implement mfn_to_gfn() then, suitable for generic use like
>>>> the one here. Imo, current limitation or not, you simply want to avoid use of
>>>> that function outside of the special gnttab case.
>>>>
>>>>>>>>>> In this context (not sure if I asked before): With this use of a radix tree,
>>>>>>>>>> how do you intend to bound the amount of memory that a domain can use, by
>>>>>>>>>> making Xen insert very many entries?
>>>>>>>>> I didn’t think about that. I assumed it would be enough to set the amount of
>>>>>>>>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>>>>>>>>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
>>>>>>>> Which would require these allocations to come from that pool.
>>>>>>> Yes, and it is true only for non-hardware domains with the current implementation.
>>>>>> ???
>>>>> I meant that pool is used now only for non-hardware domains at the moment.
>>>> And how does this matter here? The memory required for the radix tree doesn't
>>>> come from that pool anyway.
>>> I thought that is possible to do that somehow, but looking at a code of
>>> radix-tree.c it seems like the only one way to allocate memroy for the radix
>>> tree isradix_tree_node_alloc() -> xzalloc(struct rcu_node).
>>>
>>> Then it is needed to introduce radix_tree_node_allocate(domain)
>> That would be a possibility, but you may have seen that less than half a
>> year ago we got rid of something along these lines. So it would require
>> some pretty good justification to re-introduce.
>>
>>> or radix tree
>>> can't be used at all for mentioned in the previous replies security reason, no?
>> (Very) careful use may still be possible. But the downside of using this
>> (potentially long lookup times) would always remain.
> Could you please clarify what do you mean here by "(Very) careful"?
> I thought about an introduction of an amount of possible keys in radix tree and if this amount
> is 0 then stop domain. And it is also unclear what should be a value for this amount.
> Probably, you have better idea.
>
> But generally your idea below ...
>>>>>>>>> Also, it seems this would just lead to the issue you mentioned earlier: when
>>>>>>>>> the memory runs out,|domain_crash()| will be called or PTE will be zapped.
>>>>>>>> Or one domain exhausting memory would cause another domain to fail. A domain
>>>>>>>> impacting just itself may be tolerable. But a domain affecting other domains
>>>>>>>> isn't.
>>>>>>> But it seems like this issue could happen in any implementation. It won't happen only
>>>>>>> if we will have only pre-populated pool for any domain type (hardware, control, guest
>>>>>>> domain) without ability to extend them or allocate extra pages from domheap in runtime.
>>>>>>> Otherwise, if extra pages allocation is allowed then we can't really do something
>>>>>>> with this issue.
>>>>>> But that's why I brought this up: You simply have to. Or, as indicated, the
>>>>>> moment you mark Xen security-supported on RISC-V, there will be an XSA needed.
>>>>> Why it isn't XSA for other architectures? At least, Arm then should have such
>>>>> XSA.
>>>> Does Arm use a radix tree for storing types? It uses one for mem-access, but
>>>> it's not clear to me whether that's actually a supported feature.
>>>>
>>>>> I don't understand why x86 won't have the same issue. Memory is the limited
>>>>> and shared resource, so if one of the domain will use to much memory then it could
>>>>> happen that other domains won't have enough memory for its purpose...
>>>> The question is whether allocations are bounded. With this use of a radix tree,
>>>> you give domains a way to have Xen allocate pretty much arbitrary amounts of
>>>> memory to populate that tree. That unbounded-ness is the problem, not memory
>>>> allocations in general.
>>> Isn't radix tree key bounded to an amount of GFNs given for a domain? We can't have
>>> more keys then a max GFN number for a domain. So a potential amount of necessary memory
>>> for radix tree is also bounded to an amount of GFNs.
>> To some degree yes, hence why I said "pretty much arbitrary amounts".
>> But recall that "amount of GFNs" is a fuzzy term; I think you mean to
>> use it to describe the amount of memory pages given to the guest. GFNs
>> can be used for other purposes, though. Guests could e.g. grant
>> themselves access to their own memory, then map those grants at
>> otherwise unused GFNs.
>>
>>> Anyway, IIUC I just can't use radix tree for p2m types at all, right?
>>> If yes, does it make sense to borrow 2 bits from struct page_info->type_info as now it
>>> is used 9-bits for count of a frame?
>> struct page_info describes MFNs, when you want to describe GFNs. As you
>> mentioned earlier, multiple GFNs can in principle map to the same MFN.
>> You would force them to all have the same properties, which would be in
>> direct conflict with e.g. the grant P2M types.
>>
>> Just to mention one possible alternative to using radix trees: You could
>> maintain a 2nd set of intermediate "page tables", just that leaf entries
>> would hold meta data for the respective GFN. The memory for those "page
>> tables" could come from the normal P2M pool (and allocation would thus
>> only consume domain-specific resources). Of course in any model like
>> this the question of lookup times (as mentioned above) would remain.
> ...looks like an optimal option.
>
> The only thing I worry about is that it will require some code duplication
> (I will think how to re-use the current one code), as for example, when
> setting/getting metadata, TLB flushing isn’t needed at all as we aren't
> working with with real P2M page tables.
> Agree that lookup won't be the best one, but nothing can be done with
> such models.

Probably, instead of having a second set of intermediate "page tables",
we could just allocate two consecutive pages within the real P2M page
tables for the intermediate page table. The first page would serve as
the actual page table to which the intermediate page table points,
and the second page would store metadata for each entry of the page
table that the intermediate page table references.

As we are supporting only 1gb, 2mb and 4kb mappings we could do a little
optimization and start allocate these consecutive pages only for PT levels
which corresponds to 1gb, 2mb, 4kb mappings.

Does it make sense?

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 13162 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-22 10:41                     ` Oleksii Kurochko
  2025-07-22 11:34                       ` Oleksii Kurochko
@ 2025-07-22 11:54                       ` Jan Beulich
  1 sibling, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-22 11:54 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 22.07.2025 12:41, Oleksii Kurochko wrote:
> On 7/21/25 2:18 PM, Jan Beulich wrote:
>> On 18.07.2025 11:52, Oleksii Kurochko wrote:
>>> On 7/17/25 12:25 PM, Jan Beulich wrote:
>>>> On 17.07.2025 10:56, Oleksii Kurochko wrote:
>>>>> On 7/16/25 6:18 PM, Jan Beulich wrote:
>>>>>> On 16.07.2025 18:07, Oleksii Kurochko wrote:
>>>>>>> On 7/16/25 1:31 PM, Jan Beulich wrote:
>>>>>>>> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>>>>>>>>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>>>>>>>>           return __map_domain_page(p2m->root + root_table_indx);
>>>>>>>>>>>       }
>>>>>>>>>>>       
>>>>>>>>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>>>>>>>>> See comments on the earlier patch regarding naming.
>>>>>>>>>>
>>>>>>>>>>> +{
>>>>>>>>>>> +    int rc;
>>>>>>>>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>>>>>>>>> How does this work, when you record GFNs only for Xenheap pages?
>>>>>>>>> I think I don't understand what is an issue. Could you please provide
>>>>>>>>> some extra details?
>>>>>>>> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
>>>>>>>> works for 1:1 mapped domains. Can you show me the eventual final implementation
>>>>>>>> of the function, making it possible to use it here?
>>>>>>> At the moment, I planned to support only 1:1 mapped domains, so it is final
>>>>>>> implementation.
>>>>>> Isn't that on overly severe limitation?
>>>>> I wouldn't say that it's a severe limitation, as it's just a matter of how
>>>>> |mfn_to_gfn()| is implemented. When non-1:1 mapped domains are supported,
>>>>> |mfn_to_gfn()| can be implemented differently, while the code where it’s called
>>>>> will likely remain unchanged.
>>>>>
>>>>> What I meant in my reply is that, for the current state and current limitations,
>>>>> this is the final implementation of|mfn_to_gfn()|. But that doesn't mean I don't
>>>>> see the value in, or the need for, non-1:1 mapped domains—it's just that this
>>>>> limitation simplifies development at the current stage of the RISC-V port.
>>>> Simplification is fine in some cases, but not supporting the "normal" way of
>>>> domain construction looks like a pretty odd restriction. I'm also curious
>>>> how you envision to implement mfn_to_gfn() then, suitable for generic use like
>>>> the one here. Imo, current limitation or not, you simply want to avoid use of
>>>> that function outside of the special gnttab case.
>>>>
>>>>>>>>>> In this context (not sure if I asked before): With this use of a radix tree,
>>>>>>>>>> how do you intend to bound the amount of memory that a domain can use, by
>>>>>>>>>> making Xen insert very many entries?
>>>>>>>>> I didn’t think about that. I assumed it would be enough to set the amount of
>>>>>>>>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>>>>>>>>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
>>>>>>>> Which would require these allocations to come from that pool.
>>>>>>> Yes, and it is true only for non-hardware domains with the current implementation.
>>>>>> ???
>>>>> I meant that pool is used now only for non-hardware domains at the moment.
>>>> And how does this matter here? The memory required for the radix tree doesn't
>>>> come from that pool anyway.
>>> I thought that is possible to do that somehow, but looking at a code of
>>> radix-tree.c it seems like the only one way to allocate memroy for the radix
>>> tree isradix_tree_node_alloc() -> xzalloc(struct rcu_node).
>>>
>>> Then it is needed to introduce radix_tree_node_allocate(domain)
>> That would be a possibility, but you may have seen that less than half a
>> year ago we got rid of something along these lines. So it would require
>> some pretty good justification to re-introduce.
>>
>>> or radix tree
>>> can't be used at all for mentioned in the previous replies security reason, no?
>> (Very) careful use may still be possible. But the downside of using this
>> (potentially long lookup times) would always remain.
> 
> Could you please clarify what do you mean here by "(Very) careful"?
> I thought about an introduction of an amount of possible keys in radix tree and if this amount
> is 0 then stop domain. And it is also unclear what should be a value for this amount.
> Probably, you have better idea.

I had no particular idea in mind. I said "(very) careful" merely to clarify
that whatever model is chosen, it would need to satisfy certain needs.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-22 11:34                       ` Oleksii Kurochko
@ 2025-07-22 12:00                         ` Jan Beulich
  2025-07-22 14:25                           ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-22 12:00 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 22.07.2025 13:34, Oleksii Kurochko wrote:
> 
> On 7/22/25 12:41 PM, Oleksii Kurochko wrote:
>>
>>
>> On 7/21/25 2:18 PM, Jan Beulich wrote:
>>> On 18.07.2025 11:52, Oleksii Kurochko wrote:
>>>> On 7/17/25 12:25 PM, Jan Beulich wrote:
>>>>> On 17.07.2025 10:56, Oleksii Kurochko wrote:
>>>>>> On 7/16/25 6:18 PM, Jan Beulich wrote:
>>>>>>> On 16.07.2025 18:07, Oleksii Kurochko wrote:
>>>>>>>> On 7/16/25 1:31 PM, Jan Beulich wrote:
>>>>>>>>> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>>>>>>>>>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>>>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>>>>>>>>>           return __map_domain_page(p2m->root + root_table_indx);
>>>>>>>>>>>>       }
>>>>>>>>>>>>       
>>>>>>>>>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>>>>>>>>>> See comments on the earlier patch regarding naming.
>>>>>>>>>>>
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    int rc;
>>>>>>>>>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>>>>>>>>>> How does this work, when you record GFNs only for Xenheap pages?
>>>>>>>>>> I think I don't understand what is an issue. Could you please provide
>>>>>>>>>> some extra details?
>>>>>>>>> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
>>>>>>>>> works for 1:1 mapped domains. Can you show me the eventual final implementation
>>>>>>>>> of the function, making it possible to use it here?
>>>>>>>> At the moment, I planned to support only 1:1 mapped domains, so it is final
>>>>>>>> implementation.
>>>>>>> Isn't that on overly severe limitation?
>>>>>> I wouldn't say that it's a severe limitation, as it's just a matter of how
>>>>>> |mfn_to_gfn()| is implemented. When non-1:1 mapped domains are supported,
>>>>>> |mfn_to_gfn()| can be implemented differently, while the code where it’s called
>>>>>> will likely remain unchanged.
>>>>>>
>>>>>> What I meant in my reply is that, for the current state and current limitations,
>>>>>> this is the final implementation of|mfn_to_gfn()|. But that doesn't mean I don't
>>>>>> see the value in, or the need for, non-1:1 mapped domains—it's just that this
>>>>>> limitation simplifies development at the current stage of the RISC-V port.
>>>>> Simplification is fine in some cases, but not supporting the "normal" way of
>>>>> domain construction looks like a pretty odd restriction. I'm also curious
>>>>> how you envision to implement mfn_to_gfn() then, suitable for generic use like
>>>>> the one here. Imo, current limitation or not, you simply want to avoid use of
>>>>> that function outside of the special gnttab case.
>>>>>
>>>>>>>>>>> In this context (not sure if I asked before): With this use of a radix tree,
>>>>>>>>>>> how do you intend to bound the amount of memory that a domain can use, by
>>>>>>>>>>> making Xen insert very many entries?
>>>>>>>>>> I didn’t think about that. I assumed it would be enough to set the amount of
>>>>>>>>>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>>>>>>>>>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
>>>>>>>>> Which would require these allocations to come from that pool.
>>>>>>>> Yes, and it is true only for non-hardware domains with the current implementation.
>>>>>>> ???
>>>>>> I meant that pool is used now only for non-hardware domains at the moment.
>>>>> And how does this matter here? The memory required for the radix tree doesn't
>>>>> come from that pool anyway.
>>>> I thought that is possible to do that somehow, but looking at a code of
>>>> radix-tree.c it seems like the only one way to allocate memroy for the radix
>>>> tree isradix_tree_node_alloc() -> xzalloc(struct rcu_node).
>>>>
>>>> Then it is needed to introduce radix_tree_node_allocate(domain)
>>> That would be a possibility, but you may have seen that less than half a
>>> year ago we got rid of something along these lines. So it would require
>>> some pretty good justification to re-introduce.
>>>
>>>> or radix tree
>>>> can't be used at all for mentioned in the previous replies security reason, no?
>>> (Very) careful use may still be possible. But the downside of using this
>>> (potentially long lookup times) would always remain.
>> Could you please clarify what do you mean here by "(Very) careful"?
>> I thought about an introduction of an amount of possible keys in radix tree and if this amount
>> is 0 then stop domain. And it is also unclear what should be a value for this amount.
>> Probably, you have better idea.
>>
>> But generally your idea below ...
>>>>>>>>>> Also, it seems this would just lead to the issue you mentioned earlier: when
>>>>>>>>>> the memory runs out,|domain_crash()| will be called or PTE will be zapped.
>>>>>>>>> Or one domain exhausting memory would cause another domain to fail. A domain
>>>>>>>>> impacting just itself may be tolerable. But a domain affecting other domains
>>>>>>>>> isn't.
>>>>>>>> But it seems like this issue could happen in any implementation. It won't happen only
>>>>>>>> if we will have only pre-populated pool for any domain type (hardware, control, guest
>>>>>>>> domain) without ability to extend them or allocate extra pages from domheap in runtime.
>>>>>>>> Otherwise, if extra pages allocation is allowed then we can't really do something
>>>>>>>> with this issue.
>>>>>>> But that's why I brought this up: You simply have to. Or, as indicated, the
>>>>>>> moment you mark Xen security-supported on RISC-V, there will be an XSA needed.
>>>>>> Why it isn't XSA for other architectures? At least, Arm then should have such
>>>>>> XSA.
>>>>> Does Arm use a radix tree for storing types? It uses one for mem-access, but
>>>>> it's not clear to me whether that's actually a supported feature.
>>>>>
>>>>>> I don't understand why x86 won't have the same issue. Memory is the limited
>>>>>> and shared resource, so if one of the domain will use to much memory then it could
>>>>>> happen that other domains won't have enough memory for its purpose...
>>>>> The question is whether allocations are bounded. With this use of a radix tree,
>>>>> you give domains a way to have Xen allocate pretty much arbitrary amounts of
>>>>> memory to populate that tree. That unbounded-ness is the problem, not memory
>>>>> allocations in general.
>>>> Isn't radix tree key bounded to an amount of GFNs given for a domain? We can't have
>>>> more keys then a max GFN number for a domain. So a potential amount of necessary memory
>>>> for radix tree is also bounded to an amount of GFNs.
>>> To some degree yes, hence why I said "pretty much arbitrary amounts".
>>> But recall that "amount of GFNs" is a fuzzy term; I think you mean to
>>> use it to describe the amount of memory pages given to the guest. GFNs
>>> can be used for other purposes, though. Guests could e.g. grant
>>> themselves access to their own memory, then map those grants at
>>> otherwise unused GFNs.
>>>
>>>> Anyway, IIUC I just can't use radix tree for p2m types at all, right?
>>>> If yes, does it make sense to borrow 2 bits from struct page_info->type_info as now it
>>>> is used 9-bits for count of a frame?
>>> struct page_info describes MFNs, when you want to describe GFNs. As you
>>> mentioned earlier, multiple GFNs can in principle map to the same MFN.
>>> You would force them to all have the same properties, which would be in
>>> direct conflict with e.g. the grant P2M types.
>>>
>>> Just to mention one possible alternative to using radix trees: You could
>>> maintain a 2nd set of intermediate "page tables", just that leaf entries
>>> would hold meta data for the respective GFN. The memory for those "page
>>> tables" could come from the normal P2M pool (and allocation would thus
>>> only consume domain-specific resources). Of course in any model like
>>> this the question of lookup times (as mentioned above) would remain.
>> ...looks like an optimal option.
>>
>> The only thing I worry about is that it will require some code duplication
>> (I will think how to re-use the current one code), as for example, when
>> setting/getting metadata, TLB flushing isn’t needed at all as we aren't
>> working with with real P2M page tables.
>> Agree that lookup won't be the best one, but nothing can be done with
>> such models.
> 
> Probably, instead of having a second set of intermediate "page tables",
> we could just allocate two consecutive pages within the real P2M page
> tables for the intermediate page table. The first page would serve as
> the actual page table to which the intermediate page table points,
> and the second page would store metadata for each entry of the page
> table that the intermediate page table references.
> 
> As we are supporting only 1gb, 2mb and 4kb mappings we could do a little
> optimization and start allocate these consecutive pages only for PT levels
> which corresponds to 1gb, 2mb, 4kb mappings.
> 
> Does it make sense?

I was indeed entertaining this idea, but I couldn't conclude for myself if
that would indeed be without any rough edges. Hence I didn't want to
suggest such. For example, the need to have adjacent pairs of pages could
result in a higher rate of allocation failures (while populating or
re-sizing the P2M pool). This would be possible to avoid by still using
entirely separate pages, and then merely linking them together via some
unused struct page_info fields (the "normal" linking fields can't be used,
afaict).

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-07-21 13:39         ` Jan Beulich
@ 2025-07-22 12:03           ` Oleksii Kurochko
  2025-07-22 12:05             ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-22 12:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	xen-devel, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 2779 bytes --]


On 7/21/25 3:39 PM, Jan Beulich wrote:
> On 18.07.2025 16:37, Oleksii Kurochko wrote:
>> On 7/2/25 12:28 PM, Jan Beulich wrote:
>>> On 02.07.2025 12:09, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> @@ -613,3 +612,91 @@ void __iomem *ioremap(paddr_t pa, size_t len)
>>>>>    {
>>>>>        return ioremap_attr(pa, len, PAGE_HYPERVISOR_NOCACHE);
>>>>>    }
>>>>> +
>>>>> +int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
>>>>> +{
>>>>> +    ASSERT_UNREACHABLE();
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static struct domain *page_get_owner_and_nr_reference(struct page_info *page,
>>>>> +                                                      unsigned long nr)
>>>>> +{
>>>>> +    unsigned long x, y = page->count_info;
>>>>> +    struct domain *owner;
>>>>> +
>>>>> +    /* Restrict nr to avoid "double" overflow */
>>>>> +    if ( nr >= PGC_count_mask )
>>>>> +    {
>>>>> +        ASSERT_UNREACHABLE();
>>>>> +        return NULL;
>>>>> +    }
>>>> I question the validity of this, already in the Arm original: I can't spot
>>>> how the caller guarantees to stay below that limit. Without such an
>>>> (attempted) guarantee, ASSERT_UNREACHABLE() is wrong to use. All I can see
>>>> is process_shm_node() incrementing shmem_extra[].nr_shm_borrowers, without
>>>> any limit check.
>>>>
>>>>> +    do {
>>>>> +        x = y;
>>>>> +        /*
>>>>> +         * Count ==  0: Page is not allocated, so we cannot take a reference.
>>>>> +         * Count == -1: Reference count would wrap, which is invalid.
>>>>> +         */
>>>> May I once again ask that you look carefully at comments (as much as at code)
>>>> you copy. Clearly this comment wasn't properly updated when the bumping by 1
>>>> was changed to bumping by nr.
>>>>
>>>>> +        if ( unlikely(((x + nr) & PGC_count_mask) <= nr) )
>>>>> +            return NULL;
>>>>> +    }
>>>>> +    while ( (y = cmpxchg(&page->count_info, x, x + nr)) != x );
>>>>> +
>>>>> +    owner = page_get_owner(page);
>>>>> +    ASSERT(owner);
>>>>> +
>>>>> +    return owner;
>>>>> +}
>>> There also looks to be a dead code concern here (towards the "nr" parameters
>>> here and elsewhere, when STATIC_SHM=n). Just that apparently we decided to
>>> leave out Misra rule 2.2 entirely.
>> I think that I didn't get what is an issue when STATIC_SHM=n, functions is still
>> going to be called through page_get_owner_and_reference(), at least, in page_alloc.c .
> Yes, but will "nr" ever be anything other than 1 then? IOW omitting the parameter
> would be fine. And that's what "dead code" is about.

Got it.

So we don't have any SAF-x tag to mark this function as safe. What is the best one
solution for now if nr argument will be needed in the future for STATIC_SHM=y?

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 3791 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-07-22 12:03           ` Oleksii Kurochko
@ 2025-07-22 12:05             ` Jan Beulich
  2025-07-29 13:47               ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-22 12:05 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	xen-devel, Stefano Stabellini

On 22.07.2025 14:03, Oleksii Kurochko wrote:
> On 7/21/25 3:39 PM, Jan Beulich wrote:
>> On 18.07.2025 16:37, Oleksii Kurochko wrote:
>>> On 7/2/25 12:28 PM, Jan Beulich wrote:
>>>> On 02.07.2025 12:09, Jan Beulich wrote:
>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>> @@ -613,3 +612,91 @@ void __iomem *ioremap(paddr_t pa, size_t len)
>>>>>>    {
>>>>>>        return ioremap_attr(pa, len, PAGE_HYPERVISOR_NOCACHE);
>>>>>>    }
>>>>>> +
>>>>>> +int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
>>>>>> +{
>>>>>> +    ASSERT_UNREACHABLE();
>>>>>> +
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static struct domain *page_get_owner_and_nr_reference(struct page_info *page,
>>>>>> +                                                      unsigned long nr)
>>>>>> +{
>>>>>> +    unsigned long x, y = page->count_info;
>>>>>> +    struct domain *owner;
>>>>>> +
>>>>>> +    /* Restrict nr to avoid "double" overflow */
>>>>>> +    if ( nr >= PGC_count_mask )
>>>>>> +    {
>>>>>> +        ASSERT_UNREACHABLE();
>>>>>> +        return NULL;
>>>>>> +    }
>>>>> I question the validity of this, already in the Arm original: I can't spot
>>>>> how the caller guarantees to stay below that limit. Without such an
>>>>> (attempted) guarantee, ASSERT_UNREACHABLE() is wrong to use. All I can see
>>>>> is process_shm_node() incrementing shmem_extra[].nr_shm_borrowers, without
>>>>> any limit check.
>>>>>
>>>>>> +    do {
>>>>>> +        x = y;
>>>>>> +        /*
>>>>>> +         * Count ==  0: Page is not allocated, so we cannot take a reference.
>>>>>> +         * Count == -1: Reference count would wrap, which is invalid.
>>>>>> +         */
>>>>> May I once again ask that you look carefully at comments (as much as at code)
>>>>> you copy. Clearly this comment wasn't properly updated when the bumping by 1
>>>>> was changed to bumping by nr.
>>>>>
>>>>>> +        if ( unlikely(((x + nr) & PGC_count_mask) <= nr) )
>>>>>> +            return NULL;
>>>>>> +    }
>>>>>> +    while ( (y = cmpxchg(&page->count_info, x, x + nr)) != x );
>>>>>> +
>>>>>> +    owner = page_get_owner(page);
>>>>>> +    ASSERT(owner);
>>>>>> +
>>>>>> +    return owner;
>>>>>> +}
>>>> There also looks to be a dead code concern here (towards the "nr" parameters
>>>> here and elsewhere, when STATIC_SHM=n). Just that apparently we decided to
>>>> leave out Misra rule 2.2 entirely.
>>> I think that I didn't get what is an issue when STATIC_SHM=n, functions is still
>>> going to be called through page_get_owner_and_reference(), at least, in page_alloc.c .
>> Yes, but will "nr" ever be anything other than 1 then? IOW omitting the parameter
>> would be fine. And that's what "dead code" is about.
> 
> Got it.
> 
> So we don't have any SAF-x tag to mark this function as safe. What is the best one
> solution for now if nr argument will be needed in the future for STATIC_SHM=y?

Add the parameter at that point. Just like was done for Arm.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-07-21 13:42       ` Jan Beulich
@ 2025-07-22 13:38         ` Oleksii Kurochko
  0 siblings, 0 replies; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-22 13:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 3984 bytes --]


On 7/21/25 3:42 PM, Jan Beulich wrote:
> On 18.07.2025 16:49, Oleksii Kurochko wrote:
>> On 7/2/25 12:09 PM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> Implement the mfn_valid() macro to verify whether a given MFN is valid by
>>>> checking that it falls within the range [start_page, max_page).
>>>> These bounds are initialized based on the start and end addresses of RAM.
>>>>
>>>> As part of this patch, start_page is introduced and initialized with the
>>>> PFN of the first RAM page.
>>>>
>>>> Also, after providing a non-stub implementation of the mfn_valid() macro,
>>>> the following compilation errors started to occur:
>>>>     riscv64-linux-gnu-ld: prelink.o: in function `__next_node':
>>>>     /build/xen/./include/xen/nodemask.h:202: undefined reference to `page_is_ram_type'
>>>>     riscv64-linux-gnu-ld: prelink.o: in function `get_free_buddy':
>>>>     /build/xen/common/page_alloc.c:881: undefined reference to `page_is_ram_type'
>>>>     riscv64-linux-gnu-ld: prelink.o: in function `alloc_heap_pages':
>>>>     /build/xen/common/page_alloc.c:1043: undefined reference to `page_get_owner_and_reference'
>>>>     riscv64-linux-gnu-ld: /build/xen/common/page_alloc.c:1098: undefined reference to `page_is_ram_type'
>>>>     riscv64-linux-gnu-ld: prelink.o: in function `ns16550_interrupt':
>>>>     /build/xen/drivers/char/ns16550.c:205: undefined reference to `get_page'
>>>>     riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol `page_get_owner_and_reference' isn't defined
>>>>     riscv64-linux-gnu-ld: final link failed: bad value
>>>>     make[2]: *** [arch/riscv/Makefile:35: xen-syms] Error 1
>>>> To resolve these errors, the following functions have also been introduced,
>>>> based on their Arm counterparts:
>>>> - page_get_owner_and_reference() and its variant to safely acquire a
>>>>     reference to a page and retrieve its owner.
>>>> - put_page() and put_page_nr() to release page references and free the page
>>>>     when the count drops to zero.
>>>>     For put_page_nr(), code related to static memory configuration is wrapped
>>>>     with CONFIG_STATIC_MEMORY, as this configuration has not yet been moved to
>>>>     common code. Therefore, PGC_static and free_domstatic_page() are not
>>>>     introduced for RISC-V. However, since this configuration could be useful
>>>>     in the future, the relevant code is retained and conditionally compiled.
>>>> - A stub for page_is_ram_type() that currently always returns 0 and asserts
>>>>     unreachable, as RAM type checking is not yet implemented.
>>> How does this end up working when common code references the function?
>> Based on the following commit message:
>>       Callers are VT-d (so x86 specific) and various bits of page offlining
>>       support, which although it looks generic (and is in xen/common) does
>>       things like diving into page_info->count_info which is not generic.
>>       
>>       In any case on this is only reachable via XEN_SYSCTL_page_offline_op,
>>       which clearly shouldn't be called on ARM just yet.
> What commit message are you talking about? Nothing like the above is anywhere
> in this patch.

It's pretty old commit:
   commit 214c4cd94a80bcaf042f25158eaa7d0e5bbc3b5b
   Author: Ian Campbell<ian.campbell@citrix.com>
   Date:   Wed Dec 19 14:16:23 2012 +0000

But since that time page_is_ram_type() hasn't been changed for Arm.

>
>> There is no callers for page_is_ram_type() for Arm now, and I expect something similar
>> for RISC-V. As we don't even introduce hypercalls for RISC-V, we can just live
>> without it.
> If there's no caller, why the stub?

Because that parts of common code which are using it aren't under ifdef, for example, this
one function:
   int offline_page(mfn_t mfn, int broken, uint32_t *status)

And specifically this function is called when XEN_SYSCTL_page_offline_op is handled.

But I agree that it seems like nothing prevents to call XEN_SYSCTL_page_offline_op.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 5013 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-22 12:00                         ` Jan Beulich
@ 2025-07-22 14:25                           ` Oleksii Kurochko
  2025-07-22 14:35                             ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-22 14:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 10146 bytes --]


On 7/22/25 2:00 PM, Jan Beulich wrote:
> On 22.07.2025 13:34, Oleksii Kurochko wrote:
>> On 7/22/25 12:41 PM, Oleksii Kurochko wrote:
>>>
>>> On 7/21/25 2:18 PM, Jan Beulich wrote:
>>>> On 18.07.2025 11:52, Oleksii Kurochko wrote:
>>>>> On 7/17/25 12:25 PM, Jan Beulich wrote:
>>>>>> On 17.07.2025 10:56, Oleksii Kurochko wrote:
>>>>>>> On 7/16/25 6:18 PM, Jan Beulich wrote:
>>>>>>>> On 16.07.2025 18:07, Oleksii Kurochko wrote:
>>>>>>>>> On 7/16/25 1:31 PM, Jan Beulich wrote:
>>>>>>>>>> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>>>>>>>>>>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>>>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>>>>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>>>>>>>>>>            return __map_domain_page(p2m->root + root_table_indx);
>>>>>>>>>>>>>        }
>>>>>>>>>>>>>        
>>>>>>>>>>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>>>>>>>>>>> See comments on the earlier patch regarding naming.
>>>>>>>>>>>>
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    int rc;
>>>>>>>>>>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>>>>>>>>>>> How does this work, when you record GFNs only for Xenheap pages?
>>>>>>>>>>> I think I don't understand what is an issue. Could you please provide
>>>>>>>>>>> some extra details?
>>>>>>>>>> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
>>>>>>>>>> works for 1:1 mapped domains. Can you show me the eventual final implementation
>>>>>>>>>> of the function, making it possible to use it here?
>>>>>>>>> At the moment, I planned to support only 1:1 mapped domains, so it is final
>>>>>>>>> implementation.
>>>>>>>> Isn't that on overly severe limitation?
>>>>>>> I wouldn't say that it's a severe limitation, as it's just a matter of how
>>>>>>> |mfn_to_gfn()| is implemented. When non-1:1 mapped domains are supported,
>>>>>>> |mfn_to_gfn()| can be implemented differently, while the code where it’s called
>>>>>>> will likely remain unchanged.
>>>>>>>
>>>>>>> What I meant in my reply is that, for the current state and current limitations,
>>>>>>> this is the final implementation of|mfn_to_gfn()|. But that doesn't mean I don't
>>>>>>> see the value in, or the need for, non-1:1 mapped domains—it's just that this
>>>>>>> limitation simplifies development at the current stage of the RISC-V port.
>>>>>> Simplification is fine in some cases, but not supporting the "normal" way of
>>>>>> domain construction looks like a pretty odd restriction. I'm also curious
>>>>>> how you envision to implement mfn_to_gfn() then, suitable for generic use like
>>>>>> the one here. Imo, current limitation or not, you simply want to avoid use of
>>>>>> that function outside of the special gnttab case.
>>>>>>
>>>>>>>>>>>> In this context (not sure if I asked before): With this use of a radix tree,
>>>>>>>>>>>> how do you intend to bound the amount of memory that a domain can use, by
>>>>>>>>>>>> making Xen insert very many entries?
>>>>>>>>>>> I didn’t think about that. I assumed it would be enough to set the amount of
>>>>>>>>>>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>>>>>>>>>>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
>>>>>>>>>> Which would require these allocations to come from that pool.
>>>>>>>>> Yes, and it is true only for non-hardware domains with the current implementation.
>>>>>>>> ???
>>>>>>> I meant that pool is used now only for non-hardware domains at the moment.
>>>>>> And how does this matter here? The memory required for the radix tree doesn't
>>>>>> come from that pool anyway.
>>>>> I thought that is possible to do that somehow, but looking at a code of
>>>>> radix-tree.c it seems like the only one way to allocate memroy for the radix
>>>>> tree isradix_tree_node_alloc() -> xzalloc(struct rcu_node).
>>>>>
>>>>> Then it is needed to introduce radix_tree_node_allocate(domain)
>>>> That would be a possibility, but you may have seen that less than half a
>>>> year ago we got rid of something along these lines. So it would require
>>>> some pretty good justification to re-introduce.
>>>>
>>>>> or radix tree
>>>>> can't be used at all for mentioned in the previous replies security reason, no?
>>>> (Very) careful use may still be possible. But the downside of using this
>>>> (potentially long lookup times) would always remain.
>>> Could you please clarify what do you mean here by "(Very) careful"?
>>> I thought about an introduction of an amount of possible keys in radix tree and if this amount
>>> is 0 then stop domain. And it is also unclear what should be a value for this amount.
>>> Probably, you have better idea.
>>>
>>> But generally your idea below ...
>>>>>>>>>>> Also, it seems this would just lead to the issue you mentioned earlier: when
>>>>>>>>>>> the memory runs out,|domain_crash()| will be called or PTE will be zapped.
>>>>>>>>>> Or one domain exhausting memory would cause another domain to fail. A domain
>>>>>>>>>> impacting just itself may be tolerable. But a domain affecting other domains
>>>>>>>>>> isn't.
>>>>>>>>> But it seems like this issue could happen in any implementation. It won't happen only
>>>>>>>>> if we will have only pre-populated pool for any domain type (hardware, control, guest
>>>>>>>>> domain) without ability to extend them or allocate extra pages from domheap in runtime.
>>>>>>>>> Otherwise, if extra pages allocation is allowed then we can't really do something
>>>>>>>>> with this issue.
>>>>>>>> But that's why I brought this up: You simply have to. Or, as indicated, the
>>>>>>>> moment you mark Xen security-supported on RISC-V, there will be an XSA needed.
>>>>>>> Why it isn't XSA for other architectures? At least, Arm then should have such
>>>>>>> XSA.
>>>>>> Does Arm use a radix tree for storing types? It uses one for mem-access, but
>>>>>> it's not clear to me whether that's actually a supported feature.
>>>>>>
>>>>>>> I don't understand why x86 won't have the same issue. Memory is the limited
>>>>>>> and shared resource, so if one of the domain will use to much memory then it could
>>>>>>> happen that other domains won't have enough memory for its purpose...
>>>>>> The question is whether allocations are bounded. With this use of a radix tree,
>>>>>> you give domains a way to have Xen allocate pretty much arbitrary amounts of
>>>>>> memory to populate that tree. That unbounded-ness is the problem, not memory
>>>>>> allocations in general.
>>>>> Isn't radix tree key bounded to an amount of GFNs given for a domain? We can't have
>>>>> more keys then a max GFN number for a domain. So a potential amount of necessary memory
>>>>> for radix tree is also bounded to an amount of GFNs.
>>>> To some degree yes, hence why I said "pretty much arbitrary amounts".
>>>> But recall that "amount of GFNs" is a fuzzy term; I think you mean to
>>>> use it to describe the amount of memory pages given to the guest. GFNs
>>>> can be used for other purposes, though. Guests could e.g. grant
>>>> themselves access to their own memory, then map those grants at
>>>> otherwise unused GFNs.
>>>>
>>>>> Anyway, IIUC I just can't use radix tree for p2m types at all, right?
>>>>> If yes, does it make sense to borrow 2 bits from struct page_info->type_info as now it
>>>>> is used 9-bits for count of a frame?
>>>> struct page_info describes MFNs, when you want to describe GFNs. As you
>>>> mentioned earlier, multiple GFNs can in principle map to the same MFN.
>>>> You would force them to all have the same properties, which would be in
>>>> direct conflict with e.g. the grant P2M types.
>>>>
>>>> Just to mention one possible alternative to using radix trees: You could
>>>> maintain a 2nd set of intermediate "page tables", just that leaf entries
>>>> would hold meta data for the respective GFN. The memory for those "page
>>>> tables" could come from the normal P2M pool (and allocation would thus
>>>> only consume domain-specific resources). Of course in any model like
>>>> this the question of lookup times (as mentioned above) would remain.
>>> ...looks like an optimal option.
>>>
>>> The only thing I worry about is that it will require some code duplication
>>> (I will think how to re-use the current one code), as for example, when
>>> setting/getting metadata, TLB flushing isn’t needed at all as we aren't
>>> working with with real P2M page tables.
>>> Agree that lookup won't be the best one, but nothing can be done with
>>> such models.
>> Probably, instead of having a second set of intermediate "page tables",
>> we could just allocate two consecutive pages within the real P2M page
>> tables for the intermediate page table. The first page would serve as
>> the actual page table to which the intermediate page table points,
>> and the second page would store metadata for each entry of the page
>> table that the intermediate page table references.
>>
>> As we are supporting only 1gb, 2mb and 4kb mappings we could do a little
>> optimization and start allocate these consecutive pages only for PT levels
>> which corresponds to 1gb, 2mb, 4kb mappings.
>>
>> Does it make sense?
> I was indeed entertaining this idea, but I couldn't conclude for myself if
> that would indeed be without any rough edges. Hence I didn't want to
> suggest such. For example, the need to have adjacent pairs of pages could
> result in a higher rate of allocation failures (while populating or
> re-sizing the P2M pool). This would be possible to avoid by still using
> entirely separate pages, and then merely linking them together via some
> unused struct page_info fields (the "normal" linking fields can't be used,
> afaict).

I think that all the fields are used, so it will be needed to introduce new
"struct page_list_entry metadata_list;".

Can't we introduce new PGT_METADATA type and then add metadata page to
struct page_info->list and when a metadata will be needed just iterate through
page_info->list and find a page with PGT_METADATA type?

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 14585 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-22 14:25                           ` Oleksii Kurochko
@ 2025-07-22 14:35                             ` Jan Beulich
  2025-07-22 16:07                               ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-22 14:35 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 22.07.2025 16:25, Oleksii Kurochko wrote:
> On 7/22/25 2:00 PM, Jan Beulich wrote:
>> On 22.07.2025 13:34, Oleksii Kurochko wrote:
>>> On 7/22/25 12:41 PM, Oleksii Kurochko wrote:
>>>> On 7/21/25 2:18 PM, Jan Beulich wrote:
>>>>> On 18.07.2025 11:52, Oleksii Kurochko wrote:
>>>>>> On 7/17/25 12:25 PM, Jan Beulich wrote:
>>>>>>> On 17.07.2025 10:56, Oleksii Kurochko wrote:
>>>>>>>> On 7/16/25 6:18 PM, Jan Beulich wrote:
>>>>>>>>> On 16.07.2025 18:07, Oleksii Kurochko wrote:
>>>>>>>>>> On 7/16/25 1:31 PM, Jan Beulich wrote:
>>>>>>>>>>> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>>>>>>>>>>>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>>>>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>>>>>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>>>>>>>>>>>            return __map_domain_page(p2m->root + root_table_indx);
>>>>>>>>>>>>>>        }
>>>>>>>>>>>>>>        
>>>>>>>>>>>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>>>>>>>>>>>> See comments on the earlier patch regarding naming.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +    int rc;
>>>>>>>>>>>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>>>>>>>>>>>> How does this work, when you record GFNs only for Xenheap pages?
>>>>>>>>>>>> I think I don't understand what is an issue. Could you please provide
>>>>>>>>>>>> some extra details?
>>>>>>>>>>> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
>>>>>>>>>>> works for 1:1 mapped domains. Can you show me the eventual final implementation
>>>>>>>>>>> of the function, making it possible to use it here?
>>>>>>>>>> At the moment, I planned to support only 1:1 mapped domains, so it is final
>>>>>>>>>> implementation.
>>>>>>>>> Isn't that on overly severe limitation?
>>>>>>>> I wouldn't say that it's a severe limitation, as it's just a matter of how
>>>>>>>> |mfn_to_gfn()| is implemented. When non-1:1 mapped domains are supported,
>>>>>>>> |mfn_to_gfn()| can be implemented differently, while the code where it’s called
>>>>>>>> will likely remain unchanged.
>>>>>>>>
>>>>>>>> What I meant in my reply is that, for the current state and current limitations,
>>>>>>>> this is the final implementation of|mfn_to_gfn()|. But that doesn't mean I don't
>>>>>>>> see the value in, or the need for, non-1:1 mapped domains—it's just that this
>>>>>>>> limitation simplifies development at the current stage of the RISC-V port.
>>>>>>> Simplification is fine in some cases, but not supporting the "normal" way of
>>>>>>> domain construction looks like a pretty odd restriction. I'm also curious
>>>>>>> how you envision to implement mfn_to_gfn() then, suitable for generic use like
>>>>>>> the one here. Imo, current limitation or not, you simply want to avoid use of
>>>>>>> that function outside of the special gnttab case.
>>>>>>>
>>>>>>>>>>>>> In this context (not sure if I asked before): With this use of a radix tree,
>>>>>>>>>>>>> how do you intend to bound the amount of memory that a domain can use, by
>>>>>>>>>>>>> making Xen insert very many entries?
>>>>>>>>>>>> I didn’t think about that. I assumed it would be enough to set the amount of
>>>>>>>>>>>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>>>>>>>>>>>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
>>>>>>>>>>> Which would require these allocations to come from that pool.
>>>>>>>>>> Yes, and it is true only for non-hardware domains with the current implementation.
>>>>>>>>> ???
>>>>>>>> I meant that pool is used now only for non-hardware domains at the moment.
>>>>>>> And how does this matter here? The memory required for the radix tree doesn't
>>>>>>> come from that pool anyway.
>>>>>> I thought that is possible to do that somehow, but looking at a code of
>>>>>> radix-tree.c it seems like the only one way to allocate memroy for the radix
>>>>>> tree isradix_tree_node_alloc() -> xzalloc(struct rcu_node).
>>>>>>
>>>>>> Then it is needed to introduce radix_tree_node_allocate(domain)
>>>>> That would be a possibility, but you may have seen that less than half a
>>>>> year ago we got rid of something along these lines. So it would require
>>>>> some pretty good justification to re-introduce.
>>>>>
>>>>>> or radix tree
>>>>>> can't be used at all for mentioned in the previous replies security reason, no?
>>>>> (Very) careful use may still be possible. But the downside of using this
>>>>> (potentially long lookup times) would always remain.
>>>> Could you please clarify what do you mean here by "(Very) careful"?
>>>> I thought about an introduction of an amount of possible keys in radix tree and if this amount
>>>> is 0 then stop domain. And it is also unclear what should be a value for this amount.
>>>> Probably, you have better idea.
>>>>
>>>> But generally your idea below ...
>>>>>>>>>>>> Also, it seems this would just lead to the issue you mentioned earlier: when
>>>>>>>>>>>> the memory runs out,|domain_crash()| will be called or PTE will be zapped.
>>>>>>>>>>> Or one domain exhausting memory would cause another domain to fail. A domain
>>>>>>>>>>> impacting just itself may be tolerable. But a domain affecting other domains
>>>>>>>>>>> isn't.
>>>>>>>>>> But it seems like this issue could happen in any implementation. It won't happen only
>>>>>>>>>> if we will have only pre-populated pool for any domain type (hardware, control, guest
>>>>>>>>>> domain) without ability to extend them or allocate extra pages from domheap in runtime.
>>>>>>>>>> Otherwise, if extra pages allocation is allowed then we can't really do something
>>>>>>>>>> with this issue.
>>>>>>>>> But that's why I brought this up: You simply have to. Or, as indicated, the
>>>>>>>>> moment you mark Xen security-supported on RISC-V, there will be an XSA needed.
>>>>>>>> Why it isn't XSA for other architectures? At least, Arm then should have such
>>>>>>>> XSA.
>>>>>>> Does Arm use a radix tree for storing types? It uses one for mem-access, but
>>>>>>> it's not clear to me whether that's actually a supported feature.
>>>>>>>
>>>>>>>> I don't understand why x86 won't have the same issue. Memory is the limited
>>>>>>>> and shared resource, so if one of the domain will use to much memory then it could
>>>>>>>> happen that other domains won't have enough memory for its purpose...
>>>>>>> The question is whether allocations are bounded. With this use of a radix tree,
>>>>>>> you give domains a way to have Xen allocate pretty much arbitrary amounts of
>>>>>>> memory to populate that tree. That unbounded-ness is the problem, not memory
>>>>>>> allocations in general.
>>>>>> Isn't radix tree key bounded to an amount of GFNs given for a domain? We can't have
>>>>>> more keys then a max GFN number for a domain. So a potential amount of necessary memory
>>>>>> for radix tree is also bounded to an amount of GFNs.
>>>>> To some degree yes, hence why I said "pretty much arbitrary amounts".
>>>>> But recall that "amount of GFNs" is a fuzzy term; I think you mean to
>>>>> use it to describe the amount of memory pages given to the guest. GFNs
>>>>> can be used for other purposes, though. Guests could e.g. grant
>>>>> themselves access to their own memory, then map those grants at
>>>>> otherwise unused GFNs.
>>>>>
>>>>>> Anyway, IIUC I just can't use radix tree for p2m types at all, right?
>>>>>> If yes, does it make sense to borrow 2 bits from struct page_info->type_info as now it
>>>>>> is used 9-bits for count of a frame?
>>>>> struct page_info describes MFNs, when you want to describe GFNs. As you
>>>>> mentioned earlier, multiple GFNs can in principle map to the same MFN.
>>>>> You would force them to all have the same properties, which would be in
>>>>> direct conflict with e.g. the grant P2M types.
>>>>>
>>>>> Just to mention one possible alternative to using radix trees: You could
>>>>> maintain a 2nd set of intermediate "page tables", just that leaf entries
>>>>> would hold meta data for the respective GFN. The memory for those "page
>>>>> tables" could come from the normal P2M pool (and allocation would thus
>>>>> only consume domain-specific resources). Of course in any model like
>>>>> this the question of lookup times (as mentioned above) would remain.
>>>> ...looks like an optimal option.
>>>>
>>>> The only thing I worry about is that it will require some code duplication
>>>> (I will think how to re-use the current one code), as for example, when
>>>> setting/getting metadata, TLB flushing isn’t needed at all as we aren't
>>>> working with with real P2M page tables.
>>>> Agree that lookup won't be the best one, but nothing can be done with
>>>> such models.
>>> Probably, instead of having a second set of intermediate "page tables",
>>> we could just allocate two consecutive pages within the real P2M page
>>> tables for the intermediate page table. The first page would serve as
>>> the actual page table to which the intermediate page table points,
>>> and the second page would store metadata for each entry of the page
>>> table that the intermediate page table references.
>>>
>>> As we are supporting only 1gb, 2mb and 4kb mappings we could do a little
>>> optimization and start allocate these consecutive pages only for PT levels
>>> which corresponds to 1gb, 2mb, 4kb mappings.
>>>
>>> Does it make sense?
>> I was indeed entertaining this idea, but I couldn't conclude for myself if
>> that would indeed be without any rough edges. Hence I didn't want to
>> suggest such. For example, the need to have adjacent pairs of pages could
>> result in a higher rate of allocation failures (while populating or
>> re-sizing the P2M pool). This would be possible to avoid by still using
>> entirely separate pages, and then merely linking them together via some
>> unused struct page_info fields (the "normal" linking fields can't be used,
>> afaict).
> 
> I think that all the fields are used, so it will be needed to introduce new
> "struct page_list_entry metadata_list;".

All the fields are used _somewhere_, sure. But once you have allocated a
page (and that page isn't assigned to a domain), you control what the
fields are used for. Or else enlisting pages on private lists wouldn't be
legitimate either.

> Can't we introduce new PGT_METADATA type and then add metadata page to
> struct page_info->list and when a metadata will be needed just iterate through
> page_info->list and find a page with PGT_METADATA type?

I'd be careful with the introduction of new page types. All handling of
page types everywhere in the (affected part of the) code base would then
need auditing.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings
  2025-07-21 13:34       ` Jan Beulich
@ 2025-07-22 14:57         ` Oleksii Kurochko
  2025-07-22 16:02           ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-22 14:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 7947 bytes --]


On 7/21/25 3:34 PM, Jan Beulich wrote:
> On 17.07.2025 18:37, Oleksii Kurochko wrote:
>> On 7/2/25 11:25 AM, Jan Beulich wrote:
>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>> Add support for down large memory mappings ("superpages") in the RISC-V
>>>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>>>> can be inserted into lower levels of the page table hierarchy.
>>>>
>>>> To implement that the following is done:
>>>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>>>>     smaller page table entries down to the target level, preserving original
>>>>     permissions and attributes.
>>>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>>>>     entries at lower levels within a superpage-mapped region.
>>>>
>>>> This implementation is based on the ARM code, with modifications to the part
>>>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>>>> not require BBM, so there is no need to invalidate the PTE and flush the
>>>> TLB before updating it with the newly created, split page table.
>>> But some flushing is going to be necessary. As long as you only ever do
>>> global flushes, the one after the individual PTE modification (within the
>>> split table) will do (if BBM isn't required, see below), but once you move
>>> to more fine-grained flushing, that's not going to be enough anymore. Not
>>> sure it's a good idea to leave such a pitfall.
>> I think that I don't fully understand what is an issue.
> Whether a flush is necessary after solely breaking up a superpage is arch-
> defined. I don't know how much restrictions the spec on possible hardware
> behavior for RISC-V. However, the eventual change of (at least) one entry
> of fulfill the original request will surely require a flush. What I was
> trying to say is that this required flush would better not also cover for
> the flushes that may or may not be required by the spec. IOW if the spec
> leaves any room for flushes to possibly be needed, those flushes would
> better be explicit.

I think that I still don't understand why what I wrote above will work as long
as global flushes is working and will stop to work when more fine-grained flushing
is used.

Inside p2m_split_superpage() we don't need any kind of TLB flush operation as
it is allocation a new page for a table and works with it, so no one could use
this page at the moment and cache it in TLB.

The only question is that if it is needed BBM before staring using splitted entry:
         ....
         if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
         {
             /* Free the allocated sub-tree */
             p2m_free_subtree(p2m, split_pte, level);

             rc = -ENOMEM;
             goto out;
         }

------> /* Should be BBM used here ? */
         p2m_write_pte(entry, split_pte, p2m->clean_pte);

And I can't find anything in the spec what tells me to use BBM here like Arm
does:
         /*
          * Follow the break-before-sequence to update the entry.
          * For more details see (D4.7.1 in ARM DDI 0487A.j).
          */
         p2m_remove_pte(entry, p2m->clean_pte);
         p2m_force_tlb_flush_sync(p2m);

         p2m_write_pte(entry, split_pte, p2m->clean_pte);

I agree that even BBM isn't mandated explicitly sometime it could be useful, but
here I am not really sure what is the point to do TLB flush before p2m_write_pte()
as nothing serious will happen if and old mapping will be used for a some time
as we are keeping the same rights for smaller (splited) mapping.
The reason why Arm does p2m_remove_pte() & p2m_force_tlb_flush() before updating
an entry here as it is doesn't guarantee that everything will be okay and they
explicitly tells:
  This situation can possibly break coherency and ordering guarantees, leading to
  inconsistent unwanted behavior in our Processing Element (PE).


>>> As to (no need for) BBM: I couldn't find anything to that effect in the
>>> privileged spec. Can you provide some pointer? What I found instead is e.g.
>>> this sentence: "To ensure that implicit reads observe writes to the same
>>> memory locations, an SFENCE.VMA instruction must be executed after the
>>> writes to flush the relevant cached translations." And this: "Accessing the
>>> same location using different cacheability attributes may cause loss of
>>> coherence." (This may not only occur when the same physical address is
>>> mapped twice at different VAs, but also after the shattering of a superpage
>>> when the new entry differs in cacheability.)
>> I also couldn't find that RISC-V spec mandates BBM explicitly as Arm does it.
>>
>> What I meant that on RISC-V can do:
>> - Write new PTE
>> - Flush TLB
>>
>> While on Arm it is almost always needed to do:
>> - Write zero to PTE
>> - Flush TLB
>> - Write new PTE
>>
>> For example, the common CoW code path where you copy from a read only page to
>> a new page, then map that new page as writable just works on RISC-V without
>> extra considerations and on Arm it requires BBM.
> CoW is a specific sub-case with increasing privilege.

Could you please explain why it matters increasing of privilege in terms of TLB
flushing and PTE clearing before writing a new PTE?

>
>> It seems to me that BBM is mandated for Arm only because that TLB is shared
>> among cores, so there is no any guarantee that no prior entry for same VA
>> remains in TLB. In case of RISC-V's TLB isn't shared and after a flush it is
>> guaranteed that no prior entry for the same VA remains in the TLB.
> Not sure that's the sole reason. But again the question is: Is this written
> down explicitly anywhere? Generally there can be multiple levels of TLBs, and
> while some of them may be per-core, others may be shared.

Spec. mentions that:
   If a hart employs an address-translation cache, that cache must appear to be
   private to that hart.

>
>>>> +    /*
>>>> +     * Even if we failed, we should install the newly allocated PTE
>>>> +     * entry. The caller will be in charge to free the sub-tree.
>>>> +     */
>>>> +    p2m_write_pte(entry, page_to_p2m_table(p2m, page), p2m->clean_pte);
>>> Why would it be wrong to free the page right here, vacating the entry at
>>> the same time (or leaving just that to the caller)? (IOW - if this is an
>>> implementation decision of yours, I think the word "should" would want
>>> dropping.) After all, the caller invoking p2m_free_entry() on the thus
>>> split PTE is less efficient (needs to iterate over all entries) than on
>>> the original one (where it's just a single superpage).
>> I think that I didn't get your idea.
> Well, first and foremost this was a question to you, as it's not clear to
> me whether "should" is the correct word to use here. It would be
> appropriate if the spec mandated this behavior. It would seem less
> appropriate if this arrangement was an implementation choice of yours.
> And it looks to me as if the latter was the case here.

No, the spec doesn't mandate such behavior, it is just implementation specific.
I can mention that in the comment.

>
>>>> @@ -806,7 +877,36 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
>>>>         */
>>>>        if ( level > target )
>>> This condition is likely too strong, unless you actually mean to also
>>> split a superpage if it really wouldn't need splitting (new entry written
>>> still fitting with the superpage mapping, i.e. suitable MFN and same
>>> attributes).
>> I am not really sure that I fully understand.
>> My understanding is if level != target then the splitting is needed, I am
>> not really get the part "split a superpage if it really wouldn't need splitting".
> Hmm, maybe I was wrong here. The caller determines at what level the
> actual change needs to occur? In which case what you have may be right.

Yes, the caller determines expected level and then asks __p2m_set_entry() to map on
this level.

~ Oleksii


[-- Attachment #2: Type: text/html, Size: 10327 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings
  2025-07-22 14:57         ` Oleksii Kurochko
@ 2025-07-22 16:02           ` Jan Beulich
  2025-07-23 19:51             ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-22 16:02 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 22.07.2025 16:57, Oleksii Kurochko wrote:
> 
> On 7/21/25 3:34 PM, Jan Beulich wrote:
>> On 17.07.2025 18:37, Oleksii Kurochko wrote:
>>> On 7/2/25 11:25 AM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> Add support for down large memory mappings ("superpages") in the RISC-V
>>>>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>>>>> can be inserted into lower levels of the page table hierarchy.
>>>>>
>>>>> To implement that the following is done:
>>>>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>>>>>     smaller page table entries down to the target level, preserving original
>>>>>     permissions and attributes.
>>>>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>>>>>     entries at lower levels within a superpage-mapped region.
>>>>>
>>>>> This implementation is based on the ARM code, with modifications to the part
>>>>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>>>>> not require BBM, so there is no need to invalidate the PTE and flush the
>>>>> TLB before updating it with the newly created, split page table.
>>>> But some flushing is going to be necessary. As long as you only ever do
>>>> global flushes, the one after the individual PTE modification (within the
>>>> split table) will do (if BBM isn't required, see below), but once you move
>>>> to more fine-grained flushing, that's not going to be enough anymore. Not
>>>> sure it's a good idea to leave such a pitfall.
>>> I think that I don't fully understand what is an issue.
>> Whether a flush is necessary after solely breaking up a superpage is arch-
>> defined. I don't know how much restrictions the spec on possible hardware
>> behavior for RISC-V. However, the eventual change of (at least) one entry
>> of fulfill the original request will surely require a flush. What I was
>> trying to say is that this required flush would better not also cover for
>> the flushes that may or may not be required by the spec. IOW if the spec
>> leaves any room for flushes to possibly be needed, those flushes would
>> better be explicit.
> 
> I think that I still don't understand why what I wrote above will work as long
> as global flushes is working and will stop to work when more fine-grained flushing
> is used.
> 
> Inside p2m_split_superpage() we don't need any kind of TLB flush operation as
> it is allocation a new page for a table and works with it, so no one could use
> this page at the moment and cache it in TLB.
> 
> The only question is that if it is needed BBM before staring using splitted entry:
>          ....
>          if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
>          {
>              /* Free the allocated sub-tree */
>              p2m_free_subtree(p2m, split_pte, level);
> 
>              rc = -ENOMEM;
>              goto out;
>          }
> 
> ------> /* Should be BBM used here ? */
>          p2m_write_pte(entry, split_pte, p2m->clean_pte);
> 
> And I can't find anything in the spec what tells me to use BBM here like Arm
> does:

But what you need is a statement in the spec that you can get away without. Imo
at least.

>          /*
>           * Follow the break-before-sequence to update the entry.
>           * For more details see (D4.7.1 in ARM DDI 0487A.j).
>           */
>          p2m_remove_pte(entry, p2m->clean_pte);
>          p2m_force_tlb_flush_sync(p2m);
> 
>          p2m_write_pte(entry, split_pte, p2m->clean_pte);
> 
> I agree that even BBM isn't mandated explicitly sometime it could be useful, but
> here I am not really sure what is the point to do TLB flush before p2m_write_pte()
> as nothing serious will happen if and old mapping will be used for a some time
> as we are keeping the same rights for smaller (splited) mapping.
> The reason why Arm does p2m_remove_pte() & p2m_force_tlb_flush() before updating
> an entry here as it is doesn't guarantee that everything will be okay and they
> explicitly tells:
>   This situation can possibly break coherency and ordering guarantees, leading to
>   inconsistent unwanted behavior in our Processing Element (PE).
> 
> 
>>>> As to (no need for) BBM: I couldn't find anything to that effect in the
>>>> privileged spec. Can you provide some pointer? What I found instead is e.g.
>>>> this sentence: "To ensure that implicit reads observe writes to the same
>>>> memory locations, an SFENCE.VMA instruction must be executed after the
>>>> writes to flush the relevant cached translations." And this: "Accessing the
>>>> same location using different cacheability attributes may cause loss of
>>>> coherence." (This may not only occur when the same physical address is
>>>> mapped twice at different VAs, but also after the shattering of a superpage
>>>> when the new entry differs in cacheability.)
>>> I also couldn't find that RISC-V spec mandates BBM explicitly as Arm does it.
>>>
>>> What I meant that on RISC-V can do:
>>> - Write new PTE
>>> - Flush TLB
>>>
>>> While on Arm it is almost always needed to do:
>>> - Write zero to PTE
>>> - Flush TLB
>>> - Write new PTE
>>>
>>> For example, the common CoW code path where you copy from a read only page to
>>> a new page, then map that new page as writable just works on RISC-V without
>>> extra considerations and on Arm it requires BBM.
>> CoW is a specific sub-case with increasing privilege.
> 
> Could you please explain why it matters increasing of privilege in terms of TLB
> flushing and PTE clearing before writing a new PTE?

Some architectures automatically re-walk page tables when encountering a
permission violation based on TLB contents. Hence increasing privilege
can be a special case.

>>> It seems to me that BBM is mandated for Arm only because that TLB is shared
>>> among cores, so there is no any guarantee that no prior entry for same VA
>>> remains in TLB. In case of RISC-V's TLB isn't shared and after a flush it is
>>> guaranteed that no prior entry for the same VA remains in the TLB.
>> Not sure that's the sole reason. But again the question is: Is this written
>> down explicitly anywhere? Generally there can be multiple levels of TLBs, and
>> while some of them may be per-core, others may be shared.
> 
> Spec. mentions that:
>    If a hart employs an address-translation cache, that cache must appear to be
>    private to that hart.

Hmm, okay, that indeed pretty much excludes shared TLBs.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-22 14:35                             ` Jan Beulich
@ 2025-07-22 16:07                               ` Oleksii Kurochko
  2025-07-23  9:46                                 ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-22 16:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 12234 bytes --]


On 7/22/25 4:35 PM, Jan Beulich wrote:
> On 22.07.2025 16:25, Oleksii Kurochko wrote:
>> On 7/22/25 2:00 PM, Jan Beulich wrote:
>>> On 22.07.2025 13:34, Oleksii Kurochko wrote:
>>>> On 7/22/25 12:41 PM, Oleksii Kurochko wrote:
>>>>> On 7/21/25 2:18 PM, Jan Beulich wrote:
>>>>>> On 18.07.2025 11:52, Oleksii Kurochko wrote:
>>>>>>> On 7/17/25 12:25 PM, Jan Beulich wrote:
>>>>>>>> On 17.07.2025 10:56, Oleksii Kurochko wrote:
>>>>>>>>> On 7/16/25 6:18 PM, Jan Beulich wrote:
>>>>>>>>>> On 16.07.2025 18:07, Oleksii Kurochko wrote:
>>>>>>>>>>> On 7/16/25 1:31 PM, Jan Beulich wrote:
>>>>>>>>>>>> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>>>>>>>>>>>>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>>>>>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>>>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>>>>>>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>>>>>>>>>>>>             return __map_domain_page(p2m->root + root_table_indx);
>>>>>>>>>>>>>>>         }
>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>>>>>>>>>>>>> See comments on the earlier patch regarding naming.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> +    int rc;
>>>>>>>>>>>>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>>>>>>>>>>>>> How does this work, when you record GFNs only for Xenheap pages?
>>>>>>>>>>>>> I think I don't understand what is an issue. Could you please provide
>>>>>>>>>>>>> some extra details?
>>>>>>>>>>>> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
>>>>>>>>>>>> works for 1:1 mapped domains. Can you show me the eventual final implementation
>>>>>>>>>>>> of the function, making it possible to use it here?
>>>>>>>>>>> At the moment, I planned to support only 1:1 mapped domains, so it is final
>>>>>>>>>>> implementation.
>>>>>>>>>> Isn't that on overly severe limitation?
>>>>>>>>> I wouldn't say that it's a severe limitation, as it's just a matter of how
>>>>>>>>> |mfn_to_gfn()| is implemented. When non-1:1 mapped domains are supported,
>>>>>>>>> |mfn_to_gfn()| can be implemented differently, while the code where it’s called
>>>>>>>>> will likely remain unchanged.
>>>>>>>>>
>>>>>>>>> What I meant in my reply is that, for the current state and current limitations,
>>>>>>>>> this is the final implementation of|mfn_to_gfn()|. But that doesn't mean I don't
>>>>>>>>> see the value in, or the need for, non-1:1 mapped domains—it's just that this
>>>>>>>>> limitation simplifies development at the current stage of the RISC-V port.
>>>>>>>> Simplification is fine in some cases, but not supporting the "normal" way of
>>>>>>>> domain construction looks like a pretty odd restriction. I'm also curious
>>>>>>>> how you envision to implement mfn_to_gfn() then, suitable for generic use like
>>>>>>>> the one here. Imo, current limitation or not, you simply want to avoid use of
>>>>>>>> that function outside of the special gnttab case.
>>>>>>>>
>>>>>>>>>>>>>> In this context (not sure if I asked before): With this use of a radix tree,
>>>>>>>>>>>>>> how do you intend to bound the amount of memory that a domain can use, by
>>>>>>>>>>>>>> making Xen insert very many entries?
>>>>>>>>>>>>> I didn’t think about that. I assumed it would be enough to set the amount of
>>>>>>>>>>>>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>>>>>>>>>>>>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
>>>>>>>>>>>> Which would require these allocations to come from that pool.
>>>>>>>>>>> Yes, and it is true only for non-hardware domains with the current implementation.
>>>>>>>>>> ???
>>>>>>>>> I meant that pool is used now only for non-hardware domains at the moment.
>>>>>>>> And how does this matter here? The memory required for the radix tree doesn't
>>>>>>>> come from that pool anyway.
>>>>>>> I thought that is possible to do that somehow, but looking at a code of
>>>>>>> radix-tree.c it seems like the only one way to allocate memroy for the radix
>>>>>>> tree isradix_tree_node_alloc() -> xzalloc(struct rcu_node).
>>>>>>>
>>>>>>> Then it is needed to introduce radix_tree_node_allocate(domain)
>>>>>> That would be a possibility, but you may have seen that less than half a
>>>>>> year ago we got rid of something along these lines. So it would require
>>>>>> some pretty good justification to re-introduce.
>>>>>>
>>>>>>> or radix tree
>>>>>>> can't be used at all for mentioned in the previous replies security reason, no?
>>>>>> (Very) careful use may still be possible. But the downside of using this
>>>>>> (potentially long lookup times) would always remain.
>>>>> Could you please clarify what do you mean here by "(Very) careful"?
>>>>> I thought about an introduction of an amount of possible keys in radix tree and if this amount
>>>>> is 0 then stop domain. And it is also unclear what should be a value for this amount.
>>>>> Probably, you have better idea.
>>>>>
>>>>> But generally your idea below ...
>>>>>>>>>>>>> Also, it seems this would just lead to the issue you mentioned earlier: when
>>>>>>>>>>>>> the memory runs out,|domain_crash()| will be called or PTE will be zapped.
>>>>>>>>>>>> Or one domain exhausting memory would cause another domain to fail. A domain
>>>>>>>>>>>> impacting just itself may be tolerable. But a domain affecting other domains
>>>>>>>>>>>> isn't.
>>>>>>>>>>> But it seems like this issue could happen in any implementation. It won't happen only
>>>>>>>>>>> if we will have only pre-populated pool for any domain type (hardware, control, guest
>>>>>>>>>>> domain) without ability to extend them or allocate extra pages from domheap in runtime.
>>>>>>>>>>> Otherwise, if extra pages allocation is allowed then we can't really do something
>>>>>>>>>>> with this issue.
>>>>>>>>>> But that's why I brought this up: You simply have to. Or, as indicated, the
>>>>>>>>>> moment you mark Xen security-supported on RISC-V, there will be an XSA needed.
>>>>>>>>> Why it isn't XSA for other architectures? At least, Arm then should have such
>>>>>>>>> XSA.
>>>>>>>> Does Arm use a radix tree for storing types? It uses one for mem-access, but
>>>>>>>> it's not clear to me whether that's actually a supported feature.
>>>>>>>>
>>>>>>>>> I don't understand why x86 won't have the same issue. Memory is the limited
>>>>>>>>> and shared resource, so if one of the domain will use to much memory then it could
>>>>>>>>> happen that other domains won't have enough memory for its purpose...
>>>>>>>> The question is whether allocations are bounded. With this use of a radix tree,
>>>>>>>> you give domains a way to have Xen allocate pretty much arbitrary amounts of
>>>>>>>> memory to populate that tree. That unbounded-ness is the problem, not memory
>>>>>>>> allocations in general.
>>>>>>> Isn't radix tree key bounded to an amount of GFNs given for a domain? We can't have
>>>>>>> more keys then a max GFN number for a domain. So a potential amount of necessary memory
>>>>>>> for radix tree is also bounded to an amount of GFNs.
>>>>>> To some degree yes, hence why I said "pretty much arbitrary amounts".
>>>>>> But recall that "amount of GFNs" is a fuzzy term; I think you mean to
>>>>>> use it to describe the amount of memory pages given to the guest. GFNs
>>>>>> can be used for other purposes, though. Guests could e.g. grant
>>>>>> themselves access to their own memory, then map those grants at
>>>>>> otherwise unused GFNs.
>>>>>>
>>>>>>> Anyway, IIUC I just can't use radix tree for p2m types at all, right?
>>>>>>> If yes, does it make sense to borrow 2 bits from struct page_info->type_info as now it
>>>>>>> is used 9-bits for count of a frame?
>>>>>> struct page_info describes MFNs, when you want to describe GFNs. As you
>>>>>> mentioned earlier, multiple GFNs can in principle map to the same MFN.
>>>>>> You would force them to all have the same properties, which would be in
>>>>>> direct conflict with e.g. the grant P2M types.
>>>>>>
>>>>>> Just to mention one possible alternative to using radix trees: You could
>>>>>> maintain a 2nd set of intermediate "page tables", just that leaf entries
>>>>>> would hold meta data for the respective GFN. The memory for those "page
>>>>>> tables" could come from the normal P2M pool (and allocation would thus
>>>>>> only consume domain-specific resources). Of course in any model like
>>>>>> this the question of lookup times (as mentioned above) would remain.
>>>>> ...looks like an optimal option.
>>>>>
>>>>> The only thing I worry about is that it will require some code duplication
>>>>> (I will think how to re-use the current one code), as for example, when
>>>>> setting/getting metadata, TLB flushing isn’t needed at all as we aren't
>>>>> working with with real P2M page tables.
>>>>> Agree that lookup won't be the best one, but nothing can be done with
>>>>> such models.
>>>> Probably, instead of having a second set of intermediate "page tables",
>>>> we could just allocate two consecutive pages within the real P2M page
>>>> tables for the intermediate page table. The first page would serve as
>>>> the actual page table to which the intermediate page table points,
>>>> and the second page would store metadata for each entry of the page
>>>> table that the intermediate page table references.
>>>>
>>>> As we are supporting only 1gb, 2mb and 4kb mappings we could do a little
>>>> optimization and start allocate these consecutive pages only for PT levels
>>>> which corresponds to 1gb, 2mb, 4kb mappings.
>>>>
>>>> Does it make sense?
>>> I was indeed entertaining this idea, but I couldn't conclude for myself if
>>> that would indeed be without any rough edges. Hence I didn't want to
>>> suggest such. For example, the need to have adjacent pairs of pages could
>>> result in a higher rate of allocation failures (while populating or
>>> re-sizing the P2M pool). This would be possible to avoid by still using
>>> entirely separate pages, and then merely linking them together via some
>>> unused struct page_info fields (the "normal" linking fields can't be used,
>>> afaict).
>> I think that all the fields are used, so it will be needed to introduce new
>> "struct page_list_entry metadata_list;".
> All the fields are used _somewhere_, sure. But once you have allocated a
> page (and that page isn't assigned to a domain), you control what the
> fields are used for.

I thought that the whole idea is to use domain's pages from P2M pool freelist,
pages for which is allocated by alloc_domheap_page(d, MEMF_no_owner), so an
allocated page is assigned to a domain.

I assume that I have in this case to take some pages for an intermediate page
table from freelist P2M pool, set an owner domain to NULL (pg->inuse.domain = NULL).

Then in this case it isn't clear why pg->list can't be re-used to link several pages
for intermediate page table purposes + metadata? Is it because pg->list can be not
empty? In this case it isn't clear if I could use a page, which has threaded pages.

page_info->count_info can't be re-used as it will break put_page_*() connected stuff.
And for similar reason page_info->v.{...} can't be re-used as then page_get_owner()
will be broken.
And page_info->tlbflush_timestamp still need for a common alloc algo to handle when
to do TLB flush.

So if I will add something to page_info->v.{...} or page_info->u.{...} then mentioned
above functions can't be used anymore for such pages which are used for intermediate
page tables.

>   Or else enlisting pages on private lists wouldn't be
> legitimate either.

Hmm, but I still should have link several pages somehow.
Or you meant just to have a field which just store a physical address to metadata?
(what still looks like a list)


>
>> Can't we introduce new PGT_METADATA type and then add metadata page to
>> struct page_info->list and when a metadata will be needed just iterate through
>> page_info->list and find a page with PGT_METADATA type?
> I'd be careful with the introduction of new page types. All handling of
> page types everywhere in the (affected part of the) code base would then
> need auditing.

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 17596 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-22 16:07                               ` Oleksii Kurochko
@ 2025-07-23  9:46                                 ` Jan Beulich
  2025-07-28  8:52                                   ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-23  9:46 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 22.07.2025 18:07, Oleksii Kurochko wrote:
> 
> On 7/22/25 4:35 PM, Jan Beulich wrote:
>> On 22.07.2025 16:25, Oleksii Kurochko wrote:
>>> On 7/22/25 2:00 PM, Jan Beulich wrote:
>>>> On 22.07.2025 13:34, Oleksii Kurochko wrote:
>>>>> On 7/22/25 12:41 PM, Oleksii Kurochko wrote:
>>>>>> On 7/21/25 2:18 PM, Jan Beulich wrote:
>>>>>>> On 18.07.2025 11:52, Oleksii Kurochko wrote:
>>>>>>>> On 7/17/25 12:25 PM, Jan Beulich wrote:
>>>>>>>>> On 17.07.2025 10:56, Oleksii Kurochko wrote:
>>>>>>>>>> On 7/16/25 6:18 PM, Jan Beulich wrote:
>>>>>>>>>>> On 16.07.2025 18:07, Oleksii Kurochko wrote:
>>>>>>>>>>>> On 7/16/25 1:31 PM, Jan Beulich wrote:
>>>>>>>>>>>>> On 15.07.2025 16:47, Oleksii Kurochko wrote:
>>>>>>>>>>>>>> On 7/1/25 5:08 PM, Jan Beulich wrote:
>>>>>>>>>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>>>>>>>>>> --- a/xen/arch/riscv/p2m.c
>>>>>>>>>>>>>>>> +++ b/xen/arch/riscv/p2m.c
>>>>>>>>>>>>>>>> @@ -345,6 +345,26 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>>>>>>>>>>>>>>>>             return __map_domain_page(p2m->root + root_table_indx);
>>>>>>>>>>>>>>>>         }
>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>> +static int p2m_type_radix_set(struct p2m_domain *p2m, pte_t pte, p2m_type_t t)
>>>>>>>>>>>>>>> See comments on the earlier patch regarding naming.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> +    int rc;
>>>>>>>>>>>>>>>> +    gfn_t gfn = mfn_to_gfn(p2m->domain, mfn_from_pte(pte));
>>>>>>>>>>>>>>> How does this work, when you record GFNs only for Xenheap pages?
>>>>>>>>>>>>>> I think I don't understand what is an issue. Could you please provide
>>>>>>>>>>>>>> some extra details?
>>>>>>>>>>>>> Counter question: The mfn_to_gfn() you currently have is only a stub. It only
>>>>>>>>>>>>> works for 1:1 mapped domains. Can you show me the eventual final implementation
>>>>>>>>>>>>> of the function, making it possible to use it here?
>>>>>>>>>>>> At the moment, I planned to support only 1:1 mapped domains, so it is final
>>>>>>>>>>>> implementation.
>>>>>>>>>>> Isn't that on overly severe limitation?
>>>>>>>>>> I wouldn't say that it's a severe limitation, as it's just a matter of how
>>>>>>>>>> |mfn_to_gfn()| is implemented. When non-1:1 mapped domains are supported,
>>>>>>>>>> |mfn_to_gfn()| can be implemented differently, while the code where it’s called
>>>>>>>>>> will likely remain unchanged.
>>>>>>>>>>
>>>>>>>>>> What I meant in my reply is that, for the current state and current limitations,
>>>>>>>>>> this is the final implementation of|mfn_to_gfn()|. But that doesn't mean I don't
>>>>>>>>>> see the value in, or the need for, non-1:1 mapped domains—it's just that this
>>>>>>>>>> limitation simplifies development at the current stage of the RISC-V port.
>>>>>>>>> Simplification is fine in some cases, but not supporting the "normal" way of
>>>>>>>>> domain construction looks like a pretty odd restriction. I'm also curious
>>>>>>>>> how you envision to implement mfn_to_gfn() then, suitable for generic use like
>>>>>>>>> the one here. Imo, current limitation or not, you simply want to avoid use of
>>>>>>>>> that function outside of the special gnttab case.
>>>>>>>>>
>>>>>>>>>>>>>>> In this context (not sure if I asked before): With this use of a radix tree,
>>>>>>>>>>>>>>> how do you intend to bound the amount of memory that a domain can use, by
>>>>>>>>>>>>>>> making Xen insert very many entries?
>>>>>>>>>>>>>> I didn’t think about that. I assumed it would be enough to set the amount of
>>>>>>>>>>>>>> memory a guest domain can use by specifying|xen,domain-p2m-mem-mb| in the DTS,
>>>>>>>>>>>>>> or using some predefined value if|xen,domain-p2m-mem-mb| isn’t explicitly set.
>>>>>>>>>>>>> Which would require these allocations to come from that pool.
>>>>>>>>>>>> Yes, and it is true only for non-hardware domains with the current implementation.
>>>>>>>>>>> ???
>>>>>>>>>> I meant that pool is used now only for non-hardware domains at the moment.
>>>>>>>>> And how does this matter here? The memory required for the radix tree doesn't
>>>>>>>>> come from that pool anyway.
>>>>>>>> I thought that is possible to do that somehow, but looking at a code of
>>>>>>>> radix-tree.c it seems like the only one way to allocate memroy for the radix
>>>>>>>> tree isradix_tree_node_alloc() -> xzalloc(struct rcu_node).
>>>>>>>>
>>>>>>>> Then it is needed to introduce radix_tree_node_allocate(domain)
>>>>>>> That would be a possibility, but you may have seen that less than half a
>>>>>>> year ago we got rid of something along these lines. So it would require
>>>>>>> some pretty good justification to re-introduce.
>>>>>>>
>>>>>>>> or radix tree
>>>>>>>> can't be used at all for mentioned in the previous replies security reason, no?
>>>>>>> (Very) careful use may still be possible. But the downside of using this
>>>>>>> (potentially long lookup times) would always remain.
>>>>>> Could you please clarify what do you mean here by "(Very) careful"?
>>>>>> I thought about an introduction of an amount of possible keys in radix tree and if this amount
>>>>>> is 0 then stop domain. And it is also unclear what should be a value for this amount.
>>>>>> Probably, you have better idea.
>>>>>>
>>>>>> But generally your idea below ...
>>>>>>>>>>>>>> Also, it seems this would just lead to the issue you mentioned earlier: when
>>>>>>>>>>>>>> the memory runs out,|domain_crash()| will be called or PTE will be zapped.
>>>>>>>>>>>>> Or one domain exhausting memory would cause another domain to fail. A domain
>>>>>>>>>>>>> impacting just itself may be tolerable. But a domain affecting other domains
>>>>>>>>>>>>> isn't.
>>>>>>>>>>>> But it seems like this issue could happen in any implementation. It won't happen only
>>>>>>>>>>>> if we will have only pre-populated pool for any domain type (hardware, control, guest
>>>>>>>>>>>> domain) without ability to extend them or allocate extra pages from domheap in runtime.
>>>>>>>>>>>> Otherwise, if extra pages allocation is allowed then we can't really do something
>>>>>>>>>>>> with this issue.
>>>>>>>>>>> But that's why I brought this up: You simply have to. Or, as indicated, the
>>>>>>>>>>> moment you mark Xen security-supported on RISC-V, there will be an XSA needed.
>>>>>>>>>> Why it isn't XSA for other architectures? At least, Arm then should have such
>>>>>>>>>> XSA.
>>>>>>>>> Does Arm use a radix tree for storing types? It uses one for mem-access, but
>>>>>>>>> it's not clear to me whether that's actually a supported feature.
>>>>>>>>>
>>>>>>>>>> I don't understand why x86 won't have the same issue. Memory is the limited
>>>>>>>>>> and shared resource, so if one of the domain will use to much memory then it could
>>>>>>>>>> happen that other domains won't have enough memory for its purpose...
>>>>>>>>> The question is whether allocations are bounded. With this use of a radix tree,
>>>>>>>>> you give domains a way to have Xen allocate pretty much arbitrary amounts of
>>>>>>>>> memory to populate that tree. That unbounded-ness is the problem, not memory
>>>>>>>>> allocations in general.
>>>>>>>> Isn't radix tree key bounded to an amount of GFNs given for a domain? We can't have
>>>>>>>> more keys then a max GFN number for a domain. So a potential amount of necessary memory
>>>>>>>> for radix tree is also bounded to an amount of GFNs.
>>>>>>> To some degree yes, hence why I said "pretty much arbitrary amounts".
>>>>>>> But recall that "amount of GFNs" is a fuzzy term; I think you mean to
>>>>>>> use it to describe the amount of memory pages given to the guest. GFNs
>>>>>>> can be used for other purposes, though. Guests could e.g. grant
>>>>>>> themselves access to their own memory, then map those grants at
>>>>>>> otherwise unused GFNs.
>>>>>>>
>>>>>>>> Anyway, IIUC I just can't use radix tree for p2m types at all, right?
>>>>>>>> If yes, does it make sense to borrow 2 bits from struct page_info->type_info as now it
>>>>>>>> is used 9-bits for count of a frame?
>>>>>>> struct page_info describes MFNs, when you want to describe GFNs. As you
>>>>>>> mentioned earlier, multiple GFNs can in principle map to the same MFN.
>>>>>>> You would force them to all have the same properties, which would be in
>>>>>>> direct conflict with e.g. the grant P2M types.
>>>>>>>
>>>>>>> Just to mention one possible alternative to using radix trees: You could
>>>>>>> maintain a 2nd set of intermediate "page tables", just that leaf entries
>>>>>>> would hold meta data for the respective GFN. The memory for those "page
>>>>>>> tables" could come from the normal P2M pool (and allocation would thus
>>>>>>> only consume domain-specific resources). Of course in any model like
>>>>>>> this the question of lookup times (as mentioned above) would remain.
>>>>>> ...looks like an optimal option.
>>>>>>
>>>>>> The only thing I worry about is that it will require some code duplication
>>>>>> (I will think how to re-use the current one code), as for example, when
>>>>>> setting/getting metadata, TLB flushing isn’t needed at all as we aren't
>>>>>> working with with real P2M page tables.
>>>>>> Agree that lookup won't be the best one, but nothing can be done with
>>>>>> such models.
>>>>> Probably, instead of having a second set of intermediate "page tables",
>>>>> we could just allocate two consecutive pages within the real P2M page
>>>>> tables for the intermediate page table. The first page would serve as
>>>>> the actual page table to which the intermediate page table points,
>>>>> and the second page would store metadata for each entry of the page
>>>>> table that the intermediate page table references.
>>>>>
>>>>> As we are supporting only 1gb, 2mb and 4kb mappings we could do a little
>>>>> optimization and start allocate these consecutive pages only for PT levels
>>>>> which corresponds to 1gb, 2mb, 4kb mappings.
>>>>>
>>>>> Does it make sense?
>>>> I was indeed entertaining this idea, but I couldn't conclude for myself if
>>>> that would indeed be without any rough edges. Hence I didn't want to
>>>> suggest such. For example, the need to have adjacent pairs of pages could
>>>> result in a higher rate of allocation failures (while populating or
>>>> re-sizing the P2M pool). This would be possible to avoid by still using
>>>> entirely separate pages, and then merely linking them together via some
>>>> unused struct page_info fields (the "normal" linking fields can't be used,
>>>> afaict).
>>> I think that all the fields are used, so it will be needed to introduce new
>>> "struct page_list_entry metadata_list;".
>> All the fields are used _somewhere_, sure. But once you have allocated a
>> page (and that page isn't assigned to a domain), you control what the
>> fields are used for.
> 
> I thought that the whole idea is to use domain's pages from P2M pool freelist,
> pages for which is allocated by alloc_domheap_page(d, MEMF_no_owner), so an
> allocated page is assigned to a domain.

You did check what effect MEMF_no_owner has, didn't you? Such pages are _not_
assigned to the domain.

> I assume that I have in this case to take some pages for an intermediate page
> table from freelist P2M pool, set an owner domain to NULL (pg->inuse.domain = NULL).
> 
> Then in this case it isn't clear why pg->list can't be re-used to link several pages
> for intermediate page table purposes + metadata? Is it because pg->list can be not
> empty? In this case it isn't clear if I could use a page, which has threaded pages.

Actually looks like I was mis-remembering. Pages removed from freelist indeed
aren't put on any other list, so the linking fields are available for use. I
guess I had x86 shadow code in mind, where the linking fields are further used.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings
  2025-07-22 16:02           ` Jan Beulich
@ 2025-07-23 19:51             ` Oleksii Kurochko
  2025-07-24  7:58               ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-23 19:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 4742 bytes --]


On 7/22/25 6:02 PM, Jan Beulich wrote:
> On 22.07.2025 16:57, Oleksii Kurochko wrote:
>> On 7/21/25 3:34 PM, Jan Beulich wrote:
>>> On 17.07.2025 18:37, Oleksii Kurochko wrote:
>>>> On 7/2/25 11:25 AM, Jan Beulich wrote:
>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>> Add support for down large memory mappings ("superpages") in the RISC-V
>>>>>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>>>>>> can be inserted into lower levels of the page table hierarchy.
>>>>>>
>>>>>> To implement that the following is done:
>>>>>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>>>>>>      smaller page table entries down to the target level, preserving original
>>>>>>      permissions and attributes.
>>>>>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>>>>>>      entries at lower levels within a superpage-mapped region.
>>>>>>
>>>>>> This implementation is based on the ARM code, with modifications to the part
>>>>>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>>>>>> not require BBM, so there is no need to invalidate the PTE and flush the
>>>>>> TLB before updating it with the newly created, split page table.
>>>>> But some flushing is going to be necessary. As long as you only ever do
>>>>> global flushes, the one after the individual PTE modification (within the
>>>>> split table) will do (if BBM isn't required, see below), but once you move
>>>>> to more fine-grained flushing, that's not going to be enough anymore. Not
>>>>> sure it's a good idea to leave such a pitfall.
>>>> I think that I don't fully understand what is an issue.
>>> Whether a flush is necessary after solely breaking up a superpage is arch-
>>> defined. I don't know how much restrictions the spec on possible hardware
>>> behavior for RISC-V. However, the eventual change of (at least) one entry
>>> of fulfill the original request will surely require a flush. What I was
>>> trying to say is that this required flush would better not also cover for
>>> the flushes that may or may not be required by the spec. IOW if the spec
>>> leaves any room for flushes to possibly be needed, those flushes would
>>> better be explicit.
>> I think that I still don't understand why what I wrote above will work as long
>> as global flushes is working and will stop to work when more fine-grained flushing
>> is used.
>>
>> Inside p2m_split_superpage() we don't need any kind of TLB flush operation as
>> it is allocation a new page for a table and works with it, so no one could use
>> this page at the moment and cache it in TLB.
>>
>> The only question is that if it is needed BBM before staring using splitted entry:
>>           ....
>>           if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
>>           {
>>               /* Free the allocated sub-tree */
>>               p2m_free_subtree(p2m, split_pte, level);
>>
>>               rc = -ENOMEM;
>>               goto out;
>>           }
>>
>> ------> /* Should be BBM used here ? */
>>           p2m_write_pte(entry, split_pte, p2m->clean_pte);
>>
>> And I can't find anything in the spec what tells me to use BBM here like Arm
>> does:
> But what you need is a statement in the spec that you can get away without. Imo
> at least.

In the spec. it is mentioned that:
   It is permitted for multiple address-translation cache entries to co-exist for the same
   address. This represents the fact that in a conventional TLB hierarchy, it is possible for
   multiple entries to match a single address if, for example, a page is upgraded to a
   superpage without first clearing the original non-leaf PTE’s valid bit and executing an
   SFENCE.VMA with rs1=x0, or if multiple TLBs exist in parallel at a given level of the
   hierarchy. In this case, just as if an SFENCE.VMA is not executed between a write to the
   memory-management tables and subsequent implicit read of the same address: it is
   unpredictable whether the old non-leaf PTE or the new leaf PTE is used, but the behavior is
   otherwise well defined.
The phrase*"but the behavior is otherwise well defined"* emphasizes that even if the TLB sees
two versions (the old and the new), the architecture guarantees stability, and the behavior
remains safe — though unpredictable in terms of which translation will be used.
And I think that this unpredictability is okay, at least, in the case if superpage splitting
and therefore TLB flushing can be deferred since the old pages (which are used for old mapping)
still exist and the permissions of the new entries match those of the original ones.
Also, it seems like there clearing PTE before TLB flushing isn't need too.

Does it make sense?

[-- Attachment #2: Type: text/html, Size: 5689 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings
  2025-07-23 19:51             ` Oleksii Kurochko
@ 2025-07-24  7:58               ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-24  7:58 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 23.07.2025 21:51, Oleksii Kurochko wrote:
> 
> On 7/22/25 6:02 PM, Jan Beulich wrote:
>> On 22.07.2025 16:57, Oleksii Kurochko wrote:
>>> On 7/21/25 3:34 PM, Jan Beulich wrote:
>>>> On 17.07.2025 18:37, Oleksii Kurochko wrote:
>>>>> On 7/2/25 11:25 AM, Jan Beulich wrote:
>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>> Add support for down large memory mappings ("superpages") in the RISC-V
>>>>>>> p2m mapping so that smaller, more precise mappings ("finer-grained entries")
>>>>>>> can be inserted into lower levels of the page table hierarchy.
>>>>>>>
>>>>>>> To implement that the following is done:
>>>>>>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>>>>>>>      smaller page table entries down to the target level, preserving original
>>>>>>>      permissions and attributes.
>>>>>>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>>>>>>>      entries at lower levels within a superpage-mapped region.
>>>>>>>
>>>>>>> This implementation is based on the ARM code, with modifications to the part
>>>>>>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>>>>>>> not require BBM, so there is no need to invalidate the PTE and flush the
>>>>>>> TLB before updating it with the newly created, split page table.
>>>>>> But some flushing is going to be necessary. As long as you only ever do
>>>>>> global flushes, the one after the individual PTE modification (within the
>>>>>> split table) will do (if BBM isn't required, see below), but once you move
>>>>>> to more fine-grained flushing, that's not going to be enough anymore. Not
>>>>>> sure it's a good idea to leave such a pitfall.
>>>>> I think that I don't fully understand what is an issue.
>>>> Whether a flush is necessary after solely breaking up a superpage is arch-
>>>> defined. I don't know how much restrictions the spec on possible hardware
>>>> behavior for RISC-V. However, the eventual change of (at least) one entry
>>>> of fulfill the original request will surely require a flush. What I was
>>>> trying to say is that this required flush would better not also cover for
>>>> the flushes that may or may not be required by the spec. IOW if the spec
>>>> leaves any room for flushes to possibly be needed, those flushes would
>>>> better be explicit.
>>> I think that I still don't understand why what I wrote above will work as long
>>> as global flushes is working and will stop to work when more fine-grained flushing
>>> is used.
>>>
>>> Inside p2m_split_superpage() we don't need any kind of TLB flush operation as
>>> it is allocation a new page for a table and works with it, so no one could use
>>> this page at the moment and cache it in TLB.
>>>
>>> The only question is that if it is needed BBM before staring using splitted entry:
>>>           ....
>>>           if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
>>>           {
>>>               /* Free the allocated sub-tree */
>>>               p2m_free_subtree(p2m, split_pte, level);
>>>
>>>               rc = -ENOMEM;
>>>               goto out;
>>>           }
>>>
>>> ------> /* Should be BBM used here ? */
>>>           p2m_write_pte(entry, split_pte, p2m->clean_pte);
>>>
>>> And I can't find anything in the spec what tells me to use BBM here like Arm
>>> does:
>> But what you need is a statement in the spec that you can get away without. Imo
>> at least.
> 
> In the spec. it is mentioned that:
>    It is permitted for multiple address-translation cache entries to co-exist for the same
>    address. This represents the fact that in a conventional TLB hierarchy, it is possible for
>    multiple entries to match a single address if, for example, a page is upgraded to a
>    superpage without first clearing the original non-leaf PTE’s valid bit and executing an
>    SFENCE.VMA with rs1=x0, or if multiple TLBs exist in parallel at a given level of the
>    hierarchy. In this case, just as if an SFENCE.VMA is not executed between a write to the
>    memory-management tables and subsequent implicit read of the same address: it is
>    unpredictable whether the old non-leaf PTE or the new leaf PTE is used, but the behavior is
>    otherwise well defined.
> The phrase*"but the behavior is otherwise well defined"* emphasizes that even if the TLB sees
> two versions (the old and the new), the architecture guarantees stability, and the behavior
> remains safe — though unpredictable in terms of which translation will be used.
> And I think that this unpredictability is okay, at least, in the case if superpage splitting
> and therefore TLB flushing can be deferred since the old pages (which are used for old mapping)
> still exist and the permissions of the new entries match those of the original ones.
> Also, it seems like there clearing PTE before TLB flushing isn't need too.
> 
> Does it make sense?

Yes, I think this indeed is sufficient as a spec requirement.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-23  9:46                                 ` Jan Beulich
@ 2025-07-28  8:52                                   ` Oleksii Kurochko
  2025-07-28  9:09                                     ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-28  8:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2253 bytes --]


On 7/23/25 11:46 AM, Jan Beulich wrote:
>> I assume that I have in this case to take some pages for an intermediate page
>> table from freelist P2M pool, set an owner domain to NULL (pg->inuse.domain = NULL).
>>
>> Then in this case it isn't clear why pg->list can't be re-used to link several pages
>> for intermediate page table purposes + metadata? Is it because pg->list can be not
>> empty? In this case it isn't clear if I could use a page, which has threaded pages.
> Actually looks like I was mis-remembering. Pages removed from freelist indeed
> aren't put on any other list, so the linking fields are available for use. I
> guess I had x86 shadow code in mind, where the linking fields are further used.

Perhaps, I misunderstood you about "linking fields", but it seems like I can't reuse
struct page_info->list as it is used by page_list_add() which is called by p2m_alloc_page()
to allocate page(s) for an intermediate page table:
    static inline void
    page_list_add(struct page_info *page, struct page_list_head *head)
    {
         list_add(&page->list, head);
    }

     struct page_info * paging_alloc_page(struct domain *d)
     {
         struct page_info *pg;

         spin_lock(&d->arch.paging.lock);
         pg = page_list_remove_head(&d->arch.paging.freelist);
         spin_unlock(&d->arch.paging.lock);

         INIT_LIST_HEAD(&pg->list);

         return pg;
     }

     static struct page_info *p2m_alloc_page(struct domain *d)
     {
         struct page_info *pg = paging_alloc_page(d);

         if ( pg )
             page_list_add(pg, &p2m_get_hostp2m(d)->pages);

         return pg;
     }

So I have to reuse another field from struct page_info. It seems like it won't be an
issue if to add a new struct page_list_entry metadata_list to 'union v':
     union {
         /* Page is in use */
         struct {
             /* Owner of this page (NULL if page is anonymous). */
             struct domain *domain;
         } inuse;

         /* Page is on a free list. */
         struct {
             /* Order-size of the free chunk this page is the head of. */
             unsigned int order;
         } free;
+
+       struct page_list_entry metadata_list;
     } v;

Am I missing something?

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 2789 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-28  8:52                                   ` Oleksii Kurochko
@ 2025-07-28  9:09                                     ` Jan Beulich
  2025-07-28 11:37                                       ` Oleksii Kurochko
  0 siblings, 1 reply; 161+ messages in thread
From: Jan Beulich @ 2025-07-28  9:09 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, xen-devel

On 28.07.2025 10:52, Oleksii Kurochko wrote:
> On 7/23/25 11:46 AM, Jan Beulich wrote:
>>> I assume that I have in this case to take some pages for an intermediate page
>>> table from freelist P2M pool, set an owner domain to NULL (pg->inuse.domain = NULL).
>>>
>>> Then in this case it isn't clear why pg->list can't be re-used to link several pages
>>> for intermediate page table purposes + metadata? Is it because pg->list can be not
>>> empty? In this case it isn't clear if I could use a page, which has threaded pages.
>> Actually looks like I was mis-remembering. Pages removed from freelist indeed
>> aren't put on any other list, so the linking fields are available for use. I
>> guess I had x86 shadow code in mind, where the linking fields are further used.
> 
> Perhaps, I misunderstood you about "linking fields", but it seems like I can't reuse
> struct page_info->list as it is used by page_list_add() which is called by p2m_alloc_page()
> to allocate page(s) for an intermediate page table:
>     static inline void
>     page_list_add(struct page_info *page, struct page_list_head *head)
>     {
>          list_add(&page->list, head);
>     }
> 
>      struct page_info * paging_alloc_page(struct domain *d)
>      {
>          struct page_info *pg;
> 
>          spin_lock(&d->arch.paging.lock);
>          pg = page_list_remove_head(&d->arch.paging.freelist);
>          spin_unlock(&d->arch.paging.lock);
> 
>          INIT_LIST_HEAD(&pg->list);
> 
>          return pg;
>      }
> 
>      static struct page_info *p2m_alloc_page(struct domain *d)
>      {
>          struct page_info *pg = paging_alloc_page(d);
> 
>          if ( pg )
>              page_list_add(pg, &p2m_get_hostp2m(d)->pages);
> 
>          return pg;
>      }
> 
> So I have to reuse another field from struct page_info. It seems like it won't be an
> issue if to add a new struct page_list_entry metadata_list to 'union v':
>      union {
>          /* Page is in use */
>          struct {
>              /* Owner of this page (NULL if page is anonymous). */
>              struct domain *domain;
>          } inuse;
> 
>          /* Page is on a free list. */
>          struct {
>              /* Order-size of the free chunk this page is the head of. */
>              unsigned int order;
>          } free;
> +
> +       struct page_list_entry metadata_list;
>      } v;
> 
> Am I missing something?

Well, you're doubling the size of that union then, aren't you? As was mentioned
quite some time ago, struct page_info needs quite a bit of care when you mean
to add new fields there. Question is whether for the purpose here you actually
need a doubly-linked list. A single pointer would be fine to put there.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-28  9:09                                     ` Jan Beulich
@ 2025-07-28 11:37                                       ` Oleksii Kurochko
  2025-07-28 11:49                                         ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-28 11:37 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 3596 bytes --]


On 7/28/25 11:09 AM, Jan Beulich wrote:
> On 28.07.2025 10:52, Oleksii Kurochko wrote:
>> On 7/23/25 11:46 AM, Jan Beulich wrote:
>>>> I assume that I have in this case to take some pages for an intermediate page
>>>> table from freelist P2M pool, set an owner domain to NULL (pg->inuse.domain = NULL).
>>>>
>>>> Then in this case it isn't clear why pg->list can't be re-used to link several pages
>>>> for intermediate page table purposes + metadata? Is it because pg->list can be not
>>>> empty? In this case it isn't clear if I could use a page, which has threaded pages.
>>> Actually looks like I was mis-remembering. Pages removed from freelist indeed
>>> aren't put on any other list, so the linking fields are available for use. I
>>> guess I had x86 shadow code in mind, where the linking fields are further used.
>> Perhaps, I misunderstood you about "linking fields", but it seems like I can't reuse
>> struct page_info->list as it is used by page_list_add() which is called by p2m_alloc_page()
>> to allocate page(s) for an intermediate page table:
>>      static inline void
>>      page_list_add(struct page_info *page, struct page_list_head *head)
>>      {
>>           list_add(&page->list, head);
>>      }
>>
>>       struct page_info * paging_alloc_page(struct domain *d)
>>       {
>>           struct page_info *pg;
>>
>>           spin_lock(&d->arch.paging.lock);
>>           pg = page_list_remove_head(&d->arch.paging.freelist);
>>           spin_unlock(&d->arch.paging.lock);
>>
>>           INIT_LIST_HEAD(&pg->list);
>>
>>           return pg;
>>       }
>>
>>       static struct page_info *p2m_alloc_page(struct domain *d)
>>       {
>>           struct page_info *pg = paging_alloc_page(d);
>>
>>           if ( pg )
>>               page_list_add(pg, &p2m_get_hostp2m(d)->pages);
>>
>>           return pg;
>>       }
>>
>> So I have to reuse another field from struct page_info. It seems like it won't be an
>> issue if to add a new struct page_list_entry metadata_list to 'union v':
>>       union {
>>           /* Page is in use */
>>           struct {
>>               /* Owner of this page (NULL if page is anonymous). */
>>               struct domain *domain;
>>           } inuse;
>>
>>           /* Page is on a free list. */
>>           struct {
>>               /* Order-size of the free chunk this page is the head of. */
>>               unsigned int order;
>>           } free;
>> +
>> +       struct page_list_entry metadata_list;
>>       } v;
>>
>> Am I missing something?
> Well, you're doubling the size of that union then, aren't you? As was mentioned
> quite some time ago, struct page_info needs quite a bit of care when you mean
> to add new fields there. Question is whether for the purpose here you actually
> need a doubly-linked list. A single pointer would be fine to put there.

Agree, a single pointer will be more then enough.

I'm thinking if it is possible to do something with the case if someone will try
to use:
   #define page_get_owner(p)    (p)->v.inuse.domain
for a page which was allocated for metadata storage. Shouldn't I have a separate
list for such pages and a macro which will check if a page is in this list?
Similar a list which we have for p2m pages in struct p2m_domain:
     ...
     /* Pages used to construct the p2m */
     struct page_list_head pages;
     ...

Of course, such pages are allocated by alloc_domheap_page(d, MEMF_no_owner),
so there is no owner. But if someone will accidentally use this macro for such
pages then it will be an issue as ->domain likely won't be a NULL anymore.

~ Oleksii


[-- Attachment #2: Type: text/html, Size: 4238 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration
  2025-07-28 11:37                                       ` Oleksii Kurochko
@ 2025-07-28 11:49                                         ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-28 11:49 UTC (permalink / raw)
  To: Oleksii Kurochko; +Cc: xen-devel

On 28.07.2025 13:37, Oleksii Kurochko wrote:
> 
> On 7/28/25 11:09 AM, Jan Beulich wrote:
>> On 28.07.2025 10:52, Oleksii Kurochko wrote:
>>> On 7/23/25 11:46 AM, Jan Beulich wrote:
>>>>> I assume that I have in this case to take some pages for an intermediate page
>>>>> table from freelist P2M pool, set an owner domain to NULL (pg->inuse.domain = NULL).
>>>>>
>>>>> Then in this case it isn't clear why pg->list can't be re-used to link several pages
>>>>> for intermediate page table purposes + metadata? Is it because pg->list can be not
>>>>> empty? In this case it isn't clear if I could use a page, which has threaded pages.
>>>> Actually looks like I was mis-remembering. Pages removed from freelist indeed
>>>> aren't put on any other list, so the linking fields are available for use. I
>>>> guess I had x86 shadow code in mind, where the linking fields are further used.
>>> Perhaps, I misunderstood you about "linking fields", but it seems like I can't reuse
>>> struct page_info->list as it is used by page_list_add() which is called by p2m_alloc_page()
>>> to allocate page(s) for an intermediate page table:
>>>      static inline void
>>>      page_list_add(struct page_info *page, struct page_list_head *head)
>>>      {
>>>           list_add(&page->list, head);
>>>      }
>>>
>>>       struct page_info * paging_alloc_page(struct domain *d)
>>>       {
>>>           struct page_info *pg;
>>>
>>>           spin_lock(&d->arch.paging.lock);
>>>           pg = page_list_remove_head(&d->arch.paging.freelist);
>>>           spin_unlock(&d->arch.paging.lock);
>>>
>>>           INIT_LIST_HEAD(&pg->list);
>>>
>>>           return pg;
>>>       }
>>>
>>>       static struct page_info *p2m_alloc_page(struct domain *d)
>>>       {
>>>           struct page_info *pg = paging_alloc_page(d);
>>>
>>>           if ( pg )
>>>               page_list_add(pg, &p2m_get_hostp2m(d)->pages);
>>>
>>>           return pg;
>>>       }
>>>
>>> So I have to reuse another field from struct page_info. It seems like it won't be an
>>> issue if to add a new struct page_list_entry metadata_list to 'union v':
>>>       union {
>>>           /* Page is in use */
>>>           struct {
>>>               /* Owner of this page (NULL if page is anonymous). */
>>>               struct domain *domain;
>>>           } inuse;
>>>
>>>           /* Page is on a free list. */
>>>           struct {
>>>               /* Order-size of the free chunk this page is the head of. */
>>>               unsigned int order;
>>>           } free;
>>> +
>>> +       struct page_list_entry metadata_list;
>>>       } v;
>>>
>>> Am I missing something?
>> Well, you're doubling the size of that union then, aren't you? As was mentioned
>> quite some time ago, struct page_info needs quite a bit of care when you mean
>> to add new fields there. Question is whether for the purpose here you actually
>> need a doubly-linked list. A single pointer would be fine to put there.
> 
> Agree, a single pointer will be more then enough.
> 
> I'm thinking if it is possible to do something with the case if someone will try
> to use:
>    #define page_get_owner(p)    (p)->v.inuse.domain
> for a page which was allocated for metadata storage. Shouldn't I have a separate
> list for such pages and a macro which will check if a page is in this list?
> Similar a list which we have for p2m pages in struct p2m_domain:
>      ...
>      /* Pages used to construct the p2m */
>      struct page_list_head pages;
>      ...
> 
> Of course, such pages are allocated by alloc_domheap_page(d, MEMF_no_owner),
> so there is no owner. But if someone will accidentally use this macro for such
> pages then it will be an issue as ->domain likely won't be a NULL anymore.

It's the nature of using unions that such a risk exists. Take a look at x86'es
structure, where several of the fields are re-purposed for shadow pages. It's
something similar you'd do here, in the end.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-07-22 12:05             ` Jan Beulich
@ 2025-07-29 13:47               ` Oleksii Kurochko
  2025-07-29 14:48                 ` Jan Beulich
  0 siblings, 1 reply; 161+ messages in thread
From: Oleksii Kurochko @ 2025-07-29 13:47 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	xen-devel, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 3195 bytes --]


On 7/22/25 2:05 PM, Jan Beulich wrote:
> On 22.07.2025 14:03, Oleksii Kurochko wrote:
>> On 7/21/25 3:39 PM, Jan Beulich wrote:
>>> On 18.07.2025 16:37, Oleksii Kurochko wrote:
>>>> On 7/2/25 12:28 PM, Jan Beulich wrote:
>>>>> On 02.07.2025 12:09, Jan Beulich wrote:
>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>> @@ -613,3 +612,91 @@ void __iomem *ioremap(paddr_t pa, size_t len)
>>>>>>>     {
>>>>>>>         return ioremap_attr(pa, len, PAGE_HYPERVISOR_NOCACHE);
>>>>>>>     }
>>>>>>> +
>>>>>>> +int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
>>>>>>> +{
>>>>>>> +    ASSERT_UNREACHABLE();
>>>>>>> +
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static struct domain *page_get_owner_and_nr_reference(struct page_info *page,
>>>>>>> +                                                      unsigned long nr)
>>>>>>> +{
>>>>>>> +    unsigned long x, y = page->count_info;
>>>>>>> +    struct domain *owner;
>>>>>>> +
>>>>>>> +    /* Restrict nr to avoid "double" overflow */
>>>>>>> +    if ( nr >= PGC_count_mask )
>>>>>>> +    {
>>>>>>> +        ASSERT_UNREACHABLE();
>>>>>>> +        return NULL;
>>>>>>> +    }
>>>>>> I question the validity of this, already in the Arm original: I can't spot
>>>>>> how the caller guarantees to stay below that limit. Without such an
>>>>>> (attempted) guarantee, ASSERT_UNREACHABLE() is wrong to use. All I can see
>>>>>> is process_shm_node() incrementing shmem_extra[].nr_shm_borrowers, without
>>>>>> any limit check.
>>>>>>
>>>>>>> +    do {
>>>>>>> +        x = y;
>>>>>>> +        /*
>>>>>>> +         * Count ==  0: Page is not allocated, so we cannot take a reference.
>>>>>>> +         * Count == -1: Reference count would wrap, which is invalid.
>>>>>>> +         */
>>>>>> May I once again ask that you look carefully at comments (as much as at code)
>>>>>> you copy. Clearly this comment wasn't properly updated when the bumping by 1
>>>>>> was changed to bumping by nr.
>>>>>>
>>>>>>> +        if ( unlikely(((x + nr) & PGC_count_mask) <= nr) )
>>>>>>> +            return NULL;
>>>>>>> +    }
>>>>>>> +    while ( (y = cmpxchg(&page->count_info, x, x + nr)) != x );
>>>>>>> +
>>>>>>> +    owner = page_get_owner(page);
>>>>>>> +    ASSERT(owner);
>>>>>>> +
>>>>>>> +    return owner;
>>>>>>> +}
>>>>> There also looks to be a dead code concern here (towards the "nr" parameters
>>>>> here and elsewhere, when STATIC_SHM=n). Just that apparently we decided to
>>>>> leave out Misra rule 2.2 entirely.
>>>> I think that I didn't get what is an issue when STATIC_SHM=n, functions is still
>>>> going to be called through page_get_owner_and_reference(), at least, in page_alloc.c .
>>> Yes, but will "nr" ever be anything other than 1 then? IOW omitting the parameter
>>> would be fine. And that's what "dead code" is about.
>> Got it.
>>
>> So we don't have any SAF-x tag to mark this function as safe. What is the best one
>> solution for now if nr argument will be needed in the future for STATIC_SHM=y?
> Add the parameter at that point. Just like was done for Arm.

Hmm, it seems like I am confusing something... Arm has the same defintion and declaration
of page_get_owner_and_nr_reference().

~ Oleksii

[-- Attachment #2: Type: text/html, Size: 4464 bytes --]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
  2025-07-29 13:47               ` Oleksii Kurochko
@ 2025-07-29 14:48                 ` Jan Beulich
  0 siblings, 0 replies; 161+ messages in thread
From: Jan Beulich @ 2025-07-29 14:48 UTC (permalink / raw)
  To: Oleksii Kurochko
  Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
	xen-devel, Stefano Stabellini

On 29.07.2025 15:47, Oleksii Kurochko wrote:
> 
> On 7/22/25 2:05 PM, Jan Beulich wrote:
>> On 22.07.2025 14:03, Oleksii Kurochko wrote:
>>> On 7/21/25 3:39 PM, Jan Beulich wrote:
>>>> On 18.07.2025 16:37, Oleksii Kurochko wrote:
>>>>> On 7/2/25 12:28 PM, Jan Beulich wrote:
>>>>>> On 02.07.2025 12:09, Jan Beulich wrote:
>>>>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>>>>> @@ -613,3 +612,91 @@ void __iomem *ioremap(paddr_t pa, size_t len)
>>>>>>>>     {
>>>>>>>>         return ioremap_attr(pa, len, PAGE_HYPERVISOR_NOCACHE);
>>>>>>>>     }
>>>>>>>> +
>>>>>>>> +int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
>>>>>>>> +{
>>>>>>>> +    ASSERT_UNREACHABLE();
>>>>>>>> +
>>>>>>>> +    return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static struct domain *page_get_owner_and_nr_reference(struct page_info *page,
>>>>>>>> +                                                      unsigned long nr)
>>>>>>>> +{
>>>>>>>> +    unsigned long x, y = page->count_info;
>>>>>>>> +    struct domain *owner;
>>>>>>>> +
>>>>>>>> +    /* Restrict nr to avoid "double" overflow */
>>>>>>>> +    if ( nr >= PGC_count_mask )
>>>>>>>> +    {
>>>>>>>> +        ASSERT_UNREACHABLE();
>>>>>>>> +        return NULL;
>>>>>>>> +    }
>>>>>>> I question the validity of this, already in the Arm original: I can't spot
>>>>>>> how the caller guarantees to stay below that limit. Without such an
>>>>>>> (attempted) guarantee, ASSERT_UNREACHABLE() is wrong to use. All I can see
>>>>>>> is process_shm_node() incrementing shmem_extra[].nr_shm_borrowers, without
>>>>>>> any limit check.
>>>>>>>
>>>>>>>> +    do {
>>>>>>>> +        x = y;
>>>>>>>> +        /*
>>>>>>>> +         * Count ==  0: Page is not allocated, so we cannot take a reference.
>>>>>>>> +         * Count == -1: Reference count would wrap, which is invalid.
>>>>>>>> +         */
>>>>>>> May I once again ask that you look carefully at comments (as much as at code)
>>>>>>> you copy. Clearly this comment wasn't properly updated when the bumping by 1
>>>>>>> was changed to bumping by nr.
>>>>>>>
>>>>>>>> +        if ( unlikely(((x + nr) & PGC_count_mask) <= nr) )
>>>>>>>> +            return NULL;
>>>>>>>> +    }
>>>>>>>> +    while ( (y = cmpxchg(&page->count_info, x, x + nr)) != x );
>>>>>>>> +
>>>>>>>> +    owner = page_get_owner(page);
>>>>>>>> +    ASSERT(owner);
>>>>>>>> +
>>>>>>>> +    return owner;
>>>>>>>> +}
>>>>>> There also looks to be a dead code concern here (towards the "nr" parameters
>>>>>> here and elsewhere, when STATIC_SHM=n). Just that apparently we decided to
>>>>>> leave out Misra rule 2.2 entirely.
>>>>> I think that I didn't get what is an issue when STATIC_SHM=n, functions is still
>>>>> going to be called through page_get_owner_and_reference(), at least, in page_alloc.c .
>>>> Yes, but will "nr" ever be anything other than 1 then? IOW omitting the parameter
>>>> would be fine. And that's what "dead code" is about.
>>> Got it.
>>>
>>> So we don't have any SAF-x tag to mark this function as safe. What is the best one
>>> solution for now if nr argument will be needed in the future for STATIC_SHM=y?
>> Add the parameter at that point. Just like was done for Arm.
> 
> Hmm, it seems like I am confusing something... Arm has the same defintion and declaration
> of page_get_owner_and_nr_reference().

But it didn't always have it. And there is at least one pending issue there.
Hence my request to use the simpler variant until someone actually makes
STATIC_SHM work on RISC-V. And hopefully by then the issue in Arm code is
sorted, and you can clone the code without raising questions.

Jan


^ permalink raw reply	[flat|nested] 161+ messages in thread

end of thread, other threads:[~2025-07-29 14:48 UTC | newest]

Thread overview: 161+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-10 13:05 [PATCH v2 00/17] xen/riscv: introduce p2m functionality Oleksii Kurochko
2025-06-10 13:05 ` [PATCH v2 01/17] xen/riscv: implement sbi_remote_hfence_gvma() Oleksii Kurochko
2025-06-18 15:15   ` Jan Beulich
2025-06-23 14:31     ` Oleksii Kurochko
2025-06-23 14:39       ` Jan Beulich
2025-06-23 14:45         ` Oleksii Kurochko
2025-06-24 10:33     ` Oleksii Kurochko
2025-06-24 10:48       ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 02/17] xen/riscv: introduce sbi_remote_hfence_gvma_vmid() Oleksii Kurochko
2025-06-18 15:20   ` Jan Beulich
2025-06-23 14:38     ` Oleksii Kurochko
2025-06-10 13:05 ` [PATCH v2 03/17] xen/riscv: introduce guest domain's VMID allocation and manegement Oleksii Kurochko
2025-06-18 15:46   ` Jan Beulich
2025-06-24  9:46     ` Oleksii Kurochko
2025-06-24 10:44       ` Jan Beulich
2025-06-24 13:47         ` Oleksii Kurochko
2025-06-24 14:01           ` Jan Beulich
2025-06-24 15:32             ` Oleksii Kurochko
2025-06-26 10:05             ` Oleksii Kurochko
2025-06-26 10:41               ` Jan Beulich
2025-06-26 11:34                 ` Oleksii Kurochko
2025-06-26 11:43                   ` Juergen Gross
2025-06-26 12:05                     ` Oleksii Kurochko
2025-06-26 12:17                     ` Teddy Astie
2025-06-26 12:37                       ` Jan Beulich
2025-06-26 12:16                   ` Jan Beulich
2025-06-26 12:25                     ` Oleksii Kurochko
2025-06-10 13:05 ` [PATCH v2 04/17] xen/riscv: construct the P2M pages pool for guests Oleksii Kurochko
2025-06-18 15:53   ` Jan Beulich
2025-06-25 14:48     ` Oleksii Kurochko
2025-06-25 14:55       ` Jan Beulich
2025-07-01 13:04   ` Jan Beulich
2025-07-02 10:30     ` Oleksii Kurochko
2025-07-02 10:34       ` Jan Beulich
2025-07-02 11:17         ` Oleksii Kurochko
2025-07-02 11:48     ` Oleksii Kurochko
2025-07-02 11:56       ` Jan Beulich
2025-07-02 12:34         ` Oleksii Kurochko
2025-07-02 12:49           ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 05/17] xen/riscv: introduce things necessary for p2m initialization Oleksii Kurochko
2025-06-18 16:08   ` Jan Beulich
2025-06-25 15:31     ` Oleksii Kurochko
2025-06-25 15:53       ` Jan Beulich
2025-06-26  8:40         ` Oleksii Kurochko
2025-06-26 11:01           ` Jan Beulich
2025-06-26 11:55             ` Oleksii Kurochko
2025-06-10 13:05 ` [PATCH v2 06/17] xen/riscv: add root page table allocation Oleksii Kurochko
2025-06-30 15:22   ` Jan Beulich
2025-06-30 16:18     ` Oleksii Kurochko
2025-07-01  6:29       ` Jan Beulich
2025-07-01  9:44         ` Oleksii Kurochko
2025-07-01 10:27           ` Jan Beulich
2025-07-01 14:02             ` Oleksii Kurochko
2025-07-01 14:28               ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 07/17] xen/riscv: introduce pte_{set,get}_mfn() Oleksii Kurochko
2025-06-26 14:57   ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 08/17] xen/riscv: add new p2m types and helper macros for type classification Oleksii Kurochko
2025-06-26 14:59   ` Jan Beulich
2025-06-30 14:33     ` Oleksii Kurochko
2025-06-30 14:38       ` Oleksii Kurochko
2025-06-30 14:45         ` Jan Beulich
2025-06-30 15:27           ` Oleksii Kurochko
2025-06-30 15:50             ` Jan Beulich
2025-07-02 10:13               ` Oleksii Kurochko
2025-07-02 10:36                 ` Jan Beulich
2025-06-30 14:42       ` Jan Beulich
2025-06-30 15:13         ` Oleksii Kurochko
2025-06-30 15:27           ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 09/17] xen/riscv: introduce page_set_xenheap_gfn() Oleksii Kurochko
2025-06-30 15:48   ` Jan Beulich
2025-07-02 15:59     ` Oleksii Kurochko
2025-07-03  5:59       ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 10/17] xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFNs Oleksii Kurochko
2025-06-30 15:59   ` Jan Beulich
2025-07-03 11:02     ` Oleksii Kurochko
2025-07-03 11:33       ` Jan Beulich
2025-07-03 11:54         ` Oleksii Kurochko
2025-07-03 13:09           ` Jan Beulich
2025-07-03 13:28             ` Oleksii Kurochko
2025-07-03 13:34               ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry() Oleksii Kurochko
2025-07-01 13:49   ` Jan Beulich
2025-07-04 15:01     ` Oleksii Kurochko
2025-07-07  7:20       ` Jan Beulich
2025-07-07 11:46         ` Oleksii Kurochko
2025-07-07 12:53           ` Jan Beulich
2025-07-07 15:00             ` Oleksii Kurochko
2025-07-07 15:15               ` Jan Beulich
2025-07-07 16:10                 ` Oleksii Kurochko
2025-07-08  7:10                   ` Jan Beulich
2025-07-08  9:01                     ` Oleksii Kurochko
2025-07-08 10:37                       ` Oleksii Kurochko
2025-07-08 12:45                         ` Jan Beulich
2025-07-08 15:42                           ` Oleksii Kurochko
2025-07-08 16:04                             ` Jan Beulich
2025-07-09  8:24                               ` Oleksii Kurochko
2025-07-09  8:41                                 ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 12/17] xen/riscv: Implement p2m_free_entry() and related helpers Oleksii Kurochko
2025-07-01 14:23   ` Jan Beulich
2025-07-11 15:56     ` Oleksii Kurochko
2025-07-14  7:15       ` Jan Beulich
2025-07-14 16:01         ` Oleksii Kurochko
2025-07-14 16:17           ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 13/17] xen/riscv: Implement p2m_entry_from_mfn() and support PBMT configuration Oleksii Kurochko
2025-07-01 15:08   ` Jan Beulich
2025-07-15 14:47     ` Oleksii Kurochko
2025-07-16 11:31       ` Jan Beulich
2025-07-16 16:07         ` Oleksii Kurochko
2025-07-16 16:18           ` Jan Beulich
2025-07-17  8:56             ` Oleksii Kurochko
2025-07-17 10:25               ` Jan Beulich
2025-07-18  9:52                 ` Oleksii Kurochko
2025-07-21 12:18                   ` Jan Beulich
2025-07-22 10:41                     ` Oleksii Kurochko
2025-07-22 11:34                       ` Oleksii Kurochko
2025-07-22 12:00                         ` Jan Beulich
2025-07-22 14:25                           ` Oleksii Kurochko
2025-07-22 14:35                             ` Jan Beulich
2025-07-22 16:07                               ` Oleksii Kurochko
2025-07-23  9:46                                 ` Jan Beulich
2025-07-28  8:52                                   ` Oleksii Kurochko
2025-07-28  9:09                                     ` Jan Beulich
2025-07-28 11:37                                       ` Oleksii Kurochko
2025-07-28 11:49                                         ` Jan Beulich
2025-07-22 11:54                       ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 14/17] xen/riscv: implement p2m_next_level() Oleksii Kurochko
2025-07-02  8:35   ` Jan Beulich
2025-07-16 11:32     ` Oleksii Kurochko
2025-07-16 11:43       ` Jan Beulich
2025-07-16 15:53         ` Oleksii Kurochko
2025-07-16 16:12           ` Jan Beulich
2025-07-17  9:42             ` Oleksii Kurochko
2025-07-17 10:37               ` Jan Beulich
2025-07-18 11:19                 ` Oleksii Kurochko
2025-07-21 13:14                   ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings Oleksii Kurochko
2025-07-02  9:25   ` Jan Beulich
2025-07-17 16:37     ` Oleksii Kurochko
2025-07-21 13:34       ` Jan Beulich
2025-07-22 14:57         ` Oleksii Kurochko
2025-07-22 16:02           ` Jan Beulich
2025-07-23 19:51             ` Oleksii Kurochko
2025-07-24  7:58               ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 16/17] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers Oleksii Kurochko
2025-07-02 10:09   ` Jan Beulich
2025-07-02 10:28     ` Jan Beulich
2025-07-18 14:37       ` Oleksii Kurochko
2025-07-21 13:39         ` Jan Beulich
2025-07-22 12:03           ` Oleksii Kurochko
2025-07-22 12:05             ` Jan Beulich
2025-07-29 13:47               ` Oleksii Kurochko
2025-07-29 14:48                 ` Jan Beulich
2025-07-02 12:52     ` Orzel, Michal
2025-07-18 14:49     ` Oleksii Kurochko
2025-07-21 13:42       ` Jan Beulich
2025-07-22 13:38         ` Oleksii Kurochko
2025-07-21 13:53       ` Jan Beulich
2025-06-10 13:05 ` [PATCH v2 17/17] xen/riscv: add support of page lookup by GFN Oleksii Kurochko
2025-07-02 11:44   ` Jan Beulich
2025-07-21  9:43     ` Oleksii Kurochko
2025-07-21 14:06       ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.