* [for 4.22 v5 01/18] xen/riscv: detect and initialize G-stage mode
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-11-06 13:43 ` Jan Beulich
2025-10-20 15:57 ` [for 4.22 v5 02/18] xen/riscv: introduce VMID allocation and manegement Oleksii Kurochko
` (16 subsequent siblings)
17 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Introduce gstage_mode_detect() and pre_gstage_init() to probe supported
G-stage paging modes at boot. The function iterates over possible
HGATP modes (Sv32x4 on RV32, Sv39x4/Sv48x4/Sv57x4 on RV64) and selects
the first valid one by programming CSR_HGATP and reading it back.
The selected mode is stored in gstage_mode (marked __ro_after_init)
and reported via printk. If no supported mode is found, Xen panics
since Bare mode is not expected to be used.
Finally, CSR_HGATP is cleared and a local_hfence_gvma_all() is issued
to avoid any potential speculative pollution of the TLB, as required
by the RISC-V spec.
The following issue starts to occur:
./<riscv>/asm/flushtlb.h:37:55: error: 'struct page_info' declared inside
parameter list will not be visible outside of this definition or
declaration [-Werror]
37 | static inline void page_set_tlbflush_timestamp(struct page_info *page)
To resolve it, forward declaration of struct page_info is added to
<asm/flushtlb.h.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
- Add static and __initconst for local variable modes[] in
gstage_mode_detect().
- Change type for gstage_mode from 'unsigned long' to 'unsigned char'.
- Update the comment inisde defintion if modes[] variable in
gstage_mode_detect():
- Add information about Bare mode.
- Drop "a paged virtual-memory scheme described in Section 10.3" as it isn't
relevant here.
- Drop printing of function name when chosen G-stage mode message is printed.
- Drop the call of gstage_mode_detect() from start_xen(). It will be added into
p2m_init() when the latter will be introduced.
- Introduce pre_gstage_init().
- make gstage_mode_detect() static.
---
Changes in V4:
- New patch.
---
xen/arch/riscv/Makefile | 1 +
xen/arch/riscv/include/asm/flushtlb.h | 7 ++
xen/arch/riscv/include/asm/p2m.h | 4 +
xen/arch/riscv/include/asm/riscv_encoding.h | 5 ++
xen/arch/riscv/p2m.c | 96 +++++++++++++++++++++
xen/arch/riscv/setup.c | 3 +
6 files changed, 116 insertions(+)
create mode 100644 xen/arch/riscv/p2m.c
diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index e2b8aa42c8..264e265699 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -7,6 +7,7 @@ obj-y += intc.o
obj-y += irq.o
obj-y += mm.o
obj-y += pt.o
+obj-y += p2m.o
obj-$(CONFIG_RISCV_64) += riscv64/
obj-y += sbi.o
obj-y += setup.o
diff --git a/xen/arch/riscv/include/asm/flushtlb.h b/xen/arch/riscv/include/asm/flushtlb.h
index 51c8f753c5..e70badae0c 100644
--- a/xen/arch/riscv/include/asm/flushtlb.h
+++ b/xen/arch/riscv/include/asm/flushtlb.h
@@ -7,6 +7,13 @@
#include <asm/sbi.h>
+struct page_info;
+
+static inline void local_hfence_gvma_all(void)
+{
+ asm volatile ( "hfence.gvma zero, zero" ::: "memory" );
+}
+
/* Flush TLB of local processor for address va. */
static inline void flush_tlb_one_local(vaddr_t va)
{
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index e43c559e0c..3a5066f360 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -6,6 +6,8 @@
#include <asm/page-bits.h>
+extern unsigned char gstage_mode;
+
#define paddr_bits PADDR_BITS
/*
@@ -88,6 +90,8 @@ static inline bool arch_acquire_resource_check(struct domain *d)
return false;
}
+void pre_gstage_init(void);
+
#endif /* ASM__RISCV__P2M_H */
/*
diff --git a/xen/arch/riscv/include/asm/riscv_encoding.h b/xen/arch/riscv/include/asm/riscv_encoding.h
index 6cc8f4eb45..b15f5ad0b4 100644
--- a/xen/arch/riscv/include/asm/riscv_encoding.h
+++ b/xen/arch/riscv/include/asm/riscv_encoding.h
@@ -131,13 +131,16 @@
#define HGATP_MODE_SV32X4 _UL(1)
#define HGATP_MODE_SV39X4 _UL(8)
#define HGATP_MODE_SV48X4 _UL(9)
+#define HGATP_MODE_SV57X4 _UL(10)
#define HGATP32_MODE_SHIFT 31
+#define HGATP32_MODE_MASK _UL(0x80000000)
#define HGATP32_VMID_SHIFT 22
#define HGATP32_VMID_MASK _UL(0x1FC00000)
#define HGATP32_PPN _UL(0x003FFFFF)
#define HGATP64_MODE_SHIFT 60
+#define HGATP64_MODE_MASK _ULL(0xF000000000000000)
#define HGATP64_VMID_SHIFT 44
#define HGATP64_VMID_MASK _ULL(0x03FFF00000000000)
#define HGATP64_PPN _ULL(0x00000FFFFFFFFFFF)
@@ -170,6 +173,7 @@
#define HGATP_VMID_SHIFT HGATP64_VMID_SHIFT
#define HGATP_VMID_MASK HGATP64_VMID_MASK
#define HGATP_MODE_SHIFT HGATP64_MODE_SHIFT
+#define HGATP_MODE_MASK HGATP64_MODE_MASK
#else
#define MSTATUS_SD MSTATUS32_SD
#define SSTATUS_SD SSTATUS32_SD
@@ -181,6 +185,7 @@
#define HGATP_VMID_SHIFT HGATP32_VMID_SHIFT
#define HGATP_VMID_MASK HGATP32_VMID_MASK
#define HGATP_MODE_SHIFT HGATP32_MODE_SHIFT
+#define HGATP_MODE_MASK HGATP32_MODE_MASK
#endif
#define TOPI_IID_SHIFT 16
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
new file mode 100644
index 0000000000..00fe676089
--- /dev/null
+++ b/xen/arch/riscv/p2m.c
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include <xen/init.h>
+#include <xen/lib.h>
+#include <xen/macros.h>
+#include <xen/sections.h>
+
+#include <asm/csr.h>
+#include <asm/flushtlb.h>
+#include <asm/riscv_encoding.h>
+
+unsigned char __ro_after_init gstage_mode;
+
+static void __init gstage_mode_detect(void)
+{
+ static const struct {
+ unsigned char mode;
+ unsigned int paging_levels;
+ const char name[8];
+ } modes[] __initconst = {
+ /*
+ * Based on the RISC-V spec:
+ * Bare mode is always supported, regardless of SXLEN.
+ * When SXLEN=32, the only other valid setting for MODE is Sv32.
+ * When SXLEN=64, three paged virtual-memory schemes are defined:
+ * Sv39, Sv48, and Sv57.
+ */
+#ifdef CONFIG_RISCV_32
+ { HGATP_MODE_SV32X4, 2, "Sv32x4" }
+#else
+ { HGATP_MODE_SV39X4, 3, "Sv39x4" },
+ { HGATP_MODE_SV48X4, 4, "Sv48x4" },
+ { HGATP_MODE_SV57X4, 5, "Sv57x4" },
+#endif
+ };
+
+ unsigned int mode_idx;
+
+ gstage_mode = HGATP_MODE_OFF;
+
+ for ( mode_idx = 0; mode_idx < ARRAY_SIZE(modes); mode_idx++ )
+ {
+ unsigned long mode = modes[mode_idx].mode;
+
+ csr_write(CSR_HGATP, MASK_INSR(mode, HGATP_MODE_MASK));
+
+ if ( MASK_EXTR(csr_read(CSR_HGATP), HGATP_MODE_MASK) == mode )
+ {
+ gstage_mode = mode;
+ break;
+ }
+ }
+
+ if ( gstage_mode == HGATP_MODE_OFF )
+ panic("Xen expects that G-stage won't be Bare mode\n");
+
+ printk("G-stage mode is %s\n", modes[mode_idx].name);
+
+ csr_write(CSR_HGATP, 0);
+
+ /*
+ * From RISC-V spec:
+ * Speculative executions of the address-translation algorithm behave as
+ * non-speculative executions of the algorithm do, except that they must
+ * not set the dirty bit for a PTE, they must not trigger an exception,
+ * and they must not create address-translation cache entries if those
+ * entries would have been invalidated by any SFENCE.VMA instruction
+ * executed by the hart since the speculative execution of the algorithm
+ * began.
+ * The quote above mention explicitly SFENCE.VMA, but I assume it is true
+ * for HFENCE.VMA.
+ *
+ * Also, despite of the fact here it is mentioned that when V=0 two-stage
+ * address translation is inactivated:
+ * The current virtualization mode, denoted V, indicates whether the hart
+ * is currently executing in a guest. When V=1, the hart is either in
+ * virtual S-mode (VS-mode), or in virtual U-mode (VU-mode) atop a guest
+ * OS running in VS-mode. When V=0, the hart is either in M-mode, in
+ * HS-mode, or in U-mode atop an OS running in HS-mode. The
+ * virtualization mode also indicates whether two-stage address
+ * translation is active (V=1) or inactive (V=0).
+ * But on the same side, writing to hgatp register activates it:
+ * The hgatp register is considered active for the purposes of
+ * the address-translation algorithm unless the effective privilege mode
+ * is U and hstatus.HU=0.
+ *
+ * Thereby it leaves some room for speculation even in this stage of boot,
+ * so it could be that we polluted local TLB so flush all guest TLB.
+ */
+ local_hfence_gvma_all();
+}
+
+void __init pre_gstage_init(void)
+{
+ gstage_mode_detect();
+}
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index 483cdd7e17..c4f7793151 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -22,6 +22,7 @@
#include <asm/early_printk.h>
#include <asm/fixmap.h>
#include <asm/intc.h>
+#include <asm/p2m.h>
#include <asm/sbi.h>
#include <asm/setup.h>
#include <asm/traps.h>
@@ -148,6 +149,8 @@ void __init noreturn start_xen(unsigned long bootcpu_id,
console_init_postirq();
+ pre_gstage_init();
+
printk("All set up\n");
machine_halt();
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 01/18] xen/riscv: detect and initialize G-stage mode
2025-10-20 15:57 ` [for 4.22 v5 01/18] xen/riscv: detect and initialize G-stage mode Oleksii Kurochko
@ 2025-11-06 13:43 ` Jan Beulich
2025-11-13 16:18 ` Oleksii Kurochko
0 siblings, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-06 13:43 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 20.10.2025 17:57, Oleksii Kurochko wrote:
> Changes in V5:
> - Add static and __initconst for local variable modes[] in
> gstage_mode_detect().
> - Change type for gstage_mode from 'unsigned long' to 'unsigned char'.
> - Update the comment inisde defintion if modes[] variable in
> gstage_mode_detect():
> - Add information about Bare mode.
> - Drop "a paged virtual-memory scheme described in Section 10.3" as it isn't
> relevant here.
> - Drop printing of function name when chosen G-stage mode message is printed.
> - Drop the call of gstage_mode_detect() from start_xen(). It will be added into
> p2m_init() when the latter will be introduced.
Well, thanks, but ...
> - Introduce pre_gstage_init().
... the same comment that I gave before now applies here: This doesn't look to
belong directly in start_xen(). In x86'es terms I'd say this is a tiny part of
paging_init().
> --- a/xen/arch/riscv/Makefile
> +++ b/xen/arch/riscv/Makefile
> @@ -7,6 +7,7 @@ obj-y += intc.o
> obj-y += irq.o
> obj-y += mm.o
> obj-y += pt.o
> +obj-y += p2m.o
Nit: Please keep things sorted (numbers sort before letters).
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -6,6 +6,8 @@
>
> #include <asm/page-bits.h>
>
> +extern unsigned char gstage_mode;
Better move down some, at the very least ...
> #define paddr_bits PADDR_BITS
... past more fundamental #define-s?
> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
> @@ -131,13 +131,16 @@
> #define HGATP_MODE_SV32X4 _UL(1)
> #define HGATP_MODE_SV39X4 _UL(8)
> #define HGATP_MODE_SV48X4 _UL(9)
> +#define HGATP_MODE_SV57X4 _UL(10)
>
> #define HGATP32_MODE_SHIFT 31
> +#define HGATP32_MODE_MASK _UL(0x80000000)
Please can we avoid redundant (and then not even connected) #define-s? I
don't see why you would need HGATP32_MODE_SHIFT when you have
HGATP32_MODE_MASK. Similarly ...
> #define HGATP32_VMID_SHIFT 22
> #define HGATP32_VMID_MASK _UL(0x1FC00000)
... here, while ...
> #define HGATP32_PPN _UL(0x003FFFFF)
... here the constant isn't even suffixed with _MASK.
> --- /dev/null
> +++ b/xen/arch/riscv/p2m.c
> @@ -0,0 +1,96 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#include <xen/init.h>
> +#include <xen/lib.h>
> +#include <xen/macros.h>
> +#include <xen/sections.h>
> +
> +#include <asm/csr.h>
> +#include <asm/flushtlb.h>
> +#include <asm/riscv_encoding.h>
> +
> +unsigned char __ro_after_init gstage_mode;
> +
> +static void __init gstage_mode_detect(void)
> +{
> + static const struct {
> + unsigned char mode;
> + unsigned int paging_levels;
> + const char name[8];
> + } modes[] __initconst = {
> + /*
> + * Based on the RISC-V spec:
> + * Bare mode is always supported, regardless of SXLEN.
> + * When SXLEN=32, the only other valid setting for MODE is Sv32.
> + * When SXLEN=64, three paged virtual-memory schemes are defined:
> + * Sv39, Sv48, and Sv57.
> + */
> +#ifdef CONFIG_RISCV_32
> + { HGATP_MODE_SV32X4, 2, "Sv32x4" }
> +#else
> + { HGATP_MODE_SV39X4, 3, "Sv39x4" },
> + { HGATP_MODE_SV48X4, 4, "Sv48x4" },
> + { HGATP_MODE_SV57X4, 5, "Sv57x4" },
> +#endif
> + };
> +
> + unsigned int mode_idx;
> +
> + gstage_mode = HGATP_MODE_OFF;
Why is this not the variable's initializer?
> + for ( mode_idx = 0; mode_idx < ARRAY_SIZE(modes); mode_idx++ )
> + {
> + unsigned long mode = modes[mode_idx].mode;
> +
> + csr_write(CSR_HGATP, MASK_INSR(mode, HGATP_MODE_MASK));
> +
> + if ( MASK_EXTR(csr_read(CSR_HGATP), HGATP_MODE_MASK) == mode )
> + {
> + gstage_mode = mode;
> + break;
> + }
> + }
I take it that using the first available mode is only transient. To support bigger
guests, you may need to pick 48x4 or even 57x4 no matter that 39x4 is available.
I wonder whether you wouldn't be better off recording all supported modes right
away.
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 01/18] xen/riscv: detect and initialize G-stage mode
2025-11-06 13:43 ` Jan Beulich
@ 2025-11-13 16:18 ` Oleksii Kurochko
2025-11-13 16:32 ` Jan Beulich
0 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-13 16:18 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 5303 bytes --]
On 11/6/25 2:43 PM, Jan Beulich wrote:
> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>> Changes in V5:
>> - Add static and __initconst for local variable modes[] in
>> gstage_mode_detect().
>> - Change type for gstage_mode from 'unsigned long' to 'unsigned char'.
>> - Update the comment inisde defintion if modes[] variable in
>> gstage_mode_detect():
>> - Add information about Bare mode.
>> - Drop "a paged virtual-memory scheme described in Section 10.3" as it isn't
>> relevant here.
>> - Drop printing of function name when chosen G-stage mode message is printed.
>> - Drop the call of gstage_mode_detect() from start_xen(). It will be added into
>> p2m_init() when the latter will be introduced.
> Well, thanks, but ...
>
>> - Introduce pre_gstage_init().
> ... the same comment that I gave before now applies here: This doesn't look to
> belong directly in start_xen(). In x86'es terms I'd say this is a tiny part of
> paging_init().
Does it only the question of function naming now?
IMO, ideally it would be nice to have it in p2m_init(), but there is no a lot of
sense to detect supported modes each time a domain is constructed. And it is the
reason why I put directly to start_xen().
>
>> --- a/xen/arch/riscv/include/asm/riscv_encoding.h
>> +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
>> @@ -131,13 +131,16 @@
>> #define HGATP_MODE_SV32X4 _UL(1)
>> #define HGATP_MODE_SV39X4 _UL(8)
>> #define HGATP_MODE_SV48X4 _UL(9)
>> +#define HGATP_MODE_SV57X4 _UL(10)
>>
>> #define HGATP32_MODE_SHIFT 31
>> +#define HGATP32_MODE_MASK _UL(0x80000000)
> Please can we avoid redundant (and then not even connected) #define-s? I
> don't see why you would need HGATP32_MODE_SHIFT when you have
> HGATP32_MODE_MASK. Similarly ...
>
>> #define HGATP32_VMID_SHIFT 22
>> #define HGATP32_VMID_MASK _UL(0x1FC00000)
> ... here, while ...
>
>> #define HGATP32_PPN _UL(0x003FFFFF)
> ... here the constant isn't even suffixed with _MASK.
I agree that it is enough to have only *_MASK.
I can do a separate cleanup patch (as what mentioned below were already introduced
before) which will drop:
HGATP32_MODE_SHIFT, HGATP32_VMID_SHIFT, HGATP64_MODE_SHIFT, HGATP64_VMID_SHIFT
and rename:
HGATP*_PPN to HGATP*_PPN_MASK
Does it make sense?
>
>> --- /dev/null
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -0,0 +1,96 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +
>> +#include <xen/init.h>
>> +#include <xen/lib.h>
>> +#include <xen/macros.h>
>> +#include <xen/sections.h>
>> +
>> +#include <asm/csr.h>
>> +#include <asm/flushtlb.h>
>> +#include <asm/riscv_encoding.h>
>> +
>> +unsigned char __ro_after_init gstage_mode;
>> +
>> +static void __init gstage_mode_detect(void)
>> +{
>> + static const struct {
>> + unsigned char mode;
>> + unsigned int paging_levels;
>> + const char name[8];
>> + } modes[] __initconst = {
>> + /*
>> + * Based on the RISC-V spec:
>> + * Bare mode is always supported, regardless of SXLEN.
>> + * When SXLEN=32, the only other valid setting for MODE is Sv32.
>> + * When SXLEN=64, three paged virtual-memory schemes are defined:
>> + * Sv39, Sv48, and Sv57.
>> + */
>> +#ifdef CONFIG_RISCV_32
>> + { HGATP_MODE_SV32X4, 2, "Sv32x4" }
>> +#else
>> + { HGATP_MODE_SV39X4, 3, "Sv39x4" },
>> + { HGATP_MODE_SV48X4, 4, "Sv48x4" },
>> + { HGATP_MODE_SV57X4, 5, "Sv57x4" },
>> +#endif
>> + };
>> +
>> + unsigned int mode_idx;
>> +
>> + gstage_mode = HGATP_MODE_OFF;
> Why is this not the variable's initializer?
Good point. It should be the variable's initializer.
>> + for ( mode_idx = 0; mode_idx < ARRAY_SIZE(modes); mode_idx++ )
>> + {
>> + unsigned long mode = modes[mode_idx].mode;
>> +
>> + csr_write(CSR_HGATP, MASK_INSR(mode, HGATP_MODE_MASK));
>> +
>> + if ( MASK_EXTR(csr_read(CSR_HGATP), HGATP_MODE_MASK) == mode )
>> + {
>> + gstage_mode = mode;
>> + break;
>> + }
>> + }
> I take it that using the first available mode is only transient. To support bigger
> guests, you may need to pick 48x4 or even 57x4 no matter that 39x4 is available.
I considered traversing the|modes[]| array in the opposite order so that the largest
mode would be checked first. However, I decided that 39x4 is sufficiently large and
provides a good balance between the number of page tables and supported address
space, at least for now.
> I wonder whether you wouldn't be better off recording all supported modes right
> away.
What would be the use case for recording and storing all supported modes?
For example, would it be used to indicate which mode is preferable for a guest
domain via the device tree?
Also, I’d like to note that it probably doesn’t make much sense to record all
supported modes. If we traverse the|modes[]| array in the opposite order—checking
|Sv57| first—then, according to the RISC-V specification:
- Implementations that support Sv57 must also support Sv48.
- Implementations that support Sv48 must also support Sv39.
So if Sv57 is supported then lower modes are supported too. (except Sv32 for RV32)
Based on this, it seems reasonable to start checking from Sv57, right?
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 7475 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 01/18] xen/riscv: detect and initialize G-stage mode
2025-11-13 16:18 ` Oleksii Kurochko
@ 2025-11-13 16:32 ` Jan Beulich
2025-11-18 8:56 ` Oleksii Kurochko
0 siblings, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-13 16:32 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 13.11.2025 17:18, Oleksii Kurochko wrote:
> On 11/6/25 2:43 PM, Jan Beulich wrote:
>> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>>> Changes in V5:
>>> - Add static and __initconst for local variable modes[] in
>>> gstage_mode_detect().
>>> - Change type for gstage_mode from 'unsigned long' to 'unsigned char'.
>>> - Update the comment inisde defintion if modes[] variable in
>>> gstage_mode_detect():
>>> - Add information about Bare mode.
>>> - Drop "a paged virtual-memory scheme described in Section 10.3" as it isn't
>>> relevant here.
>>> - Drop printing of function name when chosen G-stage mode message is printed.
>>> - Drop the call of gstage_mode_detect() from start_xen(). It will be added into
>>> p2m_init() when the latter will be introduced.
>> Well, thanks, but ...
>>
>>> - Introduce pre_gstage_init().
>> ... the same comment that I gave before now applies here: This doesn't look to
>> belong directly in start_xen(). In x86'es terms I'd say this is a tiny part of
>> paging_init().
>
> Does it only the question of function naming now?
Not just, no. My point is that you shouldn't pollute start_xen() with calls to
dozens of special-purpose functions. There wants to be one call dealing with
everything guest-mm related, I think.
> IMO, ideally it would be nice to have it in p2m_init(), but there is no a lot of
> sense to detect supported modes each time a domain is constructed. And it is the
> reason why I put directly to start_xen().
No per-domain function wants to be used for this, I agree. Hence why I pointed
you at x86'es paging_init().
>>> --- /dev/null
>>> +++ b/xen/arch/riscv/p2m.c
>>> @@ -0,0 +1,96 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +
>>> +#include <xen/init.h>
>>> +#include <xen/lib.h>
>>> +#include <xen/macros.h>
>>> +#include <xen/sections.h>
>>> +
>>> +#include <asm/csr.h>
>>> +#include <asm/flushtlb.h>
>>> +#include <asm/riscv_encoding.h>
>>> +
>>> +unsigned char __ro_after_init gstage_mode;
>>> +
>>> +static void __init gstage_mode_detect(void)
>>> +{
>>> + static const struct {
>>> + unsigned char mode;
>>> + unsigned int paging_levels;
>>> + const char name[8];
>>> + } modes[] __initconst = {
>>> + /*
>>> + * Based on the RISC-V spec:
>>> + * Bare mode is always supported, regardless of SXLEN.
>>> + * When SXLEN=32, the only other valid setting for MODE is Sv32.
>>> + * When SXLEN=64, three paged virtual-memory schemes are defined:
>>> + * Sv39, Sv48, and Sv57.
>>> + */
>>> +#ifdef CONFIG_RISCV_32
>>> + { HGATP_MODE_SV32X4, 2, "Sv32x4" }
>>> +#else
>>> + { HGATP_MODE_SV39X4, 3, "Sv39x4" },
>>> + { HGATP_MODE_SV48X4, 4, "Sv48x4" },
>>> + { HGATP_MODE_SV57X4, 5, "Sv57x4" },
>>> +#endif
>>> + };
>>> +
>>> + unsigned int mode_idx;
>>> +
>>> + gstage_mode = HGATP_MODE_OFF;
>> Why is this not the variable's initializer?
>
> Good point. It should be the variable's initializer.
>
>>> + for ( mode_idx = 0; mode_idx < ARRAY_SIZE(modes); mode_idx++ )
>>> + {
>>> + unsigned long mode = modes[mode_idx].mode;
>>> +
>>> + csr_write(CSR_HGATP, MASK_INSR(mode, HGATP_MODE_MASK));
>>> +
>>> + if ( MASK_EXTR(csr_read(CSR_HGATP), HGATP_MODE_MASK) == mode )
>>> + {
>>> + gstage_mode = mode;
>>> + break;
>>> + }
>>> + }
>> I take it that using the first available mode is only transient. To support bigger
>> guests, you may need to pick 48x4 or even 57x4 no matter that 39x4 is available.
>
> I considered traversing the|modes[]| array in the opposite order so that the largest
> mode would be checked first. However, I decided that 39x4 is sufficiently large and
> provides a good balance between the number of page tables and supported address
> space, at least for now.
>
>> I wonder whether you wouldn't be better off recording all supported modes right
>> away.
>
> What would be the use case for recording and storing all supported modes?
> For example, would it be used to indicate which mode is preferable for a guest
> domain via the device tree?
Why device tree? That's what's exposed to guests, isn't it? Here we talk about
what Xen uses to run guests. And that can vary from guest to guest.
> Also, I’d like to note that it probably doesn’t make much sense to record all
> supported modes. If we traverse the|modes[]| array in the opposite order—checking
> |Sv57| first—then, according to the RISC-V specification:
> - Implementations that support Sv57 must also support Sv48.
> - Implementations that support Sv48 must also support Sv39.
> So if Sv57 is supported then lower modes are supported too. (except Sv32 for RV32)
>
> Based on this, it seems reasonable to start checking from Sv57, right?
No. Bigger guests want running in 48x4, huge ones in 57x4 (each: if available),
and most ones in 39x4. It doesn't matter what direction you do the checks, you
want to know what you have available.
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 01/18] xen/riscv: detect and initialize G-stage mode
2025-11-13 16:32 ` Jan Beulich
@ 2025-11-18 8:56 ` Oleksii Kurochko
2025-11-18 9:05 ` Jan Beulich
0 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-18 8:56 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 6598 bytes --]
On 11/13/25 5:32 PM, Jan Beulich wrote:
> On 13.11.2025 17:18, Oleksii Kurochko wrote:
>> On 11/6/25 2:43 PM, Jan Beulich wrote:
>>> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>>>> Changes in V5:
>>>> - Add static and __initconst for local variable modes[] in
>>>> gstage_mode_detect().
>>>> - Change type for gstage_mode from 'unsigned long' to 'unsigned char'.
>>>> - Update the comment inisde defintion if modes[] variable in
>>>> gstage_mode_detect():
>>>> - Add information about Bare mode.
>>>> - Drop "a paged virtual-memory scheme described in Section 10.3" as it isn't
>>>> relevant here.
>>>> - Drop printing of function name when chosen G-stage mode message is printed.
>>>> - Drop the call of gstage_mode_detect() from start_xen(). It will be added into
>>>> p2m_init() when the latter will be introduced.
>>> Well, thanks, but ...
>>>
>>>> - Introduce pre_gstage_init().
>>> ... the same comment that I gave before now applies here: This doesn't look to
>>> belong directly in start_xen(). In x86'es terms I'd say this is a tiny part of
>>> paging_init().
>> Does it only the question of function naming now?
> Not just, no. My point is that you shouldn't pollute start_xen() with calls to
> dozens of special-purpose functions. There wants to be one call dealing with
> everything guest-mm related, I think.
I think I understand your point now. I’ll introduce|guest_mm_init()| and move
everything related to it inside that function (at the moment, this includes
the VMID initialization and G-stage mode detection).
>
>> IMO, ideally it would be nice to have it in p2m_init(), but there is no a lot of
>> sense to detect supported modes each time a domain is constructed. And it is the
>> reason why I put directly to start_xen().
> No per-domain function wants to be used for this, I agree. Hence why I pointed
> you at x86'es paging_init().
>
>>>> --- /dev/null
>>>> +++ b/xen/arch/riscv/p2m.c
>>>> @@ -0,0 +1,96 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>>> +
>>>> +#include <xen/init.h>
>>>> +#include <xen/lib.h>
>>>> +#include <xen/macros.h>
>>>> +#include <xen/sections.h>
>>>> +
>>>> +#include <asm/csr.h>
>>>> +#include <asm/flushtlb.h>
>>>> +#include <asm/riscv_encoding.h>
>>>> +
>>>> +unsigned char __ro_after_init gstage_mode;
>>>> +
>>>> +static void __init gstage_mode_detect(void)
>>>> +{
>>>> + static const struct {
>>>> + unsigned char mode;
>>>> + unsigned int paging_levels;
>>>> + const char name[8];
>>>> + } modes[] __initconst = {
>>>> + /*
>>>> + * Based on the RISC-V spec:
>>>> + * Bare mode is always supported, regardless of SXLEN.
>>>> + * When SXLEN=32, the only other valid setting for MODE is Sv32.
>>>> + * When SXLEN=64, three paged virtual-memory schemes are defined:
>>>> + * Sv39, Sv48, and Sv57.
>>>> + */
>>>> +#ifdef CONFIG_RISCV_32
>>>> + { HGATP_MODE_SV32X4, 2, "Sv32x4" }
>>>> +#else
>>>> + { HGATP_MODE_SV39X4, 3, "Sv39x4" },
>>>> + { HGATP_MODE_SV48X4, 4, "Sv48x4" },
>>>> + { HGATP_MODE_SV57X4, 5, "Sv57x4" },
>>>> +#endif
>>>> + };
>>>> +
>>>> + unsigned int mode_idx;
>>>> +
>>>> + gstage_mode = HGATP_MODE_OFF;
>>> Why is this not the variable's initializer?
>> Good point. It should be the variable's initializer.
>>
>>>> + for ( mode_idx = 0; mode_idx < ARRAY_SIZE(modes); mode_idx++ )
>>>> + {
>>>> + unsigned long mode = modes[mode_idx].mode;
>>>> +
>>>> + csr_write(CSR_HGATP, MASK_INSR(mode, HGATP_MODE_MASK));
>>>> +
>>>> + if ( MASK_EXTR(csr_read(CSR_HGATP), HGATP_MODE_MASK) == mode )
>>>> + {
>>>> + gstage_mode = mode;
>>>> + break;
>>>> + }
>>>> + }
>>> I take it that using the first available mode is only transient. To support bigger
>>> guests, you may need to pick 48x4 or even 57x4 no matter that 39x4 is available.
>> I considered traversing the|modes[]| array in the opposite order so that the largest
>> mode would be checked first. However, I decided that 39x4 is sufficiently large and
>> provides a good balance between the number of page tables and supported address
>> space, at least for now.
>>
>>> I wonder whether you wouldn't be better off recording all supported modes right
>>> away.
>> What would be the use case for recording and storing all supported modes?
>> For example, would it be used to indicate which mode is preferable for a guest
>> domain via the device tree?
> Why device tree? That's what's exposed to guests, isn't it? Here we talk about
> what Xen uses to run guests. And that can vary from guest to guest.
At the same time, the bootloader also passes a device tree to Xen, and Xen
uses it, at least—to determine the RAM addresses and sizes. I also referred to
device tree because it can indicate the largest MMU mode supported on a hart:
mmu-type:
description:
Identifies the largest MMU address translation mode supported by
this hart. These values originate from the RISC-V Privileged
Specification document, available from
https://riscv.org/specifications/
$ref: /schemas/types.yaml#/definitions/string
enum:
- riscv,sv32
- riscv,sv39
- riscv,sv48
- riscv,sv57
- riscv,none
And so, I thought, that Xen could also re-use this information and use it as starting
value for Mode detection. But considering how much modes are supported by RISC-V spec,
it seems that it won't be too long just to detect which are supported in the way it
is done now.
>
>> Also, I’d like to note that it probably doesn’t make much sense to record all
>> supported modes. If we traverse the|modes[]| array in the opposite order—checking
>> |Sv57| first—then, according to the RISC-V specification:
>> - Implementations that support Sv57 must also support Sv48.
>> - Implementations that support Sv48 must also support Sv39.
>> So if Sv57 is supported then lower modes are supported too. (except Sv32 for RV32)
>>
>> Based on this, it seems reasonable to start checking from Sv57, right?
> No. Bigger guests want running in 48x4, huge ones in 57x4 (each: if available),
> and most ones in 39x4. It doesn't matter what direction you do the checks, you
> want to know what you have available.
My point was that if we change the direction, then once we find the first (largest)
supported MMU mode, there is no need to check the others (lower modes) as according
to the RISC-V specification, the lower modes must be supported automatically.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 8739 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 01/18] xen/riscv: detect and initialize G-stage mode
2025-11-18 8:56 ` Oleksii Kurochko
@ 2025-11-18 9:05 ` Jan Beulich
0 siblings, 0 replies; 54+ messages in thread
From: Jan Beulich @ 2025-11-18 9:05 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 18.11.2025 09:56, Oleksii Kurochko wrote:
> On 11/13/25 5:32 PM, Jan Beulich wrote:
>> On 13.11.2025 17:18, Oleksii Kurochko wrote:
>>> Also, I’d like to note that it probably doesn’t make much sense to record all
>>> supported modes. If we traverse the|modes[]| array in the opposite order—checking
>>> |Sv57| first—then, according to the RISC-V specification:
>>> - Implementations that support Sv57 must also support Sv48.
>>> - Implementations that support Sv48 must also support Sv39.
>>> So if Sv57 is supported then lower modes are supported too. (except Sv32 for RV32)
>>>
>>> Based on this, it seems reasonable to start checking from Sv57, right?
>> No. Bigger guests want running in 48x4, huge ones in 57x4 (each: if available),
>> and most ones in 39x4. It doesn't matter what direction you do the checks, you
>> want to know what you have available.
>
> My point was that if we change the direction, then once we find the first (largest)
> supported MMU mode, there is no need to check the others (lower modes) as according
> to the RISC-V specification, the lower modes must be supported automatically.
Oh, I see, makes sense.
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread
* [for 4.22 v5 02/18] xen/riscv: introduce VMID allocation and manegement
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
2025-10-20 15:57 ` [for 4.22 v5 01/18] xen/riscv: detect and initialize G-stage mode Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-11-06 14:05 ` Jan Beulich
2025-10-20 15:57 ` [for 4.22 v5 03/18] xen/riscv: introduce things necessary for p2m initialization Oleksii Kurochko
` (15 subsequent siblings)
17 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Current implementation is based on x86's way to allocate VMIDs:
VMIDs partition the physical TLB. In the current implementation VMIDs are
introduced to reduce the number of TLB flushes. Each time a guest-physical
address space changes, instead of flushing the TLB, a new VMID is
assigned. This reduces the number of TLB flushes to at most 1/#VMIDs.
The biggest advantage is that hot parts of the hypervisor's code and data
retain in the TLB.
VMIDs are a hart-local resource. As preemption of VMIDs is not possible,
VMIDs are assigned in a round-robin scheme. To minimize the overhead of
VMID invalidation, at the time of a TLB flush, VMIDs are tagged with a
64-bit generation. Only on a generation overflow the code needs to
invalidate all VMID information stored at the VCPUs with are run on the
specific physical processor. When this overflow appears VMID usage is
disabled to retain correctness.
Only minor changes are made compared to the x86 implementation.
These include using RISC-V-specific terminology, adding a check to ensure
the type used for storing the VMID has enough bits to hold VMIDLEN,
and introducing a new function vmidlen_detect() to clarify the VMIDLEN
value, rename stuff connected to VMID enable/disable to "VMID use
enable/disable".
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
- Rename opt_vmid_use_enabled with opt_vmid to be in sync with command line
option.
- Invert the expression for data->used = ... and swap "dis" and "en". Also,
invert usage of data->used elsewhere.
- s/vcpu_vmid_flush_vcpu/vmid_flush_vcpu.
- Add prototypes to asm/vmid.h which could be used outside vmid.c.
- Update the comment in vmidlen_detect(): instead of Section 3.7 ->
Section "Physical Memory Protection".
- Move vmid_init() call to pre_gstage_init().
---
Changes in V4:
- s/guest's virtual/guest-physical in the comment inside vmid.c
and in commit message.
- Drop x86-related numbers in the comment about "Sketch of the Implementation".
- s/__read_only/__ro_after_init in declaration of opt_vmid_enabled.
- s/hart_vmid_generation/generation.
- Update vmidlen_detect() to work with unsigned int type for vmid_bits
variable.
- Drop old variable im vmdidlen_detetct, it seems like there is no any reason
to restore old value of hgatp with no guest running on a hart yet.
- Update the comment above local_hfence_gvma_all() in vmidlen_detect().
- s/max_availalbe_bits/max_available_bits.
- use BITS_PER_BYTE, instead of "<< 3".
- Add BUILD_BUILD_BUG_ON() instead run-time check that an amount of set bits
can be held in vmid_data->max_vmid.
- Apply changes from the patch "x86/HVM: polish hvm_asid_init() a little" here
(changes connected to g_disabled) with the following minor changes:
Update the printk() message to "VMIDs use is...".
Rename g_disabled to g_vmid_used.
- Rename member 'disabled' of vmid_data structure to used.
- Use gstage_mode to properly detect VMIDLEN.
---
Changes in V3:
- Reimplemnt VMID allocation similar to what x86 has implemented.
---
Changes in V2:
- New patch.
---
xen/arch/riscv/Makefile | 1 +
xen/arch/riscv/include/asm/domain.h | 6 +
xen/arch/riscv/include/asm/vmid.h | 14 ++
xen/arch/riscv/p2m.c | 3 +
xen/arch/riscv/vmid.c | 193 ++++++++++++++++++++++++++++
5 files changed, 217 insertions(+)
create mode 100644 xen/arch/riscv/include/asm/vmid.h
create mode 100644 xen/arch/riscv/vmid.c
diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index 264e265699..e2499210c8 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -17,6 +17,7 @@ obj-y += smpboot.o
obj-y += stubs.o
obj-y += time.o
obj-y += traps.o
+obj-y += vmid.o
obj-y += vm_event.o
$(TARGET): $(TARGET)-syms
diff --git a/xen/arch/riscv/include/asm/domain.h b/xen/arch/riscv/include/asm/domain.h
index c3d965a559..aac1040658 100644
--- a/xen/arch/riscv/include/asm/domain.h
+++ b/xen/arch/riscv/include/asm/domain.h
@@ -5,6 +5,11 @@
#include <xen/xmalloc.h>
#include <public/hvm/params.h>
+struct vcpu_vmid {
+ uint64_t generation;
+ uint16_t vmid;
+};
+
struct hvm_domain
{
uint64_t params[HVM_NR_PARAMS];
@@ -14,6 +19,7 @@ struct arch_vcpu_io {
};
struct arch_vcpu {
+ struct vcpu_vmid vmid;
};
struct arch_domain {
diff --git a/xen/arch/riscv/include/asm/vmid.h b/xen/arch/riscv/include/asm/vmid.h
new file mode 100644
index 0000000000..1c500c4aff
--- /dev/null
+++ b/xen/arch/riscv/include/asm/vmid.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef ASM_RISCV_VMID_H
+#define ASM_RISCV_VMID_H
+
+struct vcpu;
+struct vcpu_vmid;
+
+void vmid_init(void);
+bool vmid_handle_vmenter(struct vcpu_vmid *vmid);
+void vmid_flush_vcpu(struct vcpu *v);
+void vmid_flush_hart(void);
+
+#endif /* ASM_RISCV_VMID_H */
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 00fe676089..d8027a270f 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -8,6 +8,7 @@
#include <asm/csr.h>
#include <asm/flushtlb.h>
#include <asm/riscv_encoding.h>
+#include <asm/vmid.h>
unsigned char __ro_after_init gstage_mode;
@@ -93,4 +94,6 @@ static void __init gstage_mode_detect(void)
void __init pre_gstage_init(void)
{
gstage_mode_detect();
+
+ vmid_init();
}
diff --git a/xen/arch/riscv/vmid.c b/xen/arch/riscv/vmid.c
new file mode 100644
index 0000000000..885d177e9f
--- /dev/null
+++ b/xen/arch/riscv/vmid.c
@@ -0,0 +1,193 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include <xen/domain.h>
+#include <xen/init.h>
+#include <xen/sections.h>
+#include <xen/lib.h>
+#include <xen/param.h>
+#include <xen/percpu.h>
+
+#include <asm/atomic.h>
+#include <asm/csr.h>
+#include <asm/flushtlb.h>
+#include <asm/p2m.h>
+
+/* Xen command-line option to enable VMIDs */
+static bool __ro_after_init opt_vmid = true;
+boolean_param("vmid", opt_vmid);
+
+/*
+ * VMIDs partition the physical TLB. In the current implementation VMIDs are
+ * introduced to reduce the number of TLB flushes. Each time a guest-physical
+ * address space changes, instead of flushing the TLB, a new VMID is
+ * assigned. This reduces the number of TLB flushes to at most 1/#VMIDs.
+ * The biggest advantage is that hot parts of the hypervisor's code and data
+ * retain in the TLB.
+ *
+ * Sketch of the Implementation:
+ *
+ * VMIDs are a hart-local resource. As preemption of VMIDs is not possible,
+ * VMIDs are assigned in a round-robin scheme. To minimize the overhead of
+ * VMID invalidation, at the time of a TLB flush, VMIDs are tagged with a
+ * 64-bit generation. Only on a generation overflow the code needs to
+ * invalidate all VMID information stored at the VCPUs with are run on the
+ * specific physical processor. When this overflow appears VMID usage is
+ * disabled to retain correctness.
+ */
+
+/* Per-Hart VMID management. */
+struct vmid_data {
+ uint64_t generation;
+ uint16_t next_vmid;
+ uint16_t max_vmid;
+ bool used;
+};
+
+static DEFINE_PER_CPU(struct vmid_data, vmid_data);
+
+static unsigned int vmidlen_detect(void)
+{
+ unsigned int vmid_bits;
+
+ /*
+ * According to the RISC-V Privileged Architecture Spec:
+ * When MODE=Bare, guest physical addresses are equal to supervisor
+ * physical addresses, and there is no further memory protection
+ * for a guest virtual machine beyond the physical memory protection
+ * scheme described in Section "Physical Memory Protection".
+ * In this case, the remaining fields in hgatp must be set to zeros.
+ * Thereby it is necessary to set gstage_mode not equal to Bare.
+ */
+ ASSERT(gstage_mode != HGATP_MODE_OFF);
+ csr_write(CSR_HGATP,
+ MASK_INSR(gstage_mode, HGATP_MODE_MASK) | HGATP_VMID_MASK);
+ vmid_bits = MASK_EXTR(csr_read(CSR_HGATP), HGATP_VMID_MASK);
+ vmid_bits = flsl(vmid_bits);
+ csr_write(CSR_HGATP, _AC(0, UL));
+
+ /*
+ * From RISC-V spec:
+ * Speculative executions of the address-translation algorithm behave as
+ * non-speculative executions of the algorithm do, except that they must
+ * not set the dirty bit for a PTE, they must not trigger an exception,
+ * and they must not create address-translation cache entries if those
+ * entries would have been invalidated by any SFENCE.VMA instruction
+ * executed by the hart since the speculative execution of the algorithm
+ * began.
+ *
+ * Also, despite of the fact here it is mentioned that when V=0 two-stage
+ * address translation is inactivated:
+ * The current virtualization mode, denoted V, indicates whether the hart
+ * is currently executing in a guest. When V=1, the hart is either in
+ * virtual S-mode (VS-mode), or in virtual U-mode (VU-mode) atop a guest
+ * OS running in VS-mode. When V=0, the hart is either in M-mode, in
+ * HS-mode, or in U-mode atop an OS running in HS-mode. The
+ * virtualization mode also indicates whether two-stage address
+ * translation is active (V=1) or inactive (V=0).
+ * But on the same side, writing to hgatp register activates it:
+ * The hgatp register is considered active for the purposes of
+ * the address-translation algorithm unless the effective privilege mode
+ * is U and hstatus.HU=0.
+ *
+ * Thereby it leaves some room for speculation even in this stage of boot,
+ * so it could be that we polluted local TLB so flush all guest TLB.
+ */
+ local_hfence_gvma_all();
+
+ return vmid_bits;
+}
+
+void vmid_init(void)
+{
+ static int8_t g_vmid_used = -1;
+
+ unsigned int vmid_len = vmidlen_detect();
+ struct vmid_data *data = &this_cpu(vmid_data);
+
+ BUILD_BUG_ON((HGATP_VMID_MASK >> HGATP_VMID_SHIFT) >
+ (BIT((sizeof(data->max_vmid) * BITS_PER_BYTE), UL) - 1));
+
+ data->max_vmid = BIT(vmid_len, U) - 1;
+ data->used = opt_vmid && (vmid_len > 1);
+
+ if ( g_vmid_used < 0 )
+ {
+ g_vmid_used = data->used;
+ printk("VMIDs use is %sabled\n", data->used ? "en" : "dis");
+ }
+ else if ( g_vmid_used != data->used )
+ printk("CPU%u: VMIDs use is %sabled\n", smp_processor_id(),
+ data->used ? "en" : "dis");
+
+ /* Zero indicates 'invalid generation', so we start the count at one. */
+ data->generation = 1;
+
+ /* Zero indicates 'VMIDs use disabled', so we start the count at one. */
+ data->next_vmid = 1;
+}
+
+void vmid_flush_vcpu(struct vcpu *v)
+{
+ write_atomic(&v->arch.vmid.generation, 0);
+}
+
+void vmid_flush_hart(void)
+{
+ struct vmid_data *data = &this_cpu(vmid_data);
+
+ if ( !data->used )
+ return;
+
+ if ( likely(++data->generation != 0) )
+ return;
+
+ /*
+ * VMID generations are 64 bit. Overflow of generations never happens.
+ * For safety, we simply disable ASIDs, so correctness is established; it
+ * only runs a bit slower.
+ */
+ printk("%s: VMID generation overrun. Disabling VMIDs.\n", __func__);
+ data->used = false;
+}
+
+bool vmid_handle_vmenter(struct vcpu_vmid *vmid)
+{
+ struct vmid_data *data = &this_cpu(vmid_data);
+
+ /* Test if VCPU has valid VMID. */
+ if ( read_atomic(&vmid->generation) == data->generation )
+ return 0;
+
+ /* If there are no free VMIDs, need to go to a new generation. */
+ if ( unlikely(data->next_vmid > data->max_vmid) )
+ {
+ vmid_flush_hart();
+ data->next_vmid = 1;
+ if ( !data->used )
+ goto disabled;
+ }
+
+ /* Now guaranteed to be a free VMID. */
+ vmid->vmid = data->next_vmid++;
+ write_atomic(&vmid->generation, data->generation);
+
+ /*
+ * When we assign VMID 1, flush all TLB entries as we are starting a new
+ * generation, and all old VMID allocations are now stale.
+ */
+ return (vmid->vmid == 1);
+
+ disabled:
+ vmid->vmid = 0;
+ return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 02/18] xen/riscv: introduce VMID allocation and manegement
2025-10-20 15:57 ` [for 4.22 v5 02/18] xen/riscv: introduce VMID allocation and manegement Oleksii Kurochko
@ 2025-11-06 14:05 ` Jan Beulich
2025-11-14 9:27 ` Oleksii Kurochko
0 siblings, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-06 14:05 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 20.10.2025 17:57, Oleksii Kurochko wrote:
> --- /dev/null
> +++ b/xen/arch/riscv/vmid.c
> @@ -0,0 +1,193 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#include <xen/domain.h>
> +#include <xen/init.h>
> +#include <xen/sections.h>
> +#include <xen/lib.h>
> +#include <xen/param.h>
> +#include <xen/percpu.h>
> +
> +#include <asm/atomic.h>
> +#include <asm/csr.h>
> +#include <asm/flushtlb.h>
> +#include <asm/p2m.h>
> +
> +/* Xen command-line option to enable VMIDs */
> +static bool __ro_after_init opt_vmid = true;
> +boolean_param("vmid", opt_vmid);
Command line options, btw, want documenting in docs/misc/xen-command-line.pandoc.
> +/*
> + * VMIDs partition the physical TLB. In the current implementation VMIDs are
> + * introduced to reduce the number of TLB flushes. Each time a guest-physical
> + * address space changes, instead of flushing the TLB, a new VMID is
> + * assigned. This reduces the number of TLB flushes to at most 1/#VMIDs.
> + * The biggest advantage is that hot parts of the hypervisor's code and data
> + * retain in the TLB.
> + *
> + * Sketch of the Implementation:
> + *
> + * VMIDs are a hart-local resource. As preemption of VMIDs is not possible,
> + * VMIDs are assigned in a round-robin scheme. To minimize the overhead of
> + * VMID invalidation, at the time of a TLB flush, VMIDs are tagged with a
> + * 64-bit generation. Only on a generation overflow the code needs to
> + * invalidate all VMID information stored at the VCPUs with are run on the
> + * specific physical processor. When this overflow appears VMID usage is
> + * disabled to retain correctness.
> + */
> +
> +/* Per-Hart VMID management. */
> +struct vmid_data {
> + uint64_t generation;
> + uint16_t next_vmid;
> + uint16_t max_vmid;
> + bool used;
> +};
> +
> +static DEFINE_PER_CPU(struct vmid_data, vmid_data);
> +
> +static unsigned int vmidlen_detect(void)
> +{
> + unsigned int vmid_bits;
> +
> + /*
> + * According to the RISC-V Privileged Architecture Spec:
> + * When MODE=Bare, guest physical addresses are equal to supervisor
> + * physical addresses, and there is no further memory protection
> + * for a guest virtual machine beyond the physical memory protection
> + * scheme described in Section "Physical Memory Protection".
> + * In this case, the remaining fields in hgatp must be set to zeros.
> + * Thereby it is necessary to set gstage_mode not equal to Bare.
> + */
> + ASSERT(gstage_mode != HGATP_MODE_OFF);
> + csr_write(CSR_HGATP,
> + MASK_INSR(gstage_mode, HGATP_MODE_MASK) | HGATP_VMID_MASK);
> + vmid_bits = MASK_EXTR(csr_read(CSR_HGATP), HGATP_VMID_MASK);
> + vmid_bits = flsl(vmid_bits);
> + csr_write(CSR_HGATP, _AC(0, UL));
> +
> + /*
> + * From RISC-V spec:
> + * Speculative executions of the address-translation algorithm behave as
> + * non-speculative executions of the algorithm do, except that they must
> + * not set the dirty bit for a PTE, they must not trigger an exception,
> + * and they must not create address-translation cache entries if those
> + * entries would have been invalidated by any SFENCE.VMA instruction
> + * executed by the hart since the speculative execution of the algorithm
> + * began.
> + *
> + * Also, despite of the fact here it is mentioned that when V=0 two-stage
> + * address translation is inactivated:
> + * The current virtualization mode, denoted V, indicates whether the hart
> + * is currently executing in a guest. When V=1, the hart is either in
> + * virtual S-mode (VS-mode), or in virtual U-mode (VU-mode) atop a guest
> + * OS running in VS-mode. When V=0, the hart is either in M-mode, in
> + * HS-mode, or in U-mode atop an OS running in HS-mode. The
> + * virtualization mode also indicates whether two-stage address
> + * translation is active (V=1) or inactive (V=0).
> + * But on the same side, writing to hgatp register activates it:
> + * The hgatp register is considered active for the purposes of
> + * the address-translation algorithm unless the effective privilege mode
> + * is U and hstatus.HU=0.
> + *
> + * Thereby it leaves some room for speculation even in this stage of boot,
> + * so it could be that we polluted local TLB so flush all guest TLB.
> + */
> + local_hfence_gvma_all();
That's a lot of redundancy with gstage_mode_detect(). The function call here
actually renders the one there redundant, afaict. Did you consider putting a
single instance at the end of it in pre_gstage_init()? Otherwise at least
don't repeat the comment here, but merely point at the other one?
> + return vmid_bits;
> +}
> +
> +void vmid_init(void)
This (and its helper) isn't __init because you intend to also call it during
bringup of secondary processors?
> +{
> + static int8_t g_vmid_used = -1;
Now that you're getting closer to the x86 original - __ro_after_init?
> + unsigned int vmid_len = vmidlen_detect();
> + struct vmid_data *data = &this_cpu(vmid_data);
> +
> + BUILD_BUG_ON((HGATP_VMID_MASK >> HGATP_VMID_SHIFT) >
> + (BIT((sizeof(data->max_vmid) * BITS_PER_BYTE), UL) - 1));
> +
> + data->max_vmid = BIT(vmid_len, U) - 1;
> + data->used = opt_vmid && (vmid_len > 1);
> +
> + if ( g_vmid_used < 0 )
> + {
> + g_vmid_used = data->used;
> + printk("VMIDs use is %sabled\n", data->used ? "en" : "dis");
> + }
> + else if ( g_vmid_used != data->used )
> + printk("CPU%u: VMIDs use is %sabled\n", smp_processor_id(),
> + data->used ? "en" : "dis");
> +
> + /* Zero indicates 'invalid generation', so we start the count at one. */
> + data->generation = 1;
> +
> + /* Zero indicates 'VMIDs use disabled', so we start the count at one. */
> + data->next_vmid = 1;
> +}
> +
> +void vmid_flush_vcpu(struct vcpu *v)
> +{
> + write_atomic(&v->arch.vmid.generation, 0);
> +}
> +
> +void vmid_flush_hart(void)
> +{
> + struct vmid_data *data = &this_cpu(vmid_data);
> +
> + if ( !data->used )
> + return;
> +
> + if ( likely(++data->generation != 0) )
> + return;
> +
> + /*
> + * VMID generations are 64 bit. Overflow of generations never happens.
> + * For safety, we simply disable ASIDs, so correctness is established; it
> + * only runs a bit slower.
> + */
> + printk("%s: VMID generation overrun. Disabling VMIDs.\n", __func__);
Is logging of the function name of any value here? Also, despite the x86
original havinbg it like this - generally no full stops please if log
messages. "VMID generation overrun; disabling VMIDs\n" would do.
> +bool vmid_handle_vmenter(struct vcpu_vmid *vmid)
> +{
> + struct vmid_data *data = &this_cpu(vmid_data);
> +
> + /* Test if VCPU has valid VMID. */
x86 has a ->disabled check up from here; why do you not check ->used?
> + if ( read_atomic(&vmid->generation) == data->generation )
> + return 0;
> +
> + /* If there are no free VMIDs, need to go to a new generation. */
> + if ( unlikely(data->next_vmid > data->max_vmid) )
> + {
> + vmid_flush_hart();
> + data->next_vmid = 1;
> + if ( !data->used )
> + goto disabled;
> + }
> +
> + /* Now guaranteed to be a free VMID. */
> + vmid->vmid = data->next_vmid++;
> + write_atomic(&vmid->generation, data->generation);
> +
> + /*
> + * When we assign VMID 1, flush all TLB entries as we are starting a new
> + * generation, and all old VMID allocations are now stale.
> + */
> + return (vmid->vmid == 1);
Minor: Parentheses aren't really needed here.
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 02/18] xen/riscv: introduce VMID allocation and manegement
2025-11-06 14:05 ` Jan Beulich
@ 2025-11-14 9:27 ` Oleksii Kurochko
2025-11-17 8:35 ` Jan Beulich
0 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-14 9:27 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 5789 bytes --]
On 11/6/25 3:05 PM, Jan Beulich wrote:
> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>
> +static unsigned int vmidlen_detect(void)
> +{
> + unsigned int vmid_bits;
> +
> + /*
> + * According to the RISC-V Privileged Architecture Spec:
> + * When MODE=Bare, guest physical addresses are equal to supervisor
> + * physical addresses, and there is no further memory protection
> + * for a guest virtual machine beyond the physical memory protection
> + * scheme described in Section "Physical Memory Protection".
> + * In this case, the remaining fields in hgatp must be set to zeros.
> + * Thereby it is necessary to set gstage_mode not equal to Bare.
> + */
> + ASSERT(gstage_mode != HGATP_MODE_OFF);
> + csr_write(CSR_HGATP,
> + MASK_INSR(gstage_mode, HGATP_MODE_MASK) | HGATP_VMID_MASK);
> + vmid_bits = MASK_EXTR(csr_read(CSR_HGATP), HGATP_VMID_MASK);
> + vmid_bits = flsl(vmid_bits);
> + csr_write(CSR_HGATP, _AC(0, UL));
> +
> + /*
> + * From RISC-V spec:
> + * Speculative executions of the address-translation algorithm behave as
> + * non-speculative executions of the algorithm do, except that they must
> + * not set the dirty bit for a PTE, they must not trigger an exception,
> + * and they must not create address-translation cache entries if those
> + * entries would have been invalidated by any SFENCE.VMA instruction
> + * executed by the hart since the speculative execution of the algorithm
> + * began.
> + *
> + * Also, despite of the fact here it is mentioned that when V=0 two-stage
> + * address translation is inactivated:
> + * The current virtualization mode, denoted V, indicates whether the hart
> + * is currently executing in a guest. When V=1, the hart is either in
> + * virtual S-mode (VS-mode), or in virtual U-mode (VU-mode) atop a guest
> + * OS running in VS-mode. When V=0, the hart is either in M-mode, in
> + * HS-mode, or in U-mode atop an OS running in HS-mode. The
> + * virtualization mode also indicates whether two-stage address
> + * translation is active (V=1) or inactive (V=0).
> + * But on the same side, writing to hgatp register activates it:
> + * The hgatp register is considered active for the purposes of
> + * the address-translation algorithm unless the effective privilege mode
> + * is U and hstatus.HU=0.
> + *
> + * Thereby it leaves some room for speculation even in this stage of boot,
> + * so it could be that we polluted local TLB so flush all guest TLB.
> + */
> + local_hfence_gvma_all();
> That's a lot of redundancy with gstage_mode_detect(). The function call here
> actually renders the one there redundant, afaict. Did you consider putting a
> single instance at the end of it in pre_gstage_init()? Otherwise at least
> don't repeat the comment here, but merely point at the other one?
Agree, it could be moved to the end of pre_gstage_init().
>> + return vmid_bits;
>> +}
>> +
>> +void vmid_init(void)
> This (and its helper) isn't __init because you intend to also call it during
> bringup of secondary processors?
Yes, I wasn't able to find that VMIDLEN is guaranteed to be same for all
harts.
>> + unsigned int vmid_len = vmidlen_detect();
>> + struct vmid_data *data = &this_cpu(vmid_data);
>> +
>> + BUILD_BUG_ON((HGATP_VMID_MASK >> HGATP_VMID_SHIFT) >
>> + (BIT((sizeof(data->max_vmid) * BITS_PER_BYTE), UL) - 1));
>> +
>> + data->max_vmid = BIT(vmid_len, U) - 1;
>> + data->used = opt_vmid && (vmid_len > 1);
>> +
>> + if ( g_vmid_used < 0 )
>> + {
>> + g_vmid_used = data->used;
>> + printk("VMIDs use is %sabled\n", data->used ? "en" : "dis");
>> + }
>> + else if ( g_vmid_used != data->used )
>> + printk("CPU%u: VMIDs use is %sabled\n", smp_processor_id(),
>> + data->used ? "en" : "dis");
>> +
>> + /* Zero indicates 'invalid generation', so we start the count at one. */
>> + data->generation = 1;
>> +
>> + /* Zero indicates 'VMIDs use disabled', so we start the count at one. */
>> + data->next_vmid = 1;
>> +}
>> +
>> +void vmid_flush_vcpu(struct vcpu *v)
>> +{
>> + write_atomic(&v->arch.vmid.generation, 0);
>> +}
>> +
>> +void vmid_flush_hart(void)
>> +{
>> + struct vmid_data *data = &this_cpu(vmid_data);
>> +
>> + if ( !data->used )
>> + return;
>> +
>> + if ( likely(++data->generation != 0) )
>> + return;
>> +
>> + /*
>> + * VMID generations are 64 bit. Overflow of generations never happens.
>> + * For safety, we simply disable ASIDs, so correctness is established; it
>> + * only runs a bit slower.
>> + */
>> + printk("%s: VMID generation overrun. Disabling VMIDs.\n", __func__);
> Is logging of the function name of any value here?
Agree, there is no any sense for the logging of the function name.
> Also, despite the x86
> original havinbg it like this - generally no full stops please if log
> messages. "VMID generation overrun; disabling VMIDs\n" would do.
Sure, I will drop it and will try to not add it in such cases. But could you
please remind (if I asked that before) me what is the reason why full stop
shouldn't be presented in such cases?
>> +bool vmid_handle_vmenter(struct vcpu_vmid *vmid)
>> +{
>> + struct vmid_data *data = &this_cpu(vmid_data);
>> +
>> + /* Test if VCPU has valid VMID. */
> x86 has a ->disabled check up from here; why do you not check ->used?
The x86 comment confused me, at first I thought the check was related to
erratum #170, but now I see that it might actually be useful here, so I'll add:
if ( !data->used )
goto disabled;
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 7021 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 02/18] xen/riscv: introduce VMID allocation and manegement
2025-11-14 9:27 ` Oleksii Kurochko
@ 2025-11-17 8:35 ` Jan Beulich
0 siblings, 0 replies; 54+ messages in thread
From: Jan Beulich @ 2025-11-17 8:35 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 14.11.2025 10:27, Oleksii Kurochko wrote:
> On 11/6/25 3:05 PM, Jan Beulich wrote:
>> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>>> +void vmid_flush_hart(void)
>>> +{
>>> + struct vmid_data *data = &this_cpu(vmid_data);
>>> +
>>> + if ( !data->used )
>>> + return;
>>> +
>>> + if ( likely(++data->generation != 0) )
>>> + return;
>>> +
>>> + /*
>>> + * VMID generations are 64 bit. Overflow of generations never happens.
>>> + * For safety, we simply disable ASIDs, so correctness is established; it
>>> + * only runs a bit slower.
>>> + */
>>> + printk("%s: VMID generation overrun. Disabling VMIDs.\n", __func__);
>> Is logging of the function name of any value here?
>
> Agree, there is no any sense for the logging of the function name.
>
>> Also, despite the x86
>> original havinbg it like this - generally no full stops please if log
>> messages. "VMID generation overrun; disabling VMIDs\n" would do.
>
> Sure, I will drop it and will try to not add it in such cases. But could you
> please remind (if I asked that before) me what is the reason why full stop
> shouldn't be presented in such cases?
First: Consistency across the code base. Second: Meaningless characters
needlessly consume serial line bandwidth.
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread
* [for 4.22 v5 03/18] xen/riscv: introduce things necessary for p2m initialization
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
2025-10-20 15:57 ` [for 4.22 v5 01/18] xen/riscv: detect and initialize G-stage mode Oleksii Kurochko
2025-10-20 15:57 ` [for 4.22 v5 02/18] xen/riscv: introduce VMID allocation and manegement Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-10-20 15:57 ` [for 4.22 v5 04/18] xen/riscv: construct the P2M pages pool for guests Oleksii Kurochko
` (14 subsequent siblings)
17 siblings, 0 replies; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Introduce the following things:
- Update p2m_domain structure, which describe per p2m-table state, with:
- lock to protect updates to p2m.
- pool with pages used to construct p2m.
- back pointer to domain structure.
- p2m_init() to initalize members introduced in p2m_domain structure.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V5:
- Acked-by: Jan Beulich <jbeulich@suse.com>.
---
Changes in V4:
- Move an introduction of clean_pte member of p2m_domain structure to the
patch where it is started to be used:
xen/riscv: add root page table allocation
- Add prototype of p2m_init() to asm/p2m.h.
---
Changes in V3:
- s/p2m_type/p2m_types.
- Drop init. of p2m->clean_pte in p2m_init() as CONFIG_HAS_PASSTHROUGH is
going to be selected unconditionaly. Plus CONFIG_HAS_PASSTHROUGH isn't
ready to be used for RISC-V.
Add compilation error to not forget to init p2m->clean_pte.
- Move defintion of p2m->domain up in p2m_init().
- Add iommu_use_hap_pt() when p2m->clean_pte is initialized.
- Add the comment above p2m_types member of p2m_domain struct.
- Add need_flush member to p2m_domain structure.
- Move introduction of p2m_write_(un)lock() and p2m_tlb_flush_sync()
to the patch where they are really used:
xen/riscv: implement guest_physmap_add_entry() for mapping GFNs to MFN
- Add p2m member to arch_domain structure.
- Drop p2m_types from struct p2m_domain as P2M type for PTE will be stored
differently.
- Drop default_access as it isn't going to be used for now.
- Move defintion of p2m_is_write_locked() to "implement function to map memory
in guest p2m" where it is really used.
---
Changes in V2:
- Use introduced erlier sbi_remote_hfence_gvma_vmid() for proper implementation
of p2m_force_tlb_flush_sync() as TLB flushing needs to happen for each pCPU
which potentially has cached a mapping, what is tracked by d->dirty_cpumask.
- Drop unnecessary blanks.
- Fix code style for # of pre-processor directive.
- Drop max_mapped_gfn and lowest_mapped_gfn as they aren't used now.
- [p2m_init()] Set p2m->clean_pte=false if CONFIG_HAS_PASSTHROUGH=n.
- [p2m_init()] Update the comment above p2m->domain = d;
- Drop p2m->need_flush as it seems to be always true for RISC-V and as a
consequence drop p2m_tlb_flush_sync().
- Move to separate patch an introduction of root page table allocation.
---
xen/arch/riscv/include/asm/domain.h | 5 +++++
xen/arch/riscv/include/asm/p2m.h | 33 +++++++++++++++++++++++++++++
xen/arch/riscv/p2m.c | 20 +++++++++++++++++
3 files changed, 58 insertions(+)
diff --git a/xen/arch/riscv/include/asm/domain.h b/xen/arch/riscv/include/asm/domain.h
index aac1040658..e688980efa 100644
--- a/xen/arch/riscv/include/asm/domain.h
+++ b/xen/arch/riscv/include/asm/domain.h
@@ -5,6 +5,8 @@
#include <xen/xmalloc.h>
#include <public/hvm/params.h>
+#include <asm/p2m.h>
+
struct vcpu_vmid {
uint64_t generation;
uint16_t vmid;
@@ -24,6 +26,9 @@ struct arch_vcpu {
struct arch_domain {
struct hvm_domain hvm;
+
+ /* Virtual MMU */
+ struct p2m_domain p2m;
};
#include <xen/sched.h>
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 3a5066f360..a129ed8392 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -3,6 +3,9 @@
#define ASM__RISCV__P2M_H
#include <xen/errno.h>
+#include <xen/mm.h>
+#include <xen/rwlock.h>
+#include <xen/types.h>
#include <asm/page-bits.h>
@@ -10,6 +13,34 @@ extern unsigned char gstage_mode;
#define paddr_bits PADDR_BITS
+/* Get host p2m table */
+#define p2m_get_hostp2m(d) (&(d)->arch.p2m)
+
+/* Per-p2m-table state */
+struct p2m_domain {
+ /*
+ * Lock that protects updates to the p2m.
+ */
+ rwlock_t lock;
+
+ /* Pages used to construct the p2m */
+ struct page_list_head pages;
+
+ /* Back pointer to domain */
+ struct domain *domain;
+
+ /*
+ * P2M updates may required TLBs to be flushed (invalidated).
+ *
+ * Flushes may be deferred by setting 'need_flush' and then flushing
+ * when the p2m write lock is released.
+ *
+ * If an immediate flush is required (e.g, if a super page is
+ * shattered), call p2m_tlb_flush_sync().
+ */
+ bool need_flush;
+};
+
/*
* List of possible type for each page in the p2m entry.
* The number of available bit per page in the pte for this purpose is 2 bits.
@@ -92,6 +123,8 @@ static inline bool arch_acquire_resource_check(struct domain *d)
void pre_gstage_init(void);
+int p2m_init(struct domain *d);
+
#endif /* ASM__RISCV__P2M_H */
/*
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index d8027a270f..1b5fc7ffff 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -3,6 +3,10 @@
#include <xen/init.h>
#include <xen/lib.h>
#include <xen/macros.h>
+#include <xen/mm.h>
+#include <xen/paging.h>
+#include <xen/rwlock.h>
+#include <xen/sched.h>
#include <xen/sections.h>
#include <asm/csr.h>
@@ -97,3 +101,19 @@ void __init pre_gstage_init(void)
vmid_init();
}
+
+int p2m_init(struct domain *d)
+{
+ struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+ /*
+ * "Trivial" initialisation is now complete. Set the backpointer so the
+ * users of p2m could get an access to domain structure.
+ */
+ p2m->domain = d;
+
+ rwlock_init(&p2m->lock);
+ INIT_PAGE_LIST_HEAD(&p2m->pages);
+
+ return 0;
+}
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* [for 4.22 v5 04/18] xen/riscv: construct the P2M pages pool for guests
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (2 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 03/18] xen/riscv: introduce things necessary for p2m initialization Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-10-20 15:57 ` [for 4.22 v5 05/18] xen/riscv: add root page table allocation Oleksii Kurochko
` (13 subsequent siblings)
17 siblings, 0 replies; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Implement p2m_set_allocation() to construct p2m pages pool for guests
based on required number of pages.
This is implemented by:
- Adding a `struct paging_domain` which contains a freelist, a
counter variable and a spinlock to `struct arch_domain` to
indicate the free p2m pages and the number of p2m total pages in
the p2m pages pool.
- Adding a helper `p2m_set_allocation` to set the p2m pages pool
size. This helper should be called before allocating memory for
a guest and is called from domain_p2m_set_allocation(), the latter
is a part of common dom0less code.
- Adding implementation of paging_freelist_adjust() and
paging_domain_init().
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V5:
- Nothing changed. Only rebase.
---
Changes in V4:
- s/paging_freelist_init/paging_freelist_adjust.
- Add empty line between definiton of paging_freelist_adjust()
and paging_domain_init().
- Update commit message.
- Add Acked-by: Jan Beulich <jbeulich@suse.com>.
---
Changes in v3:
- Drop usage of p2m_ prefix inside struct paging_domain().
- Introduce paging_domain_init() to init paging struct.
---
Changes in v2:
- Drop the comment above inclusion of <xen/event.h> in riscv/p2m.c.
- Use ACCESS_ONCE() for lhs and rhs for the expressions in
p2m_set_allocation().
---
xen/arch/riscv/Makefile | 1 +
xen/arch/riscv/include/asm/Makefile | 1 -
xen/arch/riscv/include/asm/domain.h | 12 ++++++
xen/arch/riscv/include/asm/paging.h | 13 ++++++
xen/arch/riscv/p2m.c | 18 ++++++++
xen/arch/riscv/paging.c | 65 +++++++++++++++++++++++++++++
6 files changed, 109 insertions(+), 1 deletion(-)
create mode 100644 xen/arch/riscv/include/asm/paging.h
create mode 100644 xen/arch/riscv/paging.c
diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index e2499210c8..6b912465b9 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -6,6 +6,7 @@ obj-y += imsic.o
obj-y += intc.o
obj-y += irq.o
obj-y += mm.o
+obj-y += paging.o
obj-y += pt.o
obj-y += p2m.o
obj-$(CONFIG_RISCV_64) += riscv64/
diff --git a/xen/arch/riscv/include/asm/Makefile b/xen/arch/riscv/include/asm/Makefile
index bfdf186c68..3824f31c39 100644
--- a/xen/arch/riscv/include/asm/Makefile
+++ b/xen/arch/riscv/include/asm/Makefile
@@ -6,7 +6,6 @@ generic-y += hardirq.h
generic-y += hypercall.h
generic-y += iocap.h
generic-y += irq-dt.h
-generic-y += paging.h
generic-y += percpu.h
generic-y += perfc_defn.h
generic-y += random.h
diff --git a/xen/arch/riscv/include/asm/domain.h b/xen/arch/riscv/include/asm/domain.h
index e688980efa..316e7c6c84 100644
--- a/xen/arch/riscv/include/asm/domain.h
+++ b/xen/arch/riscv/include/asm/domain.h
@@ -2,6 +2,8 @@
#ifndef ASM__RISCV__DOMAIN_H
#define ASM__RISCV__DOMAIN_H
+#include <xen/mm.h>
+#include <xen/spinlock.h>
#include <xen/xmalloc.h>
#include <public/hvm/params.h>
@@ -24,11 +26,21 @@ struct arch_vcpu {
struct vcpu_vmid vmid;
};
+struct paging_domain {
+ spinlock_t lock;
+ /* Free pages from the pre-allocated pool */
+ struct page_list_head freelist;
+ /* Number of pages from the pre-allocated pool */
+ unsigned long total_pages;
+};
+
struct arch_domain {
struct hvm_domain hvm;
/* Virtual MMU */
struct p2m_domain p2m;
+
+ struct paging_domain paging;
};
#include <xen/sched.h>
diff --git a/xen/arch/riscv/include/asm/paging.h b/xen/arch/riscv/include/asm/paging.h
new file mode 100644
index 0000000000..98d8b06d45
--- /dev/null
+++ b/xen/arch/riscv/include/asm/paging.h
@@ -0,0 +1,13 @@
+#ifndef ASM_RISCV_PAGING_H
+#define ASM_RISCV_PAGING_H
+
+#include <asm-generic/paging.h>
+
+struct domain;
+
+int paging_domain_init(struct domain *d);
+
+int paging_freelist_adjust(struct domain *d, unsigned long pages,
+ bool *preempted);
+
+#endif /* ASM_RISCV_PAGING_H */
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 1b5fc7ffff..d670e7612a 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -11,6 +11,7 @@
#include <asm/csr.h>
#include <asm/flushtlb.h>
+#include <asm/paging.h>
#include <asm/riscv_encoding.h>
#include <asm/vmid.h>
@@ -112,8 +113,25 @@ int p2m_init(struct domain *d)
*/
p2m->domain = d;
+ paging_domain_init(d);
+
rwlock_init(&p2m->lock);
INIT_PAGE_LIST_HEAD(&p2m->pages);
return 0;
}
+
+/*
+ * Set the pool of pages to the required number of pages.
+ * Returns 0 for success, non-zero for failure.
+ * Call with d->arch.paging.lock held.
+ */
+int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
+{
+ int rc;
+
+ if ( (rc = paging_freelist_adjust(d, pages, preempted)) )
+ return rc;
+
+ return 0;
+}
diff --git a/xen/arch/riscv/paging.c b/xen/arch/riscv/paging.c
new file mode 100644
index 0000000000..2df8de033b
--- /dev/null
+++ b/xen/arch/riscv/paging.c
@@ -0,0 +1,65 @@
+#include <xen/event.h>
+#include <xen/lib.h>
+#include <xen/mm.h>
+#include <xen/sched.h>
+#include <xen/spinlock.h>
+
+int paging_freelist_adjust(struct domain *d, unsigned long pages,
+ bool *preempted)
+{
+ struct page_info *pg;
+
+ ASSERT(spin_is_locked(&d->arch.paging.lock));
+
+ for ( ; ; )
+ {
+ if ( d->arch.paging.total_pages < pages )
+ {
+ /* Need to allocate more memory from domheap */
+ pg = alloc_domheap_page(d, MEMF_no_owner);
+ if ( pg == NULL )
+ {
+ printk(XENLOG_ERR "Failed to allocate pages.\n");
+ return -ENOMEM;
+ }
+ ACCESS_ONCE(d->arch.paging.total_pages)++;
+ page_list_add_tail(pg, &d->arch.paging.freelist);
+ }
+ else if ( d->arch.paging.total_pages > pages )
+ {
+ /* Need to return memory to domheap */
+ pg = page_list_remove_head(&d->arch.paging.freelist);
+ if ( pg )
+ {
+ ACCESS_ONCE(d->arch.paging.total_pages)--;
+ free_domheap_page(pg);
+ }
+ else
+ {
+ printk(XENLOG_ERR
+ "Failed to free pages, freelist is empty.\n");
+ return -ENOMEM;
+ }
+ }
+ else
+ break;
+
+ /* Check to see if we need to yield and try again */
+ if ( preempted && general_preempt_check() )
+ {
+ *preempted = true;
+ return -ERESTART;
+ }
+ }
+
+ return 0;
+}
+
+/* Domain paging struct initialization. */
+int paging_domain_init(struct domain *d)
+{
+ spin_lock_init(&d->arch.paging.lock);
+ INIT_PAGE_LIST_HEAD(&d->arch.paging.freelist);
+
+ return 0;
+}
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* [for 4.22 v5 05/18] xen/riscv: add root page table allocation
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (3 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 04/18] xen/riscv: construct the P2M pages pool for guests Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-11-06 14:25 ` Jan Beulich
2025-10-20 15:57 ` [for 4.22 v5 06/18] xen/riscv: introduce pte_{set,get}_mfn() Oleksii Kurochko
` (12 subsequent siblings)
17 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Introduce support for allocating and initializing the root page table
required for RISC-V stage-2 address translation.
To implement root page table allocation the following is introduced:
- p2m_get_clean_page() and p2m_alloc_root_table(), p2m_allocate_root()
helpers to allocate and zero a 16 KiB root page table, as mandated
by the RISC-V privileged specification for Sv32x4/Sv39x4/Sv48x4/Sv57x4
modes.
- Update p2m_init() to inititialize p2m_root_order.
- Add maddr_to_page() and page_to_maddr() macros for easier address
manipulation.
- Introduce paging_ret_to_domheap() to return some pages before
allocate 16 KiB pages for root page table.
- Allocate root p2m table after p2m pool is initialized.
- Add construct_hgatp() to construct the hgatp register value based on
p2m->root, p2m->hgatp_mode and VMID.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
- Update proto of construct_hgatp(): make first argument pointer-to-const.
- Code style fixes.
- s/paging_ret_pages_to_freelist/paging_refill_from_domheap.
- s/paging_ret_pages_to_domheap/paging_ret_to_domheap.
- s/paging_ret_page_to_freelist/paging_add_page_to_freelist.
- Drop ACCESS_ONCE() as all the cases where it is used are used under spinlock() hence ACCESS_ONCE() is redundant.
---
Changes in V4:
- Drop hgatp_mode from p2m_domain as gstage_mode was introduced and
initlialized earlier patch. So use gstage_mode instead.
- s/GUEST_ROOT_PAGE_TABLE_SIZE/GSTAGE_ROOT_PAGE_TABLE_SIZE.
- Drop p2m_root_order and re-define P2M_ROOT_ORDER:
#define P2M_ROOT_ORDER (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
- Update implementation of construct_hgatp(): use introduced gstage_mode
and use MASK_INSRT() to construct ppn value.
- Drop nr_root_pages variable inside p2m_alloc_root_table().
- Update the printk's message inside paging_ret_pages_to_domheap().
- Add an introduction of clean_pte member of p2m_domain structure to this
patch as it is started to be used here.
Rename clean_pte to clean_dcache.
- Drop p2m_allocate_root() function as it is going to be used only in one
place.
- Propogate rc from p2m_alloc_root_table() in p2m_set_allocation().
- Return P2M_ROOT_PAGES to freelist in case of allocation of root page
table failed.
- Add allocated root tables pages to p2m->pages pool so a usage of pages
could be properly taken into account.
---
Changes in v3:
- Drop insterting of p2m->vmid in hgatp_from_page() as now vmid is allocated
per-CPU, not per-domain, so it will be inserted later somewhere in
context_switch or before returning control to a guest.
- use BIT() to init nr_pages in p2m_allocate_root() instead of open-code
BIT() macros.
- Fix order in clear_and_clean_page().
- s/panic("Specify more xen,domain-p2m-mem-mb\n")/return NULL.
- Use lock around a procedure of returning back pages necessary for p2m
root table.
- Update the comment about allocation of page for root page table.
- Update an argument of hgatp_from_page() to "struct page_info *p2m_root_page"
to be consistent with the function name.
- Use p2m_get_hostp2m(d) instead of open-coding it.
- Update the comment above the call of p2m_alloc_root_table().
- Update the comments in p2m_allocate_root().
- Move part which returns some page to domheap before root page table allocation
to paging.c.
- Pass p2m_domain * instead of struct domain * for p2m_alloc_root_table().
- Introduce construct_hgatp() instead of hgatp_from_page().
- Add vmid and hgatp_mode member of struct p2m_domain.
- Add explanatory comment above clean_dcache_va_range() in
clear_and_clean_page().
- Introduce P2M_ROOT_ORDER and P2M_ROOT_PAGES.
- Drop vmid member from p2m_domain as now we are using per-pCPU
VMID allocation.
- Update a declaration of construct_hgatp() to recieve VMID as it
isn't per-VM anymore.
- Drop hgatp member of p2m_domain struct as with the new VMID scheme
allocation construction of hgatp will be needed more often.
- Drop is_hardware_domain() case in p2m_allocate_root(), just always
allocate root using p2m pool pages.
- Refactor p2m_alloc_root_table() and p2m_alloc_table().
---
Changes in v2:
- This patch was created from "xen/riscv: introduce things necessary for p2m
initialization" with the following changes:
- [clear_and_clean_page()] Add missed call of clean_dcache_va_range().
- Drop p2m_get_clean_page() as it is going to be used only once to allocate
root page table. Open-code it explicittly in p2m_allocate_root(). Also,
it will help avoid duplication of the code connected to order and nr_pages
of p2m root page table.
- Instead of using order 2 for alloc_domheap_pages(), use
get_order_from_bytes(KB(16)).
- Clear and clean a proper amount of allocated pages in p2m_allocate_root().
- Drop _info from the function name hgatp_from_page_info() and its argument
page_info.
- Introduce HGATP_MODE_MASK and use MASK_INSR() instead of shift to calculate
value of hgatp.
- Drop unnecessary parentheses in definition of page_to_maddr().
- Add support of VMID.
- Drop TLB flushing in p2m_alloc_root_table() and do that once when VMID
is re-used. [Look at p2m_alloc_vmid()]
- Allocate p2m root table after p2m pool is fully initialized: first
return pages to p2m pool them allocate p2m root table.
---
xen/arch/riscv/include/asm/mm.h | 4 +
xen/arch/riscv/include/asm/p2m.h | 15 +++
xen/arch/riscv/include/asm/paging.h | 3 +
xen/arch/riscv/include/asm/riscv_encoding.h | 2 +
xen/arch/riscv/p2m.c | 90 +++++++++++++++-
xen/arch/riscv/paging.c | 110 +++++++++++++++-----
6 files changed, 195 insertions(+), 29 deletions(-)
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index 9283616c02..dd8cdc9782 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -167,6 +167,10 @@ extern struct page_info *frametable_virt_start;
#define mfn_to_page(mfn) (frametable_virt_start + mfn_x(mfn))
#define page_to_mfn(pg) _mfn((pg) - frametable_virt_start)
+/* Convert between machine addresses and page-info structures. */
+#define maddr_to_page(ma) mfn_to_page(maddr_to_mfn(ma))
+#define page_to_maddr(pg) mfn_to_maddr(page_to_mfn(pg))
+
static inline void *page_to_virt(const struct page_info *pg)
{
return mfn_to_virt(mfn_x(page_to_mfn(pg)));
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index a129ed8392..85e67516c4 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -2,6 +2,7 @@
#ifndef ASM__RISCV__P2M_H
#define ASM__RISCV__P2M_H
+#include <xen/bitops.h>
#include <xen/errno.h>
#include <xen/mm.h>
#include <xen/rwlock.h>
@@ -11,6 +12,9 @@
extern unsigned char gstage_mode;
+#define P2M_ROOT_ORDER (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
+#define P2M_ROOT_PAGES BIT(P2M_ROOT_ORDER, U)
+
#define paddr_bits PADDR_BITS
/* Get host p2m table */
@@ -26,6 +30,9 @@ struct p2m_domain {
/* Pages used to construct the p2m */
struct page_list_head pages;
+ /* The root of the p2m tree. May be concatenated */
+ struct page_info *root;
+
/* Back pointer to domain */
struct domain *domain;
@@ -39,6 +46,12 @@ struct p2m_domain {
* shattered), call p2m_tlb_flush_sync().
*/
bool need_flush;
+
+ /*
+ * Indicate if it is required to clean the cache when writing an entry or
+ * when a page is needed to be fully cleared and cleaned.
+ */
+ bool clean_dcache;
};
/*
@@ -125,6 +138,8 @@ void pre_gstage_init(void);
int p2m_init(struct domain *d);
+unsigned long construct_hgatp(const struct p2m_domain *p2m, uint16_t vmid);
+
#endif /* ASM__RISCV__P2M_H */
/*
diff --git a/xen/arch/riscv/include/asm/paging.h b/xen/arch/riscv/include/asm/paging.h
index 98d8b06d45..01be45528f 100644
--- a/xen/arch/riscv/include/asm/paging.h
+++ b/xen/arch/riscv/include/asm/paging.h
@@ -10,4 +10,7 @@ int paging_domain_init(struct domain *d);
int paging_freelist_adjust(struct domain *d, unsigned long pages,
bool *preempted);
+int paging_ret_to_domheap(struct domain *d, unsigned int nr_pages);
+int paging_refill_from_domheap(struct domain *d, unsigned int nr_pages);
+
#endif /* ASM_RISCV_PAGING_H */
diff --git a/xen/arch/riscv/include/asm/riscv_encoding.h b/xen/arch/riscv/include/asm/riscv_encoding.h
index b15f5ad0b4..8890b903e1 100644
--- a/xen/arch/riscv/include/asm/riscv_encoding.h
+++ b/xen/arch/riscv/include/asm/riscv_encoding.h
@@ -188,6 +188,8 @@
#define HGATP_MODE_MASK HGATP32_MODE_MASK
#endif
+#define GSTAGE_ROOT_PAGE_TABLE_SIZE KB(16)
+
#define TOPI_IID_SHIFT 16
#define TOPI_IID_MASK 0xfff
#define TOPI_IPRIO_MASK 0xff
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index d670e7612a..c9ffad393f 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -3,6 +3,7 @@
#include <xen/init.h>
#include <xen/lib.h>
#include <xen/macros.h>
+#include <xen/domain_page.h>
#include <xen/mm.h>
#include <xen/paging.h>
#include <xen/rwlock.h>
@@ -103,6 +104,70 @@ void __init pre_gstage_init(void)
vmid_init();
}
+static void clear_and_clean_page(struct page_info *page, bool clean_dcache)
+{
+ clear_domain_page(page_to_mfn(page));
+
+ /*
+ * If the IOMMU doesn't support coherent walks and the p2m tables are
+ * shared between the CPU and IOMMU, it is necessary to clean the
+ * d-cache.
+ */
+ if ( clean_dcache )
+ clean_dcache_va_range(page, PAGE_SIZE);
+}
+
+unsigned long construct_hgatp(const struct p2m_domain *p2m, uint16_t vmid)
+{
+ return MASK_INSR(mfn_x(page_to_mfn(p2m->root)), HGATP_PPN) |
+ MASK_INSR(gstage_mode, HGATP_MODE_MASK) |
+ MASK_INSR(vmid, HGATP_VMID_MASK);
+}
+
+static int p2m_alloc_root_table(struct p2m_domain *p2m)
+{
+ struct domain *d = p2m->domain;
+ struct page_info *page;
+ int rc;
+
+ /*
+ * Return back P2M_ROOT_PAGES to assure the root table memory is also
+ * accounted against the P2M pool of the domain.
+ */
+ if ( (rc = paging_ret_to_domheap(d, P2M_ROOT_PAGES)) )
+ return rc;
+
+ /*
+ * As mentioned in the Priviliged Architecture Spec (version 20240411)
+ * in Section 18.5.1, for the paged virtual-memory schemes (Sv32x4,
+ * Sv39x4, Sv48x4, and Sv57x4), the root page table is 16 KiB and must
+ * be aligned to a 16-KiB boundary.
+ */
+ page = alloc_domheap_pages(d, P2M_ROOT_ORDER, MEMF_no_owner);
+ if ( !page )
+ {
+ /*
+ * If allocation of root table pages fails, the pages acquired above
+ * must be returned to the freelist to maintain proper freelist
+ * balance.
+ */
+ paging_refill_from_domheap(d, P2M_ROOT_PAGES);
+
+ return -ENOMEM;
+ }
+
+ for ( unsigned int i = 0; i < P2M_ROOT_PAGES; i++ )
+ {
+ clear_and_clean_page(page + i, p2m->clean_dcache);
+
+ page_list_add(page + i, &p2m->pages);
+ }
+
+ p2m->root = page;
+
+ return 0;
+}
+
int p2m_init(struct domain *d)
{
struct p2m_domain *p2m = p2m_get_hostp2m(d);
@@ -118,6 +183,19 @@ int p2m_init(struct domain *d)
rwlock_init(&p2m->lock);
INIT_PAGE_LIST_HEAD(&p2m->pages);
+ /*
+ * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
+ * is not ready for RISC-V support.
+ *
+ * When CONFIG_HAS_PASSTHROUGH=y, p2m->clean_dcache must be properly
+ * initialized.
+ * At the moment, it defaults to false because the p2m structure is
+ * zero-initialized.
+ */
+#ifdef CONFIG_HAS_PASSTHROUGH
+# error "Add init of p2m->clean_dcache"
+#endif
+
return 0;
}
@@ -128,10 +206,20 @@ int p2m_init(struct domain *d)
*/
int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
{
+ struct p2m_domain *p2m = p2m_get_hostp2m(d);
int rc;
if ( (rc = paging_freelist_adjust(d, pages, preempted)) )
return rc;
- return 0;
+ /*
+ * First, initialize p2m pool. Then allocate the root
+ * table so that the necessary pages can be returned from the p2m pool,
+ * since the root table must be allocated using alloc_domheap_pages(...)
+ * to meet its specific requirements.
+ */
+ if ( !p2m->root )
+ rc = p2m_alloc_root_table(p2m);
+
+ return rc;
}
diff --git a/xen/arch/riscv/paging.c b/xen/arch/riscv/paging.c
index 2df8de033b..c87e9b7f7f 100644
--- a/xen/arch/riscv/paging.c
+++ b/xen/arch/riscv/paging.c
@@ -4,46 +4,67 @@
#include <xen/sched.h>
#include <xen/spinlock.h>
+static int paging_ret_page_to_domheap(struct domain *d)
+{
+ struct page_info *page;
+
+ ASSERT(spin_is_locked(&d->arch.paging.lock));
+
+ /* Return memory to domheap. */
+ page = page_list_remove_head(&d->arch.paging.freelist);
+ if( page )
+ {
+ d->arch.paging.total_pages--;
+ free_domheap_page(page);
+ }
+ else
+ {
+ printk(XENLOG_ERR
+ "Failed to free P2M pages, P2M freelist is empty.\n");
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static int paging_add_page_to_freelist(struct domain *d)
+{
+ struct page_info *page;
+
+ ASSERT(spin_is_locked(&d->arch.paging.lock));
+
+ /* Need to allocate more memory from domheap */
+ page = alloc_domheap_page(d, MEMF_no_owner);
+ if ( page == NULL )
+ {
+ printk(XENLOG_ERR "Failed to allocate pages.\n");
+ return -ENOMEM;
+ }
+ d->arch.paging.total_pages++;
+ page_list_add_tail(page, &d->arch.paging.freelist);
+
+ return 0;
+}
+
int paging_freelist_adjust(struct domain *d, unsigned long pages,
bool *preempted)
{
- struct page_info *pg;
-
ASSERT(spin_is_locked(&d->arch.paging.lock));
for ( ; ; )
{
+ int rc = 0;
+
if ( d->arch.paging.total_pages < pages )
- {
- /* Need to allocate more memory from domheap */
- pg = alloc_domheap_page(d, MEMF_no_owner);
- if ( pg == NULL )
- {
- printk(XENLOG_ERR "Failed to allocate pages.\n");
- return -ENOMEM;
- }
- ACCESS_ONCE(d->arch.paging.total_pages)++;
- page_list_add_tail(pg, &d->arch.paging.freelist);
- }
+ rc = paging_add_page_to_freelist(d);
else if ( d->arch.paging.total_pages > pages )
- {
- /* Need to return memory to domheap */
- pg = page_list_remove_head(&d->arch.paging.freelist);
- if ( pg )
- {
- ACCESS_ONCE(d->arch.paging.total_pages)--;
- free_domheap_page(pg);
- }
- else
- {
- printk(XENLOG_ERR
- "Failed to free pages, freelist is empty.\n");
- return -ENOMEM;
- }
- }
+ rc = paging_ret_page_to_domheap(d);
else
break;
+ if ( rc )
+ return rc;
+
/* Check to see if we need to yield and try again */
if ( preempted && general_preempt_check() )
{
@@ -55,6 +76,39 @@ int paging_freelist_adjust(struct domain *d, unsigned long pages,
return 0;
}
+int paging_refill_from_domheap(struct domain *d, unsigned int nr_pages)
+{
+ ASSERT(spin_is_locked(&d->arch.paging.lock));
+
+ for ( unsigned int i = 0; i < nr_pages; i++ )
+ {
+ int rc = paging_add_page_to_freelist(d);
+
+ if ( rc )
+ return rc;
+ }
+
+ return 0;
+}
+
+int paging_ret_to_domheap(struct domain *d, unsigned int nr_pages)
+{
+ ASSERT(spin_is_locked(&d->arch.paging.lock));
+
+ if ( d->arch.paging.total_pages < nr_pages )
+ return false;
+
+ for ( unsigned int i = 0; i < nr_pages; i++ )
+ {
+ int rc = paging_ret_page_to_domheap(d);
+
+ if ( rc )
+ return rc;
+ }
+
+ return 0;
+}
+
/* Domain paging struct initialization. */
int paging_domain_init(struct domain *d)
{
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 05/18] xen/riscv: add root page table allocation
2025-10-20 15:57 ` [for 4.22 v5 05/18] xen/riscv: add root page table allocation Oleksii Kurochko
@ 2025-11-06 14:25 ` Jan Beulich
2025-11-14 10:53 ` Oleksii Kurochko
0 siblings, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-06 14:25 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 20.10.2025 17:57, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -3,6 +3,7 @@
> #include <xen/init.h>
> #include <xen/lib.h>
> #include <xen/macros.h>
> +#include <xen/domain_page.h>
> #include <xen/mm.h>
> #include <xen/paging.h>
> #include <xen/rwlock.h>
> @@ -103,6 +104,70 @@ void __init pre_gstage_init(void)
> vmid_init();
> }
>
> +static void clear_and_clean_page(struct page_info *page, bool clean_dcache)
> +{
> + clear_domain_page(page_to_mfn(page));
> +
> + /*
> + * If the IOMMU doesn't support coherent walks and the p2m tables are
> + * shared between the CPU and IOMMU, it is necessary to clean the
> + * d-cache.
> + */
> + if ( clean_dcache )
> + clean_dcache_va_range(page, PAGE_SIZE);
This cleans part of frame_table[], but not the memory page in question.
> --- a/xen/arch/riscv/paging.c
> +++ b/xen/arch/riscv/paging.c
> @@ -4,46 +4,67 @@
> #include <xen/sched.h>
> #include <xen/spinlock.h>
>
> +static int paging_ret_page_to_domheap(struct domain *d)
> +{
> + struct page_info *page;
> +
> + ASSERT(spin_is_locked(&d->arch.paging.lock));
> +
> + /* Return memory to domheap. */
> + page = page_list_remove_head(&d->arch.paging.freelist);
> + if( page )
> + {
> + d->arch.paging.total_pages--;
> + free_domheap_page(page);
> + }
> + else
> + {
> + printk(XENLOG_ERR
> + "Failed to free P2M pages, P2M freelist is empty.\n");
Nit: See earlier remark regarding full stops in log messages. The double
"P2M" also looks unnecessary to me.
> +static int paging_add_page_to_freelist(struct domain *d)
> +{
> + struct page_info *page;
> +
> + ASSERT(spin_is_locked(&d->arch.paging.lock));
> +
> + /* Need to allocate more memory from domheap */
> + page = alloc_domheap_page(d, MEMF_no_owner);
> + if ( page == NULL )
> + {
> + printk(XENLOG_ERR "Failed to allocate pages.\n");
Again. (Also log messages typically wouldn't start with a capital letter,
unless of course it's e.g. an acronym.)
> @@ -55,6 +76,39 @@ int paging_freelist_adjust(struct domain *d, unsigned long pages,
> return 0;
> }
>
> +int paging_refill_from_domheap(struct domain *d, unsigned int nr_pages)
> +{
> + ASSERT(spin_is_locked(&d->arch.paging.lock));
> +
> + for ( unsigned int i = 0; i < nr_pages; i++ )
> + {
> + int rc = paging_add_page_to_freelist(d);
The anomaly is more pronounced here, with the other function name in context:
paging_refill_from_domheap() doesn't suggest there's a page (or several) being
handed to it. paging_add_page_to_freelist() suggests one of its parameter
would want to be struct page_info *. Within the naming model you chose, maybe
paging_refill_from_domheap_one() or paging_refill_one_from_domheap()? Or
simply _paging_refill_from_domheap()?
> + if ( rc )
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> +int paging_ret_to_domheap(struct domain *d, unsigned int nr_pages)
> +{
> + ASSERT(spin_is_locked(&d->arch.paging.lock));
> +
> + if ( d->arch.paging.total_pages < nr_pages )
> + return false;
> +
> + for ( unsigned int i = 0; i < nr_pages; i++ )
> + {
> + int rc = paging_ret_page_to_domheap(d);
Somewhat similarly here. Maybe simply insert "one" in the name?
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 05/18] xen/riscv: add root page table allocation
2025-11-06 14:25 ` Jan Beulich
@ 2025-11-14 10:53 ` Oleksii Kurochko
2025-11-17 8:43 ` Jan Beulich
0 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-14 10:53 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 2760 bytes --]
On 11/6/25 3:25 PM, Jan Beulich wrote:
> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -3,6 +3,7 @@
>> #include <xen/init.h>
>> #include <xen/lib.h>
>> #include <xen/macros.h>
>> +#include <xen/domain_page.h>
>> #include <xen/mm.h>
>> #include <xen/paging.h>
>> #include <xen/rwlock.h>
>> @@ -103,6 +104,70 @@ void __init pre_gstage_init(void)
>> vmid_init();
>> }
>>
>> +static void clear_and_clean_page(struct page_info *page, bool clean_dcache)
>> +{
>> + clear_domain_page(page_to_mfn(page));
>> +
>> + /*
>> + * If the IOMMU doesn't support coherent walks and the p2m tables are
>> + * shared between the CPU and IOMMU, it is necessary to clean the
>> + * d-cache.
>> + */
>> + if ( clean_dcache )
>> + clean_dcache_va_range(page, PAGE_SIZE);
> This cleans part of frame_table[], but not the memory page in question.
Oh, right, we need to map the domain page first.
Would it make sense to avoid using|clear_domain_page()| in order to prevent
calling|map_domain_page()| twice (once inside|clear_domain_page()| and once
before|clean_dcache_va_range()|), and instead do it like this:
void *p = __map_domain_page(page);
clear_page(p);
/*
* If the IOMMU doesn't support coherent walks and the p2m tables are
* shared between the CPU and IOMMU, it is necessary to clean the
* d-cache.
*/
if ( clean_dcache )
clean_dcache_va_range(p, PAGE_SIZE);
unmap_domain_page(p);
>> @@ -55,6 +76,39 @@ int paging_freelist_adjust(struct domain *d, unsigned long pages,
>> return 0;
>> }
>>
>> +int paging_refill_from_domheap(struct domain *d, unsigned int nr_pages)
>> +{
>> + ASSERT(spin_is_locked(&d->arch.paging.lock));
>> +
>> + for ( unsigned int i = 0; i < nr_pages; i++ )
>> + {
>> + int rc = paging_add_page_to_freelist(d);
> The anomaly is more pronounced here, with the other function name in context:
> paging_refill_from_domheap() doesn't suggest there's a page (or several) being
> handed to it. paging_add_page_to_freelist() suggests one of its parameter
> would want to be struct page_info *. Within the naming model you chose, maybe
> paging_refill_from_domheap_one() or paging_refill_one_from_domheap()? Or
> simply _paging_refill_from_domheap()?
Thanks for suggestions. I like the option with "_*" as it is more clearly marks it
as an internal helper without introducing "_one" suffix. I will use the same approach
for paging_ret_page_to_domheap(): s/paging_ret_page_to_domheap/_paging_ret_to_domheap().
Shouldn't we use "__*" instead of "_*" or "__*" is reserved for something else? "__*" is
used quite frequent in Xen code base.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 3658 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 05/18] xen/riscv: add root page table allocation
2025-11-14 10:53 ` Oleksii Kurochko
@ 2025-11-17 8:43 ` Jan Beulich
0 siblings, 0 replies; 54+ messages in thread
From: Jan Beulich @ 2025-11-17 8:43 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 14.11.2025 11:53, Oleksii Kurochko wrote:
> On 11/6/25 3:25 PM, Jan Beulich wrote:
>> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/p2m.c
>>> +++ b/xen/arch/riscv/p2m.c
>>> @@ -3,6 +3,7 @@
>>> #include <xen/init.h>
>>> #include <xen/lib.h>
>>> #include <xen/macros.h>
>>> +#include <xen/domain_page.h>
>>> #include <xen/mm.h>
>>> #include <xen/paging.h>
>>> #include <xen/rwlock.h>
>>> @@ -103,6 +104,70 @@ void __init pre_gstage_init(void)
>>> vmid_init();
>>> }
>>>
>>> +static void clear_and_clean_page(struct page_info *page, bool clean_dcache)
>>> +{
>>> + clear_domain_page(page_to_mfn(page));
>>> +
>>> + /*
>>> + * If the IOMMU doesn't support coherent walks and the p2m tables are
>>> + * shared between the CPU and IOMMU, it is necessary to clean the
>>> + * d-cache.
>>> + */
>>> + if ( clean_dcache )
>>> + clean_dcache_va_range(page, PAGE_SIZE);
>> This cleans part of frame_table[], but not the memory page in question.
>
> Oh, right, we need to map the domain page first.
>
> Would it make sense to avoid using|clear_domain_page()| in order to prevent
> calling|map_domain_page()| twice (once inside|clear_domain_page()| and once
> before|clean_dcache_va_range()|), and instead do it like this:
> void *p = __map_domain_page(page);
>
> clear_page(p);
>
> /*
> * If the IOMMU doesn't support coherent walks and the p2m tables are
> * shared between the CPU and IOMMU, it is necessary to clean the
> * d-cache.
> */
> if ( clean_dcache )
> clean_dcache_va_range(p, PAGE_SIZE);
>
> unmap_domain_page(p);
Certainly.
>>> @@ -55,6 +76,39 @@ int paging_freelist_adjust(struct domain *d, unsigned long pages,
>>> return 0;
>>> }
>>>
>>> +int paging_refill_from_domheap(struct domain *d, unsigned int nr_pages)
>>> +{
>>> + ASSERT(spin_is_locked(&d->arch.paging.lock));
>>> +
>>> + for ( unsigned int i = 0; i < nr_pages; i++ )
>>> + {
>>> + int rc = paging_add_page_to_freelist(d);
>> The anomaly is more pronounced here, with the other function name in context:
>> paging_refill_from_domheap() doesn't suggest there's a page (or several) being
>> handed to it. paging_add_page_to_freelist() suggests one of its parameter
>> would want to be struct page_info *. Within the naming model you chose, maybe
>> paging_refill_from_domheap_one() or paging_refill_one_from_domheap()? Or
>> simply _paging_refill_from_domheap()?
>
> Thanks for suggestions. I like the option with "_*" as it is more clearly marks it
> as an internal helper without introducing "_one" suffix. I will use the same approach
> for paging_ret_page_to_domheap(): s/paging_ret_page_to_domheap/_paging_ret_to_domheap().
>
> Shouldn't we use "__*" instead of "_*" or "__*" is reserved for something else? "__*" is
> used quite frequent in Xen code base.
And wrongly so. "__*" are reserved to the implementation (i.e. compiler / library).
Whereas "_*" (with the letter following the _ not being an upper-case one) is
dedicated to file scope identifiers. (That's mandated by the library part of the
spec, but imo we're well-advised to follow that, because even if we don't link to
any libraries, the compiler using certain symbols [e.g. __builtin_*()] is still
[potentially] getting in our way.)
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread
* [for 4.22 v5 06/18] xen/riscv: introduce pte_{set,get}_mfn()
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (4 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 05/18] xen/riscv: add root page table allocation Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-10-20 15:57 ` [for 4.22 v5 07/18] xen/riscv: add new p2m types and helper macros for type classification Oleksii Kurochko
` (11 subsequent siblings)
17 siblings, 0 replies; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Introduce helpers pte_{set,get}_mfn() to simplify setting and getting
of mfn.
Also, introduce PTE_PPN_MASK and add BUILD_BUG_ON() to be sure that
PTE_PPN_MASK remains the same for all MMU modes except Sv32.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V4-V5:
- Nothing changed. Only Rebase.
---
Changes in V3:
- Add Acked-by: Jan Beulich <jbeulich@suse.com>.
---
Changes in V2:
- Patch "[PATCH v1 4/6] xen/riscv: define pt_t and pt_walk_t structures" was
renamed to xen/riscv: introduce pte_{set,get}_mfn() as after dropping of
bitfields for PTE structure, this patch introduce only pte_{set,get}_mfn().
- As pt_t and pt_walk_t were dropped, update implementation of
pte_{set,get}_mfn() to use bit operations and shifts instead of bitfields.
- Introduce PTE_PPN_MASK to be able to use MASK_INSR for setting/getting PPN.
- Add BUILD_BUG_ON(RV_STAGE1_MODE > SATP_MODE_SV57) to be sure that when
new MMU mode will be added, someone checks that PPN is still bits 53:10.
---
xen/arch/riscv/include/asm/page.h | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index ddcc4da0a3..66cb192316 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -112,6 +112,30 @@ typedef struct {
#endif
} pte_t;
+#if RV_STAGE1_MODE != SATP_MODE_SV32
+#define PTE_PPN_MASK _UL(0x3FFFFFFFFFFC00)
+#else
+#define PTE_PPN_MASK _U(0xFFFFFC00)
+#endif
+
+static inline void pte_set_mfn(pte_t *p, mfn_t mfn)
+{
+ /*
+ * At the moment spec provides Sv32 - Sv57.
+ * If one day new MMU mode will be added it will be needed
+ * to check that PPN mask still continue to cover bits 53:10.
+ */
+ BUILD_BUG_ON(RV_STAGE1_MODE > SATP_MODE_SV57);
+
+ p->pte &= ~PTE_PPN_MASK;
+ p->pte |= MASK_INSR(mfn_x(mfn), PTE_PPN_MASK);
+}
+
+static inline mfn_t pte_get_mfn(pte_t p)
+{
+ return _mfn(MASK_EXTR(p.pte, PTE_PPN_MASK));
+}
+
static inline bool pte_is_valid(pte_t p)
{
return p.pte & PTE_VALID;
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* [for 4.22 v5 07/18] xen/riscv: add new p2m types and helper macros for type classification
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (5 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 06/18] xen/riscv: introduce pte_{set,get}_mfn() Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-10-20 15:57 ` [for 4.22 v5 08/18] xen/dom0less: abstract Arm-specific p2m type name for device MMIO mappings Oleksii Kurochko
` (10 subsequent siblings)
17 siblings, 0 replies; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
- Extended p2m_type_t with additional types: p2m_mmio_direct,
p2m_ext_storage.
- Added macros to classify memory types: P2M_RAM_TYPES.
- Introduced helper predicates: p2m_is_ram(), p2m_is_any_ram().
- Introduce arch_dt_passthrough() to tell handle_passthrough_prop()
from common code how to map device memory.
- Introduce p2m_first_external for detection for relational operations
with p2m type which is stored outside P2M's PTE bits.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V4:
- Drop underscores for argumets of p2m_is_ram() and p2m_is_any_ram().
- Add Acked-by: Jan Beulich <jbeulich@suse.com>.
---
Changes in V4:
- Drop underscode in p2m_to_mask()'s argument and for other similar helpers.
- Introduce arch_dt_passthrough_p2m_type() instead of p2m_mmio_direct.
- Drop for the moment grant tables related stuff as it isn't going to be used in the nearest future.
---
Changes in V3:
- Drop p2m_ram_ro.
- Rename p2m_mmio_direct_dev to p2m_mmio_direct_io to make it more RISC-V specicific.
- s/p2m_mmio_direct_dev/p2m_mmio_direct_io.
---
Changes in V2:
- Drop stuff connected to foreign mapping as it isn't necessary for RISC-V
right now.
---
xen/arch/riscv/include/asm/p2m.h | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 85e67516c4..46ee0b93f2 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -64,8 +64,29 @@ struct p2m_domain {
typedef enum {
p2m_invalid = 0, /* Nothing mapped here */
p2m_ram_rw, /* Normal read/write domain RAM */
+ p2m_mmio_direct_io, /* Read/write mapping of genuine Device MMIO area,
+ PTE_PBMT_IO will be used for such mappings */
+ p2m_ext_storage, /* Following types'll be stored outsude PTE bits: */
+
+ /* Sentinel — not a real type, just a marker for comparison */
+ p2m_first_external = p2m_ext_storage,
} p2m_type_t;
+static inline p2m_type_t arch_dt_passthrough_p2m_type(void)
+{
+ return p2m_mmio_direct_io;
+}
+
+/* We use bitmaps and mask to handle groups of types */
+#define p2m_to_mask(t) BIT(t, UL)
+
+/* RAM types, which map to real machine frames */
+#define P2M_RAM_TYPES (p2m_to_mask(p2m_ram_rw))
+
+/* Useful predicates */
+#define p2m_is_ram(t) (p2m_to_mask(t) & P2M_RAM_TYPES)
+#define p2m_is_any_ram(t) (p2m_to_mask(t) & P2M_RAM_TYPES)
+
#include <xen/p2m-common.h>
static inline int get_page_and_type(struct page_info *page,
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* [for 4.22 v5 08/18] xen/dom0less: abstract Arm-specific p2m type name for device MMIO mappings
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (6 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 07/18] xen/riscv: add new p2m types and helper macros for type classification Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-10-20 15:57 ` [for 4.22 v5 09/18] xen/riscv: implement function to map memory in guest p2m Oleksii Kurochko
` (9 subsequent siblings)
17 siblings, 0 replies; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Stefano Stabellini, Julien Grall,
Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Jan Beulich
Introduce arch_dt_passthrough_p2m_type() and use it instead of
`p2m_mmio_direct_dev` to avoid leaking Arm-specific naming into
common Xen code, such as dom0less passthrough property handling.
This helps reduce platform-specific terminology in shared logic and
improves clarity for future non-Arm ports (e.g. RISC-V or PowerPC).
No functional changes — the definition is preserved via a static inline
function for Arm.
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
- Nothing changed. Only rebase.
---
Changes in V4:
- Introduce arch_dt_passthrough_p2m_type() instead of re-defining of
p2m_mmio_direct.
---
Changes in V3:
- New patch.
---
xen/arch/arm/include/asm/p2m.h | 5 +++++
xen/common/device-tree/dom0less-build.c | 2 +-
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/xen/arch/arm/include/asm/p2m.h b/xen/arch/arm/include/asm/p2m.h
index ef98bc5f4d..010ce8c9eb 100644
--- a/xen/arch/arm/include/asm/p2m.h
+++ b/xen/arch/arm/include/asm/p2m.h
@@ -137,6 +137,11 @@ typedef enum {
p2m_max_real_type, /* Types after this won't be store in the p2m */
} p2m_type_t;
+static inline p2m_type_t arch_dt_passthrough_p2m_type(void)
+{
+ return p2m_mmio_direct_dev;
+}
+
/* We use bitmaps and mask to handle groups of types */
#define p2m_to_mask(_t) (1UL << (_t))
diff --git a/xen/common/device-tree/dom0less-build.c b/xen/common/device-tree/dom0less-build.c
index 9fd004c42a..8214a6639f 100644
--- a/xen/common/device-tree/dom0less-build.c
+++ b/xen/common/device-tree/dom0less-build.c
@@ -185,7 +185,7 @@ static int __init handle_passthrough_prop(struct kernel_info *kinfo,
gaddr_to_gfn(gstart),
PFN_DOWN(size),
maddr_to_mfn(mstart),
- p2m_mmio_direct_dev);
+ arch_dt_passthrough_p2m_type());
if ( res < 0 )
{
printk(XENLOG_ERR
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* [for 4.22 v5 09/18] xen/riscv: implement function to map memory in guest p2m
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (7 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 08/18] xen/dom0less: abstract Arm-specific p2m type name for device MMIO mappings Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-10-20 15:57 ` [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range() Oleksii Kurochko
` (8 subsequent siblings)
17 siblings, 0 replies; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Implement map_regions_p2mt() to map a region in the guest p2m with
a specific p2m type. The memory attributes will be derived from the
p2m type. This function is used in dom0less common
code.
To implement it, introduce:
- p2m_write_(un)lock() to ensure safe concurrent updates to the P2M.
As part of this change, introduce p2m_tlb_flush_sync() and
p2m_force_tlb_flush_sync().
- A stub for p2m_set_range() to map a range of GFNs to MFNs.
- p2m_insert_mapping().
- p2m_is_write_locked().
Drop guest_physmap_add_entry() and call map_regions_p2mt() directly
from guest_physmap_add_page(), making guest_physmap_add_entry()
unnecessary.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V5:
- Put "p2m->need_flush = false;" before TLB flush.
- Correct the comment above p2m_write_unlock().
- Add Acked-by: Jan Beulich <jbeulich@suse.com>.
---
Changes in V4:
- Update the comment above declaration of map_regions_p2mt():
s/guest p2m/guest's hostp2m.
- Add const for p2m_force_tlb_flush_sync()'s local variable `d`.
- Stray 'w' in the comment inside p2m_write_unlock().
- Drop p2m_insert_mapping() and leave only map_regions_p2mt() as it
is just re-use insert_mapping().
- Rename p2m_force_tlb_flush_sync() to p2m_tlb_flush().
- Update prototype of p2m_is_write_locked() to return bool instead of
int.
---
Changes in v3:
- Introudce p2m_write_lock() and p2m_is_write_locked().
- Introduce p2m_force_tlb_flush_sync() and p2m_flush_tlb() to flush TLBs
after p2m table update.
- Change an argument of p2m_insert_mapping() from struct domain *d to
p2m_domain *p2m.
- Drop guest_physmap_add_entry() and use map_regions_p2mt() to define
guest_physmap_add_page().
- Add declaration of map_regions_p2mt() to asm/p2m.h.
- Rewrite commit message and subject.
- Drop p2m_access_t related stuff.
- Add defintion of p2m_is_write_locked().
---
Changes in v2:
- This changes were part of "xen/riscv: implement p2m mapping functionality".
No additional signigicant changes were done.
---
xen/arch/riscv/include/asm/p2m.h | 31 ++++++++++++-----
xen/arch/riscv/p2m.c | 60 ++++++++++++++++++++++++++++++++
2 files changed, 82 insertions(+), 9 deletions(-)
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 46ee0b93f2..4fafb26e1e 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -122,21 +122,22 @@ static inline int guest_physmap_mark_populate_on_demand(struct domain *d,
return -EOPNOTSUPP;
}
-static inline int guest_physmap_add_entry(struct domain *d,
- gfn_t gfn, mfn_t mfn,
- unsigned long page_order,
- p2m_type_t t)
-{
- BUG_ON("unimplemented");
- return -EINVAL;
-}
+/*
+ * Map a region in the guest's hostp2m p2m with a specific p2m type.
+ * The memory attributes will be derived from the p2m type.
+ */
+int map_regions_p2mt(struct domain *d,
+ gfn_t gfn,
+ unsigned long nr,
+ mfn_t mfn,
+ p2m_type_t p2mt);
/* Untyped version for RAM only, for compatibility */
static inline int __must_check
guest_physmap_add_page(struct domain *d, gfn_t gfn, mfn_t mfn,
unsigned int page_order)
{
- return guest_physmap_add_entry(d, gfn, mfn, page_order, p2m_ram_rw);
+ return map_regions_p2mt(d, gfn, BIT(page_order, UL), mfn, p2m_ram_rw);
}
static inline mfn_t gfn_to_mfn(struct domain *d, gfn_t gfn)
@@ -159,6 +160,18 @@ void pre_gstage_init(void);
int p2m_init(struct domain *d);
+static inline void p2m_write_lock(struct p2m_domain *p2m)
+{
+ write_lock(&p2m->lock);
+}
+
+void p2m_write_unlock(struct p2m_domain *p2m);
+
+static inline bool p2m_is_write_locked(struct p2m_domain *p2m)
+{
+ return rw_is_write_locked(&p2m->lock);
+}
+
unsigned long construct_hgatp(const struct p2m_domain *p2m, uint16_t vmid);
#endif /* ASM__RISCV__P2M_H */
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index c9ffad393f..e571257022 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -104,6 +104,41 @@ void __init pre_gstage_init(void)
vmid_init();
}
+/*
+ * Force a synchronous P2M TLB flush.
+ *
+ * Must be called with the p2m lock held.
+ */
+static void p2m_tlb_flush(struct p2m_domain *p2m)
+{
+ const struct domain *d = p2m->domain;
+
+ ASSERT(p2m_is_write_locked(p2m));
+
+ p2m->need_flush = false;
+
+ sbi_remote_hfence_gvma(d->dirty_cpumask, 0, 0);
+}
+
+void p2m_tlb_flush_sync(struct p2m_domain *p2m)
+{
+ if ( p2m->need_flush )
+ p2m_tlb_flush(p2m);
+}
+
+/* Unlock the P2M and do a P2M TLB flush if necessary */
+void p2m_write_unlock(struct p2m_domain *p2m)
+{
+ /*
+ * The final flush is done with the P2M write lock taken to avoid
+ * someone else modifying the P2M before the TLB invalidation has
+ * completed.
+ */
+ p2m_tlb_flush_sync(p2m);
+
+ write_unlock(&p2m->lock);
+}
+
static void clear_and_clean_page(struct page_info *page, bool clean_dcache)
{
clear_domain_page(page_to_mfn(page));
@@ -223,3 +258,28 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
return rc;
}
+
+static int p2m_set_range(struct p2m_domain *p2m,
+ gfn_t sgfn,
+ unsigned long nr,
+ mfn_t smfn,
+ p2m_type_t t)
+{
+ return -EOPNOTSUPP;
+}
+
+int map_regions_p2mt(struct domain *d,
+ gfn_t gfn,
+ unsigned long nr,
+ mfn_t mfn,
+ p2m_type_t p2mt)
+{
+ struct p2m_domain *p2m = p2m_get_hostp2m(d);
+ int rc;
+
+ p2m_write_lock(p2m);
+ rc = p2m_set_range(p2m, gfn, nr, mfn, p2mt);
+ p2m_write_unlock(p2m);
+
+ return rc;
+}
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (8 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 09/18] xen/riscv: implement function to map memory in guest p2m Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-11-10 14:53 ` Jan Beulich
2025-11-10 14:53 ` Jan Beulich
2025-10-20 15:57 ` [for 4.22 v5 11/18] xen/riscv: Implement p2m_free_subtree() and related helpers Oleksii Kurochko
` (7 subsequent siblings)
17 siblings, 2 replies; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
This patch introduces p2m_set_range() and its core helper p2m_set_entry() for
RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
modifications.
The main changes are:
- Simplification of Break-Before-Make (BBM) approach as according to RISC-V
spec:
It is permitted for multiple address-translation cache entries to co-exist
for the same address. This represents the fact that in a conventional
TLB hierarchy, it is possible for multiple entries to match a single
address if, for example, a page is upgraded to a superpage without first
clearing the original non-leaf PTE’s valid bit and executing an SFENCE.VMA
with rs1=x0, or if multiple TLBs exist in parallel at a given level of the
hierarchy. In this case, just as if an SFENCE.VMA is not executed between
a write to the memory-management tables and subsequent implicit read of the
same address: it is unpredictable whether the old non-leaf PTE or the new
leaf PTE is used, but the behavior is otherwise well defined.
In contrast to the Arm architecture, where BBM is mandatory and failing to
use it in some cases can lead to CPU instability, RISC-V guarantees
stability, and the behavior remains safe — though unpredictable in terms of
which translation will be used.
- Unlike Arm, the valid bit is not repurposed for other uses in this
implementation. Instead, entry validity is determined based solely on P2M
PTE's valid bit.
The main functionality is in p2m_set_entry(), which handles mappings aligned
to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).
p2m_set_range() breaks a region down into block-aligned mappings and calls
p2m_set_entry() accordingly.
Stub implementations (to be completed later) include:
- p2m_free_subtree()
- p2m_next_level()
- p2m_pte_from_mfn()
Note: Support for shattering block entries is not implemented in this
patch and will be added separately.
Additionally, some straightforward helper functions are now implemented:
- p2m_write_pte()
- p2m_remove_pte()
- p2m_get_root_pointer()
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
- Update the comment above p2m_get_root_pointer().
- Fix an identation for p2m_set_entry()'s arguments.
- Update the comment in p2m_set_entry() where lookup is happening.
- Drop part of the comment above p2m_set_entry() as it is not really
needed anymore.
- Introduce P2M_DECLARE_OFFSETS() to use it insetead of
DECLARE_OFFSETS() as the latter could have an issue with P2M code.
- Update p2m_get_root_pointer() to work only with P2M root properties.
- Update the comment inside in p2m_set_entry() for the case when
p2m_next_level() returns P2M_TABLE_MAP_{NONE,NOMEM}.
- Simplify a little bit a condition when p2m_free_subtree() by removing
a case when removing && mfn(0) are checked explicitly.
---
Changes in V4:
- Introduce gstage_root_level and use it for defintion of P2M_ROOT_LEVEL.
- Introduce P2M_LEVEL_ORDER() macros and P2M_PAGETABLE_ENTRIES().
- Add the TODO comment in p2m_write_pte() about possible perfomance
optimization.
- Use compound literal for `pte` variable inside p2m_clean_pte().
- Fix the comment above p2m_next_level().
- Update ASSERT() inside p2m_set_entry() and leave only a check of a
target as p2m_mapping_order() that page_order will be correctly
aligned.
- Update the comment above declaration of `removing_mapping` in
p2m_set_entry().
- Stray blanks.
- Handle possibly overflow of an amount of unmapped GFNs in case of
some failute in p2m_set_range().
- Handle a case when MFN is 0 and removing of such MFN is happening in
p2m_set_entry.
- Fix p2m_get_root_pointer() to return correct pointer to root page table.
---
Changes in V3:
- Drop p2m_access_t connected stuff as it isn't going to be used, at least
now.
- Move defintion of P2M_ROOT_ORDER and P2M_ROOT_PAGES to earlier patches.
- Update the comment above lowest_mapped_gfn declaration.
- Update the comment above p2m_get_root_pointer(): s/"...ofset of the root
table"/"...ofset into root table".
- s/p2m_remove_pte/p2m_clean_pte.
- Use plain 0 instead of 0x00 in p2m_clean_pte().
- s/p2m_entry_from_mfn/p2m_pte_from_mfn.
- s/GUEST_TABLE_*/P2M_TABLE_*.
- Update the comment above p2m_next_level(): "GFN entry" -> "corresponding
the entry corresponding to the GFN".
- s/__p2m_set_entry/_p2m_set_entry.
- drop "s" for sgfn and smfn prefixes of _p2m_set_entry()'s arguments
as this function work only with one GFN and one MFN.
- Return correct return code when p2m_next_level() faild in _p2m_set_entry(),
also drop "else" and just handle case (rc != P2M_TABLE_NORMAL) separately.
- Code style fixes.
- Use unsigned int for "order" in p2m_set_entry().
- s/p2m_set_entry/p2m_free_subtree.
- Update ASSERT() in __p2m_set_enty() to check that page_order is propertly
aligned.
- Return -EACCES instead of -ENOMEM in the chase when domain is dying and
someone called p2m_set_entry.
- s/p2m_set_entry/p2m_set_range.
- s/__p2m_set_entry/p2m_set_entry
- s/p2me_is_valid/p2m_is_valid()
- Return a number of successfully mapped GFNs in case if not all were mapped
in p2m_set_range().
- Use BIT(order, UL) instead of 1 << order.
- Drop IOMMU flushing code from p2m_set_entry().
- set p2m->need_flush=true when entry in p2m_set_entry() is changed.
- Introduce p2m_mapping_order() to support superpages.
- Drop p2m_is_valid() and use pte_is_valid() instead as there is no tricks
with copying of valid bit anymore.
- Update p2m_pte_from_mfn() prototype: drop p2m argument.
---
Changes in V2:
- New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
functionality" which was splitted to smaller.
- Update the way when p2m TLB is flushed:
- RISC-V does't require BBM so there is no need to remove PTE before making
new so drop 'if /*pte_is_valid(orig_pte) */' and remove PTE only removing
has been requested.
- Drop p2m->need_flush |= !!pte_is_valid(orig_pte); for the case when
PTE's removing is happening as RISC-V could cache invalid PTE and thereby
it requires to do a flush each time and it doesn't matter if PTE is valid
or not at the moment when PTE removing is happening.
- Drop a check if PTE is valid in case of PTE is modified as it was mentioned
above as BBM isn't required so TLB flushing could be defered and there is
no need to do it before modifying of PTE.
- Drop p2m->need_flush as it seems like it will be always true.
- Drop foreign mapping things as it isn't necessary for RISC-V right now.
- s/p2m_is_valid/p2me_is_valid.
- Move definition and initalization of p2m->{max_mapped_gfn,lowest_mapped_gfn}
to this patch.
---
xen/arch/riscv/include/asm/p2m.h | 43 ++++
xen/arch/riscv/p2m.c | 331 ++++++++++++++++++++++++++++++-
2 files changed, 373 insertions(+), 1 deletion(-)
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 4fafb26e1e..ce8bcb944f 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -8,12 +8,45 @@
#include <xen/rwlock.h>
#include <xen/types.h>
+#include <asm/page.h>
#include <asm/page-bits.h>
extern unsigned char gstage_mode;
+extern unsigned int gstage_root_level;
#define P2M_ROOT_ORDER (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
#define P2M_ROOT_PAGES BIT(P2M_ROOT_ORDER, U)
+#define P2M_ROOT_LEVEL gstage_root_level
+
+/*
+ * According to the RISC-V spec:
+ * When hgatp.MODE specifies a translation scheme of Sv32x4, Sv39x4, Sv48x4,
+ * or Sv57x4, G-stage address translation is a variation on the usual
+ * page-based virtual address translation scheme of Sv32, Sv39, Sv48, or
+ * Sv57, respectively. In each case, the size of the incoming address is
+ * widened by 2 bits (to 34, 41, 50, or 59 bits).
+ *
+ * P2M_LEVEL_ORDER(lvl) defines the bit position in the GFN from which
+ * the index for this level of the P2M page table starts. The extra 2
+ * bits added by the "x4" schemes only affect the root page table width.
+ *
+ * Therefore, this macro can safely reuse XEN_PT_LEVEL_ORDER() for all
+ * levels: the extra 2 bits do not change the indices of lower levels.
+ *
+ * The extra 2 bits are only relevant if one tried to address beyond the
+ * root level (i.e., P2M_LEVEL_ORDER(P2M_ROOT_LEVEL + 1)), which is
+ * invalid.
+ */
+#define P2M_LEVEL_ORDER(lvl) XEN_PT_LEVEL_ORDER(lvl)
+
+#define P2M_ROOT_EXTRA_BITS(lvl) (2 * ((lvl) == P2M_ROOT_LEVEL))
+
+#define P2M_PAGETABLE_ENTRIES(lvl) \
+ (BIT(PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl), UL))
+
+#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)
+
+#define P2M_LEVEL_SHIFT(lvl) (P2M_LEVEL_ORDER(lvl) + PAGE_SHIFT)
#define paddr_bits PADDR_BITS
@@ -52,6 +85,16 @@ struct p2m_domain {
* when a page is needed to be fully cleared and cleaned.
*/
bool clean_dcache;
+
+ /* Highest guest frame that's ever been mapped in the p2m */
+ gfn_t max_mapped_gfn;
+
+ /*
+ * Lowest mapped gfn in the p2m. When releasing mapped gfn's in a
+ * preemptible manner this is updated to track where to resume
+ * the search. Apart from during teardown this can only decrease.
+ */
+ gfn_t lowest_mapped_gfn;
};
/*
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index e571257022..f13458712a 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -9,6 +9,7 @@
#include <xen/rwlock.h>
#include <xen/sched.h>
#include <xen/sections.h>
+#include <xen/xvmalloc.h>
#include <asm/csr.h>
#include <asm/flushtlb.h>
@@ -17,6 +18,43 @@
#include <asm/vmid.h>
unsigned char __ro_after_init gstage_mode;
+unsigned int __ro_after_init gstage_root_level;
+
+/*
+ * The P2M root page table is extended by 2 bits, making its size 16KB
+ * (instead of 4KB for non-root page tables). Therefore, P2M root page
+ * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
+ * only allocates 4KB pages).
+ */
+#define ENTRIES_PER_ROOT_PAGE \
+ (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
+
+static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)
+{
+ unsigned int offset = (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
+
+ /*
+ * For P2M_ROOT_LEVEL, `offset` ranges from 0 to 2047, since the root
+ * page table spans 4 consecutive 4KB pages.
+ * We want to return an index within one of these 4 pages.
+ * The specific page to use is determined by `p2m_get_root_pointer()`.
+ *
+ * Example: if `offset == 512`:
+ * - A single 4KB page holds 512 entries.
+ * - Therefore, entry 512 corresponds to index 0 of the second page.
+ *
+ * At all other levels, only one page is allocated, and `offset` is
+ * always in the range 0 to 511, since the VPN is 9 bits long.
+ */
+ return offset % ENTRIES_PER_ROOT_PAGE;
+}
+
+#define P2M_MAX_ROOT_LEVEL 4
+
+#define P2M_DECLARE_OFFSETS(var, addr) \
+ unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
+ for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
+ var[i] = calc_offset(i, addr);
static void __init gstage_mode_detect(void)
{
@@ -54,6 +92,14 @@ static void __init gstage_mode_detect(void)
if ( MASK_EXTR(csr_read(CSR_HGATP), HGATP_MODE_MASK) == mode )
{
gstage_mode = mode;
+ gstage_root_level = modes[mode_idx].paging_levels - 1;
+ /*
+ * The highest supported mode at the moment is Sv57, where L4
+ * is the root page table.
+ * If this changes in the future, P2M_MAX_ROOT_LEVEL must be
+ * updated accordingly.
+ */
+ ASSERT(gstage_root_level <= P2M_MAX_ROOT_LEVEL);
break;
}
}
@@ -218,6 +264,9 @@ int p2m_init(struct domain *d)
rwlock_init(&p2m->lock);
INIT_PAGE_LIST_HEAD(&p2m->pages);
+ p2m->max_mapped_gfn = _gfn(0);
+ p2m->lowest_mapped_gfn = _gfn(ULONG_MAX);
+
/*
* Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
* is not ready for RISC-V support.
@@ -259,13 +308,293 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
return rc;
}
+/*
+ * Map one of the four root pages of the P2M root page table.
+ *
+ * The P2M root page table is larger than normal (16KB instead of 4KB),
+ * so it is allocated as four consecutive 4KB pages. This function selects
+ * the appropriate 4KB page based on the given GFN and returns a mapping
+ * to it.
+ *
+ * The caller is responsible for unmapping the page after use.
+ *
+ * Returns NULL if the calculated offset into the root table is invalid.
+ */
+static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
+{
+ unsigned long root_table_indx;
+
+ root_table_indx = gfn_x(gfn) >> P2M_LEVEL_ORDER(P2M_ROOT_LEVEL);
+ if ( root_table_indx >= P2M_ROOT_PAGES )
+ return NULL;
+
+ /*
+ * The P2M root page table is extended by 2 bits, making its size 16KB
+ * (instead of 4KB for non-root page tables). Therefore, p2m->root is
+ * allocated as four consecutive 4KB pages (since alloc_domheap_pages()
+ * only allocates 4KB pages).
+ *
+ * Initially, `root_table_indx` is derived directly from `va`.
+ * To locate the correct entry within a single 4KB page,
+ * we rescale the offset so it falls within one of the 4 pages.
+ *
+ * Example: if `root_table_indx == 512`
+ * - A 4KB page holds 512 entries.
+ * - Thus, entry 512 corresponds to index 0 of the second page.
+ */
+ root_table_indx /= ENTRIES_PER_ROOT_PAGE;
+
+ return __map_domain_page(p2m->root + root_table_indx);
+}
+
+static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
+{
+ write_pte(p, pte);
+
+ /*
+ * TODO: if multiple adjacent PTEs are written without releasing
+ * the lock, this then redundant cache flushing can be a
+ * performance issue.
+ */
+ if ( clean_pte )
+ clean_dcache_va_range(p, sizeof(*p));
+}
+
+static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
+{
+ pte_t pte = { .pte = 0 };
+
+ p2m_write_pte(p, pte, clean_pte);
+}
+
+static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
+{
+ panic("%s: hasn't been implemented yet\n", __func__);
+
+ return (pte_t) { .pte = 0 };
+}
+
+#define P2M_TABLE_MAP_NONE 0
+#define P2M_TABLE_MAP_NOMEM 1
+#define P2M_TABLE_SUPER_PAGE 2
+#define P2M_TABLE_NORMAL 3
+
+/*
+ * Take the currently mapped table, find the entry corresponding to the GFN,
+ * and map the next-level table if available. The previous table will be
+ * unmapped if the next level was mapped (e.g., when P2M_TABLE_NORMAL is
+ * returned).
+ *
+ * `alloc_tbl` parameter indicates whether intermediate tables should
+ * be allocated when not present.
+ *
+ * Return values:
+ * P2M_TABLE_MAP_NONE: a table allocation isn't permitted.
+ * P2M_TABLE_MAP_NOMEM: allocating a new page failed.
+ * P2M_TABLE_SUPER_PAGE: next level or leaf mapped normally.
+ * P2M_TABLE_NORMAL: The next entry points to a superpage.
+ */
+static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
+ unsigned int level, pte_t **table,
+ unsigned int offset)
+{
+ panic("%s: hasn't been implemented yet\n", __func__);
+
+ return P2M_TABLE_MAP_NONE;
+}
+
+/* Free pte sub-tree behind an entry */
+static void p2m_free_subtree(struct p2m_domain *p2m,
+ pte_t entry, unsigned int level)
+{
+ panic("%s: hasn't been implemented yet\n", __func__);
+}
+
+/* Insert an entry in the p2m */
+static int p2m_set_entry(struct p2m_domain *p2m,
+ gfn_t gfn,
+ unsigned long page_order,
+ mfn_t mfn,
+ p2m_type_t t)
+{
+ unsigned int level;
+ unsigned int target = page_order / PAGETABLE_ORDER;
+ pte_t *entry, *table, orig_pte;
+ int rc;
+ /*
+ * A mapping is removed only if the MFN is explicitly set to INVALID_MFN.
+ * Other MFNs that are considered invalid by mfn_valid() (e.g., MMIO)
+ * are still allowed.
+ */
+ bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
+ P2M_DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
+
+ ASSERT(p2m_is_write_locked(p2m));
+
+ /*
+ * Check if the level target is valid: we only support
+ * 4K - 2M - 1G mapping.
+ */
+ ASSERT(target <= 2);
+
+ table = p2m_get_root_pointer(p2m, gfn);
+ if ( !table )
+ return -EINVAL;
+
+ for ( level = P2M_ROOT_LEVEL; level > target; level-- )
+ {
+ /*
+ * Don't try to allocate intermediate page table if the mapping
+ * is about to be removed.
+ */
+ rc = p2m_next_level(p2m, !removing_mapping,
+ level, &table, offsets[level]);
+ if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
+ {
+ rc = (rc == P2M_TABLE_MAP_NONE) ? -ENOENT : -ENOMEM;
+ /*
+ * We are here because p2m_next_level has failed to map
+ * the intermediate page table (e.g the table does not exist
+ * and none should be allocated). It is a valid case
+ * when removing a mapping as it may not exist in the
+ * page table. In this case, just ignore lookup failure.
+ */
+ rc = removing_mapping ? 0 : rc;
+ goto out;
+ }
+
+ if ( rc != P2M_TABLE_NORMAL )
+ break;
+ }
+
+ entry = table + offsets[level];
+
+ /*
+ * If we are here with level > target, we must be at a leaf node,
+ * and we need to break up the superpage.
+ */
+ if ( level > target )
+ {
+ panic("Shattering isn't implemented\n");
+ }
+
+ /*
+ * We should always be there with the correct level because all the
+ * intermediate tables have been installed if necessary.
+ */
+ ASSERT(level == target);
+
+ orig_pte = *entry;
+
+ if ( removing_mapping )
+ p2m_clean_pte(entry, p2m->clean_dcache);
+ else
+ {
+ pte_t pte = p2m_pte_from_mfn(mfn, t);
+
+ p2m_write_pte(entry, pte, p2m->clean_dcache);
+
+ p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
+ gfn_add(gfn, BIT(page_order, UL) - 1));
+ p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
+ }
+
+ p2m->need_flush = true;
+
+ /*
+ * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
+ * is not ready for RISC-V support.
+ *
+ * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
+ * here.
+ */
+#ifdef CONFIG_HAS_PASSTHROUGH
+# error "add code to flush IOMMU TLB"
+#endif
+
+ rc = 0;
+
+ /*
+ * In case of a VALID -> INVALID transition, the original PTE should
+ * always be freed.
+ *
+ * In case of a VALID -> VALID transition, the original PTE should be
+ * freed only if the MFNs are different. If the MFNs are the same
+ * (i.e., only permissions differ), there is no need to free the
+ * original PTE.
+ */
+ if ( pte_is_valid(orig_pte) &&
+ (!pte_is_valid(*entry) ||
+ !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte))) )
+ p2m_free_subtree(p2m, orig_pte, level);
+
+ out:
+ unmap_domain_page(table);
+
+ return rc;
+}
+
+/* Return mapping order for given gfn, mfn and nr */
+static unsigned long p2m_mapping_order(gfn_t gfn, mfn_t mfn, unsigned long nr)
+{
+ unsigned long mask;
+ /* 1gb, 2mb, 4k mappings are supported */
+ unsigned int level = min(P2M_ROOT_LEVEL, _AC(2, U));
+ unsigned long order = 0;
+
+ mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
+ mask |= gfn_x(gfn);
+
+ for ( ; level != 0; level-- )
+ {
+ if ( !(mask & (BIT(P2M_LEVEL_ORDER(level), UL) - 1)) &&
+ (nr >= BIT(P2M_LEVEL_ORDER(level), UL)) )
+ {
+ order = P2M_LEVEL_ORDER(level);
+ break;
+ }
+ }
+
+ return order;
+}
+
static int p2m_set_range(struct p2m_domain *p2m,
gfn_t sgfn,
unsigned long nr,
mfn_t smfn,
p2m_type_t t)
{
- return -EOPNOTSUPP;
+ int rc = 0;
+ unsigned long left = nr;
+
+ /*
+ * Any reference taken by the P2M mappings (e.g. foreign mapping) will
+ * be dropped in relinquish_p2m_mapping(). As the P2M will still
+ * be accessible after, we need to prevent mapping to be added when the
+ * domain is dying.
+ */
+ if ( unlikely(p2m->domain->is_dying) )
+ return -EACCES;
+
+ while ( left )
+ {
+ unsigned long order = p2m_mapping_order(sgfn, smfn, left);
+
+ rc = p2m_set_entry(p2m, sgfn, order, smfn, t);
+ if ( rc )
+ break;
+
+ sgfn = gfn_add(sgfn, BIT(order, UL));
+ if ( !mfn_eq(smfn, INVALID_MFN) )
+ smfn = mfn_add(smfn, BIT(order, UL));
+
+ left -= BIT(order, UL);
+ }
+
+ if ( left > INT_MAX )
+ rc = -EOVERFLOW;
+
+ return !left ? rc : left;
}
int map_regions_p2mt(struct domain *d,
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
2025-10-20 15:57 ` [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range() Oleksii Kurochko
@ 2025-11-10 14:53 ` Jan Beulich
2025-11-14 17:04 ` Oleksii Kurochko
2025-11-10 14:53 ` Jan Beulich
1 sibling, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-10 14:53 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 20.10.2025 17:57, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -8,12 +8,45 @@
> #include <xen/rwlock.h>
> #include <xen/types.h>
>
> +#include <asm/page.h>
> #include <asm/page-bits.h>
>
> extern unsigned char gstage_mode;
> +extern unsigned int gstage_root_level;
>
> #define P2M_ROOT_ORDER (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
> #define P2M_ROOT_PAGES BIT(P2M_ROOT_ORDER, U)
> +#define P2M_ROOT_LEVEL gstage_root_level
> +
> +/*
> + * According to the RISC-V spec:
> + * When hgatp.MODE specifies a translation scheme of Sv32x4, Sv39x4, Sv48x4,
> + * or Sv57x4, G-stage address translation is a variation on the usual
> + * page-based virtual address translation scheme of Sv32, Sv39, Sv48, or
> + * Sv57, respectively. In each case, the size of the incoming address is
> + * widened by 2 bits (to 34, 41, 50, or 59 bits).
> + *
> + * P2M_LEVEL_ORDER(lvl) defines the bit position in the GFN from which
> + * the index for this level of the P2M page table starts. The extra 2
> + * bits added by the "x4" schemes only affect the root page table width.
> + *
> + * Therefore, this macro can safely reuse XEN_PT_LEVEL_ORDER() for all
> + * levels: the extra 2 bits do not change the indices of lower levels.
> + *
> + * The extra 2 bits are only relevant if one tried to address beyond the
> + * root level (i.e., P2M_LEVEL_ORDER(P2M_ROOT_LEVEL + 1)), which is
> + * invalid.
> + */
> +#define P2M_LEVEL_ORDER(lvl) XEN_PT_LEVEL_ORDER(lvl)
Is the last paragraph of the comment really needed? It talks about something
absurd / impossible only.
> +#define P2M_ROOT_EXTRA_BITS(lvl) (2 * ((lvl) == P2M_ROOT_LEVEL))
> +
> +#define P2M_PAGETABLE_ENTRIES(lvl) \
> + (BIT(PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl), UL))
> +
> +#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)
If I'm not mistaken, this is a mask with the low 10 or 12 bits set.
That's not really something you can apply to a GFN, unlike the name
suggests.
> +#define P2M_LEVEL_SHIFT(lvl) (P2M_LEVEL_ORDER(lvl) + PAGE_SHIFT)
Whereas here the macro name doesn't make clear what is shifted: An
address or a GFN. (It's the former, aiui.)
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -9,6 +9,7 @@
> #include <xen/rwlock.h>
> #include <xen/sched.h>
> #include <xen/sections.h>
> +#include <xen/xvmalloc.h>
>
> #include <asm/csr.h>
> #include <asm/flushtlb.h>
> @@ -17,6 +18,43 @@
> #include <asm/vmid.h>
>
> unsigned char __ro_after_init gstage_mode;
> +unsigned int __ro_after_init gstage_root_level;
Like for mode, I'm unconvinced of this being a global (and not per-P2M /
per-domain).
> +/*
> + * The P2M root page table is extended by 2 bits, making its size 16KB
> + * (instead of 4KB for non-root page tables). Therefore, P2M root page
> + * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
> + * only allocates 4KB pages).
> + */
> +#define ENTRIES_PER_ROOT_PAGE \
> + (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
> +
> +static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)
Where would a vaddr_t come from here? Your input are guest-physical addresses,
if I'm not mistaken.
> +{
> + unsigned int offset = (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
> +
> + /*
> + * For P2M_ROOT_LEVEL, `offset` ranges from 0 to 2047, since the root
> + * page table spans 4 consecutive 4KB pages.
> + * We want to return an index within one of these 4 pages.
> + * The specific page to use is determined by `p2m_get_root_pointer()`.
> + *
> + * Example: if `offset == 512`:
> + * - A single 4KB page holds 512 entries.
> + * - Therefore, entry 512 corresponds to index 0 of the second page.
> + *
> + * At all other levels, only one page is allocated, and `offset` is
> + * always in the range 0 to 511, since the VPN is 9 bits long.
> + */
> + return offset % ENTRIES_PER_ROOT_PAGE;
Seeing something "root" used here (when this is for all levels) is pretty odd,
despite all the commentary. Given all the commentary, why not simply
return offset & ((1U << PAGETABLE_ORDER) - 1);
?
> +}
> +
> +#define P2M_MAX_ROOT_LEVEL 4
> +
> +#define P2M_DECLARE_OFFSETS(var, addr) \
> + unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
> + for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
> + var[i] = calc_offset(i, addr);
This surely is more than just "declare", and it's dealing with all levels no
matter whether you actually will use all offsets.
> @@ -259,13 +308,293 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
> return rc;
> }
>
> +/*
> + * Map one of the four root pages of the P2M root page table.
> + *
> + * The P2M root page table is larger than normal (16KB instead of 4KB),
> + * so it is allocated as four consecutive 4KB pages. This function selects
> + * the appropriate 4KB page based on the given GFN and returns a mapping
> + * to it.
> + *
> + * The caller is responsible for unmapping the page after use.
> + *
> + * Returns NULL if the calculated offset into the root table is invalid.
> + */
> +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
> +{
> + unsigned long root_table_indx;
> +
> + root_table_indx = gfn_x(gfn) >> P2M_LEVEL_ORDER(P2M_ROOT_LEVEL);
With the variable name shortened (to e.g. idx) this could be its initializer
without ending up with too long a line. The root_table_ prefix isn't really
adding much value in the context of this function.
> + if ( root_table_indx >= P2M_ROOT_PAGES )
> + return NULL;
> +
> + /*
> + * The P2M root page table is extended by 2 bits, making its size 16KB
> + * (instead of 4KB for non-root page tables). Therefore, p2m->root is
> + * allocated as four consecutive 4KB pages (since alloc_domheap_pages()
> + * only allocates 4KB pages).
> + *
> + * Initially, `root_table_indx` is derived directly from `va`.
There's no 'va' here.
> +static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
"clean_pte" as a parameter name of a function of this name is, well, odd.
> +/* Insert an entry in the p2m */
> +static int p2m_set_entry(struct p2m_domain *p2m,
> + gfn_t gfn,
> + unsigned long page_order,
> + mfn_t mfn,
> + p2m_type_t t)
> +{
> + unsigned int level;
> + unsigned int target = page_order / PAGETABLE_ORDER;
> + pte_t *entry, *table, orig_pte;
> + int rc;
> + /*
> + * A mapping is removed only if the MFN is explicitly set to INVALID_MFN.
> + * Other MFNs that are considered invalid by mfn_valid() (e.g., MMIO)
> + * are still allowed.
> + */
> + bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
> + P2M_DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
> +
> + ASSERT(p2m_is_write_locked(p2m));
> +
> + /*
> + * Check if the level target is valid: we only support
> + * 4K - 2M - 1G mapping.
> + */
> + ASSERT(target <= 2);
> +
> + table = p2m_get_root_pointer(p2m, gfn);
> + if ( !table )
> + return -EINVAL;
> +
> + for ( level = P2M_ROOT_LEVEL; level > target; level-- )
> + {
> + /*
> + * Don't try to allocate intermediate page table if the mapping
> + * is about to be removed.
> + */
> + rc = p2m_next_level(p2m, !removing_mapping,
> + level, &table, offsets[level]);
> + if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
> + {
> + rc = (rc == P2M_TABLE_MAP_NONE) ? -ENOENT : -ENOMEM;
> + /*
> + * We are here because p2m_next_level has failed to map
> + * the intermediate page table (e.g the table does not exist
> + * and none should be allocated). It is a valid case
> + * when removing a mapping as it may not exist in the
> + * page table. In this case, just ignore lookup failure.
> + */
> + rc = removing_mapping ? 0 : rc;
> + goto out;
> + }
> +
> + if ( rc != P2M_TABLE_NORMAL )
> + break;
> + }
> +
> + entry = table + offsets[level];
> +
> + /*
> + * If we are here with level > target, we must be at a leaf node,
> + * and we need to break up the superpage.
> + */
> + if ( level > target )
> + {
> + panic("Shattering isn't implemented\n");
> + }
> +
> + /*
> + * We should always be there with the correct level because all the
> + * intermediate tables have been installed if necessary.
> + */
> + ASSERT(level == target);
> +
> + orig_pte = *entry;
> +
> + if ( removing_mapping )
> + p2m_clean_pte(entry, p2m->clean_dcache);
> + else
> + {
> + pte_t pte = p2m_pte_from_mfn(mfn, t);
> +
> + p2m_write_pte(entry, pte, p2m->clean_dcache);
> +
> + p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
> + gfn_add(gfn, BIT(page_order, UL) - 1));
> + p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
> + }
> +
> + p2m->need_flush = true;
> +
> + /*
> + * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
> + * is not ready for RISC-V support.
> + *
> + * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
> + * here.
> + */
> +#ifdef CONFIG_HAS_PASSTHROUGH
> +# error "add code to flush IOMMU TLB"
> +#endif
> +
> + rc = 0;
> +
> + /*
> + * In case of a VALID -> INVALID transition, the original PTE should
> + * always be freed.
> + *
> + * In case of a VALID -> VALID transition, the original PTE should be
> + * freed only if the MFNs are different. If the MFNs are the same
> + * (i.e., only permissions differ), there is no need to free the
> + * original PTE.
> + */
> + if ( pte_is_valid(orig_pte) &&
> + (!pte_is_valid(*entry) ||
> + !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte))) )
Besides my continued impression of this condition being more complex than it
ought to be expected, indentation is off by one on the last of the three lines.
(Since, otoh, I can't suggest any simpler expression (for now), this isn't a
request to further change it.)
> +/* Return mapping order for given gfn, mfn and nr */
> +static unsigned long p2m_mapping_order(gfn_t gfn, mfn_t mfn, unsigned long nr)
> +{
> + unsigned long mask;
> + /* 1gb, 2mb, 4k mappings are supported */
> + unsigned int level = min(P2M_ROOT_LEVEL, _AC(2, U));
Further up you has such a literal 2 already - please make a constant, so all
instances can easily be associated with one another.
> + unsigned long order = 0;
> +
> + mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
> + mask |= gfn_x(gfn);
> +
> + for ( ; level != 0; level-- )
> + {
> + if ( !(mask & (BIT(P2M_LEVEL_ORDER(level), UL) - 1)) &&
> + (nr >= BIT(P2M_LEVEL_ORDER(level), UL)) )
> + {
> + order = P2M_LEVEL_ORDER(level);
> + break;
I'm pretty sure I did complain about the too deep indentation here already.
> + }
> + }
> +
> + return order;
> +}
> +
> static int p2m_set_range(struct p2m_domain *p2m,
> gfn_t sgfn,
> unsigned long nr,
> mfn_t smfn,
> p2m_type_t t)
> {
> - return -EOPNOTSUPP;
> + int rc = 0;
> + unsigned long left = nr;
> +
> + /*
> + * Any reference taken by the P2M mappings (e.g. foreign mapping) will
> + * be dropped in relinquish_p2m_mapping(). As the P2M will still
> + * be accessible after, we need to prevent mapping to be added when the
> + * domain is dying.
> + */
> + if ( unlikely(p2m->domain->is_dying) )
> + return -EACCES;
> +
> + while ( left )
> + {
> + unsigned long order = p2m_mapping_order(sgfn, smfn, left);
> +
> + rc = p2m_set_entry(p2m, sgfn, order, smfn, t);
> + if ( rc )
> + break;
> +
> + sgfn = gfn_add(sgfn, BIT(order, UL));
> + if ( !mfn_eq(smfn, INVALID_MFN) )
> + smfn = mfn_add(smfn, BIT(order, UL));
Off-by-1 indentation again.
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
2025-11-10 14:53 ` Jan Beulich
@ 2025-11-14 17:04 ` Oleksii Kurochko
2025-11-17 8:56 ` Jan Beulich
0 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-14 17:04 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 10324 bytes --]
On 11/10/25 3:53 PM, Jan Beulich wrote:
> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -8,12 +8,45 @@
>> #include <xen/rwlock.h>
>> #include <xen/types.h>
>>
>> +#include <asm/page.h>
>> #include <asm/page-bits.h>
>>
>> extern unsigned char gstage_mode;
>> +extern unsigned int gstage_root_level;
>>
>> #define P2M_ROOT_ORDER (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
>> #define P2M_ROOT_PAGES BIT(P2M_ROOT_ORDER, U)
>> +#define P2M_ROOT_LEVEL gstage_root_level
>> +
>> +/*
>> + * According to the RISC-V spec:
>> + * When hgatp.MODE specifies a translation scheme of Sv32x4, Sv39x4, Sv48x4,
>> + * or Sv57x4, G-stage address translation is a variation on the usual
>> + * page-based virtual address translation scheme of Sv32, Sv39, Sv48, or
>> + * Sv57, respectively. In each case, the size of the incoming address is
>> + * widened by 2 bits (to 34, 41, 50, or 59 bits).
>> + *
>> + * P2M_LEVEL_ORDER(lvl) defines the bit position in the GFN from which
>> + * the index for this level of the P2M page table starts. The extra 2
>> + * bits added by the "x4" schemes only affect the root page table width.
>> + *
>> + * Therefore, this macro can safely reuse XEN_PT_LEVEL_ORDER() for all
>> + * levels: the extra 2 bits do not change the indices of lower levels.
>> + *
>> + * The extra 2 bits are only relevant if one tried to address beyond the
>> + * root level (i.e., P2M_LEVEL_ORDER(P2M_ROOT_LEVEL + 1)), which is
>> + * invalid.
>> + */
>> +#define P2M_LEVEL_ORDER(lvl) XEN_PT_LEVEL_ORDER(lvl)
> Is the last paragraph of the comment really needed? It talks about something
> absurd / impossible only.
Agree, it isn't really needed, lets drop it.
>
>> +#define P2M_ROOT_EXTRA_BITS(lvl) (2 * ((lvl) == P2M_ROOT_LEVEL))
>> +
>> +#define P2M_PAGETABLE_ENTRIES(lvl) \
>> + (BIT(PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl), UL))
>> +
>> +#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)
> If I'm not mistaken, this is a mask with the low 10 or 12 bits set.
I'm not sure I fully understand you here. With the current implementation,
it returns a bitmask that corresponds to the number of index bits used
at each level. So, if|P2M_ROOT_LEVEL = 2|, then:
|G||FN_MASK(0) = 0x1ff| (9-bit GFN for the level 0)
|GFN_MASK(1) = 0x1ff| (9-bit GFN width for level 1)
|GFN_MASK(2) = 0x7ff| (11-bit GFN width for level 2)
Or do you mean that GFN_MASK(lvl) should return something like this:
|G||FN_MASK_(0) = 0x1FF000 (0x1ff << 0xc) GFN_MASK_(1) = 0x3FE00000
(GFN_MASK_(0)<<9) GFN_MASK_(2) = 0x1FFC0000000 (GFN_MASK_(1)<<9 + extra
2 bits) And then here ...|
> That's not really something you can apply to a GFN, unlike the name
> suggests.
That is why virtual address should be properly shifted before, something
like it is done in calc_offset():
(va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
...
(va & GFN_MASK_(lvl)) >> P2M_LEVEL_SHIFT(lvl) ?
In this option more shifts will be needed.
Would it be better to just rename GFN_MASK() to P2M_PT_INDEX_MASK()? Or,
maybe, even just P2M_INDEX_MASK().
>
>> +#define P2M_LEVEL_SHIFT(lvl) (P2M_LEVEL_ORDER(lvl) + PAGE_SHIFT)
> Whereas here the macro name doesn't make clear what is shifted: An
> address or a GFN. (It's the former, aiui.)
Yes, it is expected to be used to shift gfn.
The similar as with above would it be better to rename P2M_LEVEL_SHIFT to
P2M_GFN_LEVEL_SHIFT()?
>
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -9,6 +9,7 @@
>> #include <xen/rwlock.h>
>> #include <xen/sched.h>
>> #include <xen/sections.h>
>> +#include <xen/xvmalloc.h>
>>
>> #include <asm/csr.h>
>> #include <asm/flushtlb.h>
>> @@ -17,6 +18,43 @@
>> #include <asm/vmid.h>
>>
>> unsigned char __ro_after_init gstage_mode;
>> +unsigned int __ro_after_init gstage_root_level;
> Like for mode, I'm unconvinced of this being a global (and not per-P2M /
> per-domain).
The question is then if we really will (or want to) have cases when gstage
mode will be different per-domain/per-p2m?
>
>> +/*
>> + * The P2M root page table is extended by 2 bits, making its size 16KB
>> + * (instead of 4KB for non-root page tables). Therefore, P2M root page
>> + * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
>> + * only allocates 4KB pages).
>> + */
>> +#define ENTRIES_PER_ROOT_PAGE \
>> + (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
>> +
>> +static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)
> Where would a vaddr_t come from here? Your input are guest-physical addresses,
> if I'm not mistaken.
You are right. Would it be right to 'paddr_t gpa' here? Or paddr_t is supposed to use
only with machine physical address?
>
>> +{
>> + unsigned int offset = (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
>> +
>> + /*
>> + * For P2M_ROOT_LEVEL, `offset` ranges from 0 to 2047, since the root
>> + * page table spans 4 consecutive 4KB pages.
>> + * We want to return an index within one of these 4 pages.
>> + * The specific page to use is determined by `p2m_get_root_pointer()`.
>> + *
>> + * Example: if `offset == 512`:
>> + * - A single 4KB page holds 512 entries.
>> + * - Therefore, entry 512 corresponds to index 0 of the second page.
>> + *
>> + * At all other levels, only one page is allocated, and `offset` is
>> + * always in the range 0 to 511, since the VPN is 9 bits long.
>> + */
>> + return offset % ENTRIES_PER_ROOT_PAGE;
> Seeing something "root" used here (when this is for all levels) is pretty odd,
> despite all the commentary. Given all the commentary, why not simply
>
> return offset & ((1U << PAGETABLE_ORDER) - 1);
>
> ?
It works for all levels where|lvl < P2M_ROOT_LEVEL|, because in those cases the GFN
bit length is equal to|PAGETABLE_ORDER|. However, at the root level the GFN bit length
is 2 bits larger. So something like the following is needed:
offset & ((1U << (PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl))) - 1);
This still returns an offset within a single 16 KB page, but in the case of the P2M
root we actually have four consecutive 4 KB pages, so the intention was to return
an offset inside one of those four 4 KB pages.
While writing the above, I started thinking whether|calc_offset()| could be implemented
much more simply. Since the root page table consists of four/consecutive/ pages, it seems
acceptable to have the offset in the range|[0, 2^11)| instead of doing all the extra
manipulation to determine which of the four pages is used and the offset within that
specific page:
static inline unsigned int calc_offset(unsigned int lvl, paddr_t gpa)
{
return (gpa >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
}
static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
{
return __map_domain_page(p2m->root);
}
It probably still makes sense for|p2m_get_root_pointer()| to check that the root GFN
index is not larger than|2^11|.
Am I missing something?
>
>> +}
>> +
>> +#define P2M_MAX_ROOT_LEVEL 4
>> +
>> +#define P2M_DECLARE_OFFSETS(var, addr) \
>> + unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
>> + for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
>> + var[i] = calc_offset(i, addr);
> This surely is more than just "declare", and it's dealing with all levels no
> matter whether you actually will use all offsets.
I will rename|P2M_DECLARE_OFFSETS| to|P2M_BUILD_LEVEL_OFFSETS()|.
But how can I know which offset I will actually need to use?
If we take the following loop as an example:
|for( level = P2M_ROOT_LEVEL; level > target; level-- ) { ||/* ||* Don't try to allocate intermediate page tables if the mapping ||* is about to be removed. ||*/ ||rc = p2m_next_level(p2m, !removing_mapping, ||level, &table, offsets[level]); ||... ||} |It walks from|P2M_ROOT_LEVEL| down to|target|, where|target| is determined at runtime.
If you mean that, for example, when the G-stage mode is Sv39, there is no need to allocate
an array with 4 entries (or 5 entries if we consider Sv57, so P2M_MAX_ROOT_LEVEL should be
updated), because Sv39 only uses 3 page table levels — then yes, in theory it could be
smaller. But I don't think it is a real issue if the|offsets[]| array on the stack has a
few extra unused entries.
If preferred, Icould allocate the array dynamically based on|gstage_root_level|.
Would that be better?
>
>> @@ -259,13 +308,293 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
>> return rc;
>> }
>>
>> +/*
>> + * Map one of the four root pages of the P2M root page table.
>> + *
>> + * The P2M root page table is larger than normal (16KB instead of 4KB),
>> + * so it is allocated as four consecutive 4KB pages. This function selects
>> + * the appropriate 4KB page based on the given GFN and returns a mapping
>> + * to it.
>> + *
>> + * The caller is responsible for unmapping the page after use.
>> + *
>> + * Returns NULL if the calculated offset into the root table is invalid.
>> + */
>> +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>> +{
>> + unsigned long root_table_indx;
>> +
>> + root_table_indx = gfn_x(gfn) >> P2M_LEVEL_ORDER(P2M_ROOT_LEVEL);
> With the variable name shortened (to e.g. idx) this could be its initializer
> without ending up with too long a line. The root_table_ prefix isn't really
> adding much value in the context of this function.
>
>> + if ( root_table_indx >= P2M_ROOT_PAGES )
>> + return NULL;
>> +
>> + /*
>> + * The P2M root page table is extended by 2 bits, making its size 16KB
>> + * (instead of 4KB for non-root page tables). Therefore, p2m->root is
>> + * allocated as four consecutive 4KB pages (since alloc_domheap_pages()
>> + * only allocates 4KB pages).
>> + *
>> + * Initially, `root_table_indx` is derived directly from `va`.
> There's no 'va' here.
Should be gfn.
>
>> +static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
> "clean_pte" as a parameter name of a function of this name is, well, odd.
clean_cache should be better:
p2m_clean_pte(pte_t *p, bool clean_cache)
I suppose it would be nice to rename everywhere clean_pte to clean_cache.
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 15638 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
2025-11-14 17:04 ` Oleksii Kurochko
@ 2025-11-17 8:56 ` Jan Beulich
2025-11-18 15:28 ` Oleksii Kurochko
0 siblings, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-17 8:56 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 14.11.2025 18:04, Oleksii Kurochko wrote:
> On 11/10/25 3:53 PM, Jan Beulich wrote:
>> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>>> +#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)
>> If I'm not mistaken, this is a mask with the low 10 or 12 bits set.
>
> I'm not sure I fully understand you here. With the current implementation,
> it returns a bitmask that corresponds to the number of index bits used
> at each level. So, if|P2M_ROOT_LEVEL = 2|, then:
> |G||FN_MASK(0) = 0x1ff| (9-bit GFN for the level 0)
> |GFN_MASK(1) = 0x1ff| (9-bit GFN width for level 1)
> |GFN_MASK(2) = 0x7ff| (11-bit GFN width for level 2)
Oh, sorry, 9 and 11 bits is what I meant.
> Or do you mean that GFN_MASK(lvl) should return something like this:
> |G||FN_MASK_(0) = 0x1FF000 (0x1ff << 0xc) GFN_MASK_(1) = 0x3FE00000
> (GFN_MASK_(0)<<9) GFN_MASK_(2) = 0x1FFC0000000 (GFN_MASK_(1)<<9 + extra
> 2 bits)
Yes.
> And then here ...|
>
>> That's not really something you can apply to a GFN, unlike the name
>> suggests.
>
> That is why virtual address should be properly shifted before, something
> like it is done in calc_offset():
Please can we stop calling guest physical addresses "virtual address"?
> (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
>
> ...
> (va & GFN_MASK_(lvl)) >> P2M_LEVEL_SHIFT(lvl) ?
> In this option more shifts will be needed.
It's okay to try to limit the number of shifts needed, but the macros need
naming accordingly.
> Would it be better to just rename GFN_MASK() to P2M_PT_INDEX_MASK()? Or,
> maybe, even just P2M_INDEX_MASK().
Perhaps. I would recommend though that you take a looks at other ports'
naming. In x86, for example, we have l<N>_table_offset().
>>> --- a/xen/arch/riscv/p2m.c
>>> +++ b/xen/arch/riscv/p2m.c
>>> @@ -9,6 +9,7 @@
>>> #include <xen/rwlock.h>
>>> #include <xen/sched.h>
>>> #include <xen/sections.h>
>>> +#include <xen/xvmalloc.h>
>>>
>>> #include <asm/csr.h>
>>> #include <asm/flushtlb.h>
>>> @@ -17,6 +18,43 @@
>>> #include <asm/vmid.h>
>>>
>>> unsigned char __ro_after_init gstage_mode;
>>> +unsigned int __ro_after_init gstage_root_level;
>> Like for mode, I'm unconvinced of this being a global (and not per-P2M /
>> per-domain).
>
> The question is then if we really will (or want to) have cases when gstage
> mode will be different per-domain/per-p2m?
Can you explain to me why you think we wouldn't want that, sooner or later?
>>> +/*
>>> + * The P2M root page table is extended by 2 bits, making its size 16KB
>>> + * (instead of 4KB for non-root page tables). Therefore, P2M root page
>>> + * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
>>> + * only allocates 4KB pages).
>>> + */
>>> +#define ENTRIES_PER_ROOT_PAGE \
>>> + (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
>>> +
>>> +static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)
>> Where would a vaddr_t come from here? Your input are guest-physical addresses,
>> if I'm not mistaken.
>
> You are right. Would it be right to 'paddr_t gpa' here? Or paddr_t is supposed to use
> only with machine physical address?
In x86 we use paddr_t in such cases. Arm iirc additionally has gaddr_t.
>>> +#define P2M_MAX_ROOT_LEVEL 4
>>> +
>>> +#define P2M_DECLARE_OFFSETS(var, addr) \
>>> + unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
>>> + for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
>>> + var[i] = calc_offset(i, addr);
>> This surely is more than just "declare", and it's dealing with all levels no
>> matter whether you actually will use all offsets.
>
> I will rename|P2M_DECLARE_OFFSETS| to|P2M_BUILD_LEVEL_OFFSETS()|.
>
> But how can I know which offset I will actually need to use?
> If we take the following loop as an example:
> |for( level = P2M_ROOT_LEVEL; level > target; level-- ) { ||/* ||* Don't try to allocate intermediate page tables if the mapping ||* is about to be removed. ||*/ ||rc = p2m_next_level(p2m, !removing_mapping, ||level, &table, offsets[level]); ||... ||} |It walks from|P2M_ROOT_LEVEL| down to|target|, where|target| is determined at runtime.
>
> If you mean that, for example, when the G-stage mode is Sv39, there is no need to allocate
> an array with 4 entries (or 5 entries if we consider Sv57, so P2M_MAX_ROOT_LEVEL should be
> updated), because Sv39 only uses 3 page table levels — then yes, in theory it could be
> smaller. But I don't think it is a real issue if the|offsets[]| array on the stack has a
> few extra unused entries.
>
> If preferred, Icould allocate the array dynamically based on|gstage_root_level|.
> Would that be better?
Having a few unused entries isn't a big deal imo. What I'm not happy with here is
that you may _initialize_ more entries than actually needed. I have no good
suggestion within the conceptual framework you use for page walking (the same
issue iirc exists in host page table walks, just that the calculations there are
cheaper).
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
2025-11-17 8:56 ` Jan Beulich
@ 2025-11-18 15:28 ` Oleksii Kurochko
2025-11-18 16:30 ` Jan Beulich
0 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-18 15:28 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 2470 bytes --]
On 11/17/25 9:56 AM, Jan Beulich wrote:
>>>> +#define P2M_MAX_ROOT_LEVEL 4
>>>> +
>>>> +#define P2M_DECLARE_OFFSETS(var, addr) \
>>>> + unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
>>>> + for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
>>>> + var[i] = calc_offset(i, addr);
>>> This surely is more than just "declare", and it's dealing with all levels no
>>> matter whether you actually will use all offsets.
>> I will rename|P2M_DECLARE_OFFSETS| to|P2M_BUILD_LEVEL_OFFSETS()|.
>>
>> But how can I know which offset I will actually need to use?
>> If we take the following loop as an example:
>> |for( level = P2M_ROOT_LEVEL; level > target; level-- ) { ||/* ||*
>> Don't try to allocate intermediate page tables if the mapping ||* is
>> about to be removed. ||*/ ||rc = p2m_next_level(p2m,
>> !removing_mapping, ||level, &table, offsets[level]); ||... ||} |It
>> walks from|P2M_ROOT_LEVEL| down to|target|, where|target| is determined at runtime.
>>
>> If you mean that, for example, when the G-stage mode is Sv39, there is no need to allocate
>> an array with 4 entries (or 5 entries if we consider Sv57, so P2M_MAX_ROOT_LEVEL should be
>> updated), because Sv39 only uses 3 page table levels — then yes, in theory it could be
>> smaller. But I don't think it is a real issue if the|offsets[]| array on the stack has a
>> few extra unused entries.
>>
>> If preferred, Icould allocate the array dynamically based on|gstage_root_level|.
>> Would that be better?
> Having a few unused entries isn't a big deal imo. What I'm not happy with here is
> that you may_initialize_ more entries than actually needed. I have no good
> suggestion within the conceptual framework you use for page walking (the same
> issue iirc exists in host page table walks, just that the calculations there are
> cheaper).
The loop inside|P2M_DECLARE_OFFSETS()| uses|gstage_root_level|, so only the entries that
are actually going to be used are initialized.
You probably mean that it’s possible we don’t need to walk all the tables because a
leaf page-table entry appears earlier than the L0 page table for some gfns? IMO, it’s not
really a big deal, because at worst we just spend some time calculating something that
isn’t actually needed, but considering that it will be just extra 2 calls in the worst case
(when mapping is 1g, for no reason we calculated offsets for L1 and L0) of calc_offset()
it won't affect performance too much.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 3559 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
2025-11-18 15:28 ` Oleksii Kurochko
@ 2025-11-18 16:30 ` Jan Beulich
0 siblings, 0 replies; 54+ messages in thread
From: Jan Beulich @ 2025-11-18 16:30 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 18.11.2025 16:28, Oleksii Kurochko wrote:
>
> On 11/17/25 9:56 AM, Jan Beulich wrote:
>>>>> +#define P2M_MAX_ROOT_LEVEL 4
>>>>> +
>>>>> +#define P2M_DECLARE_OFFSETS(var, addr) \
>>>>> + unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
>>>>> + for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
>>>>> + var[i] = calc_offset(i, addr);
>>>> This surely is more than just "declare", and it's dealing with all levels no
>>>> matter whether you actually will use all offsets.
>>> I will rename|P2M_DECLARE_OFFSETS| to|P2M_BUILD_LEVEL_OFFSETS()|.
>>>
>>> But how can I know which offset I will actually need to use?
>>> If we take the following loop as an example:
>>> |for( level = P2M_ROOT_LEVEL; level > target; level-- ) { ||/* ||*
>>> Don't try to allocate intermediate page tables if the mapping ||* is
>>> about to be removed. ||*/ ||rc = p2m_next_level(p2m,
>>> !removing_mapping, ||level, &table, offsets[level]); ||... ||} |It
>>> walks from|P2M_ROOT_LEVEL| down to|target|, where|target| is determined at runtime.
>>>
>>> If you mean that, for example, when the G-stage mode is Sv39, there is no need to allocate
>>> an array with 4 entries (or 5 entries if we consider Sv57, so P2M_MAX_ROOT_LEVEL should be
>>> updated), because Sv39 only uses 3 page table levels — then yes, in theory it could be
>>> smaller. But I don't think it is a real issue if the|offsets[]| array on the stack has a
>>> few extra unused entries.
>>>
>>> If preferred, Icould allocate the array dynamically based on|gstage_root_level|.
>>> Would that be better?
>> Having a few unused entries isn't a big deal imo. What I'm not happy with here is
>> that you may_initialize_ more entries than actually needed. I have no good
>> suggestion within the conceptual framework you use for page walking (the same
>> issue iirc exists in host page table walks, just that the calculations there are
>> cheaper).
>
> The loop inside|P2M_DECLARE_OFFSETS()| uses|gstage_root_level|, so only the entries that
> are actually going to be used are initialized.
>
> You probably mean that it’s possible we don’t need to walk all the tables because a
> leaf page-table entry appears earlier than the L0 page table for some gfns?
Yes.
> IMO, it’s not
> really a big deal, because at worst we just spend some time calculating something that
> isn’t actually needed, but considering that it will be just extra 2 calls in the worst case
> (when mapping is 1g, for no reason we calculated offsets for L1 and L0) of calc_offset()
> it won't affect performance too much.
Well, it's your call in the end.
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
2025-10-20 15:57 ` [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range() Oleksii Kurochko
2025-11-10 14:53 ` Jan Beulich
@ 2025-11-10 14:53 ` Jan Beulich
1 sibling, 0 replies; 54+ messages in thread
From: Jan Beulich @ 2025-11-10 14:53 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 20.10.2025 17:57, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -8,12 +8,45 @@
> #include <xen/rwlock.h>
> #include <xen/types.h>
>
> +#include <asm/page.h>
> #include <asm/page-bits.h>
>
> extern unsigned char gstage_mode;
> +extern unsigned int gstage_root_level;
>
> #define P2M_ROOT_ORDER (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
> #define P2M_ROOT_PAGES BIT(P2M_ROOT_ORDER, U)
> +#define P2M_ROOT_LEVEL gstage_root_level
> +
> +/*
> + * According to the RISC-V spec:
> + * When hgatp.MODE specifies a translation scheme of Sv32x4, Sv39x4, Sv48x4,
> + * or Sv57x4, G-stage address translation is a variation on the usual
> + * page-based virtual address translation scheme of Sv32, Sv39, Sv48, or
> + * Sv57, respectively. In each case, the size of the incoming address is
> + * widened by 2 bits (to 34, 41, 50, or 59 bits).
> + *
> + * P2M_LEVEL_ORDER(lvl) defines the bit position in the GFN from which
> + * the index for this level of the P2M page table starts. The extra 2
> + * bits added by the "x4" schemes only affect the root page table width.
> + *
> + * Therefore, this macro can safely reuse XEN_PT_LEVEL_ORDER() for all
> + * levels: the extra 2 bits do not change the indices of lower levels.
> + *
> + * The extra 2 bits are only relevant if one tried to address beyond the
> + * root level (i.e., P2M_LEVEL_ORDER(P2M_ROOT_LEVEL + 1)), which is
> + * invalid.
> + */
> +#define P2M_LEVEL_ORDER(lvl) XEN_PT_LEVEL_ORDER(lvl)
Is the last paragraph of the comment really needed? It talks about something
absurd / impossible only.
> +#define P2M_ROOT_EXTRA_BITS(lvl) (2 * ((lvl) == P2M_ROOT_LEVEL))
> +
> +#define P2M_PAGETABLE_ENTRIES(lvl) \
> + (BIT(PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl), UL))
> +
> +#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)
If I'm not mistaken, this is a mask with the low 10 or 12 bits set.
That's not really something you can apply to a GFN, unlike the name
suggests.
> +#define P2M_LEVEL_SHIFT(lvl) (P2M_LEVEL_ORDER(lvl) + PAGE_SHIFT)
Whereas here the macro name doesn't make clear what is shifted: An
address or a GFN. (It's the former, aiui.)
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -9,6 +9,7 @@
> #include <xen/rwlock.h>
> #include <xen/sched.h>
> #include <xen/sections.h>
> +#include <xen/xvmalloc.h>
>
> #include <asm/csr.h>
> #include <asm/flushtlb.h>
> @@ -17,6 +18,43 @@
> #include <asm/vmid.h>
>
> unsigned char __ro_after_init gstage_mode;
> +unsigned int __ro_after_init gstage_root_level;
Like for mode, I'm unconvinced of this being a global (and not per-P2M /
per-domain).
> +/*
> + * The P2M root page table is extended by 2 bits, making its size 16KB
> + * (instead of 4KB for non-root page tables). Therefore, P2M root page
> + * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
> + * only allocates 4KB pages).
> + */
> +#define ENTRIES_PER_ROOT_PAGE \
> + (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
> +
> +static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)
Where would a vaddr_t come from here? Your input are guest-physical addresses,
if I'm not mistaken.
> +{
> + unsigned int offset = (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
> +
> + /*
> + * For P2M_ROOT_LEVEL, `offset` ranges from 0 to 2047, since the root
> + * page table spans 4 consecutive 4KB pages.
> + * We want to return an index within one of these 4 pages.
> + * The specific page to use is determined by `p2m_get_root_pointer()`.
> + *
> + * Example: if `offset == 512`:
> + * - A single 4KB page holds 512 entries.
> + * - Therefore, entry 512 corresponds to index 0 of the second page.
> + *
> + * At all other levels, only one page is allocated, and `offset` is
> + * always in the range 0 to 511, since the VPN is 9 bits long.
> + */
> + return offset % ENTRIES_PER_ROOT_PAGE;
Seeing something "root" used here (when this is for all levels) is pretty odd,
despite all the commentary. Given all the commentary, why not simply
return offset & ((1U << PAGETABLE_ORDER) - 1);
?
> +}
> +
> +#define P2M_MAX_ROOT_LEVEL 4
> +
> +#define P2M_DECLARE_OFFSETS(var, addr) \
> + unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
> + for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
> + var[i] = calc_offset(i, addr);
This surely is more than just "declare", and it's dealing with all levels no
matter whether you actually will use all offsets.
> @@ -259,13 +308,293 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
> return rc;
> }
>
> +/*
> + * Map one of the four root pages of the P2M root page table.
> + *
> + * The P2M root page table is larger than normal (16KB instead of 4KB),
> + * so it is allocated as four consecutive 4KB pages. This function selects
> + * the appropriate 4KB page based on the given GFN and returns a mapping
> + * to it.
> + *
> + * The caller is responsible for unmapping the page after use.
> + *
> + * Returns NULL if the calculated offset into the root table is invalid.
> + */
> +static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
> +{
> + unsigned long root_table_indx;
> +
> + root_table_indx = gfn_x(gfn) >> P2M_LEVEL_ORDER(P2M_ROOT_LEVEL);
With the variable name shortened (to e.g. idx) this could be its initializer
without ending up with too long a line. The root_table_ prefix isn't really
adding much value in the context of this function.
> + if ( root_table_indx >= P2M_ROOT_PAGES )
> + return NULL;
> +
> + /*
> + * The P2M root page table is extended by 2 bits, making its size 16KB
> + * (instead of 4KB for non-root page tables). Therefore, p2m->root is
> + * allocated as four consecutive 4KB pages (since alloc_domheap_pages()
> + * only allocates 4KB pages).
> + *
> + * Initially, `root_table_indx` is derived directly from `va`.
There's no 'va' here.
> +static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
"clean_pte" as a parameter name of a function of this name is, well, odd.
> +/* Insert an entry in the p2m */
> +static int p2m_set_entry(struct p2m_domain *p2m,
> + gfn_t gfn,
> + unsigned long page_order,
> + mfn_t mfn,
> + p2m_type_t t)
> +{
> + unsigned int level;
> + unsigned int target = page_order / PAGETABLE_ORDER;
> + pte_t *entry, *table, orig_pte;
> + int rc;
> + /*
> + * A mapping is removed only if the MFN is explicitly set to INVALID_MFN.
> + * Other MFNs that are considered invalid by mfn_valid() (e.g., MMIO)
> + * are still allowed.
> + */
> + bool removing_mapping = mfn_eq(mfn, INVALID_MFN);
> + P2M_DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
> +
> + ASSERT(p2m_is_write_locked(p2m));
> +
> + /*
> + * Check if the level target is valid: we only support
> + * 4K - 2M - 1G mapping.
> + */
> + ASSERT(target <= 2);
> +
> + table = p2m_get_root_pointer(p2m, gfn);
> + if ( !table )
> + return -EINVAL;
> +
> + for ( level = P2M_ROOT_LEVEL; level > target; level-- )
> + {
> + /*
> + * Don't try to allocate intermediate page table if the mapping
> + * is about to be removed.
> + */
> + rc = p2m_next_level(p2m, !removing_mapping,
> + level, &table, offsets[level]);
> + if ( (rc == P2M_TABLE_MAP_NONE) || (rc == P2M_TABLE_MAP_NOMEM) )
> + {
> + rc = (rc == P2M_TABLE_MAP_NONE) ? -ENOENT : -ENOMEM;
> + /*
> + * We are here because p2m_next_level has failed to map
> + * the intermediate page table (e.g the table does not exist
> + * and none should be allocated). It is a valid case
> + * when removing a mapping as it may not exist in the
> + * page table. In this case, just ignore lookup failure.
> + */
> + rc = removing_mapping ? 0 : rc;
> + goto out;
> + }
> +
> + if ( rc != P2M_TABLE_NORMAL )
> + break;
> + }
> +
> + entry = table + offsets[level];
> +
> + /*
> + * If we are here with level > target, we must be at a leaf node,
> + * and we need to break up the superpage.
> + */
> + if ( level > target )
> + {
> + panic("Shattering isn't implemented\n");
> + }
> +
> + /*
> + * We should always be there with the correct level because all the
> + * intermediate tables have been installed if necessary.
> + */
> + ASSERT(level == target);
> +
> + orig_pte = *entry;
> +
> + if ( removing_mapping )
> + p2m_clean_pte(entry, p2m->clean_dcache);
> + else
> + {
> + pte_t pte = p2m_pte_from_mfn(mfn, t);
> +
> + p2m_write_pte(entry, pte, p2m->clean_dcache);
> +
> + p2m->max_mapped_gfn = gfn_max(p2m->max_mapped_gfn,
> + gfn_add(gfn, BIT(page_order, UL) - 1));
> + p2m->lowest_mapped_gfn = gfn_min(p2m->lowest_mapped_gfn, gfn);
> + }
> +
> + p2m->need_flush = true;
> +
> + /*
> + * Currently, the infrastructure required to enable CONFIG_HAS_PASSTHROUGH
> + * is not ready for RISC-V support.
> + *
> + * When CONFIG_HAS_PASSTHROUGH=y, iommu_iotlb_flush() should be done
> + * here.
> + */
> +#ifdef CONFIG_HAS_PASSTHROUGH
> +# error "add code to flush IOMMU TLB"
> +#endif
> +
> + rc = 0;
> +
> + /*
> + * In case of a VALID -> INVALID transition, the original PTE should
> + * always be freed.
> + *
> + * In case of a VALID -> VALID transition, the original PTE should be
> + * freed only if the MFNs are different. If the MFNs are the same
> + * (i.e., only permissions differ), there is no need to free the
> + * original PTE.
> + */
> + if ( pte_is_valid(orig_pte) &&
> + (!pte_is_valid(*entry) ||
> + !mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte))) )
Besides my continued impression of this condition being more complex than it
ought to be expected, indentation is off by one on the last of the three lines.
(Since, otoh, I can't suggest any simpler expression (for now), this isn't a
request to further change it.)
> +/* Return mapping order for given gfn, mfn and nr */
> +static unsigned long p2m_mapping_order(gfn_t gfn, mfn_t mfn, unsigned long nr)
> +{
> + unsigned long mask;
> + /* 1gb, 2mb, 4k mappings are supported */
> + unsigned int level = min(P2M_ROOT_LEVEL, _AC(2, U));
Further up you has such a literal 2 already - please make a constant, so all
instances can easily be associated with one another.
> + unsigned long order = 0;
> +
> + mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
> + mask |= gfn_x(gfn);
> +
> + for ( ; level != 0; level-- )
> + {
> + if ( !(mask & (BIT(P2M_LEVEL_ORDER(level), UL) - 1)) &&
> + (nr >= BIT(P2M_LEVEL_ORDER(level), UL)) )
> + {
> + order = P2M_LEVEL_ORDER(level);
> + break;
I'm pretty sure I did complain about the too deep indentation here already.
> + }
> + }
> +
> + return order;
> +}
> +
> static int p2m_set_range(struct p2m_domain *p2m,
> gfn_t sgfn,
> unsigned long nr,
> mfn_t smfn,
> p2m_type_t t)
> {
> - return -EOPNOTSUPP;
> + int rc = 0;
> + unsigned long left = nr;
> +
> + /*
> + * Any reference taken by the P2M mappings (e.g. foreign mapping) will
> + * be dropped in relinquish_p2m_mapping(). As the P2M will still
> + * be accessible after, we need to prevent mapping to be added when the
> + * domain is dying.
> + */
> + if ( unlikely(p2m->domain->is_dying) )
> + return -EACCES;
> +
> + while ( left )
> + {
> + unsigned long order = p2m_mapping_order(sgfn, smfn, left);
> +
> + rc = p2m_set_entry(p2m, sgfn, order, smfn, t);
> + if ( rc )
> + break;
> +
> + sgfn = gfn_add(sgfn, BIT(order, UL));
> + if ( !mfn_eq(smfn, INVALID_MFN) )
> + smfn = mfn_add(smfn, BIT(order, UL));
Off-by-1 indentation again.
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread
* [for 4.22 v5 11/18] xen/riscv: Implement p2m_free_subtree() and related helpers
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (9 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range() Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-11-10 15:29 ` Jan Beulich
2025-10-20 15:57 ` [for 4.22 v5 12/18] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration Oleksii Kurochko
` (6 subsequent siblings)
17 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
This patch introduces a working implementation of p2m_free_subtree() for RISC-V
based on ARM's implementation of p2m_free_entry(), enabling proper cleanup
of page table entries in the P2M (physical-to-machine) mapping.
Only few things are changed:
- Introduce and use p2m_get_type() to get a type of p2m entry as
RISC-V's PTE doesn't have enough space to store all necessary types so
a type is stored outside PTE. But, at the moment, handle only types
which fit into PTE's bits.
Key additions include:
- p2m_free_subtree(): Recursively frees page table entries at all levels. It
handles both regular and superpage mappings and ensures that TLB entries
are flushed before freeing intermediate tables.
- p2m_put_page() and helpers:
- p2m_put_4k_page(): Clears GFN from xenheap pages if applicable.
- p2m_put_2m_superpage(): Releases foreign page references in a 2MB
superpage.
- p2m_get_type(): Extracts the stored p2m_type from the PTE bits.
- p2m_free_page(): Returns a page to a domain's freelist.
- Introduce p2m_is_foreign() and connected to it things.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
- Rewrite the comment inside p2m_put_foreign_page().
- s/assert_failed/ASSERT_UNREACHABLE.
- Use p2mt variable when p2m_free_subtree() is called intead of
p2m_get_type(entry).
- Update the commit message: drop info about defintion of XEN_PT_ENTRIES
- Drop also defintion of XEN_PT_ENTRIES as the macro isn't used anymore.
- Drop ACCESS_ONCE() for paging_free_page() as it is redundant in the
case when a code is wrapped by a spinlock.
---
Changes in V4:
- Stray blanks.
- Implement arch_flush_tlb_mask() to make the comment in p2m_put_foreign()
clear and explicit.
- Update the comment above p2m_is_ram() in p2m_put_4k_page() with an explanation
why p2m_is_ram() is used.
- Add a type check inside p2m_put_2m_superpage().
- Swap two conditions around in p2m_free_subtree():
if ( (level == 0) || pte_is_superpage(entry, level) )
- Add ASSERT() inside p2m_free_subtree() to check that level is <= 2; otherwise,
it could consume a lot of time and big memory usage because of recursion.
- Drop page_list_del() before p2m_free_page() as page_list_del() is called
inside p2m_free_page().
- Update p2m_freelist's total_pages when a page is added to p2m_freelist in
paging_free_page().
- Introduce P2M_SUPPORTED_LEVEL_MAPPING and use it in ASSERTs() which check
supported level.
- Use P2M_PAGETABLE_ENTRIES as XEN_PT_ENTRIES
doesn't takeinto into acount that G stage root page table is
extended by 2 bits.
- Update prototype of p2m_put_page() to not have unnecessary changes later.
---
Changes in V3:
- Use p2m_tlb_flush_sync(p2m) instead of p2m_force_tlb_flush_sync() in
p2m_free_subtree().
- Drop p2m_is_valid() implementation as pte_is_valid() is going to be used
instead.
- Drop p2m_is_superpage() and introduce pte_is_superpage() instead.
- s/p2m_free_entry/p2m_free_subtree.
- s/p2m_type_radix_get/p2m_get_type.
- Update implementation of p2m_get_type() to get type both from PTE bits,
other cases will be covered in a separate patch. This requires an
introduction of new P2M_TYPE_PTE_BITS_MASK macros.
- Drop p2m argument of p2m_get_type() as it isn't needed anymore.
- Put cheapest checks first in p2m_is_superpage().
- Use switch() in p2m_put_page().
- Update the comment in p2m_put_foreign_page().
- Code style fixes.
- Move p2m_foreign stuff to this commit.
- Drop p2m argument of p2m_put_page() as itsn't used anymore.
---
Changes in V2:
- New patch. It was a part of 2ma big patch "xen/riscv: implement p2m mapping
functionality" which was splitted to smaller.
- s/p2m_is_superpage/p2me_is_superpage.
---
xen/arch/riscv/include/asm/flushtlb.h | 6 +-
xen/arch/riscv/include/asm/p2m.h | 15 +++
xen/arch/riscv/include/asm/page.h | 5 +
xen/arch/riscv/include/asm/paging.h | 2 +
xen/arch/riscv/p2m.c | 152 +++++++++++++++++++++++++-
xen/arch/riscv/paging.c | 8 ++
xen/arch/riscv/stubs.c | 5 -
7 files changed, 184 insertions(+), 9 deletions(-)
diff --git a/xen/arch/riscv/include/asm/flushtlb.h b/xen/arch/riscv/include/asm/flushtlb.h
index e70badae0c..ab32311568 100644
--- a/xen/arch/riscv/include/asm/flushtlb.h
+++ b/xen/arch/riscv/include/asm/flushtlb.h
@@ -41,8 +41,10 @@ static inline void page_set_tlbflush_timestamp(struct page_info *page)
BUG_ON("unimplemented");
}
-/* Flush specified CPUs' TLBs */
-void arch_flush_tlb_mask(const cpumask_t *mask);
+static inline void arch_flush_tlb_mask(const cpumask_t *mask)
+{
+ sbi_remote_hfence_gvma(mask, 0, 0);
+}
#endif /* ASM__RISCV__FLUSHTLB_H */
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index ce8bcb944f..6a17cd52fc 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -110,6 +110,8 @@ typedef enum {
p2m_mmio_direct_io, /* Read/write mapping of genuine Device MMIO area,
PTE_PBMT_IO will be used for such mappings */
p2m_ext_storage, /* Following types'll be stored outsude PTE bits: */
+ p2m_map_foreign_rw, /* Read/write RAM pages from foreign domain */
+ p2m_map_foreign_ro, /* Read-only RAM pages from foreign domain */
/* Sentinel — not a real type, just a marker for comparison */
p2m_first_external = p2m_ext_storage,
@@ -120,15 +122,28 @@ static inline p2m_type_t arch_dt_passthrough_p2m_type(void)
return p2m_mmio_direct_io;
}
+/*
+ * Bits 8 and 9 are reserved for use by supervisor software;
+ * the implementation shall ignore this field.
+ * We are going to use to save in these bits frequently used types to avoid
+ * get/set of a type from radix tree.
+ */
+#define P2M_TYPE_PTE_BITS_MASK 0x300
+
/* We use bitmaps and mask to handle groups of types */
#define p2m_to_mask(t) BIT(t, UL)
/* RAM types, which map to real machine frames */
#define P2M_RAM_TYPES (p2m_to_mask(p2m_ram_rw))
+/* Foreign mappings types */
+#define P2M_FOREIGN_TYPES (p2m_to_mask(p2m_map_foreign_rw) | \
+ p2m_to_mask(p2m_map_foreign_ro))
+
/* Useful predicates */
#define p2m_is_ram(t) (p2m_to_mask(t) & P2M_RAM_TYPES)
#define p2m_is_any_ram(t) (p2m_to_mask(t) & P2M_RAM_TYPES)
+#define p2m_is_foreign(t) (p2m_to_mask(t) & P2M_FOREIGN_TYPES)
#include <xen/p2m-common.h>
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index 66cb192316..78e53981ac 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -182,6 +182,11 @@ static inline bool pte_is_mapping(pte_t p)
return (p.pte & PTE_VALID) && (p.pte & PTE_ACCESS_MASK);
}
+static inline bool pte_is_superpage(pte_t p, unsigned int level)
+{
+ return (level > 0) && pte_is_mapping(p);
+}
+
static inline int clean_and_invalidate_dcache_va_range(const void *p,
unsigned long size)
{
diff --git a/xen/arch/riscv/include/asm/paging.h b/xen/arch/riscv/include/asm/paging.h
index 01be45528f..fe462be223 100644
--- a/xen/arch/riscv/include/asm/paging.h
+++ b/xen/arch/riscv/include/asm/paging.h
@@ -13,4 +13,6 @@ int paging_freelist_adjust(struct domain *d, unsigned long pages,
int paging_ret_to_domheap(struct domain *d, unsigned int nr_pages);
int paging_refill_from_domheap(struct domain *d, unsigned int nr_pages);
+void paging_free_page(struct domain *d, struct page_info *pg);
+
#endif /* ASM_RISCV_PAGING_H */
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index f13458712a..71b211410b 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -17,6 +17,8 @@
#include <asm/riscv_encoding.h>
#include <asm/vmid.h>
+#define P2M_SUPPORTED_LEVEL_MAPPING 2
+
unsigned char __ro_after_init gstage_mode;
unsigned int __ro_after_init gstage_root_level;
@@ -347,6 +349,16 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
return __map_domain_page(p2m->root + root_table_indx);
}
+static p2m_type_t p2m_get_type(const pte_t pte)
+{
+ p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
+
+ if ( type == p2m_ext_storage )
+ panic("unimplemented\n");
+
+ return type;
+}
+
static inline void p2m_write_pte(pte_t *p, pte_t pte, bool clean_pte)
{
write_pte(p, pte);
@@ -403,11 +415,147 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
return P2M_TABLE_MAP_NONE;
}
+static void p2m_put_foreign_page(struct page_info *pg)
+{
+ /*
+ * It’s safe to call put_page() here because arch_flush_tlb_mask()
+ * will be invoked if the page is reallocated, which will trigger a
+ * flush of the guest TLBs.
+ */
+ put_page(pg);
+}
+
+/* Put any references on the single 4K page referenced by mfn. */
+static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
+{
+ /* TODO: Handle other p2m types */
+
+ if ( p2m_is_foreign(type) )
+ {
+ ASSERT(mfn_valid(mfn));
+ p2m_put_foreign_page(mfn_to_page(mfn));
+ }
+}
+
+/* Put any references on the superpage referenced by mfn. */
+static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
+{
+ struct page_info *pg;
+ unsigned int i;
+
+ /*
+ * TODO: Handle other p2m types, but be aware that any changes to handle
+ * different types should require an update on the relinquish code to
+ * handle preemption.
+ */
+ if ( !p2m_is_foreign(type) )
+ return;
+
+ ASSERT(mfn_valid(mfn));
+
+ pg = mfn_to_page(mfn);
+
+ for ( i = 0; i < P2M_PAGETABLE_ENTRIES(1); i++, pg++ )
+ p2m_put_foreign_page(pg);
+}
+
+/* Put any references on the page referenced by pte. */
+static void p2m_put_page(const pte_t pte, unsigned int level, p2m_type_t p2mt)
+{
+ mfn_t mfn = pte_get_mfn(pte);
+
+ ASSERT(pte_is_valid(pte));
+
+ /*
+ * TODO: Currently we don't handle level 2 super-page, Xen is not
+ * preemptible and therefore some work is needed to handle such
+ * superpages, for which at some point Xen might end up freeing memory
+ * and therefore for such a big mapping it could end up in a very long
+ * operation.
+ */
+ switch ( level )
+ {
+ case 1:
+ return p2m_put_2m_superpage(mfn, p2mt);
+
+ case 0:
+ return p2m_put_4k_page(mfn, p2mt);
+
+ default:
+ ASSERT_UNREACHABLE();
+ break;
+ }
+}
+
+static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg)
+{
+ page_list_del(pg, &p2m->pages);
+
+ paging_free_page(p2m->domain, pg);
+}
+
/* Free pte sub-tree behind an entry */
static void p2m_free_subtree(struct p2m_domain *p2m,
pte_t entry, unsigned int level)
{
- panic("%s: hasn't been implemented yet\n", __func__);
+ unsigned int i;
+ pte_t *table;
+ mfn_t mfn;
+ struct page_info *pg;
+
+ /*
+ * Check if the level is valid: only 4K - 2M - 1G mappings are supported.
+ * To support levels > 2, the implementation of p2m_free_subtree() would
+ * need to be updated, as the current recursive approach could consume
+ * excessive time and memory.
+ */
+ ASSERT(level <= P2M_SUPPORTED_LEVEL_MAPPING);
+
+ /* Nothing to do if the entry is invalid. */
+ if ( !pte_is_valid(entry) )
+ return;
+
+ if ( (level == 0) || pte_is_superpage(entry, level) )
+ {
+ p2m_type_t p2mt = p2m_get_type(entry);
+
+#ifdef CONFIG_IOREQ_SERVER
+ /*
+ * If this gets called then either the entry was replaced by an entry
+ * with a different base (valid case) or the shattering of a superpage
+ * has failed (error case).
+ * So, at worst, the spurious mapcache invalidation might be sent.
+ */
+ if ( p2m_is_ram(p2mt) &&
+ domain_has_ioreq_server(p2m->domain) )
+ ioreq_request_mapcache_invalidate(p2m->domain);
+#endif
+
+ p2m_put_page(entry, level, p2mt);
+
+ return;
+ }
+
+ table = map_domain_page(pte_get_mfn(entry));
+ for ( i = 0; i < P2M_PAGETABLE_ENTRIES(level); i++ )
+ p2m_free_subtree(p2m, table[i], level - 1);
+
+ unmap_domain_page(table);
+
+ /*
+ * Make sure all the references in the TLB have been removed before
+ * freing the intermediate page table.
+ * XXX: Should we defer the free of the page table to avoid the
+ * flush?
+ */
+ p2m_tlb_flush_sync(p2m);
+
+ mfn = pte_get_mfn(entry);
+ ASSERT(mfn_valid(mfn));
+
+ pg = mfn_to_page(mfn);
+
+ p2m_free_page(p2m, pg);
}
/* Insert an entry in the p2m */
@@ -435,7 +583,7 @@ static int p2m_set_entry(struct p2m_domain *p2m,
* Check if the level target is valid: we only support
* 4K - 2M - 1G mapping.
*/
- ASSERT(target <= 2);
+ ASSERT(target <= P2M_SUPPORTED_LEVEL_MAPPING);
table = p2m_get_root_pointer(p2m, gfn);
if ( !table )
diff --git a/xen/arch/riscv/paging.c b/xen/arch/riscv/paging.c
index c87e9b7f7f..773c737ab5 100644
--- a/xen/arch/riscv/paging.c
+++ b/xen/arch/riscv/paging.c
@@ -109,6 +109,14 @@ int paging_ret_to_domheap(struct domain *d, unsigned int nr_pages)
return 0;
}
+void paging_free_page(struct domain *d, struct page_info *pg)
+{
+ spin_lock(&d->arch.paging.lock);
+ page_list_add_tail(pg, &d->arch.paging.freelist);
+ d->arch.paging.total_pages++;
+ spin_unlock(&d->arch.paging.lock);
+}
+
/* Domain paging struct initialization. */
int paging_domain_init(struct domain *d)
{
diff --git a/xen/arch/riscv/stubs.c b/xen/arch/riscv/stubs.c
index 1a8c86cd8d..ad6fdbf501 100644
--- a/xen/arch/riscv/stubs.c
+++ b/xen/arch/riscv/stubs.c
@@ -65,11 +65,6 @@ int arch_monitor_domctl_event(struct domain *d,
/* smp.c */
-void arch_flush_tlb_mask(const cpumask_t *mask)
-{
- BUG_ON("unimplemented");
-}
-
void smp_send_event_check_mask(const cpumask_t *mask)
{
BUG_ON("unimplemented");
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 11/18] xen/riscv: Implement p2m_free_subtree() and related helpers
2025-10-20 15:57 ` [for 4.22 v5 11/18] xen/riscv: Implement p2m_free_subtree() and related helpers Oleksii Kurochko
@ 2025-11-10 15:29 ` Jan Beulich
2025-11-17 11:36 ` Oleksii Kurochko
0 siblings, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-10 15:29 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 20.10.2025 17:57, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/p2m.h
> +++ b/xen/arch/riscv/include/asm/p2m.h
> @@ -110,6 +110,8 @@ typedef enum {
> p2m_mmio_direct_io, /* Read/write mapping of genuine Device MMIO area,
> PTE_PBMT_IO will be used for such mappings */
> p2m_ext_storage, /* Following types'll be stored outsude PTE bits: */
> + p2m_map_foreign_rw, /* Read/write RAM pages from foreign domain */
> + p2m_map_foreign_ro, /* Read-only RAM pages from foreign domain */
>
> /* Sentinel — not a real type, just a marker for comparison */
> p2m_first_external = p2m_ext_storage,
> @@ -120,15 +122,28 @@ static inline p2m_type_t arch_dt_passthrough_p2m_type(void)
> return p2m_mmio_direct_io;
> }
>
> +/*
> + * Bits 8 and 9 are reserved for use by supervisor software;
> + * the implementation shall ignore this field.
> + * We are going to use to save in these bits frequently used types to avoid
> + * get/set of a type from radix tree.
> + */
> +#define P2M_TYPE_PTE_BITS_MASK 0x300
Better use PTE_RSW in place of the raw number?
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -17,6 +17,8 @@
> #include <asm/riscv_encoding.h>
> #include <asm/vmid.h>
>
> +#define P2M_SUPPORTED_LEVEL_MAPPING 2
I fear without a comment it's left unclear what this is / represents.
> @@ -403,11 +415,147 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
> return P2M_TABLE_MAP_NONE;
> }
>
> +static void p2m_put_foreign_page(struct page_info *pg)
> +{
> + /*
> + * It’s safe to call put_page() here because arch_flush_tlb_mask()
> + * will be invoked if the page is reallocated, which will trigger a
> + * flush of the guest TLBs.
> + */
> + put_page(pg);
> +}
> +
> +/* Put any references on the single 4K page referenced by mfn. */
To me this and ...
> +static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
> +{
> + /* TODO: Handle other p2m types */
> +
> + if ( p2m_is_foreign(type) )
> + {
> + ASSERT(mfn_valid(mfn));
> + p2m_put_foreign_page(mfn_to_page(mfn));
> + }
> +}
> +
> +/* Put any references on the superpage referenced by mfn. */
... to a lesser degree this comment are potentially misleading. Down here at
least there is something plural-ish (the 4k pages that the 2M one consists
of), but especially for the single page case above "any" could easily mean
"anything that's still outstanding, anywhere". I'm also not quite sure "on"
is really what you mean (I'm not a native speaker, so my gut feeling may be
wrong here).
> +static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
> +{
> + struct page_info *pg;
> + unsigned int i;
> +
> + /*
> + * TODO: Handle other p2m types, but be aware that any changes to handle
> + * different types should require an update on the relinquish code to
> + * handle preemption.
> + */
I guess if I was to address this TODO, I wouldn't know what the latter part
of the sentence is warning me of.
> + if ( !p2m_is_foreign(type) )
> + return;
Are super-page foreign mappings actually intended to be permitted, conceptually?
> /* Free pte sub-tree behind an entry */
> static void p2m_free_subtree(struct p2m_domain *p2m,
> pte_t entry, unsigned int level)
> {
> - panic("%s: hasn't been implemented yet\n", __func__);
> + unsigned int i;
> + pte_t *table;
> + mfn_t mfn;
> + struct page_info *pg;
> +
> + /*
> + * Check if the level is valid: only 4K - 2M - 1G mappings are supported.
> + * To support levels > 2, the implementation of p2m_free_subtree() would
> + * need to be updated, as the current recursive approach could consume
> + * excessive time and memory.
> + */
> + ASSERT(level <= P2M_SUPPORTED_LEVEL_MAPPING);
> +
> + /* Nothing to do if the entry is invalid. */
> + if ( !pte_is_valid(entry) )
> + return;
> +
> + if ( (level == 0) || pte_is_superpage(entry, level) )
Considering what pte_is_superpage() expands to, simply pte_is_mapping()?
> + {
> + p2m_type_t p2mt = p2m_get_type(entry);
> +
> +#ifdef CONFIG_IOREQ_SERVER
> + /*
> + * If this gets called then either the entry was replaced by an entry
> + * with a different base (valid case) or the shattering of a superpage
> + * has failed (error case).
> + * So, at worst, the spurious mapcache invalidation might be sent.
> + */
> + if ( p2m_is_ram(p2mt) &&
> + domain_has_ioreq_server(p2m->domain) )
> + ioreq_request_mapcache_invalidate(p2m->domain);
> +#endif
> +
> + p2m_put_page(entry, level, p2mt);
> +
> + return;
> + }
> +
> + table = map_domain_page(pte_get_mfn(entry));
> + for ( i = 0; i < P2M_PAGETABLE_ENTRIES(level); i++ )
> + p2m_free_subtree(p2m, table[i], level - 1);
> +
> + unmap_domain_page(table);
Please can the use of blank lines in such cases be symmetric: Either have them
ahead of and after the loop, or have them nowhere?
> @@ -435,7 +583,7 @@ static int p2m_set_entry(struct p2m_domain *p2m,
> * Check if the level target is valid: we only support
> * 4K - 2M - 1G mapping.
> */
> - ASSERT(target <= 2);
> + ASSERT(target <= P2M_SUPPORTED_LEVEL_MAPPING);
Ah, this is where that constant comes into play. It wants moving to the earlier
patch, and with this being the purpose I guess it also wants to include MAX in
its name.
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 11/18] xen/riscv: Implement p2m_free_subtree() and related helpers
2025-11-10 15:29 ` Jan Beulich
@ 2025-11-17 11:36 ` Oleksii Kurochko
2025-11-17 14:22 ` Jan Beulich
0 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-17 11:36 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 7258 bytes --]
On 11/10/25 4:29 PM, Jan Beulich wrote:
> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/include/asm/p2m.h
>> +++ b/xen/arch/riscv/include/asm/p2m.h
>> @@ -110,6 +110,8 @@ typedef enum {
>> p2m_mmio_direct_io, /* Read/write mapping of genuine Device MMIO area,
>> PTE_PBMT_IO will be used for such mappings */
>> p2m_ext_storage, /* Following types'll be stored outsude PTE bits: */
>> + p2m_map_foreign_rw, /* Read/write RAM pages from foreign domain */
>> + p2m_map_foreign_ro, /* Read-only RAM pages from foreign domain */
>>
>> /* Sentinel — not a real type, just a marker for comparison */
>> p2m_first_external = p2m_ext_storage,
>> @@ -120,15 +122,28 @@ static inline p2m_type_t arch_dt_passthrough_p2m_type(void)
>> return p2m_mmio_direct_io;
>> }
>>
>> +/*
>> + * Bits 8 and 9 are reserved for use by supervisor software;
>> + * the implementation shall ignore this field.
>> + * We are going to use to save in these bits frequently used types to avoid
>> + * get/set of a type from radix tree.
>> + */
>> +#define P2M_TYPE_PTE_BITS_MASK 0x300
> Better use PTE_RSW in place of the raw number?
It would be better, thanks.
>
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -17,6 +17,8 @@
>> #include <asm/riscv_encoding.h>
>> #include <asm/vmid.h>
>>
>> +#define P2M_SUPPORTED_LEVEL_MAPPING 2
> I fear without a comment it's left unclear what this is / represents.
Probably just renaming it to|P2M_MAX_SUPPORTED_LEVEL_MAPPING| would make it clearer,
wouldn’t it?
Otherwise, I can add the following comment:
/*
* At the moment, only 4K, 2M, and 1G mappings are supported for G-stage
* translation. Therefore, the maximum supported page-table level is 2,
* which corresponds to 1G mappings.
*/
>
>> @@ -403,11 +415,147 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>> return P2M_TABLE_MAP_NONE;
>> }
>>
>> +static void p2m_put_foreign_page(struct page_info *pg)
>> +{
>> + /*
>> + * It’s safe to call put_page() here because arch_flush_tlb_mask()
>> + * will be invoked if the page is reallocated, which will trigger a
>> + * flush of the guest TLBs.
>> + */
>> + put_page(pg);
>> +}
>> +
>> +/* Put any references on the single 4K page referenced by mfn. */
> To me this and ...
>
>> +static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
>> +{
>> + /* TODO: Handle other p2m types */
>> +
>> + if ( p2m_is_foreign(type) )
>> + {
>> + ASSERT(mfn_valid(mfn));
>> + p2m_put_foreign_page(mfn_to_page(mfn));
>> + }
>> +}
>> +
>> +/* Put any references on the superpage referenced by mfn. */
> ... to a lesser degree this comment are potentially misleading. Down here at
> least there is something plural-ish (the 4k pages that the 2M one consists
> of), but especially for the single page case above "any" could easily mean
> "anything that's still outstanding, anywhere". I'm also not quite sure "on"
> is really what you mean (I'm not a native speaker, so my gut feeling may be
> wrong here).
Then I could suggest the following instead:
/* Put the reference associated with the 4K page identified by mfn. */
and
/* Put the references associated with the superpage identified by mfn. */
I think the comments could be omitted, since the function names already make
this clear.
>
>> +static void p2m_put_2m_superpage(mfn_t mfn, p2m_type_t type)
>> +{
>> + struct page_info *pg;
>> + unsigned int i;
>> +
>> + /*
>> + * TODO: Handle other p2m types, but be aware that any changes to handle
>> + * different types should require an update on the relinquish code to
>> + * handle preemption.
>> + */
> I guess if I was to address this TODO, I wouldn't know what the latter part
> of the sentence is warning me of.
It is referencing to the code which isn't introduced yet, something like Arm has
in|relinquish_p2m_mapping()|:
https://gitlab.com/xen-project/xen/-/blob/staging/xen/arch/arm/mmu/p2m.c#L1588
I am not 100% sure that this comment is useful now (as|relinquish_p2m_mapping() isn't introduced yet), so I am okay just to
drop it and add it when ||relinquish_p2m_mapping() will be introduced.|
>
>> + if ( !p2m_is_foreign(type) )
>> + return;
> Are super-page foreign mappings actually intended to be permitted, conceptually?
Good question. Conceptually, yes (and I thought that was the reason why ARM has
code to handle such cases and so I decided to have the same for RISC-V), but in
reality, it will be 4 KB pages, as I can see in the current codebase for other
architectures.
>
>> /* Free pte sub-tree behind an entry */
>> static void p2m_free_subtree(struct p2m_domain *p2m,
>> pte_t entry, unsigned int level)
>> {
>> - panic("%s: hasn't been implemented yet\n", __func__);
>> + unsigned int i;
>> + pte_t *table;
>> + mfn_t mfn;
>> + struct page_info *pg;
>> +
>> + /*
>> + * Check if the level is valid: only 4K - 2M - 1G mappings are supported.
>> + * To support levels > 2, the implementation of p2m_free_subtree() would
>> + * need to be updated, as the current recursive approach could consume
>> + * excessive time and memory.
>> + */
>> + ASSERT(level <= P2M_SUPPORTED_LEVEL_MAPPING);
>> +
>> + /* Nothing to do if the entry is invalid. */
>> + if ( !pte_is_valid(entry) )
>> + return;
>> +
>> + if ( (level == 0) || pte_is_superpage(entry, level) )
> Considering what pte_is_superpage() expands to, simply pte_is_mapping()?
Makes sense, we can really just have:
if ( pte_is_mapping(entry) )
>
>> + {
>> + p2m_type_t p2mt = p2m_get_type(entry);
>> +
>> +#ifdef CONFIG_IOREQ_SERVER
>> + /*
>> + * If this gets called then either the entry was replaced by an entry
>> + * with a different base (valid case) or the shattering of a superpage
>> + * has failed (error case).
>> + * So, at worst, the spurious mapcache invalidation might be sent.
>> + */
>> + if ( p2m_is_ram(p2mt) &&
>> + domain_has_ioreq_server(p2m->domain) )
>> + ioreq_request_mapcache_invalidate(p2m->domain);
>> +#endif
>> +
>> + p2m_put_page(entry, level, p2mt);
>> +
>> + return;
>> + }
>> +
>> + table = map_domain_page(pte_get_mfn(entry));
>> + for ( i = 0; i < P2M_PAGETABLE_ENTRIES(level); i++ )
>> + p2m_free_subtree(p2m, table[i], level - 1);
>> +
>> + unmap_domain_page(table);
> Please can the use of blank lines in such cases be symmetric: Either have them
> ahead of and after the loop, or have them nowhere?
>
>> @@ -435,7 +583,7 @@ static int p2m_set_entry(struct p2m_domain *p2m,
>> * Check if the level target is valid: we only support
>> * 4K - 2M - 1G mapping.
>> */
>> - ASSERT(target <= 2);
>> + ASSERT(target <= P2M_SUPPORTED_LEVEL_MAPPING);
> Ah, this is where that constant comes into play. It wants moving to the earlier
> patch, and with this being the purpose I guess it also wants to include MAX in
> its name.
Regarding MAX it is what I came up to in my reply somewhere above, so then lets
just add "MAX".
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 9866 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 11/18] xen/riscv: Implement p2m_free_subtree() and related helpers
2025-11-17 11:36 ` Oleksii Kurochko
@ 2025-11-17 14:22 ` Jan Beulich
0 siblings, 0 replies; 54+ messages in thread
From: Jan Beulich @ 2025-11-17 14:22 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 17.11.2025 12:36, Oleksii Kurochko wrote:
> On 11/10/25 4:29 PM, Jan Beulich wrote:
>> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>>> --- a/xen/arch/riscv/p2m.c
>>> +++ b/xen/arch/riscv/p2m.c
>>> @@ -17,6 +17,8 @@
>>> #include <asm/riscv_encoding.h>
>>> #include <asm/vmid.h>
>>>
>>> +#define P2M_SUPPORTED_LEVEL_MAPPING 2
>> I fear without a comment it's left unclear what this is / represents.
>
> Probably just renaming it to|P2M_MAX_SUPPORTED_LEVEL_MAPPING| would make it clearer,
> wouldn’t it?
> Otherwise, I can add the following comment:
> /*
> * At the moment, only 4K, 2M, and 1G mappings are supported for G-stage
> * translation. Therefore, the maximum supported page-table level is 2,
> * which corresponds to 1G mappings.
> */
Both the name change and the comment, if you ask me.
>>> @@ -403,11 +415,147 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
>>> return P2M_TABLE_MAP_NONE;
>>> }
>>>
>>> +static void p2m_put_foreign_page(struct page_info *pg)
>>> +{
>>> + /*
>>> + * It’s safe to call put_page() here because arch_flush_tlb_mask()
>>> + * will be invoked if the page is reallocated, which will trigger a
>>> + * flush of the guest TLBs.
>>> + */
>>> + put_page(pg);
>>> +}
>>> +
>>> +/* Put any references on the single 4K page referenced by mfn. */
>> To me this and ...
>>
>>> +static void p2m_put_4k_page(mfn_t mfn, p2m_type_t type)
>>> +{
>>> + /* TODO: Handle other p2m types */
>>> +
>>> + if ( p2m_is_foreign(type) )
>>> + {
>>> + ASSERT(mfn_valid(mfn));
>>> + p2m_put_foreign_page(mfn_to_page(mfn));
>>> + }
>>> +}
>>> +
>>> +/* Put any references on the superpage referenced by mfn. */
>> ... to a lesser degree this comment are potentially misleading. Down here at
>> least there is something plural-ish (the 4k pages that the 2M one consists
>> of), but especially for the single page case above "any" could easily mean
>> "anything that's still outstanding, anywhere". I'm also not quite sure "on"
>> is really what you mean (I'm not a native speaker, so my gut feeling may be
>> wrong here).
>
> Then I could suggest the following instead:
> /* Put the reference associated with the 4K page identified by mfn. */
> and
> /* Put the references associated with the superpage identified by mfn. */
>
> I think the comments could be omitted, since the function names already make
> this clear.
Okay with me.
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread
* [for 4.22 v5 12/18] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (10 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 11/18] xen/riscv: Implement p2m_free_subtree() and related helpers Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-11-10 16:22 ` Jan Beulich
2025-10-20 15:57 ` [for 4.22 v5 13/18] xen/riscv: implement p2m_next_level() Oleksii Kurochko
` (5 subsequent siblings)
17 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
This patch adds the initial logic for constructing PTEs from MFNs in the RISC-V
p2m subsystem. It includes:
- Implementation of p2m_pte_from_mfn(): Generates a valid PTE using the
given MFN, p2m_type_t, including permission encoding and PBMT attribute
setup.
- New helper p2m_set_permission(): Encodes access rights (r, w, x) into the
PTE based on both p2m type and access permissions.
- p2m_set_type(): Stores the p2m type in PTE's bits. The storage of types,
which don't fit PTE bits, will be implemented separately later.
- Add detection of Svade extension to properly handle a possible page-fault
if A and D bits aren't set.
PBMT type encoding support:
- Introduces an enum pbmt_type_t to represent the PBMT field values.
- Maps types like p2m_mmio_direct_dev to p2m_mmio_direct_io, others default
to pbmt_pma.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
- Moved setting of p2m_mmio_direct_io inside (!is_table) case in p2m_pte_from_mfn().
- Extend comment about the place of setting A/D bits with explanation
why it is done in this way for now.
---
Changes in V4:
- p2m_set_permission() updates:
- Update permissions for p2m_ram_rw case, make it also executable.
- Add pernissions setting for p2m_map_foreign_* types.
- Drop setting peromissions for p2m_ext_storage.
- Only turn off PTE_VALID bit for p2m_invalid, don't touch other bits.
- p2m_pte_from_mfn() updates:
- Update ASSERT(), add a check that mfn isn't INVALID_MFN (1)
explicitly to avoid the case when PADDR_MASK isn't narrow enough to
catch the case (1).
- Drop unnessary check around call of p2m_set_type() as this check
is already included inside p2m_set_type().
- Introduce new p2m type p2m_first_external to detect that passed type
is stored in external storage.
- Add handling of PTE's A and D bits in pm2_set_permission. Also, set
PTE_USER bit. For this cpufeatures.{h and c} were updated to be able
to detect availability of Svade extension.
- Drop grant table related code as it isn't going to be used at the moment.
---
Changes in V3:
- s/p2m_entry_from_mfn/p2m_pte_from_mfn.
- s/pbmt_type_t/pbmt_type.
- s/pbmt_max/pbmt_count.
- s/p2m_type_radix_set/p2m_set_type.
- Rework p2m_set_type() to handle only types which are fited into PTEs bits.
Other types will be covered separately.
Update arguments of p2m_set_type(): there is no any reason for p2m anymore.
- p2m_set_permissions() updates:
- Update the code in p2m_set_permission() for cases p2m_raw_rw and
p2m_mmio_direct_io to set proper type permissions.
- Add cases for p2m_grant_map_rw and p2m_grant_map_ro.
- Use ASSERT_UNEACHABLE() instead of BUG() in switch cases of
p2m_set_permissions.
- Add blank lines non-fall-through case blocks in switch cases.
- Set MFN before permissions are set in p2m_pte_from_mfn().
- Update prototype of p2m_entry_from_mfn().
---
Changes in V2:
- New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
functionality" which was splitted to smaller.
---
xen/arch/riscv/cpufeature.c | 1 +
xen/arch/riscv/include/asm/cpufeature.h | 1 +
xen/arch/riscv/include/asm/page.h | 8 ++
xen/arch/riscv/p2m.c | 112 +++++++++++++++++++++++-
4 files changed, 118 insertions(+), 4 deletions(-)
diff --git a/xen/arch/riscv/cpufeature.c b/xen/arch/riscv/cpufeature.c
index b846a106a3..02b68aeaa4 100644
--- a/xen/arch/riscv/cpufeature.c
+++ b/xen/arch/riscv/cpufeature.c
@@ -138,6 +138,7 @@ const struct riscv_isa_ext_data __initconst riscv_isa_ext[] = {
RISCV_ISA_EXT_DATA(zbs),
RISCV_ISA_EXT_DATA(smaia),
RISCV_ISA_EXT_DATA(ssaia),
+ RISCV_ISA_EXT_DATA(svade),
RISCV_ISA_EXT_DATA(svpbmt),
};
diff --git a/xen/arch/riscv/include/asm/cpufeature.h b/xen/arch/riscv/include/asm/cpufeature.h
index 768b84b769..5f756c76db 100644
--- a/xen/arch/riscv/include/asm/cpufeature.h
+++ b/xen/arch/riscv/include/asm/cpufeature.h
@@ -37,6 +37,7 @@ enum riscv_isa_ext_id {
RISCV_ISA_EXT_zbs,
RISCV_ISA_EXT_smaia,
RISCV_ISA_EXT_ssaia,
+ RISCV_ISA_EXT_svade,
RISCV_ISA_EXT_svpbmt,
RISCV_ISA_EXT_MAX
};
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
index 78e53981ac..4b6baeaaf2 100644
--- a/xen/arch/riscv/include/asm/page.h
+++ b/xen/arch/riscv/include/asm/page.h
@@ -73,6 +73,14 @@
#define PTE_SMALL BIT(10, UL)
#define PTE_POPULATE BIT(11, UL)
+enum pbmt_type {
+ pbmt_pma,
+ pbmt_nc,
+ pbmt_io,
+ pbmt_rsvd,
+ pbmt_count,
+};
+
#define PTE_ACCESS_MASK (PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE)
#define PTE_PBMT_MASK (PTE_PBMT_NOCACHE | PTE_PBMT_IO)
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 71b211410b..f4658e2560 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -11,6 +11,7 @@
#include <xen/sections.h>
#include <xen/xvmalloc.h>
+#include <asm/cpufeature.h>
#include <asm/csr.h>
#include <asm/flushtlb.h>
#include <asm/paging.h>
@@ -349,6 +350,18 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
return __map_domain_page(p2m->root + root_table_indx);
}
+static int p2m_set_type(pte_t *pte, p2m_type_t t)
+{
+ int rc = 0;
+
+ if ( t > p2m_first_external )
+ panic("unimplemeted\n");
+ else
+ pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
+
+ return rc;
+}
+
static p2m_type_t p2m_get_type(const pte_t pte)
{
p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
@@ -379,11 +392,102 @@ static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
p2m_write_pte(p, pte, clean_pte);
}
-static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
+static void p2m_set_permission(pte_t *e, p2m_type_t t)
{
- panic("%s: hasn't been implemented yet\n", __func__);
+ e->pte &= ~PTE_ACCESS_MASK;
+
+ e->pte |= PTE_USER;
+
+ /*
+ * Two schemes to manage the A and D bits are defined:
+ * • The Svade extension: when a virtual page is accessed and the A bit
+ * is clear, or is written and the D bit is clear, a page-fault
+ * exception is raised.
+ * • When the Svade extension is not implemented, the following scheme
+ * applies.
+ * When a virtual page is accessed and the A bit is clear, the PTE is
+ * updated to set the A bit. When the virtual page is written and the
+ * D bit is clear, the PTE is updated to set the D bit. When G-stage
+ * address translation is in use and is not Bare, the G-stage virtual
+ * pages may be accessed or written by implicit accesses to VS-level
+ * memory management data structures, such as page tables.
+ * Thereby to avoid a page-fault in case of Svade is available, it is
+ * necesssary to set A and D bits.
+ *
+ * TODO: For now, it’s fine to simply set the A/D bits, since OpenSBI
+ * delegates page faults to a lower privilege mode and so OpenSBI
+ * isn't expect to handle page-faults occured in lower modes.
+ * By setting the A/D bits here, page faults that would otherwise
+ * be generated due to unset A/D bits will not occur in Xen.
+ *
+ * Currently, Xen on RISC-V does not make use of the information
+ * that could be obtained from handling such page faults, which
+ * could otherwise be useful for several use cases such as demand
+ * paging, cache-flushing optimizations, memory access tracking,etc.
+ *
+ * To support the more general case and the optimizations mentioned
+ * above, it would be better to stop setting the A/D bits here and
+ * instead handle page faults that occur due to unset A/D bits.
+ */
+ if ( riscv_isa_extension_available(NULL, RISCV_ISA_EXT_svade) )
+ e->pte |= PTE_ACCESSED | PTE_DIRTY;
+
+ switch ( t )
+ {
+ case p2m_map_foreign_rw:
+ case p2m_mmio_direct_io:
+ e->pte |= PTE_READABLE | PTE_WRITABLE;
+ break;
+
+ case p2m_ram_rw:
+ e->pte |= PTE_ACCESS_MASK;
+ break;
+
+ case p2m_invalid:
+ e->pte &= ~PTE_VALID;
+ break;
+
+ case p2m_map_foreign_ro:
+ e->pte |= PTE_READABLE;
+ break;
+
+ default:
+ ASSERT_UNREACHABLE();
+ break;
+ }
+}
+
+static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
+{
+ pte_t e = (pte_t) { PTE_VALID };
+
+ pte_set_mfn(&e, mfn);
+
+ ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK) || mfn_eq(mfn, INVALID_MFN));
+
+ if ( !is_table )
+ {
+ switch ( t )
+ {
+ case p2m_mmio_direct_io:
+ e.pte |= PTE_PBMT_IO;
+ break;
+
+ default:
+ break;
+ }
+
+ p2m_set_permission(&e, t);
+ p2m_set_type(&e, t);
+ }
+ else
+ /*
+ * According to the spec and table "Encoding of PTE R/W/X fields":
+ * X=W=R=0 -> Pointer to next level of page table.
+ */
+ e.pte &= ~PTE_ACCESS_MASK;
- return (pte_t) { .pte = 0 };
+ return e;
}
#define P2M_TABLE_MAP_NONE 0
@@ -638,7 +742,7 @@ static int p2m_set_entry(struct p2m_domain *p2m,
p2m_clean_pte(entry, p2m->clean_dcache);
else
{
- pte_t pte = p2m_pte_from_mfn(mfn, t);
+ pte_t pte = p2m_pte_from_mfn(mfn, t, false);
p2m_write_pte(entry, pte, p2m->clean_dcache);
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 12/18] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration
2025-10-20 15:57 ` [for 4.22 v5 12/18] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration Oleksii Kurochko
@ 2025-11-10 16:22 ` Jan Beulich
2025-11-17 12:12 ` Oleksii Kurochko
0 siblings, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-10 16:22 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 20.10.2025 17:57, Oleksii Kurochko wrote:
> This patch adds the initial logic for constructing PTEs from MFNs in the RISC-V
> p2m subsystem. It includes:
> - Implementation of p2m_pte_from_mfn(): Generates a valid PTE using the
> given MFN, p2m_type_t, including permission encoding and PBMT attribute
> setup.
> - New helper p2m_set_permission(): Encodes access rights (r, w, x) into the
> PTE based on both p2m type and access permissions.
> - p2m_set_type(): Stores the p2m type in PTE's bits. The storage of types,
> which don't fit PTE bits, will be implemented separately later.
> - Add detection of Svade extension to properly handle a possible page-fault
> if A and D bits aren't set.
>
> PBMT type encoding support:
> - Introduces an enum pbmt_type_t to represent the PBMT field values.
> - Maps types like p2m_mmio_direct_dev to p2m_mmio_direct_io, others default
> to pbmt_pma.
>
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> ---
> Changes in V5:
> - Moved setting of p2m_mmio_direct_io inside (!is_table) case in p2m_pte_from_mfn().
> - Extend comment about the place of setting A/D bits with explanation
> why it is done in this way for now.
> ---
> Changes in V4:
> - p2m_set_permission() updates:
> - Update permissions for p2m_ram_rw case, make it also executable.
> - Add pernissions setting for p2m_map_foreign_* types.
> - Drop setting peromissions for p2m_ext_storage.
> - Only turn off PTE_VALID bit for p2m_invalid, don't touch other bits.
> - p2m_pte_from_mfn() updates:
> - Update ASSERT(), add a check that mfn isn't INVALID_MFN (1)
> explicitly to avoid the case when PADDR_MASK isn't narrow enough to
> catch the case (1).
> - Drop unnessary check around call of p2m_set_type() as this check
> is already included inside p2m_set_type().
> - Introduce new p2m type p2m_first_external to detect that passed type
> is stored in external storage.
> - Add handling of PTE's A and D bits in pm2_set_permission. Also, set
> PTE_USER bit. For this cpufeatures.{h and c} were updated to be able
> to detect availability of Svade extension.
> - Drop grant table related code as it isn't going to be used at the moment.
> ---
> Changes in V3:
> - s/p2m_entry_from_mfn/p2m_pte_from_mfn.
> - s/pbmt_type_t/pbmt_type.
> - s/pbmt_max/pbmt_count.
> - s/p2m_type_radix_set/p2m_set_type.
> - Rework p2m_set_type() to handle only types which are fited into PTEs bits.
> Other types will be covered separately.
> Update arguments of p2m_set_type(): there is no any reason for p2m anymore.
> - p2m_set_permissions() updates:
> - Update the code in p2m_set_permission() for cases p2m_raw_rw and
> p2m_mmio_direct_io to set proper type permissions.
> - Add cases for p2m_grant_map_rw and p2m_grant_map_ro.
> - Use ASSERT_UNEACHABLE() instead of BUG() in switch cases of
> p2m_set_permissions.
> - Add blank lines non-fall-through case blocks in switch cases.
> - Set MFN before permissions are set in p2m_pte_from_mfn().
> - Update prototype of p2m_entry_from_mfn().
> ---
> Changes in V2:
> - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
> functionality" which was splitted to smaller.
> ---
> xen/arch/riscv/cpufeature.c | 1 +
> xen/arch/riscv/include/asm/cpufeature.h | 1 +
> xen/arch/riscv/include/asm/page.h | 8 ++
> xen/arch/riscv/p2m.c | 112 +++++++++++++++++++++++-
> 4 files changed, 118 insertions(+), 4 deletions(-)
>
> diff --git a/xen/arch/riscv/cpufeature.c b/xen/arch/riscv/cpufeature.c
> index b846a106a3..02b68aeaa4 100644
> --- a/xen/arch/riscv/cpufeature.c
> +++ b/xen/arch/riscv/cpufeature.c
> @@ -138,6 +138,7 @@ const struct riscv_isa_ext_data __initconst riscv_isa_ext[] = {
> RISCV_ISA_EXT_DATA(zbs),
> RISCV_ISA_EXT_DATA(smaia),
> RISCV_ISA_EXT_DATA(ssaia),
> + RISCV_ISA_EXT_DATA(svade),
> RISCV_ISA_EXT_DATA(svpbmt),
> };
>
> diff --git a/xen/arch/riscv/include/asm/cpufeature.h b/xen/arch/riscv/include/asm/cpufeature.h
> index 768b84b769..5f756c76db 100644
> --- a/xen/arch/riscv/include/asm/cpufeature.h
> +++ b/xen/arch/riscv/include/asm/cpufeature.h
> @@ -37,6 +37,7 @@ enum riscv_isa_ext_id {
> RISCV_ISA_EXT_zbs,
> RISCV_ISA_EXT_smaia,
> RISCV_ISA_EXT_ssaia,
> + RISCV_ISA_EXT_svade,
> RISCV_ISA_EXT_svpbmt,
> RISCV_ISA_EXT_MAX
> };
> diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
> index 78e53981ac..4b6baeaaf2 100644
> --- a/xen/arch/riscv/include/asm/page.h
> +++ b/xen/arch/riscv/include/asm/page.h
> @@ -73,6 +73,14 @@
> #define PTE_SMALL BIT(10, UL)
> #define PTE_POPULATE BIT(11, UL)
>
> +enum pbmt_type {
> + pbmt_pma,
> + pbmt_nc,
> + pbmt_io,
> + pbmt_rsvd,
> + pbmt_count,
> +};
> +
> #define PTE_ACCESS_MASK (PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE)
>
> #define PTE_PBMT_MASK (PTE_PBMT_NOCACHE | PTE_PBMT_IO)
> diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
> index 71b211410b..f4658e2560 100644
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -11,6 +11,7 @@
> #include <xen/sections.h>
> #include <xen/xvmalloc.h>
>
> +#include <asm/cpufeature.h>
> #include <asm/csr.h>
> #include <asm/flushtlb.h>
> #include <asm/paging.h>
> @@ -349,6 +350,18 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
> return __map_domain_page(p2m->root + root_table_indx);
> }
>
> +static int p2m_set_type(pte_t *pte, p2m_type_t t)
> +{
> + int rc = 0;
> +
> + if ( t > p2m_first_external )
> + panic("unimplemeted\n");
> + else
> + pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
> +
> + return rc;
> +}
> +
> static p2m_type_t p2m_get_type(const pte_t pte)
> {
> p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
> @@ -379,11 +392,102 @@ static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
> p2m_write_pte(p, pte, clean_pte);
> }
>
> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
> +static void p2m_set_permission(pte_t *e, p2m_type_t t)
> {
> - panic("%s: hasn't been implemented yet\n", __func__);
> + e->pte &= ~PTE_ACCESS_MASK;
> +
> + e->pte |= PTE_USER;
> +
> + /*
> + * Two schemes to manage the A and D bits are defined:
> + * • The Svade extension: when a virtual page is accessed and the A bit
> + * is clear, or is written and the D bit is clear, a page-fault
> + * exception is raised.
> + * • When the Svade extension is not implemented, the following scheme
> + * applies.
> + * When a virtual page is accessed and the A bit is clear, the PTE is
> + * updated to set the A bit. When the virtual page is written and the
> + * D bit is clear, the PTE is updated to set the D bit. When G-stage
> + * address translation is in use and is not Bare, the G-stage virtual
> + * pages may be accessed or written by implicit accesses to VS-level
> + * memory management data structures, such as page tables.
Can you point me at the part of the spec where this behavior is described? If
things indeed work like this, ...
> + * Thereby to avoid a page-fault in case of Svade is available, it is
> + * necesssary to set A and D bits.
... I'd then agree with the "necessary" here. (Nit: note the extra 's' in your
spelling.)
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 12/18] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration
2025-11-10 16:22 ` Jan Beulich
@ 2025-11-17 12:12 ` Oleksii Kurochko
0 siblings, 0 replies; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-17 12:12 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 7935 bytes --]
On 11/10/25 5:22 PM, Jan Beulich wrote:
> On 20.10.2025 17:57, Oleksii Kurochko wrote:
>> This patch adds the initial logic for constructing PTEs from MFNs in the RISC-V
>> p2m subsystem. It includes:
>> - Implementation of p2m_pte_from_mfn(): Generates a valid PTE using the
>> given MFN, p2m_type_t, including permission encoding and PBMT attribute
>> setup.
>> - New helper p2m_set_permission(): Encodes access rights (r, w, x) into the
>> PTE based on both p2m type and access permissions.
>> - p2m_set_type(): Stores the p2m type in PTE's bits. The storage of types,
>> which don't fit PTE bits, will be implemented separately later.
>> - Add detection of Svade extension to properly handle a possible page-fault
>> if A and D bits aren't set.
>>
>> PBMT type encoding support:
>> - Introduces an enum pbmt_type_t to represent the PBMT field values.
>> - Maps types like p2m_mmio_direct_dev to p2m_mmio_direct_io, others default
>> to pbmt_pma.
>>
>> Signed-off-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
>> ---
>> Changes in V5:
>> - Moved setting of p2m_mmio_direct_io inside (!is_table) case in p2m_pte_from_mfn().
>> - Extend comment about the place of setting A/D bits with explanation
>> why it is done in this way for now.
>> ---
>> Changes in V4:
>> - p2m_set_permission() updates:
>> - Update permissions for p2m_ram_rw case, make it also executable.
>> - Add pernissions setting for p2m_map_foreign_* types.
>> - Drop setting peromissions for p2m_ext_storage.
>> - Only turn off PTE_VALID bit for p2m_invalid, don't touch other bits.
>> - p2m_pte_from_mfn() updates:
>> - Update ASSERT(), add a check that mfn isn't INVALID_MFN (1)
>> explicitly to avoid the case when PADDR_MASK isn't narrow enough to
>> catch the case (1).
>> - Drop unnessary check around call of p2m_set_type() as this check
>> is already included inside p2m_set_type().
>> - Introduce new p2m type p2m_first_external to detect that passed type
>> is stored in external storage.
>> - Add handling of PTE's A and D bits in pm2_set_permission. Also, set
>> PTE_USER bit. For this cpufeatures.{h and c} were updated to be able
>> to detect availability of Svade extension.
>> - Drop grant table related code as it isn't going to be used at the moment.
>> ---
>> Changes in V3:
>> - s/p2m_entry_from_mfn/p2m_pte_from_mfn.
>> - s/pbmt_type_t/pbmt_type.
>> - s/pbmt_max/pbmt_count.
>> - s/p2m_type_radix_set/p2m_set_type.
>> - Rework p2m_set_type() to handle only types which are fited into PTEs bits.
>> Other types will be covered separately.
>> Update arguments of p2m_set_type(): there is no any reason for p2m anymore.
>> - p2m_set_permissions() updates:
>> - Update the code in p2m_set_permission() for cases p2m_raw_rw and
>> p2m_mmio_direct_io to set proper type permissions.
>> - Add cases for p2m_grant_map_rw and p2m_grant_map_ro.
>> - Use ASSERT_UNEACHABLE() instead of BUG() in switch cases of
>> p2m_set_permissions.
>> - Add blank lines non-fall-through case blocks in switch cases.
>> - Set MFN before permissions are set in p2m_pte_from_mfn().
>> - Update prototype of p2m_entry_from_mfn().
>> ---
>> Changes in V2:
>> - New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
>> functionality" which was splitted to smaller.
>> ---
>> xen/arch/riscv/cpufeature.c | 1 +
>> xen/arch/riscv/include/asm/cpufeature.h | 1 +
>> xen/arch/riscv/include/asm/page.h | 8 ++
>> xen/arch/riscv/p2m.c | 112 +++++++++++++++++++++++-
>> 4 files changed, 118 insertions(+), 4 deletions(-)
>>
>> diff --git a/xen/arch/riscv/cpufeature.c b/xen/arch/riscv/cpufeature.c
>> index b846a106a3..02b68aeaa4 100644
>> --- a/xen/arch/riscv/cpufeature.c
>> +++ b/xen/arch/riscv/cpufeature.c
>> @@ -138,6 +138,7 @@ const struct riscv_isa_ext_data __initconst riscv_isa_ext[] = {
>> RISCV_ISA_EXT_DATA(zbs),
>> RISCV_ISA_EXT_DATA(smaia),
>> RISCV_ISA_EXT_DATA(ssaia),
>> + RISCV_ISA_EXT_DATA(svade),
>> RISCV_ISA_EXT_DATA(svpbmt),
>> };
>>
>> diff --git a/xen/arch/riscv/include/asm/cpufeature.h b/xen/arch/riscv/include/asm/cpufeature.h
>> index 768b84b769..5f756c76db 100644
>> --- a/xen/arch/riscv/include/asm/cpufeature.h
>> +++ b/xen/arch/riscv/include/asm/cpufeature.h
>> @@ -37,6 +37,7 @@ enum riscv_isa_ext_id {
>> RISCV_ISA_EXT_zbs,
>> RISCV_ISA_EXT_smaia,
>> RISCV_ISA_EXT_ssaia,
>> + RISCV_ISA_EXT_svade,
>> RISCV_ISA_EXT_svpbmt,
>> RISCV_ISA_EXT_MAX
>> };
>> diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
>> index 78e53981ac..4b6baeaaf2 100644
>> --- a/xen/arch/riscv/include/asm/page.h
>> +++ b/xen/arch/riscv/include/asm/page.h
>> @@ -73,6 +73,14 @@
>> #define PTE_SMALL BIT(10, UL)
>> #define PTE_POPULATE BIT(11, UL)
>>
>> +enum pbmt_type {
>> + pbmt_pma,
>> + pbmt_nc,
>> + pbmt_io,
>> + pbmt_rsvd,
>> + pbmt_count,
>> +};
>> +
>> #define PTE_ACCESS_MASK (PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE)
>>
>> #define PTE_PBMT_MASK (PTE_PBMT_NOCACHE | PTE_PBMT_IO)
>> diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
>> index 71b211410b..f4658e2560 100644
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -11,6 +11,7 @@
>> #include <xen/sections.h>
>> #include <xen/xvmalloc.h>
>>
>> +#include <asm/cpufeature.h>
>> #include <asm/csr.h>
>> #include <asm/flushtlb.h>
>> #include <asm/paging.h>
>> @@ -349,6 +350,18 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
>> return __map_domain_page(p2m->root + root_table_indx);
>> }
>>
>> +static int p2m_set_type(pte_t *pte, p2m_type_t t)
>> +{
>> + int rc = 0;
>> +
>> + if ( t > p2m_first_external )
>> + panic("unimplemeted\n");
>> + else
>> + pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>> +
>> + return rc;
>> +}
>> +
>> static p2m_type_t p2m_get_type(const pte_t pte)
>> {
>> p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
>> @@ -379,11 +392,102 @@ static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
>> p2m_write_pte(p, pte, clean_pte);
>> }
>>
>> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t)
>> +static void p2m_set_permission(pte_t *e, p2m_type_t t)
>> {
>> - panic("%s: hasn't been implemented yet\n", __func__);
>> + e->pte &= ~PTE_ACCESS_MASK;
>> +
>> + e->pte |= PTE_USER;
>> +
>> + /*
>> + * Two schemes to manage the A and D bits are defined:
>> + * • The Svade extension: when a virtual page is accessed and the A bit
>> + * is clear, or is written and the D bit is clear, a page-fault
>> + * exception is raised.
>> + * • When the Svade extension is not implemented, the following scheme
>> + * applies.
>> + * When a virtual page is accessed and the A bit is clear, the PTE is
>> + * updated to set the A bit. When the virtual page is written and the
>> + * D bit is clear, the PTE is updated to set the D bit. When G-stage
>> + * address translation is in use and is not Bare, the G-stage virtual
>> + * pages may be accessed or written by implicit accesses to VS-level
>> + * memory management data structures, such as page tables.
> Can you point me at the part of the spec where this behavior is described?
Sure, it is mentioned here:
https://github.com/riscv/riscv-isa-manual/blob/98ea4b5a409456ee28749748e1eafa8533c463bd/src/supervisor.adoc?plain=1#L1453
> If
> things indeed work like this, ...
>
>> + * Thereby to avoid a page-fault in case of Svade is available, it is
>> + * necesssary to set A and D bits.
> ... I'd then agree with the "necessary" here. (Nit: note the extra 's' in your
> spelling.)
Oh, right.
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 8482 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread
* [for 4.22 v5 13/18] xen/riscv: implement p2m_next_level()
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (11 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 12/18] xen/riscv: Implement p2m_pte_from_mfn() and support PBMT configuration Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-11-10 16:25 ` Jan Beulich
2025-10-20 15:57 ` [for 4.22 v5 14/18] xen/riscv: Implement superpage splitting for p2m mappings Oleksii Kurochko
` (4 subsequent siblings)
17 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Implement the p2m_next_level() function, which enables traversal and dynamic
allocation of intermediate levels (if necessary) in the RISC-V
p2m (physical-to-machine) page table hierarchy.
To support this, the following helpers are introduced:
- page_to_p2m_table(): Constructs non-leaf PTEs pointing to next-level page
tables with correct attributes.
- p2m_alloc_page(): Allocates page table pages, supporting both hardware and
guest domains.
- p2m_create_table(): Allocates and initializes a new page table page and
installs it into the hierarchy.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
- Stray more blanks after * in declaration of functions.
- Correct the comment above p2m_create_table() as metadata pages isn't
allocated anymore in this function.
- Move call of clear_and_clean_page(page, p2m->clean_dcache); from
p2m_create_table() to p2m_alloc_page().
- Drop ACCESS_ONCE() in paging_alloc_page().
---
Changes in V4:
- make `page` argument of page_to_p2m_table pointer-to-const.
- Move p2m_next_level()'s local variable `ret` to the more narrow space where
it is really used.
- Drop stale ASSERT() in p2m_next_level().
- Stray blank after * in declaration of paging_alloc_page().
- Decrease p2m_freelist.total_pages when a page is taken from the p2m freelist.
---
Changes in V3:
- s/p2me_is_mapping/p2m_is_mapping to be in syc with other p2m_is_*() functions.
- clear_and_clean_page() in p2m_create_table() instead of clear_page() to be
sure that page is cleared and d-cache is flushed for it.
- Move ASSERT(level != 0) in p2m_next_level() ahead of trying to allocate a
page table.
- Update p2m_create_table() to allocate metadata page to store p2m type in it
for each entry of page table.
- Introduce paging_alloc_page() and use it inside p2m_alloc_page().
- Add allocated page to p2m->pages list in p2m_alloc_page() to simplify
a caller code a little bit.
- Drop p2m_is_mapping() and use pte_is_mapping() instead as P2M PTE's valid
bit doesn't have another purpose anymore.
- Update an implementation and prototype of page_to_p2m_table(), it is enough
to pass only a page as an argument.
---
Changes in V2:
- New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
functionality" which was splitted to smaller.
- s/p2m_is_mapping/p2m_is_mapping.
---
xen/arch/riscv/include/asm/paging.h | 2 +
xen/arch/riscv/p2m.c | 77 ++++++++++++++++++++++++++++-
xen/arch/riscv/paging.c | 12 +++++
3 files changed, 89 insertions(+), 2 deletions(-)
diff --git a/xen/arch/riscv/include/asm/paging.h b/xen/arch/riscv/include/asm/paging.h
index fe462be223..c1d225d02b 100644
--- a/xen/arch/riscv/include/asm/paging.h
+++ b/xen/arch/riscv/include/asm/paging.h
@@ -15,4 +15,6 @@ int paging_refill_from_domheap(struct domain *d, unsigned int nr_pages);
void paging_free_page(struct domain *d, struct page_info *pg);
+struct page_info *paging_alloc_page(struct domain *d);
+
#endif /* ASM_RISCV_PAGING_H */
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index f4658e2560..6018cac336 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -350,6 +350,19 @@ static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
return __map_domain_page(p2m->root + root_table_indx);
}
+static struct page_info *p2m_alloc_page(struct p2m_domain *p2m)
+{
+ struct page_info *pg = paging_alloc_page(p2m->domain);
+
+ if ( pg )
+ {
+ page_list_add(pg, &p2m->pages);
+ clear_and_clean_page(pg, p2m->clean_dcache);
+ }
+
+ return pg;
+}
+
static int p2m_set_type(pte_t *pte, p2m_type_t t)
{
int rc = 0;
@@ -490,6 +503,33 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
return e;
}
+/* Generate table entry with correct attributes. */
+static pte_t page_to_p2m_table(const struct page_info *page)
+{
+ /*
+ * p2m_invalid will be ignored inside p2m_pte_from_mfn() as is_table is
+ * set to true and p2m_type_t shouldn't be applied for PTEs which
+ * describe an intermidiate table.
+ */
+ return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, true);
+}
+
+/* Allocate a new page table page and hook it in via the given entry. */
+static int p2m_create_table(struct p2m_domain *p2m, pte_t *entry)
+{
+ struct page_info *page;
+
+ ASSERT(!pte_is_valid(*entry));
+
+ page = p2m_alloc_page(p2m);
+ if ( page == NULL )
+ return -ENOMEM;
+
+ p2m_write_pte(entry, page_to_p2m_table(page), p2m->clean_dcache);
+
+ return 0;
+}
+
#define P2M_TABLE_MAP_NONE 0
#define P2M_TABLE_MAP_NOMEM 1
#define P2M_TABLE_SUPER_PAGE 2
@@ -514,9 +554,42 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
unsigned int level, pte_t **table,
unsigned int offset)
{
- panic("%s: hasn't been implemented yet\n", __func__);
+ pte_t *entry;
+ mfn_t mfn;
+
+ /* The function p2m_next_level() is never called at the last level */
+ ASSERT(level != 0);
+
+ entry = *table + offset;
+
+ if ( !pte_is_valid(*entry) )
+ {
+ int ret;
+
+ if ( !alloc_tbl )
+ return P2M_TABLE_MAP_NONE;
+
+ ret = p2m_create_table(p2m, entry);
+ if ( ret )
+ return P2M_TABLE_MAP_NOMEM;
+ }
+
+ if ( pte_is_mapping(*entry) )
+ return P2M_TABLE_SUPER_PAGE;
+
+ mfn = mfn_from_pte(*entry);
+
+ unmap_domain_page(*table);
+
+ /*
+ * TODO: There's an inefficiency here:
+ * In p2m_create_table(), the page is mapped to clear it.
+ * Then that mapping is torn down in p2m_create_table(),
+ * only to be re-established here.
+ */
+ *table = map_domain_page(mfn);
- return P2M_TABLE_MAP_NONE;
+ return P2M_TABLE_NORMAL;
}
static void p2m_put_foreign_page(struct page_info *pg)
diff --git a/xen/arch/riscv/paging.c b/xen/arch/riscv/paging.c
index 773c737ab5..162557dec4 100644
--- a/xen/arch/riscv/paging.c
+++ b/xen/arch/riscv/paging.c
@@ -117,6 +117,18 @@ void paging_free_page(struct domain *d, struct page_info *pg)
spin_unlock(&d->arch.paging.lock);
}
+struct page_info *paging_alloc_page(struct domain *d)
+{
+ struct page_info *pg;
+
+ spin_lock(&d->arch.paging.lock);
+ pg = page_list_remove_head(&d->arch.paging.freelist);
+ d->arch.paging.total_pages--;
+ spin_unlock(&d->arch.paging.lock);
+
+ return pg;
+}
+
/* Domain paging struct initialization. */
int paging_domain_init(struct domain *d)
{
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 13/18] xen/riscv: implement p2m_next_level()
2025-10-20 15:57 ` [for 4.22 v5 13/18] xen/riscv: implement p2m_next_level() Oleksii Kurochko
@ 2025-11-10 16:25 ` Jan Beulich
0 siblings, 0 replies; 54+ messages in thread
From: Jan Beulich @ 2025-11-10 16:25 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 20.10.2025 17:57, Oleksii Kurochko wrote:
> Implement the p2m_next_level() function, which enables traversal and dynamic
> allocation of intermediate levels (if necessary) in the RISC-V
> p2m (physical-to-machine) page table hierarchy.
>
> To support this, the following helpers are introduced:
> - page_to_p2m_table(): Constructs non-leaf PTEs pointing to next-level page
> tables with correct attributes.
> - p2m_alloc_page(): Allocates page table pages, supporting both hardware and
> guest domains.
> - p2m_create_table(): Allocates and initializes a new page table page and
> installs it into the hierarchy.
>
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Just one further nit:
> @@ -490,6 +503,33 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
> return e;
> }
>
> +/* Generate table entry with correct attributes. */
> +static pte_t page_to_p2m_table(const struct page_info *page)
> +{
> + /*
> + * p2m_invalid will be ignored inside p2m_pte_from_mfn() as is_table is
> + * set to true and p2m_type_t shouldn't be applied for PTEs which
> + * describe an intermidiate table.
intermediate
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread
* [for 4.22 v5 14/18] xen/riscv: Implement superpage splitting for p2m mappings
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (12 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 13/18] xen/riscv: implement p2m_next_level() Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-10-20 15:57 ` [for 4.22 v5 15/18] xen/riscv: implement put_page() Oleksii Kurochko
` (3 subsequent siblings)
17 siblings, 0 replies; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Add support for down large memory mappings ("superpages") in the RISC-V
p2m mapping so that smaller, more precise mappings ("finer-grained entries")
can be inserted into lower levels of the page table hierarchy.
To implement that the following is done:
- Introduce p2m_split_superpage(): Recursively shatters a superpage into
smaller page table entries down to the target level, preserving original
permissions and attributes.
- p2m_set_entry() updated to invoke superpage splitting when inserting
entries at lower levels within a superpage-mapped region.
This implementation is based on the ARM code, with modifications to the part
that follows the BBM (break-before-make) approach, some parts are simplified
as according to RISC-V spec:
It is permitted for multiple address-translation cache entries to co-exist
for the same address. This represents the fact that in a conventional
TLB hierarchy, it is possible for multiple entries to match a single
address if, for example, a page is upgraded to a superpage without first
clearing the original non-leaf PTE’s valid bit and executing an SFENCE.VMA
with rs1=x0, or if multiple TLBs exist in parallel at a given level of the
hierarchy. In this case, just as if an SFENCE.VMA is not executed between
a write to the memory-management tables and subsequent implicit read of the
same address: it is unpredictable whether the old non-leaf PTE or the new
leaf PTE is used, but the behavior is otherwise well defined.
In contrast to the Arm architecture, where BBM is mandatory and failing to
use it in some cases can lead to CPU instability, RISC-V guarantees
stability, and the behavior remains safe — though unpredictable in terms of
which translation will be used.
Additionally, the page table walk logic has been adjusted, as ARM uses the
opposite level numbering compared to RISC-V.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V5:
- Add Acked-by: Jan Beulich <jbeulich@suse.com>.
- use next_level when p2m_split_superpage() is recursively called
instead of using "level-1".
---
Changes in V4:
- s/number of levels/level numbering in the commit message.
- s/permissions/attributes.
- Remove redundant comment in p2m_split_superpage() about page
splitting.
- Use P2M_PAGETABLE_ENTRIES as XEN_PT_ENTRIES
doesn't takeinto into acount that G stage root page table is
extended by 2 bits.
- Use earlier introduced P2M_LEVEL_ORDER().
---
Changes in V3:
- Move page_list_add(page, &p2m->pages) inside p2m_alloc_page().
- Use 'unsigned long' for local vairiable 'i' in p2m_split_superpage().
- Update the comment above if ( next_level != target ) in p2m_split_superpage().
- Reverse cycle to iterate through page table levels in p2m_set_entry().
- Update p2m_split_superpage() with the same changes which are done in the
patch "P2M: Don't try to free the existing PTE if we can't allocate a new table".
---
Changes in V2:
- New patch. It was a part of a big patch "xen/riscv: implement p2m mapping
functionality" which was splitted to smaller.
- Update the commit above the cycle which creates new page table as
RISC-V travserse page tables in an opposite to ARM order.
- RISC-V doesn't require BBM so there is no needed for invalidating
and TLB flushing before updating PTE.
---
xen/arch/riscv/p2m.c | 116 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 114 insertions(+), 2 deletions(-)
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 6018cac336..383047580a 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -735,7 +735,88 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
p2m_free_page(p2m, pg);
}
-/* Insert an entry in the p2m */
+static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
+ unsigned int level, unsigned int target,
+ const unsigned int *offsets)
+{
+ struct page_info *page;
+ unsigned long i;
+ pte_t pte, *table;
+ bool rv = true;
+
+ /* Convenience aliases */
+ mfn_t mfn = pte_get_mfn(*entry);
+ unsigned int next_level = level - 1;
+ unsigned int level_order = P2M_LEVEL_ORDER(next_level);
+
+ /*
+ * This should only be called with target != level and the entry is
+ * a superpage.
+ */
+ ASSERT(level > target);
+ ASSERT(pte_is_superpage(*entry, level));
+
+ page = p2m_alloc_page(p2m);
+ if ( !page )
+ {
+ /*
+ * The caller is in charge to free the sub-tree.
+ * As we didn't manage to allocate anything, just tell the
+ * caller there is nothing to free by invalidating the PTE.
+ */
+ memset(entry, 0, sizeof(*entry));
+ return false;
+ }
+
+ table = __map_domain_page(page);
+
+ for ( i = 0; i < P2M_PAGETABLE_ENTRIES(next_level); i++ )
+ {
+ pte_t *new_entry = table + i;
+
+ /*
+ * Use the content of the superpage entry and override
+ * the necessary fields. So the correct attributes are kept.
+ */
+ pte = *entry;
+ pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
+
+ write_pte(new_entry, pte);
+ }
+
+ /*
+ * Shatter superpage in the page to the level we want to make the
+ * changes.
+ * This is done outside the loop to avoid checking the offset
+ * for every entry to know whether the entry should be shattered.
+ */
+ if ( next_level != target )
+ rv = p2m_split_superpage(p2m, table + offsets[next_level],
+ next_level, target, offsets);
+
+ if ( p2m->clean_dcache )
+ clean_dcache_va_range(table, PAGE_SIZE);
+
+ /*
+ * TODO: an inefficiency here: the caller almost certainly wants to map
+ * the same page again, to update the one entry that caused the
+ * request to shatter the page.
+ */
+ unmap_domain_page(table);
+
+ /*
+ * Even if we failed, we should (according to the current implemetation
+ * of a way how sub-tree is freed if p2m_split_superpage hasn't been
+ * finished fully) install the newly allocated PTE
+ * entry.
+ * The caller will be in charge to free the sub-tree.
+ */
+ p2m_write_pte(entry, page_to_p2m_table(page), p2m->clean_dcache);
+
+ return rv;
+}
+
+/* Insert an entry in the p2m. */
static int p2m_set_entry(struct p2m_domain *p2m,
gfn_t gfn,
unsigned long page_order,
@@ -800,7 +881,38 @@ static int p2m_set_entry(struct p2m_domain *p2m,
*/
if ( level > target )
{
- panic("Shattering isn't implemented\n");
+ /* We need to split the original page. */
+ pte_t split_pte = *entry;
+
+ ASSERT(pte_is_superpage(*entry, level));
+
+ if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
+ {
+ /* Free the allocated sub-tree */
+ p2m_free_subtree(p2m, split_pte, level);
+
+ rc = -ENOMEM;
+ goto out;
+ }
+
+ p2m_write_pte(entry, split_pte, p2m->clean_dcache);
+
+ p2m->need_flush = true;
+
+ /* Then move to the level we want to make real changes */
+ for ( ; level > target; level-- )
+ {
+ rc = p2m_next_level(p2m, true, level, &table, offsets[level]);
+
+ /*
+ * The entry should be found and either be a table
+ * or a superpage if level 0 is not targeted
+ */
+ ASSERT(rc == P2M_TABLE_NORMAL ||
+ (rc == P2M_TABLE_SUPER_PAGE && target > 0));
+ }
+
+ entry = table + offsets[level];
}
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* [for 4.22 v5 15/18] xen/riscv: implement put_page()
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (13 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 14/18] xen/riscv: Implement superpage splitting for p2m mappings Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-10-20 15:57 ` [for 4.22 v5 16/18] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers Oleksii Kurochko
` (2 subsequent siblings)
17 siblings, 0 replies; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Implement put_page(), as it will be used by p2m_put_*-related code.
Although CONFIG_STATIC_MEMORY has not yet been introduced for RISC-V,
a stub for PGC_static is added to avoid cluttering the code of
put_page() with #ifdefs.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V5:
- Correct code style of do-while loop in put_page().
- Add Acked-by: Jan Beulich <jbeulich@suse.com>.
---
Changes in V4:
- Update the comment message:
s/p2m_put_code/p2m_put_*-related code.
s/put_page_nr/put_page.
---
xen/arch/riscv/include/asm/mm.h | 7 +++++++
xen/arch/riscv/mm.c | 24 +++++++++++++++++++-----
2 files changed, 26 insertions(+), 5 deletions(-)
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index dd8cdc9782..0503c92e6c 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -264,6 +264,13 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
/* Page is Xen heap? */
#define _PGC_xen_heap PG_shift(2)
#define PGC_xen_heap PG_mask(1, 2)
+#ifdef CONFIG_STATIC_MEMORY
+/* Page is static memory */
+#define _PGC_static PG_shift(3)
+#define PGC_static PG_mask(1, 3)
+#else
+#define PGC_static 0
+#endif
/* Page is broken? */
#define _PGC_broken PG_shift(7)
#define PGC_broken PG_mask(1, 7)
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index 1ef015f179..2e42293986 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -362,11 +362,6 @@ unsigned long __init calc_phys_offset(void)
return phys_offset;
}
-void put_page(struct page_info *page)
-{
- BUG_ON("unimplemented");
-}
-
void arch_dump_shared_mem_info(void)
{
BUG_ON("unimplemented");
@@ -627,3 +622,22 @@ void flush_page_to_ram(unsigned long mfn, bool sync_icache)
if ( sync_icache )
invalidate_icache();
}
+
+void put_page(struct page_info *page)
+{
+ unsigned long nx, x, y = page->count_info;
+
+ do {
+ ASSERT((y & PGC_count_mask) >= 1);
+ x = y;
+ nx = x - 1;
+ } while ( unlikely((y = cmpxchg(&page->count_info, x, nx)) != x) );
+
+ if ( unlikely((nx & PGC_count_mask) == 0) )
+ {
+ if ( unlikely(nx & PGC_static) )
+ free_domstatic_page(page);
+ else
+ free_domheap_page(page);
+ }
+}
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* [for 4.22 v5 16/18] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (14 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 15/18] xen/riscv: implement put_page() Oleksii Kurochko
@ 2025-10-20 15:57 ` Oleksii Kurochko
2025-10-20 15:58 ` [for 4.22 v5 17/18] xen/riscv: add support of page lookup by GFN Oleksii Kurochko
2025-10-20 15:58 ` [for 4.22 v5 18/18] xen/riscv: introduce metadata table to store P2M type Oleksii Kurochko
17 siblings, 0 replies; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:57 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Implement the mfn_valid() macro to verify whether a given MFN is valid by
checking that it falls within the range [start_page, max_page).
These bounds are initialized based on the start and end addresses of RAM.
As part of this patch, start_page is introduced and initialized with the
PFN of the first RAM page.
Also, initialize pdx_group_valid() by calling set_pdx_range() when
memory banks are being mapped.
Also, after providing a non-stub implementation of the mfn_valid() macro,
the following compilation errors started to occur:
riscv64-linux-gnu-ld: prelink.o: in function `alloc_heap_pages':
/build/xen/common/page_alloc.c:1054: undefined reference to `page_is_offlinable'
riscv64-linux-gnu-ld: /build/xen/common/page_alloc.c:1035: undefined reference to `page_is_offlinable'
riscv64-linux-gnu-ld: prelink.o: in function `reserve_offlined_page':
/build/xen/common/page_alloc.c:1151: undefined reference to `page_is_offlinable'
riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol `page_is_offlinable' isn't defined
riscv64-linux-gnu-ld: final link failed: bad value
make[2]: *** [arch/riscv/Makefile:28: xen-syms] Error 1
To resolve these errors, the following functions have also been introduced,
based on their Arm counterparts:
- page_get_owner_and_reference() and its variant to safely acquire a
reference to a page and retrieve its owner.
- Implement page_is_offlinable() to return false for RISC-V.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes in V5:
- Move declaration/defintion of page_is_offlinale() before put_page() to have
get_ and put_ functions together.
- Correct code style of do-while loop.
- Add Acked-by: Jan Beulich <jbeulich@suse.com>.
---
Changes in V4:
- Rebase the patch on top of patch series "[PATCH v2 0/2] constrain page_is_ram_type() to x86".
- Add implementation of page_is_offlinable() instead of page_is_ram().
- Update the commit message.
---
Changes in V3:
- Update defintion of mfn_valid().
- Use __ro_after_init for variable start_page.
- Drop ASSERT_UNREACHABLE() in page_get_owner_and_nr_reference().
- Update the comment inside do/while in page_get_owner_and_nr_reference().
- Define _PGC_static and drop "#ifdef CONFIG_STATIC_MEMORY" in put_page_nr().
- Initialize pdx_group_valid() by calling set_pdx_range() when memory banks are mapped.
- Drop page_get_owner_and_nr_reference() and implement page_get_owner_and_reference()
without reusing of a page_get_owner_and_nr_reference() to avoid potential dead code.
- Move defintion of get_page() to "xen/riscv: add support of page lookup by GFN", where
it is really used.
---
Changes in V2:
- New patch.
---
xen/arch/riscv/include/asm/mm.h | 9 +++++++--
xen/arch/riscv/mm.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 39 insertions(+), 2 deletions(-)
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index 0503c92e6c..1b16809749 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -5,6 +5,7 @@
#include <public/xen.h>
#include <xen/bug.h>
+#include <xen/compiler.h>
#include <xen/const.h>
#include <xen/mm-frame.h>
#include <xen/pdx.h>
@@ -300,8 +301,12 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
#define page_get_owner(p) (p)->v.inuse.domain
#define page_set_owner(p, d) ((p)->v.inuse.domain = (d))
-/* TODO: implement */
-#define mfn_valid(mfn) ({ (void)(mfn); 0; })
+extern unsigned long start_page;
+
+#define mfn_valid(mfn) ({ \
+ unsigned long tmp_mfn = mfn_x(mfn); \
+ likely((tmp_mfn >= start_page)) && likely(__mfn_valid(tmp_mfn)); \
+})
#define domain_set_alloc_bitsize(d) ((void)(d))
#define domain_clamp_alloc_bitsize(d, b) ((void)(d), (b))
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index 2e42293986..e25f995b72 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -521,6 +521,8 @@ static void __init setup_directmap_mappings(unsigned long base_mfn,
#error setup_{directmap,frametable}_mapping() should be implemented for RV_32
#endif
+unsigned long __ro_after_init start_page;
+
/*
* Setup memory management
*
@@ -570,9 +572,13 @@ void __init setup_mm(void)
ram_end = max(ram_end, bank_end);
setup_directmap_mappings(PFN_DOWN(bank_start), PFN_DOWN(bank_size));
+
+ set_pdx_range(paddr_to_pfn(bank_start), paddr_to_pfn(bank_end));
}
setup_frametable_mappings(ram_start, ram_end);
+
+ start_page = PFN_DOWN(ram_start);
max_page = PFN_DOWN(ram_end);
}
@@ -623,6 +629,11 @@ void flush_page_to_ram(unsigned long mfn, bool sync_icache)
invalidate_icache();
}
+bool page_is_offlinable(mfn_t mfn)
+{
+ return false;
+}
+
void put_page(struct page_info *page)
{
unsigned long nx, x, y = page->count_info;
@@ -641,3 +652,24 @@ void put_page(struct page_info *page)
free_domheap_page(page);
}
}
+
+struct domain *page_get_owner_and_reference(struct page_info *page)
+{
+ unsigned long x, y = page->count_info;
+ struct domain *owner;
+
+ do {
+ x = y;
+ /*
+ * Count == 0: Page is not allocated, so we cannot take a reference.
+ * Count == -1: Reference count would wrap, which is invalid.
+ */
+ if ( unlikely(((x + 1) & PGC_count_mask) <= 1) )
+ return NULL;
+ } while ( (y = cmpxchg(&page->count_info, x, x + 1)) != x );
+
+ owner = page_get_owner(page);
+ ASSERT(owner);
+
+ return owner;
+}
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* [for 4.22 v5 17/18] xen/riscv: add support of page lookup by GFN
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (15 preceding siblings ...)
2025-10-20 15:57 ` [for 4.22 v5 16/18] xen/riscv: implement mfn_valid() and page reference, ownership handling helpers Oleksii Kurochko
@ 2025-10-20 15:58 ` Oleksii Kurochko
2025-11-10 16:46 ` Jan Beulich
2025-10-20 15:58 ` [for 4.22 v5 18/18] xen/riscv: introduce metadata table to store P2M type Oleksii Kurochko
17 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
Introduce helper functions for safely querying the P2M (physical-to-machine)
mapping:
- add p2m_read_lock(), p2m_read_unlock(), and p2m_is_locked() for managing
P2M lock state.
- Implement p2m_get_entry() to retrieve mapping details for a given GFN,
including MFN, page order, and validity.
- Introduce p2m_get_page_from_gfn() to convert a GFN into a page_info
pointer, acquiring a reference to the page if valid.
- Introduce get_page().
Implementations are based on Arm's functions with some minor modifications:
- p2m_get_entry():
- Reverse traversal of page tables, as RISC-V uses the opposite level
numbering compared to Arm.
- Removed the return of p2m_access_t from p2m_get_entry() since
mem_access_settings is not introduced for RISC-V.
- Updated BUILD_BUG_ON() to check using the level 0 mask, which corresponds
to Arm's THIRD_MASK.
- Replaced open-coded bit shifts with the BIT() macro.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
- Use introduced in earlier patches P2M_DECLARE_OFFSETS() instead of
DECLARE_OFFSETS().
- Drop blank line before check_outside_boundary().
- Use more readable version of if statements inside check_outside_boundary().
- Accumulate mask in check_outside_boundary() instead of re-writing it for
each page table level to have correct gfns for comparison.
- Set argument `t` of p2m_get_entry() to p2m_invalid by default.
- Drop checking of (rc == P2M_TABLE_MAP_NOMEM ) when p2m_next_level(...,false,...)
is called.
- Add ASSERT(mfn & (BIT(P2M_LEVEL_ORDER(level), UL) - 1)); in p2m_get_entry()
to be sure that recieved `mfn` has cleared lowest bits.
- Drop `valid` argument from p2m_get_entry(), it is not needed anymore.
- Drop p2m_lookup(), use p2m_get_entry() explicitly inside p2m_get_page_from_gfn().
- Update the commit message.
---
Changes in V4:
- Update prototype of p2m_is_locked() to return bool and accept pointer-to-const.
- Correct the comment above p2m_get_entry().
- Drop the check "BUILD_BUG_ON(XEN_PT_LEVEL_MAP_MASK(0) != PAGE_MASK);" inside
p2m_get_entry() as it is stale and it was needed to sure that 4k page(s) are
used on L3 (in Arm terms) what is true for RISC-V. (if not special extension
are used). It was another reason for Arm to have it (and I copied it to RISC-V),
but it isn't true for RISC-V. (some details could be found in response to the
patch).
- Style fixes.
- Add explanatory comment what the loop inside "gfn is higher then the highest
p2m mapping" does. Move this loop to separate function check_outside_boundary()
to cover both boundaries (lower_mapped_gfn and max_mapped_gfn).
- There is not need to allocate a page table as it is expected that
p2m_get_entry() normally would be called after a corresponding p2m_set_entry()
was called. So change 'true' to 'false' in a page table walking loop inside
p2m_get_entry().
- Correct handling of p2m_is_foreign case inside p2m_get_page_from_gfn().
- Introduce and use P2M_LEVEL_MASK instead of XEN_PT_LEVEL_MASK as it isn't take
into account two extra bits for root table in case of P2M.
- Drop stale item from "change in v3" - Add is_p2m_foreign() macro and connected stuff.
- Add p2m_read_(un)lock().
---
Changes in V3:
- Change struct domain *d argument of p2m_get_page_from_gfn() to
struct p2m_domain.
- Update the comment above p2m_get_entry().
- s/_t/p2mt for local variable in p2m_get_entry().
- Drop local variable addr in p2m_get_entry() and use gfn_to_gaddr(gfn)
to define offsets array.
- Code style fixes.
- Update a check of rc code from p2m_next_level() in p2m_get_entry()
and drop "else" case.
- Do not call p2m_get_type() if p2m_get_entry()'s t argument is NULL.
- Use struct p2m_domain instead of struct domain for p2m_lookup() and
p2m_get_page_from_gfn().
- Move defintion of get_page() from "xen/riscv: implement mfn_valid() and page reference, ownership handling helpers"
---
Changes in V2:
- New patch.
---
xen/arch/riscv/include/asm/p2m.h | 20 ++++
xen/arch/riscv/mm.c | 13 +++
xen/arch/riscv/p2m.c | 175 +++++++++++++++++++++++++++++++
3 files changed, 208 insertions(+)
diff --git a/xen/arch/riscv/include/asm/p2m.h b/xen/arch/riscv/include/asm/p2m.h
index 6a17cd52fc..39cfc1fd9e 100644
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -48,6 +48,8 @@ extern unsigned int gstage_root_level;
#define P2M_LEVEL_SHIFT(lvl) (P2M_LEVEL_ORDER(lvl) + PAGE_SHIFT)
+#define P2M_LEVEL_MASK(lvl) (GFN_MASK(lvl) << P2M_LEVEL_SHIFT(lvl))
+
#define paddr_bits PADDR_BITS
/* Get host p2m table */
@@ -232,6 +234,24 @@ static inline bool p2m_is_write_locked(struct p2m_domain *p2m)
unsigned long construct_hgatp(const struct p2m_domain *p2m, uint16_t vmid);
+static inline void p2m_read_lock(struct p2m_domain *p2m)
+{
+ read_lock(&p2m->lock);
+}
+
+static inline void p2m_read_unlock(struct p2m_domain *p2m)
+{
+ read_unlock(&p2m->lock);
+}
+
+static inline bool p2m_is_locked(const struct p2m_domain *p2m)
+{
+ return rw_is_locked(&p2m->lock);
+}
+
+struct page_info *p2m_get_page_from_gfn(struct p2m_domain *p2m, gfn_t gfn,
+ p2m_type_t *t);
+
#endif /* ASM__RISCV__P2M_H */
/*
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index e25f995b72..e9ce182d06 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -673,3 +673,16 @@ struct domain *page_get_owner_and_reference(struct page_info *page)
return owner;
}
+
+bool get_page(struct page_info *page, const struct domain *domain)
+{
+ const struct domain *owner = page_get_owner_and_reference(page);
+
+ if ( likely(owner == domain) )
+ return true;
+
+ if ( owner != NULL )
+ put_page(page);
+
+ return false;
+}
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 383047580a..785d11aaff 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -1049,3 +1049,178 @@ int map_regions_p2mt(struct domain *d,
return rc;
}
+
+/*
+ * p2m_get_entry() should always return the correct order value, even if an
+ * entry is not present (i.e. the GFN is outside the range):
+ * [p2m->lowest_mapped_gfn, p2m->max_mapped_gfn]). (1)
+ *
+ * This ensures that callers of p2m_get_entry() can determine what range of
+ * address space would be altered by a corresponding p2m_set_entry().
+ * Also, it would help to avoid cost page walks for GFNs outside range (1).
+ *
+ * Therefore, this function returns true for GFNs outside range (1), and in
+ * that case the corresponding level is returned via the level_out argument.
+ * Otherwise, it returns false and p2m_get_entry() performs a page walk to
+ * find the proper entry.
+ */
+static bool check_outside_boundary(gfn_t gfn, gfn_t boundary, bool is_lower,
+ unsigned int *level_out)
+{
+ unsigned int level;
+
+ if ( is_lower ? gfn_x(gfn) < gfn_x(boundary)
+ : gfn_x(gfn) > gfn_x(boundary) )
+ {
+ unsigned long mask = 0;
+
+ for ( level = P2M_ROOT_LEVEL; level; level-- )
+ {
+ unsigned long masked_gfn;
+
+ mask |= PFN_DOWN(P2M_LEVEL_MASK(level));
+ masked_gfn = gfn_x(gfn) & mask;
+
+ if ( is_lower ? masked_gfn < gfn_x(boundary)
+ : masked_gfn > gfn_x(boundary) )
+ {
+ *level_out = level;
+ return true;
+ }
+ }
+ }
+
+ return false;
+}
+
+/*
+ * Get the details of a given gfn.
+ *
+ * If the entry is present, the associated MFN will be returned and the
+ * p2m type of the mapping.
+ * The page_order will correspond to the order of the mapping in the page
+ * table (i.e it could be a superpage).
+ *
+ * If the entry is not present, INVALID_MFN will be returned and the
+ * page_order will be set according to the order of the invalid range.
+ */
+static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
+ p2m_type_t *t,
+ unsigned int *page_order)
+{
+ unsigned int level = 0;
+ pte_t entry, *table;
+ int rc;
+ mfn_t mfn = INVALID_MFN;
+ P2M_DECLARE_OFFSETS(offsets, gfn_to_gaddr(gfn));
+
+ ASSERT(p2m_is_locked(p2m));
+
+ if ( t )
+ *t = p2m_invalid;
+
+ if ( check_outside_boundary(gfn, p2m->lowest_mapped_gfn, true, &level) )
+ goto out;
+
+ if ( check_outside_boundary(gfn, p2m->max_mapped_gfn, false, &level) )
+ goto out;
+
+ table = p2m_get_root_pointer(p2m, gfn);
+
+ /*
+ * The table should always be non-NULL because the gfn is below
+ * p2m->max_mapped_gfn and the root table pages are always present.
+ */
+ if ( !table )
+ {
+ ASSERT_UNREACHABLE();
+ level = P2M_ROOT_LEVEL;
+ goto out;
+ }
+
+ for ( level = P2M_ROOT_LEVEL; level; level-- )
+ {
+ rc = p2m_next_level(p2m, false, level, &table, offsets[level]);
+ if ( rc == P2M_TABLE_MAP_NONE )
+ goto out_unmap;
+
+ if ( rc != P2M_TABLE_NORMAL )
+ break;
+ }
+
+ entry = table[offsets[level]];
+
+ if ( pte_is_valid(entry) )
+ {
+ if ( t )
+ *t = p2m_get_type(entry);
+
+ mfn = pte_get_mfn(entry);
+
+ ASSERT(!(mfn_x(mfn) & (BIT(P2M_LEVEL_ORDER(level), UL) - 1)));
+
+ /*
+ * The entry may point to a superpage. Find the MFN associated
+ * to the GFN.
+ */
+ mfn = mfn_add(mfn,
+ gfn_x(gfn) & (BIT(P2M_LEVEL_ORDER(level), UL) - 1));
+ }
+
+ out_unmap:
+ unmap_domain_page(table);
+
+ out:
+ if ( page_order )
+ *page_order = P2M_LEVEL_ORDER(level);
+
+ return mfn;
+}
+
+struct page_info *p2m_get_page_from_gfn(struct p2m_domain *p2m, gfn_t gfn,
+ p2m_type_t *t)
+{
+ struct page_info *page;
+ p2m_type_t p2mt = p2m_invalid;
+ mfn_t mfn;
+
+ p2m_read_lock(p2m);
+ mfn = p2m_get_entry(p2m, gfn, t, NULL);
+
+ if ( !mfn_valid(mfn) )
+ {
+ p2m_read_unlock(p2m);
+ return NULL;
+ }
+
+ if ( t )
+ p2mt = *t;
+
+ page = mfn_to_page(mfn);
+
+ /*
+ * get_page won't work on foreign mapping because the page doesn't
+ * belong to the current domain.
+ */
+ if ( unlikely(p2m_is_foreign(p2mt)) )
+ {
+ const struct domain *fdom = page_get_owner_and_reference(page);
+
+ p2m_read_unlock(p2m);
+
+ if ( fdom )
+ {
+ if ( likely(fdom != p2m->domain) )
+ return page;
+
+ ASSERT_UNREACHABLE();
+ put_page(page);
+ }
+
+ return NULL;
+ }
+
+ p2m_read_unlock(p2m);
+
+ return get_page(page, p2m->domain) ? page : NULL;
+}
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 17/18] xen/riscv: add support of page lookup by GFN
2025-10-20 15:58 ` [for 4.22 v5 17/18] xen/riscv: add support of page lookup by GFN Oleksii Kurochko
@ 2025-11-10 16:46 ` Jan Beulich
2025-11-17 15:52 ` Oleksii Kurochko
0 siblings, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-10 16:46 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 20.10.2025 17:58, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -1049,3 +1049,178 @@ int map_regions_p2mt(struct domain *d,
>
> return rc;
> }
> +
> +/*
> + * p2m_get_entry() should always return the correct order value, even if an
> + * entry is not present (i.e. the GFN is outside the range):
> + * [p2m->lowest_mapped_gfn, p2m->max_mapped_gfn]). (1)
There's one closing parenthesis too many here (likely the one before the colon).
> + * This ensures that callers of p2m_get_entry() can determine what range of
> + * address space would be altered by a corresponding p2m_set_entry().
> + * Also, it would help to avoid cost page walks for GFNs outside range (1).
DYM "costly"?
> + * Therefore, this function returns true for GFNs outside range (1), and in
> + * that case the corresponding level is returned via the level_out argument.
> + * Otherwise, it returns false and p2m_get_entry() performs a page walk to
> + * find the proper entry.
> + */
> +static bool check_outside_boundary(gfn_t gfn, gfn_t boundary, bool is_lower,
> + unsigned int *level_out)
> +{
> + unsigned int level;
> +
> + if ( is_lower ? gfn_x(gfn) < gfn_x(boundary)
> + : gfn_x(gfn) > gfn_x(boundary) )
> + {
> + unsigned long mask = 0;
> +
> + for ( level = P2M_ROOT_LEVEL; level; level-- )
> + {
> + unsigned long masked_gfn;
> +
> + mask |= PFN_DOWN(P2M_LEVEL_MASK(level));
> + masked_gfn = gfn_x(gfn) & mask;
> +
> + if ( is_lower ? masked_gfn < gfn_x(boundary)
> + : masked_gfn > gfn_x(boundary) )
> + {
> + *level_out = level;
For this to be correct in the is_lower case, don't you need to fill the
bottom bits of masked_gfn with all 1s, rather than with all 0s? Otherwise
the tail of the range may be above boundary.
> +struct page_info *p2m_get_page_from_gfn(struct p2m_domain *p2m, gfn_t gfn,
> + p2m_type_t *t)
> +{
> + struct page_info *page;
> + p2m_type_t p2mt = p2m_invalid;
> + mfn_t mfn;
> +
> + p2m_read_lock(p2m);
> + mfn = p2m_get_entry(p2m, gfn, t, NULL);
> +
> + if ( !mfn_valid(mfn) )
> + {
> + p2m_read_unlock(p2m);
> + return NULL;
> + }
> +
> + if ( t )
> + p2mt = *t;
Doesn't it need to be the other way around? The way you have it, when a caller
passes NULL for t, p2m_get_entry() won't give you a type, and you'll do all
further work with p2m_invalid.
Also, might this better move ahead of the earlier if()? Callers might be able
to do still something based on the type, when they get back NULL as function
return value. (Practically this might only become of interest once you add
something like PoD, paging, or sharing.)
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 17/18] xen/riscv: add support of page lookup by GFN
2025-11-10 16:46 ` Jan Beulich
@ 2025-11-17 15:52 ` Oleksii Kurochko
2025-11-17 16:00 ` Jan Beulich
0 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-17 15:52 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 3789 bytes --]
On 11/10/25 5:46 PM, Jan Beulich wrote:
> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -1049,3 +1049,178 @@ int map_regions_p2mt(struct domain *d,
>>
>> return rc;
>> }
>> +
>> +/*
>> + * p2m_get_entry() should always return the correct order value, even if an
>> + * entry is not present (i.e. the GFN is outside the range):
>> + * [p2m->lowest_mapped_gfn, p2m->max_mapped_gfn]). (1)
> There's one closing parenthesis too many here (likely the one before the colon).
You are right, ')' should be dropped. I think that "." could be drooped too.
>
>> + * This ensures that callers of p2m_get_entry() can determine what range of
>> + * address space would be altered by a corresponding p2m_set_entry().
>> + * Also, it would help to avoid cost page walks for GFNs outside range (1).
> DYM "costly"?
Agree, costly would be better here.
>
>> + * Therefore, this function returns true for GFNs outside range (1), and in
>> + * that case the corresponding level is returned via the level_out argument.
>> + * Otherwise, it returns false and p2m_get_entry() performs a page walk to
>> + * find the proper entry.
>> + */
>> +static bool check_outside_boundary(gfn_t gfn, gfn_t boundary, bool is_lower,
>> + unsigned int *level_out)
>> +{
>> + unsigned int level;
>> +
>> + if ( is_lower ? gfn_x(gfn) < gfn_x(boundary)
>> + : gfn_x(gfn) > gfn_x(boundary) )
>> + {
>> + unsigned long mask = 0;
>> +
>> + for ( level = P2M_ROOT_LEVEL; level; level-- )
>> + {
>> + unsigned long masked_gfn;
>> +
>> + mask |= PFN_DOWN(P2M_LEVEL_MASK(level));
>> + masked_gfn = gfn_x(gfn) & mask;
>> +
>> + if ( is_lower ? masked_gfn < gfn_x(boundary)
>> + : masked_gfn > gfn_x(boundary) )
>> + {
>> + *level_out = level;
> For this to be correct in the is_lower case, don't you need to fill the
> bottom bits of masked_gfn with all 1s, rather than with all 0s? Otherwise
> the tail of the range may be above boundary.
I think that I didn't get what you mean by "the range" here and so I can't understand
what is "the tail of the range".
Could you please clarify?
>
>> +struct page_info *p2m_get_page_from_gfn(struct p2m_domain *p2m, gfn_t gfn,
>> + p2m_type_t *t)
>> +{
>> + struct page_info *page;
>> + p2m_type_t p2mt = p2m_invalid;
>> + mfn_t mfn;
>> +
>> + p2m_read_lock(p2m);
>> + mfn = p2m_get_entry(p2m, gfn, t, NULL);
>> +
>> + if ( !mfn_valid(mfn) )
>> + {
>> + p2m_read_unlock(p2m);
>> + return NULL;
>> + }
>> +
>> + if ( t )
>> + p2mt = *t;
> Doesn't it need to be the other way around? The way you have it, when a caller
> passes NULL for t, p2m_get_entry() won't give you a type, and you'll do all
> further work with p2m_invalid.
IIUC, then the following should resolve the mentioned issue:
@@ -1344,11 +1344,14 @@ struct page_info *p2m_get_page_from_gfn(struct p2m_domain *p2m, gfn_t gfn,
p2m_type_t *t)
{
struct page_info *page;
- p2m_type_t p2mt = p2m_invalid;
+ p2m_type_t p2mt;
mfn_t mfn;
p2m_read_lock(p2m);
- mfn = p2m_get_entry(p2m, gfn, t, NULL);
+ mfn = p2m_get_entry(p2m, gfn, &p2mt, NULL);
>
> Also, might this better move ahead of the earlier if()? Callers might be able
> to do still something based on the type, when they get back NULL as function
> return value. (Practically this might only become of interest once you add
> something like PoD, paging, or sharing.)
Agree with that, it should be moved before "if ( !mfn_valid(mfn) )"
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 5274 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 17/18] xen/riscv: add support of page lookup by GFN
2025-11-17 15:52 ` Oleksii Kurochko
@ 2025-11-17 16:00 ` Jan Beulich
2025-11-19 17:11 ` Oleksii Kurochko
0 siblings, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-17 16:00 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 17.11.2025 16:52, Oleksii Kurochko wrote:
> On 11/10/25 5:46 PM, Jan Beulich wrote:
>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>> +static bool check_outside_boundary(gfn_t gfn, gfn_t boundary, bool is_lower,
>>> + unsigned int *level_out)
>>> +{
>>> + unsigned int level;
>>> +
>>> + if ( is_lower ? gfn_x(gfn) < gfn_x(boundary)
>>> + : gfn_x(gfn) > gfn_x(boundary) )
>>> + {
>>> + unsigned long mask = 0;
>>> +
>>> + for ( level = P2M_ROOT_LEVEL; level; level-- )
>>> + {
>>> + unsigned long masked_gfn;
>>> +
>>> + mask |= PFN_DOWN(P2M_LEVEL_MASK(level));
>>> + masked_gfn = gfn_x(gfn) & mask;
>>> +
>>> + if ( is_lower ? masked_gfn < gfn_x(boundary)
>>> + : masked_gfn > gfn_x(boundary) )
>>> + {
>>> + *level_out = level;
>> For this to be correct in the is_lower case, don't you need to fill the
>> bottom bits of masked_gfn with all 1s, rather than with all 0s? Otherwise
>> the tail of the range may be above boundary.
>
> I think that I didn't get what you mean by "the range" here and so I can't understand
> what is "the tail of the range".
> Could you please clarify?
By applying "mask" you effectively produce a range (with "gfn" somewhere in
the middle). For the level (which you return to the caller) to be correct,
the entire range must be matching "gfn" in being below or above of the
boundary. My impression is that this isn't the case when is_lower is true.
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 17/18] xen/riscv: add support of page lookup by GFN
2025-11-17 16:00 ` Jan Beulich
@ 2025-11-19 17:11 ` Oleksii Kurochko
0 siblings, 0 replies; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-19 17:11 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 1794 bytes --]
On 11/17/25 5:00 PM, Jan Beulich wrote:
> On 17.11.2025 16:52, Oleksii Kurochko wrote:
>> On 11/10/25 5:46 PM, Jan Beulich wrote:
>>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>>> +static bool check_outside_boundary(gfn_t gfn, gfn_t boundary, bool is_lower,
>>>> + unsigned int *level_out)
>>>> +{
>>>> + unsigned int level;
>>>> +
>>>> + if ( is_lower ? gfn_x(gfn) < gfn_x(boundary)
>>>> + : gfn_x(gfn) > gfn_x(boundary) )
>>>> + {
>>>> + unsigned long mask = 0;
>>>> +
>>>> + for ( level = P2M_ROOT_LEVEL; level; level-- )
>>>> + {
>>>> + unsigned long masked_gfn;
>>>> +
>>>> + mask |= PFN_DOWN(P2M_LEVEL_MASK(level));
>>>> + masked_gfn = gfn_x(gfn) & mask;
>>>> +
>>>> + if ( is_lower ? masked_gfn < gfn_x(boundary)
>>>> + : masked_gfn > gfn_x(boundary) )
>>>> + {
>>>> + *level_out = level;
>>> For this to be correct in the is_lower case, don't you need to fill the
>>> bottom bits of masked_gfn with all 1s, rather than with all 0s? Otherwise
>>> the tail of the range may be above boundary.
>> I think that I didn't get what you mean by "the range" here and so I can't understand
>> what is "the tail of the range".
>> Could you please clarify?
> By applying "mask" you effectively produce a range (with "gfn" somewhere in
> the middle). For the level (which you return to the caller) to be correct,
> the entire range must be matching "gfn" in being below or above of the
> boundary. My impression is that this isn't the case when is_lower is true.
Oh, got it. Then I agree that when is_lower is true we really need to fill the bottoms
bits of masked_gfn with all 1s.
Thanks for clarifying.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 2521 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread
* [for 4.22 v5 18/18] xen/riscv: introduce metadata table to store P2M type
2025-10-20 15:57 [for 4.22 v5 00/18] xen/riscv: introduce p2m functionality Oleksii Kurochko
` (16 preceding siblings ...)
2025-10-20 15:58 ` [for 4.22 v5 17/18] xen/riscv: add support of page lookup by GFN Oleksii Kurochko
@ 2025-10-20 15:58 ` Oleksii Kurochko
2025-11-12 11:49 ` Jan Beulich
17 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-10-20 15:58 UTC (permalink / raw)
To: xen-devel
Cc: Oleksii Kurochko, Alistair Francis, Bob Eshleman, Connor Davis,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Roger Pau Monné, Stefano Stabellini
RISC-V's PTE has only two available bits that can be used to store the P2M
type. This is insufficient to represent all the current RISC-V P2M types.
Therefore, some P2M types must be stored outside the PTE bits.
To address this, a metadata table is introduced to store P2M types that
cannot fit in the PTE itself. Not all P2M types are stored in the
metadata table—only those that require it.
The metadata table is linked to the intermediate page table via the
`struct page_info`'s v.md.metadata field of the corresponding intermediate
page.
Such pages are allocated with MEMF_no_owner, which allows us to use
the v field for the purpose of storing the metadata table.
To simplify the allocation and linking of intermediate and metadata page
tables, `p2m_{alloc,free}_table()` functions are implemented.
These changes impact `p2m_split_superpage()`, since when a superpage is
split, it is necessary to update the metadata table of the new
intermediate page table — if the entry being split has its P2M type set
to `p2m_ext_storage` in its `P2M_TYPES` bits. In addition to updating
the metadata of the new intermediate page table, the corresponding entry
in the metadata for the original superpage is invalidated.
Also, update p2m_{get,set}_type to work with P2M types which don't fit
into PTE bits.
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
- Rename metadata member of stuct md inside struct page_info to pg.
- Stray blank in the declaration of p2m_alloc_table().
- Use "<" instead of "<=" in ASSERT() in p2m_set_type().
- Move the check that ctx is provided to an earlier point in
p2m_set_type().
- Set `md_pg` after ASSERT() in p2m_set_type().
- Add BUG_ON() insetead of ASSERT_UNREACHABLE() in p2m_set_type().
- Drop a check that metadata isn't NULL before unmap_domain_page() is
being called.
- Make const `md` variable in p2m_get_type().
- unmap correct domain's page in p2m_get_type: use `md` instead of
ctx->pt_page->v.md.pg.
- Add description of how p2m and p2m_pte_ctx is expected to be used
in p2m_pte_from_mfn() and drop a comment from page_to_p2m_table().
- Drop the stale part of the comment above p2m_alloc_table().
- Drop ASSERT(tbl_pg->v.md.pg) from p2m_free_table() as tbl_pg->v.md.pg
is created conditionally now.
- Drop an introduction of p2m_alloc_table(), update p2m_alloc_page()
correspondengly and use it instead.
- Add missing blank in definition of level member for tmp_ctx variable
in p2m_free_subtree(). Also, add the comma at the end.
- Initialize old_type once before for-loop in p2m_split_superpage() as
old type will be used for all newly created PTEs.
- Properly initialize p2m_pte_ctx.level with next_level instead of
level when p2m_set_type() is going to be called for new PTEs.
- Fix identations.
- Move ASSERT(p2m) on top of p2m_set_type() to be sure that NULL isn't
passed for p2m argument of p2m_set_type().
- s/virt_to_page(table)/mfn_to_page(domain_page_map_to_mfn(table))
to recieve correct page for a table which is mapped by domain_page_map().
- Add "return;" after domain_crash() in p2m_set_type() to avoid potential
NULL pointer dereference of md_pg.
---
Changes in V4:
- Add Suggested-by: Jan Beulich <jbeulich@suse.com>.
- Update the comment above declation of md structure inside struct page_info to:
"Page is used as an intermediate P2M page table".
- Allocate metadata table on demand to save some memory. (1)
- Rework p2m_set_type():
- Add allocatation of metadata page only if needed.
- Move a check what kind of type we are handling inside p2m_set_type().
- Move mapping of metadata page inside p2m_get_type() as it is needed only
in case if PTE's type is equal to p2m_ext_storage.
- Add some description to p2m_get_type() function.
- Drop blank after return type of p2m_alloc_table().
- Drop allocation of metadata page inside p2m_alloc_table becaues of (1).
- Fix p2m_free_table() to free metadata page only if it was allocated.
---
Changes in V3:
- Add is_p2m_foreign() macro and connected stuff.
- Change struct domain *d argument of p2m_get_page_from_gfn() to
struct p2m_domain.
- Update the comment above p2m_get_entry().
- s/_t/p2mt for local variable in p2m_get_entry().
- Drop local variable addr in p2m_get_entry() and use gfn_to_gaddr(gfn)
to define offsets array.
- Code style fixes.
- Update a check of rc code from p2m_next_level() in p2m_get_entry()
and drop "else" case.
- Do not call p2m_get_type() if p2m_get_entry()'s t argument is NULL.
- Use struct p2m_domain instead of struct domain for p2m_lookup() and
p2m_get_page_from_gfn().
- Move defintion of get_page() from "xen/riscv: implement mfn_valid() and page reference, ownership handling helpers"
---
Changes in V2:
- New patch.
---
xen/arch/riscv/include/asm/mm.h | 9 ++
xen/arch/riscv/p2m.c | 223 +++++++++++++++++++++++++++-----
2 files changed, 198 insertions(+), 34 deletions(-)
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index 1b16809749..b18892e4fc 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -149,6 +149,15 @@ struct page_info
/* Order-size of the free chunk this page is the head of. */
unsigned int order;
} free;
+
+ /* Page is used as an intermediate P2M page table */
+ struct {
+ /*
+ * Pointer to a page which store metadata for an intermediate page
+ * table.
+ */
+ struct page_info *pg;
+ } md;
} v;
union {
diff --git a/xen/arch/riscv/p2m.c b/xen/arch/riscv/p2m.c
index 785d11aaff..c8112faacb 100644
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -20,6 +20,16 @@
#define P2M_SUPPORTED_LEVEL_MAPPING 2
+/*
+ * P2M PTE context is used only when a PTE's P2M type is p2m_ext_storage.
+ * In this case, the P2M type is stored separately in the metadata page.
+ */
+struct p2m_pte_ctx {
+ struct page_info *pt_page; /* Page table page containing the PTE. */
+ unsigned int index; /* Index of the PTE within that page. */
+ unsigned int level; /* Paging level at which the PTE resides. */
+};
+
unsigned char __ro_after_init gstage_mode;
unsigned int __ro_after_init gstage_root_level;
@@ -363,24 +373,89 @@ static struct page_info *p2m_alloc_page(struct p2m_domain *p2m)
return pg;
}
-static int p2m_set_type(pte_t *pte, p2m_type_t t)
+/*
+ * `pte` – PTE entry for which the type `t` will be stored.
+ *
+ * If `t` is `p2m_ext_storage`, both `ctx` and `p2m` must be provided;
+ * otherwise, only p2m may be NULL.
+ */
+static void p2m_set_type(pte_t *pte, const p2m_type_t t,
+ struct p2m_pte_ctx *ctx,
+ struct p2m_domain *p2m)
{
- int rc = 0;
+ struct page_info **md_pg;
+ pte_t *metadata = NULL;
- if ( t > p2m_first_external )
- panic("unimplemeted\n");
- else
+ ASSERT(p2m);
+
+ /* Be sure that an index correspondent to page level is passed. */
+ ASSERT(ctx && ctx->index < P2M_PAGETABLE_ENTRIES(ctx->level));
+
+ /*
+ * For the root page table (16 KB in size), we need to select the correct
+ * metadata table, since allocations are 4 KB each. In total, there are
+ * 4 tables of 4 KB each.
+ * For none-root page table index of ->pt_page[] will be always 0 as
+ * index won't be higher then 511. ASSERT() above verifies that.
+ */
+ md_pg = &ctx->pt_page[ctx->index / PAGETABLE_ENTRIES].v.md.pg;
+
+ if ( !*md_pg && (t >= p2m_first_external) )
+ {
+ BUG_ON(ctx->level > P2M_SUPPORTED_LEVEL_MAPPING);
+
+ if ( ctx->level <= P2M_SUPPORTED_LEVEL_MAPPING )
+ {
+ struct domain *d = p2m->domain;
+
+ *md_pg = p2m_alloc_page(p2m);
+ if ( !*md_pg )
+ {
+ printk("%s: can't allocate extra memory for dom%d\n",
+ __func__, d->domain_id);
+ domain_crash(d);
+
+ return;
+ }
+ }
+ }
+
+ if ( *md_pg )
+ metadata = __map_domain_page(*md_pg);
+
+ if ( t < p2m_first_external )
+ {
pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
- return rc;
+ if ( metadata )
+ metadata[ctx->index].pte = p2m_invalid;
+ }
+ else
+ {
+ pte->pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
+
+ metadata[ctx->index].pte = t;
+ }
+
+ unmap_domain_page(metadata);
}
-static p2m_type_t p2m_get_type(const pte_t pte)
+/*
+ * `pte` -> PTE entry that stores the PTE's type.
+ *
+ * If the PTE's type is `p2m_ext_storage`, `ctx` should be provided;
+ * otherwise it could be NULL.
+ */
+static p2m_type_t p2m_get_type(const pte_t pte, const struct p2m_pte_ctx *ctx)
{
p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
if ( type == p2m_ext_storage )
- panic("unimplemented\n");
+ {
+ const pte_t *md = __map_domain_page(ctx->pt_page->v.md.pg);
+ type = md[ctx->index].pte;
+ unmap_domain_page(md);
+ }
return type;
}
@@ -470,7 +545,15 @@ static void p2m_set_permission(pte_t *e, p2m_type_t t)
}
}
-static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
+/*
+ * If p2m_pte_from_mfn() is called with p2m_pte_ctx = NULL and p2m = NULL,
+ * it means the function is working with a page table for which the `t`
+ * should not be applicable. Otherwise, the function is handling a leaf PTE
+ * for which `t` is applicable.
+ */
+static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t,
+ struct p2m_pte_ctx *p2m_pte_ctx,
+ struct p2m_domain *p2m)
{
pte_t e = (pte_t) { PTE_VALID };
@@ -478,7 +561,7 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK) || mfn_eq(mfn, INVALID_MFN));
- if ( !is_table )
+ if ( p2m_pte_ctx && p2m )
{
switch ( t )
{
@@ -491,7 +574,7 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
}
p2m_set_permission(&e, t);
- p2m_set_type(&e, t);
+ p2m_set_type(&e, t, p2m_pte_ctx, p2m);
}
else
/*
@@ -506,12 +589,19 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
/* Generate table entry with correct attributes. */
static pte_t page_to_p2m_table(const struct page_info *page)
{
- /*
- * p2m_invalid will be ignored inside p2m_pte_from_mfn() as is_table is
- * set to true and p2m_type_t shouldn't be applied for PTEs which
- * describe an intermidiate table.
- */
- return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, true);
+ return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, NULL, NULL);
+}
+
+static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
+
+/*
+ * Free page table's page and metadata page linked to page table's page.
+ */
+static void p2m_free_table(struct p2m_domain *p2m, struct page_info *tbl_pg)
+{
+ if ( tbl_pg->v.md.pg )
+ p2m_free_page(p2m, tbl_pg->v.md.pg);
+ p2m_free_page(p2m, tbl_pg);
}
/* Allocate a new page table page and hook it in via the given entry. */
@@ -673,12 +763,14 @@ static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg)
/* Free pte sub-tree behind an entry */
static void p2m_free_subtree(struct p2m_domain *p2m,
- pte_t entry, unsigned int level)
+ pte_t entry,
+ const struct p2m_pte_ctx *p2m_pte_ctx)
{
unsigned int i;
pte_t *table;
mfn_t mfn;
struct page_info *pg;
+ unsigned int level = p2m_pte_ctx->level;
/*
* Check if the level is valid: only 4K - 2M - 1G mappings are supported.
@@ -694,7 +786,7 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
if ( (level == 0) || pte_is_superpage(entry, level) )
{
- p2m_type_t p2mt = p2m_get_type(entry);
+ p2m_type_t p2mt = p2m_get_type(entry, p2m_pte_ctx);
#ifdef CONFIG_IOREQ_SERVER
/*
@@ -713,9 +805,21 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
return;
}
- table = map_domain_page(pte_get_mfn(entry));
+ mfn = pte_get_mfn(entry);
+ ASSERT(mfn_valid(mfn));
+ table = map_domain_page(mfn);
+ pg = mfn_to_page(mfn);
+
for ( i = 0; i < P2M_PAGETABLE_ENTRIES(level); i++ )
- p2m_free_subtree(p2m, table[i], level - 1);
+ {
+ struct p2m_pte_ctx tmp_ctx = {
+ .pt_page = pg,
+ .index = i,
+ .level = level - 1,
+ };
+
+ p2m_free_subtree(p2m, table[i], &tmp_ctx);
+ }
unmap_domain_page(table);
@@ -727,17 +831,13 @@ static void p2m_free_subtree(struct p2m_domain *p2m,
*/
p2m_tlb_flush_sync(p2m);
- mfn = pte_get_mfn(entry);
- ASSERT(mfn_valid(mfn));
-
- pg = mfn_to_page(mfn);
-
- p2m_free_page(p2m, pg);
+ p2m_free_table(p2m, pg);
}
static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
unsigned int level, unsigned int target,
- const unsigned int *offsets)
+ const unsigned int *offsets,
+ struct page_info *tbl_pg)
{
struct page_info *page;
unsigned long i;
@@ -749,6 +849,10 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
unsigned int next_level = level - 1;
unsigned int level_order = P2M_LEVEL_ORDER(next_level);
+ struct p2m_pte_ctx p2m_pte_ctx;
+ /* Init with p2m_invalid just to make compiler happy. */
+ p2m_type_t old_type = p2m_invalid;
+
/*
* This should only be called with target != level and the entry is
* a superpage.
@@ -770,6 +874,19 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
table = __map_domain_page(page);
+ if ( MASK_EXTR(entry->pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
+ {
+ p2m_pte_ctx.pt_page = tbl_pg;
+ p2m_pte_ctx.index = offsets[level];
+ /*
+ * It doesn't really matter what is a value for a level as
+ * p2m_get_type() doesn't need it, so it is initialized just in case.
+ */
+ p2m_pte_ctx.level = level;
+
+ old_type = p2m_get_type(*entry, &p2m_pte_ctx);
+ }
+
for ( i = 0; i < P2M_PAGETABLE_ENTRIES(next_level); i++ )
{
pte_t *new_entry = table + i;
@@ -781,6 +898,15 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
pte = *entry;
pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
+ if ( MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
+ {
+ p2m_pte_ctx.pt_page = page;
+ p2m_pte_ctx.index = i;
+ p2m_pte_ctx.level = next_level;
+
+ p2m_set_type(&pte, old_type, &p2m_pte_ctx, p2m);
+ }
+
write_pte(new_entry, pte);
}
@@ -792,7 +918,7 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
*/
if ( next_level != target )
rv = p2m_split_superpage(p2m, table + offsets[next_level],
- next_level, target, offsets);
+ next_level, target, offsets, page);
if ( p2m->clean_dcache )
clean_dcache_va_range(table, PAGE_SIZE);
@@ -883,13 +1009,21 @@ static int p2m_set_entry(struct p2m_domain *p2m,
{
/* We need to split the original page. */
pte_t split_pte = *entry;
+ struct page_info *tbl_pg = mfn_to_page(domain_page_map_to_mfn(table));
ASSERT(pte_is_superpage(*entry, level));
- if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
+ if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets,
+ tbl_pg) )
{
+ struct p2m_pte_ctx tmp_ctx = {
+ .pt_page = tbl_pg,
+ .index = offsets[level],
+ .level = level,
+ };
+
/* Free the allocated sub-tree */
- p2m_free_subtree(p2m, split_pte, level);
+ p2m_free_subtree(p2m, split_pte, &tmp_ctx);
rc = -ENOMEM;
goto out;
@@ -927,7 +1061,13 @@ static int p2m_set_entry(struct p2m_domain *p2m,
p2m_clean_pte(entry, p2m->clean_dcache);
else
{
- pte_t pte = p2m_pte_from_mfn(mfn, t, false);
+ struct p2m_pte_ctx tmp_ctx = {
+ .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
+ .index = offsets[level],
+ .level = level,
+ };
+
+ pte_t pte = p2m_pte_from_mfn(mfn, t, &tmp_ctx, p2m);
p2m_write_pte(entry, pte, p2m->clean_dcache);
@@ -963,7 +1103,15 @@ static int p2m_set_entry(struct p2m_domain *p2m,
if ( pte_is_valid(orig_pte) &&
(!pte_is_valid(*entry) ||
!mfn_eq(pte_get_mfn(*entry), pte_get_mfn(orig_pte))) )
- p2m_free_subtree(p2m, orig_pte, level);
+ {
+ struct p2m_pte_ctx tmp_ctx = {
+ .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
+ .index = offsets[level],
+ .level = level,
+ };
+
+ p2m_free_subtree(p2m, orig_pte, &tmp_ctx);
+ }
out:
unmap_domain_page(table);
@@ -1153,7 +1301,14 @@ static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
if ( pte_is_valid(entry) )
{
if ( t )
- *t = p2m_get_type(entry);
+ {
+ struct p2m_pte_ctx p2m_pte_ctx = {
+ .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
+ .index = offsets[level],
+ };
+
+ *t = p2m_get_type(entry, &p2m_pte_ctx);
+ }
mfn = pte_get_mfn(entry);
--
2.51.0
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 18/18] xen/riscv: introduce metadata table to store P2M type
2025-10-20 15:58 ` [for 4.22 v5 18/18] xen/riscv: introduce metadata table to store P2M type Oleksii Kurochko
@ 2025-11-12 11:49 ` Jan Beulich
2025-11-17 19:51 ` Oleksii Kurochko
0 siblings, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-12 11:49 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 20.10.2025 17:58, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/p2m.c
> +++ b/xen/arch/riscv/p2m.c
> @@ -20,6 +20,16 @@
>
> #define P2M_SUPPORTED_LEVEL_MAPPING 2
>
> +/*
> + * P2M PTE context is used only when a PTE's P2M type is p2m_ext_storage.
> + * In this case, the P2M type is stored separately in the metadata page.
> + */
> +struct p2m_pte_ctx {
> + struct page_info *pt_page; /* Page table page containing the PTE. */
> + unsigned int index; /* Index of the PTE within that page. */
> + unsigned int level; /* Paging level at which the PTE resides. */
> +};
> +
> unsigned char __ro_after_init gstage_mode;
> unsigned int __ro_after_init gstage_root_level;
>
> @@ -363,24 +373,89 @@ static struct page_info *p2m_alloc_page(struct p2m_domain *p2m)
> return pg;
> }
>
> -static int p2m_set_type(pte_t *pte, p2m_type_t t)
> +/*
> + * `pte` – PTE entry for which the type `t` will be stored.
> + *
> + * If `t` is `p2m_ext_storage`, both `ctx` and `p2m` must be provided;
> + * otherwise, only p2m may be NULL.
> + */
> +static void p2m_set_type(pte_t *pte, const p2m_type_t t,
> + struct p2m_pte_ctx *ctx,
> + struct p2m_domain *p2m)
> {
> - int rc = 0;
> + struct page_info **md_pg;
> + pte_t *metadata = NULL;
I'm not convinced it is a good idea to re-use pte_t for this purpose. If you used
a separate type, and if then you defined that as a bitfield with only a few bits
dedicated to type, future changes (additions) may be quite a bit easier.
> - if ( t > p2m_first_external )
> - panic("unimplemeted\n");
> - else
> + ASSERT(p2m);
> +
> + /* Be sure that an index correspondent to page level is passed. */
> + ASSERT(ctx && ctx->index < P2M_PAGETABLE_ENTRIES(ctx->level));
> +
> + /*
> + * For the root page table (16 KB in size), we need to select the correct
> + * metadata table, since allocations are 4 KB each. In total, there are
> + * 4 tables of 4 KB each.
> + * For none-root page table index of ->pt_page[] will be always 0 as
> + * index won't be higher then 511. ASSERT() above verifies that.
> + */
> + md_pg = &ctx->pt_page[ctx->index / PAGETABLE_ENTRIES].v.md.pg;
> +
> + if ( !*md_pg && (t >= p2m_first_external) )
> + {
> + BUG_ON(ctx->level > P2M_SUPPORTED_LEVEL_MAPPING);
> +
> + if ( ctx->level <= P2M_SUPPORTED_LEVEL_MAPPING )
> + {
> + struct domain *d = p2m->domain;
This is (if at all) needed only ...
> + *md_pg = p2m_alloc_page(p2m);
> + if ( !*md_pg )
> + {
... in this more narrow scope.
> + printk("%s: can't allocate extra memory for dom%d\n",
> + __func__, d->domain_id);
The logging text isn't specific enough for my taste. For ordinary printk()s I'd
also recommend against use of __func__ (that's fine for dprintk()).
Also please us %pd in such cases.
> + domain_crash(d);
> +
> + return;
> + }
> + }
> + }
> +
> + if ( *md_pg )
> + metadata = __map_domain_page(*md_pg);
> +
> + if ( t < p2m_first_external )
> + {
> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>
> - return rc;
> + if ( metadata )
> + metadata[ctx->index].pte = p2m_invalid;
Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
p2m_alloc_page()'s clearing of the page won't have the intended effect?
> + }
> + else
> + {
> + pte->pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
> +
> + metadata[ctx->index].pte = t;
If you set t to p2m_ext_storage here, the pte->pte updating could be moved ...
> + }
... here, covering both cases. Overally this may then be easier as
if ( t >= p2m_first_external )
metadata[ctx->index].pte = t;
else if ( metadata )
metadata[ctx->index].pte = p2m_invalid;
pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
Then raising the question whether it couldn't still be the real type that's
stored in metadata[] even for t < p2m_first_external. That woiuld further
reduce conditionals.
> + unmap_domain_page(metadata);
> }
>
> -static p2m_type_t p2m_get_type(const pte_t pte)
> +/*
> + * `pte` -> PTE entry that stores the PTE's type.
> + *
> + * If the PTE's type is `p2m_ext_storage`, `ctx` should be provided;
> + * otherwise it could be NULL.
> + */
> +static p2m_type_t p2m_get_type(const pte_t pte, const struct p2m_pte_ctx *ctx)
> {
> p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
>
> if ( type == p2m_ext_storage )
> - panic("unimplemented\n");
> + {
> + const pte_t *md = __map_domain_page(ctx->pt_page->v.md.pg);
> + type = md[ctx->index].pte;
> + unmap_domain_page(md);
Nit (style): Blank line please between declaration(s) and statement(s).
> @@ -470,7 +545,15 @@ static void p2m_set_permission(pte_t *e, p2m_type_t t)
> }
> }
>
> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
> +/*
> + * If p2m_pte_from_mfn() is called with p2m_pte_ctx = NULL and p2m = NULL,
> + * it means the function is working with a page table for which the `t`
> + * should not be applicable. Otherwise, the function is handling a leaf PTE
> + * for which `t` is applicable.
> + */
> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t,
> + struct p2m_pte_ctx *p2m_pte_ctx,
> + struct p2m_domain *p2m)
> {
> pte_t e = (pte_t) { PTE_VALID };
>
> @@ -478,7 +561,7 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>
> ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK) || mfn_eq(mfn, INVALID_MFN));
>
> - if ( !is_table )
> + if ( p2m_pte_ctx && p2m )
> {
Maybe better
if ( p2m_pte_ctx )
{
ASSERT(p2m);
...
(if you really think the 2nd check is needed)?
> @@ -506,12 +589,19 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
> /* Generate table entry with correct attributes. */
> static pte_t page_to_p2m_table(const struct page_info *page)
> {
> - /*
> - * p2m_invalid will be ignored inside p2m_pte_from_mfn() as is_table is
> - * set to true and p2m_type_t shouldn't be applied for PTEs which
> - * describe an intermidiate table.
> - */
> - return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, true);
> + return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, NULL, NULL);
> +}
How come the comment is dropped? If you deem it unecessary, why was it added
earlier in this same series?
> +static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
> +
> +/*
> + * Free page table's page and metadata page linked to page table's page.
> + */
> +static void p2m_free_table(struct p2m_domain *p2m, struct page_info *tbl_pg)
> +{
> + if ( tbl_pg->v.md.pg )
> + p2m_free_page(p2m, tbl_pg->v.md.pg);
To play safe, maybe better also clear tbl_pg->v.md.pg?
> @@ -749,6 +849,10 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
> unsigned int next_level = level - 1;
> unsigned int level_order = P2M_LEVEL_ORDER(next_level);
>
> + struct p2m_pte_ctx p2m_pte_ctx;
I think this would better be one variable instance per scope where it's needed,
and then using an initzializer. Or else ...
> + /* Init with p2m_invalid just to make compiler happy. */
> + p2m_type_t old_type = p2m_invalid;
> +
> /*
> * This should only be called with target != level and the entry is
> * a superpage.
> @@ -770,6 +874,19 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>
> table = __map_domain_page(page);
>
> + if ( MASK_EXTR(entry->pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
> + {
> + p2m_pte_ctx.pt_page = tbl_pg;
> + p2m_pte_ctx.index = offsets[level];
> + /*
> + * It doesn't really matter what is a value for a level as
> + * p2m_get_type() doesn't need it, so it is initialized just in case.
> + */
> + p2m_pte_ctx.level = level;
> +
> + old_type = p2m_get_type(*entry, &p2m_pte_ctx);
> + }
> +
> for ( i = 0; i < P2M_PAGETABLE_ENTRIES(next_level); i++ )
> {
> pte_t *new_entry = table + i;
> @@ -781,6 +898,15 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
> pte = *entry;
> pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
>
> + if ( MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
> + {
> + p2m_pte_ctx.pt_page = page;
> + p2m_pte_ctx.index = i;
> + p2m_pte_ctx.level = next_level;
... why are the loop-invariat fields not filled ahead of the loop here?
> @@ -927,7 +1061,13 @@ static int p2m_set_entry(struct p2m_domain *p2m,
> p2m_clean_pte(entry, p2m->clean_dcache);
> else
> {
> - pte_t pte = p2m_pte_from_mfn(mfn, t, false);
> + struct p2m_pte_ctx tmp_ctx = {
> + .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
> + .index = offsets[level],
Nit: Stray blank.
> @@ -1153,7 +1301,14 @@ static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
> if ( pte_is_valid(entry) )
> {
> if ( t )
> - *t = p2m_get_type(entry);
> + {
> + struct p2m_pte_ctx p2m_pte_ctx = {
> + .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
> + .index = offsets[level],
> + };
.level not being set here?
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 18/18] xen/riscv: introduce metadata table to store P2M type
2025-11-12 11:49 ` Jan Beulich
@ 2025-11-17 19:51 ` Oleksii Kurochko
2025-11-18 6:58 ` Jan Beulich
0 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-17 19:51 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 12169 bytes --]
On 11/12/25 12:49 PM, Jan Beulich wrote:
> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>> --- a/xen/arch/riscv/p2m.c
>> +++ b/xen/arch/riscv/p2m.c
>> @@ -20,6 +20,16 @@
>>
>> #define P2M_SUPPORTED_LEVEL_MAPPING 2
>>
>> +/*
>> + * P2M PTE context is used only when a PTE's P2M type is p2m_ext_storage.
>> + * In this case, the P2M type is stored separately in the metadata page.
>> + */
>> +struct p2m_pte_ctx {
>> + struct page_info *pt_page; /* Page table page containing the PTE. */
>> + unsigned int index; /* Index of the PTE within that page. */
>> + unsigned int level; /* Paging level at which the PTE resides. */
>> +};
>> +
>> unsigned char __ro_after_init gstage_mode;
>> unsigned int __ro_after_init gstage_root_level;
>>
>> @@ -363,24 +373,89 @@ static struct page_info *p2m_alloc_page(struct p2m_domain *p2m)
>> return pg;
>> }
>>
>> -static int p2m_set_type(pte_t *pte, p2m_type_t t)
>> +/*
>> + * `pte` – PTE entry for which the type `t` will be stored.
>> + *
>> + * If `t` is `p2m_ext_storage`, both `ctx` and `p2m` must be provided;
>> + * otherwise, only p2m may be NULL.
>> + */
>> +static void p2m_set_type(pte_t *pte, const p2m_type_t t,
>> + struct p2m_pte_ctx *ctx,
>> + struct p2m_domain *p2m)
>> {
>> - int rc = 0;
>> + struct page_info **md_pg;
>> + pte_t *metadata = NULL;
> I'm not convinced it is a good idea to re-use pte_t for this purpose. If you used
> a separate type, and if then you defined that as a bitfield with only a few bits
> dedicated to type, future changes (additions) may be quite a bit easier.
Make sense, then lets go with the following structure:
struct md_t {
/*
* Describes a type stored outside the PTE.
* Look at the comment above definition of enum p2m_type_t.
*/
p2m_type_t type : 4;
};
>
>> - if ( t > p2m_first_external )
>> - panic("unimplemeted\n");
>> - else
>> + ASSERT(p2m);
>> +
>> + /* Be sure that an index correspondent to page level is passed. */
>> + ASSERT(ctx && ctx->index < P2M_PAGETABLE_ENTRIES(ctx->level));
>> +
>> + /*
>> + * For the root page table (16 KB in size), we need to select the correct
>> + * metadata table, since allocations are 4 KB each. In total, there are
>> + * 4 tables of 4 KB each.
>> + * For none-root page table index of ->pt_page[] will be always 0 as
>> + * index won't be higher then 511. ASSERT() above verifies that.
>> + */
>> + md_pg = &ctx->pt_page[ctx->index / PAGETABLE_ENTRIES].v.md.pg;
>> +
>> + if ( !*md_pg && (t >= p2m_first_external) )
>> + {
>> + BUG_ON(ctx->level > P2M_SUPPORTED_LEVEL_MAPPING);
>> +
>> + if ( ctx->level <= P2M_SUPPORTED_LEVEL_MAPPING )
>> + {
>> + struct domain *d = p2m->domain;
> This is (if at all) needed only ...
>
>> + *md_pg = p2m_alloc_page(p2m);
>> + if ( !*md_pg )
>> + {
> ... in this more narrow scope.
>
>> + printk("%s: can't allocate extra memory for dom%d\n",
>> + __func__, d->domain_id);
> The logging text isn't specific enough for my taste. For ordinary printk()s I'd
> also recommend against use of __func__ (that's fine for dprintk()).
I will update the message to:
printk("%pd: can't allocate metadata page\n", p2m->domain);
>
> Also please us %pd in such cases.
>
>> + domain_crash(d);
>> +
>> + return;
>> + }
>> + }
>> + }
>> +
>> + if ( *md_pg )
>> + metadata = __map_domain_page(*md_pg);
>> +
>> + if ( t < p2m_first_external )
>> + {
>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>
>> - return rc;
>> + if ( metadata )
>> + metadata[ctx->index].pte = p2m_invalid;
> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
> p2m_alloc_page()'s clearing of the page won't have the intended effect?
I think that, at least, at the moment we are always explicitly set p2m type and
do not rely on that by default 0==p2m_invalid.
Just to be safe, I will add after "if ( metadata )" suggested
BUILD_BUG_ON(p2m_invalid):
if ( metadata )
metadata[ctx->index].type = p2m_invalid;
/*
* metadata.type is expected to be p2m_invalid (0) after the page is
* allocated and zero-initialized in p2m_alloc_page().
*/
BUILD_BUG_ON(p2m_invalid);
...
>
>> + }
>> + else
>> + {
>> + pte->pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
>> +
>> + metadata[ctx->index].pte = t;
> If you set t to p2m_ext_storage here, the pte->pte updating could be moved ...
't' shouldn't be passed as 'p2m_ext_storage'.
For example, in this case we will have that in metadata page we will have type
equal to p2m_ext_storage and then in pte->pte will have the type set to
p2m_ext_storage, and the we end that we don't have a real type stored somewhere.
Even more, metadata.pte shouldn't be used to store p2m_ext_storage, only
p2m_invalid and types mentioned in enum p2m_t after p2m_ext_storage.
>
>> + }
> ... here, covering both cases. Overally this may then be easier as
>
> if ( t >= p2m_first_external )
> metadata[ctx->index].pte = t;
> else if ( metadata )
> metadata[ctx->index].pte = p2m_invalid;
>
> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>
> Then raising the question whether it couldn't still be the real type that's
> stored in metadata[] even for t < p2m_first_external. That woiuld further
> reduce conditionals.
It would be nice, but I think that at the moment we can’t do that. As I explained
above, 't' should not normally be passed as p2m_ext_storage. If we want to
handle this properly, I would need to update the code to:
if (!*md_pg && (t > p2m_first_external))
Alternatively, we could set p2m_first_external = p2m_map_foreign_rw instead of
p2m_ext_storage, since p2m_ext_storage is technically just a marker indicating
that the type is stored elsewhere.
We should also add a BUG_ON(t == p2m_ext_storage) before the if-condition
mentioned above.
>
>> @@ -470,7 +545,15 @@ static void p2m_set_permission(pte_t *e, p2m_type_t t)
>> }
>> }
>>
>> -static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>> +/*
>> + * If p2m_pte_from_mfn() is called with p2m_pte_ctx = NULL and p2m = NULL,
>> + * it means the function is working with a page table for which the `t`
>> + * should not be applicable. Otherwise, the function is handling a leaf PTE
>> + * for which `t` is applicable.
>> + */
>> +static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t,
>> + struct p2m_pte_ctx *p2m_pte_ctx,
>> + struct p2m_domain *p2m)
>> {
>> pte_t e = (pte_t) { PTE_VALID };
>>
>> @@ -478,7 +561,7 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>>
>> ASSERT(!(mfn_to_maddr(mfn) & ~PADDR_MASK) || mfn_eq(mfn, INVALID_MFN));
>>
>> - if ( !is_table )
>> + if ( p2m_pte_ctx && p2m )
>> {
> Maybe better
>
> if ( p2m_pte_ctx )
> {
> ASSERT(p2m);
> ...
>
> (if you really think the 2nd check is needed)?
It seems like we don't really need it as p2m_set_type() has the same ASSERT() at the start.
I will double-check why I've added it and drop if it was not very specific reason.
>
>> @@ -506,12 +589,19 @@ static pte_t p2m_pte_from_mfn(mfn_t mfn, p2m_type_t t, bool is_table)
>> /* Generate table entry with correct attributes. */
>> static pte_t page_to_p2m_table(const struct page_info *page)
>> {
>> - /*
>> - * p2m_invalid will be ignored inside p2m_pte_from_mfn() as is_table is
>> - * set to true and p2m_type_t shouldn't be applied for PTEs which
>> - * describe an intermidiate table.
>> - */
>> - return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, true);
>> + return p2m_pte_from_mfn(page_to_mfn(page), p2m_invalid, NULL, NULL);
>> +}
> How come the comment is dropped? If you deem it unecessary, why was it added
> earlier in this same series?
It is still relevant. Something went wrong during rebase and conflict resolving. Thanks for
finding that.
>
>> +static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
>> +
>> +/*
>> + * Free page table's page and metadata page linked to page table's page.
>> + */
>> +static void p2m_free_table(struct p2m_domain *p2m, struct page_info *tbl_pg)
>> +{
>> + if ( tbl_pg->v.md.pg )
>> + p2m_free_page(p2m, tbl_pg->v.md.pg);
> To play safe, maybe better also clear tbl_pg->v.md.pg?
I thought it would be enough to clear it during allocation in p2m_alloc_page(),
since I'm not sure it is critical if md.pg data were somehow leaked and read.
But to be safer, we can add this here:
clear_and_clean_page(tbl_pg->v.md.pg, p2m->clean_dcache);
>
>> @@ -749,6 +849,10 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>> unsigned int next_level = level - 1;
>> unsigned int level_order = P2M_LEVEL_ORDER(next_level);
>>
>> + struct p2m_pte_ctx p2m_pte_ctx;
> I think this would better be one variable instance per scope where it's needed,
> and then using an initzializer. Or else ...
>
>> + /* Init with p2m_invalid just to make compiler happy. */
>> + p2m_type_t old_type = p2m_invalid;
>> +
>> /*
>> * This should only be called with target != level and the entry is
>> * a superpage.
>> @@ -770,6 +874,19 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>>
>> table = __map_domain_page(page);
>>
>> + if ( MASK_EXTR(entry->pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
>> + {
>> + p2m_pte_ctx.pt_page = tbl_pg;
>> + p2m_pte_ctx.index = offsets[level];
>> + /*
>> + * It doesn't really matter what is a value for a level as
>> + * p2m_get_type() doesn't need it, so it is initialized just in case.
>> + */
>> + p2m_pte_ctx.level = level;
>> +
>> + old_type = p2m_get_type(*entry, &p2m_pte_ctx);
>> + }
>> +
>> for ( i = 0; i < P2M_PAGETABLE_ENTRIES(next_level); i++ )
>> {
>> pte_t *new_entry = table + i;
>> @@ -781,6 +898,15 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>> pte = *entry;
>> pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
>>
>> + if ( MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
>> + {
>> + p2m_pte_ctx.pt_page = page;
>> + p2m_pte_ctx.index = i;
>> + p2m_pte_ctx.level = next_level;
> ... why are the loop-invariat fields not filled ahead of the loop here?
Actually, they could be filled before the loop. If I move the initialization of
p2m_pte_ctx.pt_page and p2m_pte_ctx.level ahead of the loop, does it still make
sense to have a separate variable inside
"if (MASK_EXTR(entry->pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage)"?
>
>> @@ -927,7 +1061,13 @@ static int p2m_set_entry(struct p2m_domain *p2m,
>> p2m_clean_pte(entry, p2m->clean_dcache);
>> else
>> {
>> - pte_t pte = p2m_pte_from_mfn(mfn, t, false);
>> + struct p2m_pte_ctx tmp_ctx = {
>> + .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
>> + .index = offsets[level],
> Nit: Stray blank.
>
>> @@ -1153,7 +1301,14 @@ static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
>> if ( pte_is_valid(entry) )
>> {
>> if ( t )
>> - *t = p2m_get_type(entry);
>> + {
>> + struct p2m_pte_ctx p2m_pte_ctx = {
>> + .pt_page = mfn_to_page(domain_page_map_to_mfn(table)),
>> + .index = offsets[level],
>> + };
> .level not being set here?
It isn't used in the case when p2m_get_type() is called, but just for consistency and to
be sure that nothing will be broken if an implemnatation of p2m_get_type() will change,
I will add:
.level = level,
Thanks.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 15692 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 18/18] xen/riscv: introduce metadata table to store P2M type
2025-11-17 19:51 ` Oleksii Kurochko
@ 2025-11-18 6:58 ` Jan Beulich
2025-11-20 15:38 ` Oleksii Kurochko
0 siblings, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-18 6:58 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 17.11.2025 20:51, Oleksii Kurochko wrote:
> On 11/12/25 12:49 PM, Jan Beulich wrote:
>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>> + if ( *md_pg )
>>> + metadata = __map_domain_page(*md_pg);
>>> +
>>> + if ( t < p2m_first_external )
>>> + {
>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>> - return rc;
>>> + if ( metadata )
>>> + metadata[ctx->index].pte = p2m_invalid;
>> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
>> p2m_alloc_page()'s clearing of the page won't have the intended effect?
>
> I think that, at least, at the moment we are always explicitly set p2m type and
> do not rely on that by default 0==p2m_invalid.
You don't, and ...
> Just to be safe, I will add after "if ( metadata )" suggested
> BUILD_BUG_ON(p2m_invalid):
> if ( metadata )
> metadata[ctx->index].type = p2m_invalid;
> /*
> * metadata.type is expected to be p2m_invalid (0) after the page is
> * allocated and zero-initialized in p2m_alloc_page().
> */
> BUILD_BUG_ON(p2m_invalid);
> ...
... this leaves me with the impression that you didn't read my reply correctly.
p2m_alloc_page() clear the page, thus _implicitly_ setting all entries to
p2m_invalid. That's where the BUILD_BUG_ON() would want to go (the call site,
ftaod).
>>> + }
>>> + else
>>> + {
>>> + pte->pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
>>> +
>>> + metadata[ctx->index].pte = t;
>> If you set t to p2m_ext_storage here, the pte->pte updating could be moved ...
>
> 't' shouldn't be passed as 'p2m_ext_storage'.
Of course not. I said "set", not "pass". I suggested to set t to p2m_ext_storage
right after the assignment above. I notice though that I missed ...
> For example, in this case we will have that in metadata page we will have type
> equal to p2m_ext_storage and then in pte->pte will have the type set to
> p2m_ext_storage, and the we end that we don't have a real type stored somewhere.
> Even more, metadata.pte shouldn't be used to store p2m_ext_storage, only
> p2m_invalid and types mentioned in enum p2m_t after p2m_ext_storage.
>
>>
>>> + }
>> ... here, covering both cases. Overally this may then be easier as
>>
>> if ( t >= p2m_first_external )
>> metadata[ctx->index].pte = t;
... the respective line (and the figure braces which are the needed) here:
t = p2m_ext_storage;
>> else if ( metadata )
>> metadata[ctx->index].pte = p2m_invalid;
>>
>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>
>> Then raising the question whether it couldn't still be the real type that's
>> stored in metadata[] even for t < p2m_first_external. That woiuld further
>> reduce conditionals.
>
> It would be nice, but I think that at the moment we can’t do that. As I explained
> above, 't' should not normally be passed as p2m_ext_storage.
Of course not, but how's that relevant to storing the _real_ type in the
metadata page even when it's one which can also can be stored in the PTE?
As said, for a frequently used path it may help to reduce the number of
conditionals here.
>>> +static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
>>> +
>>> +/*
>>> + * Free page table's page and metadata page linked to page table's page.
>>> + */
>>> +static void p2m_free_table(struct p2m_domain *p2m, struct page_info *tbl_pg)
>>> +{
>>> + if ( tbl_pg->v.md.pg )
>>> + p2m_free_page(p2m, tbl_pg->v.md.pg);
>> To play safe, maybe better also clear tbl_pg->v.md.pg?
>
> I thought it would be enough to clear it during allocation in p2m_alloc_page(),
> since I'm not sure it is critical if md.pg data were somehow leaked and read.
> But to be safer, we can add this here:
> clear_and_clean_page(tbl_pg->v.md.pg, p2m->clean_dcache);
I didn't say clear what tbl_pg->v.md.pg points to, though. I suggested to clear
the struct field itself.
>>> @@ -749,6 +849,10 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>>> unsigned int next_level = level - 1;
>>> unsigned int level_order = P2M_LEVEL_ORDER(next_level);
>>> + struct p2m_pte_ctx p2m_pte_ctx;
>> I think this would better be one variable instance per scope where it's needed,
>> and then using an initzializer. Or else ...
>>
>>> + /* Init with p2m_invalid just to make compiler happy. */
>>> + p2m_type_t old_type = p2m_invalid;
>>> +
>>> /*
>>> * This should only be called with target != level and the entry is
>>> * a superpage.
>>> @@ -770,6 +874,19 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>>> table = __map_domain_page(page);
>>> + if ( MASK_EXTR(entry->pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
>>> + {
>>> + p2m_pte_ctx.pt_page = tbl_pg;
>>> + p2m_pte_ctx.index = offsets[level];
>>> + /*
>>> + * It doesn't really matter what is a value for a level as
>>> + * p2m_get_type() doesn't need it, so it is initialized just in case.
>>> + */
>>> + p2m_pte_ctx.level = level;
>>> +
>>> + old_type = p2m_get_type(*entry, &p2m_pte_ctx);
>>> + }
>>> +
>>> for ( i = 0; i < P2M_PAGETABLE_ENTRIES(next_level); i++ )
>>> {
>>> pte_t *new_entry = table + i;
>>> @@ -781,6 +898,15 @@ static bool p2m_split_superpage(struct p2m_domain *p2m, pte_t *entry,
>>> pte = *entry;
>>> pte_set_mfn(&pte, mfn_add(mfn, i << level_order));
>>> + if ( MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage )
>>> + {
>>> + p2m_pte_ctx.pt_page = page;
>>> + p2m_pte_ctx.index = i;
>>> + p2m_pte_ctx.level = next_level;
>> ... why are the loop-invariat fields not filled ahead of the loop here?
>
> Actually, they could be filled before the loop. If I move the initialization of
> p2m_pte_ctx.pt_page and p2m_pte_ctx.level ahead of the loop, does it still make
> sense to have a separate variable inside
> "if (MASK_EXTR(entry->pte, P2M_TYPE_PTE_BITS_MASK) == p2m_ext_storage)"?
No, it's one of the two - scope limited variable within the loop, or wider-scope
variable with loop-invariant fields set ahead of the loop.
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 18/18] xen/riscv: introduce metadata table to store P2M type
2025-11-18 6:58 ` Jan Beulich
@ 2025-11-20 15:38 ` Oleksii Kurochko
2025-11-20 15:47 ` Jan Beulich
0 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-20 15:38 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 6262 bytes --]
On 11/18/25 7:58 AM, Jan Beulich wrote:
> On 17.11.2025 20:51, Oleksii Kurochko wrote:
>> On 11/12/25 12:49 PM, Jan Beulich wrote:
>>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>>> + if ( *md_pg )
>>>> + metadata = __map_domain_page(*md_pg);
>>>> +
>>>> + if ( t < p2m_first_external )
>>>> + {
>>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>>> - return rc;
>>>> + if ( metadata )
>>>> + metadata[ctx->index].pte = p2m_invalid;
>>> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
>>> p2m_alloc_page()'s clearing of the page won't have the intended effect?
>> I think that, at least, at the moment we are always explicitly set p2m type and
>> do not rely on that by default 0==p2m_invalid.
> You don't, and ...
>
>> Just to be safe, I will add after "if ( metadata )" suggested
>> BUILD_BUG_ON(p2m_invalid):
>> if ( metadata )
>> metadata[ctx->index].type = p2m_invalid;
>> /*
>> * metadata.type is expected to be p2m_invalid (0) after the page is
>> * allocated and zero-initialized in p2m_alloc_page().
>> */
>> BUILD_BUG_ON(p2m_invalid);
>> ...
> ... this leaves me with the impression that you didn't read my reply correctly.
> p2m_alloc_page() clear the page, thus _implicitly_ setting all entries to
> p2m_invalid. That's where the BUILD_BUG_ON() would want to go (the call site,
> ftaod).
I think I still don’t fully understand what the issue would be if|p2m_invalid| were
ever equal to 1 instead of 0 in the context of a metadata page.
Yes, if|p2m_invalid| were 1, there would be a problem if someone tried to read this
metadata pagebefore it was assigned any type. They would find a value of 0, which
corresponds to a valid type rather than to|p2m_invalid|, as one might expect.
However, I’m not sure I currently see a scenario in which the metadata page would
be read before being initialized.
But just to be safe when such case will occur I am okay with putting
BUILD_BUG_ON(p2m_invalid) before p2m_alloc_page() in p2m_set_type() function.
>
>>>> + }
>>>> + else
>>>> + {
>>>> + pte->pte |= MASK_INSR(p2m_ext_storage, P2M_TYPE_PTE_BITS_MASK);
>>>> +
>>>> + metadata[ctx->index].pte = t;
>>> If you set t to p2m_ext_storage here, the pte->pte updating could be moved ...
>> 't' shouldn't be passed as 'p2m_ext_storage'.
> Of course not. I said "set", not "pass". I suggested to set t to p2m_ext_storage
> right after the assignment above. I notice though that I missed ...
Now, I see then ...
>
>> For example, in this case we will have that in metadata page we will have type
>> equal to p2m_ext_storage and then in pte->pte will have the type set to
>> p2m_ext_storage, and the we end that we don't have a real type stored somewhere.
>> Even more, metadata.pte shouldn't be used to store p2m_ext_storage, only
>> p2m_invalid and types mentioned in enum p2m_t after p2m_ext_storage.
>>
>>>> + }
>>> ... here, covering both cases. Overally this may then be easier as
>>>
>>> if ( t >= p2m_first_external )
>>> metadata[ctx->index].pte = t;
> ... the respective line (and the figure braces which are the needed) here:
>
> t = p2m_ext_storage;
... (what suggested above) will work.
>
>>> else if ( metadata )
>>> metadata[ctx->index].pte = p2m_invalid;
>>>
>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>>
>>> Then raising the question whether it couldn't still be the real type that's
>>> stored in metadata[] even for t < p2m_first_external. That woiuld further
>>> reduce conditionals.
>> It would be nice, but I think that at the moment we can’t do that. As I explained
>> above, 't' should not normally be passed as p2m_ext_storage.
> Of course not, but how's that relevant to storing the _real_ type in the
> metadata page even when it's one which can also can be stored in the PTE?
> As said, for a frequently used path it may help to reduce the number of
> conditionals here.
IIUC, you are asking whether, if|pte->pte| stores a type|< p2m_ext_storage|,
it would still be possible for|metadata[].pte| to contain anyreal type?
If yes, then the answer is that it could be done, because in the|p2m_get_type() |function the value stored in|pte->pte| is checked first. If it isn't|p2m_ext_storage|,
then|metadata[].pte| will not be checked at all. So technically, it could contain
whatever we want in case when pte.pte's type != p2m_ext_storage.
But will it really reduce an amount of conditions? It seems like we still need one
condition to check of metadata is mapped and one condition to set 't' to p2m_ext_storage:
if ( metadata )
metadata[ctx->index].pte = t;
if ( t >= p2m_first_external )
t = p2m_ext_storage;
pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
We can do:
if ( metadata )
{
metadata[ctx->index].pte = t;
if ( t >= p2m_first_external )
t = p2m_ext_storage;
}
pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
It will reduce an amount of conditions if metadata wasn't used/allocated, but I think you
have a different idea, don't you?
>
>>>> +static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
>>>> +
>>>> +/*
>>>> + * Free page table's page and metadata page linked to page table's page.
>>>> + */
>>>> +static void p2m_free_table(struct p2m_domain *p2m, struct page_info *tbl_pg)
>>>> +{
>>>> + if ( tbl_pg->v.md.pg )
>>>> + p2m_free_page(p2m, tbl_pg->v.md.pg);
>>> To play safe, maybe better also clear tbl_pg->v.md.pg?
>> I thought it would be enough to clear it during allocation in p2m_alloc_page(),
>> since I'm not sure it is critical if md.pg data were somehow leaked and read.
>> But to be safer, we can add this here:
>> clear_and_clean_page(tbl_pg->v.md.pg, p2m->clean_dcache);
> I didn't say clear what tbl_pg->v.md.pg points to, though. I suggested to clear
> the struct field itself.
Won't be enough just tbl_pg->v.md.pg = NULL; ?
Thanks!
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 9937 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 18/18] xen/riscv: introduce metadata table to store P2M type
2025-11-20 15:38 ` Oleksii Kurochko
@ 2025-11-20 15:47 ` Jan Beulich
2025-11-20 16:52 ` Oleksii Kurochko
0 siblings, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-20 15:47 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 20.11.2025 16:38, Oleksii Kurochko wrote:
> On 11/18/25 7:58 AM, Jan Beulich wrote:
>> On 17.11.2025 20:51, Oleksii Kurochko wrote:
>>> On 11/12/25 12:49 PM, Jan Beulich wrote:
>>>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>>>> + if ( *md_pg )
>>>>> + metadata = __map_domain_page(*md_pg);
>>>>> +
>>>>> + if ( t < p2m_first_external )
>>>>> + {
>>>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>>>> - return rc;
>>>>> + if ( metadata )
>>>>> + metadata[ctx->index].pte = p2m_invalid;
>>>> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
>>>> p2m_alloc_page()'s clearing of the page won't have the intended effect?
>>> I think that, at least, at the moment we are always explicitly set p2m type and
>>> do not rely on that by default 0==p2m_invalid.
>> You don't, and ...
>>
>>> Just to be safe, I will add after "if ( metadata )" suggested
>>> BUILD_BUG_ON(p2m_invalid):
>>> if ( metadata )
>>> metadata[ctx->index].type = p2m_invalid;
>>> /*
>>> * metadata.type is expected to be p2m_invalid (0) after the page is
>>> * allocated and zero-initialized in p2m_alloc_page().
>>> */
>>> BUILD_BUG_ON(p2m_invalid);
>>> ...
>> ... this leaves me with the impression that you didn't read my reply correctly.
>> p2m_alloc_page() clear the page, thus _implicitly_ setting all entries to
>> p2m_invalid. That's where the BUILD_BUG_ON() would want to go (the call site,
>> ftaod).
>
> I think I still don’t fully understand what the issue would be if|p2m_invalid| were
> ever equal to 1 instead of 0 in the context of a metadata page.
>
> Yes, if|p2m_invalid| were 1, there would be a problem if someone tried to read this
> metadata pagebefore it was assigned any type. They would find a value of 0, which
> corresponds to a valid type rather than to|p2m_invalid|, as one might expect.
> However, I’m not sure I currently see a scenario in which the metadata page would
> be read before being initialized.
Are you sure walks can only happen for GFNs that were set up? What you need to
do walks on is under guest control, after all.
> But just to be safe when such case will occur I am okay with putting
> BUILD_BUG_ON(p2m_invalid) before p2m_alloc_page() in p2m_set_type() function.
Thanks.
>>>>> +static void p2m_free_page(struct p2m_domain *p2m, struct page_info *pg);
>>>>> +
>>>>> +/*
>>>>> + * Free page table's page and metadata page linked to page table's page.
>>>>> + */
>>>>> +static void p2m_free_table(struct p2m_domain *p2m, struct page_info *tbl_pg)
>>>>> +{
>>>>> + if ( tbl_pg->v.md.pg )
>>>>> + p2m_free_page(p2m, tbl_pg->v.md.pg);
>>>> To play safe, maybe better also clear tbl_pg->v.md.pg?
>>> I thought it would be enough to clear it during allocation in p2m_alloc_page(),
>>> since I'm not sure it is critical if md.pg data were somehow leaked and read.
>>> But to be safer, we can add this here:
>>> clear_and_clean_page(tbl_pg->v.md.pg, p2m->clean_dcache);
>> I didn't say clear what tbl_pg->v.md.pg points to, though. I suggested to clear
>> the struct field itself.
>
> Won't be enough just tbl_pg->v.md.pg = NULL; ?
Exactly that, yes.
Jan
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 18/18] xen/riscv: introduce metadata table to store P2M type
2025-11-20 15:47 ` Jan Beulich
@ 2025-11-20 16:52 ` Oleksii Kurochko
2025-11-21 7:35 ` Jan Beulich
0 siblings, 1 reply; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-20 16:52 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 3590 bytes --]
On 11/20/25 4:47 PM, Jan Beulich wrote:
> On 20.11.2025 16:38, Oleksii Kurochko wrote:
>> On 11/18/25 7:58 AM, Jan Beulich wrote:
>>> On 17.11.2025 20:51, Oleksii Kurochko wrote:
>>>> On 11/12/25 12:49 PM, Jan Beulich wrote:
>>>>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>>>>> + if ( *md_pg )
>>>>>> + metadata = __map_domain_page(*md_pg);
>>>>>> +
>>>>>> + if ( t < p2m_first_external )
>>>>>> + {
>>>>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>>>>> - return rc;
>>>>>> + if ( metadata )
>>>>>> + metadata[ctx->index].pte = p2m_invalid;
>>>>> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
>>>>> p2m_alloc_page()'s clearing of the page won't have the intended effect?
>>>> I think that, at least, at the moment we are always explicitly set p2m type and
>>>> do not rely on that by default 0==p2m_invalid.
>>> You don't, and ...
>>>
>>>> Just to be safe, I will add after "if ( metadata )" suggested
>>>> BUILD_BUG_ON(p2m_invalid):
>>>> if ( metadata )
>>>> metadata[ctx->index].type = p2m_invalid;
>>>> /*
>>>> * metadata.type is expected to be p2m_invalid (0) after the page is
>>>> * allocated and zero-initialized in p2m_alloc_page().
>>>> */
>>>> BUILD_BUG_ON(p2m_invalid);
>>>> ...
>>> ... this leaves me with the impression that you didn't read my reply correctly.
>>> p2m_alloc_page() clear the page, thus_implicitly_ setting all entries to
>>> p2m_invalid. That's where the BUILD_BUG_ON() would want to go (the call site,
>>> ftaod).
>> I think I still don’t fully understand what the issue would be if|p2m_invalid| were
>> ever equal to 1 instead of 0 in the context of a metadata page.
>>
>> Yes, if|p2m_invalid| were 1, there would be a problem if someone tried to read this
>> metadata pagebefore it was assigned any type. They would find a value of 0, which
>> corresponds to a valid type rather than to|p2m_invalid|, as one might expect.
>> However, I’m not sure I currently see a scenario in which the metadata page would
>> be read before being initialized.
> Are you sure walks can only happen for GFNs that were set up? What you need to
> do walks on is under guest control, after all.
If a GFN lies within the range[p2m->lowest_mapped_gfn, p2m->max_mapped_gfn], then
|p2m_set_entry()| must already have been called for this GFN. This means that either
- a metadata page has been created and its entry filled with the appropriate type, or
- no metadata page was needed and the type was stored directly in|pte->pte|
For a GFN outside the range(p2m->lowest_mapped_gfn, p2m->max_mapped_gfn),
|p2m_get_entry()| will not even attempt a walk because of the boundary checks:
static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
p2m_type_t *t,
unsigned int *page_order)
...
if ( check_outside_boundary(p2m, gfn, p2m->lowest_mapped_gfn, true,
&level) )
goto out;
if ( check_outside_boundary(p2m, gfn, p2m->max_mapped_gfn, false, &level) )
goto out;
If I am misunderstanding something and there are other cases where a walk can occur for
GFNs that were never set up, then such GFNs would have neither an allocated metadata
page nor a type stored in|pte->pte|, which looks like we are in trouble.
~ Oleksii
[-- Attachment #2: Type: text/html, Size: 5204 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 18/18] xen/riscv: introduce metadata table to store P2M type
2025-11-20 16:52 ` Oleksii Kurochko
@ 2025-11-21 7:35 ` Jan Beulich
2025-11-21 9:37 ` Oleksii Kurochko
0 siblings, 1 reply; 54+ messages in thread
From: Jan Beulich @ 2025-11-21 7:35 UTC (permalink / raw)
To: Oleksii Kurochko
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
On 20.11.2025 17:52, Oleksii Kurochko wrote:
>
> On 11/20/25 4:47 PM, Jan Beulich wrote:
>> On 20.11.2025 16:38, Oleksii Kurochko wrote:
>>> On 11/18/25 7:58 AM, Jan Beulich wrote:
>>>> On 17.11.2025 20:51, Oleksii Kurochko wrote:
>>>>> On 11/12/25 12:49 PM, Jan Beulich wrote:
>>>>>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>>>>>> + if ( *md_pg )
>>>>>>> + metadata = __map_domain_page(*md_pg);
>>>>>>> +
>>>>>>> + if ( t < p2m_first_external )
>>>>>>> + {
>>>>>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>>>>>> - return rc;
>>>>>>> + if ( metadata )
>>>>>>> + metadata[ctx->index].pte = p2m_invalid;
>>>>>> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
>>>>>> p2m_alloc_page()'s clearing of the page won't have the intended effect?
>>>>> I think that, at least, at the moment we are always explicitly set p2m type and
>>>>> do not rely on that by default 0==p2m_invalid.
>>>> You don't, and ...
>>>>
>>>>> Just to be safe, I will add after "if ( metadata )" suggested
>>>>> BUILD_BUG_ON(p2m_invalid):
>>>>> if ( metadata )
>>>>> metadata[ctx->index].type = p2m_invalid;
>>>>> /*
>>>>> * metadata.type is expected to be p2m_invalid (0) after the page is
>>>>> * allocated and zero-initialized in p2m_alloc_page().
>>>>> */
>>>>> BUILD_BUG_ON(p2m_invalid);
>>>>> ...
>>>> ... this leaves me with the impression that you didn't read my reply correctly.
>>>> p2m_alloc_page() clear the page, thus_implicitly_ setting all entries to
>>>> p2m_invalid. That's where the BUILD_BUG_ON() would want to go (the call site,
>>>> ftaod).
>>> I think I still don’t fully understand what the issue would be if|p2m_invalid| were
>>> ever equal to 1 instead of 0 in the context of a metadata page.
>>>
>>> Yes, if|p2m_invalid| were 1, there would be a problem if someone tried to read this
>>> metadata pagebefore it was assigned any type. They would find a value of 0, which
>>> corresponds to a valid type rather than to|p2m_invalid|, as one might expect.
>>> However, I’m not sure I currently see a scenario in which the metadata page would
>>> be read before being initialized.
>> Are you sure walks can only happen for GFNs that were set up? What you need to
>> do walks on is under guest control, after all.
>
> If a GFN lies within the range[p2m->lowest_mapped_gfn, p2m->max_mapped_gfn], then
> |p2m_set_entry()| must already have been called for this GFN.
No. All you know from the pre-condition is that p2m_set_entry() was called for the
lowest and highest GFNs in that range.
Jan
> This means that either
> - a metadata page has been created and its entry filled with the appropriate type, or
> - no metadata page was needed and the type was stored directly in|pte->pte|
>
> For a GFN outside the range(p2m->lowest_mapped_gfn, p2m->max_mapped_gfn),
> |p2m_get_entry()| will not even attempt a walk because of the boundary checks:
> static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
> p2m_type_t *t,
> unsigned int *page_order)
> ...
> if ( check_outside_boundary(p2m, gfn, p2m->lowest_mapped_gfn, true,
> &level) )
> goto out;
>
> if ( check_outside_boundary(p2m, gfn, p2m->max_mapped_gfn, false, &level) )
> goto out;
>
> If I am misunderstanding something and there are other cases where a walk can occur for
> GFNs that were never set up, then such GFNs would have neither an allocated metadata
> page nor a type stored in|pte->pte|, which looks like we are in trouble.
>
> ~ Oleksii
>
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [for 4.22 v5 18/18] xen/riscv: introduce metadata table to store P2M type
2025-11-21 7:35 ` Jan Beulich
@ 2025-11-21 9:37 ` Oleksii Kurochko
0 siblings, 0 replies; 54+ messages in thread
From: Oleksii Kurochko @ 2025-11-21 9:37 UTC (permalink / raw)
To: Jan Beulich
Cc: Alistair Francis, Bob Eshleman, Connor Davis, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Roger Pau Monné,
Stefano Stabellini, xen-devel
[-- Attachment #1: Type: text/plain, Size: 5006 bytes --]
On 11/21/25 8:35 AM, Jan Beulich wrote:
> On 20.11.2025 17:52, Oleksii Kurochko wrote:
>> On 11/20/25 4:47 PM, Jan Beulich wrote:
>>> On 20.11.2025 16:38, Oleksii Kurochko wrote:
>>>> On 11/18/25 7:58 AM, Jan Beulich wrote:
>>>>> On 17.11.2025 20:51, Oleksii Kurochko wrote:
>>>>>> On 11/12/25 12:49 PM, Jan Beulich wrote:
>>>>>>> On 20.10.2025 17:58, Oleksii Kurochko wrote:
>>>>>>>> + if ( *md_pg )
>>>>>>>> + metadata = __map_domain_page(*md_pg);
>>>>>>>> +
>>>>>>>> + if ( t < p2m_first_external )
>>>>>>>> + {
>>>>>>>> pte->pte |= MASK_INSR(t, P2M_TYPE_PTE_BITS_MASK);
>>>>>>>> - return rc;
>>>>>>>> + if ( metadata )
>>>>>>>> + metadata[ctx->index].pte = p2m_invalid;
>>>>>>> Shouldn't this be accompanied with a BUILD_BUG_ON(p2m_invalid), as otherwise
>>>>>>> p2m_alloc_page()'s clearing of the page won't have the intended effect?
>>>>>> I think that, at least, at the moment we are always explicitly set p2m type and
>>>>>> do not rely on that by default 0==p2m_invalid.
>>>>> You don't, and ...
>>>>>
>>>>>> Just to be safe, I will add after "if ( metadata )" suggested
>>>>>> BUILD_BUG_ON(p2m_invalid):
>>>>>> if ( metadata )
>>>>>> metadata[ctx->index].type = p2m_invalid;
>>>>>> /*
>>>>>> * metadata.type is expected to be p2m_invalid (0) after the page is
>>>>>> * allocated and zero-initialized in p2m_alloc_page().
>>>>>> */
>>>>>> BUILD_BUG_ON(p2m_invalid);
>>>>>> ...
>>>>> ... this leaves me with the impression that you didn't read my reply correctly.
>>>>> p2m_alloc_page() clear the page, thus_implicitly_ setting all entries to
>>>>> p2m_invalid. That's where the BUILD_BUG_ON() would want to go (the call site,
>>>>> ftaod).
>>>> I think I still don’t fully understand what the issue would be if|p2m_invalid| were
>>>> ever equal to 1 instead of 0 in the context of a metadata page.
>>>>
>>>> Yes, if|p2m_invalid| were 1, there would be a problem if someone tried to read this
>>>> metadata pagebefore it was assigned any type. They would find a value of 0, which
>>>> corresponds to a valid type rather than to|p2m_invalid|, as one might expect.
>>>> However, I’m not sure I currently see a scenario in which the metadata page would
>>>> be read before being initialized.
>>> Are you sure walks can only happen for GFNs that were set up? What you need to
>>> do walks on is under guest control, after all.
>> If a GFN lies within the range[p2m->lowest_mapped_gfn, p2m->max_mapped_gfn], then
>> |p2m_set_entry()| must already have been called for this GFN.
> No. All you know from the pre-condition is that p2m_set_entry() was called for the
> lowest and highest GFNs in that range.
>
Oh, right. There could still be some GFNs inside the range for which|p2m_set_entry()| has
not yet been called. Then it probably makes sense to add a|BUILD_BUG_ON| here as well, before
"if ( type == p2m_ext_storage )":
static p2m_type_t p2m_get_type(const pte_t pte, const struct p2m_pte_ctx *ctx)
{
p2m_type_t type = MASK_EXTR(pte.pte, P2M_TYPE_PTE_BITS_MASK);
if ( type == p2m_ext_storage )
{
const pte_t *md = __map_domain_page(ctx->pt_page->v.md.pg);
type = md[ctx->index].pte;
unmap_domain_page(md);
}
return type;
}
I would expect that if|p2m_set_entry()| has not been called for a GFN, then|p2m_get_entry()|
(|p2m_get_type()| will be called somewhere inside|p2m_get_entry()|, for example) should return
the|p2m_invalid| type. I think we want to have the same check (as the one before the call to
|p2m_alloc_page()|), placed before the condition:
BUILD_BUG_ON(p2m_invalid);
~ Oleksii
>
>> This means that either
>> - a metadata page has been created and its entry filled with the appropriate type, or
>> - no metadata page was needed and the type was stored directly in|pte->pte|
>>
>> For a GFN outside the range(p2m->lowest_mapped_gfn, p2m->max_mapped_gfn),
>> |p2m_get_entry()| will not even attempt a walk because of the boundary checks:
>> static mfn_t p2m_get_entry(struct p2m_domain *p2m, gfn_t gfn,
>> p2m_type_t *t,
>> unsigned int *page_order)
>> ...
>> if ( check_outside_boundary(p2m, gfn, p2m->lowest_mapped_gfn, true,
>> &level) )
>> goto out;
>>
>> if ( check_outside_boundary(p2m, gfn, p2m->max_mapped_gfn, false, &level) )
>> goto out;
>>
>> If I am misunderstanding something and there are other cases where a walk can occur for
>> GFNs that were never set up, then such GFNs would have neither an allocated metadata
>> page nor a type stored in|pte->pte|, which looks like we are in trouble.
>>
>> ~ Oleksii
>>
[-- Attachment #2: Type: text/html, Size: 7103 bytes --]
^ permalink raw reply [flat|nested] 54+ messages in thread