* [PATCH v3 1/8] kconfig: turn PDX compression into a choice
2025-07-24 11:04 [PATCH v3 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
@ 2025-07-24 11:04 ` Roger Pau Monne
2025-07-24 11:04 ` [PATCH v3 2/8] pdx: provide a unified set of unit functions Roger Pau Monne
` (6 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: Roger Pau Monne @ 2025-07-24 11:04 UTC (permalink / raw)
To: xen-devel
Cc: Roger Pau Monne, Stefano Stabellini, Julien Grall,
Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper,
Anthony PERARD, Jan Beulich
Rename the current CONFIG_PDX_COMPRESSION to CONFIG_PDX_MASK_COMPRESSION,
and make it part of the PDX compression choice block, in preparation for
adding further PDX compression algorithms.
The PDX compression defaults should still be the same for all
architectures, however the choice block cannot be protected under EXPERT
and still have a default choice being unconditionally selected. As a
result, the new "PDX (Page inDeX) compression" item will be unconditionally
visible in Kconfig, even on architectures like x86 that previously had no
way to enable PDX compression.
As part of this preparation work to introduce new PDX compressions, adjust
some of the comments on pdx.h to note they apply to a specific PDX
compression. Also shuffle function prototypes and dummy implementations
around to make it easier to introduce a new PDX compression. Note all
PDX compression implementations are expected to provide a
pdx_is_region_compressible() that takes the same set of arguments.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
xen/arch/arm/setup.c | 2 +-
xen/common/Kconfig | 18 +++++++++++++++---
xen/common/pdx.c | 4 ++--
xen/include/xen/pdx.h | 32 +++++++++++++++++++-------------
4 files changed, 37 insertions(+), 19 deletions(-)
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index 27bd3f5a6ea3..ab30250bbf25 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -258,7 +258,7 @@ void __init init_pdx(void)
paddr_t bank_start, bank_size, bank_end, ram_end = 0;
int bank;
-#ifdef CONFIG_PDX_COMPRESSION
+#ifndef CONFIG_PDX_NONE
/*
* Arm does not have any restrictions on the bits to compress. Pass 0 to
* let the common code further restrict the mask.
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index 2539a635f111..0014f9b3d6d7 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -57,9 +57,10 @@ config EVTCHN_FIFO
If unsure, say Y.
-config PDX_COMPRESSION
- bool "PDX (Page inDeX) compression" if EXPERT && !X86 && !RISCV
- default ARM || PPC
+choice
+ prompt "PDX (Page inDeX) compression"
+ default PDX_MASK_COMPRESSION if !X86 && !RISCV
+ default PDX_NONE
help
PDX compression is a technique designed to reduce the memory
overhead of physical memory management on platforms with sparse RAM
@@ -72,6 +73,17 @@ config PDX_COMPRESSION
If your platform does not have sparse RAM banks, do not enable PDX
compression.
+config PDX_MASK_COMPRESSION
+ bool "Mask compression"
+ help
+ Compression relying on all RAM addresses sharing a zeroed bit region.
+
+config PDX_NONE
+ bool "None"
+ help
+ No compression
+endchoice
+
config ALTERNATIVE_CALL
bool
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index b8384e6189df..00aa7e43006d 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -34,7 +34,7 @@ bool __mfn_valid(unsigned long mfn)
{
bool invalid = mfn >= max_page;
-#ifdef CONFIG_PDX_COMPRESSION
+#ifdef CONFIG_PDX_MASK_COMPRESSION
invalid |= mfn & pfn_hole_mask;
#endif
@@ -55,7 +55,7 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn)
__set_bit(idx, pdx_group_valid);
}
-#ifdef CONFIG_PDX_COMPRESSION
+#ifdef CONFIG_PDX_MASK_COMPRESSION
/*
* Diagram to make sense of the following variables. The masks and shifts
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index c1423d64a95b..8e373cac8b87 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -25,7 +25,7 @@
* this by keeping a bitmap of the ranges in the frame table containing
* invalid entries and not allocating backing memory for them.
*
- * ## PDX compression
+ * ## PDX mask compression
*
* This is a technique to avoid wasting memory on machines known to have
* split their machine address space in several big discontinuous and highly
@@ -101,22 +101,13 @@ bool __mfn_valid(unsigned long mfn);
#define paddr_to_pdx(pa) pfn_to_pdx(paddr_to_pfn(pa))
#define pdx_to_paddr(px) pfn_to_paddr(pdx_to_pfn(px))
-#ifdef CONFIG_PDX_COMPRESSION
+#ifdef CONFIG_PDX_MASK_COMPRESSION
extern unsigned long pfn_pdx_bottom_mask, ma_va_bottom_mask;
extern unsigned int pfn_pdx_hole_shift;
extern unsigned long pfn_hole_mask;
extern unsigned long pfn_top_mask, ma_top_mask;
-/**
- * Validate a region's compatibility with the current compression runtime
- *
- * @param base Base address of the region
- * @param npages Number of PAGE_SIZE-sized pages in the region
- * @return True iff the region can be used with the current compression
- */
-bool pdx_is_region_compressible(paddr_t base, unsigned long npages);
-
/**
* Calculates a mask covering "moving" bits of all addresses of a region
*
@@ -209,7 +200,9 @@ static inline paddr_t directmapoff_to_maddr(unsigned long offset)
*/
void pfn_pdx_hole_setup(unsigned long mask);
-#else /* !CONFIG_PDX_COMPRESSION */
+#endif /* CONFIG_PDX_MASK_COMPRESSION */
+
+#ifdef CONFIG_PDX_NONE
/* Without PDX compression we can skip some computations */
@@ -241,7 +234,20 @@ static inline void pfn_pdx_hole_setup(unsigned long mask)
{
}
-#endif /* CONFIG_PDX_COMPRESSION */
+#else /* !CONFIG_PDX_NONE */
+
+/* Shared functions implemented by all PDX compressions. */
+
+/**
+ * Validate a region's compatibility with the current compression runtime
+ *
+ * @param base Base address of the region
+ * @param npages Number of PAGE_SIZE-sized pages in the region
+ * @return True iff the region can be used with the current compression
+ */
+bool pdx_is_region_compressible(paddr_t base, unsigned long npages);
+
+#endif /* !CONFIG_PDX_NONE */
#endif /* __XEN_PDX_H__ */
/*
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v3 2/8] pdx: provide a unified set of unit functions
2025-07-24 11:04 [PATCH v3 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
2025-07-24 11:04 ` [PATCH v3 1/8] kconfig: turn PDX compression into a choice Roger Pau Monne
@ 2025-07-24 11:04 ` Roger Pau Monne
2025-07-29 13:06 ` Jan Beulich
2025-07-24 11:04 ` [PATCH v3 3/8] pdx: introduce command line compression toggle Roger Pau Monne
` (5 subsequent siblings)
7 siblings, 1 reply; 19+ messages in thread
From: Roger Pau Monne @ 2025-07-24 11:04 UTC (permalink / raw)
To: xen-devel
Cc: Roger Pau Monne, Stefano Stabellini, Julien Grall,
Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper,
Anthony PERARD, Jan Beulich
The current setup (pdx_init_mask() and pdx_region_mask()) and init
(pfn_pdx_hole_setup()) PDX compression functions are tailored to the
existing PDX compression algorithm.
In preparation for introducing a new compression algorithm convert the
setup and init functions to more generic interfaces that aren't tied to the
compression in-use. To accomplish this introduce a function that registers
all the PFN RAM ranges, plus an init function.
This has the downside of requiring a static array to store such ranges
ahead of being processed by the setup function, however it's the only way
to pass all the possible information to the different compression setup
functions without using per-compression specific setup functions.
Per-arch compression setup also need to be adjusted to use the new
interface. There's a slight ordering adjustment, in that after PDX
compression setup the caller will check whether all the RAM regions are
properly covered by the newly setup compression, otherwise compression is
disabled by resetting to the initial values.
No functional change intended in the resulting PDX compression values.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v2:
- Fix indentation.
Changes since v1:
- s/nr/nr_ranges/
---
xen/arch/arm/setup.c | 34 ++++++------
xen/arch/x86/srat.c | 28 ++++++----
xen/common/pdx.c | 122 ++++++++++++++++++++++++++++++++++++------
xen/include/xen/pdx.h | 73 +++++++++----------------
4 files changed, 166 insertions(+), 91 deletions(-)
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index ab30250bbf25..1d7cf2492fd6 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -256,9 +256,11 @@ void __init init_pdx(void)
{
const struct membanks *mem = bootinfo_get_mem();
paddr_t bank_start, bank_size, bank_end, ram_end = 0;
- int bank;
+ unsigned int bank;
#ifndef CONFIG_PDX_NONE
+ for ( bank = 0 ; bank < mem->nr_banks; bank++ )
+ pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size);
/*
* Arm does not have any restrictions on the bits to compress. Pass 0 to
* let the common code further restrict the mask.
@@ -266,26 +268,24 @@ void __init init_pdx(void)
* If the logic changes in pfn_pdx_hole_setup we might have to
* update this function too.
*/
- uint64_t mask = pdx_init_mask(0x0);
-
- for ( bank = 0 ; bank < mem->nr_banks; bank++ )
- {
- bank_start = mem->bank[bank].start;
- bank_size = mem->bank[bank].size;
-
- mask |= bank_start | pdx_region_mask(bank_start, bank_size);
- }
+ pfn_pdx_compression_setup(0);
for ( bank = 0 ; bank < mem->nr_banks; bank++ )
{
- bank_start = mem->bank[bank].start;
- bank_size = mem->bank[bank].size;
-
- if (~mask & pdx_region_mask(bank_start, bank_size))
- mask = 0;
+ if ( !pdx_is_region_compressible(
+ mem->bank[bank].start,
+ PFN_UP(mem->bank[bank].start + mem->bank[bank].size) -
+ PFN_DOWN(mem->bank[bank].start)) )
+ {
+ pfn_pdx_compression_reset();
+ printk(XENLOG_WARNING
+ "PFN compression disabled, RAM region [%#" PRIpaddr ", %#"
+ PRIpaddr "] not covered\n",
+ mem->bank[bank].start,
+ mem->bank[bank].start + mem->bank[bank].size - 1);
+ break;
+ }
}
-
- pfn_pdx_hole_setup(mask >> PAGE_SHIFT);
#endif
for ( bank = 0 ; bank < mem->nr_banks; bank++ )
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 688f410287d4..747607439fb4 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -261,8 +261,6 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
void __init acpi_numa_arch_fixup(void) {}
-static uint64_t __initdata srat_region_mask;
-
static int __init cf_check srat_parse_region(
struct acpi_subtable_header *header, const unsigned long end)
{
@@ -282,15 +280,13 @@ static int __init cf_check srat_parse_region(
printk(KERN_INFO "SRAT: %013"PRIx64"-%013"PRIx64"\n",
ma->base_address, ma->base_address + ma->length - 1);
- srat_region_mask |= ma->base_address |
- pdx_region_mask(ma->base_address, ma->length);
+ pfn_pdx_add_region(ma->base_address, ma->length);
return 0;
}
void __init srat_parse_regions(paddr_t addr)
{
- u64 mask;
unsigned int i;
if (acpi_disabled || acpi_numa < 0 ||
@@ -299,19 +295,29 @@ void __init srat_parse_regions(paddr_t addr)
/* Set "PXM" as early as feasible. */
numa_fw_nid_name = "PXM";
- srat_region_mask = pdx_init_mask(addr);
acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
srat_parse_region, 0);
- for (mask = srat_region_mask, i = 0; mask && i < e820.nr_map; i++) {
+ pfn_pdx_compression_setup(addr);
+
+ /* Ensure all RAM ranges in the e820 are covered. */
+ for (i = 0; i < e820.nr_map; i++) {
if (e820.map[i].type != E820_RAM)
continue;
- if (~mask & pdx_region_mask(e820.map[i].addr, e820.map[i].size))
- mask = 0;
+ if (!pdx_is_region_compressible(
+ e820.map[i].addr,
+ PFN_UP(e820.map[i].addr + e820.map[i].size) -
+ PFN_DOWN(e820.map[i].addr))) {
+ pfn_pdx_compression_reset();
+ printk(XENLOG_WARNING
+ "PFN compression disabled, RAM region [%#" PRIx64
+ ", %#" PRIx64 "] not covered\n",
+ e820.map[i].addr,
+ e820.map[i].addr + e820.map[i].size - 1);
+ return;
+ }
}
-
- pfn_pdx_hole_setup(mask >> PAGE_SHIFT);
}
unsigned int numa_node_to_arch_nid(nodeid_t n)
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index 00aa7e43006d..49c9dbf9e5fa 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -19,6 +19,7 @@
#include <xen/mm.h>
#include <xen/bitops.h>
#include <xen/nospec.h>
+#include <xen/pfn.h>
#include <xen/sections.h>
/**
@@ -55,6 +56,44 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn)
__set_bit(idx, pdx_group_valid);
}
+#ifndef CONFIG_PDX_NONE
+
+#ifdef CONFIG_X86
+# include <asm/e820.h>
+# define MAX_PFN_RANGES E820MAX
+#elif defined(CONFIG_HAS_DEVICE_TREE_DISCOVERY)
+# include <xen/bootinfo.h>
+# define MAX_PFN_RANGES NR_MEM_BANKS
+#endif
+
+#ifndef MAX_PFN_RANGES
+# error "Missing architecture maximum number of RAM ranges"
+#endif
+
+/* Generic PFN compression helpers. */
+static struct pfn_range {
+ unsigned long base, size;
+} ranges[MAX_PFN_RANGES] __initdata;
+static unsigned int __initdata nr_ranges;
+
+void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
+{
+ if ( !size )
+ return;
+
+ if ( nr_ranges >= ARRAY_SIZE(ranges) )
+ {
+ ASSERT((nr_ranges + 1) > nr_ranges);
+ nr_ranges++;
+ return;
+ }
+
+ ranges[nr_ranges].base = PFN_DOWN(base);
+ ranges[nr_ranges++].size = PFN_UP(base + size) - PFN_DOWN(base);
+}
+
+#endif /* !CONFIG_PDX_NONE */
+
#ifdef CONFIG_PDX_MASK_COMPRESSION
/*
@@ -115,20 +154,25 @@ static uint64_t fill_mask(uint64_t mask)
return mask;
}
-bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
-{
- return !(paddr_to_pfn(base) & pfn_hole_mask) &&
- !(pdx_region_mask(base, npages * PAGE_SIZE) & ~ma_va_bottom_mask);
-}
-
-/* We don't want to compress the low MAX_ORDER bits of the addresses. */
-uint64_t __init pdx_init_mask(uint64_t base_addr)
-{
- return fill_mask(max(base_addr,
- (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
-}
-
-uint64_t pdx_region_mask(uint64_t base, uint64_t len)
+/**
+ * Calculates a mask covering "moving" bits of all addresses of a region
+ *
+ * The i-th bit of the mask must be set if there's 2 different addresses
+ * in the region that have different j-th bits. where j >= i.
+ *
+ * e.g:
+ * base=0x1B00000000
+ * len+base=0x1B00042000
+ *
+ * ought to return 0x000007FFFF, which implies that every bit position
+ * with a zero in the mask remains unchanged in every address of the
+ * region.
+ *
+ * @param base Base address of the region
+ * @param len Size in octets of the region
+ * @return Mask of moving bits at the bottom of all the region addresses
+ */
+static uint64_t pdx_region_mask(uint64_t base, uint64_t len)
{
/*
* We say a bit "moves" in a range if there exist 2 addresses in that
@@ -143,9 +187,45 @@ uint64_t pdx_region_mask(uint64_t base, uint64_t len)
return fill_mask(base ^ (base + len - 1));
}
-void __init pfn_pdx_hole_setup(unsigned long mask)
+bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
+{
+ return !(paddr_to_pfn(base) & pfn_hole_mask) &&
+ !(pdx_region_mask(base, npages * PAGE_SIZE) & ~ma_va_bottom_mask);
+}
+
+/**
+ * Creates the mask to start from when calculating non-compressible bits
+ *
+ * This function is intimately related to pdx_region_mask(), and together
+ * they are meant to calculate the mask of non-compressible bits given the
+ * current memory map.
+ *
+ * @param base_addr Address of the first maddr in the system
+ * @return An integer of the form 2^n - 1
+ */
+static uint64_t __init pdx_init_mask(uint64_t base_addr)
+{
+ return fill_mask(max(base_addr,
+ /* Don't compress the low MAX_ORDER bits. */
+ (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
+}
+
+void __init pfn_pdx_compression_setup(paddr_t base)
{
unsigned int i, j, bottom_shift = 0, hole_shift = 0;
+ unsigned long mask = pdx_init_mask(base) >> PAGE_SHIFT;
+
+ if ( nr_ranges > ARRAY_SIZE(ranges) )
+ {
+ printk(XENLOG_WARNING
+ "Too many PFN ranges (%u > %zu), not attempting PFN compression\n",
+ nr_ranges, ARRAY_SIZE(ranges));
+ return;
+ }
+
+ for ( i = 0; i < nr_ranges; i++ )
+ mask |= ranges[i].base |
+ pdx_region_mask(ranges[i].base, ranges[i].size);
/*
* We skip the first MAX_ORDER bits, as we never want to compress them.
@@ -184,6 +264,18 @@ void __init pfn_pdx_hole_setup(unsigned long mask)
ma_top_mask = pfn_top_mask << PAGE_SHIFT;
}
+void __init pfn_pdx_compression_reset(void)
+{
+ pfn_pdx_bottom_mask = ~0UL;
+ ma_va_bottom_mask = ~0UL;
+ pfn_top_mask = 0;
+ ma_top_mask = 0;
+ pfn_hole_mask = 0;
+ pfn_pdx_hole_shift = 0;
+
+ nr_ranges = 0;
+}
+
#endif /* CONFIG_PDX_COMPRESSION */
/*
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index 8e373cac8b87..10153da98bf1 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -108,38 +108,6 @@ extern unsigned int pfn_pdx_hole_shift;
extern unsigned long pfn_hole_mask;
extern unsigned long pfn_top_mask, ma_top_mask;
-/**
- * Calculates a mask covering "moving" bits of all addresses of a region
- *
- * The i-th bit of the mask must be set if there's 2 different addresses
- * in the region that have different j-th bits. where j >= i.
- *
- * e.g:
- * base=0x1B00000000
- * len+base=0x1B00042000
- *
- * ought to return 0x000007FFFF, which implies that every bit position
- * with a zero in the mask remains unchanged in every address of the
- * region.
- *
- * @param base Base address of the region
- * @param len Size in octets of the region
- * @return Mask of moving bits at the bottom of all the region addresses
- */
-uint64_t pdx_region_mask(uint64_t base, uint64_t len);
-
-/**
- * Creates the mask to start from when calculating non-compressible bits
- *
- * This function is intimately related to pdx_region_mask(), and together
- * they are meant to calculate the mask of non-compressible bits given the
- * current memory map.
- *
- * @param base_addr Address of the first maddr in the system
- * @return An integer of the form 2^n - 1
- */
-uint64_t pdx_init_mask(uint64_t base_addr);
-
/**
* Map pfn to its corresponding pdx
*
@@ -189,17 +157,6 @@ static inline paddr_t directmapoff_to_maddr(unsigned long offset)
(offset & ma_va_bottom_mask));
}
-/**
- * Initializes global variables with information about the compressible
- * range of the current memory regions.
- *
- * @param mask This mask is the biggest pdx_mask of every region in the
- * system ORed with all base addresses of every region in the
- * system. This results in a mask where every zero in a bit
- * position marks a potentially compressible bit.
- */
-void pfn_pdx_hole_setup(unsigned long mask);
-
#endif /* CONFIG_PDX_MASK_COMPRESSION */
#ifdef CONFIG_PDX_NONE
@@ -220,17 +177,15 @@ static inline bool pdx_is_region_compressible(paddr_t base,
return true;
}
-static inline uint64_t pdx_init_mask(uint64_t base_addr)
+static inline void pfn_pdx_add_region(paddr_t base, paddr_t size)
{
- return 0;
}
-static inline uint64_t pdx_region_mask(uint64_t base, uint64_t len)
+static inline void pfn_pdx_compression_setup(paddr_t base)
{
- return 0;
}
-static inline void pfn_pdx_hole_setup(unsigned long mask)
+static inline void pfn_pdx_compression_reset(void)
{
}
@@ -247,6 +202,28 @@ static inline void pfn_pdx_hole_setup(unsigned long mask)
*/
bool pdx_is_region_compressible(paddr_t base, unsigned long npages);
+/**
+ * Register a RAM region with the PFN compression logic.
+ *
+ * @param base Start of the region in bytes.
+ * @param size Length of the region in bytes.
+ */
+void pfn_pdx_add_region(paddr_t base, paddr_t size);
+
+/**
+ * Initializes global variables with information about the compressible
+ * range of the current memory regions.
+ *
+ * @param base address to start compression from.
+ */
+void pfn_pdx_compression_setup(paddr_t base);
+
+/**
+ * Reset the global variables to it's default values, thus disabling PFN
+ * compression.
+ */
+void pfn_pdx_compression_reset(void);
+
#endif /* !CONFIG_PDX_NONE */
#endif /* __XEN_PDX_H__ */
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH v3 2/8] pdx: provide a unified set of unit functions
2025-07-24 11:04 ` [PATCH v3 2/8] pdx: provide a unified set of unit functions Roger Pau Monne
@ 2025-07-29 13:06 ` Jan Beulich
0 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2025-07-29 13:06 UTC (permalink / raw)
To: Roger Pau Monne
Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel
On 24.07.2025 13:04, Roger Pau Monne wrote:
> The current setup (pdx_init_mask() and pdx_region_mask()) and init
> (pfn_pdx_hole_setup()) PDX compression functions are tailored to the
> existing PDX compression algorithm.
>
> In preparation for introducing a new compression algorithm convert the
> setup and init functions to more generic interfaces that aren't tied to the
> compression in-use. To accomplish this introduce a function that registers
> all the PFN RAM ranges, plus an init function.
>
> This has the downside of requiring a static array to store such ranges
> ahead of being processed by the setup function, however it's the only way
> to pass all the possible information to the different compression setup
> functions without using per-compression specific setup functions.
> Per-arch compression setup also need to be adjusted to use the new
> interface. There's a slight ordering adjustment, in that after PDX
> compression setup the caller will check whether all the RAM regions are
> properly covered by the newly setup compression, otherwise compression is
> disabled by resetting to the initial values.
>
> No functional change intended in the resulting PDX compression values.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v3 3/8] pdx: introduce command line compression toggle
2025-07-24 11:04 [PATCH v3 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
2025-07-24 11:04 ` [PATCH v3 1/8] kconfig: turn PDX compression into a choice Roger Pau Monne
2025-07-24 11:04 ` [PATCH v3 2/8] pdx: provide a unified set of unit functions Roger Pau Monne
@ 2025-07-24 11:04 ` Roger Pau Monne
2025-07-24 11:04 ` [PATCH v3 4/8] pdx: allow per-arch optimization of PDX conversion helpers Roger Pau Monne
` (4 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: Roger Pau Monne @ 2025-07-24 11:04 UTC (permalink / raw)
To: xen-devel
Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
Jan Beulich, Julien Grall, Stefano Stabellini,
Roger Pau Monné
Introduce a command line option to allow disabling PDX compression. The
disabling is done by turning pfn_pdx_add_region() into a no-op, so when
attempting to initialize the selected compression algorithm the array of
ranges to compress is empty.
Signed-off-by: Roger Pau Monné <roger.pau@cloud.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Changes since v1:
- New in this version.
---
docs/misc/xen-command-line.pandoc | 9 +++++++++
xen/common/pdx.c | 14 +++++++++++++-
2 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 6865a61220ca..5dd2ab82b7aa 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2072,6 +2072,15 @@ for all of them (`true`), only for those subject to XPTI (`xpti`) or for
those not subject to XPTI (`no-xpti`). The feature is used only in case
INVPCID is supported and not disabled via `invpcid=false`.
+### pdx-compress
+> `= <boolean>`
+
+> Default: `true` if CONFIG_PDX_NONE is unset
+
+Only relevant when the hypervisor is build with PFN PDX compression. Controls
+whether Xen will engage in PFN compression. The algorithm used for PFN
+compression is selected at build time from Kconfig.
+
### ple_gap
> `= <integer>`
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index 49c9dbf9e5fa..ad7871ad90be 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -19,6 +19,7 @@
#include <xen/mm.h>
#include <xen/bitops.h>
#include <xen/nospec.h>
+#include <xen/param.h>
#include <xen/pfn.h>
#include <xen/sections.h>
@@ -76,9 +77,13 @@ static struct pfn_range {
} ranges[MAX_PFN_RANGES] __initdata;
static unsigned int __initdata nr_ranges;
+static bool __initdata pdx_compress = true;
+boolean_param("pdx-compress", pdx_compress);
+
void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
{
- if ( !size )
+ /* Without ranges there's no PFN compression. */
+ if ( !size || !pdx_compress )
return;
if ( nr_ranges >= ARRAY_SIZE(ranges) )
@@ -215,6 +220,13 @@ void __init pfn_pdx_compression_setup(paddr_t base)
unsigned int i, j, bottom_shift = 0, hole_shift = 0;
unsigned long mask = pdx_init_mask(base) >> PAGE_SHIFT;
+ if ( !nr_ranges )
+ {
+ printk(XENLOG_DEBUG "PFN compression disabled%s\n",
+ pdx_compress ? ": no ranges provided" : "");
+ return;
+ }
+
if ( nr_ranges > ARRAY_SIZE(ranges) )
{
printk(XENLOG_WARNING
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v3 4/8] pdx: allow per-arch optimization of PDX conversion helpers
2025-07-24 11:04 [PATCH v3 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
` (2 preceding siblings ...)
2025-07-24 11:04 ` [PATCH v3 3/8] pdx: introduce command line compression toggle Roger Pau Monne
@ 2025-07-24 11:04 ` Roger Pau Monne
2025-07-29 13:20 ` Jan Beulich
2025-07-29 14:54 ` Andrew Cooper
2025-07-24 11:04 ` [PATCH v3 5/8] test/pdx: add PDX compression unit tests Roger Pau Monne
` (3 subsequent siblings)
7 siblings, 2 replies; 19+ messages in thread
From: Roger Pau Monne @ 2025-07-24 11:04 UTC (permalink / raw)
To: xen-devel
Cc: Roger Pau Monne, Stefano Stabellini, Julien Grall,
Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper,
Anthony PERARD, Jan Beulich, Shawn Anastasio, Alistair Francis,
Bob Eshleman, Connor Davis, Oleksii Kurochko
There are four performance critical PDX conversion helpers that do the PFN
to/from PDX and the physical addresses to/from directmap offsets
translations.
In the absence of an active PDX compression, those functions would still do
the calculations needed, just to return the same input value as no
translation is in place and hence PFN and PDX spaces are identity mapped.
To reduce the overhead of having to do the pointless calculations allow
architectures to implement the translation helpers in a per-arch header.
Rename the existing conversion functions to add a trailing _xlate suffix,
so that the per-arch headers can define the non suffixed versions.
Currently only x86 implements meaningful custom handlers to short circuit
the translation when not active, using asm goto. Other architectures use a
generic header that maps the non-xlate to the xlate variants to keep the
previous behavior.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v2:
- Consume the skip label as parameter in the PDX optimize macro.
Changes since v1:
- Pull return out of OPTIMIZE_PDX macro.
- undef OPTIMIZE_PDX.
---
Would it make sense to move the x86 implementation to the common pdx.h
header and let architectures define PDX_ASM_GOTO_SKIP instead?
---
xen/arch/arm/include/asm/Makefile | 1 +
xen/arch/ppc/include/asm/Makefile | 1 +
xen/arch/riscv/include/asm/Makefile | 1 +
xen/arch/x86/include/asm/cpufeatures.h | 1 +
xen/arch/x86/include/asm/pdx.h | 75 ++++++++++++++++++++++++++
xen/arch/x86/srat.c | 6 ++-
xen/common/pdx.c | 10 ++--
xen/include/asm-generic/pdx.h | 24 +++++++++
xen/include/xen/pdx.h | 22 +++++---
9 files changed, 130 insertions(+), 11 deletions(-)
create mode 100644 xen/arch/x86/include/asm/pdx.h
create mode 100644 xen/include/asm-generic/pdx.h
diff --git a/xen/arch/arm/include/asm/Makefile b/xen/arch/arm/include/asm/Makefile
index 4565baca6a4d..cec13c889dab 100644
--- a/xen/arch/arm/include/asm/Makefile
+++ b/xen/arch/arm/include/asm/Makefile
@@ -5,6 +5,7 @@ generic-y += hardirq.h
generic-y += iocap.h
generic-y += irq-dt.h
generic-y += paging.h
+generic-y += pdx.h
generic-y += percpu.h
generic-y += random.h
generic-y += softirq.h
diff --git a/xen/arch/ppc/include/asm/Makefile b/xen/arch/ppc/include/asm/Makefile
index c989a7f89b34..0ad45133baac 100644
--- a/xen/arch/ppc/include/asm/Makefile
+++ b/xen/arch/ppc/include/asm/Makefile
@@ -6,6 +6,7 @@ generic-y += hardirq.h
generic-y += hypercall.h
generic-y += iocap.h
generic-y += paging.h
+generic-y += pdx.h
generic-y += percpu.h
generic-y += perfc_defn.h
generic-y += random.h
diff --git a/xen/arch/riscv/include/asm/Makefile b/xen/arch/riscv/include/asm/Makefile
index bfdf186c682f..de04daf68df3 100644
--- a/xen/arch/riscv/include/asm/Makefile
+++ b/xen/arch/riscv/include/asm/Makefile
@@ -7,6 +7,7 @@ generic-y += hypercall.h
generic-y += iocap.h
generic-y += irq-dt.h
generic-y += paging.h
+generic-y += pdx.h
generic-y += percpu.h
generic-y += perfc_defn.h
generic-y += random.h
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index bc108c3819e8..71308d9dafc8 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -42,6 +42,7 @@ XEN_CPUFEATURE(XEN_IBT, X86_SYNTH(27)) /* Xen uses CET Indirect Branch
XEN_CPUFEATURE(IBPB_ENTRY_PV, X86_SYNTH(28)) /* MSR_PRED_CMD used by Xen for PV */
XEN_CPUFEATURE(IBPB_ENTRY_HVM, X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */
XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */
+XEN_CPUFEATURE(PDX_COMPRESSION, X86_SYNTH(31)) /* PDX compression */
/* Bug words follow the synthetic words. */
#define X86_NR_BUG 1
diff --git a/xen/arch/x86/include/asm/pdx.h b/xen/arch/x86/include/asm/pdx.h
new file mode 100644
index 000000000000..00666e49831a
--- /dev/null
+++ b/xen/arch/x86/include/asm/pdx.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef X86_PDX_H
+#define X86_PDX_H
+
+#ifndef CONFIG_PDX_NONE
+
+#include <asm/alternative.h>
+
+/*
+ * Introduce a macro to avoid repeating the same asm goto block in each helper.
+ * Note the macro is strictly tied to the code in the helpers.
+ */
+#define PDX_ASM_GOTO(label) \
+ asm_inline goto ( \
+ ALTERNATIVE( \
+ "", \
+ "jmp %l0", \
+ ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) \
+ : : : : label )
+
+static inline unsigned long pfn_to_pdx(unsigned long pfn)
+{
+ PDX_ASM_GOTO(skip);
+
+ return pfn_to_pdx_xlate(pfn);
+
+ skip:
+ return pfn;
+}
+
+static inline unsigned long pdx_to_pfn(unsigned long pdx)
+{
+ PDX_ASM_GOTO(skip);
+
+ return pdx_to_pfn_xlate(pdx);
+
+ skip:
+ return pdx;
+}
+
+static inline unsigned long maddr_to_directmapoff(paddr_t ma)
+{
+ PDX_ASM_GOTO(skip);
+
+ return maddr_to_directmapoff_xlate(ma);
+
+ skip:
+ return ma;
+}
+
+static inline paddr_t directmapoff_to_maddr(unsigned long offset)
+{
+ PDX_ASM_GOTO(skip);
+
+ return directmapoff_to_maddr_xlate(offset);
+
+ skip:
+ return offset;
+}
+
+#undef PDX_ASM_GOTO_SKIP
+
+#endif /* !CONFIG_PDX_NONE */
+
+#endif /* X86_PDX_H */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 747607439fb4..42ccb0c0f3ae 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -298,7 +298,8 @@ void __init srat_parse_regions(paddr_t addr)
acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
srat_parse_region, 0);
- pfn_pdx_compression_setup(addr);
+ if (!pfn_pdx_compression_setup(addr))
+ return;
/* Ensure all RAM ranges in the e820 are covered. */
for (i = 0; i < e820.nr_map; i++) {
@@ -318,6 +319,9 @@ void __init srat_parse_regions(paddr_t addr)
return;
}
}
+
+ /* If we got this far compression is working as expected. */
+ setup_force_cpu_cap(X86_FEATURE_PDX_COMPRESSION);
}
unsigned int numa_node_to_arch_nid(nodeid_t n)
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index ad7871ad90be..f4ee2198841d 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -215,7 +215,7 @@ static uint64_t __init pdx_init_mask(uint64_t base_addr)
(uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
}
-void __init pfn_pdx_compression_setup(paddr_t base)
+bool __init pfn_pdx_compression_setup(paddr_t base)
{
unsigned int i, j, bottom_shift = 0, hole_shift = 0;
unsigned long mask = pdx_init_mask(base) >> PAGE_SHIFT;
@@ -224,7 +224,7 @@ void __init pfn_pdx_compression_setup(paddr_t base)
{
printk(XENLOG_DEBUG "PFN compression disabled%s\n",
pdx_compress ? ": no ranges provided" : "");
- return;
+ return false;
}
if ( nr_ranges > ARRAY_SIZE(ranges) )
@@ -232,7 +232,7 @@ void __init pfn_pdx_compression_setup(paddr_t base)
printk(XENLOG_WARNING
"Too many PFN ranges (%u > %zu), not attempting PFN compression\n",
nr_ranges, ARRAY_SIZE(ranges));
- return;
+ return false;
}
for ( i = 0; i < nr_ranges; i++ )
@@ -263,7 +263,7 @@ void __init pfn_pdx_compression_setup(paddr_t base)
}
}
if ( !hole_shift )
- return;
+ return false;
printk(KERN_INFO "PFN compression on bits %u...%u\n",
bottom_shift, bottom_shift + hole_shift - 1);
@@ -274,6 +274,8 @@ void __init pfn_pdx_compression_setup(paddr_t base)
pfn_hole_mask = ((1UL << hole_shift) - 1) << bottom_shift;
pfn_top_mask = ~(pfn_pdx_bottom_mask | pfn_hole_mask);
ma_top_mask = pfn_top_mask << PAGE_SHIFT;
+
+ return true;
}
void __init pfn_pdx_compression_reset(void)
diff --git a/xen/include/asm-generic/pdx.h b/xen/include/asm-generic/pdx.h
new file mode 100644
index 000000000000..4dea2b97c3e5
--- /dev/null
+++ b/xen/include/asm-generic/pdx.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef GENERIC_PDX_H
+#define GENERIC_PDX_H
+
+#ifndef CONFIG_PDX_NONE
+
+#define pdx_to_pfn pdx_to_pfn_xlate
+#define pfn_to_pdx pfn_to_pdx_xlate
+#define maddr_to_directmapoff maddr_to_directmapoff_xlate
+#define directmapoff_to_maddr directmapoff_to_maddr_xlate
+
+#endif /* !CONFIG_PDX_NONE */
+
+#endif /* GENERIC_PDX_H */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index 10153da98bf1..91fc32370f21 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -114,7 +114,7 @@ extern unsigned long pfn_top_mask, ma_top_mask;
* @param pfn Frame number
* @return Obtained pdx after compressing the pfn
*/
-static inline unsigned long pfn_to_pdx(unsigned long pfn)
+static inline unsigned long pfn_to_pdx_xlate(unsigned long pfn)
{
return (pfn & pfn_pdx_bottom_mask) |
((pfn & pfn_top_mask) >> pfn_pdx_hole_shift);
@@ -126,7 +126,7 @@ static inline unsigned long pfn_to_pdx(unsigned long pfn)
* @param pdx Page index
* @return Obtained pfn after decompressing the pdx
*/
-static inline unsigned long pdx_to_pfn(unsigned long pdx)
+static inline unsigned long pdx_to_pfn_xlate(unsigned long pdx)
{
return (pdx & pfn_pdx_bottom_mask) |
((pdx << pfn_pdx_hole_shift) & pfn_top_mask);
@@ -139,7 +139,7 @@ static inline unsigned long pdx_to_pfn(unsigned long pdx)
* @return Offset on the direct map where that
* machine address can be accessed
*/
-static inline unsigned long maddr_to_directmapoff(paddr_t ma)
+static inline unsigned long maddr_to_directmapoff_xlate(paddr_t ma)
{
return (((ma & ma_top_mask) >> pfn_pdx_hole_shift) |
(ma & ma_va_bottom_mask));
@@ -151,7 +151,7 @@ static inline unsigned long maddr_to_directmapoff(paddr_t ma)
* @param offset Offset into the direct map
* @return Corresponding machine address of that virtual location
*/
-static inline paddr_t directmapoff_to_maddr(unsigned long offset)
+static inline paddr_t directmapoff_to_maddr_xlate(unsigned long offset)
{
return ((((paddr_t)offset << pfn_pdx_hole_shift) & ma_top_mask) |
(offset & ma_va_bottom_mask));
@@ -159,6 +159,14 @@ static inline paddr_t directmapoff_to_maddr(unsigned long offset)
#endif /* CONFIG_PDX_MASK_COMPRESSION */
+/*
+ * Allow each architecture to define it's (possibly optimized) versions of the
+ * translation functions.
+ *
+ * Do not use _xlate suffixed functions, always use the non _xlate variants.
+ */
+#include <asm/pdx.h>
+
#ifdef CONFIG_PDX_NONE
/* Without PDX compression we can skip some computations */
@@ -181,8 +189,9 @@ static inline void pfn_pdx_add_region(paddr_t base, paddr_t size)
{
}
-static inline void pfn_pdx_compression_setup(paddr_t base)
+static inline bool pfn_pdx_compression_setup(paddr_t base)
{
+ return false;
}
static inline void pfn_pdx_compression_reset(void)
@@ -215,8 +224,9 @@ void pfn_pdx_add_region(paddr_t base, paddr_t size);
* range of the current memory regions.
*
* @param base address to start compression from.
+ * @return True if PDX compression has been enabled.
*/
-void pfn_pdx_compression_setup(paddr_t base);
+bool pfn_pdx_compression_setup(paddr_t base);
/**
* Reset the global variables to it's default values, thus disabling PFN
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH v3 4/8] pdx: allow per-arch optimization of PDX conversion helpers
2025-07-24 11:04 ` [PATCH v3 4/8] pdx: allow per-arch optimization of PDX conversion helpers Roger Pau Monne
@ 2025-07-29 13:20 ` Jan Beulich
2025-07-29 14:54 ` Andrew Cooper
1 sibling, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2025-07-29 13:20 UTC (permalink / raw)
To: Roger Pau Monne
Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, Shawn Anastasio,
Alistair Francis, Bob Eshleman, Connor Davis, Oleksii Kurochko,
xen-devel
On 24.07.2025 13:04, Roger Pau Monne wrote:
> There are four performance critical PDX conversion helpers that do the PFN
> to/from PDX and the physical addresses to/from directmap offsets
> translations.
>
> In the absence of an active PDX compression, those functions would still do
> the calculations needed, just to return the same input value as no
> translation is in place and hence PFN and PDX spaces are identity mapped.
>
> To reduce the overhead of having to do the pointless calculations allow
> architectures to implement the translation helpers in a per-arch header.
> Rename the existing conversion functions to add a trailing _xlate suffix,
> so that the per-arch headers can define the non suffixed versions.
>
> Currently only x86 implements meaningful custom handlers to short circuit
> the translation when not active, using asm goto. Other architectures use a
> generic header that maps the non-xlate to the xlate variants to keep the
> previous behavior.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
> --- /dev/null
> +++ b/xen/arch/x86/include/asm/pdx.h
> @@ -0,0 +1,75 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#ifndef X86_PDX_H
> +#define X86_PDX_H
> +
> +#ifndef CONFIG_PDX_NONE
> +
> +#include <asm/alternative.h>
> +
> +/*
> + * Introduce a macro to avoid repeating the same asm goto block in each helper.
> + * Note the macro is strictly tied to the code in the helpers.
> + */
> +#define PDX_ASM_GOTO(label) \
> + asm_inline goto ( \
> + ALTERNATIVE( \
> + "", \
> + "jmp %l0", \
> + ALT_NOT(X86_FEATURE_PDX_COMPRESSION)) \
> + : : : : label )
Considering the uses of the macro, it seems likely that in some of the cases the
label will be within reach of a short JMP (opcode 0xEB), while the assembler will
be unable to encode it like that (for living in a separate section, far off from
the place the code will be copied to). That's likely okay for now, but we may
want to eliminate this minor ineffeciency later on.
Jan
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 4/8] pdx: allow per-arch optimization of PDX conversion helpers
2025-07-24 11:04 ` [PATCH v3 4/8] pdx: allow per-arch optimization of PDX conversion helpers Roger Pau Monne
2025-07-29 13:20 ` Jan Beulich
@ 2025-07-29 14:54 ` Andrew Cooper
2025-08-04 10:29 ` Roger Pau Monné
1 sibling, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2025-07-29 14:54 UTC (permalink / raw)
To: Roger Pau Monne, xen-devel
Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
Volodymyr Babchuk, Anthony PERARD, Jan Beulich, Shawn Anastasio,
Alistair Francis, Bob Eshleman, Connor Davis, Oleksii Kurochko
On 24/07/2025 12:04 pm, Roger Pau Monne wrote:
> diff --git a/xen/arch/arm/include/asm/Makefile b/xen/arch/arm/include/asm/Makefile
> index 4565baca6a4d..cec13c889dab 100644
> --- a/xen/arch/arm/include/asm/Makefile
> +++ b/xen/arch/arm/include/asm/Makefile
> @@ -5,6 +5,7 @@ generic-y += hardirq.h
> generic-y += iocap.h
> generic-y += irq-dt.h
> generic-y += paging.h
> +generic-y += pdx.h
> generic-y += percpu.h
> generic-y += random.h
> generic-y += softirq.h
Please could I talk you into using __has_include__, as I'm slowly trying
to purge asm-generic.
> diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
> index 10153da98bf1..91fc32370f21 100644
> --- a/xen/include/xen/pdx.h
> +++ b/xen/include/xen/pdx.h
> @@ -159,6 +159,14 @@ static inline paddr_t directmapoff_to_maddr(unsigned long offset)
>
> #endif /* CONFIG_PDX_MASK_COMPRESSION */
>
> +/*
> + * Allow each architecture to define it's (possibly optimized) versions of the
> + * translation functions.
> + *
> + * Do not use _xlate suffixed functions, always use the non _xlate variants.
> + */
> +#include <asm/pdx.h>
> +
You want:
#if __has_include__(<asm/pdx.h>)
# include <asm/pdx.h>
#else
// common defaults.
#endif
here. This lets you implement the x86 special case only, and keeps all
the common logic in one file.
~Andrew
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 4/8] pdx: allow per-arch optimization of PDX conversion helpers
2025-07-29 14:54 ` Andrew Cooper
@ 2025-08-04 10:29 ` Roger Pau Monné
0 siblings, 0 replies; 19+ messages in thread
From: Roger Pau Monné @ 2025-08-04 10:29 UTC (permalink / raw)
To: Andrew Cooper
Cc: xen-devel, Stefano Stabellini, Julien Grall, Bertrand Marquis,
Michal Orzel, Volodymyr Babchuk, Anthony PERARD, Jan Beulich,
Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
Oleksii Kurochko
On Tue, Jul 29, 2025 at 03:54:24PM +0100, Andrew Cooper wrote:
> On 24/07/2025 12:04 pm, Roger Pau Monne wrote:
> > diff --git a/xen/arch/arm/include/asm/Makefile b/xen/arch/arm/include/asm/Makefile
> > index 4565baca6a4d..cec13c889dab 100644
> > --- a/xen/arch/arm/include/asm/Makefile
> > +++ b/xen/arch/arm/include/asm/Makefile
> > @@ -5,6 +5,7 @@ generic-y += hardirq.h
> > generic-y += iocap.h
> > generic-y += irq-dt.h
> > generic-y += paging.h
> > +generic-y += pdx.h
> > generic-y += percpu.h
> > generic-y += random.h
> > generic-y += softirq.h
>
> Please could I talk you into using __has_include__, as I'm slowly trying
> to purge asm-generic.
>
> > diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
> > index 10153da98bf1..91fc32370f21 100644
> > --- a/xen/include/xen/pdx.h
> > +++ b/xen/include/xen/pdx.h
> > @@ -159,6 +159,14 @@ static inline paddr_t directmapoff_to_maddr(unsigned long offset)
> >
> > #endif /* CONFIG_PDX_MASK_COMPRESSION */
> >
> > +/*
> > + * Allow each architecture to define it's (possibly optimized) versions of the
> > + * translation functions.
> > + *
> > + * Do not use _xlate suffixed functions, always use the non _xlate variants.
> > + */
> > +#include <asm/pdx.h>
> > +
>
> You want:
>
> #if __has_include__(<asm/pdx.h>)
> # include <asm/pdx.h>
> #else
>
> // common defaults.
>
> #endif
>
> here. This lets you implement the x86 special case only, and keeps all
> the common logic in one file.
Sure, I wasn't aware of this functionality.
Thanks, Roger.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v3 5/8] test/pdx: add PDX compression unit tests
2025-07-24 11:04 [PATCH v3 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
` (3 preceding siblings ...)
2025-07-24 11:04 ` [PATCH v3 4/8] pdx: allow per-arch optimization of PDX conversion helpers Roger Pau Monne
@ 2025-07-24 11:04 ` Roger Pau Monne
2025-07-24 11:04 ` [PATCH v3 6/8] pdx: move some helpers in preparation for new compression Roger Pau Monne
` (2 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: Roger Pau Monne @ 2025-07-24 11:04 UTC (permalink / raw)
To: xen-devel
Cc: Roger Pau Monne, Anthony PERARD, Andrew Cooper, Michal Orzel,
Jan Beulich, Julien Grall, Stefano Stabellini
Introduce a set of unit tests for PDX compression. The unit tests contains
both real and crafted memory maps that are then compressed using the
selected PDX algorithm. Note the build system for the unit tests has been
done in a way to support adding new compression algorithms easily. That
requires generating a new test-pdx-<compress> executable that's build with
the selected PDX compression enabled.
Currently the only generated executable is test-pdx-mask that tests PDX
mask compression.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v2:
- Use set -e when running the tests.
- Drop stray removal of pdx.c.
- Use * instead of ? for header blank space capture.
- Add some more memory maps.
Changes since v1:
- New in this version (partially pulled out from a different patch).
---
tools/tests/Makefile | 1 +
tools/tests/pdx/.gitignore | 2 +
tools/tests/pdx/Makefile | 49 +++++++
tools/tests/pdx/harness.h | 89 +++++++++++++
tools/tests/pdx/test-pdx.c | 267 +++++++++++++++++++++++++++++++++++++
xen/common/pdx.c | 4 +
6 files changed, 412 insertions(+)
create mode 100644 tools/tests/pdx/.gitignore
create mode 100644 tools/tests/pdx/Makefile
create mode 100644 tools/tests/pdx/harness.h
create mode 100644 tools/tests/pdx/test-pdx.c
diff --git a/tools/tests/Makefile b/tools/tests/Makefile
index 36928676a666..97ba2a13894d 100644
--- a/tools/tests/Makefile
+++ b/tools/tests/Makefile
@@ -9,6 +9,7 @@ ifneq ($(clang),y)
SUBDIRS-$(CONFIG_X86) += x86_emulator
endif
SUBDIRS-y += xenstore
+SUBDIRS-y += pdx
SUBDIRS-y += rangeset
SUBDIRS-y += vpci
SUBDIRS-y += paging-mempool
diff --git a/tools/tests/pdx/.gitignore b/tools/tests/pdx/.gitignore
new file mode 100644
index 000000000000..a32c7db4de79
--- /dev/null
+++ b/tools/tests/pdx/.gitignore
@@ -0,0 +1,2 @@
+/pdx.h
+/test-pdx-mask
diff --git a/tools/tests/pdx/Makefile b/tools/tests/pdx/Makefile
new file mode 100644
index 000000000000..b3734afde686
--- /dev/null
+++ b/tools/tests/pdx/Makefile
@@ -0,0 +1,49 @@
+XEN_ROOT=$(CURDIR)/../../..
+include $(XEN_ROOT)/tools/Rules.mk
+
+TARGETS := test-pdx-mask
+
+.PHONY: all
+all: $(TARGETS)
+
+.PHONY: run
+run: $(TARGETS)
+ifeq ($(CC),$(HOSTCC))
+ set -e; \
+ for test in $? ; do \
+ ./$$test ; \
+ done
+else
+ $(warning HOSTCC != CC, will not run test)
+endif
+
+.PHONY: clean
+clean:
+ $(RM) -- *.o $(TARGETS) $(DEPS_RM) pdx.h
+
+.PHONY: distclean
+distclean: clean
+ $(RM) -- *~
+
+.PHONY: install
+install: all
+ $(INSTALL_DIR) $(DESTDIR)$(LIBEXEC)/tests
+ $(INSTALL_PROG) $(TARGETS) $(DESTDIR)$(LIBEXEC)/tests
+
+.PHONY: uninstall
+uninstall:
+ $(RM) -- $(patsubst %,$(DESTDIR)$(LIBEXEC)/tests/%,$(TARGETS))
+
+pdx.h: $(XEN_ROOT)/xen/include/xen/pdx.h
+ sed -E -e '/^#[[:space:]]*include/d' <$< >$@
+
+CFLAGS += -D__XEN_TOOLS__
+CFLAGS += $(APPEND_CFLAGS)
+CFLAGS += $(CFLAGS_xeninclude)
+
+test-pdx-mask: CFLAGS += -DCONFIG_PDX_MASK_COMPRESSION
+
+test-pdx-%: test-pdx.c pdx.h
+ $(CC) $(CPPFLAGS) $(CFLAGS) $(CFLAGS_$*.o) -o $@ $< $(APPEND_CFLAGS)
+
+-include $(DEPS_INCLUDE)
diff --git a/tools/tests/pdx/harness.h b/tools/tests/pdx/harness.h
new file mode 100644
index 000000000000..891b4de0bbb7
--- /dev/null
+++ b/tools/tests/pdx/harness.h
@@ -0,0 +1,89 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Unit tests for PDX compression.
+ *
+ * Copyright (C) 2025 Cloud Software Group
+ */
+
+#ifndef _TEST_HARNESS_
+#define _TEST_HARNESS_
+
+#include <assert.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <xen-tools/common-macros.h>
+
+#define __init
+#define __initdata
+#define __ro_after_init
+#define cf_check
+
+#define printk printf
+#define XENLOG_INFO
+#define XENLOG_DEBUG
+#define XENLOG_WARNING
+#define KERN_INFO
+
+#define BITS_PER_LONG (sizeof(unsigned long) * 8)
+
+#define PAGE_SHIFT 12
+/* Some libcs define PAGE_SIZE in limits.h. */
+#undef PAGE_SIZE
+#define PAGE_SIZE (1 << PAGE_SHIFT)
+#define MAX_ORDER 18 /* 2 * PAGETABLE_ORDER (9) */
+
+#define PFN_DOWN(x) ((x) >> PAGE_SHIFT)
+#define PFN_UP(x) (((x) + PAGE_SIZE-1) >> PAGE_SHIFT)
+
+#define pfn_to_paddr(pfn) ((paddr_t)(pfn) << PAGE_SHIFT)
+#define paddr_to_pfn(pa) ((unsigned long)((pa) >> PAGE_SHIFT))
+
+#define MAX_RANGES 16
+#define MAX_PFN_RANGES MAX_RANGES
+
+#define ASSERT assert
+
+#define CONFIG_DEBUG
+
+static inline unsigned int find_next(
+ const unsigned long *addr, unsigned int size, unsigned int off, bool value)
+{
+ unsigned int i;
+
+ ASSERT(size <= BITS_PER_LONG);
+
+ for ( i = off; i < size; i++ )
+ if ( !!(*addr & (1UL << i)) == value )
+ return i;
+
+ return size;
+}
+
+#define find_next_zero_bit(a, s, o) find_next(a, s, o, false)
+#define find_next_bit(a, s, o) find_next(a, s, o, true)
+
+#define boolean_param(name, func)
+
+#define pdx_to_pfn pdx_to_pfn_xlate
+#define pfn_to_pdx pfn_to_pdx_xlate
+#define maddr_to_directmapoff maddr_to_directmapoff_xlate
+#define directmapoff_to_maddr directmapoff_to_maddr_xlate
+
+typedef uint64_t paddr_t;
+
+#include "pdx.h"
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/tests/pdx/test-pdx.c b/tools/tests/pdx/test-pdx.c
new file mode 100644
index 000000000000..0798ccee359b
--- /dev/null
+++ b/tools/tests/pdx/test-pdx.c
@@ -0,0 +1,267 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Unit tests for PDX compression.
+ *
+ * Copyright (C) 2025 Cloud Software Group
+ */
+
+#include "harness.h"
+
+#include "../../xen/common/pdx.c"
+
+struct range {
+ /* Ranges are defined as [start, end). */
+ unsigned long start, end;
+};
+
+static void print_ranges(const struct range *r)
+{
+ unsigned int i;
+
+ printf("Ranges:\n");
+
+ for ( i = 0; i < MAX_RANGES; i++ )
+ {
+ if ( !r[i].start && !r[i].end )
+ break;
+
+ printf(" %013lx-%013lx\n", r[i].start, r[i].end);
+ }
+}
+
+int main(int argc, char **argv)
+{
+ static const struct {
+ struct range ranges[MAX_RANGES];
+ bool compress;
+ } tests[] = {
+#ifdef __LP64__
+ /*
+ * Only for targets where unsigned long is 64bits, otherwise compiler
+ * will complain about truncation from 'long long' -> 'long' conversion.
+ *
+ * Real memory map from a 4s Intel GNR. Not compressible using PDX
+ * mask compression.
+ */
+ {
+ .ranges = {
+ { .start = 0, .end = 0x80000UL },
+ { .start = 0x0100000UL, .end = 0x8080000UL },
+ { .start = 0x63e80000UL, .end = 0x6be80000UL },
+ { .start = 0xc7e80000UL, .end = 0xcfe80000UL },
+ { .start = 0x12be80000UL, .end = 0x133e80000UL },
+ },
+ .compress = false,
+ },
+ /* Simple hole. */
+ {
+ .ranges = {
+ { .start = 0,
+ .end = (1UL << MAX_ORDER) * 1 },
+ { .start = (1UL << (MAX_ORDER * 2)) | 0,
+ .end = (1UL << (MAX_ORDER * 2)) | (1UL << MAX_ORDER) * 1 },
+ },
+ .compress = true,
+ },
+ /* Simple hole, unsorted ranges. */
+ {
+ .ranges = {
+ { .start = (1UL << (MAX_ORDER * 2)) | 0,
+ .end = (1UL << (MAX_ORDER * 2)) | (1UL << MAX_ORDER) * 1 },
+ { .start = 0,
+ .end = (1UL << MAX_ORDER) * 1 },
+ },
+ .compress = true,
+ },
+ /* PDX compression, 2 ranges covered by the lower mask. */
+ {
+ .ranges = {
+ { .start = 0,
+ .end = (1 << MAX_ORDER) * 1 },
+ { .start = (1 << MAX_ORDER) * 2,
+ .end = (1 << MAX_ORDER) * 3 },
+ { .start = (1 << MAX_ORDER) * 20,
+ .end = (1 << MAX_ORDER) * 22 },
+ },
+ .compress = true,
+ },
+ /* Single range not starting at 0. */
+ {
+ .ranges = {
+ { .start = (1 << MAX_ORDER) * 10,
+ .end = (1 << MAX_ORDER) * 11 },
+ },
+ .compress = true,
+ },
+ /* Resulting PDX region size leads to no compression. */
+ {
+ .ranges = {
+ { .start = 0,
+ .end = (1 << MAX_ORDER) * 1 },
+ { .start = (1 << MAX_ORDER) * 2,
+ .end = (1 << MAX_ORDER) * 3 },
+ { .start = (1 << MAX_ORDER) * 4,
+ .end = (1 << MAX_ORDER) * 7 },
+ { .start = (1 << MAX_ORDER) * 8,
+ .end = (1 << MAX_ORDER) * 12 },
+ },
+ .compress = false,
+ },
+ /* AMD Versal Gen 2 ARM board. */
+ {
+ .ranges = {
+ { .start = 0, .end = 0x80000UL },
+ { .start = 0x800000UL, .end = 0x880000UL },
+ { .start = 0x50000000UL, .end = 0x50080000UL },
+ { .start = 0x60000000UL, .end = 0x60080000UL },
+ { .start = 0x70000000UL, .end = 0x70080000UL },
+ },
+ .compress = true,
+ },
+ /* Unsorted ranges, lower one not starting at 0. */
+ {
+ .ranges = {
+ { .start = (1UL << (35 - PAGE_SHIFT)) + (1 << MAX_ORDER) * 2,
+ .end = (1UL << (35 - PAGE_SHIFT)) + (1 << MAX_ORDER) * 3 },
+ { .start = (1 << MAX_ORDER) * 2,
+ .end = (1 << MAX_ORDER) * 3 },
+ },
+ .compress = true,
+ },
+ /* Two ranges with the same high bit set. */
+ {
+ .ranges = {
+ { .start = (1UL << (51 - PAGE_SHIFT)) + (1 << MAX_ORDER) * 0,
+ .end = (1UL << (51 - PAGE_SHIFT)) + (1 << MAX_ORDER) * 1 },
+ { .start = (1UL << (51 - PAGE_SHIFT)) + (1 << MAX_ORDER) * 3,
+ .end = (1UL << (51 - PAGE_SHIFT)) + (1 << MAX_ORDER) * 4 },
+ },
+ .compress = true,
+ },
+#endif
+ /* AMD Naples Epyc 7281 2 sockets, 8 NUMA nodes. */
+ {
+ .ranges = {
+ { .start = 0, .end = 0xa0UL },
+ { .start = 0x100UL, .end = 0xb0000UL },
+ { .start = 0x100000UL, .end = 0x430000UL },
+ { .start = 0x430000UL, .end = 0x830000UL },
+ { .start = 0x830000UL, .end = 0xc30000UL },
+ { .start = 0xc30000UL, .end = 0x1030000UL },
+ { .start = 0x1030000UL, .end = 0x1430000UL },
+ { .start = 0x1430000UL, .end = 0x1830000UL },
+ { .start = 0x1830000UL, .end = 0x1c30000UL },
+ { .start = 0x1c30000UL, .end = 0x2030000UL },
+ },
+ .compress = false,
+ },
+ /* 2-node 2GB per-node QEMU layout. */
+ {
+ .ranges = {
+ { .start = 0, .end = 0x80000UL },
+ { .start = 0x100000UL, .end = 0x180000UL },
+ },
+ .compress = true,
+ },
+ /* Not compressible, smaller than MAX_ORDER. */
+ {
+ .ranges = {
+ { .start = 0, .end = 1 },
+ { .start = 0x100UL, .end = 0x101UL },
+ },
+ .compress = false,
+ },
+ /* Compressible, requires adjusting size to (1 << MAX_ORDER). */
+ {
+ .ranges = {
+ { .start = 0, .end = 1 },
+ { .start = 0x100000UL, .end = 0x100001UL },
+ },
+ .compress = true,
+ },
+ /* 2s Intel CLX with contiguous ranges, no compression. */
+ {
+ .ranges = {
+ { .start = 0 , .end = 0x180000UL },
+ { .start = 0x180000UL, .end = 0x3040000UL },
+ },
+ .compress = false,
+ },
+ };
+ int ret_code = EXIT_SUCCESS;
+
+ for ( unsigned int i = 0 ; i < ARRAY_SIZE(tests); i++ )
+ {
+ unsigned int j;
+
+ pfn_pdx_compression_reset();
+
+ for ( j = 0; j < ARRAY_SIZE(tests[i].ranges); j++ )
+ {
+ unsigned long size = tests[i].ranges[j].end -
+ tests[i].ranges[j].start;
+
+ if ( !tests[i].ranges[j].start && !tests[i].ranges[j].end )
+ break;
+
+ pfn_pdx_add_region(tests[i].ranges[j].start << PAGE_SHIFT,
+ size << PAGE_SHIFT);
+ }
+
+ if ( pfn_pdx_compression_setup(0) != tests[i].compress )
+ {
+ printf("PFN compression diverge, expected %scompressible\n",
+ tests[i].compress ? "" : "un");
+ print_ranges(tests[i].ranges);
+
+ ret_code = EXIT_FAILURE;
+ continue;
+ }
+
+ if ( !tests[i].compress )
+ continue;
+
+ for ( j = 0; j < ARRAY_SIZE(tests[i].ranges); j++ )
+ {
+ unsigned long start = tests[i].ranges[j].start;
+ unsigned long end = tests[i].ranges[j].end;
+
+ if ( !start && !end )
+ break;
+
+ if ( !pdx_is_region_compressible(start << PAGE_SHIFT, 1) ||
+ !pdx_is_region_compressible((end - 1) << PAGE_SHIFT, 1) )
+ {
+ printf(
+ "PFN compression invalid, pages %#lx and %#lx should be compressible\n",
+ start, end - 1);
+ print_ranges(tests[i].ranges);
+ ret_code = EXIT_FAILURE;
+ }
+
+ if ( start != pdx_to_pfn(pfn_to_pdx(start)) ||
+ end - 1 != pdx_to_pfn(pfn_to_pdx(end - 1)) )
+ {
+ printf("Compression is not bi-directional:\n");
+ printf(" PFN %#lx -> PDX %#lx -> PFN %#lx\n",
+ start, pfn_to_pdx(start), pdx_to_pfn(pfn_to_pdx(start)));
+ printf(" PFN %#lx -> PDX %#lx -> PFN %#lx\n",
+ end - 1, pfn_to_pdx(end - 1),
+ pdx_to_pfn(pfn_to_pdx(end - 1)));
+ print_ranges(tests[i].ranges);
+ ret_code = EXIT_FAILURE;
+ }
+ }
+ }
+
+ return ret_code;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index f4ee2198841d..e58002e59db4 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -15,6 +15,8 @@
* along with this program; If not, see <http://www.gnu.org/licenses/>.
*/
+/* Trim content when built for the test harness. */
+#ifdef __XEN__
#include <xen/init.h>
#include <xen/mm.h>
#include <xen/bitops.h>
@@ -57,6 +59,8 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn)
__set_bit(idx, pdx_group_valid);
}
+#endif /* __XEN__ */
+
#ifndef CONFIG_PDX_NONE
#ifdef CONFIG_X86
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v3 6/8] pdx: move some helpers in preparation for new compression
2025-07-24 11:04 [PATCH v3 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
` (4 preceding siblings ...)
2025-07-24 11:04 ` [PATCH v3 5/8] test/pdx: add PDX compression unit tests Roger Pau Monne
@ 2025-07-24 11:04 ` Roger Pau Monne
2025-07-24 11:04 ` [PATCH v3 7/8] pdx: introduce a new compression algorithm based on region offsets Roger Pau Monne
2025-07-24 11:04 ` [PATCH v3 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space Roger Pau Monne
7 siblings, 0 replies; 19+ messages in thread
From: Roger Pau Monne @ 2025-07-24 11:04 UTC (permalink / raw)
To: xen-devel
Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
Jan Beulich, Julien Grall, Stefano Stabellini
Move fill_mask(), pdx_region_mask() and pdx_init_mask() to the
!CONFIG_PDX_NONE section in preparation of them also being used by a newly
added PDX compression.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
git is not very helpful when generating the diff here, and it ends up
moving everything around the functions instead of the functions themselves.
---
xen/common/pdx.c | 118 +++++++++++++++++++++++------------------------
1 file changed, 59 insertions(+), 59 deletions(-)
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index e58002e59db4..cfe2a3fd344b 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -101,59 +101,6 @@ void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
ranges[nr_ranges++].size = PFN_UP(base + size) - PFN_DOWN(base);
}
-#endif /* !CONFIG_PDX_NONE */
-
-#ifdef CONFIG_PDX_MASK_COMPRESSION
-
-/*
- * Diagram to make sense of the following variables. The masks and shifts
- * are done on mfn values in order to convert to/from pdx:
- *
- * pfn_hole_mask
- * pfn_pdx_hole_shift (mask bitsize)
- * |
- * |---------|
- * | |
- * V V
- * --------------------------
- * |HHHHHHH|000000000|LLLLLL| <--- mfn
- * --------------------------
- * ^ ^ ^ ^
- * | | |------|
- * | | |
- * | | pfn_pdx_bottom_mask
- * | |
- * |-------|
- * |
- * pfn_top_mask
- *
- * ma_{top,va_bottom}_mask is simply a shifted pfn_{top,pdx_bottom}_mask,
- * where ma_top_mask has zeroes shifted in while ma_va_bottom_mask has
- * ones.
- */
-
-/** Mask for the lower non-compressible bits of an mfn */
-unsigned long __ro_after_init pfn_pdx_bottom_mask = ~0UL;
-
-/** Mask for the lower non-compressible bits of an maddr or vaddr */
-unsigned long __ro_after_init ma_va_bottom_mask = ~0UL;
-
-/** Mask for the higher non-compressible bits of an mfn */
-unsigned long __ro_after_init pfn_top_mask = 0;
-
-/** Mask for the higher non-compressible bits of an maddr or vaddr */
-unsigned long __ro_after_init ma_top_mask = 0;
-
-/**
- * Mask for a pdx compression bit slice.
- *
- * Invariant: valid(mfn) implies (mfn & pfn_hole_mask) == 0
- */
-unsigned long __ro_after_init pfn_hole_mask = 0;
-
-/** Number of bits of the "compressible" bit slice of an mfn */
-unsigned int __ro_after_init pfn_pdx_hole_shift = 0;
-
/* Sets all bits from the most-significant 1-bit down to the LSB */
static uint64_t fill_mask(uint64_t mask)
{
@@ -196,12 +143,6 @@ static uint64_t pdx_region_mask(uint64_t base, uint64_t len)
return fill_mask(base ^ (base + len - 1));
}
-bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
-{
- return !(paddr_to_pfn(base) & pfn_hole_mask) &&
- !(pdx_region_mask(base, npages * PAGE_SIZE) & ~ma_va_bottom_mask);
-}
-
/**
* Creates the mask to start from when calculating non-compressible bits
*
@@ -219,6 +160,65 @@ static uint64_t __init pdx_init_mask(uint64_t base_addr)
(uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
}
+#endif /* !CONFIG_PDX_NONE */
+
+#ifdef CONFIG_PDX_MASK_COMPRESSION
+
+/*
+ * Diagram to make sense of the following variables. The masks and shifts
+ * are done on mfn values in order to convert to/from pdx:
+ *
+ * pfn_hole_mask
+ * pfn_pdx_hole_shift (mask bitsize)
+ * |
+ * |---------|
+ * | |
+ * V V
+ * --------------------------
+ * |HHHHHHH|000000000|LLLLLL| <--- mfn
+ * --------------------------
+ * ^ ^ ^ ^
+ * | | |------|
+ * | | |
+ * | | pfn_pdx_bottom_mask
+ * | |
+ * |-------|
+ * |
+ * pfn_top_mask
+ *
+ * ma_{top,va_bottom}_mask is simply a shifted pfn_{top,pdx_bottom}_mask,
+ * where ma_top_mask has zeroes shifted in while ma_va_bottom_mask has
+ * ones.
+ */
+
+/** Mask for the lower non-compressible bits of an mfn */
+unsigned long __ro_after_init pfn_pdx_bottom_mask = ~0UL;
+
+/** Mask for the lower non-compressible bits of an maddr or vaddr */
+unsigned long __ro_after_init ma_va_bottom_mask = ~0UL;
+
+/** Mask for the higher non-compressible bits of an mfn */
+unsigned long __ro_after_init pfn_top_mask = 0;
+
+/** Mask for the higher non-compressible bits of an maddr or vaddr */
+unsigned long __ro_after_init ma_top_mask = 0;
+
+/**
+ * Mask for a pdx compression bit slice.
+ *
+ * Invariant: valid(mfn) implies (mfn & pfn_hole_mask) == 0
+ */
+unsigned long __ro_after_init pfn_hole_mask = 0;
+
+/** Number of bits of the "compressible" bit slice of an mfn */
+unsigned int __ro_after_init pfn_pdx_hole_shift = 0;
+
+bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
+{
+ return !(paddr_to_pfn(base) & pfn_hole_mask) &&
+ !(pdx_region_mask(base, npages * PAGE_SIZE) & ~ma_va_bottom_mask);
+}
+
bool __init pfn_pdx_compression_setup(paddr_t base)
{
unsigned int i, j, bottom_shift = 0, hole_shift = 0;
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v3 7/8] pdx: introduce a new compression algorithm based on region offsets
2025-07-24 11:04 [PATCH v3 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
` (5 preceding siblings ...)
2025-07-24 11:04 ` [PATCH v3 6/8] pdx: move some helpers in preparation for new compression Roger Pau Monne
@ 2025-07-24 11:04 ` Roger Pau Monne
2025-07-29 14:16 ` Jan Beulich
2025-07-24 11:04 ` [PATCH v3 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space Roger Pau Monne
7 siblings, 1 reply; 19+ messages in thread
From: Roger Pau Monne @ 2025-07-24 11:04 UTC (permalink / raw)
To: xen-devel
Cc: Roger Pau Monne, Oleksii Kurochko, Community Manager,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Stefano Stabellini
With the appearance of Intel Sierra Forest and Granite Rapids it's now
possible to get a production x86 host with the following memory map:
SRAT: Node 0 PXM 0 [0000000000000000, 000000007fffffff]
SRAT: Node 0 PXM 0 [0000000100000000, 000000807fffffff]
SRAT: Node 1 PXM 1 [0000063e80000000, 000006be7fffffff]
SRAT: Node 2 PXM 2 [00000c7e80000000, 00000cfe7fffffff]
SRAT: Node 3 PXM 3 [000012be80000000, 0000133e7fffffff]
This is from a four socket Granite Rapids system, with each node having
512GB of memory. The total amount of RAM on the system is 2TB, but without
enabling CONFIG_BIGMEM the last range is not accessible, as it's above the
16TB boundary covered by the frame table. Sierra Forest and Granite Rapids
are socket compatible, however Sierra Forest only supports 2 socket
configurations, while Granite Rapids can go up to 8 sockets.
Note that while the memory map is very sparse, it couldn't be compressed
using the current PDX_MASK compression algorithm, which relies on all
ranges having a shared zeroed region of bits that can be removed.
The memory map presented above has the property of all regions being
similarly spaced between each other, and all having also a similar size.
Use a lookup table to store the offsets to translate from/to PFN and PDX
spaces. Such table is indexed based on the input PFN or PDX to translated.
The example PFN layout about would get compressed using the following:
PFN compression using PFN lookup table shift 29 and PDX region size 0x10000000
range 0 [0000000000000, 0x0000807ffff] PFN IDX 0 : 0000000000000
range 1 [0x00063e80000, 0x0006be7ffff] PFN IDX 3 : 0x00053e80000
range 2 [0x000c7e80000, 0x000cfe7ffff] PFN IDX 6 : 0x000a7e80000
range 3 [0x0012be80000, 0x00133e7ffff] PFN IDX 9 : 0x000fbe80000
Note how the tow ranges belonging to node 0 get merged into a single PDX
region by the compression algorithm.
The default size of lookup tables currently set in Kconfig is 64 entries,
and the example memory map consumes 10 entries. Such memory map is from a
4 socket Granite Rapids host, which in theory supports up to 8 sockets
according to Intel documentation. Assuming the layout of a 8 socket system
is similar to the 4 socket one, it would require 21 lookup table entries to
support it, way below the current default of 64 entries.
The valid range of lookup table size is currently restricted from 1 to 512
elements in Kconfig.
An extra array is used to keep track of the base PFN for each translated
range. Non used slots are set to ~0UL, so that in mfn_valid() the mfn <
base check always fails, thus reporting the mfn as invalid.
Introduce __init_or_pdx_mask and use it on some shared functions between
PDX mask and offset compression, as otherwise some code becomes unreachable
after boot if PDX offset compression is used. Mark the code as __init in
that case, so it's pruned after boot.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v2:
- s/PDX_OFFSET_TLB_ORDER/PDX_OFFSET_TBL_ORDER/.
- Fix changelog mention of Sapphire Rapids.
- Misc fixes in the test harness.
- Use SWAP() macro.
- Introduce an extra table to keep the bases of the valid ranges.
Changes since v1:
- Use a lookup table with the offsets.
- Split the adding of the test to a pre-patch.
- Amend diagram to also show possible padding after compression.
---
CHANGELOG.md | 3 +
tools/tests/pdx/.gitignore | 1 +
tools/tests/pdx/Makefile | 3 +-
tools/tests/pdx/harness.h | 14 ++
tools/tests/pdx/test-pdx.c | 4 +
xen/common/Kconfig | 21 ++-
xen/common/pdx.c | 255 ++++++++++++++++++++++++++++++++++++-
xen/include/xen/pdx.h | 87 ++++++++++++-
8 files changed, 382 insertions(+), 6 deletions(-)
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 5f31ca08fe3f..f9ef893f4851 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -20,6 +20,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
grant table or foreign memory.
### Added
+ - Introduce new PDX compression algorithm to cope with Intel Sierra Forest and
+ Granite Rapids having sparse memory maps.
+
- On x86:
- Option to attempt to fixup p2m page-faults on PVH dom0.
- Resizable BARs is supported for PVH dom0.
diff --git a/tools/tests/pdx/.gitignore b/tools/tests/pdx/.gitignore
index a32c7db4de79..1202a531a7fd 100644
--- a/tools/tests/pdx/.gitignore
+++ b/tools/tests/pdx/.gitignore
@@ -1,2 +1,3 @@
/pdx.h
/test-pdx-mask
+/test-pdx-offset
diff --git a/tools/tests/pdx/Makefile b/tools/tests/pdx/Makefile
index b3734afde686..10b354f0cefd 100644
--- a/tools/tests/pdx/Makefile
+++ b/tools/tests/pdx/Makefile
@@ -1,7 +1,7 @@
XEN_ROOT=$(CURDIR)/../../..
include $(XEN_ROOT)/tools/Rules.mk
-TARGETS := test-pdx-mask
+TARGETS := test-pdx-mask test-pdx-offset
.PHONY: all
all: $(TARGETS)
@@ -42,6 +42,7 @@ CFLAGS += $(APPEND_CFLAGS)
CFLAGS += $(CFLAGS_xeninclude)
test-pdx-mask: CFLAGS += -DCONFIG_PDX_MASK_COMPRESSION
+test-pdx-offset: CFLAGS += -DCONFIG_PDX_OFFSET_COMPRESSION
test-pdx-%: test-pdx.c pdx.h
$(CC) $(CPPFLAGS) $(CFLAGS) $(CFLAGS_$*.o) -o $@ $< $(APPEND_CFLAGS)
diff --git a/tools/tests/pdx/harness.h b/tools/tests/pdx/harness.h
index 891b4de0bbb7..18935b283f71 100644
--- a/tools/tests/pdx/harness.h
+++ b/tools/tests/pdx/harness.h
@@ -44,8 +44,10 @@
#define MAX_RANGES 16
#define MAX_PFN_RANGES MAX_RANGES
+#define CONFIG_PDX_OFFSET_TBL_ORDER 6
#define ASSERT assert
+#define ASSERT_UNREACHABLE() assert(0)
#define CONFIG_DEBUG
@@ -66,6 +68,9 @@ static inline unsigned int find_next(
#define find_next_zero_bit(a, s, o) find_next(a, s, o, false)
#define find_next_bit(a, s, o) find_next(a, s, o, true)
+#define flsl(x) ((x) ? BITS_PER_LONG - __builtin_clzl(x) : 0)
+#define ffsl(x) __builtin_ffsl(x)
+
#define boolean_param(name, func)
#define pdx_to_pfn pdx_to_pfn_xlate
@@ -75,6 +80,15 @@ static inline unsigned int find_next(
typedef uint64_t paddr_t;
+#define SWAP(a, b) \
+ do { typeof(a) t_ = (a); (a) = (b); (b) = t_; } while ( 0 )
+
+#define sort(elem, nr, size, cmp, swp) ({ \
+ /* Consume swp() so compiler doesn't complain it's unused. */ \
+ (void)swp; \
+ qsort(elem, nr, size, cmp); \
+})
+
#include "pdx.h"
#endif
diff --git a/tools/tests/pdx/test-pdx.c b/tools/tests/pdx/test-pdx.c
index 0798ccee359b..eefd54c76815 100644
--- a/tools/tests/pdx/test-pdx.c
+++ b/tools/tests/pdx/test-pdx.c
@@ -51,7 +51,11 @@ int main(int argc, char **argv)
{ .start = 0xc7e80000UL, .end = 0xcfe80000UL },
{ .start = 0x12be80000UL, .end = 0x133e80000UL },
},
+#ifdef CONFIG_PDX_OFFSET_COMPRESSION
+ .compress = true,
+#else
.compress = false,
+#endif
},
/* Simple hole. */
{
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index 0014f9b3d6d7..1bfd1e49c8b9 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -59,7 +59,8 @@ config EVTCHN_FIFO
choice
prompt "PDX (Page inDeX) compression"
- default PDX_MASK_COMPRESSION if !X86 && !RISCV
+ default PDX_OFFSET_COMPRESSION if X86
+ default PDX_MASK_COMPRESSION if !RISCV
default PDX_NONE
help
PDX compression is a technique designed to reduce the memory
@@ -78,12 +79,30 @@ config PDX_MASK_COMPRESSION
help
Compression relying on all RAM addresses sharing a zeroed bit region.
+config PDX_OFFSET_COMPRESSION
+ bool "Offset compression"
+ help
+ Compression relying on size and distance between RAM regions being
+ compressible using an offset lookup table.
+
config PDX_NONE
bool "None"
help
No compression
endchoice
+config PDX_OFFSET_TBL_ORDER
+ int "PDX offset compression lookup table order" if EXPERT
+ depends on PDX_OFFSET_COMPRESSION
+ default 6
+ range 0 9
+ help
+ Order of the PFN to PDX and PDX to PFN translation lookup tables.
+ Number of table entries is calculated as 2^N.
+
+ Size of the tables can be adjusted from 1 entry (order 0) to 512
+ entries (order 9).
+
config ALTERNATIVE_CALL
bool
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index cfe2a3fd344b..34ce48170122 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -24,6 +24,7 @@
#include <xen/param.h>
#include <xen/pfn.h>
#include <xen/sections.h>
+#include <xen/sort.h>
/**
* Maximum (non-inclusive) usable pdx. Must be
@@ -40,6 +41,12 @@ bool __mfn_valid(unsigned long mfn)
#ifdef CONFIG_PDX_MASK_COMPRESSION
invalid |= mfn & pfn_hole_mask;
+#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
+{
+ unsigned long base = pfn_bases[PFN_TBL_IDX(mfn)];
+
+ invalid |= mfn < base || mfn >= base + pdx_region_size;
+}
#endif
if ( unlikely(evaluate_nospec(invalid)) )
@@ -75,6 +82,13 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn)
# error "Missing architecture maximum number of RAM ranges"
#endif
+/* Some functions should be init when not using PDX mask compression. */
+#ifndef CONFIG_PDX_MASK_COMPRESSION
+# define __init_or_pdx_mask __init
+#else
+# define __init_or_pdx_mask
+#endif
+
/* Generic PFN compression helpers. */
static struct pfn_range {
unsigned long base, size;
@@ -102,7 +116,7 @@ void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
}
/* Sets all bits from the most-significant 1-bit down to the LSB */
-static uint64_t fill_mask(uint64_t mask)
+static uint64_t __init_or_pdx_mask fill_mask(uint64_t mask)
{
while (mask & (mask + 1))
mask |= mask + 1;
@@ -128,7 +142,7 @@ static uint64_t fill_mask(uint64_t mask)
* @param len Size in octets of the region
* @return Mask of moving bits at the bottom of all the region addresses
*/
-static uint64_t pdx_region_mask(uint64_t base, uint64_t len)
+static uint64_t __init_or_pdx_mask pdx_region_mask(uint64_t base, uint64_t len)
{
/*
* We say a bit "moves" in a range if there exist 2 addresses in that
@@ -294,7 +308,242 @@ void __init pfn_pdx_compression_reset(void)
nr_ranges = 0;
}
-#endif /* CONFIG_PDX_COMPRESSION */
+#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* CONFIG_PDX_MASK_COMPRESSION */
+
+unsigned int __ro_after_init pfn_index_shift;
+unsigned int __ro_after_init pdx_index_shift;
+
+unsigned long __ro_after_init pfn_pdx_lookup[CONFIG_PDX_NR_LOOKUP];
+unsigned long __ro_after_init pdx_pfn_lookup[CONFIG_PDX_NR_LOOKUP];
+unsigned long __ro_after_init pfn_bases[CONFIG_PDX_NR_LOOKUP];
+unsigned long __ro_after_init pdx_region_size = ~0UL;
+
+bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
+{
+ unsigned long pfn = PFN_DOWN(base);
+ unsigned long pfn_base = pfn_bases[PFN_TBL_IDX(pfn)];
+
+ return pfn >= pfn_base &&
+ pfn + npages <= pfn_base + pdx_region_size;
+}
+
+static int __init cf_check cmp_node(const void *a, const void *b)
+{
+ const struct pfn_range *l = a;
+ const struct pfn_range *r = b;
+
+ if ( l->base > r->base )
+ return 1;
+ if ( l->base < r->base )
+ return -1;
+
+ return 0;
+}
+
+static void __init cf_check swp_node(void *a, void *b, size_t size)
+{
+ SWAP(a, b);
+}
+
+static bool __init pfn_offset_sanitize_ranges(void)
+{
+ unsigned int i = 0;
+
+ if ( nr_ranges == 1 )
+ {
+ ASSERT(PFN_TBL_IDX(ranges[0].base) ==
+ PFN_TBL_IDX(ranges[0].base + ranges[0].size - 1));
+ return true;
+ }
+
+ while ( i + 1 < nr_ranges )
+ {
+ /*
+ * Ensure ranges [start, end] use the same offset table index. Should
+ * be guaranteed by the logic that calculates the pfn shift.
+ */
+ if ( PFN_TBL_IDX(ranges[i].base) !=
+ PFN_TBL_IDX(ranges[i].base + ranges[i].size - 1) ||
+ PFN_TBL_IDX(ranges[i + 1].base) !=
+ PFN_TBL_IDX(ranges[i + 1].base + ranges[i + 1].size - 1) )
+ {
+ ASSERT_UNREACHABLE();
+ return false;
+ }
+
+ if ( PFN_TBL_IDX(ranges[i].base) != PFN_TBL_IDX(ranges[i + 1].base) )
+ {
+ i++;
+ continue;
+ }
+
+ /* Merge ranges with the same table index. */
+ ranges[i].size = ranges[i + 1].base + ranges[i + 1].size -
+ ranges[i].base;
+ memmove(&ranges[i + 1], &ranges[i + 2],
+ (nr_ranges - (i + 2)) * sizeof(ranges[0]));
+ nr_ranges--;
+ }
+
+ return true;
+}
+
+bool __init pfn_pdx_compression_setup(paddr_t base)
+{
+ unsigned long mask = PFN_DOWN(pdx_init_mask(base)), idx_mask = 0;
+ unsigned long size = 0;
+ unsigned int i;
+
+ if ( !nr_ranges )
+ {
+ printk(XENLOG_DEBUG "PFN compression disabled%s\n",
+ pdx_compress ? ": no ranges provided" : "");
+ return false;
+ }
+
+ if ( nr_ranges > ARRAY_SIZE(ranges) )
+ {
+ printk(XENLOG_WARNING
+ "Too many PFN ranges (%u > %zu), not attempting PFN compression\n",
+ nr_ranges, ARRAY_SIZE(ranges));
+ return false;
+ }
+
+ /* Sort ranges by start address. */
+ sort(ranges, nr_ranges, sizeof(*ranges), cmp_node, swp_node);
+
+ for ( i = 0; i < nr_ranges; i++ )
+ {
+ unsigned long start = ranges[i].base;
+
+ /*
+ * Align range base to MAX_ORDER. This is required so the PDX offset
+ * for the bits below MAX_ORDER matches the MFN offset, and pages
+ * greater than the minimal order can be used to populate the
+ * directmap.
+ */
+ ranges[i].base = ranges[i].base & ~((1UL << MAX_ORDER) - 1);
+ ranges[i].size = start + ranges[i].size - ranges[i].base;
+
+ /*
+ * Only merge overlapped regions now, leave adjacent regions separated.
+ * They would be merged later if both use the same index into the
+ * lookup table.
+ */
+ if ( !i || ranges[i].base >= (ranges[i - 1].base + ranges[i - 1].size) )
+ {
+ /*
+ * We might parse the region at position 0 more than once, as for
+ * index 0 we don't attempt to merge to keep this simple.
+ */
+ mask |= pdx_region_mask(ranges[i].base, ranges[i].size);
+ continue;
+ }
+
+ ranges[i - 1].size = ranges[i].base + ranges[i].size -
+ ranges[i - 1].base;
+
+ if ( i + 1 < nr_ranges )
+ memmove(&ranges[i], &ranges[i + 1],
+ (nr_ranges - (i + 1)) * sizeof(ranges[0]));
+ else /* last range */
+ mask |= pdx_region_mask(ranges[i].base, ranges[i].size);
+ nr_ranges--;
+ i--;
+ }
+
+ /*
+ * Populate a mask with the non-equal bits of the different ranges, do this
+ * to calculate the maximum PFN shift to use as the lookup table index.
+ */
+ for ( i = 0; i < nr_ranges; i++ )
+ for ( unsigned int j = 0; j < nr_ranges; j++ )
+ idx_mask |= (ranges[i].base & ~mask) ^ (ranges[j].base & ~mask);
+
+ if ( !idx_mask )
+ /* Single region case. */
+ pfn_index_shift = flsl(mask);
+ else if ( flsl(idx_mask) - ffsl(idx_mask) < CONFIG_PDX_OFFSET_TBL_ORDER )
+ /* The changed mask fits in the table index width. */
+ pfn_index_shift = ffsl(idx_mask) - 1;
+ else
+ /* Changed mask is wider than array size, use most significant bits. */
+ pfn_index_shift = flsl(idx_mask) - CONFIG_PDX_OFFSET_TBL_ORDER;
+
+ /*
+ * Ensure correctness of the calculated values, plus merge ranges if they
+ * use the same lookup table index.
+ */
+ if ( !pfn_offset_sanitize_ranges() )
+ {
+ printk(XENLOG_DEBUG "PFN compression is invalid, disabling\n");
+ pfn_pdx_compression_reset();
+ return false;
+ }
+
+ /*
+ * Find the maximum PFN range size after having merged ranges with same
+ * index. The rounded up region size will be the base for the PDX region
+ * size and shift.
+ */
+ for ( i = 0; i < nr_ranges; i++ )
+ size = max(size, ranges[i].size);
+
+ /* pdx_init_mask() already takes MAX_ORDER into account. */
+ mask = PFN_DOWN(pdx_init_mask(size << PAGE_SHIFT));
+ pdx_index_shift = flsl(mask);
+
+ /* Avoid compression if there's no gain. */
+ if ( (mask + 1) * (nr_ranges - 1) >= ranges[nr_ranges - 1].base )
+ {
+ printk(XENLOG_DEBUG
+ "PFN compression yields no space gain, disabling\n");
+ pfn_pdx_compression_reset();
+ return false;
+ }
+
+ /*
+ * Set all entries in the bases table to ~0 to force both mfn_valid() and
+ * pdx_is_region_compressible() to return false for non-handled pfns.
+ */
+ memset(pfn_bases, ~0, sizeof(pfn_bases));
+
+ pdx_region_size = mask + 1;
+
+ printk(XENLOG_INFO
+ "PFN compression using lookup table shift %u and region size %#lx\n",
+ pfn_index_shift, pdx_region_size);
+
+ for ( i = 0; i < nr_ranges; i++ )
+ {
+ unsigned int idx = PFN_TBL_IDX(ranges[i].base);
+
+ pfn_pdx_lookup[idx] = ranges[i].base - (mask + 1) * i;
+ pdx_pfn_lookup[i] = pfn_pdx_lookup[idx];
+ pfn_bases[idx] = ranges[i].base;
+
+ printk(XENLOG_DEBUG
+ " range %3u [%013lx, %013lx] PFN IDX %3u : %013lx\n",
+ i, ranges[i].base, ranges[i].base + ranges[i].size - 1,
+ idx, pfn_pdx_lookup[idx]);
+ }
+
+ return true;
+}
+
+void __init pfn_pdx_compression_reset(void)
+{
+ memset(pfn_pdx_lookup, 0, sizeof(pfn_pdx_lookup));
+ memset(pdx_pfn_lookup, 0, sizeof(pfn_pdx_lookup));
+ memset(pfn_bases, 0, sizeof(pfn_bases));
+ pfn_index_shift = 0;
+ pdx_index_shift = 0;
+ pdx_region_size = ~0UL;
+
+ nr_ranges = 0;
+}
+
+#endif /* CONFIG_PDX_OFFSET_COMPRESSION */
/*
* Local variables:
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index 91fc32370f21..57e47e6daa11 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -65,6 +65,46 @@
* This scheme also holds for multiple regions, where HHHHHHH acts as
* the region identifier and LLLLLL fully contains the span of every
* region involved.
+ *
+ * ## PDX offset compression
+ *
+ * Alternative compression mechanism that relies on RAM ranges having a similar
+ * size and offset between them:
+ *
+ * PFN address space:
+ * ┌────────┬──────────┬────────┬──────────┐ ┌────────┬──────────┐
+ * │ RAM 0 │ │ RAM 1 │ │...│ RAM N │ │
+ * ├────────┼──────────┼────────┴──────────┘ └────────┴──────────┘
+ * │<------>│ │
+ * │ size │
+ * │<----------------->│
+ * offset
+ *
+ * The compression reduces the holes between RAM regions:
+ *
+ * PDX address space:
+ * ┌────────┬───┬────────┬───┐ ┌─┬────────┐
+ * │ RAM 0 │ │ RAM 1 │ │...│ │ RAM N │
+ * ├────────┴───┼────────┴───┘ └─┴────────┘
+ * │<---------->│
+ * pdx region size
+ *
+ * The offsets to convert from PFN to PDX and from PDX to PFN are stored in a
+ * pair of lookup tables, and the index into those tables to find the offset
+ * for each PFN or PDX is obtained by shifting the to be translated address by
+ * a specific value calculated at boot:
+ *
+ * pdx = pfn - pfn_lookup_table[pfn >> pfn_shift]
+ * pfn = pdx + pdx_lookup_table[pdx >> pdx_shift]
+ *
+ * Note the indexes into the lookup tables are masked to avoid out of bounds
+ * accesses.
+ *
+ * This compression requires the PFN ranges to contain a non-equal most
+ * significant part that's smaller than the lookup table size, so that a valid
+ * shift value can be found to differentiate between PFN regions. The setup
+ * algorithm might merge otherwise separate PFN ranges to use the same lookup
+ * table entry.
*/
extern unsigned long max_pdx;
@@ -157,7 +197,52 @@ static inline paddr_t directmapoff_to_maddr_xlate(unsigned long offset)
(offset & ma_va_bottom_mask));
}
-#endif /* CONFIG_PDX_MASK_COMPRESSION */
+#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* CONFIG_PDX_MASK_COMPRESSION */
+
+#include <xen/page-size.h>
+
+#define CONFIG_PDX_NR_LOOKUP (1UL << CONFIG_PDX_OFFSET_TBL_ORDER)
+#define PDX_TBL_MASK (CONFIG_PDX_NR_LOOKUP - 1)
+
+#define PFN_TBL_IDX(pfn) \
+ (((pfn) >> pfn_index_shift) & PDX_TBL_MASK)
+#define PDX_TBL_IDX(pdx) \
+ (((pdx) >> pdx_index_shift) & PDX_TBL_MASK)
+#define MADDR_TBL_IDX(ma) \
+ (((ma) >> (pfn_index_shift + PAGE_SHIFT)) & PDX_TBL_MASK)
+#define DMAPOFF_TBL_IDX(off) \
+ (((off) >> (pdx_index_shift + PAGE_SHIFT)) & PDX_TBL_MASK)
+
+extern unsigned int pfn_index_shift;
+extern unsigned int pdx_index_shift;
+
+extern unsigned long pfn_pdx_lookup[];
+extern unsigned long pdx_pfn_lookup[];
+extern unsigned long pfn_bases[];
+extern unsigned long pdx_region_size;
+
+static inline unsigned long pfn_to_pdx_xlate(unsigned long pfn)
+{
+ return pfn - pfn_pdx_lookup[PFN_TBL_IDX(pfn)];
+}
+
+static inline unsigned long pdx_to_pfn_xlate(unsigned long pdx)
+{
+ return pdx + pdx_pfn_lookup[PDX_TBL_IDX(pdx)];
+}
+
+static inline unsigned long maddr_to_directmapoff_xlate(paddr_t ma)
+{
+ return ma - ((paddr_t)pfn_pdx_lookup[MADDR_TBL_IDX(ma)] << PAGE_SHIFT);
+}
+
+static inline paddr_t directmapoff_to_maddr_xlate(unsigned long offset)
+{
+ return offset + ((paddr_t)pdx_pfn_lookup[DMAPOFF_TBL_IDX(offset)] <<
+ PAGE_SHIFT);
+}
+
+#endif /* CONFIG_PDX_OFFSET_COMPRESSION */
/*
* Allow each architecture to define it's (possibly optimized) versions of the
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH v3 7/8] pdx: introduce a new compression algorithm based on region offsets
2025-07-24 11:04 ` [PATCH v3 7/8] pdx: introduce a new compression algorithm based on region offsets Roger Pau Monne
@ 2025-07-29 14:16 ` Jan Beulich
2025-08-04 13:08 ` Roger Pau Monné
0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2025-07-29 14:16 UTC (permalink / raw)
To: Roger Pau Monne
Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
xen-devel
On 24.07.2025 13:04, Roger Pau Monne wrote:
> --- a/xen/common/pdx.c
> +++ b/xen/common/pdx.c
> @@ -24,6 +24,7 @@
> #include <xen/param.h>
> #include <xen/pfn.h>
> #include <xen/sections.h>
> +#include <xen/sort.h>
>
> /**
> * Maximum (non-inclusive) usable pdx. Must be
> @@ -40,6 +41,12 @@ bool __mfn_valid(unsigned long mfn)
>
> #ifdef CONFIG_PDX_MASK_COMPRESSION
> invalid |= mfn & pfn_hole_mask;
> +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
> +{
> + unsigned long base = pfn_bases[PFN_TBL_IDX(mfn)];
> +
> + invalid |= mfn < base || mfn >= base + pdx_region_size;
Leveraging wrapping, this could be simplified to
invalid |= mfn - base >= pdx_region_size;
I think. Considering it's a frequently used path, doing so may be worthwhile.
> @@ -75,6 +82,13 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn)
> # error "Missing architecture maximum number of RAM ranges"
> #endif
>
> +/* Some functions should be init when not using PDX mask compression. */
> +#ifndef CONFIG_PDX_MASK_COMPRESSION
> +# define __init_or_pdx_mask __init
> +#else
> +# define __init_or_pdx_mask
> +#endif
Considering this is local to the CU, "pdx" in the name isn't very meaningful.
Instead it being a compression type may be, so how about __init_or_mask_compr
or some such? (If we gain further compression types, this may be getting
unwieldy anyway.)
> +static void __init cf_check swp_node(void *a, void *b, size_t size)
> +{
> + SWAP(a, b);
This doesn't look right - you swap a and b, not what they point to.
> +static bool __init pfn_offset_sanitize_ranges(void)
> +{
> + unsigned int i = 0;
> +
> + if ( nr_ranges == 1 )
> + {
> + ASSERT(PFN_TBL_IDX(ranges[0].base) ==
> + PFN_TBL_IDX(ranges[0].base + ranges[0].size - 1));
I think this points out a naming issue in patch 2: "base" and "size" look
as if these were address / byte count, when they're PFN / page count.
Which caught my eye because of the values being passed to something that
actually wants a PFN (and hence looked wrong at the first glance).
> + return true;
> + }
> +
> + while ( i + 1 < nr_ranges )
> + {
> + /*
> + * Ensure ranges [start, end] use the same offset table index. Should
> + * be guaranteed by the logic that calculates the pfn shift.
> + */
> + if ( PFN_TBL_IDX(ranges[i].base) !=
> + PFN_TBL_IDX(ranges[i].base + ranges[i].size - 1) ||
> + PFN_TBL_IDX(ranges[i + 1].base) !=
> + PFN_TBL_IDX(ranges[i + 1].base + ranges[i + 1].size - 1) )
It feels a little odd to re-do the 2nd half of the checking here on the next
iteration, when the table wouldn't have changed when ...
> + {
> + ASSERT_UNREACHABLE();
> + return false;
> + }
> +
> + if ( PFN_TBL_IDX(ranges[i].base) != PFN_TBL_IDX(ranges[i + 1].base) )
> + {
> + i++;
> + continue;
... taking this path. Could I talk you into moving the latter half ...
> + }
... here? If that's needed at all, as ...
> + /* Merge ranges with the same table index. */
> + ranges[i].size = ranges[i + 1].base + ranges[i + 1].size -
> + ranges[i].base;
... the new range should also fulfill the requirement. Just that the last
such range then wouldn't be checked, unless ...
> + memmove(&ranges[i + 1], &ranges[i + 2],
> + (nr_ranges - (i + 2)) * sizeof(ranges[0]));
> + nr_ranges--;
> + }
... that checking was put past the loop. Which then would allow to remove
the special casing of nr_ranges == 1 at the top of the function: You'd
uniformly check the ranges[nr_ranges - 1] here.
> + return true;
> +}
> +
> +bool __init pfn_pdx_compression_setup(paddr_t base)
> +{
> + unsigned long mask = PFN_DOWN(pdx_init_mask(base)), idx_mask = 0;
> + unsigned long size = 0;
> + unsigned int i;
> +
> + if ( !nr_ranges )
> + {
> + printk(XENLOG_DEBUG "PFN compression disabled%s\n",
> + pdx_compress ? ": no ranges provided" : "");
> + return false;
> + }
> +
> + if ( nr_ranges > ARRAY_SIZE(ranges) )
> + {
> + printk(XENLOG_WARNING
> + "Too many PFN ranges (%u > %zu), not attempting PFN compression\n",
> + nr_ranges, ARRAY_SIZE(ranges));
> + return false;
> + }
> +
> + /* Sort ranges by start address. */
> + sort(ranges, nr_ranges, sizeof(*ranges), cmp_node, swp_node);
> +
> + for ( i = 0; i < nr_ranges; i++ )
> + {
> + unsigned long start = ranges[i].base;
> +
> + /*
> + * Align range base to MAX_ORDER. This is required so the PDX offset
> + * for the bits below MAX_ORDER matches the MFN offset, and pages
> + * greater than the minimal order can be used to populate the
> + * directmap.
> + */
> + ranges[i].base = ranges[i].base & ~((1UL << MAX_ORDER) - 1);
Nit: Use "start" here?
> + ranges[i].size = start + ranges[i].size - ranges[i].base;
> +
> + /*
> + * Only merge overlapped regions now, leave adjacent regions separated.
> + * They would be merged later if both use the same index into the
> + * lookup table.
> + */
> + if ( !i || ranges[i].base >= (ranges[i - 1].base + ranges[i - 1].size) )
> + {
> + /*
> + * We might parse the region at position 0 more than once, as for
> + * index 0 we don't attempt to merge to keep this simple.
> + */
> + mask |= pdx_region_mask(ranges[i].base, ranges[i].size);
> + continue;
In how far is it relevant here that slot 0 may be looked at more than once?
> + }
> +
> + ranges[i - 1].size = ranges[i].base + ranges[i].size -
> + ranges[i - 1].base;
> +
> + if ( i + 1 < nr_ranges )
> + memmove(&ranges[i], &ranges[i + 1],
> + (nr_ranges - (i + 1)) * sizeof(ranges[0]));
> + else /* last range */
> + mask |= pdx_region_mask(ranges[i].base, ranges[i].size);
> + nr_ranges--;
> + i--;
> + }
> +
> + /*
> + * Populate a mask with the non-equal bits of the different ranges, do this
> + * to calculate the maximum PFN shift to use as the lookup table index.
> + */
> + for ( i = 0; i < nr_ranges; i++ )
> + for ( unsigned int j = 0; j < nr_ranges; j++ )
> + idx_mask |= (ranges[i].base & ~mask) ^ (ranges[j].base & ~mask);
> +
> + if ( !idx_mask )
> + /* Single region case. */
> + pfn_index_shift = flsl(mask);
> + else if ( flsl(idx_mask) - ffsl(idx_mask) < CONFIG_PDX_OFFSET_TBL_ORDER )
> + /* The changed mask fits in the table index width. */
> + pfn_index_shift = ffsl(idx_mask) - 1;
> + else
> + /* Changed mask is wider than array size, use most significant bits. */
> + pfn_index_shift = flsl(idx_mask) - CONFIG_PDX_OFFSET_TBL_ORDER;
Perhaps emit a log message here to indicate that bigger PDX_OFFSET_TBL_ORDER
may yield better results?
> + /*
> + * Ensure correctness of the calculated values, plus merge ranges if they
> + * use the same lookup table index.
> + */
> + if ( !pfn_offset_sanitize_ranges() )
> + {
> + printk(XENLOG_DEBUG "PFN compression is invalid, disabling\n");
> + pfn_pdx_compression_reset();
> + return false;
> + }
> +
> + /*
> + * Find the maximum PFN range size after having merged ranges with same
> + * index. The rounded up region size will be the base for the PDX region
> + * size and shift.
> + */
> + for ( i = 0; i < nr_ranges; i++ )
> + size = max(size, ranges[i].size);
> +
> + /* pdx_init_mask() already takes MAX_ORDER into account. */
> + mask = PFN_DOWN(pdx_init_mask(size << PAGE_SHIFT));
We're in arch-neutral code here, so I think you need to cast size to paddr_t
before shifting.
Jan
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH v3 7/8] pdx: introduce a new compression algorithm based on region offsets
2025-07-29 14:16 ` Jan Beulich
@ 2025-08-04 13:08 ` Roger Pau Monné
2025-08-04 13:40 ` Jan Beulich
0 siblings, 1 reply; 19+ messages in thread
From: Roger Pau Monné @ 2025-08-04 13:08 UTC (permalink / raw)
To: Jan Beulich
Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
xen-devel
On Tue, Jul 29, 2025 at 04:16:15PM +0200, Jan Beulich wrote:
> On 24.07.2025 13:04, Roger Pau Monne wrote:
> > --- a/xen/common/pdx.c
> > +++ b/xen/common/pdx.c
> > @@ -24,6 +24,7 @@
> > #include <xen/param.h>
> > #include <xen/pfn.h>
> > #include <xen/sections.h>
> > +#include <xen/sort.h>
> >
> > /**
> > * Maximum (non-inclusive) usable pdx. Must be
> > @@ -40,6 +41,12 @@ bool __mfn_valid(unsigned long mfn)
> >
> > #ifdef CONFIG_PDX_MASK_COMPRESSION
> > invalid |= mfn & pfn_hole_mask;
> > +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
> > +{
> > + unsigned long base = pfn_bases[PFN_TBL_IDX(mfn)];
> > +
> > + invalid |= mfn < base || mfn >= base + pdx_region_size;
>
> Leveraging wrapping, this could be simplified to
>
> invalid |= mfn - base >= pdx_region_size;
>
> I think. Considering it's a frequently used path, doing so may be worthwhile.
I don't think that would work for all cases, take the following
example:
PFN compression using lookup table shift 18 and region size 0x40000
range 0 [0000000280000, 00000002bffff] PFN IDX 10 : 0000000280000
If you pass mfn 0 to mfn_valid() with your proposed adjustment, the
result of the subtraction would be:
0 - ~0UL == 1
Which wouldn't satisfy the >= condition, and hence pfn 0 would be
reported as a valid mfn. I think we need to keep both sides of the
check.
> > @@ -75,6 +82,13 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn)
> > # error "Missing architecture maximum number of RAM ranges"
> > #endif
> >
> > +/* Some functions should be init when not using PDX mask compression. */
> > +#ifndef CONFIG_PDX_MASK_COMPRESSION
> > +# define __init_or_pdx_mask __init
> > +#else
> > +# define __init_or_pdx_mask
> > +#endif
>
> Considering this is local to the CU, "pdx" in the name isn't very meaningful.
> Instead it being a compression type may be, so how about __init_or_mask_compr
> or some such? (If we gain further compression types, this may be getting
> unwieldy anyway.)
I'm fine with your proposal. The current naming was derived from the
CONFIG_ define.
> > +static void __init cf_check swp_node(void *a, void *b, size_t size)
> > +{
> > + SWAP(a, b);
>
> This doesn't look right - you swap a and b, not what they point to.
>
> > +static bool __init pfn_offset_sanitize_ranges(void)
> > +{
> > + unsigned int i = 0;
> > +
> > + if ( nr_ranges == 1 )
> > + {
> > + ASSERT(PFN_TBL_IDX(ranges[0].base) ==
> > + PFN_TBL_IDX(ranges[0].base + ranges[0].size - 1));
>
> I think this points out a naming issue in patch 2: "base" and "size" look
> as if these were address / byte count, when they're PFN / page count.
> Which caught my eye because of the values being passed to something that
> actually wants a PFN (and hence looked wrong at the first glance).
The struct name is pfn_range, I could rename the fields to base_pfn
and pages or similar, but my impression was that the struct name was
enough of a pointer that those are PFN ranges. Do you have any
alternative proposal about how to name those?
> > + return true;
> > + }
> > +
> > + while ( i + 1 < nr_ranges )
> > + {
> > + /*
> > + * Ensure ranges [start, end] use the same offset table index. Should
> > + * be guaranteed by the logic that calculates the pfn shift.
> > + */
> > + if ( PFN_TBL_IDX(ranges[i].base) !=
> > + PFN_TBL_IDX(ranges[i].base + ranges[i].size - 1) ||
> > + PFN_TBL_IDX(ranges[i + 1].base) !=
> > + PFN_TBL_IDX(ranges[i + 1].base + ranges[i + 1].size - 1) )
>
> It feels a little odd to re-do the 2nd half of the checking here on the next
> iteration, when the table wouldn't have changed when ...
>
> > + {
> > + ASSERT_UNREACHABLE();
> > + return false;
> > + }
> > +
> > + if ( PFN_TBL_IDX(ranges[i].base) != PFN_TBL_IDX(ranges[i + 1].base) )
> > + {
> > + i++;
> > + continue;
>
> ... taking this path. Could I talk you into moving the latter half ...
>
> > + }
>
> ... here? If that's needed at all, as ...
>
> > + /* Merge ranges with the same table index. */
> > + ranges[i].size = ranges[i + 1].base + ranges[i + 1].size -
> > + ranges[i].base;
>
> ... the new range should also fulfill the requirement. Just that the last
> such range then wouldn't be checked, unless ...
>
> > + memmove(&ranges[i + 1], &ranges[i + 2],
> > + (nr_ranges - (i + 2)) * sizeof(ranges[0]));
> > + nr_ranges--;
> > + }
>
> ... that checking was put past the loop. Which then would allow to remove
> the special casing of nr_ranges == 1 at the top of the function: You'd
> uniformly check the ranges[nr_ranges - 1] here.
What about doing it like:
static bool __init pfn_offset_sanitize_ranges(void)
{
unsigned int i = 0;
if ( PFN_TBL_IDX(ranges[0].base) !=
PFN_TBL_IDX(ranges[0].base + ranges[0].size - 1) )
{
ASSERT_UNREACHABLE();
return false;
}
while ( i + 1 < nr_ranges )
{
/*
* Ensure ranges [start, end] use the same offset table index. Should
* be guaranteed by the logic that calculates the pfn shift.
*/
if ( PFN_TBL_IDX(ranges[i + 1].base) !=
PFN_TBL_IDX(ranges[i + 1].base + ranges[i + 1].size - 1) )
{
ASSERT_UNREACHABLE();
return false;
}
if ( PFN_TBL_IDX(ranges[i].base) != PFN_TBL_IDX(ranges[i + 1].base) )
{
i++;
continue;
}
/* Merge ranges with the same table index. */
ranges[i].size = ranges[i + 1].base + ranges[i + 1].size -
ranges[i].base;
memmove(&ranges[i + 1], &ranges[i + 2],
(nr_ranges - (i + 2)) * sizeof(ranges[0]));
nr_ranges--;
}
return true;
}
I've pulled the index 0 check ahead of the loop, which then covers for
the case where nr_ranges == 1. There's also no duplicate checking of
the ranges, since the range at i + 1 will always be a non-checked one;
either because the array has been shifted as a result of a range
merging, or the index has been advanced.
> > + ranges[i].size = start + ranges[i].size - ranges[i].base;
> > +
> > + /*
> > + * Only merge overlapped regions now, leave adjacent regions separated.
> > + * They would be merged later if both use the same index into the
> > + * lookup table.
> > + */
> > + if ( !i || ranges[i].base >= (ranges[i - 1].base + ranges[i - 1].size) )
> > + {
> > + /*
> > + * We might parse the region at position 0 more than once, as for
> > + * index 0 we don't attempt to merge to keep this simple.
> > + */
> > + mask |= pdx_region_mask(ranges[i].base, ranges[i].size);
> > + continue;
>
> In how far is it relevant here that slot 0 may be looked at more than once?
Not really, just that someone might wonder about the possibly
inefficiency of this. I will remove the comment then.
> > + }
> > +
> > + ranges[i - 1].size = ranges[i].base + ranges[i].size -
> > + ranges[i - 1].base;
> > +
> > + if ( i + 1 < nr_ranges )
> > + memmove(&ranges[i], &ranges[i + 1],
> > + (nr_ranges - (i + 1)) * sizeof(ranges[0]));
> > + else /* last range */
> > + mask |= pdx_region_mask(ranges[i].base, ranges[i].size);
> > + nr_ranges--;
> > + i--;
> > + }
> > +
> > + /*
> > + * Populate a mask with the non-equal bits of the different ranges, do this
> > + * to calculate the maximum PFN shift to use as the lookup table index.
> > + */
> > + for ( i = 0; i < nr_ranges; i++ )
> > + for ( unsigned int j = 0; j < nr_ranges; j++ )
> > + idx_mask |= (ranges[i].base & ~mask) ^ (ranges[j].base & ~mask);
> > +
> > + if ( !idx_mask )
> > + /* Single region case. */
> > + pfn_index_shift = flsl(mask);
> > + else if ( flsl(idx_mask) - ffsl(idx_mask) < CONFIG_PDX_OFFSET_TBL_ORDER )
> > + /* The changed mask fits in the table index width. */
> > + pfn_index_shift = ffsl(idx_mask) - 1;
> > + else
> > + /* Changed mask is wider than array size, use most significant bits. */
> > + pfn_index_shift = flsl(idx_mask) - CONFIG_PDX_OFFSET_TBL_ORDER;
>
> Perhaps emit a log message here to indicate that bigger PDX_OFFSET_TBL_ORDER
> may yield better results?
What about:
printk(XENLOG_DEBUG
"PFN compression table index truncated, requires order %u\n",
flsl(idx_mask) - ffsl(idx_mask) + 1);
> > + /*
> > + * Ensure correctness of the calculated values, plus merge ranges if they
> > + * use the same lookup table index.
> > + */
> > + if ( !pfn_offset_sanitize_ranges() )
> > + {
> > + printk(XENLOG_DEBUG "PFN compression is invalid, disabling\n");
> > + pfn_pdx_compression_reset();
> > + return false;
> > + }
> > +
> > + /*
> > + * Find the maximum PFN range size after having merged ranges with same
> > + * index. The rounded up region size will be the base for the PDX region
> > + * size and shift.
> > + */
> > + for ( i = 0; i < nr_ranges; i++ )
> > + size = max(size, ranges[i].size);
> > +
> > + /* pdx_init_mask() already takes MAX_ORDER into account. */
> > + mask = PFN_DOWN(pdx_init_mask(size << PAGE_SHIFT));
>
> We're in arch-neutral code here, so I think you need to cast size to paddr_t
> before shifting.
Indeed, size is unsigned long here.
Thanks, Roger.
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH v3 7/8] pdx: introduce a new compression algorithm based on region offsets
2025-08-04 13:08 ` Roger Pau Monné
@ 2025-08-04 13:40 ` Jan Beulich
0 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2025-08-04 13:40 UTC (permalink / raw)
To: Roger Pau Monné
Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
xen-devel
On 04.08.2025 15:08, Roger Pau Monné wrote:
> On Tue, Jul 29, 2025 at 04:16:15PM +0200, Jan Beulich wrote:
>> On 24.07.2025 13:04, Roger Pau Monne wrote:
>>> --- a/xen/common/pdx.c
>>> +++ b/xen/common/pdx.c
>>> @@ -24,6 +24,7 @@
>>> #include <xen/param.h>
>>> #include <xen/pfn.h>
>>> #include <xen/sections.h>
>>> +#include <xen/sort.h>
>>>
>>> /**
>>> * Maximum (non-inclusive) usable pdx. Must be
>>> @@ -40,6 +41,12 @@ bool __mfn_valid(unsigned long mfn)
>>>
>>> #ifdef CONFIG_PDX_MASK_COMPRESSION
>>> invalid |= mfn & pfn_hole_mask;
>>> +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
>>> +{
>>> + unsigned long base = pfn_bases[PFN_TBL_IDX(mfn)];
>>> +
>>> + invalid |= mfn < base || mfn >= base + pdx_region_size;
>>
>> Leveraging wrapping, this could be simplified to
>>
>> invalid |= mfn - base >= pdx_region_size;
>>
>> I think. Considering it's a frequently used path, doing so may be worthwhile.
>
> I don't think that would work for all cases, take the following
> example:
>
> PFN compression using lookup table shift 18 and region size 0x40000
> range 0 [0000000280000, 00000002bffff] PFN IDX 10 : 0000000280000
>
> If you pass mfn 0 to mfn_valid() with your proposed adjustment, the
> result of the subtraction would be:
>
> 0 - ~0UL == 1
>
> Which wouldn't satisfy the >= condition, and hence pfn 0 would be
> reported as a valid mfn. I think we need to keep both sides of the
> check.
Hmm, right - I keep forgetting that the start of a pfn_bases[x] isn't necessarily
a valid page itself.
>>> +static void __init cf_check swp_node(void *a, void *b, size_t size)
>>> +{
>>> + SWAP(a, b);
>>
>> This doesn't look right - you swap a and b, not what they point to.
>>
>>> +static bool __init pfn_offset_sanitize_ranges(void)
>>> +{
>>> + unsigned int i = 0;
>>> +
>>> + if ( nr_ranges == 1 )
>>> + {
>>> + ASSERT(PFN_TBL_IDX(ranges[0].base) ==
>>> + PFN_TBL_IDX(ranges[0].base + ranges[0].size - 1));
>>
>> I think this points out a naming issue in patch 2: "base" and "size" look
>> as if these were address / byte count, when they're PFN / page count.
>> Which caught my eye because of the values being passed to something that
>> actually wants a PFN (and hence looked wrong at the first glance).
>
> The struct name is pfn_range, I could rename the fields to base_pfn
> and pages or similar, but my impression was that the struct name was
> enough of a pointer that those are PFN ranges.
Problem being that the struct name isn't anywhere in sight here.
> Do you have any
> alternative proposal about how to name those?
Your suggested new naming looks good to me.
>>> + return true;
>>> + }
>>> +
>>> + while ( i + 1 < nr_ranges )
>>> + {
>>> + /*
>>> + * Ensure ranges [start, end] use the same offset table index. Should
>>> + * be guaranteed by the logic that calculates the pfn shift.
>>> + */
>>> + if ( PFN_TBL_IDX(ranges[i].base) !=
>>> + PFN_TBL_IDX(ranges[i].base + ranges[i].size - 1) ||
>>> + PFN_TBL_IDX(ranges[i + 1].base) !=
>>> + PFN_TBL_IDX(ranges[i + 1].base + ranges[i + 1].size - 1) )
>>
>> It feels a little odd to re-do the 2nd half of the checking here on the next
>> iteration, when the table wouldn't have changed when ...
>>
>>> + {
>>> + ASSERT_UNREACHABLE();
>>> + return false;
>>> + }
>>> +
>>> + if ( PFN_TBL_IDX(ranges[i].base) != PFN_TBL_IDX(ranges[i + 1].base) )
>>> + {
>>> + i++;
>>> + continue;
>>
>> ... taking this path. Could I talk you into moving the latter half ...
>>
>>> + }
>>
>> ... here? If that's needed at all, as ...
>>
>>> + /* Merge ranges with the same table index. */
>>> + ranges[i].size = ranges[i + 1].base + ranges[i + 1].size -
>>> + ranges[i].base;
>>
>> ... the new range should also fulfill the requirement. Just that the last
>> such range then wouldn't be checked, unless ...
>>
>>> + memmove(&ranges[i + 1], &ranges[i + 2],
>>> + (nr_ranges - (i + 2)) * sizeof(ranges[0]));
>>> + nr_ranges--;
>>> + }
>>
>> ... that checking was put past the loop. Which then would allow to remove
>> the special casing of nr_ranges == 1 at the top of the function: You'd
>> uniformly check the ranges[nr_ranges - 1] here.
>
> What about doing it like:
>
> static bool __init pfn_offset_sanitize_ranges(void)
> {
> unsigned int i = 0;
>
> if ( PFN_TBL_IDX(ranges[0].base) !=
> PFN_TBL_IDX(ranges[0].base + ranges[0].size - 1) )
> {
> ASSERT_UNREACHABLE();
> return false;
> }
>
> while ( i + 1 < nr_ranges )
> {
> /*
> * Ensure ranges [start, end] use the same offset table index. Should
> * be guaranteed by the logic that calculates the pfn shift.
> */
> if ( PFN_TBL_IDX(ranges[i + 1].base) !=
> PFN_TBL_IDX(ranges[i + 1].base + ranges[i + 1].size - 1) )
> {
> ASSERT_UNREACHABLE();
> return false;
> }
>
> if ( PFN_TBL_IDX(ranges[i].base) != PFN_TBL_IDX(ranges[i + 1].base) )
> {
> i++;
> continue;
> }
>
> /* Merge ranges with the same table index. */
> ranges[i].size = ranges[i + 1].base + ranges[i + 1].size -
> ranges[i].base;
> memmove(&ranges[i + 1], &ranges[i + 2],
> (nr_ranges - (i + 2)) * sizeof(ranges[0]));
> nr_ranges--;
> }
>
> return true;
> }
>
> I've pulled the index 0 check ahead of the loop, which then covers for
> the case where nr_ranges == 1. There's also no duplicate checking of
> the ranges, since the range at i + 1 will always be a non-checked one;
> either because the array has been shifted as a result of a range
> merging, or the index has been advanced.
Looks good, thanks.
>>> + ranges[i].size = start + ranges[i].size - ranges[i].base;
>>> + ranges[i - 1].size = ranges[i].base + ranges[i].size -
>>> + ranges[i - 1].base;
>>> +
>>> + if ( i + 1 < nr_ranges )
>>> + memmove(&ranges[i], &ranges[i + 1],
>>> + (nr_ranges - (i + 1)) * sizeof(ranges[0]));
>>> + else /* last range */
>>> + mask |= pdx_region_mask(ranges[i].base, ranges[i].size);
>>> + nr_ranges--;
>>> + i--;
>>> + }
>>> +
>>> + /*
>>> + * Populate a mask with the non-equal bits of the different ranges, do this
>>> + * to calculate the maximum PFN shift to use as the lookup table index.
>>> + */
>>> + for ( i = 0; i < nr_ranges; i++ )
>>> + for ( unsigned int j = 0; j < nr_ranges; j++ )
>>> + idx_mask |= (ranges[i].base & ~mask) ^ (ranges[j].base & ~mask);
>>> +
>>> + if ( !idx_mask )
>>> + /* Single region case. */
>>> + pfn_index_shift = flsl(mask);
>>> + else if ( flsl(idx_mask) - ffsl(idx_mask) < CONFIG_PDX_OFFSET_TBL_ORDER )
>>> + /* The changed mask fits in the table index width. */
>>> + pfn_index_shift = ffsl(idx_mask) - 1;
>>> + else
>>> + /* Changed mask is wider than array size, use most significant bits. */
>>> + pfn_index_shift = flsl(idx_mask) - CONFIG_PDX_OFFSET_TBL_ORDER;
>>
>> Perhaps emit a log message here to indicate that bigger PDX_OFFSET_TBL_ORDER
>> may yield better results?
>
> What about:
>
> printk(XENLOG_DEBUG
> "PFN compression table index truncated, requires order %u\n",
> flsl(idx_mask) - ffsl(idx_mask) + 1);
Again, fine with me.
Jan
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v3 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space
2025-07-24 11:04 [PATCH v3 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
` (6 preceding siblings ...)
2025-07-24 11:04 ` [PATCH v3 7/8] pdx: introduce a new compression algorithm based on region offsets Roger Pau Monne
@ 2025-07-24 11:04 ` Roger Pau Monne
2025-07-29 14:33 ` Jan Beulich
7 siblings, 1 reply; 19+ messages in thread
From: Roger Pau Monne @ 2025-07-24 11:04 UTC (permalink / raw)
To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper
There's a loop in arch_init_memory() that iterates over holes and non-RAM
regions to possibly mark any page_info structures matching those addresses
as IO. The looping there is done over the PFN space.
PFNs not covered by the PDX space will always fail the mfn_valid() check,
hence re-write the loop to iterate over the PDX space and avoid checking
any holes that are not covered by the PDX translation.
On a system with a ~6TiB hole this change together with using PDX
compression reduces boot time in approximately 20 seconds. Xen boot time
without the change is ~50s, with the change it's ~30s.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v2:
- New in this version.
---
xen/arch/x86/mm.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e7fd56c7ce90..647bf0b41db6 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info *page)
void __init arch_init_memory(void)
{
- unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
+ unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, pdx;
/*
* Basic guest-accessible flags:
@@ -328,9 +328,14 @@ void __init arch_init_memory(void)
destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
(unsigned long)mfn_to_virt(ioend_pfn));
- /* Mark as I/O up to next RAM region. */
- for ( ; pfn < rstart_pfn; pfn++ )
+ /*
+ * Mark as I/O up to next RAM region. Iterate over the PDX space to
+ * skip holes which would always fail the mfn_valid() check.
+ */
+ for ( pdx = pfn_to_pdx(pfn); pdx < pfn_to_pdx(rstart_pfn); pdx++ )
{
+ pfn = pdx_to_pfn(pdx);
+
if ( !mfn_valid(_mfn(pfn)) )
continue;
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH v3 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space
2025-07-24 11:04 ` [PATCH v3 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space Roger Pau Monne
@ 2025-07-29 14:33 ` Jan Beulich
2025-07-29 16:54 ` Roger Pau Monné
0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2025-07-29 14:33 UTC (permalink / raw)
To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel
On 24.07.2025 13:04, Roger Pau Monne wrote:
> There's a loop in arch_init_memory() that iterates over holes and non-RAM
> regions to possibly mark any page_info structures matching those addresses
> as IO. The looping there is done over the PFN space.
>
> PFNs not covered by the PDX space will always fail the mfn_valid() check,
> hence re-write the loop to iterate over the PDX space and avoid checking
> any holes that are not covered by the PDX translation.
>
> On a system with a ~6TiB hole this change together with using PDX
> compression reduces boot time in approximately 20 seconds. Xen boot time
> without the change is ~50s, with the change it's ~30s.
That's nice, and I agree what we currently do isn't very efficient, but ...
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info *page)
>
> void __init arch_init_memory(void)
> {
> - unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
> + unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, pdx;
>
> /*
> * Basic guest-accessible flags:
> @@ -328,9 +328,14 @@ void __init arch_init_memory(void)
> destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
> (unsigned long)mfn_to_virt(ioend_pfn));
>
> - /* Mark as I/O up to next RAM region. */
> - for ( ; pfn < rstart_pfn; pfn++ )
> + /*
> + * Mark as I/O up to next RAM region. Iterate over the PDX space to
> + * skip holes which would always fail the mfn_valid() check.
> + */
> + for ( pdx = pfn_to_pdx(pfn); pdx < pfn_to_pdx(rstart_pfn); pdx++ )
... pfn_to_pdx() isn't well-defined for a non-RAM PFN, or more precisely for any
PFN that fails the mfn_valid() check. That is, I think, particularly noticeable
with the new offset compression you introduce.
Jan
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH v3 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space
2025-07-29 14:33 ` Jan Beulich
@ 2025-07-29 16:54 ` Roger Pau Monné
2025-07-30 5:31 ` Jan Beulich
0 siblings, 1 reply; 19+ messages in thread
From: Roger Pau Monné @ 2025-07-29 16:54 UTC (permalink / raw)
To: Jan Beulich; +Cc: Andrew Cooper, xen-devel
On Tue, Jul 29, 2025 at 04:33:53PM +0200, Jan Beulich wrote:
> On 24.07.2025 13:04, Roger Pau Monne wrote:
> > There's a loop in arch_init_memory() that iterates over holes and non-RAM
> > regions to possibly mark any page_info structures matching those addresses
> > as IO. The looping there is done over the PFN space.
> >
> > PFNs not covered by the PDX space will always fail the mfn_valid() check,
> > hence re-write the loop to iterate over the PDX space and avoid checking
> > any holes that are not covered by the PDX translation.
> >
> > On a system with a ~6TiB hole this change together with using PDX
> > compression reduces boot time in approximately 20 seconds. Xen boot time
> > without the change is ~50s, with the change it's ~30s.
>
> That's nice, and I agree what we currently do isn't very efficient, but ...
>
> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info *page)
> >
> > void __init arch_init_memory(void)
> > {
> > - unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
> > + unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, pdx;
> >
> > /*
> > * Basic guest-accessible flags:
> > @@ -328,9 +328,14 @@ void __init arch_init_memory(void)
> > destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
> > (unsigned long)mfn_to_virt(ioend_pfn));
> >
> > - /* Mark as I/O up to next RAM region. */
> > - for ( ; pfn < rstart_pfn; pfn++ )
> > + /*
> > + * Mark as I/O up to next RAM region. Iterate over the PDX space to
> > + * skip holes which would always fail the mfn_valid() check.
> > + */
> > + for ( pdx = pfn_to_pdx(pfn); pdx < pfn_to_pdx(rstart_pfn); pdx++ )
>
> ... pfn_to_pdx() isn't well-defined for a non-RAM PFN, or more precisely for any
> PFN that fails the mfn_valid() check. That is, I think, particularly noticeable
> with the new offset compression you introduce.
rstart_pfn will always point to the start of the next RAM region (or
the end of the current region if it's the last one). So for that case
pfn_to_pdx() is always provided a RAM PFN as input parameter.
However for the pfn parameter, we would need to do pfn_to_pdx(pfn -
1), as that's the last address in the previous RAM range. The loop
would then possibly be:
for ( pdx = pfn_to_pdx((pfn ?: 1) - 1) + 1; pdx < pfn_to_pdx(rstart_pfn); pdx++ )
{
...
This also assumes that PFN 0 will always have a valid PDX translation,
regardless of whether it's RAM or not (which is the case given the PDX
code currently used).
Thanks, Roger.
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH v3 8/8] x86/mm: adjust loop in arch_init_memory() to iterate over the PDX space
2025-07-29 16:54 ` Roger Pau Monné
@ 2025-07-30 5:31 ` Jan Beulich
0 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2025-07-30 5:31 UTC (permalink / raw)
To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel
On 29.07.2025 18:54, Roger Pau Monné wrote:
> On Tue, Jul 29, 2025 at 04:33:53PM +0200, Jan Beulich wrote:
>> On 24.07.2025 13:04, Roger Pau Monne wrote:
>>> There's a loop in arch_init_memory() that iterates over holes and non-RAM
>>> regions to possibly mark any page_info structures matching those addresses
>>> as IO. The looping there is done over the PFN space.
>>>
>>> PFNs not covered by the PDX space will always fail the mfn_valid() check,
>>> hence re-write the loop to iterate over the PDX space and avoid checking
>>> any holes that are not covered by the PDX translation.
>>>
>>> On a system with a ~6TiB hole this change together with using PDX
>>> compression reduces boot time in approximately 20 seconds. Xen boot time
>>> without the change is ~50s, with the change it's ~30s.
>>
>> That's nice, and I agree what we currently do isn't very efficient, but ...
>>
>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -275,7 +275,7 @@ static void __init assign_io_page(struct page_info *page)
>>>
>>> void __init arch_init_memory(void)
>>> {
>>> - unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn;
>>> + unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn, pdx;
>>>
>>> /*
>>> * Basic guest-accessible flags:
>>> @@ -328,9 +328,14 @@ void __init arch_init_memory(void)
>>> destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
>>> (unsigned long)mfn_to_virt(ioend_pfn));
>>>
>>> - /* Mark as I/O up to next RAM region. */
>>> - for ( ; pfn < rstart_pfn; pfn++ )
>>> + /*
>>> + * Mark as I/O up to next RAM region. Iterate over the PDX space to
>>> + * skip holes which would always fail the mfn_valid() check.
>>> + */
>>> + for ( pdx = pfn_to_pdx(pfn); pdx < pfn_to_pdx(rstart_pfn); pdx++ )
>>
>> ... pfn_to_pdx() isn't well-defined for a non-RAM PFN, or more precisely for any
>> PFN that fails the mfn_valid() check. That is, I think, particularly noticeable
>> with the new offset compression you introduce.
>
> rstart_pfn will always point to the start of the next RAM region (or
> the end of the current region if it's the last one). So for that case
> pfn_to_pdx() is always provided a RAM PFN as input parameter.
>
> However for the pfn parameter, we would need to do pfn_to_pdx(pfn -
> 1), as that's the last address in the previous RAM range. The loop
> would then possibly be:
>
> for ( pdx = pfn_to_pdx((pfn ?: 1) - 1) + 1; pdx < pfn_to_pdx(rstart_pfn); pdx++ )
> {
> ...
>
> This also assumes that PFN 0 will always have a valid PDX translation,
> regardless of whether it's RAM or not (which is the case given the PDX
> code currently used).
Looks good to me. The caveat may then want mentioning in the comment as
well.
Jan
^ permalink raw reply [flat|nested] 19+ messages in thread