[PATCH 0/8] pdx: introduce a new compression algorithm

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/8] pdx: introduce a new compression algorithm
@ 2025-06-11 17:16 Roger Pau Monne
  2025-06-11 17:16 ` [PATCH 1/8] x86/pdx: simplify calculation of domain struct allocation boundary Roger Pau Monne
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Roger Pau Monne @ 2025-06-11 17:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini, Bertrand Marquis,
	Volodymyr Babchuk, Oleksii Kurochko, Community Manager

Hello,

This series implements a new PDX compression algorithm to cope with the
spare memory maps found on the Intel Sapphire/Granite Rapids.

Patches 1 to 5 prepare the existing code to make it easier to introduce
a new PDX compression, including generalizing the initialization and
setup functions so they are both usable by the existing and the new
compression.

Patches 6 to 8 introduce the new compression, with patch 6 adding most
of the code.  Such patch also adds a unit test to exercise the logic
easily in user-space.  The new compression is only enabled by default on
x86, other architectures are left with their previous defaults.

Thanks, Roger.

Roger Pau Monne (8):
  x86/pdx: simplify calculation of domain struct allocation boundary
  pdx: introduce function to calculate max PFN based on PDX compression
  kconfig: turn PDX compression into a choice
  pdx: provide a unified set of unit functions
  pdx: allow optimizing PDX conversion helpers
  pdx: introduce a new compression algorithm based on offsets between
    regions
  pdx: introduce translation helpers for offset compression
  pdx: introduce a command line option for offset compression

 CHANGELOG.md                           |   3 +
 docs/misc/xen-command-line.pandoc      |  22 ++
 tools/tests/Makefile                   |   1 +
 tools/tests/pdx/.gitignore             |   3 +
 tools/tests/pdx/Makefile               |  54 ++++
 tools/tests/pdx/harness.h              |  73 +++++
 tools/tests/pdx/test-pdx-offset.c      | 320 +++++++++++++++++++
 xen/arch/arm/setup.c                   |  34 +-
 xen/arch/x86/domain.c                  |  35 +--
 xen/arch/x86/include/asm/cpufeatures.h |   1 +
 xen/arch/x86/setup.c                   |  19 +-
 xen/arch/x86/srat.c                    |  30 +-
 xen/common/Kconfig                     |  25 +-
 xen/common/pdx.c                       | 419 +++++++++++++++++++++++--
 xen/include/xen/pdx.h                  | 216 +++++++++----
 15 files changed, 1092 insertions(+), 163 deletions(-)
 create mode 100644 tools/tests/pdx/.gitignore
 create mode 100644 tools/tests/pdx/Makefile
 create mode 100644 tools/tests/pdx/harness.h
 create mode 100644 tools/tests/pdx/test-pdx-offset.c

-- 
2.49.0



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/8] x86/pdx: simplify calculation of domain struct allocation boundary
  2025-06-11 17:16 [PATCH 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
@ 2025-06-11 17:16 ` Roger Pau Monne
  2025-06-11 17:58   ` Andrew Cooper
  2025-06-12  9:03   ` Jan Beulich
  2025-06-11 17:16 ` [PATCH 2/8] pdx: introduce function to calculate max PFN based on PDX compression Roger Pau Monne
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Roger Pau Monne @ 2025-06-11 17:16 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

When not using CONFIG_BIGMEM there are some restrictions in the address
width for allocations of the domain structure, as it's PDX truncated to
32bits it's stashed into page_info structure for domain allocated pages.

The current logic to calculate this limit is based on the internals of the
PDX compression used, which is not strictly required.  Instead simplify the
logic to rely on the existing PDX to PFN conversion helpers used elsewhere.

This has the added benefit of allowing alternative PDX compression
algorithms to be implemented without requiring to change the calculation of
the domain structure allocation boundary.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/domain.c | 35 ++++++-----------------------------
 1 file changed, 6 insertions(+), 29 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 7536b6c8717e..2f438ce367cf 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -461,30 +461,6 @@ void domain_cpu_policy_changed(struct domain *d)
     }
 }
 
-#if !defined(CONFIG_BIGMEM) && defined(CONFIG_PDX_COMPRESSION)
-/*
- * The hole may be at or above the 44-bit boundary, so we need to determine
- * the total bit count until reaching 32 significant (not squashed out) bits
- * in PFN representations.
- * Note that the way "bits" gets initialized/updated/bounds-checked guarantees
- * that the function will never return zero, and hence will never be called
- * more than once (which is important due to it being deliberately placed in
- * .init.text).
- */
-static unsigned int __init noinline _domain_struct_bits(void)
-{
-    unsigned int bits = 32 + PAGE_SHIFT;
-    unsigned int sig = hweight32(~pfn_hole_mask);
-    unsigned int mask = pfn_hole_mask >> 32;
-
-    for ( ; bits < BITS_PER_LONG && sig < 32; ++bits, mask >>= 1 )
-        if ( !(mask & 1) )
-            ++sig;
-
-    return bits;
-}
-#endif
-
 struct domain *alloc_domain_struct(void)
 {
     struct domain *d;
@@ -498,14 +474,15 @@ struct domain *alloc_domain_struct(void)
      * On systems with CONFIG_BIGMEM there's no packing, and so there's no
      * such restriction.
      */
-#if defined(CONFIG_BIGMEM) || !defined(CONFIG_PDX_COMPRESSION)
-    const unsigned int bits = IS_ENABLED(CONFIG_BIGMEM) ? 0 :
-                                                          32 + PAGE_SHIFT;
+#if defined(CONFIG_BIGMEM)
+    const unsigned int bits = 0;
 #else
-    static unsigned int __read_mostly bits;
+    static unsigned int __ro_after_init bits;
 
     if ( unlikely(!bits) )
-         bits = _domain_struct_bits();
+         bits = flsl(pfn_to_paddr(pdx_to_pfn(
+             1UL << (sizeof(((struct page_info *)NULL)->v.inuse._domain) * 8))))
+             - 1;
 #endif
 
     BUILD_BUG_ON(sizeof(*d) > PAGE_SIZE);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/8] pdx: introduce function to calculate max PFN based on PDX compression
  2025-06-11 17:16 [PATCH 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
  2025-06-11 17:16 ` [PATCH 1/8] x86/pdx: simplify calculation of domain struct allocation boundary Roger Pau Monne
@ 2025-06-11 17:16 ` Roger Pau Monne
  2025-06-12  9:11   ` Jan Beulich
  2025-06-11 17:16 ` [PATCH 3/8] kconfig: turn PDX compression into a choice Roger Pau Monne
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Roger Pau Monne @ 2025-06-11 17:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini

This is the code already present and used by x86 in setup_max_pdx(), which
takes into account the current PDX compression, plus the limitation of the
virtual memory layout to return the maximum usable PFN in the system,
possibly truncating the input PFN provided by the caller.

This helper will be used by upcoming PDX related changes that introduce a
new compression algorithm.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/setup.c  | 19 ++-----------------
 xen/common/pdx.c      | 25 +++++++++++++++++++++++++
 xen/include/xen/pdx.h |  8 ++++++++
 3 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 1f5cb67bd0ee..ea670567cbf7 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -721,23 +721,8 @@ static uint64_t __init consider_modules(
 
 static void __init setup_max_pdx(unsigned long top_page)
 {
-    max_pdx = pfn_to_pdx(top_page - 1) + 1;
-
-    if ( max_pdx > (DIRECTMAP_SIZE >> PAGE_SHIFT) )
-        max_pdx = DIRECTMAP_SIZE >> PAGE_SHIFT;
-
-    if ( max_pdx > FRAMETABLE_NR )
-        max_pdx = FRAMETABLE_NR;
-
-    if ( max_pdx > MPT_VIRT_SIZE / sizeof(unsigned long) )
-        max_pdx = MPT_VIRT_SIZE / sizeof(unsigned long);
-
-#ifdef PAGE_LIST_NULL
-    if ( max_pdx >= PAGE_LIST_NULL )
-        max_pdx = PAGE_LIST_NULL - 1;
-#endif
-
-    max_page = pdx_to_pfn(max_pdx - 1) + 1;
+    max_page = get_max_pfn(top_page);
+    max_pdx = pfn_to_pdx(max_page - 1) + 1;
 }
 
 /* A temporary copy of the e820 map that we can mess with during bootstrap. */
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index b8384e6189df..3004c5f28bdd 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -55,6 +55,31 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn)
         __set_bit(idx, pdx_group_valid);
 }
 
+unsigned long get_max_pfn(unsigned long top_pfn)
+{
+    unsigned long pdx = pfn_to_pdx(top_pfn - 1) + 1;
+
+#ifdef DIRECTMAP_SIZE
+    if ( pdx > (DIRECTMAP_SIZE >> PAGE_SHIFT) )
+        pdx = DIRECTMAP_SIZE >> PAGE_SHIFT;
+#endif
+
+    if ( pdx > FRAMETABLE_NR )
+        pdx = FRAMETABLE_NR;
+
+#ifdef MPT_VIRT_SIZE
+    if ( pdx > MPT_VIRT_SIZE / sizeof(unsigned long) )
+        pdx = MPT_VIRT_SIZE / sizeof(unsigned long);
+#endif
+
+#ifdef PAGE_LIST_NULL
+    if ( pdx >= PAGE_LIST_NULL )
+        pdx = PAGE_LIST_NULL - 1;
+#endif
+
+    return pdx_to_pfn(pdx - 1) + 1;
+}
+
 #ifdef CONFIG_PDX_COMPRESSION
 
 /*
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index 9faeea3ac9f2..0f580853cb2e 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -92,6 +92,14 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn);
  */
 bool __mfn_valid(unsigned long mfn);
 
+/**
+ * Get maximum usable PFN given the virtual address space restrictions.
+ *
+ * @param pdx Maximum PFN
+ * @return Possibly truncated maximum PFN
+ */
+unsigned long get_max_pfn(unsigned long top_pfn);
+
 #define page_to_pdx(pg)  ((pg) - frame_table)
 #define pdx_to_page(pdx) gcc11_wrap(frame_table + (pdx))
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/8] kconfig: turn PDX compression into a choice
  2025-06-11 17:16 [PATCH 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
  2025-06-11 17:16 ` [PATCH 1/8] x86/pdx: simplify calculation of domain struct allocation boundary Roger Pau Monne
  2025-06-11 17:16 ` [PATCH 2/8] pdx: introduce function to calculate max PFN based on PDX compression Roger Pau Monne
@ 2025-06-11 17:16 ` Roger Pau Monne
  2025-06-12  8:28   ` Jan Beulich
  2025-06-11 17:16 ` [PATCH 4/8] pdx: provide a unified set of unit functions Roger Pau Monne
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Roger Pau Monne @ 2025-06-11 17:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

Rename the current CONFIG_PDX_COMPRESSION to CONFIG_PDX_MASK_COMPRESSION,
and make it part of the PDX compression choice block, in preparation for
adding further PDX compression algorithms.

No functional change intended as the PDX compression defaults should still
be the same for all architectures, however the choice block cannot be
protected under EXPERT and still have a default choice being
unconditionally selected.  As a result, the new "PDX (Page inDeX)
compression" item will be unconditionally visible in Kconfig.

As part of this preparation work to introduce new PDX compressions, adjust
some of the comments on pdx.h to note they apply to a specific PDX
compression.  Also shuffle function prototypes and dummy implementations
around to make it easier to introduce a new PDX compression.  Note all
PDX compression implementations are expected to provide a
pdx_is_region_compressible() that takes the same set of arguments.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/common/Kconfig    | 18 +++++++++++++++---
 xen/common/pdx.c      |  4 ++--
 xen/include/xen/pdx.h | 32 +++++++++++++++++++-------------
 3 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index eece1370a3cc..7ffd9d7d9003 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -52,9 +52,10 @@ config EVTCHN_FIFO
 
 	  If unsure, say Y.
 
-config PDX_COMPRESSION
-	bool "PDX (Page inDeX) compression" if EXPERT && !X86 && !RISCV
-	default ARM || PPC
+choice
+	prompt "PDX (Page inDeX) compression"
+	default PDX_MASK_COMPRESSION if !X86 && !RISCV
+	default PDX_NONE
 	help
 	  PDX compression is a technique designed to reduce the memory
 	  overhead of physical memory management on platforms with sparse RAM
@@ -67,6 +68,17 @@ config PDX_COMPRESSION
 	  If your platform does not have sparse RAM banks, do not enable PDX
 	  compression.
 
+config PDX_MASK_COMPRESSION
+	bool "Mask compression"
+	help
+	  Compression relying on all RAM addresses sharing a zeroed bit region.
+
+config PDX_NONE
+	bool "None"
+	help
+	  No compression
+endchoice
+
 config ALTERNATIVE_CALL
 	bool
 
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index 3004c5f28bdd..4843630bee7f 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -34,7 +34,7 @@ bool __mfn_valid(unsigned long mfn)
 {
     bool invalid = mfn >= max_page;
 
-#ifdef CONFIG_PDX_COMPRESSION
+#ifdef CONFIG_PDX_MASK_COMPRESSION
     invalid |= mfn & pfn_hole_mask;
 #endif
 
@@ -80,7 +80,7 @@ unsigned long get_max_pfn(unsigned long top_pfn)
     return pdx_to_pfn(pdx - 1) + 1;
 }
 
-#ifdef CONFIG_PDX_COMPRESSION
+#ifdef CONFIG_PDX_MASK_COMPRESSION
 
 /*
  * Diagram to make sense of the following variables. The masks and shifts
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index 0f580853cb2e..ec0827936c2f 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -25,7 +25,7 @@
  * this by keeping a bitmap of the ranges in the frame table containing
  * invalid entries and not allocating backing memory for them.
  *
- * ## PDX compression
+ * ## PDX mask compression
  *
  * This is a technique to avoid wasting memory on machines known to have
  * split their machine address space in several big discontinuous and highly
@@ -108,22 +108,13 @@ unsigned long get_max_pfn(unsigned long top_pfn);
 
 #define paddr_to_pdx(pa) pfn_to_pdx(paddr_to_pfn(pa))
 
-#ifdef CONFIG_PDX_COMPRESSION
+#ifdef CONFIG_PDX_MASK_COMPRESSION
 
 extern unsigned long pfn_pdx_bottom_mask, ma_va_bottom_mask;
 extern unsigned int pfn_pdx_hole_shift;
 extern unsigned long pfn_hole_mask;
 extern unsigned long pfn_top_mask, ma_top_mask;
 
-/**
- * Validate a region's compatibility with the current compression runtime
- *
- * @param base Base address of the region
- * @param npages Number of PAGE_SIZE-sized pages in the region
- * @return True iff the region can be used with the current compression
- */
-bool pdx_is_region_compressible(paddr_t base, unsigned long npages);
-
 /**
  * Calculates a mask covering "moving" bits of all addresses of a region
  *
@@ -216,7 +207,9 @@ static inline paddr_t directmapoff_to_maddr(unsigned long offset)
  */
 void pfn_pdx_hole_setup(unsigned long mask);
 
-#else /* !CONFIG_PDX_COMPRESSION */
+#endif /* CONFIG_PDX_MASK_COMPRESSION */
+
+#ifdef CONFIG_PDX_NONE
 
 /* Without PDX compression we can skip some computations */
 
@@ -248,7 +241,20 @@ static inline void pfn_pdx_hole_setup(unsigned long mask)
 {
 }
 
-#endif /* CONFIG_PDX_COMPRESSION */
+#else /* !CONFIG_PDX_NONE */
+
+/* Shared functions implemented by all PDX compressions. */
+
+/**
+ * Validate a region's compatibility with the current compression runtime
+ *
+ * @param base Base address of the region
+ * @param npages Number of PAGE_SIZE-sized pages in the region
+ * @return True iff the region can be used with the current compression
+ */
+bool pdx_is_region_compressible(paddr_t base, unsigned long npages);
+
+#endif /* !CONFIG_PDX_NONE */
 #endif /* __XEN_PDX_H__ */
 
 /*
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/8] pdx: provide a unified set of unit functions
  2025-06-11 17:16 [PATCH 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
                   ` (2 preceding siblings ...)
  2025-06-11 17:16 ` [PATCH 3/8] kconfig: turn PDX compression into a choice Roger Pau Monne
@ 2025-06-11 17:16 ` Roger Pau Monne
  2025-06-12  8:32   ` Jan Beulich
  2025-06-12  8:36   ` Jan Beulich
  2025-06-11 17:16 ` [PATCH 5/8] pdx: allow optimizing PDX conversion helpers Roger Pau Monne
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Roger Pau Monne @ 2025-06-11 17:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper,
	Anthony PERARD, Jan Beulich

The current setup (pdx_init_mask() and pdx_region_mask()) and init
(pfn_pdx_hole_setup()) PDX compression functions are tailored to the
existing PDX compression algorithm.

In preparation for introducing a new compression algorithm convert the
setup and init functions to more generic interfaces that aren't tied to the
compression in-use.  To accomplish this introduce a function that registers
all the PFN RAM ranges, plus an init function.

This has the downside of requiring a static array to store such ranges
ahead of being processed by the setup function, however it's the only way
to pass all the possible information to the different compression setup
functions without using per-compression specific setup functions.
Per-arch compression setup also need to be adjusted to use the new
interface.  There's a slight ordering adjustment, in that after PDX
compression setup the caller will check whether all the RAM regions are
properly covered by the newly setup compression, otherwise compression is
disabled by resetting to the initial values.

No functional change intended in the resulting PDX compression values.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/arm/setup.c  |  34 ++++++-------
 xen/arch/x86/srat.c   |  28 ++++++----
 xen/common/pdx.c      | 116 ++++++++++++++++++++++++++++++++++++------
 xen/include/xen/pdx.h |  73 +++++++++-----------------
 4 files changed, 160 insertions(+), 91 deletions(-)

diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index 734e23da4408..93ebfc29635e 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -255,6 +255,10 @@ void __init init_pdx(void)
 {
     const struct membanks *mem = bootinfo_get_mem();
     paddr_t bank_start, bank_size, bank_end;
+    unsigned int bank;
+
+    for ( bank = 0 ; bank < mem->nr_banks; bank++ )
+        pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size);
 
     /*
      * Arm does not have any restrictions on the bits to compress. Pass 0 to
@@ -263,28 +267,24 @@ void __init init_pdx(void)
      * If the logic changes in pfn_pdx_hole_setup we might have to
      * update this function too.
      */
-    uint64_t mask = pdx_init_mask(0x0);
-    int bank;
+    pfn_pdx_compression_setup(0);
 
     for ( bank = 0 ; bank < mem->nr_banks; bank++ )
     {
-        bank_start = mem->bank[bank].start;
-        bank_size = mem->bank[bank].size;
-
-        mask |= bank_start | pdx_region_mask(bank_start, bank_size);
-    }
-
-    for ( bank = 0 ; bank < mem->nr_banks; bank++ )
-    {
-        bank_start = mem->bank[bank].start;
-        bank_size = mem->bank[bank].size;
-
-        if (~mask & pdx_region_mask(bank_start, bank_size))
-            mask = 0;
+        if ( !pdx_is_region_compressible(mem->bank[bank].start,
+                 PFN_UP(mem->bank[bank].start + mem->bank[bank].size) -
+                 PFN_DOWN(mem->bank[bank].start)) )
+        {
+            pfn_pdx_compression_reset();
+            printk(XENLOG_WARNING
+                   "PFN compression disabled, RAM region [%#" PRIpaddr ", %#"
+                   PRIpaddr "] not covered\n",
+                   mem->bank[bank].start,
+                   mem->bank[bank].start + mem->bank[bank].size - 1);
+            break;
+        }
     }
 
-    pfn_pdx_hole_setup(mask >> PAGE_SHIFT);
-
     for ( bank = 0 ; bank < mem->nr_banks; bank++ )
     {
         bank_start = mem->bank[bank].start;
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 688f410287d4..7042fd3c3d88 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -261,8 +261,6 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 
 void __init acpi_numa_arch_fixup(void) {}
 
-static uint64_t __initdata srat_region_mask;
-
 static int __init cf_check srat_parse_region(
     struct acpi_subtable_header *header, const unsigned long end)
 {
@@ -282,15 +280,13 @@ static int __init cf_check srat_parse_region(
 		printk(KERN_INFO "SRAT: %013"PRIx64"-%013"PRIx64"\n",
 		       ma->base_address, ma->base_address + ma->length - 1);
 
-	srat_region_mask |= ma->base_address |
-			    pdx_region_mask(ma->base_address, ma->length);
+	pfn_pdx_add_region(ma->base_address, ma->length);
 
 	return 0;
 }
 
 void __init srat_parse_regions(paddr_t addr)
 {
-	u64 mask;
 	unsigned int i;
 
 	if (acpi_disabled || acpi_numa < 0 ||
@@ -299,19 +295,29 @@ void __init srat_parse_regions(paddr_t addr)
 
 	/* Set "PXM" as early as feasible. */
 	numa_fw_nid_name = "PXM";
-	srat_region_mask = pdx_init_mask(addr);
 	acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
 			      srat_parse_region, 0);
 
-	for (mask = srat_region_mask, i = 0; mask && i < e820.nr_map; i++) {
+	pfn_pdx_compression_setup(addr);
+
+	/* Ensure all ranges in the e820 are covered. */
+	for (i = 0; i < e820.nr_map; i++) {
 		if (e820.map[i].type != E820_RAM)
 			continue;
 
-		if (~mask & pdx_region_mask(e820.map[i].addr, e820.map[i].size))
-			mask = 0;
+		if (!pdx_is_region_compressible(e820.map[i].addr,
+		    PFN_UP(e820.map[i].addr + e820.map[i].size) -
+		    PFN_DOWN(e820.map[i].addr)))
+		{
+			pfn_pdx_compression_reset();
+			printk(XENLOG_WARNING
+			       "PFN compression disabled, RAM region [%#" PRIx64
+			       ", %#" PRIx64 "] not covered\n",
+			       e820.map[i].addr,
+			       e820.map[i].addr + e820.map[i].size - 1);
+			return;
+		}
 	}
-
-	pfn_pdx_hole_setup(mask >> PAGE_SHIFT);
 }
 
 unsigned int numa_node_to_arch_nid(nodeid_t n)
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index 4843630bee7f..65b337860d52 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -19,6 +19,7 @@
 #include <xen/mm.h>
 #include <xen/bitops.h>
 #include <xen/nospec.h>
+#include <xen/pfn.h>
 #include <xen/sections.h>
 
 /**
@@ -80,6 +81,39 @@ unsigned long get_max_pfn(unsigned long top_pfn)
     return pdx_to_pfn(pdx - 1) + 1;
 }
 
+#ifndef CONFIG_PDX_NONE
+
+#ifdef CONFIG_X86
+# include <asm/e820.h>
+# define MAX_PFN_RANGES E820MAX
+#elif defined(CONFIG_HAS_DEVICE_TREE)
+# include <xen/bootfdt.h>
+# define MAX_PFN_RANGES NR_MEM_BANKS
+#else
+# error "Missing architecture maximum number of RAM ranges"
+#endif
+
+/* Generic PFN compression helpers. */
+static struct pfn_range {
+    unsigned long base, size;
+} ranges[MAX_PFN_RANGES] __initdata;
+static unsigned int __initdata nr;
+
+void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
+{
+    if ( nr >= ARRAY_SIZE(ranges) )
+    {
+        ASSERT((nr + 1) > nr);
+        nr++;
+        return;
+    }
+
+    ranges[nr].base = PFN_DOWN(base);
+    ranges[nr++].size = PFN_UP(base + size) - PFN_DOWN(base);
+}
+
+#endif /* !CONFIG_PDX_NONE */
+
 #ifdef CONFIG_PDX_MASK_COMPRESSION
 
 /*
@@ -140,20 +174,25 @@ static uint64_t fill_mask(uint64_t mask)
     return mask;
 }
 
-bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
-{
-    return !(paddr_to_pfn(base) & pfn_hole_mask) &&
-           !(pdx_region_mask(base, npages * PAGE_SIZE) & ~ma_va_bottom_mask);
-}
-
-/* We don't want to compress the low MAX_ORDER bits of the addresses. */
-uint64_t __init pdx_init_mask(uint64_t base_addr)
-{
-    return fill_mask(max(base_addr,
-                         (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
-}
-
-uint64_t pdx_region_mask(uint64_t base, uint64_t len)
+/**
+ * Calculates a mask covering "moving" bits of all addresses of a region
+ *
+ * The i-th bit of the mask must be set if there's 2 different addresses
+ * in the region that have different j-th bits. where j >= i.
+ *
+ * e.g:
+ *       base=0x1B00000000
+ *   len+base=0x1B00042000
+ *
+ *   ought to return 0x000007FFFF, which implies that every bit position
+ *   with a zero in the mask remains unchanged in every address of the
+ *   region.
+ *
+ * @param base Base address of the region
+ * @param len  Size in octets of the region
+ * @return Mask of moving bits at the bottom of all the region addresses
+ */
+static uint64_t pdx_region_mask(uint64_t base, uint64_t len)
 {
     /*
      * We say a bit "moves" in a range if there exist 2 addresses in that
@@ -168,9 +207,46 @@ uint64_t pdx_region_mask(uint64_t base, uint64_t len)
     return fill_mask(base ^ (base + len - 1));
 }
 
-void __init pfn_pdx_hole_setup(unsigned long mask)
+bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
+{
+    return !(paddr_to_pfn(base) & pfn_hole_mask) &&
+           !(pdx_region_mask(base, npages * PAGE_SIZE) & ~ma_va_bottom_mask);
+}
+
+/**
+ * Creates the mask to start from when calculating non-compressible bits
+ *
+ * This function is intimately related to pdx_region_mask(), and together
+ * they are meant to calculate the mask of non-compressible bits given the
+ * current memory map.
+ *
+ * @param base_addr Address of the first maddr in the system
+ * @return An integer of the form 2^n - 1
+ */
+static uint64_t __init pdx_init_mask(uint64_t base_addr)
+{
+    return fill_mask(max(base_addr,
+                         /* Don't compress the low MAX_ORDER bits. */
+                         (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
+}
+
+void __init pfn_pdx_compression_setup(paddr_t base)
 {
     unsigned int i, j, bottom_shift = 0, hole_shift = 0;
+    unsigned long mask = pdx_init_mask(base);
+
+    if ( nr > ARRAY_SIZE(ranges) )
+    {
+        printk(XENLOG_WARNING
+               "Too many PFN ranges (%u), not attempting PFN compression\n",
+               nr);
+        return;
+    }
+
+    for ( i = 0; i < nr; i++ )
+        mask |= pfn_to_paddr(ranges[i].base) |
+                pdx_region_mask(pfn_to_paddr(ranges[i].base),
+                                pfn_to_paddr(ranges[i].size));
 
     /*
      * We skip the first MAX_ORDER bits, as we never want to compress them.
@@ -209,6 +285,16 @@ void __init pfn_pdx_hole_setup(unsigned long mask)
     ma_top_mask         = pfn_top_mask << PAGE_SHIFT;
 }
 
+void __init pfn_pdx_compression_reset(void)
+{
+    pfn_pdx_bottom_mask = ~0UL;
+    ma_va_bottom_mask = ~0UL;
+    pfn_top_mask = 0;
+    ma_top_mask = 0;
+    pfn_hole_mask = 0;
+    pfn_pdx_hole_shift = 0;
+}
+
 #endif /* CONFIG_PDX_COMPRESSION */
 
 /*
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index ec0827936c2f..43ce36fcbb56 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -115,38 +115,6 @@ extern unsigned int pfn_pdx_hole_shift;
 extern unsigned long pfn_hole_mask;
 extern unsigned long pfn_top_mask, ma_top_mask;
 
-/**
- * Calculates a mask covering "moving" bits of all addresses of a region
- *
- * The i-th bit of the mask must be set if there's 2 different addresses
- * in the region that have different j-th bits. where j >= i.
- *
- * e.g:
- *       base=0x1B00000000
- *   len+base=0x1B00042000
- *
- *   ought to return 0x000007FFFF, which implies that every bit position
- *   with a zero in the mask remains unchanged in every address of the
- *   region.
- *
- * @param base Base address of the region
- * @param len  Size in octets of the region
- * @return Mask of moving bits at the bottom of all the region addresses
- */
-uint64_t pdx_region_mask(uint64_t base, uint64_t len);
-
-/**
- * Creates the mask to start from when calculating non-compressible bits
- *
- * This function is intimately related to pdx_region_mask(), and together
- * they are meant to calculate the mask of non-compressible bits given the
- * current memory map.
- *
- * @param base_addr Address of the first maddr in the system
- * @return An integer of the form 2^n - 1
- */
-uint64_t pdx_init_mask(uint64_t base_addr);
-
 /**
  * Map pfn to its corresponding pdx
  *
@@ -196,17 +164,6 @@ static inline paddr_t directmapoff_to_maddr(unsigned long offset)
             (offset & ma_va_bottom_mask));
 }
 
-/**
- * Initializes global variables with information about the compressible
- * range of the current memory regions.
- *
- * @param mask This mask is the biggest pdx_mask of every region in the
- *             system ORed with all base addresses of every region in the
- *             system. This results in a mask where every zero in a bit
- *             position marks a potentially compressible bit.
- */
-void pfn_pdx_hole_setup(unsigned long mask);
-
 #endif /* CONFIG_PDX_MASK_COMPRESSION */
 
 #ifdef CONFIG_PDX_NONE
@@ -227,17 +184,15 @@ static inline bool pdx_is_region_compressible(paddr_t base,
     return true;
 }
 
-static inline uint64_t pdx_init_mask(uint64_t base_addr)
+static inline void pfn_pdx_add_region(paddr_t base, paddr_t size)
 {
-    return 0;
 }
 
-static inline uint64_t pdx_region_mask(uint64_t base, uint64_t len)
+static inline void pfn_pdx_compression_setup(paddr_t base)
 {
-    return 0;
 }
 
-static inline void pfn_pdx_hole_setup(unsigned long mask)
+static inline void pfn_pdx_compression_reset(void)
 {
 }
 
@@ -254,6 +209,28 @@ static inline void pfn_pdx_hole_setup(unsigned long mask)
  */
 bool pdx_is_region_compressible(paddr_t base, unsigned long npages);
 
+/**
+ * Register a RAM region with the PFN compression logic.
+ *
+ * @param base Start of the region in bytes.
+ * @param size Length of the region in bytes.
+ */
+void pfn_pdx_add_region(paddr_t base, paddr_t size);
+
+/**
+ * Initializes global variables with information about the compressible
+ * range of the current memory regions.
+ *
+ * @param base address to start compression from.
+ */
+void pfn_pdx_compression_setup(paddr_t base);
+
+/**
+ * Reset the global variables to it's default values, thus disabling PFN
+ * compression.
+ */
+void pfn_pdx_compression_reset(void);
+
 #endif /* !CONFIG_PDX_NONE */
 #endif /* __XEN_PDX_H__ */
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/8] pdx: allow optimizing PDX conversion helpers
  2025-06-11 17:16 [PATCH 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
                   ` (3 preceding siblings ...)
  2025-06-11 17:16 ` [PATCH 4/8] pdx: provide a unified set of unit functions Roger Pau Monne
@ 2025-06-11 17:16 ` Roger Pau Monne
  2025-06-11 17:16 ` [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions Roger Pau Monne
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Roger Pau Monne @ 2025-06-11 17:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini

There are four performance critical PDX conversion helpers that do the PFN
to/from PDX and the physical addresses to/from directmap offsets
translations.

In the absence of an active PDX compression, those functions would still do
the calculations needed, just to return the same input value as no
translation is in place and hence PFN and PDX spaces are identity mapped.

To reduce the overhead of having to do the pointless calculations allow
architectures to short-circuit the logic by providing a macro.  Implement
such short-circuiting for x86 using asm goto and alternatives.  This
results in the optimized case (when PDX compression is not enabled) doing
an unconditional jump to a label, while the non-optimized case is left
as-is.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/cpufeatures.h |  1 +
 xen/arch/x86/srat.c                    |  6 +++-
 xen/common/pdx.c                       |  8 +++--
 xen/include/xen/pdx.h                  | 46 ++++++++++++++++++++------
 4 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 9e3ed21c026d..85e1a6f0a055 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -43,6 +43,7 @@ XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch
 XEN_CPUFEATURE(IBPB_ENTRY_PV,     X86_SYNTH(28)) /* MSR_PRED_CMD used by Xen for PV */
 XEN_CPUFEATURE(IBPB_ENTRY_HVM,    X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */
 XEN_CPUFEATURE(USE_VMCALL,        X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */
+XEN_CPUFEATURE(PDX_COMPRESSION,   X86_SYNTH(31)) /* PDX compression */
 
 /* Bug words follow the synthetic words. */
 #define X86_NR_BUG 1
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 7042fd3c3d88..96a87bbce35b 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -298,7 +298,8 @@ void __init srat_parse_regions(paddr_t addr)
 	acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
 			      srat_parse_region, 0);
 
-	pfn_pdx_compression_setup(addr);
+	if (!pfn_pdx_compression_setup(addr))
+		return;
 
 	/* Ensure all ranges in the e820 are covered. */
 	for (i = 0; i < e820.nr_map; i++) {
@@ -318,6 +319,9 @@ void __init srat_parse_regions(paddr_t addr)
 			return;
 		}
 	}
+
+	/* If we got this far compression is working as expected. */
+	setup_force_cpu_cap(X86_FEATURE_PDX_COMPRESSION);
 }
 
 unsigned int numa_node_to_arch_nid(nodeid_t n)
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index 65b337860d52..7d14100224fe 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -230,7 +230,7 @@ static uint64_t __init pdx_init_mask(uint64_t base_addr)
                          (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
 }
 
-void __init pfn_pdx_compression_setup(paddr_t base)
+bool __init pfn_pdx_compression_setup(paddr_t base)
 {
     unsigned int i, j, bottom_shift = 0, hole_shift = 0;
     unsigned long mask = pdx_init_mask(base);
@@ -240,7 +240,7 @@ void __init pfn_pdx_compression_setup(paddr_t base)
         printk(XENLOG_WARNING
                "Too many PFN ranges (%u), not attempting PFN compression\n",
                nr);
-        return;
+        return false;
     }
 
     for ( i = 0; i < nr; i++ )
@@ -272,7 +272,7 @@ void __init pfn_pdx_compression_setup(paddr_t base)
         }
     }
     if ( !hole_shift )
-        return;
+        return false;
 
     printk(KERN_INFO "PFN compression on bits %u...%u\n",
            bottom_shift, bottom_shift + hole_shift - 1);
@@ -283,6 +283,8 @@ void __init pfn_pdx_compression_setup(paddr_t base)
     pfn_hole_mask       = ((1UL << hole_shift) - 1) << bottom_shift;
     pfn_top_mask        = ~(pfn_pdx_bottom_mask | pfn_hole_mask);
     ma_top_mask         = pfn_top_mask << PAGE_SHIFT;
+
+    return true;
 }
 
 void __init pfn_pdx_compression_reset(void)
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index 43ce36fcbb56..6cc0f54cff83 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -67,6 +67,26 @@
  * region involved.
  */
 
+/* Macro defined per-arch to skip PDX logic when there's no compression. */
+#ifdef CONFIG_X86
+# include <asm/alternative.h>
+
+# define OPTIMIZE_PDX(transform, raw)                   \
+    asm_inline goto (                                   \
+        ALTERNATIVE(                                    \
+            "",                                         \
+            "jmp %l[skip]",                             \
+            ALT_NOT(X86_FEATURE_PDX_COMPRESSION))       \
+        : : : : skip );                                 \
+    return (transform);                                 \
+                                                        \
+ skip:                                                  \
+    return (raw)
+#else
+# define OPTIMIZE_PDX(transform, raw)                   \
+     return (transform)
+#endif
+
 extern unsigned long max_pdx;
 
 #define PDX_GROUP_COUNT ((1 << PDX_GROUP_SHIFT) / \
@@ -123,8 +143,9 @@ extern unsigned long pfn_top_mask, ma_top_mask;
  */
 static inline unsigned long pfn_to_pdx(unsigned long pfn)
 {
-    return (pfn & pfn_pdx_bottom_mask) |
-           ((pfn & pfn_top_mask) >> pfn_pdx_hole_shift);
+    OPTIMIZE_PDX((pfn & pfn_pdx_bottom_mask) |
+                 ((pfn & pfn_top_mask) >> pfn_pdx_hole_shift),
+                 pfn);
 }
 
 /**
@@ -135,8 +156,9 @@ static inline unsigned long pfn_to_pdx(unsigned long pfn)
  */
 static inline unsigned long pdx_to_pfn(unsigned long pdx)
 {
-    return (pdx & pfn_pdx_bottom_mask) |
-           ((pdx << pfn_pdx_hole_shift) & pfn_top_mask);
+    OPTIMIZE_PDX((pdx & pfn_pdx_bottom_mask) |
+                 ((pdx << pfn_pdx_hole_shift) & pfn_top_mask),
+                 pdx);
 }
 
 /**
@@ -148,8 +170,9 @@ static inline unsigned long pdx_to_pfn(unsigned long pdx)
  */
 static inline unsigned long maddr_to_directmapoff(paddr_t ma)
 {
-    return (((ma & ma_top_mask) >> pfn_pdx_hole_shift) |
-            (ma & ma_va_bottom_mask));
+    OPTIMIZE_PDX(((ma & ma_top_mask) >> pfn_pdx_hole_shift) |
+                  (ma & ma_va_bottom_mask),
+                  ma);
 }
 
 /**
@@ -160,8 +183,9 @@ static inline unsigned long maddr_to_directmapoff(paddr_t ma)
  */
 static inline paddr_t directmapoff_to_maddr(unsigned long offset)
 {
-    return ((((paddr_t)offset << pfn_pdx_hole_shift) & ma_top_mask) |
-            (offset & ma_va_bottom_mask));
+    OPTIMIZE_PDX(((((paddr_t)offset << pfn_pdx_hole_shift) & ma_top_mask) |
+                 (offset & ma_va_bottom_mask)),
+                 offset);
 }
 
 #endif /* CONFIG_PDX_MASK_COMPRESSION */
@@ -188,8 +212,9 @@ static inline void pfn_pdx_add_region(paddr_t base, paddr_t size)
 {
 }
 
-static inline void pfn_pdx_compression_setup(paddr_t base)
+static inline bool pfn_pdx_compression_setup(paddr_t base)
 {
+    return false;
 }
 
 static inline void pfn_pdx_compression_reset(void)
@@ -222,8 +247,9 @@ void pfn_pdx_add_region(paddr_t base, paddr_t size);
  * range of the current memory regions.
  *
  * @param base address to start compression from.
+ * @return True if PDX compression has been enabled.
  */
-void pfn_pdx_compression_setup(paddr_t base);
+bool pfn_pdx_compression_setup(paddr_t base);
 
 /**
  * Reset the global variables to it's default values, thus disabling PFN
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions
  2025-06-11 17:16 [PATCH 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
                   ` (4 preceding siblings ...)
  2025-06-11 17:16 ` [PATCH 5/8] pdx: allow optimizing PDX conversion helpers Roger Pau Monne
@ 2025-06-11 17:16 ` Roger Pau Monne
  2025-06-11 19:33   ` Andrew Cooper
                     ` (2 more replies)
  2025-06-11 17:16 ` [PATCH 7/8] pdx: introduce translation helpers for offset compression Roger Pau Monne
  2025-06-11 17:16 ` [PATCH 8/8] pdx: introduce a command line option " Roger Pau Monne
  7 siblings, 3 replies; 34+ messages in thread
From: Roger Pau Monne @ 2025-06-11 17:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Anthony PERARD, Andrew Cooper, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini, Bertrand Marquis,
	Volodymyr Babchuk

With the appearance of Intel Sierra Forest and Granite Rapids it's not
possible to get a production x86 host wit the following memory map:

SRAT: Node 0 PXM 0 [0000000000000000, 000000007fffffff]
SRAT: Node 0 PXM 0 [0000000100000000, 000000407fffffff]
SRAT: Node 1 PXM 1 [0000061e80000000, 0000065e7fffffff]
SRAT: Node 2 PXM 2 [00000c3e80000000, 00000c7e7fffffff]
SRAT: Node 3 PXM 3 [0000125e80000000, 0000129e7fffffff]

This is from a four socket system, with each node having 256GB of memory.
The total amount of RAM on the system is 1TB, but without enabling
CONFIG_BIGMEM the last range is not accessible, as it's above the 16TB
boundary covered by the frame table.

Note that while the memory map is very sparse, it won't be compressible
using the current algorithm that relies on all ranges having a shared
zeroed region of bits that can be removed.

The memory map presented above has the property of all regions being
similarly spaced between each other, and all having also a similar size.
This allows to compress them using the following formula:

 pdx = (pfn % offset) + ((pfn / offset) * size)

Where offset and size are two static coefficients calculated at
initialization.

Obtaining the optimum offset and size coefficients is the complicated part.
In this patch I introduce two different algorithms, a fast one that works
correctly when the offset and size between ranges is mostly equal.  If such
fast algorithm doesn't work, or the resulting compression is not enough to
avoid truncation of the maximum usable page, it's possible to attempt a
brute force approach for calculating the coefficients.  This is also
implemented in this patch as the slow variant.  I've attempted to restrict
the number of iterations in the slow approach so it can exit early if no
better coefficients can be found due to the input constrains (minimum
region size).

The patch here focuses on introducing the logic to calculate the
compression coefficients, plus adding a unit test to exercise the logic
easily from user-space in order to test different layouts and possibly
improve the generation of the coefficients.  The added unit tests only
covers the newly added compression, but could also be extended to cover the
existing PDX mask compression.

Note the translation functions (pfn to pdx, maddr to direct map offset),
are not implemented as part of this patch, an identity set of macros are
added to satisfy the build requirements.  The patch is already long enough
without those.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
We can discuss whether we want both the fast and the slow variants.  The
slow (brute force) was added as a result of me playing with weird region
layouts where the fast one didn't manage to compress, or the resulting
coefficients had a poor compression ratio.  However at this point the
slow variant has only proven helpful in synthetic cases, I haven't (yet?)
seen a real host memory layout that would benefit from it.
---
 tools/tests/Makefile              |   1 +
 tools/tests/pdx/.gitignore        |   3 +
 tools/tests/pdx/Makefile          |  54 ++++++
 tools/tests/pdx/harness.h         |  73 +++++++
 tools/tests/pdx/test-pdx-offset.c | 310 ++++++++++++++++++++++++++++++
 xen/arch/arm/setup.c              |   2 +-
 xen/arch/x86/srat.c               |   2 +-
 xen/common/Kconfig                |   9 +-
 xen/common/pdx.c                  | 232 +++++++++++++++++++++-
 xen/include/xen/pdx.h             |  46 ++++-
 10 files changed, 724 insertions(+), 8 deletions(-)
 create mode 100644 tools/tests/pdx/.gitignore
 create mode 100644 tools/tests/pdx/Makefile
 create mode 100644 tools/tests/pdx/harness.h
 create mode 100644 tools/tests/pdx/test-pdx-offset.c

diff --git a/tools/tests/Makefile b/tools/tests/Makefile
index 36928676a666..97ba2a13894d 100644
--- a/tools/tests/Makefile
+++ b/tools/tests/Makefile
@@ -9,6 +9,7 @@ ifneq ($(clang),y)
 SUBDIRS-$(CONFIG_X86) += x86_emulator
 endif
 SUBDIRS-y += xenstore
+SUBDIRS-y += pdx
 SUBDIRS-y += rangeset
 SUBDIRS-y += vpci
 SUBDIRS-y += paging-mempool
diff --git a/tools/tests/pdx/.gitignore b/tools/tests/pdx/.gitignore
new file mode 100644
index 000000000000..21b60480415a
--- /dev/null
+++ b/tools/tests/pdx/.gitignore
@@ -0,0 +1,3 @@
+/pdx-offset.c
+/pdx.h
+/test-pdx-offset
diff --git a/tools/tests/pdx/Makefile b/tools/tests/pdx/Makefile
new file mode 100644
index 000000000000..141ae6aab56f
--- /dev/null
+++ b/tools/tests/pdx/Makefile
@@ -0,0 +1,54 @@
+XEN_ROOT=$(CURDIR)/../../..
+include $(XEN_ROOT)/tools/Rules.mk
+
+TARGET := test-pdx-offset
+
+.PHONY: all
+all: $(TARGET)
+
+.PHONY: run
+run: $(TARGET)
+ifeq ($(CC),$(HOSTCC))
+	./$<
+else
+	$(warning HOSTCC != CC, will not run test)
+endif
+
+.PHONY: clean
+clean:
+	$(RM) -- *.o $(TARGET) $(DEPS_RM) pdx.c pdx.h
+
+.PHONY: distclean
+distclean: clean
+	$(RM) -- *~
+
+.PHONY: install
+install: all
+	$(INSTALL_DIR) $(DESTDIR)$(LIBEXEC)/tests
+	$(INSTALL_PROG) $(TARGET) $(DESTDIR)$(LIBEXEC)/tests
+
+.PHONY: uninstall
+uninstall:
+	$(RM) -- $(DESTDIR)$(LIBEXEC)/tests/$(TARGET)
+
+pdx.h: $(XEN_ROOT)/xen/include/xen/pdx.h
+	sed -E -e '/^#[[:space:]]?include/d' <$< >$@
+
+pdx-offset.c: $(XEN_ROOT)/xen/common/pdx.c
+	# Remove includes/errors and add the test harness header
+	sed -E -e '/#[[:space:]]?include/d' -e '/^#[[:space:]]?error/d' \
+	       -e '1s/^/#include "harness.h"/' <$< >$@
+
+CFLAGS += -D__XEN_TOOLS__
+CFLAGS += $(APPEND_CFLAGS)
+CFLAGS += $(CFLAGS_xeninclude)
+
+LDFLAGS += $(APPEND_LDFLAGS)
+
+test-pdx-offset.o pdx-offset.o: pdx.h
+test-pdx-offset.o pdx-offset.o: CFLAGS += -DCONFIG_PDX_OFFSET_COMPRESSION
+
+test-pdx-offset: pdx-offset.o test-pdx-offset.o
+	$(CC) $^ -o $@ $(LDFLAGS)
+
+-include $(DEPS_INCLUDE)
diff --git a/tools/tests/pdx/harness.h b/tools/tests/pdx/harness.h
new file mode 100644
index 000000000000..3d31cf488daf
--- /dev/null
+++ b/tools/tests/pdx/harness.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Unit tests for PDX compression.
+ *
+ * Copyright (C) 2025 Cloud Software Group
+ */
+
+#ifndef _TEST_HARNESS_
+#define _TEST_HARNESS_
+
+#include <assert.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <strings.h>
+
+#include <xen-tools/common-macros.h>
+
+void pdx_reset(void);
+
+#define __init
+#define __initdata
+#define __ro_after_init
+#define cf_check
+
+#define printk printf
+#define dprintk(lvl, ...) printf(__VA_ARGS__)
+#define XENLOG_INFO
+#define XENLOG_DEBUG
+#define XENLOG_WARNING
+
+#define PAGE_SHIFT    12
+/* Some libcs define PAGE_SIZE in limits.h. */
+#undef  PAGE_SIZE
+#define PAGE_SIZE     (1 << PAGE_SHIFT)
+#define MAX_ORDER     18 /* 2 * PAGETABLE_ORDER (9) */
+
+#define PFN_DOWN(x)   ((x) >> PAGE_SHIFT)
+#define PFN_UP(x)     (((x) + PAGE_SIZE-1) >> PAGE_SHIFT)
+
+#define MAX_RANGES 8
+#define MAX_PFN_RANGES MAX_RANGES
+
+#define ASSERT assert
+#define ASSERT_UNREACHABLE() assert(0);
+
+#define STATIC
+typedef uint64_t paddr_t;
+
+bool pfn_offset_calculate_fast(unsigned long base);
+bool pfn_offset_calculate_slow(unsigned long base);
+void pfn_offset_sanitize_ranges(void);
+
+#define sort(elem, nr, size, cmp, swp) {                                \
+    /* Consume swp() so compiler doesn't complain it's unused. */       \
+    swp(&elem[0], &elem[0], size);                                      \
+    qsort(elem, nr, size, cmp);                                         \
+}
+
+#include "pdx.h"
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/tests/pdx/test-pdx-offset.c b/tools/tests/pdx/test-pdx-offset.c
new file mode 100644
index 000000000000..0a561f02d197
--- /dev/null
+++ b/tools/tests/pdx/test-pdx-offset.c
@@ -0,0 +1,310 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Unit tests for PDX offset compression.
+ *
+ * Copyright (C) 2025 Cloud Software Group
+ */
+
+#include "harness.h"
+
+struct range {
+    /* Ranges are defined as [start, end). */
+    unsigned long start, end;
+};
+
+static void print_ranges(const struct range *r)
+{
+    unsigned int i;
+
+    printf("Ranges:\n");
+
+    for ( i = 0; i < MAX_RANGES; i++ )
+    {
+        if ( !r[i].start && !r[i].end )
+            break;
+
+        printf(" %013lx-%013lx\n", r[i].start, r[i].end);
+    }
+}
+
+int main(int argc, char **argv)
+{
+    static const struct {
+        struct range ranges[MAX_RANGES];
+        struct result {
+            bool compress;
+            unsigned long offset, size;
+        } results[2];
+#define FAST_IDX 0
+#define SLOW_IDX 1
+    } tests[] = {
+#ifdef __LP64__
+        /*
+         * Only for targets where unsigned long is 64bits, otherwise compiler
+         * will complain about truncation from 'long long' -> 'long' conversion.
+         *
+         * Real memory map from a 4s Intel GNR.
+         */
+        {
+            .ranges = {
+                { .start =           0,   .end =   0x8080000UL },
+                { .start =  0x63e80000UL, .end =  0x6be80000UL },
+                { .start =  0xc7e80000UL, .end =  0xcfe80000UL },
+                { .start = 0x12be80000UL, .end = 0x133e80000UL },
+            },
+            .results = {
+                /* Same result with both fast and slow algorithms. */
+                [FAST_IDX ... SLOW_IDX] = {
+                    .compress = true,
+                    .offset = 0x63e80000UL,
+                    .size = 0x8300000UL,
+                },
+            },
+        },
+#endif
+        /* 2-node 2GB per-node QEMU layout. */
+        {
+            .ranges = {
+                { .start =        0,   .end =  0x80000UL },
+                { .start = 0x100000UL, .end = 0x180000UL },
+            },
+            .results = {
+                /* Same result with both fast and slow algorithms. */
+                [FAST_IDX ... SLOW_IDX] = {
+                    .compress = true,
+                    .offset = 0x100000UL,
+                    .size = 0x80000UL,
+                },
+            },
+        },
+        /* Not compressible, offset < size. */
+        {
+            .ranges = {
+                { .start = 0, .end =    1   },
+                { .start = 1, .end = 0x10UL },
+            },
+            .results = {
+                /* Same result with both fast and slow algorithms. */
+                [FAST_IDX ... SLOW_IDX] = {
+                    .compress = false,
+                },
+            },
+        },
+        /* Not compressible, offset < (1 << MAX_ORDER). */
+        {
+            .ranges = {
+                { .start =     0,   .end =     1   },
+                { .start = 0x100UL, .end = 0x101UL },
+            },
+            .results = {
+                /* Same result with both fast and slow algorithms. */
+                [FAST_IDX ... SLOW_IDX] = {
+                    .compress = false,
+                },
+            },
+        },
+        /* Compressible, requires adjusting size to (1 << MAX_ORDER). */
+        {
+            .ranges = {
+                { .start =        0,   .end =        1   },
+                { .start = 0x100000UL, .end = 0x100001UL },
+            },
+            .results = {
+                /* Same result with both fast and slow algorithms. */
+                [FAST_IDX ... SLOW_IDX] = {
+                    .compress = true,
+                    .offset = 0x100000UL,
+                    .size = (1UL << MAX_ORDER),
+                },
+            },
+        },
+        /* 2s Intel CLX with contiguous ranges, no compression. */
+        {
+            .ranges = {
+                { .start =        0  , .end =  0x180000UL },
+                { .start = 0x180000UL, .end = 0x3040000UL },
+            },
+            .results = {
+                /* Same result with both fast and slow algorithms. */
+                [FAST_IDX ... SLOW_IDX] = {
+                    .compress = false,
+                },
+            },
+        },
+        /* Middle range is bigger than first and last */
+        {
+            .ranges = {
+                { .start =         0,   .end =         1   },
+                { .start =  0x100000UL, .end =  0x180000UL },
+                { .start = 0x1000000UL, .end = 0x1000001UL },
+            },
+            .results = {
+                [FAST_IDX] = {
+                    .compress = true,
+                    .offset = 0x100000UL,
+                    .size = 0x80000UL,
+                },
+                [SLOW_IDX] = {
+                    .compress = true,
+                    .offset = 0xf00000UL,
+                    .size = 0x180000UL,
+                },
+            },
+        },
+        /* Test divergence between fast and slow algorithms. */
+        {
+            .ranges = {
+                { .start =                    0,
+                  .end   = (1 << MAX_ORDER) * 1 },
+                { .start = (1 << MAX_ORDER) * 11,
+                  .end   = (1 << MAX_ORDER) * 12 },
+                { .start = (1 << MAX_ORDER) * 20,
+                  .end   = (1 << MAX_ORDER) * 21 },
+            },
+            .results = {
+                [FAST_IDX] = { /* 66% */
+                    .compress = true,
+                    .offset = (1 << MAX_ORDER) * 9,
+                    .size = (1 << MAX_ORDER) * 3,
+                },
+                [SLOW_IDX] = { /* 80% */
+                    .compress = true,
+                    .offset = (1 << MAX_ORDER) * 10,
+                    .size = (1 << MAX_ORDER) * 2,
+                },
+            },
+        },
+        /* Test divergence between fast and slow algorithms. */
+        {
+            .ranges = {
+                { .start =                    0,
+                  .end   = (1 << MAX_ORDER) * 1 },
+                { .start = (1 << MAX_ORDER) * 11,
+                  .end   = (1 << MAX_ORDER) * 12 },
+                { .start = (1 << MAX_ORDER) * 30,
+                  .end   = (1 << MAX_ORDER) * 31 },
+            },
+            .results = {
+                [FAST_IDX] = { /* 18% */
+                    .compress = true,
+                    .offset = (1 << MAX_ORDER) * 11,
+                    .size = (1 << MAX_ORDER) * 9,
+                },
+                [SLOW_IDX] = { /* 80% */
+                    .compress = true,
+                    .offset = (1 << MAX_ORDER) * 10,
+                    .size = (1 << MAX_ORDER) * 2,
+                },
+            },
+        },
+        /* Test incompressible using fast vs compressible using slow. */
+        {
+            .ranges = {
+                { .start =                    0,
+                  .end   = (1 << MAX_ORDER) * 1 },
+                { .start = (1 << MAX_ORDER) * 2,
+                  .end   = (1 << MAX_ORDER) * 3 },
+                { .start = (1 << MAX_ORDER) * 20,
+                  .end   = (1 << MAX_ORDER) * 22 },
+            },
+            .results = {
+                [FAST_IDX] = {
+                    .compress = false,
+                },
+                [SLOW_IDX] = {
+                    .compress = true,
+                    .offset = (1 << MAX_ORDER) * 18,
+                    .size = (1 << MAX_ORDER) * 4,
+                },
+            },
+        },
+    };
+    int ret_code = EXIT_SUCCESS;
+
+    for ( unsigned int i = 0 ; i < ARRAY_SIZE(tests); i++ )
+    {
+        for ( unsigned int use_slow = 0;
+              use_slow < ARRAY_SIZE(tests[i].results); use_slow++ )
+        {
+            const struct result *result = &tests[i].results[use_slow];
+            unsigned int j;
+
+            pfn_pdx_compression_reset();
+
+            for ( j = 0; j < ARRAY_SIZE(tests[i].ranges); j++ )
+            {
+                unsigned long size = tests[i].ranges[j].end -
+                                     tests[i].ranges[j].start;
+
+                if ( !tests[i].ranges[j].start && !tests[i].ranges[j].end )
+                    break;
+
+                pfn_pdx_add_region(tests[i].ranges[j].start << PAGE_SHIFT,
+                                   size << PAGE_SHIFT, j);
+            }
+
+            pfn_offset_sanitize_ranges();
+
+            if ( result->compress != (use_slow ? pfn_offset_calculate_slow(0)
+                                               : pfn_offset_calculate_fast(0)) )
+            {
+                printf("PFN %s compression diverge, expected %scompressible\n",
+                       use_slow ? "slow" : "fast",
+                       result->compress ? "" : "un");
+                print_ranges(tests[i].ranges);
+
+                ret_code = EXIT_FAILURE;
+                continue;
+            }
+
+            if ( !result->compress )
+                continue;
+
+            if ( result->offset != pdx_offset || result->size != pdx_size )
+            {
+                printf("PFN %s compression result diverge, expected:\n",
+                       use_slow ? "slow" : "fast");
+                printf(" offset %013lx size %013lx (%lu%%)\n",
+                       result->offset, result->size,
+                       ((result->offset - result->size) * 100) / result->offset);
+                printf("got:\n offset %013lx size %013lx (%lu%%)\n",
+                       pdx_offset, pdx_size,
+                       ((pdx_offset - pdx_size) * 100) / pdx_offset);
+                print_ranges(tests[i].ranges);
+
+                ret_code = EXIT_FAILURE;
+                continue;
+            }
+
+            for ( j = 0; j < ARRAY_SIZE(tests[i].ranges); j++ )
+            {
+                unsigned long start = tests[i].ranges[j].start;
+                unsigned long end = tests[i].ranges[j].end;
+
+                if ( !start && !end )
+                    break;
+
+                if ( !pdx_is_region_compressible(start << PAGE_SHIFT, 1) ||
+                     !pdx_is_region_compressible((end - 1) << PAGE_SHIFT, 1) )
+                {
+                    printf(
+    "PFN %s compression invalid, pages %#lx and %#lx should be compressible\n",
+                           use_slow ? "slow" : "fast", start, end - 1);
+                    print_ranges(tests[i].ranges);
+                    ret_code = EXIT_FAILURE;
+                }
+            }
+        }
+    }
+
+    return ret_code;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index 93ebfc29635e..e71908b99c14 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -258,7 +258,7 @@ void __init init_pdx(void)
     unsigned int bank;
 
     for ( bank = 0 ; bank < mem->nr_banks; bank++ )
-        pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size);
+        pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size, bank);
 
     /*
      * Arm does not have any restrictions on the bits to compress. Pass 0 to
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 96a87bbce35b..dffe81e067e9 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -280,7 +280,7 @@ static int __init cf_check srat_parse_region(
 		printk(KERN_INFO "SRAT: %013"PRIx64"-%013"PRIx64"\n",
 		       ma->base_address, ma->base_address + ma->length - 1);
 
-	pfn_pdx_add_region(ma->base_address, ma->length);
+	pfn_pdx_add_region(ma->base_address, ma->length, ma->proximity_domain);
 
 	return 0;
 }
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index 7ffd9d7d9003..17afa9fe5f5c 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -54,7 +54,8 @@ config EVTCHN_FIFO
 
 choice
 	prompt "PDX (Page inDeX) compression"
-	default PDX_MASK_COMPRESSION if !X86 && !RISCV
+	default PDX_OFFSET_COMPRESSION if X86
+	default PDX_MASK_COMPRESSION if !RISCV
 	default PDX_NONE
 	help
 	  PDX compression is a technique designed to reduce the memory
@@ -73,6 +74,12 @@ config PDX_MASK_COMPRESSION
 	help
 	  Compression relying on all RAM addresses sharing a zeroed bit region.
 
+config PDX_OFFSET_COMPRESSION
+	bool "Offset compression"
+	help
+	  Compression relying on size and distance between RAM regions being
+	  constant.
+
 config PDX_NONE
 	bool "None"
 	help
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index 7d14100224fe..f2cf60bbc3f8 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -21,6 +21,15 @@
 #include <xen/nospec.h>
 #include <xen/pfn.h>
 #include <xen/sections.h>
+#include <xen/sort.h>
+
+#ifdef __XEN__ /* For building the file in user-space. */
+
+/*
+ * Use a define for the static keyword, we want to export some otherwise static
+ * functions for the unit tests.
+ */
+#define STATIC static
 
 /**
  * Maximum (non-inclusive) usable pdx. Must be
@@ -80,6 +89,7 @@ unsigned long get_max_pfn(unsigned long top_pfn)
 
     return pdx_to_pfn(pdx - 1) + 1;
 }
+#endif /* __XEN__ */
 
 #ifndef CONFIG_PDX_NONE
 
@@ -96,10 +106,11 @@ unsigned long get_max_pfn(unsigned long top_pfn)
 /* Generic PFN compression helpers. */
 static struct pfn_range {
     unsigned long base, size;
+    unsigned int id;
 } ranges[MAX_PFN_RANGES] __initdata;
 static unsigned int __initdata nr;
 
-void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
+void __init pfn_pdx_add_region(paddr_t base, paddr_t size, unsigned int id)
 {
     if ( nr >= ARRAY_SIZE(ranges) )
     {
@@ -108,6 +119,7 @@ void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
         return;
     }
 
+    ranges[nr].id = id;
     ranges[nr].base = PFN_DOWN(base);
     ranges[nr++].size = PFN_UP(base + size) - PFN_DOWN(base);
 }
@@ -297,7 +309,223 @@ void __init pfn_pdx_compression_reset(void)
     pfn_pdx_hole_shift = 0;
 }
 
-#endif /* CONFIG_PDX_COMPRESSION */
+#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* CONFIG_PDX_MASK_COMPRESSION */
+
+unsigned long __ro_after_init pdx_offset = ~0UL;
+unsigned long __ro_after_init pdx_size = ~0UL;
+
+static unsigned long __initdata top_pfn;
+
+bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
+{
+    return !pdx_size ? true
+                     : (PFN_DOWN(base) % pdx_offset) + npages <= pdx_size;
+}
+
+STATIC bool __init pfn_offset_calculate_fast(unsigned long base)
+{
+    unsigned long size = max((1UL << MAX_ORDER), base);
+    unsigned long offset = ~0UL;
+    unsigned int i;
+
+    if ( nr <= 1 )
+        return false;
+
+    /* Calculate minimal offset between regions. */
+    for ( i = 1; i < nr; i++ )
+        offset = min(offset, ranges[i].base - ranges[i - 1].base);
+
+    /* Return early if offset is smaller than the minimum size. */
+    if ( size >= offset )
+        return false;
+
+    /* Calculate size so it covers all regions based on the minimal offset. */
+    for ( i = 0; i < nr; i++ )
+        size = max(size, ranges[i].base % offset + ranges[i].size);
+
+    if ( size >= offset )
+        return false;
+
+    pdx_offset = offset;
+    pdx_size = size;
+
+    return true;
+}
+
+STATIC bool __init pfn_offset_calculate_slow(unsigned long base)
+{
+    unsigned long min_size = max((1UL << MAX_ORDER), base);
+    unsigned long offset, max_offset = 0;
+    unsigned int i, best_ratio = 0;
+
+    if ( nr <= 1 )
+        return false;
+
+    for ( i = 0; i < nr; i++ )
+    {
+        /* Minimal size required to cover the bigger region in the set. */
+        min_size = max(min_size, ranges[i].size);
+        if ( !i )
+            continue;
+
+        max_offset = max(max_offset, ranges[i].base - ranges[i - 1].base);
+    }
+
+    if ( min_size >= max_offset )
+        return false;
+
+    for ( offset = max_offset; offset > min_size; offset-- )
+    {
+        unsigned long size = min_size;
+        unsigned int ratio;
+
+        /*
+         * Terminate loop if it's impossible to get a better ratio given the
+         * decreasing offset and the minimal required region size.
+         */
+        if ( best_ratio >= ((offset - size) * 100) / offset )
+            break;
+
+        for ( i = 0; i < nr; i++ )
+            size = max(size, (ranges[i].base % offset) + ranges[i].size);
+
+        if ( size >= offset )
+            continue;
+
+        ratio = ((offset - size) * 100) / offset;
+        if ( ratio > best_ratio )
+        {
+            best_ratio = ratio;
+            pdx_offset = offset;
+            pdx_size = size;
+        }
+    }
+
+    return best_ratio;
+}
+
+static int __init cf_check cmp_node(const void *a, const void *b)
+{
+    const struct pfn_range *l = a;
+    const struct pfn_range *r = b;
+
+    if ( l->base > r->base )
+        return 1;
+    if ( l->base < r->base )
+        return -1;
+
+    ASSERT_UNREACHABLE();
+    return 0;
+}
+
+static void __init cf_check swp_node(void *a, void *b, size_t size)
+{
+    struct pfn_range *l = a;
+    struct pfn_range *r = b;
+    struct pfn_range tmp = *l;
+
+    *l = *r;
+    *r = tmp;
+}
+
+STATIC void __init pfn_offset_sanitize_ranges(void)
+{
+    unsigned int i = 0;
+
+    /* Sort nodes by start address. */
+    sort(ranges, nr, sizeof(struct pfn_range), cmp_node, swp_node);
+
+    /* Merge ranges if possible. */
+    while ( i + 1 < nr )
+    {
+        if ( ranges[i].id == ranges[i + 1].id )
+        {
+            /* Merge ranges with the same ID. */
+            ranges[i].size = ranges[i + 1].base + ranges[i + 1].size -
+                             ranges[i].base;
+        }
+        else if ( ranges[i].base + ranges[i].size == ranges[i + 1].base )
+        {
+            /* Merge ranges if contiguous. */
+            ranges[i].size += ranges[i + 1].size;
+        }
+        else
+        {
+            i++;
+            continue;
+        }
+
+        /* Merge ranges. */
+        memmove(&ranges[i + 1], &ranges[i + 2],
+                (nr - (i + 2)) * sizeof(ranges[0]));
+        nr--;
+    }
+}
+
+#ifdef __XEN__
+bool __init pfn_pdx_compression_setup(paddr_t base)
+{
+    bool use_slow = false;
+
+    if ( nr <= 1 )
+        return false;
+
+    if ( nr > ARRAY_SIZE(ranges) )
+    {
+        printk(XENLOG_WARNING
+               "Too many NUMA nodes (%u), not attempting PFN compression\n",
+               nr);
+        return false;
+    }
+
+    pfn_offset_sanitize_ranges();
+
+    if ( nr <= 1 )
+        return false;
+
+    top_pfn = ranges[nr - 1].base + ranges[nr - 1].size;
+
+    for ( ; ; use_slow = true )
+    {
+        if ( use_slow ? pfn_offset_calculate_slow(PFN_DOWN(base))
+                      : pfn_offset_calculate_fast(PFN_DOWN(base)) )
+        {
+            if ( top_pfn != get_max_pfn(top_pfn) )
+                dprintk(XENLOG_DEBUG,
+                        "PFN %s compression coefficients truncate address space\n",
+                        use_slow ? "slow" : "fast");
+            else
+                break;
+        }
+        else
+        {
+            dprintk(XENLOG_DEBUG,
+                    "Find PFN compression coefficients using %s algorithm failed\n",
+                    use_slow ? "slow" : "fast");
+            if ( use_slow )
+                return false;
+        }
+
+        if ( use_slow )
+            break;
+    }
+
+    printk(XENLOG_INFO "PFN compression using offset %#lx size %#lx (%lu%%)\n",
+           pdx_offset, pdx_size, ((pdx_offset - pdx_size) * 100) / pdx_offset);
+
+    return true;
+}
+#endif /* __XEN__ */
+
+void __init pfn_pdx_compression_reset(void)
+{
+    pdx_size = ~0UL;
+    pdx_offset = ~0UL;
+    nr = 0;
+    top_pfn = 0;
+}
+
+#endif /* CONFIG_PDX_OFFSET_COMPRESSION */
 
 /*
  * Local variables:
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index 6cc0f54cff83..88f446f4ddd9 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -65,6 +65,31 @@
  * This scheme also holds for multiple regions, where HHHHHHH acts as
  * the region identifier and LLLLLL fully contains the span of every
  * region involved.
+ *
+ * ## PDX offset compression
+ *
+ * Alternative compression mechanism that relies on RAM ranges having a similar
+ * size and offset between them:
+ *
+ * ┌────────┬──────────┬────────┬──────────┐   ┌────────┬──────────┐
+ * │ RAM 0  │          │ RAM 1  │          │...│ RAM N  │          │
+ * ├────────┼──────────┼────────┴──────────┘   └────────┴──────────┘
+ * │<------>│          │
+ * │  size             │
+ * │<----------------->│
+ *         offset
+ *
+ * The compression removes the holes between RAM regions:
+ *
+ * ┌────────┬────────┐   ┌────────┐
+ * │ RAM 0  │ RAM 1  │...│ RAM N  │
+ * ├────────┼────────┘   └────────┘
+ * │<------>│
+ *    size
+ *
+ * The compressed index is calculated as:
+ *
+ * index = (pfn % offset) + ((pfn / offset) * size)
  */
 
 /* Macro defined per-arch to skip PDX logic when there's no compression. */
@@ -188,7 +213,20 @@ static inline paddr_t directmapoff_to_maddr(unsigned long offset)
                  offset);
 }
 
-#endif /* CONFIG_PDX_MASK_COMPRESSION */
+#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* CONFIG_PDX_MASK_COMPRESSION */
+
+extern unsigned long pdx_offset;
+extern unsigned long pdx_size;
+
+/* pdx<->pfn == identity */
+#define pdx_to_pfn(x) (x)
+#define pfn_to_pdx(x) (x)
+
+/* directmap is indexed by maddr */
+#define maddr_to_directmapoff(x) (x)
+#define directmapoff_to_maddr(x) (x)
+
+#endif /* CONFIG_PDX_OFFSET_COMPRESSION */
 
 #ifdef CONFIG_PDX_NONE
 
@@ -208,7 +246,8 @@ static inline bool pdx_is_region_compressible(paddr_t base,
     return true;
 }
 
-static inline void pfn_pdx_add_region(paddr_t base, paddr_t size)
+static inline void pfn_pdx_add_region(paddr_t base, paddr_t size,
+                                      unsigned int id)
 {
 }
 
@@ -239,8 +278,9 @@ bool pdx_is_region_compressible(paddr_t base, unsigned long npages);
  *
  * @param base Start of the region in bytes.
  * @param size Length of the region in bytes.
+ * @param id Range group identifier (for example NUMA proximity domain ID).
  */
-void pfn_pdx_add_region(paddr_t base, paddr_t size);
+void pfn_pdx_add_region(paddr_t base, paddr_t size, unsigned int id);
 
 /**
  * Initializes global variables with information about the compressible
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/8] pdx: introduce translation helpers for offset compression
  2025-06-11 17:16 [PATCH 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
                   ` (5 preceding siblings ...)
  2025-06-11 17:16 ` [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions Roger Pau Monne
@ 2025-06-11 17:16 ` Roger Pau Monne
  2025-06-11 17:16 ` [PATCH 8/8] pdx: introduce a command line option " Roger Pau Monne
  7 siblings, 0 replies; 34+ messages in thread
From: Roger Pau Monne @ 2025-06-11 17:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Anthony PERARD, Andrew Cooper, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

Implement the helpers to translate from pfns or physical addresses into the
offset compressed index space.  Add a further check in the PDX testing to
ensure conversion resulting from the added functions is bi-directional.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 tools/tests/pdx/test-pdx-offset.c | 10 +++++++++
 xen/common/pdx.c                  | 10 +++++++++
 xen/include/xen/pdx.h             | 35 +++++++++++++++++++++++++------
 3 files changed, 49 insertions(+), 6 deletions(-)

diff --git a/tools/tests/pdx/test-pdx-offset.c b/tools/tests/pdx/test-pdx-offset.c
index 0a561f02d197..a3507c36deb7 100644
--- a/tools/tests/pdx/test-pdx-offset.c
+++ b/tools/tests/pdx/test-pdx-offset.c
@@ -293,6 +293,16 @@ int main(int argc, char **argv)
                     print_ranges(tests[i].ranges);
                     ret_code = EXIT_FAILURE;
                 }
+
+                if ( start != pdx_to_pfn(pfn_to_pdx(start)) ||
+                     end - 1 != pdx_to_pfn(pfn_to_pdx(end - 1)) )
+                {
+                    printf(
+    "PDX %s compression invalid, conversion of %#lx or %#lx is not bidirectional\n",
+                           use_slow ? "slow" : "fast", start, end - 1);
+                    print_ranges(tests[i].ranges);
+                    ret_code = EXIT_FAILURE;
+                }
             }
         }
     }
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index f2cf60bbc3f8..feabdcded804 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -46,6 +46,8 @@ bool __mfn_valid(unsigned long mfn)
 
 #ifdef CONFIG_PDX_MASK_COMPRESSION
     invalid |= mfn & pfn_hole_mask;
+#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
+    invalid |= (mfn % pdx_offset) >= pdx_size;
 #endif
 
     if ( unlikely(evaluate_nospec(invalid)) )
@@ -314,6 +316,9 @@ void __init pfn_pdx_compression_reset(void)
 unsigned long __ro_after_init pdx_offset = ~0UL;
 unsigned long __ro_after_init pdx_size = ~0UL;
 
+paddr_t __ro_after_init pdx_paddr_offset = ~0UL;
+paddr_t __ro_after_init pdx_paddr_size = ~0UL;
+
 static unsigned long __initdata top_pfn;
 
 bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
@@ -513,6 +518,9 @@ bool __init pfn_pdx_compression_setup(paddr_t base)
     printk(XENLOG_INFO "PFN compression using offset %#lx size %#lx (%lu%%)\n",
            pdx_offset, pdx_size, ((pdx_offset - pdx_size) * 100) / pdx_offset);
 
+    pdx_paddr_offset = (paddr_t)pdx_offset << PAGE_SHIFT;
+    pdx_paddr_size = (paddr_t)pdx_size << PAGE_SHIFT;
+
     return true;
 }
 #endif /* __XEN__ */
@@ -520,7 +528,9 @@ bool __init pfn_pdx_compression_setup(paddr_t base)
 void __init pfn_pdx_compression_reset(void)
 {
     pdx_size = ~0UL;
+    pdx_paddr_size = ~(paddr_t)0;
     pdx_offset = ~0UL;
+    pdx_paddr_offset = ~(paddr_t)0;
     nr = 0;
     top_pfn = 0;
 }
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index 88f446f4ddd9..5f9e5bc7ab62 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -217,14 +217,37 @@ static inline paddr_t directmapoff_to_maddr(unsigned long offset)
 
 extern unsigned long pdx_offset;
 extern unsigned long pdx_size;
+extern paddr_t pdx_paddr_offset;
+extern paddr_t pdx_paddr_size;
 
-/* pdx<->pfn == identity */
-#define pdx_to_pfn(x) (x)
-#define pfn_to_pdx(x) (x)
+void pdx_region_offset(unsigned long base, unsigned long size);
+bool pfn_pdx_offset_setup(void);
 
-/* directmap is indexed by maddr */
-#define maddr_to_directmapoff(x) (x)
-#define directmapoff_to_maddr(x) (x)
+static inline unsigned long pdx_to_pfn(unsigned long pdx)
+{
+    OPTIMIZE_PDX((pdx % pdx_size) + ((pdx / pdx_size) * pdx_offset),
+                 pdx);
+}
+
+static inline unsigned long pfn_to_pdx(unsigned long pfn)
+{
+    OPTIMIZE_PDX((pfn % pdx_offset) + ((pfn / pdx_offset) * pdx_size),
+                 pfn);
+}
+
+static inline unsigned long maddr_to_directmapoff(paddr_t ma)
+{
+    OPTIMIZE_PDX((ma % pdx_paddr_offset) +
+                 ((ma / pdx_paddr_offset) * pdx_paddr_size),
+                 ma);
+}
+
+static inline paddr_t directmapoff_to_maddr(unsigned long off)
+{
+    OPTIMIZE_PDX((off % pdx_paddr_size) +
+                 ((off / pdx_paddr_size) * pdx_paddr_offset),
+                 off);
+}
 
 #endif /* CONFIG_PDX_OFFSET_COMPRESSION */
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 8/8] pdx: introduce a command line option for offset compression
  2025-06-11 17:16 [PATCH 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
                   ` (6 preceding siblings ...)
  2025-06-11 17:16 ` [PATCH 7/8] pdx: introduce translation helpers for offset compression Roger Pau Monne
@ 2025-06-11 17:16 ` Roger Pau Monne
  2025-06-17  7:09   ` Oleksii Kurochko
  2025-06-18 13:36   ` Jan Beulich
  7 siblings, 2 replies; 34+ messages in thread
From: Roger Pau Monne @ 2025-06-11 17:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Oleksii Kurochko, Community Manager,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Stefano Stabellini

Allow controlling whether to attempt PDX compression, and which algorithm
to use to calculate the coefficients.  Document the option and also add a
CHANGELOG entry for the newly added feature.

Note the work has been originally done to cope with the new Intel
Sapphire/Granite Rapids, however the compression is not explicitly tied to
Intel or x86, and hence could be helpful on other architectures.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 CHANGELOG.md                      |  3 +++
 docs/misc/xen-command-line.pandoc | 22 ++++++++++++++++++
 xen/common/pdx.c                  | 38 +++++++++++++++++++++++++++----
 3 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 23215a8cc1e6..48327f03078f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -20,6 +20,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
      table or foreign memory.
 
 ### Added
+ - Introduce new PDX compression algorithm to cope with Intel Sapphire and
+   Granite Rapids having sparse memory maps.
+
  - On x86:
    - Option to attempt to fixup p2m page-faults on PVH dom0.
    - Resizable BARs is supported for PVH dom0.
diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index b0eadd2c5d58..06819576a06b 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2072,6 +2072,28 @@ for all of them (`true`), only for those subject to XPTI (`xpti`) or for
 those not subject to XPTI (`no-xpti`). The feature is used only in case
 INVPCID is supported and not disabled via `invpcid=false`.
 
+### pdx-compress
+> `= <boolean> | auto | fast | slow`
+
+> Default: `auto`
+
+Only relevant when hypervisor is build with PFN PDX offset compression
+`CONFIG_PDX_OFFSET_COMPRESSION`.
+
+Controls whether Xen will engage in PFN compression, and which algorithm will
+be used to calculate the compression coefficients:
+
+* `auto`: default choice, attempt fast calculation of compression
+  coefficients, if that's not successful fallback to slow calculation.
+
+* `fast`: only attempt fast calculation of coefficients, if it fails PFN PDX
+  compression will be disabled.
+
+* `slow`: only attempt slow calculation of coefficients, if it fails PFN PDX
+  compression will be disabled.
+
+Note `pdx-compress=true` is equivalent to `pdx-compress=auto`.
+
 ### ple_gap
 > `= <integer>`
 
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index feabdcded804..5fd01305a7be 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -19,6 +19,7 @@
 #include <xen/mm.h>
 #include <xen/bitops.h>
 #include <xen/nospec.h>
+#include <xen/param.h>
 #include <xen/pfn.h>
 #include <xen/sections.h>
 #include <xen/sort.h>
@@ -468,11 +469,40 @@ STATIC void __init pfn_offset_sanitize_ranges(void)
 }
 
 #ifdef __XEN__
+static enum {
+    PDX_AUTO, /* Fast first, fallback to slow if fast is not successful. */
+    PDX_NO,   /* Do not attempt compression. */
+    PDX_FAST, /* Only attempt fast calculation of compression parameters. */
+    PDX_SLOW, /* Only attempt slow calculation of compression parameters. */
+} compress_mode __initdata;
+
+static int __init cf_check parse_pdx_param(const char *arg)
+{
+    int val;
+
+    if ( !arg )
+        return -EINVAL;
+
+    if ( (val = parse_bool(arg, NULL)) != -1 )
+        compress_mode = val ? PDX_AUTO : PDX_NO;
+    else if ( !strcmp(arg, "auto") )
+        compress_mode = PDX_AUTO;
+    else if ( !strcmp(arg, "fast") )
+        compress_mode = PDX_FAST;
+    else if ( !strcmp(arg, "slow") )
+        compress_mode = PDX_SLOW;
+    else
+        return -EINVAL;
+
+    return 0;
+}
+custom_param("pdx-compress", parse_pdx_param);
+
 bool __init pfn_pdx_compression_setup(paddr_t base)
 {
-    bool use_slow = false;
+    bool use_slow = compress_mode == PDX_SLOW;
 
-    if ( nr <= 1 )
+    if ( nr <= 1 || compress_mode == PDX_NO )
         return false;
 
     if ( nr > ARRAY_SIZE(ranges) )
@@ -507,11 +537,11 @@ bool __init pfn_pdx_compression_setup(paddr_t base)
             dprintk(XENLOG_DEBUG,
                     "Find PFN compression coefficients using %s algorithm failed\n",
                     use_slow ? "slow" : "fast");
-            if ( use_slow )
+            if ( compress_mode == PDX_FAST || use_slow )
                 return false;
         }
 
-        if ( use_slow )
+        if ( compress_mode == PDX_FAST || use_slow )
             break;
     }
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] x86/pdx: simplify calculation of domain struct allocation boundary
  2025-06-11 17:16 ` [PATCH 1/8] x86/pdx: simplify calculation of domain struct allocation boundary Roger Pau Monne
@ 2025-06-11 17:58   ` Andrew Cooper
  2025-06-12  6:57     ` Roger Pau Monné
  2025-06-12  9:03   ` Jan Beulich
  1 sibling, 1 reply; 34+ messages in thread
From: Andrew Cooper @ 2025-06-11 17:58 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich

On 11/06/2025 6:16 pm, Roger Pau Monne wrote:
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 7536b6c8717e..2f438ce367cf 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -461,30 +461,6 @@ void domain_cpu_policy_changed(struct domain *d)
>      }
>  }
>  
> -#if !defined(CONFIG_BIGMEM) && defined(CONFIG_PDX_COMPRESSION)
> -/*
> - * The hole may be at or above the 44-bit boundary, so we need to determine
> - * the total bit count until reaching 32 significant (not squashed out) bits
> - * in PFN representations.
> - * Note that the way "bits" gets initialized/updated/bounds-checked guarantees
> - * that the function will never return zero, and hence will never be called
> - * more than once (which is important due to it being deliberately placed in
> - * .init.text).
> - */
> -static unsigned int __init noinline _domain_struct_bits(void)
> -{
> -    unsigned int bits = 32 + PAGE_SHIFT;
> -    unsigned int sig = hweight32(~pfn_hole_mask);
> -    unsigned int mask = pfn_hole_mask >> 32;
> -
> -    for ( ; bits < BITS_PER_LONG && sig < 32; ++bits, mask >>= 1 )
> -        if ( !(mask & 1) )
> -            ++sig;
> -
> -    return bits;
> -}
> -#endif
> -

I'm very happy to see this disappear.  Both because of a non-__init
function calling an __init function, and because this internal is just
horrible.

>  struct domain *alloc_domain_struct(void)
>  {
>      struct domain *d;
> @@ -498,14 +474,15 @@ struct domain *alloc_domain_struct(void)
>       * On systems with CONFIG_BIGMEM there's no packing, and so there's no
>       * such restriction.
>       */
> -#if defined(CONFIG_BIGMEM) || !defined(CONFIG_PDX_COMPRESSION)
> -    const unsigned int bits = IS_ENABLED(CONFIG_BIGMEM) ? 0 :
> -                                                          32 + PAGE_SHIFT;
> +#if defined(CONFIG_BIGMEM)
> +    const unsigned int bits = 0;
>  #else
> -    static unsigned int __read_mostly bits;
> +    static unsigned int __ro_after_init bits;
>  
>      if ( unlikely(!bits) )
> -         bits = _domain_struct_bits();
> +         bits = flsl(pfn_to_paddr(pdx_to_pfn(
> +             1UL << (sizeof(((struct page_info *)NULL)->v.inuse._domain) * 8))))
> +             - 1;

I think this would benefit greatly by not being a oneliner.  There's
sizeof_field() which helps a little.

But, isn't this UB with CONFIG_BIGMEM?  You're shifting 1UL by 64.

When __pdx_t is unsigned long, there's no bits restriction necessary. 
Therefore, don't you want !bits && sizeof_field(...) < BYTES_PER_LONG as
the entry criteria?

~Andrew


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions
  2025-06-11 17:16 ` [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions Roger Pau Monne
@ 2025-06-11 19:33   ` Andrew Cooper
  2025-06-12  7:26     ` Roger Pau Monné
  2025-06-12  8:27   ` Jan Beulich
  2025-06-18 13:02   ` Jan Beulich
  2 siblings, 1 reply; 34+ messages in thread
From: Andrew Cooper @ 2025-06-11 19:33 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: Anthony PERARD, Michal Orzel, Jan Beulich, Julien Grall,
	Stefano Stabellini, Bertrand Marquis, Volodymyr Babchuk

On 11/06/2025 6:16 pm, Roger Pau Monne wrote:
> With the appearance of Intel Sierra Forest and Granite Rapids it's not

s/not/now ?

The problem here is that it's very possible to get such a system.

It might be worth nothing that SRF and GNR are socket compatible, in
Birch Stream platforms, which is relevant to why they're similar in this
regard.

> possible to get a production x86 host wit the following memory map:
>
> SRAT: Node 0 PXM 0 [0000000000000000, 000000007fffffff]
> SRAT: Node 0 PXM 0 [0000000100000000, 000000407fffffff]
> SRAT: Node 1 PXM 1 [0000061e80000000, 0000065e7fffffff]
> SRAT: Node 2 PXM 2 [00000c3e80000000, 00000c7e7fffffff]
> SRAT: Node 3 PXM 3 [0000125e80000000, 0000129e7fffffff]
>
> This is from a four socket system, with each node having 256GB of memory.
> The total amount of RAM on the system is 1TB, but without enabling
> CONFIG_BIGMEM the last range is not accessible, as it's above the 16TB
> boundary covered by the frame table.
>
> Note that while the memory map is very sparse, it won't be compressible
> using the current algorithm that relies on all ranges having a shared
> zeroed region of bits that can be removed.

", it couldn't be compressed using the current PDX_MASK compression
algorithm, which relies ..."


>
> The memory map presented above has the property of all regions being
> similarly spaced between each other, and all having also a similar size.
> This allows to compress them using the following formula:
>
>  pdx = (pfn % offset) + ((pfn / offset) * size)
>
> Where offset and size are two static coefficients calculated at
> initialization.
>
> Obtaining the optimum offset and size coefficients is the complicated part.
> In this patch I introduce two different algorithms, a fast one that works
> correctly when the offset and size between ranges is mostly equal.  If such
> fast algorithm doesn't work, or the resulting compression is not enough to
> avoid truncation of the maximum usable page, it's possible to attempt a
> brute force approach for calculating the coefficients.  This is also
> implemented in this patch as the slow variant.  I've attempted to restrict
> the number of iterations in the slow approach so it can exit early if no
> better coefficients can be found due to the input constrains (minimum
> region size).
>
> The patch here focuses on introducing the logic to calculate the
> compression coefficients, plus adding a unit test to exercise the logic
> easily from user-space in order to test different layouts and possibly
> improve the generation of the coefficients.  The added unit tests only
> covers the newly added compression, but could also be extended to cover the
> existing PDX mask compression.

Is it possible to split out the userspace harness into an earlier patch,
and e.g. do some token testing of PDX_MASK ?

That halves the size of this patch.

>
> Note the translation functions (pfn to pdx, maddr to direct map offset),
> are not implemented as part of this patch, an identity set of macros are
> added to satisfy the build requirements.  The patch is already long enough
> without those.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> We can discuss whether we want both the fast and the slow variants.  The
> slow (brute force) was added as a result of me playing with weird region
> layouts where the fast one didn't manage to compress, or the resulting
> coefficients had a poor compression ratio.  However at this point the
> slow variant has only proven helpful in synthetic cases, I haven't (yet?)
> seen a real host memory layout that would benefit from it.

I'm going to hold off on opinions until I've read the rest of the series.

One question through.  Can we round offset up to the next power of two,
so we can replace the divide with a shift?

size is not a nice power of two, but I guarantee you that hardware is
not doing this routing with a divide.

It would result in some holes in PDX space, but it is almost certainly
faster.

> diff --git a/tools/tests/pdx/harness.h b/tools/tests/pdx/harness.h
> new file mode 100644
> index 000000000000..3d31cf488daf
> --- /dev/null
> +++ b/tools/tests/pdx/harness.h
> @@ -0,0 +1,73 @@
> ...
> +#define sort(elem, nr, size, cmp, swp) {                                \
> +    /* Consume swp() so compiler doesn't complain it's unused. */       \
> +    swp(&elem[0], &elem[0], size);                                      \

(void)swp;   ?

> diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
> index 93ebfc29635e..e71908b99c14 100644
> --- a/xen/arch/arm/setup.c
> +++ b/xen/arch/arm/setup.c
> @@ -258,7 +258,7 @@ void __init init_pdx(void)
>      unsigned int bank;
>  
>      for ( bank = 0 ; bank < mem->nr_banks; bank++ )
> -        pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size);
> +        pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size, bank);
>  

I'd suggest plumbing bank down in a previous patch.

> diff --git a/xen/common/pdx.c b/xen/common/pdx.c
> index 7d14100224fe..f2cf60bbc3f8 100644
> --- a/xen/common/pdx.c
> +++ b/xen/common/pdx.c
> @@ -21,6 +21,15 @@
>  #include <xen/nospec.h>
>  #include <xen/pfn.h>
>  #include <xen/sections.h>
> +#include <xen/sort.h>
> +
> +#ifdef __XEN__ /* For building the file in user-space. */
> +
> +/*
> + * Use a define for the static keyword, we want to export some otherwise static
> + * functions for the unit tests.
> + */
> +#define STATIC static

Most unit testing gets around this problem with the test harness itself
doing

#include "path/to/pdx.c"

If you do this right, the only thing needed is some #ifndef
_TEST_HARNESS around the includes at a the top.

> +static int __init cf_check cmp_node(const void *a, const void *b)
> +{
> +    const struct pfn_range *l = a;
> +    const struct pfn_range *r = b;
> +
> +    if ( l->base > r->base )
> +        return 1;
> +    if ( l->base < r->base )
> +        return -1;
> +
> +    ASSERT_UNREACHABLE();

I'm not sure if this is appropriate.  It's perfectly reachable if both
->base's are the same, and it may interfere with the inlining heuristics
for sort().

What you mean is "there shouldn't be two nodes that look like this", and
I'm not sure that the middle of sort() is the place to check this properly.

AFAICT, The real property you want is "len[i] && base[i] + len[i] <=
base[i+1]".

~Andrew


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] x86/pdx: simplify calculation of domain struct allocation boundary
  2025-06-11 17:58   ` Andrew Cooper
@ 2025-06-12  6:57     ` Roger Pau Monné
  0 siblings, 0 replies; 34+ messages in thread
From: Roger Pau Monné @ 2025-06-12  6:57 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Jan Beulich

On Wed, Jun 11, 2025 at 06:58:31PM +0100, Andrew Cooper wrote:
> On 11/06/2025 6:16 pm, Roger Pau Monne wrote:
> > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> > index 7536b6c8717e..2f438ce367cf 100644
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -461,30 +461,6 @@ void domain_cpu_policy_changed(struct domain *d)
> >      }
> >  }
> >  
> > -#if !defined(CONFIG_BIGMEM) && defined(CONFIG_PDX_COMPRESSION)
> > -/*
> > - * The hole may be at or above the 44-bit boundary, so we need to determine
> > - * the total bit count until reaching 32 significant (not squashed out) bits
> > - * in PFN representations.
> > - * Note that the way "bits" gets initialized/updated/bounds-checked guarantees
> > - * that the function will never return zero, and hence will never be called
> > - * more than once (which is important due to it being deliberately placed in
> > - * .init.text).
> > - */
> > -static unsigned int __init noinline _domain_struct_bits(void)
> > -{
> > -    unsigned int bits = 32 + PAGE_SHIFT;
> > -    unsigned int sig = hweight32(~pfn_hole_mask);
> > -    unsigned int mask = pfn_hole_mask >> 32;
> > -
> > -    for ( ; bits < BITS_PER_LONG && sig < 32; ++bits, mask >>= 1 )
> > -        if ( !(mask & 1) )
> > -            ++sig;
> > -
> > -    return bits;
> > -}
> > -#endif
> > -
> 
> I'm very happy to see this disappear.  Both because of a non-__init
> function calling an __init function, and because this internal is just
> horrible.
> 
> >  struct domain *alloc_domain_struct(void)
> >  {
> >      struct domain *d;
> > @@ -498,14 +474,15 @@ struct domain *alloc_domain_struct(void)
> >       * On systems with CONFIG_BIGMEM there's no packing, and so there's no
> >       * such restriction.
> >       */
> > -#if defined(CONFIG_BIGMEM) || !defined(CONFIG_PDX_COMPRESSION)
> > -    const unsigned int bits = IS_ENABLED(CONFIG_BIGMEM) ? 0 :
> > -                                                          32 + PAGE_SHIFT;
> > +#if defined(CONFIG_BIGMEM)
> > +    const unsigned int bits = 0;
> >  #else
> > -    static unsigned int __read_mostly bits;
> > +    static unsigned int __ro_after_init bits;
> >  
> >      if ( unlikely(!bits) )
> > -         bits = _domain_struct_bits();
> > +         bits = flsl(pfn_to_paddr(pdx_to_pfn(
> > +             1UL << (sizeof(((struct page_info *)NULL)->v.inuse._domain) * 8))))
> > +             - 1;
> 
> I think this would benefit greatly by not being a oneliner.  There's
> sizeof_field() which helps a little.

Oh, I missed that.  I was expecting to find something like
member_size() or similar.

> But, isn't this UB with CONFIG_BIGMEM?  You're shifting 1UL by 64.

CONFIG_BIGMEM doesn't get here (see the #fidef above).

> When __pdx_t is unsigned long, there's no bits restriction necessary. 
> Therefore, don't you want !bits && sizeof_field(...) < BYTES_PER_LONG as
> the entry criteria?

__pdx_t being unsigned long can only happen for CONFIG_BIGMEM, and
that's still handled separately using a #if defined(CONFIG_BIGMEM) ...
(which I should have converted to #ifdef instead).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions
  2025-06-11 19:33   ` Andrew Cooper
@ 2025-06-12  7:26     ` Roger Pau Monné
  2025-06-12  7:59       ` Roger Pau Monné
  0 siblings, 1 reply; 34+ messages in thread
From: Roger Pau Monné @ 2025-06-12  7:26 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: xen-devel, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Stefano Stabellini, Bertrand Marquis,
	Volodymyr Babchuk

On Wed, Jun 11, 2025 at 08:33:55PM +0100, Andrew Cooper wrote:
> On 11/06/2025 6:16 pm, Roger Pau Monne wrote:
> > With the appearance of Intel Sierra Forest and Granite Rapids it's not
> 
> s/not/now ?
> 
> The problem here is that it's very possible to get such a system.
> 
> It might be worth nothing that SRF and GNR are socket compatible, in
> Birch Stream platforms, which is relevant to why they're similar in this
> regard.
> 
> > possible to get a production x86 host wit the following memory map:
> >
> > SRAT: Node 0 PXM 0 [0000000000000000, 000000007fffffff]
> > SRAT: Node 0 PXM 0 [0000000100000000, 000000407fffffff]
> > SRAT: Node 1 PXM 1 [0000061e80000000, 0000065e7fffffff]
> > SRAT: Node 2 PXM 2 [00000c3e80000000, 00000c7e7fffffff]
> > SRAT: Node 3 PXM 3 [0000125e80000000, 0000129e7fffffff]
> >
> > This is from a four socket system, with each node having 256GB of memory.
> > The total amount of RAM on the system is 1TB, but without enabling
> > CONFIG_BIGMEM the last range is not accessible, as it's above the 16TB
> > boundary covered by the frame table.
> >
> > Note that while the memory map is very sparse, it won't be compressible
> > using the current algorithm that relies on all ranges having a shared
> > zeroed region of bits that can be removed.
> 
> ", it couldn't be compressed using the current PDX_MASK compression
> algorithm, which relies ..."
> 
> 
> >
> > The memory map presented above has the property of all regions being
> > similarly spaced between each other, and all having also a similar size.
> > This allows to compress them using the following formula:
> >
> >  pdx = (pfn % offset) + ((pfn / offset) * size)
> >
> > Where offset and size are two static coefficients calculated at
> > initialization.
> >
> > Obtaining the optimum offset and size coefficients is the complicated part.
> > In this patch I introduce two different algorithms, a fast one that works
> > correctly when the offset and size between ranges is mostly equal.  If such
> > fast algorithm doesn't work, or the resulting compression is not enough to
> > avoid truncation of the maximum usable page, it's possible to attempt a
> > brute force approach for calculating the coefficients.  This is also
> > implemented in this patch as the slow variant.  I've attempted to restrict
> > the number of iterations in the slow approach so it can exit early if no
> > better coefficients can be found due to the input constrains (minimum
> > region size).
> >
> > The patch here focuses on introducing the logic to calculate the
> > compression coefficients, plus adding a unit test to exercise the logic
> > easily from user-space in order to test different layouts and possibly
> > improve the generation of the coefficients.  The added unit tests only
> > covers the newly added compression, but could also be extended to cover the
> > existing PDX mask compression.
> 
> Is it possible to split out the userspace harness into an earlier patch,
> and e.g. do some token testing of PDX_MASK ?

It would need a different testing harness IMO, as the current testing
harness is tied to the offset implementation internals (as it wants to
compare the results of both the fast and slow coefficient
calculations).  I could add a test harness for the mask compression,
but it would be a different file, with slightly different logic, and
hence I don't think it would reduce much the size of this patch.

> That halves the size of this patch.
> 
> >
> > Note the translation functions (pfn to pdx, maddr to direct map offset),
> > are not implemented as part of this patch, an identity set of macros are
> > added to satisfy the build requirements.  The patch is already long enough
> > without those.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > We can discuss whether we want both the fast and the slow variants.  The
> > slow (brute force) was added as a result of me playing with weird region
> > layouts where the fast one didn't manage to compress, or the resulting
> > coefficients had a poor compression ratio.  However at this point the
> > slow variant has only proven helpful in synthetic cases, I haven't (yet?)
> > seen a real host memory layout that would benefit from it.
> 
> I'm going to hold off on opinions until I've read the rest of the series.
> 
> One question through.  Can we round offset up to the next power of two,
> so we can replace the divide with a shift?

I've tried to round up both offset and size, but that resulted in no
compression in some cases.  I can try to maybe round just one of
those?

Note that the divide is done once with offset and once with size,
depending on the direction of the translation.  I can explore this a
bit.

> size is not a nice power of two, but I guarantee you that hardware is
> not doing this routing with a divide.
> 
> It would result in some holes in PDX space, but it is almost certainly
> faster.

On my TGL NUC the cost of the conversion in pfn_to_pdx() measured in
TSC cycles is:

    N           Min           Max        Median           Avg        Stddev
x 2811            26           271            29     39.224831     33.516395

It's from the user-space harness, so might not be fully accurate.  I
the average time for the operation is ~14ns on my specific system
(2800Mhz nominal frequency).

> > diff --git a/tools/tests/pdx/harness.h b/tools/tests/pdx/harness.h
> > new file mode 100644
> > index 000000000000..3d31cf488daf
> > --- /dev/null
> > +++ b/tools/tests/pdx/harness.h
> > @@ -0,0 +1,73 @@
> > ...
> > +#define sort(elem, nr, size, cmp, swp) {                                \
> > +    /* Consume swp() so compiler doesn't complain it's unused. */       \
> > +    swp(&elem[0], &elem[0], size);                                      \
> 
> (void)swp;   ?

Hm, yes, that might be enough to make the compiler happy about swp()
being unused, I will try it.

> 
> > diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
> > index 93ebfc29635e..e71908b99c14 100644
> > --- a/xen/arch/arm/setup.c
> > +++ b/xen/arch/arm/setup.c
> > @@ -258,7 +258,7 @@ void __init init_pdx(void)
> >      unsigned int bank;
> >  
> >      for ( bank = 0 ; bank < mem->nr_banks; bank++ )
> > -        pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size);
> > +        pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size, bank);
> >  
> 
> I'd suggest plumbing bank down in a previous patch.
> 
> > diff --git a/xen/common/pdx.c b/xen/common/pdx.c
> > index 7d14100224fe..f2cf60bbc3f8 100644
> > --- a/xen/common/pdx.c
> > +++ b/xen/common/pdx.c
> > @@ -21,6 +21,15 @@
> >  #include <xen/nospec.h>
> >  #include <xen/pfn.h>
> >  #include <xen/sections.h>
> > +#include <xen/sort.h>
> > +
> > +#ifdef __XEN__ /* For building the file in user-space. */
> > +
> > +/*
> > + * Use a define for the static keyword, we want to export some otherwise static
> > + * functions for the unit tests.
> > + */
> > +#define STATIC static
> 
> Most unit testing gets around this problem with the test harness itself
> doing
> 
> #include "path/to/pdx.c"

I see.  I've been kind of doing that with __XEN__, but it might be
clearer to instead use _TEST_HARNESS_, I could avoid the sed with
that.  I need to see how it looks.

My original idea was that the pdx object could be used by the PDX mask
testing also, but given the algorithm is selected at build time we
would need to generate two different object files from pdx.c anyway.

> If you do this right, the only thing needed is some #ifndef
> _TEST_HARNESS around the includes at a the top.
> 
> > +static int __init cf_check cmp_node(const void *a, const void *b)
> > +{
> > +    const struct pfn_range *l = a;
> > +    const struct pfn_range *r = b;
> > +
> > +    if ( l->base > r->base )
> > +        return 1;
> > +    if ( l->base < r->base )
> > +        return -1;
> > +
> > +    ASSERT_UNREACHABLE();
> 
> I'm not sure if this is appropriate.  It's perfectly reachable if both
> ->base's are the same, and it may interfere with the inlining heuristics
> for sort().
> 
> What you mean is "there shouldn't be two nodes that look like this", and
> I'm not sure that the middle of sort() is the place to check this properly.
> 
> AFAICT, The real property you want is "len[i] && base[i] + len[i] <=
> base[i+1]".

I could possibly do the filtering from the sanitize function.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions
  2025-06-12  7:26     ` Roger Pau Monné
@ 2025-06-12  7:59       ` Roger Pau Monné
  0 siblings, 0 replies; 34+ messages in thread
From: Roger Pau Monné @ 2025-06-12  7:59 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: xen-devel, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Stefano Stabellini, Bertrand Marquis,
	Volodymyr Babchuk

On Thu, Jun 12, 2025 at 09:26:25AM +0200, Roger Pau Monné wrote:
> On Wed, Jun 11, 2025 at 08:33:55PM +0100, Andrew Cooper wrote:
> > On 11/06/2025 6:16 pm, Roger Pau Monne wrote:
> > > We can discuss whether we want both the fast and the slow variants.  The
> > > slow (brute force) was added as a result of me playing with weird region
> > > layouts where the fast one didn't manage to compress, or the resulting
> > > coefficients had a poor compression ratio.  However at this point the
> > > slow variant has only proven helpful in synthetic cases, I haven't (yet?)
> > > seen a real host memory layout that would benefit from it.
> > 
> > I'm going to hold off on opinions until I've read the rest of the series.
> > 
> > One question through.  Can we round offset up to the next power of two,
> > so we can replace the divide with a shift?
> 
> I've tried to round up both offset and size, but that resulted in no
> compression in some cases.  I can try to maybe round just one of
> those?
> 
> Note that the divide is done once with offset and once with size,
> depending on the direction of the translation.  I can explore this a
> bit.

So for the 4s GNR example, rounding the offset to the nearest (lower)
power of two will result in:

PFN fast compression result diverge, expected:
 offset 0000063e80000 size 0000008300000 (91%)
got:
 offset 0000040000000 size 0000033e80000 (18%)
Ranges:
 0000000000000-0000008080000
 0000063e80000-000006be80000
 00000c7e80000-00000cfe80000
 000012be80000-0000133e80000

And that's just rounding the offset, if we then also round the size
there's no compression possible, as size would become 0x40000000 and
thus offset == size.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions
  2025-06-11 17:16 ` [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions Roger Pau Monne
  2025-06-11 19:33   ` Andrew Cooper
@ 2025-06-12  8:27   ` Jan Beulich
  2025-06-12 14:03     ` Roger Pau Monné
  2025-06-18 13:02   ` Jan Beulich
  2 siblings, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2025-06-12  8:27 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Anthony PERARD, Andrew Cooper, Michal Orzel, Julien Grall,
	Stefano Stabellini, Bertrand Marquis, Volodymyr Babchuk,
	xen-devel

On 11.06.2025 19:16, Roger Pau Monne wrote:
> With the appearance of Intel Sierra Forest and Granite Rapids it's not
> possible to get a production x86 host wit the following memory map:
> 
> SRAT: Node 0 PXM 0 [0000000000000000, 000000007fffffff]
> SRAT: Node 0 PXM 0 [0000000100000000, 000000407fffffff]
> SRAT: Node 1 PXM 1 [0000061e80000000, 0000065e7fffffff]
> SRAT: Node 2 PXM 2 [00000c3e80000000, 00000c7e7fffffff]
> SRAT: Node 3 PXM 3 [0000125e80000000, 0000129e7fffffff]
> 
> This is from a four socket system, with each node having 256GB of memory.
> The total amount of RAM on the system is 1TB, but without enabling
> CONFIG_BIGMEM the last range is not accessible, as it's above the 16TB
> boundary covered by the frame table.
> 
> Note that while the memory map is very sparse, it won't be compressible
> using the current algorithm that relies on all ranges having a shared
> zeroed region of bits that can be removed.
> 
> The memory map presented above has the property of all regions being
> similarly spaced between each other, and all having also a similar size.
> This allows to compress them using the following formula:
> 
>  pdx = (pfn % offset) + ((pfn / offset) * size)
> 
> Where offset and size are two static coefficients calculated at
> initialization.

What I would find useful here in addition would be offset and size values
resulting from the example memory map above. In particular, without looking
at the code in detail, it doesn't become quite clear how the two ranges on
node 0 are being dealt with. For what follows I'll assume they'd be folded
into a single range covering all of node 0.

Along the lines of Andrew's concern regarding the division (and modulo)
involved, I wonder whether there might be an alternative with a lookup
array, holding bias values (e.g.) for each node. Main question there would
be how to quickly determine the array index to use, both from an incoming
MFN and an incoming PDX. If such an array wouldn't have too many entries,
such a lookup may end up being faster (on average) than a division.

Taking the example above, such an array could be:

[0x00] = 0,
[0x06] = 0x061e80000 - 1 * 0x5000000,
[0x0c] = 0x0c3e80000 - 2 * 0x5000000,
[0x12] = 0x125e80000 - 3 * 0x5000000,

indexed by the top-so-many bits of the MFN. For the reverse array some
gap would need to be left between ranges (i.e. the 0x5000000 above would
perhaps need doubling; maybe a little less than that would suffice), such
that the array slot to use could be determined easily there as well.

Jan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/8] kconfig: turn PDX compression into a choice
  2025-06-11 17:16 ` [PATCH 3/8] kconfig: turn PDX compression into a choice Roger Pau Monne
@ 2025-06-12  8:28   ` Jan Beulich
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Beulich @ 2025-06-12  8:28 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 11.06.2025 19:16, Roger Pau Monne wrote:
> Rename the current CONFIG_PDX_COMPRESSION to CONFIG_PDX_MASK_COMPRESSION,
> and make it part of the PDX compression choice block, in preparation for
> adding further PDX compression algorithms.
> 
> No functional change intended as the PDX compression defaults should still
> be the same for all architectures, however the choice block cannot be
> protected under EXPERT and still have a default choice being
> unconditionally selected.  As a result, the new "PDX (Page inDeX)
> compression" item will be unconditionally visible in Kconfig.
> 
> As part of this preparation work to introduce new PDX compressions, adjust
> some of the comments on pdx.h to note they apply to a specific PDX
> compression.  Also shuffle function prototypes and dummy implementations
> around to make it easier to introduce a new PDX compression.  Note all
> PDX compression implementations are expected to provide a
> pdx_is_region_compressible() that takes the same set of arguments.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Acked-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/8] pdx: provide a unified set of unit functions
  2025-06-11 17:16 ` [PATCH 4/8] pdx: provide a unified set of unit functions Roger Pau Monne
@ 2025-06-12  8:32   ` Jan Beulich
  2025-06-12 10:50     ` Roger Pau Monné
  2025-06-12  8:36   ` Jan Beulich
  1 sibling, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2025-06-12  8:32 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel

On 11.06.2025 19:16, Roger Pau Monne wrote:
> --- a/xen/arch/arm/setup.c
> +++ b/xen/arch/arm/setup.c
> @@ -255,6 +255,10 @@ void __init init_pdx(void)
>  {
>      const struct membanks *mem = bootinfo_get_mem();
>      paddr_t bank_start, bank_size, bank_end;
> +    unsigned int bank;
> +
> +    for ( bank = 0 ; bank < mem->nr_banks; bank++ )
> +        pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size);

In patch 6 you touch all these call sites again - why not add the 3rd
parameter / argument in this patch right away?

Jan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/8] pdx: provide a unified set of unit functions
  2025-06-11 17:16 ` [PATCH 4/8] pdx: provide a unified set of unit functions Roger Pau Monne
  2025-06-12  8:32   ` Jan Beulich
@ 2025-06-12  8:36   ` Jan Beulich
  2025-06-12 10:51     ` Roger Pau Monné
  1 sibling, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2025-06-12  8:36 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel

On 11.06.2025 19:16, Roger Pau Monne wrote:
> @@ -80,6 +81,39 @@ unsigned long get_max_pfn(unsigned long top_pfn)
>      return pdx_to_pfn(pdx - 1) + 1;
>  }
>  
> +#ifndef CONFIG_PDX_NONE
> +
> +#ifdef CONFIG_X86
> +# include <asm/e820.h>
> +# define MAX_PFN_RANGES E820MAX
> +#elif defined(CONFIG_HAS_DEVICE_TREE)
> +# include <xen/bootfdt.h>
> +# define MAX_PFN_RANGES NR_MEM_BANKS
> +#else
> +# error "Missing architecture maximum number of RAM ranges"
> +#endif
> +
> +/* Generic PFN compression helpers. */
> +static struct pfn_range {
> +    unsigned long base, size;
> +} ranges[MAX_PFN_RANGES] __initdata;
> +static unsigned int __initdata nr;

One other remark / nit: For my taste, "nr" isn't a suitable name for a static.
It's too short, and hence the risk is too high that some function would add a
local aliasing this one.

Jan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] x86/pdx: simplify calculation of domain struct allocation boundary
  2025-06-11 17:16 ` [PATCH 1/8] x86/pdx: simplify calculation of domain struct allocation boundary Roger Pau Monne
  2025-06-11 17:58   ` Andrew Cooper
@ 2025-06-12  9:03   ` Jan Beulich
  2025-06-12 10:46     ` Roger Pau Monné
  1 sibling, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2025-06-12  9:03 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 11.06.2025 19:16, Roger Pau Monne wrote:
> @@ -498,14 +474,15 @@ struct domain *alloc_domain_struct(void)
>       * On systems with CONFIG_BIGMEM there's no packing, and so there's no
>       * such restriction.
>       */
> -#if defined(CONFIG_BIGMEM) || !defined(CONFIG_PDX_COMPRESSION)
> -    const unsigned int bits = IS_ENABLED(CONFIG_BIGMEM) ? 0 :
> -                                                          32 + PAGE_SHIFT;
> +#if defined(CONFIG_BIGMEM)
> +    const unsigned int bits = 0;
>  #else
> -    static unsigned int __read_mostly bits;
> +    static unsigned int __ro_after_init bits;
>  
>      if ( unlikely(!bits) )
> -         bits = _domain_struct_bits();
> +         bits = flsl(pfn_to_paddr(pdx_to_pfn(
> +             1UL << (sizeof(((struct page_info *)NULL)->v.inuse._domain) * 8))))
> +             - 1;

While Andrew did point you at sizeof_field(), we can have this even less verbose
by utilizing that frame_table is of the right type and (almost) globally in scope.

Further, why use pfn_to_paddr()?

         bits = flsl(pdx_to_pfn(1UL << 
                                (sizeof(frame_table->v.inuse._domain) * 8))) +
                PAGE_SHIFT - 1;

However, it further feels like this was off by one; we had similar issues over
time in several places. There potentially being a gap between one less than
the PDX used here and that very PDX, don't we need to calculate based on the
"one less" value here? Hmm, there being a gap means no allocation would
succeed for the precise value of "bits" (in the mask-compression scheme), so
functionally all would be fine. Yet just to avoid setting a bad precedent I
think we'd still be better off using

         bits = flsl(pdx_to_pfn((1UL << 
                                 (sizeof(frame_table->v.inuse._domain) * 8)) -
                                1)) + PAGE_SHIFT;

If one would log the value of bits, the result would then also be less
confusing in (at least) the mask-compression scheme.

Jan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] pdx: introduce function to calculate max PFN based on PDX compression
  2025-06-11 17:16 ` [PATCH 2/8] pdx: introduce function to calculate max PFN based on PDX compression Roger Pau Monne
@ 2025-06-12  9:11   ` Jan Beulich
  2025-06-12 10:48     ` Roger Pau Monné
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2025-06-12  9:11 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 11.06.2025 19:16, Roger Pau Monne wrote:
> This is the code already present and used by x86 in setup_max_pdx(), which
> takes into account the current PDX compression, plus the limitation of the
> virtual memory layout to return the maximum usable PFN in the system,
> possibly truncating the input PFN provided by the caller.
> 
> This helper will be used by upcoming PDX related changes that introduce a
> new compression algorithm.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/setup.c  | 19 ++-----------------
>  xen/common/pdx.c      | 25 +++++++++++++++++++++++++
>  xen/include/xen/pdx.h |  8 ++++++++
>  3 files changed, 35 insertions(+), 17 deletions(-)

This is all fine for x86, but on Arm you introduce unreachable code, which
Misra dislikes. Yet then it feels like it's wrong anyway that the function
isn't used there.

Jan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] x86/pdx: simplify calculation of domain struct allocation boundary
  2025-06-12  9:03   ` Jan Beulich
@ 2025-06-12 10:46     ` Roger Pau Monné
  2025-06-12 12:00       ` Jan Beulich
  0 siblings, 1 reply; 34+ messages in thread
From: Roger Pau Monné @ 2025-06-12 10:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Thu, Jun 12, 2025 at 11:03:21AM +0200, Jan Beulich wrote:
> On 11.06.2025 19:16, Roger Pau Monne wrote:
> > @@ -498,14 +474,15 @@ struct domain *alloc_domain_struct(void)
> >       * On systems with CONFIG_BIGMEM there's no packing, and so there's no
> >       * such restriction.
> >       */
> > -#if defined(CONFIG_BIGMEM) || !defined(CONFIG_PDX_COMPRESSION)
> > -    const unsigned int bits = IS_ENABLED(CONFIG_BIGMEM) ? 0 :
> > -                                                          32 + PAGE_SHIFT;
> > +#if defined(CONFIG_BIGMEM)
> > +    const unsigned int bits = 0;
> >  #else
> > -    static unsigned int __read_mostly bits;
> > +    static unsigned int __ro_after_init bits;
> >  
> >      if ( unlikely(!bits) )
> > -         bits = _domain_struct_bits();
> > +         bits = flsl(pfn_to_paddr(pdx_to_pfn(
> > +             1UL << (sizeof(((struct page_info *)NULL)->v.inuse._domain) * 8))))
> > +             - 1;
> 
> While Andrew did point you at sizeof_field(), we can have this even less verbose
> by utilizing that frame_table is of the right type and (almost) globally in scope.
> 
> Further, why use pfn_to_paddr()?
> 
>          bits = flsl(pdx_to_pfn(1UL << 
>                                 (sizeof(frame_table->v.inuse._domain) * 8))) +
>                 PAGE_SHIFT - 1;

I've introduced and used pdx_to_paddr(), which I think it's more
natural.  We already had a paddr_to_pdx() which was missing it's
bidirectional equivalent.  It's now:

         bits = flsl(pdx_to_paddr(1UL << (sizeof_field(struct page_info,
                                                       v.inuse._domain) * 8)))
                - 1;

> However, it further feels like this was off by one; we had similar issues over
> time in several places. There potentially being a gap between one less than
> the PDX used here and that very PDX, don't we need to calculate based on the
> "one less" value here? Hmm, there being a gap means no allocation would
> succeed for the precise value of "bits" (in the mask-compression scheme), so
> functionally all would be fine. Yet just to avoid setting a bad precedent I
> think we'd still be better off using
> 
>          bits = flsl(pdx_to_pfn((1UL << 
>                                  (sizeof(frame_table->v.inuse._domain) * 8)) -
>                                 1)) + PAGE_SHIFT;
> 
> If one would log the value of bits, the result would then also be less
> confusing in (at least) the mask-compression scheme.


Is the above correct tough?

Take for example the hypothetical case where pdx_to_pfn() returns
0x10.  Then flsl() will return 5 (let's leave the PAGE_SHIFT
adjustment out for the example here).  The allocation bit width would
be off-by-one, because allocating using a bit width of 5 would also
allow 0x11 to be allocated, and that won't be correct.

I think we need to get the bit width of the next pdx (so the
non-inclusive end of the range), and then subtract 1 from it,
otherwise the allocation bit width is possibly off-by-one.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] pdx: introduce function to calculate max PFN based on PDX compression
  2025-06-12  9:11   ` Jan Beulich
@ 2025-06-12 10:48     ` Roger Pau Monné
  2025-06-12 12:02       ` Jan Beulich
  0 siblings, 1 reply; 34+ messages in thread
From: Roger Pau Monné @ 2025-06-12 10:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On Thu, Jun 12, 2025 at 11:11:14AM +0200, Jan Beulich wrote:
> On 11.06.2025 19:16, Roger Pau Monne wrote:
> > This is the code already present and used by x86 in setup_max_pdx(), which
> > takes into account the current PDX compression, plus the limitation of the
> > virtual memory layout to return the maximum usable PFN in the system,
> > possibly truncating the input PFN provided by the caller.
> > 
> > This helper will be used by upcoming PDX related changes that introduce a
> > new compression algorithm.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> >  xen/arch/x86/setup.c  | 19 ++-----------------
> >  xen/common/pdx.c      | 25 +++++++++++++++++++++++++
> >  xen/include/xen/pdx.h |  8 ++++++++
> >  3 files changed, 35 insertions(+), 17 deletions(-)
> 
> This is all fine for x86, but on Arm you introduce unreachable code, which
> Misra dislikes. Yet then it feels like it's wrong anyway that the function
> isn't used there.

I was also concerned regarding why ARM doesn't have an equivalent
function.  Is the frametable there supposed to cover up to the maximum
physical address?  In which case there's likely no need for any PDX
compression in the first place?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/8] pdx: provide a unified set of unit functions
  2025-06-12  8:32   ` Jan Beulich
@ 2025-06-12 10:50     ` Roger Pau Monné
  0 siblings, 0 replies; 34+ messages in thread
From: Roger Pau Monné @ 2025-06-12 10:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel

On Thu, Jun 12, 2025 at 10:32:17AM +0200, Jan Beulich wrote:
> On 11.06.2025 19:16, Roger Pau Monne wrote:
> > --- a/xen/arch/arm/setup.c
> > +++ b/xen/arch/arm/setup.c
> > @@ -255,6 +255,10 @@ void __init init_pdx(void)
> >  {
> >      const struct membanks *mem = bootinfo_get_mem();
> >      paddr_t bank_start, bank_size, bank_end;
> > +    unsigned int bank;
> > +
> > +    for ( bank = 0 ; bank < mem->nr_banks; bank++ )
> > +        pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size);
> 
> In patch 6 you touch all these call sites again - why not add the 3rd
> parameter / argument in this patch right away?

It's not consumed for the PDX mask compression, but I can indeed add
the extra parameter here.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/8] pdx: provide a unified set of unit functions
  2025-06-12  8:36   ` Jan Beulich
@ 2025-06-12 10:51     ` Roger Pau Monné
  2025-06-12 12:03       ` Jan Beulich
  0 siblings, 1 reply; 34+ messages in thread
From: Roger Pau Monné @ 2025-06-12 10:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel

On Thu, Jun 12, 2025 at 10:36:36AM +0200, Jan Beulich wrote:
> On 11.06.2025 19:16, Roger Pau Monne wrote:
> > @@ -80,6 +81,39 @@ unsigned long get_max_pfn(unsigned long top_pfn)
> >      return pdx_to_pfn(pdx - 1) + 1;
> >  }
> >  
> > +#ifndef CONFIG_PDX_NONE
> > +
> > +#ifdef CONFIG_X86
> > +# include <asm/e820.h>
> > +# define MAX_PFN_RANGES E820MAX
> > +#elif defined(CONFIG_HAS_DEVICE_TREE)
> > +# include <xen/bootfdt.h>
> > +# define MAX_PFN_RANGES NR_MEM_BANKS
> > +#else
> > +# error "Missing architecture maximum number of RAM ranges"
> > +#endif
> > +
> > +/* Generic PFN compression helpers. */
> > +static struct pfn_range {
> > +    unsigned long base, size;
> > +} ranges[MAX_PFN_RANGES] __initdata;
> > +static unsigned int __initdata nr;
> 
> One other remark / nit: For my taste, "nr" isn't a suitable name for a static.
> It's too short, and hence the risk is too high that some function would add a
> local aliasing this one.

Is nr_ranges enough to avoid aliasing?  Otherwise I could rename
ranges to pfn_ranges, and nr to nr_pfn_ranges.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] x86/pdx: simplify calculation of domain struct allocation boundary
  2025-06-12 10:46     ` Roger Pau Monné
@ 2025-06-12 12:00       ` Jan Beulich
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Beulich @ 2025-06-12 12:00 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 12.06.2025 12:46, Roger Pau Monné wrote:
> On Thu, Jun 12, 2025 at 11:03:21AM +0200, Jan Beulich wrote:
>> On 11.06.2025 19:16, Roger Pau Monne wrote:
>>> @@ -498,14 +474,15 @@ struct domain *alloc_domain_struct(void)
>>>       * On systems with CONFIG_BIGMEM there's no packing, and so there's no
>>>       * such restriction.
>>>       */
>>> -#if defined(CONFIG_BIGMEM) || !defined(CONFIG_PDX_COMPRESSION)
>>> -    const unsigned int bits = IS_ENABLED(CONFIG_BIGMEM) ? 0 :
>>> -                                                          32 + PAGE_SHIFT;
>>> +#if defined(CONFIG_BIGMEM)
>>> +    const unsigned int bits = 0;
>>>  #else
>>> -    static unsigned int __read_mostly bits;
>>> +    static unsigned int __ro_after_init bits;
>>>  
>>>      if ( unlikely(!bits) )
>>> -         bits = _domain_struct_bits();
>>> +         bits = flsl(pfn_to_paddr(pdx_to_pfn(
>>> +             1UL << (sizeof(((struct page_info *)NULL)->v.inuse._domain) * 8))))
>>> +             - 1;
>>
>> While Andrew did point you at sizeof_field(), we can have this even less verbose
>> by utilizing that frame_table is of the right type and (almost) globally in scope.
>>
>> Further, why use pfn_to_paddr()?
>>
>>          bits = flsl(pdx_to_pfn(1UL << 
>>                                 (sizeof(frame_table->v.inuse._domain) * 8))) +
>>                 PAGE_SHIFT - 1;
> 
> I've introduced and used pdx_to_paddr(), which I think it's more
> natural.  We already had a paddr_to_pdx() which was missing it's
> bidirectional equivalent.  It's now:
> 
>          bits = flsl(pdx_to_paddr(1UL << (sizeof_field(struct page_info,
>                                                        v.inuse._domain) * 8)))
>                 - 1;

Textually this is better, yes. I won't insist on the other variant, while
still noting that your way there's an extra shift whereas my way there's
merely an extra add.

>> However, it further feels like this was off by one; we had similar issues over
>> time in several places. There potentially being a gap between one less than
>> the PDX used here and that very PDX, don't we need to calculate based on the
>> "one less" value here? Hmm, there being a gap means no allocation would
>> succeed for the precise value of "bits" (in the mask-compression scheme), so
>> functionally all would be fine. Yet just to avoid setting a bad precedent I
>> think we'd still be better off using
>>
>>          bits = flsl(pdx_to_pfn((1UL << 
>>                                  (sizeof(frame_table->v.inuse._domain) * 8)) -
>>                                 1)) + PAGE_SHIFT;
>>
>> If one would log the value of bits, the result would then also be less
>> confusing in (at least) the mask-compression scheme.
> 
> 
> Is the above correct tough?
> 
> Take for example the hypothetical case where pdx_to_pfn() returns
> 0x10.

Hmm, yes - while impossible in the mask-compression scheme, it is in
principle possible with other schemes (like the offset one).

>  Then flsl() will return 5 (let's leave the PAGE_SHIFT
> adjustment out for the example here).  The allocation bit width would
> be off-by-one, because allocating using a bit width of 5 would also
> allow 0x11 to be allocated, and that won't be correct.
> 
> I think we need to get the bit width of the next pdx (so the
> non-inclusive end of the range), and then subtract 1 from it,
> otherwise the allocation bit width is possibly off-by-one.

I think you're right, and I can't really see how to (easily) get the more
precise value for the mask-compression scheme then. I would therefore
like to ask that you attach a comment clarifying the slight oddity.

Jan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] pdx: introduce function to calculate max PFN based on PDX compression
  2025-06-12 10:48     ` Roger Pau Monné
@ 2025-06-12 12:02       ` Jan Beulich
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Beulich @ 2025-06-12 12:02 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini, Michal Orzel
  Cc: Andrew Cooper, Anthony PERARD, xen-devel, Roger Pau Monné

On 12.06.2025 12:48, Roger Pau Monné wrote:
> On Thu, Jun 12, 2025 at 11:11:14AM +0200, Jan Beulich wrote:
>> On 11.06.2025 19:16, Roger Pau Monne wrote:
>>> This is the code already present and used by x86 in setup_max_pdx(), which
>>> takes into account the current PDX compression, plus the limitation of the
>>> virtual memory layout to return the maximum usable PFN in the system,
>>> possibly truncating the input PFN provided by the caller.
>>>
>>> This helper will be used by upcoming PDX related changes that introduce a
>>> new compression algorithm.
>>>
>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>> ---
>>>  xen/arch/x86/setup.c  | 19 ++-----------------
>>>  xen/common/pdx.c      | 25 +++++++++++++++++++++++++
>>>  xen/include/xen/pdx.h |  8 ++++++++
>>>  3 files changed, 35 insertions(+), 17 deletions(-)
>>
>> This is all fine for x86, but on Arm you introduce unreachable code, which
>> Misra dislikes. Yet then it feels like it's wrong anyway that the function
>> isn't used there.
> 
> I was also concerned regarding why ARM doesn't have an equivalent
> function.  Is the frametable there supposed to cover up to the maximum
> physical address?  In which case there's likely no need for any PDX
> compression in the first place?

Question goes to Arm folks.

Jan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/8] pdx: provide a unified set of unit functions
  2025-06-12 10:51     ` Roger Pau Monné
@ 2025-06-12 12:03       ` Jan Beulich
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Beulich @ 2025-06-12 12:03 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel

On 12.06.2025 12:51, Roger Pau Monné wrote:
> On Thu, Jun 12, 2025 at 10:36:36AM +0200, Jan Beulich wrote:
>> On 11.06.2025 19:16, Roger Pau Monne wrote:
>>> @@ -80,6 +81,39 @@ unsigned long get_max_pfn(unsigned long top_pfn)
>>>      return pdx_to_pfn(pdx - 1) + 1;
>>>  }
>>>  
>>> +#ifndef CONFIG_PDX_NONE
>>> +
>>> +#ifdef CONFIG_X86
>>> +# include <asm/e820.h>
>>> +# define MAX_PFN_RANGES E820MAX
>>> +#elif defined(CONFIG_HAS_DEVICE_TREE)
>>> +# include <xen/bootfdt.h>
>>> +# define MAX_PFN_RANGES NR_MEM_BANKS
>>> +#else
>>> +# error "Missing architecture maximum number of RAM ranges"
>>> +#endif
>>> +
>>> +/* Generic PFN compression helpers. */
>>> +static struct pfn_range {
>>> +    unsigned long base, size;
>>> +} ranges[MAX_PFN_RANGES] __initdata;
>>> +static unsigned int __initdata nr;
>>
>> One other remark / nit: For my taste, "nr" isn't a suitable name for a static.
>> It's too short, and hence the risk is too high that some function would add a
>> local aliasing this one.
> 
> Is nr_ranges enough to avoid aliasing?

Yes, at least as far as I'm concerned.

Jan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions
  2025-06-12  8:27   ` Jan Beulich
@ 2025-06-12 14:03     ` Roger Pau Monné
  2025-06-16  7:50       ` Jan Beulich
  0 siblings, 1 reply; 34+ messages in thread
From: Roger Pau Monné @ 2025-06-12 14:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Anthony PERARD, Andrew Cooper, Michal Orzel, Julien Grall,
	Stefano Stabellini, Bertrand Marquis, Volodymyr Babchuk,
	xen-devel

On Thu, Jun 12, 2025 at 10:27:03AM +0200, Jan Beulich wrote:
> On 11.06.2025 19:16, Roger Pau Monne wrote:
> > With the appearance of Intel Sierra Forest and Granite Rapids it's not
> > possible to get a production x86 host wit the following memory map:
> > 
> > SRAT: Node 0 PXM 0 [0000000000000000, 000000007fffffff]
> > SRAT: Node 0 PXM 0 [0000000100000000, 000000407fffffff]
> > SRAT: Node 1 PXM 1 [0000061e80000000, 0000065e7fffffff]
> > SRAT: Node 2 PXM 2 [00000c3e80000000, 00000c7e7fffffff]
> > SRAT: Node 3 PXM 3 [0000125e80000000, 0000129e7fffffff]
> > 
> > This is from a four socket system, with each node having 256GB of memory.
> > The total amount of RAM on the system is 1TB, but without enabling
> > CONFIG_BIGMEM the last range is not accessible, as it's above the 16TB
> > boundary covered by the frame table.
> > 
> > Note that while the memory map is very sparse, it won't be compressible
> > using the current algorithm that relies on all ranges having a shared
> > zeroed region of bits that can be removed.
> > 
> > The memory map presented above has the property of all regions being
> > similarly spaced between each other, and all having also a similar size.
> > This allows to compress them using the following formula:
> > 
> >  pdx = (pfn % offset) + ((pfn / offset) * size)
> > 
> > Where offset and size are two static coefficients calculated at
> > initialization.
> 
> What I would find useful here in addition would be offset and size values
> resulting from the example memory map above. In particular, without looking
> at the code in detail, it doesn't become quite clear how the two ranges on
> node 0 are being dealt with. For what follows I'll assume they'd be folded
> into a single range covering all of node 0.

Indeed, they are folded into a single range, that's why the function
to register ranges takes an ID, so that for this algorithm ranges with
the same ID are folded together.

For the above example the offset (pfn based) is 0x63e80000 and the
size 0x8300000.  You can see those (and for all the other examples) on
the test-pdx-offset.c file.

> Along the lines of Andrew's concern regarding the division (and modulo)
> involved, I wonder whether there might be an alternative with a lookup
> array, holding bias values (e.g.) for each node. Main question there would
> be how to quickly determine the array index to use, both from an incoming
> MFN and an incoming PDX. If such an array wouldn't have too many entries,
> such a lookup may end up being faster (on average) than a division.
> 
> Taking the example above, such an array could be:
> 
> [0x00] = 0,
> [0x06] = 0x061e80000 - 1 * 0x5000000,
> [0x0c] = 0x0c3e80000 - 2 * 0x5000000,
> [0x12] = 0x125e80000 - 3 * 0x5000000,
> 
> indexed by the top-so-many bits of the MFN. For the reverse array some
> gap would need to be left between ranges (i.e. the 0x5000000 above would
> perhaps need doubling; maybe a little less than that would suffice), such
> that the array slot to use could be determined easily there as well.

I've assumed that any kind of lookups like this would end up being
slower than arithmetic transformations.  I had the (maybe wrong)
impression that having to fetch the adjustment from an array based on
a calculated index would result in slower code that using constant
coefficients.

I was also worried about the extra memory consumption of this
approach, but overall we can use a full page for the lookup table,
which would allow up to 512 entries and that should be more than
enough.

I can try to code this suggestion.  However it's hard to benchmark
those algorithms, as the cost of rdtsc shadows the cost of the
operation.  Then running the translation in a tight loop and averaging
the result also doesn't seem very realistic, as the cache is hot in
that case.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions
  2025-06-12 14:03     ` Roger Pau Monné
@ 2025-06-16  7:50       ` Jan Beulich
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Beulich @ 2025-06-16  7:50 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Anthony PERARD, Andrew Cooper, Michal Orzel, Julien Grall,
	Stefano Stabellini, Bertrand Marquis, Volodymyr Babchuk,
	xen-devel

On 12.06.2025 16:03, Roger Pau Monné wrote:
> On Thu, Jun 12, 2025 at 10:27:03AM +0200, Jan Beulich wrote:
>> On 11.06.2025 19:16, Roger Pau Monne wrote:
>>> With the appearance of Intel Sierra Forest and Granite Rapids it's not
>>> possible to get a production x86 host wit the following memory map:
>>>
>>> SRAT: Node 0 PXM 0 [0000000000000000, 000000007fffffff]
>>> SRAT: Node 0 PXM 0 [0000000100000000, 000000407fffffff]
>>> SRAT: Node 1 PXM 1 [0000061e80000000, 0000065e7fffffff]
>>> SRAT: Node 2 PXM 2 [00000c3e80000000, 00000c7e7fffffff]
>>> SRAT: Node 3 PXM 3 [0000125e80000000, 0000129e7fffffff]
>>>
>>> This is from a four socket system, with each node having 256GB of memory.
>>> The total amount of RAM on the system is 1TB, but without enabling
>>> CONFIG_BIGMEM the last range is not accessible, as it's above the 16TB
>>> boundary covered by the frame table.
>>>
>>> Note that while the memory map is very sparse, it won't be compressible
>>> using the current algorithm that relies on all ranges having a shared
>>> zeroed region of bits that can be removed.
>>>
>>> The memory map presented above has the property of all regions being
>>> similarly spaced between each other, and all having also a similar size.
>>> This allows to compress them using the following formula:
>>>
>>>  pdx = (pfn % offset) + ((pfn / offset) * size)
>>>
>>> Where offset and size are two static coefficients calculated at
>>> initialization.
>>
>> What I would find useful here in addition would be offset and size values
>> resulting from the example memory map above. In particular, without looking
>> at the code in detail, it doesn't become quite clear how the two ranges on
>> node 0 are being dealt with. For what follows I'll assume they'd be folded
>> into a single range covering all of node 0.
> 
> Indeed, they are folded into a single range, that's why the function
> to register ranges takes an ID, so that for this algorithm ranges with
> the same ID are folded together.
> 
> For the above example the offset (pfn based) is 0x63e80000 and the
> size 0x8300000.  You can see those (and for all the other examples) on
> the test-pdx-offset.c file.

Oh, okay; didn't think of looking at the numbers in the test.

>> Along the lines of Andrew's concern regarding the division (and modulo)
>> involved, I wonder whether there might be an alternative with a lookup
>> array, holding bias values (e.g.) for each node. Main question there would
>> be how to quickly determine the array index to use, both from an incoming
>> MFN and an incoming PDX. If such an array wouldn't have too many entries,
>> such a lookup may end up being faster (on average) than a division.
>>
>> Taking the example above, such an array could be:
>>
>> [0x00] = 0,
>> [0x06] = 0x061e80000 - 1 * 0x5000000,
>> [0x0c] = 0x0c3e80000 - 2 * 0x5000000,
>> [0x12] = 0x125e80000 - 3 * 0x5000000,
>>
>> indexed by the top-so-many bits of the MFN. For the reverse array some
>> gap would need to be left between ranges (i.e. the 0x5000000 above would
>> perhaps need doubling; maybe a little less than that would suffice), such
>> that the array slot to use could be determined easily there as well.
> 
> I've assumed that any kind of lookups like this would end up being
> slower than arithmetic transformations.  I had the (maybe wrong)
> impression that having to fetch the adjustment from an array based on
> a calculated index would result in slower code that using constant
> coefficients.

Latency and throughput of DIV are quite a bit higher than those of memory
reads, assuming such reads come from a relatively hot cacheline. Then
again comparing such merely from spelled out numbers in some docs usually
doesn't work overly well.

> I was also worried about the extra memory consumption of this
> approach, but overall we can use a full page for the lookup table,
> which would allow up to 512 entries and that should be more than
> enough.

In the example above far less than a page should be needed. In general I'd
expect one array slot per (contiguous chunk on a) node.

> I can try to code this suggestion.  However it's hard to benchmark
> those algorithms, as the cost of rdtsc shadows the cost of the
> operation.  Then running the translation in a tight loop and averaging
> the result also doesn't seem very realistic, as the cache is hot in
> that case.

Except that, the fewer entries such an array would have, the hotter the
cacheline(s) can be expected to be anyway. But yes, gaining a clear
picture may be difficult. Then again please recall that my earlier
patching attempt (using BMI2 insns) was also rejected mainly on the
basis that the insns chosen are known to not perform well on some
hardware, without having taken any actual numbers (which again would
have been difficult to obtain in a representable way) into account.

Overall I'm not sure this alternative if worth trying out. I merely
wanted to point out there possibly is such an alternative, given the
concern Andrew had voiced. In the end there may be similar concerns
here ...

Jan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 8/8] pdx: introduce a command line option for offset compression
  2025-06-11 17:16 ` [PATCH 8/8] pdx: introduce a command line option " Roger Pau Monne
@ 2025-06-17  7:09   ` Oleksii Kurochko
  2025-06-18 13:36   ` Jan Beulich
  1 sibling, 0 replies; 34+ messages in thread
From: Oleksii Kurochko @ 2025-06-17  7:09 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: Community Manager, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 4877 bytes --]


On 6/11/25 7:16 PM, Roger Pau Monne wrote:
> Allow controlling whether to attempt PDX compression, and which algorithm
> to use to calculate the coefficients.  Document the option and also add a
> CHANGELOG entry for the newly added feature.
>
> Note the work has been originally done to cope with the new Intel
> Sapphire/Granite Rapids, however the compression is not explicitly tied to
> Intel or x86, and hence could be helpful on other architectures.
>
> Signed-off-by: Roger Pau Monné<roger.pau@citrix.com>
> ---
>   CHANGELOG.md                      |  3 +++
>   docs/misc/xen-command-line.pandoc | 22 ++++++++++++++++++
>   xen/common/pdx.c                  | 38 +++++++++++++++++++++++++++----
>   3 files changed, 59 insertions(+), 4 deletions(-)

LGTM: Reviewed-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>

Thanks.

~ Oleksii

>
> diff --git a/CHANGELOG.md b/CHANGELOG.md
> index 23215a8cc1e6..48327f03078f 100644
> --- a/CHANGELOG.md
> +++ b/CHANGELOG.md
> @@ -20,6 +20,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
>        table or foreign memory.
>   
>   ### Added
> + - Introduce new PDX compression algorithm to cope with Intel Sapphire and
> +   Granite Rapids having sparse memory maps.
> +
>    - On x86:
>      - Option to attempt to fixup p2m page-faults on PVH dom0.
>      - Resizable BARs is supported for PVH dom0.
> diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
> index b0eadd2c5d58..06819576a06b 100644
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -2072,6 +2072,28 @@ for all of them (`true`), only for those subject to XPTI (`xpti`) or for
>   those not subject to XPTI (`no-xpti`). The feature is used only in case
>   INVPCID is supported and not disabled via `invpcid=false`.
>   
> +### pdx-compress
> +> `= <boolean> | auto | fast | slow`
> +
> +> Default: `auto`
> +
> +Only relevant when hypervisor is build with PFN PDX offset compression
> +`CONFIG_PDX_OFFSET_COMPRESSION`.
> +
> +Controls whether Xen will engage in PFN compression, and which algorithm will
> +be used to calculate the compression coefficients:
> +
> +* `auto`: default choice, attempt fast calculation of compression
> +  coefficients, if that's not successful fallback to slow calculation.
> +
> +* `fast`: only attempt fast calculation of coefficients, if it fails PFN PDX
> +  compression will be disabled.
> +
> +* `slow`: only attempt slow calculation of coefficients, if it fails PFN PDX
> +  compression will be disabled.
> +
> +Note `pdx-compress=true` is equivalent to `pdx-compress=auto`.
> +
>   ### ple_gap
>   > `= <integer>`
>   
> diff --git a/xen/common/pdx.c b/xen/common/pdx.c
> index feabdcded804..5fd01305a7be 100644
> --- a/xen/common/pdx.c
> +++ b/xen/common/pdx.c
> @@ -19,6 +19,7 @@
>   #include <xen/mm.h>
>   #include <xen/bitops.h>
>   #include <xen/nospec.h>
> +#include <xen/param.h>
>   #include <xen/pfn.h>
>   #include <xen/sections.h>
>   #include <xen/sort.h>
> @@ -468,11 +469,40 @@ STATIC void __init pfn_offset_sanitize_ranges(void)
>   }
>   
>   #ifdef __XEN__
> +static enum {
> +    PDX_AUTO, /* Fast first, fallback to slow if fast is not successful. */
> +    PDX_NO,   /* Do not attempt compression. */
> +    PDX_FAST, /* Only attempt fast calculation of compression parameters. */
> +    PDX_SLOW, /* Only attempt slow calculation of compression parameters. */
> +} compress_mode __initdata;
> +
> +static int __init cf_check parse_pdx_param(const char *arg)
> +{
> +    int val;
> +
> +    if ( !arg )
> +        return -EINVAL;
> +
> +    if ( (val = parse_bool(arg, NULL)) != -1 )
> +        compress_mode = val ? PDX_AUTO : PDX_NO;
> +    else if ( !strcmp(arg, "auto") )
> +        compress_mode = PDX_AUTO;
> +    else if ( !strcmp(arg, "fast") )
> +        compress_mode = PDX_FAST;
> +    else if ( !strcmp(arg, "slow") )
> +        compress_mode = PDX_SLOW;
> +    else
> +        return -EINVAL;
> +
> +    return 0;
> +}
> +custom_param("pdx-compress", parse_pdx_param);
> +
>   bool __init pfn_pdx_compression_setup(paddr_t base)
>   {
> -    bool use_slow = false;
> +    bool use_slow = compress_mode == PDX_SLOW;
>   
> -    if ( nr <= 1 )
> +    if ( nr <= 1 || compress_mode == PDX_NO )
>           return false;
>   
>       if ( nr > ARRAY_SIZE(ranges) )
> @@ -507,11 +537,11 @@ bool __init pfn_pdx_compression_setup(paddr_t base)
>               dprintk(XENLOG_DEBUG,
>                       "Find PFN compression coefficients using %s algorithm failed\n",
>                       use_slow ? "slow" : "fast");
> -            if ( use_slow )
> +            if ( compress_mode == PDX_FAST || use_slow )
>                   return false;
>           }
>   
> -        if ( use_slow )
> +        if ( compress_mode == PDX_FAST || use_slow )
>               break;
>       }
>   

[-- Attachment #2: Type: text/html, Size: 5590 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions
  2025-06-11 17:16 ` [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions Roger Pau Monne
  2025-06-11 19:33   ` Andrew Cooper
  2025-06-12  8:27   ` Jan Beulich
@ 2025-06-18 13:02   ` Jan Beulich
  2025-06-18 14:11     ` Roger Pau Monné
  2 siblings, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2025-06-18 13:02 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Anthony PERARD, Andrew Cooper, Michal Orzel, Julien Grall,
	Stefano Stabellini, Bertrand Marquis, Volodymyr Babchuk,
	xen-devel

On 11.06.2025 19:16, Roger Pau Monne wrote:
> With the appearance of Intel Sierra Forest and Granite Rapids it's not
> possible to get a production x86 host wit the following memory map:
> 
> SRAT: Node 0 PXM 0 [0000000000000000, 000000007fffffff]
> SRAT: Node 0 PXM 0 [0000000100000000, 000000407fffffff]
> SRAT: Node 1 PXM 1 [0000061e80000000, 0000065e7fffffff]
> SRAT: Node 2 PXM 2 [00000c3e80000000, 00000c7e7fffffff]
> SRAT: Node 3 PXM 3 [0000125e80000000, 0000129e7fffffff]

Looks like these aren't the numbers that the test harness uses. The main
properties (relevant for the algorithms) look to be the same, though.

> ---
> We can discuss whether we want both the fast and the slow variants.  The
> slow (brute force) was added as a result of me playing with weird region
> layouts where the fast one didn't manage to compress, or the resulting
> coefficients had a poor compression ratio.  However at this point the
> slow variant has only proven helpful in synthetic cases, I haven't (yet?)
> seen a real host memory layout that would benefit from it.

I'm not convinced we need the slow variant right away. What exactly (if
anything) is going to be wanted/needed on future hardware we'll only really
know when such arrives. However, see also my comment on xen/pdx.h.

> @@ -297,7 +309,223 @@ void __init pfn_pdx_compression_reset(void)
>      pfn_pdx_hole_shift = 0;
>  }
>  
> -#endif /* CONFIG_PDX_COMPRESSION */
> +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* CONFIG_PDX_MASK_COMPRESSION */
> +
> +unsigned long __ro_after_init pdx_offset = ~0UL;
> +unsigned long __ro_after_init pdx_size = ~0UL;
> +
> +static unsigned long __initdata top_pfn;
> +
> +bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
> +{
> +    return !pdx_size ? true
> +                     : (PFN_DOWN(base) % pdx_offset) + npages <= pdx_size;
> +}
> +
> +STATIC bool __init pfn_offset_calculate_fast(unsigned long base)
> +{
> +    unsigned long size = max((1UL << MAX_ORDER), base);

Since pfn_pdx_compression_setup(), where "base" originally comes from, has no
caller (yet), it's hard to follow what "base" is and why it would affect "size".

> +    unsigned long offset = ~0UL;
> +    unsigned int i;
> +
> +    if ( nr <= 1 )
> +        return false;
> +
> +    /* Calculate minimal offset between regions. */
> +    for ( i = 1; i < nr; i++ )
> +        offset = min(offset, ranges[i].base - ranges[i - 1].base);
> +
> +    /* Return early if offset is smaller than the minimum size. */
> +    if ( size >= offset )
> +        return false;

Comment and code are off by one with one another. I actually wonder whether, for
the scheme to be beneficial, there shouldn't be some threshold below which we
would also go without doing any compression.

> --- a/xen/include/xen/pdx.h
> +++ b/xen/include/xen/pdx.h
> @@ -65,6 +65,31 @@
>   * This scheme also holds for multiple regions, where HHHHHHH acts as
>   * the region identifier and LLLLLL fully contains the span of every
>   * region involved.
> + *
> + * ## PDX offset compression
> + *
> + * Alternative compression mechanism that relies on RAM ranges having a similar
> + * size and offset between them:
> + *
> + * ┌────────┬──────────┬────────┬──────────┐   ┌────────┬──────────┐
> + * │ RAM 0  │          │ RAM 1  │          │...│ RAM N  │          │
> + * ├────────┼──────────┼────────┴──────────┘   └────────┴──────────┘
> + * │<------>│          │
> + * │  size             │
> + * │<----------------->│
> + *         offset
> + *
> + * The compression removes the holes between RAM regions:
> + *
> + * ┌────────┬────────┐   ┌────────┐
> + * │ RAM 0  │ RAM 1  │...│ RAM N  │
> + * ├────────┼────────┘   └────────┘
> + * │<------>│
> + *    size
> + *
> + * The compressed index is calculated as:
> + *
> + * index = (pfn % offset) + ((pfn / offset) * size)
>   */

Would be nice imo if the presentation here wouldn't give the impression that
the offsets are all identical, and hence the compressed map ends up being
entirely without holes. In fact I can't help the impression that with enough
nodes (but otherwise the same properties, i.e. only node 0 being special) at
least the "fast" calculation will fail. Which in turn would be an argument
to keep the "slow" one.

In fact, the alternative approach we discussed - avoiding division, but
using a table of offsets - would seem to avoid such a weakness, because the
offsets can then vary (to some degree; it still needs to be possible to
easily determine the table indexes).

Jan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 8/8] pdx: introduce a command line option for offset compression
  2025-06-11 17:16 ` [PATCH 8/8] pdx: introduce a command line option " Roger Pau Monne
  2025-06-17  7:09   ` Oleksii Kurochko
@ 2025-06-18 13:36   ` Jan Beulich
  2025-06-18 14:12     ` Roger Pau Monné
  1 sibling, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2025-06-18 13:36 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	xen-devel

On 11.06.2025 19:16, Roger Pau Monne wrote:
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -2072,6 +2072,28 @@ for all of them (`true`), only for those subject to XPTI (`xpti`) or for
>  those not subject to XPTI (`no-xpti`). The feature is used only in case
>  INVPCID is supported and not disabled via `invpcid=false`.
>  
> +### pdx-compress
> +> `= <boolean> | auto | fast | slow`
> +
> +> Default: `auto`
> +
> +Only relevant when hypervisor is build with PFN PDX offset compression
> +`CONFIG_PDX_OFFSET_COMPRESSION`.

Which, however, the name of the command line option doesn't reflect at all.
Question is what your thoughts were towards what if another compression
method also wanted a command line control.

Jan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions
  2025-06-18 13:02   ` Jan Beulich
@ 2025-06-18 14:11     ` Roger Pau Monné
  0 siblings, 0 replies; 34+ messages in thread
From: Roger Pau Monné @ 2025-06-18 14:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Anthony PERARD, Andrew Cooper, Michal Orzel, Julien Grall,
	Stefano Stabellini, Bertrand Marquis, Volodymyr Babchuk,
	xen-devel

On Wed, Jun 18, 2025 at 03:02:34PM +0200, Jan Beulich wrote:
> On 11.06.2025 19:16, Roger Pau Monne wrote:
> > With the appearance of Intel Sierra Forest and Granite Rapids it's not
> > possible to get a production x86 host wit the following memory map:
> > 
> > SRAT: Node 0 PXM 0 [0000000000000000, 000000007fffffff]
> > SRAT: Node 0 PXM 0 [0000000100000000, 000000407fffffff]
> > SRAT: Node 1 PXM 1 [0000061e80000000, 0000065e7fffffff]
> > SRAT: Node 2 PXM 2 [00000c3e80000000, 00000c7e7fffffff]
> > SRAT: Node 3 PXM 3 [0000125e80000000, 0000129e7fffffff]
> 
> Looks like these aren't the numbers that the test harness uses. The main
> properties (relevant for the algorithms) look to be the same, though.

Yeah, seems like the report we got has two different memory setups,
one with 1TB, and one with 2TB of total RAM.  The layout however
(offsets) are the same.

> > ---
> > We can discuss whether we want both the fast and the slow variants.  The
> > slow (brute force) was added as a result of me playing with weird region
> > layouts where the fast one didn't manage to compress, or the resulting
> > coefficients had a poor compression ratio.  However at this point the
> > slow variant has only proven helpful in synthetic cases, I haven't (yet?)
> > seen a real host memory layout that would benefit from it.
> 
> I'm not convinced we need the slow variant right away. What exactly (if
> anything) is going to be wanted/needed on future hardware we'll only really
> know when such arrives. However, see also my comment on xen/pdx.h.

I've implemented the lookup table as you suggested, and I think that's
more efficient than the current div plus mod.  With the lookup table
implementation there's no longer a fast vs slow variants.

> > @@ -297,7 +309,223 @@ void __init pfn_pdx_compression_reset(void)
> >      pfn_pdx_hole_shift = 0;
> >  }
> >  
> > -#endif /* CONFIG_PDX_COMPRESSION */
> > +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* CONFIG_PDX_MASK_COMPRESSION */
> > +
> > +unsigned long __ro_after_init pdx_offset = ~0UL;
> > +unsigned long __ro_after_init pdx_size = ~0UL;
> > +
> > +static unsigned long __initdata top_pfn;
> > +
> > +bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
> > +{
> > +    return !pdx_size ? true
> > +                     : (PFN_DOWN(base) % pdx_offset) + npages <= pdx_size;
> > +}
> > +
> > +STATIC bool __init pfn_offset_calculate_fast(unsigned long base)
> > +{
> > +    unsigned long size = max((1UL << MAX_ORDER), base);
> 
> Since pfn_pdx_compression_setup(), where "base" originally comes from, has no
> caller (yet), it's hard to follow what "base" is and why it would affect "size".

pfn_pdx_compression_setup() has existing callers from patch 4: "pdx:
provide a unified set of unit functions".

> > +    unsigned long offset = ~0UL;
> > +    unsigned int i;
> > +
> > +    if ( nr <= 1 )
> > +        return false;
> > +
> > +    /* Calculate minimal offset between regions. */
> > +    for ( i = 1; i < nr; i++ )
> > +        offset = min(offset, ranges[i].base - ranges[i - 1].base);
> > +
> > +    /* Return early if offset is smaller than the minimum size. */
> > +    if ( size >= offset )
> > +        return false;
> 
> Comment and code are off by one with one another. I actually wonder whether, for
> the scheme to be beneficial, there shouldn't be some threshold below which we
> would also go without doing any compression.

Yeah, I've wondered the same, but I didn't know where to put the
threshold.

> > --- a/xen/include/xen/pdx.h
> > +++ b/xen/include/xen/pdx.h
> > @@ -65,6 +65,31 @@
> >   * This scheme also holds for multiple regions, where HHHHHHH acts as
> >   * the region identifier and LLLLLL fully contains the span of every
> >   * region involved.
> > + *
> > + * ## PDX offset compression
> > + *
> > + * Alternative compression mechanism that relies on RAM ranges having a similar
> > + * size and offset between them:
> > + *
> > + * ┌────────┬──────────┬────────┬──────────┐   ┌────────┬──────────┐
> > + * │ RAM 0  │          │ RAM 1  │          │...│ RAM N  │          │
> > + * ├────────┼──────────┼────────┴──────────┘   └────────┴──────────┘
> > + * │<------>│          │
> > + * │  size             │
> > + * │<----------------->│
> > + *         offset
> > + *
> > + * The compression removes the holes between RAM regions:
> > + *
> > + * ┌────────┬────────┐   ┌────────┐
> > + * │ RAM 0  │ RAM 1  │...│ RAM N  │
> > + * ├────────┼────────┘   └────────┘
> > + * │<------>│
> > + *    size
> > + *
> > + * The compressed index is calculated as:
> > + *
> > + * index = (pfn % offset) + ((pfn / offset) * size)
> >   */
> 
> Would be nice imo if the presentation here wouldn't give the impression that
> the offsets are all identical, and hence the compressed map ends up being
> entirely without holes.

OK, I've made this too nice I guess.

> In fact I can't help the impression that with enough
> nodes (but otherwise the same properties, i.e. only node 0 being special) at
> least the "fast" calculation will fail. Which in turn would be an argument
> to keep the "slow" one.

It depends.  Fast calculation assumes the minimal offset and then just
adjusts the size.  If the offset is not the same between ranges this
leads to size being expanded to ensure the selected offset works with
all ranges.  If there are enough nodes, and spread enough it's
possible for the algorithm to converge in offset == size.

> In fact, the alternative approach we discussed - avoiding division, but
> using a table of offsets - would seem to avoid such a weakness, because the
> offsets can then vary (to some degree; it still needs to be possible to
> easily determine the table indexes).

Yeah, that's what I'm preparing to send.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 8/8] pdx: introduce a command line option for offset compression
  2025-06-18 13:36   ` Jan Beulich
@ 2025-06-18 14:12     ` Roger Pau Monné
  0 siblings, 0 replies; 34+ messages in thread
From: Roger Pau Monné @ 2025-06-18 14:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	xen-devel

On Wed, Jun 18, 2025 at 03:36:57PM +0200, Jan Beulich wrote:
> On 11.06.2025 19:16, Roger Pau Monne wrote:
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -2072,6 +2072,28 @@ for all of them (`true`), only for those subject to XPTI (`xpti`) or for
> >  those not subject to XPTI (`no-xpti`). The feature is used only in case
> >  INVPCID is supported and not disabled via `invpcid=false`.
> >  
> > +### pdx-compress
> > +> `= <boolean> | auto | fast | slow`
> > +
> > +> Default: `auto`
> > +
> > +Only relevant when hypervisor is build with PFN PDX offset compression
> > +`CONFIG_PDX_OFFSET_COMPRESSION`.
> 
> Which, however, the name of the command line option doesn't reflect at all.
> Question is what your thoughts were towards what if another compression
> method also wanted a command line control.

This will become a boolean option that applies to all PDX
compressions: `pdx-compress = <boolean>` to enabled or disable PDX
compression.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-06-18 14:13 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11 17:16 [PATCH 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
2025-06-11 17:16 ` [PATCH 1/8] x86/pdx: simplify calculation of domain struct allocation boundary Roger Pau Monne
2025-06-11 17:58   ` Andrew Cooper
2025-06-12  6:57     ` Roger Pau Monné
2025-06-12  9:03   ` Jan Beulich
2025-06-12 10:46     ` Roger Pau Monné
2025-06-12 12:00       ` Jan Beulich
2025-06-11 17:16 ` [PATCH 2/8] pdx: introduce function to calculate max PFN based on PDX compression Roger Pau Monne
2025-06-12  9:11   ` Jan Beulich
2025-06-12 10:48     ` Roger Pau Monné
2025-06-12 12:02       ` Jan Beulich
2025-06-11 17:16 ` [PATCH 3/8] kconfig: turn PDX compression into a choice Roger Pau Monne
2025-06-12  8:28   ` Jan Beulich
2025-06-11 17:16 ` [PATCH 4/8] pdx: provide a unified set of unit functions Roger Pau Monne
2025-06-12  8:32   ` Jan Beulich
2025-06-12 10:50     ` Roger Pau Monné
2025-06-12  8:36   ` Jan Beulich
2025-06-12 10:51     ` Roger Pau Monné
2025-06-12 12:03       ` Jan Beulich
2025-06-11 17:16 ` [PATCH 5/8] pdx: allow optimizing PDX conversion helpers Roger Pau Monne
2025-06-11 17:16 ` [PATCH 6/8] pdx: introduce a new compression algorithm based on offsets between regions Roger Pau Monne
2025-06-11 19:33   ` Andrew Cooper
2025-06-12  7:26     ` Roger Pau Monné
2025-06-12  7:59       ` Roger Pau Monné
2025-06-12  8:27   ` Jan Beulich
2025-06-12 14:03     ` Roger Pau Monné
2025-06-16  7:50       ` Jan Beulich
2025-06-18 13:02   ` Jan Beulich
2025-06-18 14:11     ` Roger Pau Monné
2025-06-11 17:16 ` [PATCH 7/8] pdx: introduce translation helpers for offset compression Roger Pau Monne
2025-06-11 17:16 ` [PATCH 8/8] pdx: introduce a command line option " Roger Pau Monne
2025-06-17  7:09   ` Oleksii Kurochko
2025-06-18 13:36   ` Jan Beulich
2025-06-18 14:12     ` Roger Pau Monné

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.