[PATCH v2 0/8] pdx: introduce a new compression algorithm

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/8] pdx: introduce a new compression algorithm
@ 2025-06-20 11:11 Roger Pau Monne
  2025-06-20 11:11 ` [PATCH v2 1/8] x86/pdx: simplify calculation of domain struct allocation boundary Roger Pau Monne
                   ` (8 more replies)
  0 siblings, 9 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-06-20 11:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini, Bertrand Marquis,
	Volodymyr Babchuk, Shawn Anastasio, Alistair Francis,
	Bob Eshleman, Connor Davis, Oleksii Kurochko, Community Manager

Hello,

This series implements a new PDX compression algorithm to cope with the
spare memory maps found on the Intel Sapphire/Granite Rapids.

Patches 1 to 7 prepare the existing code to make it easier to introduce
a new PDX compression, including generalizing the initialization and
setup functions and adding a unit test for PDX compression.

Patch 8 introduce the new compression.  The new compression is only
enabled by default on x86, other architectures are left with their
previous defaults.

Thanks, Roger.

Roger Pau Monne (8):
  x86/pdx: simplify calculation of domain struct allocation boundary
  kconfig: turn PDX compression into a choice
  pdx: provide a unified set of unit functions
  pdx: introduce command line compression toggle
  pdx: allow per-arch optimization of PDX conversion helpers
  test/pdx: add PDX compression unit tests
  pdx: move some helpers in preparation for new compression
  pdx: introduce a new compression algorithm based on region offsets

 CHANGELOG.md                           |   3 +
 docs/misc/xen-command-line.pandoc      |   9 +
 tools/tests/Makefile                   |   1 +
 tools/tests/pdx/.gitignore             |   3 +
 tools/tests/pdx/Makefile               |  49 ++++
 tools/tests/pdx/harness.h              |  99 +++++++
 tools/tests/pdx/test-pdx.c             | 224 +++++++++++++++
 xen/arch/arm/include/asm/Makefile      |   1 +
 xen/arch/arm/setup.c                   |  34 +--
 xen/arch/ppc/include/asm/Makefile      |   1 +
 xen/arch/riscv/include/asm/Makefile    |   1 +
 xen/arch/x86/domain.c                  |  40 +--
 xen/arch/x86/include/asm/cpufeatures.h |   1 +
 xen/arch/x86/include/asm/pdx.h         |  75 +++++
 xen/arch/x86/srat.c                    |  30 +-
 xen/common/Kconfig                     |  37 ++-
 xen/common/pdx.c                       | 379 ++++++++++++++++++++++---
 xen/include/asm-generic/pdx.h          |  24 ++
 xen/include/xen/pdx.h                  | 201 +++++++++----
 19 files changed, 1056 insertions(+), 156 deletions(-)
 create mode 100644 tools/tests/pdx/.gitignore
 create mode 100644 tools/tests/pdx/Makefile
 create mode 100644 tools/tests/pdx/harness.h
 create mode 100644 tools/tests/pdx/test-pdx.c
 create mode 100644 xen/arch/x86/include/asm/pdx.h
 create mode 100644 xen/include/asm-generic/pdx.h

-- 
2.49.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v2 1/8] x86/pdx: simplify calculation of domain struct allocation boundary
  2025-06-20 11:11 [PATCH v2 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
@ 2025-06-20 11:11 ` Roger Pau Monne
  2025-06-24 13:05   ` Jan Beulich
  2025-06-20 11:11 ` [PATCH v2 2/8] kconfig: turn PDX compression into a choice Roger Pau Monne
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-06-20 11:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini

When not using CONFIG_BIGMEM there are some restrictions in the address
width for allocations of the domain structure, as it's PDX truncated to 32
bits it's stashed into page_info structure for domain allocated pages.

The current logic to calculate this limit is based on the internals of the
PDX compression used, which is not strictly required.  Instead simplify the
logic to rely on the existing PDX to PFN conversion helpers used elsewhere.

This has the added benefit of allowing alternative PDX compression
algorithms to be implemented without requiring to change the calculation of
the domain structure allocation boundary.

As a side effect introduce pdx_to_paddr() conversion macro and use it.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Use sizeof_field().
 - Introduce and use pdx_to_paddr().
 - Add comment.
---
 xen/arch/x86/domain.c | 40 +++++++++++-----------------------------
 xen/include/xen/pdx.h |  1 +
 2 files changed, 12 insertions(+), 29 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index d025befe3d8e..14a0f6dda791 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -461,30 +461,6 @@ void domain_cpu_policy_changed(struct domain *d)
     }
 }
 
-#if !defined(CONFIG_BIGMEM) && defined(CONFIG_PDX_COMPRESSION)
-/*
- * The hole may be at or above the 44-bit boundary, so we need to determine
- * the total bit count until reaching 32 significant (not squashed out) bits
- * in PFN representations.
- * Note that the way "bits" gets initialized/updated/bounds-checked guarantees
- * that the function will never return zero, and hence will never be called
- * more than once (which is important due to it being deliberately placed in
- * .init.text).
- */
-static unsigned int __init noinline _domain_struct_bits(void)
-{
-    unsigned int bits = 32 + PAGE_SHIFT;
-    unsigned int sig = hweight32(~pfn_hole_mask);
-    unsigned int mask = pfn_hole_mask >> 32;
-
-    for ( ; bits < BITS_PER_LONG && sig < 32; ++bits, mask >>= 1 )
-        if ( !(mask & 1) )
-            ++sig;
-
-    return bits;
-}
-#endif
-
 struct domain *alloc_domain_struct(void)
 {
     struct domain *d;
@@ -498,14 +474,20 @@ struct domain *alloc_domain_struct(void)
      * On systems with CONFIG_BIGMEM there's no packing, and so there's no
      * such restriction.
      */
-#if defined(CONFIG_BIGMEM) || !defined(CONFIG_PDX_COMPRESSION)
-    const unsigned int bits = IS_ENABLED(CONFIG_BIGMEM) ? 0 :
-                                                          32 + PAGE_SHIFT;
+#if defined(CONFIG_BIGMEM)
+    const unsigned int bits = 0;
 #else
-    static unsigned int __read_mostly bits;
+    static unsigned int __ro_after_init bits;
 
     if ( unlikely(!bits) )
-         bits = _domain_struct_bits();
+         /*
+          * Get the width for the next pfn, and unconditionally subtract one
+          * from it to ensure the used width will not allocate past the PDX
+          * field limit.
+          */
+         bits = flsl(pdx_to_paddr(1UL << (sizeof_field(struct page_info,
+                                                       v.inuse._domain) * 8)))
+                - 1;
 #endif
 
     BUILD_BUG_ON(sizeof(*d) > PAGE_SIZE);
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index 9faeea3ac9f2..c1423d64a95b 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -99,6 +99,7 @@ bool __mfn_valid(unsigned long mfn);
 #define pdx_to_mfn(pdx) _mfn(pdx_to_pfn(pdx))
 
 #define paddr_to_pdx(pa) pfn_to_pdx(paddr_to_pfn(pa))
+#define pdx_to_paddr(px) pfn_to_paddr(pdx_to_pfn(px))
 
 #ifdef CONFIG_PDX_COMPRESSION
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 2/8] kconfig: turn PDX compression into a choice
  2025-06-20 11:11 [PATCH v2 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
  2025-06-20 11:11 ` [PATCH v2 1/8] x86/pdx: simplify calculation of domain struct allocation boundary Roger Pau Monne
@ 2025-06-20 11:11 ` Roger Pau Monne
  2025-06-24 13:13   ` Jan Beulich
  2025-06-20 11:11 ` [PATCH v2 4/8] pdx: introduce command line compression toggle Roger Pau Monne
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-06-20 11:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

Rename the current CONFIG_PDX_COMPRESSION to CONFIG_PDX_MASK_COMPRESSION,
and make it part of the PDX compression choice block, in preparation for
adding further PDX compression algorithms.

No functional change intended as the PDX compression defaults should still
be the same for all architectures, however the choice block cannot be
protected under EXPERT and still have a default choice being
unconditionally selected.  As a result, the new "PDX (Page inDeX)
compression" item will be unconditionally visible in Kconfig.

As part of this preparation work to introduce new PDX compressions, adjust
some of the comments on pdx.h to note they apply to a specific PDX
compression.  Also shuffle function prototypes and dummy implementations
around to make it easier to introduce a new PDX compression.  Note all
PDX compression implementations are expected to provide a
pdx_is_region_compressible() that takes the same set of arguments.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/common/Kconfig    | 18 +++++++++++++++---
 xen/common/pdx.c      |  4 ++--
 xen/include/xen/pdx.h | 32 +++++++++++++++++++-------------
 3 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index 867710134ae5..de3e01d6320e 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -52,9 +52,10 @@ config EVTCHN_FIFO
 
 	  If unsure, say Y.
 
-config PDX_COMPRESSION
-	bool "PDX (Page inDeX) compression" if EXPERT && !X86 && !RISCV
-	default ARM || PPC
+choice
+	prompt "PDX (Page inDeX) compression"
+	default PDX_MASK_COMPRESSION if !X86 && !RISCV
+	default PDX_NONE
 	help
 	  PDX compression is a technique designed to reduce the memory
 	  overhead of physical memory management on platforms with sparse RAM
@@ -67,6 +68,17 @@ config PDX_COMPRESSION
 	  If your platform does not have sparse RAM banks, do not enable PDX
 	  compression.
 
+config PDX_MASK_COMPRESSION
+	bool "Mask compression"
+	help
+	  Compression relying on all RAM addresses sharing a zeroed bit region.
+
+config PDX_NONE
+	bool "None"
+	help
+	  No compression
+endchoice
+
 config ALTERNATIVE_CALL
 	bool
 
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index b8384e6189df..00aa7e43006d 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -34,7 +34,7 @@ bool __mfn_valid(unsigned long mfn)
 {
     bool invalid = mfn >= max_page;
 
-#ifdef CONFIG_PDX_COMPRESSION
+#ifdef CONFIG_PDX_MASK_COMPRESSION
     invalid |= mfn & pfn_hole_mask;
 #endif
 
@@ -55,7 +55,7 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn)
         __set_bit(idx, pdx_group_valid);
 }
 
-#ifdef CONFIG_PDX_COMPRESSION
+#ifdef CONFIG_PDX_MASK_COMPRESSION
 
 /*
  * Diagram to make sense of the following variables. The masks and shifts
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index c1423d64a95b..8e373cac8b87 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -25,7 +25,7 @@
  * this by keeping a bitmap of the ranges in the frame table containing
  * invalid entries and not allocating backing memory for them.
  *
- * ## PDX compression
+ * ## PDX mask compression
  *
  * This is a technique to avoid wasting memory on machines known to have
  * split their machine address space in several big discontinuous and highly
@@ -101,22 +101,13 @@ bool __mfn_valid(unsigned long mfn);
 #define paddr_to_pdx(pa) pfn_to_pdx(paddr_to_pfn(pa))
 #define pdx_to_paddr(px) pfn_to_paddr(pdx_to_pfn(px))
 
-#ifdef CONFIG_PDX_COMPRESSION
+#ifdef CONFIG_PDX_MASK_COMPRESSION
 
 extern unsigned long pfn_pdx_bottom_mask, ma_va_bottom_mask;
 extern unsigned int pfn_pdx_hole_shift;
 extern unsigned long pfn_hole_mask;
 extern unsigned long pfn_top_mask, ma_top_mask;
 
-/**
- * Validate a region's compatibility with the current compression runtime
- *
- * @param base Base address of the region
- * @param npages Number of PAGE_SIZE-sized pages in the region
- * @return True iff the region can be used with the current compression
- */
-bool pdx_is_region_compressible(paddr_t base, unsigned long npages);
-
 /**
  * Calculates a mask covering "moving" bits of all addresses of a region
  *
@@ -209,7 +200,9 @@ static inline paddr_t directmapoff_to_maddr(unsigned long offset)
  */
 void pfn_pdx_hole_setup(unsigned long mask);
 
-#else /* !CONFIG_PDX_COMPRESSION */
+#endif /* CONFIG_PDX_MASK_COMPRESSION */
+
+#ifdef CONFIG_PDX_NONE
 
 /* Without PDX compression we can skip some computations */
 
@@ -241,7 +234,20 @@ static inline void pfn_pdx_hole_setup(unsigned long mask)
 {
 }
 
-#endif /* CONFIG_PDX_COMPRESSION */
+#else /* !CONFIG_PDX_NONE */
+
+/* Shared functions implemented by all PDX compressions. */
+
+/**
+ * Validate a region's compatibility with the current compression runtime
+ *
+ * @param base Base address of the region
+ * @param npages Number of PAGE_SIZE-sized pages in the region
+ * @return True iff the region can be used with the current compression
+ */
+bool pdx_is_region_compressible(paddr_t base, unsigned long npages);
+
+#endif /* !CONFIG_PDX_NONE */
 #endif /* __XEN_PDX_H__ */
 
 /*
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 4/8] pdx: introduce command line compression toggle
  2025-06-20 11:11 [PATCH v2 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
  2025-06-20 11:11 ` [PATCH v2 1/8] x86/pdx: simplify calculation of domain struct allocation boundary Roger Pau Monne
  2025-06-20 11:11 ` [PATCH v2 2/8] kconfig: turn PDX compression into a choice Roger Pau Monne
@ 2025-06-20 11:11 ` Roger Pau Monne
  2025-06-24 13:40   ` Jan Beulich
  2025-06-20 11:11 ` [PATCH v2 5/8] pdx: allow per-arch optimization of PDX conversion helpers Roger Pau Monne
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-06-20 11:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini,
	Roger Pau Monné

Introduce a command line option to allow disabling PDX compression.  The
disabling is done by turning pfn_pdx_add_region() into a no-op, so when
attempting to initialize the selected compression algorithm the array of
ranges to compress is empty.

Signed-off-by: Roger Pau Monné <roger.pau@cloud.com>
---
Changes since v1:
 - New in this version.
---
 docs/misc/xen-command-line.pandoc |  9 +++++++++
 xen/common/pdx.c                  | 10 +++++++++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index b0eadd2c5d58..c747a326be86 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2072,6 +2072,15 @@ for all of them (`true`), only for those subject to XPTI (`xpti`) or for
 those not subject to XPTI (`no-xpti`). The feature is used only in case
 INVPCID is supported and not disabled via `invpcid=false`.
 
+### pdx-compress
+> `= <boolean>`
+
+> Default: `true` if CONFIG_PDX_NONE is unset
+
+Only relevant when the hypervisor is build with PFN PDX compression. Controls
+whether Xen will engage in PFN compression.  The algorithm used for PFN
+compression is selected at build time from Kconfig.
+
 ### ple_gap
 > `= <integer>`
 
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index 6f488366e5a9..8c107676da59 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -19,6 +19,7 @@
 #include <xen/mm.h>
 #include <xen/bitops.h>
 #include <xen/nospec.h>
+#include <xen/param.h>
 #include <xen/pfn.h>
 #include <xen/sections.h>
 
@@ -76,9 +77,13 @@ static struct pfn_range {
 } ranges[MAX_PFN_RANGES] __initdata;
 static unsigned int __initdata nr_ranges;
 
+static bool __initdata pdx_compress = true;
+boolean_param("pdx-compress", pdx_compress);
+
 void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
 {
-    if ( !size )
+    /* Without ranges there's no PFN compression. */
+    if ( !size || !pdx_compress )
         return;
 
     if ( nr_ranges >= ARRAY_SIZE(ranges) )
@@ -215,6 +220,9 @@ void __init pfn_pdx_compression_setup(paddr_t base)
     unsigned int i, j, bottom_shift = 0, hole_shift = 0;
     unsigned long mask = pdx_init_mask(base) >> PAGE_SHIFT;
 
+    if ( !nr_ranges )
+        return;
+
     if ( nr_ranges > ARRAY_SIZE(ranges) )
     {
         printk(XENLOG_WARNING
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 5/8] pdx: allow per-arch optimization of PDX conversion helpers
  2025-06-20 11:11 [PATCH v2 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
                   ` (2 preceding siblings ...)
  2025-06-20 11:11 ` [PATCH v2 4/8] pdx: introduce command line compression toggle Roger Pau Monne
@ 2025-06-20 11:11 ` Roger Pau Monne
  2025-06-24 13:51   ` Jan Beulich
  2025-06-20 11:11 ` [PATCH v2 6/8] test/pdx: add PDX compression unit tests Roger Pau Monne
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-06-20 11:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper,
	Anthony PERARD, Jan Beulich, Shawn Anastasio, Alistair Francis,
	Bob Eshleman, Connor Davis, Oleksii Kurochko

There are four performance critical PDX conversion helpers that do the PFN
to/from PDX and the physical addresses to/from directmap offsets
translations.

In the absence of an active PDX compression, those functions would still do
the calculations needed, just to return the same input value as no
translation is in place and hence PFN and PDX spaces are identity mapped.

To reduce the overhead of having to do the pointless calculations allow
architectures to implement the translation helpers in a per-arch header.
Rename the existing conversion functions to add a trailing _xlate suffix,
so that the per-arch headers can define the non suffixed versions.

Currently only x86 implements meaningful custom handlers to short circuit
the translation when not active, using asm goto.  Other architectures use a
generic header that maps the non-xlate to the xlate variants to keep the
previous behavior.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Pull return out of OPTIMIZE_PDX macro.
 - undef OPTIMIZE_PDX.
---
Would it make sense to move the x86 implementation to the common pdx.h
header and let architectures define PDX_ASM_GOTO_SKIP instead?
---
 xen/arch/arm/include/asm/Makefile      |  1 +
 xen/arch/ppc/include/asm/Makefile      |  1 +
 xen/arch/riscv/include/asm/Makefile    |  1 +
 xen/arch/x86/include/asm/cpufeatures.h |  1 +
 xen/arch/x86/include/asm/pdx.h         | 75 ++++++++++++++++++++++++++
 xen/arch/x86/srat.c                    |  6 ++-
 xen/common/pdx.c                       | 10 ++--
 xen/include/asm-generic/pdx.h          | 24 +++++++++
 xen/include/xen/pdx.h                  | 22 +++++---
 9 files changed, 130 insertions(+), 11 deletions(-)
 create mode 100644 xen/arch/x86/include/asm/pdx.h
 create mode 100644 xen/include/asm-generic/pdx.h

diff --git a/xen/arch/arm/include/asm/Makefile b/xen/arch/arm/include/asm/Makefile
index 87c882142148..6283307cb0c4 100644
--- a/xen/arch/arm/include/asm/Makefile
+++ b/xen/arch/arm/include/asm/Makefile
@@ -6,6 +6,7 @@ generic-y += hardirq.h
 generic-y += iocap.h
 generic-y += irq-dt.h
 generic-y += paging.h
+generic-y += pdx.h
 generic-y += percpu.h
 generic-y += random.h
 generic-y += softirq.h
diff --git a/xen/arch/ppc/include/asm/Makefile b/xen/arch/ppc/include/asm/Makefile
index c989a7f89b34..0ad45133baac 100644
--- a/xen/arch/ppc/include/asm/Makefile
+++ b/xen/arch/ppc/include/asm/Makefile
@@ -6,6 +6,7 @@ generic-y += hardirq.h
 generic-y += hypercall.h
 generic-y += iocap.h
 generic-y += paging.h
+generic-y += pdx.h
 generic-y += percpu.h
 generic-y += perfc_defn.h
 generic-y += random.h
diff --git a/xen/arch/riscv/include/asm/Makefile b/xen/arch/riscv/include/asm/Makefile
index bfdf186c682f..de04daf68df3 100644
--- a/xen/arch/riscv/include/asm/Makefile
+++ b/xen/arch/riscv/include/asm/Makefile
@@ -7,6 +7,7 @@ generic-y += hypercall.h
 generic-y += iocap.h
 generic-y += irq-dt.h
 generic-y += paging.h
+generic-y += pdx.h
 generic-y += percpu.h
 generic-y += perfc_defn.h
 generic-y += random.h
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 9e3ed21c026d..85e1a6f0a055 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -43,6 +43,7 @@ XEN_CPUFEATURE(XEN_IBT,           X86_SYNTH(27)) /* Xen uses CET Indirect Branch
 XEN_CPUFEATURE(IBPB_ENTRY_PV,     X86_SYNTH(28)) /* MSR_PRED_CMD used by Xen for PV */
 XEN_CPUFEATURE(IBPB_ENTRY_HVM,    X86_SYNTH(29)) /* MSR_PRED_CMD used by Xen for HVM */
 XEN_CPUFEATURE(USE_VMCALL,        X86_SYNTH(30)) /* Use VMCALL instead of VMMCALL */
+XEN_CPUFEATURE(PDX_COMPRESSION,   X86_SYNTH(31)) /* PDX compression */
 
 /* Bug words follow the synthetic words. */
 #define X86_NR_BUG 1
diff --git a/xen/arch/x86/include/asm/pdx.h b/xen/arch/x86/include/asm/pdx.h
new file mode 100644
index 000000000000..b09b44ceaf4a
--- /dev/null
+++ b/xen/arch/x86/include/asm/pdx.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef X86_PDX_H
+#define X86_PDX_H
+
+#ifndef CONFIG_PDX_NONE
+
+#include <asm/alternative.h>
+
+/*
+ * Introduce a macro to avoid repeating the same asm goto block in each helper.
+ * Note the macro is strictly tied to the code in the helpers.
+ */
+#define PDX_ASM_GOTO_SKIP                           \
+    asm_inline goto (                               \
+        ALTERNATIVE(                                \
+            "",                                     \
+            "jmp %l[skip]",                         \
+            ALT_NOT(X86_FEATURE_PDX_COMPRESSION))   \
+        : : : : skip )
+
+static inline unsigned long pfn_to_pdx(unsigned long pfn)
+{
+    PDX_ASM_GOTO_SKIP;
+
+    return pfn_to_pdx_xlate(pfn);
+
+ skip:
+    return pfn;
+}
+
+static inline unsigned long pdx_to_pfn(unsigned long pdx)
+{
+    PDX_ASM_GOTO_SKIP;
+
+    return pdx_to_pfn_xlate(pdx);
+
+ skip:
+    return pdx;
+}
+
+static inline unsigned long maddr_to_directmapoff(paddr_t ma)
+{
+    PDX_ASM_GOTO_SKIP;
+
+    return maddr_to_directmapoff_xlate(ma);
+
+ skip:
+    return ma;
+}
+
+static inline paddr_t directmapoff_to_maddr(unsigned long offset)
+{
+    PDX_ASM_GOTO_SKIP;
+
+    return directmapoff_to_maddr_xlate(offset);
+
+ skip:
+    return offset;
+}
+
+#undef PDX_ASM_GOTO_SKIP
+
+#endif /* !CONFIG_PDX_NONE */
+
+#endif /* X86_PDX_H */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 2a678e744e7c..516db1b5bfa8 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -298,7 +298,8 @@ void __init srat_parse_regions(paddr_t addr)
 	acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
 			      srat_parse_region, 0);
 
-	pfn_pdx_compression_setup(addr);
+	if (!pfn_pdx_compression_setup(addr))
+		return;
 
 	/* Ensure all RAM ranges in the e820 are covered. */
 	for (i = 0; i < e820.nr_map; i++) {
@@ -318,6 +319,9 @@ void __init srat_parse_regions(paddr_t addr)
 			return;
 		}
 	}
+
+	/* If we got this far compression is working as expected. */
+	setup_force_cpu_cap(X86_FEATURE_PDX_COMPRESSION);
 }
 
 unsigned int numa_node_to_arch_nid(nodeid_t n)
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index 8c107676da59..86e2dc7c6bb6 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -215,20 +215,20 @@ static uint64_t __init pdx_init_mask(uint64_t base_addr)
                          (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
 }
 
-void __init pfn_pdx_compression_setup(paddr_t base)
+bool __init pfn_pdx_compression_setup(paddr_t base)
 {
     unsigned int i, j, bottom_shift = 0, hole_shift = 0;
     unsigned long mask = pdx_init_mask(base) >> PAGE_SHIFT;
 
     if ( !nr_ranges )
-        return;
+        return false;
 
     if ( nr_ranges > ARRAY_SIZE(ranges) )
     {
         printk(XENLOG_WARNING
                "Too many PFN ranges (%u > %zu), not attempting PFN compression\n",
                nr_ranges, ARRAY_SIZE(ranges));
-        return;
+        return false;
     }
 
     for ( i = 0; i < nr_ranges; i++ )
@@ -259,7 +259,7 @@ void __init pfn_pdx_compression_setup(paddr_t base)
         }
     }
     if ( !hole_shift )
-        return;
+        return false;
 
     printk(KERN_INFO "PFN compression on bits %u...%u\n",
            bottom_shift, bottom_shift + hole_shift - 1);
@@ -270,6 +270,8 @@ void __init pfn_pdx_compression_setup(paddr_t base)
     pfn_hole_mask       = ((1UL << hole_shift) - 1) << bottom_shift;
     pfn_top_mask        = ~(pfn_pdx_bottom_mask | pfn_hole_mask);
     ma_top_mask         = pfn_top_mask << PAGE_SHIFT;
+
+    return true;
 }
 
 void __init pfn_pdx_compression_reset(void)
diff --git a/xen/include/asm-generic/pdx.h b/xen/include/asm-generic/pdx.h
new file mode 100644
index 000000000000..4dea2b97c3e5
--- /dev/null
+++ b/xen/include/asm-generic/pdx.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef GENERIC_PDX_H
+#define GENERIC_PDX_H
+
+#ifndef CONFIG_PDX_NONE
+
+#define pdx_to_pfn pdx_to_pfn_xlate
+#define pfn_to_pdx pfn_to_pdx_xlate
+#define maddr_to_directmapoff maddr_to_directmapoff_xlate
+#define directmapoff_to_maddr directmapoff_to_maddr_xlate
+
+#endif /* !CONFIG_PDX_NONE */
+
+#endif /* GENERIC_PDX_H */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index 10153da98bf1..91fc32370f21 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -114,7 +114,7 @@ extern unsigned long pfn_top_mask, ma_top_mask;
  * @param pfn Frame number
  * @return Obtained pdx after compressing the pfn
  */
-static inline unsigned long pfn_to_pdx(unsigned long pfn)
+static inline unsigned long pfn_to_pdx_xlate(unsigned long pfn)
 {
     return (pfn & pfn_pdx_bottom_mask) |
            ((pfn & pfn_top_mask) >> pfn_pdx_hole_shift);
@@ -126,7 +126,7 @@ static inline unsigned long pfn_to_pdx(unsigned long pfn)
  * @param pdx Page index
  * @return Obtained pfn after decompressing the pdx
  */
-static inline unsigned long pdx_to_pfn(unsigned long pdx)
+static inline unsigned long pdx_to_pfn_xlate(unsigned long pdx)
 {
     return (pdx & pfn_pdx_bottom_mask) |
            ((pdx << pfn_pdx_hole_shift) & pfn_top_mask);
@@ -139,7 +139,7 @@ static inline unsigned long pdx_to_pfn(unsigned long pdx)
  * @return Offset on the direct map where that
  *         machine address can be accessed
  */
-static inline unsigned long maddr_to_directmapoff(paddr_t ma)
+static inline unsigned long maddr_to_directmapoff_xlate(paddr_t ma)
 {
     return (((ma & ma_top_mask) >> pfn_pdx_hole_shift) |
             (ma & ma_va_bottom_mask));
@@ -151,7 +151,7 @@ static inline unsigned long maddr_to_directmapoff(paddr_t ma)
  * @param offset Offset into the direct map
  * @return Corresponding machine address of that virtual location
  */
-static inline paddr_t directmapoff_to_maddr(unsigned long offset)
+static inline paddr_t directmapoff_to_maddr_xlate(unsigned long offset)
 {
     return ((((paddr_t)offset << pfn_pdx_hole_shift) & ma_top_mask) |
             (offset & ma_va_bottom_mask));
@@ -159,6 +159,14 @@ static inline paddr_t directmapoff_to_maddr(unsigned long offset)
 
 #endif /* CONFIG_PDX_MASK_COMPRESSION */
 
+/*
+ * Allow each architecture to define it's (possibly optimized) versions of the
+ * translation functions.
+ *
+ * Do not use _xlate suffixed functions, always use the non _xlate variants.
+ */
+#include <asm/pdx.h>
+
 #ifdef CONFIG_PDX_NONE
 
 /* Without PDX compression we can skip some computations */
@@ -181,8 +189,9 @@ static inline void pfn_pdx_add_region(paddr_t base, paddr_t size)
 {
 }
 
-static inline void pfn_pdx_compression_setup(paddr_t base)
+static inline bool pfn_pdx_compression_setup(paddr_t base)
 {
+    return false;
 }
 
 static inline void pfn_pdx_compression_reset(void)
@@ -215,8 +224,9 @@ void pfn_pdx_add_region(paddr_t base, paddr_t size);
  * range of the current memory regions.
  *
  * @param base address to start compression from.
+ * @return True if PDX compression has been enabled.
  */
-void pfn_pdx_compression_setup(paddr_t base);
+bool pfn_pdx_compression_setup(paddr_t base);
 
 /**
  * Reset the global variables to it's default values, thus disabling PFN
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 6/8] test/pdx: add PDX compression unit tests
  2025-06-20 11:11 [PATCH v2 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
                   ` (3 preceding siblings ...)
  2025-06-20 11:11 ` [PATCH v2 5/8] pdx: allow per-arch optimization of PDX conversion helpers Roger Pau Monne
@ 2025-06-20 11:11 ` Roger Pau Monne
  2025-06-24 13:37   ` Anthony PERARD
  2025-06-20 11:11 ` [PATCH v2 7/8] pdx: move some helpers in preparation for new compression Roger Pau Monne
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-06-20 11:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Anthony PERARD, Andrew Cooper, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

Introduce a set of unit tests for PDX compression.  The unit tests contains
both real and crafted memory maps that are then compressed using the
selected PDX algorithm.  Note the build system for the unit tests has been
done in a way to support adding new compression algorithms easily.  That
requires generating a new test-pdx-<compress> executable that's build with
the selected PDX compression enabled.

Currently the only generated executable is test-pdx-mask that tests PDX
mask compression.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - New in this version (partially pulled out from a different patch).
---
 tools/tests/Makefile       |   1 +
 tools/tests/pdx/.gitignore |   2 +
 tools/tests/pdx/Makefile   |  48 ++++++++
 tools/tests/pdx/harness.h  |  89 +++++++++++++++
 tools/tests/pdx/test-pdx.c | 220 +++++++++++++++++++++++++++++++++++++
 xen/common/pdx.c           |   4 +
 6 files changed, 364 insertions(+)
 create mode 100644 tools/tests/pdx/.gitignore
 create mode 100644 tools/tests/pdx/Makefile
 create mode 100644 tools/tests/pdx/harness.h
 create mode 100644 tools/tests/pdx/test-pdx.c

diff --git a/tools/tests/Makefile b/tools/tests/Makefile
index 36928676a666..97ba2a13894d 100644
--- a/tools/tests/Makefile
+++ b/tools/tests/Makefile
@@ -9,6 +9,7 @@ ifneq ($(clang),y)
 SUBDIRS-$(CONFIG_X86) += x86_emulator
 endif
 SUBDIRS-y += xenstore
+SUBDIRS-y += pdx
 SUBDIRS-y += rangeset
 SUBDIRS-y += vpci
 SUBDIRS-y += paging-mempool
diff --git a/tools/tests/pdx/.gitignore b/tools/tests/pdx/.gitignore
new file mode 100644
index 000000000000..a32c7db4de79
--- /dev/null
+++ b/tools/tests/pdx/.gitignore
@@ -0,0 +1,2 @@
+/pdx.h
+/test-pdx-mask
diff --git a/tools/tests/pdx/Makefile b/tools/tests/pdx/Makefile
new file mode 100644
index 000000000000..99867b71c438
--- /dev/null
+++ b/tools/tests/pdx/Makefile
@@ -0,0 +1,48 @@
+XEN_ROOT=$(CURDIR)/../../..
+include $(XEN_ROOT)/tools/Rules.mk
+
+TARGETS := test-pdx-mask
+
+.PHONY: all
+all: $(TARGETS)
+
+.PHONY: run
+run: $(TARGETS)
+ifeq ($(CC),$(HOSTCC))
+	for test in $? ; do \
+		./$$test ;  \
+	done
+else
+	$(warning HOSTCC != CC, will not run test)
+endif
+
+.PHONY: clean
+clean:
+	$(RM) -- *.o $(TARGETS) $(DEPS_RM) pdx.c pdx.h
+
+.PHONY: distclean
+distclean: clean
+	$(RM) -- *~
+
+.PHONY: install
+install: all
+	$(INSTALL_DIR) $(DESTDIR)$(LIBEXEC)/tests
+	$(INSTALL_PROG) $(TARGETS) $(DESTDIR)$(LIBEXEC)/tests
+
+.PHONY: uninstall
+uninstall:
+	$(RM) -- $(patsubst %,$(DESTDIR)$(LIBEXEC)/tests/%,$(TARGETS))
+
+pdx.h: $(XEN_ROOT)/xen/include/xen/pdx.h
+	sed -E -e '/^#[[:space:]]?include/d' <$< >$@
+
+CFLAGS += -D__XEN_TOOLS__
+CFLAGS += $(APPEND_CFLAGS)
+CFLAGS += $(CFLAGS_xeninclude)
+
+test-pdx-mask: CFLAGS += -DCONFIG_PDX_MASK_COMPRESSION
+
+test-pdx-%: test-pdx.c pdx.h
+	$(CC) $(CPPFLAGS) $(CFLAGS) $(CFLAGS_$*.o) -o $@ $< $(APPEND_CFLAGS)
+
+-include $(DEPS_INCLUDE)
diff --git a/tools/tests/pdx/harness.h b/tools/tests/pdx/harness.h
new file mode 100644
index 000000000000..64ec09f5e281
--- /dev/null
+++ b/tools/tests/pdx/harness.h
@@ -0,0 +1,89 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Unit tests for PDX compression.
+ *
+ * Copyright (C) 2025 Cloud Software Group
+ */
+
+#ifndef _TEST_HARNESS_
+#define _TEST_HARNESS_
+
+#include <assert.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <xen-tools/common-macros.h>
+
+#define __init
+#define __initdata
+#define __ro_after_init
+#define cf_check
+
+#define printk printf
+#define XENLOG_INFO
+#define XENLOG_DEBUG
+#define XENLOG_WARNING
+#define KERN_INFO
+
+#define BITS_PER_LONG (sizeof(unsigned long) * 8)
+
+#define PAGE_SHIFT    12
+/* Some libcs define PAGE_SIZE in limits.h. */
+#undef  PAGE_SIZE
+#define PAGE_SIZE     (1 << PAGE_SHIFT)
+#define MAX_ORDER     18 /* 2 * PAGETABLE_ORDER (9) */
+
+#define PFN_DOWN(x)   ((x) >> PAGE_SHIFT)
+#define PFN_UP(x)     (((x) + PAGE_SIZE-1) >> PAGE_SHIFT)
+
+#define pfn_to_paddr(pfn) ((paddr_t)(pfn) << PAGE_SHIFT)
+#define paddr_to_pfn(pa)  ((unsigned long)((pa) >> PAGE_SHIFT))
+
+#define MAX_RANGES 8
+#define MAX_PFN_RANGES MAX_RANGES
+
+#define ASSERT assert
+
+#define CONFIG_DEBUG
+
+static inline unsigned int find_next(
+    const unsigned long *addr, unsigned int size, unsigned int off, bool value)
+{
+    unsigned int i;
+
+    ASSERT(size <= BITS_PER_LONG);
+
+    for ( i = off; i < size; i++ )
+        if ( !!(*addr & (1UL << i)) == value )
+            return i;
+
+    return size;
+}
+
+#define find_next_zero_bit(a, s, o) find_next(a, s, o, false)
+#define find_next_bit(a, s, o)      find_next(a, s, o, true)
+
+#define boolean_param(name, func)
+
+#define pdx_to_pfn pdx_to_pfn_xlate
+#define pfn_to_pdx pfn_to_pdx_xlate
+#define maddr_to_directmapoff maddr_to_directmapoff_xlate
+#define directmapoff_to_maddr directmapoff_to_maddr_xlate
+
+typedef uint64_t paddr_t;
+
+#include "pdx.h"
+
+#endif
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/tests/pdx/test-pdx.c b/tools/tests/pdx/test-pdx.c
new file mode 100644
index 000000000000..b717cae00711
--- /dev/null
+++ b/tools/tests/pdx/test-pdx.c
@@ -0,0 +1,220 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Unit tests for PDX compression.
+ *
+ * Copyright (C) 2025 Cloud Software Group
+ */
+
+#include "harness.h"
+
+#include "../../xen/common/pdx.c"
+
+struct range {
+    /* Ranges are defined as [start, end). */
+    unsigned long start, end;
+};
+
+static void print_ranges(const struct range *r)
+{
+    unsigned int i;
+
+    printf("Ranges:\n");
+
+    for ( i = 0; i < MAX_RANGES; i++ )
+    {
+        if ( !r[i].start && !r[i].end )
+            break;
+
+        printf(" %013lx-%013lx\n", r[i].start, r[i].end);
+    }
+}
+
+int main(int argc, char **argv)
+{
+    static const struct {
+        struct range ranges[MAX_RANGES];
+        bool compress;
+    } tests[] = {
+#ifdef __LP64__
+        /*
+         * Only for targets where unsigned long is 64bits, otherwise compiler
+         * will complain about truncation from 'long long' -> 'long' conversion.
+         *
+         * Real memory map from a 4s Intel GNR.  Not compressible using PDX
+         * mask compression.
+         */
+        {
+            .ranges = {
+                { .start =           0,   .end =     0x80000UL },
+                { .start =   0x0100000UL, .end =   0x8080000UL },
+                { .start =  0x63e80000UL, .end =  0x6be80000UL },
+                { .start =  0xc7e80000UL, .end =  0xcfe80000UL },
+                { .start = 0x12be80000UL, .end = 0x133e80000UL },
+            },
+            .compress = false,
+        },
+        /* Simple hole. */
+        {
+            .ranges = {
+                { .start =                                                 0,
+                  .end   =                            (1UL << MAX_ORDER) * 1 },
+                { .start = (1UL << (MAX_ORDER * 2)) |                      0,
+                  .end   = (1UL << (MAX_ORDER * 2)) | (1UL << MAX_ORDER) * 1 },
+            },
+            .compress = true,
+        },
+        /* Simple hole, unsorted ranges. */
+        {
+            .ranges = {
+                { .start = (1UL << (MAX_ORDER * 2)) |                      0,
+                  .end   = (1UL << (MAX_ORDER * 2)) | (1UL << MAX_ORDER) * 1 },
+                { .start =                                                 0,
+                  .end   =                            (1UL << MAX_ORDER) * 1 },
+            },
+            .compress = true,
+        },
+        /* PDX compression, 2 ranges covered by the lower mask. */
+        {
+            .ranges = {
+                { .start =                    0,
+                  .end   = (1 << MAX_ORDER) * 1 },
+                { .start = (1 << MAX_ORDER) * 2,
+                  .end   = (1 << MAX_ORDER) * 3 },
+                { .start = (1 << MAX_ORDER) * 20,
+                  .end   = (1 << MAX_ORDER) * 22 },
+            },
+            .compress = true,
+        },
+        /* Single range not starting at 0. */
+        {
+            .ranges = {
+                { .start = (1 << MAX_ORDER) * 10,
+                  .end   = (1 << MAX_ORDER) * 11 },
+            },
+            .compress = true,
+        },
+        /* Resulting PDX region size leads to no compression. */
+        {
+            .ranges = {
+                { .start =                    0,
+                  .end   = (1 << MAX_ORDER) * 1 },
+                { .start = (1 << MAX_ORDER) * 2,
+                  .end   = (1 << MAX_ORDER) * 3 },
+                { .start = (1 << MAX_ORDER) * 4,
+                  .end   = (1 << MAX_ORDER) * 7 },
+                { .start = (1 << MAX_ORDER) * 8,
+                  .end   = (1 << MAX_ORDER) * 12 },
+            },
+            .compress = false,
+        },
+#endif
+        /* 2-node 2GB per-node QEMU layout. */
+        {
+            .ranges = {
+                { .start =        0,   .end =  0x80000UL },
+                { .start = 0x100000UL, .end = 0x180000UL },
+            },
+            .compress = true,
+        },
+        /* Not compressible, smaller than MAX_ORDER. */
+        {
+            .ranges = {
+                { .start =     0,   .end =     1   },
+                { .start = 0x100UL, .end = 0x101UL },
+            },
+            .compress = false,
+        },
+        /* Compressible, requires adjusting size to (1 << MAX_ORDER). */
+        {
+            .ranges = {
+                { .start =        0,   .end =        1   },
+                { .start = 0x100000UL, .end = 0x100001UL },
+            },
+            .compress = true,
+        },
+        /* 2s Intel CLX with contiguous ranges, no compression. */
+        {
+            .ranges = {
+                { .start =        0  , .end =  0x180000UL },
+                { .start = 0x180000UL, .end = 0x3040000UL },
+            },
+            .compress = false,
+        },
+    };
+    int ret_code = EXIT_SUCCESS;
+
+    for ( unsigned int i = 0 ; i < ARRAY_SIZE(tests); i++ )
+    {
+        unsigned int j;
+
+        pfn_pdx_compression_reset();
+
+        for ( j = 0; j < ARRAY_SIZE(tests[i].ranges); j++ )
+        {
+            unsigned long size = tests[i].ranges[j].end -
+                                 tests[i].ranges[j].start;
+
+            if ( !tests[i].ranges[j].start && !tests[i].ranges[j].end )
+                break;
+
+            pfn_pdx_add_region(tests[i].ranges[j].start << PAGE_SHIFT,
+                               size << PAGE_SHIFT);
+        }
+
+        if ( pfn_pdx_compression_setup(0) != tests[i].compress )
+        {
+            printf("PFN compression diverge, expected %scompressible\n",
+                   tests[i].compress ? "" : "un");
+            print_ranges(tests[i].ranges);
+
+            ret_code = EXIT_FAILURE;
+            continue;
+        }
+
+        if ( !tests[i].compress )
+            continue;
+
+        for ( j = 0; j < ARRAY_SIZE(tests[i].ranges); j++ )
+        {
+            unsigned long start = tests[i].ranges[j].start;
+            unsigned long end = tests[i].ranges[j].end;
+
+            if ( !start && !end )
+                break;
+
+            if ( !pdx_is_region_compressible(start << PAGE_SHIFT, 1) ||
+                 !pdx_is_region_compressible((end - 1) << PAGE_SHIFT, 1) )
+            {
+                printf(
+    "PFN compression invalid, pages %#lx and %#lx should be compressible\n",
+                       start, end - 1);
+                print_ranges(tests[i].ranges);
+                ret_code = EXIT_FAILURE;
+            }
+
+            if ( start != pdx_to_pfn(pfn_to_pdx(start)) ||
+                 end - 1 != pdx_to_pfn(pfn_to_pdx(end - 1)) )
+            {
+                printf("Compression is not bi-directional:\n");
+                printf(" PFN %#lx -> PDX %#lx -> PFN %#lx\n",
+                       start, pfn_to_pdx(start), pdx_to_pfn(pfn_to_pdx(start)));
+                printf(" PFN %#lx -> PDX %#lx -> PFN %#lx\n",
+                       end - 1, pfn_to_pdx(end - 1),
+                       pdx_to_pfn(pfn_to_pdx(end - 1)));
+                print_ranges(tests[i].ranges);
+                ret_code = EXIT_FAILURE;
+            }
+        }
+    }
+
+    return ret_code;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index 86e2dc7c6bb6..5cfec591f993 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -15,6 +15,8 @@
  * along with this program; If not, see <http://www.gnu.org/licenses/>.
  */
 
+/* Trim content when built for the test harness. */
+#ifdef __XEN__
 #include <xen/init.h>
 #include <xen/mm.h>
 #include <xen/bitops.h>
@@ -57,6 +59,8 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn)
         __set_bit(idx, pdx_group_valid);
 }
 
+#endif /* __XEN__ */
+
 #ifndef CONFIG_PDX_NONE
 
 #ifdef CONFIG_X86
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 7/8] pdx: move some helpers in preparation for new compression
  2025-06-20 11:11 [PATCH v2 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
                   ` (4 preceding siblings ...)
  2025-06-20 11:11 ` [PATCH v2 6/8] test/pdx: add PDX compression unit tests Roger Pau Monne
@ 2025-06-20 11:11 ` Roger Pau Monne
  2025-06-24 13:52   ` Jan Beulich
  2025-06-20 11:11 ` [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets Roger Pau Monne
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-06-20 11:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

Move fill_mask(), pdx_region_mask() and pdx_init_mask() to the
!CONFIG_PDX_NONE section in preparation of them also being used by a newly
added PDX compression.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
git is not very helpful when generating the diff here, and it ends up
moving everything around the functions instead of the functions themselves.
---
 xen/common/pdx.c | 118 +++++++++++++++++++++++------------------------
 1 file changed, 59 insertions(+), 59 deletions(-)

diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index 5cfec591f993..d5e469baffe2 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -101,59 +101,6 @@ void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
     ranges[nr_ranges++].size = PFN_UP(base + size) - PFN_DOWN(base);
 }
 
-#endif /* !CONFIG_PDX_NONE */
-
-#ifdef CONFIG_PDX_MASK_COMPRESSION
-
-/*
- * Diagram to make sense of the following variables. The masks and shifts
- * are done on mfn values in order to convert to/from pdx:
- *
- *                      pfn_hole_mask
- *                      pfn_pdx_hole_shift (mask bitsize)
- *                      |
- *                 |---------|
- *                 |         |
- *                 V         V
- *         --------------------------
- *         |HHHHHHH|000000000|LLLLLL| <--- mfn
- *         --------------------------
- *         ^       ^         ^      ^
- *         |       |         |------|
- *         |       |             |
- *         |       |             pfn_pdx_bottom_mask
- *         |       |
- *         |-------|
- *             |
- *             pfn_top_mask
- *
- * ma_{top,va_bottom}_mask is simply a shifted pfn_{top,pdx_bottom}_mask,
- * where ma_top_mask has zeroes shifted in while ma_va_bottom_mask has
- * ones.
- */
-
-/** Mask for the lower non-compressible bits of an mfn */
-unsigned long __ro_after_init pfn_pdx_bottom_mask = ~0UL;
-
-/** Mask for the lower non-compressible bits of an maddr or vaddr */
-unsigned long __ro_after_init ma_va_bottom_mask = ~0UL;
-
-/** Mask for the higher non-compressible bits of an mfn */
-unsigned long __ro_after_init pfn_top_mask = 0;
-
-/** Mask for the higher non-compressible bits of an maddr or vaddr */
-unsigned long __ro_after_init ma_top_mask = 0;
-
-/**
- * Mask for a pdx compression bit slice.
- *
- *  Invariant: valid(mfn) implies (mfn & pfn_hole_mask) == 0
- */
-unsigned long __ro_after_init pfn_hole_mask = 0;
-
-/** Number of bits of the "compressible" bit slice of an mfn */
-unsigned int __ro_after_init pfn_pdx_hole_shift = 0;
-
 /* Sets all bits from the most-significant 1-bit down to the LSB */
 static uint64_t fill_mask(uint64_t mask)
 {
@@ -196,12 +143,6 @@ static uint64_t pdx_region_mask(uint64_t base, uint64_t len)
     return fill_mask(base ^ (base + len - 1));
 }
 
-bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
-{
-    return !(paddr_to_pfn(base) & pfn_hole_mask) &&
-           !(pdx_region_mask(base, npages * PAGE_SIZE) & ~ma_va_bottom_mask);
-}
-
 /**
  * Creates the mask to start from when calculating non-compressible bits
  *
@@ -219,6 +160,65 @@ static uint64_t __init pdx_init_mask(uint64_t base_addr)
                          (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
 }
 
+#endif /* !CONFIG_PDX_NONE */
+
+#ifdef CONFIG_PDX_MASK_COMPRESSION
+
+/*
+ * Diagram to make sense of the following variables. The masks and shifts
+ * are done on mfn values in order to convert to/from pdx:
+ *
+ *                      pfn_hole_mask
+ *                      pfn_pdx_hole_shift (mask bitsize)
+ *                      |
+ *                 |---------|
+ *                 |         |
+ *                 V         V
+ *         --------------------------
+ *         |HHHHHHH|000000000|LLLLLL| <--- mfn
+ *         --------------------------
+ *         ^       ^         ^      ^
+ *         |       |         |------|
+ *         |       |             |
+ *         |       |             pfn_pdx_bottom_mask
+ *         |       |
+ *         |-------|
+ *             |
+ *             pfn_top_mask
+ *
+ * ma_{top,va_bottom}_mask is simply a shifted pfn_{top,pdx_bottom}_mask,
+ * where ma_top_mask has zeroes shifted in while ma_va_bottom_mask has
+ * ones.
+ */
+
+/** Mask for the lower non-compressible bits of an mfn */
+unsigned long __ro_after_init pfn_pdx_bottom_mask = ~0UL;
+
+/** Mask for the lower non-compressible bits of an maddr or vaddr */
+unsigned long __ro_after_init ma_va_bottom_mask = ~0UL;
+
+/** Mask for the higher non-compressible bits of an mfn */
+unsigned long __ro_after_init pfn_top_mask = 0;
+
+/** Mask for the higher non-compressible bits of an maddr or vaddr */
+unsigned long __ro_after_init ma_top_mask = 0;
+
+/**
+ * Mask for a pdx compression bit slice.
+ *
+ *  Invariant: valid(mfn) implies (mfn & pfn_hole_mask) == 0
+ */
+unsigned long __ro_after_init pfn_hole_mask = 0;
+
+/** Number of bits of the "compressible" bit slice of an mfn */
+unsigned int __ro_after_init pfn_pdx_hole_shift = 0;
+
+bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
+{
+    return !(paddr_to_pfn(base) & pfn_hole_mask) &&
+           !(pdx_region_mask(base, npages * PAGE_SIZE) & ~ma_va_bottom_mask);
+}
+
 bool __init pfn_pdx_compression_setup(paddr_t base)
 {
     unsigned int i, j, bottom_shift = 0, hole_shift = 0;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets
  2025-06-20 11:11 [PATCH v2 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
                   ` (5 preceding siblings ...)
  2025-06-20 11:11 ` [PATCH v2 7/8] pdx: move some helpers in preparation for new compression Roger Pau Monne
@ 2025-06-20 11:11 ` Roger Pau Monne
  2025-06-24 16:16   ` Jan Beulich
  2025-06-30  6:34   ` Jan Beulich
       [not found] ` <20250620111130.29057-4-roger.pau@citrix.com>
  2025-06-28  2:08 ` [PATCH v2 0/8] pdx: introduce a new compression algorithm Stefano Stabellini
  8 siblings, 2 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-06-20 11:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Oleksii Kurochko, Community Manager,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Stefano Stabellini

With the appearance of Intel Sierra Forest and Granite Rapids it's now
possible to get a production x86 host with the following memory map:

SRAT: Node 0 PXM 0 [0000000000000000, 000000007fffffff]
SRAT: Node 0 PXM 0 [0000000100000000, 000000807fffffff]
SRAT: Node 1 PXM 1 [0000063e80000000, 000006be7fffffff]
SRAT: Node 2 PXM 2 [00000c7e80000000, 00000cfe7fffffff]
SRAT: Node 3 PXM 3 [000012be80000000, 0000133e7fffffff]

This is from a four socket Granite Rapids system, with each node having
512GB of memory.  The total amount of RAM on the system is 2TB, but without
enabling CONFIG_BIGMEM the last range is not accessible, as it's above the
16TB boundary covered by the frame table. Sierra Forest and Granite Rapids
are socket compatible, however Sierra Forest only supports 2 socket
configurations, while Granite Rapids can go up to 8 sockets.

Note that while the memory map is very sparse, it couldn't be compressed
using the current PDX_MASK compression algorithm, which relies on all
ranges having a shared zeroed region of bits that can be removed.

The memory map presented above has the property of all regions being
similarly spaced between each other, and all having also a similar size.
Use a lookup table to store the offsets to translate from/to PFN and PDX
spaces.  Such table is indexed based on the input PFN or PDX to translated.
The example PFN layout about would get compressed using the following:

PFN compression using PFN lookup table shift 29 and PDX region size 0x10000000
 range 0 [0000000000000, 0x0000807ffff] PFN IDX  0 : 0000000000000
 range 1 [0x00063e80000, 0x0006be7ffff] PFN IDX  3 : 0x00053e80000
 range 2 [0x000c7e80000, 0x000cfe7ffff] PFN IDX  6 : 0x000a7e80000
 range 3 [0x0012be80000, 0x00133e7ffff] PFN IDX  9 : 0x000fbe80000

Note how the tow ranges belonging to node 0 get merged into a single PDX
region by the compression algorithm.

The default size of lookup tables currently set in Kconfig is 64 entries,
and the example memory map consumes 10 entries.  Such memory map is from a
4 socket Granite Rapids host, which in theory supports up to 8 sockets
according to Intel documentation.  Assuming the layout of a 8 socket system
is similar to the 4 socket one, it would require 21 lookup table entries to
support it, way below the current default of 64 entries.

The valid range of lookup table size is currently restricted from 1 to 512
elements in Kconfig.

Unused lookup table entries are set to all ones (~0UL), so that we can
detect whether a pfn or pdx is valid just by checking whether its
translation is bi-directional.  The saturated offsets will prevent the
translation from being bidirectional if the lookup table entry is not
valid.

Introduce __init_or_pdx_mask and use it on some shared functions between
PDX mask and offset compression, as otherwise some code becomes unreachable
after boot if PDX offset compression is used.  Mark the code as __init in
that case, so it's pruned after boot.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Use a lookup table with the offsets.
 - Split the adding of the test to a pre-patch.
 - Amend diagram to also show possible padding after compression.
---
 CHANGELOG.md               |   3 +
 tools/tests/pdx/.gitignore |   1 +
 tools/tests/pdx/Makefile   |   3 +-
 tools/tests/pdx/harness.h  |  10 ++
 tools/tests/pdx/test-pdx.c |   4 +
 xen/common/Kconfig         |  21 +++-
 xen/common/pdx.c           | 209 ++++++++++++++++++++++++++++++++++++-
 xen/include/xen/pdx.h      |  85 ++++++++++++++-
 8 files changed, 330 insertions(+), 6 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 5f31ca08fe3f..7023820b38c1 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -20,6 +20,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
      grant table or foreign memory.
 
 ### Added
+ - Introduce new PDX compression algorithm to cope with Intel Sapphire and
+   Granite Rapids having sparse memory maps.
+
  - On x86:
    - Option to attempt to fixup p2m page-faults on PVH dom0.
    - Resizable BARs is supported for PVH dom0.
diff --git a/tools/tests/pdx/.gitignore b/tools/tests/pdx/.gitignore
index a32c7db4de79..1202a531a7fd 100644
--- a/tools/tests/pdx/.gitignore
+++ b/tools/tests/pdx/.gitignore
@@ -1,2 +1,3 @@
 /pdx.h
 /test-pdx-mask
+/test-pdx-offset
diff --git a/tools/tests/pdx/Makefile b/tools/tests/pdx/Makefile
index 99867b71c438..ba1724bb6616 100644
--- a/tools/tests/pdx/Makefile
+++ b/tools/tests/pdx/Makefile
@@ -1,7 +1,7 @@
 XEN_ROOT=$(CURDIR)/../../..
 include $(XEN_ROOT)/tools/Rules.mk
 
-TARGETS := test-pdx-mask
+TARGETS := test-pdx-mask test-pdx-offset
 
 .PHONY: all
 all: $(TARGETS)
@@ -41,6 +41,7 @@ CFLAGS += $(APPEND_CFLAGS)
 CFLAGS += $(CFLAGS_xeninclude)
 
 test-pdx-mask: CFLAGS += -DCONFIG_PDX_MASK_COMPRESSION
+test-pdx-offset: CFLAGS += -DCONFIG_PDX_OFFSET_COMPRESSION
 
 test-pdx-%: test-pdx.c pdx.h
 	$(CC) $(CPPFLAGS) $(CFLAGS) $(CFLAGS_$*.o) -o $@ $< $(APPEND_CFLAGS)
diff --git a/tools/tests/pdx/harness.h b/tools/tests/pdx/harness.h
index 64ec09f5e281..c58a6f27ad03 100644
--- a/tools/tests/pdx/harness.h
+++ b/tools/tests/pdx/harness.h
@@ -44,8 +44,10 @@
 
 #define MAX_RANGES 8
 #define MAX_PFN_RANGES MAX_RANGES
+#define CONFIG_PDX_OFFSET_TLB_ORDER 6
 
 #define ASSERT assert
+#define ASSERT_UNREACHABLE() assert(0);
 
 #define CONFIG_DEBUG
 
@@ -66,6 +68,8 @@ static inline unsigned int find_next(
 #define find_next_zero_bit(a, s, o) find_next(a, s, o, false)
 #define find_next_bit(a, s, o)      find_next(a, s, o, true)
 
+#define flsl(x) ((x) ? BITS_PER_LONG - __builtin_clzl(x) : 0)
+
 #define boolean_param(name, func)
 
 #define pdx_to_pfn pdx_to_pfn_xlate
@@ -75,6 +79,12 @@ static inline unsigned int find_next(
 
 typedef uint64_t paddr_t;
 
+#define sort(elem, nr, size, cmp, swp) {                                \
+    /* Consume swp() so compiler doesn't complain it's unused. */       \
+    (void)swp;                                                          \
+    qsort(elem, nr, size, cmp);                                         \
+}
+
 #include "pdx.h"
 
 #endif
diff --git a/tools/tests/pdx/test-pdx.c b/tools/tests/pdx/test-pdx.c
index b717cae00711..5041228a383c 100644
--- a/tools/tests/pdx/test-pdx.c
+++ b/tools/tests/pdx/test-pdx.c
@@ -51,7 +51,11 @@ int main(int argc, char **argv)
                 { .start =  0xc7e80000UL, .end =  0xcfe80000UL },
                 { .start = 0x12be80000UL, .end = 0x133e80000UL },
             },
+#ifdef CONFIG_PDX_OFFSET_COMPRESSION
+            .compress = true,
+#else
             .compress = false,
+#endif
         },
         /* Simple hole. */
         {
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index de3e01d6320e..6d49ef535f0c 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -54,7 +54,8 @@ config EVTCHN_FIFO
 
 choice
 	prompt "PDX (Page inDeX) compression"
-	default PDX_MASK_COMPRESSION if !X86 && !RISCV
+	default PDX_OFFSET_COMPRESSION if X86
+	default PDX_MASK_COMPRESSION if !RISCV
 	default PDX_NONE
 	help
 	  PDX compression is a technique designed to reduce the memory
@@ -73,12 +74,30 @@ config PDX_MASK_COMPRESSION
 	help
 	  Compression relying on all RAM addresses sharing a zeroed bit region.
 
+config PDX_OFFSET_COMPRESSION
+	bool "Offset compression"
+	help
+	  Compression relying on size and distance between RAM regions being
+	  compressible using an offset lookup table.
+
 config PDX_NONE
 	bool "None"
 	help
 	  No compression
 endchoice
 
+config PDX_OFFSET_TLB_ORDER
+	int "PDX offset compression lookup table order" if EXPERT
+	depends on PDX_OFFSET_COMPRESSION
+	default 6
+	range 0 9
+	help
+	  Order of the PFN to PDX and PDX to PFN translation lookup tables.
+	  Number of table entries is calculated as 2^N.
+
+	  Size of the tables can be adjusted from 1 entry (order 0) to 512
+	  entries (order 9).
+
 config ALTERNATIVE_CALL
 	bool
 
diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index d5e469baffe2..ff3534122c72 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -24,6 +24,7 @@
 #include <xen/param.h>
 #include <xen/pfn.h>
 #include <xen/sections.h>
+#include <xen/sort.h>
 
 /**
  * Maximum (non-inclusive) usable pdx. Must be
@@ -40,6 +41,8 @@ bool __mfn_valid(unsigned long mfn)
 
 #ifdef CONFIG_PDX_MASK_COMPRESSION
     invalid |= mfn & pfn_hole_mask;
+#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
+    invalid |= mfn ^ pdx_to_pfn(pfn_to_pdx(mfn));
 #endif
 
     if ( unlikely(evaluate_nospec(invalid)) )
@@ -75,6 +78,13 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn)
 # error "Missing architecture maximum number of RAM ranges"
 #endif
 
+/* Some functions should be init when not using PDX mask compression. */
+#ifndef CONFIG_PDX_MASK_COMPRESSION
+# define __init_or_pdx_mask __init
+#else
+# define __init_or_pdx_mask
+#endif
+
 /* Generic PFN compression helpers. */
 static struct pfn_range {
     unsigned long base, size;
@@ -102,7 +112,7 @@ void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
 }
 
 /* Sets all bits from the most-significant 1-bit down to the LSB */
-static uint64_t fill_mask(uint64_t mask)
+static uint64_t __init_or_pdx_mask fill_mask(uint64_t mask)
 {
     while (mask & (mask + 1))
         mask |= mask + 1;
@@ -128,7 +138,7 @@ static uint64_t fill_mask(uint64_t mask)
  * @param len  Size in octets of the region
  * @return Mask of moving bits at the bottom of all the region addresses
  */
-static uint64_t pdx_region_mask(uint64_t base, uint64_t len)
+static uint64_t __init_or_pdx_mask pdx_region_mask(uint64_t base, uint64_t len)
 {
     /*
      * We say a bit "moves" in a range if there exist 2 addresses in that
@@ -290,7 +300,200 @@ void __init pfn_pdx_compression_reset(void)
     nr_ranges = 0;
 }
 
-#endif /* CONFIG_PDX_COMPRESSION */
+#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* CONFIG_PDX_MASK_COMPRESSION */
+
+unsigned long __ro_after_init pfn_pdx_lookup[CONFIG_PDX_NR_LOOKUP];
+unsigned int __ro_after_init pfn_index_shift;
+
+unsigned long __ro_after_init pdx_pfn_lookup[CONFIG_PDX_NR_LOOKUP];
+unsigned int __ro_after_init pdx_index_shift;
+
+bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
+{
+    unsigned long pfn = PFN_DOWN(base);
+
+    return pdx_to_pfn(pfn_to_pdx(pfn) + npages - 1) == (pfn + npages - 1);
+}
+
+static int __init cf_check cmp_node(const void *a, const void *b)
+{
+    const struct pfn_range *l = a;
+    const struct pfn_range *r = b;
+
+    if ( l->base > r->base )
+        return 1;
+    if ( l->base < r->base )
+        return -1;
+
+    return 0;
+}
+
+static void __init cf_check swp_node(void *a, void *b, size_t size)
+{
+    struct pfn_range *l = a;
+    struct pfn_range *r = b;
+    struct pfn_range tmp = *l;
+
+    *l = *r;
+    *r = tmp;
+}
+
+static bool __init pfn_offset_sanitize_ranges(void)
+{
+    unsigned int i = 0;
+
+    if ( nr_ranges == 1 )
+    {
+        ASSERT(PFN_TBL_IDX_VALID(ranges[0].base));
+        ASSERT(PFN_TBL_IDX(ranges[0].base) ==
+               PFN_TBL_IDX(ranges[0].base + ranges[0].size - 1));
+        return true;
+    }
+
+    /* Sort nodes by start address. */
+    sort(ranges, nr_ranges, sizeof(struct pfn_range), cmp_node, swp_node);
+
+    /* Sanitize and merge ranges if possible. */
+    while ( i + 1 < nr_ranges )
+    {
+        /* No overlap between ranges. */
+        if ( ranges[i].base + ranges[i].size > ranges[i + 1].base )
+        {
+            printk(XENLOG_WARNING
+"Invalid ranges for PDX compression: [%#lx, %#lx] overlaps [%#lx, %#lx]\n",
+                   ranges[i].base, ranges[i].base + ranges[i].size - 1,
+                   ranges[i + 1].base,
+                   ranges[i + 1].base + ranges[i + 1].size - 1);
+            return false;
+        }
+
+        /* Ensure lookup indexes don't overflow table size. */
+        if ( !PFN_TBL_IDX_VALID(ranges[i].base) ||
+             !PFN_TBL_IDX_VALID(ranges[i].base + ranges[i].size - 1) ||
+             !PFN_TBL_IDX_VALID(ranges[i + 1].base) ||
+             !PFN_TBL_IDX_VALID(ranges[i + 1].base + ranges[i + 1].size - 1) )
+            return false;
+
+        /*
+         * Ensure ranges [start, end] use the same offset table index.  Should
+         * be guaranteed by the logic that calculates the pfn shift.
+         */
+        if ( PFN_TBL_IDX(ranges[i].base) !=
+             PFN_TBL_IDX(ranges[i].base + ranges[i].size - 1) ||
+             PFN_TBL_IDX(ranges[i + 1].base) !=
+             PFN_TBL_IDX(ranges[i + 1].base + ranges[i + 1].size - 1) )
+        {
+            ASSERT_UNREACHABLE();
+            return false;
+        }
+
+        if ( PFN_TBL_IDX(ranges[i].base) != PFN_TBL_IDX(ranges[i + 1].base) )
+        {
+            i++;
+            continue;
+        }
+
+        /* Merge ranges with the same table index. */
+        ranges[i].size = ranges[i + 1].base + ranges[i + 1].size -
+                         ranges[i].base;
+        memmove(&ranges[i + 1], &ranges[i + 2],
+                (nr_ranges - (i + 2)) * sizeof(ranges[0]));
+        nr_ranges--;
+    }
+
+    return true;
+}
+
+bool __init pfn_pdx_compression_setup(paddr_t base)
+{
+    unsigned long size = 0, mask = PFN_DOWN(pdx_init_mask(base));
+    unsigned int i;
+
+    if ( !nr_ranges )
+        return false;
+
+    if ( nr_ranges > ARRAY_SIZE(ranges) )
+    {
+        printk(XENLOG_WARNING
+               "Too many PFN ranges (%u > %zu), not attempting PFN compression\n",
+               nr_ranges, ARRAY_SIZE(ranges));
+        return false;
+    }
+
+    for ( i = 0; i < nr_ranges; i++ )
+        mask |= pdx_region_mask(ranges[i].base, ranges[i].size);
+
+    pfn_index_shift = flsl(mask);
+
+    /*
+     * Increase the shift as much as possible, removing bits that are equal in
+     * all regions, as this allows the usage of smaller indexes, and in turn
+     * smaller lookup tables.
+     */
+    for ( pfn_index_shift = flsl(mask); pfn_index_shift < sizeof(mask) * 8 - 1;
+          pfn_index_shift++ )
+    {
+        const unsigned long bit = ranges[0].base & (1UL << pfn_index_shift);
+
+        for ( i = 1; i < nr_ranges; i++ )
+            if ( bit != (ranges[i].base & (1UL << pfn_index_shift)) )
+                break;
+        if ( i != nr_ranges )
+            break;
+    }
+
+    /* Sort and sanitize ranges. */
+    if ( !pfn_offset_sanitize_ranges() )
+        return false;
+
+    /* Calculate PDX region size. */
+    for ( i = 0; i < nr_ranges; i++ )
+        size = max(size, ranges[i].size);
+
+    mask = PFN_DOWN(pdx_init_mask(size << PAGE_SHIFT));
+    pdx_index_shift = flsl(mask);
+
+    /* Avoid compression if there's no gain. */
+    if ( (mask + 1) * (nr_ranges - 1) >= ranges[nr_ranges - 1].base )
+        return false;
+
+    /* Poison all lookup table entries ahead of setting them. */
+    memset(pfn_pdx_lookup, ~0, sizeof(pfn_pdx_lookup));
+    memset(pdx_pfn_lookup, ~0, sizeof(pfn_pdx_lookup));
+
+    for ( i = 0; i < nr_ranges; i++ )
+    {
+        unsigned int idx = PFN_TBL_IDX(ranges[i].base);
+
+        pfn_pdx_lookup[idx] = ranges[i].base - (mask + 1) * i;
+        pdx_pfn_lookup[i] = pfn_pdx_lookup[idx];
+    }
+
+    printk(XENLOG_INFO
+           "PFN compression using PFN lookup table shift %u and PDX region size %#lx\n",
+           pfn_index_shift, mask + 1);
+
+    for ( i = 0; i < nr_ranges; i++ )
+        printk(XENLOG_DEBUG
+               " range %u [%#013lx, %#013lx] PFN IDX %3lu : %#013lx\n",
+               i, ranges[i].base, ranges[i].base + ranges[i].size - 1,
+               PFN_TBL_IDX(ranges[i].base),
+               pfn_pdx_lookup[PFN_TBL_IDX(ranges[i].base)]);
+
+    return true;
+}
+
+void __init pfn_pdx_compression_reset(void)
+{
+    memset(pfn_pdx_lookup, 0, sizeof(pfn_pdx_lookup));
+    memset(pdx_pfn_lookup, 0, sizeof(pfn_pdx_lookup));
+    pfn_index_shift = 0;
+    pdx_index_shift = 0;
+
+    nr_ranges = 0;
+}
+
+#endif /* CONFIG_PDX_OFFSET_COMPRESSION */
 
 /*
  * Local variables:
diff --git a/xen/include/xen/pdx.h b/xen/include/xen/pdx.h
index 91fc32370f21..450e07de2764 100644
--- a/xen/include/xen/pdx.h
+++ b/xen/include/xen/pdx.h
@@ -65,6 +65,43 @@
  * This scheme also holds for multiple regions, where HHHHHHH acts as
  * the region identifier and LLLLLL fully contains the span of every
  * region involved.
+ *
+ * ## PDX offset compression
+ *
+ * Alternative compression mechanism that relies on RAM ranges having a similar
+ * size and offset between them:
+ *
+ * PFN address space:
+ * ┌────────┬──────────┬────────┬──────────┐   ┌────────┬──────────┐
+ * │ RAM 0  │          │ RAM 1  │          │...│ RAM N  │          │
+ * ├────────┼──────────┼────────┴──────────┘   └────────┴──────────┘
+ * │<------>│          │
+ * │  size             │
+ * │<----------------->│
+ *         offset
+ *
+ * The compression reduces the holes between RAM regions:
+ *
+ * PDX address space:
+ * ┌────────┬───┬────────┬───┐   ┌─┬────────┐
+ * │ RAM 0  │   │ RAM 1  │   │...│ │ RAM N  │
+ * ├────────┴───┼────────┴───┘   └─┴────────┘
+ * │<---------->│
+ *   pdx region size
+ *
+ * The offsets to convert from PFN to PDX and from PDX to PFN are stored in a
+ * pair of lookup tables, and the index into those tables to find the offset
+ * for each PFN or PDX is obtained by shifting the to be translated address by
+ * a specific value calculated at boot:
+ *
+ * pdx = pfn - pfn_lookup_table[pfn >> pfn_shift]
+ * pfn = pdx + pdx_lookup_table[pdx >> pdx_shift]
+ *
+ * This compression requires the PFN ranges to contain a non-equal most
+ * significant part that's smaller than the lookup table size, so that a valid
+ * shift value can be found to differentiate between PFN regions.  The setup
+ * algorithm might merge otherwise separate PFN ranges to use the same lookup
+ * table entry.
  */
 
 extern unsigned long max_pdx;
@@ -157,7 +194,53 @@ static inline paddr_t directmapoff_to_maddr_xlate(unsigned long offset)
             (offset & ma_va_bottom_mask));
 }
 
-#endif /* CONFIG_PDX_MASK_COMPRESSION */
+#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* CONFIG_PDX_MASK_COMPRESSION */
+
+#include <xen/page-size.h>
+
+#define CONFIG_PDX_NR_LOOKUP (1UL << CONFIG_PDX_OFFSET_TLB_ORDER)
+#define PDX_TBL_MASK (CONFIG_PDX_NR_LOOKUP - 1)
+
+#define PFN_TBL_IDX_VALID(pfn) \
+    !(((pfn) >> pfn_index_shift) & ~PDX_TBL_MASK)
+
+#define PFN_TBL_IDX(pfn) \
+    (((pfn) >> pfn_index_shift) & PDX_TBL_MASK)
+#define PDX_TBL_IDX(pdx) \
+    (((pdx) >> pdx_index_shift) & PDX_TBL_MASK)
+#define MADDR_TBL_IDX(ma) \
+    (((ma) >> (pfn_index_shift + PAGE_SHIFT)) & PDX_TBL_MASK)
+#define DMAPOFF_TBL_IDX(off) \
+    (((off) >> (pdx_index_shift + PAGE_SHIFT)) & PDX_TBL_MASK)
+
+extern unsigned long pfn_pdx_lookup[];
+extern unsigned int pfn_index_shift;
+
+extern unsigned long pdx_pfn_lookup[];
+extern unsigned int pdx_index_shift;
+
+static inline unsigned long pfn_to_pdx_xlate(unsigned long pfn)
+{
+    return pfn - pfn_pdx_lookup[PFN_TBL_IDX(pfn)];
+}
+
+static inline unsigned long pdx_to_pfn_xlate(unsigned long pdx)
+{
+    return pdx + pdx_pfn_lookup[PDX_TBL_IDX(pdx)];
+}
+
+static inline unsigned long maddr_to_directmapoff_xlate(paddr_t ma)
+{
+    return ma - ((paddr_t)pfn_pdx_lookup[MADDR_TBL_IDX(ma)] << PAGE_SHIFT);
+}
+
+static inline paddr_t directmapoff_to_maddr_xlate(unsigned long offset)
+{
+    return offset + ((paddr_t)pdx_pfn_lookup[DMAPOFF_TBL_IDX(offset)] <<
+                     PAGE_SHIFT);
+}
+
+#endif /* CONFIG_PDX_OFFSET_COMPRESSION */
 
 /*
  * Allow each architecture to define it's (possibly optimized) versions of the
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 1/8] x86/pdx: simplify calculation of domain struct allocation boundary
  2025-06-20 11:11 ` [PATCH v2 1/8] x86/pdx: simplify calculation of domain struct allocation boundary Roger Pau Monne
@ 2025-06-24 13:05   ` Jan Beulich
  2025-06-25 15:14     ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-06-24 13:05 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 20.06.2025 13:11, Roger Pau Monne wrote:
> When not using CONFIG_BIGMEM there are some restrictions in the address
> width for allocations of the domain structure, as it's PDX truncated to 32
> bits it's stashed into page_info structure for domain allocated pages.
> 
> The current logic to calculate this limit is based on the internals of the
> PDX compression used, which is not strictly required.  Instead simplify the
> logic to rely on the existing PDX to PFN conversion helpers used elsewhere.
> 
> This has the added benefit of allowing alternative PDX compression
> algorithms to be implemented without requiring to change the calculation of
> the domain structure allocation boundary.
> 
> As a side effect introduce pdx_to_paddr() conversion macro and use it.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

> @@ -498,14 +474,20 @@ struct domain *alloc_domain_struct(void)
>       * On systems with CONFIG_BIGMEM there's no packing, and so there's no
>       * such restriction.
>       */
> -#if defined(CONFIG_BIGMEM) || !defined(CONFIG_PDX_COMPRESSION)
> -    const unsigned int bits = IS_ENABLED(CONFIG_BIGMEM) ? 0 :
> -                                                          32 + PAGE_SHIFT;
> +#if defined(CONFIG_BIGMEM)
> +    const unsigned int bits = 0;
>  #else
> -    static unsigned int __read_mostly bits;
> +    static unsigned int __ro_after_init bits;
>  
>      if ( unlikely(!bits) )
> -         bits = _domain_struct_bits();
> +         /*
> +          * Get the width for the next pfn, and unconditionally subtract one
> +          * from it to ensure the used width will not allocate past the PDX
> +          * field limit.
> +          */
> +         bits = flsl(pdx_to_paddr(1UL << (sizeof_field(struct page_info,
> +                                                       v.inuse._domain) * 8)))

You didn't like the slightly shorter sizeof(frame_table->v.inuse._domain) then?

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 2/8] kconfig: turn PDX compression into a choice
  2025-06-20 11:11 ` [PATCH v2 2/8] kconfig: turn PDX compression into a choice Roger Pau Monne
@ 2025-06-24 13:13   ` Jan Beulich
  2025-06-26  7:49     ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-06-24 13:13 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 20.06.2025 13:11, Roger Pau Monne wrote:
> Rename the current CONFIG_PDX_COMPRESSION to CONFIG_PDX_MASK_COMPRESSION,
> and make it part of the PDX compression choice block, in preparation for
> adding further PDX compression algorithms.
> 
> No functional change intended as the PDX compression defaults should still
> be the same for all architectures, however the choice block cannot be
> protected under EXPERT and still have a default choice being
> unconditionally selected.  As a result, the new "PDX (Page inDeX)
> compression" item will be unconditionally visible in Kconfig.

Just to mention it: Afaict there is a functional change, but one I actually
appreciate, at least in part. So far ...

> --- a/xen/common/Kconfig
> +++ b/xen/common/Kconfig
> @@ -52,9 +52,10 @@ config EVTCHN_FIFO
>  
>  	  If unsure, say Y.
>  
> -config PDX_COMPRESSION
> -	bool "PDX (Page inDeX) compression" if EXPERT && !X86 && !RISCV
> -	default ARM || PPC

... for x86 (and RISC-V) this option couldn't be selected. Whereas ...

> @@ -67,6 +68,17 @@ config PDX_COMPRESSION
>  	  If your platform does not have sparse RAM banks, do not enable PDX
>  	  compression.
>  
> +config PDX_MASK_COMPRESSION
> +	bool "Mask compression"
> +	help
> +	  Compression relying on all RAM addresses sharing a zeroed bit region.

... this option is now available, as the prior !X86 && !RISCV doesn't
re-appear here. (As the description mentions it, that dependency clearly
can't appear on the enclosing choice itself.) Since x86 actually still
should have mask compression implemented properly, that's fine (from my
pov; iirc I even asked that it would have remained available when the
earlier change was done), whereas I think for RISC-V it's not quite right
to offer the option. It also did escape me why the option was made
available for PPC, which I'm pretty sure also lacks the logic to determine
a suitable mask.

Jan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 3/8] pdx: provide a unified set of unit functions
       [not found] ` <20250620111130.29057-4-roger.pau@citrix.com>
@ 2025-06-24 13:32   ` Jan Beulich
  2025-06-25 15:32     ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-06-24 13:32 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel

On 20.06.2025 13:11, Roger Pau Monne wrote:
> --- a/xen/arch/arm/setup.c
> +++ b/xen/arch/arm/setup.c
> @@ -255,6 +255,10 @@ void __init init_pdx(void)
>  {
>      const struct membanks *mem = bootinfo_get_mem();
>      paddr_t bank_start, bank_size, bank_end;
> +    unsigned int bank;
> +
> +    for ( bank = 0 ; bank < mem->nr_banks; bank++ )
> +        pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size);
>  
>      /*
>       * Arm does not have any restrictions on the bits to compress. Pass 0 to
> @@ -263,28 +267,24 @@ void __init init_pdx(void)
>       * If the logic changes in pfn_pdx_hole_setup we might have to
>       * update this function too.
>       */
> -    uint64_t mask = pdx_init_mask(0x0);
> -    int bank;
> +    pfn_pdx_compression_setup(0);
>  
>      for ( bank = 0 ; bank < mem->nr_banks; bank++ )
>      {
> -        bank_start = mem->bank[bank].start;
> -        bank_size = mem->bank[bank].size;
> -
> -        mask |= bank_start | pdx_region_mask(bank_start, bank_size);
> -    }
> -
> -    for ( bank = 0 ; bank < mem->nr_banks; bank++ )
> -    {
> -        bank_start = mem->bank[bank].start;
> -        bank_size = mem->bank[bank].size;
> -
> -        if (~mask & pdx_region_mask(bank_start, bank_size))
> -            mask = 0;
> +        if ( !pdx_is_region_compressible(mem->bank[bank].start,
> +                 PFN_UP(mem->bank[bank].start + mem->bank[bank].size) -
> +                 PFN_DOWN(mem->bank[bank].start)) )

Nit: This, according to my understanding, is an "impossible" style. It wants
to either be

        if ( !pdx_is_region_compressible(
                  mem->bank[bank].start,
                  PFN_UP(mem->bank[bank].start + mem->bank[bank].size) -
                  PFN_DOWN(mem->bank[bank].start)) )

or ...

> +        {
> +            pfn_pdx_compression_reset();
> +            printk(XENLOG_WARNING
> +                   "PFN compression disabled, RAM region [%#" PRIpaddr ", %#"
> +                   PRIpaddr "] not covered\n",
> +                   mem->bank[bank].start,
> +                   mem->bank[bank].start + mem->bank[bank].size - 1);

... like this. But it's not written down anywhere, so I guess I shouldn't
insist.

And then - isn't the use of PFN_UP() and PFN_DOWN() the wrong way round?
Partial pages aren't usable anyway, so the smaller range is what matters
for every individual bank. However, for two contiguous banks (no idea
whether Arm would fold such into a single one, like we do with same-type
E820 regions on x86) this gets more complicated then.

> @@ -299,19 +295,29 @@ void __init srat_parse_regions(paddr_t addr)
>  
>  	/* Set "PXM" as early as feasible. */
>  	numa_fw_nid_name = "PXM";
> -	srat_region_mask = pdx_init_mask(addr);
>  	acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
>  			      srat_parse_region, 0);
>  
> -	for (mask = srat_region_mask, i = 0; mask && i < e820.nr_map; i++) {
> +	pfn_pdx_compression_setup(addr);
> +
> +	/* Ensure all RAM ranges in the e820 are covered. */
> +	for (i = 0; i < e820.nr_map; i++) {
>  		if (e820.map[i].type != E820_RAM)
>  			continue;
>  
> -		if (~mask & pdx_region_mask(e820.map[i].addr, e820.map[i].size))
> -			mask = 0;
> +		if (!pdx_is_region_compressible(e820.map[i].addr,
> +		    PFN_UP(e820.map[i].addr + e820.map[i].size) -
> +		    PFN_DOWN(e820.map[i].addr)))

Indentation is off here in any event, i.e. irrespective of my earlier
remark.

> --- a/xen/common/pdx.c
> +++ b/xen/common/pdx.c
> @@ -19,6 +19,7 @@
>  #include <xen/mm.h>
>  #include <xen/bitops.h>
>  #include <xen/nospec.h>
> +#include <xen/pfn.h>
>  #include <xen/sections.h>
>  
>  /**
> @@ -55,6 +56,44 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn)
>          __set_bit(idx, pdx_group_valid);
>  }
>  
> +#ifndef CONFIG_PDX_NONE
> +
> +#ifdef CONFIG_X86
> +# include <asm/e820.h>
> +# define MAX_PFN_RANGES E820MAX
> +#elif defined(CONFIG_HAS_DEVICE_TREE)
> +# include <xen/bootfdt.h>
> +# define MAX_PFN_RANGES NR_MEM_BANKS
> +#endif
> +
> +#ifndef MAX_PFN_RANGES
> +# error "Missing architecture maximum number of RAM ranges"
> +#endif
> +
> +/* Generic PFN compression helpers. */
> +static struct pfn_range {
> +    unsigned long base, size;
> +} ranges[MAX_PFN_RANGES] __initdata;
> +static unsigned int __initdata nr_ranges;
> +
> +void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
> +{
> +    if ( !size )
> +        return;
> +
> +    if ( nr_ranges >= ARRAY_SIZE(ranges) )
> +    {
> +        ASSERT((nr_ranges + 1) > nr_ranges);

This looks overly pessimistic to me. (I won't outright insist on its removal,
though.)

> +        nr_ranges++;

This requires pretty careful use of the variable as an upper bound of loops.
It's fine in pfn_pdx_compression_setup(), but it feels a little risky.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 6/8] test/pdx: add PDX compression unit tests
  2025-06-20 11:11 ` [PATCH v2 6/8] test/pdx: add PDX compression unit tests Roger Pau Monne
@ 2025-06-24 13:37   ` Anthony PERARD
  2025-06-25 15:55     ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Anthony PERARD @ 2025-06-24 13:37 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: xen-devel, Andrew Cooper, Michal Orzel, Jan Beulich, Julien Grall,
	Stefano Stabellini

On Fri, Jun 20, 2025 at 01:11:28PM +0200, Roger Pau Monne wrote:
> +.PHONY: run
> +run: $(TARGETS)
> +ifeq ($(CC),$(HOSTCC))
> +	for test in $? ; do \
> +		./$$test ;  \
> +	done

You need to add `set -e` or the exit value from the tested binary might
be ignored. This `run` target only failed if the last test binary return
a failure.

> +else
> +	$(warning HOSTCC != CC, will not run test)
> +endif
> +
> +.PHONY: clean
> +clean:
> +	$(RM) -- *.o $(TARGETS) $(DEPS_RM) pdx.c pdx.h

Is this "pdx.c" left over from version? It doesn't seems to be generated
by this makefile.

> +
> +pdx.h: $(XEN_ROOT)/xen/include/xen/pdx.h
> +	sed -E -e '/^#[[:space:]]?include/d' <$< >$@

Why allow only zero or one space characters between '#' and
"include"? Why not used '*' instead of '?' ?


Thanks,

-- 
Anthony PERARD


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/8] pdx: introduce command line compression toggle
  2025-06-20 11:11 ` [PATCH v2 4/8] pdx: introduce command line compression toggle Roger Pau Monne
@ 2025-06-24 13:40   ` Jan Beulich
  2025-06-25 15:46     ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-06-24 13:40 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, Roger Pau Monné, xen-devel

On 20.06.2025 13:11, Roger Pau Monne wrote:
> Introduce a command line option to allow disabling PDX compression.  The
> disabling is done by turning pfn_pdx_add_region() into a no-op, so when
> attempting to initialize the selected compression algorithm the array of
> ranges to compress is empty.

While neat, this also feels fragile. It's not obvious that for any
algorithm pfn_pdx_compression_setup() would leave compression disabled
when there are zero ranges. In principle, if it was written differently
for mask compression, there being no ranges could result in compression
simply squeezing out all of the address bits. Yet as long as we think
we're going to keep this in mind ...

> Signed-off-by: Roger Pau Monné <roger.pau@cloud.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 5/8] pdx: allow per-arch optimization of PDX conversion helpers
  2025-06-20 11:11 ` [PATCH v2 5/8] pdx: allow per-arch optimization of PDX conversion helpers Roger Pau Monne
@ 2025-06-24 13:51   ` Jan Beulich
  2025-06-25 15:51     ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-06-24 13:51 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, Shawn Anastasio,
	Alistair Francis, Bob Eshleman, Connor Davis, Oleksii Kurochko,
	xen-devel

On 20.06.2025 13:11, Roger Pau Monne wrote:
> --- /dev/null
> +++ b/xen/arch/x86/include/asm/pdx.h
> @@ -0,0 +1,75 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#ifndef X86_PDX_H
> +#define X86_PDX_H
> +
> +#ifndef CONFIG_PDX_NONE
> +
> +#include <asm/alternative.h>
> +
> +/*
> + * Introduce a macro to avoid repeating the same asm goto block in each helper.
> + * Note the macro is strictly tied to the code in the helpers.
> + */
> +#define PDX_ASM_GOTO_SKIP                           \
> +    asm_inline goto (                               \
> +        ALTERNATIVE(                                \
> +            "",                                     \
> +            "jmp %l[skip]",                         \
> +            ALT_NOT(X86_FEATURE_PDX_COMPRESSION))   \
> +        : : : : skip )

Did you consider passing the label name as argument to the macro? That way ...

> +static inline unsigned long pfn_to_pdx(unsigned long pfn)
> +{
> +    PDX_ASM_GOTO_SKIP;
> +
> +    return pfn_to_pdx_xlate(pfn);
> +
> + skip:
> +    return pfn;
> +}

... the labels here and below then wouldn't look unused.

The other slight anomaly with this is that we're wasting 2 or 5 bytes of
code space. Yet I guess that's an acceptable price to pay for keeping the
actual translation code in C (rather than in assembly).

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 7/8] pdx: move some helpers in preparation for new compression
  2025-06-20 11:11 ` [PATCH v2 7/8] pdx: move some helpers in preparation for new compression Roger Pau Monne
@ 2025-06-24 13:52   ` Jan Beulich
  0 siblings, 0 replies; 55+ messages in thread
From: Jan Beulich @ 2025-06-24 13:52 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 20.06.2025 13:11, Roger Pau Monne wrote:
> Move fill_mask(), pdx_region_mask() and pdx_init_mask() to the
> !CONFIG_PDX_NONE section in preparation of them also being used by a newly
> added PDX compression.
> 
> No functional change intended.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Acked-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets
  2025-06-20 11:11 ` [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets Roger Pau Monne
@ 2025-06-24 16:16   ` Jan Beulich
  2025-06-25 16:24     ` Roger Pau Monné
  2025-06-30  6:34   ` Jan Beulich
  1 sibling, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-06-24 16:16 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	xen-devel

On 20.06.2025 13:11, Roger Pau Monne wrote:
> With the appearance of Intel Sierra Forest and Granite Rapids it's now
> possible to get a production x86 host with the following memory map:
> 
> SRAT: Node 0 PXM 0 [0000000000000000, 000000007fffffff]
> SRAT: Node 0 PXM 0 [0000000100000000, 000000807fffffff]
> SRAT: Node 1 PXM 1 [0000063e80000000, 000006be7fffffff]
> SRAT: Node 2 PXM 2 [00000c7e80000000, 00000cfe7fffffff]
> SRAT: Node 3 PXM 3 [000012be80000000, 0000133e7fffffff]
> 
> This is from a four socket Granite Rapids system, with each node having
> 512GB of memory.  The total amount of RAM on the system is 2TB, but without
> enabling CONFIG_BIGMEM the last range is not accessible, as it's above the
> 16TB boundary covered by the frame table. Sierra Forest and Granite Rapids
> are socket compatible, however Sierra Forest only supports 2 socket
> configurations, while Granite Rapids can go up to 8 sockets.
> 
> Note that while the memory map is very sparse, it couldn't be compressed
> using the current PDX_MASK compression algorithm, which relies on all
> ranges having a shared zeroed region of bits that can be removed.
> 
> The memory map presented above has the property of all regions being
> similarly spaced between each other, and all having also a similar size.
> Use a lookup table to store the offsets to translate from/to PFN and PDX
> spaces.  Such table is indexed based on the input PFN or PDX to translated.
> The example PFN layout about would get compressed using the following:
> 
> PFN compression using PFN lookup table shift 29 and PDX region size 0x10000000
>  range 0 [0000000000000, 0x0000807ffff] PFN IDX  0 : 0000000000000
>  range 1 [0x00063e80000, 0x0006be7ffff] PFN IDX  3 : 0x00053e80000
>  range 2 [0x000c7e80000, 0x000cfe7ffff] PFN IDX  6 : 0x000a7e80000
>  range 3 [0x0012be80000, 0x00133e7ffff] PFN IDX  9 : 0x000fbe80000
> 
> Note how the tow ranges belonging to node 0 get merged into a single PDX
> region by the compression algorithm.
> 
> The default size of lookup tables currently set in Kconfig is 64 entries,
> and the example memory map consumes 10 entries.  Such memory map is from a
> 4 socket Granite Rapids host, which in theory supports up to 8 sockets
> according to Intel documentation.  Assuming the layout of a 8 socket system
> is similar to the 4 socket one, it would require 21 lookup table entries to
> support it, way below the current default of 64 entries.
> 
> The valid range of lookup table size is currently restricted from 1 to 512
> elements in Kconfig.
> 
> Unused lookup table entries are set to all ones (~0UL), so that we can
> detect whether a pfn or pdx is valid just by checking whether its
> translation is bi-directional.  The saturated offsets will prevent the
> translation from being bidirectional if the lookup table entry is not
> valid.

Right, yet with the sad effect of still leaving almost half the space unused.
I guess that's pretty much unavoidable though in this scheme, as long as we
want the backwards translation to also be "simple" (and in particular not
involving a loop of any kind).

> --- a/CHANGELOG.md
> +++ b/CHANGELOG.md
> @@ -20,6 +20,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
>       grant table or foreign memory.
>  
>  ### Added
> + - Introduce new PDX compression algorithm to cope with Intel Sapphire and
> +   Granite Rapids having sparse memory maps.

In the description you updated to mention Sierra Forest instead, but here you
didn't.

> --- a/tools/tests/pdx/harness.h
> +++ b/tools/tests/pdx/harness.h
> @@ -44,8 +44,10 @@
>  
>  #define MAX_RANGES 8
>  #define MAX_PFN_RANGES MAX_RANGES
> +#define CONFIG_PDX_OFFSET_TLB_ORDER 6
>  
>  #define ASSERT assert
> +#define ASSERT_UNREACHABLE() assert(0);

Nit: Stray semicolon.

> @@ -66,6 +68,8 @@ static inline unsigned int find_next(
>  #define find_next_zero_bit(a, s, o) find_next(a, s, o, false)
>  #define find_next_bit(a, s, o)      find_next(a, s, o, true)
>  
> +#define flsl(x) ((x) ? BITS_PER_LONG - __builtin_clzl(x) : 0)

While this is perhaps indeed good enough for a testing utility, ...

> @@ -75,6 +79,12 @@ static inline unsigned int find_next(
>  
>  typedef uint64_t paddr_t;
>  
> +#define sort(elem, nr, size, cmp, swp) {                                \
> +    /* Consume swp() so compiler doesn't complain it's unused. */       \
> +    (void)swp;                                                          \
> +    qsort(elem, nr, size, cmp);                                         \
> +}

... this I think wants to use either do/while of ({ }).

> --- a/xen/common/Kconfig
> +++ b/xen/common/Kconfig
> @@ -54,7 +54,8 @@ config EVTCHN_FIFO
>  
>  choice
>  	prompt "PDX (Page inDeX) compression"
> -	default PDX_MASK_COMPRESSION if !X86 && !RISCV
> +	default PDX_OFFSET_COMPRESSION if X86
> +	default PDX_MASK_COMPRESSION if !RISCV
>  	default PDX_NONE
>  	help
>  	  PDX compression is a technique designed to reduce the memory
> @@ -73,12 +74,30 @@ config PDX_MASK_COMPRESSION
>  	help
>  	  Compression relying on all RAM addresses sharing a zeroed bit region.
>  
> +config PDX_OFFSET_COMPRESSION
> +	bool "Offset compression"
> +	help
> +	  Compression relying on size and distance between RAM regions being
> +	  compressible using an offset lookup table.
> +
>  config PDX_NONE
>  	bool "None"
>  	help
>  	  No compression
>  endchoice
>  
> +config PDX_OFFSET_TLB_ORDER

Please can we avoid the term "TLB" in the name? What we commonly call a TLB
is somewhat different. In fact is there anything wrong with just
PDX_OFFSET_ORDER?

> +	int "PDX offset compression lookup table order" if EXPERT
> +	depends on PDX_OFFSET_COMPRESSION
> +	default 6
> +	range 0 9

Is 0 really a sensible lower bound? There's not going to be any compression
then, I suppose?

> --- a/xen/common/pdx.c
> +++ b/xen/common/pdx.c
> @@ -24,6 +24,7 @@
>  #include <xen/param.h>
>  #include <xen/pfn.h>
>  #include <xen/sections.h>
> +#include <xen/sort.h>
>  
>  /**
>   * Maximum (non-inclusive) usable pdx. Must be
> @@ -40,6 +41,8 @@ bool __mfn_valid(unsigned long mfn)
>  
>  #ifdef CONFIG_PDX_MASK_COMPRESSION
>      invalid |= mfn & pfn_hole_mask;
> +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
> +    invalid |= mfn ^ pdx_to_pfn(pfn_to_pdx(mfn));

Hmm, that's pretty expensive already. Involving two (presumably back-to-back)
JMPs when compression isn't enabled.

> @@ -290,7 +300,200 @@ void __init pfn_pdx_compression_reset(void)
>      nr_ranges = 0;
>  }
>  
> -#endif /* CONFIG_PDX_COMPRESSION */
> +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* CONFIG_PDX_MASK_COMPRESSION */
> +
> +unsigned long __ro_after_init pfn_pdx_lookup[CONFIG_PDX_NR_LOOKUP];
> +unsigned int __ro_after_init pfn_index_shift;
> +
> +unsigned long __ro_after_init pdx_pfn_lookup[CONFIG_PDX_NR_LOOKUP];
> +unsigned int __ro_after_init pdx_index_shift;

For slightly better cache locality when only a few array indexes are in
use, may I suggest to put the indexes ahead of the arrays? Perhaps even
together, as they both take up a single unsigned long slot.

> +bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
> +{
> +    unsigned long pfn = PFN_DOWN(base);
> +
> +    return pdx_to_pfn(pfn_to_pdx(pfn) + npages - 1) == (pfn + npages - 1);

Aiui for this to be correct, there need to be gaps between the ranges
covered by individual lookup table slots. In the setup logic you have a
check commented "Avoid compression if there's no gain", but that doesn't
look to guarantee gaps everywhere (nor would pfn_offset_sanitize_ranges()
appear to)?

> +static void __init cf_check swp_node(void *a, void *b, size_t size)
> +{
> +    struct pfn_range *l = a;
> +    struct pfn_range *r = b;
> +    struct pfn_range tmp = *l;
> +
> +    *l = *r;
> +    *r = tmp;
> +}

Any reason you effectively open-code SWAP() here?

> +static bool __init pfn_offset_sanitize_ranges(void)
> +{
> +    unsigned int i = 0;
> +
> +    if ( nr_ranges == 1 )
> +    {
> +        ASSERT(PFN_TBL_IDX_VALID(ranges[0].base));
> +        ASSERT(PFN_TBL_IDX(ranges[0].base) ==
> +               PFN_TBL_IDX(ranges[0].base + ranges[0].size - 1));
> +        return true;
> +    }
> +
> +    /* Sort nodes by start address. */
> +    sort(ranges, nr_ranges, sizeof(struct pfn_range), cmp_node, swp_node);

Better sizeof(*ranges) or sizeof(ranges[0])?

> +bool __init pfn_pdx_compression_setup(paddr_t base)
> +{
> +    unsigned long size = 0, mask = PFN_DOWN(pdx_init_mask(base));
> +    unsigned int i;
> +
> +    if ( !nr_ranges )
> +        return false;

Also bail if there's just a single range?

> +    if ( nr_ranges > ARRAY_SIZE(ranges) )
> +    {
> +        printk(XENLOG_WARNING
> +               "Too many PFN ranges (%u > %zu), not attempting PFN compression\n",
> +               nr_ranges, ARRAY_SIZE(ranges));
> +        return false;
> +    }
> +
> +    for ( i = 0; i < nr_ranges; i++ )
> +        mask |= pdx_region_mask(ranges[i].base, ranges[i].size);
> +
> +    pfn_index_shift = flsl(mask);

With this ...

> +    /*
> +     * Increase the shift as much as possible, removing bits that are equal in
> +     * all regions, as this allows the usage of smaller indexes, and in turn
> +     * smaller lookup tables.
> +     */
> +    for ( pfn_index_shift = flsl(mask); pfn_index_shift < sizeof(mask) * 8 - 1;

... you don't need to do this here another time.

Also - why the subtraction of 1 in what the shift is compared against? Logic
below should in principle guarantee we never exit the loop because of the
conditional above, but if we made it that far it looks like we could as well
also look at the top bit.

> +          pfn_index_shift++ )
> +    {
> +        const unsigned long bit = ranges[0].base & (1UL << pfn_index_shift);
> +
> +        for ( i = 1; i < nr_ranges; i++ )
> +            if ( bit != (ranges[i].base & (1UL << pfn_index_shift)) )
> +                break;
> +        if ( i != nr_ranges )
> +            break;
> +    }
> +
> +    /* Sort and sanitize ranges. */
> +    if ( !pfn_offset_sanitize_ranges() )
> +        return false;
> +
> +    /* Calculate PDX region size. */
> +    for ( i = 0; i < nr_ranges; i++ )
> +        size = max(size, ranges[i].size);
> +
> +    mask = PFN_DOWN(pdx_init_mask(size << PAGE_SHIFT));
> +    pdx_index_shift = flsl(mask);
> +
> +    /* Avoid compression if there's no gain. */
> +    if ( (mask + 1) * (nr_ranges - 1) >= ranges[nr_ranges - 1].base )
> +        return false;
> +
> +    /* Poison all lookup table entries ahead of setting them. */
> +    memset(pfn_pdx_lookup, ~0, sizeof(pfn_pdx_lookup));
> +    memset(pdx_pfn_lookup, ~0, sizeof(pfn_pdx_lookup));

Have the arrays have initializers instead?

> +    for ( i = 0; i < nr_ranges; i++ )
> +    {
> +        unsigned int idx = PFN_TBL_IDX(ranges[i].base);
> +
> +        pfn_pdx_lookup[idx] = ranges[i].base - (mask + 1) * i;
> +        pdx_pfn_lookup[i] = pfn_pdx_lookup[idx];
> +    }
> +
> +    printk(XENLOG_INFO
> +           "PFN compression using PFN lookup table shift %u and PDX region size %#lx\n",

I'd drop PFN and the latter PDX from this format string.

> +           pfn_index_shift, mask + 1);
> +
> +    for ( i = 0; i < nr_ranges; i++ )
> +        printk(XENLOG_DEBUG
> +               " range %u [%#013lx, %#013lx] PFN IDX %3lu : %#013lx\n",
> +               i, ranges[i].base, ranges[i].base + ranges[i].size - 1,
> +               PFN_TBL_IDX(ranges[i].base),
> +               pfn_pdx_lookup[PFN_TBL_IDX(ranges[i].base)]);

Do you really mean this to stay active also in release builds?

Also the outcome of the earlier loop isn't used by the intermediate printk().
Perhaps join both loops, thus allowing idx to be re-used here?

> +    return true;
> +}
> +
> +void __init pfn_pdx_compression_reset(void)
> +{
> +    memset(pfn_pdx_lookup, 0, sizeof(pfn_pdx_lookup));
> +    memset(pdx_pfn_lookup, 0, sizeof(pfn_pdx_lookup));

Why not ~0?

> --- a/xen/include/xen/pdx.h
> +++ b/xen/include/xen/pdx.h
> @@ -65,6 +65,43 @@
>   * This scheme also holds for multiple regions, where HHHHHHH acts as
>   * the region identifier and LLLLLL fully contains the span of every
>   * region involved.
> + *
> + * ## PDX offset compression
> + *
> + * Alternative compression mechanism that relies on RAM ranges having a similar
> + * size and offset between them:
> + *
> + * PFN address space:
> + * ┌────────┬──────────┬────────┬──────────┐   ┌────────┬──────────┐
> + * │ RAM 0  │          │ RAM 1  │          │...│ RAM N  │          │
> + * ├────────┼──────────┼────────┴──────────┘   └────────┴──────────┘
> + * │<------>│          │
> + * │  size             │
> + * │<----------------->│
> + *         offset
> + *
> + * The compression reduces the holes between RAM regions:
> + *
> + * PDX address space:
> + * ┌────────┬───┬────────┬───┐   ┌─┬────────┐
> + * │ RAM 0  │   │ RAM 1  │   │...│ │ RAM N  │
> + * ├────────┴───┼────────┴───┘   └─┴────────┘
> + * │<---------->│
> + *   pdx region size
> + *
> + * The offsets to convert from PFN to PDX and from PDX to PFN are stored in a
> + * pair of lookup tables, and the index into those tables to find the offset
> + * for each PFN or PDX is obtained by shifting the to be translated address by
> + * a specific value calculated at boot:
> + *
> + * pdx = pfn - pfn_lookup_table[pfn >> pfn_shift]
> + * pfn = pdx + pdx_lookup_table[pdx >> pdx_shift]

I assume it's intentional (for simplicity) that you omit the index masking
here?

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 1/8] x86/pdx: simplify calculation of domain struct allocation boundary
  2025-06-24 13:05   ` Jan Beulich
@ 2025-06-25 15:14     ` Roger Pau Monné
  0 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-06-25 15:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On Tue, Jun 24, 2025 at 03:05:11PM +0200, Jan Beulich wrote:
> On 20.06.2025 13:11, Roger Pau Monne wrote:
> > When not using CONFIG_BIGMEM there are some restrictions in the address
> > width for allocations of the domain structure, as it's PDX truncated to 32
> > bits it's stashed into page_info structure for domain allocated pages.
> > 
> > The current logic to calculate this limit is based on the internals of the
> > PDX compression used, which is not strictly required.  Instead simplify the
> > logic to rely on the existing PDX to PFN conversion helpers used elsewhere.
> > 
> > This has the added benefit of allowing alternative PDX compression
> > algorithms to be implemented without requiring to change the calculation of
> > the domain structure allocation boundary.
> > 
> > As a side effect introduce pdx_to_paddr() conversion macro and use it.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks.

> > @@ -498,14 +474,20 @@ struct domain *alloc_domain_struct(void)
> >       * On systems with CONFIG_BIGMEM there's no packing, and so there's no
> >       * such restriction.
> >       */
> > -#if defined(CONFIG_BIGMEM) || !defined(CONFIG_PDX_COMPRESSION)
> > -    const unsigned int bits = IS_ENABLED(CONFIG_BIGMEM) ? 0 :
> > -                                                          32 + PAGE_SHIFT;
> > +#if defined(CONFIG_BIGMEM)
> > +    const unsigned int bits = 0;
> >  #else
> > -    static unsigned int __read_mostly bits;
> > +    static unsigned int __ro_after_init bits;
> >  
> >      if ( unlikely(!bits) )
> > -         bits = _domain_struct_bits();
> > +         /*
> > +          * Get the width for the next pfn, and unconditionally subtract one
> > +          * from it to ensure the used width will not allocate past the PDX
> > +          * field limit.
> > +          */
> > +         bits = flsl(pdx_to_paddr(1UL << (sizeof_field(struct page_info,
> > +                                                       v.inuse._domain) * 8)))
> 
> You didn't like the slightly shorter sizeof(frame_table->v.inuse._domain) then?

No strong opinion really, I have the impression however that using the
struct type itself would be less fragile, in case we ever change
frame_table variable name (which is very unlikely).

Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 3/8] pdx: provide a unified set of unit functions
  2025-06-24 13:32   ` [PATCH v2 3/8] pdx: provide a unified set of unit functions Jan Beulich
@ 2025-06-25 15:32     ` Roger Pau Monné
  0 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-06-25 15:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel

On Tue, Jun 24, 2025 at 03:32:23PM +0200, Jan Beulich wrote:
> On 20.06.2025 13:11, Roger Pau Monne wrote:
> > --- a/xen/arch/arm/setup.c
> > +++ b/xen/arch/arm/setup.c
> > @@ -255,6 +255,10 @@ void __init init_pdx(void)
> >  {
> >      const struct membanks *mem = bootinfo_get_mem();
> >      paddr_t bank_start, bank_size, bank_end;
> > +    unsigned int bank;
> > +
> > +    for ( bank = 0 ; bank < mem->nr_banks; bank++ )
> > +        pfn_pdx_add_region(mem->bank[bank].start, mem->bank[bank].size);
> >  
> >      /*
> >       * Arm does not have any restrictions on the bits to compress. Pass 0 to
> > @@ -263,28 +267,24 @@ void __init init_pdx(void)
> >       * If the logic changes in pfn_pdx_hole_setup we might have to
> >       * update this function too.
> >       */
> > -    uint64_t mask = pdx_init_mask(0x0);
> > -    int bank;
> > +    pfn_pdx_compression_setup(0);
> >  
> >      for ( bank = 0 ; bank < mem->nr_banks; bank++ )
> >      {
> > -        bank_start = mem->bank[bank].start;
> > -        bank_size = mem->bank[bank].size;
> > -
> > -        mask |= bank_start | pdx_region_mask(bank_start, bank_size);
> > -    }
> > -
> > -    for ( bank = 0 ; bank < mem->nr_banks; bank++ )
> > -    {
> > -        bank_start = mem->bank[bank].start;
> > -        bank_size = mem->bank[bank].size;
> > -
> > -        if (~mask & pdx_region_mask(bank_start, bank_size))
> > -            mask = 0;
> > +        if ( !pdx_is_region_compressible(mem->bank[bank].start,
> > +                 PFN_UP(mem->bank[bank].start + mem->bank[bank].size) -
> > +                 PFN_DOWN(mem->bank[bank].start)) )
> 
> Nit: This, according to my understanding, is an "impossible" style. It wants
> to either be
> 
>         if ( !pdx_is_region_compressible(
>                   mem->bank[bank].start,
>                   PFN_UP(mem->bank[bank].start + mem->bank[bank].size) -
>                   PFN_DOWN(mem->bank[bank].start)) )
> 
> or ...

I will switch to the example above, thanks.

> > +        {
> > +            pfn_pdx_compression_reset();
> > +            printk(XENLOG_WARNING
> > +                   "PFN compression disabled, RAM region [%#" PRIpaddr ", %#"
> > +                   PRIpaddr "] not covered\n",
> > +                   mem->bank[bank].start,
> > +                   mem->bank[bank].start + mem->bank[bank].size - 1);
> 
> ... like this. But it's not written down anywhere, so I guess I shouldn't
> insist.
> 
> And then - isn't the use of PFN_UP() and PFN_DOWN() the wrong way round?
> Partial pages aren't usable anyway, so the smaller range is what matters
> for every individual bank. However, for two contiguous banks (no idea
> whether Arm would fold such into a single one, like we do with same-type
> E820 regions on x86) this gets more complicated then.

I think it's safer to always attempt to cover the wider range, even if
the first and last pages are not fully covered, and shouldn't be used
as RAM.  Like you said it will get more complicated if ranges are
contiguous but the start and end are not page aligned.

> > @@ -299,19 +295,29 @@ void __init srat_parse_regions(paddr_t addr)
> >  
> >  	/* Set "PXM" as early as feasible. */
> >  	numa_fw_nid_name = "PXM";
> > -	srat_region_mask = pdx_init_mask(addr);
> >  	acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> >  			      srat_parse_region, 0);
> >  
> > -	for (mask = srat_region_mask, i = 0; mask && i < e820.nr_map; i++) {
> > +	pfn_pdx_compression_setup(addr);
> > +
> > +	/* Ensure all RAM ranges in the e820 are covered. */
> > +	for (i = 0; i < e820.nr_map; i++) {
> >  		if (e820.map[i].type != E820_RAM)
> >  			continue;
> >  
> > -		if (~mask & pdx_region_mask(e820.map[i].addr, e820.map[i].size))
> > -			mask = 0;
> > +		if (!pdx_is_region_compressible(e820.map[i].addr,
> > +		    PFN_UP(e820.map[i].addr + e820.map[i].size) -
> > +		    PFN_DOWN(e820.map[i].addr)))
> 
> Indentation is off here in any event, i.e. irrespective of my earlier
> remark.

Hm, yes, I've made a mess with indentation here.

> 
> > --- a/xen/common/pdx.c
> > +++ b/xen/common/pdx.c
> > @@ -19,6 +19,7 @@
> >  #include <xen/mm.h>
> >  #include <xen/bitops.h>
> >  #include <xen/nospec.h>
> > +#include <xen/pfn.h>
> >  #include <xen/sections.h>
> >  
> >  /**
> > @@ -55,6 +56,44 @@ void set_pdx_range(unsigned long smfn, unsigned long emfn)
> >          __set_bit(idx, pdx_group_valid);
> >  }
> >  
> > +#ifndef CONFIG_PDX_NONE
> > +
> > +#ifdef CONFIG_X86
> > +# include <asm/e820.h>
> > +# define MAX_PFN_RANGES E820MAX
> > +#elif defined(CONFIG_HAS_DEVICE_TREE)
> > +# include <xen/bootfdt.h>
> > +# define MAX_PFN_RANGES NR_MEM_BANKS
> > +#endif
> > +
> > +#ifndef MAX_PFN_RANGES
> > +# error "Missing architecture maximum number of RAM ranges"
> > +#endif
> > +
> > +/* Generic PFN compression helpers. */
> > +static struct pfn_range {
> > +    unsigned long base, size;
> > +} ranges[MAX_PFN_RANGES] __initdata;
> > +static unsigned int __initdata nr_ranges;
> > +
> > +void __init pfn_pdx_add_region(paddr_t base, paddr_t size)
> > +{
> > +    if ( !size )
> > +        return;
> > +
> > +    if ( nr_ranges >= ARRAY_SIZE(ranges) )
> > +    {
> > +        ASSERT((nr_ranges + 1) > nr_ranges);
> 
> This looks overly pessimistic to me. (I won't outright insist on its removal,
> though.)

TBH I've added this later, I don't have a strong opinion either.  I
don't think we usually check for overflows, so I understand this might
look odd.

> > +        nr_ranges++;
> 
> This requires pretty careful use of the variable as an upper bound of loops.
> It's fine in pfn_pdx_compression_setup(), but it feels a little risky.

It does require careful handling in pfn_pdx_compression_setup(), but
also has the benefit of providing the possibly new required upper
bound for PDX to be usable in the error message.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/8] pdx: introduce command line compression toggle
  2025-06-24 13:40   ` Jan Beulich
@ 2025-06-25 15:46     ` Roger Pau Monné
  2025-06-25 16:00       ` Jan Beulich
  0 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-06-25 15:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, Roger Pau Monné, xen-devel

On Tue, Jun 24, 2025 at 03:40:16PM +0200, Jan Beulich wrote:
> On 20.06.2025 13:11, Roger Pau Monne wrote:
> > Introduce a command line option to allow disabling PDX compression.  The
> > disabling is done by turning pfn_pdx_add_region() into a no-op, so when
> > attempting to initialize the selected compression algorithm the array of
> > ranges to compress is empty.
> 
> While neat, this also feels fragile. It's not obvious that for any
> algorithm pfn_pdx_compression_setup() would leave compression disabled
> when there are zero ranges. In principle, if it was written differently
> for mask compression, there being no ranges could result in compression
> simply squeezing out all of the address bits. Yet as long as we think
> we're going to keep this in mind ...

It seemed to me that nr_rages == 0 (so no ranges reported) should
result in no compression, for example on x86 this means there's no
SRAT.

> > Signed-off-by: Roger Pau Monné <roger.pau@cloud.com>
> 
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 5/8] pdx: allow per-arch optimization of PDX conversion helpers
  2025-06-24 13:51   ` Jan Beulich
@ 2025-06-25 15:51     ` Roger Pau Monné
  2025-06-25 16:04       ` Jan Beulich
  0 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-06-25 15:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, Shawn Anastasio,
	Alistair Francis, Bob Eshleman, Connor Davis, Oleksii Kurochko,
	xen-devel

On Tue, Jun 24, 2025 at 03:51:09PM +0200, Jan Beulich wrote:
> On 20.06.2025 13:11, Roger Pau Monne wrote:
> > --- /dev/null
> > +++ b/xen/arch/x86/include/asm/pdx.h
> > @@ -0,0 +1,75 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +
> > +#ifndef X86_PDX_H
> > +#define X86_PDX_H
> > +
> > +#ifndef CONFIG_PDX_NONE
> > +
> > +#include <asm/alternative.h>
> > +
> > +/*
> > + * Introduce a macro to avoid repeating the same asm goto block in each helper.
> > + * Note the macro is strictly tied to the code in the helpers.
> > + */
> > +#define PDX_ASM_GOTO_SKIP                           \
> > +    asm_inline goto (                               \
> > +        ALTERNATIVE(                                \
> > +            "",                                     \
> > +            "jmp %l[skip]",                         \
> > +            ALT_NOT(X86_FEATURE_PDX_COMPRESSION))   \
> > +        : : : : skip )
> 
> Did you consider passing the label name as argument to the macro? That way ...
> 
> > +static inline unsigned long pfn_to_pdx(unsigned long pfn)
> > +{
> > +    PDX_ASM_GOTO_SKIP;
> > +
> > +    return pfn_to_pdx_xlate(pfn);
> > +
> > + skip:
> > +    return pfn;
> > +}
> 
> ... the labels here and below then wouldn't look unused.

Yes - that's why I've added the "Note the macro is strictly tied to
the code in the helpers" comment ahead of the macro, and named it as
"GOTO_SKIP" to explicitly reference the label name.  I could pass the
label name however if that's preferred, ie: PDX_ASM_GOTO(skip).  IMO
It seems a bit redundant since all callers will pass the same label
name.

> The other slight anomaly with this is that we're wasting 2 or 5 bytes of
> code space. Yet I guess that's an acceptable price to pay for keeping the
> actual translation code in C (rather than in assembly).

I wanted to avoid doing the translation in assembly, so I think it's a
fair price to pay.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 6/8] test/pdx: add PDX compression unit tests
  2025-06-24 13:37   ` Anthony PERARD
@ 2025-06-25 15:55     ` Roger Pau Monné
  0 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-06-25 15:55 UTC (permalink / raw)
  To: Anthony PERARD
  Cc: xen-devel, Andrew Cooper, Michal Orzel, Jan Beulich, Julien Grall,
	Stefano Stabellini

On Tue, Jun 24, 2025 at 03:37:34PM +0200, Anthony PERARD wrote:
> On Fri, Jun 20, 2025 at 01:11:28PM +0200, Roger Pau Monne wrote:
> > +.PHONY: run
> > +run: $(TARGETS)
> > +ifeq ($(CC),$(HOSTCC))
> > +	for test in $? ; do \
> > +		./$$test ;  \
> > +	done
> 
> You need to add `set -e` or the exit value from the tested binary might
> be ignored. This `run` target only failed if the last test binary return
> a failure.

Oh, I did see it failing, but both tests at teh same time, that's why
I didn't notice that only the first failing won't be reported.

> > +else
> > +	$(warning HOSTCC != CC, will not run test)
> > +endif
> > +
> > +.PHONY: clean
> > +clean:
> > +	$(RM) -- *.o $(TARGETS) $(DEPS_RM) pdx.c pdx.h
> 
> Is this "pdx.c" left over from version? It doesn't seems to be generated
> by this makefile.

Yeah, it's a leftover from the previous version where I was also
making a local copy of pdx.c.

> > +
> > +pdx.h: $(XEN_ROOT)/xen/include/xen/pdx.h
> > +	sed -E -e '/^#[[:space:]]?include/d' <$< >$@
> 
> Why allow only zero or one space characters between '#' and
> "include"? Why not used '*' instead of '?' ?

Because that's all we currently use in the header, but I can indeed
switch to * just in case we gain includes with more spaces.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/8] pdx: introduce command line compression toggle
  2025-06-25 15:46     ` Roger Pau Monné
@ 2025-06-25 16:00       ` Jan Beulich
  2025-06-25 17:45         ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-06-25 16:00 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, Roger Pau Monné, xen-devel

On 25.06.2025 17:46, Roger Pau Monné wrote:
> On Tue, Jun 24, 2025 at 03:40:16PM +0200, Jan Beulich wrote:
>> On 20.06.2025 13:11, Roger Pau Monne wrote:
>>> Introduce a command line option to allow disabling PDX compression.  The
>>> disabling is done by turning pfn_pdx_add_region() into a no-op, so when
>>> attempting to initialize the selected compression algorithm the array of
>>> ranges to compress is empty.
>>
>> While neat, this also feels fragile. It's not obvious that for any
>> algorithm pfn_pdx_compression_setup() would leave compression disabled
>> when there are zero ranges. In principle, if it was written differently
>> for mask compression, there being no ranges could result in compression
>> simply squeezing out all of the address bits. Yet as long as we think
>> we're going to keep this in mind ...
> 
> It seemed to me that nr_rages == 0 (so no ranges reported) should
> result in no compression, for example on x86 this means there's no
> SRAT.

Just to mention it: While in the pfn_pdx_compression_setup() flavor in
patch 3 there's no explicit check (hence the logic is assumed to be
coping with that situation), the one introduced in the last patch does
have such an explicit check. Apparently there the logic doesn't cleanly
cover that case all by itself.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 5/8] pdx: allow per-arch optimization of PDX conversion helpers
  2025-06-25 15:51     ` Roger Pau Monné
@ 2025-06-25 16:04       ` Jan Beulich
  0 siblings, 0 replies; 55+ messages in thread
From: Jan Beulich @ 2025-06-25 16:04 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, Shawn Anastasio,
	Alistair Francis, Bob Eshleman, Connor Davis, Oleksii Kurochko,
	xen-devel

On 25.06.2025 17:51, Roger Pau Monné wrote:
> On Tue, Jun 24, 2025 at 03:51:09PM +0200, Jan Beulich wrote:
>> On 20.06.2025 13:11, Roger Pau Monne wrote:
>>> --- /dev/null
>>> +++ b/xen/arch/x86/include/asm/pdx.h
>>> @@ -0,0 +1,75 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +
>>> +#ifndef X86_PDX_H
>>> +#define X86_PDX_H
>>> +
>>> +#ifndef CONFIG_PDX_NONE
>>> +
>>> +#include <asm/alternative.h>
>>> +
>>> +/*
>>> + * Introduce a macro to avoid repeating the same asm goto block in each helper.
>>> + * Note the macro is strictly tied to the code in the helpers.
>>> + */
>>> +#define PDX_ASM_GOTO_SKIP                           \
>>> +    asm_inline goto (                               \
>>> +        ALTERNATIVE(                                \
>>> +            "",                                     \
>>> +            "jmp %l[skip]",                         \
>>> +            ALT_NOT(X86_FEATURE_PDX_COMPRESSION))   \
>>> +        : : : : skip )
>>
>> Did you consider passing the label name as argument to the macro? That way ...
>>
>>> +static inline unsigned long pfn_to_pdx(unsigned long pfn)
>>> +{
>>> +    PDX_ASM_GOTO_SKIP;
>>> +
>>> +    return pfn_to_pdx_xlate(pfn);
>>> +
>>> + skip:
>>> +    return pfn;
>>> +}
>>
>> ... the labels here and below then wouldn't look unused.
> 
> Yes - that's why I've added the "Note the macro is strictly tied to
> the code in the helpers" comment ahead of the macro, and named it as
> "GOTO_SKIP" to explicitly reference the label name.  I could pass the
> label name however if that's preferred, ie: PDX_ASM_GOTO(skip).  IMO
> It seems a bit redundant since all callers will pass the same label
> name.

Well, that comment isn't necessarily "in sight" when looking at the
functions using the macro. Personally I'd favor passing the label as
an argument; indeed I think we would better try to do away with other
such macros where inputs are implicit. Yes, there may be cases where
that's hard or getting unwieldy. But the one here isn't one of them.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets
  2025-06-24 16:16   ` Jan Beulich
@ 2025-06-25 16:24     ` Roger Pau Monné
  2025-06-26  7:35       ` Jan Beulich
  0 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-06-25 16:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	xen-devel

On Tue, Jun 24, 2025 at 06:16:15PM +0200, Jan Beulich wrote:
> On 20.06.2025 13:11, Roger Pau Monne wrote:
> > With the appearance of Intel Sierra Forest and Granite Rapids it's now
> > possible to get a production x86 host with the following memory map:
> > 
> > SRAT: Node 0 PXM 0 [0000000000000000, 000000007fffffff]
> > SRAT: Node 0 PXM 0 [0000000100000000, 000000807fffffff]
> > SRAT: Node 1 PXM 1 [0000063e80000000, 000006be7fffffff]
> > SRAT: Node 2 PXM 2 [00000c7e80000000, 00000cfe7fffffff]
> > SRAT: Node 3 PXM 3 [000012be80000000, 0000133e7fffffff]
> > 
> > This is from a four socket Granite Rapids system, with each node having
> > 512GB of memory.  The total amount of RAM on the system is 2TB, but without
> > enabling CONFIG_BIGMEM the last range is not accessible, as it's above the
> > 16TB boundary covered by the frame table. Sierra Forest and Granite Rapids
> > are socket compatible, however Sierra Forest only supports 2 socket
> > configurations, while Granite Rapids can go up to 8 sockets.
> > 
> > Note that while the memory map is very sparse, it couldn't be compressed
> > using the current PDX_MASK compression algorithm, which relies on all
> > ranges having a shared zeroed region of bits that can be removed.
> > 
> > The memory map presented above has the property of all regions being
> > similarly spaced between each other, and all having also a similar size.
> > Use a lookup table to store the offsets to translate from/to PFN and PDX
> > spaces.  Such table is indexed based on the input PFN or PDX to translated.
> > The example PFN layout about would get compressed using the following:
> > 
> > PFN compression using PFN lookup table shift 29 and PDX region size 0x10000000
> >  range 0 [0000000000000, 0x0000807ffff] PFN IDX  0 : 0000000000000
> >  range 1 [0x00063e80000, 0x0006be7ffff] PFN IDX  3 : 0x00053e80000
> >  range 2 [0x000c7e80000, 0x000cfe7ffff] PFN IDX  6 : 0x000a7e80000
> >  range 3 [0x0012be80000, 0x00133e7ffff] PFN IDX  9 : 0x000fbe80000
> > 
> > Note how the tow ranges belonging to node 0 get merged into a single PDX
> > region by the compression algorithm.
> > 
> > The default size of lookup tables currently set in Kconfig is 64 entries,
> > and the example memory map consumes 10 entries.  Such memory map is from a
> > 4 socket Granite Rapids host, which in theory supports up to 8 sockets
> > according to Intel documentation.  Assuming the layout of a 8 socket system
> > is similar to the 4 socket one, it would require 21 lookup table entries to
> > support it, way below the current default of 64 entries.
> > 
> > The valid range of lookup table size is currently restricted from 1 to 512
> > elements in Kconfig.
> > 
> > Unused lookup table entries are set to all ones (~0UL), so that we can
> > detect whether a pfn or pdx is valid just by checking whether its
> > translation is bi-directional.  The saturated offsets will prevent the
> > translation from being bidirectional if the lookup table entry is not
> > valid.
> 
> Right, yet with the sad effect of still leaving almost half the space unused.
> I guess that's pretty much unavoidable though in this scheme, as long as we
> want the backwards translation to also be "simple" (and in particular not
> involving a loop of any kind).
> 
> > --- a/CHANGELOG.md
> > +++ b/CHANGELOG.md
> > @@ -20,6 +20,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
> >       grant table or foreign memory.
> >  
> >  ### Added
> > + - Introduce new PDX compression algorithm to cope with Intel Sapphire and
> > +   Granite Rapids having sparse memory maps.
> 
> In the description you updated to mention Sierra Forest instead, but here you
> didn't.

Bah, my bad.  It's Sierra Forest and Granite Rapids, not Sapphire.
I've got confused with the names.

> > --- a/tools/tests/pdx/harness.h
> > +++ b/tools/tests/pdx/harness.h
> > @@ -44,8 +44,10 @@
> >  
> >  #define MAX_RANGES 8
> >  #define MAX_PFN_RANGES MAX_RANGES
> > +#define CONFIG_PDX_OFFSET_TLB_ORDER 6
> >  
> >  #define ASSERT assert
> > +#define ASSERT_UNREACHABLE() assert(0);
> 
> Nit: Stray semicolon.
> 
> > @@ -66,6 +68,8 @@ static inline unsigned int find_next(
> >  #define find_next_zero_bit(a, s, o) find_next(a, s, o, false)
> >  #define find_next_bit(a, s, o)      find_next(a, s, o, true)
> >  
> > +#define flsl(x) ((x) ? BITS_PER_LONG - __builtin_clzl(x) : 0)
> 
> While this is perhaps indeed good enough for a testing utility, ...
> 
> > @@ -75,6 +79,12 @@ static inline unsigned int find_next(
> >  
> >  typedef uint64_t paddr_t;
> >  
> > +#define sort(elem, nr, size, cmp, swp) {                                \
> > +    /* Consume swp() so compiler doesn't complain it's unused. */       \
> > +    (void)swp;                                                          \
> > +    qsort(elem, nr, size, cmp);                                         \
> > +}
> 
> ... this I think wants to use either do/while of ({ }).

OK.  Given it's limited test only usage I've assume it was fine like
this, but I certainly don't mind adjusting.

> > --- a/xen/common/Kconfig
> > +++ b/xen/common/Kconfig
> > @@ -54,7 +54,8 @@ config EVTCHN_FIFO
> >  
> >  choice
> >  	prompt "PDX (Page inDeX) compression"
> > -	default PDX_MASK_COMPRESSION if !X86 && !RISCV
> > +	default PDX_OFFSET_COMPRESSION if X86
> > +	default PDX_MASK_COMPRESSION if !RISCV
> >  	default PDX_NONE
> >  	help
> >  	  PDX compression is a technique designed to reduce the memory
> > @@ -73,12 +74,30 @@ config PDX_MASK_COMPRESSION
> >  	help
> >  	  Compression relying on all RAM addresses sharing a zeroed bit region.
> >  
> > +config PDX_OFFSET_COMPRESSION
> > +	bool "Offset compression"
> > +	help
> > +	  Compression relying on size and distance between RAM regions being
> > +	  compressible using an offset lookup table.
> > +
> >  config PDX_NONE
> >  	bool "None"
> >  	help
> >  	  No compression
> >  endchoice
> >  
> > +config PDX_OFFSET_TLB_ORDER
> 
> Please can we avoid the term "TLB" in the name? What we commonly call a TLB

It should have been TBL_ORDER, not TLB.  My finger memory is too use
to type TLB I think.

> is somewhat different. In fact is there anything wrong with just
> PDX_OFFSET_ORDER?

I've assumed that would be seen as too short and not descriptive
enough.  If that's fine I will switch it.

> > +	int "PDX offset compression lookup table order" if EXPERT
> > +	depends on PDX_OFFSET_COMPRESSION
> > +	default 6
> > +	range 0 9
> 
> Is 0 really a sensible lower bound? There's not going to be any compression
> then, I suppose?

No, you can still compress a single range if start if offset from 0.
See the following example in the test file:

/* Single range not starting at 0. */
{
    .ranges = {
        { .start = (1 << MAX_ORDER) * 10,
          .end   = (1 << MAX_ORDER) * 11 },
    },
    .compress = true,
},

Which results in:

PFN compression using PFN lookup table shift 63 and PDX region size 0x40000
 range 0 [0x00000280000, 0x000002bffff] PFN IDX   0 : 0x00000280000

> > --- a/xen/common/pdx.c
> > +++ b/xen/common/pdx.c
> > @@ -24,6 +24,7 @@
> >  #include <xen/param.h>
> >  #include <xen/pfn.h>
> >  #include <xen/sections.h>
> > +#include <xen/sort.h>
> >  
> >  /**
> >   * Maximum (non-inclusive) usable pdx. Must be
> > @@ -40,6 +41,8 @@ bool __mfn_valid(unsigned long mfn)
> >  
> >  #ifdef CONFIG_PDX_MASK_COMPRESSION
> >      invalid |= mfn & pfn_hole_mask;
> > +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
> > +    invalid |= mfn ^ pdx_to_pfn(pfn_to_pdx(mfn));
> 
> Hmm, that's pretty expensive already. Involving two (presumably back-to-back)
> JMPs when compression isn't enabled.

There's a conditional with evaluate_nospec() below, so I think the
JMPs are unlikely to make much difference?  Otherwise I would need to
check the index in the lookup table, and possibly introduce a new
variable to store the PDX region size to ensure it also fits in there.

Overall I think it's more complex for possibly little benefit given
the current code in mfn_valid() anyway.

> > @@ -290,7 +300,200 @@ void __init pfn_pdx_compression_reset(void)
> >      nr_ranges = 0;
> >  }
> >  
> > -#endif /* CONFIG_PDX_COMPRESSION */
> > +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION) /* CONFIG_PDX_MASK_COMPRESSION */
> > +
> > +unsigned long __ro_after_init pfn_pdx_lookup[CONFIG_PDX_NR_LOOKUP];
> > +unsigned int __ro_after_init pfn_index_shift;
> > +
> > +unsigned long __ro_after_init pdx_pfn_lookup[CONFIG_PDX_NR_LOOKUP];
> > +unsigned int __ro_after_init pdx_index_shift;
> 
> For slightly better cache locality when only a few array indexes are in
> use, may I suggest to put the indexes ahead of the arrays? Perhaps even
> together, as they both take up a single unsigned long slot.

Can do, yes.

> > +bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
> > +{
> > +    unsigned long pfn = PFN_DOWN(base);
> > +
> > +    return pdx_to_pfn(pfn_to_pdx(pfn) + npages - 1) == (pfn + npages - 1);
> 
> Aiui for this to be correct, there need to be gaps between the ranges
> covered by individual lookup table slots. In the setup logic you have a
> check commented "Avoid compression if there's no gain", but that doesn't
> look to guarantee gaps everywhere (nor would pfn_offset_sanitize_ranges()
> appear to)?

But if there are no gaps, the full region is covered correctly, and
hence it's compressible?

Maybe I'm missing something, could you maybe provide an example that
would exhibit this issue?

> > +static void __init cf_check swp_node(void *a, void *b, size_t size)
> > +{
> > +    struct pfn_range *l = a;
> > +    struct pfn_range *r = b;
> > +    struct pfn_range tmp = *l;
> > +
> > +    *l = *r;
> > +    *r = tmp;
> > +}
> 
> Any reason you effectively open-code SWAP() here?

Lack of knowledge :).

> > +static bool __init pfn_offset_sanitize_ranges(void)
> > +{
> > +    unsigned int i = 0;
> > +
> > +    if ( nr_ranges == 1 )
> > +    {
> > +        ASSERT(PFN_TBL_IDX_VALID(ranges[0].base));
> > +        ASSERT(PFN_TBL_IDX(ranges[0].base) ==
> > +               PFN_TBL_IDX(ranges[0].base + ranges[0].size - 1));
> > +        return true;
> > +    }
> > +
> > +    /* Sort nodes by start address. */
> > +    sort(ranges, nr_ranges, sizeof(struct pfn_range), cmp_node, swp_node);
> 
> Better sizeof(*ranges) or sizeof(ranges[0])?
> 
> > +bool __init pfn_pdx_compression_setup(paddr_t base)
> > +{
> > +    unsigned long size = 0, mask = PFN_DOWN(pdx_init_mask(base));
> > +    unsigned int i;
> > +
> > +    if ( !nr_ranges )
> > +        return false;
> 
> Also bail if there's just a single range?

No, you can still compress (and thus reduce the PDX space) if there's
a single range.

> > +    if ( nr_ranges > ARRAY_SIZE(ranges) )
> > +    {
> > +        printk(XENLOG_WARNING
> > +               "Too many PFN ranges (%u > %zu), not attempting PFN compression\n",
> > +               nr_ranges, ARRAY_SIZE(ranges));
> > +        return false;
> > +    }
> > +
> > +    for ( i = 0; i < nr_ranges; i++ )
> > +        mask |= pdx_region_mask(ranges[i].base, ranges[i].size);
> > +
> > +    pfn_index_shift = flsl(mask);
> 
> With this ...
> 
> > +    /*
> > +     * Increase the shift as much as possible, removing bits that are equal in
> > +     * all regions, as this allows the usage of smaller indexes, and in turn
> > +     * smaller lookup tables.
> > +     */
> > +    for ( pfn_index_shift = flsl(mask); pfn_index_shift < sizeof(mask) * 8 - 1;
> 
> ... you don't need to do this here another time.

Oh, good catch.  This was ordered differently, and I didn't realize
the duplication after the code movement.

> Also - why the subtraction of 1 in what the shift is compared against? Logic
> below should in principle guarantee we never exit the loop because of the
> conditional above, but if we made it that far it looks like we could as well
> also look at the top bit.

Because for a single range this would otherwise end up with
pfn_index_shift == 64, and thus lead to undefined behavior.

> > +          pfn_index_shift++ )
> > +    {
> > +        const unsigned long bit = ranges[0].base & (1UL << pfn_index_shift);
> > +
> > +        for ( i = 1; i < nr_ranges; i++ )
> > +            if ( bit != (ranges[i].base & (1UL << pfn_index_shift)) )
> > +                break;
> > +        if ( i != nr_ranges )
> > +            break;
> > +    }
> > +
> > +    /* Sort and sanitize ranges. */
> > +    if ( !pfn_offset_sanitize_ranges() )
> > +        return false;
> > +
> > +    /* Calculate PDX region size. */
> > +    for ( i = 0; i < nr_ranges; i++ )
> > +        size = max(size, ranges[i].size);
> > +
> > +    mask = PFN_DOWN(pdx_init_mask(size << PAGE_SHIFT));
> > +    pdx_index_shift = flsl(mask);
> > +
> > +    /* Avoid compression if there's no gain. */
> > +    if ( (mask + 1) * (nr_ranges - 1) >= ranges[nr_ranges - 1].base )
> > +        return false;
> > +
> > +    /* Poison all lookup table entries ahead of setting them. */
> > +    memset(pfn_pdx_lookup, ~0, sizeof(pfn_pdx_lookup));
> > +    memset(pdx_pfn_lookup, ~0, sizeof(pfn_pdx_lookup));
> 
> Have the arrays have initializers instead?

No, because otherwise early use (before the initialization done here)
of the translation functions would give bogus results.

> > +    for ( i = 0; i < nr_ranges; i++ )
> > +    {
> > +        unsigned int idx = PFN_TBL_IDX(ranges[i].base);
> > +
> > +        pfn_pdx_lookup[idx] = ranges[i].base - (mask + 1) * i;
> > +        pdx_pfn_lookup[i] = pfn_pdx_lookup[idx];
> > +    }
> > +
> > +    printk(XENLOG_INFO
> > +           "PFN compression using PFN lookup table shift %u and PDX region size %#lx\n",
> 
> I'd drop PFN and the latter PDX from this format string.

Ack.

> > +           pfn_index_shift, mask + 1);
> > +
> > +    for ( i = 0; i < nr_ranges; i++ )
> > +        printk(XENLOG_DEBUG
> > +               " range %u [%#013lx, %#013lx] PFN IDX %3lu : %#013lx\n",
> > +               i, ranges[i].base, ranges[i].base + ranges[i].size - 1,
> > +               PFN_TBL_IDX(ranges[i].base),
> > +               pfn_pdx_lookup[PFN_TBL_IDX(ranges[i].base)]);
> 
> Do you really mean this to stay active also in release builds?

I had it guarded with #ifdef CONFIG_DEBUG initially, but later decided
it was worth giving the possibility to print it in release builds if
debug log level is selected.

> Also the outcome of the earlier loop isn't used by the intermediate printk().
> Perhaps join both loops, thus allowing idx to be re-used here?

Hm, yes.  I wanted to first print the message about enabling PFN
compression, and later the compression specific information.  I can
move the message about enabling PFN compression ahead of the loop.

> > +    return true;
> > +}
> > +
> > +void __init pfn_pdx_compression_reset(void)
> > +{
> > +    memset(pfn_pdx_lookup, 0, sizeof(pfn_pdx_lookup));
> > +    memset(pdx_pfn_lookup, 0, sizeof(pfn_pdx_lookup));
> 
> Why not ~0?

Because translation needs to work, if I poison all entries with ~0
translation won't work.

> > --- a/xen/include/xen/pdx.h
> > +++ b/xen/include/xen/pdx.h
> > @@ -65,6 +65,43 @@
> >   * This scheme also holds for multiple regions, where HHHHHHH acts as
> >   * the region identifier and LLLLLL fully contains the span of every
> >   * region involved.
> > + *
> > + * ## PDX offset compression
> > + *
> > + * Alternative compression mechanism that relies on RAM ranges having a similar
> > + * size and offset between them:
> > + *
> > + * PFN address space:
> > + * ┌────────┬──────────┬────────┬──────────┐   ┌────────┬──────────┐
> > + * │ RAM 0  │          │ RAM 1  │          │...│ RAM N  │          │
> > + * ├────────┼──────────┼────────┴──────────┘   └────────┴──────────┘
> > + * │<------>│          │
> > + * │  size             │
> > + * │<----------------->│
> > + *         offset
> > + *
> > + * The compression reduces the holes between RAM regions:
> > + *
> > + * PDX address space:
> > + * ┌────────┬───┬────────┬───┐   ┌─┬────────┐
> > + * │ RAM 0  │   │ RAM 1  │   │...│ │ RAM N  │
> > + * ├────────┴───┼────────┴───┘   └─┴────────┘
> > + * │<---------->│
> > + *   pdx region size
> > + *
> > + * The offsets to convert from PFN to PDX and from PDX to PFN are stored in a
> > + * pair of lookup tables, and the index into those tables to find the offset
> > + * for each PFN or PDX is obtained by shifting the to be translated address by
> > + * a specific value calculated at boot:
> > + *
> > + * pdx = pfn - pfn_lookup_table[pfn >> pfn_shift]
> > + * pfn = pdx + pdx_lookup_table[pdx >> pdx_shift]
> 
> I assume it's intentional (for simplicity) that you omit the index masking
> here?

Indeed.  I can add it, but I think the point here is to explain the
algorithm used in a clear way, without implementation details.  I would
consider the masking one of such implementation details.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/8] pdx: introduce command line compression toggle
  2025-06-25 16:00       ` Jan Beulich
@ 2025-06-25 17:45         ` Roger Pau Monné
  2025-06-26  6:17           ` Jan Beulich
  0 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-06-25 17:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, Roger Pau Monné, xen-devel

On Wed, Jun 25, 2025 at 06:00:48PM +0200, Jan Beulich wrote:
> On 25.06.2025 17:46, Roger Pau Monné wrote:
> > On Tue, Jun 24, 2025 at 03:40:16PM +0200, Jan Beulich wrote:
> >> On 20.06.2025 13:11, Roger Pau Monne wrote:
> >>> Introduce a command line option to allow disabling PDX compression.  The
> >>> disabling is done by turning pfn_pdx_add_region() into a no-op, so when
> >>> attempting to initialize the selected compression algorithm the array of
> >>> ranges to compress is empty.
> >>
> >> While neat, this also feels fragile. It's not obvious that for any
> >> algorithm pfn_pdx_compression_setup() would leave compression disabled
> >> when there are zero ranges. In principle, if it was written differently
> >> for mask compression, there being no ranges could result in compression
> >> simply squeezing out all of the address bits. Yet as long as we think
> >> we're going to keep this in mind ...
> > 
> > It seemed to me that nr_rages == 0 (so no ranges reported) should
> > result in no compression, for example on x86 this means there's no
> > SRAT.
> 
> Just to mention it: While in the pfn_pdx_compression_setup() flavor in
> patch 3 there's no explicit check (hence the logic is assumed to be
> coping with that situation),

If you prefer I can leave the pfn_pdx_compression_setup() as-is in
patch 3, as AFAICT that implementation does cope with nr_ranges == 0,
that would result in a mask with just the low bits set, and hence
hole_shift will be 0.

> the one introduced in the last patch does
> have such an explicit check. Apparently there the logic doesn't cleanly
> cover that case all by itself.

No, I don't think the logic in patch 8 will cope nicely with nr_ranges
== 0, it seems to me at least the flsl() against a 0 pdx size mask
would result in an invalid pdx_index_shift given the current logic.

IMO it's best to short-circuit the nr_ranges == 0 case early in the
function, as that avoids complexity.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/8] pdx: introduce command line compression toggle
  2025-06-25 17:45         ` Roger Pau Monné
@ 2025-06-26  6:17           ` Jan Beulich
  0 siblings, 0 replies; 55+ messages in thread
From: Jan Beulich @ 2025-06-26  6:17 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, Roger Pau Monné, xen-devel

On 25.06.2025 19:45, Roger Pau Monné wrote:
> On Wed, Jun 25, 2025 at 06:00:48PM +0200, Jan Beulich wrote:
>> On 25.06.2025 17:46, Roger Pau Monné wrote:
>>> On Tue, Jun 24, 2025 at 03:40:16PM +0200, Jan Beulich wrote:
>>>> On 20.06.2025 13:11, Roger Pau Monne wrote:
>>>>> Introduce a command line option to allow disabling PDX compression.  The
>>>>> disabling is done by turning pfn_pdx_add_region() into a no-op, so when
>>>>> attempting to initialize the selected compression algorithm the array of
>>>>> ranges to compress is empty.
>>>>
>>>> While neat, this also feels fragile. It's not obvious that for any
>>>> algorithm pfn_pdx_compression_setup() would leave compression disabled
>>>> when there are zero ranges. In principle, if it was written differently
>>>> for mask compression, there being no ranges could result in compression
>>>> simply squeezing out all of the address bits. Yet as long as we think
>>>> we're going to keep this in mind ...
>>>
>>> It seemed to me that nr_rages == 0 (so no ranges reported) should
>>> result in no compression, for example on x86 this means there's no
>>> SRAT.
>>
>> Just to mention it: While in the pfn_pdx_compression_setup() flavor in
>> patch 3 there's no explicit check (hence the logic is assumed to be
>> coping with that situation),
> 
> If you prefer I can leave the pfn_pdx_compression_setup() as-is in
> patch 3, as AFAICT that implementation does cope with nr_ranges == 0,
> that would result in a mask with just the low bits set, and hence
> hole_shift will be 0.
> 
>> the one introduced in the last patch does
>> have such an explicit check. Apparently there the logic doesn't cleanly
>> cover that case all by itself.
> 
> No, I don't think the logic in patch 8 will cope nicely with nr_ranges
> == 0, it seems to me at least the flsl() against a 0 pdx size mask
> would result in an invalid pdx_index_shift given the current logic.
> 
> IMO it's best to short-circuit the nr_ranges == 0 case early in the
> function, as that avoids complexity.

FTAOD - I didn't mean to ask for any change. I merely meant to point out
that already within this series the special use of setting nr_ranges to
zero requires (a tiny bit of) extra care. But yes, since nr_ranges can
also end up being zero for other reasons, that bit of care is necessary
anyway.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets
  2025-06-25 16:24     ` Roger Pau Monné
@ 2025-06-26  7:35       ` Jan Beulich
  2025-06-27 14:51         ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-06-26  7:35 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	xen-devel

On 25.06.2025 18:24, Roger Pau Monné wrote:
> On Tue, Jun 24, 2025 at 06:16:15PM +0200, Jan Beulich wrote:
>> On 20.06.2025 13:11, Roger Pau Monne wrote:
>>> --- a/xen/common/Kconfig
>>> +++ b/xen/common/Kconfig
>>> @@ -54,7 +54,8 @@ config EVTCHN_FIFO
>>>  
>>>  choice
>>>  	prompt "PDX (Page inDeX) compression"
>>> -	default PDX_MASK_COMPRESSION if !X86 && !RISCV
>>> +	default PDX_OFFSET_COMPRESSION if X86
>>> +	default PDX_MASK_COMPRESSION if !RISCV
>>>  	default PDX_NONE
>>>  	help
>>>  	  PDX compression is a technique designed to reduce the memory
>>> @@ -73,12 +74,30 @@ config PDX_MASK_COMPRESSION
>>>  	help
>>>  	  Compression relying on all RAM addresses sharing a zeroed bit region.
>>>  
>>> +config PDX_OFFSET_COMPRESSION
>>> +	bool "Offset compression"
>>> +	help
>>> +	  Compression relying on size and distance between RAM regions being
>>> +	  compressible using an offset lookup table.
>>> +
>>>  config PDX_NONE
>>>  	bool "None"
>>>  	help
>>>  	  No compression
>>>  endchoice
>>>  
>>> +config PDX_OFFSET_TLB_ORDER
>>
>> Please can we avoid the term "TLB" in the name? What we commonly call a TLB
> 
> It should have been TBL_ORDER, not TLB.  My finger memory is too use
> to type TLB I think.
> 
>> is somewhat different. In fact is there anything wrong with just
>> PDX_OFFSET_ORDER?
> 
> I've assumed that would be seen as too short and not descriptive
> enough.  If that's fine I will switch it.

Oh, TBL is fine with me. And perhaps indeed better, for being more precise.

>>> +	int "PDX offset compression lookup table order" if EXPERT
>>> +	depends on PDX_OFFSET_COMPRESSION
>>> +	default 6
>>> +	range 0 9
>>
>> Is 0 really a sensible lower bound? There's not going to be any compression
>> then, I suppose?
> 
> No, you can still compress a single range if start if offset from 0.
> See the following example in the test file:
> 
> /* Single range not starting at 0. */
> {
>     .ranges = {
>         { .start = (1 << MAX_ORDER) * 10,
>           .end   = (1 << MAX_ORDER) * 11 },
>     },
>     .compress = true,
> },
> 
> Which results in:
> 
> PFN compression using PFN lookup table shift 63 and PDX region size 0x40000
>  range 0 [0x00000280000, 0x000002bffff] PFN IDX   0 : 0x00000280000

Oh, indeed. But: Does this actually work? I'm not only slightly concerned
of PDX 0 (that may indeed be fine), but more as to mfn_valid(). An MFN below
the start of that region will still use index 0, aiui. (With the resulting
underflow, a huge PDX will result.) With PDX_TBL_MASK being zero, and with
there being only a single entry in both tables, the reverse translation will
use that single entry, simply undoing the underflowed subtraction. Hence
mfn_valid() would wrongly return true, afaict. (Thinking about it, the same
issue would appear to occur for MFNs above the sum of that range's start and
the region size.)

>>> --- a/xen/common/pdx.c
>>> +++ b/xen/common/pdx.c
>>> @@ -24,6 +24,7 @@
>>>  #include <xen/param.h>
>>>  #include <xen/pfn.h>
>>>  #include <xen/sections.h>
>>> +#include <xen/sort.h>
>>>  
>>>  /**
>>>   * Maximum (non-inclusive) usable pdx. Must be
>>> @@ -40,6 +41,8 @@ bool __mfn_valid(unsigned long mfn)
>>>  
>>>  #ifdef CONFIG_PDX_MASK_COMPRESSION
>>>      invalid |= mfn & pfn_hole_mask;
>>> +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
>>> +    invalid |= mfn ^ pdx_to_pfn(pfn_to_pdx(mfn));
>>
>> Hmm, that's pretty expensive already. Involving two (presumably back-to-back)
>> JMPs when compression isn't enabled.
> 
> There's a conditional with evaluate_nospec() below, so I think the
> JMPs are unlikely to make much difference?

Hard to tell. They still take up decode bandwidth and at least some execution
resources, aiui. But perhaps you're right and that's indeed negligible, or at
least acceptable enough. Especially since, ...

>  Otherwise I would need to
> check the index in the lookup table, and possibly introduce a new
> variable to store the PDX region size to ensure it also fits in there.
> 
> Overall I think it's more complex for possibly little benefit given
> the current code in mfn_valid() anyway.

... as you say, complexity would grow.

>>> +bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
>>> +{
>>> +    unsigned long pfn = PFN_DOWN(base);
>>> +
>>> +    return pdx_to_pfn(pfn_to_pdx(pfn) + npages - 1) == (pfn + npages - 1);
>>
>> Aiui for this to be correct, there need to be gaps between the ranges
>> covered by individual lookup table slots. In the setup logic you have a
>> check commented "Avoid compression if there's no gain", but that doesn't
>> look to guarantee gaps everywhere (nor would pfn_offset_sanitize_ranges()
>> appear to)?
> 
> But if there are no gaps, the full region is covered correctly, and
> hence it's compressible?

If there's a guarantee that such ranges would be folded into a single one,
all would be fine.

> Maybe I'm missing something, could you maybe provide an example that
> would exhibit this issue?

My understanding is that when there's no gap between regions, and when
[base, base + npages) crosses as region boundary, then the expression
above will yield true when, because of crossing a region boundary, it
ought to be false. Or did I simply misunderstand the purpose of the
pdx_is_region_compressible() invocations?

>>> +    if ( nr_ranges > ARRAY_SIZE(ranges) )
>>> +    {
>>> +        printk(XENLOG_WARNING
>>> +               "Too many PFN ranges (%u > %zu), not attempting PFN compression\n",
>>> +               nr_ranges, ARRAY_SIZE(ranges));
>>> +        return false;
>>> +    }
>>> +
>>> +    for ( i = 0; i < nr_ranges; i++ )
>>> +        mask |= pdx_region_mask(ranges[i].base, ranges[i].size);
>>> +
>>> +    pfn_index_shift = flsl(mask);
>>
>> With this ...
>>
>>> +    /*
>>> +     * Increase the shift as much as possible, removing bits that are equal in
>>> +     * all regions, as this allows the usage of smaller indexes, and in turn
>>> +     * smaller lookup tables.
>>> +     */
>>> +    for ( pfn_index_shift = flsl(mask); pfn_index_shift < sizeof(mask) * 8 - 1;
>>
>> ... you don't need to do this here another time.
> 
> Oh, good catch.  This was ordered differently, and I didn't realize
> the duplication after the code movement.
> 
>> Also - why the subtraction of 1 in what the shift is compared against? Logic
>> below should in principle guarantee we never exit the loop because of the
>> conditional above, but if we made it that far it looks like we could as well
>> also look at the top bit.
> 
> Because for a single range this would otherwise end up with
> pfn_index_shift == 64, and thus lead to undefined behavior.

Hmm, right. Yet then isn't this another reason you need at least two array slots?
At least in the (theoretical) case of paddr_bits == 64? Which raises the question
whether the loop wouldn't better be bounded by paddr_bits anyway.

>>> +          pfn_index_shift++ )
>>> +    {
>>> +        const unsigned long bit = ranges[0].base & (1UL << pfn_index_shift);
>>> +
>>> +        for ( i = 1; i < nr_ranges; i++ )
>>> +            if ( bit != (ranges[i].base & (1UL << pfn_index_shift)) )
>>> +                break;
>>> +        if ( i != nr_ranges )
>>> +            break;
>>> +    }
>>> +
>>> +    /* Sort and sanitize ranges. */
>>> +    if ( !pfn_offset_sanitize_ranges() )
>>> +        return false;
>>> +
>>> +    /* Calculate PDX region size. */
>>> +    for ( i = 0; i < nr_ranges; i++ )
>>> +        size = max(size, ranges[i].size);
>>> +
>>> +    mask = PFN_DOWN(pdx_init_mask(size << PAGE_SHIFT));
>>> +    pdx_index_shift = flsl(mask);
>>> +
>>> +    /* Avoid compression if there's no gain. */
>>> +    if ( (mask + 1) * (nr_ranges - 1) >= ranges[nr_ranges - 1].base )
>>> +        return false;
>>> +
>>> +    /* Poison all lookup table entries ahead of setting them. */
>>> +    memset(pfn_pdx_lookup, ~0, sizeof(pfn_pdx_lookup));
>>> +    memset(pdx_pfn_lookup, ~0, sizeof(pfn_pdx_lookup));
>>
>> Have the arrays have initializers instead?
> 
> No, because otherwise early use (before the initialization done here)
> of the translation functions would give bogus results.

Isn't that true anyway? Before making it here, PDX and PFN have a fixed
(and hence wrong) relationship for the entire number space. And mfn_valid()
would yield true for any input no matter what the initializer, afaict. IOW
early uses look to be invalid anyway.

(Later) Oh, wait - your comment on pfn_pdx_compression_reset() made me
understand: We rely on identity mapping PDX <-> PFN if compression is
disabled, at least until alternatives patching arranges to bypass the
calculations.

>>> +           pfn_index_shift, mask + 1);
>>> +
>>> +    for ( i = 0; i < nr_ranges; i++ )
>>> +        printk(XENLOG_DEBUG
>>> +               " range %u [%#013lx, %#013lx] PFN IDX %3lu : %#013lx\n",
>>> +               i, ranges[i].base, ranges[i].base + ranges[i].size - 1,
>>> +               PFN_TBL_IDX(ranges[i].base),
>>> +               pfn_pdx_lookup[PFN_TBL_IDX(ranges[i].base)]);
>>
>> Do you really mean this to stay active also in release builds?
> 
> I had it guarded with #ifdef CONFIG_DEBUG initially, but later decided
> it was worth giving the possibility to print it in release builds if
> debug log level is selected.
> 
>> Also the outcome of the earlier loop isn't used by the intermediate printk().
>> Perhaps join both loops, thus allowing idx to be re-used here?
> 
> Hm, yes.  I wanted to first print the message about enabling PFN
> compression, and later the compression specific information.  I can
> move the message about enabling PFN compression ahead of the loop.

But that's not what I meant. You can move the body of the earlier loop
into the later one, since - as said - the printk() between the two loops
doesn't use what the first loop does.

>>> --- a/xen/include/xen/pdx.h
>>> +++ b/xen/include/xen/pdx.h
>>> @@ -65,6 +65,43 @@
>>>   * This scheme also holds for multiple regions, where HHHHHHH acts as
>>>   * the region identifier and LLLLLL fully contains the span of every
>>>   * region involved.
>>> + *
>>> + * ## PDX offset compression
>>> + *
>>> + * Alternative compression mechanism that relies on RAM ranges having a similar
>>> + * size and offset between them:
>>> + *
>>> + * PFN address space:
>>> + * ┌────────┬──────────┬────────┬──────────┐   ┌────────┬──────────┐
>>> + * │ RAM 0  │          │ RAM 1  │          │...│ RAM N  │          │
>>> + * ├────────┼──────────┼────────┴──────────┘   └────────┴──────────┘
>>> + * │<------>│          │
>>> + * │  size             │
>>> + * │<----------------->│
>>> + *         offset
>>> + *
>>> + * The compression reduces the holes between RAM regions:
>>> + *
>>> + * PDX address space:
>>> + * ┌────────┬───┬────────┬───┐   ┌─┬────────┐
>>> + * │ RAM 0  │   │ RAM 1  │   │...│ │ RAM N  │
>>> + * ├────────┴───┼────────┴───┘   └─┴────────┘
>>> + * │<---------->│
>>> + *   pdx region size
>>> + *
>>> + * The offsets to convert from PFN to PDX and from PDX to PFN are stored in a
>>> + * pair of lookup tables, and the index into those tables to find the offset
>>> + * for each PFN or PDX is obtained by shifting the to be translated address by
>>> + * a specific value calculated at boot:
>>> + *
>>> + * pdx = pfn - pfn_lookup_table[pfn >> pfn_shift]
>>> + * pfn = pdx + pdx_lookup_table[pdx >> pdx_shift]
>>
>> I assume it's intentional (for simplicity) that you omit the index masking
>> here?
> 
> Indeed.  I can add it, but I think the point here is to explain the
> algorithm used in a clear way, without implementation details.  I would
> consider the masking one of such implementation details.

I see. It's the balancing between simplicity and making the reader (wrongly)
suspect a possible array overrun here (as it happened in my case). Maybe
keep the expressions as they are, but add a few words towards masking in the
text?

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 2/8] kconfig: turn PDX compression into a choice
  2025-06-24 13:13   ` Jan Beulich
@ 2025-06-26  7:49     ` Roger Pau Monné
  2025-06-26 12:33       ` Jan Beulich
  0 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-06-26  7:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On Tue, Jun 24, 2025 at 03:13:27PM +0200, Jan Beulich wrote:
> On 20.06.2025 13:11, Roger Pau Monne wrote:
> > Rename the current CONFIG_PDX_COMPRESSION to CONFIG_PDX_MASK_COMPRESSION,
> > and make it part of the PDX compression choice block, in preparation for
> > adding further PDX compression algorithms.
> > 
> > No functional change intended as the PDX compression defaults should still
> > be the same for all architectures, however the choice block cannot be
> > protected under EXPERT and still have a default choice being
> > unconditionally selected.  As a result, the new "PDX (Page inDeX)
> > compression" item will be unconditionally visible in Kconfig.
> 
> Just to mention it: Afaict there is a functional change, but one I actually
> appreciate, at least in part. So far ...
> 
> > --- a/xen/common/Kconfig
> > +++ b/xen/common/Kconfig
> > @@ -52,9 +52,10 @@ config EVTCHN_FIFO
> >  
> >  	  If unsure, say Y.
> >  
> > -config PDX_COMPRESSION
> > -	bool "PDX (Page inDeX) compression" if EXPERT && !X86 && !RISCV
> > -	default ARM || PPC
> 
> ... for x86 (and RISC-V) this option couldn't be selected. Whereas ...
> 
> > @@ -67,6 +68,17 @@ config PDX_COMPRESSION
> >  	  If your platform does not have sparse RAM banks, do not enable PDX
> >  	  compression.
> >  
> > +config PDX_MASK_COMPRESSION
> > +	bool "Mask compression"
> > +	help
> > +	  Compression relying on all RAM addresses sharing a zeroed bit region.
> 
> ... this option is now available, as the prior !X86 && !RISCV doesn't
> re-appear here. (As the description mentions it, that dependency clearly
> can't appear on the enclosing choice itself.) Since x86 actually still
> should have mask compression implemented properly, that's fine (from my
> pov; iirc I even asked that it would have remained available when the
> earlier change was done), whereas I think for RISC-V it's not quite right
> to offer the option. It also did escape me why the option was made
> available for PPC, which I'm pretty sure also lacks the logic to determine
> a suitable mask.

Yes, the only architectures that have functional PDX compression are
x86 and ARM, as neither RISC-V nor PowerPC call the initialization
functions.  AFAICT this is harmless apart from giving the wrong
impression to the user that PDX compression might be implemented.

Would you prefer for me to introduce a new HAS_PDX config option
that's selected by x86 and ARM, and is used to enable the choice PDX
config?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 2/8] kconfig: turn PDX compression into a choice
  2025-06-26  7:49     ` Roger Pau Monné
@ 2025-06-26 12:33       ` Jan Beulich
  0 siblings, 0 replies; 55+ messages in thread
From: Jan Beulich @ 2025-06-26 12:33 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 26.06.2025 09:49, Roger Pau Monné wrote:
> On Tue, Jun 24, 2025 at 03:13:27PM +0200, Jan Beulich wrote:
>> On 20.06.2025 13:11, Roger Pau Monne wrote:
>>> Rename the current CONFIG_PDX_COMPRESSION to CONFIG_PDX_MASK_COMPRESSION,
>>> and make it part of the PDX compression choice block, in preparation for
>>> adding further PDX compression algorithms.
>>>
>>> No functional change intended as the PDX compression defaults should still
>>> be the same for all architectures, however the choice block cannot be
>>> protected under EXPERT and still have a default choice being
>>> unconditionally selected.  As a result, the new "PDX (Page inDeX)
>>> compression" item will be unconditionally visible in Kconfig.
>>
>> Just to mention it: Afaict there is a functional change, but one I actually
>> appreciate, at least in part. So far ...
>>
>>> --- a/xen/common/Kconfig
>>> +++ b/xen/common/Kconfig
>>> @@ -52,9 +52,10 @@ config EVTCHN_FIFO
>>>  
>>>  	  If unsure, say Y.
>>>  
>>> -config PDX_COMPRESSION
>>> -	bool "PDX (Page inDeX) compression" if EXPERT && !X86 && !RISCV
>>> -	default ARM || PPC
>>
>> ... for x86 (and RISC-V) this option couldn't be selected. Whereas ...
>>
>>> @@ -67,6 +68,17 @@ config PDX_COMPRESSION
>>>  	  If your platform does not have sparse RAM banks, do not enable PDX
>>>  	  compression.
>>>  
>>> +config PDX_MASK_COMPRESSION
>>> +	bool "Mask compression"
>>> +	help
>>> +	  Compression relying on all RAM addresses sharing a zeroed bit region.
>>
>> ... this option is now available, as the prior !X86 && !RISCV doesn't
>> re-appear here. (As the description mentions it, that dependency clearly
>> can't appear on the enclosing choice itself.) Since x86 actually still
>> should have mask compression implemented properly, that's fine (from my
>> pov; iirc I even asked that it would have remained available when the
>> earlier change was done), whereas I think for RISC-V it's not quite right
>> to offer the option. It also did escape me why the option was made
>> available for PPC, which I'm pretty sure also lacks the logic to determine
>> a suitable mask.
> 
> Yes, the only architectures that have functional PDX compression are
> x86 and ARM, as neither RISC-V nor PowerPC call the initialization
> functions.  AFAICT this is harmless apart from giving the wrong
> impression to the user that PDX compression might be implemented.
> 
> Would you prefer for me to introduce a new HAS_PDX config option
> that's selected by x86 and ARM, and is used to enable the choice PDX
> config?

Hmm, no, I don't think I want you to make any change to the code. I'm
actually happy with the slight relaxation for x86 (and RISC-V), and
aiui you don't alter behavior for PPC. The fact that behavior there
(and for RISC-V) doesn't look quite right isn't an effect of your
change.

A change may be wanted to the description, to avoid giving the wrong
(afaict) impression of this being "no functional change". Considering
how things ended up the way they are prior to this series, this
becoming explicit may cause _others_ to want you to make changes,
though. Hence I simply wanted to raise that aspect, to give others a
hint that they may need to chime in.

For the record, below is what I think would represent original
behavior ("help" parts omitted), albeit still leaving out the EXPERT
aspect (as it's not clear to me what a condition on a prompt means in
a choice element):

choice
	prompt "PDX (Page inDeX) compression"
	default PDX_MASK_COMPRESSION if !X86 && !RISCV
	default PDX_NONE

config PDX_MASK_COMPRESSION
	bool "Mask compression"
	depends on !X86 && !RISCV

config PDX_NONE
	bool "None"
endchoice

But again, specifically for x86 I'd prefer if PDX_MASK_COMPRESSION
became available (again), so the above is not a suggestion to change
the code, unless others insisted on restoring prior behavior.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets
  2025-06-26  7:35       ` Jan Beulich
@ 2025-06-27 14:51         ` Roger Pau Monné
  2025-06-29 14:36           ` Jan Beulich
  0 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-06-27 14:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	xen-devel

On Thu, Jun 26, 2025 at 09:35:04AM +0200, Jan Beulich wrote:
> On 25.06.2025 18:24, Roger Pau Monné wrote:
> > On Tue, Jun 24, 2025 at 06:16:15PM +0200, Jan Beulich wrote:
> >> On 20.06.2025 13:11, Roger Pau Monne wrote:
> >>> +bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
> >>> +{
> >>> +    unsigned long pfn = PFN_DOWN(base);
> >>> +
> >>> +    return pdx_to_pfn(pfn_to_pdx(pfn) + npages - 1) == (pfn + npages - 1);
> >>
> >> Aiui for this to be correct, there need to be gaps between the ranges
> >> covered by individual lookup table slots. In the setup logic you have a
> >> check commented "Avoid compression if there's no gain", but that doesn't
> >> look to guarantee gaps everywhere (nor would pfn_offset_sanitize_ranges()
> >> appear to)?
> > 
> > But if there are no gaps, the full region is covered correctly, and
> > hence it's compressible?
> 
> If there's a guarantee that such ranges would be folded into a single one,
> all would be fine.
> 
> > Maybe I'm missing something, could you maybe provide an example that
> > would exhibit this issue?
> 
> My understanding is that when there's no gap between regions, and when
> [base, base + npages) crosses as region boundary, then the expression
> above will yield true when, because of crossing a region boundary, it
> ought to be false. Or did I simply misunderstand the purpose of the
> pdx_is_region_compressible() invocations?

If there's no gap between the regions it's IMO intended for
pdx_is_region_compressible() to return true, as the whole region is
continuous in both the PFN and PDX spaces, and hence compressible
(even if it spans multiple regions).

But maybe I'm not understanding your point correctly, could you maybe
provide an example if you disagree with my reply above?  Sorry if I'm
being dull, with this compression stuff it's sometimes hard for me to
visualize the case you are trying to make without a concrete
example.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-06-20 11:11 [PATCH v2 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
                   ` (7 preceding siblings ...)
       [not found] ` <20250620111130.29057-4-roger.pau@citrix.com>
@ 2025-06-28  2:08 ` Stefano Stabellini
  2025-06-30 15:02   ` Roger Pau Monné
  2025-07-03  8:42   ` Roger Pau Monné
  8 siblings, 2 replies; 55+ messages in thread
From: Stefano Stabellini @ 2025-06-28  2:08 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini, Bertrand Marquis,
	Volodymyr Babchuk, Shawn Anastasio, Alistair Francis,
	Bob Eshleman, Connor Davis, Oleksii Kurochko, Community Manager,
	sstabellini

Hi Roger,

We have an ARM board with the following memory layout:

0x0-0x80000000, 0, 2G
0x800000000,0x880000000, 32GB, 2G
0x50000000000-0x50080000000 5T, 2GB 
0x60000000000-0x60080000000 6T, 2GB 
0x70000000000-0x70080000000 7T, 2GB 

It looks like your PDX series is exactly what we need.  However, I tried
to use it and it doesn't seem to be hooked properly on ARM yet. I spent
some time trying to fix it but I was unsuccessful.

As far as I can tell the following functions need to be adjusted but I
am not sure the list is comprehensive:

xen/arch/arm/include/asm/mmu/mm.h:maddr_to_virt
xen/arch/arm/mmu/mm.c:setup_frametable_mappings
xen/arch/arm/setup.c:init_pdx

Cheers,

Stefano

On Fri, 20 Jun 2025, Roger Pau Monne wrote:
> Hello,
> 
> This series implements a new PDX compression algorithm to cope with the
> spare memory maps found on the Intel Sapphire/Granite Rapids.
> 
> Patches 1 to 7 prepare the existing code to make it easier to introduce
> a new PDX compression, including generalizing the initialization and
> setup functions and adding a unit test for PDX compression.
> 
> Patch 8 introduce the new compression.  The new compression is only
> enabled by default on x86, other architectures are left with their
> previous defaults.
> 
> Thanks, Roger.
> 
> Roger Pau Monne (8):
>   x86/pdx: simplify calculation of domain struct allocation boundary
>   kconfig: turn PDX compression into a choice
>   pdx: provide a unified set of unit functions
>   pdx: introduce command line compression toggle
>   pdx: allow per-arch optimization of PDX conversion helpers
>   test/pdx: add PDX compression unit tests
>   pdx: move some helpers in preparation for new compression
>   pdx: introduce a new compression algorithm based on region offsets
> 
>  CHANGELOG.md                           |   3 +
>  docs/misc/xen-command-line.pandoc      |   9 +
>  tools/tests/Makefile                   |   1 +
>  tools/tests/pdx/.gitignore             |   3 +
>  tools/tests/pdx/Makefile               |  49 ++++
>  tools/tests/pdx/harness.h              |  99 +++++++
>  tools/tests/pdx/test-pdx.c             | 224 +++++++++++++++
>  xen/arch/arm/include/asm/Makefile      |   1 +
>  xen/arch/arm/setup.c                   |  34 +--
>  xen/arch/ppc/include/asm/Makefile      |   1 +
>  xen/arch/riscv/include/asm/Makefile    |   1 +
>  xen/arch/x86/domain.c                  |  40 +--
>  xen/arch/x86/include/asm/cpufeatures.h |   1 +
>  xen/arch/x86/include/asm/pdx.h         |  75 +++++
>  xen/arch/x86/srat.c                    |  30 +-
>  xen/common/Kconfig                     |  37 ++-
>  xen/common/pdx.c                       | 379 ++++++++++++++++++++++---
>  xen/include/asm-generic/pdx.h          |  24 ++
>  xen/include/xen/pdx.h                  | 201 +++++++++----
>  19 files changed, 1056 insertions(+), 156 deletions(-)
>  create mode 100644 tools/tests/pdx/.gitignore
>  create mode 100644 tools/tests/pdx/Makefile
>  create mode 100644 tools/tests/pdx/harness.h
>  create mode 100644 tools/tests/pdx/test-pdx.c
>  create mode 100644 xen/arch/x86/include/asm/pdx.h
>  create mode 100644 xen/include/asm-generic/pdx.h
> 
> -- 
> 2.49.0
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets
  2025-06-27 14:51         ` Roger Pau Monné
@ 2025-06-29 14:36           ` Jan Beulich
  2025-07-01  7:26             ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-06-29 14:36 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	xen-devel

On 27.06.2025 16:51, Roger Pau Monné wrote:
> On Thu, Jun 26, 2025 at 09:35:04AM +0200, Jan Beulich wrote:
>> On 25.06.2025 18:24, Roger Pau Monné wrote:
>>> On Tue, Jun 24, 2025 at 06:16:15PM +0200, Jan Beulich wrote:
>>>> On 20.06.2025 13:11, Roger Pau Monne wrote:
>>>>> +bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
>>>>> +{
>>>>> +    unsigned long pfn = PFN_DOWN(base);
>>>>> +
>>>>> +    return pdx_to_pfn(pfn_to_pdx(pfn) + npages - 1) == (pfn + npages - 1);
>>>>
>>>> Aiui for this to be correct, there need to be gaps between the ranges
>>>> covered by individual lookup table slots. In the setup logic you have a
>>>> check commented "Avoid compression if there's no gain", but that doesn't
>>>> look to guarantee gaps everywhere (nor would pfn_offset_sanitize_ranges()
>>>> appear to)?
>>>
>>> But if there are no gaps, the full region is covered correctly, and
>>> hence it's compressible?
>>
>> If there's a guarantee that such ranges would be folded into a single one,
>> all would be fine.
>>
>>> Maybe I'm missing something, could you maybe provide an example that
>>> would exhibit this issue?
>>
>> My understanding is that when there's no gap between regions, and when
>> [base, base + npages) crosses as region boundary, then the expression
>> above will yield true when, because of crossing a region boundary, it
>> ought to be false. Or did I simply misunderstand the purpose of the
>> pdx_is_region_compressible() invocations?
> 
> If there's no gap between the regions it's IMO intended for
> pdx_is_region_compressible() to return true, as the whole region is
> continuous in both the PFN and PDX spaces, and hence compressible
> (even if it spans multiple regions).

My problem is that I can't make the connection between that function
returning true and regions getting concatenated. When the function is
invoked, concatenation (or not) has happened already, aiui.

> But maybe I'm not understanding your point correctly, could you maybe
> provide an example if you disagree with my reply above?  Sorry if I'm
> being dull, with this compression stuff it's sometimes hard for me to
> visualize the case you are trying to make without a concrete
> example.

What I think I didn't take into consideration is that from two pages
being contiguous in MFN space, it ought to follow they're also
contiguous in PDX space. Hence [base, base + npages) crossing a region
boundary (if, contrary to what you say, this was possible in the first
place) would still not be encountering a discontinuity. So overall not
an issue, irrespective of what pdx_is_region_compressible() means
towards (non-)contiguity.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets
  2025-06-20 11:11 ` [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets Roger Pau Monne
  2025-06-24 16:16   ` Jan Beulich
@ 2025-06-30  6:34   ` Jan Beulich
  2025-07-01 15:49     ` Roger Pau Monné
  1 sibling, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-06-30  6:34 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	xen-devel

On 20.06.2025 13:11, Roger Pau Monne wrote:
> @@ -40,6 +41,8 @@ bool __mfn_valid(unsigned long mfn)
>  
>  #ifdef CONFIG_PDX_MASK_COMPRESSION
>      invalid |= mfn & pfn_hole_mask;
> +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
> +    invalid |= mfn ^ pdx_to_pfn(pfn_to_pdx(mfn));
>  #endif
>  
>      if ( unlikely(evaluate_nospec(invalid)) )

In the chat you mentioned that you would add a check against max_pdx here. While
that feels sufficient, I couldn't quite convince myself of this formally. Hence
an alternative proposal for consideration, which imo is more clearly achieving
the effect of allowing for no false-positive results. In particular, how about
adding another array holding the PDX upper bounds for the respective region.
When naming the existing two arrays moffs[] and poffs[] for brevity, the new
one would be plimit[], but indexed by the MFN index. Then we'd end up with

	p = mfn - moffs[midx]; /* Open-coded pfn_to_pdx() */
	invalid |= p >= plimit[midx] || p < plimit[midx - 1];

Of course this would need massaging to deal with the midx == 0 case, perhaps by
making the array one slot larger and incrementing the indexes by 1. The
downside compared to the max_pdx variant is that while it's the same number of
memory accesses (and the same number of comparisons [or replacements thereof,
like the ^ in context above), cache locality is worse (simply because of the
fact that it's another array).

For the example in the description, i.e.

PFN compression using PFN lookup table shift 29 and PDX region size 0x10000000
 range 0 [0000000000000, 0x0000807ffff] PFN IDX  0 : 0000000000000
 range 1 [0x00063e80000, 0x0006be7ffff] PFN IDX  3 : 0x00053e80000
 range 2 [0x000c7e80000, 0x000cfe7ffff] PFN IDX  6 : 0x000a7e80000
 range 3 [0x0012be80000, 0x00133e7ffff] PFN IDX  9 : 0x000fbe80000

we'd end up with plimit[] holding

0, 0x10000000, 0x10000000, 0x10000000, 0x20000000, 0x20000000, 0x20000000,
0x30000000, 0x30000000, 0x30000000, 0x40000000, 0x40000000, 0x40000000.

For this example the 2nd of the comparisons could even be omitted afaict, but
I couldn't convince myself that this would hold for the general case.

Jan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-06-28  2:08 ` [PATCH v2 0/8] pdx: introduce a new compression algorithm Stefano Stabellini
@ 2025-06-30 15:02   ` Roger Pau Monné
  2025-07-01  1:50     ` Stefano Stabellini
  2025-07-03  8:42   ` Roger Pau Monné
  1 sibling, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-06-30 15:02 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Community Manager

On Fri, Jun 27, 2025 at 07:08:29PM -0700, Stefano Stabellini wrote:
> Hi Roger,
> 
> We have an ARM board with the following memory layout:
> 
> 0x0-0x80000000, 0, 2G
> 0x800000000,0x880000000, 32GB, 2G
> 0x50000000000-0x50080000000 5T, 2GB 
> 0x60000000000-0x60080000000 6T, 2GB 
> 0x70000000000-0x70080000000 7T, 2GB 

With the current PDX mask compression you could compress 4bits AFAICT.

> It looks like your PDX series is exactly what we need.  However, I tried
> to use it and it doesn't seem to be hooked properly on ARM yet. I spent
> some time trying to fix it but I was unsuccessful.

Hm, weird.  It shouldn't need any special hooking, unless assumptions
about the existing PDX mask compression have leaked into ARM code.

> As far as I can tell the following functions need to be adjusted but I
> am not sure the list is comprehensive:
> 
> xen/arch/arm/include/asm/mmu/mm.h:maddr_to_virt

At least for CONFIG_ARM_64 this seems to be implemented correctly, as
it's using maddr_to_directmapoff() which should have the correct
translation between paddr -> directmap virt.

Also given the memory map above the adjustments done in ARM to remove
any initial memory map offset should be no-ops, since I expect
base_mfn == 0 in setup_directmap_mappings() in that particular case,
and then directmap_mfn_start = directmap_base_pdx = 0 and
directmap_virt_start = DIRECTMAP_VIRT_START.  FWIW, if ARM uses offset
compression the special casing about removing the initial gap can be
removed, as the compression should already take care of that.

> xen/arch/arm/mmu/mm.c:setup_frametable_mappings
> xen/arch/arm/setup.c:init_pdx

I've attempted to adjust init_pdx() myself so it works with the new
generic PDX compression setup, it seemed to work fine on the CI, but I
don't have any real ARM machines to test myself.

Is there a way I could reproduce the issue(s) you are seeing with
QEMU?

I'm already working on v3, as this version implementation of
mfn_valid() is buggy.  Maybe that's what you are hitting?

Regards, Roger.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-06-30 15:02   ` Roger Pau Monné
@ 2025-07-01  1:50     ` Stefano Stabellini
  2025-07-01  3:33       ` Stefano Stabellini
  2025-07-01  6:05       ` Jan Beulich
  0 siblings, 2 replies; 55+ messages in thread
From: Stefano Stabellini @ 2025-07-01  1:50 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, xen-devel, Jan Beulich, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Bertrand Marquis,
	Volodymyr Babchuk, Shawn Anastasio, Alistair Francis,
	Bob Eshleman, Connor Davis, Oleksii Kurochko, Community Manager

[-- Attachment #1: Type: text/plain, Size: 10315 bytes --]

On Mon, 30 Jun 2025, Roger Pau Monné wrote:
> On Fri, Jun 27, 2025 at 07:08:29PM -0700, Stefano Stabellini wrote:
> > Hi Roger,
> > 
> > We have an ARM board with the following memory layout:
> > 
> > 0x0-0x80000000, 0, 2G
> > 0x800000000,0x880000000, 32GB, 2G
> > 0x50000000000-0x50080000000 5T, 2GB 
> > 0x60000000000-0x60080000000 6T, 2GB 
> > 0x70000000000-0x70080000000 7T, 2GB 
> 
> With the current PDX mask compression you could compress 4bits AFAICT.
> 
> > It looks like your PDX series is exactly what we need.  However, I tried
> > to use it and it doesn't seem to be hooked properly on ARM yet. I spent
> > some time trying to fix it but I was unsuccessful.
> 
> Hm, weird.  It shouldn't need any special hooking, unless assumptions
> about the existing PDX mask compression have leaked into ARM code.
> 
> > As far as I can tell the following functions need to be adjusted but I
> > am not sure the list is comprehensive:
> > 
> > xen/arch/arm/include/asm/mmu/mm.h:maddr_to_virt
> 
> At least for CONFIG_ARM_64 this seems to be implemented correctly, as
> it's using maddr_to_directmapoff() which should have the correct
> translation between paddr -> directmap virt.
> 
> Also given the memory map above the adjustments done in ARM to remove
> any initial memory map offset should be no-ops, since I expect
> base_mfn == 0 in setup_directmap_mappings() in that particular case,
> and then directmap_mfn_start = directmap_base_pdx = 0 and
> directmap_virt_start = DIRECTMAP_VIRT_START.  FWIW, if ARM uses offset
> compression the special casing about removing the initial gap can be
> removed, as the compression should already take care of that.
> 
> > xen/arch/arm/mmu/mm.c:setup_frametable_mappings
> > xen/arch/arm/setup.c:init_pdx
> 
> I've attempted to adjust init_pdx() myself so it works with the new
> generic PDX compression setup, it seemed to work fine on the CI, but I
> don't have any real ARM machines to test myself.
 
> Is there a way I could reproduce the issue(s) you are seeing with
> QEMU?

Maybe. You can see how we run QEMU from gitlab-ci, but I don't know on
top of my head how to force QEMU to emulate multiple RAM banks at
specific addresses.


> I'm already working on v3, as this version implementation of
> mfn_valid() is buggy.  Maybe that's what you are hitting?
> 

This is the error:

(XEN) [0000000179e5f96b] Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72
(XEN) [0000000179e90619] ----[ Xen-4.21-unstable  arm64  debug=y  Not tainted ]----
(XEN) [0000000179e9ee58] CPU:    0
(XEN) [0000000179eac907] PC:     00000a00002da5fc setup_mm+0x174/0x200
(XEN) [0000000179ed3ed0] LR:     00000a00002da580
(XEN) [0000000179edc486] SP:     00000a0000327e10
(XEN) [0000000179ee6b3a] CPSR:   00000000200003c9 MODE:64-bit EL2h (Hypervisor, handler)
(XEN) [0000000179ef5b4f]      X0: 0000050000000000  X1: 0000000050000000  X2: 0000000000080000
(XEN) [0000000179f05de3]      X3: 0000000000000017  X4: 0000000000000000  X5: 0000000050000000
(XEN) [0000000179f19396]      X6: 000000004fffffff  X7: 0000000000000000  X8: 0000000000020400
(XEN) [0000000179f2d797]      X9: 000000000001b808 X10: 0000000000000080 X11: 00000000000186de
(XEN) [0000000179f3d492]     X12: 000000000001a7df X13: 000000000001214f X14: 0000000000017275
(XEN) [0000000179f50f4c]     X15: 00000a00002b48bc X16: 00000a0000291478 X17: 0000000000000000
(XEN) [0000000179f60902]     X18: 000000007be9bbe0 X19: 0000000000000002 X20: 0000000000000000
(XEN) [0000000179f6fde5]     X21: 0000050080000000 X22: 00000a00002f8008 X23: 00000a00002b5c90
(XEN) [0000000179f7eeea]     X24: 0000000180000000 X25: 00000a00002b5e90 X26: 0000000000000000
(XEN) [0000000179f8ee55]     X27: 0000000000000000 X28: 000000007bff2f70  FP: 00000a0000327e10
(XEN) [0000000179fa6deb] 
(XEN) [0000000179fadf84]   VTCR_EL2: 0000000000000000
(XEN) [0000000179fb9994]  VTTBR_EL2: 0000000000000000
(XEN) [0000000179fc689d] 
(XEN) [0000000179fcc1a0]  SCTLR_EL2: 0000000030cd183d
(XEN) [0000000179fd95e3]    HCR_EL2: 0000000000000038
(XEN) [0000000179fe7082]  TTBR0_EL2: 0000000022148000
(XEN) [0000000179ff0d00] 
(XEN) [0000000179ff6d07]    ESR_EL2: 00000000f2000001
(XEN) [000000017a0003fe]  HPFAR_EL2: 0000000000000000
(XEN) [000000017a00c8f4]    FAR_EL2: 0000000000000000
(XEN) [000000017a018511] 
(XEN) [000000017a01fbe5] Xen stack trace from sp=00000a0000327e10:
(XEN) [000000017a02aa88]    00000a0000327e60 00000a00002e40c4 0000000022200000 000000000000f000
(XEN) [000000017a03e578]    00000a0000c0a5c0 00000a0000332000 00000a0000a00000 0000000000000000
(XEN) [000000017a04e676]    0000000000000000 0000000000000000 000000007be89ea0 00000a00002001a4
(XEN) [000000017a0636e1]    0000000022000000 fffff60021e00000 0000000022200000 0000000000001710
(XEN) [000000017a072ae0]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [000000017a084bf8]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [000000017a097ced]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [000000017a0a6829]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [000000017a0b8e71]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [000000017a0cdb4b]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [000000017a0e44b9]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [000000017a0f6a2b]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [000000017a1074a2]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [000000017a1178b3]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [000000017a128463]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [000000017a13a015]    0000000000000000 0000000000000000
(XEN) [000000017a144d66] Xen call trace:
(XEN) [000000017a14bcee]    [<00000a00002da5fc>] setup_mm+0x174/0x200 (PC)
(XEN) [000000017a15db0a]    [<00000a00002da580>] setup_mm+0xf8/0x200 (LR)
(XEN) [000000017a167dbb]    [<00000a00002e40c4>] start_xen+0x118/0x9d0
(XEN) [000000017a171724]    [<00000a00002001a4>] arch/arm/arm64/head.o#primary_switched+0x4/0x24
(XEN) [000000017a18abb4] 
(XEN) [000000017a19a465] 
(XEN) [000000017a19ffed] ****************************************
(XEN) [000000017a1aad66] Panic on CPU 0:
(XEN) [000000017a1b2757] Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72
(XEN) [000000017a1daedf] ****************************************
(XEN) [000000017a1eb0a9] 
(XEN) [000000017a1f2b27] Reboot in five seconds...


If I remove the ASSERT:

(XEN) [00000003bc65c616] parameter "debug" unknown!
(XEN) [00000003bc70915a] 
(XEN) [00000003bc70fd14] ****************************************
(XEN) [00000003bc71afec] Panic on CPU 0:
(XEN) [00000003bc724d03] The frametable cannot cover the physical region 0000000000000000 - 0x00070080000000
(XEN) [00000003bc73786c] ****************************************
(XEN) [00000003bc741a19] 
(XEN) [00000003bc747833] Reboot in five seconds...


I think the issue (or one issue) is the implementation of
setup_frametable_mappings on ARM which is ignoring the pdx_group_valid
bitmap. I am attaching a work-in-progress patch from Michal to add
support for it for your reference. Remove commit fe6a12a08 to apply the
patch without conflict.

With Michal's patch, I can boot *without* your patches on the
problematic board.

I still cannot boot with your patches, even with Michal's patch. I still
hit the same ASSERT. If I remove the ASSERT I go further and hit:

(XEN) [00000001bccbd3ab] Panic on CPU 0:
(XEN) [00000001bccc4c3e] Frametable too small

I added some debug messages (see
attached stefano-debug.patch). Something seems to be wrong with the
pdx_group_valid bitmap after 0x880000, as we start getting MFN ranges
such as 0x254c0000-0x25500000 which don't make any sense to me.

(XEN) [00000001563012a8] DEBUG init_pdx 294 start=0 end=80000000
(XEN) [000000015630d6d9] DEBUG init_pdx 294 start=800000000 end=880000000
(XEN) [000000015631c73c] DEBUG init_pdx 294 start=50000000000 end=50080000000
(XEN) [000000015632947b] DEBUG init_pdx 294 start=60000000000 end=60080000000
(XEN) [00000001563365a8] DEBUG init_pdx 294 start=70000000000 end=70080000000
(XEN) [000000015637c6aa] DEBUG init_frametable 65 start=0 end=80000
(XEN) [00000001563898e1] DEBUG init_frametable_chunk 28 virt=a0800000000 base_mfn=7007e000 pfn_start=0 pfn_end=80000
(XEN) [000000015692ed1f] DEBUG init_frametable 65 start=800000 end=880000
(XEN) [00000001569399fe] DEBUG init_frametable_chunk 28 virt=a081c000000 base_mfn=7007c000 pfn_start=800000 pfn_end=880000
(XEN) [00000001573bad45] DEBUG init_frametable 65 start=254c0000 end=25500000
(XEN) [00000001573dee6a] DEBUG init_frametable_chunk 28 virt=a1028a00000 base_mfn=7007a000 pfn_start=254c0000 pfn_end=25500000
(XEN) [00000001578ad5c2] DEBUG init_frametable 65 start=25700000 end=257c0000
(XEN) [00000001578b841d] DEBUG init_frametable_chunk 28 virt=a1030800000 base_mfn=70076000 pfn_start=25700000 pfn_end=257c0000
(XEN) [000000015853b121] DEBUG init_frametable 65 start=27400000 end=27440000
(XEN) [00000001585470fe] DEBUG init_frametable_chunk 28 virt=a1096000000 base_mfn=70074000 pfn_start=27400000 pfn_end=27440000
(XEN) [0000000158880a59] DEBUG init_frametable 65 start=27480000 end=27500000
(XEN) [000000015888d583] DEBUG init_frametable_chunk 28 virt=a1097c00000 base_mfn=70072000 pfn_start=27480000 pfn_end=27500000
(XEN) [0000000158eacf55] DEBUG init_frametable 65 start=27580000 end=27a40000
(XEN) [0000000158eb7f8e] DEBUG init_frametable_chunk 28 virt=a109b400000 base_mfn=70060000 pfn_start=27580000 pfn_end=27a40000
(XEN) [000000015cac7416] DEBUG init_frametable 65 start=27a80000 end=27ac0000
(XEN) [000000015cad6818] DEBUG init_frametable_chunk 28 virt=a10acc00000 base_mfn=7005e000 pfn_start=27a80000 pfn_end=27ac0000
(XEN) [000000015cb26b99] arch/arm/mmu/pt.c:360: Changing MFN for a valid entry is not allowed (0x70071800 -> 0x7005e000).
(XEN) [000000015cb80f94] Xen WARN at arch/arm/mmu/pt.c:360
(XEN) [000000015cbabedc] ----[ Xen-4.21-unstable  arm64  debug=y  Not tainted ]----

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-diff; NAME=stefano-debug.patch, Size: 2639 bytes --]

diff --git a/xen/arch/arm/include/asm/mmu/mm.h b/xen/arch/arm/include/asm/mmu/mm.h
index 7f4d59137d..819fe923a4 100644
--- a/xen/arch/arm/include/asm/mmu/mm.h
+++ b/xen/arch/arm/include/asm/mmu/mm.h
@@ -69,8 +69,6 @@ static inline void *maddr_to_virt(paddr_t ma)
  */
 static inline void *maddr_to_virt(paddr_t ma)
 {
-    ASSERT((mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) <
-           (DIRECTMAP_SIZE >> PAGE_SHIFT));
     return (void *)(XENHEAP_VIRT_START -
                     (directmap_base_pdx << PAGE_SHIFT) +
                     maddr_to_directmapoff(ma));
diff --git a/xen/arch/arm/mmu/mm.c b/xen/arch/arm/mmu/mm.c
index 69617a4986..e24fca8c70 100644
--- a/xen/arch/arm/mmu/mm.c
+++ b/xen/arch/arm/mmu/mm.c
@@ -25,6 +25,7 @@ init_frametable_chunk(unsigned long pdx_s, unsigned long pdx_e)
     base_mfn = alloc_boot_pages(chunk_size >> PAGE_SHIFT, 32 << (20 - 12));
 
     virt = (unsigned long)pdx_to_page(pdx_s);
+    printk("DEBUG %s %d virt=%lx base_mfn=%lx pfn_start=%lx pfn_end=%lx\n",__func__,__LINE__,(unsigned long)virt,mfn_x(base_mfn),mfn_x(pdx_to_mfn(pdx_s)),mfn_x(pdx_to_mfn(pdx_e)));
     rc = map_pages_to_xen(virt, base_mfn, chunk_size >> PAGE_SHIFT,
                           PAGE_HYPERVISOR_RW | _PAGE_BLOCK);
     if ( rc )
@@ -51,12 +52,9 @@ void __init init_frametable(void)
 
     max_pdx = pfn_to_pdx(max_page - 1) + 1;
 
-    if ( max_pdx > FRAMETABLE_NR )
-        panic("Frametable too small\n");
-
     max_idx = DIV_ROUND_UP(max_pdx, PDX_GROUP_COUNT);
 
-    for ( sidx = (frametable_base_pdx / PDX_GROUP_COUNT); ; sidx = nidx )
+    for ( sidx = 0; ; sidx = nidx )
     {
         eidx = find_next_zero_bit(pdx_group_valid, max_idx, sidx);
         nidx = find_next_bit(pdx_group_valid, max_idx, eidx);
@@ -64,6 +62,7 @@ void __init init_frametable(void)
         if ( nidx >= max_idx )
             break;
 
+        printk("DEBUG %s %d start=%lx end=%lx\n",__func__,__LINE__,mfn_x(pdx_to_mfn(sidx * PDX_GROUP_COUNT)),mfn_x(pdx_to_mfn(eidx * PDX_GROUP_COUNT)));
         init_frametable_chunk(sidx * PDX_GROUP_COUNT, eidx * PDX_GROUP_COUNT);
     }
 
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index c9ad6bbab6..1f5c1866c4 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -291,6 +291,7 @@ void __init init_pdx(void)
         bank_size = mem->bank[bank].size;
         bank_end = bank_start + bank_size;
 
+        printk("DEBUG %s %d start=%lx end=%lx\n",__func__,__LINE__,bank_start,bank_end);
         set_pdx_range(paddr_to_pfn(bank_start),
                       paddr_to_pfn(bank_end));
     }

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: Type: text/x-diff; NAME=pdx-groups.patch, Size: 6906 bytes --]

From michal.orzel@amd.com Mon Jun 30 02:22:46 2025
Date: Mon, 30 Jun 2025 11:22:27 +0200
From: Michal Orzel <michal.orzel@amd.com>
Subject: [PATCH] xen/arm: Take into account PDX grouping for setting up frametable

At the moment we don't really take into account pdx_group_valid bitmap
containing ranges with valid RAM ranges. We populate the bitmap using
set_pdx_range() but we set up frametable to cover all RAM including
holes wasting a lot of memory (even gigabytes on some platforms with
large holes). Take example from x86 where this bitmap is used to init
frametable for valid RAM ranges in chunks. On Arm we also apply offset
(similar as with direct map), where the starting index for the bitmap
comes from frametable_base_pdx. Mapping size remains the same as before
i.e. 2MB or 32MB.

Signed-off-by: Michal Orzel <michal.orzel@amd.com>
---
 xen/arch/arm/arm32/mmu/mm.c   |  4 +-
 xen/arch/arm/arm64/mmu/mm.c   |  3 +-
 xen/arch/arm/include/asm/mm.h |  4 +-
 xen/arch/arm/mmu/mm.c         | 69 ++++++++++++++++++++++-------------
 4 files changed, 50 insertions(+), 30 deletions(-)

diff --git a/xen/arch/arm/arm32/mmu/mm.c b/xen/arch/arm/arm32/mmu/mm.c
index 956693232a1b..80b3572e0041 100644
--- a/xen/arch/arm/arm32/mmu/mm.c
+++ b/xen/arch/arm/arm32/mmu/mm.c
@@ -188,10 +188,10 @@ void __init setup_mm(void)
 
     setup_directmap_mappings(mfn_x(directmap_mfn_start), xenheap_pages);
 
-    /* Frame table covers all of RAM region, including holes */
-    setup_frametable_mappings(ram_start, ram_end);
     max_page = PFN_DOWN(ram_end);
 
+    init_frametable();
+
     /*
      * The allocators may need to use map_domain_page() (such as for
      * scrubbing pages). So we need to prepare the domheap area first.
diff --git a/xen/arch/arm/arm64/mmu/mm.c b/xen/arch/arm/arm64/mmu/mm.c
index c1efa1348aee..8bfa263be91e 100644
--- a/xen/arch/arm/arm64/mmu/mm.c
+++ b/xen/arch/arm/arm64/mmu/mm.c
@@ -277,9 +277,10 @@ void __init setup_mm(void)
     directmap_mfn_start = maddr_to_mfn(ram_start);
     directmap_mfn_end = maddr_to_mfn(ram_end);
 
-    setup_frametable_mappings(ram_start, ram_end);
     max_page = PFN_DOWN(ram_end);
 
+    init_frametable();
+
     init_staticmem_pages();
     init_sharedmem_pages();
 }
diff --git a/xen/arch/arm/include/asm/mm.h b/xen/arch/arm/include/asm/mm.h
index a0d8e5afe977..5f41da4b1c32 100644
--- a/xen/arch/arm/include/asm/mm.h
+++ b/xen/arch/arm/include/asm/mm.h
@@ -211,8 +211,8 @@ extern void *early_fdt_map(paddr_t fdt_paddr);
 extern void remove_early_mappings(void);
 /* Prepare the memory subystem to bring-up the given secondary CPU */
 extern int prepare_secondary_mm(int cpu);
-/* Map a frame table to cover physical addresses ps through pe */
-extern void setup_frametable_mappings(paddr_t ps, paddr_t pe);
+/* Map a frame table */
+void init_frametable(void);
 /* map a physical range in virtual memory */
 void __iomem *ioremap_attr(paddr_t start, size_t len, unsigned int attributes);
 
diff --git a/xen/arch/arm/mmu/mm.c b/xen/arch/arm/mmu/mm.c
index 9c50479c6373..69617a4986a5 100644
--- a/xen/arch/arm/mmu/mm.c
+++ b/xen/arch/arm/mmu/mm.c
@@ -10,16 +10,35 @@
 
 unsigned long frametable_virt_end __read_mostly;
 
-/* Map a frame table to cover physical addresses ps through pe */
-void __init setup_frametable_mappings(paddr_t ps, paddr_t pe)
+static void __init
+init_frametable_chunk(unsigned long pdx_s, unsigned long pdx_e)
 {
-    unsigned long nr_pdxs = mfn_to_pdx(mfn_add(maddr_to_mfn(pe), -1)) -
-                            mfn_to_pdx(maddr_to_mfn(ps)) + 1;
-    unsigned long frametable_size = nr_pdxs * sizeof(struct page_info);
-    mfn_t base_mfn;
-    const unsigned long mapping_size = frametable_size < MB(32) ? MB(2)
-                                                                : MB(32);
+    unsigned long nr_pdxs = pdx_e - pdx_s;
+    unsigned long chunk_size = nr_pdxs * sizeof(struct page_info);
+    const unsigned long mapping_size = chunk_size < MB(32) ? MB(2) : MB(32);
+    unsigned long virt;
     int rc;
+    mfn_t base_mfn;
+
+    /* Round up to 2M or 32M boundary, as appropriate. */
+    chunk_size = ROUNDUP(chunk_size, mapping_size);
+    base_mfn = alloc_boot_pages(chunk_size >> PAGE_SHIFT, 32 << (20 - 12));
+
+    virt = (unsigned long)pdx_to_page(pdx_s);
+    rc = map_pages_to_xen(virt, base_mfn, chunk_size >> PAGE_SHIFT,
+                          PAGE_HYPERVISOR_RW | _PAGE_BLOCK);
+    if ( rc )
+        panic("Unable to setup the frametable mappings\n");
+
+    memset(&frame_table[pdx_s], 0, nr_pdxs * sizeof(struct page_info));
+    memset(&frame_table[pdx_e], -1,
+           chunk_size - nr_pdxs * sizeof(struct page_info));
+}
+
+void __init init_frametable(void)
+{
+    unsigned int sidx, eidx, nidx;
+    unsigned int max_idx;
 
     /*
      * The size of paddr_t should be sufficient for the complete range of
@@ -28,27 +47,27 @@ void __init setup_frametable_mappings(paddr_t ps, paddr_t pe)
     BUILD_BUG_ON((sizeof(paddr_t) * BITS_PER_BYTE) < PADDR_BITS);
     BUILD_BUG_ON(sizeof(struct page_info) != PAGE_INFO_SIZE);
 
-    if ( frametable_size > FRAMETABLE_SIZE )
-        panic("The frametable cannot cover the physical region %#"PRIpaddr" - %#"PRIpaddr"\n",
-              ps, pe);
+    frametable_base_pdx = mfn_to_pdx(directmap_mfn_start);
 
-    frametable_base_pdx = mfn_to_pdx(maddr_to_mfn(ps));
-    /* Round up to 2M or 32M boundary, as appropriate. */
-    frametable_size = ROUNDUP(frametable_size, mapping_size);
-    base_mfn = alloc_boot_pages(frametable_size >> PAGE_SHIFT, 32<<(20-12));
+    max_pdx = pfn_to_pdx(max_page - 1) + 1;
 
-    rc = map_pages_to_xen(FRAMETABLE_VIRT_START, base_mfn,
-                          frametable_size >> PAGE_SHIFT,
-                          PAGE_HYPERVISOR_RW | _PAGE_BLOCK);
-    if ( rc )
-        panic("Unable to setup the frametable mappings.\n");
+    if ( max_pdx > FRAMETABLE_NR )
+        panic("Frametable too small\n");
+
+    max_idx = DIV_ROUND_UP(max_pdx, PDX_GROUP_COUNT);
+
+    for ( sidx = (frametable_base_pdx / PDX_GROUP_COUNT); ; sidx = nidx )
+    {
+        eidx = find_next_zero_bit(pdx_group_valid, max_idx, sidx);
+        nidx = find_next_bit(pdx_group_valid, max_idx, eidx);
+
+        if ( nidx >= max_idx )
+            break;
 
-    memset(&frame_table[0], 0, nr_pdxs * sizeof(struct page_info));
-    memset(&frame_table[nr_pdxs], -1,
-           frametable_size - (nr_pdxs * sizeof(struct page_info)));
+        init_frametable_chunk(sidx * PDX_GROUP_COUNT, eidx * PDX_GROUP_COUNT);
+    }
 
-    frametable_virt_end = FRAMETABLE_VIRT_START + (nr_pdxs *
-                                                   sizeof(struct page_info));
+    init_frametable_chunk(sidx * PDX_GROUP_COUNT, max_pdx);
 }
 
 /*
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-01  1:50     ` Stefano Stabellini
@ 2025-07-01  3:33       ` Stefano Stabellini
  2025-07-01  6:05       ` Jan Beulich
  1 sibling, 0 replies; 55+ messages in thread
From: Stefano Stabellini @ 2025-07-01  3:33 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Roger Pau Monné, xen-devel, Jan Beulich, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Bertrand Marquis,
	Volodymyr Babchuk, Shawn Anastasio, Alistair Francis,
	Bob Eshleman, Connor Davis, Oleksii Kurochko, Community Manager

On Mon, 30 Jun 2025, Stefano Stabellini wrote:
> I added some debug messages (see
> attached stefano-debug.patch). Something seems to be wrong with the
> pdx_group_valid bitmap after 0x880000, as we start getting MFN ranges
> such as 0x254c0000-0x25500000 which don't make any sense to me.

From what I can see the first time setup_directmap_mappings is called
with base_mfn=50000000, __mfn_to_virt goes wrong and triggers the ASSERT
in maddr_to_virt.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-01  1:50     ` Stefano Stabellini
  2025-07-01  3:33       ` Stefano Stabellini
@ 2025-07-01  6:05       ` Jan Beulich
  2025-07-01 20:46         ` Stefano Stabellini
  1 sibling, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-07-01  6:05 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Community Manager, Roger Pau Monné

On 01.07.2025 03:50, Stefano Stabellini wrote:
> On Mon, 30 Jun 2025, Roger Pau Monné wrote:
>> On Fri, Jun 27, 2025 at 07:08:29PM -0700, Stefano Stabellini wrote:
>>> Hi Roger,
>>>
>>> We have an ARM board with the following memory layout:
>>>
>>> 0x0-0x80000000, 0, 2G
>>> 0x800000000,0x880000000, 32GB, 2G
>>> 0x50000000000-0x50080000000 5T, 2GB 
>>> 0x60000000000-0x60080000000 6T, 2GB 
>>> 0x70000000000-0x70080000000 7T, 2GB 
>>
>> With the current PDX mask compression you could compress 4bits AFAICT.
>>
>>> It looks like your PDX series is exactly what we need.  However, I tried
>>> to use it and it doesn't seem to be hooked properly on ARM yet. I spent
>>> some time trying to fix it but I was unsuccessful.
>>
>> Hm, weird.  It shouldn't need any special hooking, unless assumptions
>> about the existing PDX mask compression have leaked into ARM code.
>>
>>> As far as I can tell the following functions need to be adjusted but I
>>> am not sure the list is comprehensive:
>>>
>>> xen/arch/arm/include/asm/mmu/mm.h:maddr_to_virt
>>
>> At least for CONFIG_ARM_64 this seems to be implemented correctly, as
>> it's using maddr_to_directmapoff() which should have the correct
>> translation between paddr -> directmap virt.
>>
>> Also given the memory map above the adjustments done in ARM to remove
>> any initial memory map offset should be no-ops, since I expect
>> base_mfn == 0 in setup_directmap_mappings() in that particular case,
>> and then directmap_mfn_start = directmap_base_pdx = 0 and
>> directmap_virt_start = DIRECTMAP_VIRT_START.  FWIW, if ARM uses offset
>> compression the special casing about removing the initial gap can be
>> removed, as the compression should already take care of that.
>>
>>> xen/arch/arm/mmu/mm.c:setup_frametable_mappings
>>> xen/arch/arm/setup.c:init_pdx
>>
>> I've attempted to adjust init_pdx() myself so it works with the new
>> generic PDX compression setup, it seemed to work fine on the CI, but I
>> don't have any real ARM machines to test myself.
>  
>> Is there a way I could reproduce the issue(s) you are seeing with
>> QEMU?
> 
> Maybe. You can see how we run QEMU from gitlab-ci, but I don't know on
> top of my head how to force QEMU to emulate multiple RAM banks at
> specific addresses.
> 
> 
>> I'm already working on v3, as this version implementation of
>> mfn_valid() is buggy.  Maybe that's what you are hitting?
>>
> 
> This is the error:
> 
> (XEN) [0000000179e5f96b] Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72
> (XEN) [0000000179e90619] ----[ Xen-4.21-unstable  arm64  debug=y  Not tainted ]----
> (XEN) [0000000179e9ee58] CPU:    0
> (XEN) [0000000179eac907] PC:     00000a00002da5fc setup_mm+0x174/0x200
> (XEN) [0000000179ed3ed0] LR:     00000a00002da580
> (XEN) [0000000179edc486] SP:     00000a0000327e10
> (XEN) [0000000179ee6b3a] CPSR:   00000000200003c9 MODE:64-bit EL2h (Hypervisor, handler)
> (XEN) [0000000179ef5b4f]      X0: 0000050000000000  X1: 0000000050000000  X2: 0000000000080000
> (XEN) [0000000179f05de3]      X3: 0000000000000017  X4: 0000000000000000  X5: 0000000050000000
> (XEN) [0000000179f19396]      X6: 000000004fffffff  X7: 0000000000000000  X8: 0000000000020400
> (XEN) [0000000179f2d797]      X9: 000000000001b808 X10: 0000000000000080 X11: 00000000000186de
> (XEN) [0000000179f3d492]     X12: 000000000001a7df X13: 000000000001214f X14: 0000000000017275
> (XEN) [0000000179f50f4c]     X15: 00000a00002b48bc X16: 00000a0000291478 X17: 0000000000000000
> (XEN) [0000000179f60902]     X18: 000000007be9bbe0 X19: 0000000000000002 X20: 0000000000000000
> (XEN) [0000000179f6fde5]     X21: 0000050080000000 X22: 00000a00002f8008 X23: 00000a00002b5c90
> (XEN) [0000000179f7eeea]     X24: 0000000180000000 X25: 00000a00002b5e90 X26: 0000000000000000
> (XEN) [0000000179f8ee55]     X27: 0000000000000000 X28: 000000007bff2f70  FP: 00000a0000327e10
> (XEN) [0000000179fa6deb] 
> (XEN) [0000000179fadf84]   VTCR_EL2: 0000000000000000
> (XEN) [0000000179fb9994]  VTTBR_EL2: 0000000000000000
> (XEN) [0000000179fc689d] 
> (XEN) [0000000179fcc1a0]  SCTLR_EL2: 0000000030cd183d
> (XEN) [0000000179fd95e3]    HCR_EL2: 0000000000000038
> (XEN) [0000000179fe7082]  TTBR0_EL2: 0000000022148000
> (XEN) [0000000179ff0d00] 
> (XEN) [0000000179ff6d07]    ESR_EL2: 00000000f2000001
> (XEN) [000000017a0003fe]  HPFAR_EL2: 0000000000000000
> (XEN) [000000017a00c8f4]    FAR_EL2: 0000000000000000
> (XEN) [000000017a018511] 
> (XEN) [000000017a01fbe5] Xen stack trace from sp=00000a0000327e10:
> (XEN) [000000017a02aa88]    00000a0000327e60 00000a00002e40c4 0000000022200000 000000000000f000
> (XEN) [000000017a03e578]    00000a0000c0a5c0 00000a0000332000 00000a0000a00000 0000000000000000
> (XEN) [000000017a04e676]    0000000000000000 0000000000000000 000000007be89ea0 00000a00002001a4
> (XEN) [000000017a0636e1]    0000000022000000 fffff60021e00000 0000000022200000 0000000000001710
> (XEN) [000000017a072ae0]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) [000000017a084bf8]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) [000000017a097ced]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) [000000017a0a6829]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) [000000017a0b8e71]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) [000000017a0cdb4b]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) [000000017a0e44b9]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) [000000017a0f6a2b]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) [000000017a1074a2]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) [000000017a1178b3]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) [000000017a128463]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN) [000000017a13a015]    0000000000000000 0000000000000000
> (XEN) [000000017a144d66] Xen call trace:
> (XEN) [000000017a14bcee]    [<00000a00002da5fc>] setup_mm+0x174/0x200 (PC)
> (XEN) [000000017a15db0a]    [<00000a00002da580>] setup_mm+0xf8/0x200 (LR)
> (XEN) [000000017a167dbb]    [<00000a00002e40c4>] start_xen+0x118/0x9d0
> (XEN) [000000017a171724]    [<00000a00002001a4>] arch/arm/arm64/head.o#primary_switched+0x4/0x24
> (XEN) [000000017a18abb4] 
> (XEN) [000000017a19a465] 
> (XEN) [000000017a19ffed] ****************************************
> (XEN) [000000017a1aad66] Panic on CPU 0:
> (XEN) [000000017a1b2757] Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72
> (XEN) [000000017a1daedf] ****************************************
> (XEN) [000000017a1eb0a9] 
> (XEN) [000000017a1f2b27] Reboot in five seconds...
> 
> 
> If I remove the ASSERT:
> 
> (XEN) [00000003bc65c616] parameter "debug" unknown!
> (XEN) [00000003bc70915a] 
> (XEN) [00000003bc70fd14] ****************************************
> (XEN) [00000003bc71afec] Panic on CPU 0:
> (XEN) [00000003bc724d03] The frametable cannot cover the physical region 0000000000000000 - 0x00070080000000
> (XEN) [00000003bc73786c] ****************************************
> (XEN) [00000003bc741a19] 
> (XEN) [00000003bc747833] Reboot in five seconds...
> 
> 
> I think the issue (or one issue) is the implementation of
> setup_frametable_mappings on ARM which is ignoring the pdx_group_valid
> bitmap. I am attaching a work-in-progress patch from Michal to add
> support for it for your reference. Remove commit fe6a12a08 to apply the
> patch without conflict.
> 
> With Michal's patch, I can boot *without* your patches on the
> problematic board.
> 
> I still cannot boot with your patches, even with Michal's patch. I still
> hit the same ASSERT. If I remove the ASSERT I go further and hit:
> 
> (XEN) [00000001bccbd3ab] Panic on CPU 0:
> (XEN) [00000001bccc4c3e] Frametable too small
> 
> I added some debug messages (see
> attached stefano-debug.patch). Something seems to be wrong with the
> pdx_group_valid bitmap after 0x880000, as we start getting MFN ranges
> such as 0x254c0000-0x25500000 which don't make any sense to me.

But in pdx_group_valid it would want to be PDXes.

> (XEN) [00000001563012a8] DEBUG init_pdx 294 start=0 end=80000000
> (XEN) [000000015630d6d9] DEBUG init_pdx 294 start=800000000 end=880000000
> (XEN) [000000015631c73c] DEBUG init_pdx 294 start=50000000000 end=50080000000
> (XEN) [000000015632947b] DEBUG init_pdx 294 start=60000000000 end=60080000000
> (XEN) [00000001563365a8] DEBUG init_pdx 294 start=70000000000 end=70080000000
> (XEN) [000000015637c6aa] DEBUG init_frametable 65 start=0 end=80000
> (XEN) [00000001563898e1] DEBUG init_frametable_chunk 28 virt=a0800000000 base_mfn=7007e000 pfn_start=0 pfn_end=80000
> (XEN) [000000015692ed1f] DEBUG init_frametable 65 start=800000 end=880000
> (XEN) [00000001569399fe] DEBUG init_frametable_chunk 28 virt=a081c000000 base_mfn=7007c000 pfn_start=800000 pfn_end=880000
> (XEN) [00000001573bad45] DEBUG init_frametable 65 start=254c0000 end=25500000
> (XEN) [00000001573dee6a] DEBUG init_frametable_chunk 28 virt=a1028a00000 base_mfn=7007a000 pfn_start=254c0000 pfn_end=25500000
> (XEN) [00000001578ad5c2] DEBUG init_frametable 65 start=25700000 end=257c0000
> (XEN) [00000001578b841d] DEBUG init_frametable_chunk 28 virt=a1030800000 base_mfn=70076000 pfn_start=25700000 pfn_end=257c0000
> (XEN) [000000015853b121] DEBUG init_frametable 65 start=27400000 end=27440000
> (XEN) [00000001585470fe] DEBUG init_frametable_chunk 28 virt=a1096000000 base_mfn=70074000 pfn_start=27400000 pfn_end=27440000
> (XEN) [0000000158880a59] DEBUG init_frametable 65 start=27480000 end=27500000
> (XEN) [000000015888d583] DEBUG init_frametable_chunk 28 virt=a1097c00000 base_mfn=70072000 pfn_start=27480000 pfn_end=27500000
> (XEN) [0000000158eacf55] DEBUG init_frametable 65 start=27580000 end=27a40000
> (XEN) [0000000158eb7f8e] DEBUG init_frametable_chunk 28 virt=a109b400000 base_mfn=70060000 pfn_start=27580000 pfn_end=27a40000
> (XEN) [000000015cac7416] DEBUG init_frametable 65 start=27a80000 end=27ac0000
> (XEN) [000000015cad6818] DEBUG init_frametable_chunk 28 virt=a10acc00000 base_mfn=7005e000 pfn_start=27a80000 pfn_end=27ac0000
> (XEN) [000000015cb26b99] arch/arm/mmu/pt.c:360: Changing MFN for a valid entry is not allowed (0x70071800 -> 0x7005e000).
> (XEN) [000000015cb80f94] Xen WARN at arch/arm/mmu/pt.c:360
> (XEN) [000000015cbabedc] ----[ Xen-4.21-unstable  arm64  debug=y  Not tainted ]----

Sadly from this you omitted the output from the setup of the offsets
arrays. Considering also your later reply, I'd be curious to know what
mfn_to_pdx(0x50000000) is.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets
  2025-06-29 14:36           ` Jan Beulich
@ 2025-07-01  7:26             ` Roger Pau Monné
  0 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-07-01  7:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	xen-devel

On Sun, Jun 29, 2025 at 04:36:25PM +0200, Jan Beulich wrote:
> On 27.06.2025 16:51, Roger Pau Monné wrote:
> > On Thu, Jun 26, 2025 at 09:35:04AM +0200, Jan Beulich wrote:
> >> On 25.06.2025 18:24, Roger Pau Monné wrote:
> >>> On Tue, Jun 24, 2025 at 06:16:15PM +0200, Jan Beulich wrote:
> >>>> On 20.06.2025 13:11, Roger Pau Monne wrote:
> >>>>> +bool pdx_is_region_compressible(paddr_t base, unsigned long npages)
> >>>>> +{
> >>>>> +    unsigned long pfn = PFN_DOWN(base);
> >>>>> +
> >>>>> +    return pdx_to_pfn(pfn_to_pdx(pfn) + npages - 1) == (pfn + npages - 1);
> >>>>
> >>>> Aiui for this to be correct, there need to be gaps between the ranges
> >>>> covered by individual lookup table slots. In the setup logic you have a
> >>>> check commented "Avoid compression if there's no gain", but that doesn't
> >>>> look to guarantee gaps everywhere (nor would pfn_offset_sanitize_ranges()
> >>>> appear to)?
> >>>
> >>> But if there are no gaps, the full region is covered correctly, and
> >>> hence it's compressible?
> >>
> >> If there's a guarantee that such ranges would be folded into a single one,
> >> all would be fine.
> >>
> >>> Maybe I'm missing something, could you maybe provide an example that
> >>> would exhibit this issue?
> >>
> >> My understanding is that when there's no gap between regions, and when
> >> [base, base + npages) crosses as region boundary, then the expression
> >> above will yield true when, because of crossing a region boundary, it
> >> ought to be false. Or did I simply misunderstand the purpose of the
> >> pdx_is_region_compressible() invocations?
> > 
> > If there's no gap between the regions it's IMO intended for
> > pdx_is_region_compressible() to return true, as the whole region is
> > continuous in both the PFN and PDX spaces, and hence compressible
> > (even if it spans multiple regions).
> 
> My problem is that I can't make the connection between that function
> returning true and regions getting concatenated. When the function is
> invoked, concatenation (or not) has happened already, aiui.

According to my understanding, a region is compressible if there's a
contiguous PDX translation that covers the whole region.  And I agree,
concatenation or not doesn't really matter here.

> > But maybe I'm not understanding your point correctly, could you maybe
> > provide an example if you disagree with my reply above?  Sorry if I'm
> > being dull, with this compression stuff it's sometimes hard for me to
> > visualize the case you are trying to make without a concrete
> > example.
> 
> What I think I didn't take into consideration is that from two pages
> being contiguous in MFN space, it ought to follow they're also
> contiguous in PDX space. Hence [base, base + npages) crossing a region
> boundary (if, contrary to what you say, this was possible in the first
> place) would still not be encountering a discontinuity. So overall not
> an issue, irrespective of what pdx_is_region_compressible() means
> towards (non-)contiguity.

OK, so I think we are in agreement that region crossing in
pdx_is_region_compressible() is not an issue, as long as regions are
contiguous.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets
  2025-06-30  6:34   ` Jan Beulich
@ 2025-07-01 15:49     ` Roger Pau Monné
  2025-07-01 16:01       ` Jan Beulich
  0 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-07-01 15:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	xen-devel

On Mon, Jun 30, 2025 at 08:34:52AM +0200, Jan Beulich wrote:
> On 20.06.2025 13:11, Roger Pau Monne wrote:
> > @@ -40,6 +41,8 @@ bool __mfn_valid(unsigned long mfn)
> >  
> >  #ifdef CONFIG_PDX_MASK_COMPRESSION
> >      invalid |= mfn & pfn_hole_mask;
> > +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
> > +    invalid |= mfn ^ pdx_to_pfn(pfn_to_pdx(mfn));
> >  #endif
> >  
> >      if ( unlikely(evaluate_nospec(invalid)) )
> 
> In the chat you mentioned that you would add a check against max_pdx here. While
> that feels sufficient, I couldn't quite convince myself of this formally. Hence
> an alternative proposal for consideration, which imo is more clearly achieving
> the effect of allowing for no false-positive results. In particular, how about
> adding another array holding the PDX upper bounds for the respective region.
> When naming the existing two arrays moffs[] and poffs[] for brevity, the new
> one would be plimit[], but indexed by the MFN index. Then we'd end up with
> 
> 	p = mfn - moffs[midx]; /* Open-coded pfn_to_pdx() */
> 	invalid |= p >= plimit[midx] || p < plimit[midx - 1];
> 
> Of course this would need massaging to deal with the midx == 0 case, perhaps by
> making the array one slot larger and incrementing the indexes by 1. The
> downside compared to the max_pdx variant is that while it's the same number of
> memory accesses (and the same number of comparisons [or replacements thereof,
> like the ^ in context above), cache locality is worse (simply because of the
> fact that it's another array).

I've got an alternative proposal, that also uses an extra array but is
IMO simpler.  Introduce an array to hold the PFN bases for the
different ranges that are covered by the translation.  Following the
same example, this would be:

PFN compression using lookup table shift 29 and region size 0x10000000
 range 0 [0000000000000, 000000807ffff] PFN IDX   0 : 0000000000000
 range 1 [0000063e80000, 000006be7ffff] PFN IDX   3 : 0000053e80000
 range 2 [00000c7e80000, 00000cfe7ffff] PFN IDX   6 : 00000a7e80000
 range 3 [000012be80000, 0000133e7ffff] PFN IDX   9 : 00000fbe80000

pfn_bases[] = { [0] =          0, [3] =  0x63e80000,
                [6] = 0xc7e80000, [9] = 0x12be80000 };

With the rest of the entries poisoned to ~0UL.

The checking would then be:

base = pfn_bases[PFN_TBL_IDX(mfn)];
invalid |= mfn < base || mfn >= base + (1UL << pdx_index_shift);

I think the above is clearer and avoids the weirdness of using IDX +
1 for the array indexes.  This relies on the fact that we can obtain
the PDX region size from the PDX shift itself.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets
  2025-07-01 15:49     ` Roger Pau Monné
@ 2025-07-01 16:01       ` Jan Beulich
  0 siblings, 0 replies; 55+ messages in thread
From: Jan Beulich @ 2025-07-01 16:01 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Oleksii Kurochko, Community Manager, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	xen-devel

On 01.07.2025 17:49, Roger Pau Monné wrote:
> On Mon, Jun 30, 2025 at 08:34:52AM +0200, Jan Beulich wrote:
>> On 20.06.2025 13:11, Roger Pau Monne wrote:
>>> @@ -40,6 +41,8 @@ bool __mfn_valid(unsigned long mfn)
>>>  
>>>  #ifdef CONFIG_PDX_MASK_COMPRESSION
>>>      invalid |= mfn & pfn_hole_mask;
>>> +#elif defined(CONFIG_PDX_OFFSET_COMPRESSION)
>>> +    invalid |= mfn ^ pdx_to_pfn(pfn_to_pdx(mfn));
>>>  #endif
>>>  
>>>      if ( unlikely(evaluate_nospec(invalid)) )
>>
>> In the chat you mentioned that you would add a check against max_pdx here. While
>> that feels sufficient, I couldn't quite convince myself of this formally. Hence
>> an alternative proposal for consideration, which imo is more clearly achieving
>> the effect of allowing for no false-positive results. In particular, how about
>> adding another array holding the PDX upper bounds for the respective region.
>> When naming the existing two arrays moffs[] and poffs[] for brevity, the new
>> one would be plimit[], but indexed by the MFN index. Then we'd end up with
>>
>> 	p = mfn - moffs[midx]; /* Open-coded pfn_to_pdx() */
>> 	invalid |= p >= plimit[midx] || p < plimit[midx - 1];
>>
>> Of course this would need massaging to deal with the midx == 0 case, perhaps by
>> making the array one slot larger and incrementing the indexes by 1. The
>> downside compared to the max_pdx variant is that while it's the same number of
>> memory accesses (and the same number of comparisons [or replacements thereof,
>> like the ^ in context above), cache locality is worse (simply because of the
>> fact that it's another array).
> 
> I've got an alternative proposal, that also uses an extra array but is
> IMO simpler.  Introduce an array to hold the PFN bases for the
> different ranges that are covered by the translation.  Following the
> same example, this would be:
> 
> PFN compression using lookup table shift 29 and region size 0x10000000
>  range 0 [0000000000000, 000000807ffff] PFN IDX   0 : 0000000000000
>  range 1 [0000063e80000, 000006be7ffff] PFN IDX   3 : 0000053e80000
>  range 2 [00000c7e80000, 00000cfe7ffff] PFN IDX   6 : 00000a7e80000
>  range 3 [000012be80000, 0000133e7ffff] PFN IDX   9 : 00000fbe80000
> 
> pfn_bases[] = { [0] =          0, [3] =  0x63e80000,
>                 [6] = 0xc7e80000, [9] = 0x12be80000 };
> 
> With the rest of the entries poisoned to ~0UL.
> 
> The checking would then be:
> 
> base = pfn_bases[PFN_TBL_IDX(mfn)];
> invalid |= mfn < base || mfn >= base + (1UL << pdx_index_shift);
> 
> I think the above is clearer and avoids the weirdness of using IDX +
> 1 for the array indexes.  This relies on the fact that we can obtain
> the PDX region size from the PDX shift itself.

Sounds okay to me.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-01  6:05       ` Jan Beulich
@ 2025-07-01 20:46         ` Stefano Stabellini
  2025-07-02  6:08           ` Jan Beulich
                             ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Stefano Stabellini @ 2025-07-01 20:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, xen-devel, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Community Manager, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 7508 bytes --]

On Tue, 1 Jul 2025, Jan Beulich wrote:
> Sadly from this you omitted the output from the setup of the offsets
> arrays. Considering also your later reply, I'd be curious to know what
> mfn_to_pdx(0x50000000) is.
 
Full logs here, and debug patch in attachment.

(XEN) Checking for initrd in /chosen
(XEN) RAM: 0000000000000000 - 000000007fffffff
(XEN) RAM: 0000000800000000 - 000000087fffffff
(XEN) RAM: 0000050000000000 - 000005007fffffff
(XEN) RAM: 0000060000000000 - 000006007fffffff
(XEN) RAM: 0000070000000000 - 000007007fffffff
(XEN) 
(XEN) MODULE[0]: 0000000022000000 - 0000000022172fff Xen         
(XEN) MODULE[1]: 0000000022200000 - 000000002220efff Device Tree 
(XEN) MODULE[2]: 0000000020400000 - 0000000021e2ffff Kernel      
(XEN)  RESVD[0]: 0000000000000000 - 0000000000ffffff
(XEN)  RESVD[1]: 0000000001000000 - 00000000015fffff
(XEN)  RESVD[2]: 0000000001600000 - 00000000017fffff
(XEN)  RESVD[3]: 0000000001800000 - 00000000097fffff
(XEN)  RESVD[4]: 0000000009800000 - 000000000bffffff
(XEN)  RESVD[5]: 0000000011126000 - 000000001114dfff
(XEN)  RESVD[6]: 000000001114e000 - 000000001214efff
(XEN)  RESVD[7]: 0000000017275000 - 000000001729cfff
(XEN)  RESVD[8]: 000000001729d000 - 000000001829dfff
(XEN)  RESVD[9]: 000000001a7df000 - 000000001a806fff
(XEN)  RESVD[10]: 000000001a807000 - 000000001b807fff
(XEN)  RESVD[11]: 000000001d908000 - 000000001d92ffff
(XEN)  RESVD[12]: 000000001d930000 - 000000001e930fff
(XEN)  RESVD[13]: 000000001829e000 - 000000001869dfff
(XEN)  RESVD[14]: 000000001869e000 - 00000000186ddfff
(XEN)  RESVD[15]: 0000000800000000 - 000000083fffffff
(XEN) 
(XEN) 
(XEN) Command line: console=dtuart dom0_mem=2048M console_timestamps=boot debug bootscrub=0 vwfi=native sched=null
(XEN) [00000006bfc302ec] parameter "debug" unknown!
(XEN) [00000006bfcc0476] DEBUG init_pdx 294 start=0 end=80000000
(XEN) [00000006bfcd2400] DEBUG init_pdx 294 start=800000000 end=880000000
(XEN) [00000006bfce29ec] DEBUG init_pdx 294 start=50000000000 end=50080000000
(XEN) [00000006bfcf1768] DEBUG init_pdx 294 start=60000000000 end=60080000000
(XEN) [00000006bfd015a4] DEBUG init_pdx 294 start=70000000000 end=70080000000
(XEN) [00000006bfd1444f] DEBUG setup_mm 252
(XEN) [00000006bfd3dc6f] DEBUG setup_mm 273 start=0 size=80000000 ram_end=80000000 directmap_base_pdx=0
(XEN) [00000006bfd5616e] DEBUG setup_directmap_mappings 229 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=0
(XEN) [00000006bfd7d38a] DEBUG setup_directmap_mappings 237 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0
(XEN) [00000006bfd92728] DEBUG setup_mm 273 start=800000000 size=80000000 ram_end=880000000 directmap_base_pdx=0
(XEN) [00000006bfdaba3b] DEBUG setup_directmap_mappings 229 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=800000
(XEN) [00000006bfdcd79c] DEBUG setup_directmap_mappings 237 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0
(XEN) [00000006bfde4d82] DEBUG setup_mm 273 start=50000000000 size=80000000 ram_end=50080000000 directmap_base_pdx=0
(XEN) [00000006bfdfaef0] DEBUG setup_directmap_mappings 229 base_mfn=50000000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=50000000
(XEN) [00000006bfe35249] Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72
(XEN) [00000006bfe68507] ----[ Xen-4.21-unstable  arm64  debug=y  Not tainted ]----
(XEN) [00000006bfe766bf] CPU:    0
(XEN) [00000006bfe832e0] PC:     00000a00002da70c setup_mm+0x284/0x308
(XEN) [00000006bfea5b1a] LR:     00000a00002da6b0
(XEN) [00000006bfeb1032] SP:     00000a0000327e00
(XEN) [00000006bfebf403] CPSR:   00000000200003c9 MODE:64-bit EL2h (Hypervisor, handler)
(XEN) [00000006bfed4634]      X0: 0000000000000017  X1: 0000000000000000  X2: 0000000050000000
(XEN) [00000006bfee4d11]      X3: 000000004fffffff  X4: 0000000000000020  X5: 0000000000000000
(XEN) [00000006bfef48cf]      X6: 0000000000000000  X7: 0000000000000000  X8: ffffffffffffffff
(XEN) [00000006bff047ac]      X9: fefefefefefeff09 X10: 0000000000000080 X11: 0101010101010101
(XEN) [00000006bff153b4]     X12: 0000000000000008 X13: 0000000000000009 X14: 0000000000000030
(XEN) [00000006bff2620d]     X15: 00000a0000a00000 X16: 00000a0000291478 X17: 0000000000000000
(XEN) [00000006bff35c41]     X18: 000000007be9bbe0 X19: 00000a0000292c40 X20: 00000a00002ade68
(XEN) [00000006bff465a5]     X21: 0000050080000000 X22: 0000000000000000 X23: 0000000180000000
(XEN) [00000006bff57a51]     X24: 0000000000000002 X25: 00000a0000292c50 X26: 0000000050000000
(XEN) [00000006bff67d91]     X27: 0000000000080000 X28: 0000050000000000  FP: 00000a0000327e00
(XEN) [00000006bff76ebe] 
(XEN) [00000006bff7c3e3]   VTCR_EL2: 0000000000000000
(XEN) [00000006bff8501a]  VTTBR_EL2: 0000000000000000
(XEN) [00000006bff8f616] 
(XEN) [00000006bff94c4a]  SCTLR_EL2: 0000000030cd183d
(XEN) [00000006bff9e3f7]    HCR_EL2: 0000000000000038
(XEN) [00000006bffaac9c]  TTBR0_EL2: 0000000022148000
(XEN) [00000006bffb6794] 
(XEN) [00000006bffbc972]    ESR_EL2: 00000000f2000001
(XEN) [00000006bffcb424]  HPFAR_EL2: 0000000000000000
(XEN) [00000006bffd7c69]    FAR_EL2: 0000000000000000
(XEN) [00000006bffe3719] 
(XEN) [00000006bffecd4b] Xen stack trace from sp=00000a0000327e00:
(XEN) [00000006bfff9321]    00000a0000327e60 00000a00002e4378 0000000022200000 000000000000f000
(XEN) [00000006c000e3e1]    00000a0000c0a5c0 00000a0000332000 00000a0000a00000 0000000000000000
(XEN) [00000006c001f69c]    0000000000000000 0000000000000000 0000000000000000 000000007bff2f70
(XEN) [00000006c0031b91]    000000007be89ea0 00000a00002001a4 0000000022000000 fffff60021e00000
(XEN) [00000006c0041c20]    0000000022200000 0000000000001710 0000000000000000 0000000000000000
(XEN) [00000006c0052629]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [00000006c0065bde]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [00000006c00752d1]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [00000006c00858cc]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [00000006c0096b34]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [00000006c00a72f3]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [00000006c00b8357]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [00000006c00ce60f]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [00000006c00e2ee4]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [00000006c00f53e7]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [00000006c01091f3]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [00000006c011cd30] Xen call trace:
(XEN) [00000006c01264b7]    [<00000a00002da70c>] setup_mm+0x284/0x308 (PC)
(XEN) [00000006c01348a8]    [<00000a00002da6b0>] setup_mm+0x228/0x308 (LR)
(XEN) [00000006c0144263]    [<00000a00002e4378>] start_xen+0x118/0x9d0
(XEN) [00000006c01529c3]    [<00000a00002001a4>] arch/arm/arm64/head.o#primary_switched+0x4/0x24
(XEN) [00000006c0165f60] 
(XEN) [00000006c0176bd8] 
(XEN) [00000006c017c5cf] ****************************************
(XEN) [00000006c018964c] Panic on CPU 0:
(XEN) [00000006c0190b79] Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72
(XEN) [00000006c01af78d] ****************************************

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: text/x-diff; name=debug.patch, Size: 3988 bytes --]

diff --git a/xen/arch/arm/arm64/mmu/mm.c b/xen/arch/arm/arm64/mmu/mm.c
index 27327b11db..c1eb13219c 100644
--- a/xen/arch/arm/arm64/mmu/mm.c
+++ b/xen/arch/arm/arm64/mmu/mm.c
@@ -226,6 +226,7 @@ static void __init setup_directmap_mappings(unsigned long base_mfn,
             (base_mfn - mfn_gb) * PAGE_SIZE;
     }
 
+    printk("DEBUG %s %d base_mfn=%lx nr_mfns=%lx directmap_base_pdx=%lx mfn_to_pdx=%lx\n",__func__,__LINE__,base_mfn,nr_mfns,directmap_base_pdx,(unsigned long)mfn_to_pdx(_mfn(base_mfn)));
     if ( base_mfn < mfn_x(directmap_mfn_start) )
         panic("cannot add directmap mapping at %lx below heap start %lx\n",
               base_mfn, mfn_x(directmap_mfn_start));
@@ -233,6 +234,7 @@ static void __init setup_directmap_mappings(unsigned long base_mfn,
     rc = map_pages_to_xen((vaddr_t)__mfn_to_virt(base_mfn),
                           _mfn(base_mfn), nr_mfns,
                           PAGE_HYPERVISOR_RW | _PAGE_BLOCK);
+    printk("DEBUG %s %d base_mfn=%lx nr_mfns=%lx directmap_base_pdx=%lx\n",__func__,__LINE__,base_mfn,nr_mfns,directmap_base_pdx);
     if ( rc )
         panic("Unable to setup the directmap mappings.\n");
 }
@@ -247,6 +249,7 @@ void __init setup_mm(void)
 
     init_pdx();
 
+    printk("DEBUG %s %d\n",__func__,__LINE__);
     /*
      * We need some memory to allocate the page-tables used for the directmap
      * mappings. But some regions may contain memory already allocated
@@ -267,19 +270,24 @@ void __init setup_mm(void)
         ram_start = min(ram_start, bank->start);
         ram_end = max(ram_end, bank_end);
 
+        printk("DEBUG %s %d start=%lx size=%lx ram_end=%lx directmap_base_pdx=%lx\n",__func__,__LINE__,bank->start,bank->size,ram_end, directmap_base_pdx);
         setup_directmap_mappings(PFN_DOWN(bank->start),
                                  PFN_DOWN(bank->size));
     }
+    printk("DEBUG %s %d\n",__func__,__LINE__);
 
     total_pages += ram_size >> PAGE_SHIFT;
 
     directmap_virt_end = XENHEAP_VIRT_START + ram_end - ram_start;
     directmap_mfn_start = maddr_to_mfn(ram_start);
     directmap_mfn_end = maddr_to_mfn(ram_end);
+    printk("DEBUG %s %d\n",__func__,__LINE__);
 
     max_page = PFN_DOWN(ram_end);
 
+    printk("DEBUG %s %d\n",__func__,__LINE__);
     init_frametable();
+    printk("DEBUG %s %d\n",__func__,__LINE__);
 
     init_staticmem_pages();
     init_sharedmem_pages();
diff --git a/xen/arch/arm/mmu/mm.c b/xen/arch/arm/mmu/mm.c
index 69617a4986..c31ef3255b 100644
--- a/xen/arch/arm/mmu/mm.c
+++ b/xen/arch/arm/mmu/mm.c
@@ -25,6 +25,7 @@ init_frametable_chunk(unsigned long pdx_s, unsigned long pdx_e)
     base_mfn = alloc_boot_pages(chunk_size >> PAGE_SHIFT, 32 << (20 - 12));
 
     virt = (unsigned long)pdx_to_page(pdx_s);
+    printk("DEBUG %s %d virt=%lx base_mfn=%lx pfn_start=%lx pfn_end=%lx\n",__func__,__LINE__,(unsigned long)virt,mfn_x(base_mfn),mfn_x(pdx_to_mfn(pdx_s)),mfn_x(pdx_to_mfn(pdx_e)));
     rc = map_pages_to_xen(virt, base_mfn, chunk_size >> PAGE_SHIFT,
                           PAGE_HYPERVISOR_RW | _PAGE_BLOCK);
     if ( rc )
@@ -64,6 +65,7 @@ void __init init_frametable(void)
         if ( nidx >= max_idx )
             break;
 
+        printk("DEBUG %s %d start=%lx end=%lx\n",__func__,__LINE__,mfn_x(pdx_to_mfn(sidx * PDX_GROUP_COUNT)),mfn_x(pdx_to_mfn(eidx * PDX_GROUP_COUNT)));
         init_frametable_chunk(sidx * PDX_GROUP_COUNT, eidx * PDX_GROUP_COUNT);
     }
 
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index c9ad6bbab6..1f5c1866c4 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -291,6 +291,7 @@ void __init init_pdx(void)
         bank_size = mem->bank[bank].size;
         bank_end = bank_start + bank_size;
 
+        printk("DEBUG %s %d start=%lx end=%lx\n",__func__,__LINE__,bank_start,bank_end);
         set_pdx_range(paddr_to_pfn(bank_start),
                       paddr_to_pfn(bank_end));
     }

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-01 20:46         ` Stefano Stabellini
@ 2025-07-02  6:08           ` Jan Beulich
  2025-07-02  6:32           ` Jan Beulich
  2025-07-02  7:00           ` Roger Pau Monné
  2 siblings, 0 replies; 55+ messages in thread
From: Jan Beulich @ 2025-07-02  6:08 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Community Manager, Roger Pau Monné

On 01.07.2025 22:46, Stefano Stabellini wrote:
> On Tue, 1 Jul 2025, Jan Beulich wrote:
>> Sadly from this you omitted the output from the setup of the offsets
>> arrays. Considering also your later reply, I'd be curious to know what
>> mfn_to_pdx(0x50000000) is.
>  
> Full logs here, and debug patch in attachment.

Interesting. Up to ...

> (XEN) Checking for initrd in /chosen
> (XEN) RAM: 0000000000000000 - 000000007fffffff
> (XEN) RAM: 0000000800000000 - 000000087fffffff
> (XEN) RAM: 0000050000000000 - 000005007fffffff
> (XEN) RAM: 0000060000000000 - 000006007fffffff
> (XEN) RAM: 0000070000000000 - 000007007fffffff
> (XEN) 
> (XEN) MODULE[0]: 0000000022000000 - 0000000022172fff Xen         
> (XEN) MODULE[1]: 0000000022200000 - 000000002220efff Device Tree 
> (XEN) MODULE[2]: 0000000020400000 - 0000000021e2ffff Kernel      
> (XEN)  RESVD[0]: 0000000000000000 - 0000000000ffffff
> (XEN)  RESVD[1]: 0000000001000000 - 00000000015fffff
> (XEN)  RESVD[2]: 0000000001600000 - 00000000017fffff
> (XEN)  RESVD[3]: 0000000001800000 - 00000000097fffff
> (XEN)  RESVD[4]: 0000000009800000 - 000000000bffffff
> (XEN)  RESVD[5]: 0000000011126000 - 000000001114dfff
> (XEN)  RESVD[6]: 000000001114e000 - 000000001214efff
> (XEN)  RESVD[7]: 0000000017275000 - 000000001729cfff
> (XEN)  RESVD[8]: 000000001729d000 - 000000001829dfff
> (XEN)  RESVD[9]: 000000001a7df000 - 000000001a806fff
> (XEN)  RESVD[10]: 000000001a807000 - 000000001b807fff
> (XEN)  RESVD[11]: 000000001d908000 - 000000001d92ffff
> (XEN)  RESVD[12]: 000000001d930000 - 000000001e930fff
> (XEN)  RESVD[13]: 000000001829e000 - 000000001869dfff
> (XEN)  RESVD[14]: 000000001869e000 - 00000000186ddfff
> (XEN)  RESVD[15]: 0000000800000000 - 000000083fffffff
> (XEN) 
> (XEN) 
> (XEN) Command line: console=dtuart dom0_mem=2048M console_timestamps=boot debug bootscrub=0 vwfi=native sched=null
> (XEN) [00000006bfc302ec] parameter "debug" unknown!
> (XEN) [00000006bfcc0476] DEBUG init_pdx 294 start=0 end=80000000
> (XEN) [00000006bfcd2400] DEBUG init_pdx 294 start=800000000 end=880000000
> (XEN) [00000006bfce29ec] DEBUG init_pdx 294 start=50000000000 end=50080000000
> (XEN) [00000006bfcf1768] DEBUG init_pdx 294 start=60000000000 end=60080000000
> (XEN) [00000006bfd015a4] DEBUG init_pdx 294 start=70000000000 end=70080000000
> (XEN) [00000006bfd1444f] DEBUG setup_mm 252
> (XEN) [00000006bfd3dc6f] DEBUG setup_mm 273 start=0 size=80000000 ram_end=80000000 directmap_base_pdx=0
> (XEN) [00000006bfd5616e] DEBUG setup_directmap_mappings 229 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=0
> (XEN) [00000006bfd7d38a] DEBUG setup_directmap_mappings 237 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0
> (XEN) [00000006bfd92728] DEBUG setup_mm 273 start=800000000 size=80000000 ram_end=880000000 directmap_base_pdx=0
> (XEN) [00000006bfdaba3b] DEBUG setup_directmap_mappings 229 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=800000
> (XEN) [00000006bfdcd79c] DEBUG setup_directmap_mappings 237 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0
> (XEN) [00000006bfde4d82] DEBUG setup_mm 273 start=50000000000 size=80000000 ram_end=50080000000 directmap_base_pdx=0
> (XEN) [00000006bfdfaef0] DEBUG setup_directmap_mappings 229 base_mfn=50000000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=50000000
> (XEN) [00000006bfe35249] Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72

... here there's no sign of PDX compression actually being set up; all that's
there are the init_pdx() messages. Do you perhaps have an ordering problem on
Arm? The register values ...

> (XEN) [00000006bfe68507] ----[ Xen-4.21-unstable  arm64  debug=y  Not tainted ]----
> (XEN) [00000006bfe766bf] CPU:    0
> (XEN) [00000006bfe832e0] PC:     00000a00002da70c setup_mm+0x284/0x308
> (XEN) [00000006bfea5b1a] LR:     00000a00002da6b0
> (XEN) [00000006bfeb1032] SP:     00000a0000327e00
> (XEN) [00000006bfebf403] CPSR:   00000000200003c9 MODE:64-bit EL2h (Hypervisor, handler)
> (XEN) [00000006bfed4634]      X0: 0000000000000017  X1: 0000000000000000  X2: 0000000050000000
> (XEN) [00000006bfee4d11]      X3: 000000004fffffff  X4: 0000000000000020  X5: 0000000000000000
> (XEN) [00000006bfef48cf]      X6: 0000000000000000  X7: 0000000000000000  X8: ffffffffffffffff
> (XEN) [00000006bff047ac]      X9: fefefefefefeff09 X10: 0000000000000080 X11: 0101010101010101
> (XEN) [00000006bff153b4]     X12: 0000000000000008 X13: 0000000000000009 X14: 0000000000000030
> (XEN) [00000006bff2620d]     X15: 00000a0000a00000 X16: 00000a0000291478 X17: 0000000000000000
> (XEN) [00000006bff35c41]     X18: 000000007be9bbe0 X19: 00000a0000292c40 X20: 00000a00002ade68
> (XEN) [00000006bff465a5]     X21: 0000050080000000 X22: 0000000000000000 X23: 0000000180000000
> (XEN) [00000006bff57a51]     X24: 0000000000000002 X25: 00000a0000292c50 X26: 0000000050000000
> (XEN) [00000006bff67d91]     X27: 0000000000080000 X28: 0000050000000000  FP: 00000a0000327e00

... also suggest (x2, x3, and x26 in particular) that offsets are still all
zero, i.e. PDX == MFN. And aiui DIRECTMAP_SIZE is 5Tb.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-01 20:46         ` Stefano Stabellini
  2025-07-02  6:08           ` Jan Beulich
@ 2025-07-02  6:32           ` Jan Beulich
  2025-07-02  6:53             ` Roger Pau Monné
  2025-07-02  7:00           ` Roger Pau Monné
  2 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-07-02  6:32 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Community Manager, Roger Pau Monné

On 01.07.2025 22:46, Stefano Stabellini wrote:
> On Tue, 1 Jul 2025, Jan Beulich wrote:
>> Sadly from this you omitted the output from the setup of the offsets
>> arrays. Considering also your later reply, I'd be curious to know what
>> mfn_to_pdx(0x50000000) is.
>  
> Full logs here, and debug patch in attachment.
> 
> (XEN) Checking for initrd in /chosen
> (XEN) RAM: 0000000000000000 - 000000007fffffff
> (XEN) RAM: 0000000800000000 - 000000087fffffff
> (XEN) RAM: 0000050000000000 - 000005007fffffff
> (XEN) RAM: 0000060000000000 - 000006007fffffff
> (XEN) RAM: 0000070000000000 - 000007007fffffff
> (XEN) 
> (XEN) MODULE[0]: 0000000022000000 - 0000000022172fff Xen         
> (XEN) MODULE[1]: 0000000022200000 - 000000002220efff Device Tree 
> (XEN) MODULE[2]: 0000000020400000 - 0000000021e2ffff Kernel      
> (XEN)  RESVD[0]: 0000000000000000 - 0000000000ffffff
> (XEN)  RESVD[1]: 0000000001000000 - 00000000015fffff
> (XEN)  RESVD[2]: 0000000001600000 - 00000000017fffff
> (XEN)  RESVD[3]: 0000000001800000 - 00000000097fffff
> (XEN)  RESVD[4]: 0000000009800000 - 000000000bffffff
> (XEN)  RESVD[5]: 0000000011126000 - 000000001114dfff
> (XEN)  RESVD[6]: 000000001114e000 - 000000001214efff
> (XEN)  RESVD[7]: 0000000017275000 - 000000001729cfff
> (XEN)  RESVD[8]: 000000001729d000 - 000000001829dfff
> (XEN)  RESVD[9]: 000000001a7df000 - 000000001a806fff
> (XEN)  RESVD[10]: 000000001a807000 - 000000001b807fff
> (XEN)  RESVD[11]: 000000001d908000 - 000000001d92ffff
> (XEN)  RESVD[12]: 000000001d930000 - 000000001e930fff
> (XEN)  RESVD[13]: 000000001829e000 - 000000001869dfff
> (XEN)  RESVD[14]: 000000001869e000 - 00000000186ddfff
> (XEN)  RESVD[15]: 0000000800000000 - 000000083fffffff
> (XEN) 
> (XEN) 
> (XEN) Command line: console=dtuart dom0_mem=2048M console_timestamps=boot debug bootscrub=0 vwfi=native sched=null
> (XEN) [00000006bfc302ec] parameter "debug" unknown!
> (XEN) [00000006bfcc0476] DEBUG init_pdx 294 start=0 end=80000000
> (XEN) [00000006bfcd2400] DEBUG init_pdx 294 start=800000000 end=880000000
> (XEN) [00000006bfce29ec] DEBUG init_pdx 294 start=50000000000 end=50080000000
> (XEN) [00000006bfcf1768] DEBUG init_pdx 294 start=60000000000 end=60080000000
> (XEN) [00000006bfd015a4] DEBUG init_pdx 294 start=70000000000 end=70080000000
> (XEN) [00000006bfd1444f] DEBUG setup_mm 252

This one is immediately after init_pdx(), i.e. by here the log messages from
Roger's patch (out of pfn_pdx_compression_setup()) should have appeared.
Which at least falsifies my earlier suspicion about there being an ordering
issue. You do have PDX_OFFSET_COMPRESSION=y in your .config, don't you? Are
we perhaps taking the only "return false" path in pfn_offset_sanitize_ranges()
that doesn't issue a log message? I can't see how we could plausibly take the
"Avoid compression if there's no gain" path in pfn_pdx_compression_setup()
itself.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-02  6:32           ` Jan Beulich
@ 2025-07-02  6:53             ` Roger Pau Monné
  0 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-07-02  6:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, xen-devel, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Community Manager

On Wed, Jul 02, 2025 at 08:32:27AM +0200, Jan Beulich wrote:
> On 01.07.2025 22:46, Stefano Stabellini wrote:
> > On Tue, 1 Jul 2025, Jan Beulich wrote:
> >> Sadly from this you omitted the output from the setup of the offsets
> >> arrays. Considering also your later reply, I'd be curious to know what
> >> mfn_to_pdx(0x50000000) is.
> >  
> > Full logs here, and debug patch in attachment.
> > 
> > (XEN) Checking for initrd in /chosen
> > (XEN) RAM: 0000000000000000 - 000000007fffffff
> > (XEN) RAM: 0000000800000000 - 000000087fffffff
> > (XEN) RAM: 0000050000000000 - 000005007fffffff
> > (XEN) RAM: 0000060000000000 - 000006007fffffff
> > (XEN) RAM: 0000070000000000 - 000007007fffffff
> > (XEN) 
> > (XEN) MODULE[0]: 0000000022000000 - 0000000022172fff Xen         
> > (XEN) MODULE[1]: 0000000022200000 - 000000002220efff Device Tree 
> > (XEN) MODULE[2]: 0000000020400000 - 0000000021e2ffff Kernel      
> > (XEN)  RESVD[0]: 0000000000000000 - 0000000000ffffff
> > (XEN)  RESVD[1]: 0000000001000000 - 00000000015fffff
> > (XEN)  RESVD[2]: 0000000001600000 - 00000000017fffff
> > (XEN)  RESVD[3]: 0000000001800000 - 00000000097fffff
> > (XEN)  RESVD[4]: 0000000009800000 - 000000000bffffff
> > (XEN)  RESVD[5]: 0000000011126000 - 000000001114dfff
> > (XEN)  RESVD[6]: 000000001114e000 - 000000001214efff
> > (XEN)  RESVD[7]: 0000000017275000 - 000000001729cfff
> > (XEN)  RESVD[8]: 000000001729d000 - 000000001829dfff
> > (XEN)  RESVD[9]: 000000001a7df000 - 000000001a806fff
> > (XEN)  RESVD[10]: 000000001a807000 - 000000001b807fff
> > (XEN)  RESVD[11]: 000000001d908000 - 000000001d92ffff
> > (XEN)  RESVD[12]: 000000001d930000 - 000000001e930fff
> > (XEN)  RESVD[13]: 000000001829e000 - 000000001869dfff
> > (XEN)  RESVD[14]: 000000001869e000 - 00000000186ddfff
> > (XEN)  RESVD[15]: 0000000800000000 - 000000083fffffff
> > (XEN) 
> > (XEN) 
> > (XEN) Command line: console=dtuart dom0_mem=2048M console_timestamps=boot debug bootscrub=0 vwfi=native sched=null
> > (XEN) [00000006bfc302ec] parameter "debug" unknown!
> > (XEN) [00000006bfcc0476] DEBUG init_pdx 294 start=0 end=80000000
> > (XEN) [00000006bfcd2400] DEBUG init_pdx 294 start=800000000 end=880000000
> > (XEN) [00000006bfce29ec] DEBUG init_pdx 294 start=50000000000 end=50080000000
> > (XEN) [00000006bfcf1768] DEBUG init_pdx 294 start=60000000000 end=60080000000
> > (XEN) [00000006bfd015a4] DEBUG init_pdx 294 start=70000000000 end=70080000000
> > (XEN) [00000006bfd1444f] DEBUG setup_mm 252
> 
> This one is immediately after init_pdx(), i.e. by here the log messages from
> Roger's patch (out of pfn_pdx_compression_setup()) should have appeared.
> Which at least falsifies my earlier suspicion about there being an ordering
> issue. You do have PDX_OFFSET_COMPRESSION=y in your .config, don't you? Are
> we perhaps taking the only "return false" path in pfn_offset_sanitize_ranges()
> that doesn't issue a log message?

Sorry, should have posted this yesterday.  With the current offset
compression algorithm the memory map provided by Stefano is not
compressible, as the calculated PFN shift leads to lookup table
indexes that overflows the default table size.

I'm working on an improved version that attempts to always preserve
the most significant bits in the lookup table index, even if that
leads to merging regions.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-01 20:46         ` Stefano Stabellini
  2025-07-02  6:08           ` Jan Beulich
  2025-07-02  6:32           ` Jan Beulich
@ 2025-07-02  7:00           ` Roger Pau Monné
  2025-07-02  7:52             ` Orzel, Michal
  2 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-07-02  7:00 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Jan Beulich, xen-devel, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Community Manager

On Tue, Jul 01, 2025 at 01:46:19PM -0700, Stefano Stabellini wrote:
> On Tue, 1 Jul 2025, Jan Beulich wrote:
> > Sadly from this you omitted the output from the setup of the offsets
> > arrays. Considering also your later reply, I'd be curious to know what
> > mfn_to_pdx(0x50000000) is.
>  
> Full logs here, and debug patch in attachment.
> 
> (XEN) Checking for initrd in /chosen
> (XEN) RAM: 0000000000000000 - 000000007fffffff
> (XEN) RAM: 0000000800000000 - 000000087fffffff
> (XEN) RAM: 0000050000000000 - 000005007fffffff
> (XEN) RAM: 0000060000000000 - 000006007fffffff
> (XEN) RAM: 0000070000000000 - 000007007fffffff
> (XEN) 
> (XEN) MODULE[0]: 0000000022000000 - 0000000022172fff Xen         
> (XEN) MODULE[1]: 0000000022200000 - 000000002220efff Device Tree 
> (XEN) MODULE[2]: 0000000020400000 - 0000000021e2ffff Kernel      
> (XEN)  RESVD[0]: 0000000000000000 - 0000000000ffffff
> (XEN)  RESVD[1]: 0000000001000000 - 00000000015fffff
> (XEN)  RESVD[2]: 0000000001600000 - 00000000017fffff
> (XEN)  RESVD[3]: 0000000001800000 - 00000000097fffff
> (XEN)  RESVD[4]: 0000000009800000 - 000000000bffffff
> (XEN)  RESVD[5]: 0000000011126000 - 000000001114dfff
> (XEN)  RESVD[6]: 000000001114e000 - 000000001214efff
> (XEN)  RESVD[7]: 0000000017275000 - 000000001729cfff
> (XEN)  RESVD[8]: 000000001729d000 - 000000001829dfff
> (XEN)  RESVD[9]: 000000001a7df000 - 000000001a806fff
> (XEN)  RESVD[10]: 000000001a807000 - 000000001b807fff
> (XEN)  RESVD[11]: 000000001d908000 - 000000001d92ffff
> (XEN)  RESVD[12]: 000000001d930000 - 000000001e930fff
> (XEN)  RESVD[13]: 000000001829e000 - 000000001869dfff
> (XEN)  RESVD[14]: 000000001869e000 - 00000000186ddfff
> (XEN)  RESVD[15]: 0000000800000000 - 000000083fffffff
> (XEN) 
> (XEN) 
> (XEN) Command line: console=dtuart dom0_mem=2048M console_timestamps=boot debug bootscrub=0 vwfi=native sched=null
> (XEN) [00000006bfc302ec] parameter "debug" unknown!
> (XEN) [00000006bfcc0476] DEBUG init_pdx 294 start=0 end=80000000
> (XEN) [00000006bfcd2400] DEBUG init_pdx 294 start=800000000 end=880000000
> (XEN) [00000006bfce29ec] DEBUG init_pdx 294 start=50000000000 end=50080000000
> (XEN) [00000006bfcf1768] DEBUG init_pdx 294 start=60000000000 end=60080000000
> (XEN) [00000006bfd015a4] DEBUG init_pdx 294 start=70000000000 end=70080000000
> (XEN) [00000006bfd1444f] DEBUG setup_mm 252
> (XEN) [00000006bfd3dc6f] DEBUG setup_mm 273 start=0 size=80000000 ram_end=80000000 directmap_base_pdx=0
> (XEN) [00000006bfd5616e] DEBUG setup_directmap_mappings 229 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=0
> (XEN) [00000006bfd7d38a] DEBUG setup_directmap_mappings 237 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0
> (XEN) [00000006bfd92728] DEBUG setup_mm 273 start=800000000 size=80000000 ram_end=880000000 directmap_base_pdx=0
> (XEN) [00000006bfdaba3b] DEBUG setup_directmap_mappings 229 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=800000
> (XEN) [00000006bfdcd79c] DEBUG setup_directmap_mappings 237 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0
> (XEN) [00000006bfde4d82] DEBUG setup_mm 273 start=50000000000 size=80000000 ram_end=50080000000 directmap_base_pdx=0
> (XEN) [00000006bfdfaef0] DEBUG setup_directmap_mappings 229 base_mfn=50000000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=50000000
> (XEN) [00000006bfe35249] Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72

As said on the other reply, the issue here is that with the v2 PDX
offset compression logic your memory map is not compressible, and this
leads to an overflow, as anything above 5TiB won't fit in the
directmap AFAICT.  We already discussed with Jan that ARM seems to be
missing any logic to account for the max addressable page:

https://lore.kernel.org/xen-devel/9074f1a6-a605-43f4-97f3-d0a626252d3f@suse.com/

x86 has setup_max_pdx() that truncates the maximum addressable MFN
based on the active PDX compression and the virtual memory map
restrictions.  ARM needs similar logic to account for this
restrictions.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-02  7:00           ` Roger Pau Monné
@ 2025-07-02  7:52             ` Orzel, Michal
  2025-07-02  8:26               ` Roger Pau Monné
  2025-07-02  8:45               ` Julien Grall
  0 siblings, 2 replies; 55+ messages in thread
From: Orzel, Michal @ 2025-07-02  7:52 UTC (permalink / raw)
  To: Roger Pau Monné, Stefano Stabellini
  Cc: Jan Beulich, xen-devel, Andrew Cooper, Anthony PERARD,
	Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Community Manager



On 02/07/2025 09:00, Roger Pau Monné wrote:
> On Tue, Jul 01, 2025 at 01:46:19PM -0700, Stefano Stabellini wrote:
>> On Tue, 1 Jul 2025, Jan Beulich wrote:
>>> Sadly from this you omitted the output from the setup of the offsets
>>> arrays. Considering also your later reply, I'd be curious to know what
>>> mfn_to_pdx(0x50000000) is.
>>  
>> Full logs here, and debug patch in attachment.
>>
>> (XEN) Checking for initrd in /chosen
>> (XEN) RAM: 0000000000000000 - 000000007fffffff
>> (XEN) RAM: 0000000800000000 - 000000087fffffff
>> (XEN) RAM: 0000050000000000 - 000005007fffffff
>> (XEN) RAM: 0000060000000000 - 000006007fffffff
>> (XEN) RAM: 0000070000000000 - 000007007fffffff
>> (XEN) 
>> (XEN) MODULE[0]: 0000000022000000 - 0000000022172fff Xen         
>> (XEN) MODULE[1]: 0000000022200000 - 000000002220efff Device Tree 
>> (XEN) MODULE[2]: 0000000020400000 - 0000000021e2ffff Kernel      
>> (XEN)  RESVD[0]: 0000000000000000 - 0000000000ffffff
>> (XEN)  RESVD[1]: 0000000001000000 - 00000000015fffff
>> (XEN)  RESVD[2]: 0000000001600000 - 00000000017fffff
>> (XEN)  RESVD[3]: 0000000001800000 - 00000000097fffff
>> (XEN)  RESVD[4]: 0000000009800000 - 000000000bffffff
>> (XEN)  RESVD[5]: 0000000011126000 - 000000001114dfff
>> (XEN)  RESVD[6]: 000000001114e000 - 000000001214efff
>> (XEN)  RESVD[7]: 0000000017275000 - 000000001729cfff
>> (XEN)  RESVD[8]: 000000001729d000 - 000000001829dfff
>> (XEN)  RESVD[9]: 000000001a7df000 - 000000001a806fff
>> (XEN)  RESVD[10]: 000000001a807000 - 000000001b807fff
>> (XEN)  RESVD[11]: 000000001d908000 - 000000001d92ffff
>> (XEN)  RESVD[12]: 000000001d930000 - 000000001e930fff
>> (XEN)  RESVD[13]: 000000001829e000 - 000000001869dfff
>> (XEN)  RESVD[14]: 000000001869e000 - 00000000186ddfff
>> (XEN)  RESVD[15]: 0000000800000000 - 000000083fffffff
>> (XEN) 
>> (XEN) 
>> (XEN) Command line: console=dtuart dom0_mem=2048M console_timestamps=boot debug bootscrub=0 vwfi=native sched=null
>> (XEN) [00000006bfc302ec] parameter "debug" unknown!
>> (XEN) [00000006bfcc0476] DEBUG init_pdx 294 start=0 end=80000000
>> (XEN) [00000006bfcd2400] DEBUG init_pdx 294 start=800000000 end=880000000
>> (XEN) [00000006bfce29ec] DEBUG init_pdx 294 start=50000000000 end=50080000000
>> (XEN) [00000006bfcf1768] DEBUG init_pdx 294 start=60000000000 end=60080000000
>> (XEN) [00000006bfd015a4] DEBUG init_pdx 294 start=70000000000 end=70080000000
>> (XEN) [00000006bfd1444f] DEBUG setup_mm 252
>> (XEN) [00000006bfd3dc6f] DEBUG setup_mm 273 start=0 size=80000000 ram_end=80000000 directmap_base_pdx=0
>> (XEN) [00000006bfd5616e] DEBUG setup_directmap_mappings 229 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=0
>> (XEN) [00000006bfd7d38a] DEBUG setup_directmap_mappings 237 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0
>> (XEN) [00000006bfd92728] DEBUG setup_mm 273 start=800000000 size=80000000 ram_end=880000000 directmap_base_pdx=0
>> (XEN) [00000006bfdaba3b] DEBUG setup_directmap_mappings 229 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=800000
>> (XEN) [00000006bfdcd79c] DEBUG setup_directmap_mappings 237 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0
>> (XEN) [00000006bfde4d82] DEBUG setup_mm 273 start=50000000000 size=80000000 ram_end=50080000000 directmap_base_pdx=0
>> (XEN) [00000006bfdfaef0] DEBUG setup_directmap_mappings 229 base_mfn=50000000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=50000000
>> (XEN) [00000006bfe35249] Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72
> 
> As said on the other reply, the issue here is that with the v2 PDX
> offset compression logic your memory map is not compressible, and this
> leads to an overflow, as anything above 5TiB won't fit in the
> directmap AFAICT.  We already discussed with Jan that ARM seems to be
> missing any logic to account for the max addressable page:
> 
> https://lore.kernel.org/xen-devel/9074f1a6-a605-43f4-97f3-d0a626252d3f@suse.com/
> 
> x86 has setup_max_pdx() that truncates the maximum addressable MFN
> based on the active PDX compression and the virtual memory map
> restrictions.  ARM needs similar logic to account for this
> restrictions.

We have a few issues on Arm. First, we don't check whether direct map is big
enough provided max_pdx that we don't set at all. Second, we don't really use
PDX grouping (can be also used without compression). My patch (that Stefano
attached previously) fixes the second issue (Allejandro will take it over to
come up with common solution). For the first issue, we need to know max_page (at
the moment we calculate it in setup_mm() at the very end but we could do it in
init_pdx() to know it ahead of setting direct map) and PDX offset (on x86 there
is no offset). I also think that on Arm we should just panic if direct map is
too small.

The issue can be reproduced by disabling PDX compression, so not only with
Roger's patch.

@Julien, I'm thinking of something like this:

diff --git a/xen/arch/arm/arm32/mmu/mm.c b/xen/arch/arm/arm32/mmu/mm.c
index 4d22f35618aa..e6d9b49acd3c 100644
--- a/xen/arch/arm/arm32/mmu/mm.c
+++ b/xen/arch/arm/arm32/mmu/mm.c
@@ -190,7 +190,6 @@ void __init setup_mm(void)

     /* Frame table covers all of RAM region, including holes */
     setup_frametable_mappings(ram_start, ram_end);
-    max_page = PFN_DOWN(ram_end);

     /*
      * The allocators may need to use map_domain_page() (such as for
diff --git a/xen/arch/arm/arm64/mmu/mm.c b/xen/arch/arm/arm64/mmu/mm.c
index a0a2dd8cc762..3e64be6ae664 100644
--- a/xen/arch/arm/arm64/mmu/mm.c
+++ b/xen/arch/arm/arm64/mmu/mm.c
@@ -224,6 +224,9 @@ static void __init setup_directmap_mappings(unsigned long
base_mfn,
          */
         directmap_virt_start = DIRECTMAP_VIRT_START +
             (base_mfn - mfn_gb) * PAGE_SIZE;
+
+        if ( (max_pdx - directmap_base_pdx) > (DIRECTMAP_SIZE >> PAGE_SHIFT) )
+            panic("Direct map is too small\n");
     }

     if ( base_mfn < mfn_x(directmap_mfn_start) )
@@ -278,7 +281,6 @@ void __init setup_mm(void)
     directmap_mfn_end = maddr_to_mfn(ram_end);

     setup_frametable_mappings(ram_start, ram_end);
-    max_page = PFN_DOWN(ram_end);

     init_staticmem_pages();
     init_sharedmem_pages();
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index 58acc2d0d4b8..e047225eb413 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -265,6 +265,7 @@ void __init init_pdx(void)
      */
     uint64_t mask = pdx_init_mask(0x0);
     int bank;
+    paddr_t ram_end = 0;

     for ( bank = 0 ; bank < mem->nr_banks; bank++ )
     {
@@ -290,10 +291,14 @@ void __init init_pdx(void)
         bank_start = mem->bank[bank].start;
         bank_size = mem->bank[bank].size;
         bank_end = bank_start + bank_size;
+        ram_end = max(ram_end, bank_end);

         set_pdx_range(paddr_to_pfn(bank_start),
                       paddr_to_pfn(bank_end));
     }
+
+    max_page = PFN_DOWN(ram_end);
+    max_pdx = pfn_to_pdx(max_page - 1) + 1;
 }

 size_t __read_mostly dcache_line_bytes;

~Michal



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-02  7:52             ` Orzel, Michal
@ 2025-07-02  8:26               ` Roger Pau Monné
  2025-07-02  8:49                 ` Julien Grall
  2025-07-02  8:54                 ` Orzel, Michal
  2025-07-02  8:45               ` Julien Grall
  1 sibling, 2 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-07-02  8:26 UTC (permalink / raw)
  To: Orzel, Michal
  Cc: Stefano Stabellini, Jan Beulich, xen-devel, Andrew Cooper,
	Anthony PERARD, Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Community Manager

On Wed, Jul 02, 2025 at 09:52:45AM +0200, Orzel, Michal wrote:
> 
> 
> On 02/07/2025 09:00, Roger Pau Monné wrote:
> > On Tue, Jul 01, 2025 at 01:46:19PM -0700, Stefano Stabellini wrote:
> >> On Tue, 1 Jul 2025, Jan Beulich wrote:
> >>> Sadly from this you omitted the output from the setup of the offsets
> >>> arrays. Considering also your later reply, I'd be curious to know what
> >>> mfn_to_pdx(0x50000000) is.
> >>  
> >> Full logs here, and debug patch in attachment.
> >>
> >> (XEN) Checking for initrd in /chosen
> >> (XEN) RAM: 0000000000000000 - 000000007fffffff
> >> (XEN) RAM: 0000000800000000 - 000000087fffffff
> >> (XEN) RAM: 0000050000000000 - 000005007fffffff
> >> (XEN) RAM: 0000060000000000 - 000006007fffffff
> >> (XEN) RAM: 0000070000000000 - 000007007fffffff
> >> (XEN) 
> >> (XEN) MODULE[0]: 0000000022000000 - 0000000022172fff Xen         
> >> (XEN) MODULE[1]: 0000000022200000 - 000000002220efff Device Tree 
> >> (XEN) MODULE[2]: 0000000020400000 - 0000000021e2ffff Kernel      
> >> (XEN)  RESVD[0]: 0000000000000000 - 0000000000ffffff
> >> (XEN)  RESVD[1]: 0000000001000000 - 00000000015fffff
> >> (XEN)  RESVD[2]: 0000000001600000 - 00000000017fffff
> >> (XEN)  RESVD[3]: 0000000001800000 - 00000000097fffff
> >> (XEN)  RESVD[4]: 0000000009800000 - 000000000bffffff
> >> (XEN)  RESVD[5]: 0000000011126000 - 000000001114dfff
> >> (XEN)  RESVD[6]: 000000001114e000 - 000000001214efff
> >> (XEN)  RESVD[7]: 0000000017275000 - 000000001729cfff
> >> (XEN)  RESVD[8]: 000000001729d000 - 000000001829dfff
> >> (XEN)  RESVD[9]: 000000001a7df000 - 000000001a806fff
> >> (XEN)  RESVD[10]: 000000001a807000 - 000000001b807fff
> >> (XEN)  RESVD[11]: 000000001d908000 - 000000001d92ffff
> >> (XEN)  RESVD[12]: 000000001d930000 - 000000001e930fff
> >> (XEN)  RESVD[13]: 000000001829e000 - 000000001869dfff
> >> (XEN)  RESVD[14]: 000000001869e000 - 00000000186ddfff
> >> (XEN)  RESVD[15]: 0000000800000000 - 000000083fffffff
> >> (XEN) 
> >> (XEN) 
> >> (XEN) Command line: console=dtuart dom0_mem=2048M console_timestamps=boot debug bootscrub=0 vwfi=native sched=null
> >> (XEN) [00000006bfc302ec] parameter "debug" unknown!
> >> (XEN) [00000006bfcc0476] DEBUG init_pdx 294 start=0 end=80000000
> >> (XEN) [00000006bfcd2400] DEBUG init_pdx 294 start=800000000 end=880000000
> >> (XEN) [00000006bfce29ec] DEBUG init_pdx 294 start=50000000000 end=50080000000
> >> (XEN) [00000006bfcf1768] DEBUG init_pdx 294 start=60000000000 end=60080000000
> >> (XEN) [00000006bfd015a4] DEBUG init_pdx 294 start=70000000000 end=70080000000
> >> (XEN) [00000006bfd1444f] DEBUG setup_mm 252
> >> (XEN) [00000006bfd3dc6f] DEBUG setup_mm 273 start=0 size=80000000 ram_end=80000000 directmap_base_pdx=0
> >> (XEN) [00000006bfd5616e] DEBUG setup_directmap_mappings 229 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=0
> >> (XEN) [00000006bfd7d38a] DEBUG setup_directmap_mappings 237 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0
> >> (XEN) [00000006bfd92728] DEBUG setup_mm 273 start=800000000 size=80000000 ram_end=880000000 directmap_base_pdx=0
> >> (XEN) [00000006bfdaba3b] DEBUG setup_directmap_mappings 229 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=800000
> >> (XEN) [00000006bfdcd79c] DEBUG setup_directmap_mappings 237 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0
> >> (XEN) [00000006bfde4d82] DEBUG setup_mm 273 start=50000000000 size=80000000 ram_end=50080000000 directmap_base_pdx=0
> >> (XEN) [00000006bfdfaef0] DEBUG setup_directmap_mappings 229 base_mfn=50000000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=50000000
> >> (XEN) [00000006bfe35249] Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72
> > 
> > As said on the other reply, the issue here is that with the v2 PDX
> > offset compression logic your memory map is not compressible, and this
> > leads to an overflow, as anything above 5TiB won't fit in the
> > directmap AFAICT.  We already discussed with Jan that ARM seems to be
> > missing any logic to account for the max addressable page:
> > 
> > https://lore.kernel.org/xen-devel/9074f1a6-a605-43f4-97f3-d0a626252d3f@suse.com/
> > 
> > x86 has setup_max_pdx() that truncates the maximum addressable MFN
> > based on the active PDX compression and the virtual memory map
> > restrictions.  ARM needs similar logic to account for this
> > restrictions.
> 
> We have a few issues on Arm. First, we don't check whether direct map is big
> enough provided max_pdx that we don't set at all. Second, we don't really use
> PDX grouping (can be also used without compression). My patch (that Stefano
> attached previously) fixes the second issue (Allejandro will take it over to
> come up with common solution).

You probably can handle those as different issues, as PDX grouping is
completely disjoint from PDX compression.  It might be helpful if
we could split the PDX grouping into a separate file from the PDX
compression.

One weirdness I've noticed with ARM is the addition of start offsets
to the existing PDX compression, by using directmap_base_pdx,
directmap_mfn_start, directmap_base_pdx &c.  I'm not sure whether this will
interfere with the PDX compression, but it looks like a bodge.  This
should be part of the generic PDX compression implementation, not an
extra added on a per-arch basis.

FWIW, PDX offset translation should already compress any gaps from 0
to the first RAM range, and hence this won't be needed (in fact it
would just make ARM translations slower by doing an extra unneeded
operation).  My recommendation would be to move this initial offset
compression inside the PDX mask translation.

> For the first issue, we need to know max_page (at
> the moment we calculate it in setup_mm() at the very end but we could do it in
> init_pdx() to know it ahead of setting direct map) and PDX offset (on x86 there
> is no offset). I also think that on Arm we should just panic if direct map is
> too small.

Hm, that's up to the ARM folks, but my opinion is that you should
simply ignore memory above the threshold.  Panicking should IMO be a
last resort option when there's no way to workaround the issue.

> The issue can be reproduced by disabling PDX compression, so not only with
> Roger's patch.
> 
> @Julien, I'm thinking of something like this:
> 
> diff --git a/xen/arch/arm/arm32/mmu/mm.c b/xen/arch/arm/arm32/mmu/mm.c
> index 4d22f35618aa..e6d9b49acd3c 100644
> --- a/xen/arch/arm/arm32/mmu/mm.c
> +++ b/xen/arch/arm/arm32/mmu/mm.c
> @@ -190,7 +190,6 @@ void __init setup_mm(void)
> 
>      /* Frame table covers all of RAM region, including holes */
>      setup_frametable_mappings(ram_start, ram_end);
> -    max_page = PFN_DOWN(ram_end);
> 
>      /*
>       * The allocators may need to use map_domain_page() (such as for
> diff --git a/xen/arch/arm/arm64/mmu/mm.c b/xen/arch/arm/arm64/mmu/mm.c
> index a0a2dd8cc762..3e64be6ae664 100644
> --- a/xen/arch/arm/arm64/mmu/mm.c
> +++ b/xen/arch/arm/arm64/mmu/mm.c
> @@ -224,6 +224,9 @@ static void __init setup_directmap_mappings(unsigned long
> base_mfn,
>           */
>          directmap_virt_start = DIRECTMAP_VIRT_START +
>              (base_mfn - mfn_gb) * PAGE_SIZE;
> +
> +        if ( (max_pdx - directmap_base_pdx) > (DIRECTMAP_SIZE >> PAGE_SHIFT) )
> +            panic("Direct map is too small\n");

As said above - I would avoid propagating the usage of those offsets
into generic memory management code, it's usage should be confined
inside the translation functions.

Here you probably want to use maddr_to_virt() or similar.

You can maybe pickup:

https://lore.kernel.org/xen-devel/20250611171636.5674-3-roger.pau@citrix.com/

And attempt to hook it into ARM?

I don't think it would that difficult to reduce the consumption of
memory map ranges to what Xen can handle.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-02  7:52             ` Orzel, Michal
  2025-07-02  8:26               ` Roger Pau Monné
@ 2025-07-02  8:45               ` Julien Grall
  1 sibling, 0 replies; 55+ messages in thread
From: Julien Grall @ 2025-07-02  8:45 UTC (permalink / raw)
  To: Orzel, Michal, Roger Pau Monné, Stefano Stabellini
  Cc: Jan Beulich, xen-devel, Andrew Cooper, Anthony PERARD,
	Bertrand Marquis, Volodymyr Babchuk, Shawn Anastasio,
	Alistair Francis, Bob Eshleman, Connor Davis, Oleksii Kurochko,
	Community Manager

Hi Michal,

On 02/07/2025 08:52, Orzel, Michal wrote:
> We have a few issues on Arm. First, we don't check whether direct map is big
> enough provided max_pdx that we don't set at all. Second, we don't really use
> PDX grouping (can be also used without compression). My patch (that Stefano
> attached previously) fixes the second issue (Allejandro will take it over to
> come up with common solution). For the first issue, we need to know max_page (at
> the moment we calculate it in setup_mm() at the very end but we could do it in
> init_pdx() to know it ahead of setting direct map) and PDX offset (on x86 there
> is no offset). I also think that on Arm we should just panic if direct map is
> too small.
> 
> The issue can be reproduced by disabling PDX compression, so not only with
> Roger's patch.
> 
> @Julien, I'm thinking of something like this:

The change below look good to me.

Cheers,

-- 
Julien Grall



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-02  8:26               ` Roger Pau Monné
@ 2025-07-02  8:49                 ` Julien Grall
  2025-07-02  8:54                 ` Orzel, Michal
  1 sibling, 0 replies; 55+ messages in thread
From: Julien Grall @ 2025-07-02  8:49 UTC (permalink / raw)
  To: Roger Pau Monné, Orzel, Michal
  Cc: Stefano Stabellini, Jan Beulich, xen-devel, Andrew Cooper,
	Anthony PERARD, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Community Manager

Hi Roger,

On 02/07/2025 09:26, Roger Pau Monné wrote:
> On Wed, Jul 02, 2025 at 09:52:45AM +0200, Orzel, Michal wrote:
>> We have a few issues on Arm. First, we don't check whether direct map is big
>> enough provided max_pdx that we don't set at all. Second, we don't really use
>> PDX grouping (can be also used without compression). My patch (that Stefano
>> attached previously) fixes the second issue (Allejandro will take it over to
>> come up with common solution).
> 
> You probably can handle those as different issues, as PDX grouping is
> completely disjoint from PDX compression.  It might be helpful if
> we could split the PDX grouping into a separate file from the PDX
> compression.
> 
> One weirdness I've noticed with ARM is the addition of start offsets
> to the existing PDX compression, by using directmap_base_pdx,
> directmap_mfn_start, directmap_base_pdx &c.  I'm not sure whether this will
> interfere with the PDX compression, but it looks like a bodge.  This
> should be part of the generic PDX compression implementation, not an
> extra added on a per-arch basis.

They were introduced right at the beginning of the ARM port because we 
have quite a few platforms where the memory doesn't start at 0 and there 
was still a fairly large hole between two banks. IIRC until this series 
we would have been able to handle the hole but not the offset.

This is can be handled in common, then I would be happy with that.

> 
> FWIW, PDX offset translation should already compress any gaps from 0
> to the first RAM range, and hence this won't be needed (in fact it
> would just make ARM translations slower by doing an extra unneeded
> operation).  My recommendation would be to move this initial offset
> compression inside the PDX mask translation.
> 
>> For the first issue, we need to know max_page (at
>> the moment we calculate it in setup_mm() at the very end but we could do it in
>> init_pdx() to know it ahead of setting direct map) and PDX offset (on x86 there
>> is no offset). I also think that on Arm we should just panic if direct map is
>> too small.
> 
> Hm, that's up to the ARM folks, but my opinion is that you should
> simply ignore memory above the threshold.  Panicking should IMO be a
> last resort option when there's no way to workaround the issue.

This is following the other pattern within the Arm port. We want to fail 
early with a clear error rather than booting an half broken system.

Cheers,

-- 
Julien Grall



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-02  8:26               ` Roger Pau Monné
  2025-07-02  8:49                 ` Julien Grall
@ 2025-07-02  8:54                 ` Orzel, Michal
  2025-07-02  9:45                   ` Roger Pau Monné
  2025-07-03  0:19                   ` Stefano Stabellini
  1 sibling, 2 replies; 55+ messages in thread
From: Orzel, Michal @ 2025-07-02  8:54 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, Jan Beulich, xen-devel, Andrew Cooper,
	Anthony PERARD, Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Community Manager



On 02/07/2025 10:26, Roger Pau Monné wrote:
> On Wed, Jul 02, 2025 at 09:52:45AM +0200, Orzel, Michal wrote:
>>
>>
>> On 02/07/2025 09:00, Roger Pau Monné wrote:
>>> On Tue, Jul 01, 2025 at 01:46:19PM -0700, Stefano Stabellini wrote:
>>>> On Tue, 1 Jul 2025, Jan Beulich wrote:
>>>>> Sadly from this you omitted the output from the setup of the offsets
>>>>> arrays. Considering also your later reply, I'd be curious to know what
>>>>> mfn_to_pdx(0x50000000) is.
>>>>  
>>>> Full logs here, and debug patch in attachment.
>>>>
>>>> (XEN) Checking for initrd in /chosen
>>>> (XEN) RAM: 0000000000000000 - 000000007fffffff
>>>> (XEN) RAM: 0000000800000000 - 000000087fffffff
>>>> (XEN) RAM: 0000050000000000 - 000005007fffffff
>>>> (XEN) RAM: 0000060000000000 - 000006007fffffff
>>>> (XEN) RAM: 0000070000000000 - 000007007fffffff
>>>> (XEN) 
>>>> (XEN) MODULE[0]: 0000000022000000 - 0000000022172fff Xen         
>>>> (XEN) MODULE[1]: 0000000022200000 - 000000002220efff Device Tree 
>>>> (XEN) MODULE[2]: 0000000020400000 - 0000000021e2ffff Kernel      
>>>> (XEN)  RESVD[0]: 0000000000000000 - 0000000000ffffff
>>>> (XEN)  RESVD[1]: 0000000001000000 - 00000000015fffff
>>>> (XEN)  RESVD[2]: 0000000001600000 - 00000000017fffff
>>>> (XEN)  RESVD[3]: 0000000001800000 - 00000000097fffff
>>>> (XEN)  RESVD[4]: 0000000009800000 - 000000000bffffff
>>>> (XEN)  RESVD[5]: 0000000011126000 - 000000001114dfff
>>>> (XEN)  RESVD[6]: 000000001114e000 - 000000001214efff
>>>> (XEN)  RESVD[7]: 0000000017275000 - 000000001729cfff
>>>> (XEN)  RESVD[8]: 000000001729d000 - 000000001829dfff
>>>> (XEN)  RESVD[9]: 000000001a7df000 - 000000001a806fff
>>>> (XEN)  RESVD[10]: 000000001a807000 - 000000001b807fff
>>>> (XEN)  RESVD[11]: 000000001d908000 - 000000001d92ffff
>>>> (XEN)  RESVD[12]: 000000001d930000 - 000000001e930fff
>>>> (XEN)  RESVD[13]: 000000001829e000 - 000000001869dfff
>>>> (XEN)  RESVD[14]: 000000001869e000 - 00000000186ddfff
>>>> (XEN)  RESVD[15]: 0000000800000000 - 000000083fffffff
>>>> (XEN) 
>>>> (XEN) 
>>>> (XEN) Command line: console=dtuart dom0_mem=2048M console_timestamps=boot debug bootscrub=0 vwfi=native sched=null
>>>> (XEN) [00000006bfc302ec] parameter "debug" unknown!
>>>> (XEN) [00000006bfcc0476] DEBUG init_pdx 294 start=0 end=80000000
>>>> (XEN) [00000006bfcd2400] DEBUG init_pdx 294 start=800000000 end=880000000
>>>> (XEN) [00000006bfce29ec] DEBUG init_pdx 294 start=50000000000 end=50080000000
>>>> (XEN) [00000006bfcf1768] DEBUG init_pdx 294 start=60000000000 end=60080000000
>>>> (XEN) [00000006bfd015a4] DEBUG init_pdx 294 start=70000000000 end=70080000000
>>>> (XEN) [00000006bfd1444f] DEBUG setup_mm 252
>>>> (XEN) [00000006bfd3dc6f] DEBUG setup_mm 273 start=0 size=80000000 ram_end=80000000 directmap_base_pdx=0
>>>> (XEN) [00000006bfd5616e] DEBUG setup_directmap_mappings 229 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=0
>>>> (XEN) [00000006bfd7d38a] DEBUG setup_directmap_mappings 237 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0
>>>> (XEN) [00000006bfd92728] DEBUG setup_mm 273 start=800000000 size=80000000 ram_end=880000000 directmap_base_pdx=0
>>>> (XEN) [00000006bfdaba3b] DEBUG setup_directmap_mappings 229 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=800000
>>>> (XEN) [00000006bfdcd79c] DEBUG setup_directmap_mappings 237 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0
>>>> (XEN) [00000006bfde4d82] DEBUG setup_mm 273 start=50000000000 size=80000000 ram_end=50080000000 directmap_base_pdx=0
>>>> (XEN) [00000006bfdfaef0] DEBUG setup_directmap_mappings 229 base_mfn=50000000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=50000000
>>>> (XEN) [00000006bfe35249] Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72
>>>
>>> As said on the other reply, the issue here is that with the v2 PDX
>>> offset compression logic your memory map is not compressible, and this
>>> leads to an overflow, as anything above 5TiB won't fit in the
>>> directmap AFAICT.  We already discussed with Jan that ARM seems to be
>>> missing any logic to account for the max addressable page:
>>>
>>> https://lore.kernel.org/xen-devel/9074f1a6-a605-43f4-97f3-d0a626252d3f@suse.com/
>>>
>>> x86 has setup_max_pdx() that truncates the maximum addressable MFN
>>> based on the active PDX compression and the virtual memory map
>>> restrictions.  ARM needs similar logic to account for this
>>> restrictions.
>>
>> We have a few issues on Arm. First, we don't check whether direct map is big
>> enough provided max_pdx that we don't set at all. Second, we don't really use
>> PDX grouping (can be also used without compression). My patch (that Stefano
>> attached previously) fixes the second issue (Allejandro will take it over to
>> come up with common solution).
> 
> You probably can handle those as different issues, as PDX grouping is
> completely disjoint from PDX compression.  It might be helpful if
> we could split the PDX grouping into a separate file from the PDX
> compression.
> 
> One weirdness I've noticed with ARM is the addition of start offsets
> to the existing PDX compression, by using directmap_base_pdx,
> directmap_mfn_start, directmap_base_pdx &c.  I'm not sure whether this will
> interfere with the PDX compression, but it looks like a bodge.  This
> should be part of the generic PDX compression implementation, not an
> extra added on a per-arch basis.
> 
> FWIW, PDX offset translation should already compress any gaps from 0
> to the first RAM range, and hence this won't be needed (in fact it
> would just make ARM translations slower by doing an extra unneeded
> operation).  My recommendation would be to move this initial offset
> compression inside the PDX mask translation.
> 
>> For the first issue, we need to know max_page (at
>> the moment we calculate it in setup_mm() at the very end but we could do it in
>> init_pdx() to know it ahead of setting direct map) and PDX offset (on x86 there
>> is no offset). I also think that on Arm we should just panic if direct map is
>> too small.
> 
> Hm, that's up to the ARM folks, but my opinion is that you should
> simply ignore memory above the threshold.  Panicking should IMO be a
> last resort option when there's no way to workaround the issue.
On Arm we handle user errors and suspicious behavior usually as panics as oppose
to x86 which is more liberal in that regard. We want to fail as soon as possible.

> 
>> The issue can be reproduced by disabling PDX compression, so not only with
>> Roger's patch.
>>
>> @Julien, I'm thinking of something like this:
>>
>> diff --git a/xen/arch/arm/arm32/mmu/mm.c b/xen/arch/arm/arm32/mmu/mm.c
>> index 4d22f35618aa..e6d9b49acd3c 100644
>> --- a/xen/arch/arm/arm32/mmu/mm.c
>> +++ b/xen/arch/arm/arm32/mmu/mm.c
>> @@ -190,7 +190,6 @@ void __init setup_mm(void)
>>
>>      /* Frame table covers all of RAM region, including holes */
>>      setup_frametable_mappings(ram_start, ram_end);
>> -    max_page = PFN_DOWN(ram_end);
>>
>>      /*
>>       * The allocators may need to use map_domain_page() (such as for
>> diff --git a/xen/arch/arm/arm64/mmu/mm.c b/xen/arch/arm/arm64/mmu/mm.c
>> index a0a2dd8cc762..3e64be6ae664 100644
>> --- a/xen/arch/arm/arm64/mmu/mm.c
>> +++ b/xen/arch/arm/arm64/mmu/mm.c
>> @@ -224,6 +224,9 @@ static void __init setup_directmap_mappings(unsigned long
>> base_mfn,
>>           */
>>          directmap_virt_start = DIRECTMAP_VIRT_START +
>>              (base_mfn - mfn_gb) * PAGE_SIZE;
>> +
>> +        if ( (max_pdx - directmap_base_pdx) > (DIRECTMAP_SIZE >> PAGE_SHIFT) )
>> +            panic("Direct map is too small\n");
> 
> As said above - I would avoid propagating the usage of those offsets
> into generic memory management code, it's usage should be confined
> inside the translation functions.
directmap_base_pdx is set a few lines above, so I would not call it propagation.

> 
> Here you probably want to use maddr_to_virt() or similar.
I can't because maddr_to_virt() has the ASSERT with similar check.
> 
> You can maybe pickup:
> 
> https://lore.kernel.org/xen-devel/20250611171636.5674-3-roger.pau@citrix.com/
> 
> And attempt to hook it into ARM?
As said above, we have different ways to approach setting max_pdx. On Arm we
want to panic, on x86 you want to limit the max_pdx.

> 
> I don't think it would that difficult to reduce the consumption of
> memory map ranges to what Xen can handle.
> 
> Thanks, Roger.

The diff I sent fixes the issue for direct map now. We can take it now if we
want to solve the issue. If we instead want to wait for frametable fixes (\wrt
grouping) and possible PDX changes (making offsets common) to be done first, I
can simply park this patch.

~Michal



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-02  8:54                 ` Orzel, Michal
@ 2025-07-02  9:45                   ` Roger Pau Monné
  2025-07-03  0:22                     ` Stefano Stabellini
  2025-07-03  0:19                   ` Stefano Stabellini
  1 sibling, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-07-02  9:45 UTC (permalink / raw)
  To: Orzel, Michal
  Cc: Stefano Stabellini, Jan Beulich, xen-devel, Andrew Cooper,
	Anthony PERARD, Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Community Manager

On Wed, Jul 02, 2025 at 10:54:24AM +0200, Orzel, Michal wrote:
> 
> 
> On 02/07/2025 10:26, Roger Pau Monné wrote:
> > On Wed, Jul 02, 2025 at 09:52:45AM +0200, Orzel, Michal wrote:
> >>
> >>
> >> On 02/07/2025 09:00, Roger Pau Monné wrote:
> >>> On Tue, Jul 01, 2025 at 01:46:19PM -0700, Stefano Stabellini wrote:
> >>>> On Tue, 1 Jul 2025, Jan Beulich wrote:
> >>>>> Sadly from this you omitted the output from the setup of the offsets
> >>>>> arrays. Considering also your later reply, I'd be curious to know what
> >>>>> mfn_to_pdx(0x50000000) is.
> >>>>  
> >>>> Full logs here, and debug patch in attachment.
> >>>>
> >>>> (XEN) Checking for initrd in /chosen
> >>>> (XEN) RAM: 0000000000000000 - 000000007fffffff
> >>>> (XEN) RAM: 0000000800000000 - 000000087fffffff
> >>>> (XEN) RAM: 0000050000000000 - 000005007fffffff
> >>>> (XEN) RAM: 0000060000000000 - 000006007fffffff
> >>>> (XEN) RAM: 0000070000000000 - 000007007fffffff
> >>>> (XEN) 
> >>>> (XEN) MODULE[0]: 0000000022000000 - 0000000022172fff Xen         
> >>>> (XEN) MODULE[1]: 0000000022200000 - 000000002220efff Device Tree 
> >>>> (XEN) MODULE[2]: 0000000020400000 - 0000000021e2ffff Kernel      
> >>>> (XEN)  RESVD[0]: 0000000000000000 - 0000000000ffffff
> >>>> (XEN)  RESVD[1]: 0000000001000000 - 00000000015fffff
> >>>> (XEN)  RESVD[2]: 0000000001600000 - 00000000017fffff
> >>>> (XEN)  RESVD[3]: 0000000001800000 - 00000000097fffff
> >>>> (XEN)  RESVD[4]: 0000000009800000 - 000000000bffffff
> >>>> (XEN)  RESVD[5]: 0000000011126000 - 000000001114dfff
> >>>> (XEN)  RESVD[6]: 000000001114e000 - 000000001214efff
> >>>> (XEN)  RESVD[7]: 0000000017275000 - 000000001729cfff
> >>>> (XEN)  RESVD[8]: 000000001729d000 - 000000001829dfff
> >>>> (XEN)  RESVD[9]: 000000001a7df000 - 000000001a806fff
> >>>> (XEN)  RESVD[10]: 000000001a807000 - 000000001b807fff
> >>>> (XEN)  RESVD[11]: 000000001d908000 - 000000001d92ffff
> >>>> (XEN)  RESVD[12]: 000000001d930000 - 000000001e930fff
> >>>> (XEN)  RESVD[13]: 000000001829e000 - 000000001869dfff
> >>>> (XEN)  RESVD[14]: 000000001869e000 - 00000000186ddfff
> >>>> (XEN)  RESVD[15]: 0000000800000000 - 000000083fffffff
> >>>> (XEN) 
> >>>> (XEN) 
> >>>> (XEN) Command line: console=dtuart dom0_mem=2048M console_timestamps=boot debug bootscrub=0 vwfi=native sched=null
> >>>> (XEN) [00000006bfc302ec] parameter "debug" unknown!
> >>>> (XEN) [00000006bfcc0476] DEBUG init_pdx 294 start=0 end=80000000
> >>>> (XEN) [00000006bfcd2400] DEBUG init_pdx 294 start=800000000 end=880000000
> >>>> (XEN) [00000006bfce29ec] DEBUG init_pdx 294 start=50000000000 end=50080000000
> >>>> (XEN) [00000006bfcf1768] DEBUG init_pdx 294 start=60000000000 end=60080000000
> >>>> (XEN) [00000006bfd015a4] DEBUG init_pdx 294 start=70000000000 end=70080000000
> >>>> (XEN) [00000006bfd1444f] DEBUG setup_mm 252
> >>>> (XEN) [00000006bfd3dc6f] DEBUG setup_mm 273 start=0 size=80000000 ram_end=80000000 directmap_base_pdx=0
> >>>> (XEN) [00000006bfd5616e] DEBUG setup_directmap_mappings 229 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=0
> >>>> (XEN) [00000006bfd7d38a] DEBUG setup_directmap_mappings 237 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0
> >>>> (XEN) [00000006bfd92728] DEBUG setup_mm 273 start=800000000 size=80000000 ram_end=880000000 directmap_base_pdx=0
> >>>> (XEN) [00000006bfdaba3b] DEBUG setup_directmap_mappings 229 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=800000
> >>>> (XEN) [00000006bfdcd79c] DEBUG setup_directmap_mappings 237 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0
> >>>> (XEN) [00000006bfde4d82] DEBUG setup_mm 273 start=50000000000 size=80000000 ram_end=50080000000 directmap_base_pdx=0
> >>>> (XEN) [00000006bfdfaef0] DEBUG setup_directmap_mappings 229 base_mfn=50000000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=50000000
> >>>> (XEN) [00000006bfe35249] Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72
> >>>
> >>> As said on the other reply, the issue here is that with the v2 PDX
> >>> offset compression logic your memory map is not compressible, and this
> >>> leads to an overflow, as anything above 5TiB won't fit in the
> >>> directmap AFAICT.  We already discussed with Jan that ARM seems to be
> >>> missing any logic to account for the max addressable page:
> >>>
> >>> https://lore.kernel.org/xen-devel/9074f1a6-a605-43f4-97f3-d0a626252d3f@suse.com/
> >>>
> >>> x86 has setup_max_pdx() that truncates the maximum addressable MFN
> >>> based on the active PDX compression and the virtual memory map
> >>> restrictions.  ARM needs similar logic to account for this
> >>> restrictions.
> >>
> >> We have a few issues on Arm. First, we don't check whether direct map is big
> >> enough provided max_pdx that we don't set at all. Second, we don't really use
> >> PDX grouping (can be also used without compression). My patch (that Stefano
> >> attached previously) fixes the second issue (Allejandro will take it over to
> >> come up with common solution).
> > 
> > You probably can handle those as different issues, as PDX grouping is
> > completely disjoint from PDX compression.  It might be helpful if
> > we could split the PDX grouping into a separate file from the PDX
> > compression.
> > 
> > One weirdness I've noticed with ARM is the addition of start offsets
> > to the existing PDX compression, by using directmap_base_pdx,
> > directmap_mfn_start, directmap_base_pdx &c.  I'm not sure whether this will
> > interfere with the PDX compression, but it looks like a bodge.  This
> > should be part of the generic PDX compression implementation, not an
> > extra added on a per-arch basis.
> > 
> > FWIW, PDX offset translation should already compress any gaps from 0
> > to the first RAM range, and hence this won't be needed (in fact it
> > would just make ARM translations slower by doing an extra unneeded
> > operation).  My recommendation would be to move this initial offset
> > compression inside the PDX mask translation.
> > 
> >> For the first issue, we need to know max_page (at
> >> the moment we calculate it in setup_mm() at the very end but we could do it in
> >> init_pdx() to know it ahead of setting direct map) and PDX offset (on x86 there
> >> is no offset). I also think that on Arm we should just panic if direct map is
> >> too small.
> > 
> > Hm, that's up to the ARM folks, but my opinion is that you should
> > simply ignore memory above the threshold.  Panicking should IMO be a
> > last resort option when there's no way to workaround the issue.
> On Arm we handle user errors and suspicious behavior usually as panics as oppose
> to x86 which is more liberal in that regard. We want to fail as soon as possible.
> 
> > 
> >> The issue can be reproduced by disabling PDX compression, so not only with
> >> Roger's patch.
> >>
> >> @Julien, I'm thinking of something like this:
> >>
> >> diff --git a/xen/arch/arm/arm32/mmu/mm.c b/xen/arch/arm/arm32/mmu/mm.c
> >> index 4d22f35618aa..e6d9b49acd3c 100644
> >> --- a/xen/arch/arm/arm32/mmu/mm.c
> >> +++ b/xen/arch/arm/arm32/mmu/mm.c
> >> @@ -190,7 +190,6 @@ void __init setup_mm(void)
> >>
> >>      /* Frame table covers all of RAM region, including holes */
> >>      setup_frametable_mappings(ram_start, ram_end);
> >> -    max_page = PFN_DOWN(ram_end);
> >>
> >>      /*
> >>       * The allocators may need to use map_domain_page() (such as for
> >> diff --git a/xen/arch/arm/arm64/mmu/mm.c b/xen/arch/arm/arm64/mmu/mm.c
> >> index a0a2dd8cc762..3e64be6ae664 100644
> >> --- a/xen/arch/arm/arm64/mmu/mm.c
> >> +++ b/xen/arch/arm/arm64/mmu/mm.c
> >> @@ -224,6 +224,9 @@ static void __init setup_directmap_mappings(unsigned long
> >> base_mfn,
> >>           */
> >>          directmap_virt_start = DIRECTMAP_VIRT_START +
> >>              (base_mfn - mfn_gb) * PAGE_SIZE;
> >> +
> >> +        if ( (max_pdx - directmap_base_pdx) > (DIRECTMAP_SIZE >> PAGE_SHIFT) )
> >> +            panic("Direct map is too small\n");
> > 
> > As said above - I would avoid propagating the usage of those offsets
> > into generic memory management code, it's usage should be confined
> > inside the translation functions.
> directmap_base_pdx is set a few lines above, so I would not call it propagation.
> 
> > 
> > Here you probably want to use maddr_to_virt() or similar.
> I can't because maddr_to_virt() has the ASSERT with similar check.
> > 
> > You can maybe pickup:
> > 
> > https://lore.kernel.org/xen-devel/20250611171636.5674-3-roger.pau@citrix.com/
> > 
> > And attempt to hook it into ARM?
> As said above, we have different ways to approach setting max_pdx. On Arm we
> want to panic, on x86 you want to limit the max_pdx.
> 
> > 
> > I don't think it would that difficult to reduce the consumption of
> > memory map ranges to what Xen can handle.
> > 
> > Thanks, Roger.
> 
> The diff I sent fixes the issue for direct map now. We can take it now if we
> want to solve the issue. If we instead want to wait for frametable fixes (\wrt
> grouping) and possible PDX changes (making offsets common) to be done first, I
> can simply park this patch.

No please, don't park it just because of my opinions.  I think Julien
is OK with it, so don't hold back because of my x86 based opinion on
how to handle errors.

Regards, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-02  8:54                 ` Orzel, Michal
  2025-07-02  9:45                   ` Roger Pau Monné
@ 2025-07-03  0:19                   ` Stefano Stabellini
  1 sibling, 0 replies; 55+ messages in thread
From: Stefano Stabellini @ 2025-07-03  0:19 UTC (permalink / raw)
  To: Orzel, Michal
  Cc: Roger Pau Monné, Stefano Stabellini, Jan Beulich, xen-devel,
	Andrew Cooper, Anthony PERARD, Julien Grall, Bertrand Marquis,
	Volodymyr Babchuk, Shawn Anastasio, Alistair Francis,
	Bob Eshleman, Connor Davis, Oleksii Kurochko, Community Manager

On Wed, 2 Jul 2025, Orzel, Michal wrote:
> > Hm, that's up to the ARM folks, but my opinion is that you should
> > simply ignore memory above the threshold.  Panicking should IMO be a
> > last resort option when there's no way to workaround the issue.
> On Arm we handle user errors and suspicious behavior usually as panics as oppose
> to x86 which is more liberal in that regard. We want to fail as soon as possible.

If we think about it, this is natural because Xen on ARM was mostly
aimed at embedded developers configuring an embedded system. Embedded
developers might not be Xen experts but they are typically engineers.
These people would definitely want to know if part of the memory was
ignored, and might be able to write a fix.

On the other hand Xen on x86 was aimed at non-expert users -- people
apt-get'ing Xen on a Debian system. These people wouldn't know how to
read a panic so we would certainly want to boot anyway even with only
partial resources.

This has worked well so far, but now we are getting x86 in embedded and
ARM on servers, so I think we should discuss and agree on a common
pattern or a configurable pattern to handle this kind of situations.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-02  9:45                   ` Roger Pau Monné
@ 2025-07-03  0:22                     ` Stefano Stabellini
  0 siblings, 0 replies; 55+ messages in thread
From: Stefano Stabellini @ 2025-07-03  0:22 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Orzel, Michal, Stefano Stabellini, Jan Beulich, xen-devel,
	Andrew Cooper, Anthony PERARD, Julien Grall, Bertrand Marquis,
	Volodymyr Babchuk, Shawn Anastasio, Alistair Francis,
	Bob Eshleman, Connor Davis, Oleksii Kurochko, Community Manager

[-- Attachment #1: Type: text/plain, Size: 9726 bytes --]

On Wed, 2 Jul 2025, Roger Pau Monné wrote:
> On Wed, Jul 02, 2025 at 10:54:24AM +0200, Orzel, Michal wrote:
> > 
> > 
> > On 02/07/2025 10:26, Roger Pau Monné wrote:
> > > On Wed, Jul 02, 2025 at 09:52:45AM +0200, Orzel, Michal wrote:
> > >>
> > >>
> > >> On 02/07/2025 09:00, Roger Pau Monné wrote:
> > >>> On Tue, Jul 01, 2025 at 01:46:19PM -0700, Stefano Stabellini wrote:
> > >>>> On Tue, 1 Jul 2025, Jan Beulich wrote:
> > >>>>> Sadly from this you omitted the output from the setup of the offsets
> > >>>>> arrays. Considering also your later reply, I'd be curious to know what
> > >>>>> mfn_to_pdx(0x50000000) is.
> > >>>>  
> > >>>> Full logs here, and debug patch in attachment.
> > >>>>
> > >>>> (XEN) Checking for initrd in /chosen
> > >>>> (XEN) RAM: 0000000000000000 - 000000007fffffff
> > >>>> (XEN) RAM: 0000000800000000 - 000000087fffffff
> > >>>> (XEN) RAM: 0000050000000000 - 000005007fffffff
> > >>>> (XEN) RAM: 0000060000000000 - 000006007fffffff
> > >>>> (XEN) RAM: 0000070000000000 - 000007007fffffff
> > >>>> (XEN) 
> > >>>> (XEN) MODULE[0]: 0000000022000000 - 0000000022172fff Xen         
> > >>>> (XEN) MODULE[1]: 0000000022200000 - 000000002220efff Device Tree 
> > >>>> (XEN) MODULE[2]: 0000000020400000 - 0000000021e2ffff Kernel      
> > >>>> (XEN)  RESVD[0]: 0000000000000000 - 0000000000ffffff
> > >>>> (XEN)  RESVD[1]: 0000000001000000 - 00000000015fffff
> > >>>> (XEN)  RESVD[2]: 0000000001600000 - 00000000017fffff
> > >>>> (XEN)  RESVD[3]: 0000000001800000 - 00000000097fffff
> > >>>> (XEN)  RESVD[4]: 0000000009800000 - 000000000bffffff
> > >>>> (XEN)  RESVD[5]: 0000000011126000 - 000000001114dfff
> > >>>> (XEN)  RESVD[6]: 000000001114e000 - 000000001214efff
> > >>>> (XEN)  RESVD[7]: 0000000017275000 - 000000001729cfff
> > >>>> (XEN)  RESVD[8]: 000000001729d000 - 000000001829dfff
> > >>>> (XEN)  RESVD[9]: 000000001a7df000 - 000000001a806fff
> > >>>> (XEN)  RESVD[10]: 000000001a807000 - 000000001b807fff
> > >>>> (XEN)  RESVD[11]: 000000001d908000 - 000000001d92ffff
> > >>>> (XEN)  RESVD[12]: 000000001d930000 - 000000001e930fff
> > >>>> (XEN)  RESVD[13]: 000000001829e000 - 000000001869dfff
> > >>>> (XEN)  RESVD[14]: 000000001869e000 - 00000000186ddfff
> > >>>> (XEN)  RESVD[15]: 0000000800000000 - 000000083fffffff
> > >>>> (XEN) 
> > >>>> (XEN) 
> > >>>> (XEN) Command line: console=dtuart dom0_mem=2048M console_timestamps=boot debug bootscrub=0 vwfi=native sched=null
> > >>>> (XEN) [00000006bfc302ec] parameter "debug" unknown!
> > >>>> (XEN) [00000006bfcc0476] DEBUG init_pdx 294 start=0 end=80000000
> > >>>> (XEN) [00000006bfcd2400] DEBUG init_pdx 294 start=800000000 end=880000000
> > >>>> (XEN) [00000006bfce29ec] DEBUG init_pdx 294 start=50000000000 end=50080000000
> > >>>> (XEN) [00000006bfcf1768] DEBUG init_pdx 294 start=60000000000 end=60080000000
> > >>>> (XEN) [00000006bfd015a4] DEBUG init_pdx 294 start=70000000000 end=70080000000
> > >>>> (XEN) [00000006bfd1444f] DEBUG setup_mm 252
> > >>>> (XEN) [00000006bfd3dc6f] DEBUG setup_mm 273 start=0 size=80000000 ram_end=80000000 directmap_base_pdx=0
> > >>>> (XEN) [00000006bfd5616e] DEBUG setup_directmap_mappings 229 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=0
> > >>>> (XEN) [00000006bfd7d38a] DEBUG setup_directmap_mappings 237 base_mfn=0 nr_mfns=80000 directmap_base_pdx=0
> > >>>> (XEN) [00000006bfd92728] DEBUG setup_mm 273 start=800000000 size=80000000 ram_end=880000000 directmap_base_pdx=0
> > >>>> (XEN) [00000006bfdaba3b] DEBUG setup_directmap_mappings 229 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=800000
> > >>>> (XEN) [00000006bfdcd79c] DEBUG setup_directmap_mappings 237 base_mfn=800000 nr_mfns=80000 directmap_base_pdx=0
> > >>>> (XEN) [00000006bfde4d82] DEBUG setup_mm 273 start=50000000000 size=80000000 ram_end=50080000000 directmap_base_pdx=0
> > >>>> (XEN) [00000006bfdfaef0] DEBUG setup_directmap_mappings 229 base_mfn=50000000 nr_mfns=80000 directmap_base_pdx=0 mfn_to_pdx=50000000
> > >>>> (XEN) [00000006bfe35249] Assertion '(mfn_to_pdx(maddr_to_mfn(ma)) - directmap_base_pdx) < (DIRECTMAP_SIZE >> PAGE_SHIFT)' failed at ./arch/arm/include/asm/mmu/mm.h:72
> > >>>
> > >>> As said on the other reply, the issue here is that with the v2 PDX
> > >>> offset compression logic your memory map is not compressible, and this
> > >>> leads to an overflow, as anything above 5TiB won't fit in the
> > >>> directmap AFAICT.  We already discussed with Jan that ARM seems to be
> > >>> missing any logic to account for the max addressable page:
> > >>>
> > >>> https://lore.kernel.org/xen-devel/9074f1a6-a605-43f4-97f3-d0a626252d3f@suse.com/
> > >>>
> > >>> x86 has setup_max_pdx() that truncates the maximum addressable MFN
> > >>> based on the active PDX compression and the virtual memory map
> > >>> restrictions.  ARM needs similar logic to account for this
> > >>> restrictions.
> > >>
> > >> We have a few issues on Arm. First, we don't check whether direct map is big
> > >> enough provided max_pdx that we don't set at all. Second, we don't really use
> > >> PDX grouping (can be also used without compression). My patch (that Stefano
> > >> attached previously) fixes the second issue (Allejandro will take it over to
> > >> come up with common solution).
> > > 
> > > You probably can handle those as different issues, as PDX grouping is
> > > completely disjoint from PDX compression.  It might be helpful if
> > > we could split the PDX grouping into a separate file from the PDX
> > > compression.
> > > 
> > > One weirdness I've noticed with ARM is the addition of start offsets
> > > to the existing PDX compression, by using directmap_base_pdx,
> > > directmap_mfn_start, directmap_base_pdx &c.  I'm not sure whether this will
> > > interfere with the PDX compression, but it looks like a bodge.  This
> > > should be part of the generic PDX compression implementation, not an
> > > extra added on a per-arch basis.
> > > 
> > > FWIW, PDX offset translation should already compress any gaps from 0
> > > to the first RAM range, and hence this won't be needed (in fact it
> > > would just make ARM translations slower by doing an extra unneeded
> > > operation).  My recommendation would be to move this initial offset
> > > compression inside the PDX mask translation.
> > > 
> > >> For the first issue, we need to know max_page (at
> > >> the moment we calculate it in setup_mm() at the very end but we could do it in
> > >> init_pdx() to know it ahead of setting direct map) and PDX offset (on x86 there
> > >> is no offset). I also think that on Arm we should just panic if direct map is
> > >> too small.
> > > 
> > > Hm, that's up to the ARM folks, but my opinion is that you should
> > > simply ignore memory above the threshold.  Panicking should IMO be a
> > > last resort option when there's no way to workaround the issue.
> > On Arm we handle user errors and suspicious behavior usually as panics as oppose
> > to x86 which is more liberal in that regard. We want to fail as soon as possible.
> > 
> > > 
> > >> The issue can be reproduced by disabling PDX compression, so not only with
> > >> Roger's patch.
> > >>
> > >> @Julien, I'm thinking of something like this:
> > >>
> > >> diff --git a/xen/arch/arm/arm32/mmu/mm.c b/xen/arch/arm/arm32/mmu/mm.c
> > >> index 4d22f35618aa..e6d9b49acd3c 100644
> > >> --- a/xen/arch/arm/arm32/mmu/mm.c
> > >> +++ b/xen/arch/arm/arm32/mmu/mm.c
> > >> @@ -190,7 +190,6 @@ void __init setup_mm(void)
> > >>
> > >>      /* Frame table covers all of RAM region, including holes */
> > >>      setup_frametable_mappings(ram_start, ram_end);
> > >> -    max_page = PFN_DOWN(ram_end);
> > >>
> > >>      /*
> > >>       * The allocators may need to use map_domain_page() (such as for
> > >> diff --git a/xen/arch/arm/arm64/mmu/mm.c b/xen/arch/arm/arm64/mmu/mm.c
> > >> index a0a2dd8cc762..3e64be6ae664 100644
> > >> --- a/xen/arch/arm/arm64/mmu/mm.c
> > >> +++ b/xen/arch/arm/arm64/mmu/mm.c
> > >> @@ -224,6 +224,9 @@ static void __init setup_directmap_mappings(unsigned long
> > >> base_mfn,
> > >>           */
> > >>          directmap_virt_start = DIRECTMAP_VIRT_START +
> > >>              (base_mfn - mfn_gb) * PAGE_SIZE;
> > >> +
> > >> +        if ( (max_pdx - directmap_base_pdx) > (DIRECTMAP_SIZE >> PAGE_SHIFT) )
> > >> +            panic("Direct map is too small\n");
> > > 
> > > As said above - I would avoid propagating the usage of those offsets
> > > into generic memory management code, it's usage should be confined
> > > inside the translation functions.
> > directmap_base_pdx is set a few lines above, so I would not call it propagation.
> > 
> > > 
> > > Here you probably want to use maddr_to_virt() or similar.
> > I can't because maddr_to_virt() has the ASSERT with similar check.
> > > 
> > > You can maybe pickup:
> > > 
> > > https://lore.kernel.org/xen-devel/20250611171636.5674-3-roger.pau@citrix.com/
> > > 
> > > And attempt to hook it into ARM?
> > As said above, we have different ways to approach setting max_pdx. On Arm we
> > want to panic, on x86 you want to limit the max_pdx.
> > 
> > > 
> > > I don't think it would that difficult to reduce the consumption of
> > > memory map ranges to what Xen can handle.
> > > 
> > > Thanks, Roger.
> > 
> > The diff I sent fixes the issue for direct map now. We can take it now if we
> > want to solve the issue. If we instead want to wait for frametable fixes (\wrt
> > grouping) and possible PDX changes (making offsets common) to be done first, I
> > can simply park this patch.
> 
> No please, don't park it just because of my opinions.  I think Julien
> is OK with it, so don't hold back because of my x86 based opinion on
> how to handle errors.

I would also rather have the small improvement now rather than later

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-06-28  2:08 ` [PATCH v2 0/8] pdx: introduce a new compression algorithm Stefano Stabellini
  2025-06-30 15:02   ` Roger Pau Monné
@ 2025-07-03  8:42   ` Roger Pau Monné
  2025-07-03 18:04     ` Stefano Stabellini
  1 sibling, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-07-03  8:42 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Community Manager

On Fri, Jun 27, 2025 at 07:08:29PM -0700, Stefano Stabellini wrote:
> Hi Roger,
> 
> We have an ARM board with the following memory layout:
> 
> 0x0-0x80000000, 0, 2G
> 0x800000000,0x880000000, 32GB, 2G
> 0x50000000000-0x50080000000 5T, 2GB 
> 0x60000000000-0x60080000000 6T, 2GB 
> 0x70000000000-0x70080000000 7T, 2GB 

I would like to add this memory map to the PDX unit testing, do you
have a name I could use as a reference?  For example for the Intel
sparse map I'm using: "Real memory map from a 4s Intel GNR.".  I
currently have yours listed as: "Stefano's ARM board.", but that's not
a very descriptive naming :).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/8] pdx: introduce a new compression algorithm
  2025-07-03  8:42   ` Roger Pau Monné
@ 2025-07-03 18:04     ` Stefano Stabellini
  0 siblings, 0 replies; 55+ messages in thread
From: Stefano Stabellini @ 2025-07-03 18:04 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, xen-devel, Jan Beulich, Andrew Cooper,
	Anthony PERARD, Michal Orzel, Julien Grall, Bertrand Marquis,
	Volodymyr Babchuk, Shawn Anastasio, Alistair Francis,
	Bob Eshleman, Connor Davis, Oleksii Kurochko, Community Manager

[-- Attachment #1: Type: text/plain, Size: 742 bytes --]

On Thu, 3 Jul 2025, Roger Pau Monné wrote:
> On Fri, Jun 27, 2025 at 07:08:29PM -0700, Stefano Stabellini wrote:
> > Hi Roger,
> > 
> > We have an ARM board with the following memory layout:
> > 
> > 0x0-0x80000000, 0, 2G
> > 0x800000000,0x880000000, 32GB, 2G
> > 0x50000000000-0x50080000000 5T, 2GB 
> > 0x60000000000-0x60080000000 6T, 2GB 
> > 0x70000000000-0x70080000000 7T, 2GB 
> 
> I would like to add this memory map to the PDX unit testing, do you
> have a name I could use as a reference?  For example for the Intel
> sparse map I'm using: "Real memory map from a 4s Intel GNR.".  I
> currently have yours listed as: "Stefano's ARM board.", but that's not
> a very descriptive naming :).

The name of the board is AMD "Versal Gen 2"

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2025-07-03 18:05 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-20 11:11 [PATCH v2 0/8] pdx: introduce a new compression algorithm Roger Pau Monne
2025-06-20 11:11 ` [PATCH v2 1/8] x86/pdx: simplify calculation of domain struct allocation boundary Roger Pau Monne
2025-06-24 13:05   ` Jan Beulich
2025-06-25 15:14     ` Roger Pau Monné
2025-06-20 11:11 ` [PATCH v2 2/8] kconfig: turn PDX compression into a choice Roger Pau Monne
2025-06-24 13:13   ` Jan Beulich
2025-06-26  7:49     ` Roger Pau Monné
2025-06-26 12:33       ` Jan Beulich
2025-06-20 11:11 ` [PATCH v2 4/8] pdx: introduce command line compression toggle Roger Pau Monne
2025-06-24 13:40   ` Jan Beulich
2025-06-25 15:46     ` Roger Pau Monné
2025-06-25 16:00       ` Jan Beulich
2025-06-25 17:45         ` Roger Pau Monné
2025-06-26  6:17           ` Jan Beulich
2025-06-20 11:11 ` [PATCH v2 5/8] pdx: allow per-arch optimization of PDX conversion helpers Roger Pau Monne
2025-06-24 13:51   ` Jan Beulich
2025-06-25 15:51     ` Roger Pau Monné
2025-06-25 16:04       ` Jan Beulich
2025-06-20 11:11 ` [PATCH v2 6/8] test/pdx: add PDX compression unit tests Roger Pau Monne
2025-06-24 13:37   ` Anthony PERARD
2025-06-25 15:55     ` Roger Pau Monné
2025-06-20 11:11 ` [PATCH v2 7/8] pdx: move some helpers in preparation for new compression Roger Pau Monne
2025-06-24 13:52   ` Jan Beulich
2025-06-20 11:11 ` [PATCH v2 8/8] pdx: introduce a new compression algorithm based on region offsets Roger Pau Monne
2025-06-24 16:16   ` Jan Beulich
2025-06-25 16:24     ` Roger Pau Monné
2025-06-26  7:35       ` Jan Beulich
2025-06-27 14:51         ` Roger Pau Monné
2025-06-29 14:36           ` Jan Beulich
2025-07-01  7:26             ` Roger Pau Monné
2025-06-30  6:34   ` Jan Beulich
2025-07-01 15:49     ` Roger Pau Monné
2025-07-01 16:01       ` Jan Beulich
     [not found] ` <20250620111130.29057-4-roger.pau@citrix.com>
2025-06-24 13:32   ` [PATCH v2 3/8] pdx: provide a unified set of unit functions Jan Beulich
2025-06-25 15:32     ` Roger Pau Monné
2025-06-28  2:08 ` [PATCH v2 0/8] pdx: introduce a new compression algorithm Stefano Stabellini
2025-06-30 15:02   ` Roger Pau Monné
2025-07-01  1:50     ` Stefano Stabellini
2025-07-01  3:33       ` Stefano Stabellini
2025-07-01  6:05       ` Jan Beulich
2025-07-01 20:46         ` Stefano Stabellini
2025-07-02  6:08           ` Jan Beulich
2025-07-02  6:32           ` Jan Beulich
2025-07-02  6:53             ` Roger Pau Monné
2025-07-02  7:00           ` Roger Pau Monné
2025-07-02  7:52             ` Orzel, Michal
2025-07-02  8:26               ` Roger Pau Monné
2025-07-02  8:49                 ` Julien Grall
2025-07-02  8:54                 ` Orzel, Michal
2025-07-02  9:45                   ` Roger Pau Monné
2025-07-03  0:22                     ` Stefano Stabellini
2025-07-03  0:19                   ` Stefano Stabellini
2025-07-02  8:45               ` Julien Grall
2025-07-03  8:42   ` Roger Pau Monné
2025-07-03 18:04     ` Stefano Stabellini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.