All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/7] NUMA: Add per-node domain-memory claims
@ 2025-09-07 16:15 Bernhard Kaindl
  2025-09-07 16:15 ` [PATCH v3 1/7] xen/numa: Add per_node() variables paralleling per_cpu() variables Bernhard Kaindl
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Bernhard Kaindl @ 2025-09-07 16:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Alejandro Vallejo, Bernhard Kaindl, Stefano Stabellini,
	Julien Grall, Bertrand Marquis, Michal Orzel, Volodymyr Babchuk,
	Andrew Cooper, Anthony PERARD, Jan Beulich, Roger Pau Monné,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Tamas K Lengyel, Daniel P. Smith, Juergen Gross,
	Christian Lindig, David Scott

XEN_DOMCTL_claim_memory - New Hypercall to claim memory for a domain
to improve NUMA awareness when allocating its system memory.

In tests with AMD Genoa, we achived 22% higer VM density compared
to spreading memory across all NUMA nodes for the same Speedometer
web application benchmark score, so this can enable significant
savings for server hosting (more details below).

The author of v1 is Alejandro Vallejo (he moved to AMD since).
Six months have passed, and the last review comment that I found
for it was 2 months ago.

General introduction:
---------------------

Xen supports claiming an amount of memory for a domain ahead of
allocating it to ensure that it is available for allocation.

On NUMA hosts, the same assurance is needed on a per-NUMA-node basis
to ensure optimal placement of domain memory on the correct NUMA node:

Performance test results:
-------------------------

Using "bootstorm" tests, when large VMs are booted in parallel.
Unless carefully planned, memory may be allocated on remote NUMA nodes.
It increases the memory latency experienced by applications and
degrades their performance.

NUMA claims allow for ensuring that all memory for a domain can be
allocated on the claimed NUMA node. We achieved a 15% improvement
in Speedometer performance tests and a 22% increase in VMs on AMD
Genoa while maintaining the same Speedometer score compared to
spreading the system memory of the domains across all NUMA nodes.

One out of 5 to 7 servers is not needed and could serve extra capacity.
Server and server room upgrades can be delayed, and money paid
for hosting and/or running servers can be saved.

Principle of operation:
-----------------------

Besides the NUMA node claim, host-wide exist already
and are implemented in libxl and libxenguest as well:

1. Call domain_create(); the claim is associated with this domain only.

2. Claim the needed amount of memory

   domain_set_outstanding_pages():

   - Sets d->outstanding_claims to the claimed memory
     (and with this series, also sets d->claim_node to the node)

   - Adds the new claim to per_node(outstanding_claims), with this series
   - Adds the new claim to the host-wide outstanding_claims
  
   - This prevents get_free_buddy() from allocating from NUMA nodes.
     When the amount of unclaimed memory is lower than the given request
     unless the memory is allocated for a domain with sufficient claim

3. Allocate for the domain

   alloc_heap_pages() and get_free_buddy():

   - If d->outstanding_claims is sufficient for the allocation
     (and with this series, d->claim_node matches the node the alloc from).
     Then, the allocation may continue on the node.

     domain_adjust_tot_pages() consumes part of the allocated amount:

     - Reduces d->outstanding_claims
     - Reduces per_node(outstanding_claims), with this series
     - Reduces the host-wide "outstanding_claims" variable
  
4. Cancel a possible leftover claim
5. Finish building the domain and unpause it to let it boot

We will implement multi-node claims as well, and I updated the design
to be more flexible to prepare for multi-node claims. This new hypercall
API supports multi-node claims, but the internal changes needed are
beyond what is feasible for this implementation to introduce node claims.

Overview the changes since v1:
------------------------------

Following the review's suggestion, patches should be consolidated
by the functionality they implement and not split into preparatory
changes without any function.

I agree with this change:

It makes the progression of the patches more logical to follow
as each patch serves a tangible purpose. Yes, this makes comparing
previous review comments more difficult, but the benefit of a more
consolidated series outweighs that of course.

I used Patchew (links below) to find any review comments as as some
comments were only posted 2 months ago, while the series was posted
6 months ago.

Having undergone this refactor, it may be more appropriate
to consider this submission for warranting fresh review.

More details on the changes in commits:
---------------------------------------

- #1 is new: Implemented the suggestion from review for per_node()
- #2 was new as v2#1 (moved it as here as #1 is more important)
- #3 has only minor adjustments from review and do use per_node()
- #4, has many changes and expanded comments to answer
      and explain questions that were raised while reviewing it.
      A small hunk from it was moved to #6, as it forms the basis
      of the rewritten 6/7.
- #6 was refactored with new code from v2 to fix an issue.
- #7 is unchanged after adding it in v2 as the new hypercall.

Where the old code moved:

- v1#1 is removed as the review said to remove it.
  (The #define was moved to where it is used)
- v1#2 is merged into #4 to consolidate the patches for the same code.
- v1#3 is split into #4 and #5 as per the review suggestion to move code.
- v1#4 received the parts of #5 related to staking NUMA claims.
- v1#5 was split into #3 and #4 and got the changes for adjust_tot_pages()
- v1#6 was refactored with code to fix an issue to protect the claims
- v1#7 is removed as setting the d->node_affinity
  caused Xen panics due to a locking issue (diagnosed by Roger).
  Setting d->node_affinity does not claim pages that should not have been included in the submitted series.
- v1#8 is removed as I switched to the new hypercall requested by Roger.
- v1#9-11 are removed for the same reason:

  For NUMA-node claims, we no longer pass a single NUMA node
  when we want to consume the claimed memory. Instead,
  d->node_affinity mask is already used when allocating
  by get_free_buddy(). Likewise, there is also no further
  use for claim_on_node in xl.cfg

I hope that this gives a good overview of the changes.
These are the Patchew links I used to check for review comments:

v1: https://patchew.org/Xen/20250314172502.53498-1-alejandro.vallejo@cloud.com/
v2: https://patchew.org/Xen/cover.1755341947.git.bernhard.kaindl@cloud.com/

Personal message:
-----------------

As I haven't posted any "hello" message yet, I think it is necessary that
I also write about myself: I worked on the Linux kernel and other things
like the SLES for S/390 and zSeries (IBM mainframe) for S.u.S.E.
Afterwards, I ported Linux (including the kernel and bootloaders) to a tested,
certified and assessed safety infrastructure that ensures your safety when
travelling by rail on tracks with track-side infrastructure built by one of
the two largest rail infrastructure companies worldwide.


Bernhard Kaindl (7):
  xen/numa: Add per_node() variables paralleling per_cpu() variables
  xen/page_alloc: Simplify domain_adjust_tot_pages() further
  xen/page_alloc: Add and track per_node(avail_pages)
  xen/page_alloc: Add staking a NUMA node claim for a domain
  xen/page_alloc: Pass node to adjust_tot_pages and check it
  xen/page_alloc: Protect claimed memory against other allocations
  xen: New hypercall to claim memory using XEN_DOMCTL_claim_memory

 tools/flask/policy/modules/dom0.te  |   1 +
 tools/flask/policy/modules/xen.if   |   1 +
 tools/include/xenctrl.h             |   4 +
 tools/libs/ctrl/xc_domain.c         |  42 +++++++
 tools/ocaml/libs/xc/xenctrl.ml      |   9 ++
 tools/ocaml/libs/xc/xenctrl.mli     |   9 ++
 tools/ocaml/libs/xc/xenctrl_stubs.c |  21 ++++
 xen/arch/arm/xen.lds.S              |   1 +
 xen/arch/ppc/xen.lds.S              |   1 +
 xen/arch/riscv/xen.lds.S            |   1 +
 xen/arch/x86/mm.c                   |   3 +-
 xen/arch/x86/mm/mem_sharing.c       |   4 +-
 xen/arch/x86/xen.lds.S              |   1 +
 xen/common/domain.c                 |  31 +++++-
 xen/common/domctl.c                 |   8 ++
 xen/common/grant_table.c            |   4 +-
 xen/common/memory.c                 |  18 ++-
 xen/common/numa.c                   |  53 ++++++++-
 xen/common/page_alloc.c             | 163 ++++++++++++++++++++--------
 xen/include/public/domctl.h         |  17 +++
 xen/include/xen/domain.h            |   2 +
 xen/include/xen/mm.h                |   5 +-
 xen/include/xen/numa.h              |  15 +++
 xen/include/xen/sched.h             |   1 +
 xen/include/xen/xen.lds.h           |   8 ++
 xen/xsm/flask/hooks.c               |   3 +
 xen/xsm/flask/policy/access_vectors |   2 +
 27 files changed, 370 insertions(+), 58 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/7] xen/numa: Add per_node() variables paralleling per_cpu() variables
  2025-09-07 16:15 [PATCH v3 0/7] NUMA: Add per-node domain-memory claims Bernhard Kaindl
@ 2025-09-07 16:15 ` Bernhard Kaindl
  2025-09-08 14:11   ` Jan Beulich
  2025-09-07 16:15 ` [PATCH v3 2/7] xen/page_alloc: Simplify domain_adjust_tot_pages() further Bernhard Kaindl
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Bernhard Kaindl @ 2025-09-07 16:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Alejandro Vallejo, Bernhard Kaindl, Stefano Stabellini,
	Julien Grall, Bertrand Marquis, Michal Orzel, Volodymyr Babchuk,
	Andrew Cooper, Anthony PERARD, Roger Pau Monné,
	Shawn Anastasio, Alistair Francis, Bob Eshleman, Connor Davis,
	Oleksii Kurochko, Jan Beulich

During the review of the 3rd commit of the NUMA claims v1 series, it
was found to be concerning (performance-wise) add add another array
like this that randomly written from all nodes:

+/* Per-node counts of free pages */
+static unsigned long pernode_avail_pages[MAX_NUMNODES];

As solution, it was suggested to introduce per_node() paralleling
per_cpu(), or (less desirable) to make sure one particular cache
line would only ever be written from a single node.

It was mentioned that node_need_scrub[] could/should use it, and
I assume others may benefit too.

per_cpu() is a simple standard blueprint that is easy to copy, add
per_node(), paralleling per_cpu() as the preferred suggestion:

It is entirely derived from per_cpu(), with a few differences:

- No add/remove callback: Nodes are onlined on boot and never offlined.

- As per_node(avail_pages) and pernode(outstanding_claims) are used by
  the buddy allocator itself, and the buddy allocator is used to alloc
  the per_node() memory from the local NUMA node, there is a catch:

  per_node() must already be working to have a working buddy allocator:

  - Init per_node() before the buddy allocator is ready as it needs
    to be setup before its use, e.g. to init per_node(avail_pages)!

  Use an early static __initdata array during early boot and migrate
  it to the NUMA-node-local xenheap before we enable the secondary CPUs.

Cc: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Bernhard Kaindl <bernhard.kaindl@cloud.com>

---
Changes:
- This is patch is new in v3 to resolve the the suggestion from the review.
- The previous patch #2 is removed from the series as not required,
  which is best visualized by how claims are used:

  - Claim needed memory
  - Allocate all domain memory
  - Cancel a possible leftover claim
  - Finish building the domain and unpause it.

  As it makes no sense to repeat "Claim needed memory" at any time,
  the change made had no practical significance.  It can be applied
  later as a tiny, not important cleanup, e.g. with multi-node claims.

Implementation note on this patch (not needed for the commit message):

Instead of the __initdata array, I tried to alloc bootmem, but it
caused paging_init() to panic with not enough memory for p2m on a
very large 4-Socket, 480 pCPU, 4TiB RAM host (or it caused boot to
hang after the microcode updates of the 480 pCPUs)

The static __initdata array is freed after init and does not affect
bootmem allocation.

PPS: Yes, node_need_scrub[] should use it too, do it after this series.
---
 xen/arch/arm/xen.lds.S    |  1 +
 xen/arch/ppc/xen.lds.S    |  1 +
 xen/arch/riscv/xen.lds.S  |  1 +
 xen/arch/x86/xen.lds.S    |  1 +
 xen/common/numa.c         | 53 ++++++++++++++++++++++++++++++++++++++-
 xen/include/xen/numa.h    | 15 +++++++++++
 xen/include/xen/xen.lds.h |  8 ++++++
 7 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/xen.lds.S b/xen/arch/arm/xen.lds.S
index db17ff1efa..d296a95dd3 100644
--- a/xen/arch/arm/xen.lds.S
+++ b/xen/arch/arm/xen.lds.S
@@ -176,6 +176,7 @@ SECTIONS
        *(.bss.stack_aligned)
        *(.bss.page_aligned)
        PERCPU_BSS
+       PERNODE_BSS
        *(.bss .bss.*)
        . = ALIGN(POINTER_ALIGN);
        __bss_end = .;
diff --git a/xen/arch/ppc/xen.lds.S b/xen/arch/ppc/xen.lds.S
index 1de0b77fc6..29d1b5da58 100644
--- a/xen/arch/ppc/xen.lds.S
+++ b/xen/arch/ppc/xen.lds.S
@@ -151,6 +151,7 @@ SECTIONS
         *(.bss.stack_aligned)
         *(.bss.page_aligned)
         PERCPU_BSS
+        PERNODE_BSS
         *(.bss .bss.*)
         . = ALIGN(POINTER_ALIGN);
         __bss_end = .;
diff --git a/xen/arch/riscv/xen.lds.S b/xen/arch/riscv/xen.lds.S
index edcadff90b..e154427353 100644
--- a/xen/arch/riscv/xen.lds.S
+++ b/xen/arch/riscv/xen.lds.S
@@ -146,6 +146,7 @@ SECTIONS
         *(.bss.stack_aligned)
         *(.bss.page_aligned)
         PERCPU_BSS
+        PERNODE_BSS
         *(.sbss .sbss.* .bss .bss.*)
         . = ALIGN(POINTER_ALIGN);
         __bss_end = .;
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 966e514f20..95040cd516 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -327,6 +327,7 @@ SECTIONS
        __bss_start = .;
        *(.bss.page_aligned*)
        PERCPU_BSS
+       PERNODE_BSS
        *(.bss .bss.*)
        . = ALIGN(POINTER_ALIGN);
        __bss_end = .;
diff --git a/xen/common/numa.c b/xen/common/numa.c
index ad75955a16..5e66471159 100644
--- a/xen/common/numa.c
+++ b/xen/common/numa.c
@@ -320,6 +320,51 @@ static bool __init nodes_cover_memory(void)
     return true;
 }
 
+/* Defined on the BSS in xen.lds.S, used for area sizes and relative offsets */
+extern const char __pernode_start[];
+extern const char __pernode_end[];
+
+unsigned long __read_mostly __pernode_offset[MAX_NUMNODES];
+
+#define EARLY_PERNODE_AREA_SIZE (SMP_CACHE_BYTES)
+
+static char early_pernode_area[MAX_NUMNODES][EARLY_PERNODE_AREA_SIZE]
+    __initdata __cacheline_aligned;
+
+/* per_node() needs to be ready before the first alloc call using the heap */
+static void __init early_init_pernode_areas(void)
+{
+    unsigned int node;
+
+    if (__pernode_end - __pernode_start > EARLY_PERNODE_AREA_SIZE)
+        panic("per_node() area too small, increase EARLY_PERNODE_AREA_SIZE");
+
+    for_each_online_node(node)
+        __pernode_offset[node] = early_pernode_area[node] - __pernode_start;
+}
+
+/* Before going SMP, migrate the per_node memory areas to their NUMA nodes */
+static int __init init_pernode_areas(void)
+{
+    const int pernode_size = __pernode_end - __pernode_start;  /* size in BSS */
+    unsigned int node;
+
+    for_each_online_node(node)
+    {
+        char *p = alloc_xenheap_pages(get_order_from_bytes(pernode_size),
+                                      MEMF_node(node));
+
+        if ( !p )
+            return -ENOMEM;
+        /* migrate the pernode data from the bootmem area to the xenheap */
+        memcpy(p, early_pernode_area[node], SMP_CACHE_BYTES);
+        __pernode_offset[node] = p - __pernode_start;
+    }
+    return 0;
+}
+
+presmp_initcall(init_pernode_areas);
+
 /* Use discovered information to actually set up the nodes. */
 static bool __init numa_process_nodes(paddr_t start, paddr_t end)
 {
@@ -617,7 +662,7 @@ static int __init numa_emulation(unsigned long start_pfn,
 }
 #endif
 
-void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
+static void __init init_nodes(unsigned long start_pfn, unsigned long end_pfn)
 {
     unsigned int i;
     paddr_t start = pfn_to_paddr(start_pfn);
@@ -656,6 +701,12 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
     setup_node_bootmem(0, start, end);
 }
 
+void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
+{
+    init_nodes(start_pfn, end_pfn);
+    early_init_pernode_areas(); /* With all nodes registered, init per_node() */
+}
+
 void numa_add_cpu(unsigned int cpu)
 {
     cpumask_set_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index f6c1f27ca1..729c400d64 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -152,4 +152,19 @@ static inline nodeid_t mfn_to_nid(mfn_t mfn)
 
 #define page_to_nid(pg) mfn_to_nid(page_to_mfn(pg))
 
+/* Per NUMA node data area handling based on per-cpu data area handling. */
+extern unsigned long __pernode_offset[];
+
+#define DECLARE_PER_NODE(type, name) \
+    extern __typeof__(type) pernode__ ## name
+
+#define __DEFINE_PER_NODE(attr, type, name) \
+    attr __typeof__(type) pernode_ ## name
+
+#define DEFINE_PER_NODE(type, name) \
+    __DEFINE_PER_NODE(__section(".bss.pernode"), type, _ ## name)
+
+#define per_node(var, node)  \
+    (*RELOC_HIDE(&pernode__##var, __pernode_offset[node]))
+
 #endif /* _XEN_NUMA_H */
diff --git a/xen/include/xen/xen.lds.h b/xen/include/xen/xen.lds.h
index b126dfe887..a32423dcec 100644
--- a/xen/include/xen/xen.lds.h
+++ b/xen/include/xen/xen.lds.h
@@ -174,6 +174,14 @@
 #define LOCK_PROFILE_DATA
 #endif
 
+/* Per-node BSS for declaring per_node vars, based on per_cpu, but simpler */
+#define PERNODE_BSS                \
+       . = ALIGN(PAGE_SIZE);       \
+       __pernode_start = .;        \
+       *(.bss.pernode)             \
+       . = ALIGN(SMP_CACHE_BYTES); \
+       __pernode_end = .;          \
+
 #define PERCPU_BSS                 \
        . = ALIGN(PAGE_SIZE);       \
        __per_cpu_start = .;        \
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/7] xen/page_alloc: Simplify domain_adjust_tot_pages() further
  2025-09-07 16:15 [PATCH v3 0/7] NUMA: Add per-node domain-memory claims Bernhard Kaindl
  2025-09-07 16:15 ` [PATCH v3 1/7] xen/numa: Add per_node() variables paralleling per_cpu() variables Bernhard Kaindl
@ 2025-09-07 16:15 ` Bernhard Kaindl
  2025-09-23 15:49   ` Jan Beulich
  2025-09-07 16:15 ` [PATCH v3 3/7] xen/page_alloc: Add and track per_node(avail_pages) Bernhard Kaindl
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Bernhard Kaindl @ 2025-09-07 16:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Alejandro Vallejo, Bernhard Kaindl, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, Jan Beulich

When domain memory is allocated, domain_adjust_tot_pages(),
also reduces the outstanding claim.

Replace the checks to not over-reduce the claim beyond 0
by using min() which prevents the claim to become negative
(and also not be over-conumed for the node and globally)

Cc: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Bernhard Kaindl <bernhard.kaindl@cloud.com>

---
Changes:
- Was added as 2/7 in v2, the review by Jan Beulich is applied.
---
 xen/common/page_alloc.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 1f67b88a89..e056624583 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -510,8 +510,15 @@ static unsigned long avail_heap_pages(
     return free_pages;
 }
 
+/*
+ * Update the total number of pages and outstanding claims of a domain.
+ * - When pages were freed, we do not increase outstanding claims.
+ * - On a domain's claims update, global outstanding_claims are updated as well.
+ */
 unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
 {
+    unsigned long adjustment;
+
     ASSERT(rspin_is_locked(&d->page_alloc_lock));
     d->tot_pages += pages;
 
@@ -519,23 +526,22 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
      * can test d->outstanding_pages race-free because it can only change
      * if d->page_alloc_lock and heap_lock are both held, see also
      * domain_set_outstanding_pages below
+     *
+     * If the domain has no outstanding claims (or we freed pages instead),
+     * we don't update outstanding claims and skip the claims adjustment.
      */
     if ( !d->outstanding_pages || pages <= 0 )
         goto out;
 
     spin_lock(&heap_lock);
     BUG_ON(outstanding_claims < d->outstanding_pages);
-    if ( d->outstanding_pages < pages )
-    {
-        /* `pages` exceeds the domain's outstanding count. Zero it out. */
-        outstanding_claims -= d->outstanding_pages;
-        d->outstanding_pages = 0;
-    }
-    else
-    {
-        outstanding_claims -= pages;
-        d->outstanding_pages -= pages;
-    }
+    /*
+     * Reduce claims by outstanding claims or pages (whichever is smaller):
+     * If allocated > outstanding, reduce the claims only by outstanding pages.
+     */
+    adjustment = min(d->outstanding_pages + 0UL, pages + 0UL);
+    d->outstanding_pages -= adjustment;
+    outstanding_claims -= adjustment;
     spin_unlock(&heap_lock);
 
 out:
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 3/7] xen/page_alloc: Add and track per_node(avail_pages)
  2025-09-07 16:15 [PATCH v3 0/7] NUMA: Add per-node domain-memory claims Bernhard Kaindl
  2025-09-07 16:15 ` [PATCH v3 1/7] xen/numa: Add per_node() variables paralleling per_cpu() variables Bernhard Kaindl
  2025-09-07 16:15 ` [PATCH v3 2/7] xen/page_alloc: Simplify domain_adjust_tot_pages() further Bernhard Kaindl
@ 2025-09-07 16:15 ` Bernhard Kaindl
  2025-09-23 15:55   ` Jan Beulich
  2025-09-07 16:15 ` [PATCH v3 4/7] xen/page_alloc: Add staking a NUMA node claim for a domain Bernhard Kaindl
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Bernhard Kaindl @ 2025-09-07 16:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Alejandro Vallejo, Bernhard Kaindl, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini

From: Alejandro Vallejo <alejandro.garciavallejo@amd.com>

The static per-NUMA-node count of free pages is the sum of free memory
in all zones of a node. It's an optimisation to avoid doing that operation
frequently in the following patches that introduce per-NUMA-node claims.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Signed-off-by: Bernhard Kaindl <bernhard.kaindl@cloud.com>

---
Changes in v2:
- Added ASSERT(per_node(avail_pages, node) >= request) as requested
  during review by Roger: Comment by me: As we have

  ASSERT(avail[node][zone] >= request);

  directly before it, the request is already valid, so this checks
  that per_node(avail_pages, node) is not mis-accounted too low.

Changes in v3:
- Converted from static array to use per_node(avail_pages, node)
---
 xen/common/page_alloc.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index e056624583..b8acb500da 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -486,6 +486,10 @@ static unsigned long node_need_scrub[MAX_NUMNODES];
 static unsigned long *avail[MAX_NUMNODES];
 static long total_avail_pages;
 
+/* Per-NUMA-node counts of free pages */
+DECLARE_PER_NODE(unsigned long, avail_pages);
+DEFINE_PER_NODE(unsigned long, avail_pages);
+
 static DEFINE_SPINLOCK(heap_lock);
 static long outstanding_claims; /* total outstanding claims by all domains */
 
@@ -1074,6 +1078,8 @@ static struct page_info *alloc_heap_pages(
 
     ASSERT(avail[node][zone] >= request);
     avail[node][zone] -= request;
+    ASSERT(per_node(avail_pages, node) >= request);
+    per_node(avail_pages, node) -= request;
     total_avail_pages -= request;
     ASSERT(total_avail_pages >= 0);
 
@@ -1234,6 +1240,8 @@ static int reserve_offlined_page(struct page_info *head)
             continue;
 
         avail[node][zone]--;
+        ASSERT(per_node(avail_pages, node) > 0);
+        per_node(avail_pages, node)--;
         total_avail_pages--;
         ASSERT(total_avail_pages >= 0);
 
@@ -1558,6 +1566,7 @@ static void free_heap_pages(
     }
 
     avail[node][zone] += 1 << order;
+    per_node(avail_pages, node) += 1 << order;
     total_avail_pages += 1 << order;
     if ( need_scrub )
     {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 4/7] xen/page_alloc: Add staking a NUMA node claim for a domain
  2025-09-07 16:15 [PATCH v3 0/7] NUMA: Add per-node domain-memory claims Bernhard Kaindl
                   ` (2 preceding siblings ...)
  2025-09-07 16:15 ` [PATCH v3 3/7] xen/page_alloc: Add and track per_node(avail_pages) Bernhard Kaindl
@ 2025-09-07 16:15 ` Bernhard Kaindl
  2025-09-23 16:09   ` Jan Beulich
  2025-09-07 16:15 ` [PATCH v3 5/7] xen/page_alloc: Pass node to adjust_tot_pages and check it Bernhard Kaindl
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Bernhard Kaindl @ 2025-09-07 16:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Alejandro Vallejo, Bernhard Kaindl, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini

Update domain_set_outstanding_pages() to domain_claim_pages() for
staking claims for domains on NUMA nodes:

domain_claim_pages() is a handler for claiming pages, where its former
name suggested that it just sets the domain's outstanding claims.

Actually, three different code locations do perform just this task:

Fix this using a helper to avoid repeating yourself (an anti-pattern)
for just only updating the domain's outstanding pages is added as well:

It removes the need to repeat the same sequence of operations at three
diffent places and helps to have a single location for adding multi-node
claims. It also makes the code much shorter and easier to follow.

Fix the meaning of the claims argument of domain_claim_pages()
for NUMA-node claims:

- For NUMA-node claims, we need to claim defined amounts of memory
  on different NUMA nodes. Previously, the argument was a "reservation"
  and the claim was made on the difference between d->tot_pages and
  the reservations. Of course, the argument needed to be > d->tot_pages.

  This interacs badly with NUMA claims:
  NUMA node claims are not related to potentially already allocated
  memory and reducing the claim by already allocated memory would
  not work in case d->tot_pages already has some amount of pages.

- Fix this by simply claiming the given amount of pages.

- Update the legacy caller of domain_claim_pages() accordingly by
  moving the reduction of the claim by d->tot_pages to it:

  No change for the users of the legacy hypercall, and a usable
  interface for staking NUMA claims.

Signed-off-by: Bernhard Kaindl <bernhard.kaindl@cloud.com>
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>

---
Changes in v3:

- Renamed domain_set_outstanding_pages() and add check from review.
- Reorganized v3, v4 and v5 as per review to avoid non-functional
  changes:
  - Combined patch v2#2 with v2#5 into a consolidated patch.
  - Moved the unrelated changes for domain_adjust_tot_pages() to #5.
---
 xen/common/domain.c     |  2 +-
 xen/common/memory.c     | 15 ++++++-
 xen/common/page_alloc.c | 93 ++++++++++++++++++++++++++++-------------
 xen/include/xen/mm.h    |  3 +-
 xen/include/xen/sched.h |  1 +
 5 files changed, 81 insertions(+), 33 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 775c339285..6ee9f23b10 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1247,7 +1247,7 @@ int domain_kill(struct domain *d)
         rspin_barrier(&d->domain_lock);
         argo_destroy(d);
         vnuma_destroy(d->vnuma);
-        domain_set_outstanding_pages(d, 0);
+        domain_claim_pages(d, NUMA_NO_NODE, 0);
         /* fallthrough */
     case DOMDYING_dying:
         rc = domain_teardown(d);
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 3688e6dd50..3371edec11 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1682,7 +1682,20 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         rc = xsm_claim_pages(XSM_PRIV, d);
 
         if ( !rc )
-            rc = domain_set_outstanding_pages(d, reservation.nr_extents);
+        {
+            unsigned long new_claim = reservation.nr_extents;
+
+            /*
+             * For backwards compatibility, keep the meaning of nr_extents:
+             * it is the target number of pages for the domain.
+             * In case memory for the domain was allocated before, we must
+             * substract the already allocated pages from the reservation.
+             */
+            if ( new_claim )
+                new_claim -= domain_tot_pages(d);
+
+            rc = domain_claim_pages(d, NUMA_NO_NODE, new_claim);
+        }
 
         rcu_unlock_domain(d);
 
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index b8acb500da..bbb34994b7 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -492,6 +492,30 @@ DEFINE_PER_NODE(unsigned long, avail_pages);
 
 static DEFINE_SPINLOCK(heap_lock);
 static long outstanding_claims; /* total outstanding claims by all domains */
+DECLARE_PER_NODE(long, outstanding_claims);
+DEFINE_PER_NODE(long, outstanding_claims);
+
+#define domain_has_node_claim(d) (d->claim_node != NUMA_NO_NODE)
+
+static inline bool insufficient_memory(unsigned long request, nodeid_t node)
+{
+    return per_node(avail_pages, node) -
+           per_node(outstanding_claims, node) < request;
+}
+
+/*
+ * Adjust the claim of a domain host-wide and if set, for the claimed node
+ *
+ * All callers already hold d->page_alloc_lock and the heap_lock.
+ */
+static inline void domain_adjust_outstanding_claim(struct domain *d, long pages)
+{
+    outstanding_claims += pages;   /* Update the host-wide-outstanding claims */
+    d->outstanding_pages += pages; /* Update the domain's outstanding claims */
+
+    if ( domain_has_node_claim(d) ) /* Update the claims of that node */
+        per_node(outstanding_claims, d->claim_node) += pages;
+}
 
 static unsigned long avail_heap_pages(
     unsigned int zone_lo, unsigned int zone_hi, unsigned int node)
@@ -529,7 +553,7 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
     /*
      * can test d->outstanding_pages race-free because it can only change
      * if d->page_alloc_lock and heap_lock are both held, see also
-     * domain_set_outstanding_pages below
+     * domain_claim_pages below
      *
      * If the domain has no outstanding claims (or we freed pages instead),
      * we don't update outstanding claims and skip the claims adjustment.
@@ -544,18 +568,37 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
      * If allocated > outstanding, reduce the claims only by outstanding pages.
      */
     adjustment = min(d->outstanding_pages + 0UL, pages + 0UL);
-    d->outstanding_pages -= adjustment;
-    outstanding_claims -= adjustment;
+
+    domain_adjust_outstanding_claim(d, -adjustment);
     spin_unlock(&heap_lock);
 
 out:
     return d->tot_pages;
 }
 
-int domain_set_outstanding_pages(struct domain *d, unsigned long pages)
+/*
+ * Stake claim for memory for future allocations of a domain.
+ *
+ * The claim is an abstract stake on future memory allocations,
+ * no actual memory is allocated at this point. Instead, it guarantees
+ * that future allocations up to the claim's size will succeed.
+ *
+ * If node == NUMA_NO_NODE, the claim is host-wide.
+ * Otherwise, it is local to the specific NUMA node defined by d->claim_node.
+ *
+ * It should normally only ever be before allocating the memory of the domain.
+ * When libxenguest code has finished populating the memory of the domain, it
+ * cleans up any remaining by passing of 0 to release any outstanding claims.
+ *
+ * Returns 0 on success, -EINVAL if the request is invalid,
+ * or -ENOMEM if the claim cannot be satisfied in available memory.
+ */
+int domain_claim_pages(struct domain *d, nodeid_t node, unsigned long claim)
 {
-    int ret = -ENOMEM;
-    unsigned long claim, avail_pages;
+    int ret = -EINVAL;
+
+    if ( node != NUMA_NO_NODE && !node_online(node) )
+        goto out; /* passed node is not valid */
 
     /*
      * take the domain's page_alloc_lock, else all d->tot_page adjustments
@@ -565,45 +608,35 @@ int domain_set_outstanding_pages(struct domain *d, unsigned long pages)
     nrspin_lock(&d->page_alloc_lock);
     spin_lock(&heap_lock);
 
-    /* pages==0 means "unset" the claim. */
-    if ( pages == 0 )
+    /* claim==0 means "unset" the claim. */
+    if ( claim == 0 )
     {
-        outstanding_claims -= d->outstanding_pages;
-        d->outstanding_pages = 0;
+        domain_adjust_outstanding_claim(d, -d->outstanding_pages);
         ret = 0;
         goto out;
     }
 
     /* only one active claim per domain please */
     if ( d->outstanding_pages )
-    {
-        ret = -EINVAL;
         goto out;
-    }
 
-    /* disallow a claim not exceeding domain_tot_pages() or above max_pages */
-    if ( (pages <= domain_tot_pages(d)) || (pages > d->max_pages) )
-    {
-        ret = -EINVAL;
+    /* If we allocated for the domain already, the claim is on top of that. */
+    if ( (domain_tot_pages(d) + claim) > d->max_pages )
         goto out;
-    }
 
-    /* how much memory is available? */
-    avail_pages = total_avail_pages;
-
-    avail_pages -= outstanding_claims;
+    ret = -ENOMEM;
+    /* Check if the host-wide available memory is sufficent for this claim */
+    if ( claim > total_avail_pages - outstanding_claims )
+        goto out;
 
-    /*
-     * Note, if domain has already allocated memory before making a claim
-     * then the claim must take domain_tot_pages() into account
-     */
-    claim = pages - domain_tot_pages(d);
-    if ( claim > avail_pages )
+    /* Check if the node's available memory is insufficient for this claim */
+    if ( node != NUMA_NO_NODE && insufficient_memory(node, claim) )
         goto out;
 
     /* yay, claim fits in available memory, stake the claim, success! */
-    d->outstanding_pages = claim;
-    outstanding_claims += d->outstanding_pages;
+    d->claim_node = node;
+    domain_adjust_outstanding_claim(d, claim);
+
     ret = 0;
 
 out:
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index b968f47b87..52c12c5783 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -65,6 +65,7 @@
 #include <xen/compiler.h>
 #include <xen/mm-frame.h>
 #include <xen/mm-types.h>
+#include <xen/numa.h>
 #include <xen/types.h>
 #include <xen/list.h>
 #include <xen/spinlock.h>
@@ -131,7 +132,7 @@ int populate_pt_range(unsigned long virt, unsigned long nr_mfns);
 /* Claim handling */
 unsigned long __must_check domain_adjust_tot_pages(struct domain *d,
     long pages);
-int domain_set_outstanding_pages(struct domain *d, unsigned long pages);
+int domain_claim_pages(struct domain *d, nodeid_t node, unsigned long pages);
 void get_outstanding_claims(uint64_t *free_pages, uint64_t *outstanding_pages);
 
 /* Domain suballocator. These functions are *not* interrupt-safe.*/
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 02bdc256ce..9b91261f20 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -405,6 +405,7 @@ struct domain
     unsigned int     outstanding_pages; /* pages claimed but not possessed */
     unsigned int     max_pages;         /* maximum value for domain_tot_pages() */
     unsigned int     extra_pages;       /* pages not included in domain_tot_pages() */
+    nodeid_t         claim_node;        /* NUMA_NO_NODE for host-wide claims */
 
 #ifdef CONFIG_MEM_SHARING
     atomic_t         shr_pages;         /* shared pages */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 5/7] xen/page_alloc: Pass node to adjust_tot_pages and check it
  2025-09-07 16:15 [PATCH v3 0/7] NUMA: Add per-node domain-memory claims Bernhard Kaindl
                   ` (3 preceding siblings ...)
  2025-09-07 16:15 ` [PATCH v3 4/7] xen/page_alloc: Add staking a NUMA node claim for a domain Bernhard Kaindl
@ 2025-09-07 16:15 ` Bernhard Kaindl
  2025-11-13 14:13   ` Jan Beulich
  2025-09-07 16:15 ` [PATCH v3 6/7] xen/page_alloc: Protect claimed memory against other allocations Bernhard Kaindl
  2025-09-07 16:15 ` [PATCH v3 7/7] xen: New hypercall to claim memory using XEN_DOMCTL_claim_memory Bernhard Kaindl
  6 siblings, 1 reply; 15+ messages in thread
From: Bernhard Kaindl @ 2025-09-07 16:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Alejandro Vallejo, Bernhard Kaindl, Jan Beulich, Andrew Cooper,
	Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, Tamas K Lengyel

domain_adjust_tot_pages() consumes remaining claims as pages
are allocated, now also from the claimed node.

Update it to skip consuming the outstanding claims when the page
was allocated from a different NUMA node.

This in itself would not be critically needed as the page should
only be allocated from a different NUMA node in case the target
node has no available memory, but for multi-node claims, we need
to reduce the outstanding claims only on the NUMA node the page
was allocated from.

For this, we need to pass the NUMA node of the allocated page,
so we can use it to perform this check (and in the future update
the claim only on the NUMA node the page was allocated from)

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Signed-off-by: Bernhard Kaindl <bernhard.kaindl@cloud.com>

---
- Reorganized v3, v4 and v5 as per review to avoid non-functional
  changes:
  - Split from patch v2#3 and merged the related changed from v2#5
    into a consolidated patch.
---
 xen/arch/x86/mm.c             |  3 ++-
 xen/arch/x86/mm/mem_sharing.c |  4 ++--
 xen/common/grant_table.c      |  4 ++--
 xen/common/memory.c           |  3 ++-
 xen/common/page_alloc.c       | 21 ++++++++++++++++-----
 xen/include/xen/mm.h          |  2 +-
 6 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index b929d15d00..b0f654e02e 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4442,7 +4442,8 @@ int steal_page(
     page_list_del(page, &d->page_list);
 
     /* Unlink from original owner. */
-    if ( !(memflags & MEMF_no_refcount) && !domain_adjust_tot_pages(d, -1) )
+    if ( !(memflags & MEMF_no_refcount) &&
+         !domain_adjust_tot_pages(d, NUMA_NO_NODE, -1) )
         drop_dom_ref = true;
 
     nrspin_unlock(&d->page_alloc_lock);
diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
index 4787b27964..15b8a3a9d9 100644
--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -720,7 +720,7 @@ static int page_make_sharable(struct domain *d,
     if ( !validate_only )
     {
         page_set_owner(page, dom_cow);
-        drop_dom_ref = !domain_adjust_tot_pages(d, -1);
+        drop_dom_ref = !domain_adjust_tot_pages(d, NUMA_NO_NODE, -1);
         page_list_del(page, &d->page_list);
     }
 
@@ -766,7 +766,7 @@ static int page_make_private(struct domain *d, struct page_info *page)
     ASSERT(page_get_owner(page) == dom_cow);
     page_set_owner(page, d);
 
-    if ( domain_adjust_tot_pages(d, 1) == 1 )
+    if ( domain_adjust_tot_pages(d, page_to_nid(page), 1) == 1 )
         get_knownalive_domain(d);
     page_list_add_tail(page, &d->page_list);
     nrspin_unlock(&d->page_alloc_lock);
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index cf131c43a1..8fea75dbb2 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -2405,7 +2405,7 @@ gnttab_transfer(
         }
 
         /* Okay, add the page to 'e'. */
-        if ( unlikely(domain_adjust_tot_pages(e, 1) == 1) )
+        if ( unlikely(domain_adjust_tot_pages(e, page_to_nid(page), 1) == 1) )
             get_knownalive_domain(e);
 
         /*
@@ -2431,7 +2431,7 @@ gnttab_transfer(
              * page in the page total
              */
             nrspin_lock(&e->page_alloc_lock);
-            drop_dom_ref = !domain_adjust_tot_pages(e, -1);
+            drop_dom_ref = !domain_adjust_tot_pages(e, NUMA_NO_NODE, -1);
             nrspin_unlock(&e->page_alloc_lock);
 
             if ( okay /* i.e. e->is_dying due to the surrounding if() */ )
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 3371edec11..4c54ce5ede 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -775,7 +775,8 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
 
                 nrspin_lock(&d->page_alloc_lock);
                 drop_dom_ref = (dec_count &&
-                                !domain_adjust_tot_pages(d, -dec_count));
+                                !domain_adjust_tot_pages(d, NUMA_NO_NODE,
+                                                         -dec_count));
                 nrspin_unlock(&d->page_alloc_lock);
 
                 if ( drop_dom_ref )
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index bbb34994b7..ebf41a1b33 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -542,8 +542,11 @@ static unsigned long avail_heap_pages(
  * Update the total number of pages and outstanding claims of a domain.
  * - When pages were freed, we do not increase outstanding claims.
  * - On a domain's claims update, global outstanding_claims are updated as well.
+ * - If the domain's claim is on a NUMA node, we only update outstanding claims
+ *   of the domain and the node, when the allocation is from the same NUMA node.
  */
-unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
+unsigned long domain_adjust_tot_pages(struct domain *d, nodeid_t node,
+                                      long pages)
 {
     unsigned long adjustment;
 
@@ -557,8 +560,12 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
      *
      * If the domain has no outstanding claims (or we freed pages instead),
      * we don't update outstanding claims and skip the claims adjustment.
+     *
+     * Else, a page was allocated: But if the domain has a node_claim and
+     * the page was allocated from a different node, don't update claims.
      */
-    if ( !d->outstanding_pages || pages <= 0 )
+    if ( !d->outstanding_pages || pages <= 0 ||
+         (domain_has_node_claim(d) && d->claim_node != node) )
         goto out;
 
     spin_lock(&heap_lock);
@@ -2662,6 +2669,8 @@ int assign_pages(
 
     if ( !(memflags & MEMF_no_refcount) )
     {
+        nodeid_t node = page_to_nid(&pg[0]);
+
         if ( unlikely(d->tot_pages + nr < nr) )
         {
             gprintk(XENLOG_INFO,
@@ -2672,8 +2681,9 @@ int assign_pages(
             rc = -E2BIG;
             goto out;
         }
+        ASSERT(node == page_to_nid(&pg[nr - 1]));
 
-        if ( unlikely(domain_adjust_tot_pages(d, nr) == nr) )
+        if ( unlikely(domain_adjust_tot_pages(d, node, nr) == nr) )
             get_knownalive_domain(d);
     }
 
@@ -2806,7 +2816,8 @@ void free_domheap_pages(struct page_info *pg, unsigned int order)
                 }
             }
 
-            drop_dom_ref = !domain_adjust_tot_pages(d, -(1 << order));
+            drop_dom_ref = !domain_adjust_tot_pages(d, NUMA_NO_NODE,
+                                                    -(1 << order));
 
             rspin_unlock(&d->page_alloc_lock);
 
@@ -3012,7 +3023,7 @@ void free_domstatic_page(struct page_info *page)
 
     arch_free_heap_page(d, page);
 
-    drop_dom_ref = !domain_adjust_tot_pages(d, -1);
+    drop_dom_ref = !domain_adjust_tot_pages(d, NUMA_NO_NODE, -1);
 
     unprepare_staticmem_pages(page, 1, scrub_debug);
 
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 52c12c5783..5a5252fc69 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -131,7 +131,7 @@ mfn_t xen_map_to_mfn(unsigned long va);
 int populate_pt_range(unsigned long virt, unsigned long nr_mfns);
 /* Claim handling */
 unsigned long __must_check domain_adjust_tot_pages(struct domain *d,
-    long pages);
+    nodeid_t node, long pages);
 int domain_claim_pages(struct domain *d, nodeid_t node, unsigned long pages);
 void get_outstanding_claims(uint64_t *free_pages, uint64_t *outstanding_pages);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 6/7] xen/page_alloc: Protect claimed memory against other allocations
  2025-09-07 16:15 [PATCH v3 0/7] NUMA: Add per-node domain-memory claims Bernhard Kaindl
                   ` (4 preceding siblings ...)
  2025-09-07 16:15 ` [PATCH v3 5/7] xen/page_alloc: Pass node to adjust_tot_pages and check it Bernhard Kaindl
@ 2025-09-07 16:15 ` Bernhard Kaindl
  2025-11-13 14:18   ` Jan Beulich
  2025-09-07 16:15 ` [PATCH v3 7/7] xen: New hypercall to claim memory using XEN_DOMCTL_claim_memory Bernhard Kaindl
  6 siblings, 1 reply; 15+ messages in thread
From: Bernhard Kaindl @ 2025-09-07 16:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Alejandro Vallejo, Bernhard Kaindl, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini

Extend get_free_buddy() to only allocate from nodes with enough
unclaimed memory left, unless the allocation is made by a domain
with sufficient claims on this node to cover the allocation.

Signed-off-by: Marcus Granado <marcus.granado@cloud.com>
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Signed-off-by: Bernhard Kaindl <bernhard.kaindl@cloud.com>

---
Changes in v3:

Rewritten based on a check by Marcus Granado which needs to be inside
the NUMA node loop of get_free_buddy() to only allow it to allocating
from NUMA nodes with enough unclaimed memory.

It was originally only intented for when looping over all NUMA nodes,
but the check also needs to be done when falling back to other nodes:

I updated the check to be generic: Now, it used for all requests by
integrating the check of the claim of the domain from Alejandro's
can_alloc() helper into it.

This fixes the issue that when falling back from a nodemask to allocate
from (based on MEMF_get_node(memflags) or from d->node_affinity):

When falling back to other NUMA nodes, still only allocate from nodes
with enough unclaimed memory left, unless the allocation is made by
a domain with sufficient claims on this node to cover the allocation.

This makes the can_alloc() helper function obsolete, as the needed
checks are done for the NUMA nodes as they are considered, not only
for the orignally requested NUMA node (not just before searching).
---
 xen/common/page_alloc.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index ebf41a1b33..b866076b78 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -980,9 +980,19 @@ static struct page_info *get_free_buddy(unsigned int zone_lo,
     {
         zone = zone_hi;
         do {
-            /* Check if target node can support the allocation. */
-            if ( !avail[node] || (avail[node][zone] < (1UL << order)) )
-                continue;
+            unsigned long request = 1UL << order;
+            /*
+             * Check if this node is currently suitable for this allocation.
+             * 1. It has sufficient memory in the requested zone and the
+             * 2. request must fit in the unclaimed memory of the node minus
+             *    outstanding claims, unless the allocation is made by a domain
+             *    with sufficient node-claimed memory to cover the allocation.
+             */
+            if ( !avail[node] || (avail[node][zone] < request) ||
+                 (insufficient_memory(node, request) &&
+                  (!d || node != d->claim_node ||     /* a domain with claims */
+                   request > d->outstanding_pages)) ) /* claim covers request */
+                continue;  /* next zone/node if insufficient memory or claims */
 
             /* Find smallest order which can satisfy the request. */
             for ( j = order; j <= MAX_ORDER; j++ )
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 7/7] xen: New hypercall to claim memory using XEN_DOMCTL_claim_memory
  2025-09-07 16:15 [PATCH v3 0/7] NUMA: Add per-node domain-memory claims Bernhard Kaindl
                   ` (5 preceding siblings ...)
  2025-09-07 16:15 ` [PATCH v3 6/7] xen/page_alloc: Protect claimed memory against other allocations Bernhard Kaindl
@ 2025-09-07 16:15 ` Bernhard Kaindl
  2025-11-13 14:28   ` Jan Beulich
  6 siblings, 1 reply; 15+ messages in thread
From: Bernhard Kaindl @ 2025-09-07 16:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Alejandro Vallejo, Bernhard Kaindl, Daniel P. Smith,
	Anthony PERARD, Andrew Cooper, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini,
	Juergen Gross, Christian Lindig, David Scott

Add the new hypercall requested during the review of the v1 series
do not require changing the API for multi-node claims.

The hypercall receives a number of claims, intented to be one claim per
NUMA node, and limited to one claim for now. The changes to update the
NUMA claims management to handle updating the claims for multiple
NUMA nodes of a domain at once are deferred to the next series.

Cc: Alejandro Vallejo <alejandro.garciavallejo@amd.com>
Signed-off-by: Bernhard Kaindl <bernhard.kaindl@cloud.com>

---
Changes in v3:
- As a review to check nr_claims to be > 0 (but uint = superflous),
  but to avoid a raised eyebrow, add "> 0", which the compiler will
  optimise away anyway.
---
 tools/flask/policy/modules/dom0.te  |  1 +
 tools/flask/policy/modules/xen.if   |  1 +
 tools/include/xenctrl.h             |  4 +++
 tools/libs/ctrl/xc_domain.c         | 42 +++++++++++++++++++++++++++++
 tools/ocaml/libs/xc/xenctrl.ml      |  9 +++++++
 tools/ocaml/libs/xc/xenctrl.mli     |  9 +++++++
 tools/ocaml/libs/xc/xenctrl_stubs.c | 21 +++++++++++++++
 xen/common/domain.c                 | 29 ++++++++++++++++++++
 xen/common/domctl.c                 |  8 ++++++
 xen/include/public/domctl.h         | 17 ++++++++++++
 xen/include/xen/domain.h            |  2 ++
 xen/xsm/flask/hooks.c               |  3 +++
 xen/xsm/flask/policy/access_vectors |  2 ++
 13 files changed, 148 insertions(+)

diff --git a/tools/flask/policy/modules/dom0.te b/tools/flask/policy/modules/dom0.te
index ad2b4f9ea7..8801cb24f2 100644
--- a/tools/flask/policy/modules/dom0.te
+++ b/tools/flask/policy/modules/dom0.te
@@ -105,6 +105,7 @@ allow dom0_t dom0_t:domain2 {
 	get_cpu_policy
 	dt_overlay
 	get_domain_state
+	claim_memory
 };
 allow dom0_t dom0_t:resource {
 	add
diff --git a/tools/flask/policy/modules/xen.if b/tools/flask/policy/modules/xen.if
index ef7d8f438c..8e2dceb505 100644
--- a/tools/flask/policy/modules/xen.if
+++ b/tools/flask/policy/modules/xen.if
@@ -98,6 +98,7 @@ define(`create_domain_common', `
 		vuart_op
 		set_llc_colors
 		get_domain_state
+		claim_memory
 	};
 	allow $1 $2:security check_context;
 	allow $1 $2:shadow enable;
diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 965d3b585a..43ece3f2a7 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -2660,6 +2660,10 @@ int xc_domain_set_llc_colors(xc_interface *xch, uint32_t domid,
                              const uint32_t *llc_colors,
                              uint32_t num_llc_colors);
 
+int xc_domain_claim_memory(xc_interface *xch, uint32_t domid,
+                           uint32_t nr_claims,
+                           const memory_claim_t *claims);
+
 #if defined(__arm__) || defined(__aarch64__)
 int xc_dt_overlay(xc_interface *xch, void *overlay_fdt,
                   uint32_t overlay_fdt_size, uint8_t overlay_op);
diff --git a/tools/libs/ctrl/xc_domain.c b/tools/libs/ctrl/xc_domain.c
index 2ddc3f4f42..e022b76430 100644
--- a/tools/libs/ctrl/xc_domain.c
+++ b/tools/libs/ctrl/xc_domain.c
@@ -2229,6 +2229,48 @@ out:
 
     return ret;
 }
+
+/*
+ * Claim memory for a domain. A Domain can only have one type of claim:
+ *
+ * If the number of claims is 0, existing claims are cancelled.
+ * Updating claims is not supported, cancel the existing claim first.
+ *
+ * Memory allocations consume the outstanding claim and if not enough memory is
+ * free, the allocation must be satisfied from the remaining outstanding claim.
+ */
+int xc_domain_claim_memory(xc_interface *xch, uint32_t domid,
+                           uint32_t nr_claims,
+                           const memory_claim_t *claims)
+{
+    struct xen_domctl domctl = {
+        .cmd = XEN_DOMCTL_claim_memory,
+        .domain = domid,
+        .u.claim_memory.nr_claims = nr_claims,
+    };
+    int ret;
+    DECLARE_HYPERCALL_BUFFER(struct xen_domctl_claim_memory, buffer);
+
+    /* Use an array to not need changes for multi-node claims in the future */
+    if ( nr_claims > 0 )
+    {
+        size_t bytes = sizeof(memory_claim_t) * nr_claims;
+
+        buffer = xc_hypercall_buffer_alloc(xch, buffer, bytes);
+        if ( buffer == NULL )
+        {
+            PERROR("Could not allocate memory for xc_domain_claim_memory");
+            return -1;
+        }
+        memcpy(buffer, claims, bytes);
+        set_xen_guest_handle(domctl.u.claim_memory.claims, buffer);
+    }
+
+    ret = do_domctl(xch, &domctl);
+    xc_hypercall_buffer_free(xch, buffer);
+    return ret;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/ocaml/libs/xc/xenctrl.ml b/tools/ocaml/libs/xc/xenctrl.ml
index 97108b9d86..c8692fb169 100644
--- a/tools/ocaml/libs/xc/xenctrl.ml
+++ b/tools/ocaml/libs/xc/xenctrl.ml
@@ -370,6 +370,15 @@ external domain_deassign_device: handle -> domid -> (int * int * int * int) -> u
 external domain_test_assign_device: handle -> domid -> (int * int * int * int) -> bool
   = "stub_xc_domain_test_assign_device"
 
+type claim =
+  {
+    node: int;
+    nr_pages: int64;
+  }
+
+external domain_claim_memory: handle -> domid -> int -> claim array -> unit
+  = "stub_xc_domain_claim_memory"
+
 external version: handle -> version = "stub_xc_version_version"
 external version_compile_info: handle -> compile_info
   = "stub_xc_version_compile_info"
diff --git a/tools/ocaml/libs/xc/xenctrl.mli b/tools/ocaml/libs/xc/xenctrl.mli
index 9fccb2c2c2..82d59fc80d 100644
--- a/tools/ocaml/libs/xc/xenctrl.mli
+++ b/tools/ocaml/libs/xc/xenctrl.mli
@@ -297,6 +297,15 @@ external domain_deassign_device: handle -> domid -> (int * int * int * int) -> u
 external domain_test_assign_device: handle -> domid -> (int * int * int * int) -> bool
   = "stub_xc_domain_test_assign_device"
 
+type claim =
+  {
+    node: int;
+    nr_pages: int64;
+  }
+
+external domain_claim_memory: handle -> domid -> int -> claim array -> unit
+  = "stub_xc_domain_claim_memory"
+
 external version : handle -> version = "stub_xc_version_version"
 external version_compile_info : handle -> compile_info
   = "stub_xc_version_compile_info"
diff --git a/tools/ocaml/libs/xc/xenctrl_stubs.c b/tools/ocaml/libs/xc/xenctrl_stubs.c
index ac2a7537d6..53f56c5437 100644
--- a/tools/ocaml/libs/xc/xenctrl_stubs.c
+++ b/tools/ocaml/libs/xc/xenctrl_stubs.c
@@ -1435,6 +1435,27 @@ CAMLprim value stub_xc_watchdog(value xch_val, value domid, value timeout)
 	CAMLreturn(Val_int(ret));
 }
 
+/* Claim memory for a domain. See xc_domain_claim_memory() for details. */
+CAMLprim value stub_xc_domain_claim_memory(value xch_val, value domid,
+                                           value num_claims, value desc)
+{
+	CAMLparam4(xch_val, domid, num_claims, desc);
+	xc_interface *xch = xch_of_val(xch_val);
+	int i, retval, nr_claims = Int_val(num_claims);
+	memory_claim_t claim[nr_claims];
+
+	for (i = 0; i < nr_claims; i++) {
+		claim[i].node = Int_val(Field(desc, i*2));
+		claim[i].nr_pages = Int64_val(Field(desc, i*2 + 1));
+	}
+
+	retval = xc_domain_claim_memory(xch, Int_val(domid), nr_claims, claim);
+	if (retval < 0)
+		failwith_xc(xch);
+
+	CAMLreturn(Val_unit);
+}
+
 /*
  * Local variables:
  *  indent-tabs-mode: t
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 6ee9f23b10..39f1c3718c 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -267,6 +267,35 @@ int get_domain_state(struct xen_domctl_get_domain_state *info, struct domain *d,
     return rc;
 }
 
+/* XEN_DOMCTL_claim_memory: Claim an amount of memory for a domain */
+int claim_memory(struct domain *d, const struct xen_domctl_claim_memory *uinfo)
+{
+    memory_claim_t claim;
+    int rc;
+
+    switch ( uinfo->nr_claims )
+    {
+        case 0:
+            /* Cancel existing claim. */
+            rc = domain_claim_pages(d, 0, 0);
+            break;
+
+        case 1:
+            /* Only single node claims supported at the moment. */
+            if ( copy_from_guest(&claim, uinfo->claims, 1) )
+                return -EFAULT;
+
+            rc = domain_claim_pages(d, claim.node, claim.nr_pages);
+            break;
+
+        default:
+            rc = -EOPNOTSUPP;
+            break;
+    }
+
+    return rc;
+}
+
 static void __domain_finalise_shutdown(struct domain *d)
 {
     struct vcpu *v;
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 71e712c1f3..cf9537b02c 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -863,6 +863,14 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
         ret = get_domain_state(&op->u.get_domain_state, d, &op->domain);
         break;
 
+    case XEN_DOMCTL_claim_memory:
+        ret = xsm_claim_pages(XSM_PRIV, d);
+        if ( ret )
+            break;
+
+        ret = claim_memory(d, &op->u.claim_memory);
+        break;
+
     default:
         ret = arch_do_domctl(op, d, u_domctl);
         break;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 8f6708c0a7..1cebbb878e 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -1276,6 +1276,21 @@ struct xen_domctl_get_domain_state {
     uint64_t unique_id;      /* Unique domain identifier. */
 };
 
+struct xen_memory_claim {
+    unsigned int node;      /* NUMA node, XC_NUMA_NO_NODE for a host claim */
+    unsigned long nr_pages; /* Number of pages to claim */
+};
+typedef struct xen_memory_claim memory_claim_t;
+DEFINE_XEN_GUEST_HANDLE(memory_claim_t);
+
+/* XEN_DOMCTL_claim_memory: Claim an amount of memory for a domain */
+struct xen_domctl_claim_memory {
+    /* IN: array of memory claims */
+    XEN_GUEST_HANDLE_64(memory_claim_t) claims;
+    /* IN: number of claims */
+    unsigned int nr_claims;
+};
+
 struct xen_domctl {
 /* Stable domctl ops: interface_version is required to be 0.  */
     uint32_t cmd;
@@ -1368,6 +1383,7 @@ struct xen_domctl {
 #define XEN_DOMCTL_gsi_permission                88
 #define XEN_DOMCTL_set_llc_colors                89
 #define XEN_DOMCTL_get_domain_state              90 /* stable interface */
+#define XEN_DOMCTL_claim_memory                  91
 #define XEN_DOMCTL_gdbsx_guestmemio            1000
 #define XEN_DOMCTL_gdbsx_pausevcpu             1001
 #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
@@ -1436,6 +1452,7 @@ struct xen_domctl {
 #endif
         struct xen_domctl_set_llc_colors    set_llc_colors;
         struct xen_domctl_get_domain_state  get_domain_state;
+        struct xen_domctl_claim_memory      claim_memory;
         uint8_t                             pad[128];
     } u;
 };
diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
index 8aab05ae93..cd3e933fbf 100644
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -195,4 +195,6 @@ extern bool vmtrace_available;
 
 extern bool vpmu_is_available;
 
+int claim_memory(struct domain *d, const struct xen_domctl_claim_memory *uinfo);
+
 #endif /* __XEN_DOMAIN_H__ */
diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
index b0308e1b26..6b2535b666 100644
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -853,6 +853,9 @@ static int cf_check flask_domctl(struct domain *d, unsigned int cmd,
     case XEN_DOMCTL_set_llc_colors:
         return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__SET_LLC_COLORS);
 
+    case XEN_DOMCTL_claim_memory:
+        return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__CLAIM_MEMORY);
+
     default:
         return avc_unknown_permission("domctl", cmd);
     }
diff --git a/xen/xsm/flask/policy/access_vectors b/xen/xsm/flask/policy/access_vectors
index 51a1577a66..87338b5c2a 100644
--- a/xen/xsm/flask/policy/access_vectors
+++ b/xen/xsm/flask/policy/access_vectors
@@ -259,6 +259,8 @@ class domain2
     set_llc_colors
 # XEN_DOMCTL_get_domain_state
     get_domain_state
+# XEN_DOMCTL_claim_memory
+    claim_memory
 }
 
 # Similar to class domain, but primarily contains domctls related to HVM domains
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/7] xen/numa: Add per_node() variables paralleling per_cpu() variables
  2025-09-07 16:15 ` [PATCH v3 1/7] xen/numa: Add per_node() variables paralleling per_cpu() variables Bernhard Kaindl
@ 2025-09-08 14:11   ` Jan Beulich
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2025-09-08 14:11 UTC (permalink / raw)
  To: Bernhard Kaindl
  Cc: Alejandro Vallejo, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper,
	Anthony PERARD, Roger Pau Monné, Shawn Anastasio,
	Alistair Francis, Bob Eshleman, Connor Davis, Oleksii Kurochko,
	xen-devel

On 07.09.2025 18:15, Bernhard Kaindl wrote:
> During the review of the 3rd commit of the NUMA claims v1 series, it
> was found to be concerning (performance-wise) add add another array
> like this that randomly written from all nodes:
> 
> +/* Per-node counts of free pages */
> +static unsigned long pernode_avail_pages[MAX_NUMNODES];
> 
> As solution, it was suggested to introduce per_node() paralleling
> per_cpu(), or (less desirable) to make sure one particular cache
> line would only ever be written from a single node.
> 
> It was mentioned that node_need_scrub[] could/should use it, and
> I assume others may benefit too.
> 
> per_cpu() is a simple standard blueprint that is easy to copy, add
> per_node(), paralleling per_cpu() as the preferred suggestion:
> 
> It is entirely derived from per_cpu(), with a few differences:
> 
> - No add/remove callback: Nodes are onlined on boot and never offlined.
> 
> - As per_node(avail_pages) and pernode(outstanding_claims) are used by
>   the buddy allocator itself, and the buddy allocator is used to alloc
>   the per_node() memory from the local NUMA node, there is a catch:
> 
>   per_node() must already be working to have a working buddy allocator:
> 
>   - Init per_node() before the buddy allocator is ready as it needs
>     to be setup before its use, e.g. to init per_node(avail_pages)!
> 
>   Use an early static __initdata array during early boot and migrate
>   it to the NUMA-node-local xenheap before we enable the secondary CPUs.

Hmm, this is awkward, especially the need to update the offsets. See
comment further down. This aspect may put under question whether the
underlying idea was actually a good one.

> Cc: Jan Beulich <jbeulich@suse.com>

Cc: here is a little odd, for my taste at least. This may want to be
Requested-by:.

> Signed-off-by: Bernhard Kaindl <bernhard.kaindl@cloud.com>
> 
> ---
> Changes:
> - This is patch is new in v3 to resolve the the suggestion from the review.
> - The previous patch #2 is removed from the series as not required,
>   which is best visualized by how claims are used:
> 
>   - Claim needed memory
>   - Allocate all domain memory
>   - Cancel a possible leftover claim
>   - Finish building the domain and unpause it.
> 
>   As it makes no sense to repeat "Claim needed memory" at any time,
>   the change made had no practical significance.  It can be applied
>   later as a tiny, not important cleanup, e.g. with multi-node claims.
> 
> Implementation note on this patch (not needed for the commit message):
> 
> Instead of the __initdata array, I tried to alloc bootmem, but it
> caused paging_init() to panic with not enough memory for p2m on a
> very large 4-Socket, 480 pCPU, 4TiB RAM host (or it caused boot to
> hang after the microcode updates of the 480 pCPUs)

That's odd, but without any details it's hard to make a suggestion.

> --- a/xen/common/numa.c
> +++ b/xen/common/numa.c
> @@ -320,6 +320,51 @@ static bool __init nodes_cover_memory(void)
>      return true;
>  }
>  
> +/* Defined on the BSS in xen.lds.S, used for area sizes and relative offsets */
> +extern const char __pernode_start[];
> +extern const char __pernode_end[];

May I suggest to use unsigned char throughout?

> +unsigned long __read_mostly __pernode_offset[MAX_NUMNODES];

With what you say in the description as to differences from per-CPU data,
this can be __ro_after_init, I expect?

> +#define EARLY_PERNODE_AREA_SIZE (SMP_CACHE_BYTES)

On what basis was this chosen? And how would we, at build time, notice when
this became too small?

> +static char early_pernode_area[MAX_NUMNODES][EARLY_PERNODE_AREA_SIZE]
> +    __initdata __cacheline_aligned;
> +
> +/* per_node() needs to be ready before the first alloc call using the heap */
> +static void __init early_init_pernode_areas(void)
> +{
> +    unsigned int node;
> +
> +    if (__pernode_end - __pernode_start > EARLY_PERNODE_AREA_SIZE)

Nit: Style.

> +        panic("per_node() area too small, increase EARLY_PERNODE_AREA_SIZE");
> +
> +    for_each_online_node(node)
> +        __pernode_offset[node] = early_pernode_area[node] - __pernode_start;

percpu_init_areas() poisons all slots, i.e. unused ones will remain poisoned.
I think something like that is wanted here, too.

> +}
> +
> +/* Before going SMP, migrate the per_node memory areas to their NUMA nodes */
> +static int __init init_pernode_areas(void)

This lacks cf_check, for its use with presmp_initcall() below.

> +{
> +    const int pernode_size = __pernode_end - __pernode_start;  /* size in BSS */

Why plain int? The value can't go negative.

> +    unsigned int node;
> +
> +    for_each_online_node(node)
> +    {
> +        char *p = alloc_xenheap_pages(get_order_from_bytes(pernode_size),
> +                                      MEMF_node(node));

Is per-node data really in need of being page granular? (Question also
applies to the new linker script construct.) Then again I realize we still
don't really have NUMA-aware sub-page-allocator functions. So a comment
here may have to do for now.

Further, like done for CPU0, imo node 0 will want to use the .bss area,
rather than having something allocated for it. That would partly address
the non-NUMA related comment given elsewhere.

> +        if ( !p )
> +            return -ENOMEM;

How's this going to work? Initcalls don't really have their return values
checked, iirc.

> +        /* migrate the pernode data from the bootmem area to the xenheap */

Nit (style): Capital letter at start of comment please.

> +        memcpy(p, early_pernode_area[node], SMP_CACHE_BYTES);

This SMP_CACHE_BYTES (which really means to be EARLY_PERNODE_AREA_SIZE aiui)
wants to become a suitable ARRAY_SIZE().

This also doesn't look to be safe towards an interrupt kicking in, when
sooner or later some interrupt handling code may also access per-node
data.

> +        __pernode_offset[node] = p - __pernode_start;
> +    }
> +    return 0;
> +}
> +
> +presmp_initcall(init_pernode_areas);

Nit: We prefer no blank line between the function and such an annotation.

> @@ -617,7 +662,7 @@ static int __init numa_emulation(unsigned long start_pfn,
>  }
>  #endif
>  
> -void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
> +static void __init init_nodes(unsigned long start_pfn, unsigned long end_pfn)
>  {
>      unsigned int i;
>      paddr_t start = pfn_to_paddr(start_pfn);
> @@ -656,6 +701,12 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
>      setup_node_bootmem(0, start, end);
>  }
>  
> +void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +    init_nodes(start_pfn, end_pfn);
> +    early_init_pernode_areas(); /* With all nodes registered, init per_node() */
> +}

This file is built only when CONFIG_NUMA=y. Non-NUMA configurations, however,
also need to have a working per_node(), with only ever 0 passed in as the node
ID.

> --- a/xen/include/xen/numa.h
> +++ b/xen/include/xen/numa.h
> @@ -152,4 +152,19 @@ static inline nodeid_t mfn_to_nid(mfn_t mfn)
>  
>  #define page_to_nid(pg) mfn_to_nid(page_to_mfn(pg))
>  
> +/* Per NUMA node data area handling based on per-cpu data area handling. */
> +extern unsigned long __pernode_offset[];
> +
> +#define DECLARE_PER_NODE(type, name) \
> +    extern __typeof__(type) pernode__ ## name
> +
> +#define __DEFINE_PER_NODE(attr, type, name) \
> +    attr __typeof__(type) pernode_ ## name

I would suggest to omit this as long as there's just a single use.

> +#define DEFINE_PER_NODE(type, name) \
> +    __DEFINE_PER_NODE(__section(".bss.pernode"), type, _ ## name)
> +
> +#define per_node(var, node)  \
> +    (*RELOC_HIDE(&pernode__##var, __pernode_offset[node]))

I'm glad you didn't add this_node() (yet). As per discussion with Andrew earlier
today, there may want to be a comment here as to its absence, explaining that
what "this node" is first needs determining. There can, after all, be a CPU, a
memory, and a device view. While for devices we may not want to use this_node(),
for memory it may well happen. Hence something entirely CPU-centric may not work.

Furthermore Andrew pointed out that the ACPI exposed granularity may not be
sufficient for all purposes. He suggested to make clear here that for the time
being all of this is entirely SRAT based. (Iirc there was some DT work towards
supporting NUMA there as well, but my impression is that this work was abandoned.)

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/7] xen/page_alloc: Simplify domain_adjust_tot_pages() further
  2025-09-07 16:15 ` [PATCH v3 2/7] xen/page_alloc: Simplify domain_adjust_tot_pages() further Bernhard Kaindl
@ 2025-09-23 15:49   ` Jan Beulich
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2025-09-23 15:49 UTC (permalink / raw)
  To: Bernhard Kaindl
  Cc: Alejandro Vallejo, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Julien Grall, Roger Pau Monné, Stefano Stabellini, xen-devel

On 07.09.2025 18:15, Bernhard Kaindl wrote:
> When domain memory is allocated, domain_adjust_tot_pages(),
> also reduces the outstanding claim.
> 
> Replace the checks to not over-reduce the claim beyond 0
> by using min() which prevents the claim to become negative
> (and also not be over-conumed for the node and globally)

I'm okay with the code change now, but upon re-reading this it still
feels as if this was describing an issue with the original code that
is being corrected. Afaict there's no functional change here, and hence
the above may want re-wording accordingly.

> Cc: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Bernhard Kaindl <bernhard.kaindl@cloud.com>
> 
> ---
> Changes:
> - Was added as 2/7 in v2, the review by Jan Beulich is applied.

This, imo, isn't a useful ChangeLog: It requires readers to go look up
the comments on the earlier version. You want to say what has changed;
upon what basis the changes were is (again imo) secondary.

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/7] xen/page_alloc: Add and track per_node(avail_pages)
  2025-09-07 16:15 ` [PATCH v3 3/7] xen/page_alloc: Add and track per_node(avail_pages) Bernhard Kaindl
@ 2025-09-23 15:55   ` Jan Beulich
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2025-09-23 15:55 UTC (permalink / raw)
  To: Bernhard Kaindl
  Cc: Alejandro Vallejo, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Julien Grall, Roger Pau Monné, Stefano Stabellini, xen-devel

On 07.09.2025 18:15, Bernhard Kaindl wrote:
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -486,6 +486,10 @@ static unsigned long node_need_scrub[MAX_NUMNODES];
>  static unsigned long *avail[MAX_NUMNODES];
>  static long total_avail_pages;
>  
> +/* Per-NUMA-node counts of free pages */
> +DECLARE_PER_NODE(unsigned long, avail_pages);
> +DEFINE_PER_NODE(unsigned long, avail_pages);

Why both a declare and a define, but no static? A declare, if needed, would
need to go into a header, I expect. Whereas if only this CU needs access, no
declare should be needed, but static be added to the define.

> @@ -1074,6 +1078,8 @@ static struct page_info *alloc_heap_pages(
>  
>      ASSERT(avail[node][zone] >= request);
>      avail[node][zone] -= request;
> +    ASSERT(per_node(avail_pages, node) >= request);
> +    per_node(avail_pages, node) -= request;

Seeing the avail[] adjustment in context: What's the difference of that to
per_node(avail_pages)? I don't think the (apparent?) redundancy is properly
explained in the description.

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 4/7] xen/page_alloc: Add staking a NUMA node claim for a domain
  2025-09-07 16:15 ` [PATCH v3 4/7] xen/page_alloc: Add staking a NUMA node claim for a domain Bernhard Kaindl
@ 2025-09-23 16:09   ` Jan Beulich
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2025-09-23 16:09 UTC (permalink / raw)
  To: Bernhard Kaindl
  Cc: Alejandro Vallejo, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Julien Grall, Roger Pau Monné, Stefano Stabellini, xen-devel

On 07.09.2025 18:15, Bernhard Kaindl wrote:
> Update domain_set_outstanding_pages() to domain_claim_pages() for
> staking claims for domains on NUMA nodes:
> 
> domain_claim_pages() is a handler for claiming pages, where its former
> name suggested that it just sets the domain's outstanding claims.
> 
> Actually, three different code locations do perform just this task:
> 
> Fix this using a helper to avoid repeating yourself (an anti-pattern)
> for just only updating the domain's outstanding pages is added as well:
> 
> It removes the need to repeat the same sequence of operations at three
> diffent places and helps to have a single location for adding multi-node
> claims. It also makes the code much shorter and easier to follow.
> 
> Fix the meaning of the claims argument of domain_claim_pages()
> for NUMA-node claims:
> 
> - For NUMA-node claims, we need to claim defined amounts of memory
>   on different NUMA nodes. Previously, the argument was a "reservation"
>   and the claim was made on the difference between d->tot_pages and
>   the reservations. Of course, the argument needed to be > d->tot_pages.
> 
>   This interacs badly with NUMA claims:
>   NUMA node claims are not related to potentially already allocated
>   memory and reducing the claim by already allocated memory would
>   not work in case d->tot_pages already has some amount of pages.
> 
> - Fix this by simply claiming the given amount of pages.
> 
> - Update the legacy caller of domain_claim_pages() accordingly by
>   moving the reduction of the claim by d->tot_pages to it:
> 
>   No change for the users of the legacy hypercall, and a usable
>   interface for staking NUMA claims.
> 
> Signed-off-by: Bernhard Kaindl <bernhard.kaindl@cloud.com>
> Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>

This looks the wrong way round, and then I expect a From: is also missing.

> ---
> Changes in v3:
> 
> - Renamed domain_set_outstanding_pages() and add check from review.
> - Reorganized v3, v4 and v5 as per review to avoid non-functional
>   changes:

What's v3, v4, and v5 here (when we're only at v3)?

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1682,7 +1682,20 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>          rc = xsm_claim_pages(XSM_PRIV, d);
>  
>          if ( !rc )
> -            rc = domain_set_outstanding_pages(d, reservation.nr_extents);
> +        {
> +            unsigned long new_claim = reservation.nr_extents;
> +
> +            /*
> +             * For backwards compatibility, keep the meaning of nr_extents:
> +             * it is the target number of pages for the domain.
> +             * In case memory for the domain was allocated before, we must
> +             * substract the already allocated pages from the reservation.
> +             */
> +            if ( new_claim )
> +                new_claim -= domain_tot_pages(d);

This is now racy (and hence a functional change): Without holding the heap
lock, a domain's total pages can change behind you back.

> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -492,6 +492,30 @@ DEFINE_PER_NODE(unsigned long, avail_pages);
>  
>  static DEFINE_SPINLOCK(heap_lock);
>  static long outstanding_claims; /* total outstanding claims by all domains */
> +DECLARE_PER_NODE(long, outstanding_claims);
> +DEFINE_PER_NODE(long, outstanding_claims);

See comment on the earlier patch.

> +#define domain_has_node_claim(d) (d->claim_node != NUMA_NO_NODE)
> +
> +static inline bool insufficient_memory(unsigned long request, nodeid_t node)

Except in special cases, no inline please for static functions in .c files.

> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -405,6 +405,7 @@ struct domain
>      unsigned int     outstanding_pages; /* pages claimed but not possessed */
>      unsigned int     max_pages;         /* maximum value for domain_tot_pages() */
>      unsigned int     extra_pages;       /* pages not included in domain_tot_pages() */
> +    nodeid_t         claim_node;        /* NUMA_NO_NODE for host-wide claims */

I don't quite understand the purpose of this field: It looks to be a
hidden parameter to domain_adjust_outstanding_claim(), yet then why isn't
is a real one?

As I'm also having a hard time following the description, I fear I have to
stay away from making further comments (on the main part of the code
changes), until I understand better what's (intended to be) going on here.

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 5/7] xen/page_alloc: Pass node to adjust_tot_pages and check it
  2025-09-07 16:15 ` [PATCH v3 5/7] xen/page_alloc: Pass node to adjust_tot_pages and check it Bernhard Kaindl
@ 2025-11-13 14:13   ` Jan Beulich
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2025-11-13 14:13 UTC (permalink / raw)
  To: Bernhard Kaindl
  Cc: Alejandro Vallejo, Andrew Cooper, Roger Pau Monné,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	Tamas K Lengyel, xen-devel

On 07.09.2025 18:15, Bernhard Kaindl wrote:
> @@ -2806,7 +2816,8 @@ void free_domheap_pages(struct page_info *pg, unsigned int order)
>                  }
>              }
>  
> -            drop_dom_ref = !domain_adjust_tot_pages(d, -(1 << order));
> +            drop_dom_ref = !domain_adjust_tot_pages(d, NUMA_NO_NODE,
> +                                                    -(1 << order));

As you touch this, can you take the opportunity and switch to using 1L here?

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 6/7] xen/page_alloc: Protect claimed memory against other allocations
  2025-09-07 16:15 ` [PATCH v3 6/7] xen/page_alloc: Protect claimed memory against other allocations Bernhard Kaindl
@ 2025-11-13 14:18   ` Jan Beulich
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2025-11-13 14:18 UTC (permalink / raw)
  To: Bernhard Kaindl
  Cc: Alejandro Vallejo, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Julien Grall, Roger Pau Monné, Stefano Stabellini, xen-devel

On 07.09.2025 18:15, Bernhard Kaindl wrote:
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -980,9 +980,19 @@ static struct page_info *get_free_buddy(unsigned int zone_lo,
>      {
>          zone = zone_hi;
>          do {
> -            /* Check if target node can support the allocation. */
> -            if ( !avail[node] || (avail[node][zone] < (1UL << order)) )
> -                continue;
> +            unsigned long request = 1UL << order;
> +            /*
> +             * Check if this node is currently suitable for this allocation.
> +             * 1. It has sufficient memory in the requested zone and the
> +             * 2. request must fit in the unclaimed memory of the node minus
> +             *    outstanding claims, unless the allocation is made by a domain
> +             *    with sufficient node-claimed memory to cover the allocation.
> +             */
> +            if ( !avail[node] || (avail[node][zone] < request) ||
> +                 (insufficient_memory(node, request) &&
> +                  (!d || node != d->claim_node ||     /* a domain with claims */

What if ->claim_node is NUMA_NO_NODE?

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 7/7] xen: New hypercall to claim memory using XEN_DOMCTL_claim_memory
  2025-09-07 16:15 ` [PATCH v3 7/7] xen: New hypercall to claim memory using XEN_DOMCTL_claim_memory Bernhard Kaindl
@ 2025-11-13 14:28   ` Jan Beulich
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2025-11-13 14:28 UTC (permalink / raw)
  To: Bernhard Kaindl
  Cc: Alejandro Vallejo, Daniel P. Smith, Anthony PERARD, Andrew Cooper,
	Michal Orzel, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, Juergen Gross, Christian Lindig, David Scott,
	xen-devel

On 07.09.2025 18:15, Bernhard Kaindl wrote:
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -267,6 +267,35 @@ int get_domain_state(struct xen_domctl_get_domain_state *info, struct domain *d,
>      return rc;
>  }
>  
> +/* XEN_DOMCTL_claim_memory: Claim an amount of memory for a domain */
> +int claim_memory(struct domain *d, const struct xen_domctl_claim_memory *uinfo)
> +{
> +    memory_claim_t claim;
> +    int rc;
> +
> +    switch ( uinfo->nr_claims )
> +    {
> +        case 0:

Nit (style) Indentation (case labels ant indenting the same as the respective
opening figure brace).

Considering what follows I'm not quite sure though that using switch() is
appropriate here. Kind of depends on what the longer-term plans are.

> +            /* Cancel existing claim. */
> +            rc = domain_claim_pages(d, 0, 0);
> +            break;

This effect also wants mentioning in the public header. It's not what I would
expect (a no-op). Also, shouldn't NUMA_NO_NODE be passed here?

> +        case 1:
> +            /* Only single node claims supported at the moment. */

Isn't the comment misplaced? Comments usually describe what follows, and the
"nr_claims == 1" restriction starts already a line earlier. Maybe the comment
would best go on the same line as the case label.

> +            if ( copy_from_guest(&claim, uinfo->claims, 1) )
> +                return -EFAULT;
> +
> +            rc = domain_claim_pages(d, claim.node, claim.nr_pages);

claim.node needs bounds checking, for nodemask_test() to not overrun the bitmap.

> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -1276,6 +1276,21 @@ struct xen_domctl_get_domain_state {
>      uint64_t unique_id;      /* Unique domain identifier. */
>  };
>  
> +struct xen_memory_claim {
> +    unsigned int node;      /* NUMA node, XC_NUMA_NO_NODE for a host claim */
> +    unsigned long nr_pages; /* Number of pages to claim */

Fixed-width types need using here and ...

> +};
> +typedef struct xen_memory_claim memory_claim_t;
> +DEFINE_XEN_GUEST_HANDLE(memory_claim_t);
> +
> +/* XEN_DOMCTL_claim_memory: Claim an amount of memory for a domain */
> +struct xen_domctl_claim_memory {
> +    /* IN: array of memory claims */
> +    XEN_GUEST_HANDLE_64(memory_claim_t) claims;
> +    /* IN: number of claims */
> +    unsigned int nr_claims;

... here.

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-11-13 14:29 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-07 16:15 [PATCH v3 0/7] NUMA: Add per-node domain-memory claims Bernhard Kaindl
2025-09-07 16:15 ` [PATCH v3 1/7] xen/numa: Add per_node() variables paralleling per_cpu() variables Bernhard Kaindl
2025-09-08 14:11   ` Jan Beulich
2025-09-07 16:15 ` [PATCH v3 2/7] xen/page_alloc: Simplify domain_adjust_tot_pages() further Bernhard Kaindl
2025-09-23 15:49   ` Jan Beulich
2025-09-07 16:15 ` [PATCH v3 3/7] xen/page_alloc: Add and track per_node(avail_pages) Bernhard Kaindl
2025-09-23 15:55   ` Jan Beulich
2025-09-07 16:15 ` [PATCH v3 4/7] xen/page_alloc: Add staking a NUMA node claim for a domain Bernhard Kaindl
2025-09-23 16:09   ` Jan Beulich
2025-09-07 16:15 ` [PATCH v3 5/7] xen/page_alloc: Pass node to adjust_tot_pages and check it Bernhard Kaindl
2025-11-13 14:13   ` Jan Beulich
2025-09-07 16:15 ` [PATCH v3 6/7] xen/page_alloc: Protect claimed memory against other allocations Bernhard Kaindl
2025-11-13 14:18   ` Jan Beulich
2025-09-07 16:15 ` [PATCH v3 7/7] xen: New hypercall to claim memory using XEN_DOMCTL_claim_memory Bernhard Kaindl
2025-11-13 14:28   ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.