All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xen/mm: avoid watchdog timeout in dump_numa() on large domains
@ 2026-06-02  8:49 Bernhard Kaindl
  2026-06-02 11:40 ` Roger Pau Monné
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Bernhard Kaindl @ 2026-06-02  8:49 UTC (permalink / raw)
  To: xen-devel
  Cc: Bernhard Kaindl, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Teddy Astie, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, Tamas K Lengyel, Alejandro Vallejo,
	Marcus Granado

Using the 'u' debug key invokes dump_numa(), which walks each domain's
page list under page_alloc_lock to compute per-NUMA-node counts. On
domains with many pages, this O(pages) operation can hold the lock long
enough to trigger a watchdog timeout.

Replace the page-list walk with node_tot_pages[], a per-node counter
maintained in struct domain. This reduces dump_numa()'s per-domain work
from O(pages) to O(nodes).

Accounting via domain_adjust_tot_pages()

Most page-count updates flow through domain_adjust_tot_pages(). The
helper takes the affected NUMA node and updates both tot_pages and
node_tot_pages[node] in the same locked critical section.

A debug-only helper, assert_numa_page_count(), checks after each locked
update that the node totals sum to tot_pages.

Accounting in memory_exchange()

memory_exchange() cannot use the same helper. It is preemptible, may
fail and roll back part way through a chunk, and does not hold
page_alloc_lock for the whole operation. The final success-path value of
tot_pages is unchanged: by construction, the number of input pages stolen
equals the number of output pages assigned. domain_commit_page_deltas()
still applies the accumulated deltas to tot_pages, but their sum is zero.

Instead, it accumulates deferred per-node deltas in
node_tot_pages_adjustments[] for the current chunk:

  When steal_page() is called with MEMF_no_refcount, tot_pages is
  intentionally not decremented. A decrement of 1 is recorded for the
  input page's node.

  When assign_page() is called with MEMF_no_refcount, tot_pages is
  intentionally not incremented. An increment equal to the number of
  pages in the output extent is recorded.

At the end of a successful chunk, these deltas are committed under
page_alloc_lock. The net tot_pages delta is zero, while node_tot_pages[]
is updated to reflect the NUMA-node movement.

Correctness on failure paths:

  Invariant 1: Pages allocated with MEMF_no_owner are not counted in
  tot_pages until assign_page() succeeds. Freeing such pages with
  free_domheap_pages() is therefore accounting-neutral.

  Invariant 2: Pages stolen with MEMF_no_refcount remain counted in
  tot_pages until the deferred adjustment is committed. No window
  exists in which a stolen page is absent from d->page_list and
  already subtracted from tot_pages.

These invariants cover the failure cases:

  Input-side failure before output allocation: any input pages already
  stolen are still on in_chunk_list. The fail path attempts to assign them
  back to the domain. A successful reassign cancels the earlier negative
  delta; if the reassign fails because the domain is dying, the negative
  delta remains and is committed, reflecting that the page has been freed.

  OOM during output allocation (alloc_domheap_pages() returns NULL):
  output pages already allocated before the failure are on out_chunk_list,
  have no owner, and have never been counted in tot_pages. Freeing them
  does not change accounting. Stolen input pages are handled as above:
  reassignment cancels their negative deltas, while unreassigned pages are
  deducted because the domain is dying.

  assign_page() failure because the domain is dying: output pages
  assigned before the failure remain accounted to the domain and are
  reclaimed later by domain_relinquish_resources(). The accumulated
  deltas therefore contain positive entries for the successful output
  assignments and negative entries for all stolen input pages.
  Committing the net delta to both tot_pages and node_tot_pages[] leaves
  the accounting consistent with the pages the domain still owns.

  Post-assignment failure in guest_physmap_add_page() or while copying the
  new frame number back to the guest: at this point the output extent has
  already been assigned to the domain and the input pages for the chunk
  have already been freed. Input and output page counts are therefore
  equal, so the committed update only redistributes per-node counts and
  leaves tot_pages unchanged.

A functionally equivalent version of this change is included in the
XenServer 9 pre-release and is widely exercised there. Debug Xen
builds with the added consistency checks are also used in internal
end-to-end regression testing (XenRT). For upstream submission, the
change was additionally updated to support non-NUMA Xen builds and
reviewed again for correctness, including the failure paths.

Fixes: 4dff228603ba ("Walking the page lists needs the page_alloc lock")
Signed-off-by: Alejandro Vallejo <alejandro.garciavallejo@amd.com>
Signed-off-by: Marcus Granado <marcus.granado@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Bernhard Kaindl <bernhard.kaindl@citrix.com>
---
Brief history of this code base: Work was started by Marcus Granado,
adding per-node accounting to domain_adjust_tot_pages() using work
by Alejandro Vallejo <alejandro.garciavallejo@amd.com> as basis.

In a 2nd phase, Bernhard Kaindl and Roger Pau Monné implemented
accumulating of per-node deltas in memory_exchange(), including
handling across failure paths and consistency checks of the page
counters after each update with CONFIG_DEBUG enabled. Bernhard
updated dump_numa() to replace walking d->page_list under lock
and Roger fixed handling of hypercall preemption. Andrew Cooper
participated in this phase as well and Joash Robinson tested using
end-to-end tests exercising debug-key 'u' and checking its output.

In a 3rd phase, Bernhard Kaindl added explanatory comments, prepared
this commit message for review, fixed a failure path that can occur
when memory_exchange() fails mid-exchange due to OOM, fixed the code
for non-NUMA builds and consolidated loops that apply accumulated
deltas into a single helper.
---
 xen/arch/x86/mm.c             |  3 +-
 xen/arch/x86/mm/mem_sharing.c |  4 +-
 xen/common/domain.c           |  9 ++++
 xen/common/grant_table.c      |  4 +-
 xen/common/memory.c           | 79 +++++++++++++++++++++++++----------
 xen/common/numa.c             | 14 +------
 xen/common/page_alloc.c       | 40 ++++++++++++++++--
 xen/include/xen/mm.h          | 12 +++++-
 xen/include/xen/sched.h       | 24 +++++++++++
 9 files changed, 143 insertions(+), 46 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index a158379e7734..a723a2c50a2f 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4445,7 +4445,8 @@ int steal_page(
     page_list_del(page, &d->page_list);
 
     /* Unlink from original owner. */
-    if ( !(memflags & MEMF_no_refcount) && !domain_adjust_tot_pages(d, -1) )
+    if ( !(memflags & MEMF_no_refcount) &&
+         !domain_adjust_tot_pages(d, page_to_nid(page), -1) )
         drop_dom_ref = true;
 
     nrspin_unlock(&d->page_alloc_lock);
diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
index 5c7a0ff30e8b..89ae418be0ae 100644
--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -723,7 +723,7 @@ static int page_make_sharable(struct domain *d,
     if ( !validate_only )
     {
         page_set_owner(page, dom_cow);
-        drop_dom_ref = !domain_adjust_tot_pages(d, -1);
+        drop_dom_ref = !domain_adjust_tot_pages(d, page_to_nid(page), -1);
         page_list_del(page, &d->page_list);
     }
 
@@ -769,7 +769,7 @@ static int page_make_private(struct domain *d, struct page_info *page)
     ASSERT(page_get_owner(page) == dom_cow);
     page_set_owner(page, d);
 
-    if ( domain_adjust_tot_pages(d, 1) == 1 )
+    if ( domain_adjust_tot_pages(d, page_to_nid(page), 1) == 1 )
         get_knownalive_domain(d);
     page_list_add_tail(page, &d->page_list);
     nrspin_unlock(&d->page_alloc_lock);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 8cb4241b0511..0b6afba2acdb 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1558,6 +1558,15 @@ void domain_destroy(struct domain *d)
     /* Remove from the domlist/hash. */
     domlist_remove(d);
 
+    /*
+     * Final invariant check: all pages still owned by the dying domain must
+     * be accounted for per-node before complete_domain_destroy() reclaims
+     * them.  Debug-only; expands to a no-op in production builds.
+     */
+    nrspin_lock(&d->page_alloc_lock);
+    assert_numa_page_count(d);
+    nrspin_unlock(&d->page_alloc_lock);
+
     /* Schedule RCU asynchronous completion of domain destroy. */
     call_rcu(&d->rcu, complete_domain_destroy);
 }
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index ac9fed600101..298662c3d69e 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -2404,7 +2404,7 @@ gnttab_transfer(
         }
 
         /* Okay, add the page to 'e'. */
-        if ( unlikely(domain_adjust_tot_pages(e, 1) == 1) )
+        if ( unlikely(domain_adjust_tot_pages(e, page_to_nid(page), 1) == 1) )
             get_knownalive_domain(e);
 
         /*
@@ -2430,7 +2430,7 @@ gnttab_transfer(
              * page in the page total
              */
             nrspin_lock(&e->page_alloc_lock);
-            drop_dom_ref = !domain_adjust_tot_pages(e, -1);
+            drop_dom_ref = !domain_adjust_tot_pages(e, page_to_nid(page), -1);
             nrspin_unlock(&e->page_alloc_lock);
 
             if ( okay /* i.e. e->is_dying due to the surrounding if() */ )
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 9e4899f9fc63..00eeaf33aefc 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -673,6 +673,14 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
     struct domain *d;
     struct page_info *page;
 
+    /*
+     * Deferred accounting for the current exchange chunk.  Keep it at
+     * function scope because the fail: path also needs to commit or cancel
+     * partial work.  domain_commit_page_deltas() clears applied entries, so
+     * each new chunk starts with a zeroed accumulator.
+     */
+    long node_tot_pages_adjustments[MAX_NUMNODES] = {};
+
     if ( copy_from_guest(&exch, arg, 1) )
         return -EFAULT;
 
@@ -822,6 +830,12 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
                 }
 
                 page_list_add(page, &in_chunk_list);
+                /*
+                 * steal_page() with MEMF_no_refcount removes ownership but
+                 * leaves the domain page counts unchanged; record the
+                 * deferred -1 for this page's node.
+                 */
+                node_tot_pages_adjustments[page_to_nid(page)]--;
 #ifdef CONFIG_X86
                 put_gfn(d, gmfn + k);
 #endif
@@ -870,32 +884,31 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
             if ( assign_page(page, exch.out.extent_order, d,
                              MEMF_no_refcount) )
             {
-                unsigned long dec_count;
-                bool drop_dom_ref;
-
                 /*
-                 * Pages in in_chunk_list is stolen without
-                 * decreasing the tot_pages. If the domain is dying when
-                 * assign pages, we need decrease the count. For those pages
-                 * that has been assigned, it should be covered by
-                 * domain_relinquish_resources().
+                 * Input pages for this chunk were stolen and freed without
+                 * decreasing tot_pages.  Output extents already assigned
+                 * before this failure will be released later by
+                 * domain_relinquish_resources().  Commit the relative deltas
+                 * now so concurrent reservation changes made under
+                 * page_alloc_lock are preserved.  The net delta is strictly
+                 * negative on this branch, so a zero post-commit tot_pages
+                 * means the last reference must be dropped.
                  */
-                dec_count = (((1UL << exch.in.extent_order) *
-                              (1UL << in_chunk_order)) -
-                             (j * (1UL << exch.out.extent_order)));
-
-                nrspin_lock(&d->page_alloc_lock);
-                drop_dom_ref = (dec_count &&
-                                !domain_adjust_tot_pages(d, -dec_count));
-                nrspin_unlock(&d->page_alloc_lock);
-
-                if ( drop_dom_ref )
+                if ( !domain_commit_page_deltas(d, node_tot_pages_adjustments) )
                     put_domain(d);
 
                 free_domheap_pages(page, exch.out.extent_order);
                 goto dying;
             }
 
+            /*
+             * assign_page() with MEMF_no_refcount gives ownership without
+             * updating the domain page counts; record the deferred extent
+             * size for this output node.
+             */
+            node_tot_pages_adjustments[page_to_nid(page)] +=
+                1UL << exch.out.extent_order;
+
             if ( __copy_from_guest_offset(&gpfn, exch.out.extent_start,
                                           (i << out_chunk_order) + j, 1) )
             {
@@ -915,8 +928,19 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
         }
         BUG_ON( !(d->is_dying) && (j != (1UL << out_chunk_order)) );
 
+        /*
+         * If publishing output GFNs failed after assignment, fail: still
+         * needs to commit the zero-net per-node redistribution.
+         */
         if ( rc )
             goto fail;
+
+        /*
+         * Success: commit the per-node redistribution.  The helper also
+         * applies the deltas to tot_pages; the final value is unchanged
+         * because the exchange is size-neutral.
+         */
+        domain_commit_page_deltas(d, node_tot_pages_adjustments);
     }
 
     exch.nr_exchanged = exch.in.nr_extents;
@@ -925,22 +949,31 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
     rcu_unlock_domain(d);
     return rc;
 
-    /*
-     * Failed a chunk! Free any partial chunk work. Tell caller how many
-     * chunks succeeded.
-     */
+    /* Failed a chunk.  Free partial work and report completed chunks. */
  fail:
     /*
      * Reassign any input pages we managed to steal.  NB that if the assign
      * fails again, we're on the hook for freeing the page, since we've already
-     * cleared PGC_allocated.
+     * cleared PGC_allocated.  A successful reassignment cancels the -1 that
+     * was recorded when the page was stolen.
      */
     while ( (page = page_list_remove_head(&in_chunk_list)) )
+    {
         if ( assign_pages(page, 1, d, MEMF_no_refcount) )
         {
             BUG_ON(!d->is_dying);
             free_domheap_page(page);
         }
+        else
+            node_tot_pages_adjustments[page_to_nid(page)] += 1;
+    }
+
+    /*
+     * Commit remaining deltas.  If all stolen pages were reassigned, this is
+     * a zero-net update.  Otherwise the domain is dying and unreassigned
+     * pages are deducted.
+     */
+    domain_commit_page_deltas(d, node_tot_pages_adjustments);
 
  dying:
     rcu_unlock_domain(d);
diff --git a/xen/common/numa.c b/xen/common/numa.c
index ad75955a1622..9f38145579e0 100644
--- a/xen/common/numa.c
+++ b/xen/common/numa.c
@@ -737,26 +737,14 @@ static void cf_check dump_numa(unsigned char key)
     printk("Memory location of each domain:\n");
     for_each_domain ( d )
     {
-        const struct page_info *page;
-        unsigned int page_num_node[MAX_NUMNODES];
         const struct vnuma_info *vnuma;
 
         process_pending_softirqs();
 
         printk("%pd (total: %u):\n", d, domain_tot_pages(d));
 
-        memset(page_num_node, 0, sizeof(page_num_node));
-
-        nrspin_lock(&d->page_alloc_lock);
-        page_list_for_each ( page, &d->page_list )
-        {
-            i = page_to_nid(page);
-            page_num_node[i]++;
-        }
-        nrspin_unlock(&d->page_alloc_lock);
-
         for_each_online_node ( i )
-            printk("    Node %u: %u\n", i, page_num_node[i]);
+            printk("    Node %u: %u\n", i, d->node_tot_pages[i]);
 
         if ( !read_trylock(&d->vnuma_rwlock) )
             continue;
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 1801afc96a0a..ab6f44a3f993 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -520,14 +520,45 @@ static unsigned long avail_heap_pages(
     return free_pages;
 }
 
-unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
+unsigned long domain_adjust_tot_pages(struct domain *d, nodeid_t node,
+                                      long pages)
 {
     ASSERT(rspin_is_locked(&d->page_alloc_lock));
     d->tot_pages += pages;
 
+#ifdef CONFIG_NUMA
+    ASSERT(node != NUMA_NO_NODE);
+    ASSERT(node_online(node));
+    d->node_tot_pages[node] += pages;
+    assert_numa_page_count(d);
+#endif
+
     return d->tot_pages;
 }
 
+unsigned long domain_commit_page_deltas(struct domain *d,
+                                        long adjustments[MAX_NUMNODES])
+{
+    unsigned long tot_pages;
+    nodeid_t node;
+
+    nrspin_lock(&d->page_alloc_lock);
+    for_each_online_node ( node )
+        if ( adjustments[node] )
+        {
+#ifdef CONFIG_NUMA
+            d->node_tot_pages[node] += adjustments[node];
+#endif
+            d->tot_pages += adjustments[node];
+            adjustments[node] = 0;
+        }
+    assert_numa_page_count(d);
+    tot_pages = d->tot_pages;
+    nrspin_unlock(&d->page_alloc_lock);
+
+    return tot_pages;
+}
+
 #ifdef CONFIG_SYSCTL
 void get_outstanding_claims(uint64_t *free_pages, uint64_t *outstanding_pages)
 {
@@ -2924,7 +2955,7 @@ int assign_pages(
             goto out;
         }
 
-        if ( unlikely(domain_adjust_tot_pages(d, nr) == nr) )
+        if ( unlikely(domain_adjust_tot_pages(d, page_to_nid(pg), nr) == nr) )
             get_knownalive_domain(d);
     }
 
@@ -3066,7 +3097,8 @@ void free_domheap_pages(struct page_info *pg, unsigned int order)
                 }
             }
 
-            drop_dom_ref = !domain_adjust_tot_pages(d, -(1 << order));
+            drop_dom_ref = !domain_adjust_tot_pages(d, page_to_nid(pg),
+                                                    -(1 << order));
 
             rspin_unlock(&d->page_alloc_lock);
 
@@ -3274,7 +3306,7 @@ void free_domstatic_page(struct page_info *page)
 
     arch_free_heap_page(d, page);
 
-    drop_dom_ref = !domain_adjust_tot_pages(d, -1);
+    drop_dom_ref = !domain_adjust_tot_pages(d, page_to_nid(page), -1);
 
     unprepare_staticmem_pages(page, 1, scrub_debug);
 
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index b3a35c4bc8d6..0737df02eda2 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -68,6 +68,7 @@
 #include <xen/types.h>
 #include <xen/list.h>
 #include <xen/spinlock.h>
+#include <xen/numa.h>
 #include <xen/perfc.h>
 #include <public/memory.h>
 
@@ -131,7 +132,16 @@ mfn_t xen_map_to_mfn(unsigned long va);
 int populate_pt_range(unsigned long virt, unsigned long nr_mfns);
 /* Claim handling */
 unsigned long __must_check domain_adjust_tot_pages(struct domain *d,
-    long pages);
+                                                   nodeid_t node, long pages);
+/*
+ * Commit accumulated per-node page deltas to d->tot_pages and (under NUMA)
+ * d->node_tot_pages[] under page_alloc_lock.  Applied entries are cleared
+ * so the same accumulator can be reused for the next batch.  Returns the
+ * post-commit value of d->tot_pages; callers that may be releasing the last
+ * reference on the domain should drop it when the return value is zero.
+ */
+unsigned long domain_commit_page_deltas(struct domain *d,
+                                        long adjustments[MAX_NUMNODES]);
 int domain_set_claim_entries(struct domain *d, uint32_t nr_entries,
                              const struct xen_memory_claim *claim_set);
 int domain_get_claim_entries(struct domain *d, uint32_t *nr_entries,
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index f671e0c4c7b3..ae50f523e9f5 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -620,6 +620,11 @@ struct domain
     unsigned int last_alloc_node;
     spinlock_t node_affinity_lock;
 
+#ifdef CONFIG_NUMA
+    /* Distribution of tot_pages across NUMA nodes. */
+    unsigned int node_tot_pages[MAX_NUMNODES];
+#endif
+
     /* vNUMA topology accesses are protected by rwlock. */
     rwlock_t vnuma_rwlock;
     struct vnuma_info *vnuma;
@@ -698,6 +703,25 @@ static inline unsigned int domain_tot_pages(const struct domain *d)
     return d->tot_pages - d->extra_pages;
 }
 
+/*
+ * Debug-only consistency check: the per-node page counts must sum to
+ * d->tot_pages.  Compiled out unless both NUMA and debug builds are
+ * configured; caller must hold page_alloc_lock.
+ */
+static inline void assert_numa_page_count(const struct domain *d)
+{
+#if defined(CONFIG_NUMA) && defined(CONFIG_DEBUG)
+    unsigned int i, node_total = 0;
+
+    ASSERT(rspin_is_locked(&d->page_alloc_lock));
+
+    for_each_online_node ( i )
+        node_total += d->node_tot_pages[i];
+
+    ASSERT(node_total == d->tot_pages);
+#endif
+}
+
 /* Protect updates/reads (resp.) of domain_list and domain_hash. */
 extern spinlock_t domlist_update_lock;
 extern rcu_read_lock_t domlist_read_lock;
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] xen/mm: avoid watchdog timeout in dump_numa() on large domains
  2026-06-02  8:49 [PATCH] xen/mm: avoid watchdog timeout in dump_numa() on large domains Bernhard Kaindl
@ 2026-06-02 11:40 ` Roger Pau Monné
  2026-06-02 11:51 ` Jan Beulich
  2026-06-02 13:21 ` Andrew Cooper
  2 siblings, 0 replies; 7+ messages in thread
From: Roger Pau Monné @ 2026-06-02 11:40 UTC (permalink / raw)
  To: Bernhard Kaindl
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Teddy Astie,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	Tamas K Lengyel, Alejandro Vallejo, Marcus Granado

Hello,

This is missing a "From:" line that matches the first SoB in the
command message, if nit doesn't match the sender (which is the case
here).

On Tue, Jun 02, 2026 at 09:49:50AM +0100, Bernhard Kaindl wrote:
> Using the 'u' debug key invokes dump_numa(), which walks each domain's
> page list under page_alloc_lock to compute per-NUMA-node counts. On
> domains with many pages, this O(pages) operation can hold the lock long
> enough to trigger a watchdog timeout.
> 
> Replace the page-list walk with node_tot_pages[], a per-node counter
> maintained in struct domain. This reduces dump_numa()'s per-domain work
> from O(pages) to O(nodes).
> 
> Accounting via domain_adjust_tot_pages()
> 
> Most page-count updates flow through domain_adjust_tot_pages(). The
> helper takes the affected NUMA node and updates both tot_pages and
> node_tot_pages[node] in the same locked critical section.
> 
> A debug-only helper, assert_numa_page_count(), checks after each locked
> update that the node totals sum to tot_pages.
> 
> Accounting in memory_exchange()
> 
> memory_exchange() cannot use the same helper. It is preemptible, may
> fail and roll back part way through a chunk, and does not hold
> page_alloc_lock for the whole operation. The final success-path value of
> tot_pages is unchanged: by construction, the number of input pages stolen
> equals the number of output pages assigned. domain_commit_page_deltas()
> still applies the accumulated deltas to tot_pages, but their sum is zero.
> 
> Instead, it accumulates deferred per-node deltas in
> node_tot_pages_adjustments[] for the current chunk:
> 
>   When steal_page() is called with MEMF_no_refcount, tot_pages is
>   intentionally not decremented. A decrement of 1 is recorded for the
>   input page's node.
> 
>   When assign_page() is called with MEMF_no_refcount, tot_pages is
>   intentionally not incremented. An increment equal to the number of
>   pages in the output extent is recorded.
> 
> At the end of a successful chunk, these deltas are committed under
> page_alloc_lock. The net tot_pages delta is zero, while node_tot_pages[]
> is updated to reflect the NUMA-node movement.
> 
> Correctness on failure paths:
> 
>   Invariant 1: Pages allocated with MEMF_no_owner are not counted in
>   tot_pages until assign_page() succeeds. Freeing such pages with
>   free_domheap_pages() is therefore accounting-neutral.
> 
>   Invariant 2: Pages stolen with MEMF_no_refcount remain counted in
>   tot_pages until the deferred adjustment is committed. No window
>   exists in which a stolen page is absent from d->page_list and
>   already subtracted from tot_pages.
> 
> These invariants cover the failure cases:
> 
>   Input-side failure before output allocation: any input pages already
>   stolen are still on in_chunk_list. The fail path attempts to assign them
>   back to the domain. A successful reassign cancels the earlier negative
>   delta; if the reassign fails because the domain is dying, the negative
>   delta remains and is committed, reflecting that the page has been freed.
> 
>   OOM during output allocation (alloc_domheap_pages() returns NULL):
>   output pages already allocated before the failure are on out_chunk_list,
>   have no owner, and have never been counted in tot_pages. Freeing them
>   does not change accounting. Stolen input pages are handled as above:
>   reassignment cancels their negative deltas, while unreassigned pages are
>   deducted because the domain is dying.
> 
>   assign_page() failure because the domain is dying: output pages
>   assigned before the failure remain accounted to the domain and are
>   reclaimed later by domain_relinquish_resources(). The accumulated
>   deltas therefore contain positive entries for the successful output
>   assignments and negative entries for all stolen input pages.
>   Committing the net delta to both tot_pages and node_tot_pages[] leaves
>   the accounting consistent with the pages the domain still owns.
> 
>   Post-assignment failure in guest_physmap_add_page() or while copying the
>   new frame number back to the guest: at this point the output extent has
>   already been assigned to the domain and the input pages for the chunk
>   have already been freed. Input and output page counts are therefore
>   equal, so the committed update only redistributes per-node counts and
>   leaves tot_pages unchanged.

This commit message is way, way, too long and dense.  We need to
exercise some restrain and summarize the important bits.

> A functionally equivalent version of this change is included in the
> XenServer 9 pre-release and is widely exercised there. Debug Xen
> builds with the added consistency checks are also used in internal
> end-to-end regression testing (XenRT). For upstream submission, the
> change was additionally updated to support non-NUMA Xen builds and
> reviewed again for correctness, including the failure paths.

IMO the mentioning of XenServer 9 or XenRT should be a post-commit
message.

> Fixes: 4dff228603ba ("Walking the page lists needs the page_alloc lock")
> Signed-off-by: Alejandro Vallejo <alejandro.garciavallejo@amd.com>

It's not acceptable to change the original SoB, Alejandro SoB on this
patch is:

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>

And it must be kept this way.

> Signed-off-by: Marcus Granado <marcus.granado@citrix.com>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Bernhard Kaindl <bernhard.kaindl@citrix.com>
> ---
> Brief history of this code base: Work was started by Marcus Granado,
> adding per-node accounting to domain_adjust_tot_pages() using work
> by Alejandro Vallejo <alejandro.garciavallejo@amd.com> as basis.
> 
> In a 2nd phase, Bernhard Kaindl and Roger Pau Monné implemented
> accumulating of per-node deltas in memory_exchange(), including
> handling across failure paths and consistency checks of the page
> counters after each update with CONFIG_DEBUG enabled. Bernhard
> updated dump_numa() to replace walking d->page_list under lock
> and Roger fixed handling of hypercall preemption. Andrew Cooper
> participated in this phase as well and Joash Robinson tested using
> end-to-end tests exercising debug-key 'u' and checking its output.
> 
> In a 3rd phase, Bernhard Kaindl added explanatory comments, prepared
> this commit message for review, fixed a failure path that can occur
> when memory_exchange() fails mid-exchange due to OOM, fixed the code
> for non-NUMA builds and consolidated loops that apply accumulated
> deltas into a single helper.

Why do you speak of yourself in 3rd person on the post-commit remark?

> ---
>  xen/arch/x86/mm.c             |  3 +-
>  xen/arch/x86/mm/mem_sharing.c |  4 +-
>  xen/common/domain.c           |  9 ++++
>  xen/common/grant_table.c      |  4 +-
>  xen/common/memory.c           | 79 +++++++++++++++++++++++++----------
>  xen/common/numa.c             | 14 +------
>  xen/common/page_alloc.c       | 40 ++++++++++++++++--
>  xen/include/xen/mm.h          | 12 +++++-
>  xen/include/xen/sched.h       | 24 +++++++++++
>  9 files changed, 143 insertions(+), 46 deletions(-)
> 
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index a158379e7734..a723a2c50a2f 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -4445,7 +4445,8 @@ int steal_page(
>      page_list_del(page, &d->page_list);
>  
>      /* Unlink from original owner. */
> -    if ( !(memflags & MEMF_no_refcount) && !domain_adjust_tot_pages(d, -1) )
> +    if ( !(memflags & MEMF_no_refcount) &&
> +         !domain_adjust_tot_pages(d, page_to_nid(page), -1) )
>          drop_dom_ref = true;
>  
>      nrspin_unlock(&d->page_alloc_lock);
> diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
> index 5c7a0ff30e8b..89ae418be0ae 100644
> --- a/xen/arch/x86/mm/mem_sharing.c
> +++ b/xen/arch/x86/mm/mem_sharing.c
> @@ -723,7 +723,7 @@ static int page_make_sharable(struct domain *d,
>      if ( !validate_only )
>      {
>          page_set_owner(page, dom_cow);
> -        drop_dom_ref = !domain_adjust_tot_pages(d, -1);
> +        drop_dom_ref = !domain_adjust_tot_pages(d, page_to_nid(page), -1);
>          page_list_del(page, &d->page_list);
>      }
>  
> @@ -769,7 +769,7 @@ static int page_make_private(struct domain *d, struct page_info *page)
>      ASSERT(page_get_owner(page) == dom_cow);
>      page_set_owner(page, d);
>  
> -    if ( domain_adjust_tot_pages(d, 1) == 1 )
> +    if ( domain_adjust_tot_pages(d, page_to_nid(page), 1) == 1 )
>          get_knownalive_domain(d);
>      page_list_add_tail(page, &d->page_list);
>      nrspin_unlock(&d->page_alloc_lock);
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 8cb4241b0511..0b6afba2acdb 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -1558,6 +1558,15 @@ void domain_destroy(struct domain *d)
>      /* Remove from the domlist/hash. */
>      domlist_remove(d);
>  
> +    /*
> +     * Final invariant check: all pages still owned by the dying domain must
> +     * be accounted for per-node before complete_domain_destroy() reclaims
> +     * them.  Debug-only; expands to a no-op in production builds.
> +     */

This should be a single line comment if anything:

/* Ensure per-node page counts match global one. */

> +    nrspin_lock(&d->page_alloc_lock);
> +    assert_numa_page_count(d);
> +    nrspin_unlock(&d->page_alloc_lock);
> +
>      /* Schedule RCU asynchronous completion of domain destroy. */
>      call_rcu(&d->rcu, complete_domain_destroy);
>  }
> diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
> index ac9fed600101..298662c3d69e 100644
> --- a/xen/common/grant_table.c
> +++ b/xen/common/grant_table.c
> @@ -2404,7 +2404,7 @@ gnttab_transfer(
>          }
>  
>          /* Okay, add the page to 'e'. */
> -        if ( unlikely(domain_adjust_tot_pages(e, 1) == 1) )
> +        if ( unlikely(domain_adjust_tot_pages(e, page_to_nid(page), 1) == 1) )
>              get_knownalive_domain(e);
>  
>          /*
> @@ -2430,7 +2430,7 @@ gnttab_transfer(
>               * page in the page total
>               */
>              nrspin_lock(&e->page_alloc_lock);
> -            drop_dom_ref = !domain_adjust_tot_pages(e, -1);
> +            drop_dom_ref = !domain_adjust_tot_pages(e, page_to_nid(page), -1);
>              nrspin_unlock(&e->page_alloc_lock);
>  
>              if ( okay /* i.e. e->is_dying due to the surrounding if() */ )
> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index 9e4899f9fc63..00eeaf33aefc 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -673,6 +673,14 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>      struct domain *d;
>      struct page_info *page;
>  
> +    /*
> +     * Deferred accounting for the current exchange chunk.  Keep it at
> +     * function scope because the fail: path also needs to commit or cancel
> +     * partial work.  domain_commit_page_deltas() clears applied entries, so
> +     * each new chunk starts with a zeroed accumulator.

This is way to verbose: you don't need to mention it must be kept at
function scope, as it's obvious by it's usage.  Regarding the zeroing,
you might also zeor at the start of each chunk, and avoid doing it in
domain_commit_page_deltas() if you think that would be clearer (and
forego the need of a comment to clarify).

> +     */
> +    long node_tot_pages_adjustments[MAX_NUMNODES] = {};
> +
>      if ( copy_from_guest(&exch, arg, 1) )
>          return -EFAULT;
>  
> @@ -822,6 +830,12 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>                  }
>  
>                  page_list_add(page, &in_chunk_list);
> +                /*
> +                 * steal_page() with MEMF_no_refcount removes ownership but
> +                 * leaves the domain page counts unchanged; record the
> +                 * deferred -1 for this page's node.
> +                 */
> +                node_tot_pages_adjustments[page_to_nid(page)]--;
>  #ifdef CONFIG_X86
>                  put_gfn(d, gmfn + k);
>  #endif
> @@ -870,32 +884,31 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>              if ( assign_page(page, exch.out.extent_order, d,
>                               MEMF_no_refcount) )
>              {
> -                unsigned long dec_count;
> -                bool drop_dom_ref;
> -
>                  /*
> -                 * Pages in in_chunk_list is stolen without
> -                 * decreasing the tot_pages. If the domain is dying when
> -                 * assign pages, we need decrease the count. For those pages
> -                 * that has been assigned, it should be covered by
> -                 * domain_relinquish_resources().
> +                 * Input pages for this chunk were stolen and freed without
> +                 * decreasing tot_pages.  Output extents already assigned
> +                 * before this failure will be released later by
> +                 * domain_relinquish_resources().  Commit the relative deltas
> +                 * now so concurrent reservation changes made under
> +                 * page_alloc_lock are preserved.  The net delta is strictly
> +                 * negative on this branch, so a zero post-commit tot_pages
> +                 * means the last reference must be dropped.
>                   */
> -                dec_count = (((1UL << exch.in.extent_order) *
> -                              (1UL << in_chunk_order)) -
> -                             (j * (1UL << exch.out.extent_order)));
> -
> -                nrspin_lock(&d->page_alloc_lock);
> -                drop_dom_ref = (dec_count &&
> -                                !domain_adjust_tot_pages(d, -dec_count));
> -                nrspin_unlock(&d->page_alloc_lock);
> -
> -                if ( drop_dom_ref )
> +                if ( !domain_commit_page_deltas(d, node_tot_pages_adjustments) )
>                      put_domain(d);
>  
>                  free_domheap_pages(page, exch.out.extent_order);
>                  goto dying;
>              }
>  
> +            /*
> +             * assign_page() with MEMF_no_refcount gives ownership without
> +             * updating the domain page counts; record the deferred extent
> +             * size for this output node.
> +             */
> +            node_tot_pages_adjustments[page_to_nid(page)] +=
> +                1UL << exch.out.extent_order;
> +
>              if ( __copy_from_guest_offset(&gpfn, exch.out.extent_start,
>                                            (i << out_chunk_order) + j, 1) )
>              {
> @@ -915,8 +928,19 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>          }
>          BUG_ON( !(d->is_dying) && (j != (1UL << out_chunk_order)) );
>  
> +        /*
> +         * If publishing output GFNs failed after assignment, fail: still
> +         * needs to commit the zero-net per-node redistribution.
> +         */
>          if ( rc )
>              goto fail;
> +
> +        /*
> +         * Success: commit the per-node redistribution.  The helper also
> +         * applies the deltas to tot_pages; the final value is unchanged
> +         * because the exchange is size-neutral.
> +         */
> +        domain_commit_page_deltas(d, node_tot_pages_adjustments);
>      }
>  
>      exch.nr_exchanged = exch.in.nr_extents;
> @@ -925,22 +949,31 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>      rcu_unlock_domain(d);
>      return rc;
>  
> -    /*
> -     * Failed a chunk! Free any partial chunk work. Tell caller how many
> -     * chunks succeeded.
> -     */
> +    /* Failed a chunk.  Free partial work and report completed chunks. */
>   fail:
>      /*
>       * Reassign any input pages we managed to steal.  NB that if the assign
>       * fails again, we're on the hook for freeing the page, since we've already
> -     * cleared PGC_allocated.
> +     * cleared PGC_allocated.  A successful reassignment cancels the -1 that
> +     * was recorded when the page was stolen.
>       */
>      while ( (page = page_list_remove_head(&in_chunk_list)) )
> +    {
>          if ( assign_pages(page, 1, d, MEMF_no_refcount) )
>          {
>              BUG_ON(!d->is_dying);
>              free_domheap_page(page);
>          }
> +        else
> +            node_tot_pages_adjustments[page_to_nid(page)] += 1;
> +    }
> +
> +    /*
> +     * Commit remaining deltas.  If all stolen pages were reassigned, this is
> +     * a zero-net update.  Otherwise the domain is dying and unreassigned
> +     * pages are deducted.
> +     */
> +    domain_commit_page_deltas(d, node_tot_pages_adjustments);
>  
>   dying:
>      rcu_unlock_domain(d);
> diff --git a/xen/common/numa.c b/xen/common/numa.c
> index ad75955a1622..9f38145579e0 100644
> --- a/xen/common/numa.c
> +++ b/xen/common/numa.c
> @@ -737,26 +737,14 @@ static void cf_check dump_numa(unsigned char key)
>      printk("Memory location of each domain:\n");
>      for_each_domain ( d )
>      {
> -        const struct page_info *page;
> -        unsigned int page_num_node[MAX_NUMNODES];
>          const struct vnuma_info *vnuma;
>  
>          process_pending_softirqs();
>  
>          printk("%pd (total: %u):\n", d, domain_tot_pages(d));
>  
> -        memset(page_num_node, 0, sizeof(page_num_node));
> -
> -        nrspin_lock(&d->page_alloc_lock);
> -        page_list_for_each ( page, &d->page_list )
> -        {
> -            i = page_to_nid(page);
> -            page_num_node[i]++;
> -        }
> -        nrspin_unlock(&d->page_alloc_lock);
> -
>          for_each_online_node ( i )
> -            printk("    Node %u: %u\n", i, page_num_node[i]);
> +            printk("    Node %u: %u\n", i, d->node_tot_pages[i]);

You are not holding any lock here (which I guess it's fine, because
this is just debug info), and hence you could call
process_pending_softirqs() while iteration over the nodes if you think
this might be an issue, ie:

nodeid_t nr = 0;
[...]
for_each_online_node ( i )
{
    printk("    Node %u: %u\n", i, d->node_tot_pages[i]);
    if ( !(++nr % 0x3f) )
        process_pending_softirqs();
}

>  
>          if ( !read_trylock(&d->vnuma_rwlock) )
>              continue;
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index 1801afc96a0a..ab6f44a3f993 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -520,14 +520,45 @@ static unsigned long avail_heap_pages(
>      return free_pages;
>  }
>  
> -unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> +unsigned long domain_adjust_tot_pages(struct domain *d, nodeid_t node,
> +                                      long pages)
>  {
>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>      d->tot_pages += pages;
>  
> +#ifdef CONFIG_NUMA
> +    ASSERT(node != NUMA_NO_NODE);
> +    ASSERT(node_online(node));
> +    d->node_tot_pages[node] += pages;
> +    assert_numa_page_count(d);
> +#endif
> +
>      return d->tot_pages;
>  }
>  
> +unsigned long domain_commit_page_deltas(struct domain *d,
> +                                        long adjustments[MAX_NUMNODES])
> +{
> +    unsigned long tot_pages;
> +    nodeid_t node;
> +
> +    nrspin_lock(&d->page_alloc_lock);
> +    for_each_online_node ( node )
> +        if ( adjustments[node] )
> +        {
> +#ifdef CONFIG_NUMA
> +            d->node_tot_pages[node] += adjustments[node];
> +#endif
> +            d->tot_pages += adjustments[node];
> +            adjustments[node] = 0;
> +        }
> +    assert_numa_page_count(d);
> +    tot_pages = d->tot_pages;
> +    nrspin_unlock(&d->page_alloc_lock);
> +
> +    return tot_pages;
> +}
> +
>  #ifdef CONFIG_SYSCTL
>  void get_outstanding_claims(uint64_t *free_pages, uint64_t *outstanding_pages)
>  {
> @@ -2924,7 +2955,7 @@ int assign_pages(
>              goto out;
>          }
>  
> -        if ( unlikely(domain_adjust_tot_pages(d, nr) == nr) )
> +        if ( unlikely(domain_adjust_tot_pages(d, page_to_nid(pg), nr) == nr) )
>              get_knownalive_domain(d);
>      }
>  
> @@ -3066,7 +3097,8 @@ void free_domheap_pages(struct page_info *pg, unsigned int order)
>                  }
>              }
>  
> -            drop_dom_ref = !domain_adjust_tot_pages(d, -(1 << order));
> +            drop_dom_ref = !domain_adjust_tot_pages(d, page_to_nid(pg),
> +                                                    -(1 << order));
>  
>              rspin_unlock(&d->page_alloc_lock);
>  
> @@ -3274,7 +3306,7 @@ void free_domstatic_page(struct page_info *page)
>  
>      arch_free_heap_page(d, page);
>  
> -    drop_dom_ref = !domain_adjust_tot_pages(d, -1);
> +    drop_dom_ref = !domain_adjust_tot_pages(d, page_to_nid(page), -1);
>  
>      unprepare_staticmem_pages(page, 1, scrub_debug);
>  
> diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
> index b3a35c4bc8d6..0737df02eda2 100644
> --- a/xen/include/xen/mm.h
> +++ b/xen/include/xen/mm.h
> @@ -68,6 +68,7 @@
>  #include <xen/types.h>
>  #include <xen/list.h>
>  #include <xen/spinlock.h>
> +#include <xen/numa.h>
>  #include <xen/perfc.h>
>  #include <public/memory.h>
>  
> @@ -131,7 +132,16 @@ mfn_t xen_map_to_mfn(unsigned long va);
>  int populate_pt_range(unsigned long virt, unsigned long nr_mfns);
>  /* Claim handling */
>  unsigned long __must_check domain_adjust_tot_pages(struct domain *d,
> -    long pages);
> +                                                   nodeid_t node, long pages);
> +/*
> + * Commit accumulated per-node page deltas to d->tot_pages and (under NUMA)
> + * d->node_tot_pages[] under page_alloc_lock.  Applied entries are cleared
> + * so the same accumulator can be reused for the next batch.  Returns the
> + * post-commit value of d->tot_pages; callers that may be releasing the last
> + * reference on the domain should drop it when the return value is zero.
> + */
> +unsigned long domain_commit_page_deltas(struct domain *d,
> +                                        long adjustments[MAX_NUMNODES]);

Given the (current) usage of domain_commit_page_deltas() it might be
better to simply return bool to signal whether the domain reference
should be released?  Callers don't care about the exact amount of
pages.

>  int domain_set_claim_entries(struct domain *d, uint32_t nr_entries,
>                               const struct xen_memory_claim *claim_set);
>  int domain_get_claim_entries(struct domain *d, uint32_t *nr_entries,
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index f671e0c4c7b3..ae50f523e9f5 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -620,6 +620,11 @@ struct domain
>      unsigned int last_alloc_node;
>      spinlock_t node_affinity_lock;
>  
> +#ifdef CONFIG_NUMA
> +    /* Distribution of tot_pages across NUMA nodes. */

It would be helpful to note this array is protected by the
page_alloc_lock lock.

> +    unsigned int node_tot_pages[MAX_NUMNODES];
> +#endif
> +
>      /* vNUMA topology accesses are protected by rwlock. */
>      rwlock_t vnuma_rwlock;
>      struct vnuma_info *vnuma;
> @@ -698,6 +703,25 @@ static inline unsigned int domain_tot_pages(const struct domain *d)
>      return d->tot_pages - d->extra_pages;
>  }
>  
> +/*
> + * Debug-only consistency check: the per-node page counts must sum to
> + * d->tot_pages.  Compiled out unless both NUMA and debug builds are
> + * configured; caller must hold page_alloc_lock.

This is way too verbose, and not very helpful.  It's obvious from the
guards below that the code is compiled out unless NUMA and DEBUG are
enabled.  I would just write:

/* Consistency check: ensure tot_pages == sum(node_tot_pages[]) */

> + */
> +static inline void assert_numa_page_count(const struct domain *d)
> +{
> +#if defined(CONFIG_NUMA) && defined(CONFIG_DEBUG)
> +    unsigned int i, node_total = 0;

i might be nodeid_t if we want to be type-accurate.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xen/mm: avoid watchdog timeout in dump_numa() on large domains
  2026-06-02  8:49 [PATCH] xen/mm: avoid watchdog timeout in dump_numa() on large domains Bernhard Kaindl
  2026-06-02 11:40 ` Roger Pau Monné
@ 2026-06-02 11:51 ` Jan Beulich
  2026-06-02 13:09   ` Roger Pau Monné
  2026-06-02 13:22   ` Andrew Cooper
  2026-06-02 13:21 ` Andrew Cooper
  2 siblings, 2 replies; 7+ messages in thread
From: Jan Beulich @ 2026-06-02 11:51 UTC (permalink / raw)
  To: Bernhard Kaindl
  Cc: Andrew Cooper, Roger Pau Monné, Teddy Astie, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini, Tamas K Lengyel,
	Alejandro Vallejo, Marcus Granado, xen-devel

On 02.06.2026 10:49, Bernhard Kaindl wrote:
> Using the 'u' debug key invokes dump_numa(), which walks each domain's
> page list under page_alloc_lock to compute per-NUMA-node counts. On
> domains with many pages, this O(pages) operation can hold the lock long
> enough to trigger a watchdog timeout.

In addition to what Roger said: Is it really the lock holding that's a
problem here? That is, there would be no problem if there was no lock
involved in this O(pages) operation?

> Replace the page-list walk with node_tot_pages[], a per-node counter
> maintained in struct domain. This reduces dump_numa()'s per-domain work
> from O(pages) to O(nodes).

Alternative approch for consideration: Purge dump_numa()? This big a
change for making a keyhandler work better is somewhat questionable an
approach, imo. The keyhandler isn't there for use in production anyway,
it's (primarily) a debugging aid. If the data is still needed (and may
e.g. be useful on production systems), make a (preemptible) domctl or
sysctl or alike instead?

Jan


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xen/mm: avoid watchdog timeout in dump_numa() on large domains
  2026-06-02 11:51 ` Jan Beulich
@ 2026-06-02 13:09   ` Roger Pau Monné
  2026-06-02 13:22   ` Andrew Cooper
  1 sibling, 0 replies; 7+ messages in thread
From: Roger Pau Monné @ 2026-06-02 13:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Bernhard Kaindl, Andrew Cooper, Teddy Astie, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini, Tamas K Lengyel,
	Alejandro Vallejo, Marcus Granado, xen-devel

On Tue, Jun 02, 2026 at 01:51:13PM +0200, Jan Beulich wrote:
> On 02.06.2026 10:49, Bernhard Kaindl wrote:
> > Using the 'u' debug key invokes dump_numa(), which walks each domain's
> > page list under page_alloc_lock to compute per-NUMA-node counts. On
> > domains with many pages, this O(pages) operation can hold the lock long
> > enough to trigger a watchdog timeout.
> 
> In addition to what Roger said: Is it really the lock holding that's a
> problem here? That is, there would be no problem if there was no lock
> involved in this O(pages) operation?
> 
> > Replace the page-list walk with node_tot_pages[], a per-node counter
> > maintained in struct domain. This reduces dump_numa()'s per-domain work
> > from O(pages) to O(nodes).
> 
> Alternative approch for consideration: Purge dump_numa()? This big a
> change for making a keyhandler work better is somewhat questionable an
> approach, imo. The keyhandler isn't there for use in production anyway,
> it's (primarily) a debugging aid. If the data is still needed (and may
> e.g. be useful on production systems), make a (preemptible) domctl or
> sysctl or alike instead?

At some point a similar change will be needed for per-node memory
claims.  While we might refuse just for dump_numa(), the intrusive
changes in memory_exchange() are a requirement for per-node claims
IIRC.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xen/mm: avoid watchdog timeout in dump_numa() on large domains
  2026-06-02  8:49 [PATCH] xen/mm: avoid watchdog timeout in dump_numa() on large domains Bernhard Kaindl
  2026-06-02 11:40 ` Roger Pau Monné
  2026-06-02 11:51 ` Jan Beulich
@ 2026-06-02 13:21 ` Andrew Cooper
  2 siblings, 0 replies; 7+ messages in thread
From: Andrew Cooper @ 2026-06-02 13:21 UTC (permalink / raw)
  To: Bernhard Kaindl, xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Teddy Astie,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini,
	Tamas K Lengyel, Alejandro Vallejo, Marcus Granado

On 02/06/2026 9:49 am, Bernhard Kaindl wrote:
> diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
> index 5c7a0ff30e8b..89ae418be0ae 100644
> --- a/xen/arch/x86/mm/mem_sharing.c
> +++ b/xen/arch/x86/mm/mem_sharing.c
> @@ -769,7 +769,7 @@ static int page_make_private(struct domain *d, struct page_info *page)
>      ASSERT(page_get_owner(page) == dom_cow);
>      page_set_owner(page, d);
>  
> -    if ( domain_adjust_tot_pages(d, 1) == 1 )
> +    if ( domain_adjust_tot_pages(d, page_to_nid(page), 1) == 1 )
>          get_knownalive_domain(d);
>      page_list_add_tail(page, &d->page_list);
>      nrspin_unlock(&d->page_alloc_lock);
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 8cb4241b0511..0b6afba2acdb 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -1558,6 +1558,15 @@ void domain_destroy(struct domain *d)
>      /* Remove from the domlist/hash. */
>      domlist_remove(d);
>  
> +    /*
> +     * Final invariant check: all pages still owned by the dying domain must
> +     * be accounted for per-node before complete_domain_destroy() reclaims
> +     * them.  Debug-only; expands to a no-op in production builds.
> +     */
> +    nrspin_lock(&d->page_alloc_lock);
> +    assert_numa_page_count(d);
> +    nrspin_unlock(&d->page_alloc_lock);
> +
>      /* Schedule RCU asynchronous completion of domain destroy. */
>      call_rcu(&d->rcu, complete_domain_destroy);
>  }

Why have you re-merged these two patches?   I explicitly split them in
the XenServer patchqueue, and they should remain separate.

~Andrew


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xen/mm: avoid watchdog timeout in dump_numa() on large domains
  2026-06-02 11:51 ` Jan Beulich
  2026-06-02 13:09   ` Roger Pau Monné
@ 2026-06-02 13:22   ` Andrew Cooper
  2026-06-02 14:09     ` Bernhard Kaindl
  1 sibling, 1 reply; 7+ messages in thread
From: Andrew Cooper @ 2026-06-02 13:22 UTC (permalink / raw)
  To: Jan Beulich, Bernhard Kaindl
  Cc: Andrew Cooper, Roger Pau Monné, Teddy Astie, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini, Tamas K Lengyel,
	Alejandro Vallejo, Marcus Granado, xen-devel

On 02/06/2026 12:51 pm, Jan Beulich wrote:
> On 02.06.2026 10:49, Bernhard Kaindl wrote:
>> Using the 'u' debug key invokes dump_numa(), which walks each domain's
>> page list under page_alloc_lock to compute per-NUMA-node counts. On
>> domains with many pages, this O(pages) operation can hold the lock long
>> enough to trigger a watchdog timeout.
> In addition to what Roger said: Is it really the lock holding that's a
> problem here? That is, there would be no problem if there was no lock
> involved in this O(pages) operation?
>
>> Replace the page-list walk with node_tot_pages[], a per-node counter
>> maintained in struct domain. This reduces dump_numa()'s per-domain work
>> from O(pages) to O(nodes).
> Alternative approch for consideration: Purge dump_numa()? This big a
> change for making a keyhandler work better is somewhat questionable an
> approach, imo. The keyhandler isn't there for use in production anyway,
> it's (primarily) a debugging aid. If the data is still needed (and may
> e.g. be useful on production systems), make a (preemptible) domctl or
> sysctl or alike instead?

Introducing d->node_tot_pages[] is a prerequisite for per-node claims. 
Tidying up dump_numa() is just a useful side effect.

From an upstream perspective, it does not want separating from the
per-node claims work.

~Andrew


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] xen/mm: avoid watchdog timeout in dump_numa() on large domains
  2026-06-02 13:22   ` Andrew Cooper
@ 2026-06-02 14:09     ` Bernhard Kaindl
  0 siblings, 0 replies; 7+ messages in thread
From: Bernhard Kaindl @ 2026-06-02 14:09 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: Roger Pau Monne, Teddy Astie, Anthony PERARD, Michal Orzel,
	Julien Grall, Stefano Stabellini, Tamas K Lengyel,
	Alejandro Vallejo, Marcus Granado, xen-devel@lists.xenproject.org


> >> Replace the page-list walk with node_tot_pages[], a per-node counter
> >> maintained in struct domain. This reduces dump_numa()'s per-domain work
> >> from O(pages) to O(nodes).
>
> > Alternative approch for consideration: Purge dump_numa()? This big a
> > change for making a keyhandler work better is somewhat questionable an
> > approach, imo. The keyhandler isn't there for use in production anyway,
> > it's (primarily) a debugging aid. If the data is still needed (and may
> > e.g. be useful on production systems), make a (preemptible) domctl or
> > sysctl or alike instead?
> 
> Introducing d->node_tot_pages[] is a prerequisite for per-node claims.

That isn't actually the case.

Only of we want to subtracting d->node_tot_pages[node] from the claims.
But that isn't useful in my opinion. 

With multi-node claims (as requested by Jan and Roger), it becomed really
awkward to subtract d->node_tot_pages[node] from the requested claims for
getting the claims to install.

1) That's unnecessary complexity:

   The purpose of claims is to allow domain builders to claim
   memory before populating the physmap of a domain.

   a. A domain builder knows how much memory he needs to populate.

      It is natural to just claim this amount of memory for the physmap.

      It is pointless for a domain builder to:
      populate the physmap without claims and then stake a claims for it.

      It is very awkward to have to call a not yet upstream hypercall to
      get d->node_tot_pages[] for each node and then run this loop:

      for_each_pysmap_node( node )
          claim_request[node] = physmapsize[node] + node_tot_pages[node]

	only for Xen to subtract it again in the claims domctl:

      for_each_requested_claim_node( node )
          d->claim[node] = domctl.requested_claim[node] node_tot_pages[node]

So no, I'm not going to do this insanity.

> Tidying up dump_numa() is just a useful side effect.

Ack, the performance teams of customers were forced to rely on dump_numa()
to get diagnostic information about the NUMA memory distribution of domains,
it would break their tooling (as unsupported as that could be), so it would
be undesirable from a Xen user perspective to just yank it without a sufficient
deprecation period.

Bernhard

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-06-02 14:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-02  8:49 [PATCH] xen/mm: avoid watchdog timeout in dump_numa() on large domains Bernhard Kaindl
2026-06-02 11:40 ` Roger Pau Monné
2026-06-02 11:51 ` Jan Beulich
2026-06-02 13:09   ` Roger Pau Monné
2026-06-02 13:22   ` Andrew Cooper
2026-06-02 14:09     ` Bernhard Kaindl
2026-06-02 13:21 ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.