[PATCH 0/2] xen/mm: limit in-place scrubbing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] xen/mm: limit in-place scrubbing
@ 2026-01-08 17:55 Roger Pau Monne
  2026-01-08 17:55 ` [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() Roger Pau Monne
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Roger Pau Monne @ 2026-01-08 17:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper,
	Anthony PERARD, Jan Beulich

Hello,

In XenServer we have seen the watchdog occasionally triggering during
domain creation if 1GB pages are scrubbed in-place during physmap
population.  The following series attempt to mitigate this by limiting
the in-place scrubbing during allocation to 2M pages, but it has some
drawbacks, see the post-commit remarks in patch 2.

I'm hopping someone might have a better idea, or we converge we can't do
better than this for the time being.

Thanks, Roger.

Roger Pau Monne (2):
  xen/mm: add a NUMA node parameter to scrub_free_pages()
  xen/mm: limit non-scrubbed allocations to a specific order

 xen/arch/arm/domain.c   |  2 +-
 xen/arch/x86/domain.c   |  2 +-
 xen/common/memory.c     | 12 +++++++++
 xen/common/page_alloc.c | 54 +++++++++++++++++++++++++++++++++++++----
 xen/include/xen/mm.h    | 12 ++++++++-
 5 files changed, 74 insertions(+), 8 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages()
  2026-01-08 17:55 [PATCH 0/2] xen/mm: limit in-place scrubbing Roger Pau Monne
@ 2026-01-08 17:55 ` Roger Pau Monne
  2026-01-09 10:22   ` Jan Beulich
  2026-01-08 17:55 ` [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order Roger Pau Monne
  2026-01-09 10:15 ` [PATCH 0/2] xen/mm: limit in-place scrubbing Jan Beulich
  2 siblings, 1 reply; 17+ messages in thread
From: Roger Pau Monne @ 2026-01-08 17:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper,
	Anthony PERARD, Jan Beulich

Such parameter allow requesting to scrub memory only from the specified
node.  If there's no memory to scrub from the requested node the function
returns false.  If the node is already being scrubbed from a different CPU
the function returns true so the caller can differentiate whether there's
still pending work to do.

No functional change intended.  Existing callers are switched to use the
new interface, albeit they all pass NUMA_NO_NODE to keep the current
behavior.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/arm/domain.c   |  2 +-
 xen/arch/x86/domain.c   |  2 +-
 xen/common/page_alloc.c | 17 ++++++++++++++---
 xen/include/xen/mm.h    |  3 ++-
 4 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 47973f99d935..dff7554417ea 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -75,7 +75,7 @@ static void noreturn idle_loop(void)
          * and then, after it is done, whether softirqs became pending
          * while we were scrubbing.
          */
-        else if ( !softirq_pending(cpu) && !scrub_free_pages() &&
+        else if ( !softirq_pending(cpu) && !scrub_free_pages(NUMA_NO_NODE) &&
                   !softirq_pending(cpu) )
             do_idle();
 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 7632d5e2d62d..276c485a204f 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -166,7 +166,7 @@ static void noreturn cf_check idle_loop(void)
          * and then, after it is done, whether softirqs became pending
          * while we were scrubbing.
          */
-        else if ( !softirq_pending(cpu) && !scrub_free_pages() &&
+        else if ( !softirq_pending(cpu) && !scrub_free_pages(NUMA_NO_NODE) &&
                   !softirq_pending(cpu) )
         {
             if ( guest )
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 2efc11ce095f..248c44df32b3 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -1339,16 +1339,27 @@ static void cf_check scrub_continue(void *data)
     }
 }
 
-bool scrub_free_pages(void)
+bool scrub_free_pages(nodeid_t node)
 {
     struct page_info *pg;
     unsigned int zone;
     unsigned int cpu = smp_processor_id();
     bool preempt = false;
-    nodeid_t node;
     unsigned int cnt = 0;
 
-    node = node_to_scrub(true);
+    if ( node != NUMA_NO_NODE )
+    {
+        if ( !node_need_scrub[node] )
+            /* Nothing to scrub. */
+            return false;
+
+        if ( node_test_and_set(node, node_scrubbing) )
+            /* Another CPU is scrubbing it. */
+            return true;
+    }
+    else
+        node = node_to_scrub(true);
+
     if ( node == NUMA_NO_NODE )
         return false;
 
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 426362adb2f4..7067c9ec0405 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -65,6 +65,7 @@
 #include <xen/compiler.h>
 #include <xen/mm-frame.h>
 #include <xen/mm-types.h>
+#include <xen/numa.h>
 #include <xen/types.h>
 #include <xen/list.h>
 #include <xen/spinlock.h>
@@ -90,7 +91,7 @@ void init_xenheap_pages(paddr_t ps, paddr_t pe);
 void xenheap_max_mfn(unsigned long mfn);
 void *alloc_xenheap_pages(unsigned int order, unsigned int memflags);
 void free_xenheap_pages(void *v, unsigned int order);
-bool scrub_free_pages(void);
+bool scrub_free_pages(nodeid_t node);
 #define alloc_xenheap_page() (alloc_xenheap_pages(0,0))
 #define free_xenheap_page(v) (free_xenheap_pages(v,0))
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order
  2026-01-08 17:55 [PATCH 0/2] xen/mm: limit in-place scrubbing Roger Pau Monne
  2026-01-08 17:55 ` [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() Roger Pau Monne
@ 2026-01-08 17:55 ` Roger Pau Monne
  2026-01-09 11:19   ` Jan Beulich
  2026-01-09 10:15 ` [PATCH 0/2] xen/mm: limit in-place scrubbing Jan Beulich
  2 siblings, 1 reply; 17+ messages in thread
From: Roger Pau Monne @ 2026-01-08 17:55 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

The current model of falling back to allocate unscrubbed pages and scrub
them in place at allocation time risks triggering the watchdog:

Watchdog timer detects that CPU55 is stuck!
----[ Xen-4.17.5-21  x86_64  debug=n  Not tainted ]----
CPU:    55
RIP:    e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
RFLAGS: 0000000000000202   CONTEXT: hypervisor (d0v12)
[...]
Xen call trace:
   [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
   [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
   [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
   [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
   [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
   [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970

The maximum allocation order on x86 is limited to 18, that means allocating
and scrubbing possibly 1G worth of memory in 4K chunks.

Start by limiting dirty allocations to CONFIG_DOMU_MAX_ORDER, which is
currently set to 2M chunks.  However such limitation might cause
fragmentation in HVM p2m population during domain creation.  To prevent
that introduce some extra logic in populate_physmap() that fallback to
preemptive page-scrubbing if the requested allocation cannot be fulfilled
and there's scrubbing work to do.  This approach is less fair than the
current one, but allows preemptive page scrubbing in the context of
populate_physmap() to attempt to ensure unnecessary page-shattering.

Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
I'm not particularly happy with this approach, as it doesn't guarantee
progress for the callers.  IOW: a caller might do a lot of scrubbing, just
to get it's pages stolen by a different concurrent thread doing
allocations.  However I'm not sure there's a better solution than resorting
to 2M allocations if there's not enough free memory that is scrubbed.

I'm having trouble seeing where we could temporary store page(s) allocated
that need to be scrubbed before being assigned to the domain, in a way that
can be used by continuations, and that would allow Xen to keep track of
them in case the operation is never finished.  IOW: we would need to
account for cleanup of such temporary stash of pages in case the domain
never completes the hypercall, or is destroyed midway.

Otherwise we could add the option to switch back to scrubbing before
returning the pages to the free pool, but that's also problematic: the
current approach aim to scrub pages in the same NUMA node as the CPU that's
doing the scrubbing.  If we scrub in the context of the domain destruction
hypercall there's no attempt to scrub pages in the local NUMA node.
---
 xen/common/memory.c     | 12 ++++++++++++
 xen/common/page_alloc.c | 37 +++++++++++++++++++++++++++++++++++--
 xen/include/xen/mm.h    |  9 +++++++++
 3 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/xen/common/memory.c b/xen/common/memory.c
index 10becf7c1f4c..28b254e9d280 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a)
 
                 if ( unlikely(!page) )
                 {
+                    nodeid_t node = MEMF_get_node(a->memflags);
+
+                    if ( memory_scrub_pending(node) ||
+                         (node != NUMA_NO_NODE &&
+                          !(a->memflags & MEMF_exact_node) &&
+                          memory_scrub_pending(node = NUMA_NO_NODE)) )
+                    {
+                        scrub_free_pages(node);
+                        a->preempted = 1;
+                        goto out;
+                    }
+
                     gdprintk(XENLOG_INFO,
                              "Could not allocate order=%u extent: id=%d memflags=%#x (%u of %u)\n",
                              a->extent_order, d->domain_id, a->memflags,
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 248c44df32b3..d4dabc997c44 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -483,6 +483,20 @@ static heap_by_zone_and_order_t *_heap[MAX_NUMNODES];
 
 static unsigned long node_need_scrub[MAX_NUMNODES];
 
+bool memory_scrub_pending(nodeid_t node)
+{
+    nodeid_t i;
+
+    if ( node != NUMA_NO_NODE )
+        return node_need_scrub[node];
+
+    for_each_online_node ( i )
+        if ( node_need_scrub[i] )
+            return true;
+
+    return false;
+}
+
 static unsigned long *avail[MAX_NUMNODES];
 static long total_avail_pages;
 
@@ -1007,8 +1021,18 @@ static struct page_info *alloc_heap_pages(
     }
 
     pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d);
-    /* Try getting a dirty buddy if we couldn't get a clean one. */
-    if ( !pg && !(memflags & MEMF_no_scrub) )
+    /*
+     * Try getting a dirty buddy if we couldn't get a clean one.  Limit the
+     * fallback to orders equal or below MAX_DIRTY_ORDER, as otherwise the
+     * non-preemptive scrubbing could trigger the watchdog.
+     */
+    if ( !pg && !(memflags & MEMF_no_scrub) &&
+         /*
+          * Allow any order unscrubbed allocations during boot time, we
+          * compensate by processing softirqs in the scrubbing loop below once
+          * irqs are enabled.
+          */
+         (order <= MAX_DIRTY_ORDER || system_state < SYS_STATE_active) )
         pg = get_free_buddy(zone_lo, zone_hi, order,
                             memflags | MEMF_no_scrub, d);
     if ( !pg )
@@ -1115,7 +1139,16 @@ static struct page_info *alloc_heap_pages(
             if ( test_and_clear_bit(_PGC_need_scrub, &pg[i].count_info) )
             {
                 if ( !(memflags & MEMF_no_scrub) )
+                {
                     scrub_one_page(&pg[i], cold);
+                    /*
+                     * Use SYS_STATE_smp_boot explicitly; ahead of that state
+                     * interrupts are disabled.
+                     */
+                    if ( system_state == SYS_STATE_smp_boot &&
+                         !(dirty_cnt & 0xff) )
+                        process_pending_softirqs();
+                }
 
                 dirty_cnt++;
             }
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 7067c9ec0405..a37476a99f1b 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -92,6 +92,7 @@ void xenheap_max_mfn(unsigned long mfn);
 void *alloc_xenheap_pages(unsigned int order, unsigned int memflags);
 void free_xenheap_pages(void *v, unsigned int order);
 bool scrub_free_pages(nodeid_t node);
+bool memory_scrub_pending(nodeid_t node);
 #define alloc_xenheap_page() (alloc_xenheap_pages(0,0))
 #define free_xenheap_page(v) (free_xenheap_pages(v,0))
 
@@ -223,6 +224,14 @@ struct npfec {
 #else
 #define MAX_ORDER 20 /* 2^20 contiguous pages */
 #endif
+
+/* Max order when scrubbing pages at allocation time.  */
+#ifdef CONFIG_DOMU_MAX_ORDER
+# define MAX_DIRTY_ORDER CONFIG_DOMU_MAX_ORDER
+#else
+# define MAX_DIRTY_ORDER 9
+#endif
+
 mfn_t acquire_reserved_page(struct domain *d, unsigned int memflags);
 
 /* Private domain structs for DOMID_XEN, DOMID_IO, etc. */
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] xen/mm: limit in-place scrubbing
  2026-01-08 17:55 [PATCH 0/2] xen/mm: limit in-place scrubbing Roger Pau Monne
  2026-01-08 17:55 ` [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() Roger Pau Monne
  2026-01-08 17:55 ` [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order Roger Pau Monne
@ 2026-01-09 10:15 ` Jan Beulich
  2026-01-09 10:29   ` Andrew Cooper
  2 siblings, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2026-01-09 10:15 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel

On 08.01.2026 18:55, Roger Pau Monne wrote:
> In XenServer we have seen the watchdog occasionally triggering during
> domain creation if 1GB pages are scrubbed in-place during physmap
> population.

That's pretty extreme - writing to 1Gb of memory can't really take over 5s,
can it? Is there lock contention involved? Or is this when very many CPUs
try to do the same in parallel?

Jan

>  The following series attempt to mitigate this by limiting
> the in-place scrubbing during allocation to 2M pages, but it has some
> drawbacks, see the post-commit remarks in patch 2.
> 
> I'm hopping someone might have a better idea, or we converge we can't do
> better than this for the time being.
> 
> Thanks, Roger.
> 
> Roger Pau Monne (2):
>   xen/mm: add a NUMA node parameter to scrub_free_pages()
>   xen/mm: limit non-scrubbed allocations to a specific order
> 
>  xen/arch/arm/domain.c   |  2 +-
>  xen/arch/x86/domain.c   |  2 +-
>  xen/common/memory.c     | 12 +++++++++
>  xen/common/page_alloc.c | 54 +++++++++++++++++++++++++++++++++++++----
>  xen/include/xen/mm.h    | 12 ++++++++-
>  5 files changed, 74 insertions(+), 8 deletions(-)
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages()
  2026-01-08 17:55 ` [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() Roger Pau Monne
@ 2026-01-09 10:22   ` Jan Beulich
  2026-01-09 14:46     ` Roger Pau Monné
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2026-01-09 10:22 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel

On 08.01.2026 18:55, Roger Pau Monne wrote:
> Such parameter allow requesting to scrub memory only from the specified
> node.  If there's no memory to scrub from the requested node the function
> returns false.  If the node is already being scrubbed from a different CPU
> the function returns true so the caller can differentiate whether there's
> still pending work to do.

I'm really trying to understand both patches together, and peeking ahead I
don't understand the above, which looks to describe ...

> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -1339,16 +1339,27 @@ static void cf_check scrub_continue(void *data)
>      }
>  }
>  
> -bool scrub_free_pages(void)
> +bool scrub_free_pages(nodeid_t node)
>  {
>      struct page_info *pg;
>      unsigned int zone;
>      unsigned int cpu = smp_processor_id();
>      bool preempt = false;
> -    nodeid_t node;
>      unsigned int cnt = 0;
>  
> -    node = node_to_scrub(true);
> +    if ( node != NUMA_NO_NODE )
> +    {
> +        if ( !node_need_scrub[node] )
> +            /* Nothing to scrub. */
> +            return false;
> +
> +        if ( node_test_and_set(node, node_scrubbing) )
> +            /* Another CPU is scrubbing it. */
> +            return true;

... these two return-s. My problem being that patch 2 doesn't use the
return value (while existing callers don't take this path). Is this then
"just in case" for now (and making the meaning of the return values
somewhat inconsistent for the function as a whole)?

Jan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] xen/mm: limit in-place scrubbing
  2026-01-09 10:15 ` [PATCH 0/2] xen/mm: limit in-place scrubbing Jan Beulich
@ 2026-01-09 10:29   ` Andrew Cooper
  2026-01-09 11:32     ` Jan Beulich
  2026-01-09 12:31     ` Roger Pau Monné
  0 siblings, 2 replies; 17+ messages in thread
From: Andrew Cooper @ 2026-01-09 10:29 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monne
  Cc: Andrew Cooper, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Anthony PERARD, xen-devel

On 09/01/2026 10:15 am, Jan Beulich wrote:
> On 08.01.2026 18:55, Roger Pau Monne wrote:
>> In XenServer we have seen the watchdog occasionally triggering during
>> domain creation if 1GB pages are scrubbed in-place during physmap
>> population.
> That's pretty extreme - writing to 1Gb of memory can't really take over 5s,
> can it?

Sure it can.

> Is there lock contention involved?

Almost certainly, and it's probably the more relevant aspect in this case.

> Or is this when very many CPUs
> try to do the same in parallel?

The scenario is reboot of a VM when Xapi is doing NUMA placement using
per-node claims.

In this case, even with sufficient scrubbed RAM on other nodes, you need
to take from the node you claimed on which might need scrubbing.

The underlying problem is the need to do a long running operation in a
context where you cannot continue, and cannot (reasonably) fail.

~Andrew


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order
  2026-01-08 17:55 ` [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order Roger Pau Monne
@ 2026-01-09 11:19   ` Jan Beulich
  2026-01-13 14:01     ` Roger Pau Monné
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2026-01-09 11:19 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 08.01.2026 18:55, Roger Pau Monne wrote:
> The current model of falling back to allocate unscrubbed pages and scrub
> them in place at allocation time risks triggering the watchdog:
> 
> Watchdog timer detects that CPU55 is stuck!
> ----[ Xen-4.17.5-21  x86_64  debug=n  Not tainted ]----
> CPU:    55
> RIP:    e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
> RFLAGS: 0000000000000202   CONTEXT: hypervisor (d0v12)
> [...]
> Xen call trace:
>    [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
>    [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
>    [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
>    [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
>    [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
>    [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970
> 
> The maximum allocation order on x86 is limited to 18, that means allocating
> and scrubbing possibly 1G worth of memory in 4K chunks.
> 
> Start by limiting dirty allocations to CONFIG_DOMU_MAX_ORDER, which is
> currently set to 2M chunks.  However such limitation might cause
> fragmentation in HVM p2m population during domain creation.  To prevent
> that introduce some extra logic in populate_physmap() that fallback to
> preemptive page-scrubbing if the requested allocation cannot be fulfilled
> and there's scrubbing work to do.  This approach is less fair than the
> current one, but allows preemptive page scrubbing in the context of
> populate_physmap() to attempt to ensure unnecessary page-shattering.
> 
> Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> I'm not particularly happy with this approach, as it doesn't guarantee
> progress for the callers.  IOW: a caller might do a lot of scrubbing, just
> to get it's pages stolen by a different concurrent thread doing
> allocations.  However I'm not sure there's a better solution than resorting
> to 2M allocations if there's not enough free memory that is scrubbed.
> 
> I'm having trouble seeing where we could temporary store page(s) allocated
> that need to be scrubbed before being assigned to the domain, in a way that
> can be used by continuations, and that would allow Xen to keep track of
> them in case the operation is never finished.  IOW: we would need to
> account for cleanup of such temporary stash of pages in case the domain
> never completes the hypercall, or is destroyed midway.

How about stealing a bit from the range above MEMOP_EXTENT_SHIFT to
indicate that state, with the actual page (and order plus scrub progress)
recorded in the target struct domain? Actually, maybe such an indicator
isn't needed at all: If the next invocation (continuation or not) finds
an in-progress allocation, it could simply use that rather than doing a
real allocation. (What to do if this isn't a continuation is less clear:
We could fail such requests [likely not an option unless we can reliably
tell original requests from continuations], or split the allocation if
the request is smaller, or free the allocation to then take the normal
path.) All of which of course only for "foreign" requests.

If the hypercall is never continued, we could refuse to unpause the
domain (with the allocation then freed normally when the domain gets
destroyed).

As another alternative, how about returning unscrubbed pages altogether
when it's during domain creation, requiring the tool stack to do the
scrubbing (potentially allowing it to skip some of it when pages are
fully initialized anyway, much like we do for Dom0 iirc)?

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a)
>  
>                  if ( unlikely(!page) )
>                  {
> +                    nodeid_t node = MEMF_get_node(a->memflags);
> +
> +                    if ( memory_scrub_pending(node) ||
> +                         (node != NUMA_NO_NODE &&
> +                          !(a->memflags & MEMF_exact_node) &&
> +                          memory_scrub_pending(node = NUMA_NO_NODE)) )
> +                    {
> +                        scrub_free_pages(node);
> +                        a->preempted = 1;
> +                        goto out;
> +                    }

At least for order 0 requests there's no point in trying this. With the
current logic, actually for orders up to MAX_DIRTY_ORDER.

Further, from a general interface perspective, wouldn't we need to do the
same for at least XENMEM_increase_reservation?

> @@ -1115,7 +1139,16 @@ static struct page_info *alloc_heap_pages(
>              if ( test_and_clear_bit(_PGC_need_scrub, &pg[i].count_info) )
>              {
>                  if ( !(memflags & MEMF_no_scrub) )
> +                {
>                      scrub_one_page(&pg[i], cold);
> +                    /*
> +                     * Use SYS_STATE_smp_boot explicitly; ahead of that state
> +                     * interrupts are disabled.
> +                     */
> +                    if ( system_state == SYS_STATE_smp_boot &&
> +                         !(dirty_cnt & 0xff) )
> +                        process_pending_softirqs();
> +                }
>  
>                  dirty_cnt++;
>              }

Yet an alternative consideration: When "cold" is true, couldn't we call
process_pending_softirqs() like you do here ( >= SYS_STATE_smp_boot then
of course), without any of the other changes? Of course that's worse
than a proper continuation, especially from the calling domain's pov.

> @@ -223,6 +224,14 @@ struct npfec {
>  #else
>  #define MAX_ORDER 20 /* 2^20 contiguous pages */
>  #endif
> +
> +/* Max order when scrubbing pages at allocation time.  */
> +#ifdef CONFIG_DOMU_MAX_ORDER
> +# define MAX_DIRTY_ORDER CONFIG_DOMU_MAX_ORDER
> +#else
> +# define MAX_DIRTY_ORDER 9
> +#endif

Using CONFIG_DOMU_MAX_ORDER rather than the command line overridable
domu_max_order means people couldn't even restore original behavior.

Jan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] xen/mm: limit in-place scrubbing
  2026-01-09 10:29   ` Andrew Cooper
@ 2026-01-09 11:32     ` Jan Beulich
  2026-01-09 11:34       ` Andrew Cooper
  2026-01-09 12:31     ` Roger Pau Monné
  1 sibling, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2026-01-09 11:32 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Anthony PERARD, xen-devel, Roger Pau Monne

On 09.01.2026 11:29, Andrew Cooper wrote:
> On 09/01/2026 10:15 am, Jan Beulich wrote:
>> On 08.01.2026 18:55, Roger Pau Monne wrote:
>>> In XenServer we have seen the watchdog occasionally triggering during
>>> domain creation if 1GB pages are scrubbed in-place during physmap
>>> population.
>> That's pretty extreme - writing to 1Gb of memory can't really take over 5s,
>> can it?
> 
> Sure it can.

Under what unusual circumstances, or on what extremely slow hardware? (Of
course improperly set MTRRs could cause such, for example.)

>> Is there lock contention involved?
> 
> Almost certainly, and it's probably the more relevant aspect in this case.

Thing is - the scrubbing happens after alloc_heap_pages() has already
dropped the heap lock. And I can't spot the XENMEM_populate_physmap path
to take any locks outward from alloc_heap_pages(). And the domain's
page alloc lock (which in principle should be uncontended anyway unless
the toolstack tries to race with itself) is acquired only later.

If it was a lock contention problem, the first goal ought to be to move
the scrubbing outside of any (potentially contended) locks.

>> Or is this when very many CPUs
>> try to do the same in parallel?
> 
> The scenario is reboot of a VM when Xapi is doing NUMA placement using
> per-node claims.
> 
> In this case, even with sufficient scrubbed RAM on other nodes, you need
> to take from the node you claimed on which might need scrubbing.

Much like if there was an exact-node request without involving claims.

> The underlying problem is the need to do a long running operation in a
> context where you cannot continue, and cannot (reasonably) fail.

Right.

Jan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] xen/mm: limit in-place scrubbing
  2026-01-09 11:32     ` Jan Beulich
@ 2026-01-09 11:34       ` Andrew Cooper
  0 siblings, 0 replies; 17+ messages in thread
From: Andrew Cooper @ 2026-01-09 11:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Anthony PERARD, xen-devel,
	Roger Pau Monne

On 09/01/2026 11:32 am, Jan Beulich wrote:
>>> Or is this when very many CPUs
>>> try to do the same in parallel?
>> The scenario is reboot of a VM when Xapi is doing NUMA placement using
>> per-node claims.
>>
>> In this case, even with sufficient scrubbed RAM on other nodes, you need
>> to take from the node you claimed on which might need scrubbing.
> Much like if there was an exact-node request without involving claims.
>
>> The underlying problem is the need to do a long running operation in a
>> context where you cannot continue, and cannot (reasonably) fail.
> Right.

Yeah - I think this is a scenario that could happen without NUMA
aspects, if the system is almost full.  I suspect we've just made it
easier to hit, or we've got better testing.  Hard to say.

~Andrew


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] xen/mm: limit in-place scrubbing
  2026-01-09 10:29   ` Andrew Cooper
  2026-01-09 11:32     ` Jan Beulich
@ 2026-01-09 12:31     ` Roger Pau Monné
  1 sibling, 0 replies; 17+ messages in thread
From: Roger Pau Monné @ 2026-01-09 12:31 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jan Beulich, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Anthony PERARD, xen-devel

On Fri, Jan 09, 2026 at 10:29:20AM +0000, Andrew Cooper wrote:
> On 09/01/2026 10:15 am, Jan Beulich wrote:
> > On 08.01.2026 18:55, Roger Pau Monne wrote:
> >> In XenServer we have seen the watchdog occasionally triggering during
> >> domain creation if 1GB pages are scrubbed in-place during physmap
> >> population.
> > That's pretty extreme - writing to 1Gb of memory can't really take over 5s,
> > can it?
> 
> Sure it can.
> 
> > Is there lock contention involved?
> 
> Almost certainly, and it's probably the more relevant aspect in this case.

Possibly.  I can tell Edwin to give me his reproduction.  There's also
the map_domain_page() page aspect of this operation.  On big enough
systems this will cause a fair amount of stress to the map cache,
since each page is mapped, scrubbed and unmapped.  I don't think
however the systems on which we have seen this to be using the map
cache (it was on debug=n builds with less than 5TB of memory).

> > Or is this when very many CPUs
> > try to do the same in parallel?
> 
> The scenario is reboot of a VM when Xapi is doing NUMA placement using
> per-node claims.

Not exclusively.  We have reports of this also happening without any
claims or NUMA placements being used.

AFAICT it's possibly triggered when doing reboots of multiple VMs in
parallel, and all reports of it I've seen it's on multi-node NUMA
systems.  I wonder if scrubbing a 1G remote page in 4K chunks is
killing the intra-node bandwidth.

Thanks, Roger.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages()
  2026-01-09 10:22   ` Jan Beulich
@ 2026-01-09 14:46     ` Roger Pau Monné
  2026-01-09 14:50       ` Jan Beulich
  0 siblings, 1 reply; 17+ messages in thread
From: Roger Pau Monné @ 2026-01-09 14:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel

On Fri, Jan 09, 2026 at 11:22:39AM +0100, Jan Beulich wrote:
> On 08.01.2026 18:55, Roger Pau Monne wrote:
> > Such parameter allow requesting to scrub memory only from the specified
> > node.  If there's no memory to scrub from the requested node the function
> > returns false.  If the node is already being scrubbed from a different CPU
> > the function returns true so the caller can differentiate whether there's
> > still pending work to do.
> 
> I'm really trying to understand both patches together, and peeking ahead I
> don't understand the above, which looks to describe ...
> 
> > --- a/xen/common/page_alloc.c
> > +++ b/xen/common/page_alloc.c
> > @@ -1339,16 +1339,27 @@ static void cf_check scrub_continue(void *data)
> >      }
> >  }
> >  
> > -bool scrub_free_pages(void)
> > +bool scrub_free_pages(nodeid_t node)
> >  {
> >      struct page_info *pg;
> >      unsigned int zone;
> >      unsigned int cpu = smp_processor_id();
> >      bool preempt = false;
> > -    nodeid_t node;
> >      unsigned int cnt = 0;
> >  
> > -    node = node_to_scrub(true);
> > +    if ( node != NUMA_NO_NODE )
> > +    {
> > +        if ( !node_need_scrub[node] )
> > +            /* Nothing to scrub. */
> > +            return false;
> > +
> > +        if ( node_test_and_set(node, node_scrubbing) )
> > +            /* Another CPU is scrubbing it. */
> > +            return true;
> 
> ... these two return-s. My problem being that patch 2 doesn't use the
> return value (while existing callers don't take this path). Is this then
> "just in case" for now (and making the meaning of the return values
> somewhat inconsistent for the function as a whole)?

I've added those so that the function return values are consistent,
even if not consumed right now, it would make no sense for the return
values to have different meaning when the node parameter is !=
NUMA_NO_NODE.  Or at least that was my impression.

In fact an earlier version of patch 2 did consume those values.  I've
moved to a different approach, but I think it's good to keep the
return values consistent regardless of the input parameters.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages()
  2026-01-09 14:46     ` Roger Pau Monné
@ 2026-01-09 14:50       ` Jan Beulich
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Beulich @ 2026-01-09 14:50 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel

On 09.01.2026 15:46, Roger Pau Monné wrote:
> On Fri, Jan 09, 2026 at 11:22:39AM +0100, Jan Beulich wrote:
>> On 08.01.2026 18:55, Roger Pau Monne wrote:
>>> Such parameter allow requesting to scrub memory only from the specified
>>> node.  If there's no memory to scrub from the requested node the function
>>> returns false.  If the node is already being scrubbed from a different CPU
>>> the function returns true so the caller can differentiate whether there's
>>> still pending work to do.
>>
>> I'm really trying to understand both patches together, and peeking ahead I
>> don't understand the above, which looks to describe ...
>>
>>> --- a/xen/common/page_alloc.c
>>> +++ b/xen/common/page_alloc.c
>>> @@ -1339,16 +1339,27 @@ static void cf_check scrub_continue(void *data)
>>>      }
>>>  }
>>>  
>>> -bool scrub_free_pages(void)
>>> +bool scrub_free_pages(nodeid_t node)
>>>  {
>>>      struct page_info *pg;
>>>      unsigned int zone;
>>>      unsigned int cpu = smp_processor_id();
>>>      bool preempt = false;
>>> -    nodeid_t node;
>>>      unsigned int cnt = 0;
>>>  
>>> -    node = node_to_scrub(true);
>>> +    if ( node != NUMA_NO_NODE )
>>> +    {
>>> +        if ( !node_need_scrub[node] )
>>> +            /* Nothing to scrub. */
>>> +            return false;
>>> +
>>> +        if ( node_test_and_set(node, node_scrubbing) )
>>> +            /* Another CPU is scrubbing it. */
>>> +            return true;
>>
>> ... these two return-s. My problem being that patch 2 doesn't use the
>> return value (while existing callers don't take this path). Is this then
>> "just in case" for now (and making the meaning of the return values
>> somewhat inconsistent for the function as a whole)?
> 
> I've added those so that the function return values are consistent,
> even if not consumed right now, it would make no sense for the return
> values to have different meaning when the node parameter is !=
> NUMA_NO_NODE.  Or at least that was my impression.
> 
> In fact an earlier version of patch 2 did consume those values.  I've
> moved to a different approach, but I think it's good to keep the
> return values consistent regardless of the input parameters.

My point was though: The present "true" return doesn't mean "Another CPU
is scrubbing it." Instead it means "More work to do" aiui. That's similar
in a way, but not identical.

Jan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order
  2026-01-09 11:19   ` Jan Beulich
@ 2026-01-13 14:01     ` Roger Pau Monné
  2026-01-14  8:48       ` Jan Beulich
  0 siblings, 1 reply; 17+ messages in thread
From: Roger Pau Monné @ 2026-01-13 14:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On Fri, Jan 09, 2026 at 12:19:26PM +0100, Jan Beulich wrote:
> On 08.01.2026 18:55, Roger Pau Monne wrote:
> > The current model of falling back to allocate unscrubbed pages and scrub
> > them in place at allocation time risks triggering the watchdog:
> > 
> > Watchdog timer detects that CPU55 is stuck!
> > ----[ Xen-4.17.5-21  x86_64  debug=n  Not tainted ]----
> > CPU:    55
> > RIP:    e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
> > RFLAGS: 0000000000000202   CONTEXT: hypervisor (d0v12)
> > [...]
> > Xen call trace:
> >    [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
> >    [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
> >    [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
> >    [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
> >    [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
> >    [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970
> > 
> > The maximum allocation order on x86 is limited to 18, that means allocating
> > and scrubbing possibly 1G worth of memory in 4K chunks.
> > 
> > Start by limiting dirty allocations to CONFIG_DOMU_MAX_ORDER, which is
> > currently set to 2M chunks.  However such limitation might cause
> > fragmentation in HVM p2m population during domain creation.  To prevent
> > that introduce some extra logic in populate_physmap() that fallback to
> > preemptive page-scrubbing if the requested allocation cannot be fulfilled
> > and there's scrubbing work to do.  This approach is less fair than the
> > current one, but allows preemptive page scrubbing in the context of
> > populate_physmap() to attempt to ensure unnecessary page-shattering.
> > 
> > Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > I'm not particularly happy with this approach, as it doesn't guarantee
> > progress for the callers.  IOW: a caller might do a lot of scrubbing, just
> > to get it's pages stolen by a different concurrent thread doing
> > allocations.  However I'm not sure there's a better solution than resorting
> > to 2M allocations if there's not enough free memory that is scrubbed.
> > 
> > I'm having trouble seeing where we could temporary store page(s) allocated
> > that need to be scrubbed before being assigned to the domain, in a way that
> > can be used by continuations, and that would allow Xen to keep track of
> > them in case the operation is never finished.  IOW: we would need to
> > account for cleanup of such temporary stash of pages in case the domain
> > never completes the hypercall, or is destroyed midway.
> 
> How about stealing a bit from the range above MEMOP_EXTENT_SHIFT to
> indicate that state, with the actual page (and order plus scrub progress)
> recorded in the target struct domain? Actually, maybe such an indicator
> isn't needed at all: If the next invocation (continuation or not) finds
> an in-progress allocation, it could simply use that rather than doing a
> real allocation. (What to do if this isn't a continuation is less clear:
> We could fail such requests [likely not an option unless we can reliably
> tell original requests from continuations], or split the allocation if
> the request is smaller, or free the allocation to then take the normal
> path.) All of which of course only for "foreign" requests.
> 
> If the hypercall is never continued, we could refuse to unpause the
> domain (with the allocation then freed normally when the domain gets
> destroyed).

I have done something along this lines, introduced a couple of
stashing variables in the domain struct and stored the progress of
scrubbing in there.

> As another alternative, how about returning unscrubbed pages altogether
> when it's during domain creation, requiring the tool stack to do the
> scrubbing (potentially allowing it to skip some of it when pages are
> fully initialized anyway, much like we do for Dom0 iirc)?

It's going to be difficult for the toolstack to figure out which pages
need to be scrubbed, we would need a way to tell it the unscrubbed
regions in a domain physmap?

> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a)
> >  
> >                  if ( unlikely(!page) )
> >                  {
> > +                    nodeid_t node = MEMF_get_node(a->memflags);
> > +
> > +                    if ( memory_scrub_pending(node) ||
> > +                         (node != NUMA_NO_NODE &&
> > +                          !(a->memflags & MEMF_exact_node) &&
> > +                          memory_scrub_pending(node = NUMA_NO_NODE)) )
> > +                    {
> > +                        scrub_free_pages(node);
> > +                        a->preempted = 1;
> > +                        goto out;
> > +                    }
> 
> At least for order 0 requests there's no point in trying this. With the
> current logic, actually for orders up to MAX_DIRTY_ORDER.

Yes, otherwise we might force the CPU to do some scrubbing work when
it won't satisfy it's allocation request anyway.

> Further, from a general interface perspective, wouldn't we need to do the
> same for at least XENMEM_increase_reservation?

Possibly yes.  TBH I would also be fine with strictly limiting
XENMEM_increase_reservation to 2M order extents, even for the control
domain.  The physmap population is the only that actually requires
bigger extents.

> > @@ -1115,7 +1139,16 @@ static struct page_info *alloc_heap_pages(
> >              if ( test_and_clear_bit(_PGC_need_scrub, &pg[i].count_info) )
> >              {
> >                  if ( !(memflags & MEMF_no_scrub) )
> > +                {
> >                      scrub_one_page(&pg[i], cold);
> > +                    /*
> > +                     * Use SYS_STATE_smp_boot explicitly; ahead of that state
> > +                     * interrupts are disabled.
> > +                     */
> > +                    if ( system_state == SYS_STATE_smp_boot &&
> > +                         !(dirty_cnt & 0xff) )
> > +                        process_pending_softirqs();
> > +                }
> >  
> >                  dirty_cnt++;
> >              }
> 
> Yet an alternative consideration: When "cold" is true, couldn't we call
> process_pending_softirqs() like you do here ( >= SYS_STATE_smp_boot then
> of course), without any of the other changes? Of course that's worse
> than a proper continuation, especially from the calling domain's pov.

Overall I think it would be best to solve this with hypercall
continuations, in case we even want to support pages bigger than 1G.
I know this has a lot of other implications, but would be nice to not
add more baggage here.

The "cold" case is the typical scenario for domain building, and we
would block a control domain CPU for more than 5s which seems
undesirable.

> > @@ -223,6 +224,14 @@ struct npfec {
> >  #else
> >  #define MAX_ORDER 20 /* 2^20 contiguous pages */
> >  #endif
> > +
> > +/* Max order when scrubbing pages at allocation time.  */
> > +#ifdef CONFIG_DOMU_MAX_ORDER
> > +# define MAX_DIRTY_ORDER CONFIG_DOMU_MAX_ORDER
> > +#else
> > +# define MAX_DIRTY_ORDER 9
> > +#endif
> 
> Using CONFIG_DOMU_MAX_ORDER rather than the command line overridable
> domu_max_order means people couldn't even restore original behavior.

We likely want a separate command line option for this one, but given
your comments above we might want to explore other options.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order
  2026-01-13 14:01     ` Roger Pau Monné
@ 2026-01-14  8:48       ` Jan Beulich
  2026-01-15 10:48         ` Roger Pau Monné
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2026-01-14  8:48 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 13.01.2026 15:01, Roger Pau Monné wrote:
> On Fri, Jan 09, 2026 at 12:19:26PM +0100, Jan Beulich wrote:
>> On 08.01.2026 18:55, Roger Pau Monne wrote:
>>> The current model of falling back to allocate unscrubbed pages and scrub
>>> them in place at allocation time risks triggering the watchdog:
>>>
>>> Watchdog timer detects that CPU55 is stuck!
>>> ----[ Xen-4.17.5-21  x86_64  debug=n  Not tainted ]----
>>> CPU:    55
>>> RIP:    e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
>>> RFLAGS: 0000000000000202   CONTEXT: hypervisor (d0v12)
>>> [...]
>>> Xen call trace:
>>>    [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
>>>    [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
>>>    [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
>>>    [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
>>>    [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
>>>    [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970
>>>
>>> The maximum allocation order on x86 is limited to 18, that means allocating
>>> and scrubbing possibly 1G worth of memory in 4K chunks.
>>>
>>> Start by limiting dirty allocations to CONFIG_DOMU_MAX_ORDER, which is
>>> currently set to 2M chunks.  However such limitation might cause
>>> fragmentation in HVM p2m population during domain creation.  To prevent
>>> that introduce some extra logic in populate_physmap() that fallback to
>>> preemptive page-scrubbing if the requested allocation cannot be fulfilled
>>> and there's scrubbing work to do.  This approach is less fair than the
>>> current one, but allows preemptive page scrubbing in the context of
>>> populate_physmap() to attempt to ensure unnecessary page-shattering.
>>>
>>> Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>> ---
>>> I'm not particularly happy with this approach, as it doesn't guarantee
>>> progress for the callers.  IOW: a caller might do a lot of scrubbing, just
>>> to get it's pages stolen by a different concurrent thread doing
>>> allocations.  However I'm not sure there's a better solution than resorting
>>> to 2M allocations if there's not enough free memory that is scrubbed.
>>>
>>> I'm having trouble seeing where we could temporary store page(s) allocated
>>> that need to be scrubbed before being assigned to the domain, in a way that
>>> can be used by continuations, and that would allow Xen to keep track of
>>> them in case the operation is never finished.  IOW: we would need to
>>> account for cleanup of such temporary stash of pages in case the domain
>>> never completes the hypercall, or is destroyed midway.
>>
>> How about stealing a bit from the range above MEMOP_EXTENT_SHIFT to
>> indicate that state, with the actual page (and order plus scrub progress)
>> recorded in the target struct domain? Actually, maybe such an indicator
>> isn't needed at all: If the next invocation (continuation or not) finds
>> an in-progress allocation, it could simply use that rather than doing a
>> real allocation. (What to do if this isn't a continuation is less clear:
>> We could fail such requests [likely not an option unless we can reliably
>> tell original requests from continuations], or split the allocation if
>> the request is smaller, or free the allocation to then take the normal
>> path.) All of which of course only for "foreign" requests.
>>
>> If the hypercall is never continued, we could refuse to unpause the
>> domain (with the allocation then freed normally when the domain gets
>> destroyed).
> 
> I have done something along this lines, introduced a couple of
> stashing variables in the domain struct and stored the progress of
> scrubbing in there.
> 
>> As another alternative, how about returning unscrubbed pages altogether
>> when it's during domain creation, requiring the tool stack to do the
>> scrubbing (potentially allowing it to skip some of it when pages are
>> fully initialized anyway, much like we do for Dom0 iirc)?
> 
> It's going to be difficult for the toolstack to figure out which pages
> need to be scrubbed, we would need a way to tell it the unscrubbed
> regions in a domain physmap?

My thinking here was that the toolstack would have to assume everything
is unscrubbed, and it could avoid scrubbing only those pages which it
knows it fully fills with some data.

>>> --- a/xen/common/memory.c
>>> +++ b/xen/common/memory.c
>>> @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a)
>>>  
>>>                  if ( unlikely(!page) )
>>>                  {
>>> +                    nodeid_t node = MEMF_get_node(a->memflags);
>>> +
>>> +                    if ( memory_scrub_pending(node) ||
>>> +                         (node != NUMA_NO_NODE &&
>>> +                          !(a->memflags & MEMF_exact_node) &&
>>> +                          memory_scrub_pending(node = NUMA_NO_NODE)) )
>>> +                    {
>>> +                        scrub_free_pages(node);
>>> +                        a->preempted = 1;
>>> +                        goto out;
>>> +                    }
>>
>> At least for order 0 requests there's no point in trying this. With the
>> current logic, actually for orders up to MAX_DIRTY_ORDER.
> 
> Yes, otherwise we might force the CPU to do some scrubbing work when
> it won't satisfy it's allocation request anyway.
> 
>> Further, from a general interface perspective, wouldn't we need to do the
>> same for at least XENMEM_increase_reservation?
> 
> Possibly yes.  TBH I would also be fine with strictly limiting
> XENMEM_increase_reservation to 2M order extents, even for the control
> domain.  The physmap population is the only that actually requires
> bigger extents.

Hmm, that's an option, yes, but an ABI-changing one.

Jan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order
  2026-01-14  8:48       ` Jan Beulich
@ 2026-01-15 10:48         ` Roger Pau Monné
  2026-01-15 10:56           ` Jan Beulich
  0 siblings, 1 reply; 17+ messages in thread
From: Roger Pau Monné @ 2026-01-15 10:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On Wed, Jan 14, 2026 at 09:48:59AM +0100, Jan Beulich wrote:
> On 13.01.2026 15:01, Roger Pau Monné wrote:
> > On Fri, Jan 09, 2026 at 12:19:26PM +0100, Jan Beulich wrote:
> >> On 08.01.2026 18:55, Roger Pau Monne wrote:
> >>> --- a/xen/common/memory.c
> >>> +++ b/xen/common/memory.c
> >>> @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a)
> >>>  
> >>>                  if ( unlikely(!page) )
> >>>                  {
> >>> +                    nodeid_t node = MEMF_get_node(a->memflags);
> >>> +
> >>> +                    if ( memory_scrub_pending(node) ||
> >>> +                         (node != NUMA_NO_NODE &&
> >>> +                          !(a->memflags & MEMF_exact_node) &&
> >>> +                          memory_scrub_pending(node = NUMA_NO_NODE)) )
> >>> +                    {
> >>> +                        scrub_free_pages(node);
> >>> +                        a->preempted = 1;
> >>> +                        goto out;
> >>> +                    }
> >>
> >> At least for order 0 requests there's no point in trying this. With the
> >> current logic, actually for orders up to MAX_DIRTY_ORDER.
> > 
> > Yes, otherwise we might force the CPU to do some scrubbing work when
> > it won't satisfy it's allocation request anyway.
> > 
> >> Further, from a general interface perspective, wouldn't we need to do the
> >> same for at least XENMEM_increase_reservation?
> > 
> > Possibly yes.  TBH I would also be fine with strictly limiting
> > XENMEM_increase_reservation to 2M order extents, even for the control
> > domain.  The physmap population is the only that actually requires
> > bigger extents.
> 
> Hmm, that's an option, yes, but an ABI-changing one.

I don't think it changes the ABI: Xen has always reserved the right to
block high order allocations.  See for example how max_order() has
different limits depending on the domain permissions, and I would not
consider those limits part of the ABI, they can be changed from the
command line.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order
  2026-01-15 10:48         ` Roger Pau Monné
@ 2026-01-15 10:56           ` Jan Beulich
  2026-01-15 13:05             ` Roger Pau Monné
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2026-01-15 10:56 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 15.01.2026 11:48, Roger Pau Monné wrote:
> On Wed, Jan 14, 2026 at 09:48:59AM +0100, Jan Beulich wrote:
>> On 13.01.2026 15:01, Roger Pau Monné wrote:
>>> On Fri, Jan 09, 2026 at 12:19:26PM +0100, Jan Beulich wrote:
>>>> On 08.01.2026 18:55, Roger Pau Monne wrote:
>>>>> --- a/xen/common/memory.c
>>>>> +++ b/xen/common/memory.c
>>>>> @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a)
>>>>>  
>>>>>                  if ( unlikely(!page) )
>>>>>                  {
>>>>> +                    nodeid_t node = MEMF_get_node(a->memflags);
>>>>> +
>>>>> +                    if ( memory_scrub_pending(node) ||
>>>>> +                         (node != NUMA_NO_NODE &&
>>>>> +                          !(a->memflags & MEMF_exact_node) &&
>>>>> +                          memory_scrub_pending(node = NUMA_NO_NODE)) )
>>>>> +                    {
>>>>> +                        scrub_free_pages(node);
>>>>> +                        a->preempted = 1;
>>>>> +                        goto out;
>>>>> +                    }
>>>>
>>>> At least for order 0 requests there's no point in trying this. With the
>>>> current logic, actually for orders up to MAX_DIRTY_ORDER.
>>>
>>> Yes, otherwise we might force the CPU to do some scrubbing work when
>>> it won't satisfy it's allocation request anyway.
>>>
>>>> Further, from a general interface perspective, wouldn't we need to do the
>>>> same for at least XENMEM_increase_reservation?
>>>
>>> Possibly yes.  TBH I would also be fine with strictly limiting
>>> XENMEM_increase_reservation to 2M order extents, even for the control
>>> domain.  The physmap population is the only that actually requires
>>> bigger extents.
>>
>> Hmm, that's an option, yes, but an ABI-changing one.
> 
> I don't think it changes the ABI: Xen has always reserved the right to
> block high order allocations.  See for example how max_order() has
> different limits depending on the domain permissions, and I would not
> consider those limits part of the ABI, they can be changed from the
> command line.

When the limits were introduced, we were aware this is an ABI change, albeit
a necessary one. You have a point however as to the command line control that
there now is.

Jan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order
  2026-01-15 10:56           ` Jan Beulich
@ 2026-01-15 13:05             ` Roger Pau Monné
  0 siblings, 0 replies; 17+ messages in thread
From: Roger Pau Monné @ 2026-01-15 13:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On Thu, Jan 15, 2026 at 11:56:16AM +0100, Jan Beulich wrote:
> On 15.01.2026 11:48, Roger Pau Monné wrote:
> > On Wed, Jan 14, 2026 at 09:48:59AM +0100, Jan Beulich wrote:
> >> On 13.01.2026 15:01, Roger Pau Monné wrote:
> >>> On Fri, Jan 09, 2026 at 12:19:26PM +0100, Jan Beulich wrote:
> >>>> On 08.01.2026 18:55, Roger Pau Monne wrote:
> >>>>> --- a/xen/common/memory.c
> >>>>> +++ b/xen/common/memory.c
> >>>>> @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a)
> >>>>>  
> >>>>>                  if ( unlikely(!page) )
> >>>>>                  {
> >>>>> +                    nodeid_t node = MEMF_get_node(a->memflags);
> >>>>> +
> >>>>> +                    if ( memory_scrub_pending(node) ||
> >>>>> +                         (node != NUMA_NO_NODE &&
> >>>>> +                          !(a->memflags & MEMF_exact_node) &&
> >>>>> +                          memory_scrub_pending(node = NUMA_NO_NODE)) )
> >>>>> +                    {
> >>>>> +                        scrub_free_pages(node);
> >>>>> +                        a->preempted = 1;
> >>>>> +                        goto out;
> >>>>> +                    }
> >>>>
> >>>> At least for order 0 requests there's no point in trying this. With the
> >>>> current logic, actually for orders up to MAX_DIRTY_ORDER.
> >>>
> >>> Yes, otherwise we might force the CPU to do some scrubbing work when
> >>> it won't satisfy it's allocation request anyway.
> >>>
> >>>> Further, from a general interface perspective, wouldn't we need to do the
> >>>> same for at least XENMEM_increase_reservation?
> >>>
> >>> Possibly yes.  TBH I would also be fine with strictly limiting
> >>> XENMEM_increase_reservation to 2M order extents, even for the control
> >>> domain.  The physmap population is the only that actually requires
> >>> bigger extents.
> >>
> >> Hmm, that's an option, yes, but an ABI-changing one.
> > 
> > I don't think it changes the ABI: Xen has always reserved the right to
> > block high order allocations.  See for example how max_order() has
> > different limits depending on the domain permissions, and I would not
> > consider those limits part of the ABI, they can be changed from the
> > command line.
> 
> When the limits were introduced, we were aware this is an ABI change, albeit
> a necessary one. You have a point however as to the command line control that
> there now is.

In addition to what I've said above: the limit that I've introduced in
v2 only affects dirty allocations that require scrubbing.  If the
requested order is available and scrubbed the limit won't be enforced.
So the ABI is not changed in that regard, only unscrubbed pages past
a certain order are considered as not free.

It's possibly best to move the conversation to the v2 proposal and
discuss the limit there.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-01-15 13:06 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-08 17:55 [PATCH 0/2] xen/mm: limit in-place scrubbing Roger Pau Monne
2026-01-08 17:55 ` [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() Roger Pau Monne
2026-01-09 10:22   ` Jan Beulich
2026-01-09 14:46     ` Roger Pau Monné
2026-01-09 14:50       ` Jan Beulich
2026-01-08 17:55 ` [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order Roger Pau Monne
2026-01-09 11:19   ` Jan Beulich
2026-01-13 14:01     ` Roger Pau Monné
2026-01-14  8:48       ` Jan Beulich
2026-01-15 10:48         ` Roger Pau Monné
2026-01-15 10:56           ` Jan Beulich
2026-01-15 13:05             ` Roger Pau Monné
2026-01-09 10:15 ` [PATCH 0/2] xen/mm: limit in-place scrubbing Jan Beulich
2026-01-09 10:29   ` Andrew Cooper
2026-01-09 11:32     ` Jan Beulich
2026-01-09 11:34       ` Andrew Cooper
2026-01-09 12:31     ` Roger Pau Monné

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.