[PATCH v3 0/3] xen/mm: limit in-place scrubbing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/3] xen/mm: limit in-place scrubbing
@ 2026-01-22 17:38 Roger Pau Monne
  2026-01-22 17:38 ` [PATCH v3 1/3] xen/mm: enforce SCRUB_DEBUG checks for MEMF_no_scrub allocations Roger Pau Monne
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Roger Pau Monne @ 2026-01-22 17:38 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

Hello,

In XenServer we have seen the watchdog occasionally triggering during
domain creation if 1GB pages are scrubbed in-place during physmap
population.  The following series attempt to mitigate this by adding
preemption to page scrubbing in populate_physmap().  Also a new limit
and command line option to signal the maximum allocation order when
doing in-place scrubbing.  This is set by default to
CONFIG_PTDOM_MAX_ORDER.

Thanks, Roger.

Roger Pau Monne (3):
  xen/mm: enforce SCRUB_DEBUG checks for MEMF_no_scrub allocations
  xen/mm: allow deferred scrub of physmap populate allocated pages
  xen/mm: limit non-scrubbed allocations to a specific order

 docs/misc/xen-command-line.pandoc |  13 ++++
 xen/common/domain.c               |  28 +++++++++
 xen/common/memory.c               | 100 ++++++++++++++++++++++++++++--
 xen/common/page_alloc.c           |  30 +++++++--
 xen/include/xen/mm.h              |  14 +++++
 xen/include/xen/sched.h           |   5 ++
 6 files changed, 181 insertions(+), 9 deletions(-)

-- 
2.51.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/3] xen/mm: enforce SCRUB_DEBUG checks for MEMF_no_scrub allocations
  2026-01-22 17:38 [PATCH v3 0/3] xen/mm: limit in-place scrubbing Roger Pau Monne
@ 2026-01-22 17:38 ` Roger Pau Monne
  2026-01-22 17:38 ` [PATCH v3 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages Roger Pau Monne
  2026-01-22 17:38 ` [PATCH v3 3/3] xen/mm: limit non-scrubbed allocations to a specific order Roger Pau Monne
  2 siblings, 0 replies; 11+ messages in thread
From: Roger Pau Monne @ 2026-01-22 17:38 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

The logic in alloc_heap_pages() only checks for scrubbing pattern
correctness when the caller doesn't pass MEMF_no_scrub in memflags.
However already scrubbed pages can be checked for correctness, regardless
of the caller having requested MEMF_no_scrub.

Relax the checking around the check_one_page() call, to allow for calls
with MEMF_no_scrub to also check the correctness of pages marked as already
scrubbed when allocated.  This widens the checking of scrubbing
correctness, so it would also check the scrubbing correctness of
MEMF_no_scrub allocations.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
After discussing with Jan I've deliberately omitted the tag:

Fixes: 0c5f2f9cefac ("mm: Make sure pages are scrubbed")

The intended approach might have been to ensure the caller of
alloc_heap_pages() gets properly scrubbed pages, rather than asserting the
internal state of free pages is as expected.
---
 xen/common/page_alloc.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 2efc11ce095f..de1480316f05 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -1105,8 +1105,7 @@ static struct page_info *alloc_heap_pages(
 
     spin_unlock(&heap_lock);
 
-    if ( first_dirty != INVALID_DIRTY_IDX ||
-         (scrub_debug && !(memflags & MEMF_no_scrub)) )
+    if ( first_dirty != INVALID_DIRTY_IDX || scrub_debug )
     {
         bool cold = d && d != current->domain;
 
@@ -1119,7 +1118,7 @@ static struct page_info *alloc_heap_pages(
 
                 dirty_cnt++;
             }
-            else if ( !(memflags & MEMF_no_scrub) )
+            else
                 check_one_page(&pg[i]);
         }
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages
  2026-01-22 17:38 [PATCH v3 0/3] xen/mm: limit in-place scrubbing Roger Pau Monne
  2026-01-22 17:38 ` [PATCH v3 1/3] xen/mm: enforce SCRUB_DEBUG checks for MEMF_no_scrub allocations Roger Pau Monne
@ 2026-01-22 17:38 ` Roger Pau Monne
  2026-01-26 11:14   ` Jan Beulich
  2026-01-22 17:38 ` [PATCH v3 3/3] xen/mm: limit non-scrubbed allocations to a specific order Roger Pau Monne
  2 siblings, 1 reply; 11+ messages in thread
From: Roger Pau Monne @ 2026-01-22 17:38 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

Physmap population has the need to use pages as big as possible to reduce
p2m shattering.  However that triggers issues when big enough pages are not
yet scrubbed, and so scrubbing must be done at allocation time.  On some
scenarios with added contention the watchdog can trigger:

Watchdog timer detects that CPU55 is stuck!
----[ Xen-4.17.5-21  x86_64  debug=n  Not tainted ]----
CPU:    55
RIP:    e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
RFLAGS: 0000000000000202   CONTEXT: hypervisor (d0v12)
[...]
Xen call trace:
   [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
   [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
   [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
   [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
   [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
   [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970

Introduce a mechanism to preempt page scrubbing in populate_physmap().  It
relies on stashing the dirty page in the domain struct temporarily to
preempt to guest context, so the scrubbing can resume when the domain
re-enters the hypercall.  The added deferral mechanism will only be used for
domain construction, and is designed to be used with a single threaded
domain builder.  If the toolstack makes concurrent calls to
XENMEM_populate_physmap for the same target domain it will trash stashed
pages, resulting in slow domain physmap population.

Note a similar issue is present in increase reservation.  However that
hypercall is likely to only be used once the domain is already running and
the known implementations use 4K pages. It will be deal with in a separate
patch using a different approach, that will also take care of the
allocation in populate_physmap() once the domain is running.

Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v2:
 - Introduce FREE_DOMHEAP_PAGE{,S}().
 - Remove j local counter.
 - Free page pending scrub in domain_kill() also.
 - Remove BUG_ON().
 - Reorder get_stashed_allocation() flow.
 - s/dirty/unscrubbed/ in a printk message.

Changes since v1:
 - New in this version, different approach than v1.
---
 xen/common/domain.c     | 28 ++++++++++++
 xen/common/memory.c     | 97 ++++++++++++++++++++++++++++++++++++++++-
 xen/common/page_alloc.c |  2 +-
 xen/include/xen/mm.h    | 10 +++++
 xen/include/xen/sched.h |  5 +++
 5 files changed, 140 insertions(+), 2 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 376351b528c9..bc739571fdd5 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -722,6 +722,13 @@ static void _domain_destroy(struct domain *d)
 
     XVFREE(d->console);
 
+    if ( d->pending_scrub )
+    {
+        FREE_DOMHEAP_PAGES(d->pending_scrub, d->pending_scrub_order);
+        d->pending_scrub_order = 0;
+        d->pending_scrub_index = 0;
+    }
+
     argo_destroy(d);
 
     rangeset_domain_destroy(d);
@@ -1286,6 +1293,19 @@ int domain_kill(struct domain *d)
         rspin_barrier(&d->domain_lock);
         argo_destroy(d);
         vnuma_destroy(d->vnuma);
+        /*
+         * Attempt to free any pages pending scrub early.  Toolstack can still
+         * trigger populate_physmap() operations at this point, and hence a
+         * final cleanup must be done in _domain_destroy().
+         */
+        rspin_lock(&d->page_alloc_lock);
+        if ( d->pending_scrub )
+        {
+            FREE_DOMHEAP_PAGES(d->pending_scrub, d->pending_scrub_order);
+            d->pending_scrub_order = 0;
+            d->pending_scrub_index = 0;
+        }
+        rspin_unlock(&d->page_alloc_lock);
         domain_set_outstanding_pages(d, 0);
         /* fallthrough */
     case DOMDYING_dying:
@@ -1678,6 +1698,14 @@ int domain_unpause_by_systemcontroller(struct domain *d)
      */
     if ( new == 0 && !d->creation_finished )
     {
+        if ( d->pending_scrub )
+        {
+            printk(XENLOG_ERR
+                   "%pd: cannot be started with pending unscrubbed pages, destroying\n",
+                   d);
+            domain_crash(d);
+            return -EBUSY;
+        }
         d->creation_finished = true;
         arch_domain_creation_finished(d);
     }
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 10becf7c1f4c..db20da1bcaaa 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -159,6 +159,66 @@ static void increase_reservation(struct memop_args *a)
     a->nr_done = i;
 }
 
+/*
+ * Temporary storage for a domain assigned page that's not been fully scrubbed.
+ * Stored pages must be domheap ones.
+ *
+ * The stashed page can be freed at any time by Xen, the caller must pass the
+ * order and NUMA node requirement to the fetch function to ensure the
+ * currently stashed page matches it's requirements.
+ */
+static void stash_allocation(struct domain *d, struct page_info *page,
+                             unsigned int order, unsigned int scrub_index)
+{
+    rspin_lock(&d->page_alloc_lock);
+
+    /*
+     * Drop any stashed allocation to accommodated the current one.  This
+     * interface is designed to be used for single-threaded domain creation.
+     */
+    if ( d->pending_scrub )
+        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
+
+    d->pending_scrub_index = scrub_index;
+    d->pending_scrub_order = order;
+    d->pending_scrub = page;
+
+    rspin_unlock(&d->page_alloc_lock);
+}
+
+static struct page_info *get_stashed_allocation(struct domain *d,
+                                                unsigned int order,
+                                                nodeid_t node,
+                                                unsigned int *scrub_index)
+{
+    struct page_info *page = NULL;
+
+    rspin_lock(&d->page_alloc_lock);
+
+    /*
+     * If there's a pending page to scrub check if it satisfies the current
+     * request.  If it doesn't keep it stashed and return NULL.
+     */
+    if ( d->pending_scrub && d->pending_scrub_order == order &&
+         (node == NUMA_NO_NODE || node == page_to_nid(d->pending_scrub)) )
+    {
+        page = d->pending_scrub;
+        *scrub_index = d->pending_scrub_index;
+
+        /*
+         * The caller now owns the page, clear stashed information.  Prevent
+         * concurrent usages of get_stashed_allocation() from returning the same
+         * page to different contexts.
+         */
+        d->pending_scrub_index = 0;
+        d->pending_scrub_order = 0;
+        d->pending_scrub = NULL;
+    }
+
+    rspin_unlock(&d->page_alloc_lock);
+    return page;
+}
+
 static void populate_physmap(struct memop_args *a)
 {
     struct page_info *page;
@@ -275,7 +335,18 @@ static void populate_physmap(struct memop_args *a)
             }
             else
             {
-                page = alloc_domheap_pages(d, a->extent_order, a->memflags);
+                unsigned int scrub_start = 0;
+                nodeid_t node =
+                    (a->memflags & MEMF_exact_node) ? MEMF_get_node(a->memflags)
+                                                    : NUMA_NO_NODE;
+
+                page = get_stashed_allocation(d, a->extent_order, node,
+                                              &scrub_start);
+
+                if ( !page )
+                    page = alloc_domheap_pages(d, a->extent_order,
+                        a->memflags | (d->creation_finished ? 0
+                                                            : MEMF_no_scrub));
 
                 if ( unlikely(!page) )
                 {
@@ -286,6 +357,30 @@ static void populate_physmap(struct memop_args *a)
                     goto out;
                 }
 
+                if ( !d->creation_finished )
+                {
+                    unsigned int dirty_cnt = 0;
+
+                    /* Check if there's anything to scrub. */
+                    for ( j = scrub_start; j < (1U << a->extent_order); j++ )
+                    {
+                        if ( !test_and_clear_bit(_PGC_need_scrub,
+                                                 &page[j].count_info) )
+                            continue;
+
+                        scrub_one_page(&page[j], true);
+
+                        if ( (j + 1) != (1U << a->extent_order) &&
+                             !(++dirty_cnt & 0xff) &&
+                             hypercall_preempt_check() )
+                        {
+                            a->preempted = 1;
+                            stash_allocation(d, page, a->extent_order, ++j);
+                            goto out;
+                        }
+                    }
+                }
+
                 if ( unlikely(a->memflags & MEMF_no_tlbflush) )
                 {
                     for ( j = 0; j < (1U << a->extent_order); j++ )
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index de1480316f05..c9e82fd7ab62 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -792,7 +792,7 @@ static void page_list_add_scrub(struct page_info *pg, unsigned int node,
 # define scrub_page_cold clear_page_cold
 #endif
 
-static void scrub_one_page(const struct page_info *pg, bool cold)
+void scrub_one_page(const struct page_info *pg, bool cold)
 {
     void *ptr;
 
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 426362adb2f4..d80bfba6d393 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -145,6 +145,16 @@ unsigned long avail_node_heap_pages(unsigned int nodeid);
 #define alloc_domheap_page(d,f) (alloc_domheap_pages(d,0,f))
 #define free_domheap_page(p)  (free_domheap_pages(p,0))
 
+/* Free an allocation, and zero the pointer to it. */
+#define FREE_DOMHEAP_PAGES(p, o) do { \
+    void *_ptr_ = (p);                \
+    (p) = NULL;                       \
+    free_domheap_pages(_ptr_, o);     \
+} while ( false )
+#define FREE_DOMHEAP_PAGE(p) FREE_DOMHEAP_PAGES(p, 0)
+
+void scrub_one_page(const struct page_info *pg, bool cold);
+
 int online_page(mfn_t mfn, uint32_t *status);
 int offline_page(mfn_t mfn, int broken, uint32_t *status);
 int query_page_offline(mfn_t mfn, uint32_t *status);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 91d6a49daf16..735d5b76b411 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -661,6 +661,11 @@ struct domain
 
     /* Pointer to console settings; NULL for system domains. */
     struct domain_console *console;
+
+    /* Pointer to allocated domheap page that possibly needs scrubbing. */
+    struct page_info *pending_scrub;
+    unsigned int pending_scrub_order;
+    unsigned int pending_scrub_index;
 } __aligned(PAGE_SIZE);
 
 static inline struct page_list_head *page_to_list(
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages
  2026-01-22 17:38 ` [PATCH v3 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages Roger Pau Monne
@ 2026-01-26 11:14   ` Jan Beulich
  2026-01-27 10:40     ` Roger Pau Monné
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2026-01-26 11:14 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 22.01.2026 18:38, Roger Pau Monne wrote:
> Physmap population has the need to use pages as big as possible to reduce
> p2m shattering.  However that triggers issues when big enough pages are not
> yet scrubbed, and so scrubbing must be done at allocation time.  On some
> scenarios with added contention the watchdog can trigger:
> 
> Watchdog timer detects that CPU55 is stuck!
> ----[ Xen-4.17.5-21  x86_64  debug=n  Not tainted ]----
> CPU:    55
> RIP:    e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
> RFLAGS: 0000000000000202   CONTEXT: hypervisor (d0v12)
> [...]
> Xen call trace:
>    [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
>    [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
>    [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
>    [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
>    [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
>    [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970
> 
> Introduce a mechanism to preempt page scrubbing in populate_physmap().  It
> relies on stashing the dirty page in the domain struct temporarily to
> preempt to guest context, so the scrubbing can resume when the domain
> re-enters the hypercall.  The added deferral mechanism will only be used for
> domain construction, and is designed to be used with a single threaded
> domain builder.  If the toolstack makes concurrent calls to
> XENMEM_populate_physmap for the same target domain it will trash stashed
> pages, resulting in slow domain physmap population.
> 
> Note a similar issue is present in increase reservation.  However that
> hypercall is likely to only be used once the domain is already running and
> the known implementations use 4K pages. It will be deal with in a separate
> patch using a different approach, that will also take care of the
> allocation in populate_physmap() once the domain is running.
> 
> Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Changes since v2:
>  - Introduce FREE_DOMHEAP_PAGE{,S}().
>  - Remove j local counter.
>  - Free page pending scrub in domain_kill() also.

Yet still not right in domain_unpause_by_systemcontroller() as well. I.e. a
toolstack action is still needed after the crash to make the memory usable
again. If you made ...

> @@ -1286,6 +1293,19 @@ int domain_kill(struct domain *d)
>          rspin_barrier(&d->domain_lock);
>          argo_destroy(d);
>          vnuma_destroy(d->vnuma);
> +        /*
> +         * Attempt to free any pages pending scrub early.  Toolstack can still
> +         * trigger populate_physmap() operations at this point, and hence a
> +         * final cleanup must be done in _domain_destroy().
> +         */
> +        rspin_lock(&d->page_alloc_lock);
> +        if ( d->pending_scrub )
> +        {
> +            FREE_DOMHEAP_PAGES(d->pending_scrub, d->pending_scrub_order);
> +            d->pending_scrub_order = 0;
> +            d->pending_scrub_index = 0;
> +        }
> +        rspin_unlock(&d->page_alloc_lock);

... this into a small helper function (usable even from _domain_destroy(),
as locking being used doesn't matter there), it would have negligible
footprint there.

As to the comment, not being a native speaker it still feels to me as if
moving "early" earlier (after "free") might help parsing of the 1st sentence.

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -159,6 +159,66 @@ static void increase_reservation(struct memop_args *a)
>      a->nr_done = i;
>  }
>  
> +/*
> + * Temporary storage for a domain assigned page that's not been fully scrubbed.
> + * Stored pages must be domheap ones.
> + *
> + * The stashed page can be freed at any time by Xen, the caller must pass the
> + * order and NUMA node requirement to the fetch function to ensure the
> + * currently stashed page matches it's requirements.
> + */
> +static void stash_allocation(struct domain *d, struct page_info *page,
> +                             unsigned int order, unsigned int scrub_index)
> +{
> +    rspin_lock(&d->page_alloc_lock);
> +
> +    /*
> +     * Drop any stashed allocation to accommodated the current one.  This
> +     * interface is designed to be used for single-threaded domain creation.
> +     */
> +    if ( d->pending_scrub )
> +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);

Didn't you indicate you'd move the freeing ...

> +    d->pending_scrub_index = scrub_index;
> +    d->pending_scrub_order = order;
> +    d->pending_scrub = page;
> +
> +    rspin_unlock(&d->page_alloc_lock);
> +}
> +
> +static struct page_info *get_stashed_allocation(struct domain *d,
> +                                                unsigned int order,
> +                                                nodeid_t node,
> +                                                unsigned int *scrub_index)
> +{

... into this function?

Jan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages
  2026-01-26 11:14   ` Jan Beulich
@ 2026-01-27 10:40     ` Roger Pau Monné
  2026-01-27 11:06       ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Roger Pau Monné @ 2026-01-27 10:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On Mon, Jan 26, 2026 at 12:14:35PM +0100, Jan Beulich wrote:
> On 22.01.2026 18:38, Roger Pau Monne wrote:
> > Physmap population has the need to use pages as big as possible to reduce
> > p2m shattering.  However that triggers issues when big enough pages are not
> > yet scrubbed, and so scrubbing must be done at allocation time.  On some
> > scenarios with added contention the watchdog can trigger:
> > 
> > Watchdog timer detects that CPU55 is stuck!
> > ----[ Xen-4.17.5-21  x86_64  debug=n  Not tainted ]----
> > CPU:    55
> > RIP:    e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
> > RFLAGS: 0000000000000202   CONTEXT: hypervisor (d0v12)
> > [...]
> > Xen call trace:
> >    [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
> >    [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
> >    [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
> >    [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
> >    [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
> >    [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970
> > 
> > Introduce a mechanism to preempt page scrubbing in populate_physmap().  It
> > relies on stashing the dirty page in the domain struct temporarily to
> > preempt to guest context, so the scrubbing can resume when the domain
> > re-enters the hypercall.  The added deferral mechanism will only be used for
> > domain construction, and is designed to be used with a single threaded
> > domain builder.  If the toolstack makes concurrent calls to
> > XENMEM_populate_physmap for the same target domain it will trash stashed
> > pages, resulting in slow domain physmap population.
> > 
> > Note a similar issue is present in increase reservation.  However that
> > hypercall is likely to only be used once the domain is already running and
> > the known implementations use 4K pages. It will be deal with in a separate
> > patch using a different approach, that will also take care of the
> > allocation in populate_physmap() once the domain is running.
> > 
> > Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Changes since v2:
> >  - Introduce FREE_DOMHEAP_PAGE{,S}().
> >  - Remove j local counter.
> >  - Free page pending scrub in domain_kill() also.
> 
> Yet still not right in domain_unpause_by_systemcontroller() as well. I.e. a
> toolstack action is still needed after the crash to make the memory usable
> again. If you made ...

Oh, I've misread your previous reply and it seemed to me your
preference was to do it in domain_kill().

> > @@ -1286,6 +1293,19 @@ int domain_kill(struct domain *d)
> >          rspin_barrier(&d->domain_lock);
> >          argo_destroy(d);
> >          vnuma_destroy(d->vnuma);
> > +        /*
> > +         * Attempt to free any pages pending scrub early.  Toolstack can still
> > +         * trigger populate_physmap() operations at this point, and hence a
> > +         * final cleanup must be done in _domain_destroy().
> > +         */
> > +        rspin_lock(&d->page_alloc_lock);
> > +        if ( d->pending_scrub )
> > +        {
> > +            FREE_DOMHEAP_PAGES(d->pending_scrub, d->pending_scrub_order);
> > +            d->pending_scrub_order = 0;
> > +            d->pending_scrub_index = 0;
> > +        }
> > +        rspin_unlock(&d->page_alloc_lock);
> 
> ... this into a small helper function (usable even from _domain_destroy(),
> as locking being used doesn't matter there), it would have negligible
> footprint there.
> 
> As to the comment, not being a native speaker it still feels to me as if
> moving "early" earlier (after "free") might help parsing of the 1st sentence.

I could also drop "early" completely from the sentence.  I've moved
the comment at the top of the newly introduced helper and reworded it
as:

/*
 * Called multiple times during domain destruction, to attempt to early free
 * any stashed pages to be scrubbed.  The call from _domain_destroy() is done
 * when the toolstack can no longer stash any pages.
 */

Let me know if that's OK.

> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -159,6 +159,66 @@ static void increase_reservation(struct memop_args *a)
> >      a->nr_done = i;
> >  }
> >  
> > +/*
> > + * Temporary storage for a domain assigned page that's not been fully scrubbed.
> > + * Stored pages must be domheap ones.
> > + *
> > + * The stashed page can be freed at any time by Xen, the caller must pass the
> > + * order and NUMA node requirement to the fetch function to ensure the
> > + * currently stashed page matches it's requirements.
> > + */
> > +static void stash_allocation(struct domain *d, struct page_info *page,
> > +                             unsigned int order, unsigned int scrub_index)
> > +{
> > +    rspin_lock(&d->page_alloc_lock);
> > +
> > +    /*
> > +     * Drop any stashed allocation to accommodated the current one.  This
> > +     * interface is designed to be used for single-threaded domain creation.
> > +     */
> > +    if ( d->pending_scrub )
> > +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
> 
> Didn't you indicate you'd move the freeing ...
> 
> > +    d->pending_scrub_index = scrub_index;
> > +    d->pending_scrub_order = order;
> > +    d->pending_scrub = page;
> > +
> > +    rspin_unlock(&d->page_alloc_lock);
> > +}
> > +
> > +static struct page_info *get_stashed_allocation(struct domain *d,
> > +                                                unsigned int order,
> > +                                                nodeid_t node,
> > +                                                unsigned int *scrub_index)
> > +{
> 
> ... into this function?

I could add freeing to get_stashed_allocation(), but it seems
pointless, because the freeing in stash_allocation() will have to stay
to deal with concurrent callers.  Even if a context frees the stashed
page in get_stashed_allocation() there's no guarantee the field will
still be free when stash_allocation() is called, as another concurrent
thread might have stashed a page in the meantime.

I think it's best to consistently do it only in stash_allocation(), as
that's clearer.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages
  2026-01-27 10:40     ` Roger Pau Monné
@ 2026-01-27 11:06       ` Jan Beulich
  2026-01-27 15:01         ` Roger Pau Monné
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2026-01-27 11:06 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 27.01.2026 11:40, Roger Pau Monné wrote:
> On Mon, Jan 26, 2026 at 12:14:35PM +0100, Jan Beulich wrote:
>> On 22.01.2026 18:38, Roger Pau Monne wrote:
>>> Physmap population has the need to use pages as big as possible to reduce
>>> p2m shattering.  However that triggers issues when big enough pages are not
>>> yet scrubbed, and so scrubbing must be done at allocation time.  On some
>>> scenarios with added contention the watchdog can trigger:
>>>
>>> Watchdog timer detects that CPU55 is stuck!
>>> ----[ Xen-4.17.5-21  x86_64  debug=n  Not tainted ]----
>>> CPU:    55
>>> RIP:    e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
>>> RFLAGS: 0000000000000202   CONTEXT: hypervisor (d0v12)
>>> [...]
>>> Xen call trace:
>>>    [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
>>>    [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
>>>    [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
>>>    [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
>>>    [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
>>>    [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970
>>>
>>> Introduce a mechanism to preempt page scrubbing in populate_physmap().  It
>>> relies on stashing the dirty page in the domain struct temporarily to
>>> preempt to guest context, so the scrubbing can resume when the domain
>>> re-enters the hypercall.  The added deferral mechanism will only be used for
>>> domain construction, and is designed to be used with a single threaded
>>> domain builder.  If the toolstack makes concurrent calls to
>>> XENMEM_populate_physmap for the same target domain it will trash stashed
>>> pages, resulting in slow domain physmap population.
>>>
>>> Note a similar issue is present in increase reservation.  However that
>>> hypercall is likely to only be used once the domain is already running and
>>> the known implementations use 4K pages. It will be deal with in a separate
>>> patch using a different approach, that will also take care of the
>>> allocation in populate_physmap() once the domain is running.
>>>
>>> Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>> ---
>>> Changes since v2:
>>>  - Introduce FREE_DOMHEAP_PAGE{,S}().
>>>  - Remove j local counter.
>>>  - Free page pending scrub in domain_kill() also.
>>
>> Yet still not right in domain_unpause_by_systemcontroller() as well. I.e. a
>> toolstack action is still needed after the crash to make the memory usable
>> again. If you made ...
> 
> Oh, I've misread your previous reply and it seemed to me your
> preference was to do it in domain_kill().

I meant to (possibly) have it kept there and be done yet earlier as well.

>>> @@ -1286,6 +1293,19 @@ int domain_kill(struct domain *d)
>>>          rspin_barrier(&d->domain_lock);
>>>          argo_destroy(d);
>>>          vnuma_destroy(d->vnuma);
>>> +        /*
>>> +         * Attempt to free any pages pending scrub early.  Toolstack can still
>>> +         * trigger populate_physmap() operations at this point, and hence a
>>> +         * final cleanup must be done in _domain_destroy().
>>> +         */
>>> +        rspin_lock(&d->page_alloc_lock);
>>> +        if ( d->pending_scrub )
>>> +        {
>>> +            FREE_DOMHEAP_PAGES(d->pending_scrub, d->pending_scrub_order);
>>> +            d->pending_scrub_order = 0;
>>> +            d->pending_scrub_index = 0;
>>> +        }
>>> +        rspin_unlock(&d->page_alloc_lock);
>>
>> ... this into a small helper function (usable even from _domain_destroy(),
>> as locking being used doesn't matter there), it would have negligible
>> footprint there.
>>
>> As to the comment, not being a native speaker it still feels to me as if
>> moving "early" earlier (after "free") might help parsing of the 1st sentence.
> 
> I could also drop "early" completely from the sentence.  I've moved
> the comment at the top of the newly introduced helper and reworded it
> as:
> 
> /*
>  * Called multiple times during domain destruction, to attempt to early free
>  * any stashed pages to be scrubbed.  The call from _domain_destroy() is done
>  * when the toolstack can no longer stash any pages.
>  */
> 
> Let me know if that's OK.

Fine with me.

>>> --- a/xen/common/memory.c
>>> +++ b/xen/common/memory.c
>>> @@ -159,6 +159,66 @@ static void increase_reservation(struct memop_args *a)
>>>      a->nr_done = i;
>>>  }
>>>  
>>> +/*
>>> + * Temporary storage for a domain assigned page that's not been fully scrubbed.
>>> + * Stored pages must be domheap ones.
>>> + *
>>> + * The stashed page can be freed at any time by Xen, the caller must pass the
>>> + * order and NUMA node requirement to the fetch function to ensure the
>>> + * currently stashed page matches it's requirements.
>>> + */
>>> +static void stash_allocation(struct domain *d, struct page_info *page,
>>> +                             unsigned int order, unsigned int scrub_index)
>>> +{
>>> +    rspin_lock(&d->page_alloc_lock);
>>> +
>>> +    /*
>>> +     * Drop any stashed allocation to accommodated the current one.  This
>>> +     * interface is designed to be used for single-threaded domain creation.
>>> +     */
>>> +    if ( d->pending_scrub )
>>> +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
>>
>> Didn't you indicate you'd move the freeing ...
>>
>>> +    d->pending_scrub_index = scrub_index;
>>> +    d->pending_scrub_order = order;
>>> +    d->pending_scrub = page;
>>> +
>>> +    rspin_unlock(&d->page_alloc_lock);
>>> +}
>>> +
>>> +static struct page_info *get_stashed_allocation(struct domain *d,
>>> +                                                unsigned int order,
>>> +                                                nodeid_t node,
>>> +                                                unsigned int *scrub_index)
>>> +{
>>
>> ... into this function?
> 
> I could add freeing to get_stashed_allocation(), but it seems
> pointless, because the freeing in stash_allocation() will have to stay
> to deal with concurrent callers.  Even if a context frees the stashed
> page in get_stashed_allocation() there's no guarantee the field will
> still be free when stash_allocation() is called, as another concurrent
> thread might have stashed a page in the meantime.

Hmm, yes, yet still ...

> I think it's best to consistently do it only in stash_allocation(), as
> that's clearer.

... no, as (to me) "clearer" is only a secondary criteria here. What I'm
worried of is potentially holding back a 1Gb page when the new request is,
say, a 2Mb one, and then not having enough memory available just because
of that detained huge page.

In fact, if stash_allocation() finds the field re-populated despite
get_stashed_allocation() having cleared it, it's not quite clear which
of the two allocations should actually be undone. The other vCPU may be
quicker in retrying, and to avoid ping-pong freeing the new (local)
allocation rather than stashing it might possibly be better. Thoughts?

Jan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages
  2026-01-27 11:06       ` Jan Beulich
@ 2026-01-27 15:01         ` Roger Pau Monné
  2026-01-27 15:49           ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Roger Pau Monné @ 2026-01-27 15:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On Tue, Jan 27, 2026 at 12:06:32PM +0100, Jan Beulich wrote:
> On 27.01.2026 11:40, Roger Pau Monné wrote:
> > On Mon, Jan 26, 2026 at 12:14:35PM +0100, Jan Beulich wrote:
> >> On 22.01.2026 18:38, Roger Pau Monne wrote:
> >>> --- a/xen/common/memory.c
> >>> +++ b/xen/common/memory.c
> >>> @@ -159,6 +159,66 @@ static void increase_reservation(struct memop_args *a)
> >>>      a->nr_done = i;
> >>>  }
> >>>  
> >>> +/*
> >>> + * Temporary storage for a domain assigned page that's not been fully scrubbed.
> >>> + * Stored pages must be domheap ones.
> >>> + *
> >>> + * The stashed page can be freed at any time by Xen, the caller must pass the
> >>> + * order and NUMA node requirement to the fetch function to ensure the
> >>> + * currently stashed page matches it's requirements.
> >>> + */
> >>> +static void stash_allocation(struct domain *d, struct page_info *page,
> >>> +                             unsigned int order, unsigned int scrub_index)
> >>> +{
> >>> +    rspin_lock(&d->page_alloc_lock);
> >>> +
> >>> +    /*
> >>> +     * Drop any stashed allocation to accommodated the current one.  This
> >>> +     * interface is designed to be used for single-threaded domain creation.
> >>> +     */
> >>> +    if ( d->pending_scrub )
> >>> +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
> >>
> >> Didn't you indicate you'd move the freeing ...
> >>
> >>> +    d->pending_scrub_index = scrub_index;
> >>> +    d->pending_scrub_order = order;
> >>> +    d->pending_scrub = page;
> >>> +
> >>> +    rspin_unlock(&d->page_alloc_lock);
> >>> +}
> >>> +
> >>> +static struct page_info *get_stashed_allocation(struct domain *d,
> >>> +                                                unsigned int order,
> >>> +                                                nodeid_t node,
> >>> +                                                unsigned int *scrub_index)
> >>> +{
> >>
> >> ... into this function?
> > 
> > I could add freeing to get_stashed_allocation(), but it seems
> > pointless, because the freeing in stash_allocation() will have to stay
> > to deal with concurrent callers.  Even if a context frees the stashed
> > page in get_stashed_allocation() there's no guarantee the field will
> > still be free when stash_allocation() is called, as another concurrent
> > thread might have stashed a page in the meantime.
> 
> Hmm, yes, yet still ...
> 
> > I think it's best to consistently do it only in stash_allocation(), as
> > that's clearer.
> 
> ... no, as (to me) "clearer" is only a secondary criteria here. What I'm
> worried of is potentially holding back a 1Gb page when the new request is,
> say, a 2Mb one, and then not having enough memory available just because
> of that detained huge page.

If that's really the case then either the caller is using a broken
toolstack that's making bogus populate physmap calls, or the caller is
attempting to populate the physmap in parallel and hasn't properly
checked whether there's enough free memory in the system.  In the
later case the physmap population would end up failing anyway.

> In fact, if stash_allocation() finds the field re-populated despite
> get_stashed_allocation() having cleared it, it's not quite clear which
> of the two allocations should actually be undone. The other vCPU may be
> quicker in retrying, and to avoid ping-pong freeing the new (local)
> allocation rather than stashing it might possibly be better. Thoughts?

TBH I didn't give it much thought, as in any case progression when
attempting to populate the physmap in parallel will be far from
optimal.  If you prefer I can switch to the approach where the freeing
of the stashed page is done in get_stashed_allocation() and
stash_allocation() instead frees the current one if it find the field
is already in use.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages
  2026-01-27 15:01         ` Roger Pau Monné
@ 2026-01-27 15:49           ` Jan Beulich
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Beulich @ 2026-01-27 15:49 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 27.01.2026 16:01, Roger Pau Monné wrote:
> On Tue, Jan 27, 2026 at 12:06:32PM +0100, Jan Beulich wrote:
>> On 27.01.2026 11:40, Roger Pau Monné wrote:
>>> On Mon, Jan 26, 2026 at 12:14:35PM +0100, Jan Beulich wrote:
>>>> On 22.01.2026 18:38, Roger Pau Monne wrote:
>>>>> --- a/xen/common/memory.c
>>>>> +++ b/xen/common/memory.c
>>>>> @@ -159,6 +159,66 @@ static void increase_reservation(struct memop_args *a)
>>>>>      a->nr_done = i;
>>>>>  }
>>>>>  
>>>>> +/*
>>>>> + * Temporary storage for a domain assigned page that's not been fully scrubbed.
>>>>> + * Stored pages must be domheap ones.
>>>>> + *
>>>>> + * The stashed page can be freed at any time by Xen, the caller must pass the
>>>>> + * order and NUMA node requirement to the fetch function to ensure the
>>>>> + * currently stashed page matches it's requirements.
>>>>> + */
>>>>> +static void stash_allocation(struct domain *d, struct page_info *page,
>>>>> +                             unsigned int order, unsigned int scrub_index)
>>>>> +{
>>>>> +    rspin_lock(&d->page_alloc_lock);
>>>>> +
>>>>> +    /*
>>>>> +     * Drop any stashed allocation to accommodated the current one.  This
>>>>> +     * interface is designed to be used for single-threaded domain creation.
>>>>> +     */
>>>>> +    if ( d->pending_scrub )
>>>>> +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
>>>>
>>>> Didn't you indicate you'd move the freeing ...
>>>>
>>>>> +    d->pending_scrub_index = scrub_index;
>>>>> +    d->pending_scrub_order = order;
>>>>> +    d->pending_scrub = page;
>>>>> +
>>>>> +    rspin_unlock(&d->page_alloc_lock);
>>>>> +}
>>>>> +
>>>>> +static struct page_info *get_stashed_allocation(struct domain *d,
>>>>> +                                                unsigned int order,
>>>>> +                                                nodeid_t node,
>>>>> +                                                unsigned int *scrub_index)
>>>>> +{
>>>>
>>>> ... into this function?
>>>
>>> I could add freeing to get_stashed_allocation(), but it seems
>>> pointless, because the freeing in stash_allocation() will have to stay
>>> to deal with concurrent callers.  Even if a context frees the stashed
>>> page in get_stashed_allocation() there's no guarantee the field will
>>> still be free when stash_allocation() is called, as another concurrent
>>> thread might have stashed a page in the meantime.
>>
>> Hmm, yes, yet still ...
>>
>>> I think it's best to consistently do it only in stash_allocation(), as
>>> that's clearer.
>>
>> ... no, as (to me) "clearer" is only a secondary criteria here. What I'm
>> worried of is potentially holding back a 1Gb page when the new request is,
>> say, a 2Mb one, and then not having enough memory available just because
>> of that detained huge page.
> 
> If that's really the case then either the caller is using a broken
> toolstack that's making bogus populate physmap calls, or the caller is
> attempting to populate the physmap in parallel and hasn't properly
> checked whether there's enough free memory in the system.  In the
> later case the physmap population would end up failing anyway.
> 
>> In fact, if stash_allocation() finds the field re-populated despite
>> get_stashed_allocation() having cleared it, it's not quite clear which
>> of the two allocations should actually be undone. The other vCPU may be
>> quicker in retrying, and to avoid ping-pong freeing the new (local)
>> allocation rather than stashing it might possibly be better. Thoughts?
> 
> TBH I didn't give it much thought, as in any case progression when
> attempting to populate the physmap in parallel will be far from
> optimal.  If you prefer I can switch to the approach where the freeing
> of the stashed page is done in get_stashed_allocation() and
> stash_allocation() instead frees the current one if it find the field
> is already in use.

I'd prefer that, yes. Of course if others were to agree with your take ...

Jan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 3/3] xen/mm: limit non-scrubbed allocations to a specific order
  2026-01-22 17:38 [PATCH v3 0/3] xen/mm: limit in-place scrubbing Roger Pau Monne
  2026-01-22 17:38 ` [PATCH v3 1/3] xen/mm: enforce SCRUB_DEBUG checks for MEMF_no_scrub allocations Roger Pau Monne
  2026-01-22 17:38 ` [PATCH v3 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages Roger Pau Monne
@ 2026-01-22 17:38 ` Roger Pau Monne
  2026-01-26 11:21   ` Jan Beulich
  2 siblings, 1 reply; 11+ messages in thread
From: Roger Pau Monne @ 2026-01-22 17:38 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

The current logic allows for up to 1G pages to be scrubbed in place, which
can cause the watchdog to trigger in practice.  Reduce the limit for
in-place scrubbed allocations to a newly introduced define:
CONFIG_DIRTY_MAX_ORDER.  This currently defaults to CONFIG_PTDOM_MAX_ORDER
on all architectures.  Also introduce a command line option to set the
value.

Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v2:
 - Move placement of the max-order-dirty option help.
 - Add note in memop-max-order about interactions.
 - Use CONFIG_PTDOM_MAX_ORDER as the default.

Changes since v1:
 - Split from previous patch.
 - Introduce a command line option to set the limit.
---
 docs/misc/xen-command-line.pandoc | 13 +++++++++++++
 xen/common/memory.c               |  3 ---
 xen/common/page_alloc.c           | 23 ++++++++++++++++++++++-
 xen/include/xen/mm.h              |  4 ++++
 4 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 15f7a315a4b5..3577e491e379 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1837,6 +1837,16 @@ presented as the number of bits needed to encode it. This must be at least
 one pending bit to be allocated.
 Defaults to 20 bits (to cover at most 1048576 interrupts).
 
+### max-order-dirty
+> `= <integer>`
+
+Specify the maximum allocation order allowed when scrubbing allocated pages
+in-place.  The allocation is non-preemptive, and hence the value must be keep
+low enough to avoid hogging the CPU for too long.
+
+Defaults to `CONFIG_DIRTY_MAX_ORDER` or if unset to `CONFIG_PTDOM_MAX_ORDER`.
+Note those are internal per-architecture defines not available from Kconfig.
+
 ### mce (x86)
 > `= <boolean>`
 
@@ -1878,6 +1888,9 @@ requests issued by the various kinds of domains (in this order:
 ordinary DomU, control domain, hardware domain, and - when supported
 by the platform - DomU with pass-through device assigned).
 
+Note orders here can be further limited by the value in `max-order-dirty` for
+allocations requesting pages to be scrubbed in-place.
+
 ### mmcfg (x86)
 > `= <boolean>[,amd-fam10]`
 
diff --git a/xen/common/memory.c b/xen/common/memory.c
index db20da1bcaaa..cf63bd077d42 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -56,9 +56,6 @@ struct memop_args {
 #ifndef CONFIG_CTLDOM_MAX_ORDER
 #define CONFIG_CTLDOM_MAX_ORDER CONFIG_PAGEALLOC_MAX_ORDER
 #endif
-#ifndef CONFIG_PTDOM_MAX_ORDER
-#define CONFIG_PTDOM_MAX_ORDER CONFIG_HWDOM_MAX_ORDER
-#endif
 
 static unsigned int __read_mostly domu_max_order = CONFIG_DOMU_MAX_ORDER;
 static unsigned int __read_mostly ctldom_max_order = CONFIG_CTLDOM_MAX_ORDER;
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index c9e82fd7ab62..d2d5e4762d59 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -267,6 +267,13 @@ static PAGE_LIST_HEAD(page_offlined_list);
 /* Broken page list, protected by heap_lock. */
 static PAGE_LIST_HEAD(page_broken_list);
 
+/* Maximum order allowed for allocations with MEMF_no_scrub. */
+#ifndef CONFIG_DIRTY_MAX_ORDER
+# define CONFIG_DIRTY_MAX_ORDER CONFIG_PTDOM_MAX_ORDER
+#endif
+static unsigned int __ro_after_init dirty_max_order = CONFIG_DIRTY_MAX_ORDER;
+integer_param("max-order-dirty", dirty_max_order);
+
 /*************************
  * BOOT-TIME ALLOCATOR
  */
@@ -1008,7 +1015,13 @@ static struct page_info *alloc_heap_pages(
 
     pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d);
     /* Try getting a dirty buddy if we couldn't get a clean one. */
-    if ( !pg && !(memflags & MEMF_no_scrub) )
+    if ( !pg && !(memflags & MEMF_no_scrub) &&
+         /*
+          * Allow any order unscrubbed allocations during boot time, we
+          * compensate by processing softirqs in the scrubbing loop below once
+          * irqs are enabled.
+          */
+         (order <= dirty_max_order || system_state < SYS_STATE_active) )
         pg = get_free_buddy(zone_lo, zone_hi, order,
                             memflags | MEMF_no_scrub, d);
     if ( !pg )
@@ -1117,6 +1130,14 @@ static struct page_info *alloc_heap_pages(
                     scrub_one_page(&pg[i], cold);
 
                 dirty_cnt++;
+
+                /*
+                 * Use SYS_STATE_smp_boot explicitly; ahead of that state
+                 * interrupts are disabled.
+                 */
+                if ( system_state == SYS_STATE_smp_boot &&
+                     !(dirty_cnt & 0xff) )
+                    process_pending_softirqs();
             }
             else
                 check_one_page(&pg[i]);
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index d80bfba6d393..cf3796d4286d 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -232,6 +232,10 @@ struct npfec {
 #else
 #define MAX_ORDER 20 /* 2^20 contiguous pages */
 #endif
+#ifndef CONFIG_PTDOM_MAX_ORDER
+# define CONFIG_PTDOM_MAX_ORDER CONFIG_HWDOM_MAX_ORDER
+#endif
+
 mfn_t acquire_reserved_page(struct domain *d, unsigned int memflags);
 
 /* Private domain structs for DOMID_XEN, DOMID_IO, etc. */
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] xen/mm: limit non-scrubbed allocations to a specific order
  2026-01-22 17:38 ` [PATCH v3 3/3] xen/mm: limit non-scrubbed allocations to a specific order Roger Pau Monne
@ 2026-01-26 11:21   ` Jan Beulich
  2026-01-27 10:45     ` Roger Pau Monné
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2026-01-26 11:21 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On 22.01.2026 18:38, Roger Pau Monne wrote:
> The current logic allows for up to 1G pages to be scrubbed in place, which
> can cause the watchdog to trigger in practice.  Reduce the limit for
> in-place scrubbed allocations to a newly introduced define:
> CONFIG_DIRTY_MAX_ORDER.  This currently defaults to CONFIG_PTDOM_MAX_ORDER
> on all architectures.  Also introduce a command line option to set the
> value.
> 
> Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Apart from a nit (see below) looks technically okay to me now. Still I have
an uneasy feeling about introducing such a restriction, so I'm (still)
hesitant to ack the change.

> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -267,6 +267,13 @@ static PAGE_LIST_HEAD(page_offlined_list);
>  /* Broken page list, protected by heap_lock. */
>  static PAGE_LIST_HEAD(page_broken_list);
>  
> +/* Maximum order allowed for allocations with MEMF_no_scrub. */
> +#ifndef CONFIG_DIRTY_MAX_ORDER
> +# define CONFIG_DIRTY_MAX_ORDER CONFIG_PTDOM_MAX_ORDER
> +#endif
> +static unsigned int __ro_after_init dirty_max_order = CONFIG_DIRTY_MAX_ORDER;
> +integer_param("max-order-dirty", dirty_max_order);

The comment may want to mention "post-boot", to account for ...

> @@ -1008,7 +1015,13 @@ static struct page_info *alloc_heap_pages(
>  
>      pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d);
>      /* Try getting a dirty buddy if we couldn't get a clean one. */
> -    if ( !pg && !(memflags & MEMF_no_scrub) )
> +    if ( !pg && !(memflags & MEMF_no_scrub) &&
> +         /*
> +          * Allow any order unscrubbed allocations during boot time, we
> +          * compensate by processing softirqs in the scrubbing loop below once
> +          * irqs are enabled.
> +          */
> +         (order <= dirty_max_order || system_state < SYS_STATE_active) )

... the system_state check here.

Jan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] xen/mm: limit non-scrubbed allocations to a specific order
  2026-01-26 11:21   ` Jan Beulich
@ 2026-01-27 10:45     ` Roger Pau Monné
  0 siblings, 0 replies; 11+ messages in thread
From: Roger Pau Monné @ 2026-01-27 10:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, xen-devel

On Mon, Jan 26, 2026 at 12:21:17PM +0100, Jan Beulich wrote:
> On 22.01.2026 18:38, Roger Pau Monne wrote:
> > The current logic allows for up to 1G pages to be scrubbed in place, which
> > can cause the watchdog to trigger in practice.  Reduce the limit for
> > in-place scrubbed allocations to a newly introduced define:
> > CONFIG_DIRTY_MAX_ORDER.  This currently defaults to CONFIG_PTDOM_MAX_ORDER
> > on all architectures.  Also introduce a command line option to set the
> > value.
> > 
> > Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Apart from a nit (see below) looks technically okay to me now. Still I have
> an uneasy feeling about introducing such a restriction, so I'm (still)
> hesitant to ack the change.

OK, I understand that, and I'm not going to argue there's no risk.
Overall, even if this commit is not fully correct, it's a step in the
right direction IMO, we need to limit such allocations.  And for
callers that legitimately need bigger orders we will have to add
preemptive scrubbing like we do for populate physmap.

> > --- a/xen/common/page_alloc.c
> > +++ b/xen/common/page_alloc.c
> > @@ -267,6 +267,13 @@ static PAGE_LIST_HEAD(page_offlined_list);
> >  /* Broken page list, protected by heap_lock. */
> >  static PAGE_LIST_HEAD(page_broken_list);
> >  
> > +/* Maximum order allowed for allocations with MEMF_no_scrub. */
> > +#ifndef CONFIG_DIRTY_MAX_ORDER
> > +# define CONFIG_DIRTY_MAX_ORDER CONFIG_PTDOM_MAX_ORDER
> > +#endif
> > +static unsigned int __ro_after_init dirty_max_order = CONFIG_DIRTY_MAX_ORDER;
> > +integer_param("max-order-dirty", dirty_max_order);
> 
> The comment may want to mention "post-boot", to account for ...
> 
> > @@ -1008,7 +1015,13 @@ static struct page_info *alloc_heap_pages(
> >  
> >      pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d);
> >      /* Try getting a dirty buddy if we couldn't get a clean one. */
> > -    if ( !pg && !(memflags & MEMF_no_scrub) )
> > +    if ( !pg && !(memflags & MEMF_no_scrub) &&
> > +         /*
> > +          * Allow any order unscrubbed allocations during boot time, we
> > +          * compensate by processing softirqs in the scrubbing loop below once
> > +          * irqs are enabled.
> > +          */
> > +         (order <= dirty_max_order || system_state < SYS_STATE_active) )
> 
> ... the system_state check here.

Added.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-01-27 15:49 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-22 17:38 [PATCH v3 0/3] xen/mm: limit in-place scrubbing Roger Pau Monne
2026-01-22 17:38 ` [PATCH v3 1/3] xen/mm: enforce SCRUB_DEBUG checks for MEMF_no_scrub allocations Roger Pau Monne
2026-01-22 17:38 ` [PATCH v3 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages Roger Pau Monne
2026-01-26 11:14   ` Jan Beulich
2026-01-27 10:40     ` Roger Pau Monné
2026-01-27 11:06       ` Jan Beulich
2026-01-27 15:01         ` Roger Pau Monné
2026-01-27 15:49           ` Jan Beulich
2026-01-22 17:38 ` [PATCH v3 3/3] xen/mm: limit non-scrubbed allocations to a specific order Roger Pau Monne
2026-01-26 11:21   ` Jan Beulich
2026-01-27 10:45     ` Roger Pau Monné

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.