* [PATCH 0/2] xen/mm: limit in-place scrubbing
@ 2026-01-08 17:55 Roger Pau Monne
2026-01-08 17:55 ` [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() Roger Pau Monne
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Roger Pau Monne @ 2026-01-08 17:55 UTC (permalink / raw)
To: xen-devel
Cc: Roger Pau Monne, Stefano Stabellini, Julien Grall,
Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper,
Anthony PERARD, Jan Beulich
Hello,
In XenServer we have seen the watchdog occasionally triggering during
domain creation if 1GB pages are scrubbed in-place during physmap
population. The following series attempt to mitigate this by limiting
the in-place scrubbing during allocation to 2M pages, but it has some
drawbacks, see the post-commit remarks in patch 2.
I'm hopping someone might have a better idea, or we converge we can't do
better than this for the time being.
Thanks, Roger.
Roger Pau Monne (2):
xen/mm: add a NUMA node parameter to scrub_free_pages()
xen/mm: limit non-scrubbed allocations to a specific order
xen/arch/arm/domain.c | 2 +-
xen/arch/x86/domain.c | 2 +-
xen/common/memory.c | 12 +++++++++
xen/common/page_alloc.c | 54 +++++++++++++++++++++++++++++++++++++----
xen/include/xen/mm.h | 12 ++++++++-
5 files changed, 74 insertions(+), 8 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 17+ messages in thread* [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() 2026-01-08 17:55 [PATCH 0/2] xen/mm: limit in-place scrubbing Roger Pau Monne @ 2026-01-08 17:55 ` Roger Pau Monne 2026-01-09 10:22 ` Jan Beulich 2026-01-08 17:55 ` [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order Roger Pau Monne 2026-01-09 10:15 ` [PATCH 0/2] xen/mm: limit in-place scrubbing Jan Beulich 2 siblings, 1 reply; 17+ messages in thread From: Roger Pau Monne @ 2026-01-08 17:55 UTC (permalink / raw) To: xen-devel Cc: Roger Pau Monne, Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, Jan Beulich Such parameter allow requesting to scrub memory only from the specified node. If there's no memory to scrub from the requested node the function returns false. If the node is already being scrubbed from a different CPU the function returns true so the caller can differentiate whether there's still pending work to do. No functional change intended. Existing callers are switched to use the new interface, albeit they all pass NUMA_NO_NODE to keep the current behavior. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/arm/domain.c | 2 +- xen/arch/x86/domain.c | 2 +- xen/common/page_alloc.c | 17 ++++++++++++++--- xen/include/xen/mm.h | 3 ++- 4 files changed, 18 insertions(+), 6 deletions(-) diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index 47973f99d935..dff7554417ea 100644 --- a/xen/arch/arm/domain.c +++ b/xen/arch/arm/domain.c @@ -75,7 +75,7 @@ static void noreturn idle_loop(void) * and then, after it is done, whether softirqs became pending * while we were scrubbing. */ - else if ( !softirq_pending(cpu) && !scrub_free_pages() && + else if ( !softirq_pending(cpu) && !scrub_free_pages(NUMA_NO_NODE) && !softirq_pending(cpu) ) do_idle(); diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 7632d5e2d62d..276c485a204f 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -166,7 +166,7 @@ static void noreturn cf_check idle_loop(void) * and then, after it is done, whether softirqs became pending * while we were scrubbing. */ - else if ( !softirq_pending(cpu) && !scrub_free_pages() && + else if ( !softirq_pending(cpu) && !scrub_free_pages(NUMA_NO_NODE) && !softirq_pending(cpu) ) { if ( guest ) diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 2efc11ce095f..248c44df32b3 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -1339,16 +1339,27 @@ static void cf_check scrub_continue(void *data) } } -bool scrub_free_pages(void) +bool scrub_free_pages(nodeid_t node) { struct page_info *pg; unsigned int zone; unsigned int cpu = smp_processor_id(); bool preempt = false; - nodeid_t node; unsigned int cnt = 0; - node = node_to_scrub(true); + if ( node != NUMA_NO_NODE ) + { + if ( !node_need_scrub[node] ) + /* Nothing to scrub. */ + return false; + + if ( node_test_and_set(node, node_scrubbing) ) + /* Another CPU is scrubbing it. */ + return true; + } + else + node = node_to_scrub(true); + if ( node == NUMA_NO_NODE ) return false; diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h index 426362adb2f4..7067c9ec0405 100644 --- a/xen/include/xen/mm.h +++ b/xen/include/xen/mm.h @@ -65,6 +65,7 @@ #include <xen/compiler.h> #include <xen/mm-frame.h> #include <xen/mm-types.h> +#include <xen/numa.h> #include <xen/types.h> #include <xen/list.h> #include <xen/spinlock.h> @@ -90,7 +91,7 @@ void init_xenheap_pages(paddr_t ps, paddr_t pe); void xenheap_max_mfn(unsigned long mfn); void *alloc_xenheap_pages(unsigned int order, unsigned int memflags); void free_xenheap_pages(void *v, unsigned int order); -bool scrub_free_pages(void); +bool scrub_free_pages(nodeid_t node); #define alloc_xenheap_page() (alloc_xenheap_pages(0,0)) #define free_xenheap_page(v) (free_xenheap_pages(v,0)) -- 2.51.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() 2026-01-08 17:55 ` [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() Roger Pau Monne @ 2026-01-09 10:22 ` Jan Beulich 2026-01-09 14:46 ` Roger Pau Monné 0 siblings, 1 reply; 17+ messages in thread From: Jan Beulich @ 2026-01-09 10:22 UTC (permalink / raw) To: Roger Pau Monne Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel On 08.01.2026 18:55, Roger Pau Monne wrote: > Such parameter allow requesting to scrub memory only from the specified > node. If there's no memory to scrub from the requested node the function > returns false. If the node is already being scrubbed from a different CPU > the function returns true so the caller can differentiate whether there's > still pending work to do. I'm really trying to understand both patches together, and peeking ahead I don't understand the above, which looks to describe ... > --- a/xen/common/page_alloc.c > +++ b/xen/common/page_alloc.c > @@ -1339,16 +1339,27 @@ static void cf_check scrub_continue(void *data) > } > } > > -bool scrub_free_pages(void) > +bool scrub_free_pages(nodeid_t node) > { > struct page_info *pg; > unsigned int zone; > unsigned int cpu = smp_processor_id(); > bool preempt = false; > - nodeid_t node; > unsigned int cnt = 0; > > - node = node_to_scrub(true); > + if ( node != NUMA_NO_NODE ) > + { > + if ( !node_need_scrub[node] ) > + /* Nothing to scrub. */ > + return false; > + > + if ( node_test_and_set(node, node_scrubbing) ) > + /* Another CPU is scrubbing it. */ > + return true; ... these two return-s. My problem being that patch 2 doesn't use the return value (while existing callers don't take this path). Is this then "just in case" for now (and making the meaning of the return values somewhat inconsistent for the function as a whole)? Jan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() 2026-01-09 10:22 ` Jan Beulich @ 2026-01-09 14:46 ` Roger Pau Monné 2026-01-09 14:50 ` Jan Beulich 0 siblings, 1 reply; 17+ messages in thread From: Roger Pau Monné @ 2026-01-09 14:46 UTC (permalink / raw) To: Jan Beulich Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel On Fri, Jan 09, 2026 at 11:22:39AM +0100, Jan Beulich wrote: > On 08.01.2026 18:55, Roger Pau Monne wrote: > > Such parameter allow requesting to scrub memory only from the specified > > node. If there's no memory to scrub from the requested node the function > > returns false. If the node is already being scrubbed from a different CPU > > the function returns true so the caller can differentiate whether there's > > still pending work to do. > > I'm really trying to understand both patches together, and peeking ahead I > don't understand the above, which looks to describe ... > > > --- a/xen/common/page_alloc.c > > +++ b/xen/common/page_alloc.c > > @@ -1339,16 +1339,27 @@ static void cf_check scrub_continue(void *data) > > } > > } > > > > -bool scrub_free_pages(void) > > +bool scrub_free_pages(nodeid_t node) > > { > > struct page_info *pg; > > unsigned int zone; > > unsigned int cpu = smp_processor_id(); > > bool preempt = false; > > - nodeid_t node; > > unsigned int cnt = 0; > > > > - node = node_to_scrub(true); > > + if ( node != NUMA_NO_NODE ) > > + { > > + if ( !node_need_scrub[node] ) > > + /* Nothing to scrub. */ > > + return false; > > + > > + if ( node_test_and_set(node, node_scrubbing) ) > > + /* Another CPU is scrubbing it. */ > > + return true; > > ... these two return-s. My problem being that patch 2 doesn't use the > return value (while existing callers don't take this path). Is this then > "just in case" for now (and making the meaning of the return values > somewhat inconsistent for the function as a whole)? I've added those so that the function return values are consistent, even if not consumed right now, it would make no sense for the return values to have different meaning when the node parameter is != NUMA_NO_NODE. Or at least that was my impression. In fact an earlier version of patch 2 did consume those values. I've moved to a different approach, but I think it's good to keep the return values consistent regardless of the input parameters. Thanks, Roger. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() 2026-01-09 14:46 ` Roger Pau Monné @ 2026-01-09 14:50 ` Jan Beulich 0 siblings, 0 replies; 17+ messages in thread From: Jan Beulich @ 2026-01-09 14:50 UTC (permalink / raw) To: Roger Pau Monné Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel On 09.01.2026 15:46, Roger Pau Monné wrote: > On Fri, Jan 09, 2026 at 11:22:39AM +0100, Jan Beulich wrote: >> On 08.01.2026 18:55, Roger Pau Monne wrote: >>> Such parameter allow requesting to scrub memory only from the specified >>> node. If there's no memory to scrub from the requested node the function >>> returns false. If the node is already being scrubbed from a different CPU >>> the function returns true so the caller can differentiate whether there's >>> still pending work to do. >> >> I'm really trying to understand both patches together, and peeking ahead I >> don't understand the above, which looks to describe ... >> >>> --- a/xen/common/page_alloc.c >>> +++ b/xen/common/page_alloc.c >>> @@ -1339,16 +1339,27 @@ static void cf_check scrub_continue(void *data) >>> } >>> } >>> >>> -bool scrub_free_pages(void) >>> +bool scrub_free_pages(nodeid_t node) >>> { >>> struct page_info *pg; >>> unsigned int zone; >>> unsigned int cpu = smp_processor_id(); >>> bool preempt = false; >>> - nodeid_t node; >>> unsigned int cnt = 0; >>> >>> - node = node_to_scrub(true); >>> + if ( node != NUMA_NO_NODE ) >>> + { >>> + if ( !node_need_scrub[node] ) >>> + /* Nothing to scrub. */ >>> + return false; >>> + >>> + if ( node_test_and_set(node, node_scrubbing) ) >>> + /* Another CPU is scrubbing it. */ >>> + return true; >> >> ... these two return-s. My problem being that patch 2 doesn't use the >> return value (while existing callers don't take this path). Is this then >> "just in case" for now (and making the meaning of the return values >> somewhat inconsistent for the function as a whole)? > > I've added those so that the function return values are consistent, > even if not consumed right now, it would make no sense for the return > values to have different meaning when the node parameter is != > NUMA_NO_NODE. Or at least that was my impression. > > In fact an earlier version of patch 2 did consume those values. I've > moved to a different approach, but I think it's good to keep the > return values consistent regardless of the input parameters. My point was though: The present "true" return doesn't mean "Another CPU is scrubbing it." Instead it means "More work to do" aiui. That's similar in a way, but not identical. Jan ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order 2026-01-08 17:55 [PATCH 0/2] xen/mm: limit in-place scrubbing Roger Pau Monne 2026-01-08 17:55 ` [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() Roger Pau Monne @ 2026-01-08 17:55 ` Roger Pau Monne 2026-01-09 11:19 ` Jan Beulich 2026-01-09 10:15 ` [PATCH 0/2] xen/mm: limit in-place scrubbing Jan Beulich 2 siblings, 1 reply; 17+ messages in thread From: Roger Pau Monne @ 2026-01-08 17:55 UTC (permalink / raw) To: xen-devel Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich, Julien Grall, Stefano Stabellini The current model of falling back to allocate unscrubbed pages and scrub them in place at allocation time risks triggering the watchdog: Watchdog timer detects that CPU55 is stuck! ----[ Xen-4.17.5-21 x86_64 debug=n Not tainted ]---- CPU: 55 RIP: e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30 RFLAGS: 0000000000000202 CONTEXT: hypervisor (d0v12) [...] Xen call trace: [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30 [<ffff82d04022a121>] S clear_domain_page+0x11/0x20 [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0 [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180 [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0 [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970 The maximum allocation order on x86 is limited to 18, that means allocating and scrubbing possibly 1G worth of memory in 4K chunks. Start by limiting dirty allocations to CONFIG_DOMU_MAX_ORDER, which is currently set to 2M chunks. However such limitation might cause fragmentation in HVM p2m population during domain creation. To prevent that introduce some extra logic in populate_physmap() that fallback to preemptive page-scrubbing if the requested allocation cannot be fulfilled and there's scrubbing work to do. This approach is less fair than the current one, but allows preemptive page scrubbing in the context of populate_physmap() to attempt to ensure unnecessary page-shattering. Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed") Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- I'm not particularly happy with this approach, as it doesn't guarantee progress for the callers. IOW: a caller might do a lot of scrubbing, just to get it's pages stolen by a different concurrent thread doing allocations. However I'm not sure there's a better solution than resorting to 2M allocations if there's not enough free memory that is scrubbed. I'm having trouble seeing where we could temporary store page(s) allocated that need to be scrubbed before being assigned to the domain, in a way that can be used by continuations, and that would allow Xen to keep track of them in case the operation is never finished. IOW: we would need to account for cleanup of such temporary stash of pages in case the domain never completes the hypercall, or is destroyed midway. Otherwise we could add the option to switch back to scrubbing before returning the pages to the free pool, but that's also problematic: the current approach aim to scrub pages in the same NUMA node as the CPU that's doing the scrubbing. If we scrub in the context of the domain destruction hypercall there's no attempt to scrub pages in the local NUMA node. --- xen/common/memory.c | 12 ++++++++++++ xen/common/page_alloc.c | 37 +++++++++++++++++++++++++++++++++++-- xen/include/xen/mm.h | 9 +++++++++ 3 files changed, 56 insertions(+), 2 deletions(-) diff --git a/xen/common/memory.c b/xen/common/memory.c index 10becf7c1f4c..28b254e9d280 100644 --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a) if ( unlikely(!page) ) { + nodeid_t node = MEMF_get_node(a->memflags); + + if ( memory_scrub_pending(node) || + (node != NUMA_NO_NODE && + !(a->memflags & MEMF_exact_node) && + memory_scrub_pending(node = NUMA_NO_NODE)) ) + { + scrub_free_pages(node); + a->preempted = 1; + goto out; + } + gdprintk(XENLOG_INFO, "Could not allocate order=%u extent: id=%d memflags=%#x (%u of %u)\n", a->extent_order, d->domain_id, a->memflags, diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 248c44df32b3..d4dabc997c44 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -483,6 +483,20 @@ static heap_by_zone_and_order_t *_heap[MAX_NUMNODES]; static unsigned long node_need_scrub[MAX_NUMNODES]; +bool memory_scrub_pending(nodeid_t node) +{ + nodeid_t i; + + if ( node != NUMA_NO_NODE ) + return node_need_scrub[node]; + + for_each_online_node ( i ) + if ( node_need_scrub[i] ) + return true; + + return false; +} + static unsigned long *avail[MAX_NUMNODES]; static long total_avail_pages; @@ -1007,8 +1021,18 @@ static struct page_info *alloc_heap_pages( } pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d); - /* Try getting a dirty buddy if we couldn't get a clean one. */ - if ( !pg && !(memflags & MEMF_no_scrub) ) + /* + * Try getting a dirty buddy if we couldn't get a clean one. Limit the + * fallback to orders equal or below MAX_DIRTY_ORDER, as otherwise the + * non-preemptive scrubbing could trigger the watchdog. + */ + if ( !pg && !(memflags & MEMF_no_scrub) && + /* + * Allow any order unscrubbed allocations during boot time, we + * compensate by processing softirqs in the scrubbing loop below once + * irqs are enabled. + */ + (order <= MAX_DIRTY_ORDER || system_state < SYS_STATE_active) ) pg = get_free_buddy(zone_lo, zone_hi, order, memflags | MEMF_no_scrub, d); if ( !pg ) @@ -1115,7 +1139,16 @@ static struct page_info *alloc_heap_pages( if ( test_and_clear_bit(_PGC_need_scrub, &pg[i].count_info) ) { if ( !(memflags & MEMF_no_scrub) ) + { scrub_one_page(&pg[i], cold); + /* + * Use SYS_STATE_smp_boot explicitly; ahead of that state + * interrupts are disabled. + */ + if ( system_state == SYS_STATE_smp_boot && + !(dirty_cnt & 0xff) ) + process_pending_softirqs(); + } dirty_cnt++; } diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h index 7067c9ec0405..a37476a99f1b 100644 --- a/xen/include/xen/mm.h +++ b/xen/include/xen/mm.h @@ -92,6 +92,7 @@ void xenheap_max_mfn(unsigned long mfn); void *alloc_xenheap_pages(unsigned int order, unsigned int memflags); void free_xenheap_pages(void *v, unsigned int order); bool scrub_free_pages(nodeid_t node); +bool memory_scrub_pending(nodeid_t node); #define alloc_xenheap_page() (alloc_xenheap_pages(0,0)) #define free_xenheap_page(v) (free_xenheap_pages(v,0)) @@ -223,6 +224,14 @@ struct npfec { #else #define MAX_ORDER 20 /* 2^20 contiguous pages */ #endif + +/* Max order when scrubbing pages at allocation time. */ +#ifdef CONFIG_DOMU_MAX_ORDER +# define MAX_DIRTY_ORDER CONFIG_DOMU_MAX_ORDER +#else +# define MAX_DIRTY_ORDER 9 +#endif + mfn_t acquire_reserved_page(struct domain *d, unsigned int memflags); /* Private domain structs for DOMID_XEN, DOMID_IO, etc. */ -- 2.51.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order 2026-01-08 17:55 ` [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order Roger Pau Monne @ 2026-01-09 11:19 ` Jan Beulich 2026-01-13 14:01 ` Roger Pau Monné 0 siblings, 1 reply; 17+ messages in thread From: Jan Beulich @ 2026-01-09 11:19 UTC (permalink / raw) To: Roger Pau Monne Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel On 08.01.2026 18:55, Roger Pau Monne wrote: > The current model of falling back to allocate unscrubbed pages and scrub > them in place at allocation time risks triggering the watchdog: > > Watchdog timer detects that CPU55 is stuck! > ----[ Xen-4.17.5-21 x86_64 debug=n Not tainted ]---- > CPU: 55 > RIP: e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30 > RFLAGS: 0000000000000202 CONTEXT: hypervisor (d0v12) > [...] > Xen call trace: > [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30 > [<ffff82d04022a121>] S clear_domain_page+0x11/0x20 > [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0 > [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180 > [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0 > [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970 > > The maximum allocation order on x86 is limited to 18, that means allocating > and scrubbing possibly 1G worth of memory in 4K chunks. > > Start by limiting dirty allocations to CONFIG_DOMU_MAX_ORDER, which is > currently set to 2M chunks. However such limitation might cause > fragmentation in HVM p2m population during domain creation. To prevent > that introduce some extra logic in populate_physmap() that fallback to > preemptive page-scrubbing if the requested allocation cannot be fulfilled > and there's scrubbing work to do. This approach is less fair than the > current one, but allows preemptive page scrubbing in the context of > populate_physmap() to attempt to ensure unnecessary page-shattering. > > Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed") > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> > --- > I'm not particularly happy with this approach, as it doesn't guarantee > progress for the callers. IOW: a caller might do a lot of scrubbing, just > to get it's pages stolen by a different concurrent thread doing > allocations. However I'm not sure there's a better solution than resorting > to 2M allocations if there's not enough free memory that is scrubbed. > > I'm having trouble seeing where we could temporary store page(s) allocated > that need to be scrubbed before being assigned to the domain, in a way that > can be used by continuations, and that would allow Xen to keep track of > them in case the operation is never finished. IOW: we would need to > account for cleanup of such temporary stash of pages in case the domain > never completes the hypercall, or is destroyed midway. How about stealing a bit from the range above MEMOP_EXTENT_SHIFT to indicate that state, with the actual page (and order plus scrub progress) recorded in the target struct domain? Actually, maybe such an indicator isn't needed at all: If the next invocation (continuation or not) finds an in-progress allocation, it could simply use that rather than doing a real allocation. (What to do if this isn't a continuation is less clear: We could fail such requests [likely not an option unless we can reliably tell original requests from continuations], or split the allocation if the request is smaller, or free the allocation to then take the normal path.) All of which of course only for "foreign" requests. If the hypercall is never continued, we could refuse to unpause the domain (with the allocation then freed normally when the domain gets destroyed). As another alternative, how about returning unscrubbed pages altogether when it's during domain creation, requiring the tool stack to do the scrubbing (potentially allowing it to skip some of it when pages are fully initialized anyway, much like we do for Dom0 iirc)? > --- a/xen/common/memory.c > +++ b/xen/common/memory.c > @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a) > > if ( unlikely(!page) ) > { > + nodeid_t node = MEMF_get_node(a->memflags); > + > + if ( memory_scrub_pending(node) || > + (node != NUMA_NO_NODE && > + !(a->memflags & MEMF_exact_node) && > + memory_scrub_pending(node = NUMA_NO_NODE)) ) > + { > + scrub_free_pages(node); > + a->preempted = 1; > + goto out; > + } At least for order 0 requests there's no point in trying this. With the current logic, actually for orders up to MAX_DIRTY_ORDER. Further, from a general interface perspective, wouldn't we need to do the same for at least XENMEM_increase_reservation? > @@ -1115,7 +1139,16 @@ static struct page_info *alloc_heap_pages( > if ( test_and_clear_bit(_PGC_need_scrub, &pg[i].count_info) ) > { > if ( !(memflags & MEMF_no_scrub) ) > + { > scrub_one_page(&pg[i], cold); > + /* > + * Use SYS_STATE_smp_boot explicitly; ahead of that state > + * interrupts are disabled. > + */ > + if ( system_state == SYS_STATE_smp_boot && > + !(dirty_cnt & 0xff) ) > + process_pending_softirqs(); > + } > > dirty_cnt++; > } Yet an alternative consideration: When "cold" is true, couldn't we call process_pending_softirqs() like you do here ( >= SYS_STATE_smp_boot then of course), without any of the other changes? Of course that's worse than a proper continuation, especially from the calling domain's pov. > @@ -223,6 +224,14 @@ struct npfec { > #else > #define MAX_ORDER 20 /* 2^20 contiguous pages */ > #endif > + > +/* Max order when scrubbing pages at allocation time. */ > +#ifdef CONFIG_DOMU_MAX_ORDER > +# define MAX_DIRTY_ORDER CONFIG_DOMU_MAX_ORDER > +#else > +# define MAX_DIRTY_ORDER 9 > +#endif Using CONFIG_DOMU_MAX_ORDER rather than the command line overridable domu_max_order means people couldn't even restore original behavior. Jan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order 2026-01-09 11:19 ` Jan Beulich @ 2026-01-13 14:01 ` Roger Pau Monné 2026-01-14 8:48 ` Jan Beulich 0 siblings, 1 reply; 17+ messages in thread From: Roger Pau Monné @ 2026-01-13 14:01 UTC (permalink / raw) To: Jan Beulich Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel On Fri, Jan 09, 2026 at 12:19:26PM +0100, Jan Beulich wrote: > On 08.01.2026 18:55, Roger Pau Monne wrote: > > The current model of falling back to allocate unscrubbed pages and scrub > > them in place at allocation time risks triggering the watchdog: > > > > Watchdog timer detects that CPU55 is stuck! > > ----[ Xen-4.17.5-21 x86_64 debug=n Not tainted ]---- > > CPU: 55 > > RIP: e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30 > > RFLAGS: 0000000000000202 CONTEXT: hypervisor (d0v12) > > [...] > > Xen call trace: > > [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30 > > [<ffff82d04022a121>] S clear_domain_page+0x11/0x20 > > [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0 > > [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180 > > [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0 > > [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970 > > > > The maximum allocation order on x86 is limited to 18, that means allocating > > and scrubbing possibly 1G worth of memory in 4K chunks. > > > > Start by limiting dirty allocations to CONFIG_DOMU_MAX_ORDER, which is > > currently set to 2M chunks. However such limitation might cause > > fragmentation in HVM p2m population during domain creation. To prevent > > that introduce some extra logic in populate_physmap() that fallback to > > preemptive page-scrubbing if the requested allocation cannot be fulfilled > > and there's scrubbing work to do. This approach is less fair than the > > current one, but allows preemptive page scrubbing in the context of > > populate_physmap() to attempt to ensure unnecessary page-shattering. > > > > Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed") > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> > > --- > > I'm not particularly happy with this approach, as it doesn't guarantee > > progress for the callers. IOW: a caller might do a lot of scrubbing, just > > to get it's pages stolen by a different concurrent thread doing > > allocations. However I'm not sure there's a better solution than resorting > > to 2M allocations if there's not enough free memory that is scrubbed. > > > > I'm having trouble seeing where we could temporary store page(s) allocated > > that need to be scrubbed before being assigned to the domain, in a way that > > can be used by continuations, and that would allow Xen to keep track of > > them in case the operation is never finished. IOW: we would need to > > account for cleanup of such temporary stash of pages in case the domain > > never completes the hypercall, or is destroyed midway. > > How about stealing a bit from the range above MEMOP_EXTENT_SHIFT to > indicate that state, with the actual page (and order plus scrub progress) > recorded in the target struct domain? Actually, maybe such an indicator > isn't needed at all: If the next invocation (continuation or not) finds > an in-progress allocation, it could simply use that rather than doing a > real allocation. (What to do if this isn't a continuation is less clear: > We could fail such requests [likely not an option unless we can reliably > tell original requests from continuations], or split the allocation if > the request is smaller, or free the allocation to then take the normal > path.) All of which of course only for "foreign" requests. > > If the hypercall is never continued, we could refuse to unpause the > domain (with the allocation then freed normally when the domain gets > destroyed). I have done something along this lines, introduced a couple of stashing variables in the domain struct and stored the progress of scrubbing in there. > As another alternative, how about returning unscrubbed pages altogether > when it's during domain creation, requiring the tool stack to do the > scrubbing (potentially allowing it to skip some of it when pages are > fully initialized anyway, much like we do for Dom0 iirc)? It's going to be difficult for the toolstack to figure out which pages need to be scrubbed, we would need a way to tell it the unscrubbed regions in a domain physmap? > > --- a/xen/common/memory.c > > +++ b/xen/common/memory.c > > @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a) > > > > if ( unlikely(!page) ) > > { > > + nodeid_t node = MEMF_get_node(a->memflags); > > + > > + if ( memory_scrub_pending(node) || > > + (node != NUMA_NO_NODE && > > + !(a->memflags & MEMF_exact_node) && > > + memory_scrub_pending(node = NUMA_NO_NODE)) ) > > + { > > + scrub_free_pages(node); > > + a->preempted = 1; > > + goto out; > > + } > > At least for order 0 requests there's no point in trying this. With the > current logic, actually for orders up to MAX_DIRTY_ORDER. Yes, otherwise we might force the CPU to do some scrubbing work when it won't satisfy it's allocation request anyway. > Further, from a general interface perspective, wouldn't we need to do the > same for at least XENMEM_increase_reservation? Possibly yes. TBH I would also be fine with strictly limiting XENMEM_increase_reservation to 2M order extents, even for the control domain. The physmap population is the only that actually requires bigger extents. > > @@ -1115,7 +1139,16 @@ static struct page_info *alloc_heap_pages( > > if ( test_and_clear_bit(_PGC_need_scrub, &pg[i].count_info) ) > > { > > if ( !(memflags & MEMF_no_scrub) ) > > + { > > scrub_one_page(&pg[i], cold); > > + /* > > + * Use SYS_STATE_smp_boot explicitly; ahead of that state > > + * interrupts are disabled. > > + */ > > + if ( system_state == SYS_STATE_smp_boot && > > + !(dirty_cnt & 0xff) ) > > + process_pending_softirqs(); > > + } > > > > dirty_cnt++; > > } > > Yet an alternative consideration: When "cold" is true, couldn't we call > process_pending_softirqs() like you do here ( >= SYS_STATE_smp_boot then > of course), without any of the other changes? Of course that's worse > than a proper continuation, especially from the calling domain's pov. Overall I think it would be best to solve this with hypercall continuations, in case we even want to support pages bigger than 1G. I know this has a lot of other implications, but would be nice to not add more baggage here. The "cold" case is the typical scenario for domain building, and we would block a control domain CPU for more than 5s which seems undesirable. > > @@ -223,6 +224,14 @@ struct npfec { > > #else > > #define MAX_ORDER 20 /* 2^20 contiguous pages */ > > #endif > > + > > +/* Max order when scrubbing pages at allocation time. */ > > +#ifdef CONFIG_DOMU_MAX_ORDER > > +# define MAX_DIRTY_ORDER CONFIG_DOMU_MAX_ORDER > > +#else > > +# define MAX_DIRTY_ORDER 9 > > +#endif > > Using CONFIG_DOMU_MAX_ORDER rather than the command line overridable > domu_max_order means people couldn't even restore original behavior. We likely want a separate command line option for this one, but given your comments above we might want to explore other options. Thanks, Roger. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order 2026-01-13 14:01 ` Roger Pau Monné @ 2026-01-14 8:48 ` Jan Beulich 2026-01-15 10:48 ` Roger Pau Monné 0 siblings, 1 reply; 17+ messages in thread From: Jan Beulich @ 2026-01-14 8:48 UTC (permalink / raw) To: Roger Pau Monné Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel On 13.01.2026 15:01, Roger Pau Monné wrote: > On Fri, Jan 09, 2026 at 12:19:26PM +0100, Jan Beulich wrote: >> On 08.01.2026 18:55, Roger Pau Monne wrote: >>> The current model of falling back to allocate unscrubbed pages and scrub >>> them in place at allocation time risks triggering the watchdog: >>> >>> Watchdog timer detects that CPU55 is stuck! >>> ----[ Xen-4.17.5-21 x86_64 debug=n Not tainted ]---- >>> CPU: 55 >>> RIP: e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30 >>> RFLAGS: 0000000000000202 CONTEXT: hypervisor (d0v12) >>> [...] >>> Xen call trace: >>> [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30 >>> [<ffff82d04022a121>] S clear_domain_page+0x11/0x20 >>> [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0 >>> [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180 >>> [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0 >>> [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970 >>> >>> The maximum allocation order on x86 is limited to 18, that means allocating >>> and scrubbing possibly 1G worth of memory in 4K chunks. >>> >>> Start by limiting dirty allocations to CONFIG_DOMU_MAX_ORDER, which is >>> currently set to 2M chunks. However such limitation might cause >>> fragmentation in HVM p2m population during domain creation. To prevent >>> that introduce some extra logic in populate_physmap() that fallback to >>> preemptive page-scrubbing if the requested allocation cannot be fulfilled >>> and there's scrubbing work to do. This approach is less fair than the >>> current one, but allows preemptive page scrubbing in the context of >>> populate_physmap() to attempt to ensure unnecessary page-shattering. >>> >>> Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed") >>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> >>> --- >>> I'm not particularly happy with this approach, as it doesn't guarantee >>> progress for the callers. IOW: a caller might do a lot of scrubbing, just >>> to get it's pages stolen by a different concurrent thread doing >>> allocations. However I'm not sure there's a better solution than resorting >>> to 2M allocations if there's not enough free memory that is scrubbed. >>> >>> I'm having trouble seeing where we could temporary store page(s) allocated >>> that need to be scrubbed before being assigned to the domain, in a way that >>> can be used by continuations, and that would allow Xen to keep track of >>> them in case the operation is never finished. IOW: we would need to >>> account for cleanup of such temporary stash of pages in case the domain >>> never completes the hypercall, or is destroyed midway. >> >> How about stealing a bit from the range above MEMOP_EXTENT_SHIFT to >> indicate that state, with the actual page (and order plus scrub progress) >> recorded in the target struct domain? Actually, maybe such an indicator >> isn't needed at all: If the next invocation (continuation or not) finds >> an in-progress allocation, it could simply use that rather than doing a >> real allocation. (What to do if this isn't a continuation is less clear: >> We could fail such requests [likely not an option unless we can reliably >> tell original requests from continuations], or split the allocation if >> the request is smaller, or free the allocation to then take the normal >> path.) All of which of course only for "foreign" requests. >> >> If the hypercall is never continued, we could refuse to unpause the >> domain (with the allocation then freed normally when the domain gets >> destroyed). > > I have done something along this lines, introduced a couple of > stashing variables in the domain struct and stored the progress of > scrubbing in there. > >> As another alternative, how about returning unscrubbed pages altogether >> when it's during domain creation, requiring the tool stack to do the >> scrubbing (potentially allowing it to skip some of it when pages are >> fully initialized anyway, much like we do for Dom0 iirc)? > > It's going to be difficult for the toolstack to figure out which pages > need to be scrubbed, we would need a way to tell it the unscrubbed > regions in a domain physmap? My thinking here was that the toolstack would have to assume everything is unscrubbed, and it could avoid scrubbing only those pages which it knows it fully fills with some data. >>> --- a/xen/common/memory.c >>> +++ b/xen/common/memory.c >>> @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a) >>> >>> if ( unlikely(!page) ) >>> { >>> + nodeid_t node = MEMF_get_node(a->memflags); >>> + >>> + if ( memory_scrub_pending(node) || >>> + (node != NUMA_NO_NODE && >>> + !(a->memflags & MEMF_exact_node) && >>> + memory_scrub_pending(node = NUMA_NO_NODE)) ) >>> + { >>> + scrub_free_pages(node); >>> + a->preempted = 1; >>> + goto out; >>> + } >> >> At least for order 0 requests there's no point in trying this. With the >> current logic, actually for orders up to MAX_DIRTY_ORDER. > > Yes, otherwise we might force the CPU to do some scrubbing work when > it won't satisfy it's allocation request anyway. > >> Further, from a general interface perspective, wouldn't we need to do the >> same for at least XENMEM_increase_reservation? > > Possibly yes. TBH I would also be fine with strictly limiting > XENMEM_increase_reservation to 2M order extents, even for the control > domain. The physmap population is the only that actually requires > bigger extents. Hmm, that's an option, yes, but an ABI-changing one. Jan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order 2026-01-14 8:48 ` Jan Beulich @ 2026-01-15 10:48 ` Roger Pau Monné 2026-01-15 10:56 ` Jan Beulich 0 siblings, 1 reply; 17+ messages in thread From: Roger Pau Monné @ 2026-01-15 10:48 UTC (permalink / raw) To: Jan Beulich Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel On Wed, Jan 14, 2026 at 09:48:59AM +0100, Jan Beulich wrote: > On 13.01.2026 15:01, Roger Pau Monné wrote: > > On Fri, Jan 09, 2026 at 12:19:26PM +0100, Jan Beulich wrote: > >> On 08.01.2026 18:55, Roger Pau Monne wrote: > >>> --- a/xen/common/memory.c > >>> +++ b/xen/common/memory.c > >>> @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a) > >>> > >>> if ( unlikely(!page) ) > >>> { > >>> + nodeid_t node = MEMF_get_node(a->memflags); > >>> + > >>> + if ( memory_scrub_pending(node) || > >>> + (node != NUMA_NO_NODE && > >>> + !(a->memflags & MEMF_exact_node) && > >>> + memory_scrub_pending(node = NUMA_NO_NODE)) ) > >>> + { > >>> + scrub_free_pages(node); > >>> + a->preempted = 1; > >>> + goto out; > >>> + } > >> > >> At least for order 0 requests there's no point in trying this. With the > >> current logic, actually for orders up to MAX_DIRTY_ORDER. > > > > Yes, otherwise we might force the CPU to do some scrubbing work when > > it won't satisfy it's allocation request anyway. > > > >> Further, from a general interface perspective, wouldn't we need to do the > >> same for at least XENMEM_increase_reservation? > > > > Possibly yes. TBH I would also be fine with strictly limiting > > XENMEM_increase_reservation to 2M order extents, even for the control > > domain. The physmap population is the only that actually requires > > bigger extents. > > Hmm, that's an option, yes, but an ABI-changing one. I don't think it changes the ABI: Xen has always reserved the right to block high order allocations. See for example how max_order() has different limits depending on the domain permissions, and I would not consider those limits part of the ABI, they can be changed from the command line. Thanks, Roger. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order 2026-01-15 10:48 ` Roger Pau Monné @ 2026-01-15 10:56 ` Jan Beulich 2026-01-15 13:05 ` Roger Pau Monné 0 siblings, 1 reply; 17+ messages in thread From: Jan Beulich @ 2026-01-15 10:56 UTC (permalink / raw) To: Roger Pau Monné Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel On 15.01.2026 11:48, Roger Pau Monné wrote: > On Wed, Jan 14, 2026 at 09:48:59AM +0100, Jan Beulich wrote: >> On 13.01.2026 15:01, Roger Pau Monné wrote: >>> On Fri, Jan 09, 2026 at 12:19:26PM +0100, Jan Beulich wrote: >>>> On 08.01.2026 18:55, Roger Pau Monne wrote: >>>>> --- a/xen/common/memory.c >>>>> +++ b/xen/common/memory.c >>>>> @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a) >>>>> >>>>> if ( unlikely(!page) ) >>>>> { >>>>> + nodeid_t node = MEMF_get_node(a->memflags); >>>>> + >>>>> + if ( memory_scrub_pending(node) || >>>>> + (node != NUMA_NO_NODE && >>>>> + !(a->memflags & MEMF_exact_node) && >>>>> + memory_scrub_pending(node = NUMA_NO_NODE)) ) >>>>> + { >>>>> + scrub_free_pages(node); >>>>> + a->preempted = 1; >>>>> + goto out; >>>>> + } >>>> >>>> At least for order 0 requests there's no point in trying this. With the >>>> current logic, actually for orders up to MAX_DIRTY_ORDER. >>> >>> Yes, otherwise we might force the CPU to do some scrubbing work when >>> it won't satisfy it's allocation request anyway. >>> >>>> Further, from a general interface perspective, wouldn't we need to do the >>>> same for at least XENMEM_increase_reservation? >>> >>> Possibly yes. TBH I would also be fine with strictly limiting >>> XENMEM_increase_reservation to 2M order extents, even for the control >>> domain. The physmap population is the only that actually requires >>> bigger extents. >> >> Hmm, that's an option, yes, but an ABI-changing one. > > I don't think it changes the ABI: Xen has always reserved the right to > block high order allocations. See for example how max_order() has > different limits depending on the domain permissions, and I would not > consider those limits part of the ABI, they can be changed from the > command line. When the limits were introduced, we were aware this is an ABI change, albeit a necessary one. You have a point however as to the command line control that there now is. Jan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order 2026-01-15 10:56 ` Jan Beulich @ 2026-01-15 13:05 ` Roger Pau Monné 0 siblings, 0 replies; 17+ messages in thread From: Roger Pau Monné @ 2026-01-15 13:05 UTC (permalink / raw) To: Jan Beulich Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini, xen-devel On Thu, Jan 15, 2026 at 11:56:16AM +0100, Jan Beulich wrote: > On 15.01.2026 11:48, Roger Pau Monné wrote: > > On Wed, Jan 14, 2026 at 09:48:59AM +0100, Jan Beulich wrote: > >> On 13.01.2026 15:01, Roger Pau Monné wrote: > >>> On Fri, Jan 09, 2026 at 12:19:26PM +0100, Jan Beulich wrote: > >>>> On 08.01.2026 18:55, Roger Pau Monne wrote: > >>>>> --- a/xen/common/memory.c > >>>>> +++ b/xen/common/memory.c > >>>>> @@ -279,6 +279,18 @@ static void populate_physmap(struct memop_args *a) > >>>>> > >>>>> if ( unlikely(!page) ) > >>>>> { > >>>>> + nodeid_t node = MEMF_get_node(a->memflags); > >>>>> + > >>>>> + if ( memory_scrub_pending(node) || > >>>>> + (node != NUMA_NO_NODE && > >>>>> + !(a->memflags & MEMF_exact_node) && > >>>>> + memory_scrub_pending(node = NUMA_NO_NODE)) ) > >>>>> + { > >>>>> + scrub_free_pages(node); > >>>>> + a->preempted = 1; > >>>>> + goto out; > >>>>> + } > >>>> > >>>> At least for order 0 requests there's no point in trying this. With the > >>>> current logic, actually for orders up to MAX_DIRTY_ORDER. > >>> > >>> Yes, otherwise we might force the CPU to do some scrubbing work when > >>> it won't satisfy it's allocation request anyway. > >>> > >>>> Further, from a general interface perspective, wouldn't we need to do the > >>>> same for at least XENMEM_increase_reservation? > >>> > >>> Possibly yes. TBH I would also be fine with strictly limiting > >>> XENMEM_increase_reservation to 2M order extents, even for the control > >>> domain. The physmap population is the only that actually requires > >>> bigger extents. > >> > >> Hmm, that's an option, yes, but an ABI-changing one. > > > > I don't think it changes the ABI: Xen has always reserved the right to > > block high order allocations. See for example how max_order() has > > different limits depending on the domain permissions, and I would not > > consider those limits part of the ABI, they can be changed from the > > command line. > > When the limits were introduced, we were aware this is an ABI change, albeit > a necessary one. You have a point however as to the command line control that > there now is. In addition to what I've said above: the limit that I've introduced in v2 only affects dirty allocations that require scrubbing. If the requested order is available and scrubbed the limit won't be enforced. So the ABI is not changed in that regard, only unscrubbed pages past a certain order are considered as not free. It's possibly best to move the conversation to the v2 proposal and discuss the limit there. Thanks, Roger. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] xen/mm: limit in-place scrubbing 2026-01-08 17:55 [PATCH 0/2] xen/mm: limit in-place scrubbing Roger Pau Monne 2026-01-08 17:55 ` [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() Roger Pau Monne 2026-01-08 17:55 ` [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order Roger Pau Monne @ 2026-01-09 10:15 ` Jan Beulich 2026-01-09 10:29 ` Andrew Cooper 2 siblings, 1 reply; 17+ messages in thread From: Jan Beulich @ 2026-01-09 10:15 UTC (permalink / raw) To: Roger Pau Monne Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper, Anthony PERARD, xen-devel On 08.01.2026 18:55, Roger Pau Monne wrote: > In XenServer we have seen the watchdog occasionally triggering during > domain creation if 1GB pages are scrubbed in-place during physmap > population. That's pretty extreme - writing to 1Gb of memory can't really take over 5s, can it? Is there lock contention involved? Or is this when very many CPUs try to do the same in parallel? Jan > The following series attempt to mitigate this by limiting > the in-place scrubbing during allocation to 2M pages, but it has some > drawbacks, see the post-commit remarks in patch 2. > > I'm hopping someone might have a better idea, or we converge we can't do > better than this for the time being. > > Thanks, Roger. > > Roger Pau Monne (2): > xen/mm: add a NUMA node parameter to scrub_free_pages() > xen/mm: limit non-scrubbed allocations to a specific order > > xen/arch/arm/domain.c | 2 +- > xen/arch/x86/domain.c | 2 +- > xen/common/memory.c | 12 +++++++++ > xen/common/page_alloc.c | 54 +++++++++++++++++++++++++++++++++++++---- > xen/include/xen/mm.h | 12 ++++++++- > 5 files changed, 74 insertions(+), 8 deletions(-) > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] xen/mm: limit in-place scrubbing 2026-01-09 10:15 ` [PATCH 0/2] xen/mm: limit in-place scrubbing Jan Beulich @ 2026-01-09 10:29 ` Andrew Cooper 2026-01-09 11:32 ` Jan Beulich 2026-01-09 12:31 ` Roger Pau Monné 0 siblings, 2 replies; 17+ messages in thread From: Andrew Cooper @ 2026-01-09 10:29 UTC (permalink / raw) To: Jan Beulich, Roger Pau Monne Cc: Andrew Cooper, Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Anthony PERARD, xen-devel On 09/01/2026 10:15 am, Jan Beulich wrote: > On 08.01.2026 18:55, Roger Pau Monne wrote: >> In XenServer we have seen the watchdog occasionally triggering during >> domain creation if 1GB pages are scrubbed in-place during physmap >> population. > That's pretty extreme - writing to 1Gb of memory can't really take over 5s, > can it? Sure it can. > Is there lock contention involved? Almost certainly, and it's probably the more relevant aspect in this case. > Or is this when very many CPUs > try to do the same in parallel? The scenario is reboot of a VM when Xapi is doing NUMA placement using per-node claims. In this case, even with sufficient scrubbed RAM on other nodes, you need to take from the node you claimed on which might need scrubbing. The underlying problem is the need to do a long running operation in a context where you cannot continue, and cannot (reasonably) fail. ~Andrew ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] xen/mm: limit in-place scrubbing 2026-01-09 10:29 ` Andrew Cooper @ 2026-01-09 11:32 ` Jan Beulich 2026-01-09 11:34 ` Andrew Cooper 2026-01-09 12:31 ` Roger Pau Monné 1 sibling, 1 reply; 17+ messages in thread From: Jan Beulich @ 2026-01-09 11:32 UTC (permalink / raw) To: Andrew Cooper Cc: Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Anthony PERARD, xen-devel, Roger Pau Monne On 09.01.2026 11:29, Andrew Cooper wrote: > On 09/01/2026 10:15 am, Jan Beulich wrote: >> On 08.01.2026 18:55, Roger Pau Monne wrote: >>> In XenServer we have seen the watchdog occasionally triggering during >>> domain creation if 1GB pages are scrubbed in-place during physmap >>> population. >> That's pretty extreme - writing to 1Gb of memory can't really take over 5s, >> can it? > > Sure it can. Under what unusual circumstances, or on what extremely slow hardware? (Of course improperly set MTRRs could cause such, for example.) >> Is there lock contention involved? > > Almost certainly, and it's probably the more relevant aspect in this case. Thing is - the scrubbing happens after alloc_heap_pages() has already dropped the heap lock. And I can't spot the XENMEM_populate_physmap path to take any locks outward from alloc_heap_pages(). And the domain's page alloc lock (which in principle should be uncontended anyway unless the toolstack tries to race with itself) is acquired only later. If it was a lock contention problem, the first goal ought to be to move the scrubbing outside of any (potentially contended) locks. >> Or is this when very many CPUs >> try to do the same in parallel? > > The scenario is reboot of a VM when Xapi is doing NUMA placement using > per-node claims. > > In this case, even with sufficient scrubbed RAM on other nodes, you need > to take from the node you claimed on which might need scrubbing. Much like if there was an exact-node request without involving claims. > The underlying problem is the need to do a long running operation in a > context where you cannot continue, and cannot (reasonably) fail. Right. Jan ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] xen/mm: limit in-place scrubbing 2026-01-09 11:32 ` Jan Beulich @ 2026-01-09 11:34 ` Andrew Cooper 0 siblings, 0 replies; 17+ messages in thread From: Andrew Cooper @ 2026-01-09 11:34 UTC (permalink / raw) To: Jan Beulich Cc: Andrew Cooper, Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Anthony PERARD, xen-devel, Roger Pau Monne On 09/01/2026 11:32 am, Jan Beulich wrote: >>> Or is this when very many CPUs >>> try to do the same in parallel? >> The scenario is reboot of a VM when Xapi is doing NUMA placement using >> per-node claims. >> >> In this case, even with sufficient scrubbed RAM on other nodes, you need >> to take from the node you claimed on which might need scrubbing. > Much like if there was an exact-node request without involving claims. > >> The underlying problem is the need to do a long running operation in a >> context where you cannot continue, and cannot (reasonably) fail. > Right. Yeah - I think this is a scenario that could happen without NUMA aspects, if the system is almost full. I suspect we've just made it easier to hit, or we've got better testing. Hard to say. ~Andrew ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] xen/mm: limit in-place scrubbing 2026-01-09 10:29 ` Andrew Cooper 2026-01-09 11:32 ` Jan Beulich @ 2026-01-09 12:31 ` Roger Pau Monné 1 sibling, 0 replies; 17+ messages in thread From: Roger Pau Monné @ 2026-01-09 12:31 UTC (permalink / raw) To: Andrew Cooper Cc: Jan Beulich, Stefano Stabellini, Julien Grall, Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Anthony PERARD, xen-devel On Fri, Jan 09, 2026 at 10:29:20AM +0000, Andrew Cooper wrote: > On 09/01/2026 10:15 am, Jan Beulich wrote: > > On 08.01.2026 18:55, Roger Pau Monne wrote: > >> In XenServer we have seen the watchdog occasionally triggering during > >> domain creation if 1GB pages are scrubbed in-place during physmap > >> population. > > That's pretty extreme - writing to 1Gb of memory can't really take over 5s, > > can it? > > Sure it can. > > > Is there lock contention involved? > > Almost certainly, and it's probably the more relevant aspect in this case. Possibly. I can tell Edwin to give me his reproduction. There's also the map_domain_page() page aspect of this operation. On big enough systems this will cause a fair amount of stress to the map cache, since each page is mapped, scrubbed and unmapped. I don't think however the systems on which we have seen this to be using the map cache (it was on debug=n builds with less than 5TB of memory). > > Or is this when very many CPUs > > try to do the same in parallel? > > The scenario is reboot of a VM when Xapi is doing NUMA placement using > per-node claims. Not exclusively. We have reports of this also happening without any claims or NUMA placements being used. AFAICT it's possibly triggered when doing reboots of multiple VMs in parallel, and all reports of it I've seen it's on multi-node NUMA systems. I wonder if scrubbing a 1G remote page in 4K chunks is killing the intra-node bandwidth. Thanks, Roger. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-01-15 13:06 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-08 17:55 [PATCH 0/2] xen/mm: limit in-place scrubbing Roger Pau Monne 2026-01-08 17:55 ` [PATCH 1/2] xen/mm: add a NUMA node parameter to scrub_free_pages() Roger Pau Monne 2026-01-09 10:22 ` Jan Beulich 2026-01-09 14:46 ` Roger Pau Monné 2026-01-09 14:50 ` Jan Beulich 2026-01-08 17:55 ` [PATCH 2/2] xen/mm: limit non-scrubbed allocations to a specific order Roger Pau Monne 2026-01-09 11:19 ` Jan Beulich 2026-01-13 14:01 ` Roger Pau Monné 2026-01-14 8:48 ` Jan Beulich 2026-01-15 10:48 ` Roger Pau Monné 2026-01-15 10:56 ` Jan Beulich 2026-01-15 13:05 ` Roger Pau Monné 2026-01-09 10:15 ` [PATCH 0/2] xen/mm: limit in-place scrubbing Jan Beulich 2026-01-09 10:29 ` Andrew Cooper 2026-01-09 11:32 ` Jan Beulich 2026-01-09 11:34 ` Andrew Cooper 2026-01-09 12:31 ` Roger Pau Monné
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.