[PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages
@ 2025-01-08 11:30 yangge1116
  2025-01-13  8:47 ` Barry Song
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: yangge1116 @ 2025-01-08 11:30 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-kernel, 21cnbao, david, baolin.wang, hannes,
	liuzixing, yangge

From: yangge <yangge1116@126.com>

There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
of memory. I have configured 16GB of CMA memory on each NUMA node,
and starting a 32GB virtual machine with device passthrough is
extremely slow, taking almost an hour.

During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
Long term GUP cannot allocate memory from CMA area, so a maximum of
16 GB of no-CMA memory on a NUMA node can be used as virtual machine
memory. There is 16GB of free CMA memory on a NUMA node, which is
sufficient to pass the order-0 watermark check, causing the
__compaction_suitable() function to  consistently return true.
However, if there aren't enough migratable pages available, performing
memory compaction is also meaningless. Besides checking whether
the order-0 watermark is met, __compaction_suitable() also needs
to determine whether there are sufficient migratable pages available
for memory compaction.

For costly allocations, because __compaction_suitable() always
returns true, __alloc_pages_slowpath() can't exit at the appropriate
place, resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
    if (compact_result == COMPACT_SKIPPED ||
        compact_result == COMPACT_DEFERRED)
        goto nopage; // should exit __alloc_pages_slowpath() from here

When the 16G of non-CMA memory on a single node is exhausted, we will
fallback to allocating memory on other nodes. In order to quickly
fallback to remote nodes, we should skip memory compaction when
migratable pages are insufficient. After this fix, it only takes a
few tens of seconds to start a 32GB virtual machine with device
passthrough functionality.

Signed-off-by: yangge <yangge1116@126.com>
---

V3:
- fix build error

V2:
- consider unevictable folios

 mm/compaction.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index 07bd227..a9f1261 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order,
 				  int highest_zoneidx,
 				  unsigned long wmark_target)
 {
+	pg_data_t __maybe_unused *pgdat = zone->zone_pgdat;
+	unsigned long sum, nr_pinned;
 	unsigned long watermark;
+
+	sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
+		node_page_state(pgdat, NR_INACTIVE_ANON) +
+		node_page_state(pgdat, NR_ACTIVE_FILE) +
+		node_page_state(pgdat, NR_ACTIVE_ANON) +
+		node_page_state(pgdat, NR_UNEVICTABLE);
+
+	nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
+		node_page_state(pgdat, NR_FOLL_PIN_RELEASED);
+
+	/*
+	 * Gup-pinned pages are non-migratable. After subtracting these pages,
+	 * we need to check if the remaining pages are sufficient for memory
+	 * compaction.
+	 */
+	if ((sum - nr_pinned) < (1 << order))
+		return false;
+
 	/*
 	 * Watermarks for order-0 must be met for compaction to be able to
 	 * isolate free pages for migration targets. This means that the
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages
  2025-01-08 11:30 [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages yangge1116
@ 2025-01-13  8:47 ` Barry Song
  2025-01-13  9:02   ` Ge Yang
  2025-01-13 15:46 ` Johannes Weiner
  2025-01-14 11:21 ` Vlastimil Babka
  2 siblings, 1 reply; 11+ messages in thread
From: Barry Song @ 2025-01-13  8:47 UTC (permalink / raw)
  To: yangge1116
  Cc: akpm, linux-mm, linux-kernel, david, baolin.wang, hannes,
	liuzixing

On Thu, Jan 9, 2025 at 12:31 AM <yangge1116@126.com> wrote:
>
> From: yangge <yangge1116@126.com>
>
> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
> of memory. I have configured 16GB of CMA memory on each NUMA node,
> and starting a 32GB virtual machine with device passthrough is
> extremely slow, taking almost an hour.
>
> During the start-up of the virtual machine, it will call
> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
> Long term GUP cannot allocate memory from CMA area, so a maximum of
> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
> memory. There is 16GB of free CMA memory on a NUMA node, which is
> sufficient to pass the order-0 watermark check, causing the
> __compaction_suitable() function to  consistently return true.
> However, if there aren't enough migratable pages available, performing
> memory compaction is also meaningless. Besides checking whether
> the order-0 watermark is met, __compaction_suitable() also needs
> to determine whether there are sufficient migratable pages available
> for memory compaction.
>
> For costly allocations, because __compaction_suitable() always
> returns true, __alloc_pages_slowpath() can't exit at the appropriate
> place, resulting in excessively long virtual machine startup times.
> Call trace:
> __alloc_pages_slowpath
>     if (compact_result == COMPACT_SKIPPED ||
>         compact_result == COMPACT_DEFERRED)
>         goto nopage; // should exit __alloc_pages_slowpath() from here
>
> When the 16G of non-CMA memory on a single node is exhausted, we will
> fallback to allocating memory on other nodes. In order to quickly
> fallback to remote nodes, we should skip memory compaction when
> migratable pages are insufficient. After this fix, it only takes a
> few tens of seconds to start a 32GB virtual machine with device
> passthrough functionality.
>
> Signed-off-by: yangge <yangge1116@126.com>
> ---
>
> V3:
> - fix build error
>
> V2:
> - consider unevictable folios
>
>  mm/compaction.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 07bd227..a9f1261 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order,
>                                   int highest_zoneidx,
>                                   unsigned long wmark_target)
>  {
> +       pg_data_t __maybe_unused *pgdat = zone->zone_pgdat;
> +       unsigned long sum, nr_pinned;
>         unsigned long watermark;
> +
> +       sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
> +               node_page_state(pgdat, NR_INACTIVE_ANON) +
> +               node_page_state(pgdat, NR_ACTIVE_FILE) +
> +               node_page_state(pgdat, NR_ACTIVE_ANON) +
> +               node_page_state(pgdat, NR_UNEVICTABLE);
> +
> +       nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
> +               node_page_state(pgdat, NR_FOLL_PIN_RELEASED);
> +

Does the sum of all LRU pages equal non-CMA memory?
I'm quite confused for two reasons:
1. CMA pages can be LRU pages.
2. Free pages might not belong to any LRUs.


> +       /*
> +        * Gup-pinned pages are non-migratable. After subtracting these pages,
> +        * we need to check if the remaining pages are sufficient for memory
> +        * compaction.
> +        */
> +       if ((sum - nr_pinned) < (1 << order))
> +               return false;
> +
>         /*
>          * Watermarks for order-0 must be met for compaction to be able to
>          * isolate free pages for migration targets. This means that the
> --
> 2.7.4
>
>

Thanks
barry

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages
  2025-01-13  8:47 ` Barry Song
@ 2025-01-13  9:02   ` Ge Yang
  2025-01-13 10:05     ` Barry Song
  0 siblings, 1 reply; 11+ messages in thread
From: Ge Yang @ 2025-01-13  9:02 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-kernel, david, baolin.wang, hannes,
	liuzixing



在 2025/1/13 16:47, Barry Song 写道:
> On Thu, Jan 9, 2025 at 12:31 AM <yangge1116@126.com> wrote:
>>
>> From: yangge <yangge1116@126.com>
>>
>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>> and starting a 32GB virtual machine with device passthrough is
>> extremely slow, taking almost an hour.
>>
>> During the start-up of the virtual machine, it will call
>> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
>> Long term GUP cannot allocate memory from CMA area, so a maximum of
>> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
>> memory. There is 16GB of free CMA memory on a NUMA node, which is
>> sufficient to pass the order-0 watermark check, causing the
>> __compaction_suitable() function to  consistently return true.
>> However, if there aren't enough migratable pages available, performing
>> memory compaction is also meaningless. Besides checking whether
>> the order-0 watermark is met, __compaction_suitable() also needs
>> to determine whether there are sufficient migratable pages available
>> for memory compaction.
>>
>> For costly allocations, because __compaction_suitable() always
>> returns true, __alloc_pages_slowpath() can't exit at the appropriate
>> place, resulting in excessively long virtual machine startup times.
>> Call trace:
>> __alloc_pages_slowpath
>>      if (compact_result == COMPACT_SKIPPED ||
>>          compact_result == COMPACT_DEFERRED)
>>          goto nopage; // should exit __alloc_pages_slowpath() from here
>>
>> When the 16G of non-CMA memory on a single node is exhausted, we will
>> fallback to allocating memory on other nodes. In order to quickly
>> fallback to remote nodes, we should skip memory compaction when
>> migratable pages are insufficient. After this fix, it only takes a
>> few tens of seconds to start a 32GB virtual machine with device
>> passthrough functionality.
>>
>> Signed-off-by: yangge <yangge1116@126.com>
>> ---
>>
>> V3:
>> - fix build error
>>
>> V2:
>> - consider unevictable folios
>>
>>   mm/compaction.c | 20 ++++++++++++++++++++
>>   1 file changed, 20 insertions(+)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 07bd227..a9f1261 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order,
>>                                    int highest_zoneidx,
>>                                    unsigned long wmark_target)
>>   {
>> +       pg_data_t __maybe_unused *pgdat = zone->zone_pgdat;
>> +       unsigned long sum, nr_pinned;
>>          unsigned long watermark;
>> +
>> +       sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
>> +               node_page_state(pgdat, NR_INACTIVE_ANON) +
>> +               node_page_state(pgdat, NR_ACTIVE_FILE) +
>> +               node_page_state(pgdat, NR_ACTIVE_ANON) +
>> +               node_page_state(pgdat, NR_UNEVICTABLE);
>> +
>> +       nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
>> +               node_page_state(pgdat, NR_FOLL_PIN_RELEASED);
>> +
> 
> Does the sum of all LRU pages equal non-CMA memory?
> I'm quite confused for two reasons:
> 1. CMA pages can be LRU pages.
> 2. Free pages might not belong to any LRUs.
NO.

If all the pages in the LRU are pinned, it seems unnecessary to perform 
memory compaction, as the migration of pinned pages is unlikely to succeed.
Besides checking whether the order-0 watermark is met, 
__compaction_suitable() also needs to determine whether there are 
sufficient migratable pages available for memory compaction.
> 
> 
>> +       /*
>> +        * Gup-pinned pages are non-migratable. After subtracting these pages,
>> +        * we need to check if the remaining pages are sufficient for memory
>> +        * compaction.
>> +        */
>> +       if ((sum - nr_pinned) < (1 << order))
>> +               return false;
>> +
>>          /*
>>           * Watermarks for order-0 must be met for compaction to be able to
>>           * isolate free pages for migration targets. This means that the
>> --
>> 2.7.4
>>
>>
> 
> Thanks
> barry


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages
  2025-01-13  9:02   ` Ge Yang
@ 2025-01-13 10:05     ` Barry Song
  2025-01-13 11:23       ` Ge Yang
  0 siblings, 1 reply; 11+ messages in thread
From: Barry Song @ 2025-01-13 10:05 UTC (permalink / raw)
  To: Ge Yang; +Cc: akpm, linux-mm, linux-kernel, david, baolin.wang, hannes,
	liuzixing

On Mon, Jan 13, 2025 at 10:04 PM Ge Yang <yangge1116@126.com> wrote:
>
>
>
> 在 2025/1/13 16:47, Barry Song 写道:
> > On Thu, Jan 9, 2025 at 12:31 AM <yangge1116@126.com> wrote:
> >>
> >> From: yangge <yangge1116@126.com>
> >>
> >> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
> >> of memory. I have configured 16GB of CMA memory on each NUMA node,
> >> and starting a 32GB virtual machine with device passthrough is
> >> extremely slow, taking almost an hour.
> >>
> >> During the start-up of the virtual machine, it will call
> >> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
> >> Long term GUP cannot allocate memory from CMA area, so a maximum of
> >> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
> >> memory. There is 16GB of free CMA memory on a NUMA node, which is
> >> sufficient to pass the order-0 watermark check, causing the
> >> __compaction_suitable() function to  consistently return true.
> >> However, if there aren't enough migratable pages available, performing
> >> memory compaction is also meaningless. Besides checking whether
> >> the order-0 watermark is met, __compaction_suitable() also needs
> >> to determine whether there are sufficient migratable pages available
> >> for memory compaction.
> >>
> >> For costly allocations, because __compaction_suitable() always
> >> returns true, __alloc_pages_slowpath() can't exit at the appropriate
> >> place, resulting in excessively long virtual machine startup times.
> >> Call trace:
> >> __alloc_pages_slowpath
> >>      if (compact_result == COMPACT_SKIPPED ||
> >>          compact_result == COMPACT_DEFERRED)
> >>          goto nopage; // should exit __alloc_pages_slowpath() from here
> >>
> >> When the 16G of non-CMA memory on a single node is exhausted, we will
> >> fallback to allocating memory on other nodes. In order to quickly
> >> fallback to remote nodes, we should skip memory compaction when
> >> migratable pages are insufficient. After this fix, it only takes a
> >> few tens of seconds to start a 32GB virtual machine with device
> >> passthrough functionality.
> >>
> >> Signed-off-by: yangge <yangge1116@126.com>
> >> ---
> >>
> >> V3:
> >> - fix build error
> >>
> >> V2:
> >> - consider unevictable folios
> >>
> >>   mm/compaction.c | 20 ++++++++++++++++++++
> >>   1 file changed, 20 insertions(+)
> >>
> >> diff --git a/mm/compaction.c b/mm/compaction.c
> >> index 07bd227..a9f1261 100644
> >> --- a/mm/compaction.c
> >> +++ b/mm/compaction.c
> >> @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order,
> >>                                    int highest_zoneidx,
> >>                                    unsigned long wmark_target)
> >>   {
> >> +       pg_data_t __maybe_unused *pgdat = zone->zone_pgdat;
> >> +       unsigned long sum, nr_pinned;
> >>          unsigned long watermark;
> >> +
> >> +       sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
> >> +               node_page_state(pgdat, NR_INACTIVE_ANON) +
> >> +               node_page_state(pgdat, NR_ACTIVE_FILE) +
> >> +               node_page_state(pgdat, NR_ACTIVE_ANON) +
> >> +               node_page_state(pgdat, NR_UNEVICTABLE);
> >> +
> >> +       nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
> >> +               node_page_state(pgdat, NR_FOLL_PIN_RELEASED);
> >> +
> >
> > Does the sum of all LRU pages equal non-CMA memory?
> > I'm quite confused for two reasons:
> > 1. CMA pages can be LRU pages.
> > 2. Free pages might not belong to any LRUs.
> NO.
>
> If all the pages in the LRU are pinned, it seems unnecessary to perform
> memory compaction, as the migration of pinned pages is unlikely to succeed.
> Besides checking whether the order-0 watermark is met,
> __compaction_suitable() also needs to determine whether there are
> sufficient migratable pages available for memory compaction.

Ok, but I am not convinced that this is a correct patch. If all your
CMA pages are
used by userspace—in other words, they are in LRUs—the sum could become
quite large, and `nr_pinned` might include non-CMA pages. In that case,
`sum - nr_pinned` would also be quite large. The "return false" logic wouldn't
work as intended.

I suspect the issue seems to have disappeared simply because your CMA is
not being used at all.

> >
> >
> >> +       /*
> >> +        * Gup-pinned pages are non-migratable. After subtracting these pages,
> >> +        * we need to check if the remaining pages are sufficient for memory
> >> +        * compaction.
> >> +        */
> >> +       if ((sum - nr_pinned) < (1 << order))
> >> +               return false;
> >> +
> >>          /*
> >>           * Watermarks for order-0 must be met for compaction to be able to
> >>           * isolate free pages for migration targets. This means that the
> >> --
> >> 2.7.4
> >>
> >>
> >

Thanks
Barry

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages
  2025-01-13 10:05     ` Barry Song
@ 2025-01-13 11:23       ` Ge Yang
  0 siblings, 0 replies; 11+ messages in thread
From: Ge Yang @ 2025-01-13 11:23 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-kernel, david, baolin.wang, hannes,
	liuzixing



在 2025/1/13 18:05, Barry Song 写道:
> On Mon, Jan 13, 2025 at 10:04 PM Ge Yang <yangge1116@126.com> wrote:
>>
>>
>>
>> 在 2025/1/13 16:47, Barry Song 写道:
>>> On Thu, Jan 9, 2025 at 12:31 AM <yangge1116@126.com> wrote:
>>>>
>>>> From: yangge <yangge1116@126.com>
>>>>
>>>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>>>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>>>> and starting a 32GB virtual machine with device passthrough is
>>>> extremely slow, taking almost an hour.
>>>>
>>>> During the start-up of the virtual machine, it will call
>>>> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
>>>> Long term GUP cannot allocate memory from CMA area, so a maximum of
>>>> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
>>>> memory. There is 16GB of free CMA memory on a NUMA node, which is
>>>> sufficient to pass the order-0 watermark check, causing the
>>>> __compaction_suitable() function to  consistently return true.
>>>> However, if there aren't enough migratable pages available, performing
>>>> memory compaction is also meaningless. Besides checking whether
>>>> the order-0 watermark is met, __compaction_suitable() also needs
>>>> to determine whether there are sufficient migratable pages available
>>>> for memory compaction.
>>>>
>>>> For costly allocations, because __compaction_suitable() always
>>>> returns true, __alloc_pages_slowpath() can't exit at the appropriate
>>>> place, resulting in excessively long virtual machine startup times.
>>>> Call trace:
>>>> __alloc_pages_slowpath
>>>>       if (compact_result == COMPACT_SKIPPED ||
>>>>           compact_result == COMPACT_DEFERRED)
>>>>           goto nopage; // should exit __alloc_pages_slowpath() from here
>>>>
>>>> When the 16G of non-CMA memory on a single node is exhausted, we will
>>>> fallback to allocating memory on other nodes. In order to quickly
>>>> fallback to remote nodes, we should skip memory compaction when
>>>> migratable pages are insufficient. After this fix, it only takes a
>>>> few tens of seconds to start a 32GB virtual machine with device
>>>> passthrough functionality.
>>>>
>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>> ---
>>>>
>>>> V3:
>>>> - fix build error
>>>>
>>>> V2:
>>>> - consider unevictable folios
>>>>
>>>>    mm/compaction.c | 20 ++++++++++++++++++++
>>>>    1 file changed, 20 insertions(+)
>>>>
>>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>>> index 07bd227..a9f1261 100644
>>>> --- a/mm/compaction.c
>>>> +++ b/mm/compaction.c
>>>> @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order,
>>>>                                     int highest_zoneidx,
>>>>                                     unsigned long wmark_target)
>>>>    {
>>>> +       pg_data_t __maybe_unused *pgdat = zone->zone_pgdat;
>>>> +       unsigned long sum, nr_pinned;
>>>>           unsigned long watermark;
>>>> +
>>>> +       sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
>>>> +               node_page_state(pgdat, NR_INACTIVE_ANON) +
>>>> +               node_page_state(pgdat, NR_ACTIVE_FILE) +
>>>> +               node_page_state(pgdat, NR_ACTIVE_ANON) +
>>>> +               node_page_state(pgdat, NR_UNEVICTABLE);
>>>> +
>>>> +       nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
>>>> +               node_page_state(pgdat, NR_FOLL_PIN_RELEASED);
>>>> +
>>>
>>> Does the sum of all LRU pages equal non-CMA memory?
>>> I'm quite confused for two reasons:
>>> 1. CMA pages can be LRU pages.
>>> 2. Free pages might not belong to any LRUs.
>> NO.
>>
>> If all the pages in the LRU are pinned, it seems unnecessary to perform
>> memory compaction, as the migration of pinned pages is unlikely to succeed.
>> Besides checking whether the order-0 watermark is met,
>> __compaction_suitable() also needs to determine whether there are
>> sufficient migratable pages available for memory compaction.
> 
> Ok, but I am not convinced that this is a correct patch. If all your
> CMA pages are
> used by userspace—in other words, they are in LRUs—the sum could become
> quite large, and `nr_pinned` might include non-CMA pages. In that case,
> `sum - nr_pinned` would also be quite large. The "return false" logic wouldn't
> work as intended.
> 
> I suspect the issue seems to have disappeared simply because your CMA is
> not being used at all.
> 
Part of the CMA has been used. Due to __compaction_suitable() always 
returning true, it triggers swapping, which evicts the already-used CMA 
pages to disk, ultimately resulting in only pinned pages remaining in 
the LRU (Least Recently Used) list.

>>>
>>>
>>>> +       /*
>>>> +        * Gup-pinned pages are non-migratable. After subtracting these pages,
>>>> +        * we need to check if the remaining pages are sufficient for memory
>>>> +        * compaction.
>>>> +        */
>>>> +       if ((sum - nr_pinned) < (1 << order))
>>>> +               return false;
>>>> +
>>>>           /*
>>>>            * Watermarks for order-0 must be met for compaction to be able to
>>>>            * isolate free pages for migration targets. This means that the
>>>> --
>>>> 2.7.4
>>>>
>>>>
>>>
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages
  2025-01-08 11:30 [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages yangge1116
  2025-01-13  8:47 ` Barry Song
@ 2025-01-13 15:46 ` Johannes Weiner
  2025-01-14  2:51   ` Ge Yang
  2025-01-14 11:21 ` Vlastimil Babka
  2 siblings, 1 reply; 11+ messages in thread
From: Johannes Weiner @ 2025-01-13 15:46 UTC (permalink / raw)
  To: yangge1116
  Cc: akpm, linux-mm, linux-kernel, 21cnbao, david, baolin.wang,
	liuzixing, Vlastimil Babka

CC Vlastimil

On Wed, Jan 08, 2025 at 07:30:54PM +0800, yangge1116@126.com wrote:
> From: yangge <yangge1116@126.com>
> 
> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
> of memory. I have configured 16GB of CMA memory on each NUMA node,
> and starting a 32GB virtual machine with device passthrough is
> extremely slow, taking almost an hour.
> 
> During the start-up of the virtual machine, it will call
> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
> Long term GUP cannot allocate memory from CMA area, so a maximum of
> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
> memory. There is 16GB of free CMA memory on a NUMA node, which is
> sufficient to pass the order-0 watermark check, causing the
> __compaction_suitable() function to  consistently return true.
> However, if there aren't enough migratable pages available, performing
> memory compaction is also meaningless. Besides checking whether
> the order-0 watermark is met, __compaction_suitable() also needs
> to determine whether there are sufficient migratable pages available
> for memory compaction.
> 
> For costly allocations, because __compaction_suitable() always
> returns true, __alloc_pages_slowpath() can't exit at the appropriate
> place, resulting in excessively long virtual machine startup times.
> Call trace:
> __alloc_pages_slowpath
>     if (compact_result == COMPACT_SKIPPED ||
>         compact_result == COMPACT_DEFERRED)
>         goto nopage; // should exit __alloc_pages_slowpath() from here
> 
> When the 16G of non-CMA memory on a single node is exhausted, we will
> fallback to allocating memory on other nodes. In order to quickly
> fallback to remote nodes, we should skip memory compaction when
> migratable pages are insufficient. After this fix, it only takes a
> few tens of seconds to start a 32GB virtual machine with device
> passthrough functionality.
> 
> Signed-off-by: yangge <yangge1116@126.com>
> ---
> 
> V3:
> - fix build error
> 
> V2:
> - consider unevictable folios
> 
>  mm/compaction.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 07bd227..a9f1261 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order,
>  				  int highest_zoneidx,
>  				  unsigned long wmark_target)
>  {
> +	pg_data_t __maybe_unused *pgdat = zone->zone_pgdat;
> +	unsigned long sum, nr_pinned;
>  	unsigned long watermark;
> +
> +	sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
> +		node_page_state(pgdat, NR_INACTIVE_ANON) +
> +		node_page_state(pgdat, NR_ACTIVE_FILE) +
> +		node_page_state(pgdat, NR_ACTIVE_ANON) +
> +		node_page_state(pgdat, NR_UNEVICTABLE);

What about PAGE_MAPPING_MOVABLE pages that aren't on this list? For
example, zsmalloc backend pages can be a large share of allocated
memory, and they are compactable. You would give up on compaction
prematurely and cause unnecessary allocation failures.

That scenario is way more common than the one you're trying to fix.

I think trying to make this list complete, and maintaining it, is
painstaking and error prone. And errors are hard to detect: they will
just manifest as spurious failures in higher order requests that you'd
need to catch with tracing enabled in the right moments.

So I'm not a fan of this approach.

Compaction is already skipped when previous runs were not successful.
See defer_compaction() and compaction_deferred(). Why is this not
helping here?

> +	nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
> +		node_page_state(pgdat, NR_FOLL_PIN_RELEASED);

Likewise, as Barry notes, not all pinned pages are necessarily LRU
pages. remap_vmalloc_range() pages come to mind. You can't do subset
math on potentially disjunct sets.

> +	/*
> +	 * Gup-pinned pages are non-migratable. After subtracting these pages,
> +	 * we need to check if the remaining pages are sufficient for memory
> +	 * compaction.
> +	 */
> +	if ((sum - nr_pinned) < (1 << order))
> +		return false;
> +

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages
  2025-01-13 15:46 ` Johannes Weiner
@ 2025-01-14  2:51   ` Ge Yang
  0 siblings, 0 replies; 11+ messages in thread
From: Ge Yang @ 2025-01-14  2:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, linux-mm, linux-kernel, 21cnbao, david, baolin.wang,
	liuzixing, Vlastimil Babka



在 2025/1/13 23:46, Johannes Weiner 写道:
> CC Vlastimil
> 
> On Wed, Jan 08, 2025 at 07:30:54PM +0800, yangge1116@126.com wrote:
>> From: yangge <yangge1116@126.com>
>>
>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>> and starting a 32GB virtual machine with device passthrough is
>> extremely slow, taking almost an hour.
>>
>> During the start-up of the virtual machine, it will call
>> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
>> Long term GUP cannot allocate memory from CMA area, so a maximum of
>> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
>> memory. There is 16GB of free CMA memory on a NUMA node, which is
>> sufficient to pass the order-0 watermark check, causing the
>> __compaction_suitable() function to  consistently return true.
>> However, if there aren't enough migratable pages available, performing
>> memory compaction is also meaningless. Besides checking whether
>> the order-0 watermark is met, __compaction_suitable() also needs
>> to determine whether there are sufficient migratable pages available
>> for memory compaction.
>>
>> For costly allocations, because __compaction_suitable() always
>> returns true, __alloc_pages_slowpath() can't exit at the appropriate
>> place, resulting in excessively long virtual machine startup times.
>> Call trace:
>> __alloc_pages_slowpath
>>      if (compact_result == COMPACT_SKIPPED ||
>>          compact_result == COMPACT_DEFERRED)
>>          goto nopage; // should exit __alloc_pages_slowpath() from here
>>
>> When the 16G of non-CMA memory on a single node is exhausted, we will
>> fallback to allocating memory on other nodes. In order to quickly
>> fallback to remote nodes, we should skip memory compaction when
>> migratable pages are insufficient. After this fix, it only takes a
>> few tens of seconds to start a 32GB virtual machine with device
>> passthrough functionality.
>>
>> Signed-off-by: yangge <yangge1116@126.com>
>> ---
>>
>> V3:
>> - fix build error
>>
>> V2:
>> - consider unevictable folios
>>
>>   mm/compaction.c | 20 ++++++++++++++++++++
>>   1 file changed, 20 insertions(+)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 07bd227..a9f1261 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order,
>>   				  int highest_zoneidx,
>>   				  unsigned long wmark_target)
>>   {
>> +	pg_data_t __maybe_unused *pgdat = zone->zone_pgdat;
>> +	unsigned long sum, nr_pinned;
>>   	unsigned long watermark;
>> +
>> +	sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
>> +		node_page_state(pgdat, NR_INACTIVE_ANON) +
>> +		node_page_state(pgdat, NR_ACTIVE_FILE) +
>> +		node_page_state(pgdat, NR_ACTIVE_ANON) +
>> +		node_page_state(pgdat, NR_UNEVICTABLE);
> 
> What about PAGE_MAPPING_MOVABLE pages that aren't on this list? For
> example, zsmalloc backend pages can be a large share of allocated
> memory, and they are compactable. You would give up on compaction
> prematurely and cause unnecessary allocation failures.
> 
Yes, indeed, there are pages that are not in the LRU list but support 
migration. Currently, technologies such as balloon, z3fold, and zsmalloc 
are utilizing such pages. I feel that we could add an item to 
node_stat_item to keep statistics on these pages.

> That scenario is way more common than the one you're trying to fix.
> 
> I think trying to make this list complete, and maintaining it, is
> painstaking and error prone. And errors are hard to detect: they will
> just manifest as spurious failures in higher order requests that you'd
> need to catch with tracing enabled in the right moments.
> 
> So I'm not a fan of this approach.
> 
> Compaction is already skipped when previous runs were not successful.
> See defer_compaction() and compaction_deferred(). Why is this not
> helping here?

  if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
                                  status == COMPACT_PARTIAL_SKIPPED))
           defer_compaction(zone, order);

When prio != COMPACT_PRIO_ASYNC, defer_compaction(zone, order) will be 
executed. In the __alloc_page_slowpath() function, during the first 
execution of __alloc_pages_direct_compact(), prio is equal to 
COMPACT_PRIO_ASYNC, and therefore defer_compaction(zone, order) will not 
be executed. Instead, it will eventually proceed to the time-consuming 
__alloc_pages_direct_reclaim(). This can be avoided in scenarios where 
memory compaction is not suitable.

> 
>> +	nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
>> +		node_page_state(pgdat, NR_FOLL_PIN_RELEASED);
> 
> Likewise, as Barry notes, not all pinned pages are necessarily LRU
> pages. remap_vmalloc_range() pages come to mind. You can't do subset
> math on potentially disjunct sets.

Indeed, some problem scenarios are unsolvable currently, but there are 
some scenarios that can be resolved through this approach. Currently, we 
haven't come up with a better solution yet.

> 
>> +	/*
>> +	 * Gup-pinned pages are non-migratable. After subtracting these pages,
>> +	 * we need to check if the remaining pages are sufficient for memory
>> +	 * compaction.
>> +	 */
>> +	if ((sum - nr_pinned) < (1 << order))
>> +		return false;
>> +


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages
  2025-01-08 11:30 [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages yangge1116
  2025-01-13  8:47 ` Barry Song
  2025-01-13 15:46 ` Johannes Weiner
@ 2025-01-14 11:21 ` Vlastimil Babka
  2025-01-14 12:24   ` Ge Yang
  2 siblings, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2025-01-14 11:21 UTC (permalink / raw)
  To: yangge1116, akpm
  Cc: linux-mm, linux-kernel, 21cnbao, david, baolin.wang, hannes,
	liuzixing

On 1/8/25 12:30, yangge1116@126.com wrote:
> From: yangge <yangge1116@126.com>
> 
> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
> of memory. I have configured 16GB of CMA memory on each NUMA node,
> and starting a 32GB virtual machine with device passthrough is
> extremely slow, taking almost an hour.
> 
> During the start-up of the virtual machine, it will call
> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
> Long term GUP cannot allocate memory from CMA area, so a maximum of
> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
> memory. There is 16GB of free CMA memory on a NUMA node, which is
> sufficient to pass the order-0 watermark check, causing the
> __compaction_suitable() function to  consistently return true.
> However, if there aren't enough migratable pages available, performing
> memory compaction is also meaningless. Besides checking whether
> the order-0 watermark is met, __compaction_suitable() also needs
> to determine whether there are sufficient migratable pages available
> for memory compaction.
> 
> For costly allocations, because __compaction_suitable() always
> returns true, __alloc_pages_slowpath() can't exit at the appropriate
> place, resulting in excessively long virtual machine startup times.
> Call trace:
> __alloc_pages_slowpath
>     if (compact_result == COMPACT_SKIPPED ||
>         compact_result == COMPACT_DEFERRED)
>         goto nopage; // should exit __alloc_pages_slowpath() from here
> 
> When the 16G of non-CMA memory on a single node is exhausted, we will
> fallback to allocating memory on other nodes. In order to quickly
> fallback to remote nodes, we should skip memory compaction when
> migratable pages are insufficient. After this fix, it only takes a
> few tens of seconds to start a 32GB virtual machine with device
> passthrough functionality.
> 
> Signed-off-by: yangge <yangge1116@126.com>
> ---
> 
> V3:
> - fix build error
> 
> V2:
> - consider unevictable folios
> 
>  mm/compaction.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 07bd227..a9f1261 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order,
>  				  int highest_zoneidx,
>  				  unsigned long wmark_target)
>  {
> +	pg_data_t __maybe_unused *pgdat = zone->zone_pgdat;
> +	unsigned long sum, nr_pinned;
>  	unsigned long watermark;
> +
> +	sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
> +		node_page_state(pgdat, NR_INACTIVE_ANON) +
> +		node_page_state(pgdat, NR_ACTIVE_FILE) +
> +		node_page_state(pgdat, NR_ACTIVE_ANON) +
> +		node_page_state(pgdat, NR_UNEVICTABLE);

In addition to what Johannes pointed out, these are whole-node numbers and
compaction works on a zone level.

> +
> +	nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
> +		node_page_state(pgdat, NR_FOLL_PIN_RELEASED);

Statistics of *events* used to derive current *state*... I don't think we do
that anywhere else? I'm not sure if we make sure vmstat events may never be
missed, as they are only for statistics. IIUC we allow some rare races to
have less expensive synchronization?

But anyway let's try looking for a different solution.

Assuming this is a THP allocation attempt (__GFP_THISNODE even?) and we are
in the "For costly allocations, try direct compaction first" part of
__alloc_pages_slowpath() right?
Hopefully also when done from the pin_user_pages_remote(..., FOLL_LONGTERM,
...) context the allocation gfp_mask correctly lacks __GFP_MOVABLE? I guess
it has to, otherwise it would allocate from the CMA pageblocks.

Then I wonder if we could use the real allocation context to determine
watermarks, as __compaction_suitable() is passing ALLOC_CMA instead because
it's checking only for migration targets, which have to be CMA compatible by
definition. But we could use the real unmovable allocation context to have
__zone_watermark_unusable_free() subtract CMA pages, and thus we won't pass
the order-0 check anymore once the non-CMA part is exhausted.

There's some risk that in some different scenario the compaction could in
fact migrate pages from the exhausted non-CMA part of the zone to the CMA
part and succeed, and we'll skip it instead. But that should be rare?

Anyway given that concern I'm not sure about changing
__compaction_suitable() for every caller like this. We could (at least
initially) target this heuristic only for COMPACT_PRIO_ASYNC which is being
used for this THP opportunistic attempt.

So for example:
- add a new bool flag to compact_control that is true for COMPACT_PRIO_ASYNC
- pass cc pointer to compaction_suit_allocation_order()
- in that function, add another check if the the new cc flag is true,
between the current zone_watermark_ok() and compaction_suitable() checks,
which works like __compaction_suitable() but uses alloc_flags (which should
not be ALLOC_CMA in our pinned allocation case) instead of ALLOC_CMA, return
COMPACT_SKIPPED if it fails.

> +	/*
> +	 * Gup-pinned pages are non-migratable. After subtracting these pages,
> +	 * we need to check if the remaining pages are sufficient for memory
> +	 * compaction.
> +	 */
> +	if ((sum - nr_pinned) < (1 << order))
> +		return false;
> +
>  	/*
>  	 * Watermarks for order-0 must be met for compaction to be able to
>  	 * isolate free pages for migration targets. This means that the


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages
  2025-01-14 11:21 ` Vlastimil Babka
@ 2025-01-14 12:24   ` Ge Yang
  2025-01-14 12:51     ` Vlastimil Babka
  0 siblings, 1 reply; 11+ messages in thread
From: Ge Yang @ 2025-01-14 12:24 UTC (permalink / raw)
  To: Vlastimil Babka, akpm
  Cc: linux-mm, linux-kernel, 21cnbao, david, baolin.wang, hannes,
	liuzixing



在 2025/1/14 19:21, Vlastimil Babka 写道:
> On 1/8/25 12:30, yangge1116@126.com wrote:
>> From: yangge <yangge1116@126.com>
>>
>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>> and starting a 32GB virtual machine with device passthrough is
>> extremely slow, taking almost an hour.
>>
>> During the start-up of the virtual machine, it will call
>> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
>> Long term GUP cannot allocate memory from CMA area, so a maximum of
>> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
>> memory. There is 16GB of free CMA memory on a NUMA node, which is
>> sufficient to pass the order-0 watermark check, causing the
>> __compaction_suitable() function to  consistently return true.
>> However, if there aren't enough migratable pages available, performing
>> memory compaction is also meaningless. Besides checking whether
>> the order-0 watermark is met, __compaction_suitable() also needs
>> to determine whether there are sufficient migratable pages available
>> for memory compaction.
>>
>> For costly allocations, because __compaction_suitable() always
>> returns true, __alloc_pages_slowpath() can't exit at the appropriate
>> place, resulting in excessively long virtual machine startup times.
>> Call trace:
>> __alloc_pages_slowpath
>>      if (compact_result == COMPACT_SKIPPED ||
>>          compact_result == COMPACT_DEFERRED)
>>          goto nopage; // should exit __alloc_pages_slowpath() from here
>>
>> When the 16G of non-CMA memory on a single node is exhausted, we will
>> fallback to allocating memory on other nodes. In order to quickly
>> fallback to remote nodes, we should skip memory compaction when
>> migratable pages are insufficient. After this fix, it only takes a
>> few tens of seconds to start a 32GB virtual machine with device
>> passthrough functionality.
>>
>> Signed-off-by: yangge <yangge1116@126.com>
>> ---
>>
>> V3:
>> - fix build error
>>
>> V2:
>> - consider unevictable folios
>>
>>   mm/compaction.c | 20 ++++++++++++++++++++
>>   1 file changed, 20 insertions(+)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 07bd227..a9f1261 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -2383,7 +2383,27 @@ static bool __compaction_suitable(struct zone *zone, int order,
>>   				  int highest_zoneidx,
>>   				  unsigned long wmark_target)
>>   {
>> +	pg_data_t __maybe_unused *pgdat = zone->zone_pgdat;
>> +	unsigned long sum, nr_pinned;
>>   	unsigned long watermark;
>> +
>> +	sum = node_page_state(pgdat, NR_INACTIVE_FILE) +
>> +		node_page_state(pgdat, NR_INACTIVE_ANON) +
>> +		node_page_state(pgdat, NR_ACTIVE_FILE) +
>> +		node_page_state(pgdat, NR_ACTIVE_ANON) +
>> +		node_page_state(pgdat, NR_UNEVICTABLE);
> 
> In addition to what Johannes pointed out, these are whole-node numbers and
> compaction works on a zone level.
> 
>> +
>> +	nr_pinned = node_page_state(pgdat, NR_FOLL_PIN_ACQUIRED) -
>> +		node_page_state(pgdat, NR_FOLL_PIN_RELEASED);
> 
> Statistics of *events* used to derive current *state*... I don't think we do
> that anywhere else? I'm not sure if we make sure vmstat events may never be
> missed, as they are only for statistics. IIUC we allow some rare races to
> have less expensive synchronization?
> 
> But anyway let's try looking for a different solution.
> 
> Assuming this is a THP allocation attempt (__GFP_THISNODE even?)
Yes, Transparent Huge Pages are allocated using the __GFP_THISNODE flag.
  and we are
> in the "For costly allocations, try direct compaction first" part of
> __alloc_pages_slowpath() right?
Yes, memory is being allocated using the following memory allocation 
strategy:

static struct page *alloc_pages_mpol()
{
      page = __alloc_frozen_pages_noprof(__GFP_THISNODE,...); // 1, try
to allocate THP only on local node

      if (page || !(gpf & __GFP_DIRECT_RECLAIM))
          return page;

      page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);//2,
fall back to remote NUMA nodes
}

> Hopefully also when done from the pin_user_pages_remote(..., FOLL_LONGTERM,
> ...) context the allocation gfp_mask correctly lacks __GFP_MOVABLE? 
yes.
I guess
> it has to, otherwise it would allocate from the CMA pageblocks.
> 
> Then I wonder if we could use the real allocation context to determine
> watermarks, as __compaction_suitable() is passing ALLOC_CMA instead because
> it's checking only for migration targets, which have to be CMA compatible by
> definition. But we could use the real unmovable allocation context to have
> __zone_watermark_unusable_free() subtract CMA pages, and thus we won't pass
> the order-0 check anymore once the non-CMA part is exhausted.
> 
> There's some risk that in some different scenario the compaction could in
> fact migrate pages from the exhausted non-CMA part of the zone to the CMA
> part and succeed, and we'll skip it instead. But that should be rare?
> 
Below is the previous discussion：
https://lore.kernel.org/lkml/1734436004-1212-1-git-send-email-yangge1116@126.com/
> Anyway given that concern I'm not sure about changing
> __compaction_suitable() for every caller like this. We could (at least
> initially) target this heuristic only for COMPACT_PRIO_ASYNC which is being
> used for this THP opportunistic attempt.
> 
> So for example:
> - add a new bool flag to compact_control that is true for COMPACT_PRIO_ASYNC
> - pass cc pointer to compaction_suit_allocation_order()
> - in that function, add another check if the the new cc flag is true,
> between the current zone_watermark_ok() and compaction_suitable() checks,
> which works like __compaction_suitable() but uses alloc_flags (which should
> not be ALLOC_CMA in our pinned allocation case) instead of ALLOC_CMA, return
> COMPACT_SKIPPED if it fails.
> 
I will send a new version of the patch based on the suggestions here. 
Thank you.
>> +	/*
>> +	 * Gup-pinned pages are non-migratable. After subtracting these pages,
>> +	 * we need to check if the remaining pages are sufficient for memory
>> +	 * compaction.
>> +	 */
>> +	if ((sum - nr_pinned) < (1 << order))
>> +		return false;
>> +
>>   	/*
>>   	 * Watermarks for order-0 must be met for compaction to be able to
>>   	 * isolate free pages for migration targets. This means that the


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages
  2025-01-14 12:24   ` Ge Yang
@ 2025-01-14 12:51     ` Vlastimil Babka
  2025-01-15  9:17       ` Ge Yang
  0 siblings, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2025-01-14 12:51 UTC (permalink / raw)
  To: Ge Yang, akpm
  Cc: linux-mm, linux-kernel, 21cnbao, david, baolin.wang, hannes,
	liuzixing

On 1/14/25 13:24, Ge Yang wrote:
>> Hopefully also when done from the pin_user_pages_remote(..., FOLL_LONGTERM,
>> ...) context the allocation gfp_mask correctly lacks __GFP_MOVABLE? 
> yes.
> I guess
>> it has to, otherwise it would allocate from the CMA pageblocks.
>> 
>> Then I wonder if we could use the real allocation context to determine
>> watermarks, as __compaction_suitable() is passing ALLOC_CMA instead because
>> it's checking only for migration targets, which have to be CMA compatible by
>> definition. But we could use the real unmovable allocation context to have
>> __zone_watermark_unusable_free() subtract CMA pages, and thus we won't pass
>> the order-0 check anymore once the non-CMA part is exhausted.
>> 
>> There's some risk that in some different scenario the compaction could in
>> fact migrate pages from the exhausted non-CMA part of the zone to the CMA
>> part and succeed, and we'll skip it instead. But that should be rare?
>> 
> Below is the previous discussion：
> https://lore.kernel.org/lkml/1734436004-1212-1-git-send-email-yangge1116@126.com/

Right so Johannes had the same concern.

>> Anyway given that concern I'm not sure about changing
>> __compaction_suitable() for every caller like this. We could (at least
>> initially) target this heuristic only for COMPACT_PRIO_ASYNC which is being
>> used for this THP opportunistic attempt.
>> 
>> So for example:
>> - add a new bool flag to compact_control that is true for COMPACT_PRIO_ASYNC
>> - pass cc pointer to compaction_suit_allocation_order()
>> - in that function, add another check if the the new cc flag is true,
>> between the current zone_watermark_ok() and compaction_suitable() checks,
>> which works like __compaction_suitable() but uses alloc_flags (which should
>> not be ALLOC_CMA in our pinned allocation case) instead of ALLOC_CMA, return
>> COMPACT_SKIPPED if it fails.
>> 
> I will send a new version of the patch based on the suggestions here. 
> Thank you.

Yeah that way should hopefully limit the concern sufficiently. Maybe we
could also add costly_order condition in addition to COMPACT_PRIO_ASYNC
condition to set the new compact_control flag. But only __GFP_NORETRY
allocations should be affected in the immediate "goto nopage" when
compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY anyway
and won't fail without trying to compact-migrate the non-CMA pageblocks into
CMA pageblocks first, so it should be fine.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages
  2025-01-14 12:51     ` Vlastimil Babka
@ 2025-01-15  9:17       ` Ge Yang
  0 siblings, 0 replies; 11+ messages in thread
From: Ge Yang @ 2025-01-15  9:17 UTC (permalink / raw)
  To: Vlastimil Babka, akpm
  Cc: linux-mm, linux-kernel, 21cnbao, david, baolin.wang, hannes,
	liuzixing



在 2025/1/14 20:51, Vlastimil Babka 写道:
> On 1/14/25 13:24, Ge Yang wrote:
>>> Hopefully also when done from the pin_user_pages_remote(..., FOLL_LONGTERM,
>>> ...) context the allocation gfp_mask correctly lacks __GFP_MOVABLE?
>> yes.
>> I guess
>>> it has to, otherwise it would allocate from the CMA pageblocks.
>>>
>>> Then I wonder if we could use the real allocation context to determine
>>> watermarks, as __compaction_suitable() is passing ALLOC_CMA instead because
>>> it's checking only for migration targets, which have to be CMA compatible by
>>> definition. But we could use the real unmovable allocation context to have
>>> __zone_watermark_unusable_free() subtract CMA pages, and thus we won't pass
>>> the order-0 check anymore once the non-CMA part is exhausted.
>>>
>>> There's some risk that in some different scenario the compaction could in
>>> fact migrate pages from the exhausted non-CMA part of the zone to the CMA
>>> part and succeed, and we'll skip it instead. But that should be rare?
>>>
>> Below is the previous discussion：
>> https://lore.kernel.org/lkml/1734436004-1212-1-git-send-email-yangge1116@126.com/
> 
> Right so Johannes had the same concern.
> 
>>> Anyway given that concern I'm not sure about changing
>>> __compaction_suitable() for every caller like this. We could (at least
>>> initially) target this heuristic only for COMPACT_PRIO_ASYNC which is being
>>> used for this THP opportunistic attempt.
>>>
>>> So for example:
>>> - add a new bool flag to compact_control that is true for COMPACT_PRIO_ASYNC
>>> - pass cc pointer to compaction_suit_allocation_order()
>>> - in that function, add another check if the the new cc flag is true,
>>> between the current zone_watermark_ok() and compaction_suitable() checks,
>>> which works like __compaction_suitable() but uses alloc_flags (which should
>>> not be ALLOC_CMA in our pinned allocation case) instead of ALLOC_CMA, return
>>> COMPACT_SKIPPED if it fails.
>>>
>> I will send a new version of the patch based on the suggestions here.
>> Thank you.
> 
> Yeah that way should hopefully limit the concern sufficiently. Maybe we
> could also add costly_order condition in addition to COMPACT_PRIO_ASYNC
> condition to set the new compact_control flag. But only __GFP_NORETRY
> allocations should be affected in the immediate "goto nopage" when
> compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY anyway
> and won't fail without trying to compact-migrate the non-CMA pageblocks into
> CMA pageblocks first, so it should be fine.
Ok, thanks.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-01-15  9:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-08 11:30 [PATCH V3] mm: compaction: skip memory compaction when there are not enough migratable pages yangge1116
2025-01-13  8:47 ` Barry Song
2025-01-13  9:02   ` Ge Yang
2025-01-13 10:05     ` Barry Song
2025-01-13 11:23       ` Ge Yang
2025-01-13 15:46 ` Johannes Weiner
2025-01-14  2:51   ` Ge Yang
2025-01-14 11:21 ` Vlastimil Babka
2025-01-14 12:24   ` Ge Yang
2025-01-14 12:51     ` Vlastimil Babka
2025-01-15  9:17       ` Ge Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).