Re: [PATCH V7] mm, compaction: don't use ALLOC_CMA for unmovable allocations

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes@cmpxchg.org>
To: Ge Yang <yangge1116@126.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	21cnbao@gmail.com, david@redhat.com,
	baolin.wang@linux.alibaba.com, vbabka@suse.cz,
	liuzixing@hygon.cn
Subject: Re: [PATCH V7] mm, compaction: don't use ALLOC_CMA for unmovable allocations
Date: Tue, 17 Dec 2024 22:29:36 -0500	[thread overview]
Message-ID: <20241218032936.GB37530@cmpxchg.org> (raw)
In-Reply-To: <93cf1aee-70df-426f-a3c0-1db8068bd59a@126.com>

On Wed, Dec 18, 2024 at 10:15:06AM +0800, Ge Yang wrote:
> 
> 
> 在 2024/12/17 23:55, Johannes Weiner 写道:
> > Hello Yangge,
> > 
> > On Tue, Dec 17, 2024 at 07:46:44PM +0800, yangge1116@126.com wrote:
> >> From: yangge <yangge1116@126.com>
> >>
> >> Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
> >> in __compaction_suitable()") allow compaction to proceed when free
> >> pages required for compaction reside in the CMA pageblocks, it's
> >> possible that __compaction_suitable() always returns true, and in
> >> some cases, it's not acceptable.
> >>
> >> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
> >> of memory. I have configured 16GB of CMA memory on each NUMA node,
> >> and starting a 32GB virtual machine with device passthrough is
> >> extremely slow, taking almost an hour.
> >>
> >> During the start-up of the virtual machine, it will call
> >> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
> >> Long term GUP cannot allocate memory from CMA area, so a maximum
> >> of 16 GB of no-CMA memory on a NUMA node can be used as virtual
> >> machine memory. Since there is 16G of free CMA memory on the NUMA
> >> node, watermark for order-0 always be met for compaction, so
> >> __compaction_suitable() always returns true, even if the node is
> >> unable to allocate non-CMA memory for the virtual machine.
> >>
> >> For costly allocations, because __compaction_suitable() always
> >> returns true, __alloc_pages_slowpath() can't exit at the appropriate
> >> place, resulting in excessively long virtual machine startup times.
> >> Call trace:
> >> __alloc_pages_slowpath
> >>      if (compact_result == COMPACT_SKIPPED ||
> >>          compact_result == COMPACT_DEFERRED)
> >>          goto nopage; // should exit __alloc_pages_slowpath() from here
> >>
> >> Other unmovable alloctions, like dma_buf, which can be large in a
> >> Linux system, are also unable to allocate memory from CMA, and these
> >> allocations suffer from the same problems described above. In order
> >> to quickly fall back to remote node, we should remove ALLOC_CMA both
> >> in __compaction_suitable() and __isolate_free_page() for unmovable
> >> alloctions. After this fix, starting a 32GB virtual machine with
> >> device passthrough takes only a few seconds.
> > 
> > The symptom is obviously bad, but I don't understand this fix.
> > 
> > The reason we do ALLOC_CMA is that, even for unmovable allocations,
> > you can create space in non-CMA space by moving migratable pages over
> > to CMA space. This is not a property we want to lose. But I also don't
> > see how it would interfere with your scenario.
> 
> The __alloc_pages_slowpath() function was originally intended to exit at 
> place 1, but due to __compaction_suitable() always returning true, it 
> results in __alloc_pages_slowpath() exiting at place 2 instead. This 
> ultimately leads to a significantly longer execution time for 
> __alloc_pages_slowpath().
> 
> Call trace:
>   __alloc_pages_slowpath
>        if (compact_result == COMPACT_SKIPPED ||
>           compact_result == COMPACT_DEFERRED)
>            goto nopage; // place 1
>        __alloc_pages_direct_reclaim() // Reclaim is very expensive
>        __alloc_pages_direct_compact()
>        if (gfp_mask & __GFP_NORETRY)
>            goto nopage; // place 2
> 
> Every time memory allocation goes through the above slower process, it 
> ultimately leads to significantly longer virtual machine startup times.

I still don't follow. Why do you want the allocation to fail?

The changelog says this is in order to fall back quickly to other
nodes. But there is a full node walk in get_page_from_freelist()
before the allocator even engages reclaim. There is something missing
from the story still.

But regardless - surely you can see that we can't make the allocator
generally weaker on large requests just because they happen to be
optional in your specific case?

> > There is the compaction_suitable() check in should_compact_retry(),
> > but that only applies when COMPACT_SKIPPED. IOW, it should only happen
> > when compaction_suitable() just now returned false. IOW, a race
> > condition. Which is why it's also not subject to limited retries.
> > 
> > What's the exact condition that traps the allocator inside the loop?
> The should_compact_retry() function was not executed, and the slow here 
> was mainly due to the execution of __alloc_pages_direct_reclaim().

Ok.

next prev parent reply	other threads:[~2024-12-18  3:29 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-17 11:46 [PATCH V7] mm, compaction: don't use ALLOC_CMA for unmovable allocations yangge1116
2024-12-17 15:55 ` Johannes Weiner
2024-12-18  2:15   ` Ge Yang
2024-12-18  3:29     ` Johannes Weiner [this message]
2024-12-18  3:56       ` Ge Yang
2024-12-18  4:00       ` Ge Yang
2024-12-18  7:57   ` Baolin Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241218032936.GB37530@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liuzixing@hygon.cn \
    --cc=stable@vger.kernel.org \
    --cc=vbabka@suse.cz \
    --cc=yangge1116@126.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.