Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Michal Hocko <mhocko@suse.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: akpm@linux-foundation.org, muchun.song@linux.dev,
	osalvador@suse.de, david@redhat.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb
Date: Fri, 2 Feb 2024 10:55:05 +0100	[thread overview]
Message-ID: <Zby7-dTtPIy2k5pj@tiehlicka> (raw)
In-Reply-To: <f1606912-5bcc-46be-b4f4-666149eab7bd@linux.alibaba.com>

On Fri 02-02-24 17:29:02, Baolin Wang wrote:
> On 2/2/2024 4:17 PM, Michal Hocko wrote:
[...]
> > > Agree. So how about below changing?
> > > (1) disallow fallbacking to other nodes when handing in-use hugetlb, which
> > > can ensure consistent behavior in handling hugetlb.
> > 
> > I can see two cases here. alloc_contig_range which is an internal kernel
> > user and then we have memory offlining. The former shouldn't break the
> > per-node hugetlb pool reservations, the latter might not have any other
> > choice (the whole node could get offline and that resembles breaking cpu
> > affininty if the cpu is gone).
> 
> IMO, not always true for memory offlining, when handling a free hugetlb, it
> disallows fallbacking, which is inconsistent.

It's been some time I've looked into that code so I am not 100% sure how
the free pool is currently handled. The above is the way I _think_ it
should work from the usability POV.

> Not only memory offlining, but also the longterm pinning (in
> migrate_longterm_unpinnable_pages()) and memory failure (in
> soft_offline_in_use_page()) can also break the per-node hugetlb pool
> reservations.

Bad

> > Now I can see how a hugetlb page sitting inside a CMA region breaks CMA
> > users expectations but hugetlb migration already tries hard to allocate
> > a replacement hugetlb so the system must be under a heavy memory
> > pressure if that fails, right? Is it possible that the hugetlb
> > reservation is just overshooted here? Maybe the memory is just terribly
> > fragmented though?
> > 
> > Could you be more specific about numbers in your failure case?
> 
> Sure. Our customer's machine contains serveral numa nodes, and the system
> reserves a large number of CMA memory occupied 50% of the total memory which
> is used for the virtual machine, meanwhile it also reserves lots of hugetlb
> which can occupy 50% of the CMA. So before starting the virtual machine, the
> hugetlb can use 50% of the CMA, but when starting the virtual machine, the
> CMA will be used by the virtual machine and the hugetlb should be migrated
> from CMA.

Would it make more sense for hugetlb pages to _not_ use CMA in this
case? I mean would be better off overall if the hugetlb pool was
preallocated before the CMA is reserved? I do realize this is just
working around the current limitations but it could be better than
nothing.

> Due to several nodes in the system, one node's memory can be exhausted,
> which will fail the hugetlb migration with __GFP_THISNODE flag.

Is the workload NUMA aware? I.e. do you bind virtual machines to
specific nodes?

-- 
Michal Hocko
SUSE Labs

next prev parent reply	other threads:[~2024-02-02  9:55 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-01 13:31 [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb Baolin Wang
2024-02-01 15:27 ` Michal Hocko
2024-02-02  1:35   ` Baolin Wang
2024-02-02  8:17     ` Michal Hocko
2024-02-02  9:29       ` Baolin Wang
2024-02-02  9:55         ` Michal Hocko [this message]
2024-02-05  2:50           ` Baolin Wang
2024-02-05  9:15             ` Michal Hocko
2024-02-05 13:06               ` Baolin Wang
2024-02-05 14:23                 ` Michal Hocko
2024-02-06  8:18                   ` Baolin Wang
2024-02-06 13:19                     ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zby7-dTtPIy2k5pj@tiehlicka \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.