From: "Zi Yan" <ziy@nvidia.com>
To: "David Hildenbrand" <david@redhat.com>,
"Juan Yescas" <jyescas@google.com>
Cc: "Barry Song" <21cnbao@gmail.com>, <linux-mm@kvack.org>,
<muchun.song@linux.dev>, <rppt@kernel.org>, <osalvador@suse.de>,
<akpm@linux-foundation.org>, <lorenzo.stoakes@oracle.com>,
"Jann Horn" <jannh@google.com>, <Liam.Howlett@oracle.com>,
<minchan@kernel.org>, <jaewon31.kim@samsung.com>,
<charante@codeaurora.org>,
"Suren Baghdasaryan" <surenb@google.com>,
"Kalesh Singh" <kaleshsingh@google.com>,
"T.J. Mercier" <tjmercier@google.com>,
"Isaac Manjarres" <isaacmanjarres@google.com>,
<iamjoonsoo.kim@lge.com>, <quic_charante@quicinc.com>
Subject: Re: mm: CMA reservations require 32MiB alignment in 16KiB page size kernels instead of 8MiB in 4KiB page size kernel.
Date: Wed, 22 Jan 2025 07:49:35 -0500 [thread overview]
Message-ID: <D78M4QZP8MSU.3DQ1YR81U2EFS@nvidia.com> (raw)
In-Reply-To: <6d13a5e9-bdff-435b-ad7a-3a3a550738b0@redhat.com>
On Wed Jan 22, 2025 at 3:11 AM EST, David Hildenbrand wrote:
> On 22.01.25 03:24, Zi Yan wrote:
>> On Tue Jan 21, 2025 at 9:08 PM EST, Juan Yescas wrote:
>>> On Mon, Jan 20, 2025 at 9:59 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 20.01.25 16:29, Zi Yan wrote:
>>>>> On Mon Jan 20, 2025 at 3:14 AM EST, David Hildenbrand wrote:
>>>>>> On 20.01.25 01:39, Zi Yan wrote:
>>>>>>> On Sun Jan 19, 2025 at 6:55 PM EST, Barry Song wrote:
>>>>>>> <snip>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> However, with this workaround, we can't use transparent huge pages.
>>>>>>>>>>>>
>>>>>>>>>>>> Is the CMA_MIN_ALIGNMENT_BYTES requirement alignment only to support huge pages?
>>>>>>>>> No. CMA_MIN_ALIGNMENT_BYTES is limited by CMA_MIN_ALIGNMENT_PAGES, which
>>>>>>>>> is equal to pageblock size. Enabling THP just bumps the pageblock size.
>>>>>>>>
>>>
>>> Thanks, I can see the initialization in include/linux/pageblock-flags.h
>>>
>>> #define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)
>>>
>>>>>>>> Currently, THP might be mTHP, which can have a significantly smaller
>>>>>>>> size than 32MB. For
>>>>>>>> example, on arm64 systems with a 16KiB page size, a 2MB CONT-PTE mTHP
>>>>>>>> is possible.
>>>>>>>> Additionally, mTHP relies on the CONFIG_TRANSPARENT_HUGEPAGE configuration.
>>>>>>>>
>>>>>>>> I wonder if it's possible to enable CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>>> without necessarily
>>>>>>>> using 32MiB THP. If we use other sizes, such as 64KiB, perhaps a large
>>>>>>>> pageblock size wouldn't
>>>>>>>> be necessary?
>>>
>>> Do you mean with mTHP? We haven't explored that option.
>>
>> Yes. Unless your applications have special demands for PMD THPs. 2MB
>> mTHP should work.
>>
>>>
>>>>>>>
>>>>>>> I think this should work by reducing MAX_PAGE_ORDER like Juan did for
>>>>>>> the experiment. But MAX_PAGE_ORDER is a macro right now, Kconfig needs
>>>>>>> to be changed and kernel needs to be recompiled. Not sure if it is OK
>>>>>>> for Juan's use case.
>>>>>>
>>>
>>> The main goal is to reserve only the necessary CMA memory for the
>>> drivers, which is
>>> usually the same for 4kb and 16kb page size kernels.
>>
>> Got it. Based on your experiment, you changed MAX_PAGE_ORDER to get the
>> minimal CMA alignment size. Can you deploy that kernel to production?
>> If yes, you can use mTHP instead of PMD THP and still get the CMA
>> alignemnt you want.
>>
>>>
>>>>>>
>>>>>> IIRC, we set pageblock size == THP size because this is the granularity
>>>>>> we want to optimize defragmentation for. ("try keep pageblock
>>>>>> granularity of the same memory type: movable vs. unmovable")
>>>>>
>>>>> Right. In past, it is optimized for PMD THP. Now we have mTHP. If user
>>>>> does not care about PMD THP (32MB in ARM64 16KB base page case) and mTHP
>>>>> (2MB mTHP here) is good enough, reducing pageblock size works.
>>>>>
>>>>>>
>>>>>> However, the buddy already supports having different pagetypes for large
>>>>>> allocations.
>>>>>
>>>>> Right. To be clear, only MIGRATE_UNMOVABLE, MIGRATE_RECLAIMABLE, and
>>>>> MIGRATE_MOVABLE can be merged.
>>>>
>>>> Yes! An a THP cannot span partial MIGRATE_CMA, which would be fine.
>>>>
>>>>>
>>>>>>
>>>>>> So we could leave MAX_ORDER alone and try adjusting the pageblock size
>>>>>> in these setups. pageblock size is already variable on some
>>>>>> architectures IIRC.
>>>>>
>>>
>>> Which values would work for the CMA_MIN_ALIGNMENT_BYTES macro? In the
>>> 16KiB page size kernel,
>>> I tried these 2 configurations:
>>>
>>> #define CMA_MIN_ALIGNMENT_BYTES (2048 * CMA_MIN_ALIGNMENT_PAGES)
>>>
>>> and
>>>
>>> #define CMA_MIN_ALIGNMENT_BYTES (4096 * CMA_MIN_ALIGNMENT_PAGES)
>>>
>>> with both of them, the kernel failed to boot.
>>
>> CMA_MIN_ALIGNMENT_BYTES needs to be PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES.
>> So you need to adjust CMA_MIN_ALIGNMENT_PAGES, which is set by pageblock
>> size. pageblock size is determined by pageblock order, which is
>> affected by MAX_PAGE_ORDER.
>
> Yes, most importantly we must not exceed MAX_PAGE_ORDER. Going smaller
> is the common case.
>
>>
>>>
>>>>> Making pageblock size a boot time variable? We might want to warn
>>>>> sysadmin/user that >pageblock_order THP/mTHP creation will suffer.
>>>>
>>>> Yes, some way to configure it.
>>>>
>>>>>
>>>>>>
>>>>>> We'd only have to check if all of the THP logic can deal with pageblock
>>>>>> size < THP size.
>>>>>
>>>
>>> The reason that THP was disabled in my experiment is because this
>>> assertion failed
>>>
>>> mm/huge_memory.c
>>> /*
>>> * hugepages can't be allocated by the buddy allocator
>>> */
>>> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER > MAX_PAGE_ORDER);
>>>
>>> when
>>>
>>> config ARCH_FORCE_MAX_ORDER
>>> int
>>> .....
>>> default "8" if ARM64_16K_PAGES
>>>
>>
>> You can remove that BUILD_BUG_ON and turn on mTHP and see if mTHP works.
>>
>>>
>>>>> Probably yes, pageblock should be independent of THP logic, although
>>>>> compaction (used to create THPs) logic is based on pageblock.
>>>>
>>>> Right. As raised in the past, we need a higher level mechanism that
>>>> tries to group pageblocks together during comapction/conversion to limit
>>>> fragmentation on a higher level.
>>>>
>>>> I assume that many use cases would be fine with not using 32MB/512MB
>>>> THPs at all for now -- and instead using 2 MB ones. Of course, for very
>>>> large installations it might be different.
>>>>
>>>>>>
>>>>>> This issue is even more severe on arm64 with 64k (pageblock = 512MiB).
>>>>>
>>>
>>> I agree, and if ARCH_FORCE_MAX_ORDER is configured to the max value we get:
>>>
>>> PAGE_SIZE | max MAX_PAGE_ORDER | CMA_MIN_ALIGNMENT_BYTES
>>> 4KiB | 15 | 4KiB
>>> * 32KiB = 128MiB
>>> 16KiB | 13 | 16KiB
>>> * 8KiB = 128MiB
>>> 64KiB | 13 | 64KiB
>>> * 8KiB = 512MiB
>>>
>>>>> This is also good for virtio-mem, since the offline memory block size
>>>>> can also be reduced. I remember you complained about it before.
>>>>
>>>> Yes, yes, yes! :)
>>>>
>>
>> David's proposal should work in general, but will might take non-trivial
>> amount of work:
>>
>> 1. keep pageblock size always at 4MB for all arch.
>
> My proposal was to leave it unchanged for most archs, but allow for
> overriding it on aarch64 as a first step.
Got it. Makes sense.
>
> s390x is happy with 1MiB, x86 with 2MiB. It's aarch64 that does
> questionable things :)
>
> CONFIG_HUGETLB_PAGE_SIZE_VARIABLE already allows for variable
> pageblock_order. That whole code likely needs some love, but most of it
> should already be there.
>
>
> In the future, I could imagine just going for a smaller pageblock size
> on aarch64, and handling fragmentation avoidance for larger THPs (512
> MiB really is close to 1 GiB on x86) differently, not using pageblocks.
Right. That is what I meant by "...compaction, to work on a different
range, independent of pageblock". But based on Juan's reply[1], 32MB PMD
THP is still needed for the deployment. That means "the future" needs to
be done to fully satisfy Juan's needs. :)
--
Best Regards,
Yan, Zi
next prev parent reply other threads:[~2025-01-22 12:49 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-17 22:51 mm: CMA reservations require 32MiB alignment in 16KiB page size kernels instead of 8MiB in 4KiB page size kernel Juan Yescas
2025-01-17 22:52 ` Juan Yescas
2025-01-17 23:00 ` Juan Yescas
2025-01-17 23:19 ` Zi Yan
2025-01-19 23:55 ` Barry Song
2025-01-20 0:39 ` Zi Yan
2025-01-20 8:14 ` David Hildenbrand
2025-01-20 15:29 ` Zi Yan
2025-01-20 17:59 ` David Hildenbrand
2025-01-22 2:08 ` Juan Yescas
2025-01-22 2:24 ` Zi Yan
2025-01-22 4:06 ` Juan Yescas
2025-01-22 6:52 ` Barry Song
2025-01-22 8:04 ` David Hildenbrand
2025-01-22 8:11 ` David Hildenbrand
2025-01-22 12:49 ` Zi Yan [this message]
2025-01-22 13:58 ` David Hildenbrand
2025-01-20 0:17 ` Barry Song
2025-01-20 0:26 ` Zi Yan
2025-01-20 0:38 ` Barry Song
2025-01-20 0:45 ` Zi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=D78M4QZP8MSU.3DQ1YR81U2EFS@nvidia.com \
--to=ziy@nvidia.com \
--cc=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=charante@codeaurora.org \
--cc=david@redhat.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=isaacmanjarres@google.com \
--cc=jaewon31.kim@samsung.com \
--cc=jannh@google.com \
--cc=jyescas@google.com \
--cc=kaleshsingh@google.com \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=minchan@kernel.org \
--cc=muchun.song@linux.dev \
--cc=osalvador@suse.de \
--cc=quic_charante@quicinc.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=tjmercier@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.