From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: David Hildenbrand <david@redhat.com>, Daniel Gomez <d@kruces.com>,
Daniel Gomez <da.gomez@samsung.com>,
"Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>,
akpm@linux-foundation.org, hughd@google.com,
wangkefeng.wang@huawei.com, 21cnbao@gmail.com,
ryan.roberts@arm.com, ioworker0@gmail.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Date: Wed, 6 Nov 2024 11:17:43 +0800 [thread overview]
Message-ID: <fc6bc624-e84b-41c7-9deb-040f7ce8ee41@linux.alibaba.com> (raw)
In-Reply-To: <01f91147-c66f-4501-bd55-3ff04088e337@redhat.com>
On 2024/11/5 22:56, David Hildenbrand wrote:
> On 05.11.24 13:45, Baolin Wang wrote:
>>
>>
>> On 2024/10/31 18:46, David Hildenbrand wrote:
>> [snip]
>>
>>>>> I don't like that:
>>>>>
>>>>> (a) there is no way to explicitly enable/name that new behavior.
>>>>
>>>> But this is similar to other file systems that enable large folios
>>>> (setting mapping_set_large_folios()), and I haven't seen any other file
>>>> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
>>>> a bit special?
>>>
>>> I'm afraid I don't have the energy to explain once more why I think
>>> tmpfs is not just like any other file system in some cases.
>>>
>>> And distributions are rather careful when it comes to something like
>>> this ...
>>>
>>>>
>>>> If we all agree that tmpfs is a bit special when using huge pages, then
>>>> fine, a Kconfig option might be needed.
>>>>
>>>>> (b) "always" etc. are only concerned about PMDs.
>>>>
>>>> Yes, currently maintain the same semantics as before, in case users
>>>> still expect THPs.
>>>
>>> Again, I don't think that is a reasonable approach to make PMD-sized
>>> ones special here. It will all get seriously confusing and inconsistent.
>>
>> I agree PMD-sized should not be special. This is all for backward
>> compatibility with the ‘huge=’ mount option, and adding a new kconfig is
>> also for this purpose.
>>
>>> THPs are opportunistic after all, and page fault behavior will remain
>>> unchanged (PMD-sized) for now. And even if we support other sizes during
>>> page faults, we'd like start with the largest size (PMD-size) first, and
>>> it likely might just all work better than before.
>>>
>>> Happy to learn where this really makes a difference.
>>>
>>> Of course, if you change the default behavior (which you are planning),
>>> it's ... a changed default.
>>>
>>> If there are reasons to have more tunables regarding the sizes to use,
>>> then it should not be limited to PMD-size.
>>
>> I have tried to modify the code according to your suggestion (not tested
>> yet). These are what you had in mind?
>>
>> static inline unsigned int
>> shmem_mapping_size_order(struct address_space *mapping, pgoff_t index,
>> loff_t write_end)
>> {
>> unsigned int order;
>> size_t size;
>>
>> if (!mapping_large_folio_support(mapping) || !write_end)
>> return 0;
>>
>> /* Calculate the write size based on the write_end */
>> size = write_end - (index << PAGE_SHIFT);
>> order = filemap_get_order(size);
>> if (!order)
>> return 0;
>>
>> /* If we're not aligned, allocate a smaller folio */
>> if (index & ((1UL << order) - 1))
>> order = __ffs(index);
>>
>> order = min_t(size_t, order, MAX_PAGECACHE_ORDER);
>> return order > 0 ? BIT(order + 1) - 1 : 0;
>> }
>>
>> static unsigned int shmem_huge_global_enabled(struct inode *inode,
>> pgoff_t index,
>> loff_t write_end, bool
>> shmem_huge_force,
>> unsigned long vm_flags)
>> {
>> bool is_shmem = inode->i_sb == shm_mnt->mnt_sb;
>> unsigned long within_size_orders;
>> unsigned int order;
>> loff_t i_size;
>>
>> if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER)
>> return 0;
>> if (!S_ISREG(inode->i_mode))
>> return 0;
>> if (shmem_huge == SHMEM_HUGE_DENY)
>> return 0;
>> if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE)
>> return BIT(HPAGE_PMD_ORDER);
>>
>> switch (SHMEM_SB(inode->i_sb)->huge) {
>> case SHMEM_HUGE_NEVER:
>> return 0;
>> case SHMEM_HUGE_ALWAYS:
>> if (is_shmem ||
>> IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
>> return BIT(HPAGE_PMD_ORDER);
>>
>> return shmem_mapping_size_order(inode->i_mapping,
>> index, write_end);
>> case SHMEM_HUGE_WITHIN_SIZE:
>> if (is_shmem ||
>> IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS))
>> within_size_orders = BIT(HPAGE_PMD_ORDER);
>> else
>> within_size_orders =
>> shmem_mapping_size_order(inode->i_mapping,
>> index, write_end);
>>
>> order = highest_order(within_size_orders);
>> while (within_size_orders) {
>> index = round_up(index + 1, 1 << order);
>> i_size = max(write_end, i_size_read(inode));
>> i_size = round_up(i_size, PAGE_SIZE);
>> if (i_size >> PAGE_SHIFT >= index)
>> return within_size_orders;
>>
>> order = next_order(&within_size_orders, order);
>> }
>> fallthrough;
>> case SHMEM_HUGE_ADVISE:
>> if (vm_flags & VM_HUGEPAGE) {
>> if (is_shmem ||
>> IS_ENABLED(USE_ONLY_THP_FOR_TMPFS))
>> return BIT(HPAGE_PMD_ORDER);
>>
>> return
>> shmem_mapping_size_order(inode->i_mapping,
>> index,
>> write_end);
>> }
>> fallthrough;
>> default:
>> return 0;
>> }
>> }
>>
>> 1) Add a new 'CONFIG_USE_ONLY_THP_FOR_TMPFS' kconfig to keep ‘huge=’
>> mount option compatibility.
>> 2) For tmpfs write(), if CONFIG_USE_ONLY_THP_FOR_TMPFS is not enabled,
>> then will get the possible huge orders based on the write size.
>> 3) For tmpfs mmap() fault, always use PMD-sized huge order.
>> 4) For shmem, ignore the write size logic and always use PMD-sized THP
>> to check if the global huge is enabled.
>>
>> However, in case 2), if 'huge=always' and write size is less than 4K, so
>> we will allocate small pages, that will break the 'huge' semantics?
>> Maybe it's not something to worry too much about.
>
> Probably I didn't express clearly what I think we should, because this is
> not quite what I had in mind.
>
> I would use the CONFIG_USE_ONLY_THP_FOR_TMPFS way of doing it only if
> really required. As raised, if someone needs finer control, providing that
> only for a single size is rather limiting.
OK. I misunderstood your points.
> This is what I hope we can do (doc update to show what I mean):
Thanks for updating the doc. I'd like to include them in the next version.
> diff --git a/Documentation/admin-guide/mm/transhuge.rst
> b/Documentation/admin-guide/mm/transhuge.rst
> index 5034915f4e8e8..d7d1a9acdbfc5 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -349,11 +349,24 @@ user, the PMD_ORDER hugepage policy will be
> overridden. If the policy for
> PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will
> default to ``never``.
>
> -Hugepages in tmpfs/shmem
> -========================
> +tmpfs/shmem
> +===========
>
> -You can control hugepage allocation policy in tmpfs with mount option
> -``huge=``. It can have following values:
> +Traditionally, tmpfs only supported a single huge page size ("PMD").
> Today,
> +it also supports smaller sizes just like anonymous memory, often referred
> +to as "multi-size THP" (mTHP). Huge pages of any size are commonly
> +represented in the kernel as "large folios".
> +
> +While there is fine control over the huge page sizes to use for the
> internal
> +shmem mount (see below), ordinary tmpfs mounts will make use of all
> +available huge page sizes without any control over the exact sizes,
> +behaving more like other file systems.
> +
> +tmpfs mounts
> +------------
> +
> +The THP allocation policy for tmpfs mounts can be adjusted using the mount
> +option: ``huge=``. It can have following values:
>
> always
> Attempt to allocate huge pages every time we need a new page;
> @@ -368,19 +381,20 @@ within_size
> advise
> Only allocate huge pages if requested with fadvise()/madvise();
>
> -The default policy is ``never``.
> +Remember, that the kernel may use huge pages of all available sizes, and
> +that no fine control as for the internal tmpfs mount is available.
> +
> +The default policy in the past was ``never``, but it can now be adjusted
> +using the CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_ALWAYS,
> +CONFIG_TMPFS_TRANSPARENT_HUGEPAGE_NEVER etc.
>
> ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
> ``huge=never`` will not attempt to break up huge pages at all, just
> stop more
> from being allocated.
>
> -There's also sysfs knob to control hugepage allocation policy for internal
> -shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
> -is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
> -MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
> -
> -In addition to policies listed above, shmem_enabled allows two further
> -values:
> +In addition to policies listed above, the sysfs knob
> +/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the
> +allocation policy of tmpfs mounts, when set to the following values:
>
> deny
> For use in emergencies, to force the huge option off from
> @@ -388,13 +402,26 @@ deny
> force
> Force the huge option on for all - very useful for testing;
>
> -Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to
> -control mTHP allocation:
> -'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
> -and its value for each mTHP is essentially consistent with the global
> -setting. An 'inherit' option is added to ensure compatibility with these
> -global settings. Conversely, the options 'force' and 'deny' are dropped,
> -which are rather testing artifacts from the old ages.
> +
> +shmem / internal tmpfs
> +----------------------
> +
> +The mount internal tmpfs mount is used for SysV SHM, memfds, shared
> anonymous
> +mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
> +
> +To control the THP allocation policy for this internal tmpfs mount, the
> +sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs
> +per THP size in
> +'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled'
> +can be used.
> +
> +The global knob has the same semantics as the ``huge=`` mount options
> +for tmpfs mounts, except that the different huge page sizes can be
> controlled
> +individually, and will only use the setting of the global knob when the
> +per-size knob is set to 'inherit'.
> +
> +The options 'force' and 'deny' are dropped for the individual sizes, which
> +are rather testing artifacts from the old ages.
>
> always
> Attempt to allocate <size> huge pages every time we need a new page;
> diff --git a/Documentation/filesystems/tmpfs.rst
> b/Documentation/filesystems/tmpfs.rst
> index 56a26c843dbe9..10de8f706d07b 100644
>
>
>
> There is this question of "do we need the old way of doing it and only
> allocate PMDs". For that, likely a config similar to the one you propose
> might
> make sense, but I would want to see if there is real demand for that. In
> particular:
> for whom the smaller sizes are a problem when bigger (PMD) sizes were
> enabled in the past.
I am also not sure if such a case exists. I can remove this kconfig for
now, and we can consider it again if someone really complains this in
the future.
next prev parent reply other threads:[~2024-11-06 3:17 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-10 9:58 [RFC PATCH v3 0/4] Support large folios for tmpfs Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 1/4] mm: factor out the order calculation into a new helper Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 2/4] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap Baolin Wang
2024-10-13 3:30 ` kernel test robot
2024-10-16 9:30 ` Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 3/4] mm: shmem: add large folio support to the write and fallocate paths for tmpfs Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 4/4] docs: tmpfs: add documention for 'write_size' huge option Baolin Wang
2024-10-16 7:49 ` [RFC PATCH v3 0/4] Support large folios for tmpfs Kefeng Wang
2024-10-16 9:29 ` Baolin Wang
2024-10-16 13:45 ` Kefeng Wang
2024-10-17 9:52 ` Baolin Wang
2024-10-16 14:06 ` Matthew Wilcox
2024-10-17 9:34 ` Baolin Wang
2024-10-17 11:26 ` Kirill A. Shutemov
2024-10-21 6:24 ` Baolin Wang
2024-10-21 8:54 ` Kirill A. Shutemov
2024-10-21 13:34 ` Daniel Gomez
2024-10-22 3:41 ` Baolin Wang
2024-10-22 15:31 ` David Hildenbrand
2024-10-23 8:04 ` Baolin Wang
2024-10-23 9:27 ` David Hildenbrand
2024-10-24 10:49 ` Daniel Gomez
2024-10-24 10:52 ` Daniel Gomez
2024-10-25 2:56 ` Baolin Wang
2024-10-25 20:21 ` David Hildenbrand
2024-10-28 9:48 ` David Hildenbrand
2024-10-31 3:43 ` Baolin Wang
2024-10-31 8:53 ` David Hildenbrand
2024-10-31 10:04 ` Baolin Wang
2024-10-31 10:46 ` David Hildenbrand
2024-11-05 12:45 ` Baolin Wang
2024-11-05 14:56 ` David Hildenbrand
2024-11-06 3:17 ` Baolin Wang [this message]
2024-10-31 10:46 ` David Hildenbrand
2024-10-28 21:56 ` Daniel Gomez
2024-10-29 12:20 ` David Hildenbrand
2024-10-22 3:34 ` Baolin Wang
2024-10-22 10:06 ` Kirill A. Shutemov
2024-10-23 9:25 ` Baolin Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fc6bc624-e84b-41c7-9deb-040f7ce8ee41@linux.alibaba.com \
--to=baolin.wang@linux.alibaba.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=d@kruces.com \
--cc=da.gomez@samsung.com \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=ioworker0@gmail.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ryan.roberts@arm.com \
--cc=wangkefeng.wang@huawei.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.