From: David Hildenbrand <david@redhat.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>,
Daniel Gomez <d@kruces.com>, Daniel Gomez <da.gomez@samsung.com>,
"Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>,
akpm@linux-foundation.org, hughd@google.com,
wangkefeng.wang@huawei.com, 21cnbao@gmail.com,
ryan.roberts@arm.com, ioworker0@gmail.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Date: Thu, 31 Oct 2024 11:46:20 +0100 [thread overview]
Message-ID: <99a3cc07-bdc3-48e2-ab5c-6f4de1bd2e7b@redhat.com> (raw)
In-Reply-To: <2782890e-09dc-46bd-ab86-1f8974c7eb7a@linux.alibaba.com>
>>> I am still worried about adding a new kconfig option, which might
>>> complicate the tmpfs controls further.
>>
>> Why exactly?
>
> There will be more options to control huge pages allocation for tmpfs,
> which may confuse users and make life harder? Yes, we can add some
> documentation, but I'm still a bit cautious about this.
If it's just "changing the default from "huge=never" to "huge=X" I don't
see a big problem here. Again, we already do that for anon THPs.
If we make more behavior depend on than (which I don't think we should
be doing), I agree that it would be more controversial.
[..]
>>>
>>>> That should probably do as a first shot; I assume people will want more
>>>> control over which size to use, especially during page faults, but that
>>>> can likely be added later.
>>
>> I know, it puts you in a bad position because there are different
>> opinions floating around. But let's try to find something that is
>> reasonable and still acceptable. And let's hope that Hugh will voice an
>> opinion :D
>
> Yes, I am also waiting to see if Hugh has any inputs :)
We keep saying that ... I have to find a way to summon him :)
>
>>> After some discussions, I think the first step is to achieve two goals:
>>> 1) Try to make tmpfs use large folios like other file systems, that
>>> means we should avoid adding more complex control options (per Matthew).
>>> 2) Still need maintain compatibility with the 'huge=' mount option (per
>>> Kirill), as I also remembered we have customers who use
>>> 'huge=within_size' to allocate THPs for better performance.
>>
>>>
>>> Based on these considerations, my first step is to neither add a new
>>> 'huge=' option parameter nor introduce the mTHP interfaces control for
>>> tmpfs, but rather to change the default huge allocation behavior for
>>> tmpfs. That is to say, when 'huge=' option is not configured, we will
>>> allow the huge folios allocation based on the write size. As a result,
>>> the behavior of huge pages for tmpfs will change as follows:
>> > > no 'huge=' set: can allocate any size huge folios based on write size
>> > huge=never: no any size huge folios> huge=always: only PMD sized THP
>> allocation as before
>> > huge=fadvise: like "always" but only with fadvise/madvise>
>> huge=within_size: like "fadvise" but respect i_size
>>
>> I don't like that:
>>
>> (a) there is no way to explicitly enable/name that new behavior.
>
> But this is similar to other file systems that enable large folios
> (setting mapping_set_large_folios()), and I haven't seen any other file
> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
> a bit special?
I'm afraid I don't have the energy to explain once more why I think
tmpfs is not just like any other file system in some cases.
And distributions are rather careful when it comes to something like
this ...
>
> If we all agree that tmpfs is a bit special when using huge pages, then
> fine, a Kconfig option might be needed.
>
>> (b) "always" etc. are only concerned about PMDs.
>
> Yes, currently maintain the same semantics as before, in case users
> still expect THPs.
Again, I don't think that is a reasonable approach to make PMD-sized
ones special here. It will all get seriously confusing and inconsistent.
THPs are opportunistic after all, and page fault behavior will remain
unchanged (PMD-sized) for now. And even if we support other sizes during
page faults, we'd like start with the largest size (PMD-size) first, and
it likely might just all work better than before.
Happy to learn where this really makes a difference.
Of course, if you change the default behavior (which you are planning),
it's ... a changed default.
If there are reasons to have more tunables regarding the sizes to use,
then it should not be limited to PMD-size.
> >> So again, I suggest:
>>
>> huge=never: No THPs of any size
>> huge=always: THPs of any size
>> huge=fadvise: like "always" but only with fadvise/madvise
>> huge=within_size: like "fadvise" but respect i_size
>>
>> "huge=" default depends on a Kconfig option.
>>
>> With that we:
>>
>> (1) Maximize the cases where we will use large folios of any sizes
>> (which Willy cares about).
>> (2) Have a way to disable them completely (which I care about).
>> (3) Allow distros to keep the default unchanged.
>>
>> Likely, for now we will only try allocating PMD-sized THPs during page
>> faults, and allocate different sizes only during write(). So the effect
>> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be
>> completely unchanged even with "huge=always".
>>
>> It will get more tricky once we change that behavior as well, but that's
>> something to likely figure out if it is a real problem at at different
>> day :)
>>
>>
>> I really preferred using the sysfs toggles (as discussed with Hugh in
>> the meeting back then), but I can also understand why we at least want
>> to try making tmpfs behave more like other file systems. But I'm a bit
>> more careful to not ignore the cases where it really isn't like any
>> other file system.
>
> That's also my previous thought, but Matthew is strongly against that.
> Let's step by step.
Yes, I understand his view as well.
But I won't blindly agree to the "tmpfs is just like any other file
system" opinion :)
> >> If we start making PMD-sized THPs special in any non-configurable way,
>> then we are effectively off *worse* than allowing to configure them
>> properly. So if someone voices "but we want only PMD-sized" ones, the
>> next one will say "but we only want cont-pte sized-ones" and then we
>> should provide an option to control the actual sizes to use differently,
>> in some way. But let's see if that is even required.
>
> Yes, I agree. So what I am thinking is, the 'huge=' option should be
> gradually deprecated in the future and eventually tmpfs can allocate any
> size large folios as default.
Let's be realistic, it won't get removed any time soon. ;)
So changing "huge=always" etc. semantics to reflect our new size
options, and then try changing the default (with the option for
people/distros to have the old default) is a reasonable approach, at
least to me.
I'm trying to stay open-minded here, but the proposal I heard so far is
not particularly appealing.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-10-31 10:46 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-10 9:58 [RFC PATCH v3 0/4] Support large folios for tmpfs Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 1/4] mm: factor out the order calculation into a new helper Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 2/4] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 3/4] mm: shmem: add large folio support to the write and fallocate paths for tmpfs Baolin Wang
2024-10-10 9:58 ` [RFC PATCH v3 4/4] docs: tmpfs: add documention for 'write_size' huge option Baolin Wang
2024-10-16 7:49 ` [RFC PATCH v3 0/4] Support large folios for tmpfs Kefeng Wang
2024-10-16 9:29 ` Baolin Wang
2024-10-16 13:45 ` Kefeng Wang
2024-10-17 9:52 ` Baolin Wang
2024-10-16 14:06 ` Matthew Wilcox
2024-10-17 9:34 ` Baolin Wang
2024-10-17 11:26 ` Kirill A. Shutemov
2024-10-21 6:24 ` Baolin Wang
2024-10-21 8:54 ` Kirill A. Shutemov
2024-10-21 13:34 ` Daniel Gomez
2024-10-22 3:41 ` Baolin Wang
2024-10-22 15:31 ` David Hildenbrand
2024-10-23 8:04 ` Baolin Wang
2024-10-23 9:27 ` David Hildenbrand
2024-10-24 10:49 ` Daniel Gomez
2024-10-24 10:52 ` Daniel Gomez
2024-10-25 2:56 ` Baolin Wang
2024-10-25 20:21 ` David Hildenbrand
2024-10-28 9:48 ` David Hildenbrand
2024-10-31 3:43 ` Baolin Wang
2024-10-31 8:53 ` David Hildenbrand
2024-10-31 10:04 ` Baolin Wang
2024-10-31 10:46 ` David Hildenbrand
2024-10-31 10:46 ` David Hildenbrand [this message]
2024-11-05 12:45 ` Baolin Wang
2024-11-05 14:56 ` David Hildenbrand
2024-11-06 3:17 ` Baolin Wang
2024-10-28 21:56 ` Daniel Gomez
2024-10-29 12:20 ` David Hildenbrand
2024-10-22 3:34 ` Baolin Wang
2024-10-22 10:06 ` Kirill A. Shutemov
2024-10-23 9:25 ` Baolin Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=99a3cc07-bdc3-48e2-ab5c-6f4de1bd2e7b@redhat.com \
--to=david@redhat.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=d@kruces.com \
--cc=da.gomez@samsung.com \
--cc=hughd@google.com \
--cc=ioworker0@gmail.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ryan.roberts@arm.com \
--cc=wangkefeng.wang@huawei.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).