Re: [RFC PATCH v3 0/4] Support large folios for tmpfs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>,
	Daniel Gomez <d@kruces.com>, Daniel Gomez <da.gomez@samsung.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>,
	akpm@linux-foundation.org, hughd@google.com,
	wangkefeng.wang@huawei.com, 21cnbao@gmail.com,
	ryan.roberts@arm.com, ioworker0@gmail.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [RFC PATCH v3 0/4] Support large folios for tmpfs
Date: Thu, 31 Oct 2024 11:46:20 +0100	[thread overview]
Message-ID: <99a3cc07-bdc3-48e2-ab5c-6f4de1bd2e7b@redhat.com> (raw)
In-Reply-To: <2782890e-09dc-46bd-ab86-1f8974c7eb7a@linux.alibaba.com>

>>> I am still worried about adding a new kconfig option, which might
>>> complicate the tmpfs controls further.
>>
>> Why exactly?
> 
> There will be more options to control huge pages allocation for tmpfs,
> which may confuse users and make life harder? Yes, we can add some
> documentation, but I'm still a bit cautious about this.

If it's just "changing the default from "huge=never" to "huge=X" I don't 
see a big problem here. Again, we already do that for anon THPs.

If we make more behavior depend on than (which I don't think we should 
be doing), I agree that it would be more controversial.

[..]

>>>
>>>> That should probably do as a first shot; I assume people will want more
>>>> control over which size to use, especially during page faults, but that
>>>> can likely be added later.
>>
>> I know, it puts you in a bad position because there are different
>> opinions floating around. But let's try to find something that is
>> reasonable and still acceptable. And let's hope that Hugh will voice an
>> opinion :D
> 
> Yes, I am also waiting to see if Hugh has any inputs :)

We keep saying that ... I have to find a way to summon him :)

> 
>>> After some discussions, I think the first step is to achieve two goals:
>>> 1) Try to make tmpfs use large folios like other file systems, that
>>> means we should avoid adding more complex control options (per Matthew).
>>> 2) Still need maintain compatibility with the 'huge=' mount option (per
>>> Kirill), as I also remembered we have customers who use
>>> 'huge=within_size' to allocate THPs for better performance.
>>
>>>
>>> Based on these considerations, my first step is to neither add a new
>>> 'huge=' option parameter nor introduce the mTHP interfaces control for
>>> tmpfs, but rather to change the default huge allocation behavior for
>>> tmpfs. That is to say, when 'huge=' option is not configured, we will
>>> allow the huge folios allocation based on the write size. As a result,
>>> the behavior of huge pages for tmpfs will change as follows:
>>   > > no 'huge=' set: can allocate any size huge folios based on write size
>>   > huge=never: no any size huge folios> huge=always: only PMD sized THP
>> allocation as before
>>   > huge=fadvise: like "always" but only with fadvise/madvise>
>> huge=within_size: like "fadvise" but respect i_size
>>
>> I don't like that:
>>
>> (a) there is no way to explicitly enable/name that new behavior.
> 
> But this is similar to other file systems that enable large folios
> (setting mapping_set_large_folios()), and I haven't seen any other file
> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is
> a bit special?

I'm afraid I don't have the energy to explain once more why I think 
tmpfs is not just like any other file system in some cases.

And distributions are rather careful when it comes to something like 
this ...

> 
> If we all agree that tmpfs is a bit special when using huge pages, then
> fine, a Kconfig option might be needed.
> 
>> (b) "always" etc. are only concerned about PMDs.
> 
> Yes, currently maintain the same semantics as before, in case users
> still expect THPs.

Again, I don't think that is a reasonable approach to make PMD-sized 
ones special here. It will all get seriously confusing and inconsistent.

THPs are opportunistic after all, and page fault behavior will remain 
unchanged (PMD-sized) for now. And even if we support other sizes during 
page faults, we'd like start with the largest size (PMD-size) first, and 
it likely might just all work better than before.

Happy to learn where this really makes a difference.

Of course, if you change the default behavior (which you are planning), 
it's ... a changed default.

If there are reasons to have more tunables regarding the sizes to use, 
then it should not be limited to PMD-size.

 > >> So again, I suggest:
>>
>> huge=never: No THPs of any size
>> huge=always: THPs of any size
>> huge=fadvise: like "always" but only with fadvise/madvise
>> huge=within_size: like "fadvise" but respect i_size
>>
>> "huge=" default depends on a Kconfig option.
>>
>> With that we:
>>
>> (1) Maximize the cases where we will use large folios of any sizes
>>       (which Willy cares about).
>> (2) Have a way to disable them completely (which I care about).
>> (3) Allow distros to keep the default unchanged.
>>
>> Likely, for now we will only try allocating PMD-sized THPs during page
>> faults, and allocate different sizes only during write(). So the effect
>> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be
>> completely unchanged even with "huge=always".
>>
>> It will get more tricky once we change that behavior as well, but that's
>> something to likely figure out if it is a real problem at at different
>> day :)
>>
>>
>> I really preferred using the sysfs toggles (as discussed with Hugh in
>> the meeting back then), but I can also understand why we at least want
>> to try making tmpfs behave more like other file systems. But I'm a bit
>> more careful to not ignore the cases where it really isn't like any
>> other file system.
> 
> That's also my previous thought, but Matthew is strongly against that.
> Let's step by step.

Yes, I understand his view as well.

But I won't blindly agree to the "tmpfs is just like any other file 
system" opinion :)

 > >> If we start making PMD-sized THPs special in any non-configurable way,
>> then we are effectively off *worse* than allowing to configure them
>> properly. So if someone voices "but we want only PMD-sized" ones, the
>> next one will say "but we only want cont-pte sized-ones" and then we
>> should provide an option to control the actual sizes to use differently,
>> in some way. But let's see if that is even required.
> 
> Yes, I agree. So what I am thinking is, the 'huge=' option should be
> gradually deprecated in the future and eventually tmpfs can allocate any
> size large folios as default.

Let's be realistic, it won't get removed any time soon. ;)

So changing "huge=always" etc. semantics to reflect our new size 
options, and then try changing the default (with the option for 
people/distros to have the old default) is a reasonable approach, at 
least to me.

I'm trying to stay open-minded here, but the proposal I heard so far is 
not particularly appealing.

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2024-10-31 10:46 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-10  9:58 [RFC PATCH v3 0/4] Support large folios for tmpfs Baolin Wang
2024-10-10  9:58 ` [RFC PATCH v3 1/4] mm: factor out the order calculation into a new helper Baolin Wang
2024-10-10  9:58 ` [RFC PATCH v3 2/4] mm: shmem: change shmem_huge_global_enabled() to return huge order bitmap Baolin Wang
2024-10-10  9:58 ` [RFC PATCH v3 3/4] mm: shmem: add large folio support to the write and fallocate paths for tmpfs Baolin Wang
2024-10-10  9:58 ` [RFC PATCH v3 4/4] docs: tmpfs: add documention for 'write_size' huge option Baolin Wang
2024-10-16  7:49 ` [RFC PATCH v3 0/4] Support large folios for tmpfs Kefeng Wang
2024-10-16  9:29   ` Baolin Wang
2024-10-16 13:45     ` Kefeng Wang
2024-10-17  9:52       ` Baolin Wang
2024-10-16 14:06 ` Matthew Wilcox
2024-10-17  9:34   ` Baolin Wang
2024-10-17 11:26     ` Kirill A. Shutemov
2024-10-21  6:24       ` Baolin Wang
2024-10-21  8:54         ` Kirill A. Shutemov
2024-10-21 13:34           ` Daniel Gomez
2024-10-22  3:41             ` Baolin Wang
2024-10-22 15:31               ` David Hildenbrand
2024-10-23  8:04                 ` Baolin Wang
2024-10-23  9:27                   ` David Hildenbrand
2024-10-24 10:49                     ` Daniel Gomez
2024-10-24 10:52                       ` Daniel Gomez
2024-10-25  2:56                       ` Baolin Wang
2024-10-25 20:21                       ` David Hildenbrand
2024-10-28  9:48                         ` David Hildenbrand
2024-10-31  3:43                           ` Baolin Wang
2024-10-31  8:53                             ` David Hildenbrand
2024-10-31 10:04                               ` Baolin Wang
2024-10-31 10:46                                 ` David Hildenbrand
2024-10-31 10:46                                 ` David Hildenbrand [this message]
2024-11-05 12:45                                   ` Baolin Wang
2024-11-05 14:56                                     ` David Hildenbrand
2024-11-06  3:17                                       ` Baolin Wang
2024-10-28 21:56                         ` Daniel Gomez
2024-10-29 12:20                           ` David Hildenbrand
2024-10-22  3:34           ` Baolin Wang
2024-10-22 10:06             ` Kirill A. Shutemov
2024-10-23  9:25               ` Baolin Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=99a3cc07-bdc3-48e2-ab5c-6f4de1bd2e7b@redhat.com \
    --to=david@redhat.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=d@kruces.com \
    --cc=da.gomez@samsung.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).