From: Lance Yang <lance.yang@linux.dev>
To: ziy@nvidia.com
Cc: baolin.wang@linux.alibaba.com, ljs@kernel.org,
willy@infradead.org, songliubraving@fb.com, clm@fb.com,
dsterba@suse.com, viro@zeniv.linux.org.uk, brauner@kernel.org,
jack@suse.cz, akpm@linux-foundation.org, david@kernel.org,
Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com,
dev.jain@arm.com, baohua@kernel.org, lance.yang@linux.dev,
vbabka@kernel.org, rppt@kernel.org, surenb@google.com,
mhocko@suse.com, shuah@kernel.org, linux-btrfs@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-kselftest@vger.kernel.org
Subject: Re: [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
Date: Sat, 28 Mar 2026 00:22:52 +0800 [thread overview]
Message-ID: <20260327162252.57553-1-lance.yang@linux.dev> (raw)
In-Reply-To: <6FCC0430-C98D-4D7C-8C53-F7722F1BDC4A@nvidia.com>
On Fri, Mar 27, 2026 at 11:00:26AM -0400, Zi Yan wrote:
>On 27 Mar 2026, at 10:31, Lorenzo Stoakes (Oracle) wrote:
>
>> On Fri, Mar 27, 2026 at 10:26:53PM +0800, Baolin Wang wrote:
>>>
>>>
>>> On 3/27/26 10:12 PM, Lorenzo Stoakes (Oracle) wrote:
>>>> On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote:
>>>>>> On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 3/27/26 9:42 AM, Zi Yan wrote:
>>>>>>>> collapse_file() requires FSes supporting large folio with at least
>>>>>>>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
>>>>>>>> huge option turned on also sets large folio order on mapping, so the check
>>>>>>>> also applies to shmem.
>>>>>>>>
>>>>>>>> While at it, replace VM_BUG_ON with returning failure values.
>>>>>>>>
>>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>>>>>> ---
>>>>>>>> mm/khugepaged.c | 7 +++++--
>>>>>>>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>>>>> index d06d84219e1b..45b12ffb1550 100644
>>>>>>>> --- a/mm/khugepaged.c
>>>>>>>> +++ b/mm/khugepaged.c
>>>>>>>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>>>>>> int nr_none = 0;
>>>>>>>> bool is_shmem = shmem_file(file);
>>>>>>>> - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>>>>>>>> - VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>>>>>>>> + /* "huge" shmem sets mapping folio order and passes the check below */
>>>>>>>> + if (mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>>>>> + return SCAN_FAIL;
>>>>>>>
>>>>>>> This is not true for anonymous shmem, since its large order allocation logic
>>>>>>> is similar to anonymous memory. That means it will not call
>>>>>>> mapping_set_large_folios() for anonymous shmem.
>>>>>>>
>>>>>>> So I think the check should be:
>>>>>>>
>>>>>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>>>> return SCAN_FAIL;
>>>>>>
>>>>>> Hmm but in shmem_init() we have:
>>>>>>
>>>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>> if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
>>>>>> SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
>>>>>> else
>>>>>> shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
>>>>>>
>>>>>> /*
>>>>>> * Default to setting PMD-sized THP to inherit the global setting and
>>>>>> * disable all other multi-size THPs.
>>>>>> */
>>>>>> if (!shmem_orders_configured)
>>>>>> huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
>>>>>> #endif
>>>>>>
>>>>>> And shm_mnt->mnt_sb is the superblock used for anon shmem. Also
>>>>>> shmem_enabled_store() updates that if necessary.
>>>>>>
>>>>>> So we're still fine right?
>>>>>>
>>>>>> __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() ->
>>>>>> __shmem_get_inode() which has:
>>>>>>
>>>>>> if (sbinfo->huge)
>>>>>> mapping_set_large_folios(inode->i_mapping);
>>>>>>
>>>>>> Shared for both anon shmem and tmpfs-style shmem.
>>>>>>
>>>>>> So I think it's fine as-is.
>>>>>
>>>>> I'm afraid not. Sorry, I should have been clearer.
>>>>>
>>>>> First, anonymous shmem large order allocation is dynamically controlled via
>>>>> the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and
>>>>> the mTHP interfaces
>>>>> (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled).
>>>>>
>>>>> This means that during anonymous shmem initialization, these interfaces
>>>>> might be set to 'never'. so it will not call mapping_set_large_folios()
>>>>> because sbinfo->huge is 'SHMEM_HUGE_NEVER'.
>>>>>
>>>>> Even if shmem large order allocation is subsequently enabled via the
>>>>> interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called
>>>>> again.
>>>>
>>>> I see your point, oh this is all a bit of a mess...
>>>>
>>>> It feels like entirely the wrong abstraction anyway, since at best you're
>>>> getting a global 'is enabled'.
>>>>
>>>> I guess what happened before was we'd never call into this with ! r/o thp for fs
>>>> && ! is_shmem.
>>>
>>> Right.
>>>
>>>> But now we are allowing it, but should STILL be gating on !is_shmem so yeah your
>>>> suggestion is correct I think actualyl.
>>>>
>>>> I do hate:
>>>>
>>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>>>>
>>>> As a bit of code though. It's horrible.
>>>
>>> Indeed.
>>>
>>>> Let's abstract that...
>>>>
>>>> It'd be nice if we could find a way to clean things up in the lead up to changes
>>>> in series like this instead of sticking with the mess, but I guess since it
>>>> mostly removes stuff that's ok for now.
>>>
>>> I think this check can be removed from this patch.
>>>
>>> During the khugepaged's scan, it will call thp_vma_allowable_order() to
>>> check if the VMA is allowed to collapse into a PMD.
>>>
>>> Specifically, within the call chain thp_vma_allowable_order() ->
>>> __thp_vma_allowable_orders(), shmem is checked via
>>> shmem_allowable_huge_orders(), while other FSes are checked via
>>> file_thp_enabled().
>
>But for madvise(MADV_COLLAPSE) case, IIRC, it ignores shmem huge config
>and can perform collapse anyway. This means without !is_shmem the check
>will break madvise(MADV_COLLAPSE). Let me know if I get it wrong, since
Right. That will break MADV_COLLAPSE, IIUC.
For MADV_COLLAPSE on anonymous shmem, eligibility is determined by the
TVA_FORCED_COLLAPSE path via shmem_allowable_huge_orders(), not by
whether the inode mapping got mapping_set_large_folios() at creation
time.
Using mmap(MAP_SHARED | MAP_ANONYMOUS):
- create time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=never
- collapse time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=always
With the !is_shmem guard, collapse succeeds. Without it, the same setup
fails with -EINVAL.
Thanks,
Lance
>I was in that TVA_FORCED_COLLAPSE email thread but does not remember
>everything there.
>
>
>>
>> It sucks not to have an assert. Maybe in that case make it a
>> VM_WARN_ON_ONCE().
>
>Will do that as I replied to David already.
>
>>
>> I hate that you're left tracing things back like that...
>>
>>>
>>> For those other filesystems, Patch 5 has already added the following check,
>>> which I think is sufficient to filter out those FSes that do not support
>>> large folios:
>>>
>>> if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER)
>>> return false;
>>
>> 2 < 5, we won't tolerate bisection hazards.
>>
>>>
>>>
>>>>> Anonymous shmem behaves similarly to anonymous pages: it is controlled by
>>>>> the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to
>>>>> check for allowed large orders, rather than relying on
>>>>> mapping_max_folio_order().
>>>>>
>>>>> The mapping_max_folio_order() is intended to control large page allocation
>>>>> only for tmpfs mounts. Therefore, I find the current code confusing and
>>>>> think it needs to be fixed:
>>>>>
>>>>> /* Don't consider 'deny' for emergencies and 'force' for testing */
>>>>> if (sb != shm_mnt->mnt_sb && sbinfo->huge)
>>>>> mapping_set_large_folios(inode->i_mapping);
>>>>
>
>Hi Baolin,
>
>Do you want to send a fix for this?
>
>Also I wonder how I can distinguish between anonymous shmem code and tmpfs code.
>I thought they are the same thing except that they have different user interface,
>but it seems that I was wrong.
>
>
>Best Regards,
>Yan, Zi
>
>
next prev parent reply other threads:[~2026-03-27 16:23 UTC|newest]
Thread overview: 80+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-27 1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
2026-03-27 1:42 ` [PATCH v1 01/10] mm: remove READ_ONLY_THP_FOR_FS Kconfig option Zi Yan
2026-03-27 11:45 ` Lorenzo Stoakes (Oracle)
2026-03-27 13:33 ` David Hildenbrand (Arm)
2026-03-27 14:39 ` Zi Yan
2026-03-27 1:42 ` [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
2026-03-27 7:29 ` Lance Yang
2026-03-27 7:35 ` Lance Yang
2026-03-27 9:44 ` Baolin Wang
2026-03-27 12:02 ` Lorenzo Stoakes (Oracle)
2026-03-27 13:45 ` Baolin Wang
2026-03-27 14:12 ` Lorenzo Stoakes (Oracle)
2026-03-27 14:26 ` Baolin Wang
2026-03-27 14:31 ` Lorenzo Stoakes (Oracle)
2026-03-27 15:00 ` Zi Yan
2026-03-27 16:22 ` Lance Yang [this message]
2026-03-27 16:30 ` Zi Yan
2026-03-28 2:29 ` Baolin Wang
2026-03-27 12:07 ` Lorenzo Stoakes (Oracle)
2026-03-27 14:15 ` Lorenzo Stoakes (Oracle)
2026-03-27 14:46 ` Zi Yan
2026-03-27 13:37 ` David Hildenbrand (Arm)
2026-03-27 14:43 ` Zi Yan
2026-03-27 1:42 ` [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users Zi Yan
2026-03-27 9:32 ` Lance Yang
2026-03-27 12:23 ` Lorenzo Stoakes (Oracle)
2026-03-27 13:58 ` David Hildenbrand (Arm)
2026-03-27 14:23 ` Lorenzo Stoakes (Oracle)
2026-03-27 15:05 ` Zi Yan
2026-04-01 14:35 ` David Hildenbrand (Arm)
2026-04-01 15:32 ` Zi Yan
2026-04-01 19:15 ` David Hildenbrand (Arm)
2026-04-01 20:33 ` Zi Yan
2026-04-02 14:35 ` David Hildenbrand (Arm)
2026-04-02 14:38 ` Zi Yan
2026-03-27 1:42 ` [PATCH v1 04/10] fs: remove nr_thps from struct address_space Zi Yan
2026-03-27 12:29 ` Lorenzo Stoakes (Oracle)
2026-03-27 14:00 ` David Hildenbrand (Arm)
2026-03-30 3:06 ` Lance Yang
2026-03-27 1:42 ` [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled() Zi Yan
2026-03-27 12:42 ` Lorenzo Stoakes (Oracle)
2026-03-27 15:12 ` Zi Yan
2026-03-27 15:29 ` Lorenzo Stoakes (Oracle)
2026-03-27 15:43 ` Zi Yan
2026-03-27 16:08 ` Lorenzo Stoakes (Oracle)
2026-03-27 16:12 ` Zi Yan
2026-03-27 16:14 ` Lorenzo Stoakes (Oracle)
2026-03-29 4:07 ` WANG Rui
2026-03-30 11:17 ` Lorenzo Stoakes (Oracle)
2026-03-30 14:35 ` Zi Yan
2026-03-30 16:09 ` WANG Rui
2026-03-30 16:19 ` Matthew Wilcox
2026-04-01 14:38 ` David Hildenbrand (Arm)
2026-04-01 14:53 ` Darrick J. Wong
2026-04-10 15:06 ` Andres Freund
2026-04-10 15:18 ` Zi Yan
2026-04-10 15:38 ` Andres Freund
2026-04-13 8:46 ` Jan Kara
2026-03-27 1:42 ` [PATCH v1 06/10] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS Zi Yan
2026-03-27 12:50 ` Lorenzo Stoakes (Oracle)
2026-03-30 9:15 ` Lance Yang
2026-03-27 1:42 ` [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
2026-03-27 3:33 ` Lance Yang
2026-03-27 13:05 ` Lorenzo Stoakes (Oracle)
2026-03-27 15:35 ` Zi Yan
2026-03-28 9:54 ` kernel test robot
2026-03-28 9:54 ` kernel test robot
2026-03-27 1:42 ` [PATCH v1 08/10] fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS Zi Yan
2026-03-27 13:05 ` Lorenzo Stoakes (Oracle)
2026-03-27 1:42 ` [PATCH v1 09/10] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged Zi Yan
2026-03-27 13:05 ` Lorenzo Stoakes (Oracle)
2026-03-27 1:42 ` [PATCH v1 10/10] selftests/mm: remove READ_ONLY_THP_FOR_FS from comments in guard-regions Zi Yan
2026-03-27 13:06 ` Lorenzo Stoakes (Oracle)
2026-03-27 13:46 ` [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig David Hildenbrand (Arm)
2026-03-27 14:26 ` Zi Yan
2026-03-27 14:27 ` Lorenzo Stoakes (Oracle)
2026-03-27 14:30 ` Zi Yan
2026-04-05 17:38 ` Nico Pache
2026-04-06 1:59 ` Zi Yan
2026-04-06 16:17 ` Nico Pache
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260327162252.57553-1-lance.yang@linux.dev \
--to=lance.yang@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=brauner@kernel.org \
--cc=clm@fb.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=dsterba@suse.com \
--cc=jack@suse.cz \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=npache@redhat.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=shuah@kernel.org \
--cc=songliubraving@fb.com \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.