public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Zi Yan <ziy@nvidia.com>
Cc: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Song Liu <songliubraving@fb.com>, Chris Mason <clm@fb.com>,
	David Sterba <dsterba@suse.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Nico Pache <npache@redhat.com>,
	Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
	Barry Song <baohua@kernel.org>, Lance Yang <lance.yang@linux.dev>,
	Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>, Shuah Khan <shuah@kernel.org>,
	linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kselftest@vger.kernel.org
Subject: Re: [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users
Date: Thu, 2 Apr 2026 16:35:57 +0200	[thread overview]
Message-ID: <d1fa25cb-083a-4afc-afce-a62929acbb33@kernel.org> (raw)
In-Reply-To: <44DEB48D-589B-493D-A278-4896BDB58564@nvidia.com>

On 4/1/26 22:33, Zi Yan wrote:
> On 1 Apr 2026, at 15:15, David Hildenbrand (Arm) wrote:
> 
>> On 4/1/26 17:32, Zi Yan wrote:
>>>
>>>
>>> Let me think.
>>>
>>> do_dentry_open() -> file_get_write_access() -> get_write_access() bumps
>>> inode->i_writecount atomically and it turns inode_is_open_for_write()
>>> to true. Then, do_dentry_open() also truncates all pages
>>> if filemap_nr_thps() is not zero. This pairs with khugepaged’s first
>>> filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening
>>> a fd with write when there is a read-only THP.
>>>
>>> After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs
>>> on FSes with large folio support (to be precise THP support). If a fd
>>> is opened for write before inode_is_open_for_write() check, khugepaged
>>> will stop. It is fine. But if a fd is opened for write after
>>> inode_is_open_for_write() check, khugepaged will try to collapse a read-only
>>> THP and the fd can be written at the same time.
>>
>> Exactly, that's the race I mean.
>>
>>>
>>> I notice that fd write requires locking the to-be-written folio first
>>> (I see it from f_ops->write_iter() -> write_begin_get_folio() and assume
>>> f_ops->write() has the same locking requirement) and khugepaged has already
>>> locked the to-be-collapsed folio before inode_is_open_for_write(). So if the
>>> fd is opened for write after inode_is_open_for_write() check, its write
>>> will wait for khugepaged collapse and see a new THP. Since the FS
>>> supports THP, writing to the new THP should be fine.
>>>
>>> Let me know if my analysis above makes sense. If yes, I will add it
>>> to the commit message and add a succinct comment about it before
>>> inode_is_open_for_write().
>>
>> khugepaged code is the only code that replaces folios in the pagecache
>> by other folios. So my main concern is if that is problematic on
>> concurrent write access.
> 
> folio_split() does it too, although it replaces a large folio with
> a bunch of after-split folios. It is a kinda reverse process of
> collapse_file().

Right. You won't start looking at a small folio and suddenly there is
something larger.

> 
> 
>>
>> You argue that the folio lock is sufficient. That's certainly true for
>> individual folios, but I am more concerned about the replacement part.
> 
> For the replacement part, both old and new folios are locked during
> the process. A parallel writer uses filemap_get_entry() to get the folio
> from mapping, but all of them check folio->mapping after acquiring the
> folio lock, except mincore_page() which is a reader. A writer can see
> either old folio or new folio during the process, but
> 
> 1. if it sees the old one, it waits on the old folio lock. After
> it acquires the lock, it sees old_folio->mapping is NULL, no longer
> matches the original mapping. The writer will try again.
> 
> 2. if it sees the new one, it waits on the new folio lock. After
> it acquires the lock, it sees new_folio->mapping matches the
> original mapping and proceeds to its writes.
> 
> 3. if khugepaged needs to do a rollback, the old folio will stay
> the same and the writer will see the old one after it gets the old
> folio lock.

I am primarily wondering about what would happen if someone traverses
the pageache, and found+processed 3 small folios. Suddenly there is a
large folio that covers the 3 small folios processes before.

I suspect that is fine, because the code likely had to deal with
concurrent truncation+population if relevant locks are dropped already.

Just raising it.

> 
>>
>> I don't have anything concrete, primarily just pointing out that this is
>> a change that might unlock some code paths that could not have been
>> triggered before.
> 
> Yes, the concern makes sense.
> 
> BTW, Claude is trying to convince me that even inode_is_open_for_write()
> is unecessary since 1) folio_test_dirty() before it has
> made sure the folio is clean, 2) try_to_unmap() and the locked folio prevents
> further writes.
> 
> But then we find a hole between folio_test_dirty() and
> try_to_unmap() where a write via a writable mmap PTE can dirty the folio
> after folio_test_dirty() and try_to_unmap(). To remove that hole,
> the “if (!is_shmem && (folio_test_dirty(...) || folio_test_writeback(...))”
> needs to be moved after try_to_unmap(). With that, all to-be-collapsed
> folios will be clean, unmapped, and locked, where unmapped means
> writes via mmap need to fault and take the folio lock, locked means
> writes via mmap and write() need to wait until the folio is unlocked.
> 
> Let me know if my reasoning makes sense. It is definitely worth the time
> and effort to ensure this patchset does not introduce any unexpected race
> condition or issue.

Makes sense.

Please clearly spell out that there is a slight change now, where we
might be collapsing after the file has been opened for write. Then you
can document that the folio locks should be protecting us from that.

Implying that collapsing in writable files could likely "easily" done in
the future.

-- 
Cheers,

David


  reply	other threads:[~2026-04-02 14:36 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-27  1:42 [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig Zi Yan
2026-03-27  1:42 ` [PATCH v1 01/10] mm: remove READ_ONLY_THP_FOR_FS Kconfig option Zi Yan
2026-03-27 11:45   ` Lorenzo Stoakes (Oracle)
2026-03-27 13:33   ` David Hildenbrand (Arm)
2026-03-27 14:39     ` Zi Yan
2026-03-27  1:42 ` [PATCH v1 02/10] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
2026-03-27  7:29   ` Lance Yang
2026-03-27  7:35     ` Lance Yang
2026-03-27  9:44   ` Baolin Wang
2026-03-27 12:02     ` Lorenzo Stoakes (Oracle)
2026-03-27 13:45       ` Baolin Wang
2026-03-27 14:12         ` Lorenzo Stoakes (Oracle)
2026-03-27 14:26           ` Baolin Wang
2026-03-27 14:31             ` Lorenzo Stoakes (Oracle)
2026-03-27 15:00               ` Zi Yan
2026-03-27 16:22                 ` Lance Yang
2026-03-27 16:30                   ` Zi Yan
2026-03-28  2:29                     ` Baolin Wang
2026-03-27 12:07   ` Lorenzo Stoakes (Oracle)
2026-03-27 14:15     ` Lorenzo Stoakes (Oracle)
2026-03-27 14:46     ` Zi Yan
2026-03-27 13:37   ` David Hildenbrand (Arm)
2026-03-27 14:43     ` Zi Yan
2026-03-27  1:42 ` [PATCH v1 03/10] mm: fs: remove filemap_nr_thps*() functions and their users Zi Yan
2026-03-27  9:32   ` Lance Yang
2026-03-27 12:23   ` Lorenzo Stoakes (Oracle)
2026-03-27 13:58     ` David Hildenbrand (Arm)
2026-03-27 14:23       ` Lorenzo Stoakes (Oracle)
2026-03-27 15:05         ` Zi Yan
2026-04-01 14:35           ` David Hildenbrand (Arm)
2026-04-01 15:32             ` Zi Yan
2026-04-01 19:15               ` David Hildenbrand (Arm)
2026-04-01 20:33                 ` Zi Yan
2026-04-02 14:35                   ` David Hildenbrand (Arm) [this message]
2026-04-02 14:38                     ` Zi Yan
2026-03-27  1:42 ` [PATCH v1 04/10] fs: remove nr_thps from struct address_space Zi Yan
2026-03-27 12:29   ` Lorenzo Stoakes (Oracle)
2026-03-27 14:00   ` David Hildenbrand (Arm)
2026-03-30  3:06   ` Lance Yang
2026-03-27  1:42 ` [PATCH v1 05/10] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled() Zi Yan
2026-03-27 12:42   ` Lorenzo Stoakes (Oracle)
2026-03-27 15:12     ` Zi Yan
2026-03-27 15:29       ` Lorenzo Stoakes (Oracle)
2026-03-27 15:43         ` Zi Yan
2026-03-27 16:08           ` Lorenzo Stoakes (Oracle)
2026-03-27 16:12             ` Zi Yan
2026-03-27 16:14               ` Lorenzo Stoakes (Oracle)
2026-03-29  4:07               ` WANG Rui
2026-03-30 11:17                 ` Lorenzo Stoakes (Oracle)
2026-03-30 14:35                   ` Zi Yan
2026-03-30 16:09                     ` WANG Rui
2026-03-30 16:19                       ` Matthew Wilcox
2026-04-01 14:38                         ` David Hildenbrand (Arm)
2026-04-01 14:53                           ` Darrick J. Wong
2026-03-27  1:42 ` [PATCH v1 06/10] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS Zi Yan
2026-03-27 12:50   ` Lorenzo Stoakes (Oracle)
2026-03-30  9:15   ` Lance Yang
2026-03-27  1:42 ` [PATCH v1 07/10] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
2026-03-27  3:33   ` Lance Yang
2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
2026-03-27 15:35     ` Zi Yan
2026-03-28  9:54   ` kernel test robot
2026-03-28  9:54   ` kernel test robot
2026-03-27  1:42 ` [PATCH v1 08/10] fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS Zi Yan
2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
2026-03-27  1:42 ` [PATCH v1 09/10] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged Zi Yan
2026-03-27 13:05   ` Lorenzo Stoakes (Oracle)
2026-03-27  1:42 ` [PATCH v1 10/10] selftests/mm: remove READ_ONLY_THP_FOR_FS from comments in guard-regions Zi Yan
2026-03-27 13:06   ` Lorenzo Stoakes (Oracle)
2026-03-27 13:46 ` [PATCH v1 00/10] Remove READ_ONLY_THP_FOR_FS Kconfig David Hildenbrand (Arm)
2026-03-27 14:26   ` Zi Yan
2026-03-27 14:27   ` Lorenzo Stoakes (Oracle)
2026-03-27 14:30     ` Zi Yan
2026-04-05 17:38 ` Nico Pache
2026-04-06  1:59   ` Zi Yan
2026-04-06 16:17     ` Nico Pache

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d1fa25cb-083a-4afc-afce-a62929acbb33@kernel.org \
    --to=david@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brauner@kernel.org \
    --cc=clm@fb.com \
    --cc=dev.jain@arm.com \
    --cc=dsterba@suse.com \
    --cc=jack@suse.cz \
    --cc=lance.yang@linux.dev \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shuah@kernel.org \
    --cc=songliubraving@fb.com \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox