* Re: [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing
2026-07-02 16:54 [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing Pedro Falcato
@ 2026-07-02 17:24 ` Zi Yan
2026-07-03 2:53 ` Lance Yang
2026-07-03 3:49 ` Baolin Wang
` (2 subsequent siblings)
3 siblings, 1 reply; 12+ messages in thread
From: Zi Yan @ 2026-07-02 17:24 UTC (permalink / raw)
To: Pedro Falcato
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, linux-mm, linux-kernel, linux-fsdevel, stable,
Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox,
Song Liu, Eric Hagberg, Gregg Leventhal
On 2 Jul 2026, at 12:54, Pedro Falcato wrote:
> As-is, khugepaged and writable-file opening exclude each other. A file
> cannot be open writeable and have THPs (because the filesystem is not aware
> of them). khugepaged will never collapse file pages for files that are
> opened writeable. On an open(O_RDWR/O_WRONLY), the page cache for that
> particular file is dropped. This is fine because nothing could've been
> dirtied.
>
> However, there is an edge-case: collapse_file() might not be able to
> coexist with concurrent writers, but it can coexist with dirty folios
> (from previous writers). Therefore, the following can happen:
>
> open(file, O_RDWR)
> write(file)
> close(file)
> madvise(file_mapping, MADV_COLLAPSE, some non-dirty range)
> open(file, O_RDWR)
> nr_thps > 0
> truncate_inode_pages()
> /* THPs are cleared out, but so are the dirty folios */
>
> When this edge-case happens, there is data loss, as the dirty folios are
> fully discarded.
>
> Fix it by fully writing back the page cache (and waiting) when collapsing
> file THPs. Doing so provides the guarantee that no dirty folio will be
> observed while there are active THPs. To fully ensure this is safe, the
> invalidate_lock needs to be held while doing the writeout, so that
> do_dentry_open()'s page cache truncation excludes this write-and-wait.
>
> Cc: stable@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Song Liu <song@kernel.org>
> Cc: Eric Hagberg <ehagberg@janestreet.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
> Reported-by: Gregg Leventhal <gleventhal@janestreet.com>
> Closes: https://lore.kernel.org/linux-mm/CAFN_u7H_0ECF3jixP=T=U7AH5=Q3wQNvJMo8an3VqUDMerQfUw@mail.gmail.com/
> Tested-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> ---
> This patch is written against 7.1.0 (because the code no longer exists in mainline).
>
> Zi, I kept your Tested-by, but I had to move some things around and
> use the invalidate lock. Please re-test if you can.
Tested it again on top of v6.12 (the patch applied cleanly) and the issue
is gone. My Tested-by still holds. :)
>
> mm/khugepaged.c | 39 +++++++++++++++++++++++++--------------
> 1 file changed, 25 insertions(+), 14 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b8452dbdb043..0707d719a270 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2094,32 +2094,43 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> goto xa_unlocked;
> }
>
> - if (!is_shmem) {
> +xa_locked:
> + xas_unlock_irq(&xas);
> +xa_unlocked:
> +
> + /*
> + * If collapse is successful, flush must be done now before copying.
> + * If collapse is unsuccessful, does flush actually need to be done?
> + * Do it anyway, to clear the state.
> + */
> + try_to_unmap_flush();
> +
> + if (result == SCAN_SUCCEED && !is_shmem) {
> + /*
> + * invalidate_lock as shared excludes against concurrent opens
> + * in do_dentry_open() truncating the page cache. This is
> + * particularly important if there are dirty folios in transit.
> + */
> + filemap_invalidate_lock_shared(mapping);
> filemap_nr_thps_inc(mapping);
> /*
> * Paired with the fence in do_dentry_open() -> get_write_access()
> * to ensure i_writecount is up to date and the update to nr_thps
> * is visible. Ensures the page cache will be truncated if the
> - * file is opened writable.
> + * file is opened writable. If collapse looks to be successful,
> + * flush any dirty pages out the page cache. With the nr_thps
> + * incremented, there won't be any new writers (nor new dirties).
> */
> smp_mb();
> - if (inode_is_open_for_write(mapping->host)) {
> + if (inode_is_open_for_write(mapping->host) || filemap_write_and_wait(mapping)) {
> result = SCAN_FAIL;
> filemap_nr_thps_dec(mapping);
> + filemap_invalidate_unlock_shared(mapping);
> + goto rollback;
> }
> + filemap_invalidate_unlock_shared(mapping);
> }
>
> -xa_locked:
> - xas_unlock_irq(&xas);
> -xa_unlocked:
> -
> - /*
> - * If collapse is successful, flush must be done now before copying.
> - * If collapse is unsuccessful, does flush actually need to be done?
> - * Do it anyway, to clear the state.
> - */
> - try_to_unmap_flush();
> -
> if (result == SCAN_SUCCEED && nr_none &&
> !shmem_charge(mapping->host, nr_none))
> result = SCAN_FAIL;
> --
> 2.54.0
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing
2026-07-02 17:24 ` Zi Yan
@ 2026-07-03 2:53 ` Lance Yang
2026-07-03 9:19 ` Pedro Falcato
0 siblings, 1 reply; 12+ messages in thread
From: Lance Yang @ 2026-07-03 2:53 UTC (permalink / raw)
To: Zi Yan, Pedro Falcato
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
linux-mm, linux-kernel, linux-fsdevel, stable, Alexander Viro,
Christian Brauner, Jan Kara, Matthew Wilcox, Song Liu,
Eric Hagberg, Gregg Leventhal
On 2026/7/3 01:24, Zi Yan wrote:
> On 2 Jul 2026, at 12:54, Pedro Falcato wrote:
>
>> As-is, khugepaged and writable-file opening exclude each other. A file
>> cannot be open writeable and have THPs (because the filesystem is not aware
>> of them). khugepaged will never collapse file pages for files that are
>> opened writeable. On an open(O_RDWR/O_WRONLY), the page cache for that
>> particular file is dropped. This is fine because nothing could've been
>> dirtied.
>>
>> However, there is an edge-case: collapse_file() might not be able to
>> coexist with concurrent writers, but it can coexist with dirty folios
>> (from previous writers). Therefore, the following can happen:
>>
>> open(file, O_RDWR)
>> write(file)
>> close(file)
>> madvise(file_mapping, MADV_COLLAPSE, some non-dirty range)
>> open(file, O_RDWR)
>> nr_thps > 0
>> truncate_inode_pages()
>> /* THPs are cleared out, but so are the dirty folios */
>>
>> When this edge-case happens, there is data loss, as the dirty folios are
>> fully discarded.
>>
>> Fix it by fully writing back the page cache (and waiting) when collapsing
>> file THPs. Doing so provides the guarantee that no dirty folio will be
>> observed while there are active THPs. To fully ensure this is safe, the
>> invalidate_lock needs to be held while doing the writeout, so that
>> do_dentry_open()'s page cache truncation excludes this write-and-wait.
>>
>> Cc: stable@vger.kernel.org
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Song Liu <song@kernel.org>
>> Cc: Eric Hagberg <ehagberg@janestreet.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
>> Reported-by: Gregg Leventhal <gleventhal@janestreet.com>
>> Closes: https://lore.kernel.org/linux-mm/CAFN_u7H_0ECF3jixP=T=U7AH5=Q3wQNvJMo8an3VqUDMerQfUw@mail.gmail.com/
>> Tested-by: Zi Yan <ziy@nvidia.com>
>> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
>> ---
>> This patch is written against 7.1.0 (because the code no longer exists in mainline).
>>
>> Zi, I kept your Tested-by, but I had to move some things around and
>> use the invalidate lock. Please re-test if you can.
>
> Tested it again on top of v6.12 (the patch applied cleanly) and the issue
> is gone. My Tested-by still holds. :)
Since READ_ONLY_THP_FOR_FS is gone from mainline, just to confirm: does this
only affect stable kernels, right?
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing
2026-07-03 2:53 ` Lance Yang
@ 2026-07-03 9:19 ` Pedro Falcato
0 siblings, 0 replies; 12+ messages in thread
From: Pedro Falcato @ 2026-07-03 9:19 UTC (permalink / raw)
To: Lance Yang
Cc: Zi Yan, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, linux-mm, linux-kernel, linux-fsdevel, stable,
Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox,
Song Liu, Eric Hagberg, Gregg Leventhal
On Fri, Jul 03, 2026 at 10:53:24AM +0800, Lance Yang wrote:
>
>
> On 2026/7/3 01:24, Zi Yan wrote:
> > On 2 Jul 2026, at 12:54, Pedro Falcato wrote:
> >
> > > As-is, khugepaged and writable-file opening exclude each other. A file
> > > cannot be open writeable and have THPs (because the filesystem is not aware
> > > of them). khugepaged will never collapse file pages for files that are
> > > opened writeable. On an open(O_RDWR/O_WRONLY), the page cache for that
> > > particular file is dropped. This is fine because nothing could've been
> > > dirtied.
> > >
> > > However, there is an edge-case: collapse_file() might not be able to
> > > coexist with concurrent writers, but it can coexist with dirty folios
> > > (from previous writers). Therefore, the following can happen:
> > >
> > > open(file, O_RDWR)
> > > write(file)
> > > close(file)
> > > madvise(file_mapping, MADV_COLLAPSE, some non-dirty range)
> > > open(file, O_RDWR)
> > > nr_thps > 0
> > > truncate_inode_pages()
> > > /* THPs are cleared out, but so are the dirty folios */
> > >
> > > When this edge-case happens, there is data loss, as the dirty folios are
> > > fully discarded.
> > >
> > > Fix it by fully writing back the page cache (and waiting) when collapsing
> > > file THPs. Doing so provides the guarantee that no dirty folio will be
> > > observed while there are active THPs. To fully ensure this is safe, the
> > > invalidate_lock needs to be held while doing the writeout, so that
> > > do_dentry_open()'s page cache truncation excludes this write-and-wait.
> > >
> > > Cc: stable@vger.kernel.org
> > > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Matthew Wilcox <willy@infradead.org>
> > > Cc: Song Liu <song@kernel.org>
> > > Cc: Eric Hagberg <ehagberg@janestreet.com>
> > > Cc: Zi Yan <ziy@nvidia.com>
> > > Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
> > > Reported-by: Gregg Leventhal <gleventhal@janestreet.com>
> > > Closes: https://lore.kernel.org/linux-mm/CAFN_u7H_0ECF3jixP=T=U7AH5=Q3wQNvJMo8an3VqUDMerQfUw@mail.gmail.com/
> > > Tested-by: Zi Yan <ziy@nvidia.com>
> > > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> > > ---
> > > This patch is written against 7.1.0 (because the code no longer exists in mainline).
> > >
> > > Zi, I kept your Tested-by, but I had to move some things around and
> > > use the invalidate lock. Please re-test if you can.
> >
> > Tested it again on top of v6.12 (the patch applied cleanly) and the issue
> > is gone. My Tested-by still holds. :)
Awesome, thanks Zi!
>
> Since READ_ONLY_THP_FOR_FS is gone from mainline, just to confirm: does this
> only affect stable kernels, right?
Correct, current mainline doesn't play any sort of fun games with
truncation and the code is fully gone from do_dentry_open().
--
Pedro
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing
2026-07-02 16:54 [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing Pedro Falcato
2026-07-02 17:24 ` Zi Yan
@ 2026-07-03 3:49 ` Baolin Wang
2026-07-03 8:45 ` Lance Yang
2026-07-03 5:11 ` Lance Yang
2026-07-03 8:55 ` David Hildenbrand (Arm)
3 siblings, 1 reply; 12+ messages in thread
From: Baolin Wang @ 2026-07-03 3:49 UTC (permalink / raw)
To: Pedro Falcato, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, linux-mm, linux-kernel, linux-fsdevel, stable,
Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox,
Song Liu, Eric Hagberg, Zi Yan, Gregg Leventhal
On 7/3/26 12:54 AM, Pedro Falcato wrote:
> As-is, khugepaged and writable-file opening exclude each other. A file
> cannot be open writeable and have THPs (because the filesystem is not aware
> of them). khugepaged will never collapse file pages for files that are
> opened writeable. On an open(O_RDWR/O_WRONLY), the page cache for that
> particular file is dropped. This is fine because nothing could've been
> dirtied.
>
> However, there is an edge-case: collapse_file() might not be able to
> coexist with concurrent writers, but it can coexist with dirty folios
> (from previous writers). Therefore, the following can happen:
>
> open(file, O_RDWR)
> write(file)
> close(file)
> madvise(file_mapping, MADV_COLLAPSE, some non-dirty range)
> open(file, O_RDWR)
> nr_thps > 0
> truncate_inode_pages()
> /* THPs are cleared out, but so are the dirty folios */
>
> When this edge-case happens, there is data loss, as the dirty folios are
> fully discarded.
>
> Fix it by fully writing back the page cache (and waiting) when collapsing
> file THPs. Doing so provides the guarantee that no dirty folio will be
> observed while there are active THPs. To fully ensure this is safe, the
> invalidate_lock needs to be held while doing the writeout, so that
> do_dentry_open()'s page cache truncation excludes this write-and-wait.
Thanks for explaining the race, and it looks reasonable to me. One nit
below.
> Cc: stable@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Song Liu <song@kernel.org>
> Cc: Eric Hagberg <ehagberg@janestreet.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
> Reported-by: Gregg Leventhal <gleventhal@janestreet.com>
> Closes: https://lore.kernel.org/linux-mm/CAFN_u7H_0ECF3jixP=T=U7AH5=Q3wQNvJMo8an3VqUDMerQfUw@mail.gmail.com/
> Tested-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> ---
> This patch is written against 7.1.0 (because the code no longer exists in mainline).
>
> Zi, I kept your Tested-by, but I had to move some things around and
> use the invalidate lock. Please re-test if you can.
>
> mm/khugepaged.c | 39 +++++++++++++++++++++++++--------------
> 1 file changed, 25 insertions(+), 14 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b8452dbdb043..0707d719a270 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2094,32 +2094,43 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> goto xa_unlocked;
> }
>
> - if (!is_shmem) {
> +xa_locked:
> + xas_unlock_irq(&xas);
> +xa_unlocked:
> +
> + /*
> + * If collapse is successful, flush must be done now before copying.
> + * If collapse is unsuccessful, does flush actually need to be done?
> + * Do it anyway, to clear the state.
> + */
> + try_to_unmap_flush();
> +
> + if (result == SCAN_SUCCEED && !is_shmem) {
Actually, the operations below only for those mappings that do not
support large folios. For mappings with large folio support,
filemap_nr_thps() always returns 0, so the race described in the commit
message won't happen. We can add mapping_large_folio_support() here to
filter them out.
if (result == SCAN_SUCCEED && !is_shmem &&
!mapping_large_folio_support(mapping)) {
Otherwise LGTM.
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing
2026-07-03 3:49 ` Baolin Wang
@ 2026-07-03 8:45 ` Lance Yang
2026-07-03 9:17 ` Pedro Falcato
0 siblings, 1 reply; 12+ messages in thread
From: Lance Yang @ 2026-07-03 8:45 UTC (permalink / raw)
To: Baolin Wang, Pedro Falcato
Cc: Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
linux-mm, Andrew Morton, linux-kernel, linux-fsdevel, stable,
Alexander Viro, Lorenzo Stoakes, Christian Brauner, Jan Kara,
Matthew Wilcox, Song Liu, Eric Hagberg, Zi Yan, Gregg Leventhal,
David Hildenbrand
On 2026/7/3 11:49, Baolin Wang wrote:
>
>
> On 7/3/26 12:54 AM, Pedro Falcato wrote:
>> As-is, khugepaged and writable-file opening exclude each other. A file
>> cannot be open writeable and have THPs (because the filesystem is not
>> aware
>> of them). khugepaged will never collapse file pages for files that are
>> opened writeable. On an open(O_RDWR/O_WRONLY), the page cache for that
>> particular file is dropped. This is fine because nothing could've been
>> dirtied.
>>
>> However, there is an edge-case: collapse_file() might not be able to
>> coexist with concurrent writers, but it can coexist with dirty folios
>> (from previous writers). Therefore, the following can happen:
>>
>> open(file, O_RDWR)
>> write(file)
>> close(file)
>> madvise(file_mapping, MADV_COLLAPSE, some non-dirty range)
>> open(file, O_RDWR)
>> nr_thps > 0
>> truncate_inode_pages()
>> /* THPs are cleared out, but so are the dirty folios */
>>
>> When this edge-case happens, there is data loss, as the dirty folios are
>> fully discarded.
>>
>> Fix it by fully writing back the page cache (and waiting) when collapsing
>> file THPs. Doing so provides the guarantee that no dirty folio will be
>> observed while there are active THPs. To fully ensure this is safe, the
>> invalidate_lock needs to be held while doing the writeout, so that
>> do_dentry_open()'s page cache truncation excludes this write-and-wait.
>
> Thanks for explaining the race, and it looks reasonable to me. One nit
> below.
>
>> Cc: stable@vger.kernel.org
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Song Liu <song@kernel.org>
>> Cc: Eric Hagberg <ehagberg@janestreet.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-
>> shmem) FS")
>> Reported-by: Gregg Leventhal <gleventhal@janestreet.com>
>> Closes: https://lore.kernel.org/linux-mm/
>> CAFN_u7H_0ECF3jixP=T=U7AH5=Q3wQNvJMo8an3VqUDMerQfUw@mail.gmail.com/
>> Tested-by: Zi Yan <ziy@nvidia.com>
>> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
>> ---
>> This patch is written against 7.1.0 (because the code no longer exists
>> in mainline).
>>
>> Zi, I kept your Tested-by, but I had to move some things around and
>> use the invalidate lock. Please re-test if you can.
>>
>> mm/khugepaged.c | 39 +++++++++++++++++++++++++--------------
>> 1 file changed, 25 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index b8452dbdb043..0707d719a270 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2094,32 +2094,43 @@ static enum scan_result collapse_file(struct
>> mm_struct *mm, unsigned long addr,
>> goto xa_unlocked;
>> }
>> - if (!is_shmem) {
>> +xa_locked:
>> + xas_unlock_irq(&xas);
>> +xa_unlocked:
>> +
>> + /*
>> + * If collapse is successful, flush must be done now before copying.
>> + * If collapse is unsuccessful, does flush actually need to be done?
>> + * Do it anyway, to clear the state.
>> + */
>> + try_to_unmap_flush();
>> +
>> + if (result == SCAN_SUCCEED && !is_shmem) {
>
> Actually, the operations below only for those mappings that do not
> support large folios. For mappings with large folio support,
> filemap_nr_thps() always returns 0, so the race described in the commit
> message won't happen. We can add mapping_large_folio_support() here to
> filter them out.
>
> if (result == SCAN_SUCCEED && !is_shmem && !
> mapping_large_folio_support(mapping)) {
>
Right! nr_thps only gets updated when !mapping_large_folio_support(mapping).
For mappings that do support large folios, writable open won't see
nr_thps > 0, so no truncate_inode_pages() for that case :)
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing
2026-07-03 8:45 ` Lance Yang
@ 2026-07-03 9:17 ` Pedro Falcato
0 siblings, 0 replies; 12+ messages in thread
From: Pedro Falcato @ 2026-07-03 9:17 UTC (permalink / raw)
To: Lance Yang
Cc: Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, linux-mm, Andrew Morton, linux-kernel, linux-fsdevel,
stable, Alexander Viro, Lorenzo Stoakes, Christian Brauner,
Jan Kara, Matthew Wilcox, Song Liu, Eric Hagberg, Zi Yan,
Gregg Leventhal, David Hildenbrand
On Fri, Jul 03, 2026 at 04:45:34PM +0800, Lance Yang wrote:
>
>
> On 2026/7/3 11:49, Baolin Wang wrote:
> >
> >
> > On 7/3/26 12:54 AM, Pedro Falcato wrote:
> > > As-is, khugepaged and writable-file opening exclude each other. A file
> > > cannot be open writeable and have THPs (because the filesystem is
> > > not aware
> > > of them). khugepaged will never collapse file pages for files that are
> > > opened writeable. On an open(O_RDWR/O_WRONLY), the page cache for that
> > > particular file is dropped. This is fine because nothing could've been
> > > dirtied.
> > >
> > > However, there is an edge-case: collapse_file() might not be able to
> > > coexist with concurrent writers, but it can coexist with dirty folios
> > > (from previous writers). Therefore, the following can happen:
> > >
> > > open(file, O_RDWR)
> > > write(file)
> > > close(file)
> > > madvise(file_mapping, MADV_COLLAPSE, some non-dirty range)
> > > open(file, O_RDWR)
> > > nr_thps > 0
> > > truncate_inode_pages()
> > > /* THPs are cleared out, but so are the dirty folios */
> > >
> > > When this edge-case happens, there is data loss, as the dirty folios are
> > > fully discarded.
> > >
> > > Fix it by fully writing back the page cache (and waiting) when collapsing
> > > file THPs. Doing so provides the guarantee that no dirty folio will be
> > > observed while there are active THPs. To fully ensure this is safe, the
> > > invalidate_lock needs to be held while doing the writeout, so that
> > > do_dentry_open()'s page cache truncation excludes this write-and-wait.
> >
> > Thanks for explaining the race, and it looks reasonable to me. One nit
> > below.
> >
> > > Cc: stable@vger.kernel.org
> > > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Matthew Wilcox <willy@infradead.org>
> > > Cc: Song Liu <song@kernel.org>
> > > Cc: Eric Hagberg <ehagberg@janestreet.com>
> > > Cc: Zi Yan <ziy@nvidia.com>
> > > Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-
> > > shmem) FS")
> > > Reported-by: Gregg Leventhal <gleventhal@janestreet.com>
> > > Closes: https://lore.kernel.org/linux-mm/
> > > CAFN_u7H_0ECF3jixP=T=U7AH5=Q3wQNvJMo8an3VqUDMerQfUw@mail.gmail.com/
> > > Tested-by: Zi Yan <ziy@nvidia.com>
> > > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> > > ---
> > > This patch is written against 7.1.0 (because the code no longer
> > > exists in mainline).
> > >
> > > Zi, I kept your Tested-by, but I had to move some things around and
> > > use the invalidate lock. Please re-test if you can.
> > >
> > > mm/khugepaged.c | 39 +++++++++++++++++++++++++--------------
> > > 1 file changed, 25 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index b8452dbdb043..0707d719a270 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -2094,32 +2094,43 @@ static enum scan_result collapse_file(struct
> > > mm_struct *mm, unsigned long addr,
> > > goto xa_unlocked;
> > > }
> > > - if (!is_shmem) {
> > > +xa_locked:
> > > + xas_unlock_irq(&xas);
> > > +xa_unlocked:
> > > +
> > > + /*
> > > + * If collapse is successful, flush must be done now before copying.
> > > + * If collapse is unsuccessful, does flush actually need to be done?
> > > + * Do it anyway, to clear the state.
> > > + */
> > > + try_to_unmap_flush();
> > > +
> > > + if (result == SCAN_SUCCEED && !is_shmem) {
> >
> > Actually, the operations below only for those mappings that do not
> > support large folios. For mappings with large folio support,
> > filemap_nr_thps() always returns 0, so the race described in the commit
> > message won't happen. We can add mapping_large_folio_support() here to
> > filter them out.
> >
> > if (result == SCAN_SUCCEED && !is_shmem && !
> > mapping_large_folio_support(mapping)) {
> >
>
> Right! nr_thps only gets updated when !mapping_large_folio_support(mapping).
>
> For mappings that do support large folios, writable open won't see
> nr_thps > 0, so no truncate_inode_pages() for that case :)
Yep, thanks for the suggestions. Willy also suggested this, and I didn't get
why at the time, but looking closely at nr_thps_inc/dec, those helpers only
do something when !mapping_large_folio_support(). Fun...
I'll fix it up when sending to stable (or a possible v2).
--
Pedro
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing
2026-07-02 16:54 [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing Pedro Falcato
2026-07-02 17:24 ` Zi Yan
2026-07-03 3:49 ` Baolin Wang
@ 2026-07-03 5:11 ` Lance Yang
2026-07-03 9:18 ` Pedro Falcato
2026-07-03 8:55 ` David Hildenbrand (Arm)
3 siblings, 1 reply; 12+ messages in thread
From: Lance Yang @ 2026-07-03 5:11 UTC (permalink / raw)
To: pfalcato
Cc: akpm, david, ljs, baolin.wang, liam, npache, ryan.roberts,
dev.jain, baohua, lance.yang, linux-mm, linux-kernel,
linux-fsdevel, stable, viro, brauner, jack, willy, song, ehagberg,
ziy, gleventhal
On Thu, Jul 02, 2026 at 05:54:09PM +0100, Pedro Falcato wrote:
>As-is, khugepaged and writable-file opening exclude each other. A file
>cannot be open writeable and have THPs (because the filesystem is not aware
>of them). khugepaged will never collapse file pages for files that are
>opened writeable. On an open(O_RDWR/O_WRONLY), the page cache for that
>particular file is dropped. This is fine because nothing could've been
>dirtied.
>
>However, there is an edge-case: collapse_file() might not be able to
>coexist with concurrent writers, but it can coexist with dirty folios
>(from previous writers). Therefore, the following can happen:
>
>open(file, O_RDWR)
>write(file)
>close(file)
>madvise(file_mapping, MADV_COLLAPSE, some non-dirty range)
>open(file, O_RDWR)
> nr_thps > 0
> truncate_inode_pages()
> /* THPs are cleared out, but so are the dirty folios */
>
>When this edge-case happens, there is data loss, as the dirty folios are
>fully discarded.
Well spotted, thanks!
>
>Fix it by fully writing back the page cache (and waiting) when collapsing
>file THPs. Doing so provides the guarantee that no dirty folio will be
>observed while there are active THPs. To fully ensure this is safe, the
>invalidate_lock needs to be held while doing the writeout, so that
>do_dentry_open()'s page cache truncation excludes this write-and-wait.
>
>Cc: stable@vger.kernel.org
>Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>Cc: Christian Brauner <brauner@kernel.org>
>Cc: Jan Kara <jack@suse.cz>
>Cc: Matthew Wilcox <willy@infradead.org>
>Cc: Song Liu <song@kernel.org>
>Cc: Eric Hagberg <ehagberg@janestreet.com>
>Cc: Zi Yan <ziy@nvidia.com>
>Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
>Reported-by: Gregg Leventhal <gleventhal@janestreet.com>
>Closes: https://lore.kernel.org/linux-mm/CAFN_u7H_0ECF3jixP=T=U7AH5=Q3wQNvJMo8an3VqUDMerQfUw@mail.gmail.com/
>Tested-by: Zi Yan <ziy@nvidia.com>
>Signed-off-by: Pedro Falcato <pfalcato@suse.de>
>---
Tested on v7.1.2. I no longer see the data loss with this patch applied.
Tested-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing
2026-07-03 5:11 ` Lance Yang
@ 2026-07-03 9:18 ` Pedro Falcato
0 siblings, 0 replies; 12+ messages in thread
From: Pedro Falcato @ 2026-07-03 9:18 UTC (permalink / raw)
To: Lance Yang
Cc: akpm, david, ljs, baolin.wang, liam, npache, ryan.roberts,
dev.jain, baohua, linux-mm, linux-kernel, linux-fsdevel, stable,
viro, brauner, jack, willy, song, ehagberg, ziy, gleventhal
On Fri, Jul 03, 2026 at 01:11:29PM +0800, Lance Yang wrote:
>
> On Thu, Jul 02, 2026 at 05:54:09PM +0100, Pedro Falcato wrote:
> >As-is, khugepaged and writable-file opening exclude each other. A file
> >cannot be open writeable and have THPs (because the filesystem is not aware
> >of them). khugepaged will never collapse file pages for files that are
> >opened writeable. On an open(O_RDWR/O_WRONLY), the page cache for that
> >particular file is dropped. This is fine because nothing could've been
> >dirtied.
> >
> >However, there is an edge-case: collapse_file() might not be able to
> >coexist with concurrent writers, but it can coexist with dirty folios
> >(from previous writers). Therefore, the following can happen:
> >
> >open(file, O_RDWR)
> >write(file)
> >close(file)
> >madvise(file_mapping, MADV_COLLAPSE, some non-dirty range)
> >open(file, O_RDWR)
> > nr_thps > 0
> > truncate_inode_pages()
> > /* THPs are cleared out, but so are the dirty folios */
> >
> >When this edge-case happens, there is data loss, as the dirty folios are
> >fully discarded.
>
> Well spotted, thanks!
Well, Gregg deserves a lot of the credit :)
>
> >
> >Fix it by fully writing back the page cache (and waiting) when collapsing
> >file THPs. Doing so provides the guarantee that no dirty folio will be
> >observed while there are active THPs. To fully ensure this is safe, the
> >invalidate_lock needs to be held while doing the writeout, so that
> >do_dentry_open()'s page cache truncation excludes this write-and-wait.
> >
> >Cc: stable@vger.kernel.org
> >Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> >Cc: Christian Brauner <brauner@kernel.org>
> >Cc: Jan Kara <jack@suse.cz>
> >Cc: Matthew Wilcox <willy@infradead.org>
> >Cc: Song Liu <song@kernel.org>
> >Cc: Eric Hagberg <ehagberg@janestreet.com>
> >Cc: Zi Yan <ziy@nvidia.com>
> >Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
> >Reported-by: Gregg Leventhal <gleventhal@janestreet.com>
> >Closes: https://lore.kernel.org/linux-mm/CAFN_u7H_0ECF3jixP=T=U7AH5=Q3wQNvJMo8an3VqUDMerQfUw@mail.gmail.com/
> >Tested-by: Zi Yan <ziy@nvidia.com>
> >Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> >---
>
> Tested on v7.1.2. I no longer see the data loss with this patch applied.
>
> Tested-by: Lance Yang <lance.yang@linux.dev>
Thanks!
--
Pedro
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing
2026-07-02 16:54 [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing Pedro Falcato
` (2 preceding siblings ...)
2026-07-03 5:11 ` Lance Yang
@ 2026-07-03 8:55 ` David Hildenbrand (Arm)
2026-07-03 9:02 ` Lance Yang
3 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand (Arm) @ 2026-07-03 8:55 UTC (permalink / raw)
To: Pedro Falcato, Andrew Morton, Lorenzo Stoakes
Cc: Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, linux-mm, linux-kernel, linux-fsdevel,
stable, Alexander Viro, Christian Brauner, Jan Kara,
Matthew Wilcox, Song Liu, Eric Hagberg, Zi Yan, Gregg Leventhal
On 7/2/26 18:54, Pedro Falcato wrote:
> As-is, khugepaged and writable-file opening exclude each other. A file
> cannot be open writeable and have THPs (because the filesystem is not aware
> of them). khugepaged will never collapse file pages for files that are
> opened writeable. On an open(O_RDWR/O_WRONLY), the page cache for that
> particular file is dropped. This is fine because nothing could've been
> dirtied.
>
> However, there is an edge-case: collapse_file() might not be able to
> coexist with concurrent writers, but it can coexist with dirty folios
> (from previous writers). Therefore, the following can happen:
>
> open(file, O_RDWR)
> write(file)
> close(file)
Okay, folios are dirty.
> madvise(file_mapping, MADV_COLLAPSE, some non-dirty range)
collapse_file() has
if (!is_shmem && (folio_test_dirty(folio) ||
folio_test_writeback(folio))) {
...
result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
goto out_unlock;
}
Making us abort collapse.
What am I missing?
--
Cheers,
David
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing
2026-07-03 8:55 ` David Hildenbrand (Arm)
@ 2026-07-03 9:02 ` Lance Yang
2026-07-03 9:20 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 12+ messages in thread
From: Lance Yang @ 2026-07-03 9:02 UTC (permalink / raw)
To: david
Cc: pfalcato, akpm, ljs, baolin.wang, liam, npache, ryan.roberts,
dev.jain, baohua, lance.yang, linux-mm, linux-kernel,
linux-fsdevel, stable, viro, brauner, jack, willy, song, ehagberg,
ziy, gleventhal
On Fri, Jul 03, 2026 at 10:55:42AM +0200, David Hildenbrand (Arm) wrote:
>On 7/2/26 18:54, Pedro Falcato wrote:
>> As-is, khugepaged and writable-file opening exclude each other. A file
>> cannot be open writeable and have THPs (because the filesystem is not aware
>> of them). khugepaged will never collapse file pages for files that are
>> opened writeable. On an open(O_RDWR/O_WRONLY), the page cache for that
>> particular file is dropped. This is fine because nothing could've been
>> dirtied.
>>
>> However, there is an edge-case: collapse_file() might not be able to
>> coexist with concurrent writers, but it can coexist with dirty folios
>> (from previous writers). Therefore, the following can happen:
>>
>> open(file, O_RDWR)
>> write(file)
>> close(file)
>
>Okay, folios are dirty.
>
>> madvise(file_mapping, MADV_COLLAPSE, some non-dirty range)
>
>collapse_file() has
>
> if (!is_shmem && (folio_test_dirty(folio) ||
> folio_test_writeback(folio))) {
> ...
> result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
> goto out_unlock;
> }
>
>Making us abort collapse.
>
>What am I missing?
Hmm ... dirty folios can be outside the range being collapsed ...
For example:
write/dirty: [6M, 8M)
MADV_COLLAPSE: [0M, 2M)
collapse_file() only checks the folios in the collapse range, so the
dirty/writeback check passes for [0M, 2M). But after that, for the old
READ_ONLY_THP_FOR_FS case, nr_thps gets bumped for the mapping.
Then a later writable open can hit ...
filemap_nr_thps(mapping)
-> truncate_inode_pages(mapping, 0)
and that drops page cache for the whole mapping, including the dirty
folios at [6M, 8M) ...
Cheers, Lance
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing
2026-07-03 9:02 ` Lance Yang
@ 2026-07-03 9:20 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand (Arm) @ 2026-07-03 9:20 UTC (permalink / raw)
To: Lance Yang
Cc: pfalcato, akpm, ljs, baolin.wang, liam, npache, ryan.roberts,
dev.jain, baohua, linux-mm, linux-kernel, linux-fsdevel, stable,
viro, brauner, jack, willy, song, ehagberg, ziy, gleventhal
On 7/3/26 11:02, Lance Yang wrote:
>
> On Fri, Jul 03, 2026 at 10:55:42AM +0200, David Hildenbrand (Arm) wrote:
>> On 7/2/26 18:54, Pedro Falcato wrote:
>>> As-is, khugepaged and writable-file opening exclude each other. A file
>>> cannot be open writeable and have THPs (because the filesystem is not aware
>>> of them). khugepaged will never collapse file pages for files that are
>>> opened writeable. On an open(O_RDWR/O_WRONLY), the page cache for that
>>> particular file is dropped. This is fine because nothing could've been
>>> dirtied.
>>>
>>> However, there is an edge-case: collapse_file() might not be able to
>>> coexist with concurrent writers, but it can coexist with dirty folios
>>> (from previous writers). Therefore, the following can happen:
>>>
>>> open(file, O_RDWR)
>>> write(file)
>>> close(file)
>>
>> Okay, folios are dirty.
>>
>>> madvise(file_mapping, MADV_COLLAPSE, some non-dirty range)
>>
>> collapse_file() has
>>
>> if (!is_shmem && (folio_test_dirty(folio) ||
>> folio_test_writeback(folio))) {
>> ...
>> result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
>> goto out_unlock;
>> }
>>
>> Making us abort collapse.
>>
>> What am I missing?
>
> Hmm ... dirty folios can be outside the range being collapsed ...
>
> For example:
>
> write/dirty: [6M, 8M)
> MADV_COLLAPSE: [0M, 2M)
>
> collapse_file() only checks the folios in the collapse range, so the
> dirty/writeback check passes for [0M, 2M). But after that, for the old
> READ_ONLY_THP_FOR_FS case, nr_thps gets bumped for the mapping.
Ahh, so it's unrelated folios, thanks for clarifying that.
("some non-dirty range" documents that)
All makes sense to me, thanks!
--
Cheers,
David
^ permalink raw reply [flat|nested] 12+ messages in thread