From: Pedro Falcato <pfalcato@suse.de>
To: Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>,
"Liam R. Howlett" <liam@infradead.org>,
Nico Pache <npache@redhat.com>,
Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
Barry Song <baohua@kernel.org>, Lance Yang <lance.yang@linux.dev>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, Pedro Falcato <pfalcato@suse.de>,
stable@vger.kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
Matthew Wilcox <willy@infradead.org>, Song Liu <song@kernel.org>,
Eric Hagberg <ehagberg@janestreet.com>, Zi Yan <ziy@nvidia.com>,
Gregg Leventhal <gleventhal@janestreet.com>
Subject: [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing
Date: Thu, 2 Jul 2026 17:54:09 +0100 [thread overview]
Message-ID: <20260702165409.164568-1-pfalcato@suse.de> (raw)
As-is, khugepaged and writable-file opening exclude each other. A file
cannot be open writeable and have THPs (because the filesystem is not aware
of them). khugepaged will never collapse file pages for files that are
opened writeable. On an open(O_RDWR/O_WRONLY), the page cache for that
particular file is dropped. This is fine because nothing could've been
dirtied.
However, there is an edge-case: collapse_file() might not be able to
coexist with concurrent writers, but it can coexist with dirty folios
(from previous writers). Therefore, the following can happen:
open(file, O_RDWR)
write(file)
close(file)
madvise(file_mapping, MADV_COLLAPSE, some non-dirty range)
open(file, O_RDWR)
nr_thps > 0
truncate_inode_pages()
/* THPs are cleared out, but so are the dirty folios */
When this edge-case happens, there is data loss, as the dirty folios are
fully discarded.
Fix it by fully writing back the page cache (and waiting) when collapsing
file THPs. Doing so provides the guarantee that no dirty folio will be
observed while there are active THPs. To fully ensure this is safe, the
invalidate_lock needs to be held while doing the writeout, so that
do_dentry_open()'s page cache truncation excludes this write-and-wait.
Cc: stable@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Song Liu <song@kernel.org>
Cc: Eric Hagberg <ehagberg@janestreet.com>
Cc: Zi Yan <ziy@nvidia.com>
Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Reported-by: Gregg Leventhal <gleventhal@janestreet.com>
Closes: https://lore.kernel.org/linux-mm/CAFN_u7H_0ECF3jixP=T=U7AH5=Q3wQNvJMo8an3VqUDMerQfUw@mail.gmail.com/
Tested-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Pedro Falcato <pfalcato@suse.de>
---
This patch is written against 7.1.0 (because the code no longer exists in mainline).
Zi, I kept your Tested-by, but I had to move some things around and
use the invalidate lock. Please re-test if you can.
mm/khugepaged.c | 39 +++++++++++++++++++++++++--------------
1 file changed, 25 insertions(+), 14 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b8452dbdb043..0707d719a270 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2094,32 +2094,43 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
goto xa_unlocked;
}
- if (!is_shmem) {
+xa_locked:
+ xas_unlock_irq(&xas);
+xa_unlocked:
+
+ /*
+ * If collapse is successful, flush must be done now before copying.
+ * If collapse is unsuccessful, does flush actually need to be done?
+ * Do it anyway, to clear the state.
+ */
+ try_to_unmap_flush();
+
+ if (result == SCAN_SUCCEED && !is_shmem) {
+ /*
+ * invalidate_lock as shared excludes against concurrent opens
+ * in do_dentry_open() truncating the page cache. This is
+ * particularly important if there are dirty folios in transit.
+ */
+ filemap_invalidate_lock_shared(mapping);
filemap_nr_thps_inc(mapping);
/*
* Paired with the fence in do_dentry_open() -> get_write_access()
* to ensure i_writecount is up to date and the update to nr_thps
* is visible. Ensures the page cache will be truncated if the
- * file is opened writable.
+ * file is opened writable. If collapse looks to be successful,
+ * flush any dirty pages out the page cache. With the nr_thps
+ * incremented, there won't be any new writers (nor new dirties).
*/
smp_mb();
- if (inode_is_open_for_write(mapping->host)) {
+ if (inode_is_open_for_write(mapping->host) || filemap_write_and_wait(mapping)) {
result = SCAN_FAIL;
filemap_nr_thps_dec(mapping);
+ filemap_invalidate_unlock_shared(mapping);
+ goto rollback;
}
+ filemap_invalidate_unlock_shared(mapping);
}
-xa_locked:
- xas_unlock_irq(&xas);
-xa_unlocked:
-
- /*
- * If collapse is successful, flush must be done now before copying.
- * If collapse is unsuccessful, does flush actually need to be done?
- * Do it anyway, to clear the state.
- */
- try_to_unmap_flush();
-
if (result == SCAN_SUCCEED && nr_none &&
!shmem_charge(mapping->host, nr_none))
result = SCAN_FAIL;
--
2.54.0
next reply other threads:[~2026-07-03 1:10 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-02 16:54 Pedro Falcato [this message]
2026-07-02 17:24 ` [PATCH stable] mm/khugepaged: write all dirty file folios when collapsing Zi Yan
2026-07-03 2:53 ` Lance Yang
2026-07-03 9:19 ` Pedro Falcato
2026-07-03 3:49 ` Baolin Wang
2026-07-03 8:45 ` Lance Yang
2026-07-03 9:17 ` Pedro Falcato
2026-07-03 5:11 ` Lance Yang
2026-07-03 9:18 ` Pedro Falcato
2026-07-03 8:55 ` David Hildenbrand (Arm)
2026-07-03 9:02 ` Lance Yang
2026-07-03 9:20 ` David Hildenbrand (Arm)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260702165409.164568-1-pfalcato@suse.de \
--to=pfalcato@suse.de \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=brauner@kernel.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=ehagberg@janestreet.com \
--cc=gleventhal@janestreet.com \
--cc=jack@suse.cz \
--cc=lance.yang@linux.dev \
--cc=liam@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=npache@redhat.com \
--cc=ryan.roberts@arm.com \
--cc=song@kernel.org \
--cc=stable@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox