From: Pedro Falcato <pfalcato@suse.de>
To: Gregg Leventhal <gleventhal@janestreet.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
Matthew Wilcox <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
Song Liu <song@kernel.org>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org,
Eric Hagberg <ehagberg@janestreet.com>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>, Zi Yan <ziy@nvidia.com>
Subject: Re: Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache
Date: Tue, 30 Jun 2026 19:31:04 +0100 [thread overview]
Message-ID: <akQIIav_zaK6RmIC@pedro-suse.lan> (raw)
In-Reply-To: <CAFN_u7H_0ECF3jixP=T=U7AH5=Q3wQNvJMo8an3VqUDMerQfUw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2840 bytes --]
+CC some relevant THP folks
Quick note, your email client's spacing seems to be all over the place, making
this extremely hard to read.
On Tue, Jun 30, 2026 at 01:01:53PM -0400, Gregg Leventhal wrote:
> Hello,
>
> We (Gregg Leventhal <gleventhal@janestreet.com> and Eric Hagberg
>
> <ehagberg@janestreet.com>) have a reproducible data-loss issue involving file
>
> THPs and write-open, impacting filesystems that do not support
> writable large folios.
>
>
> Attached are:
>
>
> - thp_write_open_cancel_dirty_repro.c
>
> - thp-open-writeback-before-purge.patch
>
>
>
> Summary
>
> =======
>
>
> On an affected 6.12 kernel with CONFIG_READ_ONLY_THP_FOR_FS=y, a file can
>
> contain read-only file THPs installed by khugepaged / MADV_COLLAPSE. When that
>
> same file is later opened for write, do_dentry_open() notices
>
> filemap_nr_thps() and drops the page cache:
>
>
> /*
>
> * XXX: Huge page cache doesn't support writing yet. Drop all page
>
> * cache for this file before processing writes.
>
> */
>
> if (f->f_mode & FMODE_WRITE) {
>
> if (filemap_nr_thps(inode->i_mapping)) {
>
> struct address_space *mapping = inode->i_mapping;
>
>
> filemap_invalidate_lock(inode->i_mapping);
>
> unmap_mapping_range(mapping, 0, 0, 0);
>
> truncate_inode_pages(mapping, 0);
>
> filemap_invalidate_unlock(inode->i_mapping);
>
> }
>
> }
Ugh, this is embarassing. So, good news: this code doesn't exist anymore
in mainline! Bad news: it exists on every other upstream-stable-maintained
release :|
FWIW I don't think your fix works, there's still a race there (what if
you write and wait, then someone dirties a folio, then you truncate the
pagecache? you lost data again.). I'm attaching a very quick WIP patch
that I wrote against 6.12 LTS (again, this does not exist in mainline).
I _think_ we want to go roughly in that direction, either here or in
collapse file paths. There are still problems which are invasive and
I haven't dealt with (GUP and other "temporary" folio releases being
the main one). Some of these problems may simply make it so opening
these files writable may fail (there is certainly, AFAIK, no way of
waiting for GUP and other temporary folio holders).
We would probably be served with a custom loop that forcibly yanks
only THPs out the pagecache, though. But that requires a bit more
code for a stable-only issue...
Anyway, the patch is obviously ungood and uncromulent and is only
here for a rough conversation starter. I don't think it works and
it will probably never work. mapping invalidation is simply too
best-effort for something that Just Needs(tm) to work.
--
Pedro
[-- Attachment #2: 0001-foo.patch --]
[-- Type: text/x-patch, Size: 6161 bytes --]
From 22c7255577e1efbca5186fa3a3afadf714743647 Mon Sep 17 00:00:00 2001
From: Pedro Falcato <pfalcato@suse.de>
Date: Tue, 30 Jun 2026 19:13:20 +0100
Subject: [PATCH] foo
Not-Signed-off-by: Pedro Falcato <pfalcato@suse.de>
---
fs/open.c | 15 ++--------
include/linux/pagemap.h | 1 +
mm/fadvise.c | 2 +-
mm/internal.h | 3 +-
mm/truncate.c | 63 +++++++++++++++++++++++++++++++++++++++--
5 files changed, 68 insertions(+), 16 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index be7b55260a75..8feaf87c06b8 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -985,18 +985,9 @@ static int do_dentry_open(struct file *f,
* cache will fail.
*/
if (filemap_nr_thps(inode->i_mapping)) {
- struct address_space *mapping = inode->i_mapping;
-
- filemap_invalidate_lock(inode->i_mapping);
- /*
- * unmap_mapping_range just need to be called once
- * here, because the private pages is not need to be
- * unmapped mapping (e.g. data segment of dynamic
- * shared libraries here).
- */
- unmap_mapping_range(mapping, 0, 0, 0);
- truncate_inode_pages(mapping, 0);
- filemap_invalidate_unlock(inode->i_mapping);
+ error = filemap_truncate_thps(inode);
+ if (error)
+ goto cleanup_all;
}
}
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 68a5f1ff3301..401b03970f68 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -67,6 +67,7 @@ static inline int filemap_write_and_wait(struct address_space *mapping)
{
return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
}
+int filemap_truncate_thps(struct inode *inode);
/**
* filemap_set_wb_err - set a writeback error on an address_space
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 588fe76c5a14..c44a7a11eee2 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -156,7 +156,7 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
lru_add_drain();
mapping_try_invalidate(mapping, start_index, end_index,
- &nr_failed);
+ &nr_failed, NULL);
/*
* The failures may be due to the folio being
diff --git a/mm/internal.h b/mm/internal.h
index 3bfc1dc2d7ea..83e3bbbe18a6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -407,7 +407,8 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start,
loff_t end);
long mapping_evict_folio(struct address_space *mapping, struct folio *folio);
unsigned long mapping_try_invalidate(struct address_space *mapping,
- pgoff_t start, pgoff_t end, unsigned long *nr_failed);
+ pgoff_t start, pgoff_t end, unsigned long *nr_failed,
+ pgoff_t *first_fail);
/**
* folio_evictable - Test whether a folio is evictable.
diff --git a/mm/truncate.c b/mm/truncate.c
index fb5c20b57bd4..3efcadd2be4f 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -490,12 +490,14 @@ EXPORT_SYMBOL(truncate_inode_pages_final);
* @start: the offset 'from' which to invalidate
* @end: the offset 'to' which to invalidate (inclusive)
* @nr_failed: How many folio invalidations failed
+ * @first_fail: What was the first offset to fail invalidation?
*
* This function is similar to invalidate_mapping_pages(), except that it
* returns the number of folios which could not be evicted in @nr_failed.
*/
unsigned long mapping_try_invalidate(struct address_space *mapping,
- pgoff_t start, pgoff_t end, unsigned long *nr_failed)
+ pgoff_t start, pgoff_t end, unsigned long *nr_failed,
+ pgoff_t *first_fail)
{
pgoff_t indices[PAGEVEC_SIZE];
struct folio_batch fbatch;
@@ -504,6 +506,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
unsigned long count = 0;
int i;
bool xa_has_values = false;
+ bool has_failed = false;
folio_batch_init(&fbatch);
while (find_lock_entries(mapping, &index, end, &fbatch, indices)) {
@@ -529,6 +532,9 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
/* Likely in the lru cache of a remote CPU */
if (nr_failed)
(*nr_failed)++;
+ if (!has_failed && first_fail)
+ *first_fail = folio_pgoff(folio);
+ has_failed = true;
}
count += ret;
}
@@ -560,7 +566,7 @@ unsigned long mapping_try_invalidate(struct address_space *mapping,
unsigned long invalidate_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t end)
{
- return mapping_try_invalidate(mapping, start, end, NULL);
+ return mapping_try_invalidate(mapping, start, end, NULL, NULL);
}
EXPORT_SYMBOL(invalidate_mapping_pages);
@@ -864,3 +870,56 @@ void truncate_pagecache_range(struct inode *inode, loff_t lstart, loff_t lend)
truncate_inode_pages_range(mapping, lstart, lend);
}
EXPORT_SYMBOL(truncate_pagecache_range);
+
+int filemap_truncate_thps(struct inode *inode)
+{
+ struct address_space *mapping = inode->i_mapping;
+ pgoff_t start_index = 0, first_fail;
+ unsigned long nr_failed = 0;
+ int err;
+
+ while (filemap_nr_thps(mapping)) {
+ nr_failed = 0;
+ first_fail = 0;
+ filemap_invalidate_lock(mapping);
+ /*
+ * unmap_mapping_range just need to be called once
+ * here, because the private pages is not need to be
+ * unmapped mapping (e.g. data segment of dynamic
+ * shared libraries here).
+ */
+ unmap_mapping_range(mapping, 0, 0, 0);
+ lru_add_drain();
+ mapping_try_invalidate(mapping, start_index, LLONG_MAX,
+ &nr_failed, &first_fail);
+ filemap_invalidate_unlock(mapping);
+ if (!nr_failed)
+ break;
+ /*
+ * The failures may be due to the folio being
+ * in the LRU cache of a remote CPU. Drain all
+ * caches, do writeback and try again.
+ */
+ lru_add_drain_all();
+ /*
+ * We now know that up to first_fail, there are no THPs. Start
+ * from there to ensure forward progress.
+ */
+ start_index = first_fail;
+
+ /*
+ * Attempt to writeback. If it fails, it's ok to fail the open. There's not
+ * much we can do in that case.
+ */
+ err = filemap_write_and_wait_range(mapping, start_index, LLONG_MAX);
+ if (err)
+ return err;
+ }
+
+ /*
+ * It should not be possible to hit this case after the above loop
+ * completes.
+ */
+ WARN_ON_ONCE(filemap_nr_thps(mapping));
+ return 0;
+}
--
2.55.0
next prev parent reply other threads:[~2026-06-30 18:31 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-30 17:01 Subject: [BUG/RFC] write-open file THP cache purge can discard dirty page cache Gregg Leventhal
2026-06-30 17:18 ` Gregg Leventhal
2026-06-30 18:31 ` Pedro Falcato [this message]
2026-06-30 18:49 ` Pedro Falcato
2026-06-30 19:55 ` Pedro Falcato
2026-06-30 22:34 ` Matthew Wilcox
2026-06-30 22:48 ` Zi Yan
2026-07-01 12:05 ` Pedro Falcato
2026-07-01 11:54 ` Matthew Wilcox
2026-07-01 12:04 ` Pedro Falcato
2026-07-01 12:48 ` Pedro Falcato
2026-07-01 13:07 ` Gregg Leventhal
2026-07-01 14:23 ` Pedro Falcato
2026-06-30 18:36 ` Matthew Wilcox
2026-06-30 19:05 ` Zi Yan
2026-06-30 19:07 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=akQIIav_zaK6RmIC@pedro-suse.lan \
--to=pfalcato@suse.de \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=david@kernel.org \
--cc=ehagberg@janestreet.com \
--cc=gleventhal@janestreet.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=song@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox